Reader for WikitableQuestions (

class bool = False, tables_directory: str = None, offline_logical_forms_directory: str = None, max_offline_logical_forms: int = 10, keep_if_no_logical_forms: bool = False, tokenizer: = None, question_token_indexers: Dict[str,] = None, table_token_indexers: Dict[str,] = None, use_table_for_vocab: bool = False, max_table_tokens: int = None, output_agendas: bool = False)[source]


This DatasetReader takes WikiTableQuestions *.examples files and converts them into Instances suitable for use with the WikiTablesSemanticParser.

The *.examples files have pointers in them to two other files: a file that contains an associated table for each question, and a file that has pre-computed, possible logical forms. Because of how the DatasetReader API works, we need to take base directories for those other files in the constructor.

We initialize the dataset reader with paths to the tables directory and the directory where offline search output is stored if you are training. While testing, you can either provide existing table filenames or if your question is about a new table, provide the content of the table as a dict (See TableQuestionContext.read_from_json() for the expected format). If you are doing the former, you still need to provide a tables_directory path here.

We lowercase the question and all table text, because the questions in the data are typically all lowercase, anyway. This makes it so that any live demo that you put up will have questions that match the data this was trained on. Lowercasing the table text makes matching the lowercased question text easier.

lazybool (optional, default=False)

Passed to DatasetReader. If this is True, training will start sooner, but will take longer per batch.

tables_directorystr, optional

Prefix for the path to the directory in which the tables reside. For example, *.examples files contain paths like csv/204-csv/590.csv, and we will use the corresponding tagged files by manipulating the paths in the examples files. This is the directory that contains the csv and tagged directories. This is only optional for Predictors (i.e., in a demo), where you’re only calling text_to_instance().

offline_logical_forms_directorystr, optional

Directory that contains all the gzipped offline search output files. We assume the filenames match the example IDs (e.g.: nt-0.gz). This is required for training a model, but not required for prediction.

max_offline_logical_formsint, optional (default=10)

We will use the first max_offline_logical_forms logical forms as our target label. Only applicable if offline_logical_forms_directory is given.

keep_if_no_logical_formsbool, optional (default=False)

If True, we will keep instances we read that don’t have offline search output. If you want to compute denotation accuracy on the full dataset, you should set this to True. Otherwise, your accuracy numbers will only reflect the subset of the data that has offline search output.

tokenizerTokenizer, optional

Tokenizer to use for the questions. Will default to WordTokenizer() with Spacy’s tagger enabled, as we use lemma matches as features for entity linking.

question_token_indexersDict[str, TokenIndexer], optional

Token indexers for questions. Will default to {"tokens": SingleIdTokenIndexer()}.

table_token_indexersDict[str, TokenIndexer], optional

Token indexers for table entities. Will default to question_token_indexers (though you very likely want to use something different for these, as you can’t rely on having an embedding for every table entity at test time).

use_table_for_vocabbool (optional, default=False)

If True, we will include table cell text in vocabulary creation. The original parser did not do this, because the same table can appear multiple times, messing with vocab counts, and making you include lots of rare entities in your vocab.

max_table_tokensint, optional

If given, we will only keep this number of total table tokens. This bounds the memory usage of the table representations, truncating cells with really long text. We specify a total number of tokens, not a max cell text length, because the number of table entities varies.

output_agendasbool, (optional, default=False)

Should we output agenda fields? This needs to be true if you want to train a coverage based parser.

text_to_instance(self, question: str, table_lines: List[List[str]], target_values: List[str] = None, offline_search_output: List[str] = None) →[source]

Reads text inputs and makes an instance. We pass the table_lines to TableQuestionContext, and that method accepts this field either as lines from CoreNLP processed tagged files that come with the dataset, or simply in a tsv format where each line corresponds to a row and the cells are tab-separated.


Input question


The table content optionally preprocessed by CoreNLP. See TableQuestionContext.read_from_lines for the expected format.

target_valuesList[str], optional

Target values for the denotations the logical forms should execute to. Not required for testing.

offline_search_outputList[str], optional

List of logical forms, produced by offline search. Not required during test. str) → Dict[source]
Training data in WikitableQuestions comes with examples in the form of lisp strings in the format:
(example (id <example-id>)

(utterance <question>) (context (graph tables.TableKnowledgeGraph <table-filename>)) (targetValue (list (description <answer1>) (description <answer2>) …)))

We parse such strings and return the parsed information here.