allennlp.data.dataset_readers.semantic_parsing.wikitables¶
Reader for WikitableQuestions (https://github.com/ppasupat/WikiTableQuestions/releases/tag/v1.0.2).
-
class
allennlp.data.dataset_readers.semantic_parsing.wikitables.wikitables.
WikiTablesDatasetReader
(lazy: bool = False, tables_directory: str = None, offline_logical_forms_directory: str = None, max_offline_logical_forms: int = 10, keep_if_no_logical_forms: bool = False, tokenizer: allennlp.data.tokenizers.tokenizer.Tokenizer = None, question_token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer] = None, table_token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer] = None, use_table_for_vocab: bool = False, max_table_tokens: int = None, output_agendas: bool = False)[source]¶ Bases:
allennlp.data.dataset_readers.dataset_reader.DatasetReader
This
DatasetReader
takes WikiTableQuestions*.examples
files and converts them intoInstances
suitable for use with theWikiTablesSemanticParser
.The
*.examples
files have pointers in them to two other files: a file that contains an associated table for each question, and a file that has pre-computed, possible logical forms. Because of how theDatasetReader
API works, we need to take base directories for those other files in the constructor.We initialize the dataset reader with paths to the tables directory and the directory where offline search output is stored if you are training. While testing, you can either provide existing table filenames or if your question is about a new table, provide the content of the table as a dict (See
TableQuestionContext.read_from_json()
for the expected format). If you are doing the former, you still need to provide atables_directory
path here.We lowercase the question and all table text, because the questions in the data are typically all lowercase, anyway. This makes it so that any live demo that you put up will have questions that match the data this was trained on. Lowercasing the table text makes matching the lowercased question text easier.
- Parameters
- lazy
bool
(optional, default=False) Passed to
DatasetReader
. If this isTrue
, training will start sooner, but will take longer per batch.- tables_directory
str
, optional Prefix for the path to the directory in which the tables reside. For example,
*.examples
files contain paths likecsv/204-csv/590.csv
, and we will use the correspondingtagged
files by manipulating the paths in the examples files. This is the directory that contains thecsv
andtagged
directories. This is only optional forPredictors
(i.e., in a demo), where you’re only callingtext_to_instance()
.- offline_logical_forms_directory
str
, optional Directory that contains all the gzipped offline search output files. We assume the filenames match the example IDs (e.g.:
nt-0.gz
). This is required for training a model, but not required for prediction.- max_offline_logical_forms
int
, optional (default=10) We will use the first
max_offline_logical_forms
logical forms as our target label. Only applicable ifoffline_logical_forms_directory
is given.- keep_if_no_logical_forms
bool
, optional (default=False) If
True
, we will keep instances we read that don’t have offline search output. If you want to compute denotation accuracy on the full dataset, you should set this toTrue
. Otherwise, your accuracy numbers will only reflect the subset of the data that has offline search output.- tokenizer
Tokenizer
, optional Tokenizer to use for the questions. Will default to
WordTokenizer()
with Spacy’s tagger enabled, as we use lemma matches as features for entity linking.- question_token_indexers
Dict[str, TokenIndexer]
, optional Token indexers for questions. Will default to
{"tokens": SingleIdTokenIndexer()}
.- table_token_indexers
Dict[str, TokenIndexer]
, optional Token indexers for table entities. Will default to
question_token_indexers
(though you very likely want to use something different for these, as you can’t rely on having an embedding for every table entity at test time).- use_table_for_vocab
bool
(optional, default=False) If
True
, we will include table cell text in vocabulary creation. The original parser did not do this, because the same table can appear multiple times, messing with vocab counts, and making you include lots of rare entities in your vocab.- max_table_tokens
int
, optional If given, we will only keep this number of total table tokens. This bounds the memory usage of the table representations, truncating cells with really long text. We specify a total number of tokens, not a max cell text length, because the number of table entities varies.
- output_agendas
bool
, (optional, default=False) Should we output agenda fields? This needs to be true if you want to train a coverage based parser.
- lazy
-
text_to_instance
(self, question: str, table_lines: List[List[str]], target_values: List[str] = None, offline_search_output: List[str] = None) → allennlp.data.instance.Instance[source]¶ Reads text inputs and makes an instance. We pass the
table_lines
toTableQuestionContext
, and that method accepts this field either as lines from CoreNLP processed tagged files that come with the dataset, or simply in a tsv format where each line corresponds to a row and the cells are tab-separated.- Parameters
- question
str
Input question
- table_lines
List[List[str]]
The table content optionally preprocessed by CoreNLP. See
TableQuestionContext.read_from_lines
for the expected format.- target_values
List[str]
, optional Target values for the denotations the logical forms should execute to. Not required for testing.
- offline_search_output
List[str]
, optional List of logical forms, produced by offline search. Not required during test.
- question
-
allennlp.data.dataset_readers.semantic_parsing.wikitables.util.
parse_example_line
(lisp_string: str) → Dict[source]¶ - Training data in WikitableQuestions comes with examples in the form of lisp strings in the format:
- (example (id <example-id>)
(utterance <question>) (context (graph tables.TableKnowledgeGraph <table-filename>)) (targetValue (list (description <answer1>) (description <answer2>) …)))
We parse such strings and return the parsed information here.