allennlp.data.dataset_readers.semantic_parsing¶

allennlp.data.dataset_readers.semantic_parsing.wikitables

class allennlp.data.dataset_readers.semantic_parsing.atis.AtisDatasetReader(token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer] = None, keep_if_unparseable: bool = False, lazy: bool = False, tokenizer: allennlp.data.tokenizers.tokenizer.Tokenizer = None, database_file: str = None, num_turns_to_concatenate: int = 1)[source]¶

Bases: allennlp.data.dataset_readers.dataset_reader.DatasetReader

This DatasetReader takes json files and converts them into Instances for the AtisSemanticParser.

Each line in the file is a JSON object that represent an interaction in the ATIS dataset that has the following keys and values: ` "id": The original filepath in the LDC corpus "interaction": <list where each element represents a turn in the interaction> "scenario": A code that refers to the scenario that served as the prompt for this interaction "ut_date": Date of the interaction "zc09_path": Path that was used in the original paper `Learning Context-Dependent Mappings from Sentences to Logical Form <https://www.semanticscholar.org/paper/Learning-Context-Dependent-Mappings-from-Sentences-Zettlemoyer-Collins/44a8fcee0741139fa15862dc4b6ce1e11444878f>'_ by Zettlemoyer and Collins (ACL/IJCNLP 2009) `

Each element in the interaction list has the following keys and values: ` "utterance": Natural language input "sql": A list of SQL queries that the utterance maps to, it could be multiple SQL queries or none at all. `

Parameters

token_indexersDict[str, TokenIndexer], optional: Token indexers for the utterances. Will default to {"tokens": SingleIdTokenIndexer()}.
lazybool (optional, default=False): Passed to DatasetReader. If this is True, training will start sooner, but will take longer per batch.
tokenizerTokenizer, optional: Tokenizer to use for the utterances. Will default to WordTokenizer() with Spacy’s tagger enabled.
database_file: ``str``, optional: The directory to find the sqlite database file. We query the sqlite database to find the strings that are allowed.
num_turns_to_concatenate: ``str``, optional: The number of utterances to concatenate as the conversation context.

text_to_instance(self, utterances: List[str], sql_query_labels: List[str] = None) → allennlp.data.instance.Instance[source]¶

Parameters

utterances: ``List[str]``, required.: List of utterances in the interaction, the last element is the current utterance.
sql_query_labels: ``List[str]``, optional: The SQL queries that are given as labels during training or validation.

class allennlp.data.dataset_readers.semantic_parsing.nlvr.NlvrDatasetReader(lazy: bool = False, tokenizer: allennlp.data.tokenizers.tokenizer.Tokenizer = None, sentence_token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer] = None, nonterminal_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer] = None, terminal_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer] = None, output_agendas: bool = True)[source]¶

Bases: allennlp.data.dataset_readers.dataset_reader.DatasetReader

DatasetReader for the NLVR domain. In addition to the usual methods for reading files and instances from text, this class contains a method for creating an agenda of actions that each sentence triggers, if needed. Note that we deal with the version of the dataset with structured representations of the synthetic images instead of the actual images themselves.

We support multiple data formats here: 1) The original json version of the NLVR dataset (http://lic.nlp.cornell.edu/nlvr/) where the format of each line in the jsonl file is ` "sentence": <sentence>, "label": <true/false>, "identifier": <id>, "evals": <dict containing all annotations>, "structured_rep": <list of three box representations, where each box is a list of object representation dicts, containing fields "x_loc", "y_loc", "color", "type", "size"> `

2) A grouped version (constructed using scripts/nlvr/group_nlvr_worlds.py) where we group all the worlds that a sentence appears in. We use the fields sentence, label and structured_rep. And the format of the grouped files is ` "sentence": <sentence>, "labels": <list of labels corresponding to worlds the sentence appears in> "identifier": <id that is only the prefix from the original data> "worlds": <list of structured representations> `

3) A processed version that contains action sequences that lead to the correct denotations (or not), using some search. This format is very similar to the grouped format, and has the following extra field

` "correct_sequences": <list of lists of action sequences corresponding to logical forms that evaluate to the correct denotations> `

Parameters

lazybool (optional, default=False): Passed to DatasetReader. If this is True, training will start sooner, but will take longer per batch.
tokenizerTokenizer (optional): The tokenizer used for sentences in NLVR. Default is WordTokenizer
sentence_token_indexersDict[str, TokenIndexer] (optional): Token indexers for tokens in input sentences. Default is {"tokens": SingleIdTokenIndexer()}
nonterminal_indexersDict[str, TokenIndexer] (optional): Indexers for non-terminals in production rules. The default is to index terminals and non-terminals in the same way, but you may want to change it. Default is {"tokens": SingleIdTokenIndexer("rule_labels")}
terminal_indexersDict[str, TokenIndexer] (optional): Indexers for terminals in production rules. The default is to index terminals and non-terminals in the same way, but you may want to change it. Default is {"tokens": SingleIdTokenIndexer("rule_labels")}
output_agendasbool (optional): If preparing data for a trainer that uses agendas, set this flag and the datset reader will output agendas.

text_to_instance(self, sentence: str, structured_representations: List[List[List[Dict[str, Any]]]], labels: List[str] = None, target_sequences: List[List[str]] = None, identifier: str = None) → allennlp.data.instance.Instance[source]¶

Parameters

sentencestr: The query sentence.
structured_representationsList[List[List[JsonDict]]]: A list of Json representations of all the worlds. See expected format in this class’ docstring.
labelsList[str] (optional): List of string representations of the labels (true or false) corresponding to the structured_representations. Not required while testing.
target_sequencesList[List[str]] (optional): List of target action sequences for each element which lead to the correct denotation in worlds corresponding to the structured representations.
identifierstr (optional): The identifier from the dataset if available.

class allennlp.data.dataset_readers.semantic_parsing.template_text2sql.TemplateText2SqlDatasetReader(use_all_sql: bool = False, token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer] = None, cross_validation_split_to_exclude: int = None, lazy: bool = False)[source]¶

Bases: allennlp.data.dataset_readers.dataset_reader.DatasetReader

Reads text2sql data for the sequence tagging and template prediction baseline from “Improving Text to SQL Evaluation Methodology”.

Parameters

use_all_sqlbool, optional (default = False): Whether to use all of the sql queries which have identical semantics, or whether to just use the first one.
token_indexersDict[str, TokenIndexer], optional (default=``{“tokens”: SingleIdTokenIndexer()}``): We use this to define the input representation for the text. See TokenIndexer. Note that the output tags will always correspond to single token IDs based on how they are pre-tokenised in the data file.
cross_validation_split_to_excludeint, optional (default = None): Some of the text2sql datasets are very small, so you may need to do cross validation. Here, you can specify a integer corresponding to a split_{int}.json file not to include in the training set.

text_to_instance(self, query: List[str], slot_tags: List[str] = None, sql_template: str = None) → allennlp.data.instance.Instance[source]¶

Does whatever tokenization or processing is necessary to go from textual input to an Instance. The primary intended use for this is with a Predictor, which gets text input as a JSON object and needs to process it to be input to a model.

The intent here is to share code between _read() and what happens at model serving time, or any other time you want to make a prediction from new data. We need to process the data in the same way it was done at training time. Allowing the DatasetReader to process new text lets us accomplish this, as we can just call DatasetReader.text_to_instance when serving predictions.

The input type here is rather vaguely specified, unfortunately. The Predictor will have to make some assumptions about the kind of DatasetReader that it’s using, in order to pass it the right information.

class allennlp.data.dataset_readers.semantic_parsing.grammar_based_text2sql.GrammarBasedText2SqlDatasetReader(schema_path: str, database_file: str = None, use_all_sql: bool = False, remove_unneeded_aliases: bool = True, use_prelinked_entities: bool = True, use_untyped_entities: bool = True, token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer] = None, cross_validation_split_to_exclude: int = None, keep_if_unparseable: bool = True, lazy: bool = False)[source]¶

Bases: allennlp.data.dataset_readers.dataset_reader.DatasetReader

Reads text2sql data from “Improving Text to SQL Evaluation Methodology” for a type constrained semantic parser.

Parameters

schema_pathstr, required.: The path to the database schema.
database_pathstr, optional (default = None): The path to a database.
use_all_sqlbool, optional (default = False): Whether to use all of the sql queries which have identical semantics, or whether to just use the first one.
remove_unneeded_aliasesbool, (default = True): Whether or not to remove table aliases in the SQL which are not required.
use_prelinked_entitiesbool, (default = True): Whether or not to use the pre-linked entities in the text2sql data.
use_untyped_entitiesbool, (default = True): Whether or not to attempt to infer the pre-linked entity types.
token_indexersDict[str, TokenIndexer], optional (default=``{“tokens”: SingleIdTokenIndexer()}``): We use this to define the input representation for the text. See TokenIndexer. Note that the output tags will always correspond to single token IDs based on how they are pre-tokenised in the data file.
cross_validation_split_to_excludeint, optional (default = None): Some of the text2sql datasets are very small, so you may need to do cross validation. Here, you can specify a integer corresponding to a split_{int}.json file not to include in the training set.
keep_if_unparsablebool, optional (default = True): Whether or not to keep examples that we can’t parse using the grammar.

text_to_instance(self, query: List[str], prelinked_entities: Dict[str, Dict[str, str]] = None, sql: List[str] = None) → allennlp.data.instance.Instance[source]¶

Does whatever tokenization or processing is necessary to go from textual input to an Instance. The primary intended use for this is with a Predictor, which gets text input as a JSON object and needs to process it to be input to a model.

The intent here is to share code between _read() and what happens at model serving time, or any other time you want to make a prediction from new data. We need to process the data in the same way it was done at training time. Allowing the DatasetReader to process new text lets us accomplish this, as we can just call DatasetReader.text_to_instance when serving predictions.

The input type here is rather vaguely specified, unfortunately. The Predictor will have to make some assumptions about the kind of DatasetReader that it’s using, in order to pass it the right information.

Reader for QuaRel dataset

class allennlp.data.dataset_readers.semantic_parsing.quarel.QuarelDatasetReader(lazy: bool = False, sample: int = -1, lf_syntax: str = None, replace_world_entities: bool = False, align_world_extractions: bool = False, gold_world_extractions: bool = False, tagger_only: bool = False, denotation_only: bool = False, world_extraction_model: Optional[str] = None, skip_attributes_regex: Optional[str] = None, entity_bits_mode: Optional[str] = None, entity_types: Optional[List[str]] = None, lexical_cues: List[str] = None, tokenizer: allennlp.data.tokenizers.tokenizer.Tokenizer = None, question_token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer] = None)[source]¶

Bases: allennlp.data.dataset_readers.dataset_reader.DatasetReader

Parameters

lazybool (optional, default=False): Passed to DatasetReader. If this is True, training will start sooner, but will take longer per batch.
replace_world_entitiesbool (optional, default=False): Replace world entities (w/stemming) with “worldone” and “worldtwo” directly in the question
world_extraction_model: ``str`` (optional, default=None): Reference (file or URL) to world tagger model used to extract worlds.
align_world_extractionsbool (optional, default=False): Use alignment of extracted worlds with gold worlds, to pick the appropriate gold LF.
gold_world_extractionsbool (optional, default=False): Use gold worlds rather than world extractor
tagger_onlybool (optional default=False): Only output tagging information, in format for CRF tagger
denotation_only: ``bool`` (optional, default=False): Only output information needed for a denotation-only model (no LF)
entity_bits_modestr (optional, default=None): If set, add a field for entity bits (“simple” = 1.0 value for world1 and world2, “simple_collapsed” = single 1.0 value for any world).
entity_typesList[str] (optional, default=None): List of entity types used for tagger model
world_extraction_modelstr (optional, default=None): Reference to model file for world tagger model
lexical_cuesList[str] (optional, default=None): List of lexical cue categories to include when using dynamic attributes
skip_attributes_regex: ``str`` (optional, default=None): Regex string for which examples and attributes to skip if the LF matches
lf_syntax: ``str``: Which logical form formalism to use

preprocess(self, question_data: Dict[str, Any], predict: bool = False) → List[Dict[str, Any]][source]¶

text_to_instance(self, question: str, logical_forms: List[str] = None, additional_metadata: Dict[str, Any] = None, world_extractions: Dict[str, Union[str, List[str]]] = None, entity_literals: Dict[str, Union[str, List[str]]] = None, tokenized_question: List[allennlp.data.tokenizers.token.Token] = None, debug_counter: int = None, qr_spec_override: List[Dict[str, int]] = None, dynamic_entities_override: Dict[str, str] = None) → allennlp.data.instance.Instance[source]¶

Does whatever tokenization or processing is necessary to go from textual input to an Instance. The primary intended use for this is with a Predictor, which gets text input as a JSON object and needs to process it to be input to a model.

The intent here is to share code between _read() and what happens at model serving time, or any other time you want to make a prediction from new data. We need to process the data in the same way it was done at training time. Allowing the DatasetReader to process new text lets us accomplish this, as we can just call DatasetReader.text_to_instance when serving predictions.

The input type here is rather vaguely specified, unfortunately. The Predictor will have to make some assumptions about the kind of DatasetReader that it’s using, in order to pass it the right information.