allennlp.data.dataset_readers.semantic_parsing¶
-
class
allennlp.data.dataset_readers.semantic_parsing.atis.
AtisDatasetReader
(token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer] = None, keep_if_unparseable: bool = False, lazy: bool = False, tokenizer: allennlp.data.tokenizers.tokenizer.Tokenizer = None, database_file: str = None, num_turns_to_concatenate: int = 1)[source]¶ Bases:
allennlp.data.dataset_readers.dataset_reader.DatasetReader
This
DatasetReader
takes json files and converts them intoInstances
for theAtisSemanticParser
.Each line in the file is a JSON object that represent an interaction in the ATIS dataset that has the following keys and values:
` "id": The original filepath in the LDC corpus "interaction": <list where each element represents a turn in the interaction> "scenario": A code that refers to the scenario that served as the prompt for this interaction "ut_date": Date of the interaction "zc09_path": Path that was used in the original paper `Learning Context-Dependent Mappings from Sentences to Logical Form <https://www.semanticscholar.org/paper/Learning-Context-Dependent-Mappings-from-Sentences-Zettlemoyer-Collins/44a8fcee0741139fa15862dc4b6ce1e11444878f>'_ by Zettlemoyer and Collins (ACL/IJCNLP 2009) `
Each element in the
interaction
list has the following keys and values:` "utterance": Natural language input "sql": A list of SQL queries that the utterance maps to, it could be multiple SQL queries or none at all. `
- Parameters
- token_indexers
Dict[str, TokenIndexer]
, optional Token indexers for the utterances. Will default to
{"tokens": SingleIdTokenIndexer()}
.- lazy
bool
(optional, default=False) Passed to
DatasetReader
. If this isTrue
, training will start sooner, but will take longer per batch.- tokenizer
Tokenizer
, optional Tokenizer to use for the utterances. Will default to
WordTokenizer()
with Spacy’s tagger enabled.- database_file: ``str``, optional
The directory to find the sqlite database file. We query the sqlite database to find the strings that are allowed.
- num_turns_to_concatenate: ``str``, optional
The number of utterances to concatenate as the conversation context.
- token_indexers
-
text_to_instance
(self, utterances: List[str], sql_query_labels: List[str] = None) → allennlp.data.instance.Instance[source]¶ - Parameters
- utterances: ``List[str]``, required.
List of utterances in the interaction, the last element is the current utterance.
- sql_query_labels: ``List[str]``, optional
The SQL queries that are given as labels during training or validation.
-
class
allennlp.data.dataset_readers.semantic_parsing.nlvr.
NlvrDatasetReader
(lazy: bool = False, tokenizer: allennlp.data.tokenizers.tokenizer.Tokenizer = None, sentence_token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer] = None, nonterminal_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer] = None, terminal_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer] = None, output_agendas: bool = True)[source]¶ Bases:
allennlp.data.dataset_readers.dataset_reader.DatasetReader
DatasetReader
for the NLVR domain. In addition to the usual methods for reading files and instances from text, this class contains a method for creating an agenda of actions that each sentence triggers, if needed. Note that we deal with the version of the dataset with structured representations of the synthetic images instead of the actual images themselves.We support multiple data formats here: 1) The original json version of the NLVR dataset (http://lic.nlp.cornell.edu/nlvr/) where the format of each line in the jsonl file is
` "sentence": <sentence>, "label": <true/false>, "identifier": <id>, "evals": <dict containing all annotations>, "structured_rep": <list of three box representations, where each box is a list of object representation dicts, containing fields "x_loc", "y_loc", "color", "type", "size"> `
2) A grouped version (constructed using
scripts/nlvr/group_nlvr_worlds.py
) where we group all the worlds that a sentence appears in. We use the fieldssentence
,label
andstructured_rep
. And the format of the grouped files is` "sentence": <sentence>, "labels": <list of labels corresponding to worlds the sentence appears in> "identifier": <id that is only the prefix from the original data> "worlds": <list of structured representations> `
3) A processed version that contains action sequences that lead to the correct denotations (or not), using some search. This format is very similar to the grouped format, and has the following extra field
` "correct_sequences": <list of lists of action sequences corresponding to logical forms that evaluate to the correct denotations> `
- Parameters
- lazy
bool
(optional, default=False) Passed to
DatasetReader
. If this isTrue
, training will start sooner, but will take longer per batch.- tokenizer
Tokenizer
(optional) The tokenizer used for sentences in NLVR. Default is
WordTokenizer
- sentence_token_indexers
Dict[str, TokenIndexer]
(optional) Token indexers for tokens in input sentences. Default is
{"tokens": SingleIdTokenIndexer()}
- nonterminal_indexers
Dict[str, TokenIndexer]
(optional) Indexers for non-terminals in production rules. The default is to index terminals and non-terminals in the same way, but you may want to change it. Default is
{"tokens": SingleIdTokenIndexer("rule_labels")}
- terminal_indexers
Dict[str, TokenIndexer]
(optional) Indexers for terminals in production rules. The default is to index terminals and non-terminals in the same way, but you may want to change it. Default is
{"tokens": SingleIdTokenIndexer("rule_labels")}
- output_agendas
bool
(optional) If preparing data for a trainer that uses agendas, set this flag and the datset reader will output agendas.
- lazy
-
text_to_instance
(self, sentence: str, structured_representations: List[List[List[Dict[str, Any]]]], labels: List[str] = None, target_sequences: List[List[str]] = None, identifier: str = None) → allennlp.data.instance.Instance[source]¶ - Parameters
- sentence
str
The query sentence.
- structured_representations
List[List[List[JsonDict]]]
A list of Json representations of all the worlds. See expected format in this class’ docstring.
- labels
List[str]
(optional) List of string representations of the labels (true or false) corresponding to the
structured_representations
. Not required while testing.- target_sequences
List[List[str]]
(optional) List of target action sequences for each element which lead to the correct denotation in worlds corresponding to the structured representations.
- identifier
str
(optional) The identifier from the dataset if available.
- sentence
-
class
allennlp.data.dataset_readers.semantic_parsing.template_text2sql.
TemplateText2SqlDatasetReader
(use_all_sql: bool = False, token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer] = None, cross_validation_split_to_exclude: int = None, lazy: bool = False)[source]¶ Bases:
allennlp.data.dataset_readers.dataset_reader.DatasetReader
Reads text2sql data for the sequence tagging and template prediction baseline from “Improving Text to SQL Evaluation Methodology”.
- Parameters
- use_all_sql
bool
, optional (default = False) Whether to use all of the sql queries which have identical semantics, or whether to just use the first one.
- token_indexers
Dict[str, TokenIndexer]
, optional (default=``{“tokens”: SingleIdTokenIndexer()}``) We use this to define the input representation for the text. See
TokenIndexer
. Note that the output tags will always correspond to single token IDs based on how they are pre-tokenised in the data file.- cross_validation_split_to_exclude
int
, optional (default = None) Some of the text2sql datasets are very small, so you may need to do cross validation. Here, you can specify a integer corresponding to a split_{int}.json file not to include in the training set.
- use_all_sql
-
text_to_instance
(self, query: List[str], slot_tags: List[str] = None, sql_template: str = None) → allennlp.data.instance.Instance[source]¶ Does whatever tokenization or processing is necessary to go from textual input to an
Instance
. The primary intended use for this is with aPredictor
, which gets text input as a JSON object and needs to process it to be input to a model.The intent here is to share code between
_read()
and what happens at model serving time, or any other time you want to make a prediction from new data. We need to process the data in the same way it was done at training time. Allowing theDatasetReader
to process new text lets us accomplish this, as we can just callDatasetReader.text_to_instance
when serving predictions.The input type here is rather vaguely specified, unfortunately. The
Predictor
will have to make some assumptions about the kind ofDatasetReader
that it’s using, in order to pass it the right information.
-
class
allennlp.data.dataset_readers.semantic_parsing.grammar_based_text2sql.
GrammarBasedText2SqlDatasetReader
(schema_path: str, database_file: str = None, use_all_sql: bool = False, remove_unneeded_aliases: bool = True, use_prelinked_entities: bool = True, use_untyped_entities: bool = True, token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer] = None, cross_validation_split_to_exclude: int = None, keep_if_unparseable: bool = True, lazy: bool = False)[source]¶ Bases:
allennlp.data.dataset_readers.dataset_reader.DatasetReader
Reads text2sql data from “Improving Text to SQL Evaluation Methodology” for a type constrained semantic parser.
- Parameters
- schema_path
str
, required. The path to the database schema.
- database_path
str
, optional (default = None) The path to a database.
- use_all_sql
bool
, optional (default = False) Whether to use all of the sql queries which have identical semantics, or whether to just use the first one.
- remove_unneeded_aliases
bool
, (default = True) Whether or not to remove table aliases in the SQL which are not required.
- use_prelinked_entities
bool
, (default = True) Whether or not to use the pre-linked entities in the text2sql data.
- use_untyped_entities
bool
, (default = True) Whether or not to attempt to infer the pre-linked entity types.
- token_indexers
Dict[str, TokenIndexer]
, optional (default=``{“tokens”: SingleIdTokenIndexer()}``) We use this to define the input representation for the text. See
TokenIndexer
. Note that the output tags will always correspond to single token IDs based on how they are pre-tokenised in the data file.- cross_validation_split_to_exclude
int
, optional (default = None) Some of the text2sql datasets are very small, so you may need to do cross validation. Here, you can specify a integer corresponding to a split_{int}.json file not to include in the training set.
- keep_if_unparsable
bool
, optional (default = True) Whether or not to keep examples that we can’t parse using the grammar.
- schema_path
-
text_to_instance
(self, query: List[str], prelinked_entities: Dict[str, Dict[str, str]] = None, sql: List[str] = None) → allennlp.data.instance.Instance[source]¶ Does whatever tokenization or processing is necessary to go from textual input to an
Instance
. The primary intended use for this is with aPredictor
, which gets text input as a JSON object and needs to process it to be input to a model.The intent here is to share code between
_read()
and what happens at model serving time, or any other time you want to make a prediction from new data. We need to process the data in the same way it was done at training time. Allowing theDatasetReader
to process new text lets us accomplish this, as we can just callDatasetReader.text_to_instance
when serving predictions.The input type here is rather vaguely specified, unfortunately. The
Predictor
will have to make some assumptions about the kind ofDatasetReader
that it’s using, in order to pass it the right information.
Reader for QuaRel dataset
-
class
allennlp.data.dataset_readers.semantic_parsing.quarel.
QuarelDatasetReader
(lazy: bool = False, sample: int = -1, lf_syntax: str = None, replace_world_entities: bool = False, align_world_extractions: bool = False, gold_world_extractions: bool = False, tagger_only: bool = False, denotation_only: bool = False, world_extraction_model: Optional[str] = None, skip_attributes_regex: Optional[str] = None, entity_bits_mode: Optional[str] = None, entity_types: Optional[List[str]] = None, lexical_cues: List[str] = None, tokenizer: allennlp.data.tokenizers.tokenizer.Tokenizer = None, question_token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer] = None)[source]¶ Bases:
allennlp.data.dataset_readers.dataset_reader.DatasetReader
- Parameters
- lazy
bool
(optional, default=False) Passed to
DatasetReader
. If this isTrue
, training will start sooner, but will take longer per batch.- replace_world_entities
bool
(optional, default=False) Replace world entities (w/stemming) with “worldone” and “worldtwo” directly in the question
- world_extraction_model: ``str`` (optional, default=None)
Reference (file or URL) to world tagger model used to extract worlds.
- align_world_extractions
bool
(optional, default=False) Use alignment of extracted worlds with gold worlds, to pick the appropriate gold LF.
- gold_world_extractions
bool
(optional, default=False) Use gold worlds rather than world extractor
- tagger_only
bool
(optional default=False) Only output tagging information, in format for CRF tagger
- denotation_only: ``bool`` (optional, default=False)
Only output information needed for a denotation-only model (no LF)
- entity_bits_mode
str
(optional, default=None) If set, add a field for entity bits (“simple” = 1.0 value for world1 and world2, “simple_collapsed” = single 1.0 value for any world).
- entity_types
List[str]
(optional, default=None) List of entity types used for tagger model
- world_extraction_model
str
(optional, default=None) Reference to model file for world tagger model
- lexical_cues
List[str]
(optional, default=None) List of lexical cue categories to include when using dynamic attributes
- skip_attributes_regex: ``str`` (optional, default=None)
Regex string for which examples and attributes to skip if the LF matches
- lf_syntax: ``str``
Which logical form formalism to use
- lazy
-
preprocess
(self, question_data: Dict[str, Any], predict: bool = False) → List[Dict[str, Any]][source]¶
-
text_to_instance
(self, question: str, logical_forms: List[str] = None, additional_metadata: Dict[str, Any] = None, world_extractions: Dict[str, Union[str, List[str]]] = None, entity_literals: Dict[str, Union[str, List[str]]] = None, tokenized_question: List[allennlp.data.tokenizers.token.Token] = None, debug_counter: int = None, qr_spec_override: List[Dict[str, int]] = None, dynamic_entities_override: Dict[str, str] = None) → allennlp.data.instance.Instance[source]¶ Does whatever tokenization or processing is necessary to go from textual input to an
Instance
. The primary intended use for this is with aPredictor
, which gets text input as a JSON object and needs to process it to be input to a model.The intent here is to share code between
_read()
and what happens at model serving time, or any other time you want to make a prediction from new data. We need to process the data in the same way it was done at training time. Allowing theDatasetReader
to process new text lets us accomplish this, as we can just callDatasetReader.text_to_instance
when serving predictions.The input type here is rather vaguely specified, unfortunately. The
Predictor
will have to make some assumptions about the kind ofDatasetReader
that it’s using, in order to pass it the right information.