allennlp.data.dataset_readers.reading_comprehension¶
Reading comprehension is loosely defined as follows: given a question and a passage of text that contains the answer, answer the question.
These submodules contain readers for things that are predominantly reading comprehension datasets.
- 
class allennlp.data.dataset_readers.reading_comprehension.drop.DropReader(tokenizer: allennlp.data.tokenizers.tokenizer.Tokenizer = None, token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer] = None, lazy: bool = False, passage_length_limit: int = None, question_length_limit: int = None, skip_when_all_empty: List[str] = None, instance_format: str = 'drop', relaxed_span_match_for_finding_labels: bool = True)[source]¶
- Bases: - allennlp.data.dataset_readers.dataset_reader.DatasetReader- Reads a JSON-formatted DROP dataset file and returns instances in a few different possible formats. The input format is complicated; see the test fixture for an example of what it looks like. The output formats all contain a question - TextField, a passage- TextField, and some kind of answer representation. Because DROP has instances with several different kinds of answers, this dataset reader allows you to filter out questions that do not have answers of a particular type (e.g., remove questions that have numbers as answers, if you model can only give passage spans as answers). We typically return all possible ways of arriving at a given answer string, and expect models to marginalize over these possibilities.- Parameters
- tokenizerTokenizer, optional (default=``WordTokenizer()``)
- We use this - Tokenizerfor both the question and the passage. See- Tokenizer. Default is- `WordTokenizer().
- token_indexersDict[str, TokenIndexer], optional
- We similarly use this for both the question and the passage. See - TokenIndexer. Default is- {"tokens": SingleIdTokenIndexer()}.
- lazybool, optional (default=False)
- If this is true, - instances()will return an object whose- __iter__method reloads the dataset each time it’s called. Otherwise,- instances()returns a list.
- passage_length_limitint, optional (default=None)
- If specified, we will cut the passage if the length of passage exceeds this limit. 
- question_length_limitint, optional (default=None)
- If specified, we will cut the question if the length of passage exceeds this limit. 
- skip_when_all_empty: ``List[str]``, optional (default=None)
- In some cases such as preparing for training examples, you may want to skip some examples when there are no gold labels. You can specify on what condition should the examples be skipped. Currently, you can put “passage_span”, “question_span”, “addition_subtraction”, or “counting” in this list, to tell the reader skip when there are no such label found. If not specified, we will keep all the examples. 
- instance_format: ``str``, optional (default=”drop”)
- We try to be generous in providing a few different formats for the instances in DROP, in terms of the - Fieldsthat we return for each- Instance, to allow for several different kinds of models. “drop” format will do processing to detect numbers and various ways those numbers can be arrived at from the passage, and return- Fieldsrelated to that. “bert” format only allows passage spans as answers, and provides a “question_and_passage” field with the two pieces of text joined as BERT expects. “squad” format provides the same fields that our BiDAF and other SQuAD models expect.
- relaxed_span_match_for_finding_labelsbool, optional (default=True)
- DROP dataset contains multi-span answers, and the date-type answers are usually hard to find exact span matches for, also. In order to use as many examples as possible to train the model, we may not want a strict match for such cases when finding the gold span labels. If this argument is true, we will treat every span in the multi-span answers as correct, and every token in the date answer as correct, too. Because models trained on DROP typically marginalize over all possible answer positions, this is just being a little more generous in what is being marginalized. Note that this will not affect evaluation. 
 
- tokenizer
 - 
static convert_word_to_number(word: str, try_to_include_more_numbers=False)[source]¶
- Currently we only support limited types of conversion. 
 - 
static extract_answer_info_from_annotation(answer_annotation: Dict[str, Any]) → Tuple[str, List[str]][source]¶
 - 
static find_valid_add_sub_expressions(numbers: List[int], targets: List[int], max_number_of_numbers_to_consider: int = 2) → List[List[int]][source]¶
 - 
static find_valid_spans(passage_tokens: List[allennlp.data.tokenizers.token.Token], answer_texts: List[str]) → List[Tuple[int, int]][source]¶
 - 
static make_bert_drop_instance(question_tokens: List[allennlp.data.tokenizers.token.Token], passage_tokens: List[allennlp.data.tokenizers.token.Token], question_concat_passage_tokens: List[allennlp.data.tokenizers.token.Token], token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer], passage_text: str, answer_info: Dict[str, Any] = None, additional_metadata: Dict[str, Any] = None) → allennlp.data.instance.Instance[source]¶
 - 
static make_marginal_drop_instance(question_tokens: List[allennlp.data.tokenizers.token.Token], passage_tokens: List[allennlp.data.tokenizers.token.Token], number_tokens: List[allennlp.data.tokenizers.token.Token], number_indices: List[int], token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer], passage_text: str, answer_info: Dict[str, Any] = None, additional_metadata: Dict[str, Any] = None) → allennlp.data.instance.Instance[source]¶
 - 
text_to_instance(self, question_text: str, passage_text: str, question_id: str = None, passage_id: str = None, answer_annotations: List[Dict] = None, passage_tokens: List[allennlp.data.tokenizers.token.Token] = None) → Union[allennlp.data.instance.Instance, NoneType][source]¶
- Does whatever tokenization or processing is necessary to go from textual input to an - Instance. The primary intended use for this is with a- Predictor, which gets text input as a JSON object and needs to process it to be input to a model.- The intent here is to share code between - _read()and what happens at model serving time, or any other time you want to make a prediction from new data. We need to process the data in the same way it was done at training time. Allowing the- DatasetReaderto process new text lets us accomplish this, as we can just call- DatasetReader.text_to_instancewhen serving predictions.- The input type here is rather vaguely specified, unfortunately. The - Predictorwill have to make some assumptions about the kind of- DatasetReaderthat it’s using, in order to pass it the right information.
 
- 
class allennlp.data.dataset_readers.reading_comprehension.squad.SquadReader(tokenizer: allennlp.data.tokenizers.tokenizer.Tokenizer = None, token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer] = None, lazy: bool = False, passage_length_limit: int = None, question_length_limit: int = None, skip_invalid_examples: bool = False)[source]¶
- Bases: - allennlp.data.dataset_readers.dataset_reader.DatasetReader- Reads a JSON-formatted SQuAD file and returns a - Datasetwhere the- Instanceshave four fields:- question, a- TextField,- passage, another- TextField, and- span_startand- span_end, both- IndexFieldsinto the- passage- TextField. We also add a- MetadataFieldthat stores the instance’s ID, the original passage text, gold answer strings, and token offsets into the original passage, accessible as- metadata['id'],- metadata['original_passage'],- metadata['answer_texts']and- metadata['token_offsets']. This is so that we can more easily use the official SQuAD evaluation script to get metrics.- We also support limiting the maximum length for both passage and question. However, some gold answer spans may exceed the maximum passage length, which will cause error in making instances. We simply skip these spans to avoid errors. If all of the gold answer spans of an example are skipped, during training, we will skip this example. During validating or testing, since we cannot skip examples, we use the last token as the pseudo gold answer span instead. The computed loss will not be accurate as a result. But this will not affect the answer evaluation, because we keep all the original gold answer texts. - Parameters
- tokenizerTokenizer, optional (default=``WordTokenizer()``)
- We use this - Tokenizerfor both the question and the passage. See- Tokenizer. Default is- `WordTokenizer().
- token_indexersDict[str, TokenIndexer], optional
- We similarly use this for both the question and the passage. See - TokenIndexer. Default is- {"tokens": SingleIdTokenIndexer()}.
- lazybool, optional (default=False)
- If this is true, - instances()will return an object whose- __iter__method reloads the dataset each time it’s called. Otherwise,- instances()returns a list.
- passage_length_limitint, optional (default=None)
- if specified, we will cut the passage if the length of passage exceeds this limit. 
- question_length_limitint, optional (default=None)
- if specified, we will cut the question if the length of passage exceeds this limit. 
- skip_invalid_examples: ``bool``, optional (default=False)
- if this is true, we will skip those invalid examples 
 
- tokenizer
 - 
text_to_instance(self, question_text: str, passage_text: str, char_spans: List[Tuple[int, int]] = None, answer_texts: List[str] = None, passage_tokens: List[allennlp.data.tokenizers.token.Token] = None) → Union[allennlp.data.instance.Instance, NoneType][source]¶
- Does whatever tokenization or processing is necessary to go from textual input to an - Instance. The primary intended use for this is with a- Predictor, which gets text input as a JSON object and needs to process it to be input to a model.- The intent here is to share code between - _read()and what happens at model serving time, or any other time you want to make a prediction from new data. We need to process the data in the same way it was done at training time. Allowing the- DatasetReaderto process new text lets us accomplish this, as we can just call- DatasetReader.text_to_instancewhen serving predictions.- The input type here is rather vaguely specified, unfortunately. The - Predictorwill have to make some assumptions about the kind of- DatasetReaderthat it’s using, in order to pass it the right information.
 
- 
class allennlp.data.dataset_readers.reading_comprehension.triviaqa.TriviaQaReader(base_tarball_path: str, unfiltered_tarball_path: str = None, tokenizer: allennlp.data.tokenizers.tokenizer.Tokenizer = None, token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer] = None, lazy: bool = False)[source]¶
- Bases: - allennlp.data.dataset_readers.dataset_reader.DatasetReader- Reads the TriviaQA dataset into a - Datasetcontaining- Instanceswith four fields:- question(a- TextField),- passage(another- TextField),- span_start, and- span_end(both- IndexFields).- TriviaQA is split up into several JSON files defining the questions, and a lot of text files containing crawled web documents. We read these from a gzipped tarball, to avoid having to have millions of individual files on a filesystem. - Because we need to read both train and validation files from the same tarball, we take the tarball itself as a constructor parameter, and take the question file as the argument to - read. This means that you should give the path to the tarball in the- dataset_readerparameters in your experiment configuration file, and something like- "wikipedia-train.json"for the- train_data_pathand- validation_data_path.- Parameters
- base_tarball_pathstr
- This is the path to the main - tar.gzfile you can download from the TriviaQA website, with directories- evidenceand- qa.
- unfiltered_tarball_pathstr, optional
- This is the path to the “unfiltered” TriviaQA data that you can download from the TriviaQA website, containing just question JSON files that point to evidence files in the base tarball. 
- tokenizerTokenizer, optional
- We’ll use this tokenizer on questions and evidence passages, defaulting to - WordTokenizerif none is provided.
- token_indexersDict[str, TokenIndexer], optional
- Determines how both the question and the evidence passages are represented as arrays. See - TokenIndexer. Default is to have a single word ID for every token.
 
- base_tarball_path
 - 
pick_paragraphs(self, evidence_files: List[List[str]], question: str = None, answer_texts: List[str] = None) → List[str][source]¶
- Given a list of evidence documents, return a list of paragraphs to use as training examples. Each paragraph returned will be made into one training example. - To aid in picking the best paragraph, you can also optionally pass the question text or the answer strings. Note, though, that if you actually use the answer strings for picking the paragraph on the dev or test sets, that’s likely cheating, depending on how you’ve defined the task. 
 - 
text_to_instance(self, question_text: str, passage_text: str, token_spans: List[Tuple[int, int]] = None, answer_texts: List[str] = None, question_tokens: List[allennlp.data.tokenizers.token.Token] = None, passage_tokens: List[allennlp.data.tokenizers.token.Token] = None) → allennlp.data.instance.Instance[source]¶
- Does whatever tokenization or processing is necessary to go from textual input to an - Instance. The primary intended use for this is with a- Predictor, which gets text input as a JSON object and needs to process it to be input to a model.- The intent here is to share code between - _read()and what happens at model serving time, or any other time you want to make a prediction from new data. We need to process the data in the same way it was done at training time. Allowing the- DatasetReaderto process new text lets us accomplish this, as we can just call- DatasetReader.text_to_instancewhen serving predictions.- The input type here is rather vaguely specified, unfortunately. The - Predictorwill have to make some assumptions about the kind of- DatasetReaderthat it’s using, in order to pass it the right information.
 
- 
class allennlp.data.dataset_readers.reading_comprehension.quac.QuACReader(tokenizer: allennlp.data.tokenizers.tokenizer.Tokenizer = None, token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer] = None, lazy: bool = False, num_context_answers: int = 0)[source]¶
- Bases: - allennlp.data.dataset_readers.dataset_reader.DatasetReader- Reads a JSON-formatted Question Answering in Context (QuAC) data file and returns a - Datasetwhere the- Instanceshave four fields:- question, a- ListField,- passage, another- TextField, and- span_startand- span_end, both- ListFieldcomposed of IndexFields`` into the- passage- TextField. Two- ListField, composed of- LabelField,- yesno_listand- followup_listis added. We also add a- MetadataFieldthat stores the instance’s ID, the original passage text, gold answer strings, and token offsets into the original passage, accessible as- metadata['id'],- metadata['original_passage'],- metadata['answer_text_lists'] and ``metadata['token_offsets'].- Parameters
- tokenizerTokenizer, optional (default=``WordTokenizer()``)
- We use this - Tokenizerfor both the question and the passage. See- Tokenizer. Default is- `WordTokenizer().
- token_indexersDict[str, TokenIndexer], optional
- We similarly use this for both the question and the passage. See - TokenIndexer. Default is- {"tokens": SingleIdTokenIndexer()}.
- num_context_answersint, optional
- How many previous question answers to consider in a context. 
 
- tokenizer
 - 
text_to_instance(self, question_text_list: List[str], passage_text: str, start_span_list: List[List[int]] = None, end_span_list: List[List[int]] = None, passage_tokens: List[allennlp.data.tokenizers.token.Token] = None, yesno_list: List[int] = None, followup_list: List[int] = None, additional_metadata: Dict[str, Any] = None) → allennlp.data.instance.Instance[source]¶
- Does whatever tokenization or processing is necessary to go from textual input to an - Instance. The primary intended use for this is with a- Predictor, which gets text input as a JSON object and needs to process it to be input to a model.- The intent here is to share code between - _read()and what happens at model serving time, or any other time you want to make a prediction from new data. We need to process the data in the same way it was done at training time. Allowing the- DatasetReaderto process new text lets us accomplish this, as we can just call- DatasetReader.text_to_instancewhen serving predictions.- The input type here is rather vaguely specified, unfortunately. The - Predictorwill have to make some assumptions about the kind of- DatasetReaderthat it’s using, in order to pass it the right information.
 
- 
class allennlp.data.dataset_readers.reading_comprehension.qangaroo.QangarooReader(tokenizer: allennlp.data.tokenizers.tokenizer.Tokenizer = None, token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer] = None, lazy: bool = False)[source]¶
- Bases: - allennlp.data.dataset_readers.dataset_reader.DatasetReader- Reads a JSON-formatted Qangaroo file and returns a - Datasetwhere the- Instanceshave six fields:- candidates, a- ListField[TextField],- query, a- TextField,- supports, a- ListField[TextField],- answer, a- TextField, and- answer_index, a- IndexField. We also add a- MetadataFieldthat stores the instance’s ID and annotations if they are present.- Parameters
- tokenizerTokenizer, optional (default=``WordTokenizer()``)
- We use this - Tokenizerfor both the question and the passage. See- Tokenizer. Default is- `WordTokenizer().
- token_indexersDict[str, TokenIndexer], optional
- We similarly use this for both the question and the passage. See - TokenIndexer. Default is- {"tokens": SingleIdTokenIndexer()}.
 
- tokenizer
 - 
text_to_instance(self, candidates: List[str], query: str, supports: List[str], _id: str = None, answer: str = None, annotations: List[List[str]] = None) → allennlp.data.instance.Instance[source]¶
- Does whatever tokenization or processing is necessary to go from textual input to an - Instance. The primary intended use for this is with a- Predictor, which gets text input as a JSON object and needs to process it to be input to a model.- The intent here is to share code between - _read()and what happens at model serving time, or any other time you want to make a prediction from new data. We need to process the data in the same way it was done at training time. Allowing the- DatasetReaderto process new text lets us accomplish this, as we can just call- DatasetReader.text_to_instancewhen serving predictions.- The input type here is rather vaguely specified, unfortunately. The - Predictorwill have to make some assumptions about the kind of- DatasetReaderthat it’s using, in order to pass it the right information.
 
Utilities for reading comprehension dataset readers.
- 
allennlp.data.dataset_readers.reading_comprehension.util.char_span_to_token_span(token_offsets: List[Tuple[int, int]], character_span: Tuple[int, int]) → Tuple[Tuple[int, int], bool][source]¶
- Converts a character span from a passage into the corresponding token span in the tokenized version of the passage. If you pass in a character span that does not correspond to complete tokens in the tokenized version, we’ll do our best, but the behavior is officially undefined. We return an error flag in this case, and have some debug logging so you can figure out the cause of this issue (in SQuAD, these are mostly either tokenization problems or annotation problems; there’s a fair amount of both). - The basic outline of this method is to find the token span that has the same offsets as the input character span. If the tokenizer tokenized the passage correctly and has matching offsets, this is easy. We try to be a little smart about cases where they don’t match exactly, but mostly just find the closest thing we can. - The returned - (begin, end)indices are inclusive for both- beginand- end. So, for example,- (2, 2)is the one word span beginning at token index 2,- (3, 4)is the two-word span beginning at token index 3, and so on.- Returns
- token_spanTuple[int, int]
- Inclusive span start and end token indices that match as closely as possible to the input character spans. 
- errorbool
- Whether the token spans match the input character spans exactly. If this is - False, it means there was an error in either the tokenization or the annotated character span.
 
- token_span
 
- 
allennlp.data.dataset_readers.reading_comprehension.util.find_valid_answer_spans(passage_tokens: List[allennlp.data.tokenizers.token.Token], answer_texts: List[str]) → List[Tuple[int, int]][source]¶
- Finds a list of token spans in - passage_tokensthat match the given- answer_texts. This tries to find all spans that would evaluate to correct given the SQuAD and TriviaQA official evaluation scripts, which do some normalization of the input text.- Note that this could return duplicate spans! The caller is expected to be able to handle possible duplicates (as already happens in the SQuAD dev set, for instance). 
- 
allennlp.data.dataset_readers.reading_comprehension.util.handle_cannot(reference_answers: List[str])[source]¶
- Process a list of reference answers. If equal or more than half of the reference answers are “CANNOTANSWER”, take it as gold. Otherwise, return answers that are not “CANNOTANSWER”. 
- 
allennlp.data.dataset_readers.reading_comprehension.util.make_reading_comprehension_instance(question_tokens: List[allennlp.data.tokenizers.token.Token], passage_tokens: List[allennlp.data.tokenizers.token.Token], token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer], passage_text: str, token_spans: List[Tuple[int, int]] = None, answer_texts: List[str] = None, additional_metadata: Dict[str, Any] = None) → allennlp.data.instance.Instance[source]¶
- Converts a question, a passage, and an optional answer (or answers) to an - Instancefor use in a reading comprehension model.- Creates an - Instancewith at least these fields:- questionand- passage, both- TextFields; and- metadata, a- MetadataField. Additionally, if both- answer_textsand- char_span_startsare given, the- Instancehas- span_startand- span_endfields, which are both- IndexFields.- Parameters
- question_tokensList[Token]
- An already-tokenized question. 
- passage_tokensList[Token]
- An already-tokenized passage that contains the answer to the given question. 
- token_indexersDict[str, TokenIndexer]
- Determines how the question and passage - TextFieldswill be converted into tensors that get input to a model. See- TokenIndexer.
- passage_textstr
- The original passage text. We need this so that we can recover the actual span from the original passage that the model predicts as the answer to the question. This is used in official evaluation scripts. 
- token_spansList[Tuple[int, int]], optional
- Indices into - passage_tokensto use as the answer to the question for training. This is a list because there might be several possible correct answer spans in the passage. Currently, we just select the most frequent span in this list (i.e., SQuAD has multiple annotations on the dev set; this will select the span that the most annotators gave as correct).
- answer_textsList[str], optional
- All valid answer strings for the given question. In SQuAD, e.g., the training set has exactly one answer per question, but the dev and test sets have several. TriviaQA has many possible answers, which are the aliases for the known correct entity. This is put into the metadata for use with official evaluation scripts, but not used anywhere else. 
- additional_metadataDict[str, Any], optional
- The constructed - metadatafield will by default contain- original_passage,- token_offsets,- question_tokens,- passage_tokens, and- answer_textskeys. If you want any other metadata to be associated with each instance, you can pass that in here. This dictionary will get added to the- metadatadictionary we already construct.
 
- question_tokens
 
- 
allennlp.data.dataset_readers.reading_comprehension.util.make_reading_comprehension_instance_quac(question_list_tokens: List[List[allennlp.data.tokenizers.token.Token]], passage_tokens: List[allennlp.data.tokenizers.token.Token], token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer], passage_text: str, token_span_lists: List[List[Tuple[int, int]]] = None, yesno_list: List[int] = None, followup_list: List[int] = None, additional_metadata: Dict[str, Any] = None, num_context_answers: int = 0) → allennlp.data.instance.Instance[source]¶
- Converts a question, a passage, and an optional answer (or answers) to an - Instancefor use in a reading comprehension model.- Creates an - Instancewith at least these fields:- questionand- passage, both- TextFields; and- metadata, a- MetadataField. Additionally, if both- answer_textsand- char_span_startsare given, the- Instancehas- span_startand- span_endfields, which are both- IndexFields.- Parameters
- question_list_tokensList[List[Token]]
- An already-tokenized list of questions. Each dialog have multiple questions. 
- passage_tokensList[Token]
- An already-tokenized passage that contains the answer to the given question. 
- token_indexersDict[str, TokenIndexer]
- Determines how the question and passage - TextFieldswill be converted into tensors that get input to a model. See- TokenIndexer.
- passage_textstr
- The original passage text. We need this so that we can recover the actual span from the original passage that the model predicts as the answer to the question. This is used in official evaluation scripts. 
- token_span_listsList[List[Tuple[int, int]]], optional
- Indices into - passage_tokensto use as the answer to the question for training. This is a list of list, first because there is multiple questions per dialog, and because there might be several possible correct answer spans in the passage. Currently, we just select the last span in this list (i.e., QuAC has multiple annotations on the dev set; this will select the last span, which was given by the original annotator).
- yesno_listList[int]
- List of the affirmation bit for each question answer pairs. 
- followup_listList[int]
- List of the continuation bit for each question answer pairs. 
- num_context_answersint, optional
- How many answers to encode into the passage. 
- additional_metadataDict[str, Any], optional
- The constructed - metadatafield will by default contain- original_passage,- token_offsets,- question_tokens,- passage_tokens, and- answer_textskeys. If you want any other metadata to be associated with each instance, you can pass that in here. This dictionary will get added to the- metadatadictionary we already construct.
 
- question_list_tokens
 
- 
allennlp.data.dataset_readers.reading_comprehension.util.normalize_text(text: str) → str[source]¶
- Performs a normalization that is very similar to that done by the normalization functions in SQuAD and TriviaQA. - This involves splitting and rejoining the text, and could be a somewhat expensive operation.