allennlp.data.dataset_readers.reading_comprehension

Reading comprehension is loosely defined as follows: given a question and a passage of text that contains the answer, answer the question.

These submodules contain readers for things that are predominantly reading comprehension datasets.

class allennlp.data.dataset_readers.reading_comprehension.drop.DropReader(tokenizer: allennlp.data.tokenizers.tokenizer.Tokenizer = None, token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer] = None, lazy: bool = False, passage_length_limit: int = None, question_length_limit: int = None, skip_when_all_empty: List[str] = None, instance_format: str = 'drop', relaxed_span_match_for_finding_labels: bool = True)[source]

Bases: allennlp.data.dataset_readers.dataset_reader.DatasetReader

Reads a JSON-formatted DROP dataset file and returns instances in a few different possible formats. The input format is complicated; see the test fixture for an example of what it looks like. The output formats all contain a question TextField, a passage TextField, and some kind of answer representation. Because DROP has instances with several different kinds of answers, this dataset reader allows you to filter out questions that do not have answers of a particular type (e.g., remove questions that have numbers as answers, if you model can only give passage spans as answers). We typically return all possible ways of arriving at a given answer string, and expect models to marginalize over these possibilities.

Parameters
tokenizerTokenizer, optional (default=``WordTokenizer()``)

We use this Tokenizer for both the question and the passage. See Tokenizer. Default is `WordTokenizer().

token_indexersDict[str, TokenIndexer], optional

We similarly use this for both the question and the passage. See TokenIndexer. Default is {"tokens": SingleIdTokenIndexer()}.

lazybool, optional (default=False)

If this is true, instances() will return an object whose __iter__ method reloads the dataset each time it’s called. Otherwise, instances() returns a list.

passage_length_limitint, optional (default=None)

If specified, we will cut the passage if the length of passage exceeds this limit.

question_length_limitint, optional (default=None)

If specified, we will cut the question if the length of passage exceeds this limit.

skip_when_all_empty: ``List[str]``, optional (default=None)

In some cases such as preparing for training examples, you may want to skip some examples when there are no gold labels. You can specify on what condition should the examples be skipped. Currently, you can put “passage_span”, “question_span”, “addition_subtraction”, or “counting” in this list, to tell the reader skip when there are no such label found. If not specified, we will keep all the examples.

instance_format: ``str``, optional (default=”drop”)

We try to be generous in providing a few different formats for the instances in DROP, in terms of the Fields that we return for each Instance, to allow for several different kinds of models. “drop” format will do processing to detect numbers and various ways those numbers can be arrived at from the passage, and return Fields related to that. “bert” format only allows passage spans as answers, and provides a “question_and_passage” field with the two pieces of text joined as BERT expects. “squad” format provides the same fields that our BiDAF and other SQuAD models expect.

relaxed_span_match_for_finding_labelsbool, optional (default=True)

DROP dataset contains multi-span answers, and the date-type answers are usually hard to find exact span matches for, also. In order to use as many examples as possible to train the model, we may not want a strict match for such cases when finding the gold span labels. If this argument is true, we will treat every span in the multi-span answers as correct, and every token in the date answer as correct, too. Because models trained on DROP typically marginalize over all possible answer positions, this is just being a little more generous in what is being marginalized. Note that this will not affect evaluation.

static convert_word_to_number(word: str, try_to_include_more_numbers=False)[source]

Currently we only support limited types of conversion.

static extract_answer_info_from_annotation(answer_annotation: Dict[str, Any]) → Tuple[str, List[str]][source]
static find_valid_add_sub_expressions(numbers: List[int], targets: List[int], max_number_of_numbers_to_consider: int = 2) → List[List[int]][source]
static find_valid_counts(count_numbers: List[int], targets: List[int]) → List[int][source]
static find_valid_spans(passage_tokens: List[allennlp.data.tokenizers.token.Token], answer_texts: List[str]) → List[Tuple[int, int]][source]
static make_bert_drop_instance(question_tokens: List[allennlp.data.tokenizers.token.Token], passage_tokens: List[allennlp.data.tokenizers.token.Token], question_concat_passage_tokens: List[allennlp.data.tokenizers.token.Token], token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer], passage_text: str, answer_info: Dict[str, Any] = None, additional_metadata: Dict[str, Any] = None) → allennlp.data.instance.Instance[source]
static make_marginal_drop_instance(question_tokens: List[allennlp.data.tokenizers.token.Token], passage_tokens: List[allennlp.data.tokenizers.token.Token], number_tokens: List[allennlp.data.tokenizers.token.Token], number_indices: List[int], token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer], passage_text: str, answer_info: Dict[str, Any] = None, additional_metadata: Dict[str, Any] = None) → allennlp.data.instance.Instance[source]
text_to_instance(self, question_text: str, passage_text: str, question_id: str = None, passage_id: str = None, answer_annotations: List[Dict] = None, passage_tokens: List[allennlp.data.tokenizers.token.Token] = None) → Union[allennlp.data.instance.Instance, NoneType][source]

Does whatever tokenization or processing is necessary to go from textual input to an Instance. The primary intended use for this is with a Predictor, which gets text input as a JSON object and needs to process it to be input to a model.

The intent here is to share code between _read() and what happens at model serving time, or any other time you want to make a prediction from new data. We need to process the data in the same way it was done at training time. Allowing the DatasetReader to process new text lets us accomplish this, as we can just call DatasetReader.text_to_instance when serving predictions.

The input type here is rather vaguely specified, unfortunately. The Predictor will have to make some assumptions about the kind of DatasetReader that it’s using, in order to pass it the right information.

class allennlp.data.dataset_readers.reading_comprehension.squad.SquadReader(tokenizer: allennlp.data.tokenizers.tokenizer.Tokenizer = None, token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer] = None, lazy: bool = False, passage_length_limit: int = None, question_length_limit: int = None, skip_invalid_examples: bool = False)[source]

Bases: allennlp.data.dataset_readers.dataset_reader.DatasetReader

Reads a JSON-formatted SQuAD file and returns a Dataset where the Instances have four fields: question, a TextField, passage, another TextField, and span_start and span_end, both IndexFields into the passage TextField. We also add a MetadataField that stores the instance’s ID, the original passage text, gold answer strings, and token offsets into the original passage, accessible as metadata['id'], metadata['original_passage'], metadata['answer_texts'] and metadata['token_offsets']. This is so that we can more easily use the official SQuAD evaluation script to get metrics.

We also support limiting the maximum length for both passage and question. However, some gold answer spans may exceed the maximum passage length, which will cause error in making instances. We simply skip these spans to avoid errors. If all of the gold answer spans of an example are skipped, during training, we will skip this example. During validating or testing, since we cannot skip examples, we use the last token as the pseudo gold answer span instead. The computed loss will not be accurate as a result. But this will not affect the answer evaluation, because we keep all the original gold answer texts.

Parameters
tokenizerTokenizer, optional (default=``WordTokenizer()``)

We use this Tokenizer for both the question and the passage. See Tokenizer. Default is `WordTokenizer().

token_indexersDict[str, TokenIndexer], optional

We similarly use this for both the question and the passage. See TokenIndexer. Default is {"tokens": SingleIdTokenIndexer()}.

lazybool, optional (default=False)

If this is true, instances() will return an object whose __iter__ method reloads the dataset each time it’s called. Otherwise, instances() returns a list.

passage_length_limitint, optional (default=None)

if specified, we will cut the passage if the length of passage exceeds this limit.

question_length_limitint, optional (default=None)

if specified, we will cut the question if the length of passage exceeds this limit.

skip_invalid_examples: ``bool``, optional (default=False)

if this is true, we will skip those invalid examples

text_to_instance(self, question_text: str, passage_text: str, char_spans: List[Tuple[int, int]] = None, answer_texts: List[str] = None, passage_tokens: List[allennlp.data.tokenizers.token.Token] = None) → Union[allennlp.data.instance.Instance, NoneType][source]

Does whatever tokenization or processing is necessary to go from textual input to an Instance. The primary intended use for this is with a Predictor, which gets text input as a JSON object and needs to process it to be input to a model.

The intent here is to share code between _read() and what happens at model serving time, or any other time you want to make a prediction from new data. We need to process the data in the same way it was done at training time. Allowing the DatasetReader to process new text lets us accomplish this, as we can just call DatasetReader.text_to_instance when serving predictions.

The input type here is rather vaguely specified, unfortunately. The Predictor will have to make some assumptions about the kind of DatasetReader that it’s using, in order to pass it the right information.

class allennlp.data.dataset_readers.reading_comprehension.triviaqa.TriviaQaReader(base_tarball_path: str, unfiltered_tarball_path: str = None, tokenizer: allennlp.data.tokenizers.tokenizer.Tokenizer = None, token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer] = None, lazy: bool = False)[source]

Bases: allennlp.data.dataset_readers.dataset_reader.DatasetReader

Reads the TriviaQA dataset into a Dataset containing Instances with four fields: question (a TextField), passage (another TextField), span_start, and span_end (both IndexFields).

TriviaQA is split up into several JSON files defining the questions, and a lot of text files containing crawled web documents. We read these from a gzipped tarball, to avoid having to have millions of individual files on a filesystem.

Because we need to read both train and validation files from the same tarball, we take the tarball itself as a constructor parameter, and take the question file as the argument to read. This means that you should give the path to the tarball in the dataset_reader parameters in your experiment configuration file, and something like "wikipedia-train.json" for the train_data_path and validation_data_path.

Parameters
base_tarball_pathstr

This is the path to the main tar.gz file you can download from the TriviaQA website, with directories evidence and qa.

unfiltered_tarball_pathstr, optional

This is the path to the “unfiltered” TriviaQA data that you can download from the TriviaQA website, containing just question JSON files that point to evidence files in the base tarball.

tokenizerTokenizer, optional

We’ll use this tokenizer on questions and evidence passages, defaulting to WordTokenizer if none is provided.

token_indexersDict[str, TokenIndexer], optional

Determines how both the question and the evidence passages are represented as arrays. See TokenIndexer. Default is to have a single word ID for every token.

pick_paragraphs(self, evidence_files: List[List[str]], question: str = None, answer_texts: List[str] = None) → List[str][source]

Given a list of evidence documents, return a list of paragraphs to use as training examples. Each paragraph returned will be made into one training example.

To aid in picking the best paragraph, you can also optionally pass the question text or the answer strings. Note, though, that if you actually use the answer strings for picking the paragraph on the dev or test sets, that’s likely cheating, depending on how you’ve defined the task.

text_to_instance(self, question_text: str, passage_text: str, token_spans: List[Tuple[int, int]] = None, answer_texts: List[str] = None, question_tokens: List[allennlp.data.tokenizers.token.Token] = None, passage_tokens: List[allennlp.data.tokenizers.token.Token] = None) → allennlp.data.instance.Instance[source]

Does whatever tokenization or processing is necessary to go from textual input to an Instance. The primary intended use for this is with a Predictor, which gets text input as a JSON object and needs to process it to be input to a model.

The intent here is to share code between _read() and what happens at model serving time, or any other time you want to make a prediction from new data. We need to process the data in the same way it was done at training time. Allowing the DatasetReader to process new text lets us accomplish this, as we can just call DatasetReader.text_to_instance when serving predictions.

The input type here is rather vaguely specified, unfortunately. The Predictor will have to make some assumptions about the kind of DatasetReader that it’s using, in order to pass it the right information.

class allennlp.data.dataset_readers.reading_comprehension.quac.QuACReader(tokenizer: allennlp.data.tokenizers.tokenizer.Tokenizer = None, token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer] = None, lazy: bool = False, num_context_answers: int = 0)[source]

Bases: allennlp.data.dataset_readers.dataset_reader.DatasetReader

Reads a JSON-formatted Question Answering in Context (QuAC) data file and returns a Dataset where the Instances have four fields: question, a ListField, passage, another TextField, and span_start and span_end, both ListField composed of IndexFields`` into the passage TextField. Two ListField, composed of LabelField, yesno_list and followup_list is added. We also add a MetadataField that stores the instance’s ID, the original passage text, gold answer strings, and token offsets into the original passage, accessible as metadata['id'], metadata['original_passage'], metadata['answer_text_lists'] and ``metadata['token_offsets'].

Parameters
tokenizerTokenizer, optional (default=``WordTokenizer()``)

We use this Tokenizer for both the question and the passage. See Tokenizer. Default is `WordTokenizer().

token_indexersDict[str, TokenIndexer], optional

We similarly use this for both the question and the passage. See TokenIndexer. Default is {"tokens": SingleIdTokenIndexer()}.

num_context_answersint, optional

How many previous question answers to consider in a context.

text_to_instance(self, question_text_list: List[str], passage_text: str, start_span_list: List[List[int]] = None, end_span_list: List[List[int]] = None, passage_tokens: List[allennlp.data.tokenizers.token.Token] = None, yesno_list: List[int] = None, followup_list: List[int] = None, additional_metadata: Dict[str, Any] = None) → allennlp.data.instance.Instance[source]

Does whatever tokenization or processing is necessary to go from textual input to an Instance. The primary intended use for this is with a Predictor, which gets text input as a JSON object and needs to process it to be input to a model.

The intent here is to share code between _read() and what happens at model serving time, or any other time you want to make a prediction from new data. We need to process the data in the same way it was done at training time. Allowing the DatasetReader to process new text lets us accomplish this, as we can just call DatasetReader.text_to_instance when serving predictions.

The input type here is rather vaguely specified, unfortunately. The Predictor will have to make some assumptions about the kind of DatasetReader that it’s using, in order to pass it the right information.

class allennlp.data.dataset_readers.reading_comprehension.qangaroo.QangarooReader(tokenizer: allennlp.data.tokenizers.tokenizer.Tokenizer = None, token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer] = None, lazy: bool = False)[source]

Bases: allennlp.data.dataset_readers.dataset_reader.DatasetReader

Reads a JSON-formatted Qangaroo file and returns a Dataset where the Instances have six fields: candidates, a ListField[TextField], query, a TextField, supports, a ListField[TextField], answer, a TextField, and answer_index, a IndexField. We also add a MetadataField that stores the instance’s ID and annotations if they are present.

Parameters
tokenizerTokenizer, optional (default=``WordTokenizer()``)

We use this Tokenizer for both the question and the passage. See Tokenizer. Default is `WordTokenizer().

token_indexersDict[str, TokenIndexer], optional

We similarly use this for both the question and the passage. See TokenIndexer. Default is {"tokens": SingleIdTokenIndexer()}.

text_to_instance(self, candidates: List[str], query: str, supports: List[str], _id: str = None, answer: str = None, annotations: List[List[str]] = None) → allennlp.data.instance.Instance[source]

Does whatever tokenization or processing is necessary to go from textual input to an Instance. The primary intended use for this is with a Predictor, which gets text input as a JSON object and needs to process it to be input to a model.

The intent here is to share code between _read() and what happens at model serving time, or any other time you want to make a prediction from new data. We need to process the data in the same way it was done at training time. Allowing the DatasetReader to process new text lets us accomplish this, as we can just call DatasetReader.text_to_instance when serving predictions.

The input type here is rather vaguely specified, unfortunately. The Predictor will have to make some assumptions about the kind of DatasetReader that it’s using, in order to pass it the right information.

Utilities for reading comprehension dataset readers.

allennlp.data.dataset_readers.reading_comprehension.util.char_span_to_token_span(token_offsets: List[Tuple[int, int]], character_span: Tuple[int, int]) → Tuple[Tuple[int, int], bool][source]

Converts a character span from a passage into the corresponding token span in the tokenized version of the passage. If you pass in a character span that does not correspond to complete tokens in the tokenized version, we’ll do our best, but the behavior is officially undefined. We return an error flag in this case, and have some debug logging so you can figure out the cause of this issue (in SQuAD, these are mostly either tokenization problems or annotation problems; there’s a fair amount of both).

The basic outline of this method is to find the token span that has the same offsets as the input character span. If the tokenizer tokenized the passage correctly and has matching offsets, this is easy. We try to be a little smart about cases where they don’t match exactly, but mostly just find the closest thing we can.

The returned (begin, end) indices are inclusive for both begin and end. So, for example, (2, 2) is the one word span beginning at token index 2, (3, 4) is the two-word span beginning at token index 3, and so on.

Returns
token_spanTuple[int, int]

Inclusive span start and end token indices that match as closely as possible to the input character spans.

errorbool

Whether the token spans match the input character spans exactly. If this is False, it means there was an error in either the tokenization or the annotated character span.

allennlp.data.dataset_readers.reading_comprehension.util.find_valid_answer_spans(passage_tokens: List[allennlp.data.tokenizers.token.Token], answer_texts: List[str]) → List[Tuple[int, int]][source]

Finds a list of token spans in passage_tokens that match the given answer_texts. This tries to find all spans that would evaluate to correct given the SQuAD and TriviaQA official evaluation scripts, which do some normalization of the input text.

Note that this could return duplicate spans! The caller is expected to be able to handle possible duplicates (as already happens in the SQuAD dev set, for instance).

allennlp.data.dataset_readers.reading_comprehension.util.handle_cannot(reference_answers: List[str])[source]

Process a list of reference answers. If equal or more than half of the reference answers are “CANNOTANSWER”, take it as gold. Otherwise, return answers that are not “CANNOTANSWER”.

allennlp.data.dataset_readers.reading_comprehension.util.make_reading_comprehension_instance(question_tokens: List[allennlp.data.tokenizers.token.Token], passage_tokens: List[allennlp.data.tokenizers.token.Token], token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer], passage_text: str, token_spans: List[Tuple[int, int]] = None, answer_texts: List[str] = None, additional_metadata: Dict[str, Any] = None) → allennlp.data.instance.Instance[source]

Converts a question, a passage, and an optional answer (or answers) to an Instance for use in a reading comprehension model.

Creates an Instance with at least these fields: question and passage, both TextFields; and metadata, a MetadataField. Additionally, if both answer_texts and char_span_starts are given, the Instance has span_start and span_end fields, which are both IndexFields.

Parameters
question_tokensList[Token]

An already-tokenized question.

passage_tokensList[Token]

An already-tokenized passage that contains the answer to the given question.

token_indexersDict[str, TokenIndexer]

Determines how the question and passage TextFields will be converted into tensors that get input to a model. See TokenIndexer.

passage_textstr

The original passage text. We need this so that we can recover the actual span from the original passage that the model predicts as the answer to the question. This is used in official evaluation scripts.

token_spansList[Tuple[int, int]], optional

Indices into passage_tokens to use as the answer to the question for training. This is a list because there might be several possible correct answer spans in the passage. Currently, we just select the most frequent span in this list (i.e., SQuAD has multiple annotations on the dev set; this will select the span that the most annotators gave as correct).

answer_textsList[str], optional

All valid answer strings for the given question. In SQuAD, e.g., the training set has exactly one answer per question, but the dev and test sets have several. TriviaQA has many possible answers, which are the aliases for the known correct entity. This is put into the metadata for use with official evaluation scripts, but not used anywhere else.

additional_metadataDict[str, Any], optional

The constructed metadata field will by default contain original_passage, token_offsets, question_tokens, passage_tokens, and answer_texts keys. If you want any other metadata to be associated with each instance, you can pass that in here. This dictionary will get added to the metadata dictionary we already construct.

allennlp.data.dataset_readers.reading_comprehension.util.make_reading_comprehension_instance_quac(question_list_tokens: List[List[allennlp.data.tokenizers.token.Token]], passage_tokens: List[allennlp.data.tokenizers.token.Token], token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer], passage_text: str, token_span_lists: List[List[Tuple[int, int]]] = None, yesno_list: List[int] = None, followup_list: List[int] = None, additional_metadata: Dict[str, Any] = None, num_context_answers: int = 0) → allennlp.data.instance.Instance[source]

Converts a question, a passage, and an optional answer (or answers) to an Instance for use in a reading comprehension model.

Creates an Instance with at least these fields: question and passage, both TextFields; and metadata, a MetadataField. Additionally, if both answer_texts and char_span_starts are given, the Instance has span_start and span_end fields, which are both IndexFields.

Parameters
question_list_tokensList[List[Token]]

An already-tokenized list of questions. Each dialog have multiple questions.

passage_tokensList[Token]

An already-tokenized passage that contains the answer to the given question.

token_indexersDict[str, TokenIndexer]

Determines how the question and passage TextFields will be converted into tensors that get input to a model. See TokenIndexer.

passage_textstr

The original passage text. We need this so that we can recover the actual span from the original passage that the model predicts as the answer to the question. This is used in official evaluation scripts.

token_span_listsList[List[Tuple[int, int]]], optional

Indices into passage_tokens to use as the answer to the question for training. This is a list of list, first because there is multiple questions per dialog, and because there might be several possible correct answer spans in the passage. Currently, we just select the last span in this list (i.e., QuAC has multiple annotations on the dev set; this will select the last span, which was given by the original annotator).

yesno_listList[int]

List of the affirmation bit for each question answer pairs.

followup_listList[int]

List of the continuation bit for each question answer pairs.

num_context_answersint, optional

How many answers to encode into the passage.

additional_metadataDict[str, Any], optional

The constructed metadata field will by default contain original_passage, token_offsets, question_tokens, passage_tokens, and answer_texts keys. If you want any other metadata to be associated with each instance, you can pass that in here. This dictionary will get added to the metadata dictionary we already construct.

allennlp.data.dataset_readers.reading_comprehension.util.normalize_text(text: str) → str[source]

Performs a normalization that is very similar to that done by the normalization functions in SQuAD and TriviaQA.

This involves splitting and rejoining the text, and could be a somewhat expensive operation.

allennlp.data.dataset_readers.reading_comprehension.util.split_token_by_delimiter(token: allennlp.data.tokenizers.token.Token, delimiter: str) → List[allennlp.data.tokenizers.token.Token][source]
allennlp.data.dataset_readers.reading_comprehension.util.split_tokens_by_hyphen(tokens: List[allennlp.data.tokenizers.token.Token]) → List[allennlp.data.tokenizers.token.Token][source]