allennlp.data.dataset_readers.reading_comprehension¶
Reading comprehension is loosely defined as follows: given a question and a passage of text that contains the answer, answer the question.
These submodules contain readers for things that are predominantly reading comprehension datasets.
-
class
allennlp.data.dataset_readers.reading_comprehension.drop.
DropReader
(tokenizer: allennlp.data.tokenizers.tokenizer.Tokenizer = None, token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer] = None, lazy: bool = False, passage_length_limit: int = None, question_length_limit: int = None, skip_when_all_empty: List[str] = None, instance_format: str = 'drop', relaxed_span_match_for_finding_labels: bool = True)[source]¶ Bases:
allennlp.data.dataset_readers.dataset_reader.DatasetReader
Reads a JSON-formatted DROP dataset file and returns instances in a few different possible formats. The input format is complicated; see the test fixture for an example of what it looks like. The output formats all contain a question
TextField
, a passageTextField
, and some kind of answer representation. Because DROP has instances with several different kinds of answers, this dataset reader allows you to filter out questions that do not have answers of a particular type (e.g., remove questions that have numbers as answers, if you model can only give passage spans as answers). We typically return all possible ways of arriving at a given answer string, and expect models to marginalize over these possibilities.- Parameters
- tokenizer
Tokenizer
, optional (default=``WordTokenizer()``) We use this
Tokenizer
for both the question and the passage. SeeTokenizer
. Default is`WordTokenizer()
.- token_indexers
Dict[str, TokenIndexer]
, optional We similarly use this for both the question and the passage. See
TokenIndexer
. Default is{"tokens": SingleIdTokenIndexer()}
.- lazy
bool
, optional (default=False) If this is true,
instances()
will return an object whose__iter__
method reloads the dataset each time it’s called. Otherwise,instances()
returns a list.- passage_length_limit
int
, optional (default=None) If specified, we will cut the passage if the length of passage exceeds this limit.
- question_length_limit
int
, optional (default=None) If specified, we will cut the question if the length of passage exceeds this limit.
- skip_when_all_empty: ``List[str]``, optional (default=None)
In some cases such as preparing for training examples, you may want to skip some examples when there are no gold labels. You can specify on what condition should the examples be skipped. Currently, you can put “passage_span”, “question_span”, “addition_subtraction”, or “counting” in this list, to tell the reader skip when there are no such label found. If not specified, we will keep all the examples.
- instance_format: ``str``, optional (default=”drop”)
We try to be generous in providing a few different formats for the instances in DROP, in terms of the
Fields
that we return for eachInstance
, to allow for several different kinds of models. “drop” format will do processing to detect numbers and various ways those numbers can be arrived at from the passage, and returnFields
related to that. “bert” format only allows passage spans as answers, and provides a “question_and_passage” field with the two pieces of text joined as BERT expects. “squad” format provides the same fields that our BiDAF and other SQuAD models expect.- relaxed_span_match_for_finding_labels
bool
, optional (default=True) DROP dataset contains multi-span answers, and the date-type answers are usually hard to find exact span matches for, also. In order to use as many examples as possible to train the model, we may not want a strict match for such cases when finding the gold span labels. If this argument is true, we will treat every span in the multi-span answers as correct, and every token in the date answer as correct, too. Because models trained on DROP typically marginalize over all possible answer positions, this is just being a little more generous in what is being marginalized. Note that this will not affect evaluation.
- tokenizer
-
static
convert_word_to_number
(word: str, try_to_include_more_numbers=False)[source]¶ Currently we only support limited types of conversion.
-
static
extract_answer_info_from_annotation
(answer_annotation: Dict[str, Any]) → Tuple[str, List[str]][source]¶
-
static
find_valid_add_sub_expressions
(numbers: List[int], targets: List[int], max_number_of_numbers_to_consider: int = 2) → List[List[int]][source]¶
-
static
find_valid_spans
(passage_tokens: List[allennlp.data.tokenizers.token.Token], answer_texts: List[str]) → List[Tuple[int, int]][source]¶
-
static
make_bert_drop_instance
(question_tokens: List[allennlp.data.tokenizers.token.Token], passage_tokens: List[allennlp.data.tokenizers.token.Token], question_concat_passage_tokens: List[allennlp.data.tokenizers.token.Token], token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer], passage_text: str, answer_info: Dict[str, Any] = None, additional_metadata: Dict[str, Any] = None) → allennlp.data.instance.Instance[source]¶
-
static
make_marginal_drop_instance
(question_tokens: List[allennlp.data.tokenizers.token.Token], passage_tokens: List[allennlp.data.tokenizers.token.Token], number_tokens: List[allennlp.data.tokenizers.token.Token], number_indices: List[int], token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer], passage_text: str, answer_info: Dict[str, Any] = None, additional_metadata: Dict[str, Any] = None) → allennlp.data.instance.Instance[source]¶
-
text_to_instance
(self, question_text: str, passage_text: str, question_id: str = None, passage_id: str = None, answer_annotations: List[Dict] = None, passage_tokens: List[allennlp.data.tokenizers.token.Token] = None) → Union[allennlp.data.instance.Instance, NoneType][source]¶ Does whatever tokenization or processing is necessary to go from textual input to an
Instance
. The primary intended use for this is with aPredictor
, which gets text input as a JSON object and needs to process it to be input to a model.The intent here is to share code between
_read()
and what happens at model serving time, or any other time you want to make a prediction from new data. We need to process the data in the same way it was done at training time. Allowing theDatasetReader
to process new text lets us accomplish this, as we can just callDatasetReader.text_to_instance
when serving predictions.The input type here is rather vaguely specified, unfortunately. The
Predictor
will have to make some assumptions about the kind ofDatasetReader
that it’s using, in order to pass it the right information.
-
class
allennlp.data.dataset_readers.reading_comprehension.squad.
SquadReader
(tokenizer: allennlp.data.tokenizers.tokenizer.Tokenizer = None, token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer] = None, lazy: bool = False, passage_length_limit: int = None, question_length_limit: int = None, skip_invalid_examples: bool = False)[source]¶ Bases:
allennlp.data.dataset_readers.dataset_reader.DatasetReader
Reads a JSON-formatted SQuAD file and returns a
Dataset
where theInstances
have four fields:question
, aTextField
,passage
, anotherTextField
, andspan_start
andspan_end
, bothIndexFields
into thepassage
TextField
. We also add aMetadataField
that stores the instance’s ID, the original passage text, gold answer strings, and token offsets into the original passage, accessible asmetadata['id']
,metadata['original_passage']
,metadata['answer_texts']
andmetadata['token_offsets']
. This is so that we can more easily use the official SQuAD evaluation script to get metrics.We also support limiting the maximum length for both passage and question. However, some gold answer spans may exceed the maximum passage length, which will cause error in making instances. We simply skip these spans to avoid errors. If all of the gold answer spans of an example are skipped, during training, we will skip this example. During validating or testing, since we cannot skip examples, we use the last token as the pseudo gold answer span instead. The computed loss will not be accurate as a result. But this will not affect the answer evaluation, because we keep all the original gold answer texts.
- Parameters
- tokenizer
Tokenizer
, optional (default=``WordTokenizer()``) We use this
Tokenizer
for both the question and the passage. SeeTokenizer
. Default is`WordTokenizer()
.- token_indexers
Dict[str, TokenIndexer]
, optional We similarly use this for both the question and the passage. See
TokenIndexer
. Default is{"tokens": SingleIdTokenIndexer()}
.- lazy
bool
, optional (default=False) If this is true,
instances()
will return an object whose__iter__
method reloads the dataset each time it’s called. Otherwise,instances()
returns a list.- passage_length_limit
int
, optional (default=None) if specified, we will cut the passage if the length of passage exceeds this limit.
- question_length_limit
int
, optional (default=None) if specified, we will cut the question if the length of passage exceeds this limit.
- skip_invalid_examples: ``bool``, optional (default=False)
if this is true, we will skip those invalid examples
- tokenizer
-
text_to_instance
(self, question_text: str, passage_text: str, char_spans: List[Tuple[int, int]] = None, answer_texts: List[str] = None, passage_tokens: List[allennlp.data.tokenizers.token.Token] = None) → Union[allennlp.data.instance.Instance, NoneType][source]¶ Does whatever tokenization or processing is necessary to go from textual input to an
Instance
. The primary intended use for this is with aPredictor
, which gets text input as a JSON object and needs to process it to be input to a model.The intent here is to share code between
_read()
and what happens at model serving time, or any other time you want to make a prediction from new data. We need to process the data in the same way it was done at training time. Allowing theDatasetReader
to process new text lets us accomplish this, as we can just callDatasetReader.text_to_instance
when serving predictions.The input type here is rather vaguely specified, unfortunately. The
Predictor
will have to make some assumptions about the kind ofDatasetReader
that it’s using, in order to pass it the right information.
-
class
allennlp.data.dataset_readers.reading_comprehension.triviaqa.
TriviaQaReader
(base_tarball_path: str, unfiltered_tarball_path: str = None, tokenizer: allennlp.data.tokenizers.tokenizer.Tokenizer = None, token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer] = None, lazy: bool = False)[source]¶ Bases:
allennlp.data.dataset_readers.dataset_reader.DatasetReader
Reads the TriviaQA dataset into a
Dataset
containingInstances
with four fields:question
(aTextField
),passage
(anotherTextField
),span_start
, andspan_end
(bothIndexFields
).TriviaQA is split up into several JSON files defining the questions, and a lot of text files containing crawled web documents. We read these from a gzipped tarball, to avoid having to have millions of individual files on a filesystem.
Because we need to read both train and validation files from the same tarball, we take the tarball itself as a constructor parameter, and take the question file as the argument to
read
. This means that you should give the path to the tarball in thedataset_reader
parameters in your experiment configuration file, and something like"wikipedia-train.json"
for thetrain_data_path
andvalidation_data_path
.- Parameters
- base_tarball_path
str
This is the path to the main
tar.gz
file you can download from the TriviaQA website, with directoriesevidence
andqa
.- unfiltered_tarball_path
str
, optional This is the path to the “unfiltered” TriviaQA data that you can download from the TriviaQA website, containing just question JSON files that point to evidence files in the base tarball.
- tokenizer
Tokenizer
, optional We’ll use this tokenizer on questions and evidence passages, defaulting to
WordTokenizer
if none is provided.- token_indexers
Dict[str, TokenIndexer]
, optional Determines how both the question and the evidence passages are represented as arrays. See
TokenIndexer
. Default is to have a single word ID for every token.
- base_tarball_path
-
pick_paragraphs
(self, evidence_files: List[List[str]], question: str = None, answer_texts: List[str] = None) → List[str][source]¶ Given a list of evidence documents, return a list of paragraphs to use as training examples. Each paragraph returned will be made into one training example.
To aid in picking the best paragraph, you can also optionally pass the question text or the answer strings. Note, though, that if you actually use the answer strings for picking the paragraph on the dev or test sets, that’s likely cheating, depending on how you’ve defined the task.
-
text_to_instance
(self, question_text: str, passage_text: str, token_spans: List[Tuple[int, int]] = None, answer_texts: List[str] = None, question_tokens: List[allennlp.data.tokenizers.token.Token] = None, passage_tokens: List[allennlp.data.tokenizers.token.Token] = None) → allennlp.data.instance.Instance[source]¶ Does whatever tokenization or processing is necessary to go from textual input to an
Instance
. The primary intended use for this is with aPredictor
, which gets text input as a JSON object and needs to process it to be input to a model.The intent here is to share code between
_read()
and what happens at model serving time, or any other time you want to make a prediction from new data. We need to process the data in the same way it was done at training time. Allowing theDatasetReader
to process new text lets us accomplish this, as we can just callDatasetReader.text_to_instance
when serving predictions.The input type here is rather vaguely specified, unfortunately. The
Predictor
will have to make some assumptions about the kind ofDatasetReader
that it’s using, in order to pass it the right information.
-
class
allennlp.data.dataset_readers.reading_comprehension.quac.
QuACReader
(tokenizer: allennlp.data.tokenizers.tokenizer.Tokenizer = None, token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer] = None, lazy: bool = False, num_context_answers: int = 0)[source]¶ Bases:
allennlp.data.dataset_readers.dataset_reader.DatasetReader
Reads a JSON-formatted Question Answering in Context (QuAC) data file and returns a
Dataset
where theInstances
have four fields:question
, aListField
,passage
, anotherTextField
, andspan_start
andspan_end
, bothListField
composed of IndexFields`` into thepassage
TextField
. TwoListField
, composed ofLabelField
,yesno_list
andfollowup_list
is added. We also add aMetadataField
that stores the instance’s ID, the original passage text, gold answer strings, and token offsets into the original passage, accessible asmetadata['id']
,metadata['original_passage']
,metadata['answer_text_lists'] and ``metadata['token_offsets']
.- Parameters
- tokenizer
Tokenizer
, optional (default=``WordTokenizer()``) We use this
Tokenizer
for both the question and the passage. SeeTokenizer
. Default is`WordTokenizer()
.- token_indexers
Dict[str, TokenIndexer]
, optional We similarly use this for both the question and the passage. See
TokenIndexer
. Default is{"tokens": SingleIdTokenIndexer()}
.- num_context_answers
int
, optional How many previous question answers to consider in a context.
- tokenizer
-
text_to_instance
(self, question_text_list: List[str], passage_text: str, start_span_list: List[List[int]] = None, end_span_list: List[List[int]] = None, passage_tokens: List[allennlp.data.tokenizers.token.Token] = None, yesno_list: List[int] = None, followup_list: List[int] = None, additional_metadata: Dict[str, Any] = None) → allennlp.data.instance.Instance[source]¶ Does whatever tokenization or processing is necessary to go from textual input to an
Instance
. The primary intended use for this is with aPredictor
, which gets text input as a JSON object and needs to process it to be input to a model.The intent here is to share code between
_read()
and what happens at model serving time, or any other time you want to make a prediction from new data. We need to process the data in the same way it was done at training time. Allowing theDatasetReader
to process new text lets us accomplish this, as we can just callDatasetReader.text_to_instance
when serving predictions.The input type here is rather vaguely specified, unfortunately. The
Predictor
will have to make some assumptions about the kind ofDatasetReader
that it’s using, in order to pass it the right information.
-
class
allennlp.data.dataset_readers.reading_comprehension.qangaroo.
QangarooReader
(tokenizer: allennlp.data.tokenizers.tokenizer.Tokenizer = None, token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer] = None, lazy: bool = False)[source]¶ Bases:
allennlp.data.dataset_readers.dataset_reader.DatasetReader
Reads a JSON-formatted Qangaroo file and returns a
Dataset
where theInstances
have six fields:candidates
, aListField[TextField]
,query
, aTextField
,supports
, aListField[TextField]
,answer
, aTextField
, andanswer_index
, aIndexField
. We also add aMetadataField
that stores the instance’s ID and annotations if they are present.- Parameters
- tokenizer
Tokenizer
, optional (default=``WordTokenizer()``) We use this
Tokenizer
for both the question and the passage. SeeTokenizer
. Default is`WordTokenizer()
.- token_indexers
Dict[str, TokenIndexer]
, optional We similarly use this for both the question and the passage. See
TokenIndexer
. Default is{"tokens": SingleIdTokenIndexer()}
.
- tokenizer
-
text_to_instance
(self, candidates: List[str], query: str, supports: List[str], _id: str = None, answer: str = None, annotations: List[List[str]] = None) → allennlp.data.instance.Instance[source]¶ Does whatever tokenization or processing is necessary to go from textual input to an
Instance
. The primary intended use for this is with aPredictor
, which gets text input as a JSON object and needs to process it to be input to a model.The intent here is to share code between
_read()
and what happens at model serving time, or any other time you want to make a prediction from new data. We need to process the data in the same way it was done at training time. Allowing theDatasetReader
to process new text lets us accomplish this, as we can just callDatasetReader.text_to_instance
when serving predictions.The input type here is rather vaguely specified, unfortunately. The
Predictor
will have to make some assumptions about the kind ofDatasetReader
that it’s using, in order to pass it the right information.
Utilities for reading comprehension dataset readers.
-
allennlp.data.dataset_readers.reading_comprehension.util.
char_span_to_token_span
(token_offsets: List[Tuple[int, int]], character_span: Tuple[int, int]) → Tuple[Tuple[int, int], bool][source]¶ Converts a character span from a passage into the corresponding token span in the tokenized version of the passage. If you pass in a character span that does not correspond to complete tokens in the tokenized version, we’ll do our best, but the behavior is officially undefined. We return an error flag in this case, and have some debug logging so you can figure out the cause of this issue (in SQuAD, these are mostly either tokenization problems or annotation problems; there’s a fair amount of both).
The basic outline of this method is to find the token span that has the same offsets as the input character span. If the tokenizer tokenized the passage correctly and has matching offsets, this is easy. We try to be a little smart about cases where they don’t match exactly, but mostly just find the closest thing we can.
The returned
(begin, end)
indices are inclusive for bothbegin
andend
. So, for example,(2, 2)
is the one word span beginning at token index 2,(3, 4)
is the two-word span beginning at token index 3, and so on.- Returns
- token_span
Tuple[int, int]
Inclusive span start and end token indices that match as closely as possible to the input character spans.
- error
bool
Whether the token spans match the input character spans exactly. If this is
False
, it means there was an error in either the tokenization or the annotated character span.
- token_span
-
allennlp.data.dataset_readers.reading_comprehension.util.
find_valid_answer_spans
(passage_tokens: List[allennlp.data.tokenizers.token.Token], answer_texts: List[str]) → List[Tuple[int, int]][source]¶ Finds a list of token spans in
passage_tokens
that match the givenanswer_texts
. This tries to find all spans that would evaluate to correct given the SQuAD and TriviaQA official evaluation scripts, which do some normalization of the input text.Note that this could return duplicate spans! The caller is expected to be able to handle possible duplicates (as already happens in the SQuAD dev set, for instance).
-
allennlp.data.dataset_readers.reading_comprehension.util.
handle_cannot
(reference_answers: List[str])[source]¶ Process a list of reference answers. If equal or more than half of the reference answers are “CANNOTANSWER”, take it as gold. Otherwise, return answers that are not “CANNOTANSWER”.
-
allennlp.data.dataset_readers.reading_comprehension.util.
make_reading_comprehension_instance
(question_tokens: List[allennlp.data.tokenizers.token.Token], passage_tokens: List[allennlp.data.tokenizers.token.Token], token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer], passage_text: str, token_spans: List[Tuple[int, int]] = None, answer_texts: List[str] = None, additional_metadata: Dict[str, Any] = None) → allennlp.data.instance.Instance[source]¶ Converts a question, a passage, and an optional answer (or answers) to an
Instance
for use in a reading comprehension model.Creates an
Instance
with at least these fields:question
andpassage
, bothTextFields
; andmetadata
, aMetadataField
. Additionally, if bothanswer_texts
andchar_span_starts
are given, theInstance
hasspan_start
andspan_end
fields, which are bothIndexFields
.- Parameters
- question_tokens
List[Token]
An already-tokenized question.
- passage_tokens
List[Token]
An already-tokenized passage that contains the answer to the given question.
- token_indexers
Dict[str, TokenIndexer]
Determines how the question and passage
TextFields
will be converted into tensors that get input to a model. SeeTokenIndexer
.- passage_text
str
The original passage text. We need this so that we can recover the actual span from the original passage that the model predicts as the answer to the question. This is used in official evaluation scripts.
- token_spans
List[Tuple[int, int]]
, optional Indices into
passage_tokens
to use as the answer to the question for training. This is a list because there might be several possible correct answer spans in the passage. Currently, we just select the most frequent span in this list (i.e., SQuAD has multiple annotations on the dev set; this will select the span that the most annotators gave as correct).- answer_texts
List[str]
, optional All valid answer strings for the given question. In SQuAD, e.g., the training set has exactly one answer per question, but the dev and test sets have several. TriviaQA has many possible answers, which are the aliases for the known correct entity. This is put into the metadata for use with official evaluation scripts, but not used anywhere else.
- additional_metadata
Dict[str, Any]
, optional The constructed
metadata
field will by default containoriginal_passage
,token_offsets
,question_tokens
,passage_tokens
, andanswer_texts
keys. If you want any other metadata to be associated with each instance, you can pass that in here. This dictionary will get added to themetadata
dictionary we already construct.
- question_tokens
-
allennlp.data.dataset_readers.reading_comprehension.util.
make_reading_comprehension_instance_quac
(question_list_tokens: List[List[allennlp.data.tokenizers.token.Token]], passage_tokens: List[allennlp.data.tokenizers.token.Token], token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer], passage_text: str, token_span_lists: List[List[Tuple[int, int]]] = None, yesno_list: List[int] = None, followup_list: List[int] = None, additional_metadata: Dict[str, Any] = None, num_context_answers: int = 0) → allennlp.data.instance.Instance[source]¶ Converts a question, a passage, and an optional answer (or answers) to an
Instance
for use in a reading comprehension model.Creates an
Instance
with at least these fields:question
andpassage
, bothTextFields
; andmetadata
, aMetadataField
. Additionally, if bothanswer_texts
andchar_span_starts
are given, theInstance
hasspan_start
andspan_end
fields, which are bothIndexFields
.- Parameters
- question_list_tokens
List[List[Token]]
An already-tokenized list of questions. Each dialog have multiple questions.
- passage_tokens
List[Token]
An already-tokenized passage that contains the answer to the given question.
- token_indexers
Dict[str, TokenIndexer]
Determines how the question and passage
TextFields
will be converted into tensors that get input to a model. SeeTokenIndexer
.- passage_text
str
The original passage text. We need this so that we can recover the actual span from the original passage that the model predicts as the answer to the question. This is used in official evaluation scripts.
- token_span_lists
List[List[Tuple[int, int]]]
, optional Indices into
passage_tokens
to use as the answer to the question for training. This is a list of list, first because there is multiple questions per dialog, and because there might be several possible correct answer spans in the passage. Currently, we just select the last span in this list (i.e., QuAC has multiple annotations on the dev set; this will select the last span, which was given by the original annotator).- yesno_list
List[int]
List of the affirmation bit for each question answer pairs.
- followup_list
List[int]
List of the continuation bit for each question answer pairs.
- num_context_answers
int
, optional How many answers to encode into the passage.
- additional_metadata
Dict[str, Any]
, optional The constructed
metadata
field will by default containoriginal_passage
,token_offsets
,question_tokens
,passage_tokens
, andanswer_texts
keys. If you want any other metadata to be associated with each instance, you can pass that in here. This dictionary will get added to themetadata
dictionary we already construct.
- question_list_tokens
-
allennlp.data.dataset_readers.reading_comprehension.util.
normalize_text
(text: str) → str[source]¶ Performs a normalization that is very similar to that done by the normalization functions in SQuAD and TriviaQA.
This involves splitting and rejoining the text, and could be a somewhat expensive operation.