utils
allennlp_models.rc.dataset_readers.utils
Utilities for reading comprehension dataset readers.
IGNORED_TOKENS#
IGNORED_TOKENS = {"a", "an", "the"}
STRIPPED_CHARACTERS#
STRIPPED_CHARACTERS = string.punctuation + "".join(["‘", "’", "´", "`", "_"])
normalize_text#
def normalize_text(text: str) -> str
Performs a normalization that is very similar to that done by the normalization functions in SQuAD and TriviaQA.
This involves splitting and rejoining the text, and could be a somewhat expensive operation.
char_span_to_token_span#
def char_span_to_token_span(
token_offsets: List[Optional[Tuple[int, int]]],
character_span: Tuple[int, int]
) -> Tuple[Tuple[int, int], bool]
Converts a character span from a passage into the corresponding token span in the tokenized version of the passage. If you pass in a character span that does not correspond to complete tokens in the tokenized version, we'll do our best, but the behavior is officially undefined. We return an error flag in this case, and have some debug logging so you can figure out the cause of this issue (in SQuAD, these are mostly either tokenization problems or annotation problems; there's a fair amount of both).
The basic outline of this method is to find the token span that has the same offsets as the input character span. If the tokenizer tokenized the passage correctly and has matching offsets, this is easy. We try to be a little smart about cases where they don't match exactly, but mostly just find the closest thing we can.
The returned (begin, end)
indices are inclusive
for both begin
and end
.
So, for example, (2, 2)
is the one word span beginning at token index 2, (3, 4)
is the
two-word span beginning at token index 3, and so on.
Returnstoken_span : ``Tuple[int, int]``¶
`Inclusive` span start and end token indices that match as closely as possible to the input
character spans.
error : bool
Whether there was an error while matching the token spans exactly. If this is True
, it
means there was an error in either the tokenization or the annotated character span. If this
is False
, it means that we found tokens that match the character span exactly.
find_valid_answer_spans#
def find_valid_answer_spans(
passage_tokens: List[Token],
answer_texts: List[str]
) -> List[Tuple[int, int]]
Finds a list of token spans in passage_tokens
that match the given answer_texts
. This
tries to find all spans that would evaluate to correct given the SQuAD and TriviaQA official
evaluation scripts, which do some normalization of the input text.
Note that this could return duplicate spans! The caller is expected to be able to handle possible duplicates (as already happens in the SQuAD dev set, for instance).
make_reading_comprehension_instance#
def make_reading_comprehension_instance(
question_tokens: List[Token],
passage_tokens: List[Token],
token_indexers: Dict[str, TokenIndexer],
passage_text: str,
token_spans: List[Tuple[int, int]] = None,
answer_texts: List[str] = None,
additional_metadata: Dict[str, Any] = None
) -> Instance
Converts a question, a passage, and an optional answer (or answers) to an Instance
for use
in a reading comprehension model.
Creates an Instance
with at least these fields: question
and passage
, both
TextFields
; and metadata
, a MetadataField
. Additionally, if both answer_texts
and char_span_starts
are given, the Instance
has span_start
and span_end
fields, which are both IndexFields
.
Parameters¶
- question_tokens :
List[Token]
An already-tokenized question. - passage_tokens :
List[Token]
An already-tokenized passage that contains the answer to the given question. - token_indexers :
Dict[str, TokenIndexer]
Determines how the question and passageTextFields
will be converted into tensors that get input to a model. SeeTokenIndexer
. - passage_text :
str
The original passage text. We need this so that we can recover the actual span from the original passage that the model predicts as the answer to the question. This is used in official evaluation scripts. - token_spans :
List[Tuple[int, int]]
, optional
Indices intopassage_tokens
to use as the answer to the question for training. This is a list because there might be several possible correct answer spans in the passage. Currently, we just select the most frequent span in this list (i.e., SQuAD has multiple annotations on the dev set; this will select the span that the most annotators gave as correct). - answer_texts :
List[str]
, optional
All valid answer strings for the given question. In SQuAD, e.g., the training set has exactly one answer per question, but the dev and test sets have several. TriviaQA has many possible answers, which are the aliases for the known correct entity. This is put into the metadata for use with official evaluation scripts, but not used anywhere else. - additional_metadata :
Dict[str, Any]
, optional
The constructedmetadata
field will by default containoriginal_passage
,token_offsets
,question_tokens
,passage_tokens
, andanswer_texts
keys. If you want any other metadata to be associated with each instance, you can pass that in here. This dictionary will get added to themetadata
dictionary we already construct.
make_reading_comprehension_instance_quac#
def make_reading_comprehension_instance_quac(
question_list_tokens: List[List[Token]],
passage_tokens: List[Token],
token_indexers: Dict[str, TokenIndexer],
passage_text: str,
token_span_lists: List[List[Tuple[int, int]]] = None,
yesno_list: List[int] = None,
followup_list: List[int] = None,
additional_metadata: Dict[str, Any] = None,
num_context_answers: int = 0
) -> Instance
Converts a question, a passage, and an optional answer (or answers) to an Instance
for use
in a reading comprehension model.
Creates an Instance
with at least these fields: question
and passage
, both
TextFields
; and metadata
, a MetadataField
. Additionally, if both answer_texts
and char_span_starts
are given, the Instance
has span_start
and span_end
fields, which are both IndexFields
.
Parameters¶
- question_list_tokens :
List[List[Token]]
An already-tokenized list of questions. Each dialog have multiple questions. - passage_tokens :
List[Token]
An already-tokenized passage that contains the answer to the given question. - token_indexers :
Dict[str, TokenIndexer]
Determines how the question and passageTextFields
will be converted into tensors that get input to a model. SeeTokenIndexer
. - passage_text :
str
The original passage text. We need this so that we can recover the actual span from the original passage that the model predicts as the answer to the question. This is used in official evaluation scripts. - token_span_lists :
List[List[Tuple[int, int]]]
, optional
Indices intopassage_tokens
to use as the answer to the question for training. This is a list of list, first because there is multiple questions per dialog, and because there might be several possible correct answer spans in the passage. Currently, we just select the last span in this list (i.e., QuAC has multiple annotations on the dev set; this will select the last span, which was given by the original annotator). - yesno_list :
List[int]
List of the affirmation bit for each question answer pairs. - followup_list :
List[int]
List of the continuation bit for each question answer pairs. - num_context_answers :
int
, optional
How many answers to encode into the passage. - additional_metadata :
Dict[str, Any]
, optional
The constructedmetadata
field will by default containoriginal_passage
,token_offsets
,question_tokens
,passage_tokens
, andanswer_texts
keys. If you want any other metadata to be associated with each instance, you can pass that in here. This dictionary will get added to themetadata
dictionary we already construct.
handle_cannot#
def handle_cannot(reference_answers: List[str])
Process a list of reference answers. If equal or more than half of the reference answers are "CANNOTANSWER", take it as gold. Otherwise, return answers that are not "CANNOTANSWER".
split_token_by_delimiter#
def split_token_by_delimiter(
token: Token,
delimiter: str
) -> List[Token]
split_tokens_by_hyphen#
def split_tokens_by_hyphen(tokens: List[Token]) -> List[Token]