utils

allennlp_models.rc.dataset_readers.utils

Utilities for reading comprehension dataset readers.

IGNORED_TOKENS#

IGNORED_TOKENS = {"a", "an", "the"}

STRIPPED_CHARACTERS#

STRIPPED_CHARACTERS = string.punctuation + "".join(["‘", "’", "´", "`", "_"])

normalize_text#

def normalize_text(text: str) -> str

Performs a normalization that is very similar to that done by the normalization functions in SQuAD and TriviaQA.

This involves splitting and rejoining the text, and could be a somewhat expensive operation.

char_span_to_token_span#

def char_span_to_token_span(
    token_offsets: List[Optional[Tuple[int, int]]],
    character_span: Tuple[int, int]
) -> Tuple[Tuple[int, int], bool]

Converts a character span from a passage into the corresponding token span in the tokenized version of the passage. If you pass in a character span that does not correspond to complete tokens in the tokenized version, we'll do our best, but the behavior is officially undefined. We return an error flag in this case, and have some debug logging so you can figure out the cause of this issue (in SQuAD, these are mostly either tokenization problems or annotation problems; there's a fair amount of both).

The basic outline of this method is to find the token span that has the same offsets as the input character span. If the tokenizer tokenized the passage correctly and has matching offsets, this is easy. We try to be a little smart about cases where they don't match exactly, but mostly just find the closest thing we can.

The returned (begin, end) indices are inclusive for both begin and end. So, for example, (2, 2) is the one word span beginning at token index 2, (3, 4) is the two-word span beginning at token index 3, and so on.

Returnstoken_span : ``Tuple[int, int]``¶

`Inclusive` span start and end token indices that match as closely as possible to the input
character spans.

error : bool Whether there was an error while matching the token spans exactly. If this is True, it means there was an error in either the tokenization or the annotated character span. If this is False, it means that we found tokens that match the character span exactly.

find_valid_answer_spans#

def find_valid_answer_spans(
    passage_tokens: List[Token],
    answer_texts: List[str]
) -> List[Tuple[int, int]]

Finds a list of token spans in passage_tokens that match the given answer_texts. This tries to find all spans that would evaluate to correct given the SQuAD and TriviaQA official evaluation scripts, which do some normalization of the input text.

Note that this could return duplicate spans! The caller is expected to be able to handle possible duplicates (as already happens in the SQuAD dev set, for instance).

make_reading_comprehension_instance#

def make_reading_comprehension_instance(
    question_tokens: List[Token],
    passage_tokens: List[Token],
    token_indexers: Dict[str, TokenIndexer],
    passage_text: str,
    token_spans: List[Tuple[int, int]] = None,
    answer_texts: List[str] = None,
    additional_metadata: Dict[str, Any] = None
) -> Instance

Converts a question, a passage, and an optional answer (or answers) to an Instance for use in a reading comprehension model.

Creates an Instance with at least these fields: question and passage, both TextFields; and metadata, a MetadataField. Additionally, if both answer_texts and char_span_starts are given, the Instance has span_start and span_end fields, which are both IndexFields.

Parameters¶

question_tokens : List[Token]
An already-tokenized question.
passage_tokens : List[Token]
An already-tokenized passage that contains the answer to the given question.
token_indexers : Dict[str, TokenIndexer]
Determines how the question and passage TextFields will be converted into tensors that get input to a model. See TokenIndexer.
passage_text : str
The original passage text. We need this so that we can recover the actual span from the original passage that the model predicts as the answer to the question. This is used in official evaluation scripts.
token_spans : List[Tuple[int, int]], optional
Indices into passage_tokens to use as the answer to the question for training. This is a list because there might be several possible correct answer spans in the passage. Currently, we just select the most frequent span in this list (i.e., SQuAD has multiple annotations on the dev set; this will select the span that the most annotators gave as correct).
answer_texts : List[str], optional
All valid answer strings for the given question. In SQuAD, e.g., the training set has exactly one answer per question, but the dev and test sets have several. TriviaQA has many possible answers, which are the aliases for the known correct entity. This is put into the metadata for use with official evaluation scripts, but not used anywhere else.
additional_metadata : Dict[str, Any], optional
The constructed metadata field will by default contain original_passage, token_offsets, question_tokens, passage_tokens, and answer_texts keys. If you want any other metadata to be associated with each instance, you can pass that in here. This dictionary will get added to the metadata dictionary we already construct.

make_reading_comprehension_instance_quac#

def make_reading_comprehension_instance_quac(
    question_list_tokens: List[List[Token]],
    passage_tokens: List[Token],
    token_indexers: Dict[str, TokenIndexer],
    passage_text: str,
    token_span_lists: List[List[Tuple[int, int]]] = None,
    yesno_list: List[int] = None,
    followup_list: List[int] = None,
    additional_metadata: Dict[str, Any] = None,
    num_context_answers: int = 0
) -> Instance

Converts a question, a passage, and an optional answer (or answers) to an Instance for use in a reading comprehension model.

Creates an Instance with at least these fields: question and passage, both TextFields; and metadata, a MetadataField. Additionally, if both answer_texts and char_span_starts are given, the Instance has span_start and span_end fields, which are both IndexFields.

Parameters¶

question_list_tokens : List[List[Token]]
An already-tokenized list of questions. Each dialog have multiple questions.
passage_tokens : List[Token]
An already-tokenized passage that contains the answer to the given question.
token_indexers : Dict[str, TokenIndexer]
Determines how the question and passage TextFields will be converted into tensors that get input to a model. See TokenIndexer.
passage_text : str
The original passage text. We need this so that we can recover the actual span from the original passage that the model predicts as the answer to the question. This is used in official evaluation scripts.
token_span_lists : List[List[Tuple[int, int]]], optional
Indices into passage_tokens to use as the answer to the question for training. This is a list of list, first because there is multiple questions per dialog, and because there might be several possible correct answer spans in the passage. Currently, we just select the last span in this list (i.e., QuAC has multiple annotations on the dev set; this will select the last span, which was given by the original annotator).
yesno_list : List[int]
List of the affirmation bit for each question answer pairs.
followup_list : List[int]
List of the continuation bit for each question answer pairs.
num_context_answers : int, optional
How many answers to encode into the passage.
additional_metadata : Dict[str, Any], optional
The constructed metadata field will by default contain original_passage, token_offsets, question_tokens, passage_tokens, and answer_texts keys. If you want any other metadata to be associated with each instance, you can pass that in here. This dictionary will get added to the metadata dictionary we already construct.

handle_cannot#

def handle_cannot(reference_answers: List[str])

Process a list of reference answers. If equal or more than half of the reference answers are "CANNOTANSWER", take it as gold. Otherwise, return answers that are not "CANNOTANSWER".

split_token_by_delimiter#

def split_token_by_delimiter(
    token: Token,
    delimiter: str
) -> List[Token]

split_tokens_by_hyphen#

def split_tokens_by_hyphen(tokens: List[Token]) -> List[Token]