Skip to content

squad

allennlp_models.rc.dataset_readers.squad

[SOURCE]


SQUAD2_NO_ANSWER_TOKEN#

SQUAD2_NO_ANSWER_TOKEN = "@@<NO_ANSWER>@@"

The default no_answer_token for the squad2 reader.

SquadReader#

@DatasetReader.register("squad")
class SquadReader(DatasetReader):
 | def __init__(
 |     self,
 |     tokenizer: Tokenizer = None,
 |     token_indexers: Dict[str, TokenIndexer] = None,
 |     passage_length_limit: int = None,
 |     question_length_limit: int = None,
 |     skip_impossible_questions: bool = False,
 |     no_answer_token: Optional[str] = None,
 |     **kwargs
 | ) -> None

Note

If you're training on SQuAD v1.1 you should use the squad1() classmethod to instantiate this reader, and for SQuAD v2.0 you should use the squad2() classmethod.

Also, for transformer-based models you should be using the TransformerSquadReader.

Dataset reader suitable for JSON-formatted SQuAD-like datasets. It will generate Instances with the following fields:

  • question, a TextField,
  • passage, another TextField,
  • span_start and span_end, both IndexFields into the passage TextField,
  • and metadata, a MetadataField that stores the instance's ID, the original passage text, gold answer strings, and token offsets into the original passage, accessible as metadata['id'], metadata['original_passage'], metadata['answer_texts'] and metadata['token_offsets'], respectively. This is so that we can more easily use the official SQuAD evaluation scripts to get metrics.

We also support limiting the maximum length for both passage and question. However, some gold answer spans may exceed the maximum passage length, which will cause error in making instances. We simply skip these spans to avoid errors. If all of the gold answer spans of an example are skipped, during training, we will skip this example. During validating or testing, since we cannot skip examples, we use the last token as the pseudo gold answer span instead. The computed loss will not be accurate as a result. But this will not affect the answer evaluation, because we keep all the original gold answer texts.

Parameters

  • tokenizer : Tokenizer, optional (default = SpacyTokenizer())
    We use this Tokenizer for both the question and the passage. See Tokenizer. Default is SpacyTokenizer().
  • token_indexers : Dict[str, TokenIndexer], optional
    We similarly use this for both the question and the passage. See TokenIndexer. Default is {"tokens": SingleIdTokenIndexer()}.
  • passage_length_limit : int, optional (default = None)
    If specified, we will cut the passage if the length of passage exceeds this limit.
  • question_length_limit : int, optional (default = None)
    If specified, we will cut the question if the length of question exceeds this limit.
  • skip_impossible_questions : bool, optional (default = False)
    If this is true, we will skip examples with questions that don't contain the answer spans.
  • no_answer_token : Optional[str], optional (default = None)
    A special token to append to each context. If using a SQuAD 2.0-style dataset, this should be set, otherwise an exception will be raised if an impossible question is encountered.

text_to_instance#

class SquadReader(DatasetReader):
 | ...
 | @overrides
 | def text_to_instance(
 |     self,
 |     question_text: str,
 |     passage_text: str,
 |     is_impossible: bool = None,
 |     char_spans: List[Tuple[int, int]] = None,
 |     answer_texts: List[str] = None,
 |     passage_tokens: List[Token] = None,
 |     additional_metadata: Dict[str, Any] = None
 | ) -> Optional[Instance]

squad1#

class SquadReader(DatasetReader):
 | ...
 | @classmethod
 | def squad1(
 |     cls,
 |     tokenizer: Tokenizer = None,
 |     token_indexers: Dict[str, TokenIndexer] = None,
 |     passage_length_limit: int = None,
 |     question_length_limit: int = None,
 |     skip_impossible_questions: bool = False,
 |     **kwargs
 | ) -> "SquadReader"

Gives a SquadReader suitable for SQuAD v1.1.

squad2#

class SquadReader(DatasetReader):
 | ...
 | @classmethod
 | def squad2(
 |     cls,
 |     tokenizer: Tokenizer = None,
 |     token_indexers: Dict[str, TokenIndexer] = None,
 |     passage_length_limit: int = None,
 |     question_length_limit: int = None,
 |     skip_impossible_questions: bool = False,
 |     no_answer_token: str = SQUAD2_NO_ANSWER_TOKEN,
 |     **kwargs
 | ) -> "SquadReader"

Gives a SquadReader suitable for SQuAD v2.0.