squad
allennlp_models.rc.dataset_readers.squad
SQUAD2_NO_ANSWER_TOKEN#
SQUAD2_NO_ANSWER_TOKEN = "@@<NO_ANSWER>@@"
The default no_answer_token
for the squad2
reader.
SquadReader#
@DatasetReader.register("squad")
class SquadReader(DatasetReader):
| def __init__(
| self,
| tokenizer: Tokenizer = None,
| token_indexers: Dict[str, TokenIndexer] = None,
| passage_length_limit: int = None,
| question_length_limit: int = None,
| skip_impossible_questions: bool = False,
| no_answer_token: Optional[str] = None,
| **kwargs
| ) -> None
Note
If you're training on SQuAD v1.1 you should use the squad1()
classmethod
to instantiate this reader, and for SQuAD v2.0 you should use the
squad2()
classmethod.
Also, for transformer-based models you should be using the
TransformerSquadReader
.
Dataset reader suitable for JSON-formatted SQuAD-like datasets.
It will generate Instances
with the following fields:
question
, aTextField
,passage
, anotherTextField
,span_start
andspan_end
, bothIndexFields
into thepassage
TextField
,- and
metadata
, aMetadataField
that stores the instance's ID, the original passage text, gold answer strings, and token offsets into the original passage, accessible asmetadata['id']
,metadata['original_passage']
,metadata['answer_texts']
andmetadata['token_offsets']
, respectively. This is so that we can more easily use the official SQuAD evaluation scripts to get metrics.
We also support limiting the maximum length for both passage and question. However, some gold answer spans may exceed the maximum passage length, which will cause error in making instances. We simply skip these spans to avoid errors. If all of the gold answer spans of an example are skipped, during training, we will skip this example. During validating or testing, since we cannot skip examples, we use the last token as the pseudo gold answer span instead. The computed loss will not be accurate as a result. But this will not affect the answer evaluation, because we keep all the original gold answer texts.
Parameters¶
- tokenizer :
Tokenizer
, optional (default =SpacyTokenizer()
)
We use thisTokenizer
for both the question and the passage. SeeTokenizer
. Default isSpacyTokenizer()
. - token_indexers :
Dict[str, TokenIndexer]
, optional
We similarly use this for both the question and the passage. SeeTokenIndexer
. Default is{"tokens": SingleIdTokenIndexer()}
. - passage_length_limit :
int
, optional (default =None
)
If specified, we will cut the passage if the length of passage exceeds this limit. - question_length_limit :
int
, optional (default =None
)
If specified, we will cut the question if the length of question exceeds this limit. - skip_impossible_questions :
bool
, optional (default =False
)
If this is true, we will skip examples with questions that don't contain the answer spans. - no_answer_token :
Optional[str]
, optional (default =None
)
A special token to append to each context. If using a SQuAD 2.0-style dataset, this should be set, otherwise an exception will be raised if an impossible question is encountered.
text_to_instance#
class SquadReader(DatasetReader):
| ...
| @overrides
| def text_to_instance(
| self,
| question_text: str,
| passage_text: str,
| is_impossible: bool = None,
| char_spans: List[Tuple[int, int]] = None,
| answer_texts: List[str] = None,
| passage_tokens: List[Token] = None,
| additional_metadata: Dict[str, Any] = None
| ) -> Optional[Instance]
squad1#
class SquadReader(DatasetReader):
| ...
| @classmethod
| def squad1(
| cls,
| tokenizer: Tokenizer = None,
| token_indexers: Dict[str, TokenIndexer] = None,
| passage_length_limit: int = None,
| question_length_limit: int = None,
| skip_impossible_questions: bool = False,
| **kwargs
| ) -> "SquadReader"
Gives a SquadReader
suitable for SQuAD v1.1.
squad2#
class SquadReader(DatasetReader):
| ...
| @classmethod
| def squad2(
| cls,
| tokenizer: Tokenizer = None,
| token_indexers: Dict[str, TokenIndexer] = None,
| passage_length_limit: int = None,
| question_length_limit: int = None,
| skip_impossible_questions: bool = False,
| no_answer_token: str = SQUAD2_NO_ANSWER_TOKEN,
| **kwargs
| ) -> "SquadReader"
Gives a SquadReader
suitable for SQuAD v2.0.