transformer_squad
allennlp_models.rc.dataset_readers.transformer_squad
TransformerSquadReader#
@DatasetReader.register("transformer_squad")
class TransformerSquadReader(DatasetReader):
| def __init__(
| self,
| transformer_model_name: str = "bert-base-cased",
| length_limit: int = 384,
| stride: int = 128,
| skip_impossible_questions: bool = False,
| max_query_length: int = 64,
| tokenizer_kwargs: Dict[str, Any] = None,
| **kwargs
| ) -> None
Dataset reader suitable for JSON-formatted SQuAD-like datasets to be used with a transformer-based
QA model, such as TransformerQA
.
It will generate Instances
with the following fields:
question_with_context
, aTextField
that contains the concatenation of question and context,answer_span
, aSpanField
into thequestion
TextField
denoting the answer.context_span
, aSpanField
into thequestion
TextField
denoting the context, i.e., the part of the text that potential answers can come from.cls_index
(optional), anIndexField
that holds the index of the[CLS]
token within thequestion_with_context
field. This is needed because the[CLS]
token is used to indicate an impossible question. Since most tokenizers/models have the[CLS]
token as the first token, this will only be included in the instance if the[CLS]
token is NOT the first token.metadata
, aMetadataField
that stores the instance's ID, the original question, the original passage text, both of these in tokenized form, and the gold answer strings, accessible asmetadata['id']
,metadata['question']
,metadata['context']
,metadata['question_tokens']
,metadata['context_tokens']
, andmetadata['answers']
. This is so that we can more easily use the official SQuAD evaluation script to get metrics.
For SQuAD v2.0-style datasets that contain impossible questions, we set the gold answer span
to the span of the [CLS]
token when there are no answers.
We also support limiting the maximum length for the question. When the context+question is too long, we run a
sliding window over the context and emit multiple instances for a single question.
If skip_impossible_questions
is True
, then we only emit instances that contain a gold answer.
As a result, the per-instance metrics you get during training and evaluation might not correspond
100% to the SQuAD task.
To get a final number for SQuAD v1.1, you have to run
python -m allennlp_models.rc.tools.transformer_qa_eval
Parameters¶
-
transformer_model_name :
str
, optional (default ='bert-base-cased'
)
This reader chooses tokenizer and token indexer according to this setting. -
length_limit :
int
, optional (default =384
)
We will make sure that the length of context+question never exceeds this many word pieces. -
stride :
int
, optional (default =128
)
When context+question are too long for the length limit, we emit multiple instances for one question, where the context is shifted. This parameter specifies the overlap between the shifted context window. It is called "stride" instead of "overlap" because that's what it's called in the original huggingface implementation. -
skip_impossible_questions :
bool
, optional (default =False
)
If this is true, we will skip examples that don't have an answer. This could happen if the question is marked impossible in the dataset, or if the question+context is truncated according tolength_limit
such that the context no longer contains a gold answer.For SQuAD v1.1-style datasets, you should set this to
True
during training, andFalse
any other time.For SQuAD v2.0-style datasets you should leave this as
False
. -
max_query_length :
int
, optional (default =64
)
The maximum number of wordpieces dedicated to the question. If the question is longer than this, it will be truncated.
make_instances#
class TransformerSquadReader(DatasetReader):
| ...
| def make_instances(
| self,
| qid: str,
| question: str,
| answers: List[str],
| context: str,
| first_answer_offset: Optional[int],
| always_add_answer_span: bool = False,
| is_training: bool = False,
| cached_tokenized_context: Optional[List[Token]] = None
| ) -> Iterable[Instance]
Create training instances from a SQuAD example.
text_to_instance#
class TransformerSquadReader(DatasetReader):
| ...
| def text_to_instance(
| self,
| question: str,
| tokenized_question: List[Token],
| context: str,
| tokenized_context: List[Token],
| answers: List[str] = None,
| token_answer_span: Optional[Tuple[int, int]] = None,
| additional_metadata: Dict[str, Any] = None,
| always_add_answer_span: bool = False
| ) -> Instance
apply_token_indexers#
class TransformerSquadReader(DatasetReader):
| ...
| def apply_token_indexers(self, instance: Instance) -> None