transformer_squad
TransformerSquadReader#
class TransformerSquadReader(DatasetReader):
| def __init__(
| self,
| transformer_model_name: str = "bert-base-cased",
| length_limit: int = 384,
| stride: int = 128,
| skip_invalid_examples: bool = False,
| max_query_length: int = 64,
| **kwargs
| ) -> None
Reads a JSON-formatted SQuAD file and returns a Dataset
where the Instances
have four
fields:
* question_with_context
, a TextField
that contains the concatenation of question and context,
* answer_span
, a SpanField
into the question
TextField
denoting the answer.
* context_span
a SpanField
into the question
TextField
denoting the context, i.e., the part of
the text that potential answers can come from.
* A MetadataField
that stores the instance's ID, the original question, the original passage text, both of
these in tokenized form, and the gold answer strings, accessible as metadata['id']
,
metadata['question']
, metadata['context']
, metadata['question_tokens']
,
metadata['context_tokens']
, and ``metadata['answers']. This is so that we can more easily use the
official SQuAD evaluation script to get metrics.
We also support limiting the maximum length for the question. When the context+question is too long, we run a sliding window over the context and emit multiple instances for a single question. At training time, we only emit instances that contain a gold answer. At test time, we emit all instances. As a result, the per-instance metrics you get during training and evaluation don't correspond 100% to the SQuAD task. To get a final number, you have to run the script in scripts/transformer_qa_eval.py.
Parameters
- transformer_model_name :
str
, optional (default ='bert-base-cased'
)
This reader chooses tokenizer and token indexer according to this setting. - length_limit :
int
, optional (default =384
)
We will make sure that the length of context+question never exceeds this many word pieces. - stride :
int
, optional (default =128
)
When context+question are too long for the length limit, we emit multiple instances for one question, where the context is shifted. This parameter specifies the overlap between the shifted context window. It is called "stride" instead of "overlap" because that's what it's called in the original huggingface implementation. - skip_invalid_examples :
bool
, optional (default =False
)
If this is true, we will skip examples that don't have a gold answer. You should set this to True during training, and False any other time. - max_query_length :
int
, optional (default =64
)
The maximum number of wordpieces dedicated to the question. If the question is longer than this, it will be truncated.
make_instances#
class TransformerSquadReader(DatasetReader):
| ...
| def make_instances(
| self,
| qid: str,
| question: str,
| answers: List[str],
| context: str,
| first_answer_offset: Optional[int]
| ) -> Iterable[Instance]
tokenize context by spaces first, and then with the wordpiece tokenizer For RoBERTa, this produces a bug where every token is marked as beginning-of-sentence. To fix it, we detect whether a space comes before a word, and if so, add "a " in front of the word.
text_to_instance#
class TransformerSquadReader(DatasetReader):
| ...
| @overrides
| def text_to_instance(
| self,
| question: str,
| tokenized_question: List[Token],
| context: str,
| tokenized_context: List[Token],
| answers: List[str],
| token_answer_span: Optional[Tuple[int, int]],
| additional_metadata: Dict[str, Any] = None
| ) -> Instance