Skip to content

transformer_squad

TransformerSquadReader#

class TransformerSquadReader(DatasetReader):
 | def __init__(
 |     self,
 |     transformer_model_name: str = "bert-base-cased",
 |     length_limit: int = 384,
 |     stride: int = 128,
 |     skip_invalid_examples: bool = False,
 |     max_query_length: int = 64,
 |     **kwargs
 | ) -> None

Reads a JSON-formatted SQuAD file and returns a Dataset where the Instances have four fields: * question_with_context, a TextField that contains the concatenation of question and context, * answer_span, a SpanField into the question TextField denoting the answer. * context_span a SpanField into the question TextField denoting the context, i.e., the part of the text that potential answers can come from. * A MetadataField that stores the instance's ID, the original question, the original passage text, both of these in tokenized form, and the gold answer strings, accessible as metadata['id'], metadata['question'], metadata['context'], metadata['question_tokens'], metadata['context_tokens'], and ``metadata['answers']. This is so that we can more easily use the official SQuAD evaluation script to get metrics.

We also support limiting the maximum length for the question. When the context+question is too long, we run a sliding window over the context and emit multiple instances for a single question. At training time, we only emit instances that contain a gold answer. At test time, we emit all instances. As a result, the per-instance metrics you get during training and evaluation don't correspond 100% to the SQuAD task. To get a final number, you have to run the script in scripts/transformer_qa_eval.py.

Parameters

  • transformer_model_name : str, optional (default = 'bert-base-cased')
    This reader chooses tokenizer and token indexer according to this setting.
  • length_limit : int, optional (default = 384)
    We will make sure that the length of context+question never exceeds this many word pieces.
  • stride : int, optional (default = 128)
    When context+question are too long for the length limit, we emit multiple instances for one question, where the context is shifted. This parameter specifies the overlap between the shifted context window. It is called "stride" instead of "overlap" because that's what it's called in the original huggingface implementation.
  • skip_invalid_examples : bool, optional (default = False)
    If this is true, we will skip examples that don't have a gold answer. You should set this to True during training, and False any other time.
  • max_query_length : int, optional (default = 64)
    The maximum number of wordpieces dedicated to the question. If the question is longer than this, it will be truncated.

make_instances#

class TransformerSquadReader(DatasetReader):
 | ...
 | def make_instances(
 |     self,
 |     qid: str,
 |     question: str,
 |     answers: List[str],
 |     context: str,
 |     first_answer_offset: Optional[int]
 | ) -> Iterable[Instance]

tokenize context by spaces first, and then with the wordpiece tokenizer For RoBERTa, this produces a bug where every token is marked as beginning-of-sentence. To fix it, we detect whether a space comes before a word, and if so, add "a " in front of the word.

text_to_instance#

class TransformerSquadReader(DatasetReader):
 | ...
 | @overrides
 | def text_to_instance(
 |     self,
 |     question: str,
 |     tokenized_question: List[Token],
 |     context: str,
 |     tokenized_context: List[Token],
 |     answers: List[str],
 |     token_answer_span: Optional[Tuple[int, int]],
 |     additional_metadata: Dict[str, Any] = None
 | ) -> Instance