Skip to content





class TransformerSquadReader(DatasetReader):
 | def __init__(
 |     self,
 |     transformer_model_name: str = "bert-base-cased",
 |     length_limit: int = 384,
 |     stride: int = 128,
 |     skip_impossible_questions: bool = False,
 |     max_query_length: int = 64,
 |     tokenizer_kwargs: Dict[str, Any] = None,
 |     **kwargs
 | ) -> None

Dataset reader suitable for JSON-formatted SQuAD-like datasets to be used with a transformer-based QA model, such as TransformerQA.

It will generate Instances with the following fields:

  • question_with_context, a TextField that contains the concatenation of question and context,
  • answer_span, a SpanField into the question TextField denoting the answer.
  • context_span, a SpanField into the question TextField denoting the context, i.e., the part of the text that potential answers can come from.
  • cls_index (optional), an IndexField that holds the index of the [CLS] token within the question_with_context field. This is needed because the [CLS] token is used to indicate an impossible question. Since most tokenizers/models have the [CLS] token as the first token, this will only be included in the instance if the [CLS] token is NOT the first token.
  • metadata, a MetadataField that stores the instance's ID, the original question, the original passage text, both of these in tokenized form, and the gold answer strings, accessible as metadata['id'], metadata['question'], metadata['context'], metadata['question_tokens'], metadata['context_tokens'], and metadata['answers']. This is so that we can more easily use the official SQuAD evaluation script to get metrics.

For SQuAD v2.0-style datasets that contain impossible questions, we set the gold answer span to the span of the [CLS] token when there are no answers.

We also support limiting the maximum length for the question. When the context+question is too long, we run a sliding window over the context and emit multiple instances for a single question. If skip_impossible_questions is True, then we only emit instances that contain a gold answer. As a result, the per-instance metrics you get during training and evaluation might not correspond 100% to the SQuAD task.

To get a final number for SQuAD v1.1, you have to run

python -m


  • transformer_model_name : str, optional (default = 'bert-base-cased')
    This reader chooses tokenizer and token indexer according to this setting.

  • length_limit : int, optional (default = 384)
    We will make sure that the length of context+question never exceeds this many word pieces.

  • stride : int, optional (default = 128)
    When context+question are too long for the length limit, we emit multiple instances for one question, where the context is shifted. This parameter specifies the overlap between the shifted context window. It is called "stride" instead of "overlap" because that's what it's called in the original huggingface implementation.

  • skip_impossible_questions : bool, optional (default = False)
    If this is true, we will skip examples that don't have an answer. This could happen if the question is marked impossible in the dataset, or if the question+context is truncated according to length_limit such that the context no longer contains a gold answer.

    For SQuAD v1.1-style datasets, you should set this to True during training, and False any other time.

    For SQuAD v2.0-style datasets you should leave this as False.

  • max_query_length : int, optional (default = 64)
    The maximum number of wordpieces dedicated to the question. If the question is longer than this, it will be truncated.


class TransformerSquadReader(DatasetReader):
 | ...
 | def make_instances(
 |     self,
 |     qid: str,
 |     question: str,
 |     answers: List[str],
 |     context: str,
 |     first_answer_offset: Optional[int],
 |     always_add_answer_span: bool = False,
 |     is_training: bool = False,
 |     cached_tokenized_context: Optional[List[Token]] = None
 | ) -> Iterable[Instance]

Create training instances from a SQuAD example.


class TransformerSquadReader(DatasetReader):
 | ...
 | def text_to_instance(
 |     self,
 |     question: str,
 |     tokenized_question: List[Token],
 |     context: str,
 |     tokenized_context: List[Token],
 |     answers: List[str] = None,
 |     token_answer_span: Optional[Tuple[int, int]] = None,
 |     additional_metadata: Dict[str, Any] = None,
 |     always_add_answer_span: bool = False
 | ) -> Instance


class TransformerSquadReader(DatasetReader):
 | ...
 | def apply_token_indexers(self, instance: Instance) -> None