Skip to content

record_reader

allennlp_models.rc.dataset_readers.record_reader

[SOURCE]


Dataset reader for SuperGLUE's Reading Comprehension with Commonsense Reasoning task (Zhang Et al. 2018).

Reader Implemented by Gabriel Orlanski

RecordTaskReader#

@DatasetReader.register("superglue_record")
class RecordTaskReader(DatasetReader):
 | def __init__(
 |     self,
 |     transformer_model_name: str = "bert-base-cased",
 |     length_limit: int = 384,
 |     question_length_limit: int = 64,
 |     stride: int = 128,
 |     raise_errors: bool = False,
 |     tokenizer_kwargs: Dict[str, Any] = None,
 |     one_instance_per_query: bool = False,
 |     max_instances: int = None,
 |     **kwargs
 | ) -> None

Reader for Reading Comprehension with Commonsense Reasoning(ReCoRD) task from SuperGLUE. The task is detailed in the paper ReCoRD: Bridging the Gap between Human and Machine Commonsense Reading Comprehension (arxiv.org/pdf/1810.12885.pdf) by Zhang et al. Leaderboards and the official evaluation script for the ReCoRD task can be found sheng-z.github.io/ReCoRD-explorer/.

The reader reads a JSON file in the format from sheng-z.github.io/ReCoRD-explorer/dataset-readme.txt

Parameters

  • tokenizer : Tokenizer, optional
    The tokenizer class to use. Defaults to SpacyTokenizer

  • token_indexers : Dict[str, TokenIndexer], optional
    We similarly use this for both the question and the passage. See TokenIndexer. Default is {"tokens": SingleIdTokenIndexer()}.

  • passage_length_limit : int, optional (default = None)
    If specified, we will cut the passage if the length of passage exceeds this limit.

  • question_length_limit : int, optional (default = None)
    If specified, we will cut the question if the length of question exceeds this limit.

  • raise_errors : bool, optional (default = False)
    If the reader should raise errors or just continue.

  • kwargs : Dict
    Keyword arguments to be passed to the DatasetReader parent class constructor.

get_instances_from_example#

class RecordTaskReader(DatasetReader):
 | ...
 | def get_instances_from_example(
 |     self,
 |     example: Dict,
 |     always_add_answer_span: bool = False
 | ) -> Iterable[Instance]

Helper function to get instances from an example.

Much of this comes from transformer_squad.make_instances

Parameters

  • example : Dict[str,Any]
    The example dict.

Returns:

Iterable[Instance] The instances for each example

tokenize_slice#

class RecordTaskReader(DatasetReader):
 | ...
 | def tokenize_slice(
 |     self,
 |     text: str,
 |     start: int = None,
 |     end: int = None
 | ) -> Iterable[Token]

Get + tokenize a span from a source text.

Originally from the transformer_squad.py

Parameters

  • text : str
    The text to draw from.
  • start : int
    The start index for the span.
  • end : int
    The end index for the span. Assumed that this is inclusive.

Returns

  • Iterable[Token] List of tokens for the retrieved span.

tokenize_str#

class RecordTaskReader(DatasetReader):
 | ...
 | def tokenize_str(self, text: str) -> List[Token]

Helper method to tokenize a string.

Adapted from the transformer_squad.make_instances

Parameters

text: `str`
    The string to tokenize.

Returns

  • Iterable[Tokens] The resulting tokens.

get_spans_from_text#

class RecordTaskReader(DatasetReader):
 | ...
 | @staticmethod
 | def get_spans_from_text(
 |     text: str,
 |     spans: List[Tuple[int, int]]
 | ) -> List[str]

Helper function to get a span from a string

Parameter

text: str The source string spans: List[Tuple[int,int]] List of start and end indices for spans.

Assumes that the end index is inclusive. Therefore, for start
index `i` and end index `j`, retrieves the span at `text[i:j+1]`.

Returns

  • List[str] The extracted string from text.

text_to_instance#

class RecordTaskReader(DatasetReader):
 | ...
 | def text_to_instance(
 |     self,
 |     query: str,
 |     tokenized_query: List[Token],
 |     passage: str,
 |     tokenized_passage: List[Token],
 |     answers: List[str],
 |     token_answer_span: Optional[Tuple[int, int]] = None,
 |     additional_metadata: Optional[Dict[str, Any]] = None,
 |     always_add_answer_span: Optional[bool] = False
 | ) -> Instance

A lot of this comes directly from the transformer_squad.text_to_instance