record_reader
allennlp_models.rc.dataset_readers.record_reader
Dataset reader for SuperGLUE's Reading Comprehension with Commonsense Reasoning task (Zhang Et al. 2018).
Reader Implemented by Gabriel Orlanski
RecordTaskReader#
@DatasetReader.register("superglue_record")
class RecordTaskReader(DatasetReader):
| def __init__(
| self,
| transformer_model_name: str = "bert-base-cased",
| length_limit: int = 384,
| question_length_limit: int = 64,
| stride: int = 128,
| raise_errors: bool = False,
| tokenizer_kwargs: Dict[str, Any] = None,
| one_instance_per_query: bool = False,
| max_instances: int = None,
| **kwargs
| ) -> None
Reader for Reading Comprehension with Commonsense Reasoning(ReCoRD) task from SuperGLUE. The task is detailed in the paper ReCoRD: Bridging the Gap between Human and Machine Commonsense Reading Comprehension (arxiv.org/pdf/1810.12885.pdf) by Zhang et al. Leaderboards and the official evaluation script for the ReCoRD task can be found sheng-z.github.io/ReCoRD-explorer/.
The reader reads a JSON file in the format from sheng-z.github.io/ReCoRD-explorer/dataset-readme.txt
Parameters¶
-
tokenizer :
Tokenizer
, optional
The tokenizer class to use. Defaults to SpacyTokenizer -
token_indexers :
Dict[str, TokenIndexer]
, optional
We similarly use this for both the question and the passage. SeeTokenIndexer
. Default is{"tokens": SingleIdTokenIndexer()}
. -
passage_length_limit :
int
, optional (default =None
)
If specified, we will cut the passage if the length of passage exceeds this limit. -
question_length_limit :
int
, optional (default =None
)
If specified, we will cut the question if the length of question exceeds this limit. -
raise_errors :
bool
, optional (default =False
)
If the reader should raise errors or just continue. -
kwargs :
Dict
Keyword arguments to be passed to the DatasetReader parent class constructor.
get_instances_from_example#
class RecordTaskReader(DatasetReader):
| ...
| def get_instances_from_example(
| self,
| example: Dict,
| always_add_answer_span: bool = False
| ) -> Iterable[Instance]
Helper function to get instances from an example.
Much of this comes from transformer_squad.make_instances
Parameters¶
- example :
Dict[str,Any]
The example dict.
Returns:¶
Iterable[Instance]
The instances for each example
tokenize_slice#
class RecordTaskReader(DatasetReader):
| ...
| def tokenize_slice(
| self,
| text: str,
| start: int = None,
| end: int = None
| ) -> Iterable[Token]
Get + tokenize a span from a source text.
Originally from the transformer_squad.py
Parameters¶
- text :
str
The text to draw from. - start :
int
The start index for the span. - end :
int
The end index for the span. Assumed that this is inclusive.
Returns¶
Iterable[Token]
List of tokens for the retrieved span.
tokenize_str#
class RecordTaskReader(DatasetReader):
| ...
| def tokenize_str(self, text: str) -> List[Token]
Helper method to tokenize a string.
Adapted from the transformer_squad.make_instances
Parameters¶
text: `str`
The string to tokenize.
Returns¶
Iterable[Tokens]
The resulting tokens.
get_spans_from_text#
class RecordTaskReader(DatasetReader):
| ...
| @staticmethod
| def get_spans_from_text(
| text: str,
| spans: List[Tuple[int, int]]
| ) -> List[str]
Helper function to get a span from a string
Parameter¶
text: str
The source string
spans: List[Tuple[int,int]]
List of start and end indices for spans.
Assumes that the end index is inclusive. Therefore, for start
index `i` and end index `j`, retrieves the span at `text[i:j+1]`.
Returns¶
List[str]
The extracted string from text.
text_to_instance#
class RecordTaskReader(DatasetReader):
| ...
| def text_to_instance(
| self,
| query: str,
| tokenized_query: List[Token],
| passage: str,
| tokenized_passage: List[Token],
| answers: List[str],
| token_answer_span: Optional[Tuple[int, int]] = None,
| additional_metadata: Optional[Dict[str, Any]] = None,
| always_add_answer_span: Optional[bool] = False
| ) -> Instance
A lot of this comes directly from the transformer_squad.text_to_instance