Skip to content

drop

allennlp_models.rc.dataset_readers.drop

[SOURCE]


WORD_NUMBER_MAP#

WORD_NUMBER_MAP = {
    "zero": 0,
    "one": 1,
    "two": 2,
    "three": 3,
    "four": 4,
    "five": 5,
    "six" ...

DropReader#

@DatasetReader.register("drop")
class DropReader(DatasetReader):
 | def __init__(
 |     self,
 |     tokenizer: Tokenizer = None,
 |     token_indexers: Dict[str, TokenIndexer] = None,
 |     passage_length_limit: int = None,
 |     question_length_limit: int = None,
 |     skip_when_all_empty: List[str] = None,
 |     instance_format: str = "drop",
 |     relaxed_span_match_for_finding_labels: bool = True,
 |     **kwargs
 | ) -> None

Reads a JSON-formatted DROP dataset file and returns instances in a few different possible formats. The input format is complicated; see the test fixture for an example of what it looks like. The output formats all contain a question TextField, a passage TextField, and some kind of answer representation. Because DROP has instances with several different kinds of answers, this dataset reader allows you to filter out questions that do not have answers of a particular type (e.g., remove questions that have numbers as answers, if you model can only give passage spans as answers). We typically return all possible ways of arriving at a given answer string, and expect models to marginalize over these possibilities.

Parameters

  • tokenizer : Tokenizer, optional (default = SpacyTokenizer())
    We use this Tokenizer for both the question and the passage. See Tokenizer. Default is SpacyTokenizer().

  • token_indexers : Dict[str, TokenIndexer], optional
    We similarly use this for both the question and the passage. See TokenIndexer. Default is {"tokens": SingleIdTokenIndexer()}.

  • passage_length_limit : int, optional (default = None)
    If specified, we will cut the passage if the length of passage exceeds this limit.

  • question_length_limit : int, optional (default = None)
    If specified, we will cut the question if the length of passage exceeds this limit.

  • skip_when_all_empty : List[str], optional (default = None)
    In some cases such as preparing for training examples, you may want to skip some examples when there are no gold labels. You can specify on what condition should the examples be skipped. Currently, you can put "passage_span", "question_span", "addition_subtraction", or "counting" in this list, to tell the reader skip when there are no such label found. If not specified, we will keep all the examples.

  • instance_format : str, optional (default = "drop")
    We try to be generous in providing a few different formats for the instances in DROP, in terms of the Fields that we return for each Instance, to allow for several different kinds of models. "drop" format will do processing to detect numbers and various ways those numbers can be arrived at from the passage, and return Fields related to that. "bert" format only allows passage spans as answers, and provides a "question_and_passage" field with the two pieces of text joined as BERT expects. "squad" format provides the same fields that our BiDAF and other SQuAD models expect.

  • relaxed_span_match_for_finding_labels : bool, optional (default = True)
    DROP dataset contains multi-span answers, and the date-type answers are usually hard to find exact span matches for, also. In order to use as many examples as possible to train the model, we may not want a strict match for such cases when finding the gold span labels. If this argument is true, we will treat every span in the multi-span answers as correct, and every token in the date answer as correct, too. Because models trained on DROP typically marginalize over all possible answer positions, this is just being a little more generous in what is being marginalized. Note that this will not affect evaluation.

text_to_instance#

class DropReader(DatasetReader):
 | ...
 | def text_to_instance(
 |     self,
 |     question_text: str,
 |     passage_text: str,
 |     question_id: str = None,
 |     passage_id: str = None,
 |     answer_annotations: List[Dict] = None,
 |     passage_tokens: List[Token] = None
 | ) -> Union[Instance, None]

make_marginal_drop_instance#

class DropReader(DatasetReader):
 | ...
 | @staticmethod
 | def make_marginal_drop_instance(
 |     question_tokens: List[Token],
 |     passage_tokens: List[Token],
 |     number_tokens: List[Token],
 |     number_indices: List[int],
 |     token_indexers: Dict[str, TokenIndexer],
 |     passage_text: str,
 |     answer_info: Dict[str, Any] = None,
 |     additional_metadata: Dict[str, Any] = None
 | ) -> Instance

make_bert_drop_instance#

class DropReader(DatasetReader):
 | ...
 | @staticmethod
 | def make_bert_drop_instance(
 |     question_tokens: List[Token],
 |     passage_tokens: List[Token],
 |     question_concat_passage_tokens: List[Token],
 |     token_indexers: Dict[str, TokenIndexer],
 |     passage_text: str,
 |     answer_info: Dict[str, Any] = None,
 |     additional_metadata: Dict[str, Any] = None
 | ) -> Instance

extract_answer_info_from_annotation#

class DropReader(DatasetReader):
 | ...
 | @staticmethod
 | def extract_answer_info_from_annotation(
 |     answer_annotation: Dict[str, Any]
 | ) -> Tuple[str, List[str]]

convert_word_to_number#

class DropReader(DatasetReader):
 | ...
 | @staticmethod
 | def convert_word_to_number(
 |     word: str,
 |     try_to_include_more_numbers=False
 | )

Currently we only support limited types of conversion.

find_valid_spans#

class DropReader(DatasetReader):
 | ...
 | @staticmethod
 | def find_valid_spans(
 |     passage_tokens: List[Token],
 |     answer_texts: List[str]
 | ) -> List[Tuple[int, int]]

find_valid_add_sub_expressions#

class DropReader(DatasetReader):
 | ...
 | @staticmethod
 | def find_valid_add_sub_expressions(
 |     numbers: List[int],
 |     targets: List[int],
 |     max_number_of_numbers_to_consider: int = 2
 | ) -> List[List[int]]

find_valid_counts#

class DropReader(DatasetReader):
 | ...
 | @staticmethod
 | def find_valid_counts(
 |     count_numbers: List[int],
 |     targets: List[int]
 | ) -> List[int]