drop
allennlp_models.rc.dataset_readers.drop
WORD_NUMBER_MAP#
WORD_NUMBER_MAP = {
"zero": 0,
"one": 1,
"two": 2,
"three": 3,
"four": 4,
"five": 5,
"six" ...
DropReader#
@DatasetReader.register("drop")
class DropReader(DatasetReader):
| def __init__(
| self,
| tokenizer: Tokenizer = None,
| token_indexers: Dict[str, TokenIndexer] = None,
| passage_length_limit: int = None,
| question_length_limit: int = None,
| skip_when_all_empty: List[str] = None,
| instance_format: str = "drop",
| relaxed_span_match_for_finding_labels: bool = True,
| **kwargs
| ) -> None
Reads a JSON-formatted DROP dataset file and returns instances in a few different possible
formats. The input format is complicated; see the test fixture for an example of what it looks
like. The output formats all contain a question TextField
, a passage TextField
, and
some kind of answer representation. Because DROP has instances with several different kinds of
answers, this dataset reader allows you to filter out questions that do not have answers of a
particular type (e.g., remove questions that have numbers as answers, if you model can only
give passage spans as answers). We typically return all possible ways of arriving at a given
answer string, and expect models to marginalize over these possibilities.
Parameters¶
-
tokenizer :
Tokenizer
, optional (default =SpacyTokenizer()
)
We use thisTokenizer
for both the question and the passage. SeeTokenizer
. Default isSpacyTokenizer()
. -
token_indexers :
Dict[str, TokenIndexer]
, optional
We similarly use this for both the question and the passage. SeeTokenIndexer
. Default is{"tokens": SingleIdTokenIndexer()}
. -
passage_length_limit :
int
, optional (default =None
)
If specified, we will cut the passage if the length of passage exceeds this limit. -
question_length_limit :
int
, optional (default =None
)
If specified, we will cut the question if the length of passage exceeds this limit. -
skip_when_all_empty :
List[str]
, optional (default =None
)
In some cases such as preparing for training examples, you may want to skip some examples when there are no gold labels. You can specify on what condition should the examples be skipped. Currently, you can put "passage_span", "question_span", "addition_subtraction", or "counting" in this list, to tell the reader skip when there are no such label found. If not specified, we will keep all the examples. -
instance_format :
str
, optional (default ="drop"
)
We try to be generous in providing a few different formats for the instances in DROP, in terms of theFields
that we return for eachInstance
, to allow for several different kinds of models. "drop" format will do processing to detect numbers and various ways those numbers can be arrived at from the passage, and returnFields
related to that. "bert" format only allows passage spans as answers, and provides a "question_and_passage" field with the two pieces of text joined as BERT expects. "squad" format provides the same fields that our BiDAF and other SQuAD models expect. -
relaxed_span_match_for_finding_labels :
bool
, optional (default =True
)
DROP dataset contains multi-span answers, and the date-type answers are usually hard to find exact span matches for, also. In order to use as many examples as possible to train the model, we may not want a strict match for such cases when finding the gold span labels. If this argument is true, we will treat every span in the multi-span answers as correct, and every token in the date answer as correct, too. Because models trained on DROP typically marginalize over all possible answer positions, this is just being a little more generous in what is being marginalized. Note that this will not affect evaluation.
text_to_instance#
class DropReader(DatasetReader):
| ...
| @overrides
| def text_to_instance(
| self,
| question_text: str,
| passage_text: str,
| question_id: str = None,
| passage_id: str = None,
| answer_annotations: List[Dict] = None,
| passage_tokens: List[Token] = None
| ) -> Union[Instance, None]
make_marginal_drop_instance#
class DropReader(DatasetReader):
| ...
| @staticmethod
| def make_marginal_drop_instance(
| question_tokens: List[Token],
| passage_tokens: List[Token],
| number_tokens: List[Token],
| number_indices: List[int],
| token_indexers: Dict[str, TokenIndexer],
| passage_text: str,
| answer_info: Dict[str, Any] = None,
| additional_metadata: Dict[str, Any] = None
| ) -> Instance
make_bert_drop_instance#
class DropReader(DatasetReader):
| ...
| @staticmethod
| def make_bert_drop_instance(
| question_tokens: List[Token],
| passage_tokens: List[Token],
| question_concat_passage_tokens: List[Token],
| token_indexers: Dict[str, TokenIndexer],
| passage_text: str,
| answer_info: Dict[str, Any] = None,
| additional_metadata: Dict[str, Any] = None
| ) -> Instance
extract_answer_info_from_annotation#
class DropReader(DatasetReader):
| ...
| @staticmethod
| def extract_answer_info_from_annotation(
| answer_annotation: Dict[str, Any]
| ) -> Tuple[str, List[str]]
convert_word_to_number#
class DropReader(DatasetReader):
| ...
| @staticmethod
| def convert_word_to_number(
| word: str,
| try_to_include_more_numbers=False
| )
Currently we only support limited types of conversion.
find_valid_spans#
class DropReader(DatasetReader):
| ...
| @staticmethod
| def find_valid_spans(
| passage_tokens: List[Token],
| answer_texts: List[str]
| ) -> List[Tuple[int, int]]
find_valid_add_sub_expressions#
class DropReader(DatasetReader):
| ...
| @staticmethod
| def find_valid_add_sub_expressions(
| numbers: List[int],
| targets: List[int],
| max_number_of_numbers_to_consider: int = 2
| ) -> List[List[int]]
find_valid_counts#
class DropReader(DatasetReader):
| ...
| @staticmethod
| def find_valid_counts(
| count_numbers: List[int],
| targets: List[int]
| ) -> List[int]