triviaqa
TriviaQaReader#
class TriviaQaReader(DatasetReader):
| def __init__(
| self,
| base_tarball_path: str,
| unfiltered_tarball_path: str = None,
| tokenizer: Tokenizer = None,
| token_indexers: Dict[str, TokenIndexer] = None,
| lazy: bool = False
| ) -> None
Reads the TriviaQA dataset into a Dataset
containing Instances
with four fields:
question
(a TextField
), passage
(another TextField
), span_start
, and
span_end
(both IndexFields
).
TriviaQA is split up into several JSON files defining the questions, and a lot of text files containing crawled web documents. We read these from a gzipped tarball, to avoid having to have millions of individual files on a filesystem.
Because we need to read both train and validation files from the same tarball, we take the
tarball itself as a constructor parameter, and take the question file as the argument to
read
. This means that you should give the path to the tarball in the dataset_reader
parameters in your experiment configuration file, and something like "wikipedia-train.json"
for the train_data_path
and validation_data_path
.
Parameters
- base_tarball_path :
str
This is the path to the maintar.gz
file you can download from the TriviaQA website, with directoriesevidence
andqa
. - unfiltered_tarball_path :
str
, optional
This is the path to the "unfiltered" TriviaQA data that you can download from the TriviaQA website, containing just question JSON files that point to evidence files in the base tarball. - tokenizer :
Tokenizer
, optional
We'll use this tokenizer on questions and evidence passages, defaulting toSpacyTokenizer
if none is provided. - token_indexers :
Dict[str, TokenIndexer]
, optional
Determines how both the question and the evidence passages are represented as arrays. SeeTokenIndexer
. Default is to have a single word ID for every token.
pick_paragraphs#
class TriviaQaReader(DatasetReader):
| ...
| def pick_paragraphs(
| self,
| evidence_files: List[List[str]],
| question: str = None,
| answer_texts: List[str] = None
| ) -> List[str]
Given a list of evidence documents, return a list of paragraphs to use as training examples. Each paragraph returned will be made into one training example.
To aid in picking the best paragraph, you can also optionally pass the question text or the answer strings. Note, though, that if you actually use the answer strings for picking the paragraph on the dev or test sets, that's likely cheating, depending on how you've defined the task.
text_to_instance#
class TriviaQaReader(DatasetReader):
| ...
| @overrides
| def text_to_instance(
| self,
| question_text: str,
| passage_text: str,
| token_spans: List[Tuple[int, int]] = None,
| answer_texts: List[str] = None,
| question_tokens: List[Token] = None,
| passage_tokens: List[Token] = None
| ) -> Instance