@DatasetReader.register("triviaqa") class TriviaQaReader(DatasetReader): | def __init__( | self, | base_tarball_path: str, | unfiltered_tarball_path: str = None, | tokenizer: Tokenizer = None, | token_indexers: Dict[str, TokenIndexer] = None, | **kwargs | ) -> None
Reads the TriviaQA dataset into a
Instances with four fields:
TriviaQA is split up into several JSON files defining the questions, and a lot of text files containing crawled web documents. We read these from a gzipped tarball, to avoid having to have millions of individual files on a filesystem.
Because we need to read both train and validation files from the same tarball, we take the
tarball itself as a constructor parameter, and take the question file as the argument to
read. This means that you should give the path to the tarball in the
parameters in your experiment configuration file, and something like
- base_tarball_path :
This is the path to the main
tar.gzfile you can download from the TriviaQA website, with directories
- unfiltered_tarball_path :
This is the path to the "unfiltered" TriviaQA data that you can download from the TriviaQA website, containing just question JSON files that point to evidence files in the base tarball.
- tokenizer :
We'll use this tokenizer on questions and evidence passages, defaulting to
SpacyTokenizerif none is provided.
- token_indexers :
Dict[str, TokenIndexer], optional
Determines how both the question and the evidence passages are represented as arrays. See
TokenIndexer. Default is to have a single word ID for every token.
class TriviaQaReader(DatasetReader): | ... | def pick_paragraphs( | self, | evidence_files: List[List[str]], | question: str = None, | answer_texts: List[str] = None | ) -> List[str]
Given a list of evidence documents, return a list of paragraphs to use as training examples. Each paragraph returned will be made into one training example.
To aid in picking the best paragraph, you can also optionally pass the question text or the answer strings. Note, though, that if you actually use the answer strings for picking the paragraph on the dev or test sets, that's likely cheating, depending on how you've defined the task.
class TriviaQaReader(DatasetReader): | ... | def text_to_instance( | self, | question_text: str, | passage_text: str, | token_spans: List[Tuple[int, int]] = None, | answer_texts: List[str] = None, | question_tokens: List[Token] = None, | passage_tokens: List[Token] = None | ) -> Instance