triviaqa

allennlp_models.rc.dataset_readers.triviaqa

TriviaQaReader#

@DatasetReader.register("triviaqa")
class TriviaQaReader(DatasetReader):
 | def __init__(
 |     self,
 |     base_tarball_path: str,
 |     unfiltered_tarball_path: str = None,
 |     tokenizer: Tokenizer = None,
 |     token_indexers: Dict[str, TokenIndexer] = None,
 |     **kwargs
 | ) -> None

Reads the TriviaQA dataset into a Dataset containing Instances with four fields: question (a TextField), passage (another TextField), span_start, and span_end (both IndexFields).

TriviaQA is split up into several JSON files defining the questions, and a lot of text files containing crawled web documents. We read these from a gzipped tarball, to avoid having to have millions of individual files on a filesystem.

Because we need to read both train and validation files from the same tarball, we take the tarball itself as a constructor parameter, and take the question file as the argument to read. This means that you should give the path to the tarball in the dataset_reader parameters in your experiment configuration file, and something like "wikipedia-train.json" for the train_data_path and validation_data_path.

Parameters¶

base_tarball_path : str
This is the path to the main tar.gz file you can download from the TriviaQA website, with directories evidence and qa.
unfiltered_tarball_path : str, optional
This is the path to the "unfiltered" TriviaQA data that you can download from the TriviaQA website, containing just question JSON files that point to evidence files in the base tarball.
tokenizer : Tokenizer, optional
We'll use this tokenizer on questions and evidence passages, defaulting to SpacyTokenizer if none is provided.
token_indexers : Dict[str, TokenIndexer], optional
Determines how both the question and the evidence passages are represented as arrays. See TokenIndexer. Default is to have a single word ID for every token.

pick_paragraphs#

class TriviaQaReader(DatasetReader):
 | ...
 | def pick_paragraphs(
 |     self,
 |     evidence_files: List[List[str]],
 |     question: str = None,
 |     answer_texts: List[str] = None
 | ) -> List[str]

Given a list of evidence documents, return a list of paragraphs to use as training examples. Each paragraph returned will be made into one training example.

To aid in picking the best paragraph, you can also optionally pass the question text or the answer strings. Note, though, that if you actually use the answer strings for picking the paragraph on the dev or test sets, that's likely cheating, depending on how you've defined the task.

text_to_instance#

class TriviaQaReader(DatasetReader):
 | ...
 | @overrides
 | def text_to_instance(
 |     self,
 |     question_text: str,
 |     passage_text: str,
 |     token_spans: List[Tuple[int, int]] = None,
 |     answer_texts: List[str] = None,
 |     question_tokens: List[Token] = None,
 |     passage_tokens: List[Token] = None
 | ) -> Instance