allennlp.data.dataset_readers.snli¶
-
class
allennlp.data.dataset_readers.snli.SnliReader(tokenizer: allennlp.data.tokenizers.tokenizer.Tokenizer = None, token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer] = None, lazy: bool = False)[source]¶ Bases:
allennlp.data.dataset_readers.dataset_reader.DatasetReaderReads a file from the Stanford Natural Language Inference (SNLI) dataset. This data is formatted as jsonl, one json-formatted instance per line. The keys in the data are “gold_label”, “sentence1”, and “sentence2”. We convert these keys into fields named “label”, “premise” and “hypothesis”, along with a metadata field containing the tokenized strings of the premise and hypothesis.
- Parameters
- tokenizer
Tokenizer, optional (default=``WordTokenizer()``) We use this
Tokenizerfor both the premise and the hypothesis. SeeTokenizer.- token_indexers
Dict[str, TokenIndexer], optional (default=``{“tokens”: SingleIdTokenIndexer()}``) We similarly use this for both the premise and the hypothesis. See
TokenIndexer.
- tokenizer
-
text_to_instance(self, premise: str, hypothesis: str, label: str = None) → allennlp.data.instance.Instance[source]¶ Does whatever tokenization or processing is necessary to go from textual input to an
Instance. The primary intended use for this is with aPredictor, which gets text input as a JSON object and needs to process it to be input to a model.The intent here is to share code between
_read()and what happens at model serving time, or any other time you want to make a prediction from new data. We need to process the data in the same way it was done at training time. Allowing theDatasetReaderto process new text lets us accomplish this, as we can just callDatasetReader.text_to_instancewhen serving predictions.The input type here is rather vaguely specified, unfortunately. The
Predictorwill have to make some assumptions about the kind ofDatasetReaderthat it’s using, in order to pass it the right information.