Skip to content





class QuoraParaphraseDatasetReader(DatasetReader):
 | def __init__(
 |     self,
 |     tokenizer: Tokenizer = None,
 |     token_indexers: Dict[str, TokenIndexer] = None,
 |     combine_input_fields: Optional[bool] = None,
 |     **kwargs
 | ) -> None

Reads a file from the Quora Paraphrase dataset. The train/validation/test split of the data comes from the paper Bilateral Multi-Perspective Matching for Natural Language Sentences by Zhiguo Wang et al., 2017. Each file of the data is a tsv file without header. The columns are is_duplicate, question1, question2, and id. All questions are pre-tokenized and tokens are space separated. We convert these keys into fields named "label", "premise" and "hypothesis", so that it is compatible t some existing natural language inference algorithms.

Registered as a DatasetReader with name "quora_paraphrase".


  • tokenizer : Tokenizer, optional
    Tokenizer to use to split the premise and hypothesis into words or other kinds of tokens. Defaults to WhitespaceTokenizer.
  • token_indexers : Dict[str, TokenIndexer], optional
    Indexers used to define input token representations. Defaults to {"tokens": SingleIdTokenIndexer()}.


class QuoraParaphraseDatasetReader(DatasetReader):
 | ...
 | def text_to_instance(
 |     self,
 |     premise: str,
 |     hypothesis: str,
 |     label: str = None
 | ) -> Instance