quora_paraphrase
QuoraParaphraseDatasetReader#
class QuoraParaphraseDatasetReader(DatasetReader):
| def __init__(
| self,
| tokenizer: Tokenizer = None,
| token_indexers: Dict[str, TokenIndexer] = None,
| combine_input_fields: Optional[bool] = None,
| **kwargs
| ) -> None
Reads a file from the Quora Paraphrase dataset. The train/validation/test split of the data comes from the paper Bilateral Multi-Perspective Matching for Natural Language Sentences by Zhiguo Wang et al., 2017. Each file of the data is a tsv file without header. The columns are is_duplicate, question1, question2, and id. All questions are pre-tokenized and tokens are space separated. We convert these keys into fields named "label", "premise" and "hypothesis", so that it is compatible t some existing natural language inference algorithms.
Registered as a DatasetReader
with name "quora_paraphrase".
Parameters
- tokenizer :
Tokenizer
, optional
Tokenizer to use to split the premise and hypothesis into words or other kinds of tokens. Defaults toWhitespaceTokenizer
. - token_indexers :
Dict[str, TokenIndexer]
, optional
Indexers used to define input token representations. Defaults to{"tokens": SingleIdTokenIndexer()}
.
text_to_instance#
class QuoraParaphraseDatasetReader(DatasetReader):
| ...
| @overrides
| def text_to_instance(
| self,
| premise: str,
| hypothesis: str,
| label: str = None
| ) -> Instance