class Dict[str,] = None, tokenizer: = None, segment_sentences: bool = False, max_sequence_length: int = None, skip_label_indexing: bool = False, lazy: bool = False)[source]


Reads tokens and their labels from a labeled text classification dataset. Expects a “text” field and a “label” field in JSON format.

The output of read is a list of Instance s with the fields:

tokens: TextField and label: LabelField

token_indexersDict[str, TokenIndexer], optional

optional (default=``{“tokens”: SingleIdTokenIndexer()}``) We use this to define the input representation for the text. See TokenIndexer.

tokenizerTokenizer, optional (default = {"tokens": WordTokenizer()})

Tokenizer to use to split the input text into words or other kinds of tokens.

segment_sentences: ``bool``, optional (default = ``False``)

If True, we will first segment the text into sentences using SpaCy and then tokenize words. Necessary for some models that require pre-segmentation of sentences, like the Hierarchical Attention Network (

max_sequence_length: ``int``, optional (default = ``None``)

If specified, will truncate tokens to specified maximum length.

skip_label_indexing: ``bool``, optional (default = ``False``)

Whether or not to skip label indexing. You might want to skip label indexing if your labels are numbers, so the dataset reader doesn’t re-number them starting from 0.

lazybool, optional, (default = False)

Whether or not instances can be read lazily.

text_to_instance(self, text: str, label: Union[str, int] = None) →[source]
textstr, required.

The text to classify

labelstr, optional, (default = None).

The label for this text.

An Instance containing the following fields:

The tokens in the sentence or phrase.


The label label of the sentence or phrase.