Token( self, text: Optional[str] = None, idx: Optional[int] = None, lemma_: Optional[str] = None, pos_: Optional[str] = None, tag_: Optional[str] = None, dep_: Optional[str] = None, ent_type_: Optional[str] = None, text_id: Optional[int] = None, type_id: Optional[int] = None, ) -> None
A simple token representation, keeping track of the token's text, offset in the passage it was taken from, POS tag, dependency relation, and similar information. These fields match spacy's exactly, so we can just use a spacy token for this.
- text :
str, optional The original text represented by this token.
- idx :
int, optional The character offset of this token into the tokenized passage.
- lemma_ :
str, optional The lemma of this token.
- pos_ :
str, optional The coarse-grained part of speech of this token.
- tag_ :
str, optional The fine-grained part of speech of this token.
- dep_ :
str, optional The dependency relation for this token.
- ent_type_ :
str, optional The entity type (i.e., the NER tag) for this token.
- text_id :
int, optional If your tokenizer returns integers instead of strings (e.g., because you're doing byte encoding, or some hash-based embedding), set this with the integer. If this is set, we will bypass the vocabulary when indexing this token, regardless of whether
textis also set. You can
textwith the original text, if you want, so that you can still use a character-level representation in addition to a hash-based word embedding.
int, optional Token type id used by some pretrained language models like original BERT
The other fields on
Tokenfollow the fields on spacy's
Tokenobject; this is one we added, similar to spacy's