[ allennlp.data.tokenizers.token ]
A simple token representation, keeping track of the token's text, offset in the passage it was taken from, POS tag, dependency relation, and similar information. These fields match spacy's exactly, so we can just use a spacy token for this.
- text :
The original text represented by this token.
- idx :
The character offset of this token into the tokenized passage.
- idx_end :
The character offset one past the last character in the tokenized passage.
- lemma_ :
The lemma of this token.
- pos_ :
The coarse-grained part of speech of this token.
- tag_ :
The fine-grained part of speech of this token.
- dep_ :
The dependency relation for this token.
- ent_type_ :
The entity type (i.e., the NER tag) for this token.
- text_id :
If your tokenizer returns integers instead of strings (e.g., because you're doing byte encoding, or some hash-based embedding), set this with the integer. If this is set, we will bypass the vocabulary when indexing this token, regardless of whether
textis also set. You can
textwith the original text, if you want, so that you can still use a character-level representation in addition to a hash-based word embedding.
Token type id used by some pretrained language models like original BERT
The other fields on
Tokenfollow the fields on spacy's
Tokenobject; this is one we added, similar to spacy's
text = None
idx = None
idx_end = None
lemma_ = None
pos_ = None
tag_ = None
dep_ = None
ent_type_ = None
text_id = None
type_id = None
def show_token(token: Token) -> str