Skip to content

token_class

allennlp.data.tokenizers.token_class

[SOURCE]


Token

@dataclass(init=False, repr=False)
class Token:
 | def __init__(
 |     self,
 |     text: str = None,
 |     idx: int = None,
 |     idx_end: int = None,
 |     lemma_: str = None,
 |     pos_: str = None,
 |     tag_: str = None,
 |     dep_: str = None,
 |     ent_type_: str = None,
 |     text_id: int = None,
 |     type_id: int = None
 | ) -> None

A simple token representation, keeping track of the token's text, offset in the passage it was taken from, POS tag, dependency relation, and similar information. These fields match spacy's exactly, so we can just use a spacy token for this.

Parameters

  • text : str, optional
    The original text represented by this token.
  • idx : int, optional
    The character offset of this token into the tokenized passage.
  • idx_end : int, optional
    The character offset one past the last character in the tokenized passage.
  • lemma_ : str, optional
    The lemma of this token.
  • pos_ : str, optional
    The coarse-grained part of speech of this token.
  • tag_ : str, optional
    The fine-grained part of speech of this token.
  • dep_ : str, optional
    The dependency relation for this token.
  • ent_type_ : str, optional
    The entity type (i.e., the NER tag) for this token.
  • text_id : int, optional
    If your tokenizer returns integers instead of strings (e.g., because you're doing byte encoding, or some hash-based embedding), set this with the integer. If this is set, we will bypass the vocabulary when indexing this token, regardless of whether text is also set. You can also set text with the original text, if you want, so that you can still use a character-level representation in addition to a hash-based word embedding.
  • type_id : int, optional
    Token type id used by some pretrained language models like original BERT

    The other fields on Token follow the fields on spacy's Token object; this is one we added, similar to spacy's lex_id.

text

class Token:
 | ...
 | text: Optional[str] = None

idx

class Token:
 | ...
 | idx: Optional[int] = None

idx_end

class Token:
 | ...
 | idx_end: Optional[int] = None

lemma_

class Token:
 | ...
 | lemma_: Optional[str] = None

pos_

class Token:
 | ...
 | pos_: Optional[str] = None

tag_

class Token:
 | ...
 | tag_: Optional[str] = None

dep_

class Token:
 | ...
 | dep_: Optional[str] = None

ent_type_

class Token:
 | ...
 | ent_type_: Optional[str] = None

text_id

class Token:
 | ...
 | text_id: Optional[int] = None

type_id

class Token:
 | ...
 | type_id: Optional[int] = None

ensure_text

class Token:
 | ...
 | def ensure_text(self) -> str

Return the text field, raising an exception if it's None.

show_token

def show_token(token: Token) -> str