token_indexer
allennlp.data.token_indexers.token_indexer
IndexedTokenList#
IndexedTokenList = Dict[str, List[Any]]
TokenIndexer#
class TokenIndexer(Registrable):
| def __init__(self, token_min_padding_length: int = 0) -> None
A TokenIndexer
determines how string tokens get represented as arrays of indices in a model.
This class both converts strings into numerical values, with the help of a
Vocabulary
, and it produces actual arrays.
Tokens can be represented as single IDs (e.g., the word "cat" gets represented by the number 34), or as lists of character IDs (e.g., "cat" gets represented by the numbers [23, 10, 18]), or in some other way that you can come up with (e.g., if you have some structured input you want to represent in a special way in your data arrays, you can do that here).
Parameters
- token_min_padding_length :
int
, optional (default =0
)
The minimum padding length required for theTokenIndexer
. For example, the minimum padding length ofSingleIdTokenIndexer
is the largest size of filter when usingCnnEncoder
. Note that if you set this for one TokenIndexer, you likely have to set it for allTokenIndexer
for the same field, otherwise you'll get mismatched tensor sizes.
default_implementation#
class TokenIndexer(Registrable):
| ...
| default_implementation = "single_id"
has_warned_for_as_padded_tensor#
class TokenIndexer(Registrable):
| ...
| has_warned_for_as_padded_tensor = False
count_vocab_items#
class TokenIndexer(Registrable):
| ...
| def count_vocab_items(
| self,
| token: Token,
| counter: Dict[str, Dict[str, int]]
| )
The Vocabulary
needs to assign indices to whatever strings we see in the training
data (possibly doing some frequency filtering and using an OOV, or out of vocabulary,
token). This method takes a token and a dictionary of counts and increments counts for
whatever vocabulary items are present in the token. If this is a single token ID
representation, the vocabulary item is likely the token itself. If this is a token
characters representation, the vocabulary items are all of the characters in the token.
tokens_to_indices#
class TokenIndexer(Registrable):
| ...
| def tokens_to_indices(
| self,
| tokens: List[Token],
| vocabulary: Vocabulary
| ) -> IndexedTokenList
Takes a list of tokens and converts them to an IndexedTokenList
.
This could be just an ID for each token from the vocabulary.
Or it could split each token into characters and return one ID per character.
Or (for instance, in the case of byte-pair encoding) there might not be a clean
mapping from individual tokens to indices, and the IndexedTokenList
could be a complex
data structure.
indices_to_tokens#
class TokenIndexer(Registrable):
| ...
| def indices_to_tokens(
| self,
| indexed_tokens: IndexedTokenList,
| vocabulary: Vocabulary
| ) -> List[Token]
Inverse operations of tokens_to_indices. Takes an IndexedTokenList
and converts it back
into a list of tokens.
get_empty_token_list#
class TokenIndexer(Registrable):
| ...
| def get_empty_token_list(self) -> IndexedTokenList
Returns an already indexed
version of an empty token list. This is typically just an
empty list for whatever keys are used in the indexer.
get_padding_lengths#
class TokenIndexer(Registrable):
| ...
| def get_padding_lengths(
| self,
| indexed_tokens: IndexedTokenList
| ) -> Dict[str, int]
This method returns a padding dictionary for the given indexed_tokens
specifying all
lengths that need padding. If all you have is a list of single ID tokens, this is just the
length of the list, and that's what the default implementation will give you. If you have
something more complicated, like a list of character ids for token, you'll need to override
this.
as_padded_tensor_dict#
class TokenIndexer(Registrable):
| ...
| def as_padded_tensor_dict(
| self,
| tokens: IndexedTokenList,
| padding_lengths: Dict[str, int]
| ) -> Dict[str, torch.Tensor]
This method pads a list of tokens given the input padding lengths (which could actually
truncate things, depending on settings) and returns that padded list of input tokens as a
Dict[str, torch.Tensor]
. This is a dictionary because there should be one key per
argument that the TokenEmbedder
corresponding to this class expects in its forward()
method (where the argument name in the TokenEmbedder
needs to make the key in this
dictionary).
The base class implements the case when all you want to do is create a padded LongTensor
for every list in the tokens
dictionary. If your TokenIndexer
needs more complex
logic than that, you need to override this method.