SpacyTokenIndexer( self, hidden_dim: int = 96, token_min_padding_length: int = 0, ) -> None
SpacyTokenIndexer represents tokens as word vectors
from a spacy model. You might want to do this for two main reasons;
easier integration with a spacy pipeline and no out of vocabulary
Registered as a
TokenIndexer with name "spacy".
- hidden_dim :
int, optional (default=
96) The dimension of the vectors that spacy generates for representing words.
- token_min_padding_length :
int, optional (default=
- See :class:
SpacyTokenIndexer.as_padded_tensor_dict( self, tokens: Dict[str, List[Any]], padding_lengths: Dict[str, int], ) -> Dict[str, torch.Tensor]
This method pads a list of tokens given the input padding lengths (which could actually
truncate things, depending on settings) and returns that padded list of input tokens as a
Dict[str, torch.Tensor]. This is a dictionary because there should be one key per
argument that the
TokenEmbedder corresponding to this class expects in its
method (where the argument name in the
TokenEmbedder needs to make the key in this
The base class implements the case when all you want to do is create a padded
for every list in the
tokens dictionary. If your
TokenIndexer needs more complex
logic than that, you need to override this method.
SpacyTokenIndexer.count_vocab_items( self, token: allennlp.data.tokenizers.token.Token, counter: Dict[str, Dict[str, int]], )
Vocabulary needs to assign indices to whatever strings we see in the training
data (possibly doing some frequency filtering and using an OOV, or out of vocabulary,
token). This method takes a token and a dictionary of counts and increments counts for
whatever vocabulary items are present in the token. If this is a single token ID
representation, the vocabulary item is likely the token itself. If this is a token
characters representation, the vocabulary items are all of the characters in the token.
SpacyTokenIndexer.tokens_to_indices( self, tokens: List[spacy.tokens.token.Token], vocabulary: allennlp.data.vocabulary.Vocabulary, ) -> Dict[str, List[numpy.ndarray]]
Takes a list of tokens and converts them to an
This could be just an ID for each token from the vocabulary.
Or it could split each token into characters and return one ID per character.
Or (for instance, in the case of byte-pair encoding) there might not be a clean
mapping from individual tokens to indices, and the
IndexedTokenList could be a complex