allennlp.data.token_indexers¶
A TokenIndexer
determines how string tokens get represented as arrays of indices in a model.
-
class
allennlp.data.token_indexers.token_indexer.
TokenIndexer
(token_min_padding_length: int = 0)[source]¶ Bases:
typing.Generic
,allennlp.common.registrable.Registrable
A
TokenIndexer
determines how string tokens get represented as arrays of indices in a model. This class both converts strings into numerical values, with the help of aVocabulary
, and it produces actual arrays.Tokens can be represented as single IDs (e.g., the word “cat” gets represented by the number 34), or as lists of character IDs (e.g., “cat” gets represented by the numbers [23, 10, 18]), or in some other way that you can come up with (e.g., if you have some structured input you want to represent in a special way in your data arrays, you can do that here).
- Parameters
- token_min_padding_length
int
, optional (default=``0``) The minimum padding length required for the
TokenIndexer
. For example, the minimum padding length ofSingleIdTokenIndexer
is the largest size of filter when usingCnnEncoder
. Note that if you set this for one TokenIndexer, you likely have to set it for allTokenIndexer
for the same field, otherwise you’ll get mismatched tensor sizes.
- token_min_padding_length
-
as_padded_tensor
(self, tokens: Dict[str, List[~TokenType]], desired_num_tokens: Dict[str, int], padding_lengths: Dict[str, int]) → Dict[str, torch.Tensor][source]¶ This method pads a list of tokens to
desired_num_tokens
and returns that padded list of input tokens as a torch Tensor. If the input token list is longer thandesired_num_tokens
then it will be truncated.padding_lengths
is used to provide supplemental padding parameters which are needed in some cases. For example, it contains the widths to pad characters to when doing character-level padding.Note that this method should be abstract, but it is implemented to allow backward compatability.
-
count_vocab_items
(self, token: allennlp.data.tokenizers.token.Token, counter: Dict[str, Dict[str, int]])[source]¶ The
Vocabulary
needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV, or out of vocabulary, token). This method takes a token and a dictionary of counts and increments counts for whatever vocabulary items are present in the token. If this is a single token ID representation, the vocabulary item is likely the token itself. If this is a token characters representation, the vocabulary items are all of the characters in the token.
-
default_implementation
: str = 'single_id'¶
-
get_keys
(self, index_name: str) → List[str][source]¶ Return a list of the keys this indexer return from
tokens_to_indices
.
-
get_padding_lengths
(self, token: ~TokenType) → Dict[str, int][source]¶ This method returns a padding dictionary for the given token that specifies lengths for all arrays that need padding. For example, for single ID tokens the returned dictionary will be empty, but for a token characters representation, this will return the number of characters in the token.
-
get_padding_token
(self) → ~TokenType[source]¶ Deprecated. Please just implement the padding token in as_padded_tensor instead. TODO(Mark): remove in 1.0 release. This is only a concrete implementation to preserve backward compatability, otherwise it would be abstract.
When we need to add padding tokens, what should they look like? This method returns a “blank” token of whatever type is returned by
tokens_to_indices()
.
-
get_token_min_padding_length
(self) → int[source]¶ This method returns the minimum padding length required for this TokenIndexer. For example, the minimum padding length of SingleIdTokenIndexer is the largest size of filter when using CnnEncoder.
-
has_warned_for_as_padded_tensor
= False¶
-
pad_token_sequence
(self, tokens: Dict[str, List[~TokenType]], desired_num_tokens: Dict[str, int], padding_lengths: Dict[str, int]) → Dict[str, ~TokenType][source]¶ Deprecated. Please use as_padded_tensor instead. TODO(Mark): remove in 1.0 release.
-
tokens_to_indices
(self, tokens: List[allennlp.data.tokenizers.token.Token], vocabulary: allennlp.data.vocabulary.Vocabulary, index_name: str) → Dict[str, List[~TokenType]][source]¶ Takes a list of tokens and converts them to one or more sets of indices. This could be just an ID for each token from the vocabulary. Or it could split each token into characters and return one ID per character. Or (for instance, in the case of byte-pair encoding) there might not be a clean mapping from individual tokens to indices.
-
class
allennlp.data.token_indexers.dep_label_indexer.
DepLabelIndexer
(namespace: str = 'dep_labels', token_min_padding_length: int = 0)[source]¶ Bases:
allennlp.data.token_indexers.token_indexer.TokenIndexer
This
TokenIndexer
represents tokens by their syntactic dependency label, as determined by thedep_
field onToken
.- Parameters
- namespace
str
, optional (default=``dep_labels``) We will use this namespace in the
Vocabulary
to map strings to indices.- token_min_padding_length
int
, optional (default=``0``) See
TokenIndexer
.
- namespace
-
as_padded_tensor
(self, tokens: Dict[str, List[int]], desired_num_tokens: Dict[str, int], padding_lengths: Dict[str, int]) → Dict[str, torch.Tensor][source]¶ This method pads a list of tokens to
desired_num_tokens
and returns that padded list of input tokens as a torch Tensor. If the input token list is longer thandesired_num_tokens
then it will be truncated.padding_lengths
is used to provide supplemental padding parameters which are needed in some cases. For example, it contains the widths to pad characters to when doing character-level padding.Note that this method should be abstract, but it is implemented to allow backward compatability.
-
count_vocab_items
(self, token: allennlp.data.tokenizers.token.Token, counter: Dict[str, Dict[str, int]])[source]¶ The
Vocabulary
needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV, or out of vocabulary, token). This method takes a token and a dictionary of counts and increments counts for whatever vocabulary items are present in the token. If this is a single token ID representation, the vocabulary item is likely the token itself. If this is a token characters representation, the vocabulary items are all of the characters in the token.
-
get_padding_lengths
(self, token: int) → Dict[str, int][source]¶ This method returns a padding dictionary for the given token that specifies lengths for all arrays that need padding. For example, for single ID tokens the returned dictionary will be empty, but for a token characters representation, this will return the number of characters in the token.
-
tokens_to_indices
(self, tokens: List[allennlp.data.tokenizers.token.Token], vocabulary: allennlp.data.vocabulary.Vocabulary, index_name: str) → Dict[str, List[int]][source]¶ Takes a list of tokens and converts them to one or more sets of indices. This could be just an ID for each token from the vocabulary. Or it could split each token into characters and return one ID per character. Or (for instance, in the case of byte-pair encoding) there might not be a clean mapping from individual tokens to indices.
-
class
allennlp.data.token_indexers.ner_tag_indexer.
NerTagIndexer
(namespace: str = 'ner_tokens', token_min_padding_length: int = 0)[source]¶ Bases:
allennlp.data.token_indexers.token_indexer.TokenIndexer
This
TokenIndexer
represents tokens by their entity type (i.e., their NER tag), as determined by theent_type_
field onToken
.- Parameters
- namespace
str
, optional (default=``ner_tokens``) We will use this namespace in the
Vocabulary
to map strings to indices.- token_min_padding_length
int
, optional (default=``0``) See
TokenIndexer
.
- namespace
-
as_padded_tensor
(self, tokens: Dict[str, List[int]], desired_num_tokens: Dict[str, int], padding_lengths: Dict[str, int]) → Dict[str, torch.Tensor][source]¶ This method pads a list of tokens to
desired_num_tokens
and returns that padded list of input tokens as a torch Tensor. If the input token list is longer thandesired_num_tokens
then it will be truncated.padding_lengths
is used to provide supplemental padding parameters which are needed in some cases. For example, it contains the widths to pad characters to when doing character-level padding.Note that this method should be abstract, but it is implemented to allow backward compatability.
-
count_vocab_items
(self, token: allennlp.data.tokenizers.token.Token, counter: Dict[str, Dict[str, int]])[source]¶ The
Vocabulary
needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV, or out of vocabulary, token). This method takes a token and a dictionary of counts and increments counts for whatever vocabulary items are present in the token. If this is a single token ID representation, the vocabulary item is likely the token itself. If this is a token characters representation, the vocabulary items are all of the characters in the token.
-
get_padding_lengths
(self, token: int) → Dict[str, int][source]¶ This method returns a padding dictionary for the given token that specifies lengths for all arrays that need padding. For example, for single ID tokens the returned dictionary will be empty, but for a token characters representation, this will return the number of characters in the token.
-
tokens_to_indices
(self, tokens: List[allennlp.data.tokenizers.token.Token], vocabulary: allennlp.data.vocabulary.Vocabulary, index_name: str) → Dict[str, List[int]][source]¶ Takes a list of tokens and converts them to one or more sets of indices. This could be just an ID for each token from the vocabulary. Or it could split each token into characters and return one ID per character. Or (for instance, in the case of byte-pair encoding) there might not be a clean mapping from individual tokens to indices.
-
class
allennlp.data.token_indexers.pos_tag_indexer.
PosTagIndexer
(namespace: str = 'pos_tokens', coarse_tags: bool = False, token_min_padding_length: int = 0)[source]¶ Bases:
allennlp.data.token_indexers.token_indexer.TokenIndexer
This
TokenIndexer
represents tokens by their part of speech tag, as determined by thepos_
ortag_
fields onToken
(corresponding to spacy’s coarse-grained and fine-grained POS tags, respectively).- Parameters
- namespace
str
, optional (default=``pos_tokens``) We will use this namespace in the
Vocabulary
to map strings to indices.- coarse_tags
bool
, optional (default=``False``) If
True
, we will use coarse POS tags instead of the default fine-grained POS tags.- token_min_padding_length
int
, optional (default=``0``) See
TokenIndexer
.
- namespace
-
as_padded_tensor
(self, tokens: Dict[str, List[int]], desired_num_tokens: Dict[str, int], padding_lengths: Dict[str, int]) → Dict[str, torch.Tensor][source]¶ This method pads a list of tokens to
desired_num_tokens
and returns that padded list of input tokens as a torch Tensor. If the input token list is longer thandesired_num_tokens
then it will be truncated.padding_lengths
is used to provide supplemental padding parameters which are needed in some cases. For example, it contains the widths to pad characters to when doing character-level padding.Note that this method should be abstract, but it is implemented to allow backward compatability.
-
count_vocab_items
(self, token: allennlp.data.tokenizers.token.Token, counter: Dict[str, Dict[str, int]])[source]¶ The
Vocabulary
needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV, or out of vocabulary, token). This method takes a token and a dictionary of counts and increments counts for whatever vocabulary items are present in the token. If this is a single token ID representation, the vocabulary item is likely the token itself. If this is a token characters representation, the vocabulary items are all of the characters in the token.
-
get_padding_lengths
(self, token: int) → Dict[str, int][source]¶ This method returns a padding dictionary for the given token that specifies lengths for all arrays that need padding. For example, for single ID tokens the returned dictionary will be empty, but for a token characters representation, this will return the number of characters in the token.
-
tokens_to_indices
(self, tokens: List[allennlp.data.tokenizers.token.Token], vocabulary: allennlp.data.vocabulary.Vocabulary, index_name: str) → Dict[str, List[int]][source]¶ Takes a list of tokens and converts them to one or more sets of indices. This could be just an ID for each token from the vocabulary. Or it could split each token into characters and return one ID per character. Or (for instance, in the case of byte-pair encoding) there might not be a clean mapping from individual tokens to indices.
-
class
allennlp.data.token_indexers.single_id_token_indexer.
SingleIdTokenIndexer
(namespace: str = 'tokens', lowercase_tokens: bool = False, start_tokens: List[str] = None, end_tokens: List[str] = None, token_min_padding_length: int = 0)[source]¶ Bases:
allennlp.data.token_indexers.token_indexer.TokenIndexer
This
TokenIndexer
represents tokens as single integers.- Parameters
- namespace
str
, optional (default=``tokens``) We will use this namespace in the
Vocabulary
to map strings to indices.- lowercase_tokens
bool
, optional (default=``False``) If
True
, we will calltoken.lower()
before getting an index for the token from the vocabulary.- start_tokens
List[str]
, optional (default=``None``) These are prepended to the tokens provided to
tokens_to_indices
.- end_tokens
List[str]
, optional (default=``None``) These are appended to the tokens provided to
tokens_to_indices
.- token_min_padding_length
int
, optional (default=``0``) See
TokenIndexer
.
- namespace
-
as_padded_tensor
(self, tokens: Dict[str, List[int]], desired_num_tokens: Dict[str, int], padding_lengths: Dict[str, int]) → Dict[str, torch.Tensor][source]¶ This method pads a list of tokens to
desired_num_tokens
and returns that padded list of input tokens as a torch Tensor. If the input token list is longer thandesired_num_tokens
then it will be truncated.padding_lengths
is used to provide supplemental padding parameters which are needed in some cases. For example, it contains the widths to pad characters to when doing character-level padding.Note that this method should be abstract, but it is implemented to allow backward compatability.
-
count_vocab_items
(self, token: allennlp.data.tokenizers.token.Token, counter: Dict[str, Dict[str, int]])[source]¶ The
Vocabulary
needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV, or out of vocabulary, token). This method takes a token and a dictionary of counts and increments counts for whatever vocabulary items are present in the token. If this is a single token ID representation, the vocabulary item is likely the token itself. If this is a token characters representation, the vocabulary items are all of the characters in the token.
-
get_padding_lengths
(self, token: int) → Dict[str, int][source]¶ This method returns a padding dictionary for the given token that specifies lengths for all arrays that need padding. For example, for single ID tokens the returned dictionary will be empty, but for a token characters representation, this will return the number of characters in the token.
-
tokens_to_indices
(self, tokens: List[allennlp.data.tokenizers.token.Token], vocabulary: allennlp.data.vocabulary.Vocabulary, index_name: str) → Dict[str, List[int]][source]¶ Takes a list of tokens and converts them to one or more sets of indices. This could be just an ID for each token from the vocabulary. Or it could split each token into characters and return one ID per character. Or (for instance, in the case of byte-pair encoding) there might not be a clean mapping from individual tokens to indices.
-
class
allennlp.data.token_indexers.token_characters_indexer.
TokenCharactersIndexer
(namespace: str = 'token_characters', character_tokenizer: allennlp.data.tokenizers.character_tokenizer.CharacterTokenizer = <allennlp.data.tokenizers.character_tokenizer.CharacterTokenizer object>, start_tokens: List[str] = None, end_tokens: List[str] = None, min_padding_length: int = 0, token_min_padding_length: int = 0)[source]¶ Bases:
allennlp.data.token_indexers.token_indexer.TokenIndexer
This
TokenIndexer
represents tokens as lists of character indices.- Parameters
- namespace
str
, optional (default=``token_characters``) We will use this namespace in the
Vocabulary
to map the characters in each token to indices.- character_tokenizer
CharacterTokenizer
, optional (default=``CharacterTokenizer()``) We use a
CharacterTokenizer
to handle splitting tokens into characters, as it has options for byte encoding and other things. The default here is to instantiate aCharacterTokenizer
with its default parameters, which uses unicode characters and retains casing.- start_tokens
List[str]
, optional (default=``None``) These are prepended to the tokens provided to
tokens_to_indices
.- end_tokens
List[str]
, optional (default=``None``) These are appended to the tokens provided to
tokens_to_indices
.- min_padding_length: ``int``, optional (default=``0``)
We use this value as the minimum length of padding. Usually used with :class:
CnnEncoder
, its value should be set to the maximum value ofngram_filter_sizes
correspondingly.- token_min_padding_length
int
, optional (default=``0``) See
TokenIndexer
.
- namespace
-
as_padded_tensor
(self, tokens: Dict[str, List[List[int]]], desired_num_tokens: Dict[str, int], padding_lengths: Dict[str, int]) → Dict[str, torch.Tensor][source]¶ This method pads a list of tokens to
desired_num_tokens
and returns that padded list of input tokens as a torch Tensor. If the input token list is longer thandesired_num_tokens
then it will be truncated.padding_lengths
is used to provide supplemental padding parameters which are needed in some cases. For example, it contains the widths to pad characters to when doing character-level padding.Note that this method should be abstract, but it is implemented to allow backward compatability.
-
count_vocab_items
(self, token: allennlp.data.tokenizers.token.Token, counter: Dict[str, Dict[str, int]])[source]¶ The
Vocabulary
needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV, or out of vocabulary, token). This method takes a token and a dictionary of counts and increments counts for whatever vocabulary items are present in the token. If this is a single token ID representation, the vocabulary item is likely the token itself. If this is a token characters representation, the vocabulary items are all of the characters in the token.
-
get_padding_lengths
(self, token: List[int]) → Dict[str, int][source]¶ This method returns a padding dictionary for the given token that specifies lengths for all arrays that need padding. For example, for single ID tokens the returned dictionary will be empty, but for a token characters representation, this will return the number of characters in the token.
-
get_padding_token
(self) → List[int][source]¶ Deprecated. Please just implement the padding token in as_padded_tensor instead. TODO(Mark): remove in 1.0 release. This is only a concrete implementation to preserve backward compatability, otherwise it would be abstract.
When we need to add padding tokens, what should they look like? This method returns a “blank” token of whatever type is returned by
tokens_to_indices()
.
-
tokens_to_indices
(self, tokens: List[allennlp.data.tokenizers.token.Token], vocabulary: allennlp.data.vocabulary.Vocabulary, index_name: str) → Dict[str, List[List[int]]][source]¶ Takes a list of tokens and converts them to one or more sets of indices. This could be just an ID for each token from the vocabulary. Or it could split each token into characters and return one ID per character. Or (for instance, in the case of byte-pair encoding) there might not be a clean mapping from individual tokens to indices.
-
class
allennlp.data.token_indexers.elmo_indexer.
ELMoCharacterMapper
(tokens_to_add: Dict[str, int] = None)[source]¶ Bases:
object
Maps individual tokens to sequences of character ids, compatible with ELMo. To be consistent with previously trained models, we include it here as special of existing character indexers.
We allow to add optional additional special tokens with designated character ids with
tokens_to_add
.-
beginning_of_sentence_character
= 256¶
-
beginning_of_sentence_characters
= [258, 256, 259, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260]¶
-
beginning_of_word_character
= 258¶
-
bos_token
= '<S>'¶
-
end_of_sentence_character
= 257¶
-
end_of_sentence_characters
= [258, 257, 259, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260]¶
-
end_of_word_character
= 259¶
-
eos_token
= '</S>'¶
-
max_word_length
= 50¶
-
padding_character
= 260¶
-
-
class
allennlp.data.token_indexers.elmo_indexer.
ELMoTokenCharactersIndexer
(namespace: str = 'elmo_characters', tokens_to_add: Dict[str, int] = None, token_min_padding_length: int = 0)[source]¶ Bases:
allennlp.data.token_indexers.token_indexer.TokenIndexer
Convert a token to an array of character ids to compute ELMo representations.
- Parameters
- namespace
str
, optional (default=``elmo_characters``) - tokens_to_add
Dict[str, int]
, optional (default=``None``) If not None, then provides a mapping of special tokens to character ids. When using pre-trained models, then the character id must be less then 261, and we recommend using un-used ids (e.g. 1-32).
- token_min_padding_length
int
, optional (default=``0``) See
TokenIndexer
.
- namespace
-
as_padded_tensor
(self, tokens: Dict[str, List[List[int]]], desired_num_tokens: Dict[str, int], padding_lengths: Dict[str, int]) → Dict[str, torch.Tensor][source]¶ This method pads a list of tokens to
desired_num_tokens
and returns that padded list of input tokens as a torch Tensor. If the input token list is longer thandesired_num_tokens
then it will be truncated.padding_lengths
is used to provide supplemental padding parameters which are needed in some cases. For example, it contains the widths to pad characters to when doing character-level padding.Note that this method should be abstract, but it is implemented to allow backward compatability.
-
count_vocab_items
(self, token: allennlp.data.tokenizers.token.Token, counter: Dict[str, Dict[str, int]])[source]¶ The
Vocabulary
needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV, or out of vocabulary, token). This method takes a token and a dictionary of counts and increments counts for whatever vocabulary items are present in the token. If this is a single token ID representation, the vocabulary item is likely the token itself. If this is a token characters representation, the vocabulary items are all of the characters in the token.
-
get_padding_lengths
(self, token: List[int]) → Dict[str, int][source]¶ This method returns a padding dictionary for the given token that specifies lengths for all arrays that need padding. For example, for single ID tokens the returned dictionary will be empty, but for a token characters representation, this will return the number of characters in the token.
-
tokens_to_indices
(self, tokens: List[allennlp.data.tokenizers.token.Token], vocabulary: allennlp.data.vocabulary.Vocabulary, index_name: str) → Dict[str, List[List[int]]][source]¶ Takes a list of tokens and converts them to one or more sets of indices. This could be just an ID for each token from the vocabulary. Or it could split each token into characters and return one ID per character. Or (for instance, in the case of byte-pair encoding) there might not be a clean mapping from individual tokens to indices.
-
class
allennlp.data.token_indexers.openai_transformer_byte_pair_indexer.
OpenaiTransformerBytePairIndexer
(encoder: Dict[str, int] = None, byte_pairs: List[Tuple[str, str]] = None, n_ctx: int = 512, model_path: str = None, namespace: str = 'openai_transformer', tokens_to_add: List[str] = None, token_min_padding_length: int = 0)[source]¶ Bases:
allennlp.data.token_indexers.token_indexer.TokenIndexer
Generates the indices for the byte-pair encoding used by the OpenAI transformer language model: https://blog.openai.com/language-unsupervised/
This is unlike most of our TokenIndexers in that its indexing is not based on a Vocabulary but on a fixed set of mappings that are loaded by the constructor.
Note: recommend using
OpenAISplitter
tokenizer with this indexer, as it applies the same text normalization as the original implementation.Note 2: when
tokens_to_add
is not None, be sure to setn_special=len(tokens_to_add)
inOpenaiTransformer
, otherwise behavior is undefined.-
as_padded_tensor
(self, tokens: Dict[str, List[int]], desired_num_tokens: Dict[str, int], padding_lengths: Dict[str, int]) → Dict[str, torch.Tensor][source]¶ This method pads a list of tokens to
desired_num_tokens
and returns that padded list of input tokens as a torch Tensor. If the input token list is longer thandesired_num_tokens
then it will be truncated.padding_lengths
is used to provide supplemental padding parameters which are needed in some cases. For example, it contains the widths to pad characters to when doing character-level padding.Note that this method should be abstract, but it is implemented to allow backward compatability.
-
byte_pair_encode
(self, token: allennlp.data.tokenizers.token.Token, lowercase: bool = True) → List[str][source]¶
-
count_vocab_items
(self, token: allennlp.data.tokenizers.token.Token, counter: Dict[str, Dict[str, int]])[source]¶ The
Vocabulary
needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV, or out of vocabulary, token). This method takes a token and a dictionary of counts and increments counts for whatever vocabulary items are present in the token. If this is a single token ID representation, the vocabulary item is likely the token itself. If this is a token characters representation, the vocabulary items are all of the characters in the token.
-
get_padding_lengths
(self, token: int) → Dict[str, int][source]¶ This method returns a padding dictionary for the given token that specifies lengths for all arrays that need padding. For example, for single ID tokens the returned dictionary will be empty, but for a token characters representation, this will return the number of characters in the token.
-
tokens_to_indices
(self, tokens: List[allennlp.data.tokenizers.token.Token], vocabulary: allennlp.data.vocabulary.Vocabulary, index_name: str) → Dict[str, List[int]][source]¶ Takes a list of tokens and converts them to one or more sets of indices. This could be just an ID for each token from the vocabulary. Or it could split each token into characters and return one ID per character. Or (for instance, in the case of byte-pair encoding) there might not be a clean mapping from individual tokens to indices.
-
-
allennlp.data.token_indexers.openai_transformer_byte_pair_indexer.
text_standardize
(text)[source]¶ Apply text standardization following original implementation.
-
class
allennlp.data.token_indexers.wordpiece_indexer.
PretrainedBertIndexer
(pretrained_model: str, use_starting_offsets: bool = False, do_lowercase: bool = True, never_lowercase: List[str] = None, max_pieces: int = 512, truncate_long_sequences: bool = True)[source]¶ Bases:
allennlp.data.token_indexers.wordpiece_indexer.WordpieceIndexer
A
TokenIndexer
corresponding to a pretrained BERT model.- Parameters
- pretrained_model: ``str``
Either the name of the pretrained model to use (e.g. ‘bert-base-uncased’), or the path to the .txt file with its vocabulary.
If the name is a key in the list of pretrained models at https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/tokenization.py#L33 the corresponding path will be used; otherwise it will be interpreted as a path or URL.
- use_starting_offsets: bool, optional (default: False)
By default, the “offsets” created by the token indexer correspond to the last wordpiece in each word. If
use_starting_offsets
is specified, they will instead correspond to the first wordpiece in each word.- do_lowercase: ``bool``, optional (default = True)
Whether to lowercase the tokens before converting to wordpiece ids.
- never_lowercase: ``List[str]``, optional
Tokens that should never be lowercased. Default is [‘[UNK]’, ‘[SEP]’, ‘[PAD]’, ‘[CLS]’, ‘[MASK]’].
- max_pieces: int, optional (default: 512)
The BERT embedder uses positional embeddings and so has a corresponding maximum length for its input ids. Any inputs longer than this will either be truncated (default), or be split apart and batched using a sliding window.
- truncate_long_sequences
bool
, optional (default=``True``) By default, long sequences will be truncated to the maximum sequence length. Otherwise, they will be split apart and batched using a sliding window.
-
class
allennlp.data.token_indexers.wordpiece_indexer.
WordpieceIndexer
(vocab: Dict[str, int], wordpiece_tokenizer: Callable[[str], List[str]], namespace: str = 'wordpiece', use_starting_offsets: bool = False, max_pieces: int = 512, do_lowercase: bool = False, never_lowercase: List[str] = None, start_tokens: List[str] = None, end_tokens: List[str] = None, separator_token: str = '[SEP]', truncate_long_sequences: bool = True, token_min_padding_length: int = 0)[source]¶ Bases:
allennlp.data.token_indexers.token_indexer.TokenIndexer
A token indexer that does the wordpiece-tokenization (e.g. for BERT embeddings). If you are using one of the pretrained BERT models, you’ll want to use the
PretrainedBertIndexer
subclass rather than this base class.- Parameters
- vocab
Dict[str, int]
The mapping {wordpiece -> id}. Note this is not an AllenNLP
Vocabulary
.- wordpiece_tokenizer
Callable[[str], List[str]]
A function that does the actual tokenization.
- namespacestr, optional (default: “wordpiece”)
The namespace in the AllenNLP
Vocabulary
into which the wordpieces will be loaded.- use_starting_offsetsbool, optional (default: False)
By default, the “offsets” created by the token indexer correspond to the last wordpiece in each word. If
use_starting_offsets
is specified, they will instead correspond to the first wordpiece in each word.- max_piecesint, optional (default: 512)
The BERT embedder uses positional embeddings and so has a corresponding maximum length for its input ids. Any inputs longer than this will either be truncated (default), or be split apart and batched using a sliding window.
- do_lowercase
bool
, optional (default=``False``) Should we lowercase the provided tokens before getting the indices? You would need to do this if you are using an -uncased BERT model but your DatasetReader is not lowercasing tokens (which might be the case if you’re also using other embeddings based on cased tokens).
- never_lowercase: ``List[str]``, optional
Tokens that should never be lowercased. Default is [‘[UNK]’, ‘[SEP]’, ‘[PAD]’, ‘[CLS]’, ‘[MASK]’].
- start_tokens
List[str]
, optional (default=``None``) These are prepended to the tokens provided to
tokens_to_indices
.- end_tokens
List[str]
, optional (default=``None``) These are appended to the tokens provided to
tokens_to_indices
.- separator_token
str
, optional (default=``[SEP]``) This token indicates the segments in the sequence.
- truncate_long_sequences
bool
, optional (default=``True``) By default, long sequences will be truncated to the maximum sequence length. Otherwise, they will be split apart and batched using a sliding window.
- token_min_padding_length
int
, optional (default=``0``) See
TokenIndexer
.
- vocab
-
as_padded_tensor
(self, tokens: Dict[str, List[int]], desired_num_tokens: Dict[str, int], padding_lengths: Dict[str, int]) → Dict[str, torch.Tensor][source]¶ This method pads a list of tokens to
desired_num_tokens
and returns that padded list of input tokens as a torch Tensor. If the input token list is longer thandesired_num_tokens
then it will be truncated.padding_lengths
is used to provide supplemental padding parameters which are needed in some cases. For example, it contains the widths to pad characters to when doing character-level padding.Note that this method should be abstract, but it is implemented to allow backward compatability.
-
count_vocab_items
(self, token: allennlp.data.tokenizers.token.Token, counter: Dict[str, Dict[str, int]])[source]¶ The
Vocabulary
needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV, or out of vocabulary, token). This method takes a token and a dictionary of counts and increments counts for whatever vocabulary items are present in the token. If this is a single token ID representation, the vocabulary item is likely the token itself. If this is a token characters representation, the vocabulary items are all of the characters in the token.
-
get_keys
(self, index_name: str) → List[str][source]¶ We need to override this because the indexer generates multiple keys.
-
get_padding_lengths
(self, token: int) → Dict[str, int][source]¶ This method returns a padding dictionary for the given token that specifies lengths for all arrays that need padding. For example, for single ID tokens the returned dictionary will be empty, but for a token characters representation, this will return the number of characters in the token.
-
tokens_to_indices
(self, tokens: List[allennlp.data.tokenizers.token.Token], vocabulary: allennlp.data.vocabulary.Vocabulary, index_name: str) → Dict[str, List[int]][source]¶ Takes a list of tokens and converts them to one or more sets of indices. This could be just an ID for each token from the vocabulary. Or it could split each token into characters and return one ID per character. Or (for instance, in the case of byte-pair encoding) there might not be a clean mapping from individual tokens to indices.
-
class
allennlp.data.token_indexers.pretrained_transformer_indexer.
PretrainedTransformerIndexer
(model_name: str, do_lowercase: bool, namespace: str = 'tags', token_min_padding_length: int = 0)[source]¶ Bases:
allennlp.data.token_indexers.token_indexer.TokenIndexer
This
TokenIndexer
uses a tokenizer from thepytorch_transformers
repository to index tokens. ThisIndexer
is only really appropriate to use if you’ve also used a correspondingPretrainedTransformerTokenizer
to tokenize your input. Otherwise you’ll have a mismatch between your tokens and your vocabulary, and you’ll get a lot of UNK tokens.- Parameters
- model_name
str
The name of the
pytorch_transformers
model to use.- do_lowercase
str
Whether to lowercase the tokens (this should match the casing of the model name that you pass)
- namespace
str
, optional (default=``tags``) We will add the tokens in the pytorch_transformer vocabulary to this vocabulary namespace. We use a somewhat confusing default value of
tags
so that we do not add padding or UNK tokens to this namespace, which would break on loading because we wouldn’t find our default OOV token.
- model_name
-
as_padded_tensor
(self, tokens: Dict[str, List[int]], desired_num_tokens: Dict[str, int], padding_lengths: Dict[str, int]) → Dict[str, torch.Tensor][source]¶ This method pads a list of tokens to
desired_num_tokens
and returns that padded list of input tokens as a torch Tensor. If the input token list is longer thandesired_num_tokens
then it will be truncated.padding_lengths
is used to provide supplemental padding parameters which are needed in some cases. For example, it contains the widths to pad characters to when doing character-level padding.Note that this method should be abstract, but it is implemented to allow backward compatability.
-
count_vocab_items
(self, token: allennlp.data.tokenizers.token.Token, counter: Dict[str, Dict[str, int]])[source]¶ The
Vocabulary
needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV, or out of vocabulary, token). This method takes a token and a dictionary of counts and increments counts for whatever vocabulary items are present in the token. If this is a single token ID representation, the vocabulary item is likely the token itself. If this is a token characters representation, the vocabulary items are all of the characters in the token.
-
get_padding_lengths
(self, token: int) → Dict[str, int][source]¶ This method returns a padding dictionary for the given token that specifies lengths for all arrays that need padding. For example, for single ID tokens the returned dictionary will be empty, but for a token characters representation, this will return the number of characters in the token.
-
tokens_to_indices
(self, tokens: List[allennlp.data.tokenizers.token.Token], vocabulary: allennlp.data.vocabulary.Vocabulary, index_name: str) → Dict[str, List[int]][source]¶ Takes a list of tokens and converts them to one or more sets of indices. This could be just an ID for each token from the vocabulary. Or it could split each token into characters and return one ID per character. Or (for instance, in the case of byte-pair encoding) there might not be a clean mapping from individual tokens to indices.
-
class
allennlp.data.token_indexers.spacy_indexer.
SpacyTokenIndexer
(hidden_dim: int = 96, token_min_padding_length: int = 0)[source]¶ Bases:
allennlp.data.token_indexers.token_indexer.TokenIndexer
This
SpacyTokenIndexer
represents tokens as word vectors from a spacy model. You might want to do this for two main reasons; easier integration with a spacy pipeline and no out of vocabulary tokens.- Parameters
- hidden_dim
int
, optional (default=``96``) The dimension of the vectors that spacy generates for representing words.
- token_min_padding_length
int
, optional (default=``0``) See
TokenIndexer
.
- hidden_dim
-
as_padded_tensor
(self, tokens: Dict[str, List[numpy.ndarray]], desired_num_tokens: Dict[str, int], padding_lengths: Dict[str, int]) → Dict[str, torch.Tensor][source]¶ This method pads a list of tokens to
desired_num_tokens
and returns that padded list of input tokens as a torch Tensor. If the input token list is longer thandesired_num_tokens
then it will be truncated.padding_lengths
is used to provide supplemental padding parameters which are needed in some cases. For example, it contains the widths to pad characters to when doing character-level padding.Note that this method should be abstract, but it is implemented to allow backward compatability.
-
count_vocab_items
(self, token: allennlp.data.tokenizers.token.Token, counter: Dict[str, Dict[str, int]])[source]¶ The
Vocabulary
needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV, or out of vocabulary, token). This method takes a token and a dictionary of counts and increments counts for whatever vocabulary items are present in the token. If this is a single token ID representation, the vocabulary item is likely the token itself. If this is a token characters representation, the vocabulary items are all of the characters in the token.
-
get_padding_lengths
(self, token: numpy.ndarray) → Dict[str, numpy.ndarray][source]¶ This method returns a padding dictionary for the given token that specifies lengths for all arrays that need padding. For example, for single ID tokens the returned dictionary will be empty, but for a token characters representation, this will return the number of characters in the token.
-
get_padding_token
(self) → numpy.ndarray[source]¶ Deprecated. Please just implement the padding token in as_padded_tensor instead. TODO(Mark): remove in 1.0 release. This is only a concrete implementation to preserve backward compatability, otherwise it would be abstract.
When we need to add padding tokens, what should they look like? This method returns a “blank” token of whatever type is returned by
tokens_to_indices()
.
-
tokens_to_indices
(self, tokens: List[spacy.tokens.token.Token], vocabulary: allennlp.data.vocabulary.Vocabulary, index_name: str) → Dict[str, List[numpy.ndarray]][source]¶ Takes a list of tokens and converts them to one or more sets of indices. This could be just an ID for each token from the vocabulary. Or it could split each token into characters and return one ID per character. Or (for instance, in the case of byte-pair encoding) there might not be a clean mapping from individual tokens to indices.