allennlp.data.token_indexers¶

A TokenIndexer determines how string tokens get represented as arrays of indices in a model.

TokenIndexer
DepLabelIndexer
NerTagIndexer
PosTagIndexer
SingleIdTokenIndexer
TokenCharactersIndexer
ELMoTokenCharactersIndexer
OpenaiTransformerBytePairIndexer
WordpieceIndexer
PretrainedTransformerIndexer
SpacyTokenIndexer

class allennlp.data.token_indexers.token_indexer.TokenIndexer(token_min_padding_length: int = 0)[source]¶

Bases: typing.Generic, allennlp.common.registrable.Registrable

A TokenIndexer determines how string tokens get represented as arrays of indices in a model. This class both converts strings into numerical values, with the help of a Vocabulary, and it produces actual arrays.

Tokens can be represented as single IDs (e.g., the word “cat” gets represented by the number 34), or as lists of character IDs (e.g., “cat” gets represented by the numbers [23, 10, 18]), or in some other way that you can come up with (e.g., if you have some structured input you want to represent in a special way in your data arrays, you can do that here).

Parameters

token_min_padding_lengthint, optional (default=``0``): The minimum padding length required for the TokenIndexer. For example, the minimum padding length of SingleIdTokenIndexer is the largest size of filter when using CnnEncoder. Note that if you set this for one TokenIndexer, you likely have to set it for all TokenIndexer for the same field, otherwise you’ll get mismatched tensor sizes.

as_padded_tensor(self, tokens: Dict[str, List[~TokenType]], desired_num_tokens: Dict[str, int], padding_lengths: Dict[str, int]) → Dict[str, torch.Tensor][source]¶

This method pads a list of tokens to desired_num_tokens and returns that padded list of input tokens as a torch Tensor. If the input token list is longer than desired_num_tokens then it will be truncated.

padding_lengths is used to provide supplemental padding parameters which are needed in some cases. For example, it contains the widths to pad characters to when doing character-level padding.

Note that this method should be abstract, but it is implemented to allow backward compatability.

count_vocab_items(self, token: allennlp.data.tokenizers.token.Token, counter: Dict[str, Dict[str, int]])[source]¶: The Vocabulary needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV, or out of vocabulary, token). This method takes a token and a dictionary of counts and increments counts for whatever vocabulary items are present in the token. If this is a single token ID representation, the vocabulary item is likely the token itself. If this is a token characters representation, the vocabulary items are all of the characters in the token.

default_implementation: str = 'single_id'¶

get_keys(self, index_name: str) → List[str][source]¶: Return a list of the keys this indexer return from tokens_to_indices.

get_padding_lengths(self, token: ~TokenType) → Dict[str, int][source]¶: This method returns a padding dictionary for the given token that specifies lengths for all arrays that need padding. For example, for single ID tokens the returned dictionary will be empty, but for a token characters representation, this will return the number of characters in the token.

get_padding_token(self) → ~TokenType[source]¶

Deprecated. Please just implement the padding token in as_padded_tensor instead. TODO(Mark): remove in 1.0 release. This is only a concrete implementation to preserve backward compatability, otherwise it would be abstract.

When we need to add padding tokens, what should they look like? This method returns a “blank” token of whatever type is returned by tokens_to_indices().

get_token_min_padding_length(self) → int[source]¶: This method returns the minimum padding length required for this TokenIndexer. For example, the minimum padding length of SingleIdTokenIndexer is the largest size of filter when using CnnEncoder.

has_warned_for_as_padded_tensor = False¶

pad_token_sequence(self, tokens: Dict[str, List[~TokenType]], desired_num_tokens: Dict[str, int], padding_lengths: Dict[str, int]) → Dict[str, ~TokenType][source]¶: Deprecated. Please use as_padded_tensor instead. TODO(Mark): remove in 1.0 release.

tokens_to_indices(self, tokens: List[allennlp.data.tokenizers.token.Token], vocabulary: allennlp.data.vocabulary.Vocabulary, index_name: str) → Dict[str, List[~TokenType]][source]¶: Takes a list of tokens and converts them to one or more sets of indices. This could be just an ID for each token from the vocabulary. Or it could split each token into characters and return one ID per character. Or (for instance, in the case of byte-pair encoding) there might not be a clean mapping from individual tokens to indices.

class allennlp.data.token_indexers.dep_label_indexer.DepLabelIndexer(namespace: str = 'dep_labels', token_min_padding_length: int = 0)[source]¶

Bases: allennlp.data.token_indexers.token_indexer.TokenIndexer

This TokenIndexer represents tokens by their syntactic dependency label, as determined by the dep_ field on Token.

Parameters

namespacestr, optional (default=``dep_labels``): We will use this namespace in the Vocabulary to map strings to indices.
token_min_padding_lengthint, optional (default=``0``): See TokenIndexer.

as_padded_tensor(self, tokens: Dict[str, List[int]], desired_num_tokens: Dict[str, int], padding_lengths: Dict[str, int]) → Dict[str, torch.Tensor][source]¶

This method pads a list of tokens to desired_num_tokens and returns that padded list of input tokens as a torch Tensor. If the input token list is longer than desired_num_tokens then it will be truncated.

padding_lengths is used to provide supplemental padding parameters which are needed in some cases. For example, it contains the widths to pad characters to when doing character-level padding.

Note that this method should be abstract, but it is implemented to allow backward compatability.

count_vocab_items(self, token: allennlp.data.tokenizers.token.Token, counter: Dict[str, Dict[str, int]])[source]¶: The Vocabulary needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV, or out of vocabulary, token). This method takes a token and a dictionary of counts and increments counts for whatever vocabulary items are present in the token. If this is a single token ID representation, the vocabulary item is likely the token itself. If this is a token characters representation, the vocabulary items are all of the characters in the token.

get_padding_lengths(self, token: int) → Dict[str, int][source]¶: This method returns a padding dictionary for the given token that specifies lengths for all arrays that need padding. For example, for single ID tokens the returned dictionary will be empty, but for a token characters representation, this will return the number of characters in the token.

tokens_to_indices(self, tokens: List[allennlp.data.tokenizers.token.Token], vocabulary: allennlp.data.vocabulary.Vocabulary, index_name: str) → Dict[str, List[int]][source]¶: Takes a list of tokens and converts them to one or more sets of indices. This could be just an ID for each token from the vocabulary. Or it could split each token into characters and return one ID per character. Or (for instance, in the case of byte-pair encoding) there might not be a clean mapping from individual tokens to indices.

class allennlp.data.token_indexers.ner_tag_indexer.NerTagIndexer(namespace: str = 'ner_tokens', token_min_padding_length: int = 0)[source]¶

Bases: allennlp.data.token_indexers.token_indexer.TokenIndexer

This TokenIndexer represents tokens by their entity type (i.e., their NER tag), as determined by the ent_type_ field on Token.

Parameters

namespacestr, optional (default=``ner_tokens``): We will use this namespace in the Vocabulary to map strings to indices.
token_min_padding_lengthint, optional (default=``0``): See TokenIndexer.

as_padded_tensor(self, tokens: Dict[str, List[int]], desired_num_tokens: Dict[str, int], padding_lengths: Dict[str, int]) → Dict[str, torch.Tensor][source]¶

This method pads a list of tokens to desired_num_tokens and returns that padded list of input tokens as a torch Tensor. If the input token list is longer than desired_num_tokens then it will be truncated.

padding_lengths is used to provide supplemental padding parameters which are needed in some cases. For example, it contains the widths to pad characters to when doing character-level padding.

Note that this method should be abstract, but it is implemented to allow backward compatability.

count_vocab_items(self, token: allennlp.data.tokenizers.token.Token, counter: Dict[str, Dict[str, int]])[source]¶: The Vocabulary needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV, or out of vocabulary, token). This method takes a token and a dictionary of counts and increments counts for whatever vocabulary items are present in the token. If this is a single token ID representation, the vocabulary item is likely the token itself. If this is a token characters representation, the vocabulary items are all of the characters in the token.

get_padding_lengths(self, token: int) → Dict[str, int][source]¶: This method returns a padding dictionary for the given token that specifies lengths for all arrays that need padding. For example, for single ID tokens the returned dictionary will be empty, but for a token characters representation, this will return the number of characters in the token.

tokens_to_indices(self, tokens: List[allennlp.data.tokenizers.token.Token], vocabulary: allennlp.data.vocabulary.Vocabulary, index_name: str) → Dict[str, List[int]][source]¶: Takes a list of tokens and converts them to one or more sets of indices. This could be just an ID for each token from the vocabulary. Or it could split each token into characters and return one ID per character. Or (for instance, in the case of byte-pair encoding) there might not be a clean mapping from individual tokens to indices.

class allennlp.data.token_indexers.pos_tag_indexer.PosTagIndexer(namespace: str = 'pos_tokens', coarse_tags: bool = False, token_min_padding_length: int = 0)[source]¶

Bases: allennlp.data.token_indexers.token_indexer.TokenIndexer

This TokenIndexer represents tokens by their part of speech tag, as determined by the pos_ or tag_ fields on Token (corresponding to spacy’s coarse-grained and fine-grained POS tags, respectively).

Parameters

namespacestr, optional (default=``pos_tokens``): We will use this namespace in the Vocabulary to map strings to indices.
coarse_tagsbool, optional (default=``False``): If True, we will use coarse POS tags instead of the default fine-grained POS tags.
token_min_padding_lengthint, optional (default=``0``): See TokenIndexer.

as_padded_tensor(self, tokens: Dict[str, List[int]], desired_num_tokens: Dict[str, int], padding_lengths: Dict[str, int]) → Dict[str, torch.Tensor][source]¶

This method pads a list of tokens to desired_num_tokens and returns that padded list of input tokens as a torch Tensor. If the input token list is longer than desired_num_tokens then it will be truncated.

padding_lengths is used to provide supplemental padding parameters which are needed in some cases. For example, it contains the widths to pad characters to when doing character-level padding.

Note that this method should be abstract, but it is implemented to allow backward compatability.

count_vocab_items(self, token: allennlp.data.tokenizers.token.Token, counter: Dict[str, Dict[str, int]])[source]¶: The Vocabulary needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV, or out of vocabulary, token). This method takes a token and a dictionary of counts and increments counts for whatever vocabulary items are present in the token. If this is a single token ID representation, the vocabulary item is likely the token itself. If this is a token characters representation, the vocabulary items are all of the characters in the token.

get_padding_lengths(self, token: int) → Dict[str, int][source]¶: This method returns a padding dictionary for the given token that specifies lengths for all arrays that need padding. For example, for single ID tokens the returned dictionary will be empty, but for a token characters representation, this will return the number of characters in the token.

tokens_to_indices(self, tokens: List[allennlp.data.tokenizers.token.Token], vocabulary: allennlp.data.vocabulary.Vocabulary, index_name: str) → Dict[str, List[int]][source]¶: Takes a list of tokens and converts them to one or more sets of indices. This could be just an ID for each token from the vocabulary. Or it could split each token into characters and return one ID per character. Or (for instance, in the case of byte-pair encoding) there might not be a clean mapping from individual tokens to indices.

class allennlp.data.token_indexers.single_id_token_indexer.SingleIdTokenIndexer(namespace: str = 'tokens', lowercase_tokens: bool = False, start_tokens: List[str] = None, end_tokens: List[str] = None, token_min_padding_length: int = 0)[source]¶

Bases: allennlp.data.token_indexers.token_indexer.TokenIndexer

This TokenIndexer represents tokens as single integers.

Parameters

namespacestr, optional (default=``tokens``): We will use this namespace in the Vocabulary to map strings to indices.
lowercase_tokensbool, optional (default=``False``): If True, we will call token.lower() before getting an index for the token from the vocabulary.
start_tokensList[str], optional (default=``None``): These are prepended to the tokens provided to tokens_to_indices.
end_tokensList[str], optional (default=``None``): These are appended to the tokens provided to tokens_to_indices.
token_min_padding_lengthint, optional (default=``0``): See TokenIndexer.

as_padded_tensor(self, tokens: Dict[str, List[int]], desired_num_tokens: Dict[str, int], padding_lengths: Dict[str, int]) → Dict[str, torch.Tensor][source]¶

This method pads a list of tokens to desired_num_tokens and returns that padded list of input tokens as a torch Tensor. If the input token list is longer than desired_num_tokens then it will be truncated.

padding_lengths is used to provide supplemental padding parameters which are needed in some cases. For example, it contains the widths to pad characters to when doing character-level padding.

Note that this method should be abstract, but it is implemented to allow backward compatability.

count_vocab_items(self, token: allennlp.data.tokenizers.token.Token, counter: Dict[str, Dict[str, int]])[source]¶: The Vocabulary needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV, or out of vocabulary, token). This method takes a token and a dictionary of counts and increments counts for whatever vocabulary items are present in the token. If this is a single token ID representation, the vocabulary item is likely the token itself. If this is a token characters representation, the vocabulary items are all of the characters in the token.

get_padding_lengths(self, token: int) → Dict[str, int][source]¶: This method returns a padding dictionary for the given token that specifies lengths for all arrays that need padding. For example, for single ID tokens the returned dictionary will be empty, but for a token characters representation, this will return the number of characters in the token.

tokens_to_indices(self, tokens: List[allennlp.data.tokenizers.token.Token], vocabulary: allennlp.data.vocabulary.Vocabulary, index_name: str) → Dict[str, List[int]][source]¶: Takes a list of tokens and converts them to one or more sets of indices. This could be just an ID for each token from the vocabulary. Or it could split each token into characters and return one ID per character. Or (for instance, in the case of byte-pair encoding) there might not be a clean mapping from individual tokens to indices.

class allennlp.data.token_indexers.token_characters_indexer.TokenCharactersIndexer(namespace: str = 'token_characters', character_tokenizer: allennlp.data.tokenizers.character_tokenizer.CharacterTokenizer = <allennlp.data.tokenizers.character_tokenizer.CharacterTokenizer object>, start_tokens: List[str] = None, end_tokens: List[str] = None, min_padding_length: int = 0, token_min_padding_length: int = 0)[source]¶

Bases: allennlp.data.token_indexers.token_indexer.TokenIndexer

This TokenIndexer represents tokens as lists of character indices.

Parameters

namespacestr, optional (default=``token_characters``): We will use this namespace in the Vocabulary to map the characters in each token to indices.
character_tokenizerCharacterTokenizer, optional (default=``CharacterTokenizer()``): We use a CharacterTokenizer to handle splitting tokens into characters, as it has options for byte encoding and other things. The default here is to instantiate a CharacterTokenizer with its default parameters, which uses unicode characters and retains casing.
start_tokensList[str], optional (default=``None``): These are prepended to the tokens provided to tokens_to_indices.
end_tokensList[str], optional (default=``None``): These are appended to the tokens provided to tokens_to_indices.
min_padding_length: ``int``, optional (default=``0``): We use this value as the minimum length of padding. Usually used with :class:CnnEncoder, its value should be set to the maximum value of ngram_filter_sizes correspondingly.
token_min_padding_lengthint, optional (default=``0``): See TokenIndexer.

as_padded_tensor(self, tokens: Dict[str, List[List[int]]], desired_num_tokens: Dict[str, int], padding_lengths: Dict[str, int]) → Dict[str, torch.Tensor][source]¶

This method pads a list of tokens to desired_num_tokens and returns that padded list of input tokens as a torch Tensor. If the input token list is longer than desired_num_tokens then it will be truncated.

padding_lengths is used to provide supplemental padding parameters which are needed in some cases. For example, it contains the widths to pad characters to when doing character-level padding.

Note that this method should be abstract, but it is implemented to allow backward compatability.

count_vocab_items(self, token: allennlp.data.tokenizers.token.Token, counter: Dict[str, Dict[str, int]])[source]¶: The Vocabulary needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV, or out of vocabulary, token). This method takes a token and a dictionary of counts and increments counts for whatever vocabulary items are present in the token. If this is a single token ID representation, the vocabulary item is likely the token itself. If this is a token characters representation, the vocabulary items are all of the characters in the token.

get_padding_lengths(self, token: List[int]) → Dict[str, int][source]¶: This method returns a padding dictionary for the given token that specifies lengths for all arrays that need padding. For example, for single ID tokens the returned dictionary will be empty, but for a token characters representation, this will return the number of characters in the token.

get_padding_token(self) → List[int][source]¶

Deprecated. Please just implement the padding token in as_padded_tensor instead. TODO(Mark): remove in 1.0 release. This is only a concrete implementation to preserve backward compatability, otherwise it would be abstract.

When we need to add padding tokens, what should they look like? This method returns a “blank” token of whatever type is returned by tokens_to_indices().

tokens_to_indices(self, tokens: List[allennlp.data.tokenizers.token.Token], vocabulary: allennlp.data.vocabulary.Vocabulary, index_name: str) → Dict[str, List[List[int]]][source]¶: Takes a list of tokens and converts them to one or more sets of indices. This could be just an ID for each token from the vocabulary. Or it could split each token into characters and return one ID per character. Or (for instance, in the case of byte-pair encoding) there might not be a clean mapping from individual tokens to indices.

class allennlp.data.token_indexers.elmo_indexer.ELMoCharacterMapper(tokens_to_add: Dict[str, int] = None)[source]¶

Bases: object

Maps individual tokens to sequences of character ids, compatible with ELMo. To be consistent with previously trained models, we include it here as special of existing character indexers.

We allow to add optional additional special tokens with designated character ids with tokens_to_add.

beginning_of_sentence_character = 256¶

beginning_of_sentence_characters = [258, 256, 259, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260]¶

beginning_of_word_character = 258¶

bos_token = '<S>'¶

convert_word_to_char_ids(self, word: str) → List[int][source]¶

end_of_sentence_character = 257¶

end_of_sentence_characters = [258, 257, 259, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260]¶

end_of_word_character = 259¶

eos_token = '</S>'¶

max_word_length = 50¶

padding_character = 260¶

class allennlp.data.token_indexers.elmo_indexer.ELMoTokenCharactersIndexer(namespace: str = 'elmo_characters', tokens_to_add: Dict[str, int] = None, token_min_padding_length: int = 0)[source]¶

Bases: allennlp.data.token_indexers.token_indexer.TokenIndexer

Convert a token to an array of character ids to compute ELMo representations.

Parameters

namespacestr, optional (default=``elmo_characters``)
tokens_to_addDict[str, int], optional (default=``None``): If not None, then provides a mapping of special tokens to character ids. When using pre-trained models, then the character id must be less then 261, and we recommend using un-used ids (e.g. 1-32).
token_min_padding_lengthint, optional (default=``0``): See TokenIndexer.

as_padded_tensor(self, tokens: Dict[str, List[List[int]]], desired_num_tokens: Dict[str, int], padding_lengths: Dict[str, int]) → Dict[str, torch.Tensor][source]¶

This method pads a list of tokens to desired_num_tokens and returns that padded list of input tokens as a torch Tensor. If the input token list is longer than desired_num_tokens then it will be truncated.

padding_lengths is used to provide supplemental padding parameters which are needed in some cases. For example, it contains the widths to pad characters to when doing character-level padding.

Note that this method should be abstract, but it is implemented to allow backward compatability.

count_vocab_items(self, token: allennlp.data.tokenizers.token.Token, counter: Dict[str, Dict[str, int]])[source]¶: The Vocabulary needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV, or out of vocabulary, token). This method takes a token and a dictionary of counts and increments counts for whatever vocabulary items are present in the token. If this is a single token ID representation, the vocabulary item is likely the token itself. If this is a token characters representation, the vocabulary items are all of the characters in the token.

get_padding_lengths(self, token: List[int]) → Dict[str, int][source]¶: This method returns a padding dictionary for the given token that specifies lengths for all arrays that need padding. For example, for single ID tokens the returned dictionary will be empty, but for a token characters representation, this will return the number of characters in the token.

tokens_to_indices(self, tokens: List[allennlp.data.tokenizers.token.Token], vocabulary: allennlp.data.vocabulary.Vocabulary, index_name: str) → Dict[str, List[List[int]]][source]¶: Takes a list of tokens and converts them to one or more sets of indices. This could be just an ID for each token from the vocabulary. Or it could split each token into characters and return one ID per character. Or (for instance, in the case of byte-pair encoding) there might not be a clean mapping from individual tokens to indices.

class allennlp.data.token_indexers.openai_transformer_byte_pair_indexer.OpenaiTransformerBytePairIndexer(encoder: Dict[str, int] = None, byte_pairs: List[Tuple[str, str]] = None, n_ctx: int = 512, model_path: str = None, namespace: str = 'openai_transformer', tokens_to_add: List[str] = None, token_min_padding_length: int = 0)[source]¶

Bases: allennlp.data.token_indexers.token_indexer.TokenIndexer

Generates the indices for the byte-pair encoding used by the OpenAI transformer language model: https://blog.openai.com/language-unsupervised/

This is unlike most of our TokenIndexers in that its indexing is not based on a Vocabulary but on a fixed set of mappings that are loaded by the constructor.

Note: recommend using OpenAISplitter tokenizer with this indexer, as it applies the same text normalization as the original implementation.

Note 2: when tokens_to_add is not None, be sure to set n_special=len(tokens_to_add) in OpenaiTransformer, otherwise behavior is undefined.

as_padded_tensor(self, tokens: Dict[str, List[int]], desired_num_tokens: Dict[str, int], padding_lengths: Dict[str, int]) → Dict[str, torch.Tensor][source]¶

This method pads a list of tokens to desired_num_tokens and returns that padded list of input tokens as a torch Tensor. If the input token list is longer than desired_num_tokens then it will be truncated.

padding_lengths is used to provide supplemental padding parameters which are needed in some cases. For example, it contains the widths to pad characters to when doing character-level padding.

Note that this method should be abstract, but it is implemented to allow backward compatability.

byte_pair_encode(self, token: allennlp.data.tokenizers.token.Token, lowercase: bool = True) → List[str][source]¶

count_vocab_items(self, token: allennlp.data.tokenizers.token.Token, counter: Dict[str, Dict[str, int]])[source]¶: The Vocabulary needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV, or out of vocabulary, token). This method takes a token and a dictionary of counts and increments counts for whatever vocabulary items are present in the token. If this is a single token ID representation, the vocabulary item is likely the token itself. If this is a token characters representation, the vocabulary items are all of the characters in the token.

get_padding_lengths(self, token: int) → Dict[str, int][source]¶: This method returns a padding dictionary for the given token that specifies lengths for all arrays that need padding. For example, for single ID tokens the returned dictionary will be empty, but for a token characters representation, this will return the number of characters in the token.

tokens_to_indices(self, tokens: List[allennlp.data.tokenizers.token.Token], vocabulary: allennlp.data.vocabulary.Vocabulary, index_name: str) → Dict[str, List[int]][source]¶: Takes a list of tokens and converts them to one or more sets of indices. This could be just an ID for each token from the vocabulary. Or it could split each token into characters and return one ID per character. Or (for instance, in the case of byte-pair encoding) there might not be a clean mapping from individual tokens to indices.

allennlp.data.token_indexers.openai_transformer_byte_pair_indexer.text_standardize(text)[source]¶: Apply text standardization following original implementation.

class allennlp.data.token_indexers.wordpiece_indexer.PretrainedBertIndexer(pretrained_model: str, use_starting_offsets: bool = False, do_lowercase: bool = True, never_lowercase: List[str] = None, max_pieces: int = 512, truncate_long_sequences: bool = True)[source]¶

Bases: allennlp.data.token_indexers.wordpiece_indexer.WordpieceIndexer

A TokenIndexer corresponding to a pretrained BERT model.

Parameters

pretrained_model: ``str``

Either the name of the pretrained model to use (e.g. ‘bert-base-uncased’), or the path to the .txt file with its vocabulary.

If the name is a key in the list of pretrained models at https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/tokenization.py#L33 the corresponding path will be used; otherwise it will be interpreted as a path or URL.

use_starting_offsets: bool, optional (default: False)

By default, the “offsets” created by the token indexer correspond to the last wordpiece in each word. If use_starting_offsets is specified, they will instead correspond to the first wordpiece in each word.

do_lowercase: ``bool``, optional (default = True)

Whether to lowercase the tokens before converting to wordpiece ids.

never_lowercase: ``List[str]``, optional

Tokens that should never be lowercased. Default is [‘[UNK]’, ‘[SEP]’, ‘[PAD]’, ‘[CLS]’, ‘[MASK]’].

max_pieces: int, optional (default: 512)

The BERT embedder uses positional embeddings and so has a corresponding maximum length for its input ids. Any inputs longer than this will either be truncated (default), or be split apart and batched using a sliding window.

truncate_long_sequencesbool, optional (default=``True``)

By default, long sequences will be truncated to the maximum sequence length. Otherwise, they will be split apart and batched using a sliding window.

class allennlp.data.token_indexers.wordpiece_indexer.WordpieceIndexer(vocab: Dict[str, int], wordpiece_tokenizer: Callable[[str], List[str]], namespace: str = 'wordpiece', use_starting_offsets: bool = False, max_pieces: int = 512, do_lowercase: bool = False, never_lowercase: List[str] = None, start_tokens: List[str] = None, end_tokens: List[str] = None, separator_token: str = '[SEP]', truncate_long_sequences: bool = True, token_min_padding_length: int = 0)[source]¶

Bases: allennlp.data.token_indexers.token_indexer.TokenIndexer

A token indexer that does the wordpiece-tokenization (e.g. for BERT embeddings). If you are using one of the pretrained BERT models, you’ll want to use the PretrainedBertIndexer subclass rather than this base class.

Parameters

vocabDict[str, int]: The mapping {wordpiece -> id}. Note this is not an AllenNLP Vocabulary.
wordpiece_tokenizerCallable[[str], List[str]]: A function that does the actual tokenization.
namespacestr, optional (default: “wordpiece”): The namespace in the AllenNLP Vocabulary into which the wordpieces will be loaded.
use_starting_offsetsbool, optional (default: False): By default, the “offsets” created by the token indexer correspond to the last wordpiece in each word. If use_starting_offsets is specified, they will instead correspond to the first wordpiece in each word.
max_piecesint, optional (default: 512): The BERT embedder uses positional embeddings and so has a corresponding maximum length for its input ids. Any inputs longer than this will either be truncated (default), or be split apart and batched using a sliding window.
do_lowercasebool, optional (default=``False``): Should we lowercase the provided tokens before getting the indices? You would need to do this if you are using an -uncased BERT model but your DatasetReader is not lowercasing tokens (which might be the case if you’re also using other embeddings based on cased tokens).
never_lowercase: ``List[str]``, optional: Tokens that should never be lowercased. Default is [‘[UNK]’, ‘[SEP]’, ‘[PAD]’, ‘[CLS]’, ‘[MASK]’].
start_tokensList[str], optional (default=``None``): These are prepended to the tokens provided to tokens_to_indices.
end_tokensList[str], optional (default=``None``): These are appended to the tokens provided to tokens_to_indices.
separator_tokenstr, optional (default=``[SEP]``): This token indicates the segments in the sequence.
truncate_long_sequencesbool, optional (default=``True``): By default, long sequences will be truncated to the maximum sequence length. Otherwise, they will be split apart and batched using a sliding window.
token_min_padding_lengthint, optional (default=``0``): See TokenIndexer.

as_padded_tensor(self, tokens: Dict[str, List[int]], desired_num_tokens: Dict[str, int], padding_lengths: Dict[str, int]) → Dict[str, torch.Tensor][source]¶

This method pads a list of tokens to desired_num_tokens and returns that padded list of input tokens as a torch Tensor. If the input token list is longer than desired_num_tokens then it will be truncated.

padding_lengths is used to provide supplemental padding parameters which are needed in some cases. For example, it contains the widths to pad characters to when doing character-level padding.

Note that this method should be abstract, but it is implemented to allow backward compatability.

count_vocab_items(self, token: allennlp.data.tokenizers.token.Token, counter: Dict[str, Dict[str, int]])[source]¶: The Vocabulary needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV, or out of vocabulary, token). This method takes a token and a dictionary of counts and increments counts for whatever vocabulary items are present in the token. If this is a single token ID representation, the vocabulary item is likely the token itself. If this is a token characters representation, the vocabulary items are all of the characters in the token.

get_keys(self, index_name: str) → List[str][source]¶: We need to override this because the indexer generates multiple keys.

get_padding_lengths(self, token: int) → Dict[str, int][source]¶: This method returns a padding dictionary for the given token that specifies lengths for all arrays that need padding. For example, for single ID tokens the returned dictionary will be empty, but for a token characters representation, this will return the number of characters in the token.

tokens_to_indices(self, tokens: List[allennlp.data.tokenizers.token.Token], vocabulary: allennlp.data.vocabulary.Vocabulary, index_name: str) → Dict[str, List[int]][source]¶: Takes a list of tokens and converts them to one or more sets of indices. This could be just an ID for each token from the vocabulary. Or it could split each token into characters and return one ID per character. Or (for instance, in the case of byte-pair encoding) there might not be a clean mapping from individual tokens to indices.

class allennlp.data.token_indexers.pretrained_transformer_indexer.PretrainedTransformerIndexer(model_name: str, do_lowercase: bool, namespace: str = 'tags', token_min_padding_length: int = 0)[source]¶

Bases: allennlp.data.token_indexers.token_indexer.TokenIndexer

This TokenIndexer uses a tokenizer from the pytorch_transformers repository to index tokens. This Indexer is only really appropriate to use if you’ve also used a corresponding PretrainedTransformerTokenizer to tokenize your input. Otherwise you’ll have a mismatch between your tokens and your vocabulary, and you’ll get a lot of UNK tokens.

Parameters

model_namestr: The name of the pytorch_transformers model to use.
do_lowercasestr: Whether to lowercase the tokens (this should match the casing of the model name that you pass)
namespacestr, optional (default=``tags``): We will add the tokens in the pytorch_transformer vocabulary to this vocabulary namespace. We use a somewhat confusing default value of tags so that we do not add padding or UNK tokens to this namespace, which would break on loading because we wouldn’t find our default OOV token.

as_padded_tensor(self, tokens: Dict[str, List[int]], desired_num_tokens: Dict[str, int], padding_lengths: Dict[str, int]) → Dict[str, torch.Tensor][source]¶

This method pads a list of tokens to desired_num_tokens and returns that padded list of input tokens as a torch Tensor. If the input token list is longer than desired_num_tokens then it will be truncated.

padding_lengths is used to provide supplemental padding parameters which are needed in some cases. For example, it contains the widths to pad characters to when doing character-level padding.

Note that this method should be abstract, but it is implemented to allow backward compatability.

count_vocab_items(self, token: allennlp.data.tokenizers.token.Token, counter: Dict[str, Dict[str, int]])[source]¶: The Vocabulary needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV, or out of vocabulary, token). This method takes a token and a dictionary of counts and increments counts for whatever vocabulary items are present in the token. If this is a single token ID representation, the vocabulary item is likely the token itself. If this is a token characters representation, the vocabulary items are all of the characters in the token.

get_padding_lengths(self, token: int) → Dict[str, int][source]¶: This method returns a padding dictionary for the given token that specifies lengths for all arrays that need padding. For example, for single ID tokens the returned dictionary will be empty, but for a token characters representation, this will return the number of characters in the token.

tokens_to_indices(self, tokens: List[allennlp.data.tokenizers.token.Token], vocabulary: allennlp.data.vocabulary.Vocabulary, index_name: str) → Dict[str, List[int]][source]¶: Takes a list of tokens and converts them to one or more sets of indices. This could be just an ID for each token from the vocabulary. Or it could split each token into characters and return one ID per character. Or (for instance, in the case of byte-pair encoding) there might not be a clean mapping from individual tokens to indices.

class allennlp.data.token_indexers.spacy_indexer.SpacyTokenIndexer(hidden_dim: int = 96, token_min_padding_length: int = 0)[source]¶

Bases: allennlp.data.token_indexers.token_indexer.TokenIndexer

This SpacyTokenIndexer represents tokens as word vectors from a spacy model. You might want to do this for two main reasons; easier integration with a spacy pipeline and no out of vocabulary tokens.

Parameters

hidden_dimint, optional (default=``96``): The dimension of the vectors that spacy generates for representing words.
token_min_padding_lengthint, optional (default=``0``): See TokenIndexer.

as_padded_tensor(self, tokens: Dict[str, List[numpy.ndarray]], desired_num_tokens: Dict[str, int], padding_lengths: Dict[str, int]) → Dict[str, torch.Tensor][source]¶

This method pads a list of tokens to desired_num_tokens and returns that padded list of input tokens as a torch Tensor. If the input token list is longer than desired_num_tokens then it will be truncated.

padding_lengths is used to provide supplemental padding parameters which are needed in some cases. For example, it contains the widths to pad characters to when doing character-level padding.

Note that this method should be abstract, but it is implemented to allow backward compatability.

count_vocab_items(self, token: allennlp.data.tokenizers.token.Token, counter: Dict[str, Dict[str, int]])[source]¶: The Vocabulary needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV, or out of vocabulary, token). This method takes a token and a dictionary of counts and increments counts for whatever vocabulary items are present in the token. If this is a single token ID representation, the vocabulary item is likely the token itself. If this is a token characters representation, the vocabulary items are all of the characters in the token.

get_padding_lengths(self, token: numpy.ndarray) → Dict[str, numpy.ndarray][source]¶: This method returns a padding dictionary for the given token that specifies lengths for all arrays that need padding. For example, for single ID tokens the returned dictionary will be empty, but for a token characters representation, this will return the number of characters in the token.

get_padding_token(self) → numpy.ndarray[source]¶

Deprecated. Please just implement the padding token in as_padded_tensor instead. TODO(Mark): remove in 1.0 release. This is only a concrete implementation to preserve backward compatability, otherwise it would be abstract.

When we need to add padding tokens, what should they look like? This method returns a “blank” token of whatever type is returned by tokens_to_indices().

tokens_to_indices(self, tokens: List[spacy.tokens.token.Token], vocabulary: allennlp.data.vocabulary.Vocabulary, index_name: str) → Dict[str, List[numpy.ndarray]][source]¶: Takes a list of tokens and converts them to one or more sets of indices. This could be just an ID for each token from the vocabulary. Or it could split each token into characters and return one ID per character. Or (for instance, in the case of byte-pair encoding) there might not be a clean mapping from individual tokens to indices.