PretrainedTransformerMismatchedIndexer( self, model_name: str, namespace: str = 'tags', max_length: int = None, kwargs, ) -> None
Use this indexer when (for whatever reason) you are not using a corresponding
PretrainedTransformerTokenizer on your input. We assume that you used a tokenizer that splits
strings into words, while the transformer expects wordpieces as input. This indexer splits the
words into wordpieces and flattens them out. You should use the corresponding
PretrainedTransformerMismatchedEmbedder to embed these wordpieces and then pull out a single
vector for each original word.
Registered as a
TokenIndexer with name "pretrained_transformer_mismatched".
- model_name :
strThe name of the
transformersmodel to use.
- namespace :
str, optional (default=
tags) We will add the tokens in the pytorch_transformer vocabulary to this vocabulary namespace. We use a somewhat confusing default value of
tagsso that we do not add padding or UNK tokens to this namespace, which would break on loading because we wouldn't find our default OOV token.
- max_length :
int, optional (default = None) If positive, split the document into segments of this many tokens (including special tokens) before feeding into the embedder. The embedder embeds these segments independently and concatenate the results to get the original document representation. Should be set to the same value as the
max_lengthoption on the
PretrainedTransformerMismatchedIndexer.as_padded_tensor_dict( self, tokens: Dict[str, List[Any]], padding_lengths: Dict[str, int], ) -> Dict[str, torch.Tensor]
This method pads a list of tokens given the input padding lengths (which could actually
truncate things, depending on settings) and returns that padded list of input tokens as a
Dict[str, torch.Tensor]. This is a dictionary because there should be one key per
argument that the
TokenEmbedder corresponding to this class expects in its
method (where the argument name in the
TokenEmbedder needs to make the key in this
The base class implements the case when all you want to do is create a padded
for every list in the
tokens dictionary. If your
TokenIndexer needs more complex
logic than that, you need to override this method.
PretrainedTransformerMismatchedIndexer.count_vocab_items( self, token: allennlp.data.tokenizers.token.Token, counter: Dict[str, Dict[str, int]], )
Vocabulary needs to assign indices to whatever strings we see in the training
data (possibly doing some frequency filtering and using an OOV, or out of vocabulary,
token). This method takes a token and a dictionary of counts and increments counts for
whatever vocabulary items are present in the token. If this is a single token ID
representation, the vocabulary item is likely the token itself. If this is a token
characters representation, the vocabulary items are all of the characters in the token.
PretrainedTransformerMismatchedIndexer.get_empty_token_list( self, ) -> Dict[str, List[Any]]
already indexed version of an empty token list. This is typically just an
empty list for whatever keys are used in the indexer.
PretrainedTransformerMismatchedIndexer.tokens_to_indices( self, tokens: List[allennlp.data.tokenizers.token.Token], vocabulary: allennlp.data.vocabulary.Vocabulary, ) -> Dict[str, List[Any]]
Takes a list of tokens and converts them to an
This could be just an ID for each token from the vocabulary.
Or it could split each token into characters and return one ID per character.
Or (for instance, in the case of byte-pair encoding) there might not be a clean
mapping from individual tokens to indices, and the
IndexedTokenList could be a complex