Skip to content

pretrained_transformer_indexer

allennlp.data.token_indexers.pretrained_transformer_indexer

[SOURCE]


PretrainedTransformerIndexer#

@TokenIndexer.register("pretrained_transformer")
class PretrainedTransformerIndexer(TokenIndexer):
 | def __init__(
 |     self,
 |     model_name: str,
 |     namespace: str = "tags",
 |     max_length: int = None,
 |     tokenizer_kwargs: Optional[Dict[str, Any]] = None,
 |     **kwargs
 | ) -> None

This TokenIndexer assumes that Tokens already have their indexes in them (see text_id field). We still require model_name because we want to form allennlp vocabulary from pretrained one. This Indexer is only really appropriate to use if you've also used a corresponding PretrainedTransformerTokenizer to tokenize your input. Otherwise you'll have a mismatch between your tokens and your vocabulary, and you'll get a lot of UNK tokens.

Registered as a TokenIndexer with name "pretrained_transformer".

Parameters

  • model_name : str
    The name of the transformers model to use.
  • namespace : str, optional (default = tags)
    We will add the tokens in the pytorch_transformer vocabulary to this vocabulary namespace. We use a somewhat confusing default value of tags so that we do not add padding or UNK tokens to this namespace, which would break on loading because we wouldn't find our default OOV token.
  • max_length : int, optional (default = None)
    If not None, split the document into segments of this many tokens (including special tokens) before feeding into the embedder. The embedder embeds these segments independently and concatenate the results to get the original document representation. Should be set to the same value as the max_length option on the PretrainedTransformerEmbedder.
  • tokenizer_kwargs : Dict[str, Any], optional (default = None)
    Dictionary with additional arguments for AutoTokenizer.from_pretrained.

count_vocab_items#

class PretrainedTransformerIndexer(TokenIndexer):
 | ...
 | @overrides
 | def count_vocab_items(
 |     self,
 |     token: Token,
 |     counter: Dict[str, Dict[str, int]]
 | )

If we only use pretrained models, we don't need to do anything here.

tokens_to_indices#

class PretrainedTransformerIndexer(TokenIndexer):
 | ...
 | @overrides
 | def tokens_to_indices(
 |     self,
 |     tokens: List[Token],
 |     vocabulary: Vocabulary
 | ) -> IndexedTokenList

indices_to_tokens#

class PretrainedTransformerIndexer(TokenIndexer):
 | ...
 | @overrides
 | def indices_to_tokens(
 |     self,
 |     indexed_tokens: IndexedTokenList,
 |     vocabulary: Vocabulary
 | ) -> List[Token]

get_empty_token_list#

class PretrainedTransformerIndexer(TokenIndexer):
 | ...
 | @overrides
 | def get_empty_token_list(self) -> IndexedTokenList

as_padded_tensor_dict#

class PretrainedTransformerIndexer(TokenIndexer):
 | ...
 | @overrides
 | def as_padded_tensor_dict(
 |     self,
 |     tokens: IndexedTokenList,
 |     padding_lengths: Dict[str, int]
 | ) -> Dict[str, torch.Tensor]