pretrained_transformer_indexer
allennlp.data.token_indexers.pretrained_transformer_indexer
PretrainedTransformerIndexer#
@TokenIndexer.register("pretrained_transformer")
class PretrainedTransformerIndexer(TokenIndexer):
| def __init__(
| self,
| model_name: str,
| namespace: str = "tags",
| max_length: int = None,
| tokenizer_kwargs: Optional[Dict[str, Any]] = None,
| **kwargs
| ) -> None
This TokenIndexer
assumes that Tokens already have their indexes in them (see text_id
field).
We still require model_name
because we want to form allennlp vocabulary from pretrained one.
This Indexer
is only really appropriate to use if you've also used a
corresponding PretrainedTransformerTokenizer
to tokenize your input. Otherwise you'll
have a mismatch between your tokens and your vocabulary, and you'll get a lot of UNK tokens.
Registered as a TokenIndexer
with name "pretrained_transformer".
Parameters
- model_name :
str
The name of thetransformers
model to use. - namespace :
str
, optional (default =tags
)
We will add the tokens in the pytorch_transformer vocabulary to this vocabulary namespace. We use a somewhat confusing default value oftags
so that we do not add padding or UNK tokens to this namespace, which would break on loading because we wouldn't find our default OOV token. - max_length :
int
, optional (default =None
)
If not None, split the document into segments of this many tokens (including special tokens) before feeding into the embedder. The embedder embeds these segments independently and concatenate the results to get the original document representation. Should be set to the same value as themax_length
option on thePretrainedTransformerEmbedder
. - tokenizer_kwargs :
Dict[str, Any]
, optional (default =None
)
Dictionary with additional arguments forAutoTokenizer.from_pretrained
.
count_vocab_items#
class PretrainedTransformerIndexer(TokenIndexer):
| ...
| @overrides
| def count_vocab_items(
| self,
| token: Token,
| counter: Dict[str, Dict[str, int]]
| )
If we only use pretrained models, we don't need to do anything here.
tokens_to_indices#
class PretrainedTransformerIndexer(TokenIndexer):
| ...
| @overrides
| def tokens_to_indices(
| self,
| tokens: List[Token],
| vocabulary: Vocabulary
| ) -> IndexedTokenList
indices_to_tokens#
class PretrainedTransformerIndexer(TokenIndexer):
| ...
| @overrides
| def indices_to_tokens(
| self,
| indexed_tokens: IndexedTokenList,
| vocabulary: Vocabulary
| ) -> List[Token]
get_empty_token_list#
class PretrainedTransformerIndexer(TokenIndexer):
| ...
| @overrides
| def get_empty_token_list(self) -> IndexedTokenList
as_padded_tensor_dict#
class PretrainedTransformerIndexer(TokenIndexer):
| ...
| @overrides
| def as_padded_tensor_dict(
| self,
| tokens: IndexedTokenList,
| padding_lengths: Dict[str, int]
| ) -> Dict[str, torch.Tensor]