pretrained_transformer_tokenizer
allennlp.data.tokenizers.pretrained_transformer_tokenizer
PretrainedTransformerTokenizer¶
@Tokenizer.register("pretrained_transformer")
class PretrainedTransformerTokenizer(Tokenizer):
| def __init__(
| self,
| model_name: str,
| add_special_tokens: bool = True,
| max_length: Optional[int] = None,
| tokenizer_kwargs: Optional[Dict[str, Any]] = None,
| verification_tokens: Optional[Tuple[str, str]] = None
| ) -> None
A PretrainedTransformerTokenizer
uses a model from HuggingFace's
transformers
library to tokenize some input text. This often means wordpieces
(where 'AllenNLP is awesome'
might get split into ['Allen', '##NL', '##P', 'is',
'awesome']
), but it could also use byte-pair encoding, or some other tokenization, depending
on the pretrained model that you're using.
We take a model name as an input parameter, which we will pass to
AutoTokenizer.from_pretrained
.
We also add special tokens relative to the pretrained model and truncate the sequences.
This tokenizer also indexes tokens and adds the indexes to the Token
fields so that
they can be picked up by PretrainedTransformerIndexer
.
Registered as a Tokenizer
with name "pretrained_transformer".
Parameters¶
- model_name :
str
The name of the pretrained wordpiece tokenizer to use. - add_special_tokens :
bool
, optional (default =True
)
If set toTrue
, the sequences will be encoded with the special tokens relative to their model. - max_length :
int
, optional (default =None
)
If set to a number, will limit the total sequence returned so that it has a maximum length. - tokenizer_kwargs :
Dict[str, Any]
, optional (default =None
)
Dictionary with additional arguments forAutoTokenizer.from_pretrained
. - verification_tokens :
Tuple[str, str]
, optional (default =None
)
A pair of tokens having different token IDs. It's used for reverse-engineering special tokens.
tokenizer_lowercases¶
class PretrainedTransformerTokenizer(Tokenizer):
| ...
| @staticmethod
| def tokenizer_lowercases(tokenizer: PreTrainedTokenizer) -> bool
tokenize¶
class PretrainedTransformerTokenizer(Tokenizer):
| ...
| def tokenize(self, text: str) -> List[Token]
This method only handles a single sentence (or sequence) of text.
intra_word_tokenize¶
class PretrainedTransformerTokenizer(Tokenizer):
| ...
| def intra_word_tokenize(
| self,
| string_tokens: List[str]
| ) -> Tuple[List[Token], List[Optional[Tuple[int, int]]]]
Tokenizes each word into wordpieces separately and returns the wordpiece IDs. Also calculates offsets such that tokens[offsets[i][0]:offsets[i][1] + 1] corresponds to the original i-th token.
This function inserts special tokens.
intra_word_tokenize_sentence_pair¶
class PretrainedTransformerTokenizer(Tokenizer):
| ...
| def intra_word_tokenize_sentence_pair(
| self,
| string_tokens_a: List[str],
| string_tokens_b: List[str]
| ) -> Tuple[List[Token], List[Optional[Tuple[int, int]]], List[Optional[Tuple[int, int]]]]
Tokenizes each word into wordpieces separately and returns the wordpiece IDs. Also calculates offsets such that wordpieces[offsets[i][0]:offsets[i][1] + 1] corresponds to the original i-th token.
This function inserts special tokens.
add_special_tokens¶
class PretrainedTransformerTokenizer(Tokenizer):
| ...
| def add_special_tokens(
| self,
| tokens1: List[Token],
| tokens2: Optional[List[Token]] = None
| ) -> List[Token]
num_special_tokens_for_sequence¶
class PretrainedTransformerTokenizer(Tokenizer):
| ...
| def num_special_tokens_for_sequence(self) -> int
num_special_tokens_for_pair¶
class PretrainedTransformerTokenizer(Tokenizer):
| ...
| def num_special_tokens_for_pair(self) -> int