Skip to content

letters_digits_tokenizer

allennlp.data.tokenizers.letters_digits_tokenizer

[SOURCE]


LettersDigitsTokenizer

@Tokenizer.register("letters_digits")
class LettersDigitsTokenizer(Tokenizer)

A Tokenizer which keeps runs of (unicode) letters and runs of digits together, while every other non-whitespace character becomes a separate word.

Registered as a Tokenizer with name "letters_digits".

tokenize

class LettersDigitsTokenizer(Tokenizer):
 | ...
 | def tokenize(self, text: str) -> List[Token]

We use the [^\W\d_] pattern as a trick to match unicode letters