Skip to content


[ ]

LettersDigitsTokenizer Objects#

class LettersDigitsTokenizer(Tokenizer)

A Tokenizer which keeps runs of (unicode) letters and runs of digits together, while every other non-whitespace character becomes a separate word.

Registered as a Tokenizer with name "letters_digits".


 | @overrides
 | def tokenize(self, text: str) -> List[Token]

We use the [^\W\d_] pattern as a trick to match unicode letters