letters_digits_tokenizer
[ allennlp.data.tokenizers.letters_digits_tokenizer ]
LettersDigitsTokenizer Objects#
class LettersDigitsTokenizer(Tokenizer)
A Tokenizer
which keeps runs of (unicode) letters and runs of digits together, while
every other non-whitespace character becomes a separate word.
Registered as a Tokenizer
with name "letters_digits".
tokenize#
| @overrides
| def tokenize(self, text: str) -> List[Token]
We use the [^\W\d_] pattern as a trick to match unicode letters