letters_digits_tokenizer
allennlp.data.tokenizers.letters_digits_tokenizer
LettersDigitsTokenizer#
@Tokenizer.register("letters_digits")
class LettersDigitsTokenizer(Tokenizer)
A Tokenizer
which keeps runs of (unicode) letters and runs of digits together, while
every other non-whitespace character becomes a separate word.
Registered as a Tokenizer
with name "letters_digits".
tokenize#
class LettersDigitsTokenizer(Tokenizer):
| ...
| @overrides
| def tokenize(self, text: str) -> List[Token]
We use the [^\W\d_] pattern as a trick to match unicode letters