allennlp.data.tokenizers¶

class allennlp.data.tokenizers.token.Token[source]¶

Bases: tuple

A simple token representation, keeping track of the token’s text, offset in the passage it was taken from, POS tag, dependency relation, and similar information. These fields match spacy’s exactly, so we can just use a spacy token for this.

Parameters

textstr, optional

The original text represented by this token.

idxint, optional

The character offset of this token into the tokenized passage.

lemma_str, optional

The lemma of this token.

pos_str, optional

The coarse-grained part of speech of this token.

tag_str, optional

The fine-grained part of speech of this token.

dep_str, optional

The dependency relation for this token.

ent_type_str, optional

The entity type (i.e., the NER tag) for this token.

text_idint, optional

If your tokenizer returns integers instead of strings (e.g., because you’re doing byte encoding, or some hash-based embedding), set this with the integer. If this is set, we will bypass the vocabulary when indexing this token, regardless of whether text is also set. You can also set text with the original text, if you want, so that you can still use a character-level representation in addition to a hash-based word embedding.

The other fields on Token follow the fields on spacy’s Token object; this is one we added, similar to spacy’s lex_id.

property dep_¶: Alias for field number 5

property ent_type_¶: Alias for field number 6

property idx¶: Alias for field number 1

property lemma_¶: Alias for field number 2

property pos_¶: Alias for field number 3

property tag_¶: Alias for field number 4

property text¶: Alias for field number 0

property text_id¶: Alias for field number 7

allennlp.data.tokenizers.token.show_token(token: allennlp.data.tokenizers.token.Token) → str[source]¶

This module contains various classes for performing tokenization, stemming, and filtering.

Tokenizer
WordTokenizer
CharacterTokenizer
PretrainedTransformerTokenizer
WordFilter
WordSplitter
WordStemmer

class allennlp.data.tokenizers.tokenizer.Tokenizer[source]¶

Bases: allennlp.common.registrable.Registrable

A Tokenizer splits strings of text into tokens. Typically, this either splits text into word tokens or character tokens, and those are the two tokenizer subclasses we have implemented here, though you could imagine wanting to do other kinds of tokenization for structured or other inputs.

As part of tokenization, concrete implementations of this API will also handle stemming, stopword filtering, adding start and end tokens, or other kinds of things you might want to do to your tokens. See the parameters to, e.g., WordTokenizer, or whichever tokenizer you want to use.

If the base input to your model is words, you should use a WordTokenizer, even if you also want to have a character-level encoder to get an additional vector for each word token. Splitting word tokens into character arrays is handled separately, in the token_representations.TokenRepresentation class.

batch_tokenize(self, texts: List[str]) → List[List[allennlp.data.tokenizers.token.Token]][source]¶

Batches together tokenization of several texts, in case that is faster for particular tokenizers.

By default we just do this without batching. Override this in your tokenizer if you have a good way of doing batched computation.

default_implementation: str = 'word'¶

tokenize(self, text: str) → List[allennlp.data.tokenizers.token.Token][source]¶

Actually implements splitting words into tokens.

Returns

tokensList[Token]

class allennlp.data.tokenizers.word_tokenizer.WordTokenizer(word_splitter: allennlp.data.tokenizers.word_splitter.WordSplitter = None, word_filter: allennlp.data.tokenizers.word_filter.WordFilter = <allennlp.data.tokenizers.word_filter.PassThroughWordFilter object>, word_stemmer: allennlp.data.tokenizers.word_stemmer.WordStemmer = <allennlp.data.tokenizers.word_stemmer.PassThroughWordStemmer object>, start_tokens: List[str] = None, end_tokens: List[str] = None)[source]¶

Bases: allennlp.data.tokenizers.tokenizer.Tokenizer

A WordTokenizer handles the splitting of strings into words as well as any desired post-processing (e.g., stemming, filtering, etc.). Note that we leave one particular piece of post-processing for later: the decision of whether or not to lowercase the token. This is for two reasons: (1) if you want to make two different casing decisions for whatever reason, you won’t have to run the tokenizer twice, and more importantly (2) if you want to lowercase words for your word embedding, but retain capitalization in a character-level representation, we need to retain the capitalization here.

Parameters

word_splitterWordSplitter, optional: The WordSplitter to use for splitting text strings into word tokens. The default is to use the SpacyWordSplitter with default parameters.
word_filterWordFilter, optional: The WordFilter to use for, e.g., removing stopwords. Default is to do no filtering.
word_stemmerWordStemmer, optional: The WordStemmer to use. Default is no stemming.
start_tokensList[str], optional: If given, these tokens will be added to the beginning of every string we tokenize.
end_tokensList[str], optional: If given, these tokens will be added to the end of every string we tokenize.

batch_tokenize(self, texts: List[str]) → List[List[allennlp.data.tokenizers.token.Token]][source]¶

Batches together tokenization of several texts, in case that is faster for particular tokenizers.

By default we just do this without batching. Override this in your tokenizer if you have a good way of doing batched computation.

tokenize(self, text: str) → List[allennlp.data.tokenizers.token.Token][source]¶

Does whatever processing is required to convert a string of text into a sequence of tokens.

At a minimum, this uses a WordSplitter to split words into text. It may also do stemming or stopword removal, depending on the parameters given to the constructor.

class allennlp.data.tokenizers.character_tokenizer.CharacterTokenizer(byte_encoding: str = None, lowercase_characters: bool = False, start_tokens: List[str] = None, end_tokens: List[str] = None)[source]¶

Bases: allennlp.data.tokenizers.tokenizer.Tokenizer

A CharacterTokenizer splits strings into character tokens.

Parameters

byte_encodingstr, optional (default=``None``)

If not None, we will use this encoding to encode the string as bytes, and use the byte sequence as characters, instead of the unicode characters in the python string. E.g., the character ‘á’ would be a single token if this option is None, but it would be two tokens if this option is set to "utf-8".

If this is not None, tokenize will return a List[int] instead of a List[str], and we will bypass the vocabulary in the TokenIndexer.

lowercase_charactersbool, optional (default=``False``)

If True, we will lowercase all of the characters in the text before doing any other operation. You probably do not want to do this, as character vocabularies are generally not very large to begin with, but it’s an option if you really want it.

start_tokensList[str], optional

If given, these tokens will be added to the beginning of every string we tokenize. If using byte encoding, this should actually be a List[int], not a List[str].

end_tokensList[str], optional

If given, these tokens will be added to the end of every string we tokenize. If using byte encoding, this should actually be a List[int], not a List[str].

tokenize(self, text: str) → List[allennlp.data.tokenizers.token.Token][source]¶

Actually implements splitting words into tokens.

Returns

tokensList[Token]

class allennlp.data.tokenizers.pretrained_transformer_tokenizer.PretrainedTransformerTokenizer(model_name: str, do_lowercase: bool, start_tokens: List[str] = None, end_tokens: List[str] = None)[source]¶

Bases: allennlp.data.tokenizers.tokenizer.Tokenizer

A PretrainedTransformerTokenizer uses a model from HuggingFace’s pytorch_transformers library to tokenize some input text. This often means wordpieces (where 'AllenNLP is awesome' might get split into ['Allen', '##NL', '##P', 'is', 'awesome']), but it could also use byte-pair encoding, or some other tokenization, depending on the pretrained model that you’re using.

We take a model name as an input parameter, which we will pass to AutoTokenizer.from_pretrained.

Parameters

model_namestr: The name of the pretrained wordpiece tokenizer to use.
start_tokensList[str], optional: If given, these tokens will be added to the beginning of every string we tokenize. We try to be a little bit smart about defaults here - e.g., if your model name contains bert, we by default add [CLS] at the beginning and [SEP] at the end.
end_tokensList[str], optional: If given, these tokens will be added to the end of every string we tokenize.

tokenize(self, text: str) → List[allennlp.data.tokenizers.token.Token][source]¶

Actually implements splitting words into tokens.

Returns

tokensList[Token]

class allennlp.data.tokenizers.word_filter.PassThroughWordFilter[source]¶

Bases: allennlp.data.tokenizers.word_filter.WordFilter

Does not filter words; it’s a no-op. This is the default word filter.

filter_words(self, words: List[allennlp.data.tokenizers.token.Token]) → List[allennlp.data.tokenizers.token.Token][source]¶: Returns a filtered list of words.

class allennlp.data.tokenizers.word_filter.RegexFilter(patterns: List[str])[source]¶

Bases: allennlp.data.tokenizers.word_filter.WordFilter

A RegexFilter removes words according to supplied regex patterns.

Parameters

patternsList[str]: Words matching these regex patterns will be removed as stopwords.

filter_words(self, words: List[allennlp.data.tokenizers.token.Token]) → List[allennlp.data.tokenizers.token.Token][source]¶: Returns a filtered list of words.

class allennlp.data.tokenizers.word_filter.StopwordFilter(stopword_file: str = None, tokens_to_add: List[str] = None)[source]¶

Bases: allennlp.data.tokenizers.word_filter.WordFilter

A StopwordFilter uses a list of stopwords to filter. If no file is specified, spaCy’s default list of English stopwords is used. Words and stopwords are lowercased for comparison.

Parameters

stopword_filestr, optional: A filename containing stopwords to filter out (file format is one stopword per line).
tokens_to_addList[str], optional: A list of tokens to additionally filter out.

filter_words(self, words: List[allennlp.data.tokenizers.token.Token]) → List[allennlp.data.tokenizers.token.Token][source]¶: Returns a filtered list of words.

class allennlp.data.tokenizers.word_filter.WordFilter[source]¶

Bases: allennlp.common.registrable.Registrable

A WordFilter removes words from a token list. Typically, this is for stopword removal, though you could feasibly use it for more domain-specific removal if you want.

Word removal happens before stemming, so keep that in mind if you’re designing a list of words to be removed.

default_implementation: str = 'pass_through'¶

filter_words(self, words: List[allennlp.data.tokenizers.token.Token]) → List[allennlp.data.tokenizers.token.Token][source]¶: Returns a filtered list of words.

class allennlp.data.tokenizers.word_splitter.BertBasicWordSplitter(do_lower_case: bool = True, never_split: Optional[List[str]] = None)[source]¶

Bases: allennlp.data.tokenizers.word_splitter.WordSplitter

The BasicWordSplitter from the BERT implementation. This is used to split a sentence into words. Then the BertTokenIndexer converts each word into wordpieces.

split_words(self, sentence: str) → List[allennlp.data.tokenizers.token.Token][source]¶: Splits sentence into a list of Token objects.

class allennlp.data.tokenizers.word_splitter.JustSpacesWordSplitter[source]¶

Bases: allennlp.data.tokenizers.word_splitter.WordSplitter

A WordSplitter that assumes you’ve already done your own tokenization somehow and have separated the tokens by spaces. We just split the input string on whitespace and return the resulting list. We use a somewhat odd name here to avoid coming too close to the more commonly used SpacyWordSplitter.

Note that we use sentence.split(), which means that the amount of whitespace between the tokens does not matter. This will never result in spaces being included as tokens.

split_words(self, sentence: str) → List[allennlp.data.tokenizers.token.Token][source]¶: Splits sentence into a list of Token objects.

class allennlp.data.tokenizers.word_splitter.LettersDigitsWordSplitter[source]¶

Bases: allennlp.data.tokenizers.word_splitter.WordSplitter

A WordSplitter which keeps runs of (unicode) letters and runs of digits together, while every other non-whitespace character becomes a separate word.

split_words(self, sentence: str) → List[allennlp.data.tokenizers.token.Token][source]¶: Splits sentence into a list of Token objects.

class allennlp.data.tokenizers.word_splitter.OpenAISplitter(language: str = 'en_core_web_sm')[source]¶

Bases: allennlp.data.tokenizers.word_splitter.WordSplitter

For OpenAI transformer

batch_split_words(self, sentences: List[str]) → List[List[allennlp.data.tokenizers.token.Token]][source]¶: Spacy needs to do batch processing, or it can be really slow. This method lets you take advantage of that if you want. Default implementation is to just iterate of the sentences and call split_words, but the SpacyWordSplitter will actually do batched processing.

split_words(self, sentence: str) → List[allennlp.data.tokenizers.token.Token][source]¶: Splits sentence into a list of Token objects.

class allennlp.data.tokenizers.word_splitter.SimpleWordSplitter[source]¶

Bases: allennlp.data.tokenizers.word_splitter.WordSplitter

Does really simple tokenization. NLTK was too slow, so we wrote our own simple tokenizer instead. This just does an initial split(), followed by some heuristic filtering of each whitespace-delimited token, separating contractions and punctuation. We assume lower-cased, reasonably well-formed English sentences as input.

split_words(self, sentence: str) → List[allennlp.data.tokenizers.token.Token][source]¶

Splits a sentence into word tokens. We handle four kinds of things: words with punctuation that should be ignored as a special case (Mr. Mrs., etc.), contractions/genitives (isn’t, don’t, Matt’s), and beginning and ending punctuation (“antennagate”, (parentheticals), and such.).

The basic outline is to split on whitespace, then check each of these cases. First, we strip off beginning punctuation, then strip off ending punctuation, then strip off contractions. When we strip something off the beginning of a word, we can add it to the list of tokens immediately. When we strip it off the end, we have to save it to be added to after the word itself has been added. Before stripping off any part of a token, we first check to be sure the token isn’t in our list of special cases.

class allennlp.data.tokenizers.word_splitter.SpacyWordSplitter(language: str = 'en_core_web_sm', pos_tags: bool = False, parse: bool = False, ner: bool = False, keep_spacy_tokens: bool = False, split_on_spaces: bool = False)[source]¶

Bases: allennlp.data.tokenizers.word_splitter.WordSplitter

A WordSplitter that uses spaCy’s tokenizer. It’s fast and reasonable - this is the recommended WordSplitter. By default it will return allennlp Tokens, which are small, efficient NamedTuples (and are serializable). If you want to keep the original spaCy tokens, pass keep_spacy_tokens=True.

batch_split_words(self, sentences: List[str]) → List[List[allennlp.data.tokenizers.token.Token]][source]¶: Spacy needs to do batch processing, or it can be really slow. This method lets you take advantage of that if you want. Default implementation is to just iterate of the sentences and call split_words, but the SpacyWordSplitter will actually do batched processing.

split_words(self, sentence: str) → List[allennlp.data.tokenizers.token.Token][source]¶: Splits sentence into a list of Token objects.

class allennlp.data.tokenizers.word_splitter.WhitespaceTokenizer(vocab)[source]¶

Bases: object

Spacy doesn’t assume that text is tokenised. Sometimes this is annoying, like when you have gold data which is pre-tokenised, but Spacy’s tokenisation doesn’t match the gold. This can be used as follows: nlp = spacy.load(“en_core_web_md”) # hack to replace tokenizer with a whitespace tokenizer nlp.tokenizer = WhitespaceTokenizer(nlp.vocab) … use nlp(“here is some text”) as normal.

class allennlp.data.tokenizers.word_splitter.WordSplitter[source]¶

Bases: allennlp.common.registrable.Registrable

A WordSplitter splits strings into words. This is typically called a “tokenizer” in NLP, because splitting strings into characters is trivial, but we use Tokenizer to refer to the higher-level object that splits strings into tokens (which could just be character tokens). So, we’re using “word splitter” here for this.

batch_split_words(self, sentences: List[str]) → List[List[allennlp.data.tokenizers.token.Token]][source]¶: Spacy needs to do batch processing, or it can be really slow. This method lets you take advantage of that if you want. Default implementation is to just iterate of the sentences and call split_words, but the SpacyWordSplitter will actually do batched processing.

default_implementation: str = 'spacy'¶

split_words(self, sentence: str) → List[allennlp.data.tokenizers.token.Token][source]¶: Splits sentence into a list of Token objects.

class allennlp.data.tokenizers.word_stemmer.PassThroughWordStemmer[source]¶

Bases: allennlp.data.tokenizers.word_stemmer.WordStemmer

Does not stem words; it’s a no-op. This is the default word stemmer.

stem_word(self, word: allennlp.data.tokenizers.token.Token) → allennlp.data.tokenizers.token.Token[source]¶: Returns a new Token with word.text replaced by a stemmed word.

class allennlp.data.tokenizers.word_stemmer.PorterStemmer[source]¶

Bases: allennlp.data.tokenizers.word_stemmer.WordStemmer

Uses NLTK’s PorterStemmer to stem words.

stem_word(self, word: allennlp.data.tokenizers.token.Token) → allennlp.data.tokenizers.token.Token[source]¶: Returns a new Token with word.text replaced by a stemmed word.

class allennlp.data.tokenizers.word_stemmer.WordStemmer[source]¶

Bases: allennlp.common.registrable.Registrable

A WordStemmer lemmatizes words. This means that we map words to their root form, so that, e.g., “have”, “has”, and “had” all have the same internal representation.

You should think carefully about whether and how much stemming you want in your model. Kind of the whole point of using word embeddings is so that you don’t have to do this, but in a highly inflected language, or in a low-data setting, you might need it anyway. The default WordStemmer does nothing, just returning the work token as-is.

default_implementation: str = 'pass_through'¶

stem_word(self, word: allennlp.data.tokenizers.token.Token) → allennlp.data.tokenizers.token.Token[source]¶: Returns a new Token with word.text replaced by a stemmed word.

class allennlp.data.tokenizers.sentence_splitter.SentenceSplitter[source]¶

Bases: allennlp.common.registrable.Registrable

A SentenceSplitter splits strings into sentences.

batch_split_sentences(self, texts: List[str]) → List[List[str]][source]¶: Default implementation is to just iterate over the texts and call split_sentences.

default_implementation: str = 'spacy'¶

split_sentences(self, text: str) → List[str][source]¶: Splits a text str paragraph into a list of str, where each is a sentence.

class allennlp.data.tokenizers.sentence_splitter.SpacySentenceSplitter(language: str = 'en_core_web_sm', rule_based: bool = False)[source]¶

Bases: allennlp.data.tokenizers.sentence_splitter.SentenceSplitter

A SentenceSplitter that uses spaCy’s built-in sentence boundary detection.

Spacy’s default sentence splitter uses a dependency parse to detect sentence boundaries, so it is slow, but accurate.

Another option is to use rule-based sentence boundary detection. It’s fast and has a small memory footprint, since it uses punctuation to detect sentence boundaries. This can be activated with the rule_based flag.

By default, SpacySentenceSplitter calls the default spacy boundary detector.

batch_split_sentences(self, texts: List[str]) → List[List[str]][source]¶: This method lets you take advantage of spacy’s batch processing.

split_sentences(self, text: str) → List[str][source]¶: Splits a text str paragraph into a list of str, where each is a sentence.