allennlp.data.tokenizers¶
-
class
allennlp.data.tokenizers.token.
Token
[source]¶ Bases:
tuple
A simple token representation, keeping track of the token’s text, offset in the passage it was taken from, POS tag, dependency relation, and similar information. These fields match spacy’s exactly, so we can just use a spacy token for this.
- Parameters
- text
str
, optional The original text represented by this token.
- idx
int
, optional The character offset of this token into the tokenized passage.
- lemma_
str
, optional The lemma of this token.
- pos_
str
, optional The coarse-grained part of speech of this token.
- tag_
str
, optional The fine-grained part of speech of this token.
- dep_
str
, optional The dependency relation for this token.
- ent_type_
str
, optional The entity type (i.e., the NER tag) for this token.
- text_id
int
, optional If your tokenizer returns integers instead of strings (e.g., because you’re doing byte encoding, or some hash-based embedding), set this with the integer. If this is set, we will bypass the vocabulary when indexing this token, regardless of whether
text
is also set. You can also settext
with the original text, if you want, so that you can still use a character-level representation in addition to a hash-based word embedding.The other fields on
Token
follow the fields on spacy’sToken
object; this is one we added, similar to spacy’slex_id
.
- text
-
property
dep_
¶ Alias for field number 5
-
property
ent_type_
¶ Alias for field number 6
-
property
idx
¶ Alias for field number 1
-
property
lemma_
¶ Alias for field number 2
-
property
pos_
¶ Alias for field number 3
-
property
tag_
¶ Alias for field number 4
-
property
text
¶ Alias for field number 0
-
property
text_id
¶ Alias for field number 7
-
allennlp.data.tokenizers.token.
show_token
(token: allennlp.data.tokenizers.token.Token) → str[source]¶
This module contains various classes for performing tokenization, stemming, and filtering.
-
class
allennlp.data.tokenizers.tokenizer.
Tokenizer
[source]¶ Bases:
allennlp.common.registrable.Registrable
A
Tokenizer
splits strings of text into tokens. Typically, this either splits text into word tokens or character tokens, and those are the two tokenizer subclasses we have implemented here, though you could imagine wanting to do other kinds of tokenization for structured or other inputs.As part of tokenization, concrete implementations of this API will also handle stemming, stopword filtering, adding start and end tokens, or other kinds of things you might want to do to your tokens. See the parameters to, e.g.,
WordTokenizer
, or whichever tokenizer you want to use.If the base input to your model is words, you should use a
WordTokenizer
, even if you also want to have a character-level encoder to get an additional vector for each word token. Splitting word tokens into character arrays is handled separately, in thetoken_representations.TokenRepresentation
class.-
batch_tokenize
(self, texts: List[str]) → List[List[allennlp.data.tokenizers.token.Token]][source]¶ Batches together tokenization of several texts, in case that is faster for particular tokenizers.
By default we just do this without batching. Override this in your tokenizer if you have a good way of doing batched computation.
-
default_implementation
: str = 'word'¶
-
-
class
allennlp.data.tokenizers.word_tokenizer.
WordTokenizer
(word_splitter: allennlp.data.tokenizers.word_splitter.WordSplitter = None, word_filter: allennlp.data.tokenizers.word_filter.WordFilter = <allennlp.data.tokenizers.word_filter.PassThroughWordFilter object>, word_stemmer: allennlp.data.tokenizers.word_stemmer.WordStemmer = <allennlp.data.tokenizers.word_stemmer.PassThroughWordStemmer object>, start_tokens: List[str] = None, end_tokens: List[str] = None)[source]¶ Bases:
allennlp.data.tokenizers.tokenizer.Tokenizer
A
WordTokenizer
handles the splitting of strings into words as well as any desired post-processing (e.g., stemming, filtering, etc.). Note that we leave one particular piece of post-processing for later: the decision of whether or not to lowercase the token. This is for two reasons: (1) if you want to make two different casing decisions for whatever reason, you won’t have to run the tokenizer twice, and more importantly (2) if you want to lowercase words for your word embedding, but retain capitalization in a character-level representation, we need to retain the capitalization here.- Parameters
- word_splitter
WordSplitter
, optional The
WordSplitter
to use for splitting text strings into word tokens. The default is to use theSpacyWordSplitter
with default parameters.- word_filter
WordFilter
, optional The
WordFilter
to use for, e.g., removing stopwords. Default is to do no filtering.- word_stemmer
WordStemmer
, optional The
WordStemmer
to use. Default is no stemming.- start_tokens
List[str]
, optional If given, these tokens will be added to the beginning of every string we tokenize.
- end_tokens
List[str]
, optional If given, these tokens will be added to the end of every string we tokenize.
- word_splitter
-
batch_tokenize
(self, texts: List[str]) → List[List[allennlp.data.tokenizers.token.Token]][source]¶ Batches together tokenization of several texts, in case that is faster for particular tokenizers.
By default we just do this without batching. Override this in your tokenizer if you have a good way of doing batched computation.
-
tokenize
(self, text: str) → List[allennlp.data.tokenizers.token.Token][source]¶ Does whatever processing is required to convert a string of text into a sequence of tokens.
At a minimum, this uses a
WordSplitter
to split words into text. It may also do stemming or stopword removal, depending on the parameters given to the constructor.
-
class
allennlp.data.tokenizers.character_tokenizer.
CharacterTokenizer
(byte_encoding: str = None, lowercase_characters: bool = False, start_tokens: List[str] = None, end_tokens: List[str] = None)[source]¶ Bases:
allennlp.data.tokenizers.tokenizer.Tokenizer
A
CharacterTokenizer
splits strings into character tokens.- Parameters
- byte_encodingstr, optional (default=``None``)
If not
None
, we will use this encoding to encode the string as bytes, and use the byte sequence as characters, instead of the unicode characters in the python string. E.g., the character ‘á’ would be a single token if this option isNone
, but it would be two tokens if this option is set to"utf-8"
.If this is not
None
,tokenize
will return aList[int]
instead of aList[str]
, and we will bypass the vocabulary in theTokenIndexer
.- lowercase_characters
bool
, optional (default=``False``) If
True
, we will lowercase all of the characters in the text before doing any other operation. You probably do not want to do this, as character vocabularies are generally not very large to begin with, but it’s an option if you really want it.- start_tokens
List[str]
, optional If given, these tokens will be added to the beginning of every string we tokenize. If using byte encoding, this should actually be a
List[int]
, not aList[str]
.- end_tokens
List[str]
, optional If given, these tokens will be added to the end of every string we tokenize. If using byte encoding, this should actually be a
List[int]
, not aList[str]
.
-
class
allennlp.data.tokenizers.pretrained_transformer_tokenizer.
PretrainedTransformerTokenizer
(model_name: str, do_lowercase: bool, start_tokens: List[str] = None, end_tokens: List[str] = None)[source]¶ Bases:
allennlp.data.tokenizers.tokenizer.Tokenizer
A
PretrainedTransformerTokenizer
uses a model from HuggingFace’spytorch_transformers
library to tokenize some input text. This often means wordpieces (where'AllenNLP is awesome'
might get split into['Allen', '##NL', '##P', 'is', 'awesome']
), but it could also use byte-pair encoding, or some other tokenization, depending on the pretrained model that you’re using.We take a model name as an input parameter, which we will pass to
AutoTokenizer.from_pretrained
.- Parameters
- model_name
str
The name of the pretrained wordpiece tokenizer to use.
- start_tokens
List[str]
, optional If given, these tokens will be added to the beginning of every string we tokenize. We try to be a little bit smart about defaults here - e.g., if your model name contains
bert
, we by default add[CLS]
at the beginning and[SEP]
at the end.- end_tokens
List[str]
, optional If given, these tokens will be added to the end of every string we tokenize.
- model_name
-
class
allennlp.data.tokenizers.word_filter.
PassThroughWordFilter
[source]¶ Bases:
allennlp.data.tokenizers.word_filter.WordFilter
Does not filter words; it’s a no-op. This is the default word filter.
-
class
allennlp.data.tokenizers.word_filter.
RegexFilter
(patterns: List[str])[source]¶ Bases:
allennlp.data.tokenizers.word_filter.WordFilter
A
RegexFilter
removes words according to supplied regex patterns.- Parameters
- patterns
List[str]
Words matching these regex patterns will be removed as stopwords.
- patterns
-
class
allennlp.data.tokenizers.word_filter.
StopwordFilter
(stopword_file: str = None, tokens_to_add: List[str] = None)[source]¶ Bases:
allennlp.data.tokenizers.word_filter.WordFilter
A
StopwordFilter
uses a list of stopwords to filter. If no file is specified, spaCy’s default list of English stopwords is used. Words and stopwords are lowercased for comparison.- Parameters
- stopword_file
str
, optional A filename containing stopwords to filter out (file format is one stopword per line).
- tokens_to_add
List[str]
, optional A list of tokens to additionally filter out.
- stopword_file
-
class
allennlp.data.tokenizers.word_filter.
WordFilter
[source]¶ Bases:
allennlp.common.registrable.Registrable
A
WordFilter
removes words from a token list. Typically, this is for stopword removal, though you could feasibly use it for more domain-specific removal if you want.Word removal happens before stemming, so keep that in mind if you’re designing a list of words to be removed.
-
default_implementation
: str = 'pass_through'¶
-
-
class
allennlp.data.tokenizers.word_splitter.
BertBasicWordSplitter
(do_lower_case: bool = True, never_split: Optional[List[str]] = None)[source]¶ Bases:
allennlp.data.tokenizers.word_splitter.WordSplitter
The
BasicWordSplitter
from the BERT implementation. This is used to split a sentence into words. Then theBertTokenIndexer
converts each word into wordpieces.
-
class
allennlp.data.tokenizers.word_splitter.
JustSpacesWordSplitter
[source]¶ Bases:
allennlp.data.tokenizers.word_splitter.WordSplitter
A
WordSplitter
that assumes you’ve already done your own tokenization somehow and have separated the tokens by spaces. We just split the input string on whitespace and return the resulting list. We use a somewhat odd name here to avoid coming too close to the more commonly usedSpacyWordSplitter
.Note that we use
sentence.split()
, which means that the amount of whitespace between the tokens does not matter. This will never result in spaces being included as tokens.
-
class
allennlp.data.tokenizers.word_splitter.
LettersDigitsWordSplitter
[source]¶ Bases:
allennlp.data.tokenizers.word_splitter.WordSplitter
A
WordSplitter
which keeps runs of (unicode) letters and runs of digits together, while every other non-whitespace character becomes a separate word.
-
class
allennlp.data.tokenizers.word_splitter.
OpenAISplitter
(language: str = 'en_core_web_sm')[source]¶ Bases:
allennlp.data.tokenizers.word_splitter.WordSplitter
For OpenAI transformer
-
batch_split_words
(self, sentences: List[str]) → List[List[allennlp.data.tokenizers.token.Token]][source]¶ Spacy needs to do batch processing, or it can be really slow. This method lets you take advantage of that if you want. Default implementation is to just iterate of the sentences and call
split_words
, but theSpacyWordSplitter
will actually do batched processing.
-
-
class
allennlp.data.tokenizers.word_splitter.
SimpleWordSplitter
[source]¶ Bases:
allennlp.data.tokenizers.word_splitter.WordSplitter
Does really simple tokenization. NLTK was too slow, so we wrote our own simple tokenizer instead. This just does an initial split(), followed by some heuristic filtering of each whitespace-delimited token, separating contractions and punctuation. We assume lower-cased, reasonably well-formed English sentences as input.
-
split_words
(self, sentence: str) → List[allennlp.data.tokenizers.token.Token][source]¶ Splits a sentence into word tokens. We handle four kinds of things: words with punctuation that should be ignored as a special case (Mr. Mrs., etc.), contractions/genitives (isn’t, don’t, Matt’s), and beginning and ending punctuation (“antennagate”, (parentheticals), and such.).
The basic outline is to split on whitespace, then check each of these cases. First, we strip off beginning punctuation, then strip off ending punctuation, then strip off contractions. When we strip something off the beginning of a word, we can add it to the list of tokens immediately. When we strip it off the end, we have to save it to be added to after the word itself has been added. Before stripping off any part of a token, we first check to be sure the token isn’t in our list of special cases.
-
-
class
allennlp.data.tokenizers.word_splitter.
SpacyWordSplitter
(language: str = 'en_core_web_sm', pos_tags: bool = False, parse: bool = False, ner: bool = False, keep_spacy_tokens: bool = False, split_on_spaces: bool = False)[source]¶ Bases:
allennlp.data.tokenizers.word_splitter.WordSplitter
A
WordSplitter
that uses spaCy’s tokenizer. It’s fast and reasonable - this is the recommendedWordSplitter
. By default it will return allennlp Tokens, which are small, efficient NamedTuples (and are serializable). If you want to keep the original spaCy tokens, pass keep_spacy_tokens=True.-
batch_split_words
(self, sentences: List[str]) → List[List[allennlp.data.tokenizers.token.Token]][source]¶ Spacy needs to do batch processing, or it can be really slow. This method lets you take advantage of that if you want. Default implementation is to just iterate of the sentences and call
split_words
, but theSpacyWordSplitter
will actually do batched processing.
-
-
class
allennlp.data.tokenizers.word_splitter.
WhitespaceTokenizer
(vocab)[source]¶ Bases:
object
Spacy doesn’t assume that text is tokenised. Sometimes this is annoying, like when you have gold data which is pre-tokenised, but Spacy’s tokenisation doesn’t match the gold. This can be used as follows: nlp = spacy.load(“en_core_web_md”) # hack to replace tokenizer with a whitespace tokenizer nlp.tokenizer = WhitespaceTokenizer(nlp.vocab) … use nlp(“here is some text”) as normal.
-
class
allennlp.data.tokenizers.word_splitter.
WordSplitter
[source]¶ Bases:
allennlp.common.registrable.Registrable
A
WordSplitter
splits strings into words. This is typically called a “tokenizer” in NLP, because splitting strings into characters is trivial, but we useTokenizer
to refer to the higher-level object that splits strings into tokens (which could just be character tokens). So, we’re using “word splitter” here for this.-
batch_split_words
(self, sentences: List[str]) → List[List[allennlp.data.tokenizers.token.Token]][source]¶ Spacy needs to do batch processing, or it can be really slow. This method lets you take advantage of that if you want. Default implementation is to just iterate of the sentences and call
split_words
, but theSpacyWordSplitter
will actually do batched processing.
-
default_implementation
: str = 'spacy'¶
-
-
class
allennlp.data.tokenizers.word_stemmer.
PassThroughWordStemmer
[source]¶ Bases:
allennlp.data.tokenizers.word_stemmer.WordStemmer
Does not stem words; it’s a no-op. This is the default word stemmer.
-
class
allennlp.data.tokenizers.word_stemmer.
PorterStemmer
[source]¶ Bases:
allennlp.data.tokenizers.word_stemmer.WordStemmer
Uses NLTK’s PorterStemmer to stem words.
-
class
allennlp.data.tokenizers.word_stemmer.
WordStemmer
[source]¶ Bases:
allennlp.common.registrable.Registrable
A
WordStemmer
lemmatizes words. This means that we map words to their root form, so that, e.g., “have”, “has”, and “had” all have the same internal representation.You should think carefully about whether and how much stemming you want in your model. Kind of the whole point of using word embeddings is so that you don’t have to do this, but in a highly inflected language, or in a low-data setting, you might need it anyway. The default
WordStemmer
does nothing, just returning the work token as-is.-
default_implementation
: str = 'pass_through'¶
-
-
class
allennlp.data.tokenizers.sentence_splitter.
SentenceSplitter
[source]¶ Bases:
allennlp.common.registrable.Registrable
A
SentenceSplitter
splits strings into sentences.-
batch_split_sentences
(self, texts: List[str]) → List[List[str]][source]¶ Default implementation is to just iterate over the texts and call
split_sentences
.
-
default_implementation
: str = 'spacy'¶
-
-
class
allennlp.data.tokenizers.sentence_splitter.
SpacySentenceSplitter
(language: str = 'en_core_web_sm', rule_based: bool = False)[source]¶ Bases:
allennlp.data.tokenizers.sentence_splitter.SentenceSplitter
A
SentenceSplitter
that uses spaCy’s built-in sentence boundary detection.Spacy’s default sentence splitter uses a dependency parse to detect sentence boundaries, so it is slow, but accurate.
Another option is to use rule-based sentence boundary detection. It’s fast and has a small memory footprint, since it uses punctuation to detect sentence boundaries. This can be activated with the rule_based flag.
By default,
SpacySentenceSplitter
calls the default spacy boundary detector.