Skip to content

spacy_tokenizer

allennlp.data.tokenizers.spacy_tokenizer

[SOURCE]


SpacyTokenizer

@Tokenizer.register("spacy")
class SpacyTokenizer(Tokenizer):
 | def __init__(
 |     self,
 |     language: str = "en_core_web_sm",
 |     pos_tags: bool = True,
 |     parse: bool = False,
 |     ner: bool = False,
 |     keep_spacy_tokens: bool = False,
 |     split_on_spaces: bool = False,
 |     start_tokens: Optional[List[str]] = None,
 |     end_tokens: Optional[List[str]] = None
 | ) -> None

A Tokenizer that uses spaCy's tokenizer. It's fast and reasonable - this is the recommended Tokenizer. By default it will return allennlp Tokens, which are small, efficient NamedTuples (and are serializable). If you want to keep the original spaCy tokens, pass keep_spacy_tokens=True. Note that we leave one particular piece of post-processing for later: the decision of whether or not to lowercase the token. This is for two reasons: (1) if you want to make two different casing decisions for whatever reason, you won't have to run the tokenizer twice, and more importantly (2) if you want to lowercase words for your word embedding, but retain capitalization in a character-level representation, we need to retain the capitalization here.

Registered as a Tokenizer with name "spacy", which is currently the default.

Parameters

  • language : str, optional (default = "en_core_web_sm")
    Spacy model name.
  • pos_tags : bool, optional (default = False)
    If True, performs POS tagging with spacy model on the tokens. Generally used in conjunction with PosTagIndexer.
  • parse : bool, optional (default = False)
    If True, performs dependency parsing with spacy model on the tokens. Generally used in conjunction with DepLabelIndexer.
  • ner : bool, optional (default = False)
    If True, performs dependency parsing with spacy model on the tokens. Generally used in conjunction with NerTagIndexer.
  • keep_spacy_tokens : bool, optional (default = False)
    If True, will preserve spacy token objects, We copy spacy tokens into our own class by default instead because spacy Cython Tokens can't be pickled.
  • split_on_spaces : bool, optional (default = False)
    If True, will split by spaces without performing tokenization. Used when your data is already tokenized, but you want to perform pos, ner or parsing on the tokens.
  • start_tokens : Optional[List[str]], optional (default = None)
    If given, these tokens will be added to the beginning of every string we tokenize.
  • end_tokens : Optional[List[str]], optional (default = None)
    If given, these tokens will be added to the end of every string we tokenize.

batch_tokenize

class SpacyTokenizer(Tokenizer):
 | ...
 | def batch_tokenize(self, texts: List[str]) -> List[List[Token]]

tokenize

class SpacyTokenizer(Tokenizer):
 | ...
 | def tokenize(self, text: str) -> List[Token]

This works because our Token class matches spacy's.

__call__

class _WhitespaceSpacyTokenizer:
 | ...
 | def __call__(self, text)