spacy_tokenizer
allennlp.data.tokenizers.spacy_tokenizer
SpacyTokenizer#
@Tokenizer.register("spacy")
class SpacyTokenizer(Tokenizer):
| def __init__(
| self,
| language: str = "en_core_web_sm",
| pos_tags: bool = False,
| parse: bool = False,
| ner: bool = False,
| keep_spacy_tokens: bool = False,
| split_on_spaces: bool = False,
| start_tokens: Optional[List[str]] = None,
| end_tokens: Optional[List[str]] = None
| ) -> None
A Tokenizer
that uses spaCy's tokenizer. It's fast and reasonable - this is the
recommended Tokenizer
. By default it will return allennlp Tokens,
which are small, efficient NamedTuples (and are serializable). If you want
to keep the original spaCy tokens, pass keep_spacy_tokens=True. Note that we leave one particular piece of
post-processing for later: the decision of whether or not to lowercase the token. This is for
two reasons: (1) if you want to make two different casing decisions for whatever reason, you
won't have to run the tokenizer twice, and more importantly (2) if you want to lowercase words
for your word embedding, but retain capitalization in a character-level representation, we need
to retain the capitalization here.
Registered as a Tokenizer
with name "spacy", which is currently the default.
Parameters
- language :
str
, optional (default ="en_core_web_sm"
)
Spacy model name. - pos_tags :
bool
, optional (default =False
)
IfTrue
, performs POS tagging with spacy model on the tokens. Generally used in conjunction withPosTagIndexer
. - parse :
bool
, optional (default =False
)
IfTrue
, performs dependency parsing with spacy model on the tokens. Generally used in conjunction withDepLabelIndexer
. - ner :
bool
, optional (default =False
)
IfTrue
, performs dependency parsing with spacy model on the tokens. Generally used in conjunction withNerTagIndexer
. - keep_spacy_tokens :
bool
, optional (default =False
)
IfTrue
, will preserve spacy token objects, We copy spacy tokens into our own class by default instead because spacy Cython Tokens can't be pickled. - split_on_spaces :
bool
, optional (default =False
)
IfTrue
, will split by spaces without performing tokenization. Used when your data is already tokenized, but you want to perform pos, ner or parsing on the tokens. - start_tokens :
Optional[List[str]]
, optional (default =None
)
If given, these tokens will be added to the beginning of every string we tokenize. - end_tokens :
Optional[List[str]]
, optional (default =None
)
If given, these tokens will be added to the end of every string we tokenize.
batch_tokenize#
class SpacyTokenizer(Tokenizer):
| ...
| @overrides
| def batch_tokenize(self, texts: List[str]) -> List[List[Token]]
tokenize#
class SpacyTokenizer(Tokenizer):
| ...
| @overrides
| def tokenize(self, text: str) -> List[Token]
This works because our Token class matches spacy's.
__call__#
class _WhitespaceSpacyTokenizer:
| ...
| def __call__(self, text)