character_tokenizer
allennlp.data.tokenizers.character_tokenizer
CharacterTokenizer#
@Tokenizer.register("character")
class CharacterTokenizer(Tokenizer):
| def __init__(
| self,
| byte_encoding: str = None,
| lowercase_characters: bool = False,
| start_tokens: List[Union[str, int]] = None,
| end_tokens: List[Union[str, int]] = None
| ) -> None
A CharacterTokenizer
splits strings into character tokens.
Registered as a Tokenizer
with name "character".
Parameters
-
byte_encoding :
str
, optional (default =None
)
If notNone
, we will use this encoding to encode the string as bytes, and use the byte sequence as characters, instead of the unicode characters in the python string. E.g., the character 'รก' would be a single token if this option isNone
, but it would be two tokens if this option is set to"utf-8"
.If this is not
None
,tokenize
will return aList[int]
instead of aList[str]
, and we will bypass the vocabulary in theTokenIndexer
. -
lowercase_characters :
bool
, optional (default =False
)
IfTrue
, we will lowercase all of the characters in the text before doing any other operation. You probably do not want to do this, as character vocabularies are generally not very large to begin with, but it's an option if you really want it. -
start_tokens :
List[str]
, optional
If given, these tokens will be added to the beginning of every string we tokenize. If using byte encoding, this should actually be aList[int]
, not aList[str]
. -
end_tokens :
List[str]
, optional
If given, these tokens will be added to the end of every string we tokenize. If using byte encoding, this should actually be aList[int]
, not aList[str]
.
tokenize#
class CharacterTokenizer(Tokenizer):
| ...
| @overrides
| def tokenize(self, text: str) -> List[Token]