hf_tokenize
allennlp.tango.hf_tokenize
AllenNLP Tango is an experimental API and parts of it might change or disappear every time we release a new version.
HuggingfaceTokenize¶
@Step.register("hf_tokenize")
class HuggingfaceTokenize(Step)
This step converts strings in the original dataset into TransformerTextField
s.
DETERMINISTIC¶
class HuggingfaceTokenize(Step):
| ...
| DETERMINISTIC = True
VERSION¶
class HuggingfaceTokenize(Step):
| ...
| VERSION = "001"
CACHEABLE¶
class HuggingfaceTokenize(Step):
| ...
| CACHEABLE = True
run¶
class HuggingfaceTokenize(Step):
| ...
| def run(
| self,
| tokenizer_name: str,
| input: DatasetDict,
| fields_to_tokenize: Optional[List[str]] = None,
| add_special_tokens: bool = True,
| max_length: Optional[int] = 512,
| special_tokens_mask: bool = False,
| offset_mapping: bool = False
| ) -> DatasetDict
Reads a dataset and converts all strings in it into TransformerTextField
.
tokenizer_name
is the name of the tokenizer to use. For example,"roberta-large"
.input
is the dataset to transform in this way.- By default, this step tokenizes all strings it finds, but if you specify
fields_to_tokenize
, it will only tokenize the named fields. add_special_tokens
specifies whether or not to add special tokens to the tokenized strings.max_length
is the maximum length the resultingTransformerTextField
will have. If there is too much text, it will be truncated.special_tokens_mask
specifies whether to add the special token mask as one of the tensors inTransformerTextField
.offset_mapping
specifies whether to add a mapping from tokens to original string offsets to the tensors inTransformerTextField
.
This function returns a new dataset with new TransformerTextField
s.