hf_tokenize
allennlp.tango.hf_tokenize
AllenNLP Tango is an experimental API and parts of it might change or disappear every time we release a new version.
HuggingfaceTokenize¶
@Step.register("hf_tokenize")
class HuggingfaceTokenize(Step)
This step converts strings in the original dataset into TransformerTextFields.
DETERMINISTIC¶
class HuggingfaceTokenize(Step):
| ...
| DETERMINISTIC = True
VERSION¶
class HuggingfaceTokenize(Step):
| ...
| VERSION = "001"
CACHEABLE¶
class HuggingfaceTokenize(Step):
| ...
| CACHEABLE = True
run¶
class HuggingfaceTokenize(Step):
| ...
| def run(
| self,
| tokenizer_name: str,
| input: DatasetDict,
| fields_to_tokenize: Optional[List[str]] = None,
| add_special_tokens: bool = True,
| max_length: Optional[int] = 512,
| special_tokens_mask: bool = False,
| offset_mapping: bool = False
| ) -> DatasetDict
Reads a dataset and converts all strings in it into TransformerTextField.
tokenizer_nameis the name of the tokenizer to use. For example,"roberta-large".inputis the dataset to transform in this way.- By default, this step tokenizes all strings it finds, but if you specify
fields_to_tokenize, it will only tokenize the named fields. add_special_tokensspecifies whether or not to add special tokens to the tokenized strings.max_lengthis the maximum length the resultingTransformerTextFieldwill have. If there is too much text, it will be truncated.special_tokens_maskspecifies whether to add the special token mask as one of the tensors inTransformerTextField.offset_mappingspecifies whether to add a mapping from tokens to original string offsets to the tensors inTransformerTextField.
This function returns a new dataset with new TransformerTextFields.