Skip to content

hf_tokenize

allennlp.tango.hf_tokenize

[SOURCE]


AllenNLP Tango is an experimental API and parts of it might change or disappear every time we release a new version.

HuggingfaceTokenize

@Step.register("hf_tokenize")
class HuggingfaceTokenize(Step)

This step converts strings in the original dataset into TransformerTextFields.

DETERMINISTIC

class HuggingfaceTokenize(Step):
 | ...
 | DETERMINISTIC = True

VERSION

class HuggingfaceTokenize(Step):
 | ...
 | VERSION = "001"

CACHEABLE

class HuggingfaceTokenize(Step):
 | ...
 | CACHEABLE = True

run

class HuggingfaceTokenize(Step):
 | ...
 | def run(
 |     self,
 |     tokenizer_name: str,
 |     input: DatasetDict,
 |     fields_to_tokenize: Optional[List[str]] = None,
 |     add_special_tokens: bool = True,
 |     max_length: Optional[int] = 512,
 |     special_tokens_mask: bool = False,
 |     offset_mapping: bool = False
 | ) -> DatasetDict

Reads a dataset and converts all strings in it into TransformerTextField.

  • tokenizer_name is the name of the tokenizer to use. For example, "roberta-large".
  • input is the dataset to transform in this way.
  • By default, this step tokenizes all strings it finds, but if you specify fields_to_tokenize, it will only tokenize the named fields.
  • add_special_tokens specifies whether or not to add special tokens to the tokenized strings.
  • max_length is the maximum length the resulting TransformerTextField will have. If there is too much text, it will be truncated.
  • special_tokens_mask specifies whether to add the special token mask as one of the tensors in TransformerTextField.
  • offset_mapping specifies whether to add a mapping from tokens to original string offsets to the tensors in TransformerTextField.

This function returns a new dataset with new TransformerTextFields.