next_token_lm

allennlp_models.lm.dataset_readers.next_token_lm

NextTokenLMReader#

@DatasetReader.register("next_token_lm")
class NextTokenLMReader(DatasetReader):
 | def __init__(
 |     self,
 |     tokenizer: Tokenizer = None,
 |     token_indexers: Dict[str, TokenIndexer] = None,
 |     max_tokens: int = None,
 |     **kwargs
 | ) -> None

Creates Instances suitable for use in predicting a single next token using a language model. The Field s that we create are the following: an input TextField and a target token TextField (we only ver have a single token, but we use a TextField so we can index it the same way as our input, typically with a single PretrainedTransformerIndexer).

NOTE: This is not fully functional! It was written to put together a demo for interpreting and attacking language models, not for actually training anything. It would be a really bad idea to use this setup for training language models, as it would be incredibly inefficient. The only purpose of this class is for a demo.

Parameters¶

tokenizer : Tokenizer, optional (default = WhitespaceTokenizer())
We use this Tokenizer for the text. See Tokenizer.
token_indexers : Dict[str, TokenIndexer], optional (default = {"tokens": SingleIdTokenIndexer()})
We use this to define the input representation for the text, and to get ids for the mask targets. See TokenIndexer.
max_tokens : int, optional (default = None)
If you don't handle truncation at the tokenizer level, you can specify max_tokens here, and the only the last max_tokens will be used.

text_to_instance#

class NextTokenLMReader(DatasetReader):
 | ...
 | def text_to_instance(
 |     self,
 |     sentence: str = None,
 |     tokens: List[Token] = None,
 |     target: str = None
 | ) -> Instance