allennlp.data.dataset_readers.next_token_lm

class allennlp.data.dataset_readers.next_token_lm.NextTokenLmReader(tokenizer: allennlp.data.tokenizers.tokenizer.Tokenizer = None, token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer] = None, lazy: bool = False)[source]

Bases: allennlp.data.dataset_readers.dataset_reader.DatasetReader

Creates Instances suitable for use in predicting a single next token using a language model. The Field s that we create are the following: an input TextField and a target token TextField (we only ver have a single token, but we use a TextField so we can index it the same way as our input, typically with a single PretrainedTransformerIndexer).

NOTE: This is not fully functional! It was written to put together a demo for interpreting and attacking language models, not for actually training anything. It would be a really bad idea to use this setup for training language models, as it would be incredibly inefficient. The only purpose of this class is for a demo.

Parameters
tokenizerTokenizer, optional (default=``WordTokenizer()``)

We use this Tokenizer for the text. See Tokenizer.

token_indexersDict[str, TokenIndexer], optional (default=``{“tokens”: SingleIdTokenIndexer()}``)

We use this to define the input representation for the text, and to get ids for the mask targets. See TokenIndexer.

text_to_instance(self, sentence: str = None, tokens: List[allennlp.data.tokenizers.token.Token] = None, target: str = None) → allennlp.data.instance.Instance[source]

Does whatever tokenization or processing is necessary to go from textual input to an Instance. The primary intended use for this is with a Predictor, which gets text input as a JSON object and needs to process it to be input to a model.

The intent here is to share code between _read() and what happens at model serving time, or any other time you want to make a prediction from new data. We need to process the data in the same way it was done at training time. Allowing the DatasetReader to process new text lets us accomplish this, as we can just call DatasetReader.text_to_instance when serving predictions.

The input type here is rather vaguely specified, unfortunately. The Predictor will have to make some assumptions about the kind of DatasetReader that it’s using, in order to pass it the right information.