next_token_lm
NextTokenLMReader#
class NextTokenLMReader(DatasetReader):
| def __init__(
| self,
| tokenizer: Tokenizer = None,
| token_indexers: Dict[str, TokenIndexer] = None,
| **kwargs
| ) -> None
Creates Instances
suitable for use in predicting a single next token using a language
model. The Field
s that we create are the following: an input TextField
and a
target token TextField
(we only ver have a single token, but we use a TextField
so we
can index it the same way as our input, typically with a single
PretrainedTransformerIndexer
).
NOTE: This is not fully functional! It was written to put together a demo for interpreting and attacking language models, not for actually training anything. It would be a really bad idea to use this setup for training language models, as it would be incredibly inefficient. The only purpose of this class is for a demo.
Parameters
- tokenizer :
Tokenizer
, optional (default =WhitespaceTokenizer()
)
We use thisTokenizer
for the text. SeeTokenizer
. - token_indexers :
Dict[str, TokenIndexer]
, optional (default ={"tokens": SingleIdTokenIndexer()}
)
We use this to define the input representation for the text, and to get ids for the mask targets. SeeTokenIndexer
.
text_to_instance#
class NextTokenLMReader(DatasetReader):
| ...
| @overrides
| def text_to_instance(
| self,
| sentence: str = None,
| tokens: List[Token] = None,
| target: str = None
| ) -> Instance