@DatasetReader.register("copynet_seq2seq") class CopyNetDatasetReader(DatasetReader): | def __init__( | self, | target_namespace: str, | source_tokenizer: Tokenizer = None, | target_tokenizer: Tokenizer = None, | source_token_indexers: Dict[str, TokenIndexer] = None, | **kwargs | ) -> None
Read a tsv file containing paired sequences, and create a dataset suitable for a
CopyNet model, or any model with a matching API.
The expected format for each input line is:
CopyNetDatasetReader will containing at least the following fields:
TextFieldcontaining the tokenized source sentence. This will result in a tensor of shape
(batch_size, source_length)that contains an ID for each token in the source sentence. Tokens that match at the lowercase level will share the same ID. If
target_tokensis passed as well, these IDs will also correspond to the
target_token_idsfield, i.e. any tokens that match at the lowercase level in both the source and target sentences will share the same ID. Note that these IDs have no correlation with the token indices from the corresponding vocabulary namespaces.
NamespaceSwappingFieldthat keeps track of the index of the target token that matches each token in the source sentence. When there is no matching target token, the OOV index is used. This will result in a tensor of shape
MetadataFieldwhich contains the source tokens and potentially target tokens as lists of strings.
target_string is passed, the instance will also contain these fields:
TextFieldcontaining the tokenized target sentence, including the
END_SYMBOL. This will result in a tensor of shape
(batch_size, target_length). This is calculated in the same way as
See the "Notes" section below for a description of how these fields are used.
- target_namespace :
The vocab namespace for the targets. This needs to be passed to the dataset reader in order to construct the NamespaceSwappingField.
- source_tokenizer :
Tokenizer to use to split the input sequences into words or other kinds of tokens. Defaults to
- target_tokenizer :
Tokenizer to use to split the output sequences (during training) into words or other kinds of tokens. Defaults to
- source_token_indexers :
Dict[str, TokenIndexer], optional
Indexers used to define input (source side) token representations. Defaults to
In regards to the fields in an
Instance produced by this dataset reader,
target_token_ids are primarily used during training
to determine whether a target token is copied from a source token (or multiple matching
source tokens), while
source_to_target is primarily used during prediction
to combine the copy scores of source tokens with the generation scores for matching
tokens in the target namespace.
class CopyNetDatasetReader(DatasetReader): | ... | def text_to_instance( | self, | source_string: str, | target_string: str = None, | weight: float = None | ) -> Instance
Turn raw source string and target string into an
str, optional (default =
float, optional (default =
An optional weight to assign to this instance when calculating the loss in CopyNetSeq2Seq.forward().
See the above for a description of the fields that the instance will contain.
class CopyNetDatasetReader(DatasetReader): | ... | def apply_token_indexers(self, instance: Instance) -> None