copynet_seq2seq
allennlp_models.generation.dataset_readers.copynet_seq2seq
CopyNetDatasetReader#
@DatasetReader.register("copynet_seq2seq")
class CopyNetDatasetReader(DatasetReader):
| def __init__(
| self,
| target_namespace: str,
| source_tokenizer: Tokenizer = None,
| target_tokenizer: Tokenizer = None,
| source_token_indexers: Dict[str, TokenIndexer] = None,
| **kwargs
| ) -> None
Read a tsv file containing paired sequences, and create a dataset suitable for a
CopyNet
model, or any model with a matching API.
The expected format for each input line is: CopyNetDatasetReader
will containing at least the following fields:
-
source_tokens
: aTextField
containing the tokenized source sentence. This will result in a tensor of shape(batch_size, source_length)
. -
source_token_ids
: anTensorField
of size(batch_size, source_length)
that contains an ID for each token in the source sentence. Tokens that match at the lowercase level will share the same ID. Iftarget_tokens
is passed as well, these IDs will also correspond to thetarget_token_ids
field, i.e. any tokens that match at the lowercase level in both the source and target sentences will share the same ID. Note that these IDs have no correlation with the token indices from the corresponding vocabulary namespaces. -
source_to_target
: aNamespaceSwappingField
that keeps track of the index of the target token that matches each token in the source sentence. When there is no matching target token, the OOV index is used. This will result in a tensor of shape(batch_size, source_length)
. -
metadata
: aMetadataField
which contains the source tokens and potentially target tokens as lists of strings.
When target_string
is passed, the instance will also contain these fields:
-
target_tokens
: aTextField
containing the tokenized target sentence, including theSTART_SYMBOL
andEND_SYMBOL
. This will result in a tensor of shape(batch_size, target_length)
. -
target_token_ids
: anTensorField
of size(batch_size, target_length)
. This is calculated in the same way assource_token_ids
.
See the "Notes" section below for a description of how these fields are used.
Parameters¶
- target_namespace :
str
The vocab namespace for the targets. This needs to be passed to the dataset reader in order to construct the NamespaceSwappingField. - source_tokenizer :
Tokenizer
, optional
Tokenizer to use to split the input sequences into words or other kinds of tokens. Defaults toSpacyTokenizer()
. - target_tokenizer :
Tokenizer
, optional
Tokenizer to use to split the output sequences (during training) into words or other kinds of tokens. Defaults tosource_tokenizer
. - source_token_indexers :
Dict[str, TokenIndexer]
, optional
Indexers used to define input (source side) token representations. Defaults to{"tokens": SingleIdTokenIndexer()}
.
Notes¶
In regards to the fields in an Instance
produced by this dataset reader,
source_token_ids
and target_token_ids
are primarily used during training
to determine whether a target token is copied from a source token (or multiple matching
source tokens), while source_to_target
is primarily used during prediction
to combine the copy scores of source tokens with the generation scores for matching
tokens in the target namespace.
text_to_instance#
class CopyNetDatasetReader(DatasetReader):
| ...
| @overrides
| def text_to_instance(
| self,
| source_string: str,
| target_string: str = None,
| weight: float = None
| ) -> Instance
Turn raw source string and target string into an Instance
.
Parameters¶
-
source_string :
str
-
target_string :
str
, optional (default =None
) -
weight :
float
, optional (default =None
)
An optional weight to assign to this instance when calculating the loss in CopyNetSeq2Seq.forward().
Returns¶
Instance
See the above for a description of the fields that the instance will contain.
apply_token_indexers#
class CopyNetDatasetReader(DatasetReader):
| ...
| @overrides
| def apply_token_indexers(self, instance: Instance) -> None