copynet_seq2seq
allennlp_models.generation.dataset_readers.copynet_seq2seq
CopyNetDatasetReader#
@DatasetReader.register("copynet_seq2seq")
class CopyNetDatasetReader(DatasetReader):
| def __init__(
| self,
| target_namespace: str,
| source_tokenizer: Tokenizer = None,
| target_tokenizer: Tokenizer = None,
| source_token_indexers: Dict[str, TokenIndexer] = None,
| **kwargs
| ) -> None
Read a tsv file containing paired sequences, and create a dataset suitable for a
CopyNet model, or any model with a matching API.
The expected format for each input line is: CopyNetDatasetReader will containing at least the following fields:
-
source_tokens: aTextFieldcontaining the tokenized source sentence. This will result in a tensor of shape(batch_size, source_length). -
source_token_ids: anArrayFieldof size(batch_size, source_length)that contains an ID for each token in the source sentence. Tokens that match at the lowercase level will share the same ID. Iftarget_tokensis passed as well, these IDs will also correspond to thetarget_token_idsfield, i.e. any tokens that match at the lowercase level in both the source and target sentences will share the same ID. Note that these IDs have no correlation with the token indices from the corresponding vocabulary namespaces. -
source_to_target: aNamespaceSwappingFieldthat keeps track of the index of the target token that matches each token in the source sentence. When there is no matching target token, the OOV index is used. This will result in a tensor of shape(batch_size, source_length). -
metadata: aMetadataFieldwhich contains the source tokens and potentially target tokens as lists of strings.
When target_string is passed, the instance will also contain these fields:
-
target_tokens: aTextFieldcontaining the tokenized target sentence, including theSTART_SYMBOLandEND_SYMBOL. This will result in a tensor of shape(batch_size, target_length). -
target_token_ids: anArrayFieldof size(batch_size, target_length). This is calculated in the same way assource_token_ids.
See the "Notes" section below for a description of how these fields are used.
Parameters¶
- target_namespace :
str
The vocab namespace for the targets. This needs to be passed to the dataset reader in order to construct the NamespaceSwappingField. - source_tokenizer :
Tokenizer, optional
Tokenizer to use to split the input sequences into words or other kinds of tokens. Defaults toSpacyTokenizer(). - target_tokenizer :
Tokenizer, optional
Tokenizer to use to split the output sequences (during training) into words or other kinds of tokens. Defaults tosource_tokenizer. - source_token_indexers :
Dict[str, TokenIndexer], optional
Indexers used to define input (source side) token representations. Defaults to{"tokens": SingleIdTokenIndexer()}.
Notes¶
In regards to the fields in an Instance produced by this dataset reader,
source_token_ids and target_token_ids are primarily used during training
to determine whether a target token is copied from a source token (or multiple matching
source tokens), while source_to_target is primarily used during prediction
to combine the copy scores of source tokens with the generation scores for matching
tokens in the target namespace.
text_to_instance#
class CopyNetDatasetReader(DatasetReader):
| ...
| @overrides
| def text_to_instance(
| self,
| source_string: str,
| target_string: str = None
| ) -> Instance
Turn raw source string and target string into an Instance.
Parameters¶
- source_string :
str - target_string :
str, optional (default =None)
Returns¶
Instance
See the above for a description of the fields that the instance will contain.
apply_token_indexers#
class CopyNetDatasetReader(DatasetReader):
| ...
| @overrides
| def apply_token_indexers(self, instance: Instance) -> None