allennlp.data.dataset_readers.copynet_seq2seq¶
-
class
allennlp.data.dataset_readers.copynet_seq2seq.CopyNetDatasetReader(target_namespace: str, source_tokenizer: allennlp.data.tokenizers.tokenizer.Tokenizer = None, target_tokenizer: allennlp.data.tokenizers.tokenizer.Tokenizer = None, source_token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer] = None, lazy: bool = False)[source]¶ Bases:
allennlp.data.dataset_readers.dataset_reader.DatasetReaderRead a tsv file containing paired sequences, and create a dataset suitable for a
CopyNetmodel, or any model with a matching API.The expected format for each input line is: <source_sequence_string><tab><target_sequence_string>. An instance produced by
CopyNetDatasetReaderwill containing at least the following fields:source_tokens: aTextFieldcontaining the tokenized source sentence,including the
START_SYMBOLandEND_SYMBOL. This will result in a tensor of shape(batch_size, source_length).
source_token_ids: anArrayFieldof size(batch_size, trimmed_source_length)that contains an ID for each token in the source sentence. Tokens that match at the lowercase level will share the same ID. Iftarget_tokensis passed as well, these IDs will also correspond to thetarget_token_idsfield, i.e. any tokens that match at the lowercase level in both the source and target sentences will share the same ID. Note that these IDs have no correlation with the token indices from the corresponding vocabulary namespaces.source_to_target: aNamespaceSwappingFieldthat keeps track of the index of the target token that matches each token in the source sentence. When there is no matching target token, the OOV index is used. This will result in a tensor of shape(batch_size, trimmed_source_length).metadata: aMetadataFieldwhich contains the source tokens and potentially target tokens as lists of strings.
When
target_stringis passed, the instance will also contain these fields:target_tokens: aTextFieldcontaining the tokenized target sentence, including theSTART_SYMBOLandEND_SYMBOL. This will result in a tensor of shape(batch_size, target_length).target_token_ids: anArrayFieldof size(batch_size, target_length). This is calculated in the same way assource_token_ids.
See the “Notes” section below for a description of how these fields are used.
- Parameters
- target_namespace
str, required The vocab namespace for the targets. This needs to be passed to the dataset reader in order to construct the NamespaceSwappingField.
- source_tokenizer
Tokenizer, optional Tokenizer to use to split the input sequences into words or other kinds of tokens. Defaults to
WordTokenizer().- target_tokenizer
Tokenizer, optional Tokenizer to use to split the output sequences (during training) into words or other kinds of tokens. Defaults to
source_tokenizer.- source_token_indexers
Dict[str, TokenIndexer], optional Indexers used to define input (source side) token representations. Defaults to
{"tokens": SingleIdTokenIndexer()}.
- target_namespace
Notes
By
source_lengthwe are referring to the number of tokens in the source sentence including theSTART_SYMBOLandEND_SYMBOL, whiletrimmed_source_lengthrefers to the number of tokens in the source sentence excluding theSTART_SYMBOLandEND_SYMBOL, i.e.trimmed_source_length = source_length - 2.On the other hand,
target_lengthis the number of tokens in the target sentence including theSTART_SYMBOLandEND_SYMBOL.In the context where there is a
batch_sizedimension, the above refer to the maximum of their individual values across the batch.In regards to the fields in an
Instanceproduced by this dataset reader,source_token_idsandtarget_token_idsare primarily used during training to determine whether a target token is copied from a source token (or multiple matching source tokens), whilesource_to_targetis primarily used during prediction to combine the copy scores of source tokens with the generation scores for matching tokens in the target namespace.-
text_to_instance(self, source_string: str, target_string: str = None) → allennlp.data.instance.Instance[source]¶ Turn raw source string and target string into an
Instance.- Parameters
- source_string
str, required - target_string
str, optional (default = None)
- source_string
- Returns
- Instance
See the above for a description of the fields that the instance will contain.