allennlp.data.fields¶
A Field
is some piece of data instance
that ends up as an array in a model.
-
class
allennlp.data.fields.field.
Field
[source]¶ Bases:
typing.Generic
A
Field
is some piece of a data instance that ends up as an tensor in a model (either as an input or an output). Data instances are just collections of fields.Fields go through up to two steps of processing: (1) tokenized fields are converted into token ids, (2) fields containing token ids (or any other numeric data) are padded (if necessary) and converted into tensors. The
Field
API has methods around both of these steps, though they may not be needed for some concreteField
classes - if your field doesn’t have any strings that need indexing, you don’t need to implementcount_vocab_items
orindex
. These methodspass
by default.Once a vocabulary is computed and all fields are indexed, we will determine padding lengths, then intelligently batch together instances and pad them into actual tensors.
-
as_tensor
(self, padding_lengths: Dict[str, int]) → ~DataArray[source]¶ Given a set of specified padding lengths, actually pad the data in this field and return a torch Tensor (or a more complex data structure) of the correct shape. We also take a couple of parameters that are important when constructing torch Tensors.
- Parameters
- padding_lengths
Dict[str, int]
This dictionary will have the same keys that were produced in
get_padding_lengths()
. The values specify the lengths to use when padding each relevant dimension, aggregated across all instances in a batch.
- padding_lengths
-
batch_tensors
(self, tensor_list: List[~DataArray]) → ~DataArray[source]¶ Takes the output of
Field.as_tensor()
from a list ofInstances
and merges it into one batched tensor for thisField
. The default implementation here in the base class handles cases whereas_tensor
returns a single torch tensor per instance. If your subclass returns something other than this, you need to override this method.This operation does not modify
self
, but in some cases we need the information contained inself
in order to perform the batching, so this is an instance method, not a class method.
-
count_vocab_items
(self, counter: Dict[str, Dict[str, int]])[source]¶ If there are strings in this field that need to be converted into integers through a
Vocabulary
, here is where we count them, to determine which tokens are in or out of the vocabulary.If your
Field
does not have any strings that need to be converted into indices, you do not need to implement this method.A note on this
counter
: becauseFields
can represent conceptually different things, we separate the vocabulary items by namespaces. This way, we can use a single shared mechanism to handle all mappings from strings to integers in all fields, while keeping words in aTextField
from sharing the same ids with labels in aLabelField
(e.g., “entailment” or “contradiction” are labels in an entailment task)Additionally, a single
Field
might want to use multiple namespaces -TextFields
can be represented as a combination of word ids and character ids, and you don’t want words and characters to share the same vocabulary - “a” as a word should get a different id from “a” as a character, and the vocabulary sizes of words and characters are very different.Because of this, the first key in the
counter
object is a namespace, like “tokens”, “token_characters”, “tags”, or “labels”, and the second key is the actual vocabulary item.
-
empty_field
(self) → 'Field'[source]¶ So that
ListField
can pad the number of fields in a list (e.g., the number of answer optionTextFields
), we need a representation of an empty field of each type. This returns that. This will only ever be called when we’re to the point of callingas_tensor()
, so you don’t need to worry aboutget_padding_lengths
,count_vocab_items
, etc., being called on this empty field.We make this an instance method instead of a static method so that if there is any state in the Field, we can copy it over (e.g., the token indexers in
TextField
).
-
get_padding_lengths
(self) → Dict[str, int][source]¶ If there are things in this field that need padding, note them here. In order to pad a batch of instance, we get all of the lengths from the batch, take the max, and pad everything to that length (or use a pre-specified maximum length). The return value is a dictionary mapping keys to lengths, like {‘num_tokens’: 13}.
This is always called after
index()
.
-
index
(self, vocab: allennlp.data.vocabulary.Vocabulary)[source]¶ Given a
Vocabulary
, converts all strings in this field into (typically) integers. This modifies theField
object, it does not return anything.If your
Field
does not have any strings that need to be converted into indices, you do not need to implement this method.
-
-
class
allennlp.data.fields.array_field.
ArrayField
(array: numpy.ndarray, padding_value: int = 0, dtype: numpy.dtype = <class 'numpy.float32'>)[source]¶ Bases:
allennlp.data.fields.field.Field
A class representing an array, which could have arbitrary dimensions. A batch of these arrays are padded to the max dimension length in the batch for each dimension.
-
as_tensor
(self, padding_lengths: Dict[str, int]) → torch.Tensor[source]¶ Given a set of specified padding lengths, actually pad the data in this field and return a torch Tensor (or a more complex data structure) of the correct shape. We also take a couple of parameters that are important when constructing torch Tensors.
- Parameters
- padding_lengths
Dict[str, int]
This dictionary will have the same keys that were produced in
get_padding_lengths()
. The values specify the lengths to use when padding each relevant dimension, aggregated across all instances in a batch.
- padding_lengths
-
empty_field
(self)[source]¶ So that
ListField
can pad the number of fields in a list (e.g., the number of answer optionTextFields
), we need a representation of an empty field of each type. This returns that. This will only ever be called when we’re to the point of callingas_tensor()
, so you don’t need to worry aboutget_padding_lengths
,count_vocab_items
, etc., being called on this empty field.We make this an instance method instead of a static method so that if there is any state in the Field, we can copy it over (e.g., the token indexers in
TextField
).
-
get_padding_lengths
(self) → Dict[str, int][source]¶ If there are things in this field that need padding, note them here. In order to pad a batch of instance, we get all of the lengths from the batch, take the max, and pad everything to that length (or use a pre-specified maximum length). The return value is a dictionary mapping keys to lengths, like {‘num_tokens’: 13}.
This is always called after
index()
.
-
-
class
allennlp.data.fields.index_field.
IndexField
(index: int, sequence_field: allennlp.data.fields.sequence_field.SequenceField)[source]¶ Bases:
allennlp.data.fields.field.Field
An
IndexField
is an index into aSequenceField
, as might be used for representing a correct answer option in a list, or a span begin and span end position in a passage, for example. Because it’s an index into aSequenceField
, we take one of those as input and use it to compute padding lengths.- Parameters
- index
int
The index of the answer in the
SequenceField
. This is typically the “correct answer” in some classification decision over the sequence, like where an answer span starts in SQuAD, or which answer option is correct in a multiple choice question. A value of-1
means there is no label, which can be used for padding or other purposes.- sequence_field
SequenceField
A field containing the sequence that this
IndexField
is a pointer into.
- index
-
as_tensor
(self, padding_lengths: Dict[str, int]) → torch.Tensor[source]¶ Given a set of specified padding lengths, actually pad the data in this field and return a torch Tensor (or a more complex data structure) of the correct shape. We also take a couple of parameters that are important when constructing torch Tensors.
- Parameters
- padding_lengths
Dict[str, int]
This dictionary will have the same keys that were produced in
get_padding_lengths()
. The values specify the lengths to use when padding each relevant dimension, aggregated across all instances in a batch.
- padding_lengths
-
empty_field
(self)[source]¶ So that
ListField
can pad the number of fields in a list (e.g., the number of answer optionTextFields
), we need a representation of an empty field of each type. This returns that. This will only ever be called when we’re to the point of callingas_tensor()
, so you don’t need to worry aboutget_padding_lengths
,count_vocab_items
, etc., being called on this empty field.We make this an instance method instead of a static method so that if there is any state in the Field, we can copy it over (e.g., the token indexers in
TextField
).
-
get_padding_lengths
(self) → Dict[str, int][source]¶ If there are things in this field that need padding, note them here. In order to pad a batch of instance, we get all of the lengths from the batch, take the max, and pad everything to that length (or use a pre-specified maximum length). The return value is a dictionary mapping keys to lengths, like {‘num_tokens’: 13}.
This is always called after
index()
.
-
class
allennlp.data.fields.span_field.
SpanField
(span_start: int, span_end: int, sequence_field: allennlp.data.fields.sequence_field.SequenceField)[source]¶ Bases:
allennlp.data.fields.field.Field
A
SpanField
is a pair of inclusive, zero-indexed (start, end) indices into aSequenceField
, used to represent a span of text. Because it’s a pair of indices into aSequenceField
, we take one of those as input to make the span’s dependence explicit and to validate that the span is well defined.- Parameters
- span_start
int
, required. The index of the start of the span in the
SequenceField
.- span_end
int
, required. The inclusive index of the end of the span in the
SequenceField
.- sequence_field
SequenceField
, required. A field containing the sequence that this
SpanField
is a span inside.
- span_start
-
as_tensor
(self, padding_lengths: Dict[str, int]) → torch.Tensor[source]¶ Given a set of specified padding lengths, actually pad the data in this field and return a torch Tensor (or a more complex data structure) of the correct shape. We also take a couple of parameters that are important when constructing torch Tensors.
- Parameters
- padding_lengths
Dict[str, int]
This dictionary will have the same keys that were produced in
get_padding_lengths()
. The values specify the lengths to use when padding each relevant dimension, aggregated across all instances in a batch.
- padding_lengths
-
empty_field
(self)[source]¶ So that
ListField
can pad the number of fields in a list (e.g., the number of answer optionTextFields
), we need a representation of an empty field of each type. This returns that. This will only ever be called when we’re to the point of callingas_tensor()
, so you don’t need to worry aboutget_padding_lengths
,count_vocab_items
, etc., being called on this empty field.We make this an instance method instead of a static method so that if there is any state in the Field, we can copy it over (e.g., the token indexers in
TextField
).
-
get_padding_lengths
(self) → Dict[str, int][source]¶ If there are things in this field that need padding, note them here. In order to pad a batch of instance, we get all of the lengths from the batch, take the max, and pad everything to that length (or use a pre-specified maximum length). The return value is a dictionary mapping keys to lengths, like {‘num_tokens’: 13}.
This is always called after
index()
.
KnowledgeGraphField
is a Field
which stores a knowledge graph representation.
-
class
allennlp.data.fields.knowledge_graph_field.
KnowledgeGraphField
(knowledge_graph: allennlp.semparse.contexts.knowledge_graph.KnowledgeGraph, utterance_tokens: List[allennlp.data.tokenizers.token.Token], token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer], tokenizer: allennlp.data.tokenizers.tokenizer.Tokenizer = None, feature_extractors: List[str] = None, entity_tokens: List[List[allennlp.data.tokenizers.token.Token]] = None, linking_features: List[List[List[float]]] = None, include_in_vocab: bool = True, max_table_tokens: int = None)[source]¶ Bases:
allennlp.data.fields.field.Field
A
KnowledgeGraphField
represents aKnowledgeGraph
as aField
that can be used in aModel
. For each entity in the graph, we output two things: a text representation of the entity, handled identically to aTextField
, and a list of linking features for each token in some input utterance.The output of this field is a dictionary:
{ "text": Dict[str, torch.Tensor], # each tensor has shape (batch_size, num_entities, num_entity_tokens) "linking": torch.Tensor # shape (batch_size, num_entities, num_utterance_tokens, num_features) }
The
text
component of this dictionary is suitable to be passed into aTextFieldEmbedder
(which handles the additionalnum_entities
dimension without any issues). Thelinking
component of the dictionary can be used however you want to decide which tokens in the utterance correspond to which entities in the knowledge graph.In order to create the
text
component, we use the same dictionary ofTokenIndexers
that’s used in aTextField
(as we’re just representing the text corresponding to each entity). For thelinking
component, we use a set of hard-coded feature extractors that operate between the text corresponding to each entity and each token in the utterance.- Parameters
- knowledge_graph
KnowledgeGraph
The knowledge graph that this field stores.
- utterance_tokens
List[Token]
The tokens in some utterance that is paired with the
KnowledgeGraph
. We compute a set of features for linking tokens in the utterance to entities in the graph.- tokenizer
Tokenizer
, optional (default=``WordTokenizer()``) We’ll use this
Tokenizer
to tokenize the text representation of each entity.- token_indexers
Dict[str, TokenIndexer]
Token indexers that convert entities into arrays, similar to how text tokens are treated in a
TextField
. These might operate on the name of the entity itself, its type, its neighbors in the graph, etc.- feature_extractors
List[str]
, optional Names of feature extractors to use for computing linking features. These must be attributes of this object, without the first underscore. The feature extraction functions are listed as the last methods in this class. For example, to use
_exact_token_match()
, you would pass the stringexact_token_match
. We will add an underscore and look for a function matching that name. If this list is omitted, we will use all available feature functions.- entity_tokens
List[List[Token]]
, optional If you have pre-computed the tokenization of the table text, you can pass it in here. The must be a list of the tokens in the entity text, for each entity in the knowledge graph, in the same order in which the knowledge graph returns entities.
- linking_features
List[List[List[float]]]
, optional If you have pre-computed the linking features between the utterance and the table text, you can pass it in here.
- include_in_vocab
bool
, optional (default=True) If this is
False
, we will skip thecount_vocab_items
logic, leaving out all table entity text from the vocabulary computation. You might want to do this if you have a lot of rare entities in your tables, and you see the same table in multiple training instances, so your vocabulary counts get skewed and include too many rare entities.- max_table_tokens
int
, optional If given, we will only keep this number of total table tokens. This bounds the memory usage of the table representations, truncating cells with really long text. We specify a total number of tokens, not a max cell text length, because the number of table entities varies.
- knowledge_graph
-
as_tensor
(self, padding_lengths: Dict[str, int]) → Dict[str, torch.Tensor][source]¶ Given a set of specified padding lengths, actually pad the data in this field and return a torch Tensor (or a more complex data structure) of the correct shape. We also take a couple of parameters that are important when constructing torch Tensors.
- Parameters
- padding_lengths
Dict[str, int]
This dictionary will have the same keys that were produced in
get_padding_lengths()
. The values specify the lengths to use when padding each relevant dimension, aggregated across all instances in a batch.
- padding_lengths
-
batch_tensors
(self, tensor_list: List[Dict[str, torch.Tensor]]) → Dict[str, torch.Tensor][source]¶ Takes the output of
Field.as_tensor()
from a list ofInstances
and merges it into one batched tensor for thisField
. The default implementation here in the base class handles cases whereas_tensor
returns a single torch tensor per instance. If your subclass returns something other than this, you need to override this method.This operation does not modify
self
, but in some cases we need the information contained inself
in order to perform the batching, so this is an instance method, not a class method.
-
count_vocab_items
(self, counter: Dict[str, Dict[str, int]])[source]¶ If there are strings in this field that need to be converted into integers through a
Vocabulary
, here is where we count them, to determine which tokens are in or out of the vocabulary.If your
Field
does not have any strings that need to be converted into indices, you do not need to implement this method.A note on this
counter
: becauseFields
can represent conceptually different things, we separate the vocabulary items by namespaces. This way, we can use a single shared mechanism to handle all mappings from strings to integers in all fields, while keeping words in aTextField
from sharing the same ids with labels in aLabelField
(e.g., “entailment” or “contradiction” are labels in an entailment task)Additionally, a single
Field
might want to use multiple namespaces -TextFields
can be represented as a combination of word ids and character ids, and you don’t want words and characters to share the same vocabulary - “a” as a word should get a different id from “a” as a character, and the vocabulary sizes of words and characters are very different.Because of this, the first key in the
counter
object is a namespace, like “tokens”, “token_characters”, “tags”, or “labels”, and the second key is the actual vocabulary item.
-
empty_field
(self) → 'KnowledgeGraphField'[source]¶ So that
ListField
can pad the number of fields in a list (e.g., the number of answer optionTextFields
), we need a representation of an empty field of each type. This returns that. This will only ever be called when we’re to the point of callingas_tensor()
, so you don’t need to worry aboutget_padding_lengths
,count_vocab_items
, etc., being called on this empty field.We make this an instance method instead of a static method so that if there is any state in the Field, we can copy it over (e.g., the token indexers in
TextField
).
-
get_padding_lengths
(self) → Dict[str, int][source]¶ If there are things in this field that need padding, note them here. In order to pad a batch of instance, we get all of the lengths from the batch, take the max, and pad everything to that length (or use a pre-specified maximum length). The return value is a dictionary mapping keys to lengths, like {‘num_tokens’: 13}.
This is always called after
index()
.
-
index
(self, vocab: allennlp.data.vocabulary.Vocabulary)[source]¶ Given a
Vocabulary
, converts all strings in this field into (typically) integers. This modifies theField
object, it does not return anything.If your
Field
does not have any strings that need to be converted into indices, you do not need to implement this method.
-
class
allennlp.data.fields.label_field.
LabelField
(label: Union[str, int], label_namespace: str = 'labels', skip_indexing: bool = False)[source]¶ Bases:
allennlp.data.fields.field.Field
A
LabelField
is a categorical label of some kind, where the labels are either strings of text or 0-indexed integers (if you wish to skip indexing by passing skip_indexing=True). If the labels need indexing, we will use aVocabulary
to convert the string labels into integers.This field will get converted into an integer index representing the class label.
- Parameters
- label
Union[str, int]
- label_namespace
str
, optional (default=”labels”) The namespace to use for converting label strings into integers. We map label strings to integers for you (e.g., “entailment” and “contradiction” get converted to 0, 1, …), and this namespace tells the
Vocabulary
object which mapping from strings to integers to use (so “entailment” as a label doesn’t get the same integer id as “entailment” as a word). If you have multiple different label fields in your data, you should make sure you use different namespaces for each one, always using the suffix “labels” (e.g., “passage_labels” and “question_labels”).- skip_indexing
bool
, optional (default=False) If your labels are 0-indexed integers, you can pass in this flag, and we’ll skip the indexing step. If this is
False
and your labels are not strings, this throws aConfigurationError
.
- label
-
as_tensor
(self, padding_lengths: Dict[str, int]) → torch.Tensor[source]¶ Given a set of specified padding lengths, actually pad the data in this field and return a torch Tensor (or a more complex data structure) of the correct shape. We also take a couple of parameters that are important when constructing torch Tensors.
- Parameters
- padding_lengths
Dict[str, int]
This dictionary will have the same keys that were produced in
get_padding_lengths()
. The values specify the lengths to use when padding each relevant dimension, aggregated across all instances in a batch.
- padding_lengths
-
count_vocab_items
(self, counter: Dict[str, Dict[str, int]])[source]¶ If there are strings in this field that need to be converted into integers through a
Vocabulary
, here is where we count them, to determine which tokens are in or out of the vocabulary.If your
Field
does not have any strings that need to be converted into indices, you do not need to implement this method.A note on this
counter
: becauseFields
can represent conceptually different things, we separate the vocabulary items by namespaces. This way, we can use a single shared mechanism to handle all mappings from strings to integers in all fields, while keeping words in aTextField
from sharing the same ids with labels in aLabelField
(e.g., “entailment” or “contradiction” are labels in an entailment task)Additionally, a single
Field
might want to use multiple namespaces -TextFields
can be represented as a combination of word ids and character ids, and you don’t want words and characters to share the same vocabulary - “a” as a word should get a different id from “a” as a character, and the vocabulary sizes of words and characters are very different.Because of this, the first key in the
counter
object is a namespace, like “tokens”, “token_characters”, “tags”, or “labels”, and the second key is the actual vocabulary item.
-
empty_field
(self)[source]¶ So that
ListField
can pad the number of fields in a list (e.g., the number of answer optionTextFields
), we need a representation of an empty field of each type. This returns that. This will only ever be called when we’re to the point of callingas_tensor()
, so you don’t need to worry aboutget_padding_lengths
,count_vocab_items
, etc., being called on this empty field.We make this an instance method instead of a static method so that if there is any state in the Field, we can copy it over (e.g., the token indexers in
TextField
).
-
get_padding_lengths
(self) → Dict[str, int][source]¶ If there are things in this field that need padding, note them here. In order to pad a batch of instance, we get all of the lengths from the batch, take the max, and pad everything to that length (or use a pre-specified maximum length). The return value is a dictionary mapping keys to lengths, like {‘num_tokens’: 13}.
This is always called after
index()
.
-
index
(self, vocab: allennlp.data.vocabulary.Vocabulary)[source]¶ Given a
Vocabulary
, converts all strings in this field into (typically) integers. This modifies theField
object, it does not return anything.If your
Field
does not have any strings that need to be converted into indices, you do not need to implement this method.
-
class
allennlp.data.fields.multilabel_field.
MultiLabelField
(labels: Sequence[Union[str, int]], label_namespace: str = 'labels', skip_indexing: bool = False, num_labels: Optional[int] = None)[source]¶ Bases:
allennlp.data.fields.field.Field
A
MultiLabelField
is an extension of theLabelField
that allows for multiple labels. It is particularly useful in multi-label classification where more than one label can be correct. As with theLabelField
, labels are either strings of text or 0-indexed integers (if you wish to skip indexing by passing skip_indexing=True). If the labels need indexing, we will use aVocabulary
to convert the string labels into integers.This field will get converted into a vector of length equal to the vocabulary size with one hot encoding for the labels (all zeros, and ones for the labels).
- Parameters
- labels
Sequence[Union[str, int]]
- label_namespace
str
, optional (default=”labels”) The namespace to use for converting label strings into integers. We map label strings to integers for you (e.g., “entailment” and “contradiction” get converted to 0, 1, …), and this namespace tells the
Vocabulary
object which mapping from strings to integers to use (so “entailment” as a label doesn’t get the same integer id as “entailment” as a word). If you have multiple different label fields in your data, you should make sure you use different namespaces for each one, always using the suffix “labels” (e.g., “passage_labels” and “question_labels”).- skip_indexing
bool
, optional (default=False) If your labels are 0-indexed integers, you can pass in this flag, and we’ll skip the indexing step. If this is
False
and your labels are not strings, this throws aConfigurationError
.- num_labels
int
, optional (default=None) If
skip_indexing=True
, the total number of possible labels should be provided, which is required to decide the size of the output tensor. num_labels should equal largest label id + 1. Ifskip_indexing=False
, num_labels is not required.
- labels
-
as_tensor
(self, padding_lengths: Dict[str, int]) → torch.Tensor[source]¶ Given a set of specified padding lengths, actually pad the data in this field and return a torch Tensor (or a more complex data structure) of the correct shape. We also take a couple of parameters that are important when constructing torch Tensors.
- Parameters
- padding_lengths
Dict[str, int]
This dictionary will have the same keys that were produced in
get_padding_lengths()
. The values specify the lengths to use when padding each relevant dimension, aggregated across all instances in a batch.
- padding_lengths
-
count_vocab_items
(self, counter: Dict[str, Dict[str, int]])[source]¶ If there are strings in this field that need to be converted into integers through a
Vocabulary
, here is where we count them, to determine which tokens are in or out of the vocabulary.If your
Field
does not have any strings that need to be converted into indices, you do not need to implement this method.A note on this
counter
: becauseFields
can represent conceptually different things, we separate the vocabulary items by namespaces. This way, we can use a single shared mechanism to handle all mappings from strings to integers in all fields, while keeping words in aTextField
from sharing the same ids with labels in aLabelField
(e.g., “entailment” or “contradiction” are labels in an entailment task)Additionally, a single
Field
might want to use multiple namespaces -TextFields
can be represented as a combination of word ids and character ids, and you don’t want words and characters to share the same vocabulary - “a” as a word should get a different id from “a” as a character, and the vocabulary sizes of words and characters are very different.Because of this, the first key in the
counter
object is a namespace, like “tokens”, “token_characters”, “tags”, or “labels”, and the second key is the actual vocabulary item.
-
empty_field
(self)[source]¶ So that
ListField
can pad the number of fields in a list (e.g., the number of answer optionTextFields
), we need a representation of an empty field of each type. This returns that. This will only ever be called when we’re to the point of callingas_tensor()
, so you don’t need to worry aboutget_padding_lengths
,count_vocab_items
, etc., being called on this empty field.We make this an instance method instead of a static method so that if there is any state in the Field, we can copy it over (e.g., the token indexers in
TextField
).
-
get_padding_lengths
(self) → Dict[str, int][source]¶ If there are things in this field that need padding, note them here. In order to pad a batch of instance, we get all of the lengths from the batch, take the max, and pad everything to that length (or use a pre-specified maximum length). The return value is a dictionary mapping keys to lengths, like {‘num_tokens’: 13}.
This is always called after
index()
.
-
index
(self, vocab: allennlp.data.vocabulary.Vocabulary)[source]¶ Given a
Vocabulary
, converts all strings in this field into (typically) integers. This modifies theField
object, it does not return anything.If your
Field
does not have any strings that need to be converted into indices, you do not need to implement this method.
-
class
allennlp.data.fields.list_field.
ListField
(field_list: List[allennlp.data.fields.field.Field])[source]¶ Bases:
allennlp.data.fields.sequence_field.SequenceField
A
ListField
is a list of other fields. You would use this to represent, e.g., a list of answer options that are themselvesTextFields
.This field will get converted into a tensor that has one more mode than the items in the list. If this is a list of
TextFields
that have shape (num_words, num_characters), thisListField
will output a tensor of shape (num_sentences, num_words, num_characters).- Parameters
- field_list
List[Field]
A list of
Field
objects to be concatenated into a single input tensor. All of the containedField
objects must be of the same type.
- field_list
-
as_tensor
(self, padding_lengths: Dict[str, int]) → ~DataArray[source]¶ Given a set of specified padding lengths, actually pad the data in this field and return a torch Tensor (or a more complex data structure) of the correct shape. We also take a couple of parameters that are important when constructing torch Tensors.
- Parameters
- padding_lengths
Dict[str, int]
This dictionary will have the same keys that were produced in
get_padding_lengths()
. The values specify the lengths to use when padding each relevant dimension, aggregated across all instances in a batch.
- padding_lengths
-
batch_tensors
(self, tensor_list: List[~DataArray]) → ~DataArray[source]¶ Takes the output of
Field.as_tensor()
from a list ofInstances
and merges it into one batched tensor for thisField
. The default implementation here in the base class handles cases whereas_tensor
returns a single torch tensor per instance. If your subclass returns something other than this, you need to override this method.This operation does not modify
self
, but in some cases we need the information contained inself
in order to perform the batching, so this is an instance method, not a class method.
-
count_vocab_items
(self, counter: Dict[str, Dict[str, int]])[source]¶ If there are strings in this field that need to be converted into integers through a
Vocabulary
, here is where we count them, to determine which tokens are in or out of the vocabulary.If your
Field
does not have any strings that need to be converted into indices, you do not need to implement this method.A note on this
counter
: becauseFields
can represent conceptually different things, we separate the vocabulary items by namespaces. This way, we can use a single shared mechanism to handle all mappings from strings to integers in all fields, while keeping words in aTextField
from sharing the same ids with labels in aLabelField
(e.g., “entailment” or “contradiction” are labels in an entailment task)Additionally, a single
Field
might want to use multiple namespaces -TextFields
can be represented as a combination of word ids and character ids, and you don’t want words and characters to share the same vocabulary - “a” as a word should get a different id from “a” as a character, and the vocabulary sizes of words and characters are very different.Because of this, the first key in the
counter
object is a namespace, like “tokens”, “token_characters”, “tags”, or “labels”, and the second key is the actual vocabulary item.
-
empty_field
(self)[source]¶ So that
ListField
can pad the number of fields in a list (e.g., the number of answer optionTextFields
), we need a representation of an empty field of each type. This returns that. This will only ever be called when we’re to the point of callingas_tensor()
, so you don’t need to worry aboutget_padding_lengths
,count_vocab_items
, etc., being called on this empty field.We make this an instance method instead of a static method so that if there is any state in the Field, we can copy it over (e.g., the token indexers in
TextField
).
-
get_padding_lengths
(self) → Dict[str, int][source]¶ If there are things in this field that need padding, note them here. In order to pad a batch of instance, we get all of the lengths from the batch, take the max, and pad everything to that length (or use a pre-specified maximum length). The return value is a dictionary mapping keys to lengths, like {‘num_tokens’: 13}.
This is always called after
index()
.
-
index
(self, vocab: allennlp.data.vocabulary.Vocabulary)[source]¶ Given a
Vocabulary
, converts all strings in this field into (typically) integers. This modifies theField
object, it does not return anything.If your
Field
does not have any strings that need to be converted into indices, you do not need to implement this method.
-
class
allennlp.data.fields.metadata_field.
MetadataField
(metadata: Any)[source]¶ Bases:
allennlp.data.fields.field.Field
,collections.abc.Mapping
,typing.Generic
A
MetadataField
is aField
that does not get converted into tensors. It just carries side information that might be needed later on, for computing some third-party metric, or outputting debugging information, or whatever else you need. We use this in the BiDAF model, for instance, to keep track of question IDs and passage token offsets, so we can more easily use the official evaluation script to compute metrics.We don’t try to do any kind of smart combination of this field for batched input - when you use this
Field
in a model, you’ll get a list of metadata objects, one for each instance in the batch.- Parameters
- metadata
Any
Some object containing the metadata that you want to store. It’s likely that you’ll want this to be a dictionary, but it could be anything you want.
- metadata
-
as_tensor
(self, padding_lengths: Dict[str, int]) → ~DataArray[source]¶ Given a set of specified padding lengths, actually pad the data in this field and return a torch Tensor (or a more complex data structure) of the correct shape. We also take a couple of parameters that are important when constructing torch Tensors.
- Parameters
- padding_lengths
Dict[str, int]
This dictionary will have the same keys that were produced in
get_padding_lengths()
. The values specify the lengths to use when padding each relevant dimension, aggregated across all instances in a batch.
- padding_lengths
-
batch_tensors
(self, tensor_list: List[~DataArray]) → List[~DataArray][source]¶ Takes the output of
Field.as_tensor()
from a list ofInstances
and merges it into one batched tensor for thisField
. The default implementation here in the base class handles cases whereas_tensor
returns a single torch tensor per instance. If your subclass returns something other than this, you need to override this method.This operation does not modify
self
, but in some cases we need the information contained inself
in order to perform the batching, so this is an instance method, not a class method.
-
empty_field
(self) → 'MetadataField'[source]¶ So that
ListField
can pad the number of fields in a list (e.g., the number of answer optionTextFields
), we need a representation of an empty field of each type. This returns that. This will only ever be called when we’re to the point of callingas_tensor()
, so you don’t need to worry aboutget_padding_lengths
,count_vocab_items
, etc., being called on this empty field.We make this an instance method instead of a static method so that if there is any state in the Field, we can copy it over (e.g., the token indexers in
TextField
).
-
get_padding_lengths
(self) → Dict[str, int][source]¶ If there are things in this field that need padding, note them here. In order to pad a batch of instance, we get all of the lengths from the batch, take the max, and pad everything to that length (or use a pre-specified maximum length). The return value is a dictionary mapping keys to lengths, like {‘num_tokens’: 13}.
This is always called after
index()
.
-
class
allennlp.data.fields.production_rule_field.
ProductionRule
(rule, is_global_rule, rule_id, nonterminal)[source]¶ Bases:
tuple
-
property
is_global_rule
¶ Alias for field number 1
-
property
nonterminal
¶ Alias for field number 3
-
property
rule
¶ Alias for field number 0
-
property
rule_id
¶ Alias for field number 2
-
property
-
allennlp.data.fields.production_rule_field.
ProductionRuleArray
[source]¶ alias of
allennlp.data.fields.production_rule_field.ProductionRule
-
class
allennlp.data.fields.production_rule_field.
ProductionRuleField
(rule: str, is_global_rule: bool, vocab_namespace: str = 'rule_labels', nonterminal: str = None)[source]¶ Bases:
allennlp.data.fields.field.Field
This
Field
represents a production rule from a grammar, like “S -> [NP, VP]”, “N -> John”, or “<b,c> -> [<a,<b,c>>, a]”.We assume a few things about how these rules are formatted:
There is a left-hand side (LHS) and a right-hand side (RHS), where the LHS is always a non-terminal, and the RHS is either a terminal, a non-terminal, or a sequence of terminals and/or non-terminals.
The LHS and the RHS are joined by ” -> “, and this sequence of characters appears nowhere else in the rule.
Non-terminal sequences in the RHS are formatted as “[NT1, NT2, …]”.
Some rules come from a global grammar used for a whole dataset, while other rules are specific to a particular
Instance
.
We don’t make use of most of these assumptions in this class, but the code that consumes this
Field
relies heavily on them in some places.If the given rule is in the global grammar, we treat the rule as a vocabulary item that will get an index and (in the model) an embedding. If the rule is not in the global grammar, we do not create a vocabulary item from the rule, and don’t produce a tensor for the rule - we assume the model will handle representing this rule in some other way.
Because we represent global grammar rules and instance-specific rules differently, this
Field
does not lend itself well to batching its arrays, even in a sequence for a single training instance. A model using this field will have to manually batch together rule representations after splitting apart the global rules from theInstance
rules.In a model, this will get represented as a
ProductionRule
, which is defined above. This is a namedtuple of(rule_string, is_global_rule, [rule_id], nonterminal)
, where therule_id
Tensor
, if present, will have shape(1,)
. We don’t do any batching of theTensors
, so this gets passed toModel.forward()
as aList[ProductionRule]
. We pass along the rule string because there isn’t another way to recover it for instance-specific rules that do not make it into the vocabulary.- Parameters
- rule
str
The production rule, formatted as described above. If this field is just padding,
rule
will be the empty string.- is_global_rule
bool
Whether this rule comes from the global grammar or is an instance-specific production rule.
- vocab_namespace
str
, optional (default=”rule_labels”) The vocabulary namespace to use for the global production rules. We use “rule_labels” by default, because we typically do not want padding and OOV tokens for these, and ending the namespace with “labels” means we don’t get padding and OOV tokens.
- nonterminal
str
, optional, default = None The left hand side of the rule. Sometimes having this as separate part of the
ProductionRule
can deduplicate work.
- rule
-
as_tensor
(self, padding_lengths: Dict[str, int]) → allennlp.data.fields.production_rule_field.ProductionRule[source]¶ Given a set of specified padding lengths, actually pad the data in this field and return a torch Tensor (or a more complex data structure) of the correct shape. We also take a couple of parameters that are important when constructing torch Tensors.
- Parameters
- padding_lengths
Dict[str, int]
This dictionary will have the same keys that were produced in
get_padding_lengths()
. The values specify the lengths to use when padding each relevant dimension, aggregated across all instances in a batch.
- padding_lengths
-
batch_tensors
(self, tensor_list: List[allennlp.data.fields.production_rule_field.ProductionRule]) → List[allennlp.data.fields.production_rule_field.ProductionRule][source]¶ Takes the output of
Field.as_tensor()
from a list ofInstances
and merges it into one batched tensor for thisField
. The default implementation here in the base class handles cases whereas_tensor
returns a single torch tensor per instance. If your subclass returns something other than this, you need to override this method.This operation does not modify
self
, but in some cases we need the information contained inself
in order to perform the batching, so this is an instance method, not a class method.
-
count_vocab_items
(self, counter: Dict[str, Dict[str, int]])[source]¶ If there are strings in this field that need to be converted into integers through a
Vocabulary
, here is where we count them, to determine which tokens are in or out of the vocabulary.If your
Field
does not have any strings that need to be converted into indices, you do not need to implement this method.A note on this
counter
: becauseFields
can represent conceptually different things, we separate the vocabulary items by namespaces. This way, we can use a single shared mechanism to handle all mappings from strings to integers in all fields, while keeping words in aTextField
from sharing the same ids with labels in aLabelField
(e.g., “entailment” or “contradiction” are labels in an entailment task)Additionally, a single
Field
might want to use multiple namespaces -TextFields
can be represented as a combination of word ids and character ids, and you don’t want words and characters to share the same vocabulary - “a” as a word should get a different id from “a” as a character, and the vocabulary sizes of words and characters are very different.Because of this, the first key in the
counter
object is a namespace, like “tokens”, “token_characters”, “tags”, or “labels”, and the second key is the actual vocabulary item.
-
empty_field
(self)[source]¶ So that
ListField
can pad the number of fields in a list (e.g., the number of answer optionTextFields
), we need a representation of an empty field of each type. This returns that. This will only ever be called when we’re to the point of callingas_tensor()
, so you don’t need to worry aboutget_padding_lengths
,count_vocab_items
, etc., being called on this empty field.We make this an instance method instead of a static method so that if there is any state in the Field, we can copy it over (e.g., the token indexers in
TextField
).
-
get_padding_lengths
(self) → Dict[str, int][source]¶ If there are things in this field that need padding, note them here. In order to pad a batch of instance, we get all of the lengths from the batch, take the max, and pad everything to that length (or use a pre-specified maximum length). The return value is a dictionary mapping keys to lengths, like {‘num_tokens’: 13}.
This is always called after
index()
.
-
index
(self, vocab: allennlp.data.vocabulary.Vocabulary)[source]¶ Given a
Vocabulary
, converts all strings in this field into (typically) integers. This modifies theField
object, it does not return anything.If your
Field
does not have any strings that need to be converted into indices, you do not need to implement this method.
-
class
allennlp.data.fields.sequence_field.
SequenceField
[source]¶ Bases:
allennlp.data.fields.field.Field
A
SequenceField
represents a sequence of things. This class just adds a method ontoField
:sequence_length()
. It exists so thatSequenceLabelField
,IndexField
and other similarFields
can have a single type to require, with a consistent API, whether they are pointing to words in aTextField
, items in aListField
, or something else.-
empty_field
(self) → 'SequenceField'[source]¶ So that
ListField
can pad the number of fields in a list (e.g., the number of answer optionTextFields
), we need a representation of an empty field of each type. This returns that. This will only ever be called when we’re to the point of callingas_tensor()
, so you don’t need to worry aboutget_padding_lengths
,count_vocab_items
, etc., being called on this empty field.We make this an instance method instead of a static method so that if there is any state in the Field, we can copy it over (e.g., the token indexers in
TextField
).
-
-
class
allennlp.data.fields.sequence_label_field.
SequenceLabelField
(labels: Union[List[str], List[int]], sequence_field: allennlp.data.fields.sequence_field.SequenceField, label_namespace: str = 'labels')[source]¶ Bases:
allennlp.data.fields.field.Field
A
SequenceLabelField
assigns a categorical label to each element in aSequenceField
. Because it’s a labeling of some other field, we take that field as input here, and we use it to determine our padding and other things.This field will get converted into a list of integer class ids, representing the correct class for each element in the sequence.
- Parameters
- labels
Union[List[str], List[int]]
A sequence of categorical labels, encoded as strings or integers. These could be POS tags like [NN, JJ, …], BIO tags like [B-PERS, I-PERS, O, O, …], or any other categorical tag sequence. If the labels are encoded as integers, they will not be indexed using a vocab.
- sequence_field
SequenceField
A field containing the sequence that this
SequenceLabelField
is labeling. Most often, this is aTextField
, for tagging individual tokens in a sentence.- label_namespace
str
, optional (default=’labels’) The namespace to use for converting tag strings into integers. We convert tag strings to integers for you, and this parameter tells the
Vocabulary
object which mapping from strings to integers to use (so that “O” as a tag doesn’t get the same id as “O” as a word).
- labels
-
as_tensor
(self, padding_lengths: Dict[str, int]) → torch.Tensor[source]¶ Given a set of specified padding lengths, actually pad the data in this field and return a torch Tensor (or a more complex data structure) of the correct shape. We also take a couple of parameters that are important when constructing torch Tensors.
- Parameters
- padding_lengths
Dict[str, int]
This dictionary will have the same keys that were produced in
get_padding_lengths()
. The values specify the lengths to use when padding each relevant dimension, aggregated across all instances in a batch.
- padding_lengths
-
count_vocab_items
(self, counter: Dict[str, Dict[str, int]])[source]¶ If there are strings in this field that need to be converted into integers through a
Vocabulary
, here is where we count them, to determine which tokens are in or out of the vocabulary.If your
Field
does not have any strings that need to be converted into indices, you do not need to implement this method.A note on this
counter
: becauseFields
can represent conceptually different things, we separate the vocabulary items by namespaces. This way, we can use a single shared mechanism to handle all mappings from strings to integers in all fields, while keeping words in aTextField
from sharing the same ids with labels in aLabelField
(e.g., “entailment” or “contradiction” are labels in an entailment task)Additionally, a single
Field
might want to use multiple namespaces -TextFields
can be represented as a combination of word ids and character ids, and you don’t want words and characters to share the same vocabulary - “a” as a word should get a different id from “a” as a character, and the vocabulary sizes of words and characters are very different.Because of this, the first key in the
counter
object is a namespace, like “tokens”, “token_characters”, “tags”, or “labels”, and the second key is the actual vocabulary item.
-
empty_field
(self) → 'SequenceLabelField'[source]¶ So that
ListField
can pad the number of fields in a list (e.g., the number of answer optionTextFields
), we need a representation of an empty field of each type. This returns that. This will only ever be called when we’re to the point of callingas_tensor()
, so you don’t need to worry aboutget_padding_lengths
,count_vocab_items
, etc., being called on this empty field.We make this an instance method instead of a static method so that if there is any state in the Field, we can copy it over (e.g., the token indexers in
TextField
).
-
get_padding_lengths
(self) → Dict[str, int][source]¶ If there are things in this field that need padding, note them here. In order to pad a batch of instance, we get all of the lengths from the batch, take the max, and pad everything to that length (or use a pre-specified maximum length). The return value is a dictionary mapping keys to lengths, like {‘num_tokens’: 13}.
This is always called after
index()
.
-
index
(self, vocab: allennlp.data.vocabulary.Vocabulary)[source]¶ Given a
Vocabulary
, converts all strings in this field into (typically) integers. This modifies theField
object, it does not return anything.If your
Field
does not have any strings that need to be converted into indices, you do not need to implement this method.
A TextField
represents a string of text, the kind that you might want to represent with
standard word vectors, or pass through an LSTM.
-
class
allennlp.data.fields.text_field.
TextField
(tokens: List[allennlp.data.tokenizers.token.Token], token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer])[source]¶ Bases:
allennlp.data.fields.sequence_field.SequenceField
This
Field
represents a list of string tokens. Before constructing this object, you need to tokenize raw strings using aTokenizer
.Because string tokens can be represented as indexed arrays in a number of ways, we also take a dictionary of
TokenIndexer
objects that will be used to convert the tokens into indices. EachTokenIndexer
could represent each token as a single ID, or a list of character IDs, or something else.This field will get converted into a dictionary of arrays, one for each
TokenIndexer
. ASingleIdTokenIndexer
produces an array of shape (num_tokens,), while aTokenCharactersIndexer
produces an array of shape (num_tokens, num_characters).-
as_tensor
(self, padding_lengths: Dict[str, int]) → Dict[str, torch.Tensor][source]¶ Given a set of specified padding lengths, actually pad the data in this field and return a torch Tensor (or a more complex data structure) of the correct shape. We also take a couple of parameters that are important when constructing torch Tensors.
- Parameters
- padding_lengths
Dict[str, int]
This dictionary will have the same keys that were produced in
get_padding_lengths()
. The values specify the lengths to use when padding each relevant dimension, aggregated across all instances in a batch.
- padding_lengths
-
batch_tensors
(self, tensor_list: List[Dict[str, torch.Tensor]]) → Dict[str, torch.Tensor][source]¶ Takes the output of
Field.as_tensor()
from a list ofInstances
and merges it into one batched tensor for thisField
. The default implementation here in the base class handles cases whereas_tensor
returns a single torch tensor per instance. If your subclass returns something other than this, you need to override this method.This operation does not modify
self
, but in some cases we need the information contained inself
in order to perform the batching, so this is an instance method, not a class method.
-
count_vocab_items
(self, counter: Dict[str, Dict[str, int]])[source]¶ If there are strings in this field that need to be converted into integers through a
Vocabulary
, here is where we count them, to determine which tokens are in or out of the vocabulary.If your
Field
does not have any strings that need to be converted into indices, you do not need to implement this method.A note on this
counter
: becauseFields
can represent conceptually different things, we separate the vocabulary items by namespaces. This way, we can use a single shared mechanism to handle all mappings from strings to integers in all fields, while keeping words in aTextField
from sharing the same ids with labels in aLabelField
(e.g., “entailment” or “contradiction” are labels in an entailment task)Additionally, a single
Field
might want to use multiple namespaces -TextFields
can be represented as a combination of word ids and character ids, and you don’t want words and characters to share the same vocabulary - “a” as a word should get a different id from “a” as a character, and the vocabulary sizes of words and characters are very different.Because of this, the first key in the
counter
object is a namespace, like “tokens”, “token_characters”, “tags”, or “labels”, and the second key is the actual vocabulary item.
-
empty_field
(self)[source]¶ So that
ListField
can pad the number of fields in a list (e.g., the number of answer optionTextFields
), we need a representation of an empty field of each type. This returns that. This will only ever be called when we’re to the point of callingas_tensor()
, so you don’t need to worry aboutget_padding_lengths
,count_vocab_items
, etc., being called on this empty field.We make this an instance method instead of a static method so that if there is any state in the Field, we can copy it over (e.g., the token indexers in
TextField
).
-
get_padding_lengths
(self) → Dict[str, int][source]¶ The
TextField
has a list ofTokens
, and eachToken
gets converted into arrays by (potentially) severalTokenIndexers
. This method gets the max length (over tokens) associated with each of these arrays.
-
index
(self, vocab: allennlp.data.vocabulary.Vocabulary)[source]¶ Given a
Vocabulary
, converts all strings in this field into (typically) integers. This modifies theField
object, it does not return anything.If your
Field
does not have any strings that need to be converted into indices, you do not need to implement this method.
-
-
class
allennlp.data.fields.adjacency_field.
AdjacencyField
(indices: List[Tuple[int, int]], sequence_field: allennlp.data.fields.sequence_field.SequenceField, labels: List[str] = None, label_namespace: str = 'labels', padding_value: int = -1)[source]¶ Bases:
allennlp.data.fields.field.Field
A
AdjacencyField
defines directed adjacency relations between elements in aSequenceField
. Because it’s a labeling of some other field, we take that field as input here and use it to determine our padding and other things.This field will get converted into an array of shape (sequence_field_length, sequence_field_length), where the (i, j)th array element is either a binary flag indicating there is an edge from i to j, or an integer label k, indicating there is a label from i to j of type k.
- Parameters
- indices
List[Tuple[int, int]]
- sequence_field
SequenceField
A field containing the sequence that this
AdjacencyField
is labeling. Most often, this is aTextField
, for tagging edge relations between tokens in a sentence.- labels
List[str]
, optional, default = None Optional labels for the edges of the adjacency matrix.
- label_namespace
str
, optional (default=’labels’) The namespace to use for converting tag strings into integers. We convert tag strings to integers for you, and this parameter tells the
Vocabulary
object which mapping from strings to integers to use (so that “O” as a tag doesn’t get the same id as “O” as a word).- padding_value
int
, (optional, default = -1) The value to use as padding.
- indices
-
as_tensor
(self, padding_lengths: Dict[str, int]) → torch.Tensor[source]¶ Given a set of specified padding lengths, actually pad the data in this field and return a torch Tensor (or a more complex data structure) of the correct shape. We also take a couple of parameters that are important when constructing torch Tensors.
- Parameters
- padding_lengths
Dict[str, int]
This dictionary will have the same keys that were produced in
get_padding_lengths()
. The values specify the lengths to use when padding each relevant dimension, aggregated across all instances in a batch.
- padding_lengths
-
count_vocab_items
(self, counter: Dict[str, Dict[str, int]])[source]¶ If there are strings in this field that need to be converted into integers through a
Vocabulary
, here is where we count them, to determine which tokens are in or out of the vocabulary.If your
Field
does not have any strings that need to be converted into indices, you do not need to implement this method.A note on this
counter
: becauseFields
can represent conceptually different things, we separate the vocabulary items by namespaces. This way, we can use a single shared mechanism to handle all mappings from strings to integers in all fields, while keeping words in aTextField
from sharing the same ids with labels in aLabelField
(e.g., “entailment” or “contradiction” are labels in an entailment task)Additionally, a single
Field
might want to use multiple namespaces -TextFields
can be represented as a combination of word ids and character ids, and you don’t want words and characters to share the same vocabulary - “a” as a word should get a different id from “a” as a character, and the vocabulary sizes of words and characters are very different.Because of this, the first key in the
counter
object is a namespace, like “tokens”, “token_characters”, “tags”, or “labels”, and the second key is the actual vocabulary item.
-
empty_field
(self) → 'AdjacencyField'[source]¶ So that
ListField
can pad the number of fields in a list (e.g., the number of answer optionTextFields
), we need a representation of an empty field of each type. This returns that. This will only ever be called when we’re to the point of callingas_tensor()
, so you don’t need to worry aboutget_padding_lengths
,count_vocab_items
, etc., being called on this empty field.We make this an instance method instead of a static method so that if there is any state in the Field, we can copy it over (e.g., the token indexers in
TextField
).
-
get_padding_lengths
(self) → Dict[str, int][source]¶ If there are things in this field that need padding, note them here. In order to pad a batch of instance, we get all of the lengths from the batch, take the max, and pad everything to that length (or use a pre-specified maximum length). The return value is a dictionary mapping keys to lengths, like {‘num_tokens’: 13}.
This is always called after
index()
.
-
index
(self, vocab: allennlp.data.vocabulary.Vocabulary)[source]¶ Given a
Vocabulary
, converts all strings in this field into (typically) integers. This modifies theField
object, it does not return anything.If your
Field
does not have any strings that need to be converted into indices, you do not need to implement this method.
-
class
allennlp.data.fields.namespace_swapping_field.
NamespaceSwappingField
(source_tokens: List[allennlp.data.tokenizers.token.Token], target_namespace: str)[source]¶ Bases:
allennlp.data.fields.field.Field
A
NamespaceSwappingField
is used to map tokens in one namespace to tokens in another namespace. It is used by seq2seq models with a copy mechanism that copies tokens from the source sentence into the target sentence.- Parameters
- source_tokens
List[Token]
The tokens from the source sentence.
- target_namespace
str
The namespace that the tokens from the source sentence will be mapped to.
- source_tokens
-
as_tensor
(self, padding_lengths: Dict[str, int]) → torch.Tensor[source]¶ Given a set of specified padding lengths, actually pad the data in this field and return a torch Tensor (or a more complex data structure) of the correct shape. We also take a couple of parameters that are important when constructing torch Tensors.
- Parameters
- padding_lengths
Dict[str, int]
This dictionary will have the same keys that were produced in
get_padding_lengths()
. The values specify the lengths to use when padding each relevant dimension, aggregated across all instances in a batch.
- padding_lengths
-
empty_field
(self) → 'NamespaceSwappingField'[source]¶ So that
ListField
can pad the number of fields in a list (e.g., the number of answer optionTextFields
), we need a representation of an empty field of each type. This returns that. This will only ever be called when we’re to the point of callingas_tensor()
, so you don’t need to worry aboutget_padding_lengths
,count_vocab_items
, etc., being called on this empty field.We make this an instance method instead of a static method so that if there is any state in the Field, we can copy it over (e.g., the token indexers in
TextField
).
-
get_padding_lengths
(self) → Dict[str, int][source]¶ If there are things in this field that need padding, note them here. In order to pad a batch of instance, we get all of the lengths from the batch, take the max, and pad everything to that length (or use a pre-specified maximum length). The return value is a dictionary mapping keys to lengths, like {‘num_tokens’: 13}.
This is always called after
index()
.
-
index
(self, vocab: allennlp.data.vocabulary.Vocabulary)[source]¶ Given a
Vocabulary
, converts all strings in this field into (typically) integers. This modifies theField
object, it does not return anything.If your
Field
does not have any strings that need to be converted into indices, you do not need to implement this method.