allennlp.data.fields

A Field is some piece of data instance that ends up as an array in a model.

class allennlp.data.fields.field.Field[source]

Bases: typing.Generic

A Field is some piece of a data instance that ends up as an tensor in a model (either as an input or an output). Data instances are just collections of fields.

Fields go through up to two steps of processing: (1) tokenized fields are converted into token ids, (2) fields containing token ids (or any other numeric data) are padded (if necessary) and converted into tensors. The Field API has methods around both of these steps, though they may not be needed for some concrete Field classes - if your field doesn’t have any strings that need indexing, you don’t need to implement count_vocab_items or index. These methods pass by default.

Once a vocabulary is computed and all fields are indexed, we will determine padding lengths, then intelligently batch together instances and pad them into actual tensors.

as_tensor(self, padding_lengths: Dict[str, int]) → ~DataArray[source]

Given a set of specified padding lengths, actually pad the data in this field and return a torch Tensor (or a more complex data structure) of the correct shape. We also take a couple of parameters that are important when constructing torch Tensors.

Parameters
padding_lengthsDict[str, int]

This dictionary will have the same keys that were produced in get_padding_lengths(). The values specify the lengths to use when padding each relevant dimension, aggregated across all instances in a batch.

batch_tensors(self, tensor_list: List[~DataArray]) → ~DataArray[source]

Takes the output of Field.as_tensor() from a list of Instances and merges it into one batched tensor for this Field. The default implementation here in the base class handles cases where as_tensor returns a single torch tensor per instance. If your subclass returns something other than this, you need to override this method.

This operation does not modify self, but in some cases we need the information contained in self in order to perform the batching, so this is an instance method, not a class method.

count_vocab_items(self, counter: Dict[str, Dict[str, int]])[source]

If there are strings in this field that need to be converted into integers through a Vocabulary, here is where we count them, to determine which tokens are in or out of the vocabulary.

If your Field does not have any strings that need to be converted into indices, you do not need to implement this method.

A note on this counter: because Fields can represent conceptually different things, we separate the vocabulary items by namespaces. This way, we can use a single shared mechanism to handle all mappings from strings to integers in all fields, while keeping words in a TextField from sharing the same ids with labels in a LabelField (e.g., “entailment” or “contradiction” are labels in an entailment task)

Additionally, a single Field might want to use multiple namespaces - TextFields can be represented as a combination of word ids and character ids, and you don’t want words and characters to share the same vocabulary - “a” as a word should get a different id from “a” as a character, and the vocabulary sizes of words and characters are very different.

Because of this, the first key in the counter object is a namespace, like “tokens”, “token_characters”, “tags”, or “labels”, and the second key is the actual vocabulary item.

empty_field(self) → 'Field'[source]

So that ListField can pad the number of fields in a list (e.g., the number of answer option TextFields), we need a representation of an empty field of each type. This returns that. This will only ever be called when we’re to the point of calling as_tensor(), so you don’t need to worry about get_padding_lengths, count_vocab_items, etc., being called on this empty field.

We make this an instance method instead of a static method so that if there is any state in the Field, we can copy it over (e.g., the token indexers in TextField).

get_padding_lengths(self) → Dict[str, int][source]

If there are things in this field that need padding, note them here. In order to pad a batch of instance, we get all of the lengths from the batch, take the max, and pad everything to that length (or use a pre-specified maximum length). The return value is a dictionary mapping keys to lengths, like {‘num_tokens’: 13}.

This is always called after index().

index(self, vocab: allennlp.data.vocabulary.Vocabulary)[source]

Given a Vocabulary, converts all strings in this field into (typically) integers. This modifies the Field object, it does not return anything.

If your Field does not have any strings that need to be converted into indices, you do not need to implement this method.

class allennlp.data.fields.array_field.ArrayField(array: numpy.ndarray, padding_value: int = 0, dtype: numpy.dtype = <class 'numpy.float32'>)[source]

Bases: allennlp.data.fields.field.Field

A class representing an array, which could have arbitrary dimensions. A batch of these arrays are padded to the max dimension length in the batch for each dimension.

as_tensor(self, padding_lengths: Dict[str, int]) → torch.Tensor[source]

Given a set of specified padding lengths, actually pad the data in this field and return a torch Tensor (or a more complex data structure) of the correct shape. We also take a couple of parameters that are important when constructing torch Tensors.

Parameters
padding_lengthsDict[str, int]

This dictionary will have the same keys that were produced in get_padding_lengths(). The values specify the lengths to use when padding each relevant dimension, aggregated across all instances in a batch.

empty_field(self)[source]

So that ListField can pad the number of fields in a list (e.g., the number of answer option TextFields), we need a representation of an empty field of each type. This returns that. This will only ever be called when we’re to the point of calling as_tensor(), so you don’t need to worry about get_padding_lengths, count_vocab_items, etc., being called on this empty field.

We make this an instance method instead of a static method so that if there is any state in the Field, we can copy it over (e.g., the token indexers in TextField).

get_padding_lengths(self) → Dict[str, int][source]

If there are things in this field that need padding, note them here. In order to pad a batch of instance, we get all of the lengths from the batch, take the max, and pad everything to that length (or use a pre-specified maximum length). The return value is a dictionary mapping keys to lengths, like {‘num_tokens’: 13}.

This is always called after index().

class allennlp.data.fields.index_field.IndexField(index: int, sequence_field: allennlp.data.fields.sequence_field.SequenceField)[source]

Bases: allennlp.data.fields.field.Field

An IndexField is an index into a SequenceField, as might be used for representing a correct answer option in a list, or a span begin and span end position in a passage, for example. Because it’s an index into a SequenceField, we take one of those as input and use it to compute padding lengths.

Parameters
indexint

The index of the answer in the SequenceField. This is typically the “correct answer” in some classification decision over the sequence, like where an answer span starts in SQuAD, or which answer option is correct in a multiple choice question. A value of -1 means there is no label, which can be used for padding or other purposes.

sequence_fieldSequenceField

A field containing the sequence that this IndexField is a pointer into.

as_tensor(self, padding_lengths: Dict[str, int]) → torch.Tensor[source]

Given a set of specified padding lengths, actually pad the data in this field and return a torch Tensor (or a more complex data structure) of the correct shape. We also take a couple of parameters that are important when constructing torch Tensors.

Parameters
padding_lengthsDict[str, int]

This dictionary will have the same keys that were produced in get_padding_lengths(). The values specify the lengths to use when padding each relevant dimension, aggregated across all instances in a batch.

empty_field(self)[source]

So that ListField can pad the number of fields in a list (e.g., the number of answer option TextFields), we need a representation of an empty field of each type. This returns that. This will only ever be called when we’re to the point of calling as_tensor(), so you don’t need to worry about get_padding_lengths, count_vocab_items, etc., being called on this empty field.

We make this an instance method instead of a static method so that if there is any state in the Field, we can copy it over (e.g., the token indexers in TextField).

get_padding_lengths(self) → Dict[str, int][source]

If there are things in this field that need padding, note them here. In order to pad a batch of instance, we get all of the lengths from the batch, take the max, and pad everything to that length (or use a pre-specified maximum length). The return value is a dictionary mapping keys to lengths, like {‘num_tokens’: 13}.

This is always called after index().

class allennlp.data.fields.span_field.SpanField(span_start: int, span_end: int, sequence_field: allennlp.data.fields.sequence_field.SequenceField)[source]

Bases: allennlp.data.fields.field.Field

A SpanField is a pair of inclusive, zero-indexed (start, end) indices into a SequenceField, used to represent a span of text. Because it’s a pair of indices into a SequenceField, we take one of those as input to make the span’s dependence explicit and to validate that the span is well defined.

Parameters
span_startint, required.

The index of the start of the span in the SequenceField.

span_endint, required.

The inclusive index of the end of the span in the SequenceField.

sequence_fieldSequenceField, required.

A field containing the sequence that this SpanField is a span inside.

as_tensor(self, padding_lengths: Dict[str, int]) → torch.Tensor[source]

Given a set of specified padding lengths, actually pad the data in this field and return a torch Tensor (or a more complex data structure) of the correct shape. We also take a couple of parameters that are important when constructing torch Tensors.

Parameters
padding_lengthsDict[str, int]

This dictionary will have the same keys that were produced in get_padding_lengths(). The values specify the lengths to use when padding each relevant dimension, aggregated across all instances in a batch.

empty_field(self)[source]

So that ListField can pad the number of fields in a list (e.g., the number of answer option TextFields), we need a representation of an empty field of each type. This returns that. This will only ever be called when we’re to the point of calling as_tensor(), so you don’t need to worry about get_padding_lengths, count_vocab_items, etc., being called on this empty field.

We make this an instance method instead of a static method so that if there is any state in the Field, we can copy it over (e.g., the token indexers in TextField).

get_padding_lengths(self) → Dict[str, int][source]

If there are things in this field that need padding, note them here. In order to pad a batch of instance, we get all of the lengths from the batch, take the max, and pad everything to that length (or use a pre-specified maximum length). The return value is a dictionary mapping keys to lengths, like {‘num_tokens’: 13}.

This is always called after index().

KnowledgeGraphField is a Field which stores a knowledge graph representation.

class allennlp.data.fields.knowledge_graph_field.KnowledgeGraphField(knowledge_graph: allennlp.semparse.contexts.knowledge_graph.KnowledgeGraph, utterance_tokens: List[allennlp.data.tokenizers.token.Token], token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer], tokenizer: allennlp.data.tokenizers.tokenizer.Tokenizer = None, feature_extractors: List[str] = None, entity_tokens: List[List[allennlp.data.tokenizers.token.Token]] = None, linking_features: List[List[List[float]]] = None, include_in_vocab: bool = True, max_table_tokens: int = None)[source]

Bases: allennlp.data.fields.field.Field

A KnowledgeGraphField represents a KnowledgeGraph as a Field that can be used in a Model. For each entity in the graph, we output two things: a text representation of the entity, handled identically to a TextField, and a list of linking features for each token in some input utterance.

The output of this field is a dictionary:

{
  "text": Dict[str, torch.Tensor],  # each tensor has shape (batch_size, num_entities, num_entity_tokens)
  "linking": torch.Tensor  # shape (batch_size, num_entities, num_utterance_tokens, num_features)
}

The text component of this dictionary is suitable to be passed into a TextFieldEmbedder (which handles the additional num_entities dimension without any issues). The linking component of the dictionary can be used however you want to decide which tokens in the utterance correspond to which entities in the knowledge graph.

In order to create the text component, we use the same dictionary of TokenIndexers that’s used in a TextField (as we’re just representing the text corresponding to each entity). For the linking component, we use a set of hard-coded feature extractors that operate between the text corresponding to each entity and each token in the utterance.

Parameters
knowledge_graphKnowledgeGraph

The knowledge graph that this field stores.

utterance_tokensList[Token]

The tokens in some utterance that is paired with the KnowledgeGraph. We compute a set of features for linking tokens in the utterance to entities in the graph.

tokenizerTokenizer, optional (default=``WordTokenizer()``)

We’ll use this Tokenizer to tokenize the text representation of each entity.

token_indexersDict[str, TokenIndexer]

Token indexers that convert entities into arrays, similar to how text tokens are treated in a TextField. These might operate on the name of the entity itself, its type, its neighbors in the graph, etc.

feature_extractorsList[str], optional

Names of feature extractors to use for computing linking features. These must be attributes of this object, without the first underscore. The feature extraction functions are listed as the last methods in this class. For example, to use _exact_token_match(), you would pass the string exact_token_match. We will add an underscore and look for a function matching that name. If this list is omitted, we will use all available feature functions.

entity_tokensList[List[Token]], optional

If you have pre-computed the tokenization of the table text, you can pass it in here. The must be a list of the tokens in the entity text, for each entity in the knowledge graph, in the same order in which the knowledge graph returns entities.

linking_featuresList[List[List[float]]], optional

If you have pre-computed the linking features between the utterance and the table text, you can pass it in here.

include_in_vocabbool, optional (default=True)

If this is False, we will skip the count_vocab_items logic, leaving out all table entity text from the vocabulary computation. You might want to do this if you have a lot of rare entities in your tables, and you see the same table in multiple training instances, so your vocabulary counts get skewed and include too many rare entities.

max_table_tokensint, optional

If given, we will only keep this number of total table tokens. This bounds the memory usage of the table representations, truncating cells with really long text. We specify a total number of tokens, not a max cell text length, because the number of table entities varies.

as_tensor(self, padding_lengths: Dict[str, int]) → Dict[str, torch.Tensor][source]

Given a set of specified padding lengths, actually pad the data in this field and return a torch Tensor (or a more complex data structure) of the correct shape. We also take a couple of parameters that are important when constructing torch Tensors.

Parameters
padding_lengthsDict[str, int]

This dictionary will have the same keys that were produced in get_padding_lengths(). The values specify the lengths to use when padding each relevant dimension, aggregated across all instances in a batch.

batch_tensors(self, tensor_list: List[Dict[str, torch.Tensor]]) → Dict[str, torch.Tensor][source]

Takes the output of Field.as_tensor() from a list of Instances and merges it into one batched tensor for this Field. The default implementation here in the base class handles cases where as_tensor returns a single torch tensor per instance. If your subclass returns something other than this, you need to override this method.

This operation does not modify self, but in some cases we need the information contained in self in order to perform the batching, so this is an instance method, not a class method.

count_vocab_items(self, counter: Dict[str, Dict[str, int]])[source]

If there are strings in this field that need to be converted into integers through a Vocabulary, here is where we count them, to determine which tokens are in or out of the vocabulary.

If your Field does not have any strings that need to be converted into indices, you do not need to implement this method.

A note on this counter: because Fields can represent conceptually different things, we separate the vocabulary items by namespaces. This way, we can use a single shared mechanism to handle all mappings from strings to integers in all fields, while keeping words in a TextField from sharing the same ids with labels in a LabelField (e.g., “entailment” or “contradiction” are labels in an entailment task)

Additionally, a single Field might want to use multiple namespaces - TextFields can be represented as a combination of word ids and character ids, and you don’t want words and characters to share the same vocabulary - “a” as a word should get a different id from “a” as a character, and the vocabulary sizes of words and characters are very different.

Because of this, the first key in the counter object is a namespace, like “tokens”, “token_characters”, “tags”, or “labels”, and the second key is the actual vocabulary item.

empty_field(self) → 'KnowledgeGraphField'[source]

So that ListField can pad the number of fields in a list (e.g., the number of answer option TextFields), we need a representation of an empty field of each type. This returns that. This will only ever be called when we’re to the point of calling as_tensor(), so you don’t need to worry about get_padding_lengths, count_vocab_items, etc., being called on this empty field.

We make this an instance method instead of a static method so that if there is any state in the Field, we can copy it over (e.g., the token indexers in TextField).

get_padding_lengths(self) → Dict[str, int][source]

If there are things in this field that need padding, note them here. In order to pad a batch of instance, we get all of the lengths from the batch, take the max, and pad everything to that length (or use a pre-specified maximum length). The return value is a dictionary mapping keys to lengths, like {‘num_tokens’: 13}.

This is always called after index().

index(self, vocab: allennlp.data.vocabulary.Vocabulary)[source]

Given a Vocabulary, converts all strings in this field into (typically) integers. This modifies the Field object, it does not return anything.

If your Field does not have any strings that need to be converted into indices, you do not need to implement this method.

class allennlp.data.fields.label_field.LabelField(label: Union[str, int], label_namespace: str = 'labels', skip_indexing: bool = False)[source]

Bases: allennlp.data.fields.field.Field

A LabelField is a categorical label of some kind, where the labels are either strings of text or 0-indexed integers (if you wish to skip indexing by passing skip_indexing=True). If the labels need indexing, we will use a Vocabulary to convert the string labels into integers.

This field will get converted into an integer index representing the class label.

Parameters
labelUnion[str, int]
label_namespacestr, optional (default=”labels”)

The namespace to use for converting label strings into integers. We map label strings to integers for you (e.g., “entailment” and “contradiction” get converted to 0, 1, …), and this namespace tells the Vocabulary object which mapping from strings to integers to use (so “entailment” as a label doesn’t get the same integer id as “entailment” as a word). If you have multiple different label fields in your data, you should make sure you use different namespaces for each one, always using the suffix “labels” (e.g., “passage_labels” and “question_labels”).

skip_indexingbool, optional (default=False)

If your labels are 0-indexed integers, you can pass in this flag, and we’ll skip the indexing step. If this is False and your labels are not strings, this throws a ConfigurationError.

as_tensor(self, padding_lengths: Dict[str, int]) → torch.Tensor[source]

Given a set of specified padding lengths, actually pad the data in this field and return a torch Tensor (or a more complex data structure) of the correct shape. We also take a couple of parameters that are important when constructing torch Tensors.

Parameters
padding_lengthsDict[str, int]

This dictionary will have the same keys that were produced in get_padding_lengths(). The values specify the lengths to use when padding each relevant dimension, aggregated across all instances in a batch.

count_vocab_items(self, counter: Dict[str, Dict[str, int]])[source]

If there are strings in this field that need to be converted into integers through a Vocabulary, here is where we count them, to determine which tokens are in or out of the vocabulary.

If your Field does not have any strings that need to be converted into indices, you do not need to implement this method.

A note on this counter: because Fields can represent conceptually different things, we separate the vocabulary items by namespaces. This way, we can use a single shared mechanism to handle all mappings from strings to integers in all fields, while keeping words in a TextField from sharing the same ids with labels in a LabelField (e.g., “entailment” or “contradiction” are labels in an entailment task)

Additionally, a single Field might want to use multiple namespaces - TextFields can be represented as a combination of word ids and character ids, and you don’t want words and characters to share the same vocabulary - “a” as a word should get a different id from “a” as a character, and the vocabulary sizes of words and characters are very different.

Because of this, the first key in the counter object is a namespace, like “tokens”, “token_characters”, “tags”, or “labels”, and the second key is the actual vocabulary item.

empty_field(self)[source]

So that ListField can pad the number of fields in a list (e.g., the number of answer option TextFields), we need a representation of an empty field of each type. This returns that. This will only ever be called when we’re to the point of calling as_tensor(), so you don’t need to worry about get_padding_lengths, count_vocab_items, etc., being called on this empty field.

We make this an instance method instead of a static method so that if there is any state in the Field, we can copy it over (e.g., the token indexers in TextField).

get_padding_lengths(self) → Dict[str, int][source]

If there are things in this field that need padding, note them here. In order to pad a batch of instance, we get all of the lengths from the batch, take the max, and pad everything to that length (or use a pre-specified maximum length). The return value is a dictionary mapping keys to lengths, like {‘num_tokens’: 13}.

This is always called after index().

index(self, vocab: allennlp.data.vocabulary.Vocabulary)[source]

Given a Vocabulary, converts all strings in this field into (typically) integers. This modifies the Field object, it does not return anything.

If your Field does not have any strings that need to be converted into indices, you do not need to implement this method.

class allennlp.data.fields.multilabel_field.MultiLabelField(labels: Sequence[Union[str, int]], label_namespace: str = 'labels', skip_indexing: bool = False, num_labels: Optional[int] = None)[source]

Bases: allennlp.data.fields.field.Field

A MultiLabelField is an extension of the LabelField that allows for multiple labels. It is particularly useful in multi-label classification where more than one label can be correct. As with the LabelField, labels are either strings of text or 0-indexed integers (if you wish to skip indexing by passing skip_indexing=True). If the labels need indexing, we will use a Vocabulary to convert the string labels into integers.

This field will get converted into a vector of length equal to the vocabulary size with one hot encoding for the labels (all zeros, and ones for the labels).

Parameters
labelsSequence[Union[str, int]]
label_namespacestr, optional (default=”labels”)

The namespace to use for converting label strings into integers. We map label strings to integers for you (e.g., “entailment” and “contradiction” get converted to 0, 1, …), and this namespace tells the Vocabulary object which mapping from strings to integers to use (so “entailment” as a label doesn’t get the same integer id as “entailment” as a word). If you have multiple different label fields in your data, you should make sure you use different namespaces for each one, always using the suffix “labels” (e.g., “passage_labels” and “question_labels”).

skip_indexingbool, optional (default=False)

If your labels are 0-indexed integers, you can pass in this flag, and we’ll skip the indexing step. If this is False and your labels are not strings, this throws a ConfigurationError.

num_labelsint, optional (default=None)

If skip_indexing=True, the total number of possible labels should be provided, which is required to decide the size of the output tensor. num_labels should equal largest label id + 1. If skip_indexing=False, num_labels is not required.

as_tensor(self, padding_lengths: Dict[str, int]) → torch.Tensor[source]

Given a set of specified padding lengths, actually pad the data in this field and return a torch Tensor (or a more complex data structure) of the correct shape. We also take a couple of parameters that are important when constructing torch Tensors.

Parameters
padding_lengthsDict[str, int]

This dictionary will have the same keys that were produced in get_padding_lengths(). The values specify the lengths to use when padding each relevant dimension, aggregated across all instances in a batch.

count_vocab_items(self, counter: Dict[str, Dict[str, int]])[source]

If there are strings in this field that need to be converted into integers through a Vocabulary, here is where we count them, to determine which tokens are in or out of the vocabulary.

If your Field does not have any strings that need to be converted into indices, you do not need to implement this method.

A note on this counter: because Fields can represent conceptually different things, we separate the vocabulary items by namespaces. This way, we can use a single shared mechanism to handle all mappings from strings to integers in all fields, while keeping words in a TextField from sharing the same ids with labels in a LabelField (e.g., “entailment” or “contradiction” are labels in an entailment task)

Additionally, a single Field might want to use multiple namespaces - TextFields can be represented as a combination of word ids and character ids, and you don’t want words and characters to share the same vocabulary - “a” as a word should get a different id from “a” as a character, and the vocabulary sizes of words and characters are very different.

Because of this, the first key in the counter object is a namespace, like “tokens”, “token_characters”, “tags”, or “labels”, and the second key is the actual vocabulary item.

empty_field(self)[source]

So that ListField can pad the number of fields in a list (e.g., the number of answer option TextFields), we need a representation of an empty field of each type. This returns that. This will only ever be called when we’re to the point of calling as_tensor(), so you don’t need to worry about get_padding_lengths, count_vocab_items, etc., being called on this empty field.

We make this an instance method instead of a static method so that if there is any state in the Field, we can copy it over (e.g., the token indexers in TextField).

get_padding_lengths(self) → Dict[str, int][source]

If there are things in this field that need padding, note them here. In order to pad a batch of instance, we get all of the lengths from the batch, take the max, and pad everything to that length (or use a pre-specified maximum length). The return value is a dictionary mapping keys to lengths, like {‘num_tokens’: 13}.

This is always called after index().

index(self, vocab: allennlp.data.vocabulary.Vocabulary)[source]

Given a Vocabulary, converts all strings in this field into (typically) integers. This modifies the Field object, it does not return anything.

If your Field does not have any strings that need to be converted into indices, you do not need to implement this method.

class allennlp.data.fields.list_field.ListField(field_list: List[allennlp.data.fields.field.Field])[source]

Bases: allennlp.data.fields.sequence_field.SequenceField

A ListField is a list of other fields. You would use this to represent, e.g., a list of answer options that are themselves TextFields.

This field will get converted into a tensor that has one more mode than the items in the list. If this is a list of TextFields that have shape (num_words, num_characters), this ListField will output a tensor of shape (num_sentences, num_words, num_characters).

Parameters
field_listList[Field]

A list of Field objects to be concatenated into a single input tensor. All of the contained Field objects must be of the same type.

as_tensor(self, padding_lengths: Dict[str, int]) → ~DataArray[source]

Given a set of specified padding lengths, actually pad the data in this field and return a torch Tensor (or a more complex data structure) of the correct shape. We also take a couple of parameters that are important when constructing torch Tensors.

Parameters
padding_lengthsDict[str, int]

This dictionary will have the same keys that were produced in get_padding_lengths(). The values specify the lengths to use when padding each relevant dimension, aggregated across all instances in a batch.

batch_tensors(self, tensor_list: List[~DataArray]) → ~DataArray[source]

Takes the output of Field.as_tensor() from a list of Instances and merges it into one batched tensor for this Field. The default implementation here in the base class handles cases where as_tensor returns a single torch tensor per instance. If your subclass returns something other than this, you need to override this method.

This operation does not modify self, but in some cases we need the information contained in self in order to perform the batching, so this is an instance method, not a class method.

count_vocab_items(self, counter: Dict[str, Dict[str, int]])[source]

If there are strings in this field that need to be converted into integers through a Vocabulary, here is where we count them, to determine which tokens are in or out of the vocabulary.

If your Field does not have any strings that need to be converted into indices, you do not need to implement this method.

A note on this counter: because Fields can represent conceptually different things, we separate the vocabulary items by namespaces. This way, we can use a single shared mechanism to handle all mappings from strings to integers in all fields, while keeping words in a TextField from sharing the same ids with labels in a LabelField (e.g., “entailment” or “contradiction” are labels in an entailment task)

Additionally, a single Field might want to use multiple namespaces - TextFields can be represented as a combination of word ids and character ids, and you don’t want words and characters to share the same vocabulary - “a” as a word should get a different id from “a” as a character, and the vocabulary sizes of words and characters are very different.

Because of this, the first key in the counter object is a namespace, like “tokens”, “token_characters”, “tags”, or “labels”, and the second key is the actual vocabulary item.

empty_field(self)[source]

So that ListField can pad the number of fields in a list (e.g., the number of answer option TextFields), we need a representation of an empty field of each type. This returns that. This will only ever be called when we’re to the point of calling as_tensor(), so you don’t need to worry about get_padding_lengths, count_vocab_items, etc., being called on this empty field.

We make this an instance method instead of a static method so that if there is any state in the Field, we can copy it over (e.g., the token indexers in TextField).

get_padding_lengths(self) → Dict[str, int][source]

If there are things in this field that need padding, note them here. In order to pad a batch of instance, we get all of the lengths from the batch, take the max, and pad everything to that length (or use a pre-specified maximum length). The return value is a dictionary mapping keys to lengths, like {‘num_tokens’: 13}.

This is always called after index().

index(self, vocab: allennlp.data.vocabulary.Vocabulary)[source]

Given a Vocabulary, converts all strings in this field into (typically) integers. This modifies the Field object, it does not return anything.

If your Field does not have any strings that need to be converted into indices, you do not need to implement this method.

sequence_length(self) → int[source]

How many elements are there in this sequence?

class allennlp.data.fields.metadata_field.MetadataField(metadata: Any)[source]

Bases: allennlp.data.fields.field.Field, collections.abc.Mapping, typing.Generic

A MetadataField is a Field that does not get converted into tensors. It just carries side information that might be needed later on, for computing some third-party metric, or outputting debugging information, or whatever else you need. We use this in the BiDAF model, for instance, to keep track of question IDs and passage token offsets, so we can more easily use the official evaluation script to compute metrics.

We don’t try to do any kind of smart combination of this field for batched input - when you use this Field in a model, you’ll get a list of metadata objects, one for each instance in the batch.

Parameters
metadataAny

Some object containing the metadata that you want to store. It’s likely that you’ll want this to be a dictionary, but it could be anything you want.

as_tensor(self, padding_lengths: Dict[str, int]) → ~DataArray[source]

Given a set of specified padding lengths, actually pad the data in this field and return a torch Tensor (or a more complex data structure) of the correct shape. We also take a couple of parameters that are important when constructing torch Tensors.

Parameters
padding_lengthsDict[str, int]

This dictionary will have the same keys that were produced in get_padding_lengths(). The values specify the lengths to use when padding each relevant dimension, aggregated across all instances in a batch.

batch_tensors(self, tensor_list: List[~DataArray]) → List[~DataArray][source]

Takes the output of Field.as_tensor() from a list of Instances and merges it into one batched tensor for this Field. The default implementation here in the base class handles cases where as_tensor returns a single torch tensor per instance. If your subclass returns something other than this, you need to override this method.

This operation does not modify self, but in some cases we need the information contained in self in order to perform the batching, so this is an instance method, not a class method.

empty_field(self) → 'MetadataField'[source]

So that ListField can pad the number of fields in a list (e.g., the number of answer option TextFields), we need a representation of an empty field of each type. This returns that. This will only ever be called when we’re to the point of calling as_tensor(), so you don’t need to worry about get_padding_lengths, count_vocab_items, etc., being called on this empty field.

We make this an instance method instead of a static method so that if there is any state in the Field, we can copy it over (e.g., the token indexers in TextField).

get_padding_lengths(self) → Dict[str, int][source]

If there are things in this field that need padding, note them here. In order to pad a batch of instance, we get all of the lengths from the batch, take the max, and pad everything to that length (or use a pre-specified maximum length). The return value is a dictionary mapping keys to lengths, like {‘num_tokens’: 13}.

This is always called after index().

class allennlp.data.fields.production_rule_field.ProductionRule(rule, is_global_rule, rule_id, nonterminal)[source]

Bases: tuple

property is_global_rule

Alias for field number 1

property nonterminal

Alias for field number 3

property rule

Alias for field number 0

property rule_id

Alias for field number 2

allennlp.data.fields.production_rule_field.ProductionRuleArray[source]

alias of allennlp.data.fields.production_rule_field.ProductionRule

class allennlp.data.fields.production_rule_field.ProductionRuleField(rule: str, is_global_rule: bool, vocab_namespace: str = 'rule_labels', nonterminal: str = None)[source]

Bases: allennlp.data.fields.field.Field

This Field represents a production rule from a grammar, like “S -> [NP, VP]”, “N -> John”, or “<b,c> -> [<a,<b,c>>, a]”.

We assume a few things about how these rules are formatted:

  • There is a left-hand side (LHS) and a right-hand side (RHS), where the LHS is always a non-terminal, and the RHS is either a terminal, a non-terminal, or a sequence of terminals and/or non-terminals.

  • The LHS and the RHS are joined by ” -> “, and this sequence of characters appears nowhere else in the rule.

  • Non-terminal sequences in the RHS are formatted as “[NT1, NT2, …]”.

  • Some rules come from a global grammar used for a whole dataset, while other rules are specific to a particular Instance.

We don’t make use of most of these assumptions in this class, but the code that consumes this Field relies heavily on them in some places.

If the given rule is in the global grammar, we treat the rule as a vocabulary item that will get an index and (in the model) an embedding. If the rule is not in the global grammar, we do not create a vocabulary item from the rule, and don’t produce a tensor for the rule - we assume the model will handle representing this rule in some other way.

Because we represent global grammar rules and instance-specific rules differently, this Field does not lend itself well to batching its arrays, even in a sequence for a single training instance. A model using this field will have to manually batch together rule representations after splitting apart the global rules from the Instance rules.

In a model, this will get represented as a ProductionRule, which is defined above. This is a namedtuple of (rule_string, is_global_rule, [rule_id], nonterminal), where the rule_id Tensor, if present, will have shape (1,). We don’t do any batching of the Tensors, so this gets passed to Model.forward() as a List[ProductionRule]. We pass along the rule string because there isn’t another way to recover it for instance-specific rules that do not make it into the vocabulary.

Parameters
rulestr

The production rule, formatted as described above. If this field is just padding, rule will be the empty string.

is_global_rulebool

Whether this rule comes from the global grammar or is an instance-specific production rule.

vocab_namespacestr, optional (default=”rule_labels”)

The vocabulary namespace to use for the global production rules. We use “rule_labels” by default, because we typically do not want padding and OOV tokens for these, and ending the namespace with “labels” means we don’t get padding and OOV tokens.

nonterminalstr, optional, default = None

The left hand side of the rule. Sometimes having this as separate part of the ProductionRule can deduplicate work.

as_tensor(self, padding_lengths: Dict[str, int]) → allennlp.data.fields.production_rule_field.ProductionRule[source]

Given a set of specified padding lengths, actually pad the data in this field and return a torch Tensor (or a more complex data structure) of the correct shape. We also take a couple of parameters that are important when constructing torch Tensors.

Parameters
padding_lengthsDict[str, int]

This dictionary will have the same keys that were produced in get_padding_lengths(). The values specify the lengths to use when padding each relevant dimension, aggregated across all instances in a batch.

batch_tensors(self, tensor_list: List[allennlp.data.fields.production_rule_field.ProductionRule]) → List[allennlp.data.fields.production_rule_field.ProductionRule][source]

Takes the output of Field.as_tensor() from a list of Instances and merges it into one batched tensor for this Field. The default implementation here in the base class handles cases where as_tensor returns a single torch tensor per instance. If your subclass returns something other than this, you need to override this method.

This operation does not modify self, but in some cases we need the information contained in self in order to perform the batching, so this is an instance method, not a class method.

count_vocab_items(self, counter: Dict[str, Dict[str, int]])[source]

If there are strings in this field that need to be converted into integers through a Vocabulary, here is where we count them, to determine which tokens are in or out of the vocabulary.

If your Field does not have any strings that need to be converted into indices, you do not need to implement this method.

A note on this counter: because Fields can represent conceptually different things, we separate the vocabulary items by namespaces. This way, we can use a single shared mechanism to handle all mappings from strings to integers in all fields, while keeping words in a TextField from sharing the same ids with labels in a LabelField (e.g., “entailment” or “contradiction” are labels in an entailment task)

Additionally, a single Field might want to use multiple namespaces - TextFields can be represented as a combination of word ids and character ids, and you don’t want words and characters to share the same vocabulary - “a” as a word should get a different id from “a” as a character, and the vocabulary sizes of words and characters are very different.

Because of this, the first key in the counter object is a namespace, like “tokens”, “token_characters”, “tags”, or “labels”, and the second key is the actual vocabulary item.

empty_field(self)[source]

So that ListField can pad the number of fields in a list (e.g., the number of answer option TextFields), we need a representation of an empty field of each type. This returns that. This will only ever be called when we’re to the point of calling as_tensor(), so you don’t need to worry about get_padding_lengths, count_vocab_items, etc., being called on this empty field.

We make this an instance method instead of a static method so that if there is any state in the Field, we can copy it over (e.g., the token indexers in TextField).

get_padding_lengths(self) → Dict[str, int][source]

If there are things in this field that need padding, note them here. In order to pad a batch of instance, we get all of the lengths from the batch, take the max, and pad everything to that length (or use a pre-specified maximum length). The return value is a dictionary mapping keys to lengths, like {‘num_tokens’: 13}.

This is always called after index().

index(self, vocab: allennlp.data.vocabulary.Vocabulary)[source]

Given a Vocabulary, converts all strings in this field into (typically) integers. This modifies the Field object, it does not return anything.

If your Field does not have any strings that need to be converted into indices, you do not need to implement this method.

class allennlp.data.fields.sequence_field.SequenceField[source]

Bases: allennlp.data.fields.field.Field

A SequenceField represents a sequence of things. This class just adds a method onto Field: sequence_length(). It exists so that SequenceLabelField, IndexField and other similar Fields can have a single type to require, with a consistent API, whether they are pointing to words in a TextField, items in a ListField, or something else.

empty_field(self) → 'SequenceField'[source]

So that ListField can pad the number of fields in a list (e.g., the number of answer option TextFields), we need a representation of an empty field of each type. This returns that. This will only ever be called when we’re to the point of calling as_tensor(), so you don’t need to worry about get_padding_lengths, count_vocab_items, etc., being called on this empty field.

We make this an instance method instead of a static method so that if there is any state in the Field, we can copy it over (e.g., the token indexers in TextField).

sequence_length(self) → int[source]

How many elements are there in this sequence?

class allennlp.data.fields.sequence_label_field.SequenceLabelField(labels: Union[List[str], List[int]], sequence_field: allennlp.data.fields.sequence_field.SequenceField, label_namespace: str = 'labels')[source]

Bases: allennlp.data.fields.field.Field

A SequenceLabelField assigns a categorical label to each element in a SequenceField. Because it’s a labeling of some other field, we take that field as input here, and we use it to determine our padding and other things.

This field will get converted into a list of integer class ids, representing the correct class for each element in the sequence.

Parameters
labelsUnion[List[str], List[int]]

A sequence of categorical labels, encoded as strings or integers. These could be POS tags like [NN, JJ, …], BIO tags like [B-PERS, I-PERS, O, O, …], or any other categorical tag sequence. If the labels are encoded as integers, they will not be indexed using a vocab.

sequence_fieldSequenceField

A field containing the sequence that this SequenceLabelField is labeling. Most often, this is a TextField, for tagging individual tokens in a sentence.

label_namespacestr, optional (default=’labels’)

The namespace to use for converting tag strings into integers. We convert tag strings to integers for you, and this parameter tells the Vocabulary object which mapping from strings to integers to use (so that “O” as a tag doesn’t get the same id as “O” as a word).

as_tensor(self, padding_lengths: Dict[str, int]) → torch.Tensor[source]

Given a set of specified padding lengths, actually pad the data in this field and return a torch Tensor (or a more complex data structure) of the correct shape. We also take a couple of parameters that are important when constructing torch Tensors.

Parameters
padding_lengthsDict[str, int]

This dictionary will have the same keys that were produced in get_padding_lengths(). The values specify the lengths to use when padding each relevant dimension, aggregated across all instances in a batch.

count_vocab_items(self, counter: Dict[str, Dict[str, int]])[source]

If there are strings in this field that need to be converted into integers through a Vocabulary, here is where we count them, to determine which tokens are in or out of the vocabulary.

If your Field does not have any strings that need to be converted into indices, you do not need to implement this method.

A note on this counter: because Fields can represent conceptually different things, we separate the vocabulary items by namespaces. This way, we can use a single shared mechanism to handle all mappings from strings to integers in all fields, while keeping words in a TextField from sharing the same ids with labels in a LabelField (e.g., “entailment” or “contradiction” are labels in an entailment task)

Additionally, a single Field might want to use multiple namespaces - TextFields can be represented as a combination of word ids and character ids, and you don’t want words and characters to share the same vocabulary - “a” as a word should get a different id from “a” as a character, and the vocabulary sizes of words and characters are very different.

Because of this, the first key in the counter object is a namespace, like “tokens”, “token_characters”, “tags”, or “labels”, and the second key is the actual vocabulary item.

empty_field(self) → 'SequenceLabelField'[source]

So that ListField can pad the number of fields in a list (e.g., the number of answer option TextFields), we need a representation of an empty field of each type. This returns that. This will only ever be called when we’re to the point of calling as_tensor(), so you don’t need to worry about get_padding_lengths, count_vocab_items, etc., being called on this empty field.

We make this an instance method instead of a static method so that if there is any state in the Field, we can copy it over (e.g., the token indexers in TextField).

get_padding_lengths(self) → Dict[str, int][source]

If there are things in this field that need padding, note them here. In order to pad a batch of instance, we get all of the lengths from the batch, take the max, and pad everything to that length (or use a pre-specified maximum length). The return value is a dictionary mapping keys to lengths, like {‘num_tokens’: 13}.

This is always called after index().

index(self, vocab: allennlp.data.vocabulary.Vocabulary)[source]

Given a Vocabulary, converts all strings in this field into (typically) integers. This modifies the Field object, it does not return anything.

If your Field does not have any strings that need to be converted into indices, you do not need to implement this method.

A TextField represents a string of text, the kind that you might want to represent with standard word vectors, or pass through an LSTM.

class allennlp.data.fields.text_field.TextField(tokens: List[allennlp.data.tokenizers.token.Token], token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer])[source]

Bases: allennlp.data.fields.sequence_field.SequenceField

This Field represents a list of string tokens. Before constructing this object, you need to tokenize raw strings using a Tokenizer.

Because string tokens can be represented as indexed arrays in a number of ways, we also take a dictionary of TokenIndexer objects that will be used to convert the tokens into indices. Each TokenIndexer could represent each token as a single ID, or a list of character IDs, or something else.

This field will get converted into a dictionary of arrays, one for each TokenIndexer. A SingleIdTokenIndexer produces an array of shape (num_tokens,), while a TokenCharactersIndexer produces an array of shape (num_tokens, num_characters).

as_tensor(self, padding_lengths: Dict[str, int]) → Dict[str, torch.Tensor][source]

Given a set of specified padding lengths, actually pad the data in this field and return a torch Tensor (or a more complex data structure) of the correct shape. We also take a couple of parameters that are important when constructing torch Tensors.

Parameters
padding_lengthsDict[str, int]

This dictionary will have the same keys that were produced in get_padding_lengths(). The values specify the lengths to use when padding each relevant dimension, aggregated across all instances in a batch.

batch_tensors(self, tensor_list: List[Dict[str, torch.Tensor]]) → Dict[str, torch.Tensor][source]

Takes the output of Field.as_tensor() from a list of Instances and merges it into one batched tensor for this Field. The default implementation here in the base class handles cases where as_tensor returns a single torch tensor per instance. If your subclass returns something other than this, you need to override this method.

This operation does not modify self, but in some cases we need the information contained in self in order to perform the batching, so this is an instance method, not a class method.

count_vocab_items(self, counter: Dict[str, Dict[str, int]])[source]

If there are strings in this field that need to be converted into integers through a Vocabulary, here is where we count them, to determine which tokens are in or out of the vocabulary.

If your Field does not have any strings that need to be converted into indices, you do not need to implement this method.

A note on this counter: because Fields can represent conceptually different things, we separate the vocabulary items by namespaces. This way, we can use a single shared mechanism to handle all mappings from strings to integers in all fields, while keeping words in a TextField from sharing the same ids with labels in a LabelField (e.g., “entailment” or “contradiction” are labels in an entailment task)

Additionally, a single Field might want to use multiple namespaces - TextFields can be represented as a combination of word ids and character ids, and you don’t want words and characters to share the same vocabulary - “a” as a word should get a different id from “a” as a character, and the vocabulary sizes of words and characters are very different.

Because of this, the first key in the counter object is a namespace, like “tokens”, “token_characters”, “tags”, or “labels”, and the second key is the actual vocabulary item.

empty_field(self)[source]

So that ListField can pad the number of fields in a list (e.g., the number of answer option TextFields), we need a representation of an empty field of each type. This returns that. This will only ever be called when we’re to the point of calling as_tensor(), so you don’t need to worry about get_padding_lengths, count_vocab_items, etc., being called on this empty field.

We make this an instance method instead of a static method so that if there is any state in the Field, we can copy it over (e.g., the token indexers in TextField).

get_padding_lengths(self) → Dict[str, int][source]

The TextField has a list of Tokens, and each Token gets converted into arrays by (potentially) several TokenIndexers. This method gets the max length (over tokens) associated with each of these arrays.

index(self, vocab: allennlp.data.vocabulary.Vocabulary)[source]

Given a Vocabulary, converts all strings in this field into (typically) integers. This modifies the Field object, it does not return anything.

If your Field does not have any strings that need to be converted into indices, you do not need to implement this method.

sequence_length(self) → int[source]

How many elements are there in this sequence?

class allennlp.data.fields.adjacency_field.AdjacencyField(indices: List[Tuple[int, int]], sequence_field: allennlp.data.fields.sequence_field.SequenceField, labels: List[str] = None, label_namespace: str = 'labels', padding_value: int = -1)[source]

Bases: allennlp.data.fields.field.Field

A AdjacencyField defines directed adjacency relations between elements in a SequenceField. Because it’s a labeling of some other field, we take that field as input here and use it to determine our padding and other things.

This field will get converted into an array of shape (sequence_field_length, sequence_field_length), where the (i, j)th array element is either a binary flag indicating there is an edge from i to j, or an integer label k, indicating there is a label from i to j of type k.

Parameters
indicesList[Tuple[int, int]]
sequence_fieldSequenceField

A field containing the sequence that this AdjacencyField is labeling. Most often, this is a TextField, for tagging edge relations between tokens in a sentence.

labelsList[str], optional, default = None

Optional labels for the edges of the adjacency matrix.

label_namespacestr, optional (default=’labels’)

The namespace to use for converting tag strings into integers. We convert tag strings to integers for you, and this parameter tells the Vocabulary object which mapping from strings to integers to use (so that “O” as a tag doesn’t get the same id as “O” as a word).

padding_valueint, (optional, default = -1)

The value to use as padding.

as_tensor(self, padding_lengths: Dict[str, int]) → torch.Tensor[source]

Given a set of specified padding lengths, actually pad the data in this field and return a torch Tensor (or a more complex data structure) of the correct shape. We also take a couple of parameters that are important when constructing torch Tensors.

Parameters
padding_lengthsDict[str, int]

This dictionary will have the same keys that were produced in get_padding_lengths(). The values specify the lengths to use when padding each relevant dimension, aggregated across all instances in a batch.

count_vocab_items(self, counter: Dict[str, Dict[str, int]])[source]

If there are strings in this field that need to be converted into integers through a Vocabulary, here is where we count them, to determine which tokens are in or out of the vocabulary.

If your Field does not have any strings that need to be converted into indices, you do not need to implement this method.

A note on this counter: because Fields can represent conceptually different things, we separate the vocabulary items by namespaces. This way, we can use a single shared mechanism to handle all mappings from strings to integers in all fields, while keeping words in a TextField from sharing the same ids with labels in a LabelField (e.g., “entailment” or “contradiction” are labels in an entailment task)

Additionally, a single Field might want to use multiple namespaces - TextFields can be represented as a combination of word ids and character ids, and you don’t want words and characters to share the same vocabulary - “a” as a word should get a different id from “a” as a character, and the vocabulary sizes of words and characters are very different.

Because of this, the first key in the counter object is a namespace, like “tokens”, “token_characters”, “tags”, or “labels”, and the second key is the actual vocabulary item.

empty_field(self) → 'AdjacencyField'[source]

So that ListField can pad the number of fields in a list (e.g., the number of answer option TextFields), we need a representation of an empty field of each type. This returns that. This will only ever be called when we’re to the point of calling as_tensor(), so you don’t need to worry about get_padding_lengths, count_vocab_items, etc., being called on this empty field.

We make this an instance method instead of a static method so that if there is any state in the Field, we can copy it over (e.g., the token indexers in TextField).

get_padding_lengths(self) → Dict[str, int][source]

If there are things in this field that need padding, note them here. In order to pad a batch of instance, we get all of the lengths from the batch, take the max, and pad everything to that length (or use a pre-specified maximum length). The return value is a dictionary mapping keys to lengths, like {‘num_tokens’: 13}.

This is always called after index().

index(self, vocab: allennlp.data.vocabulary.Vocabulary)[source]

Given a Vocabulary, converts all strings in this field into (typically) integers. This modifies the Field object, it does not return anything.

If your Field does not have any strings that need to be converted into indices, you do not need to implement this method.

class allennlp.data.fields.namespace_swapping_field.NamespaceSwappingField(source_tokens: List[allennlp.data.tokenizers.token.Token], target_namespace: str)[source]

Bases: allennlp.data.fields.field.Field

A NamespaceSwappingField is used to map tokens in one namespace to tokens in another namespace. It is used by seq2seq models with a copy mechanism that copies tokens from the source sentence into the target sentence.

Parameters
source_tokensList[Token]

The tokens from the source sentence.

target_namespacestr

The namespace that the tokens from the source sentence will be mapped to.

as_tensor(self, padding_lengths: Dict[str, int]) → torch.Tensor[source]

Given a set of specified padding lengths, actually pad the data in this field and return a torch Tensor (or a more complex data structure) of the correct shape. We also take a couple of parameters that are important when constructing torch Tensors.

Parameters
padding_lengthsDict[str, int]

This dictionary will have the same keys that were produced in get_padding_lengths(). The values specify the lengths to use when padding each relevant dimension, aggregated across all instances in a batch.

empty_field(self) → 'NamespaceSwappingField'[source]

So that ListField can pad the number of fields in a list (e.g., the number of answer option TextFields), we need a representation of an empty field of each type. This returns that. This will only ever be called when we’re to the point of calling as_tensor(), so you don’t need to worry about get_padding_lengths, count_vocab_items, etc., being called on this empty field.

We make this an instance method instead of a static method so that if there is any state in the Field, we can copy it over (e.g., the token indexers in TextField).

get_padding_lengths(self) → Dict[str, int][source]

If there are things in this field that need padding, note them here. In order to pad a batch of instance, we get all of the lengths from the batch, take the max, and pad everything to that length (or use a pre-specified maximum length). The return value is a dictionary mapping keys to lengths, like {‘num_tokens’: 13}.

This is always called after index().

index(self, vocab: allennlp.data.vocabulary.Vocabulary)[source]

Given a Vocabulary, converts all strings in this field into (typically) integers. This modifies the Field object, it does not return anything.

If your Field does not have any strings that need to be converted into indices, you do not need to implement this method.