allennlp.data.fields¶

A Field is some piece of data instance that ends up as an array in a model.

Field
ArrayField
IndexField
SpanField
KnowledgeGraphField
LabelField
MultiLabelField
ListField
MetadataField
ProductionRuleField
SequenceField
SequenceLabelField
TextField
AdjacencyField

class allennlp.data.fields.field.Field[source]¶

Bases: typing.Generic

A Field is some piece of a data instance that ends up as an tensor in a model (either as an input or an output). Data instances are just collections of fields.

Fields go through up to two steps of processing: (1) tokenized fields are converted into token ids, (2) fields containing token ids (or any other numeric data) are padded (if necessary) and converted into tensors. The Field API has methods around both of these steps, though they may not be needed for some concrete Field classes - if your field doesn’t have any strings that need indexing, you don’t need to implement count_vocab_items or index. These methods pass by default.

Once a vocabulary is computed and all fields are indexed, we will determine padding lengths, then intelligently batch together instances and pad them into actual tensors.

as_tensor(self, padding_lengths: Dict[str, int]) → ~DataArray[source]¶

Given a set of specified padding lengths, actually pad the data in this field and return a torch Tensor (or a more complex data structure) of the correct shape. We also take a couple of parameters that are important when constructing torch Tensors.

Parameters

padding_lengthsDict[str, int]: This dictionary will have the same keys that were produced in get_padding_lengths(). The values specify the lengths to use when padding each relevant dimension, aggregated across all instances in a batch.

batch_tensors(self, tensor_list: List[~DataArray]) → ~DataArray[source]¶

Takes the output of Field.as_tensor() from a list of Instances and merges it into one batched tensor for this Field. The default implementation here in the base class handles cases where as_tensor returns a single torch tensor per instance. If your subclass returns something other than this, you need to override this method.

This operation does not modify self, but in some cases we need the information contained in self in order to perform the batching, so this is an instance method, not a class method.

count_vocab_items(self, counter: Dict[str, Dict[str, int]])[source]¶

If there are strings in this field that need to be converted into integers through a Vocabulary, here is where we count them, to determine which tokens are in or out of the vocabulary.

If your Field does not have any strings that need to be converted into indices, you do not need to implement this method.

A note on this counter: because Fields can represent conceptually different things, we separate the vocabulary items by namespaces. This way, we can use a single shared mechanism to handle all mappings from strings to integers in all fields, while keeping words in a TextField from sharing the same ids with labels in a LabelField (e.g., “entailment” or “contradiction” are labels in an entailment task)

Additionally, a single Field might want to use multiple namespaces - TextFields can be represented as a combination of word ids and character ids, and you don’t want words and characters to share the same vocabulary - “a” as a word should get a different id from “a” as a character, and the vocabulary sizes of words and characters are very different.

Because of this, the first key in the counter object is a namespace, like “tokens”, “token_characters”, “tags”, or “labels”, and the second key is the actual vocabulary item.

empty_field(self) → 'Field'[source]¶

So that ListField can pad the number of fields in a list (e.g., the number of answer option TextFields), we need a representation of an empty field of each type. This returns that. This will only ever be called when we’re to the point of calling as_tensor(), so you don’t need to worry about get_padding_lengths, count_vocab_items, etc., being called on this empty field.

We make this an instance method instead of a static method so that if there is any state in the Field, we can copy it over (e.g., the token indexers in TextField).

get_padding_lengths(self) → Dict[str, int][source]¶

If there are things in this field that need padding, note them here. In order to pad a batch of instance, we get all of the lengths from the batch, take the max, and pad everything to that length (or use a pre-specified maximum length). The return value is a dictionary mapping keys to lengths, like {‘num_tokens’: 13}.

This is always called after index().

index(self, vocab: allennlp.data.vocabulary.Vocabulary)[source]¶

Given a Vocabulary, converts all strings in this field into (typically) integers. This modifies the Field object, it does not return anything.

If your Field does not have any strings that need to be converted into indices, you do not need to implement this method.

class allennlp.data.fields.array_field.ArrayField(array: numpy.ndarray, padding_value: int = 0, dtype: numpy.dtype = <class 'numpy.float32'>)[source]¶

Bases: allennlp.data.fields.field.Field

A class representing an array, which could have arbitrary dimensions. A batch of these arrays are padded to the max dimension length in the batch for each dimension.

as_tensor(self, padding_lengths: Dict[str, int]) → torch.Tensor[source]¶

Given a set of specified padding lengths, actually pad the data in this field and return a torch Tensor (or a more complex data structure) of the correct shape. We also take a couple of parameters that are important when constructing torch Tensors.

Parameters

padding_lengthsDict[str, int]: This dictionary will have the same keys that were produced in get_padding_lengths(). The values specify the lengths to use when padding each relevant dimension, aggregated across all instances in a batch.

empty_field(self)[source]¶

So that ListField can pad the number of fields in a list (e.g., the number of answer option TextFields), we need a representation of an empty field of each type. This returns that. This will only ever be called when we’re to the point of calling as_tensor(), so you don’t need to worry about get_padding_lengths, count_vocab_items, etc., being called on this empty field.

We make this an instance method instead of a static method so that if there is any state in the Field, we can copy it over (e.g., the token indexers in TextField).

get_padding_lengths(self) → Dict[str, int][source]¶

If there are things in this field that need padding, note them here. In order to pad a batch of instance, we get all of the lengths from the batch, take the max, and pad everything to that length (or use a pre-specified maximum length). The return value is a dictionary mapping keys to lengths, like {‘num_tokens’: 13}.

This is always called after index().

class allennlp.data.fields.index_field.IndexField(index: int, sequence_field: allennlp.data.fields.sequence_field.SequenceField)[source]¶

Bases: allennlp.data.fields.field.Field

An IndexField is an index into a SequenceField, as might be used for representing a correct answer option in a list, or a span begin and span end position in a passage, for example. Because it’s an index into a SequenceField, we take one of those as input and use it to compute padding lengths.

Parameters

indexint: The index of the answer in the SequenceField. This is typically the “correct answer” in some classification decision over the sequence, like where an answer span starts in SQuAD, or which answer option is correct in a multiple choice question. A value of -1 means there is no label, which can be used for padding or other purposes.
sequence_fieldSequenceField: A field containing the sequence that this IndexField is a pointer into.

as_tensor(self, padding_lengths: Dict[str, int]) → torch.Tensor[source]¶

Given a set of specified padding lengths, actually pad the data in this field and return a torch Tensor (or a more complex data structure) of the correct shape. We also take a couple of parameters that are important when constructing torch Tensors.

Parameters

padding_lengthsDict[str, int]: This dictionary will have the same keys that were produced in get_padding_lengths(). The values specify the lengths to use when padding each relevant dimension, aggregated across all instances in a batch.

empty_field(self)[source]¶

So that ListField can pad the number of fields in a list (e.g., the number of answer option TextFields), we need a representation of an empty field of each type. This returns that. This will only ever be called when we’re to the point of calling as_tensor(), so you don’t need to worry about get_padding_lengths, count_vocab_items, etc., being called on this empty field.

We make this an instance method instead of a static method so that if there is any state in the Field, we can copy it over (e.g., the token indexers in TextField).

get_padding_lengths(self) → Dict[str, int][source]¶

If there are things in this field that need padding, note them here. In order to pad a batch of instance, we get all of the lengths from the batch, take the max, and pad everything to that length (or use a pre-specified maximum length). The return value is a dictionary mapping keys to lengths, like {‘num_tokens’: 13}.

This is always called after index().

class allennlp.data.fields.span_field.SpanField(span_start: int, span_end: int, sequence_field: allennlp.data.fields.sequence_field.SequenceField)[source]¶

Bases: allennlp.data.fields.field.Field

A SpanField is a pair of inclusive, zero-indexed (start, end) indices into a SequenceField, used to represent a span of text. Because it’s a pair of indices into a SequenceField, we take one of those as input to make the span’s dependence explicit and to validate that the span is well defined.

Parameters

span_startint, required.: The index of the start of the span in the SequenceField.
span_endint, required.: The inclusive index of the end of the span in the SequenceField.
sequence_fieldSequenceField, required.: A field containing the sequence that this SpanField is a span inside.

as_tensor(self, padding_lengths: Dict[str, int]) → torch.Tensor[source]¶

Given a set of specified padding lengths, actually pad the data in this field and return a torch Tensor (or a more complex data structure) of the correct shape. We also take a couple of parameters that are important when constructing torch Tensors.

Parameters

padding_lengthsDict[str, int]: This dictionary will have the same keys that were produced in get_padding_lengths(). The values specify the lengths to use when padding each relevant dimension, aggregated across all instances in a batch.

empty_field(self)[source]¶

So that ListField can pad the number of fields in a list (e.g., the number of answer option TextFields), we need a representation of an empty field of each type. This returns that. This will only ever be called when we’re to the point of calling as_tensor(), so you don’t need to worry about get_padding_lengths, count_vocab_items, etc., being called on this empty field.

We make this an instance method instead of a static method so that if there is any state in the Field, we can copy it over (e.g., the token indexers in TextField).

get_padding_lengths(self) → Dict[str, int][source]¶

If there are things in this field that need padding, note them here. In order to pad a batch of instance, we get all of the lengths from the batch, take the max, and pad everything to that length (or use a pre-specified maximum length). The return value is a dictionary mapping keys to lengths, like {‘num_tokens’: 13}.

This is always called after index().

KnowledgeGraphField is a Field which stores a knowledge graph representation.

class allennlp.data.fields.knowledge_graph_field.KnowledgeGraphField(knowledge_graph: allennlp.semparse.contexts.knowledge_graph.KnowledgeGraph, utterance_tokens: List[allennlp.data.tokenizers.token.Token], token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer], tokenizer: allennlp.data.tokenizers.tokenizer.Tokenizer = None, feature_extractors: List[str] = None, entity_tokens: List[List[allennlp.data.tokenizers.token.Token]] = None, linking_features: List[List[List[float]]] = None, include_in_vocab: bool = True, max_table_tokens: int = None)[source]¶

Bases: allennlp.data.fields.field.Field

A KnowledgeGraphField represents a KnowledgeGraph as a Field that can be used in a Model. For each entity in the graph, we output two things: a text representation of the entity, handled identically to a TextField, and a list of linking features for each token in some input utterance.

The output of this field is a dictionary:

{
  "text": Dict[str, torch.Tensor],  # each tensor has shape (batch_size, num_entities, num_entity_tokens)
  "linking": torch.Tensor  # shape (batch_size, num_entities, num_utterance_tokens, num_features)
}

The text component of this dictionary is suitable to be passed into a TextFieldEmbedder (which handles the additional num_entities dimension without any issues). The linking component of the dictionary can be used however you want to decide which tokens in the utterance correspond to which entities in the knowledge graph.

In order to create the text component, we use the same dictionary of TokenIndexers that’s used in a TextField (as we’re just representing the text corresponding to each entity). For the linking component, we use a set of hard-coded feature extractors that operate between the text corresponding to each entity and each token in the utterance.

Parameters

knowledge_graphKnowledgeGraph: The knowledge graph that this field stores.
utterance_tokensList[Token]: The tokens in some utterance that is paired with the KnowledgeGraph. We compute a set of features for linking tokens in the utterance to entities in the graph.
tokenizerTokenizer, optional (default=``WordTokenizer()``): We’ll use this Tokenizer to tokenize the text representation of each entity.
token_indexersDict[str, TokenIndexer]: Token indexers that convert entities into arrays, similar to how text tokens are treated in a TextField. These might operate on the name of the entity itself, its type, its neighbors in the graph, etc.
feature_extractorsList[str], optional: Names of feature extractors to use for computing linking features. These must be attributes of this object, without the first underscore. The feature extraction functions are listed as the last methods in this class. For example, to use _exact_token_match(), you would pass the string exact_token_match. We will add an underscore and look for a function matching that name. If this list is omitted, we will use all available feature functions.
entity_tokensList[List[Token]], optional: If you have pre-computed the tokenization of the table text, you can pass it in here. The must be a list of the tokens in the entity text, for each entity in the knowledge graph, in the same order in which the knowledge graph returns entities.
linking_featuresList[List[List[float]]], optional: If you have pre-computed the linking features between the utterance and the table text, you can pass it in here.
include_in_vocabbool, optional (default=True): If this is False, we will skip the count_vocab_items logic, leaving out all table entity text from the vocabulary computation. You might want to do this if you have a lot of rare entities in your tables, and you see the same table in multiple training instances, so your vocabulary counts get skewed and include too many rare entities.
max_table_tokensint, optional: If given, we will only keep this number of total table tokens. This bounds the memory usage of the table representations, truncating cells with really long text. We specify a total number of tokens, not a max cell text length, because the number of table entities varies.

as_tensor(self, padding_lengths: Dict[str, int]) → Dict[str, torch.Tensor][source]¶

Given a set of specified padding lengths, actually pad the data in this field and return a torch Tensor (or a more complex data structure) of the correct shape. We also take a couple of parameters that are important when constructing torch Tensors.

Parameters

padding_lengthsDict[str, int]: This dictionary will have the same keys that were produced in get_padding_lengths(). The values specify the lengths to use when padding each relevant dimension, aggregated across all instances in a batch.

batch_tensors(self, tensor_list: List[Dict[str, torch.Tensor]]) → Dict[str, torch.Tensor][source]¶