
An attention module that computes the similarity between an input vector and the rows of a matrix.

class allennlp.modules.attention.attention.Attention(normalize: bool = True)[source]

Bases: torch.nn.modules.module.Module, allennlp.common.registrable.Registrable

An Attention takes two inputs: a (batched) vector and a matrix, plus an optional mask on the rows of the matrix. We compute the similarity between the vector and each row in the matrix, and then (optionally) perform a softmax over rows using those computed similarities.


  • vector: shape (batch_size, embedding_dim)

  • matrix: shape (batch_size, num_rows, embedding_dim)

  • matrix_mask: shape (batch_size, num_rows), specifying which rows are just padding.


  • attention: shape (batch_size, num_rows).

normalizebool, optional (default: True)

If true, we normalize the computed similarities with a softmax, to return a probability distribution for your attention. If false, this is just computing a similarity score.

forward(self, vector: torch.Tensor, matrix: torch.Tensor, matrix_mask: torch.Tensor = None) → torch.Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.


Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class allennlp.modules.attention.bilinear_attention.BilinearAttention(vector_dim: int, matrix_dim: int, activation: allennlp.nn.activations.Activation = None, normalize: bool = True)[source]

Bases: allennlp.modules.attention.attention.Attention

Computes attention between a vector and a matrix using a bilinear attention function. This function has a matrix of weights W and a bias b, and the similarity between the vector x and the matrix y is computed as x^T W y + b.


The dimension of the vector, x, described above. This is x.size()[-1] - the length of the vector that will go into the similarity computation. We need this so we can build the weight matrix correctly.


The dimension of the matrix, y, described above. This is y.size()[-1] - the length of the vector that will go into the similarity computation. We need this so we can build the weight matrix correctly.

activationActivation, optional (default=linear (i.e. no activation))

An activation function applied after the x^T W y + b calculation. Default is no activation.

normalizebool, optional (default: True)

If true, we normalize the computed similarities with a softmax, to return a probability distribution for your attention. If false, this is just computing a similarity score.

class allennlp.modules.attention.additive_attention.AdditiveAttention(vector_dim: int, matrix_dim: int, normalize: bool = True)[source]

Bases: allennlp.modules.attention.attention.Attention

Computes attention between a vector and a matrix using an additive attention function. This function has two matrices W, U and a vector V. The similarity between the vector x and the matrix y is computed as V tanh(Wx + Uy).

This attention is often referred as concat or additive attention. It was introduced in <> by Bahdanau et al.


The dimension of the vector, x, described above. This is x.size()[-1] - the length of the vector that will go into the similarity computation. We need this so we can build the weight matrix correctly.


The dimension of the matrix, y, described above. This is y.size()[-1] - the length of the vector that will go into the similarity computation. We need this so we can build the weight matrix correctly.

normalizebool, optional (default: True)

If true, we normalize the computed similarities with a softmax, to return a probability distribution for your attention. If false, this is just computing a similarity score.

class allennlp.modules.attention.cosine_attention.CosineAttention(normalize: bool = True)[source]

Bases: allennlp.modules.attention.attention.Attention

Computes attention between a vector and a matrix using cosine similarity.

class allennlp.modules.attention.dot_product_attention.DotProductAttention(normalize: bool = True)[source]

Bases: allennlp.modules.attention.attention.Attention

Computes attention between a vector and a matrix using dot product.

class allennlp.modules.attention.legacy_attention.LegacyAttention(similarity_function: allennlp.modules.similarity_functions.similarity_function.SimilarityFunction = None, normalize: bool = True)[source]

Bases: allennlp.modules.attention.attention.Attention

Computes attention between a vector and a matrix using a similarity function. This should be considered deprecated, as it consumes more memory than the specialized attention modules.

class allennlp.modules.attention.linear_attention.LinearAttention(tensor_1_dim: int, tensor_2_dim: int, combination: str = 'x, y', activation: allennlp.nn.activations.Activation = None, normalize: bool = True)[source]

Bases: allennlp.modules.attention.attention.Attention

This Attention module performs a dot product between a vector of weights and some combination of the two input vectors, followed by an (optional) activation function. The combination used is configurable.

If the two vectors are x and y, we allow the following kinds of combinations: x, y, x*y, x+y, x-y, x/y, where each of those binary operations is performed elementwise. You can list as many combinations as you want, comma separated. For example, you might give x,y,x*y as the combination parameter to this class. The computed similarity function would then be w^T [x; y; x*y] + b, where w is a vector of weights, b is a bias parameter, and [;] is vector concatenation.

Note that if you want a bilinear similarity function with a diagonal weight matrix W, where the similarity function is computed as x * w * y + b (with w the diagonal of W), you can accomplish that with this class by using “x*y” for combination.


The dimension of the first tensor, x, described above. This is x.size()[-1] - the length of the vector that will go into the similarity computation. We need this so we can build weight vectors correctly.


The dimension of the second tensor, y, described above. This is y.size()[-1] - the length of the vector that will go into the similarity computation. We need this so we can build weight vectors correctly.

combinationstr, optional (default=”x,y”)

Described above.

activationActivation, optional (default=linear (i.e. no activation))

An activation function applied after the w^T * [x;y] + b calculation. Default is no activation.
