input_dim: int,
    num_layers: int,
    feedforward_hidden_dim: int = 2048,
    num_attention_heads: int = 8,
    positional_encoding: Optional[str] = None,
    positional_embedding_size: int = 512,
    dropout_prob: float = 0.1,
    activation: str = 'relu',
) -> None

Implements a stacked self-attention encoder similar to the Transformer architecture in [Attention is all you Need] (

This class adapts the Transformer from torch.nn for use in AllenNLP. Optionally, it adds positional encodings.

Registered as a Seq2SeqEncoder with name "pytorch_transformer".


  • input_dim : int, required. The input dimension of the encoder.
  • feedforward_hidden_dim : int, required. The middle dimension of the FeedForward network. The input and output dimensions are fixed to ensure sizes match up for the self attention layers.
  • num_layers : int, required. The number of stacked self attention -> feedforward -> layer normalisation blocks.
  • num_attention_heads : int, required. The number of attention heads to use per layer.
  • use_positional_encoding : bool, optional, (default = True) Whether to add sinusoidal frequencies to the input tensor. This is strongly recommended, as without this feature, the self attention layers have no idea of absolute or relative position (as they are just computing pairwise similarity between vectors of elements), which can be important features for many tasks.
  • dropout_prob : float, optional, (default = 0.1) The dropout probability for the feedforward network.


PytorchTransformer.forward(self, inputs:torch.Tensor, mask:torch.BoolTensor)

Defines the computation performed at every call.

Should be overridden by all subclasses.

.. note:: Although the recipe for forward pass needs to be defined within this function, one should call the :class:Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


PytorchTransformer.get_input_dim(self) -> int

Returns the dimension of the vector input for each element in the sequence input to a Seq2SeqEncoder. This is not the shape of the input tensor, but the last element of that shape.


PytorchTransformer.get_output_dim(self) -> int

Returns the dimension of each vector in the sequence output by this Seq2SeqEncoder. This is not the shape of the returned tensor, but the last element of that shape.



Returns True if this encoder is bidirectional. If so, we assume the forward direction of the encoder is the first half of the final dimension, and the backward direction is the second half.