Skip to content





class StackedSelfAttentionEncoder(Seq2SeqEncoder):
 | def __init__(
 |     self,
 |     input_dim: int,
 |     hidden_dim: int,
 |     projection_dim: int,
 |     feedforward_hidden_dim: int,
 |     num_layers: int,
 |     num_attention_heads: int,
 |     use_positional_encoding: bool = True,
 |     dropout_prob: float = 0.1,
 |     residual_dropout_prob: float = 0.2,
 |     attention_dropout_prob: float = 0.1
 | ) -> None

Implements a stacked self-attention encoder similar to, but different from, the Transformer architecture in Attention is all you Need.

This encoder combines 3 layers in a 'block':

  1. A 2 layer FeedForward network.
  2. Multi-headed self attention, which uses 2 learnt linear projections to perform a dot-product similarity between every pair of elements scaled by the square root of the sequence length.
  3. Layer Normalisation.

These are then stacked into num_layers layers.


  • input_dim : int
    The input dimension of the encoder.
  • hidden_dim : int
    The hidden dimension used for the input to self attention layers and the output from the feedforward layers.
  • projection_dim : int
    The dimension of the linear projections for the self-attention layers.
  • feedforward_hidden_dim : int
    The middle dimension of the FeedForward network. The input and output dimensions are fixed to ensure sizes match up for the self attention layers.
  • num_layers : int
    The number of stacked self attention -> feedforward -> layer normalisation blocks.
  • num_attention_heads : int
    The number of attention heads to use per layer.
  • use_positional_encoding : bool, optional (default = True)
    Whether to add sinusoidal frequencies to the input tensor. This is strongly recommended, as without this feature, the self attention layers have no idea of absolute or relative position (as they are just computing pairwise similarity between vectors of elements), which can be important features for many tasks.
  • dropout_prob : float, optional (default = 0.1)
    The dropout probability for the feedforward network.
  • residual_dropout_prob : float, optional (default = 0.2)
    The dropout probability for the residual connections.
  • attention_dropout_prob : float, optional (default = 0.1)
    The dropout probability for the attention distributions in each attention layer.


class StackedSelfAttentionEncoder(Seq2SeqEncoder):
 | ...
 | def get_input_dim(self) -> int


class StackedSelfAttentionEncoder(Seq2SeqEncoder):
 | ...
 | def get_output_dim(self) -> int


class StackedSelfAttentionEncoder(Seq2SeqEncoder):
 | ...
 | def is_bidirectional(self)


class StackedSelfAttentionEncoder(Seq2SeqEncoder):
 | ...
 | def forward(self, inputs: torch.Tensor, mask: torch.BoolTensor)