self_attention

allennlp.modules.transformer.self_attention

SelfAttention¶

class SelfAttention(TransformerModule,  FromParams):
 | def __init__(
 |     self,
 |     hidden_size: int,
 |     num_attention_heads: int,
 |     dropout: float = 0.0,
 |     scoring_func: str = "scaled_dot_product",
 |     output_linear: bool = False
 | )

This module computes the self-attention, similar to the architecture in BERT. Additionally, the attention scoring function can be specified. Details in the paper: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Devlin et al, 2019

Parameters¶

hidden_size : int
num_attention_heads : int
dropout : float, optional (default = 0.0)
scoring_func : str, optional (default = scaled_dot_product)
The name of the attention-calculating function to be used. Eg. additive, linear, etc. For a complete list, please check attention.

forward¶

class SelfAttention(TransformerModule,  FromParams):
 | ...
 | def forward(
 |     self,
 |     query_states: torch.Tensor,
 |     key_states: Optional[torch.Tensor] = None,
 |     value_states: Optional[torch.Tensor] = None,
 |     attention_mask: Optional[torch.BoolTensor] = None,
 |     head_mask: Optional[torch.Tensor] = None,
 |     output_attentions: bool = False
 | )

query_states : torch.Tensor Shape batch_size x seq_len x hidden_dim key_states : torch.Tensor, optional Shape batch_size x seq_len x hidden_dim value_states : torch.Tensor, optional Shape batch_size x seq_len x hidden_dim attention_mask : torch.BoolTensor, optional Shape batch_size x seq_len head_mask : torch.BoolTensor, optional output_attentions : bool Whether to also return the attention probabilities, default = False