self_attention
allennlp.modules.transformer.self_attention
SelfAttention¶
class SelfAttention(TransformerModule, FromParams):
| def __init__(
| self,
| hidden_size: int,
| num_attention_heads: int,
| dropout: float = 0.0,
| scoring_func: str = "scaled_dot_product",
| output_linear: bool = False
| )
This module computes the self-attention, similar to the architecture in BERT. Additionally, the attention scoring function can be specified. Details in the paper: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Devlin et al, 2019
Parameters¶
- hidden_size :
int
- num_attention_heads :
int
- dropout :
float
, optional (default =0.0
) - scoring_func :
str
, optional (default =scaled_dot_product
)
The name of the attention-calculating function to be used. Eg.additive
,linear
, etc. For a complete list, please checkattention
.
forward¶
class SelfAttention(TransformerModule, FromParams):
| ...
| def forward(
| self,
| query_states: torch.Tensor,
| key_states: Optional[torch.Tensor] = None,
| value_states: Optional[torch.Tensor] = None,
| attention_mask: Optional[torch.BoolTensor] = None,
| head_mask: Optional[torch.Tensor] = None,
| output_attentions: bool = False
| )
query_states : torch.Tensor
Shape batch_size x seq_len x hidden_dim
key_states : torch.Tensor
, optional
Shape batch_size x seq_len x hidden_dim
value_states : torch.Tensor
, optional
Shape batch_size x seq_len x hidden_dim
attention_mask : torch.BoolTensor
, optional
Shape batch_size x seq_len
head_mask : torch.BoolTensor
, optional
output_attentions : bool
Whether to also return the attention probabilities, default = False