bimodal_attention
allennlp.modules.transformer.bimodal_attention
BiModalAttention¶
class BiModalAttention(TransformerModule, FromParams):
| def __init__(
| self,
| hidden_size1: int,
| hidden_size2: int,
| combined_hidden_size: int,
| num_attention_heads: int,
| dropout1: float = 0.0,
| dropout2: float = 0.0,
| scoring_func1: str = "scaled_dot_product",
| scoring_func2: str = "scaled_dot_product"
| )
Computes attention for two modalities, based on ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks (Lu et al, 2019).
From the paper:
"The keys and values from each modality are passed as input to the other modality’s multi-headed attention block. Consequentially, the attention block produces attention-pooled features for each modality conditioned on the other."
For example, considering the case when the first modality is image, and the second modality is language, the module performs "image-conditioned language attention in the visual stream and language-conditioned image attention in the linguistic stream."
Parameters¶
- hidden_size1 :
int
The input hidden dim for the first modality. - hidden_size2 :
int
The input hidden dim for the second modality. - combined_hidden_size :
int
The output hidden dim for both modalities; it should be a multiple ofnum_attention_heads
. - num_attention_heads :
int
The number of attention heads. - dropout1 :
float
, optional (default =0.0
)
The dropout probability for the first modality stream. - dropout2 :
float
, optional (default =0.0
)
The dropout probability for the second modality stream. - scoring_func1 :
str
, optional (default =scaled_dot_product
)
The name of the attention-calculating function to be used for the first modality. - scoring_func2 :
str
, optional (default =scaled_dot_product
)
The name of the attention-calculating function to be used for the second modality. Eg.dot_product
,linear
, etc. For a complete list, please checkmatrix_attention
.
forward¶
class BiModalAttention(TransformerModule, FromParams):
| ...
| def forward(
| self,
| input_tensor1,
| input_tensor2,
| attention_mask1=None,
| attention_mask2=None,
| co_attention_mask=None
| )
Parameters¶
- input_tensor1 :
torch.Tensor
Shapebatch_size x seq_len1 x hidden_dim1
whereseq_len1
can be the sequence length when the modality is text, or the number of regions when the modality is image. - input_tensor2 :
torch.Tensor
Shapebatch_size x seq_len2 x hidden_dim2
whereseq_len2
can be the sequence length when the modality is text, or the number of regions when the modality is image. - attention_mask1 :
torch.BoolTensor
, optional
Shapebatch_size x seq_len1
- attention_mask :
torch.BoolTensor
, optional
Shapebatch_size x seq_len2
- co_attention_mask :
torch.Tensor
, optional
Shapebatch_size x seq_len1 x seq_len2 x all_head_size
This mask is for cases when you already have some prior information about the interaction between the two modalities. For example, if you know which words correspond to which regions in the image, this mask can be applied to limit the attention given the bias.