bimodal_attention
allennlp.modules.transformer.bimodal_attention
BiModalAttention¶
class BiModalAttention(TransformerModule, FromParams):
| def __init__(
| self,
| hidden_size1: int,
| hidden_size2: int,
| combined_hidden_size: int,
| num_attention_heads: int,
| dropout1: float = 0.0,
| dropout2: float = 0.0,
| scoring_func1: str = "scaled_dot_product",
| scoring_func2: str = "scaled_dot_product"
| )
Computes attention for two modalities, based on ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks (Lu et al, 2019).
From the paper:
"The keys and values from each modality are passed as input to the other modality’s multi-headed attention block. Consequentially, the attention block produces attention-pooled features for each modality conditioned on the other."
For example, considering the case when the first modality is image, and the second modality is language, the module performs "image-conditioned language attention in the visual stream and language-conditioned image attention in the linguistic stream."
Parameters¶
- hidden_size1 :
int
The input hidden dim for the first modality. - hidden_size2 :
int
The input hidden dim for the second modality. - combined_hidden_size :
int
The output hidden dim for both modalities; it should be a multiple ofnum_attention_heads
. - num_attention_heads :
int
The number of attention heads. - dropout1 :
float
, optional (default =0.0
)
The dropout probability for the first modality stream. - dropout2 :
float
, optional (default =0.0
)
The dropout probability for the second modality stream. - scoring_func1 :
str
, optional (default =scaled_dot_product
)
The name of the attention-calculating function to be used for the first modality. - scoring_func2 :
str
, optional (default =scaled_dot_product
)
The name of the attention-calculating function to be used for the second modality. Eg.additive
,linear
, etc. For a complete list, please checkattention
.
forward¶
class BiModalAttention(TransformerModule, FromParams):
| ...
| def forward(
| self,
| input_tensor1,
| input_tensor2,
| attention_mask1=None,
| attention_mask2=None,
| co_attention_mask=None,
| use_co_attention_mask=False
| )
input_tensor1 : torch.Tensor
Shape batch_size x seq_len1 x hidden_dim1
where seq_len1
can be the sequence length
when the modality is text, or the number of
regions when the modality is image.
input_tensor2 : torch.Tensor
Shape batch_size x seq_len2 x hidden_dim2
where seq_len2
can be the sequence length
when the modality is text, or the number of
regions when the modality is image.
attention_mask1 : torch.BoolTensor
, optional
Shape batch_size x seq_len1
attention_mask : torch.BoolTensor
, optional
Shape batch_size x seq_len2
co_attention_mask : torch.Tensor
, optional
Shape batch_size x seq_len1 x seq_len2 x all_head_size
This mask is for cases when you already have some prior information
about the interaction between the two modalities. For example,
if you know which words correspond to which regions in the image,
this mask can be applied to limit the attention given the bias.
use_co_attention_mask : bool
# TODO: is this flag necessary?
Whether to use co_attention_mask or not, default = False
.