bimodal_attention

allennlp.modules.transformer.bimodal_attention

BiModalAttention¶

class BiModalAttention(TransformerModule,  FromParams):
 | def __init__(
 |     self,
 |     hidden_size1: int,
 |     hidden_size2: int,
 |     combined_hidden_size: int,
 |     num_attention_heads: int,
 |     dropout1: float = 0.0,
 |     dropout2: float = 0.0,
 |     scoring_func1: str = "scaled_dot_product",
 |     scoring_func2: str = "scaled_dot_product"
 | )

Computes attention for two modalities, based on ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks (Lu et al, 2019).

From the paper:

"The keys and values from each modality are passed as input to the other modality’s multi-headed attention block. Consequentially, the attention block produces attention-pooled features for each modality conditioned on the other."

For example, considering the case when the first modality is image, and the second modality is language, the module performs "image-conditioned language attention in the visual stream and language-conditioned image attention in the linguistic stream."

Parameters¶

hidden_size1 : int
The input hidden dim for the first modality.
hidden_size2 : int
The input hidden dim for the second modality.
combined_hidden_size : int
The output hidden dim for both modalities; it should be a multiple of num_attention_heads.
num_attention_heads : int
The number of attention heads.
dropout1 : float, optional (default = 0.0)
The dropout probability for the first modality stream.
dropout2 : float, optional (default = 0.0)
The dropout probability for the second modality stream.
scoring_func1 : str, optional (default = scaled_dot_product)
The name of the attention-calculating function to be used for the first modality.
scoring_func2 : str, optional (default = scaled_dot_product)
The name of the attention-calculating function to be used for the second modality. Eg. additive, linear, etc. For a complete list, please check attention.

forward¶

class BiModalAttention(TransformerModule,  FromParams):
 | ...
 | def forward(
 |     self,
 |     input_tensor1,
 |     input_tensor2,
 |     attention_mask1=None,
 |     attention_mask2=None,
 |     co_attention_mask=None,
 |     use_co_attention_mask=False
 | )

input_tensor1 : torch.Tensor Shape batch_size x seq_len1 x hidden_dim1 where seq_len1 can be the sequence length when the modality is text, or the number of regions when the modality is image. input_tensor2 : torch.Tensor Shape batch_size x seq_len2 x hidden_dim2 where seq_len2 can be the sequence length when the modality is text, or the number of regions when the modality is image. attention_mask1 : torch.BoolTensor, optional Shape batch_size x seq_len1 attention_mask : torch.BoolTensor, optional Shape batch_size x seq_len2 co_attention_mask : torch.Tensor, optional Shape batch_size x seq_len1 x seq_len2 x all_head_size This mask is for cases when you already have some prior information about the interaction between the two modalities. For example, if you know which words correspond to which regions in the image, this mask can be applied to limit the attention given the bias. use_co_attention_mask : bool # TODO: is this flag necessary? Whether to use co_attention_mask or not, default = False.