600字范文 > 用于视觉问答的多模态关系推理的模型《Multimodal Relational Reasoning for Visual Question Answering》

用于视觉问答的多模态关系推理的模型《Multimodal Relational Reasoning for Visual Question Answering》

时间：2021-07-20 21:27:43

一、文献摘要介绍

Multimodal attentional networks are currently state-of-the-art models for Visual Question Answering (VQA) tasks involving real images.Although attention allows to focus on the visual content relevant to the question, this simple mechanism is arguably insufficient to model complex reasoning features required for VQA or other high-level tasks.
In this paper, we propose MuRel, a multimodal relational network which is learned end-to-end to reason over real images. Our first contribution is the introduction of the MuRel cell, an atomic reasoning primitive representing interactions between question and image regions by a rich vectorial representation, and modeling region relations with pairwise combinations. Secondly, we incorporate the cell into a full MuRel network, which progressively refines visual and question interactions, and can be leveraged to define visualization schemes finer than mere attention maps.
We validate the relevance of our approach with various ablation studies, and show its superiority to attention-based methods on three datasets: VQA 2.0, VQA-CP v2 andTDIUC. Our final MuRel network is competitive to or outperforms state-of-the-art results in this challenging context.

论文作者认为多模态注意力网络是目前最先进的涉及真实图像的视觉问答（vqa）任务模型。尽管注意力可以集中在与问题相关的可视化内容上，但这种简单的机制显然不足以为vqa或其他高级任务所需的复杂推理功能建模。针对这个问题，于是作者引入了murel单元，这是一个原子推理原语，通过丰富的向量表示来表示问题和图像区域之间的交互，并使用成对组合来建模区域关系。其次，将murel单元整合到一个完整的murel网络中，该网络逐步完善视觉和问题交互，并可用于定义比仅仅注意地图更精细的可视化方案，实验表明，该方案比最先进的结果更具竞争力或更好。

二、网络框架介绍

该论文采用了向量化表示方法代替了传统的注意力框架，对每个区域的视觉内容和问题进行双线性融合，然后进行成对关系建模。此外，还在表示中加入了空间和语义环境的概念，即通过视觉嵌入和空间坐标的交互来表示成对的图像区域，整体架构如下图所示。

下面对该框架进行分析。

2.1 MuRel approach

其中，Pθ是我们可训练的模型。在我们的系统中，图像由一组向量{vi} i∈[1，N]表示，其中每个vi对应于图像中检测到的目标。我们还使用每个区域的空间坐标bi=[x，y，w，h]，其中(x，y)是边界框左上角的坐标，h和w对应于边界框的高度和宽度。而x和w(各自的y和h)是规范化的。对于问题，我们使用一个门控循环单元来提供一个语句嵌入q。

2.2 MuRel cell

在本论文方案设计中，MuRel network是由推理单元MuRel Cell迭代实现的，下图是MuRel cell。

MuRel cell首先以N个可视特征作为输入，这些特征都带着坐标bi。它有两个模块组成，第一个是双线性混合模型(Bilinear Fusion)，将每个图像区域特征（由目标检测网络得到）都分别与问题文本特征融合得到多模态embedding,第二个是成对关系建模(Pairwise Relational Modeling)对这些embedding进行成对的关系建模。另外，注意到这里面还有一个残差的设计，作者解释这是为了避免梯度消失问题,下面分别讲解这两个模块，代码如下。

class MuRelCell(nn.Module):def __init__(self,residual=False, # 定义是否使用残差fusion={}, # 定义融合pairwise={}): # 定义成对建模super(MuRelCell, self).__init__()self.residual = residualself.fusion = fusionself.pairwise = pairwise#self.fusion_module = block.factory_fusion(self.fusion) # 用工厂模式建立融合if self.pairwise:self.pairwise_module = Pairwise(**pairwise) # 成对建模def forward(self, q_expand, mm, coords=None):mm_new = self.process_fusion(q_expand, mm)if self.pairwise:mm_new = self.pairwise_module(mm_new, coords)if self.residual:mm_new = mm_new + mmreturn mm_newdef process_fusion(self, q, mm): # 融合bsize = mm.shape[0]n_regions = mm.shape[1]mm = mm.contiguous().view(bsize * n_regions, -1)mm = self.fusion_module([q, mm])mm = mm.view(bsize, n_regions, -1)return mm