Keywords

1 Introduction

Visual Question Answering (VQA) has been topical recently, as its solution has relied on the successful models in both computer vision and natural language communities. VQA provides a simple and effective testbed to verify whether AI can truly understand the semantic meaning of vision and language. To this end, numerous efforts have been made towards improving the VQA models by better representations [1], attention [2,3,4,5] and fusion strategies.

Despite various fusion mechanisms have been proposed, most of them focus on the fusion features of cross-modality. The early proposed models fuse the cross-modal features with first order interaction such as concatenation [6, 7]. Recently the bi-linear based methods [8,9,10] have been proposed to capture the fine-grained cross-modal features with second order interaction. Multimodal Tucker Fusion (MUTAN) [10] proposes an effective bi-linear fusion for visual and textual features based on low-rank matrix decomposition. Furthermore, it is essential and necessary to extend the first order or bi-linear fusion models to high order fusion ones, in order to better grasp the rich and yet complex information, existed in both visual and textual features. On the other hand, the explicit high order fusion has been widely adopted in many applications, e.g., in recommendation tasks [11, 12], and yet to a lesser extent VQA. For example, DeepFM [11] adopts the factorized machine (FM) to construct the explicit second order features of deep features.

However it is nontrivial to apply the explicit high order method to VQA, as most of visual object features, in principle, are orderless with respect to the different semantic attributes. This is quite different from the attribute embeddings (e.g., age, gender) of recommendation, which are arranged in a fixed order. To overcome this problem, the multi-glimpse attention strategy is re-introduced, and re-purposed as the ordered visual representations; and each glimpse is corresponding to one type of attribute. To this end, a novel Second Order enhanced Multi-glimpse Attention (SOMA) model is thus proposed to construct the explicit high order features from the multi-glimpse outputs and the question features. The SOMA calculates, in an embedding space, the interactions of features from both intra-modality (i.e. interaction between different glimpse outputs) and cross-modality (i.e., interaction between glimpse output and question feature). To fully utilize the outputs of multi-glimpse attention, we feed each attended features to an independent prediction branch, ensuring that each glimpse has focused on the question-related objects.

Furthermore, despite several large-scale VQA datasets have been contributed to the community, effectively learning a deep VQA network still suffers from the training data scarcity, and long-tailed distributed question-answer pairs. Particularly, as in [13], only a limited number of question and answer pairs appeared frequently, whilst most of the other ones have only sparse examples. To alleviate this problem, a novel data augmentation strategy has been proposed in this paper. Typically, data augmentation, e.g., cropping and resizing images, aims at synthesizing new instances by training examples. Most of previous data augmentation strategies are conducted in visual features space, rather than semantic space. Remarkably, as a task requiring the high-level reasoning, the VQA should demand the data augmentation method by integrating the semantic information of each modality. To this end, a data augmentation method – semantic deformation, is proposed in this paper, by randomly removing some visual object and adds some noise visual instance. The images are dynamically augmented by randomly removing some visual objects, to create the more diverse visual inputs. Such a technique is further adopted as a self-supervised mechanism to improve the learning process of attention.

Formally, in this paper, we propose a Second Order enhanced Multi-glimpse Attention (SOMA) to tackle the tasks of visual question answering. As shown in Fig. 1, the model has several key components, including multi-glimpse attention module, second order module and classifier. The multi-glimpse attention module has different attention preference to different semantic aspects of question in each glimpse, which makes the attended feature more robust. The second order module explicitly models the interaction both on intra-modality and cross-modality by embedding the visual and textual features into a shared space. The classifier is strengthened with branch loss, which is able to provide a more direct supervised signal for each glimpse and the second module.

To sum up, we have several contributions as follows. (1) A second order module to construct the explicit second order features from the outputs of multi-glimpse and question feature in a shared embedding space. (2) Branch loss as a prediction signal to make each glimpse have better learning ability and attention performance. (3) A semantic deformation method with semantic objects cropping, noise objects adding and negative sample loss regularization. (4) Extensive experiments and ablation studies have shown the effectiveness of SOMA and semantic deformation.

2 Related Work

Visual Question Answering (VQA). The goal of visual question answering is to predict an answer on the given question and image pair [13, 14]. The dominant methods solve this problem as a classification task. A canonical model has three main stages: visual and textual feature extraction [1, 15], attention [2, 5, 16] and fusion strategy. The textual features from questions are mainly extracted from RNN based methods or Transformer. Recently, the object visual features from Faster-RCNN are preferred to the grid visual features by ResNet. Extensive attention models are proposed to identify the question-related information in the image, including question guided attention [1, 15], co-attention [5, 16, 17], self-attention [4, 5] and stacked attention [2, 18]. The fusion of visual and textual features includes first order [6, 7] and high order solution [8, 9].

Attention. Attention mechanism is a key component in the canonical VQA model. Visual attention exploits the visual grounding information to identify the salient regions for questions in early works [1, 2]. Some co-attention models [4, 16] find textual attention is also beneficial to detect the related words in questions along with visual attention. Recently, models with stacked self-attention layers [4, 5, 19, 20] have achieved state-of-the-art results on VQA task. But the multi-layer architecture makes it require a large computation cost. Studies [8, 18] have shown multi-glimpse attention is more robust by generating more than one attention map. However, the relation between different attention results has not been well studied yet.

Fusion. Fusion in VQA aims to combine visual and textual features. The two main factors of fusion are interaction granularity and orders. Coarse-grained first order fusion methods [6, 7] combine the aggregated visual feature and question feature by concatenation. The simple first order fusion is limited to model the complex interactions of two modalities. Coarse-grained second order fusion approaches [8,9,10] advocate the effective bi-linear pooling between aggregated visual and textual features. Fine-grained second order fusion approach BAN [3] applies bi-linear attention between visual objects and question words and uses the sum pooling to obtain the fusion feature. MFH [18] is the most related work to our paper. It first adopts bi-linear attention between grid visual features and question features to generate multi-glimpse output. Then it concatenates the multi-glimpse output into one visual feature for cross-modality bilinear fusion. In contrast, our approach projects the multi-glimpse output and question feature into a shared embedding space to gather the interaction of cross-modality and intra-modality simultaneously. Inspired by the success of explicit high order features in recommendation tasks [11, 12], We construct an explicit second order feature in the shared embedding space as fusion. Since our fusion is based on the result of multi-glimpse attention, its granularity is more flexible, which means it is fine-grained if each attention map is near a one-hot vector.

Data Augmentation. Due to the dynamic nature of vision and language combination, the current scale of VQA dataset is insufficient for the deep neural network based model. In image classification, the traditional data augmentation methods include cropping, resizing, flipping, rotation, mixup [21,22,23] on the input space. The manifold mixup method [23] is proposed to interpolate the training instances in the hidden layer and label space. Counterfactual Sample Synthesizing (CSS [24]) use critical objects masking to generate numerous samples for robust model training. Inspired by manifold mixup, we propose a semantic deformation method in the visual semantic space by instance-level cropping and noise adding.

Self-supervised Learning. The intrinsic structure information in the domain data can be utilized as an extra supervised signal for machine learning. In computation vision, the relative position of image patches [25], colorization [26], inpainting [27] and jigsaw problem [28] are formulated as surrogate tasks. In NLP tasks, the language model skip-gram [29, 30] learns the word embedding via context prediction in NLP tasks. Particularly, it adopts negative sampling to distinguish the learned vector from noise distribution. For the semantic deformation examples, we propose a hinge loss on the attention score of noise instance as an extra supervised signal by the assumption that a noise instance in VQA should be ignored with high possibility.

Fig. 1.
figure 1

The framework of SOMA. The main components of SOMA are multi-glimpse attention, second order module and classifier. Extracted visual features and question feature are fed to the multi-glimpse attention module to generate the attended visual features. The attended visual features and question feature are taken as inputs of the second order module. Finally, the attended visual feature, second order feature and the question feature are put into the classifier.

3 Approach

Overview. We formulate the visual question answering task, as a classification problem to calculate the answer a possibility \(\mathrm {p}\left( a\mid \mathbf {Q},\mathbf {I}\right) \) conditioned on the question \(\mathbf {Q}\) and image \(\mathbf {I}\). In this paper, we propose a novel framework – Second Order enhanced Multi-glimpse Attention (SOMA). SOMA is composed of three components: multi-glimpse attention module, second order module and classifier. The whole pipeline is illustrated in Fig. 1. Multiple attended visual features are generated through the multiple-glimpse attention module with different semantic similarity preferences. The question embedding and attended visual features are fed into the second order module to produce the second order feature. Then the second order feature, attended visual features and question embedding are further passed to the classifier. During training, in addition to the full prediction, a branch prediction is used as an extra supervised signal for each glimpse in the classifier.

3.1 Feature Extraction

Typically, we have the image \(\mathbf {I}\), question \(\mathbf {Q}\) into the vision feature set \(\mathbf {V}\) and the question embedding \(\mathbf {q}\). Thus the original task of calculating \(\mathrm {p}(a|\mathbf {Q},\mathbf {I})\) is translated into obtaining \(\mathrm {p}(a|\mathbf {q},\mathbf {V})\).

Visual Features. The visual feature set \(\mathbf {V}=\left\{ v_{1},\ldots ,v_{k}\right\} \), \(v_{i}\in \mathbb {R}^{D_{v}}\) is the output of Faster R-CNN as described in Bottom-up [1]. The Faster R-CNN model is pre-trained on Visual Genome [31] and the object number k is fixed at 36 in our experiments. Thus, in our case, we denote the extracted visual object set as:

$$\begin{aligned} \mathbf {V}=\mathrm {RCNN}(\mathbf {I},\theta _{\mathrm {RCNN}}). \end{aligned}$$
(1)

Question Feature. The question embedding \(\mathbf {q}\in \mathbb {R}^{D_{t}\times 1}\) is obtained from a single layer GRU. The words in the question are first transformed into a vector by GloVe. Then the word vectors are fed into the GRU in sequence. The last hidden state vector is taken as the question embedding. We represent the question embedding as:

$$\begin{aligned} \mathbf {q}=\mathrm {GRU}(\mathbf {Q},\theta _{\mathrm {GRU}}). \end{aligned}$$
(2)

3.2 Multi-glimpse Attention

To answer a question about an image, the attention map in one glimpse is used to identify the visual grounding objects. In multi-glimpse attention mechanism, each glimpse may have different semantic similarity preference, some prefer to attend the question-related colors, some prefer to attend the question-related shapes and so on. We adopt the multi-glimpse attention mechanism to make the attention results more robust and diverse. First, we project the visual feature set \(\mathbf {V}\in \mathbb {R}^{k\times d_{v}}\) and question embedding \(\mathbf {q}\in \mathbb {R}^{1\times d_{t}}\) into a shared embedding space by \(\mathbf {W}_{v}\in \mathbb {R}^{D_{v}\times D_{h}}\) and \(\mathbf {W}_{t}\in \mathbb {R}^{D_{t}\times D_{h}}\) respectively. The two latent features are further combined through element products and then to generate the attention weight \(A\in \mathbb {R}^{k\times m}\) as:

$$\begin{aligned} \mathbf {A}=\mathrm {softmax}\left( \left( \mathrm {ReLU}\left( \mathbf {1}\left( \mathbf {q}\mathbf {W}_{t}\right) \right) \odot \mathrm {ReLU}\left( \mathbf {V}\mathbf {W}_{v}\right) \right) \mathbf {W}_{G}\right) \end{aligned}$$
(3)

where \(\mathbf {1}\in \mathbb {R}^{k\times 1}\) is an all-one vector by using k ones to expand the \(\mathbf {q}\). \(\mathbf {W}_{G}\in \mathbb {R}^{d_{h}\times m}\) and m is the number of glimpses. The \(\mathrm {softmax}\) function is performed on the first dimension to generate weights on the k objects for each glimpse.

It then calculates the question attended visual features \(\mathbf {G}\in \mathbb {R}^{m\times d_{v}}\) as a product of the attention weights and the original visual feature set,

$$\begin{aligned} \mathbf {G}=\mathbf {A}^{T}\mathbf {V}. \end{aligned}$$
(4)

3.3 Second Order Module

We introduce the denotation for the second order module. Particularly, we introduce a score prediction task over a scalar variable set \(\left\{ x_{1},x_{2},\ldots ,x_{n}\right\} \) as below,

$$\begin{aligned} \hat{y}=w_{0}+\sum _{i=1}^{n}w_{i}x_{i}+\sum _{i=1}^{n-1}\sum _{j=i+1}^{n}<v_{i},v_{j}>x_{i}x_{j} \end{aligned}$$
(5)

where \(\hat{y}\) is the predicted score, \(w_{0}\) is the bias, \(\sum _{i=1}^{n}w_{i}x_{i}\) represents the score from first order interaction and the last term denotes the impact of second order interaction. The inner product \(v_{i}\),\(v_{j}\) represents the coefficient for the interaction of variable \(x_{i}\) and \(x_{j}\).

We propose a second order interaction module for the question and visual features as shown in Fig. 1. The question feature is first projected into the visual feature space. The concatenation of question feature and the attended visual features are further transformed into a shared embedding space as:

$$\begin{aligned} \mathbf {E}=\mathrm {ReLU}([\mathbf {G};\mathrm {ReLU}(\mathbf {q}\mathbf {W}_{qv})]\mathbf {W}_{ve}) \end{aligned}$$
(6)

where \(\mathbf {W}_{qv}\in \mathbb {R}^{d_{t}\times d_{v}}\), \(\mathbf {W}_{ve}\in \mathbb {R}^{d_{v}\times d_{e}}\) and \(d_{e}\) is the dimension of the latent space.

We construct the explicit second-order feature \(\mathbf {s}\) over a vector variable set \(\mathbf {E}=[\mathbf {e}_{1};\mathbf {e}_{2},\ldots ,\mathbf {e}_{m+1}]\) as below:

(7)

where \(\circ \) denotes Hadamard product. The first term represents the impact of first order features and the second term reflects the importance of second order interactions. For simplicity and efficiency, the coefficients of this vector version FM are all fixed at 1. We argue that a proper embedding space learned by \(\mathbf {W}_{ve}\) can alleviate this impact.

3.4 Classifier

The classifier takes the question embedding, multi-glimpse outputs and second order feature as inputs. It contains two subcomponent types: branch prediction module and full prediction module. The branch prediction is used for each glimpse or the second order feature. The full prediction takes all the glimpse outputs and second order features as inputs.

Branch Prediction. To encourage each glimpse and the second order module to gather the information for answering, we feed each of them into an independent branch prediction module. In the branch prediction module the visual feature and question feature are first transformed into a hidden space, then projected by a fully connected layer to the answer space as follows:

$$\begin{aligned} \mathbf {h}_{x}&= \mathrm {ReLU}\left( \left( \mathrm {ReLU}\left( \mathbf {q}\mathbf {W}_{qh}\right) \circ \mathrm {ReLU}\left( \mathbf {v}_{x}\mathbf {W}_{xh}\right) \right) \right) \\ \hat{\mathbf {a}}_{x}&= \mathrm {sigmoid}\left( \mathbf {h}_{x}\mathbf {W}_{xa}\right) \end{aligned}$$

where \(\mathbf {v}_{x}\in \left\{ \mathbf {g}_{1},\mathbf {g}_{2},\ldots ,\mathbf {g}_{m},\mathbf {s}\right\} \) , \(\mathbf {W}_{qh}\in \mathbb {R}^{d_{t}\times d_{h}}\), \(\mathbf {W}_{xh}\in \mathbb {R}^{d_{v}\times d_{h}}\) , \(\mathbf {W}_{xa}\in \mathbb {R}^{d_{h}\times d_{a}}\).

Full Prediction. To fully utilize all the information in each branch, we concatenate all the hidden features in branches into \(\mathbf {h}\) , then map it into the answer space by a linear transformation.

$$\begin{aligned} \mathbf {h}&= \mathrm {\left[ h_{1},h_{2},\ldots ,h_{m+1}\right] }\\ \hat{\mathbf {a}}&= \mathrm {sigmoid}\left( \mathbf {h}\mathbf {W}_{ha}\right) \end{aligned}$$

where \(\mathbf {h}\in \mathbb {R}^{(m+1)\times d_{h}}\) and \(\mathbf {W}_{ha}\in \mathbb {R}^{(m+1)d_{h}\times d_{a}}\).

Loss Function. The total loss for prediction is composed of two parts: loss for branch prediction and loss for full prediction. The branch loss is scaled by a factor \(\alpha _{b}\).

$$\begin{aligned} L=L_{f}+\alpha _{b}L_{b} \end{aligned}$$
(8)

Both the full prediction loss and branch prediction loss adopt binary cross-entropy (BCE) as the loss function.

$$\begin{aligned} L_{f}&= \mathrm {\mathrm {BCE}\left( \hat{a},a\right) }\\ L_{b}&= \sum _{x=1}^{m+1}\mathrm {\mathrm {BCE}\left( \hat{a}_{x},a\right) } \end{aligned}$$

3.5 Data Augmentation by Semantic Deformation

We observe that, with a high probability, humans are able to answer questions when some objects in the image are occluded, or some un-related ‘noise’ objects are existing in the image. Inspired by this, we propose an object-level data augmentation method – semantic deformation. Essentially, it contains two key steps, i.e., semantic objects cropping and semantic objects adding.

Semantic Objects Cropping. The size k of visual object set \(\mathbf {V}\) from Faster R-CNN is usually very large to make sure that it contains the necessary objects for question answering. If we randomly remove a small number of \(k_{r}\) objects from the original visual object set, the remaining object set will still contain the clues for answering with high probability. We choose the \(k_{r}\) over a uniform distribution over 1 to \(R_{max}\), where the \(R_{max}\) is the maximum number of objects that can be removed.

$$\begin{aligned} k_{r}&=\mathrm {uniform}(1,R_{max})\\ \mathbf {V}_{selected}&=\mathrm {select}(\mathbf {V},k-k_{r}) \end{aligned}$$

Semantic Objects Adding. We add \(k_{a}\) semantic noise objects to the visual object set from a randomly picked image. The number \(k_{a}\) of noise objects is a uniform distribution over 1 to \(A_{max}\). The selected visual object set and the added noise object set can be merged into a new semantic image by concatenation.

$$\begin{aligned} k_{a}&=\mathrm {uniform}(1,A_{max})\\ \mathbf {V}_{add}&=\mathrm {select}(\mathbf {V}^{\prime },k_{a})\\ \mathbf {V}_{new}&=\mathrm {concate(\mathbf {V}_{selected},\mathbf {V}_{add})} \end{aligned}$$

Negative Example Loss. Intuitively, the added noise objects are unrelated to the question and visual context with a big chance. The irrelevance can be utilized as a self-supervised signal to guide the model where not to look. We apply a negative example loss to punish the model when the attention score on the added noise object surpasses a threshold.

$$\begin{aligned} L_{neg}=\sum _{g=1}^{m}\sum _{i=k-k_{r}+1}^{k-k_{r}+k_{a}}\mathrm {max}(0,\mathbf {A}_{i,g}-\tau ) \end{aligned}$$
(9)

where \(\mathbf {A}_{i,g}\) is the attention score on the i-th object in the g-th glimpse, \(\tau \) is the threshold attention value for noise objects. The negative loss is added to the total loss by a factor of \(\alpha _{neg}\).

$$\begin{aligned} L=L_{f}+\alpha _{b}L_{b}+\alpha _{neg}L_{neg} \end{aligned}$$
(10)

where \(\alpha _{b}\) is the coefficient.

4 Experiments

4.1 Datasets

We evaluate our model both on VQA v2.0 [32] and VQA-CP v2.0 [33]. VQA v2.0 contains 204k images from MS-COCO dataset [34] and 1.1M human-annotated questions. The dataset is built to alleviate the language bias problem existing in VQA v1.0. VQA v2.0 dataset makes the image matter by building complementary pairs as \(Question_{A}\), \(Img_{A}\), \(Ans_{A}\) and \(Question_{A}\), \(Img_{B}\), \(Ans_{B}\) , which share the question but with different images and answers. The dataset is divided into 3 folder: 443K for training, 214K for validation and 453K for testing. VQA-CP v2.0 generates the new training and testing splits with changing priors from VQA dataset. The changing priors setting requires the model to learn the ground concept in images rather than memorizing the dataset bias. For each image, question pair, there are 10 human annotated answers. The evaluation metric for the predicted answer is defined as below:

$$\begin{aligned} \mathrm {Acc}(ans)=\mathrm {min}\left\{ \frac{\#humans\quad that\quad said\quad ans}{3},1\right\} \end{aligned}$$
(11)

4.2 Implementation Details

Model Setting. The hyper-parameters of the proposed model in the experiments are as follows. The dimension of visual features \(d_{v}\), question feature \(d_{t}\), second order feature \(d_{e}\) and hidden feature \(d_{h}\) are set to 2048, 1024, 2048 and 2048 respectively. The number of candidate answers is set to 3129 according to the occurrence frequency. The glimpses number in attention is \(m\in \left\{ 1,2,4,6,8\right\} \). We empirically set the branch loss factor \(\alpha _{b}\) to 0.2.

Training Setting. In training, we choose the Adamax optimizer with learning rate \(min(t\times 10^{-3},4\times 10^{-3})\) for the first 10 epochs and then decayed by 1/5 for every 2 epochs. The model is trained by 13 epochs with a clip value of 0.25 and batch size of 256. When tested at VQA v2.0 test-dev and test-std split, we train the model on training, validation and extra genome dataset. The performance on VQA v2.0 validation dataset is evaluated by the model trained on training split. The result on VQA-CP v2.0 test split is evaluated by the model trained on the training split.

Semantic Deformation Setting. We denote the maximum number of objects removed and objects added as \(R_{max}\) and \(A_{max}\) respectively. They are both set to 4 by default. The negative sample loss factor \(\alpha _{neg}\) and threshold \(\tau \) is set to 1.0 and 0.18 respectively.

Table 1. The Results of SOMA and other previous state-of-the-art methods on VQA v2.0 test-dev and test-std splits. The accuracy of each answer type on test-dev split is listed separately.

4.3 Results and Analysis

Results on VQA v2.0. First, we evaluate SOMA model on VQA v2.0 dataset. The results of our model and other attention based methods are summarized in Table 1. Bottom-up model is the winner of VQA v2.0 Challenge 2017 which utilizes the visual features from Faster R-CNN. Multimodal Compact Bilinear Pooling (MCB)  [8] adopts count-sketch projection to calculate the outer product of visual feature and textual feature in a lower dimensional space. Multi-modal Factorized High-order Pooling (MFH) [18] cascades multiple low rank matrix factorization based bilinear fusion modules. MuRel [37] adopts bilinear fusion to represent the interactions between question and visual vectors. Dense Co-Attention Network (DCN) is composed of co-attention layers for visual and textual modalities. Counter [38] is specialized to count objects in VQA by utilizing the graph of objects. In contrast to MFH, our model SOMA projects the visual and textual features into a shared embedding space and models the interaction of intra-modality and cross-modality simultaneously in the second order module. The results on VQA v2.0 show that SOMA improves the Bottom-up baseline with a margin of 3% overall, which demonstrates the effectiveness of the second order module. Furthermore, we apply the semantic deformation strategy in training and the results show that performance has been boosted on all answer types. Since we train the model with semantic deformation in the same epochs, the improvement is totally a benefit for free.

Table 2. The Results of SOMA and other previous state-of-the-art methods on VQA-CP v2.0 test splits. Models with * have been trained by [39].
Fig. 2.
figure 2

Qualitative examples of the prediction results on VQA v2.0 dataset for model SOMA. In each example, the left part is the original image and the right part is the illustration of attention. Below the image is the question, ground-truth answer and predicted answer respectively.

Results on VQA-CP v2.0. In this experiment, we compare SOMA with other competitors on VQA-CP v2.0 dataset. Ground visual Question Answering model (GVQA) disentangles the recognition of visual concepts from answer identification. Bilinear Attention Network (BAN) [3] develops an effective way to utilize multiple bilinear attention maps in a residual way. Bottom-up + AttAlign aligns the model attention with human attention to increase the robustness. Bottom-up + AdvReg trains a VQA model and a question-only model. It uses the question-only model as an adversary to discourage the VQA model to keep the language bias in its learned question feature. Table 2 shows that SOMA outperforms the bottom-up model with a margin of 2.3% in total. And it is only below to Bottom-up + AdvReg with a minor gap. To be noticed, Bottom-up + AdvReg is specially designed to prevent the model from overfitting the bias. While SOMA achieves this score with no special design and it can perform well on both VQA v2.0 and VQA-CP v2.0 dataset.

Qualitative Results. To better reveal the insight of our model, we give some qualitative results. Particularly, to qualitatively analyze SOMA, we visualize the input image, question and predicted answer in Fig. 2. The examples have shown that SOMA is able to attend to the question related region in the image during answering. This also validates the efficacy of our model.

Table 3. Ablation experiments results on VQA v2.0 validation split. SOMA w/o SO denotes the model without utilizing the second order feature. SOMA w/o BL denotes the model without using branch loss. The comparisons are performed on all the models with the glimpse number of 4.
Fig. 3.
figure 3

Accuracies of model SOMA and its variants over different glimpses \(G\in \left\{ 1,2,4,6\right\} \) on VQA v2.0 validation split.

4.4 Ablation Study

Component Study. To investigate the contribution of each component, we train a full SOMA model with the glimpse number of 4 as a baseline. Then we propose two variants of SOMA. (1) SOMA w/o SO denotes the model does not contain the second order module for the multi-glimpse attention output. (2) SOMA w/o BL indicates the model does not conclude the branch loss of each glimpse. As shown in Table 3, the overall performance of SOMA w/o SO and SOMA w/o BL drops 0.17% and 0.36% respectively. Figure 3 further shows that the full model outperforms the variants on all glimpse \(G\in \left\{ 1,2,4,6\right\} \) in all answer types. We notice that the overall performance of the full model with the glimpse number of 2 is even better than the variants with the glimpse number of 4 or 6.

Table 4. Performance and model size of SOMA over the number of glimpses. Accuracy denotes the prediction accuracy on VQA v2.0 validation split. Params represent the total parameter size of the model. FLOP denotes float point operation cost. Computation cost is evaluated when the number of visual objects is 36 and the question contains 7 words over the glimpse number \(G\in \left\{ 1,2,4,6\right\} \).

Performance and Cost. It is important to investigate the relationship between performance and cost, especially in real word application. Table 4 quantitatively shows the accuracy, model size and computation cost (FLOPs) trends over the glimpse number. The result shows that SOMA achieves the best performance when the number of glimpse is 4.

4.5 Experiments of Data Augmentation

Table 5. The performance of SOMA with semantic deformation on VQA v2.0 val split. SOMA indicates the baseline with 4 glimpses. SOMA + C indicates cropping on the input visual features of SOMA. SOMA + CA denotes cropping and noise adding. SOMA + CAN represents cropping and noise adding with negative example loss.
Fig. 4.
figure 4

(a) Performance of Semantic Deformation for different deformation number N, which denotes the maximum object removed and added number. The red dash line denotes the baseline which is the case \(N=0\). (b) Accumulated attention for the most attended M objects. (Color figure online)

Fig. 5.
figure 5

Qualitative examples of Semantic Deformation. The left image is the training image from semantic deformation. The red box denotes the bounding box of removed semantic objects. And the patch with green frames represents the added noise objects. The right image is the visualization results of attention maps. (Color figure online)

Data Augmentation Evaluation. To analyze semantic deformation, we propose serval variants and perform the ablation study on VQA v2.0 validation split. We train a SOMA model with the glimpse number of 4 as the baseline. Three variants are proposed by taking first n steps in semantic objects cropping, semantic objects adding and negative example loss applying gradually. The results in Table 5 show that semantic cropping, noise object adding are all beneficial to improve the performance of the baseline. And negative example loss is effective when the noise object adding strategy is used. When all the three techniques are used, the trained model achieves the best performance of 65.61% on the validation split. Furthermore, we conduct a series of experiments with different maximum numbers for objects removed and added in semantic deformation. Figure  4(a) shows that the model with semantic deformation can beat the baseline with a slighter margin when the maximum number is from 1 to 6. Figure  4(b) indicates that the accumulated attention of SOMA model grows slower when trained with semantic deformation, which means the model gains robustness by attending to more related objects.

Data Augmentation Example. To qualitatively analyze why semantic deformation works, we visualize a randomly generated image from semantic deformation as in Fig. 5. For simplicity, we do not plot all of the 36 bounding boxes. We only show the bounding boxes of removed semantic objects and added semantic objects. Actually, the 36 bounding box has a lot of overlaps which make the visual feature set with high redundancy. Intuitively, we can see that the model is able to answer the question with a high possibility from the necessary grounding visual information.

In this paper, we propose a Second Order enhanced Multi-glimpse Attention (SOMA) model for Visual Question Answering. SOMA adopts a second order module to explicitly model the interaction on both the intra-modality and cross-modality in the shared embedding for multi-glimpse outputs and question feature. The branch loss is added to enhance each glimpse for better feature learning and attention ability. Furthermore, we advocate a novel semantic deformation method as data augmentation for VQA, which can generate the new image in the semantic space by semantic object cropping and semantic object adding. A negative example loss is introduced to provide a self-supervised signal for where not to look. The experiments on VQA v2.0 and VQA-CP v2.0 have shown the effectiveness of SOMA and semantic deformation. In feature works, we would like to design a better strategy for noise objects picking and apply semantic deformation to more multi-modal tasks.