Keywords

1 Introduction

Visual Question Answering (VQA) [1] tasks must provide correct answers to the questions posed by given images. In comparison with the traditional Question Answering system, the search and reasoning parts must be based on the image content. This system contains the knowledge of target location detection, scene classification and knowledge reasoning. VQA tasks can be easily expanded to other tasks and play a significant role in various practical scenarios such as automobile navigation, medical system and education system.

In this paper, we propose a multi-modal feature fusion method for combining image and question features. The central idea is to use Variational Autoencoder (VAE) [2] to calculate the hidden coding of image and question features and then fuse them in the hidden layer to obtain the associated image and question representations for improved answer reasoning. We use the fusion method to the basic model and verify its validity on the VQA 2.0 dataset. Subsequently, in decoding the attention weight, the sampling method is added to the attention mechanism of VQA tasks to increase randomness and the hierarchical attention mechanism model is designed by using hidden variables, and a further generalized attention mechanism weighting matrix, which can weight image and question features, is generated. Experimental results show that our model further improves the accuracy of VQA tasks.

The remainder of this paper is presented as follows: In Sect. 2, we introduce the relevant work in recent years. In Sect. 3, we present the implementation details of the model. In Sect. 4, we compare basic models and our model on VQA 2.0 dataset. In Sect. 5, we conclude this paper.

2 Related Work

2.1 Visual Question Answering

VQA tasks have been proposed in 2015. In recent studies, majority of the methods in VQA are based on neural networks. Convolution Neural Networks (CNNs) [3] are generally used to extract image features, whereas Recurrent Neural Networks (RNNs) [4] are utilized to extract question features. Then, the two features are fused to form a new feature, which is used for answer reasoning (see Fig. 1).

Fig. 1.
figure 1

Simple model of the VQA task.

Recently, most VQA tasks use VGGNet [5] and ResNet [6] to extract image features. Girshick et al. [7] proposed the use of Fast Region-based Convolutional Network (Fast-R-CNN)to extract image features consisting of multiple objects and obtained new state-of-the-art results. By contrast, almost all VQA models use GRU [8] and LSTM [9] to extract question features. These two models can efficiently obtain the question contextual information. Multi-model feature fusion is a method for associate images with question textual information. Fukui et al. [10] first introduced the bilinear model into multi-modal feature fusion in VQA. They proposed the Multi-modal Compact Bilinear (MCB) pooling method and achieved good results. Then Yu et al. [11] designed a Multi-modal Factorized High-order (MFH) pooling method to improve the result further.

VQA reasoning method is simple. We must develop a limited quantity answer set in accordance with the frequency of the answers and perform classification tasks on it.

2.2 Attention Mechanism

Recently, attention mechanism, which finds the most deserving word or phrase in the text, has been successfully applied in the field of natural language processing. In the VQA task, researchers use the attention mechanism to discover picture areas that are most related to the semantic information of the question (see Fig. 2).

Fig. 2.
figure 2

Simple attention mechanism model for VQA.

[12] proposed a question-oriented image attention mechanism. This method assigns attention weights to image features based on the question features. [13] introduced a collaborative attention mechanism to associated images and questions that became the baseline method at that time. [14] firstly introduced the multi-objective feature extraction method in the field of target detection into the VQA model and named it as bottom-up attention. Compared with the method of weighting the attention of the whole image, this model can directly focus on the image target itself by weighting the entire object’s attention, which has been significantly improved and has become one of the best models in the field of VQA.

2.3 Variational Autoencoder

VAE is a probabilistic approximation model based on variational inference and autoencoder structure. Suppose two variables x and z, the variational inference uses simple distribution q(z) to approximate complex posterior distribution p(z|x) and Kullback-Leibler (KL) distance for measuring the distance between probability distributions:

$$\begin{aligned} K L(q(z)| | \mathrm {p}(z | x))=\int q(z) \ln \frac{q(z)}{p(z | x)} d z \end{aligned}$$
(1)

The smaller the KL distance, the closer the two probability distributions are. The goal is to minimize the KL distance. Further derivation form is as follows:

$$\begin{aligned} \ln p(x)-K L(q(z)| | p(z | x))=\int q(z) \ln p(x | z) d z-K L(q(z)| | p(z)) \end{aligned}$$
(2)

VAE assumes that q(z) obeys a normal distribution \(N\left( \mu , \sigma ^{2}\right) \), and p(z) obeys a normal distribution N(0, I). The optimization objectives of the model can be expressed as:

$$\begin{aligned} L=E_{x \sim p(x)}[-\ln q(x | z)+K L(p(z | x)| | q(z))], z \sim p(z | x) \end{aligned}$$
(3)

Here,\(-\ln q(x | z)\) indicates the distance of the generating and real values. The KL distance can be calculated by:

$$\begin{aligned} K L\left( N\left( \mu , \sigma ^{2}\right) | | N(0, I)\right) =\frac{-\sum \log \left( \sigma ^{2}\right) -d+\sum \left( \sigma ^{3}\right) +\mu ^{T} \mu }{2} \end{aligned}$$
(4)

The structure of VAE (see Fig. 3) makes it a generating model. By sampling and decoding the probability distribution on the trained model, new data are generated with the same distribution as the training data. Therefore, this model is widely used in the field of image generation with Generative Adversarial Networks (GAN) [15].

Fig. 3.
figure 3

Model of the Variational Autoencoder.

3 Proposed Method

3.1 Multi-modal Feature Fusion

Traditional VQA fusion methods only consider the external representation of features instead of the important hidden links between images and questions, thereby losing information during fusion.

Currently, multi-modal feature fusion methods are based on the calculation of the outer product or approximate outer product of the features. This case limits the scope of application because numerous parameters and computational loads are required during calculation, and the dimension reduction methods of the optimization calculation process are sensitive to the super-parameters and slow convergence speed of the model.

Our first work is to attempt to use VAE to solve the above-mentioned problem (see Fig. 4). In this model, we use ResNet to extract image features and LSTM to extract question features. Then we apply the VAE model mentioned in Sect. 2.3 to calculate the hidden vector probability distribution of the features. Finally, the hidden variables of features are sampled and fused.

Fig. 4.
figure 4

Structure of multi-modal feature fusion model.

The algorithm for calculating the probability distribution of hidden variables is shown as Algorithm 1. The extracted image hidden variables are multiplied by the question hidden variables and the results are input into the full connection layer. By locally adjusting the model structure, several different models are obtained, which mainly include calculating the distribution of hidden variables for image features only, for question features only, and simultaneously for image features and question features. In order to ensure the association between image and question, we fuse the image feature hidden variable with the question feature hidden variable, then use the merged feature to decode. After that, we attempts to fuse the image feature hidden variable and the question feature hidden variable into the multi-modal decomposition bilinear pooling method.

figure a

3.2 Variational Attention Mechanism

Our second work is to introduce a variational attention mechanism in the process of multi-modal feature fusion to reduce the complexity of the parameters.

We use Faster-R-CNN to extract multi-target features from images. Then, an implicit variable model of attention is established on the basis of bottom-up attention mechanism model and variational inference. Finally, a method for the multi-sample fusion of attention weighted features is designed (see Fig. 5).

Fig. 5.
figure 5

Calculation of attention weight with VAE.

Fig. 6.
figure 6

Variational attention mechanism model.

figure b

For a further a generalized expression of attention weight of the feature of local question, the feature of the question text need to be generated hierarchically. However, the above model still directly calculates the attention weight for the image many times, and the number of parameters required for the untreated image will lead to inefficiency, then we sampled the hidden variables of the image several times. Therefore, we sampled the hidden variables of the image several times, and calculated the attention weight of the image features by combining with the previous features of each layer. The number of parameters can be greatly reduced and the training speed of the model can be improved due to the low dimension of the problem text features after hidden variable coding and layering (see Fig. 6). Algorithm 2 shows the algorithm for question feature hierarchy.

4 Experiment

4.1 Datasets

The dataset used in the experiment is test-dev of VQA 2.0. The images are obtained from MS-COCO dataset, including 123,287 images, of which 72,738 are used for training and 38,948 for testing. Each image has a corresponding question and answer. The evaluation of answers can be divided into three types: yes/no, number and others. The three types correspond to judgment, counting and open questions respectively. On the VQA2.0 dataset, the calculation of accuracy rate does not directly measure the proportion of the correct answered samples. The calculation formula is as follows:

$$\begin{aligned} Acc=\frac{1}{M} \sum _{i=1}^{M} \min \left\{ \frac{ \text{ human } \text{ that } \text{ provided } \text{ that } \text{ answer } }{3},1\right\} \end{aligned}$$
(5)

where M represents the total number of tested samples, and “humans that provided that answer” indicates the number of answers predicted by the model consistent with those manually collected by VQA 2.0.

4.2 Configurations

In the training process of this study, the neural networks built by different models use uniform hyperparameters. Table 1 shows the key parameters, in which Weight_VQAVae denotes the weights used in the multi-modal feature fusion method. Weight_VQAVaeAtt indicates the weights used in variational attention mechanism.

4.3 Results

Multi-modal Feature Fusion Results. The different structures of the local model are as follows, and Table 2 shows the experimental results.

  • * \(I_{-}h\): Image feature hidden variable

  • * \(Q_{-}h\): Problem feature hidden variable

  • * Concat: Fusion feature using stitching method

  • * OuterProduct(OP): Fusion feature using outer product method

  • * Merge: Binding of hidden variables based on bilinear pooling

Table 1. Key parameters of our methods.

Experimental results show that implicit vector coding using image features, encoding without question features, and multi-modal feature fusion using outer products are significant improvements in the accuracy of the VQA model. We rename \(I_{-}h+Q+OP+Merge\) with the best experimental results on the model as VQAVae, jointly use the training and verification sets to train the model, and evaluate the model accuracy on the test set. Table 3 shows the results compared with the existing basic VQA model.

Table 2. Experimental results of multi-modality feature fusion method.
Table 3. VQAVae compared with existing base models.

Results show that the proposed multi-modal feature fusion method based on variational inference outperforms most of the existing basic VQA models. The possible original meaning is that the question text is a discrete word sequence, and the image features are further continuous. In decoding the new features and calculating the error with the original features, the image features can be efficiently restored to the original features. During training, the difference value can be easily optimized as part of the loss value, so that the coding probability of the hidden variable coding can be calculated further accurately.

Variational Attention Mechanism Results. We name the used methods as follow:

  • * ResNet (Res): Extracting image features using a residual network

  • * Fast-R-CNN (FRC): Extracting multi-target image features using Fast-R-CNN

  • * Qatt: Problem-oriented self-attention mechanism

  • * Iatt: Problem-oriented attention mechanism for images

  • * Concat: Splicing combines multiple sampling features

  • * Average (Ave): Weighted average blends multiple sampling features.

Table 4 presents the compared experimental results. Then we rename VQAVaeAtt as the model with the best experimental results on the verification, jointly use the training and verification sets to train the model, and calculate the model accuracy on the test set. Table 5 shows the results compared with the existing basic VQA model.

The proposed attention mechanism method based on VAE outperforms most existing VQA models. Because (1) the attention mechanism is modeled as an implicit variable model and the probability distribution of attention weight is calculated through VAE; (2) multiple attention weight sampling is added to the model and (3) the attention weight of image is calculated combined with the feature information of subsection questions, which is helpful for obtaining additional information. To sum up, modeling the effective method to model the attention mechanism as an implicit variable model based on the complete image attention weighted feature is a novel and effective method.

Table 4. Experimental results of the variational attention method
Table 5. VQAVaeAtt compared with existing base models.

5 Conclusion

This study investigates the feature fusion of VQA, including multi-modal feature fusion and attention mechanism. VAE is introduced to overcome the limitations of existing methods, and greatly improves the accuracy of the VQA model. The main contribution of this work includes two parts:

The VAE is introduced to calculate the probability distribution of hidden variables of image and question text features, and a multi-modal feature fusion method based on hidden variables is designed to reduce the computational complexity of the model effectively. Furthermore, the random sampling increases the anti-over-fitting of the model. Comparative experiments show that the model effectively improves the accuracy of VQA tasks.

We attempted to introduce a variational attention mechanism in the process of multi-modal feature fusion. Based on the VAE model, the question text information is used to guide the autoencoding of image features in accordance with their attention weights. As such, a hierarchical attention mechanism, which effectively reduces the number of parameters required by the model, is established. The comparative experiments show that the variational attention mechanism can further improve the model accuracy in VQA tasks.