1 Introduction

With breakthroughs in deep learning techniques [1,2,3,4,5], visual question answering (VQA) [6] tasks are gaining widespread attention. As a representative task in the multimodal domain, VQA requires cross-modal understanding of vision and language and the ability to reason about fused vision-language features. In the past few years, VQA models have been greatly enhanced [3, 4, 7] by transformers [1]. These achievements are widely attributed to the self-attention (SA) mechanism, which implements multimodal alignment and provides explanations for VQA tasks.

When humans answer a question, they usually make it easier by eliminating wrong answers; however, the VQA model has no such ability. Previous approaches directly input the questions and images into a transformer and align them by SA. In this process, the wrong region is always involved in the computation, making the transformer at risk of being misleading. To solve this problem, we follow the latest advances in computer vision. In PIT [8], the author introduced the dimensional change into the transformer. As the layers of the transformer become deeper, the number of patches gradually decreases, while the feature dimension of each patch increases. This operation of dimensional change improves the model performance and generalization performance. Inspired by them, we further realize that the change in patch dimensions can be understood as the fusion and shrinkage between patches, and they are particularly important in VQA. As shown in Fig. 1, if we fuse the region feature of water and bird in the image into bird near the water, we can directly calculate the relevance of bird near the water to the question when answering “What color is the bird near the water?”, instead of separating them. This process of region fusion reduces the number of regions, decreases the difficulty of the problem, and affects the final performance of the model.

Fig. 1
figure 1

Illustration of alignment of the image regions to the question in VQA

In detail, a high-performance multimodal transformer is able to (1) associate different objects in an image to generate composite features and thus reduce the difficulty of the problem, and (2) progressively eliminate wrong answers during multi-step reasoning to avoid the transformer being misleading. Thus, we propose a shrinkage transformer visual question answering (ST-VQA) framework, whose technical contribution is to give the transformer both these two abilities at the same time. Specifically, ST-VQA consists of an MCAN [3] backbone and our proposed shrinkage fusion module (SFM). MCAN is a standard two-stream transformer that enables the alignment of image and text features on high-level semantic space. Our proposed SFM is embedded into MCAN as a simple plug-and-play module. Between two adjacent transformer layers, SFM generates composite region features by feature fusion and simultaneously reduces the number of regions. The composite region features are used for subsequent inference and feature updates for multimodal alignment and answer decisions in the final layer of the MCAN. To achieve an effective comparison, we construct the ST baseline, which is a simple combination of MCAN and tokenwise multilayer perceptron (MLP) [9]. Tokenwise MLP can act directly on the feature map to reduce the number of regions by feature mapping on the token dimension, resulting in a simple ST-VQA comparison scheme.

Finally, to validate the validity and interpretability of the ST-VQA, we conducted a series of ablation and comparison experiments on three benchmark datasets, VQA-v2 [6], GQA [10], and COCO-QA. In addition, the ST-VQA framework has a better performance improvement compared with the traditional MCAN model. Our proposed method is simple and effective. Therefore, interested readers can try it in other model architectures or other tasks. In summary, the main contributions of this paper are threefold:

  1. 1.

    We focus on the impact of the token number in the transformer and verify that the performance of the transformer will be improved by our proposed SFM.

  2. 2.

    We explore the reasons for the effectiveness of the ST-VQA framework through extensive ablation experiments, which provide sufficient evidence for model performance.

  3. 3.

    We reveal the factors that affect the transformer’s alignment capability, which may be crucial for the study of transformer interpretability.

The rest of this paper is organized as follows: Section 2 introduces the visual question answering model and some inspiring work related to this paper. Section 3 presents the overall structure and technical details of the ST-VQA framework, and demonstrates its advantages by comparing it with recent studies. In Section 4, we evaluate ST-VQA and compare it with other advanced methods.The conclusions are given in Section 5.

2 Related work

2.1 Visual question answering

VQA is a multimodal task that requires the model to be able to answer a question based on an input image. The process of solving a question is defined as a classification task. The model obtains multimodal features from the image and the question and then classifies them to obtain the correct answer from the answer dictionary. VQA has grown by leaps and bounds with the emergence of many benchmark datasets and methods. The widely popular tasks include the following three kinds: the VQA Challenge [11, 12] on the VQA-V2 [6] dataset, the reasoning work [13, 14] for the GQA [10] dataset, and the debiasing work [15, 16] for the VQA-CP [17] dataset. The pretrained and fine-tuning paradigm has achieved further success in the domain of NLP in recent years using large-scale image text pairs on the internet, and the extension of these ideas to the VQA domain has led to a new trend in popular VLP, creating many excellent works, such as Oscar [5] and VILT [18].

2.2 Transformers in VQA

The popularity of transformers in the field of VQA benefits from BUTD and MCAN. Anderson et al. proposed the BUTD [19] framework, which extracts vision and question features separately and uses element-level products to generate multimodal fusion features. Finally, a linear layer is used to classify this fusion feature to obtain the correct answer. Based on this, Kim et al. proposed a bilinear attention network (BAN [20]) to improve multi-modal fusion. Guo et al. proposed the bilinear graph network (BGN [21]) and performed cross-relational reasoning by question graphs and image graphs. This was an early attempt to introduce attention mechanisms in the VQA task, and although significant results were achieved, it was limited by the complexity of the method. After that, zhou et al. proposed MCAN [3] to align word and object features using transformer architecture and obtain fusion features with better performance. Since the MCAN’s breakthrough performance, most researchers have considered the MCAN as the standard transformer in the field of VQA. Specifically, MCAN consists of 3 types of attention mechanisms with different purposes, and these attention mechanisms model the relationships between word-word, region-region, and word-region. Thus, the input question words and image objects are aligned, and then a well-performing multimodal fusion feature is obtained.

2.3 Alignment capability in the transformer

The alignment capability of the attention mechanism directly affects the performance of the transformer. Recent studies [4] have found that reasonable control of attention spans is beneficial to the model. In the field of VQA, SA brings powerful cross-modal alignment capability, However, when the question critical words and image critical regions are spatially dense,the noise can mislead the standard transformer (e.g. MCAN) and negatively affects the performance of the model. One solution is to use external guidance in the SA layer. Zhou et al. [4] used image objects as nodes, generating an adjacency matrix as an external guide for SA. The guide of the adjacency matrix converts global attention to local attention, thus improving the performance of the transformer. Another potential solution is to reduce the number of regions and make them sparse. As the transformer deepened, Heo et al. [8] built pooling layers with a CNN to reduce the number of regions and increase the dimensional representation of each region. Wang et al. [22] used different cuts to reduce the number of regions by each region embedding layer without focusing on the dimension size of each region. They both construct a pyramid-shaped transformer. Our approach belongs to the last direction but differs from previous work in that they define regions in a pixel-based way, where each region block contains a part of the local information of the image, and these regions are spatially contiguous and meaningful. In the VQA transformer, regions are defined as mutually independent regional features, so we overcome this difference by fusing the regions.

3 Method

In this section, we present our model designs. The ST-VQA framework consists of three novel components: (1) a shrinkage transformer backbone, (2) an MLP base SFM, which is used to reduce the number of regions and fuse their features, and (3) a contrast loss, which is used to optimize the training process. In the following subsections, we describe them in detail.

3.1 Backbone

We build the shrinkage transformer as our backbone, which is based on the representative standard transformer (MCAN [3]). Figure 2 describes the differences in the image encoder between our backbone and MCAN. Since the attention layer of the transformer does not change the input feature dimensions, in the shrinkage transformer baseline, we reduce the number of regions with a tokenwise MLP.

Fig. 2
figure 2

Illustration of the structure of the shrinkage transformer backbone in the second stage. Compared with the standard transformer, the shrinkage transformer adds the shrinkage module between transformer layers. The shrinkage module generates composite region features while reducing the number of region features (from m to m/2)

For a fair comparison with previous work, the process of gaining answers is divided into 3 stages.

The first stage is feature extraction. For the question, we keep the first n words of each question sentence, the retained word is embedded by Glove initialization, and the LSTM encoding updates the word feature. For the image, we extract m regions for each image by the pretrained Fast-RCNN [23], and each region is represented by a 1D feature vector of length 2048. Then, we project each region onto a dim = 512 embedding space using a shared projection matrix. After extracting the image features, the feature dimension of the question is \(\mathbb {R}^{n \times 512}\) and the feature dimension of the image is \(\mathbb {R}^{m \times 512}\).

In the second stage, we use the L-layer transformer with double encoders to achieve the alignment of question features and image features. The encoder consists of multihead self-attention, a feed forward network (FFN) and residual blocks. The self-attention calculation process is described by (1).

$$ \text { attention }(Q, K, V)=\text{softmax}\left( \frac{Q \cdot K^{T}}{\sqrt{d}}\right) V $$
(1)

During the calculation, the question encoder calculates the word-word correlation in the question according to (2). The image encoder contains both SA and guide attention (GA) parts, and the SA calculates the region-region correlation in the image according to (3), and the GA calculates the word-region correlation according to (4).

$$ \begin{array}{@{}rcl@{}} \text{SA}(Q) &=& \text{attention}\left( \mathrm{Q}=Q W^{Q}, \mathrm{~K}=Q W^{K}, \mathrm{~V}=Q W^{V}\right) \end{array} $$
(2)
$$ \begin{array}{@{}rcl@{}} \text{SA}(I) &=& \text{attention}\left( \mathrm{Q}=I W^{Q}, \mathrm{~K}=I W^{K}, \mathrm{~V}=I W^{V}\right) \end{array} $$
(3)
$$ \begin{array}{@{}rcl@{}} \text{GA}(Q, I) &=& \text { attention }\left( \mathrm{Q}=I W^{Q}, \mathrm{~K}=Q W^{K}, \mathrm{~V}=Q W^{V}\right) \end{array} $$
(4)

In the third stage, question and image features pass through the attention pool separately and then we use the element-level summation operation to obtain the vision-language feature. Finally, we input the vision-language feature to the linear classification layer and obtain the answer to the question.

3.2 Shrinkage fusion module

In the baseline, we reduce the number of regions with a tokenwise MLP. However, direct projection has two obvious drawbacks: (1) Linear projection lacks nonlinear representability. (2) Such projection uses the same parameter matrix for different input features, which lacks adaptability to different input contents. To address these issues, an MLP-based SFM is proposed. The schematic diagram of the module is shown in Fig. 3. We define the core transformation formula as Y = PX. To treat various input features differently and increase the adaptability to different input contents, the transform matrix P is dynamically generated from the contents of input X. Assume that the input feature \(X \in \mathbb {R}^{N \times d}\), where N is defined to be the number of regions and d is the feature dimension of each region. With a set of fusion blocks of length N1, we generate the transform matrix P from X. The process is described as follows:

$$ \begin{array}{@{}rcl@{}} \widehat{\mathrm{X}} &=& \text{MLP}(\mathrm{X}) \end{array} $$
(5)
$$ \begin{array}{@{}rcl@{}} \mathrm{P}_{i} &=& \text{softmax}\left( \text{MLP}^{(i)}(\text{flat}(\widehat{\mathrm{X}}))\right) \end{array} $$
(6)

The MLP consists of a hidden layer, nonlinear activation function, and dropout layer. First, the feature X is mapped from \(\mathbb {R}^{d}\) to \(\mathbb {R}^{d_{1}}\) by (5). Then, we use (6) to generate each row of the transform matrix P in the same way. In detail, flat (⋅) is a flattening operation, and the softmax operator acts on the row vector. The transformation matrix P contains the weights of all regions fused into the new regions.

Fig. 3
figure 3

Description of the SFM. The function of SFM (left) consists of two parts: (1) generating the fusion matrix P by FB (right) and (2) fusing the features of the input X using the fusion matrix P

Inspired by the recent work [24], we split the feature of X and perform a fusion operation on each split part separately to improve the robustness of the SFM. This process can be described as follows.

$$ \begin{array}{@{}rcl@{}} \widehat{\mathrm{X}}^{(s)} &=& \text{MLP}^{(s)}\left( \mathrm{X}^{(s)}\right) \end{array} $$
(7)
$$ \begin{array}{@{}rcl@{}} \mathrm{P}_{i}^{(s)} &=& \text{softmax}\left( \text{MLP}^{(i, s)}\left( \text{flat}\left( \widehat{\mathrm{X}}^{(s)}\right)\right)\right) \end{array} $$
(8)

where, X(s) is the i-th segment of X. First, each segment of the input X is mapped through a different MLP, and the mapping size is a hyperparameter. Similar to (6), we can generate \(P_{i}^{(s)}\) from X(s) using (8). Then, this step is repeated to obtain each row of P(s). P(s) is the s-th segment of the transformation matrix P; thus, the transformation formula is defined as follows:

$$ Y=\left[P^{(1)} X^{(1)}, P^{(2)} X^{(2)}, \cdots, P^{(s)} X^{(s)}\right] $$
(9)

where [∗,∗] is the connection operation applied to the row. As a result of the above operation, the fusion and shrinkage of the number of input regions is achieved. Note that the SFM is embedded after the GA block of the shrinkage transformer, the input of the SFM is the attention map of the GA output, and the fused feature output goes to the next layer of SA to update the region-region correlation feature. After the first layer of SA and GA operations, the original relatively independent image region feature has been updated through self-attention. The SFM generates new regions based on the input regions in a similar way, and the similar way leads to structural similarity among the new regions. As we described in Section 4.5, computing dot product attention between these new regions with similar structures achieves better results.

3.3 Loss function

In the ST-VQA framework, the overall loss function consists of two parts. The first part is the cross-entropy loss between the predicted answer \(A^{\prime }\) and the ground truth answer Agt. The second part is the bias brought by the SFM, which are denoted by LCE and LSFM. To make the SFM fused regions structurally similar, we use the same method to construct each row of the transform matrix P to generate the new regions, but they still need to be distinguished from each other in terms of content to better differentiate between them. To achieve this, we introduce the contrast loss [25] to train our SFM. Specifically, we select the first region as a positive sample among all regions generated in each mini batch and the rest of the regions as negative samples. We measure the similarity between positive and negative samples by calculating the dot product between them, so the contrast loss of the SFM is defined as follows:

$$ L_{SFM}=-\frac{1}{N_{1}} \sum\limits_{i=1}^{N_{1}} \log \frac{\exp \left( q_{i} \cdot k^{+} / \tau\right)}{{\sum}_{j=1}^{N_{1}} \exp \left( q_{i} \cdot k_{j}^{-} / \tau\right)} $$
(10)

where N1 is the number of newly generated regions, q is the selected region, and k+ and k are the positive and negative samples, respectively. τ is the temperature hyperparameter, which is set to 0.07 in our experiment. LSFM obtains a lower value when q is similar to the only positive sample k+ and dissimilar to all negative samples k. Furthermore, we found that defining positive samples as zeros-vectors leads to better results. Since the similarity of q to the only positive sample is defined as a constant, the model is more concerned with the similarity of q to all negative samples, so we further define the contrast loss as follows:

$$ L_{SFM}=-\frac{1}{N_{1}} \sum\limits_{i=1}^{N_{1}} \log \frac{1}{{\sum}_{j=1}^{N_{1}} \exp \left( q_{i} \cdot k_{j}^{-}/\tau\right)} $$
(11)

Finally, combining the two loss functions and their weights, the total loss is:

$$ L_{TOT}=L_{CE}\left( A, A^{gt}\right)+\lambda \times L_{SFM}(Y) $$
(12)

3.4 Discussion about why the ST-VQA framework works

The standard transformer usually relies on multistep inference to predict the correct answer. During the inference process, attention gradually updates the region features and identifies the key regions associated with the correct answer. GUO et al. proposed SCAVQA [14] to select some of the most critical words and regions by threshold at each layer of the transformer by eliminating wrong answers that may mislead the inference. Their approach has achieved some success, but the threshold-based approach is more complex and lacks adaptability for different samples. However, Shuang et al. proposed CPDR [13], which uses an improved RN network to model the interactions between regions to form composite region features and thus reduce the difficulty of inference. However, they do not exclude the wrong answer regions, and the model still suffers from the risk of being misleading. Our proposed approach combines the advantages of each of them. The ST-VQA framework generates composite features by feature fusion and reduces the number of region features in the process at the same time. In contrast loss, the model selects the semantics represented by the composite features and mitigates the misleading effect of wrong answers. Thus, the ST-VQA framework will have the ability to simplify inference and avoid misleading, resulting in a more advantageous performance.

4 Experiments

4.1 Dataset

VQA-v2

is a widely cited benchmark dataset on VQA tasks. In the balanced real images branch of VQA-v2, the input images contain 204,721 real scene images from MSCOCO, with a total of 1,105,904 input questions, and the question type is open-ended. Each question in VQA-v2 has 10 different answers, and the answer confidence (e.g., Yes or maybe) is given for each answer. If 3 of the 10 answers are the same, then the answer is considered to have a 90% probability of being the correct answer (greater than or equal to 4, considered to be 100%). The VQA-v2 dataset contains 3129 answers, so we usually design the model as a classification model with 3129 categories. The types of questions in VQA-v2 involve a variety of types, with a high number of question types such as: what is (13.84%), what color (8.98%), is the (10.16%), how many (10.28%).

COCO-QA

is a widely used small VQA dataset with images sourced from MS-COCO. In contrast to VQA-v2, all questions and answers in COCO-QA are created automatically from image captions, so it focuses on the obvious information in the images and is easier to scale and apply. The COCO-QA dataset consists of 117,711 samples, with a ratio of training set to test set of approximately 2:1. The COCO-QA dataset has four question categories, including object, color, number, and location, where all of the answers are single words. Except for the standard accuracy, we used Wu-Palmer similarity (WUPS) in COCO-QA to evaluate the model, and the WUPS thresholds were set to 0.1 and 0.9.

The GQA dataset is a new cleaner version based on Visual Genome, which contains 22 M questions and 113k images. Each GQA image contains a set of scene graph representations, and each answer explicitly refers to the relevant region in the image. The GQA dataset focuses on complex and combinatorial questions (e.g., Are both the yellow flag and the flag that looks blue small? Answering this question requires more reasoning about the visual concept. Except for the standard accuracy, GQA supports multidimensional evaluation metrics, including consistency, plausibility, validity, and distribution (lower is better).

4.2 Implementation details

Most transformer-based VQA models have a similar setup. Except for the feature extraction we described in Section 3.2, we use zero padding to fill question and image features to their maximum sizes (m = 64 and n = 14 or 29), i.e., the number of the input image regions is 64 and the number of the input question words is 14 or 29. In the ST, the number of transformer layers is set to 6, where the dimensions of self-attention and FFN are both 512, and the number of attention heads is 8. For the shrinkage transformer backbone, we set the number of regions per layer to [m, m, m, m/2, m/2, m/2]. To avoid overfitting, we use the nonlinear activation functions, GELU, and dropout after the FC layer. The model uses a warm-up operation, the maximum number of training epochs is 13, and the batch size is set to 64. The initial learning rate is set to 1e-4, scaled down by a factor of 0.2 at epoch = [10, 12]. In addition, the ST-VQA framework also supports the use of the visual genome as additional training data, which can further improve its performance.

4.3 Overall performance

On the VQA-v2 dataset, we compare the proposed ST-VQA framework with a set of baselines in the field of VQA, and the results are given in Table 1. The ST indicates the shrinkage transformer backbone proposed in this paper. From Table 1, we can observe that the ST-VQA framework achieves + 0.51 (Yes/No), + 0.87 (Number), + 0.75 (Other) and + 0.66 (All) performance improvements

Table 1 Comparison with different baselines on the VQA-v2 dataset

over the MCAN. In addition, our model achieves a large performance improvement compared to recent advanced work. This result indicates that the core components of the ST-VQA framework achieve more accurate multimodal alignment and play an important role in the inference and prediction of answers.

The questions with number type are considered to be the most difficult questions in the field of VQA, and their answer accuracy reflects the inference ability of the model, which is usually trained on more inference-oriented datasets (such as GQA [10]) to achieve higher accuracy for number-type questions. Surprisingly, compared to the standard transformer, our baseline shows lower accuracy (− 0.32) for number-type questions, which indicates that direct feature mapping (tokenwise MLP) is harmful to the inference ability of the transformer. However, the addition of the SFM provides the ST with the beneficial properties of structural similarity and content difference between

new regions, which not only improves the accuracy of all types of questions but also maximizes the accuracy of number type questions (+ 1.19). This result confirms that the structure of the ST-VQA framework helps to improve the inference ability of the transformer. Our experiments on COCO-QA also support this conclusion. Table 2 shows the results on the COCO-QA dataset. The ST-VQA achieves the best performance, and it is particularly striking that on questions of type Number, the ST-VQA outperforms the second place by + 3.45. Table 3 shows the results for the GQA dataset. For a fair comparison, all comparison models are single models and use Faster RCNN extracted region features as image features. The experimental results show that the ST-VQA still has sufficient advantages in reasoning about complex problems with an accuracy of 57.62, which is considerably higher than that of MCAN (+ 0.78). In addition, we focus on the performance gap of the ST-VQA on GQA and VQA-v2. Because the GQA question is more

Table 2 Comparison with VQA models on the COCO-QA dataset
Table 3 Results for ST-VQA and the state-of-the-art models on the GQA dataset

demanding in terms of inference, we consider this gap to be acceptable.

4.4 Ablation study

We conducted ablation experiments on the model to analyze the reason why our model is effective, and the results are summarized in Table 4. It can be clearly seen that the ST accuracy is higher than the standard transformer (+ 0.07). After we replaced the tokenwise MLP with the SFM, the shrinkage transformer achieved better performance (+ 0.05) at a higher train loss (+ 7.88%), which confirms that the generalization of the transformer was improved. This result shows that the SFM has higher generalization performance than the tokenwise MLP in the shrinkage transformer structure. In addition, it can be directly observed that the loss function of the SFM brings a breakthrough to the model, which proves that the new regions generated after fusion must be independent of each other; otherwise, they harm the model. Note that while most of the correctness improvement is achieved by the improved contrast loss, the potential improvement from the SFM cannot be ignored. We bring structural similarity and content difference to the fused new regions by the SFM and contrast loss, respectively, and the combination of these two characters is the key to the performance improvement of the transformer. This conclusion is proven in Section 4.4.1.

Table 4 Ablation study on the VQA-v2 dataset

4.4.1 Ablation of the transformer structure

Previous experiments found that even if feature mapping is performed directly using the tokenwise MLP, the shrinkage transformer still resulted in a slight performance improvement for the standard transformer. In this section, we experiment with the performance of the backbone at different levels of shrinkage, and we use the proposed SFM to change the number of regions. The results are given in Table 5. It can be seen that a higher degree of shrinkage is not better. Maintaining a relatively uniform degree of region shrinkage is beneficial to the performance of the model. In Table 5. The Degree column shows the change in the number of regions. The first row is the standard transformer without shrinkage. In particular, we set up a comparison experiment in Row 2 of Table 5 with the SFM but without shrinking (i.e., the number of regions remains the same). The experimental results confirm that the SFM needs to be used in combination with the shrinkage transformer and that the SFM alone will have a slight negative effect on transformer performance (− 0.09). We also note that the frequent use of the SFM is bad for the results, probably because the transformer requires multiple layers of updated features to reach saturation.

Table 5 Different levels of ST-VQA performance on the VQA-v2 dataset

4.4.2 Ablation of the SFM structure

In this section, we analyze the different ideas related to shrinkage regions in the SFM and their respective performance, and the experimental results are given in Table 6. In the method column of Table 6, “Tokenwise MLP” refers to the use of the tokenwise MLP to update the number and dimension of regions. “MLP fusion” is the SFM as described in Section 3.2, and “FC fusion” refers to the replacement of the MLP structure in the SFM with the linear layer. We verify the following two aspects in this section: the importance of based content adaptive fusion and the importance of nonlinear activation.

Table 6 Different ideas related to SFM performance on the VQA-v2 dataset

We can observe that the SFM has performance advantages over the tokenwise MLP. Meanwhile, within the SFM, the model can have an approximate performance either using FC or MLP for region number reduction, which proves that the nonlinear activation function in MLP may not be the key for the model to achieve good performance. In addition, FC Fusion has slightly lower performance than MLP Fusion, while having higher model parameters (approximately 5%), and because it is not suitable for adding dropout, there is more risk of overfitting with FC Fusion.

4.4.3 Ablation of the loss function

In this section, we explore the effectiveness of the SFM through ablation experiments on the loss function. Contrast loss makes the composite region features generated after fusion dissimilar to each other, which ensures the diversity of information, i.e., the composite features during inference represent the conditions required to infer the correct answer to a greater extent. We report the results of the ablation experiments in Table 7, where one and zero represent using one-vector and zero-vector as positive samples, respectively. It can be clearly seen that there is a very substantial performance improvement of the model with the addition of the comparison loss, and the model will obtain the best performance with the zero-vector as the positive sample. This is consistent with our analysis in Section 3.3, where the model only needs to focus on generating more distinguishable composite features during feature fusion. In addition, the hyperparameter λ does not perform sensitively in the experiments, so we believe that the ST-VQA possesses advantages in terms of stability.

Table 7 Different loss functions for the ST-VQA performance on the VQA-v2 dataset

4.5 Attention analysis

The ST-VQA framework fuses and updates the region feature by the SFM, and this operation loses the bounding box of each region. Therefore, we visualize and analyze the ST-VQA framework from the perspective of attention entropy [22]. Attention entropy indicates the degree of concentration and diffusion of region interactions in attention, and smaller attention entropy indicates more concentrated interactions. Figure 4 illustrates the comparison of the changes in attention entropies of MCAN with our model in different cases. Figure 4(a) Shows the change in attention entropy for different numbers of training epochs. The attention entropies of both models decrease as training proceeds, but the decrease in the ST-VQA is higher than that of MCAN. Figure 4(b) Shows the variation in attention entropy between different layers in the transformer, and our model exhibits lower attention entropy for the same layers. These facts demonstrate that our model has tighter interregion interactions compared to MCAN, further validating our idea that a smaller number of regions facilitates attention. More analysis are given in Online Resource 1.

Fig. 4
figure 4

Comparison of attention entropy for our transformer and MCAN

5 Conclusion

In this paper, we propose the ST-VQA framework to explore a more efficient VQA transformer. Through feature fusion, the ST-VQA framework improves multistep inference and excludes the effect of wrong answer regions. Furthermore, we reveal the core elements of this improvement. The shrinkage transformer can perform better than the standard transformer if the regions of the image have both of these properties: similar structure and different content. To achieve an effective shrinkage transformer, we introduce contrast loss and design the SFM for shrinking regions. In addition, we verify the advantages and optimal configuration of each module through complete ablation experiments. The experimental results suggest that the dimension change of the attention map in the transformer may be more favorable to the model and, more importantly, a better shrinkage mechanism will further stimulate the potential power of the transformer. We consider shrinkage and fusion as an inductive bias, which can be attached to the self-attention to improve the transformer’s performance. The transformer works in the same way for different inputs, whether they are tokens, patches or regions, so the ST-VQA framework is more likely to be extended to other tasks. We are optimistically looking forward to more research developments in the future.