1 Introduction

Deep learning has made significant progress in computer vision [14] and natural language processing [26]. Interdisciplinary disciplines between vision and natural languages, such as image caption [31], VQA [2, 23] and visual dialogue [18], have attracted strong attention in the field of vision and natural language.

VQA is a very challenging research direction, but most VQA models [34] focus on vision processing. For some images with scene text, most models [8, 39] need help to answer the questions about the scene text in the image. These images have actual text semantics and need to understand the image scene text.

The Lorra model [27] can effectively infer and answer the scene text information in the image. It uses the BUTD [1] attention model to infer the visual object. However, this method is limited by simple attention model interaction. The M4C [16] model jointly encodes questions, images, and text by using multimodal transformer Architecture [30] and obtains answers from OCR tokens or some fixed vocabulary iterations. Although this isomorphism processing method is easy to implement and fast to train, it does not distinguish between text and visual objects after isomorphism.

To solve the above problems, after obtaining the text features and the visual features based on BUA(bottom-up attention), and using OCR and FastText [5] to obtain the OCR text features in the image, we use the question self-attention unit to filter the redundant features, and then use the question-guide attention mechanism to process the visual features and the OCR token features. Finally, we fuse the three features, and the features input to the classifier are obtained. Based on the newly introduced question-guided OCR token feature, we add the copy of the OCR token as the extension of candidate answers so that the model can predict the OCR token outside the fixed candidate answers as the answer.

Our innovations and contributions can be summarized as follows:

  • To enable the model to have the ability to understand text in the image, we designed a co-attention model that incorporates the scene text features in images. The deep-stacked co-attention mechanism guarantees to answer common questions that do not involve scene texts and enable the model to recognize texts in images and answer questions using scene texts.

  • We design a classifier which can get an answer from a fixed answer set or directly copy the text in the image detected from the OCR model as the final answer.

  • Ablation experiments show that the methods we proposed are effective. Our model has significantly improved the performance on the VQA 2.0 dataset [13].

2 Related work

2.1 Attention

The attention mechanism has been successfully applied to uni-modal tasks and simple multi-modal tasks. Paper [1] learned the visual attention of the image region from the input question of VQA, embedded the question into the visual space using the attention structure, and constructed a convolution kernel to search the noticed region in the image, which effectively promoted the representation ability of the model; Subsequently, many studies [10, 11, 29, 35, 36] introduced the use of visual attention to extract features and reduce the interference of redundant features in image and text information through an attention mechanism. In addition, papers [4, 20] use different multi-modal bilinear pooling methods to combine the grid visual features with the text features to predict the answer. The results show that attention to learning vision and text modals helps enhance the fine-grained representation of images and questions to improve the model’s accuracy effectively. However, these rough attention models can not infer the correlation between regions and question words.

Therefore, learning the co-attention between two modals can effectively improve the VQA results. Paper [38] simplifies the co-attention method into two steps. Firstly, the question is put into the self-attention mechanism to learn the dependency between question words. Then the most relevant visual region is searched in the question-guided attention module. At the same time, paper [19] proposed a bi-linear attention network to refine attention based on previously noticed features. In addition, we can use better models to encode features [17, 40] and further enhance the models.

2.2 Pre-training model

A visual language pre-training model (VLP) is a type of deep learning model that combines images and text for joint learning to obtain rich visual and language representations.

VLP models typically consist of an image encoder and a text encoder. The image encoder converts the input image into a low-dimensional vector, while the text encoder converts the input text into a low-dimensional vector. These vectors can then be aligned in a shared representation space to capture semantic similarities between the images and text.

Currently, some popular VLP models include UNITER [9], ViLBERT [22], LXMERT [28], BLIP [21], OFA [32], among others. These models have achieved state-of-the-art results in various visual language tasks, such as image captioning, question answering, visual reasoning, image classification, and more.

3 Proposed model

The multi-modal co-attention model is shown in Fig. 1. GloVe [25] and LSTM are used to extract question features, Faster R-CNN is used to extract image features, and FastText is used to encode OCR tokens in the image. Among them, SA is a self-attention unit, and GA corresponds to a guided attention unit.

Fig. 1
figure 1

Overview of the proposed model

3.1 Text question representation

Each question is encoded as a Glove vector for word representation. If the question length is less than 14, it is extended with a zero vector. The sequence after word embedding is encoded by LSTM [15], the memory mechanism of LSTM can effectively process long text information and prevent network gradient explosion, and the calculation formula is as follows:

$$ X = LSTM(Glove(ques)) $$
(1)

where \( X \in \mathbb {R}^{d_{x} *M } \) is the question representations, M is the length of question.

3.2 Image OCR tokens extraction and representation

We use the Rosetta [6] OCR system to extract text marks on each image, identify up to 50 OCR tokens in the image, and then use FastText [5] to represent OCR tokens. The calculation formula is as follows:

$$ tokens = Rosetta(image) $$
(2)

After obtaining the image OCR tokens, we use the FastText to encode:

$$ O = FastText(tokens) $$
(3)

where tokens is the OCR text of the image, \( O \in \mathbb {R}^{d_{ocr} *P } \) is the OCR representations. P is the number of detected OCR token.

3.3 Image feature representation

Currently, the mainstream image feature extraction adopts the BUA [1] method to identify specific objects in the image. Figure 2 is the overview of the BUA model. The calculation formula is as follows:

$$ Y = FRCNN(image) $$
(4)

where \( Y \in \mathbb {R}^{d_{y} {\ast } N } \) is the vision feature, N is the number of the targets.

Fig. 2
figure 2

Overview of the region feature extractor based on Bottom-Up Attention

3.4 Co-attention model

Our model includes a question self-attention unit, a question-guided image OCR token attention unit, and a question-guided image vision attention unit. Learn the interaction information between the same modal or different modals through the attention mechanism. The question-guided image attention unit has the same implementation method as the question-guided image OCR token attention unit, except that the image OCR token feature representation replaces the image feature representation.

3.4.1 Multi-head self-attention mechanism

The multi-head self-attention mechanism module includes a multi-head attention layer, layer normalization [3] layer, residual link layer, and forward layer. As shown in Fig. 3a, the input feature X is mapped by three matrices to obtain the corresponding matrices Q, K, and V, which are calculated by scaled dot-product. The calculation formula is as follows:

$$ \begin{cases} Q = XW^{Q}, K = XW^{K}, V = XW^{V}\\ Attn(Q,K,V) = softmax(\frac{QK^{T}}{\sqrt{d_{k}}})V\\ \end{cases} $$
(5)

where dk is the dimension of the Query, Key and Vaule.

Fig. 3
figure 3

Architecture of the two basic attention units

A multi-head attention mechanism can be adopted to improve the presentation ability further. Multi-head attention includes h attention operations, and each attention operation corresponds to a scaling dot product operation. The operation result is connected to the output representation of the multi-head attention layer:

$$ \begin{cases} head_{i} = Attn(Q_{i},K_{i},V_{i})\\ MHA(Q,K,V) = Concat(head_{1},head_{2},...,head_{h})W_{o} \\ \end{cases} $$
(6)

where Qi,Ki,Vi is the i-th head’s matrix. \(W_{o} \in \mathbb {R}^{h*d_{h}*d} \) are the projection matrices, dh is the dimension of the output features.

The output of the multi-head attention layer is then processed through the residual link layer and layer normalization to prevent the gradient from disappearing and accelerate the convergence of the model:

$$ f = LayerNorm(X+MHA(Q,K,V)) $$
(7)

where dk is the dimension of the query and the key.

After passing through the forward layer, the final output of the self-attention module is:

$$ FFN(f) = FC(Dropout(ReLU(FC(f)))) $$
(8)
$$ Z = LayerNorm(f+FFN(f)) $$
(9)

where FC is the fully-connected layers.

3.4.2 Self-attention unit and guided-attention unit

As shown in the Fig. 3a, SA is based on MHA. \( X=\{x_{1};x_{2};...;x_{M}\} \in \mathbb {R}^{M *d_{x}} \) is the input features of SA:

$$ \begin{cases} Q = XW^{Q}, K = XW^{K}, V = XW^{V} & \\ f = LayerNorm(X+MHA(Q,K,V)) & \\ SA(X) = LayerNorm(f+FFN(f)) & \\ \end{cases} $$
(10)

where MHA is the multi-head attention layers.

The difference between the guided-attention mechanism and self-attention lies in the input of two different features. As shown in Fig. 3b, the corresponding matrices of question feature X as guidance features are K and V, and the corresponding matrix of visual feature (or image description feature) is Q:

$$ \begin{cases} Q = YW^{Q}, K = XW^{K}, V = XW^{V} \\ Attn(Q,K,V) = softmax(\frac{QK^{T}}{\sqrt{d_{k}}})V\\ f = LayerNorm(Y+MHA(Q,K,V)) \\ GA(X,Y) = LayerNorm(f+FFN(f)) \\ \end{cases} $$
(11)

The calculation of GA is similar to SA.

3.4.3 Cascade of attention modules

To further improve the representation ability of the feature, we use a cascade approach to combine attention modules. The output of the last layer can be used as the input of the next layer. For the question self-attention mechanism, a total of L layers are stacked as the final question feature:

$$ X^{k} = SA^{k}\left (X^{k-1} \right ) $$
(12)

where SA1,SA2,...,SAL, represents the question self-attention of different layers. The final layer XL is obtained as the final question feature.

Then, the final image features are obtained by using the SA and GA. The calculation formula is as follows:

$$ Y^{k} = GA^{k}(X^{L}, SA(Y^{k-1})) $$
(13)

where GA1,GA2,...,GAL, represents the question Guided-attention of different layers. The obtained final image feature YL is used as the final image region feature.

Similarly, after obtaining the image OCT token feature O processed by LSTM, use the self-attention mechanism unit to process it, then input it to the question-guided attention module GA. The calculation formula is as follows:

$$ O^{k} = GA^{k}(X^{L}, SA(O^{k-1})) $$
(14)

The last layer OL obtained is used as the final image OCR token feature.

4 Feature fusion and answer prediction

After obtaining the question text features, image features, and image OCR token feature processed by the attention module, we need to fuse these three features. Before fusion, we use MLP to process these three features to get the final features.

$$ \begin{cases} MLP(X) = FC_{2d}^{d} \circ Relu \circ F{C_{d}^{d}}(X)\\ \alpha = Softmax\left (MLP(X^{L}) \right )\\ V= {\sum}_{i=1}^{M}\alpha_{i} x_{i}\\ \beta = Softmax\left (MLP(Y^{L}) \right )\\ Q= {\sum}_{i=1}^{N}\beta_{i} y_{i} \\ \gamma = Softmax\left (MLP(O^{L}) \right )\\ OCR= {\sum}_{i=1}^{P}\gamma_{i} o_{i}\\ \end{cases} $$
(15)

where \(FC_{2d}^{d}, F{C_{d}^{d}}() \) are fully connected layers. Vector \(\alpha = [\alpha _{1},\alpha _{2},...,\alpha _{M}] \in \mathbb {R}^{M}\), vector \(\beta = [\beta _{1},\beta _{2},...,\beta _{N}] \in \mathbb {R}^{N}\) and vector \(\gamma = [\gamma _{1},\gamma _{2},...,\gamma _{N}] \in \mathbb {R}^{N}\).

Finally, we obtain the fusion feature:

$$ r = LayerNorm({W_{v}^{T}} V + {W_{q}^{T}} Q + {W_{d}^{T}} OCR) $$
(16)

where vector \({W_{v}^{T}}, {W_{q}^{T}}, {W_{d}^{T}} \in \mathbb {R}^{d*d_{z}}\), \( h \in \mathbb {R}^{d_{z}}\) is the fusion feature.

After the final fusion feature r is obtained, we use a linear classifier to classify it and then use a sigmoid function to get the predicted probability value:

$$ \hat{y} = Sigmoid({W_{z}^{T}}r) $$
(17)

where vector \({W_{z}^{T}} \in \mathbb {R}^{d_{z}*A}\).

If the model’s predicted index is greater than A(the number of fixed candidate answers), we use the corresponding OCR token as the final answer.

5 Experiment

5.1 Datasets

VQA v2.0 [13] dataset is composed of natural images in MSCOCO [47], which is consistent with MSCOCO in the division of training set, verification set, and test set. It is a large-scale data set disclosed in the VQA task. The test dev model is divided into four parts, which the developers can use to test the system more flexibly, and the test dev model can be used to prevent the developers from passing the test. Each image question pair collects ten answers and selects the answer with the most occurrences as the correct answer. There are two kinds of questions in the dataset: open and multi-topic. This paper focuses on the open task.

5.2 Experimental setup

The basic setup of the experiment follows MCAN. The maximum number of detected OCR token P is 50.

5.3 Ablation experiment

Based on the MCAN [37] model, we conducted the following ablation experiments:

  • MCAN: denotes benchmark model.

  • MCAN+Q-GA(Rosetta OCR): indicates MCAN with question-guide OCR token attention (Q-GA) unit and the OCR token in the image is extracted by Rosetta modle.

  • MCAN+Q-GA(Paddle OCR): refers to that extracts image OCR token based on the Paddle model and introduces question-guided image OCR token attention.

  • MCAN+Q-GA(Rosetta Meaningful OCR): refers to that when the OCR token is obtained, some characters without practical significance are filtered.

  • MCAN+Q-GA(Rosetta OCR)+V-GA(Rosetta OCR): refers to introduces question-guided OCR token attention and image-guided OCR token attention.

The ablation results are shown in Table 1. The MCAN model is used as the benchmark in the first row. In the second row, after introducing the question-guide OCR token attention (Q-GA) unit, the accuracy is improved by 0.38% compared with the benchmark model, indicating that the Q-GA module is effective.

Table 1 The results of ablation experiment on VQA 2.0 val set

The Paddle model is used to extract the image OCR token in the third row, and the accuracy is improved by 0.26% compared to the benchmark model. It demonstrates that different OCR models have varying effects on our model.

In the fourth line, the Rosetta model is used to extract the image OCR token of the image. Some characters without practical significance are filtered when the OCR token is obtained. But the accuracy still needs to be improved.

In the fifth line, the question-guide OCR token attention (Q-GA) unit and image-guide OCR token attention (V-GA) unit are added based on the MCAN, compared with MCAN, 0.33% improves the accuracy. That means the image-guide OCR token attention (V-GA) unit didn’t promote.

The accuracy and loss curves during training are depicted in Fig. 4.

Fig. 4
figure 4

the accuracy and loss curve of the ablation model on VQA 2.0 val set

We randomly selected four demos in Fig. 5 to demonstrate the effects of our model. The MCAN model can not answer the question related to image tokens at all through these four examples. Compared with the MCAN model, our model can identify whether the question is related to the tokens in the image and can copy one from the tokens in the image as the answer. The following two wrong examples show that when more tokens are involved in the image, Our model can not well judge which token to choose as the answer and can not deal with the situation where two tokens need to be answered.

Fig. 5
figure 5

Some typical examples of our model prediction

5.4 Comparison with current main models

Table 2 shows the comparison results. Our model is compared with the existing Bottom-Up, MuRel, MFH, MCAN, DFAF, DMBA-NET and MDFNet. It can be concluded from Table 2 that the model proposed in this paper is superior to other models.

Table 2 Accuracy of single model on VQA v2.0 test-dev and test-standard dataset

BUTD [1] is the champion model of the 2017 VQA challenge. The region features based on bottom-up attention are extracted for the first time, and our model is improved by 5.68%. To improve the reasoning ability of the model, MRA-NET [24] combines the relationship between text and vision, and 1.89% improves our model. MCAN [37] and DFAF [12] explore the attention mechanism within and between modals. Compared with the two models, our model is improved by 0.45% and 0.54%, respectively. MDFNet [41] proposes graphical reasoning and fusion layer (GRFL) to infer the complex spatial and semantic relationships between visual objects and adaptively fuse the two relationships, 0.03% improving our model.

6 Conclusions

In this paper, we designed a deep co-attention model that is able to fuse the scene text information in images, thus, equipping the model with the ability to read and understand the text in images, and compared to other models of similar direction, our model is able to deal with datasets that use general, rather than text-oriented specific VQA datasets. Through validation on the VQA 2.0 dataset, it is found that our model is capable of answering general questions as well as questions involving text in scenes in images. As a result, our model is more generalizable.