Keywords

1 Introduction

Visual Question Answering (VQA) has obtained extensive attention from numerous scholars dedicated to research Computer Vision (CV) [1, 2] or Natural Language Processing (NLP) [3, 4] in the past few years. As a specific domain of VQA, the purpose of medical VQA is to answer diagnostically a question asked on a medical image. An outstanding medical VQA model can profit both clinicians and sick person. It can provide subsidiary analysis for clinical diagnoses and therapeutics for doctors. In addition, a Medical-VQA system helps ask for medical consultation whenever patients need. Therefore, developing a medical VQA model helps relieve the burden of healthcare and make medical diagnoses and treatment more efficient. Although medical VQA has tremendous potential, researches on medical VQA still face many challenges. Compared with general VQA, medical VQA is more challenging. In the foremost, well-annotated medical VQA datasets for training model are extraordinarily rare, since they are time-consuming and strenuous to gain precise annotations by clinicians. For example, the manually annotated dataset VQA-RAD [5] includes varied types of questions but it contains only 315 radioactive pictures. Furthermore, some general VQA models cannot be adopted to develop Medical-VQA systems. The reason is that they always utilize extremely complex visual feature extraction modules such as Faster R-CNN [6] and ResNet-101 [7], which included a great deal of arguments and demanded to be trained with large datasets. The direct employment of these models may result in the overfitting issue. Furthermore, clinical questions are not only harder to be understanded for the VQA system as they are about professional medical knowledge, but also needed to be answered precisely as they are relevant to safety and health.

Some previous works [8, 9] attempted to utilize general VQA models and fine-tuned them on Medical-VQA datasets. Nevertheless, medical images and clinical questions were quite different from those of general VQA. Raghu et al. [10] proposed to transfer knowledge from general VQA, but they gained a subtle improvement. Nguyen et al. [11] employed Model-Agnostic Meta-Learning (MAML) [12] to obtain weights of the visual feature extractor. In addition, they utilized Convolutional Denoising Auto-Encoder (CDAE) [2] to make model more robust. Though these groundbreaking medical VQA works pushed forward the research field, they only focused on making better the feature extractor, while ignored inference module. Zhan et al. [13] concentrated on enhancing the inference ability of models. Specifically, they devised a Question-Conditioned Reasoning (QCR) module to identify the importance of each word. Besides, they proposed a task-conditioned reasoning (TCR) strategy to enlarge the difference of reference abilities for close-ended and open-ended tasks accordingly. Nevertheless, owing to the limitation of medical data, it can only obtain rough fusion features. Li et al. [14] designed two reasoning modules to obtain fine-grained relations between words and image regions. But they ignored the relationships between Word-to-Image attention and Sentence-to-Image attention, which make them unable to gain more fine-grained semantic information.

In order to gain a multi-level multimodal fusion feature, we design a Multi-level Attention-based Multimodal Fusion (MAMF) model by developing a Word-to-Image (W2I) attention and a Sentence-to-Image (S2I) to model the relations of both word embeddings and question feature to the image feature for medical VQA. The W2I attention is adopted to word-level fine-grained reasoning, while the S2I attention is applied to sentence-level coarse-grained reasoning. Besides, we propose an Attention Alignment Loss (AAL) to concentrate on adjusting the weights of the image regions learned from word embeddings and question feature to lay stress on crucial image regions and obtain multi-level multimodal semantic representation to predict the high-quality answer.

To sum up, our contributions are as follows:

  1. 1)

    A novel Multi-level Attention-based Multimodal Fusion (MAMF) model is proposed by developing a Word-to-Image (W2I) attention and a Sentence-to-Image (S2I) attention to capture word-level and sentence-level inter-modality relations of them, as well as to learn a multi-level multimodal semantic representation for medical VQA.

  2. 2)

    An Attention Alignment Loss (AAL) is designed to adjust the importance of the image regions obtained from word embeddings and question feature to identify the relevant and crucial image regions.

  3. 3)

    The evaluations on VQA-Rad and PathVQA datasets show that our proposed MAMF significantly superior to the related state-of-the-art baselines.

2 Related Work

VQA has aroused great research interest among scholars since Antol et al. [15] proposed the first VQA task. VQA models in general domain adopted various methods for extracting image feature and question feature. As for image feature extraction module, researchers commonly utilized object detectors like simple CNNs [16], SSD [17], and Faster-RCNN [6]. As for question feature extractor, they usually adopted models like GTP-3 [12], Bert [3] and RoBerta [18]. After that, the extracted features were aggregated by using bilinear pooling model like Multimodal Compact Bilinear Pooling [19], Multimodal Low-rank Bilinear Pooling [20] or Bilinear Attention Network (BAN) [21] to obtain a fusion feature. The feature was transmitted to the classifier to predict the answer.

However, these models could not be simply adopted to develop a Medical-VQA system, owing to the limitation of medical data. Therefore, Nguyen et al. [11] utilized a meta-learning algorithm MAML [12] and CDAE [2] to obtain weight initialization of visual feature extractor to learn visual features. Do et al. [22] proposed a multiple meta-model quantifying (MMQ) algorithm to learn meta-annotation. Nevertheless, they ignored the reasoning module of the models, which led to limit their performances. Consequently, Zhan et al. proposed a question-conditioned reasoning (QCR) module to adjust the weights of words and task-conditioned reasoning (TCR) method to learn inference abilities for close-ended tasks and open-ended tasks respectively. Gong et al. [23] designed a novel multi-task learning paradigm. However, this needed large-scale medical data. Bo et al. [24] adopted contrastive learning to gain several cumbersome models and train an unsophisticated student model by distilling these models and fine-tuning on VQA-RAD dataset.

Various attention mechanisms were also adopted in the medical VQA field. Vu et al. [25] proposed a multi-glance attention method to obtain the most related image regions. Sharma et al. [26] proposed a MedFuseNet to utilize a co-attention mechanism to improve the quality of fusion feature. However, these previous works neglected to learn multi-level multimodal feature representations which limited their performance. In this paper, we develop a Word-to-Image (W2I) attention and a Sentence-to-Image (S2I) attention to concentrate on learning a multi-level multimodal semantic representation.

3 Methods

3.1 Problem Formulation

Medical VQA is defined as a multiclassification problem. Given an image \(I\) and a question \(q\), the output is the predicted answer \(\hat{a}\). The both \(I\) and \(q\) are input into model \(f\) to obtain the predicted answer:

$$ \hat{a} = {\arg \max\limits_{a \in A}} \;f(a| I,q,\theta ), $$
(1)

where \(A\) and \(a\) denote candidate answers and one of them, separately, and \(\theta\) denotes all parameters.

Fig. 1.
figure 1

Overview framework of our proposed MAMF. Each medical image gains three 64-D vector through a CDAE encoder and two meta-model. The vectors are concatenated to generate the visual feature V. GloVe and GRU are adopted to produce the word embedding sequence \(W_{emb}\) and the semantic feature Q. \(A_{VW}\) and \(A_{VQ}\) are W2I attention weight and S2I attention weight respectively, and M is a fusion feature.

3.2 Overview of Our Proposed Model

The structure of MAMF is shown in Fig. 1. Overall, the model includes a visual feature extractor, a word embedding module GloVe [27], a question embedding module GRU [28], an attention-based multimodal fusion module and a classifier. Glove is adopted to convert every word to a 300-dimension word. Then we utilized GRU to generates question feature. The visual feature extractor utilizes the Convolutional Denoising Auto-Encoder [2] and two meta-models obtained from Multiple Meta-model Quantifying (MMQ) [22]. The attention-based multimodal fusion module is adopted to model the relations between visual feature and word embeddings, and between visual feature and question feature, respectively. Finally, the classifier is adopted to classify multimodal semantic representations and then provide predicted answers to the Medical-VQA tasks.

3.3 Word Embedding and Question Representation

In the foremost, given a question \(q\) who has \(l\) words, GloVe [27] is adopted to generate a word embedding sequence. \(w_i \in {\rm{\mathbb{R}}}^{d_w }\) express the \(i\)-th word vector:

$$ W_{{\text{emb}}} = WordEmbedding(q) = [w_1 ,...,w_l ]. $$
(2)

The word embedding \(W_{emb} \in {\rm{\mathbb{R}}}^{d_w \times l}\) is then sent to Gated Recurrent Unit (GRU) [28] whose dimension is \(d_G\) to gain the semantic feature:

$$ Q = GRU(W_{emb} ) = [\gamma_1 ,...,\gamma_l ], $$
(3)

where \(Q \in {\rm{\mathbb{R}}}^{d_G \times l}\), and \(\gamma_i\) is the \(i\)-th word embedding.

3.4 Visual Feature Extractor

As for the visual feature, we adopt the two best meta-models obtained from MMQ [22] and a CDAE [2] as visual feature extractor, as shown in Fig. 1. Specifically, each meta-model contains four 3*3 convolutional layers. Each convolutional layer includes 64 filters. Finally, the extractor gains three feature vectors. We concatenated them to obtain the visual feature. It is denoted as \(V \in {\rm{\mathcal{R}}}^{d_V }\), where \(d_V = 192\) represents the dimension of the feature.

3.5 Attention-Based Multimodal Fusion Module

This module calculates the word-based attention \(A_{VW}\) and the sentence-based attention \(A_{VQ}\) using the following equations respectively.

$$ A_{VW} = {\text{softmax}} (l \times w_1 \times ((w_2 \times V) \circ (w_3 \times W_{{\text{emb}}} )) + b), $$
(4)
$$ A_{VQ} = {\text{softmax}} (l{\rm{\prime}} \times {\text{w}}^{\rm{\prime}}_1 \times ((w^{\rm{\prime}}_2 \times V) \circ (w^{\rm{\prime}}_3 \times Q)) + b^{\rm{\prime}} ), $$
(5)

where \(l\) and \(w_x\) represent the weight matrix and a fully connected layer, respectively, and \(b\) denotes a scalar. Besides, \(\circ\) indicates element-wise multiplication. The softmax functions in Eq. (4) and Eq. (5) are adopted to normalize the attention weights.

The attention weight of image feature is computed as:

$$ A_V = A_{VQ} + A_{VW} . $$
(6)

The attention weight \(A_V\) and visual feature \(V\) are then element-wise multiplied to obtain the visual feature,

$$ V^{\prime} = A_V \circ V. $$
(7)

The visual feature and the question feature are both sent to fully connected layers. The vectors from the fully connection layers are element-wise multiplied together to obtain the joint embedding \(M\). \(M\) is then sent to a classifier. The predicted answer \(\hat{a}\) has the highest probability among the candidate answers. The accuracy is computed as follows:

$$ Accuracy = \frac{1}{{n_{Test} }}\sum_{\,}^{Test} {(Onehot(\arg \max (\hat{a})) \cdot a)} , $$
(8)

where \(a\) denotes the correct answer of the task.

3.6 Loss Function

The predicted answers are utilized to obtain binary cross entropy loss during training,

$$ L_{CE} = - \frac{1}{{n_{Train} }}\sum_{i = 1}^{n_{Train} } {(a\log (\hat{a}) - (1 - a)\log (1 - \hat{a}))} . $$
(9)

In addition, an Attention Alignment Loss (AAL) is proposed to align the word-based attention and the sentence-based attention to emphasize relevant and crucial image regions. The loss function is computed as follows:

$$ L_{AAL} = - \frac{1}{{{\text{n}}_{{\text{Train}}} }}\sum_{i = 1}^{{\text{n}}_{{\text{Train}}} } {\left\| {A_{VQ} - A_{VW} } \right\|}^2 . $$
(10)

At last, the final loss function is calculated as follows:

$$ L{\text{oss}} = \alpha L_{AAL} + L_{CE} , $$
(11)

where \(\alpha\) is a weighting parameter.

4 Experiments

4.1 Datasets

The prevalent medical VQA datasets are adopted to evaluate our proposed MAMF: (1) VQA-RAD [5]: It contains 3,515 question-answer pairs and 315 radiology images. Some questions are related to the same image. The clinicians or patients ask various questions about position, presence, organ and others. (2) PathVQA [29]: It contains 32,799 question-answer pairs, including “how”, “what”, “where” and other types. There are 3,328 medical images obtained from the PEIR digital library and 1,670 pathological images selected from several medical literatures. The answer types of two datasets are classified as close-ended and open-ended. The close-ended answers are “yes/no” or several options, while the open-ended answers are free-form texts. The question-answer pairs of PathVQA dataset are generated by a semi-automated approach using image captions and then manually reviewed and modified by clinicians.

4.2 Experiment Settings

All experiments are performed on the Ubuntu 20.04.4 server with NVIDIA GTX 1080 GPU based on PyTorch library in version 1.8. We adopt Adam optimizer to optimize our model. The learning rate is set to 1e–4 and batch size is set to 128. For semantic textual features, each question contains 12 words. GloVe [27] is utilized to generate the word embeddings. They are input into GRU [28] to gain question feature. As for visual representations, each 128-dimensional image is input into 2 quantified meta-models obtained from the MMQ [22] and a Convolutional Denoising Auto-Encoder, which generates 3 vectors. The enhanced visual feature is produced by concatenating these vectors. We adopt accuracy, precision, recall and F1-score (denoted as Acc, P, R, F1) as evaluation metrics.

4.3 Baseline Models

The medical VQA baselines including MAML, BiAN, MEVF [11], MMQ [22], CR [13] and CMSA [26] are reimplemented by using the open-source codes. The brief descriptions of baselines are in Table 1.

Table 1. The brief descriptions of baseline models.

4.4 Results

The results of our proposed MAMF and other baseline models on the VQA-RAD test set are shown in the Table 2. The results of baseline models are re-implemented using available codes. From the table, it suggests that MAMF significantly superior to other state-of-the-art baselines. MAMF gains the best overall accuracy 74.94%, precision 82.39%, recall 74.94% and F1-score 78.02%. As for close-ended tasks and open-ended tasks, we also achieve the best performances except precision of the open-ended. Although we utilize the MMQ methods to enhance our image feature extractor, the reason may be that our model reduces the prediction probability of the true positive samples during fusion stage. The tasks corresponding to open-ended questions are harder for medical VQA models to answer correctly, since their answers can be free-form text. However, our proposed model MAMF still outperforms other baselines benefitting from the W2I attention, S2I attention, and AAL.

We also perform experiments on PathVQA dataset. Compared with VQA-RAD dataset, PathVQA have more diversities. It can verify the robustness of our proposed MAMF. The result is shown in Table 3. Our proposed MAMF gains the best performances reaching the best accuracy 54.28%, precision 65.82%, recall 54.28% and F1-score 52.38% on the entire test set. MAMF obtains dramatic improvement on the open-ended questions compared with other baseline models. The reasons of this improvement are as follows: First, MAMF builds word-level correlation representation of word embeddings and image feature, which filters unrelated regions in the image and retains essential ones for predicting answer. Second, our proposed AAL aligns the attention weights of regions in the image learned from the W2I attention and Q2I attention to recognize essential words and image regions for reasoning.

Table 2. Results on the VQA-RAD.
Table 3. Results on the PathVQA.

4.5 Ablation Study

Several ablation experiments are conducted to verify the effectiveness of each part of MAMF. The experiment results are shown in Table 4 and Table 5. We remove W2I attention, S2I attention and AAL successively. The performances of MAMF without W2I attention and MAMF without S2I attention datasets dramatically decreased compared with the complete form of MAMF. Without the W2I attention, the model cannot establish word-level correlations between the word embeddings and image feature. Thus, it can only use the coarse sentence-level multimodal semantic representations to roughly reason. Without the S2I attention, the model can neither properly understand the meaning of questions nor predict the high-quality answers. These two ablation instances show the effectiveness of the W2I attention and S2I attention. As for the model MAMF without AAL, it also obtains poor performances on the two datasets. As discussed in Sect. 3.5, AAL is used to align the W2I attention and S2I attention, which helps locate crucial image regions to optimize the model. Furthermore, the complete form of MAMF obtains the best performance. Consequently, our proposed MAMF gains a satisfactory performance that utilizes the W2I attention and S2I attention to obtain the multi-level semantic information of image from word-level feature and sentence-level feature in the question, respectively, and employs the AAL to maximize the similarity of the relevant regions obtained from the W2I and S2I attention respectively.

Table 4. Ablation experiments on the VQA-RAD.
Table 5. Ablation experiments on the PathVQA.
Table 6. \(\alpha\) changes from 0 to 2.0 in Eq. (11).
Fig. 2.
figure 2

The loss curve of MAMF.

4.6 Hyperparameter Analysis

We allocate distinct values of the hyperparameter α in the AAL in Eq. (11) and conduct experiments on the VQA-RAD dataset, as shown in Table 6. The overall task and open-ended task can gain the best performances when α is 1.2. Therefore, α is set to 1.2 during training our proposed model.

We train MAMF for 150 epochs. The loss curve and accuracy curve of MAMF are shown in Fig. 2 and Fig. 3, respectively. As shown from the Fig. 2, MAMF gains a relatively stable state after about approximately 150 epochs. From the Fig. 3, we can see that the accuracy curve also slowly becomes stable. Consequently, the value of hyperparameter epochs is set to 150 during training.

Fig. 3.
figure 3

The accuracy curve of MAMF.

4.7 Qualitative Evaluation

The qualitative evaluation of our proposed MAMF and the best baseline CMSA on the VQA-RAD dataset is shown in Fig. 4. For the first VQA task, while the baseline model CMSA cannot select all the relevant regions to answer the clinical question, our proposed model locates all the related regions and correctly predicts the answer. The real position of the radiological image is completely opposite to what we see.

Fig. 4.
figure 4

Visualization of performances of our presented model MAMF and the baseline CMSA.

Therefore, “left” in the answer means the right region of the image. As for the second task, the CMSA identified the liver as the kidney, while our method finds that there is no kidney in the image. For the third task, the baseline can identify the related image region, but it could not recognize the concrete region to answer the question. In contrast, our model identifies the crucial image region and provides an accurate answer.

These instances show that our method has better ability to locate relevant and crucial regions in the medical image and understand well the clinical question. Therefore, it can provide concrete and accurate answer to complex Medical-VQA tasks.

5 Conclusion

This paper presents a Multi-level Attention-based Multimodal Fusion (MAMF) model. MAMF utilizes word embeddings and question features to identify the relevant and key regions of medical image by adopting a W2I attention and a S2I attention. It then contributes to obtain a multi-level multimodal semantic representation. Moreover, we propose an attention alignment loss to align the word-based attention and sentence-based attention to recognize relevant and crucial regions in medical images. This model is beneficial for clinicians in diagnosing different diseases. It also can help patients obtain the answers of health-related questions. Additionally, our model significantly outperforms related state-of-the-art baselines.