Parallel multi-head attention and term-weighted question embedding for medical visual question answering

Manmadhan, Sruthy; Kovoor, Binsu C

doi:10.1007/s11042-023-14981-2

Parallel multi-head attention and term-weighted question embedding for medical visual question answering

Published: 11 March 2023

Volume 82, pages 34937–34958, (2023)
Cite this article

Download PDF

Multimedia Tools and Applications Aims and scope Submit manuscript

Parallel multi-head attention and term-weighted question embedding for medical visual question answering

Download PDF

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

The goal of medical visual question answering (Med-VQA) is to correctly answer a clinical question posed by a medical image. Medical images are fundamentally different from images in the general domain. As a result, using general domain Visual Question Answering (VQA) models to the medical domain is impossible. Furthermore, the large-scale data required by VQA models is rarely available in the medical arena. Existing approaches of medical visual question answering often rely on transfer learning with external data to generate good image feature representation and use cross-modal fusion of visual and language features to acclimate to the lack of labelled data. This research provides a new parallel multi-head attention framework (MaMVQA) for dealing with Med-VQA without the use of external data. The proposed framework addresses image feature extraction using the unsupervised Denoising Auto-Encoder (DAE) and language feature extraction using term-weighted question embedding. In addition, we present qf-MI, a unique supervised term-weighting (STW) scheme based on the concept of mutual information (MI) between the word and the corresponding class label. Extensive experimental findings on the VQA-RAD public medical VQA benchmark show that the proposed methodology outperforms previous state-of-the-art methods in terms of accuracy while requiring no external data to train the model. Remarkably, the presented MaMVQA model achieved significantly increased accuracy in predicting answers to both close-ended (78.68%) and open-ended (55.31%) questions. Also, an extensive set of ablations are studied to demonstrate the significance of individual components of the system.

MAMF: A Multi-Level Attention-Based Multimodal Fusion Model for Medical Visual Question Answering

Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering

Article 29 March 2023

Multi-modal multi-head self-attention for medical VQA

Article 10 October 2023

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Images from radiology are crucial in clinical decision-making, such as the utilization of chest X-rays to diagnose Covid-19. By answering questions regarding the image contents, automated systems could assist clinicians in dealing with massive amounts of imagery during a pandemic. Patients gain from such a sophisticated question-answering system as well. Rather than sifting through non-specific and overwhelming search engine results, they can get answers to simple queries about their medical imaging. Visual Question Answering (VQA), a new branch of artificial intelligence, throws insight into this type of clinical decision support.

VQA seeks to offer an accurate answer to a question regarding an image in general. It’s an AI-complete [1] task that combines Natural Language Processing (NLP) with Computer Vision (CV), two significant computer science research domains. The goal of medical VQA (MVQA) is to deal with medical images and the queries that come with them. Because structured and unstructured medical data is increasingly accessible to patients through patient portals, MVQA helps to promote patient engagement in clinical decision making. MVQA serves as a personal assistant for doctors, providing a second opinion while interpreting difficult medical images.

When compared to other vision-language tasks like image captioning, general VQA is a challenging multi-modal-knowledge based task. The researchers devised a three-phase workflow to address the VQA problem, as shown in Fig. 1. The two sub-tasks of phase 1; image featurization and question featurization; indebted to the success of deep learning models Convolutional Neural Networks (CNN) and Recurrent Neural Network (RNN) respectively. The VQA research community extensively focused on phase 2, which involves the joint comprehension of extracted multi-modal aspects. The majority of VQA systems were designed to predict the correct answer from a pool of candidates in the final phase. VQA can be viewed as a classification task in this context. Open-ended VQA is a variant in which the final answers are generated as free-form phrases. Table 1 shows the most common approaches for implementing phases 1 and 2 in the VQA literature.

Table 1 Challenges of VQA sub-tasks when applied to medical domain (MVQA)

Full size table

Because of the huge difference between the general and medical domains, knowledge transfer from the general domain based on inadequate medical data can be of limited use [39]. Table 1 also highlights specific challenges that MVQA has paused at various stages of the general VQA pipeline. In summary, the challenging facts of MVQA are: 1) semantic parsing of clinical language while interpreting medical questions, 2) poor contrast and different modalities (MRI, CT, X-Ray) of medical images, and 3) need of a complex fusion method.

In response to the above concerns, a unique MVQA system is developed that can produce valid results with limited data. The model seeks to provide equal emphasis to multi-modal feature extraction and fusion. The power of unsupervised learning with denoising auto-encoders is used to handle image featurization. With the use of domain-specific bioBERT embeddings [26] and a newly proposed term-weighting scheme, the system was able to interpret complex clinical terminologies in questions. Finally, two parallel branches of multi-head attention modules accomplish multi-modal fusion to correctly comprehend visual and language data and predict the correct answer. The proposed model is named Multi-head attention for Medical VQA (MaMVQA).

Inspired by the success of VQA in the general domain and the potential applications of VQA in medical domain, in [24] the authors released the first manually constructed dataset VQA-RAD for MVQA where questions and answers about images are naturally created and validated by clinicians. There are two major types of questions in the VQA-RAD dataset, closed-ended questions where answers are limited to a set of predetermined options and open-ended questions where the answers are free-form text which is harder. The proposed MaMVQA model achieves superior accuracy in VQA_RAD, especially for open-ended questions.

In summary, the main contributions of this paper included:

The paper proposes a parallel Multi-head Attention network for Medical domain VQA (MaMVQA).
The paper introduces a novel semantic supervised term-weighting scheme named ‘qf.MI’ that utilizes bioBERT word embeddings and mutual information scores to assign word weights.
Presentation of results obtained from the extensive experiments and comparative studies carried out with VQA-RAD dataset.
Discussion and analysis of qualitative and qualitative results of ablation studies conducted to validate the significance of individual components of MaMVQA.

The remainder of this paper is arranged as follows: We briefly discuss several recognized related works in Section 2. Section 3 presents the details of MaMVQA framework and the proposed term-weighting scheme. Section 4 conduct experiments to evaluate novel methods and discusses the results. Finally, we conclude in Section 5.

2 Related works

This section summarizes the progress happened in general VQA. The second part focuses on studies that employed the VQA-RAD dataset in their experiments. A dedicated subsection also includes a quick rundown of the most common term-weighting strategies.

2.1 Visual question answering

After Antol et al. [1] introduced the VQA system, numerous researchers [12, 17, 32, 55] built on the baseline VQA pipeline, which included CNN/R-CNN [40] for image featurization, RNN for question featurization, and any elementary feature fusion. The focus of VQA research in the following phase was on feature fusion techniques, which led to certain milestones in the field [8, 19, 47]. Later, with the enormous influence of attention mechanisms on VQA, various types of attention networks were introduced to the stack of VQA systems [2, 13, 18, 23, 30, 36, 43, 50, 52, 53].

2.2 Medical visual question answering with VQA-RAD

While proposing the first manually constructed high-quality dataset for VQA in the medical domain (VQA-RAD), Lau et al. [24] discussed and tested two baseline solutions. The two models stem from the two well-known general VQA models Multimodal Compact Bilinear Pooling (MCB) [8] and Stacked Attention Network (SAN) [51] respectively. To forecast attention and for the joint embedding of attended visual representation with question representation, MCB incorporates three components in line with VQA phases: a CNN image model with ResNet-152, an LSTM question model, and MCB pooling. The SAN model, as the name implies, attends to the image several times to enhance visual attention over time. In experiments, these baseline models were given the names MCB-RAD and SAN-RAD, respectively.

Nguyen et al. [35] reported a powerful baseline for VQA-RAD called BAN-RAD, which is based on Bilinear Attention Network (BAN) [20] instead of SAN in the joint comprehension phase of VQA. However, due to the significant differences between medical and general VQA data, such direct adaptations suffer from a serious lack of data and multi-modal reasoning abilities. To tackle the challenges caused by transfer learning in image feature extraction, the authors devised model-agnostic meta-learning (MAML) [7] to train meta-weights that swiftly adapt to visual concepts. Later, the MEVF [35] network used a Convolutional Denoising Auto-encoder (CDAE) trained on an external dataset to extend the image feature extraction using MAML. This combination conquered MVQA’s data constraint.

Do et al. (2021) [5] presented a multiple meta-model quantifying (MMQ) method to increase meta-data through auto-annotations and noisy label processing. This method allows more meaningful feature extraction from medical images without external data.

Table 2 critically compares various state-of-the-art medical VQA systems from the literature. The table dissects each work to discuss the strategies used for image understanding, question understanding and multi-modal feature fusion and the datasets used in experiments along with special remarks on the methodology.

Table 2 Overview of the current state-of-the-art medical VQA works. Table focuses the way the works tackled three main tasks in general VQA pipeline. Omitted the last phase called ‘answering’ because all the works considered VQA as a classification task. [*indicates the works that resembles the proposed system and are critically examined later in section 4.3]

Full size table

2.3 Term-weighting schemes

Initially, STW schemes were created by combining classic unsupervised component ‘term frequency (tf)’ with feature selection metrics including chi-square (χ²), information gain (ig) and Odds Ratio (OR) to have tf.χ², tf.ig, and tf.OR STW schemes (Debole and Sebastiani. 2004) [4]. Lan et al. (2008) [22] presented tf.rf, a new approach based on the concept of ‘relevance frequency,’ which marked a significant milestone in STW research. It was built on the principle that “the more concentrated a high-frequency term is in one class c than in others, the greater contributions it makes in identifying class c samples.”

Quan et al. (2010) [38] improved this by emphasizing the importance of terms with a high frequency in the positive category, particularly in the context of question classification. For this, they proposed four new concepts.

question frequency (qf): qf(t) is the number of samples in positive category that contain t.
category frequency (cf): cf.(t) is the number of classes in which term t occurs.
inverse category frequency (icf): used to measure discriminative power of a term t.
inverse question frequency (iqf): similar to inverse document frequency (idf).

They proposed three new STW schemes named qf.icf, iqf.qf.icf and vrf and obtained the best results for the second one.

The frequency of the weighting term in the form of tf or qf is a fixed component in all of these ideas. Statistical information about terms is also contained in the various factors multiplied with them. Text semantics should be used in STW to get closer to how humans determine term weights. The semantic distance between terms and category core terms (CCTs), the representative terms of a class, is substituted with the global statistical component of STW in Wei et al. (2011) [49]. The technique presumes that the problem is a binary classification task, with positive and negative CCT well defined. The procedure becomes infeasible as the number of classes grows.

Luo et al. (2011) [31] looked at the semantics of categories as well as WordNet’s interpretation of terms occurring in the category label. The similarity between a word and category semantics is used to compute the multiplicative factor of STW. When the words in category labels are not generic and/or domain-specific, such as in clinical text, the usage of WordNet senses becomes ineffective because the majority of these words will not be found in WordNet. Determining the meaning of medical terminology for term-weighting while focusing on clinical content is a challenging task. Matsuo and Ho (2018) [33] suggested a two-phase approach that uses the medical ontology to generate a two-part hierarchy for determining semantic weights of terms in clinical texts.

3 Method

The proposed parallel Multi-head Attention network for Medical domain VQA (MaMVQA) is portrayed in Fig. 2, which includes a denoising auto-encoder for medical image representation learning, two question encoding branches and two parallel multi-head attention blocks for feature fusion as its core components.

3.1 Problem modeling

Given a medical image I and an associated natural language question Q, the correct answer is finally predicted. This can be mathematically formulated as:

$$ \hat{y}=\mathit{\arg}\underset{a\in A}{\max }P\left(a\ \right|\ Q,I,\theta \Big) $$

(1)

where $ \hat{y} $ represents the final classification result. ‘A’ and ‘a’ respectively indicates a set of candidate answers and one of the answers. P denotes the MaMVQA framework and θ represents all parameters of MaMVQA. The framework is divided into three modules; features extraction, feature fusion and answer prediction. The three modules are elaborated in the following subsections.

3.2 Feature extraction

VQA is an AI-Complete task that requires multi-modal knowledge that extends beyond a single sub-domain. As a result, the quality of features extracted is crucial for subsequent operations. The image and the corresponding natural language question are two modalities that must be investigated as part of the VQA task.

3.2.1 Image Featurization

The majority of image features in conventional VQA systems are now extracted based on CNN models that have been pre-trained on the ImageNet database. Fine-tuning those models to make them adequate for predicting medical image features necessitates a massive amount of labelled data, which is unavailable in medical VQA. Thus, for VQA-RAD images, we designed and trained an unsupervised denoising autoencoder based on a convolutional neural network.

An autoencoder is a form of neural network structure that generates targets from input data via self-supervision. At each step, the encoder part of the autoencoder extracts input image features and compresses the learning to a hidden or bottleneck representation h.

$$ e:h= ReLu\ \left( Wx+\mathrm{b}\right) $$

(2)

The latent compressed representation h is then mapped back to input z with the same shape as x using the decoder.

$$ d:z= ReLu\ \left({W}^{\prime }h+{b}^{\prime}\right) $$

(3)

By minimizing a loss measure that measures how similar the reconstructed image is to the input image, the parameters W, W′, b, and b’ are optimized.

By introducing a stochastic noise component to the input image, the overfitting characteristic of traditional autoencoders can be controlled. The model is then trained to rebuild the uncorrupted original image as an output. Denoising autoencoders (DAE) are a type of autoencoder that works based on this strategy. Convolutional DAEs (CDAE) are a variant of DAE that adds convolutional encoding and decoding layers to the normal DAE architecture. Previous research [10, 25] has demonstrated that CDAEs perform better in image processing than ordinary DAEs because to comprehend visual structure, CDAE makes use of all of CNN’s capabilities.

To avoid degradation [15] caused by several convolutional and deconvolutional layers, residual connections from the CDAE encoder to the decoder are added, bypassing the bottleneck layer [6, 25]. These extra connections allow feature maps to be sent straight from an earlier encoder layer to a later decoder layer. This aids the decoder in producing better specified decompressions of the input image.

The SkipCDAE architecture utilized in this study to learn the features of VQA-RAD medical images is shown in Fig. 3. The input images comprise 512*512*1 (gray-scale) matrices, and Gaussian noise with a standard deviation of 0.2 has been introduced. Then, using a convolution kernel of size 3*3, 32 filters of size 3*3*1 were employed to construct 32 feature maps from the input layer. A LeakyReLu activation function follows this layer. Each of the 2 to 14 layers has 64, 128, 256, 512, 1024, 1024, 512, 256, 256, 128, 64, and 32 filers respectively, with LeakyReLu as the activation function for all of them. Sum aggregation is used to add two skip connections from the fourth and sixth encoder layers to the corresponding layers at the decoder. Simple actions such as image rotation, image scaling, and normalizations are applied to the VQA-RAD images before being utilized to train SkipCDAE to augment the available image set.

3.2.2 Question Featurization

Tokenizing the natural language inquiries into words and turning the words to lowercase are the first steps in the pre-processing process. Next, the punctuations associated with the words are removed including the question mark (?). Each question is then reassembled into a series of ‘n’ tokens, where ‘n’ is the number of words in the training dataset’s longest question. Zero padding will be applied to any queries with fewer than ‘n’ words. The value of ‘n’ in VQA-RAD is 22. The preprocessed question data is now input into two distinct question featurization modules of MaMVQA framework.

Domain-specific question embedding

In this strategy, question word sequences are represented by BioBERT [26] word embedding matrix, $ D=\left\{{e}_1,{e}_2,\dots \dots \dots, {e}_n\right\}\in {R}^{n\ast {d}_e} $, where d_e represents the embedding dimension of each word in the sequence which is 768 for BioBERT embedding. BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining) is the first domain specific language representation pre-trained on large-scale biomedical corpora. The creators proved the effectiveness of BioBERT embedding for the task of question answering over the state-of-the-art language models.

The embedded word vectors are then fed into a bidirectional LSTM (biLSTM) network to generate the final question encoding, $ Q1=\left\{{q}_1,{q}_2,\dots, {q}_N\right\}\in {R}^{N\ast {d}_q} $, where N is the total number of questions in the training set and d_q is the dimension of last hidden state output from the biLSTM network which is set to 2048 in experiments. biLSTM can capture both contextual and historical information because it sums up information from both the forward and backward directions of a sentence.

Supervised term-weighted (STW) question embedding

The primary goal of the STW scheme is to limit the impact of irrelevant words on subsequent tasks. The auxiliary task is to look up the weight assigned to each word that appears in the question of interest in the term-weight vector created using the newly proposed STW method qf.MI. Multiply the scalar weights by the distributed embedding vector for that word to get a weighted embedding matrix E ∈ R^|V| × d, where |V| is the corpus vocabulary size and d is the dimension of distributed embedding vector. The embedding layer of the proposed question featurization network is initialized with this E. The final weighted question encoding Q2 is obtained by feeding the embedding layer output to the biLSTM network.

Proposed STW scheme: Qf.MI

In the case of STW, two ways for assigning term weighted feature vectors are available: local and global. Each class in the local policy has its weight vector, resulting in a set of feature vectors for each training sample corresponding to distinct classes. For each class, a common weight vector is inferred from local vectors in the global policy. This can be accomplished by performing ‘max’, ‘avg’, or ‘sum’ operations on local vectors, with the ‘max’ strategy performing best in earlier studies [4, 38]. The global weight vector can be determined using the ‘max’ policy as shown in Eq. 4.

$$ w(t)=\underset{1\le i\le \left|C\right|}{\max }w\left(t,{c}_i\right) $$

(4)

where w(t) is the final weight of the term t, w(t, c_i) is the weight of t in class i and C is the set of classes.

Two ideas were used in the suggested scheme: question frequency (qf) and mutual information (MI). The initial idea, qf, is borrowed from [38] and used to replace the word frequency (tf) as the first multiplicative factor in STW schemes. Their research found that this approach is particularly useful for short texts with a large number of tf values equal to one. The VQA task’s questions are brief texts. The frequency of occurrence of a term t in the class of interest (positive category) is represented by its qf.^{Footnote 1}

$$ qf(t)=\mathit{\log}\left({t}_p+1\right) $$

(5)

The second factor MI stems from information theory that denotes statistical relatedness between two datasets/random variables [41]. It is a non-negative value and the higher the value means higher the dependency. MI can be mathematically computed using Eq. 6.

$$ I\left(X,Y\right)=\iint p\left(x,y\right)\log \left(\frac{p\left(x,y\right)}{p(x).p(y)}\right)\ dx. dy $$

(6)

Each (image, question) pair in the VQA dataset has an associated answer. The answer offers a wealth of information that helps to weigh the question words. This idea was put into action by computing the MI between each word vector and its corresponding answer vector to allocate weights to each term. The motivational fact behind the development of this scheme can be summarized as follows: ‘the more closely a word is related to the label text of the class of interest, the better it will be able to assist the question representation to forecast the proper answer.’

The complete procedure of term weight calculation using qf.MI is depicted in Fig. 4 for better understanding. The process creates an STW weight matrix W of size ∣C ∣ × |V| from a corpus consisting of |C| classes and a vocab of size |V|. Each class’s local weight vectors are represented by each row of W. Each class vector is built so that terms that appear at least once in the class of interest are given weights calculated according to Eq. 7. For those terms that do not occur in the class of interest ci but do appear in V, the weight will be presented as the lowest of all computed weights for that class.

$$ W(i)=\kern0.5em \left\{\begin{array}{c} qf. MI\left(t,{c}_i\right)\ if\ t\in class\_ vocab\kern0.5em V\\ {}\underset{1\le j\le \left| class\_ vocab\right|}{\min } qf. MI\left({t}_j,{c}_i\right)\ if\ t\in V- class\_ vocab\end{array}\right. $$

(7)

3.3 Feature fusion via multi-head attention (MHA)

Two parallel branches of the multi-head attention (MHA) network are used to interpret the retrieved visual and linguistic cues. MHA is an attention module that uses n-heads to run through an attention process numerous times in parallel. Following that, the individual attention outputs are concatenated and linearly translated into the desired dimensions. Multiple attention heads, appear to enable different kinds of attention to be paid to different areas of the inputs. The general process is portrayed in Fig. 5 and formulated as follows:

$$ MHA\left(Q,K,V\right)=\left[{head}_1,\dots, {head}_n\right]{W}_0\ \mathrm{Where}\ {head}_i= Attention\ \left({QW}_i^Q,{KW}_i^k,{VW}_i^V\right) $$

(8)

W are all learnable parameter matrices and three inputs (Q, K, V) to MHA module are named as (Query, Key, Value). Mathematically, the layer first projects Q, K, and V. Then the query and key tensors are dot-producted and scaled. To achieve attention scores, these are softmaxed. These scores are then used to interpolate the value tensors, which are subsequently concatenated back into a single tensor. Finally, the result can be projected linearly and returned.

The terms query and key are used in this paper to refer to extracted visual features and question features, respectively. Because no special value (V) is specified, it is assumed to be same as the key. These input tensors are split into several segments and sent to a number of distinct heads. Medical image features are compared to corresponding question features within each head, and attention weights are assigned. The interaction between Q (image) and K and V (question) is represented by each cell in the resulting attention weight matrix.

Two parallel branches of multi-head attention are implemented with two types of question embeddings in the proposed MaMVQA framework, as detailed in section 3.2.2. By executing question guided visual attention on distinct areas of the input, the MHA module generates refined visual features with the help of many heads.

3.4 Answer prediction

The MaMVQA model is handled as a classification model in its entirety. The outputs of two branches of multi-head attention are combined to generate a refined visual representation that is attended by the question inputs. The final fused vector, say z, for answer prediction is formed by concatenating this with the domain-specific question embedding. The projection of the fused feature vector z to a vector a ∈ R^N is then performed, followed by a softmax activation, where N is the number of potential answers in the training dataset. The ‘categorical crossentropy’ loss function was used to train an N-way classifier on top of feature vector z.

4 Experiment

This section describes the extensive tests that were carried out to assess MaMVQA’s effectiveness. It begins with a description of the dataset before moving on to the implementation and evaluation details. On the VQA-RAD dataset, MaMVQA is then compared to other state-of-the-art models. The ablation research is then provided to demonstrate the necessity of the suggested framework’s components. Finally, MaMVQA qualitative analysis employing VQA-RAD data has been demonstrated.

4.1 Dataset description

The aim is to generate or predict the proper answer to natural language questions regarding the content of medical images in medical VQA datasets. In the field of medical VQA, the VQA-RAD dataset is the first manually generated dataset. It contains 315 medical radiological images divided evenly throughout the head, chest, and abdomen. The statistics of the dataset are as follows:

3064 question-answer pairs form the training set.
451 question-answer pairs form the test set.

Throughout the tests, the same train-test split was applied. Abnormality (ABN), attribute (ATTRIB), colour, count, modality, organ, other, plane, positional reasoning (POS), object condition presence (PRES), and size are the 11 question-categories. The question-answer pairs are classified as open-ended or closed-ended based on the answer types. Closed-ended responses are yes/no or limited to a set of predetermined options, but open-ended responses represent a significant challenge for the VQA system.

4.2 Implementation setup and evaluation metric

All of the experiments were run on an NVIDIA P100 GPU with the Keras Python library. The embedded question tokens were input through biLSTM, which produced a 2048-dimensional output combining forward and backward pass with a hidden layer dimension of 1024. The ‘Adam’ optimizer was used to train the models, with a learning rate of 1e-5 and a batch size of 32. In the VQA-RAD dataset, there are a total of 458 classes. A total of 16 heads were used in each multi-head attention module. By monitoring validation accuracy and validation loss, Keras Checkpoint and EarlyStopping callbacks are leveraged to save the best model during training. The maximum epochs have been set to 200.

The accuracy is used as an evaluation metric of the models in all experiments, that is, the proportion of the correctly answered questions over the total number of questions. It can be represented as:

$$ acc=\frac{t}{N}\times 100\% $$

(9)

where t represents the number of correctly answered questions and N refers to the total number of questions. As followed in current VQA_RAD literature, open-ended and close-ended accuracies are reported separately.

4.3 VQA-RAD benchmarking

The proposed MaMVQA model is compared with 9 existing models listed below:

SAN-RAD, MCB-RAD [24]: Two baselines directly adopted from general VQA literature using stacked attention and multi-modal compact bilinear pooling respectively.
BAN-RAD [35]: It is an advanced baseline again stemmed from the general VQA methodology of bilinear attention for multi-modal feature fusion.
MAML-SAN, MAML-BAN [7]: Introduced model-agnostic meta-learning to learn meta-weights to overcome problems caused by transfer learning for image featurization.
MEVF-SAN, MEVF-BAN [35]: Leveraged meta-learning and denoising autoencoder to extract image features.
MMQ-SAN, MMQ-BAN [5]: Aimed to increase metadata through automatic annotations and process noisy labels.

MEVF and MMQ resemble the proposed system and are critically analyzed in Table 3.

Table 3 MaMVQA vs. MEVF and MMQ: a critical comparison

Full size table

The result of the proposed model and other existing methods using the VQA-RAD dataset are compared in Table 4. It can be noted from the quantitative analysis that the proposed system outperforms the other models in open-ended, close-ended and overall accuracies. MaMVQA achieved a maximum improvement of 21% from SAN-RAD and a minimum improvement of 3% from MMQ-BAN in close-ended category. Similarly, 31% (max) and 2% (min) hike in open-ended accuracies. The overall accuracy rate is also improved by above 8% on average.

Table 4 Comparison of accuracies of different existing MVQA methods for VQA-RAD

Full size table

VQA-RAD dataset queries are now grouped into 11 types, as mentioned in section 4.1. In both the training and test sets, these types are not evenly distributed (refer Fig. 6). Experiments with the training and test sets separately are used to estimate MaMVQA’s accuracy for these question kinds. The results are presented in Table 5, where two separate columns provide the MaMVQA model’s percentage accuracies for training and test data over various question categories. The overall accuracy is then shown in Fig. 7 as an average of training and testing accuracies for each question type. The disparity in performance shown across a few question categories is due to the data imbalance. In any case, it demonstrates the proposed system’s generic performance potential, with an average accuracy of 82%.

Table 5 Accuracy of the proposed systems’ answers sub-grouped into question types

Full size table

As mentioned in the dataset description, VQA-RAD evenly distributes the available radiology images and associated questions into three areas of the human body – CHEST, HEAD and ABDOMEN. Figure 8a and b show the statistics of the VQA-RAD train-test split across the three organ categories. Also, Table 6 shows the results of the MaMVQA model when evaluated separately on visual questions from these three categories. The proposed model shows a generalizable performance over different categories.

Table 6 Accuracy of the proposed systems’ answers sub-grouped into three organ types

Full size table

4.4 Ablation study

Complete ablation studies on the VQA RAD dataset were discussed in this section to verify the significance of various components integrated to develop the MaMVQA framework. Supervised term-weighting, concurrent multi-head attention, and the usage of two different question representations are the primary factors to address here. When these ablations are carried out, the following models emerge:

MaMVQA: This is the full proposed model with parallel multi-head attention and two branches of question embedding, domain-specific and supervised term-weighted. Refer Fig. 2
MaMVQA_Ab1 [MaMVQA-Term Weighting]: By omitting the usage of term-weighting (qf.MI) in the second branch of question representation, this becomes the least ablated form of MaMVQA. Questions are fed to the embedding layer, which subsequently feeds them to biLSTM.
MaMVQA_Ab2 [MaMVQA – Parallel MHA]: The remaining components include two types of question embeddings and a term-weighting mechanism. The architecture used for ablation 2 is depicted in Fig. 9.

MaMVQA_Ab3(a) [MaMVQA - Parallel MHA – STW question embedding]: Another branch of question representation that this model avoids is supervised term weighted question embedding. Both the attention process and the ultimate fusion rely on domain-specific embedding. The architectural view of this ablation can be seen in Fig. 10.

MaMVQA_Ab3(b) [MaMVQA – Parallel MHA – Domain-specific question embedding]: This is same as the Ab3(a) except in the avoided branch of question representation. For clear distinction, the conceptual view of ablation 3(b) has been shown in Fig. 11.

Table 7 shows the quantitative findings of the ablation studies. The results show that cooperation among all components is preferable than any one of them, yet they are still better than the three baseline systems discussed in section 4.3. (SAN-RAD, MCB-RAD and BAN-RAD). Although several ablations provided equal accuracies in the close-ended category, the difference and usefulness of each MaMVQA component is clear in the more challenging open-ended category of VQA-RAD.

Table 7 VQA-RAD ablation results, in percentage

Full size table

In Fig. 12, the time efficiency of the proposed parallel multi-head attention model with two kinds of question embeddings is evaluated and compared with its ablations. The results are obtained by averaging the training time of 5 epochs and test time using the fully trained models. MaMVQA had a considerable increase in training time when compared to ablation models, whereas the testing overhead was insignificant. These results show that the model can be efficiently applied without incurring much time complexity.

4.5 Qualitative evaluation

The qualitative findings of MaMVQA and its ablations on the VQA-RAD dataset are shown in Table 5. The VQA-RAD dataset has a well-balanced distribution of images from the chest, head, and belly. The three examples in Table 5‘s top row clearly demonstrate MaMVQA’s superior performance in answering questions regarding images in all three categories. When the concurrent multi-head attention and the second branch of question embeddings were eliminated, the model became unable to correctly answer some questions. The fourth and fifth instances (d) and (e), respectively, emphasize the importance of the basic components. In all of these cases, the last two ablations, 3a and 3b, are inferior. There are also some challenges with the suggested system. The last instance (f) of Table 8 shows the case where the MaMVQA model and all the other models fail. Upon close examination of that instances, the following facts can be inferred.

The background area (black) is very much dominant so that it is difficult to focus at region-of-interest.
The ground truth answer and predicted answer other than ‘nothing’ is difficult to distinguish without a focused examination of the image content.

Table 8 Comparison of Ground Truth answer with the predicted answer of VQA-RAD samples

Full size table

5 Conclusions

Today, we have access to our electronic health records (EHRs) online including radiology reports and images. Online access to medical images makes the patients more curious about them. Medical VQA opens an intelligent platform to interact with and understand medical images by asking questions.

In this research, MaMVQA is proposed as a solution to medical VQA tasks. There are three modules in all. To acquire image representation and two different kinds of question representations, the feature extraction module employs three feature extraction methodologies. To prevent the usage of a huge amount of external data, an unsupervised denoising autoencoder with skip connections is used to handle image featurization. The two branches of question featurization are domain-specific and supervised term-weighted embeddings. Second, the research suggests parallel multi-head attention driven by question to maximize the interpretation of visual semantic information. Third, a module for predicting answers that implements MaMVQA as an n-way classifier. In addition, a new supervised term-weighting strategy (qf.MI) based on the concept of mutual information is proposed based on the concept of mutual information. The VQA-RAD dataset was used in the experiments. Several ablation experiments are shown to prove that the proposed MaMVQA components are effective for medical VQA tasks. Furthermore, the new method outperformed existing methods in terms of accuracy.

The proposed architecture can be applied to solve VQA in other domains with trivial changes to the pre-trained language embedding used. Quantitative and qualitative analysis of the obtained results reveals that the model fails on some questions. Thus, we wish to do a human-adversarial benchmarking of the medical VQA model in future to correctly identify the type of instances where the model fails and then refine the model.

Data availability

The data that support the findings of this study are openly available at https://osf.io/89kps/. DOI: https://doi.org/10.17605/OSF.IO/89KPS.

Notes

t_p=The number of training samples that contain t in the positive category

References

Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick CL, Parikh D (2015) Vqa: visual question answering. In proceedings of the IEEE international conference on computer vision (pp. 2425-2433)
Chen K, Wang J, Chen LC, Gao H, Xu W, Nevatia R (2015). Abc-cnn: an attention based convolutional neural network for visual question answering. arXiv preprint arXiv:1511.05960
Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078
Debole F, Sebastiani F (2004) Supervised term weighting for automated text categorization. In text mining and its applications (pp. 81–97). Springer, Berlin, Heidelberg
Do T, Nguyen BX, Tjiputra E, Tran M, Tran QD, Nguyen A (2021) Multiple Meta-model quantifying for medical visual question answering. arXiv preprint arXiv:2105.08913
Dong LF, Gan YZ, Mao XL, Yang YB, Shen C (2018) Learning deep representations using convolutional auto-encoders with symmetric skip connections. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 3006-3010). IEEE
Finn C, Abbeel P, Levine S (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In international conference on machine learning (pp. 1126-1135). PMLR
Fukui A, Park DH, Yang D, Rohrbach A, Darrell T, Rohrbach M (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847
Gasmi K (2022) Hybrid deep learning model for answering visual medical questions. J Supercomput 78:1–18
Article Google Scholar
Gondara L (2016) Medical image denoising using convolutional denoising autoencoders. In: 2016 IEEE 16th international conference on data mining workshops (ICDMW) (pp. 241-246). IEEE
Gong Y, Ke Q, Isard M, Lazebnik S (2014) A multi-view embedding space for modeling internet images, tags, and their semantics. Int J Comput Vis 106(2):210–233
Article Google Scholar
Goyal Y, Khot T, Summers-Stay D, Batra D, Parikh D (2017) Making the v in vqa matter: elevating the role of image understanding in visual question answering. In proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6904-6913)
Guo W, Zhang Y, Yang J, Yuan X (2021) Re-attention for visual question answering. IEEE Trans Image Process 30:6730–6743
Article Google Scholar
Gupta D, Suman S, Ekbal A (2021) Hierarchical deep multi-modal network for medical visual question answering. Expert Syst Appl, 164. pp. 113993
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778)
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Jabri A, Joulin A, Van Der Maaten L (2016) Revisiting visual question answering baselines. In European conference on computer vision (pp. 727-739). Springer, Cham
Khan AU, Mazaheri A, Lobo NDV, Shah M (2020) Mmft-bert: multimodal fusion transformer with bert encodings for visual question answering. arXiv preprint arXiv:2010.14095
Kim JH, Lee SW, Kwak D, Heo MO, Kim J, Ha JW, Zhang BT (2016) Multimodal residual learning for visual qa. In advances in neural information processing systems (pp. 361-369)
Kim JH, Jun J, Zhang BT (2018) Bilinear attention networks. arXiv preprint arXiv:1805.07932
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Proces Syst 25:1097–1105
Google Scholar
Lan M, Tan CL, Su J, Lu Y (2008) Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans Pattern Anal Mach Intell 31(4):721–735
Article Google Scholar
Lao M, Guo Y, Wang H, Zhang X (2018) Cross-modal multistep fusion network with co-attention for visual question answering. IEEE Access 6:31516–31524
Article Google Scholar
Lau JJ, Gayen S, Abacha AB, Demner-Fushman D (2018) A dataset of clinically generated visual questions and answers about radiology images. Scientific data 5(1):1–10
Article Google Scholar
Lee D, Choi S, Kim HJ (2018) Performance evaluation of image denoising developed using convolutional denoising autoencoders in chest radiography. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, 884. pp. 97–104
Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J (2020) BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4):1234–1240
Article Google Scholar
Li Y, Yang Z, Hao T (2021) TAM at VQA-med 2021: a hybrid model with feature extraction and fusion for medical visual question answering. In CLEF (working notes) (pp. 1295-1304)
Liu B, Zhan LM, Wu XM (2021) Contrastive pre-training and representation distillation for medical visual question answering based on radiology images. In international conference on medical image computing and computer-assisted intervention (pp. 210-220). Springer, Cham
Liu S, Zhang X, Zhou X, Yang J (2022) BPI-MVQA: a bi-branch model for medical visual question answering. BMC Med Imaging 22(1):1–19
Article Google Scholar
Lu J, Yang J, Batra D, Parikh D (2016) Hierarchical question-image co-attention for visual question answering. Adv Neural Inf Proces Syst 29:289–297
Google Scholar
Luo Q, Chen E, Xiong H (2011) A semantic term weighting scheme for text categorization. Expert Syst Appl 38(10):12708–12716
Article Google Scholar
Malinowski M, Rohrbach M, Fritz M (2017) Ask your neurons: a deep learning approach to visual question answering. Int J Comput Vis 125(1):110–135
Article MathSciNet Google Scholar
Matsuo R, Ho TB (2018) Semantic term weighting for clinical texts. Expert Syst Appl, 114. pp. 543–551
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In advances in neural information processing systems (pp. 3111-3119)
Nguyen BD, Do TT, Nguyen BX, Do T, Tjiputra E, Tran QD (2019) Overcoming data limitation in medical visual question answering. In international conference on medical image computing and computer-assisted intervention (pp. 522-530). Springer, Cham
Peng L, Yang Y, Bin Y, Xie N, Shen F, Ji Y, Xu X (2019) Word-to-region attention network for visual question answering. Multimed Tools Appl 78(3):3843–3858
Article Google Scholar
Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543)
Quan X, Wenyin L, Qiu B (2010) Term weighting schemes for question categorization. IEEE Trans Pattern Anal Mach Intell 33(5):1009–1021
Article Google Scholar
Raghu M, Zhang C, Kleinberg J, Bengio S (2019) Transfusion: understanding transfer learning for medical imaging. arXiv preprint arXiv:1902.07208
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Adv Neural Inf Proces Syst 28:91–99
Google Scholar
Ross BC (2014) Mutual information between discrete and continuous data sets. PLoS One 9(2):e87357
Article Google Scholar
Sharma D, Purushotham S, Reddy CK (2021) MedFuseNet: an attention-based multimodal deep learning model for visual question answering in the medical domain. Sci Rep 11(1):1–18
Article Google Scholar
Shi Y, Furlanello T, Zha S, Anandkumar A (2018) Question type guided attention in visual question answering. In proceedings of the European conference on computer vision (ECCV) (pp. 151-166)
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Sitara NMS, Srinivasan K (2021) SSN MLRG at VQA-MED 2021: an approach for VQA to solve abnormality related queries using improved datasets. In CLEF (working notes) (pp. 1329-1335)
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, ..., Rabinovich A (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1–9)
Tommasi T, Mallya A, Plummer B, Lazebnik S, Berg AC, Berg TL (2016) Solving visual madlibs with multiple cues. arXiv preprint arXiv:1608.03410
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, ..., Polosukhin I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998–6008)
Wei B, Feng B, He F, Fu X (2011) An extended supervised term weighting method for text categorization. In proceedings of the international conference on human-centric computing 2011 and embedded and multimedia computing 2011 (pp. 87–99). Springer, Dordrecht
Xi Y, Zhang Y, Ding S, Wan S (2020) Visual question answering model based on visual relationship detection. Signal Process Image Commun 80:115648
Article Google Scholar
Yang Z, He X, Gao J, Deng L, Smola A (2016) Stacked attention networks for image question answering. In proceedings of the IEEE conference on computer vision and pattern recognition (pp. 21-29)
Yu Z, Yu J, Fan J, Tao D (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In proceedings of the IEEE international conference on computer vision (pp. 1821-1830)
Yu Z, Yu J, Cui Y, Tao D, Tian Q (2019) Deep modular co-attention networks for visual question answering. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 6281-6290)
Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In European conference on computer vision (pp. 818-833). Springer, Cham
Zhang P, Goyal Y, Summers-Stay D, Batra D, Parikh D (2016) Yin and yang: balancing and answering binary visual questions. In proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5014-5022)

Download references

Author information

Authors and Affiliations

Division of Information Technology, Cochin University of Science and Technology, Kochi, Kerala, 682022, India
Sruthy Manmadhan & Binsu C Kovoor
Department of Computer Science and Engineering, NSS College of Engineering, Palakkad, Kerala, 678008, India
Sruthy Manmadhan

Authors

Sruthy Manmadhan
View author publications
You can also search for this author in PubMed Google Scholar
Binsu C Kovoor
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Sruthy Manmadhan - Conceptualization, Methodology, Software, Writing- Original draft preparation, Investigation, Validation. Binsu C Kovoor - Validation, Supervision, Writing- Reviewing and Editing, Investigation.

Corresponding author

Correspondence to Sruthy Manmadhan.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Manmadhan, S., Kovoor, B.C. Parallel multi-head attention and term-weighted question embedding for medical visual question answering. Multimed Tools Appl 82, 34937–34958 (2023). https://doi.org/10.1007/s11042-023-14981-2

Download citation

Received: 14 January 2022
Revised: 23 June 2022
Accepted: 22 February 2023
Published: 11 March 2023
Issue Date: September 2023
DOI: https://doi.org/10.1007/s11042-023-14981-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Parallel multi-head attention and term-weighted question embedding for medical visual question answering

Abstract

Similar content being viewed by others

MAMF: A Multi-Level Attention-Based Multimodal Fusion Model for Medical Visual Question Answering

Pre-trained multilevel fuse network based on vision-conditioned reasoning and bilinear attentions for medical image visual question answering

Multi-modal multi-head self-attention for medical VQA

Explore related subjects

1 Introduction

2 Related works

2.1 Visual question answering

2.2 Medical visual question answering with VQA-RAD

2.3 Term-weighting schemes

3 Method

3.1 Problem modeling

3.2 Feature extraction

3.2.1 Image Featurization

3.2.2 Question Featurization

Domain-specific question embedding

Supervised term-weighted (STW) question embedding

Proposed STW scheme: Qf.MI

3.3 Feature fusion via multi-head attention (MHA)

3.4 Answer prediction

4 Experiment

4.1 Dataset description

4.2 Implementation setup and evaluation metric

4.3 VQA-RAD benchmarking

4.4 Ablation study

4.5 Qualitative evaluation

5 Conclusions

Data availability

Notes

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation