1 Introduction

Images from radiology are crucial in clinical decision-making, such as the utilization of chest X-rays to diagnose Covid-19. By answering questions regarding the image contents, automated systems could assist clinicians in dealing with massive amounts of imagery during a pandemic. Patients gain from such a sophisticated question-answering system as well. Rather than sifting through non-specific and overwhelming search engine results, they can get answers to simple queries about their medical imaging. Visual Question Answering (VQA), a new branch of artificial intelligence, throws insight into this type of clinical decision support.

VQA seeks to offer an accurate answer to a question regarding an image in general. It’s an AI-complete [1] task that combines Natural Language Processing (NLP) with Computer Vision (CV), two significant computer science research domains. The goal of medical VQA (MVQA) is to deal with medical images and the queries that come with them. Because structured and unstructured medical data is increasingly accessible to patients through patient portals, MVQA helps to promote patient engagement in clinical decision making. MVQA serves as a personal assistant for doctors, providing a second opinion while interpreting difficult medical images.

When compared to other vision-language tasks like image captioning, general VQA is a challenging multi-modal-knowledge based task. The researchers devised a three-phase workflow to address the VQA problem, as shown in Fig. 1. The two sub-tasks of phase 1; image featurization and question featurization; indebted to the success of deep learning models Convolutional Neural Networks (CNN) and Recurrent Neural Network (RNN) respectively. The VQA research community extensively focused on phase 2, which involves the joint comprehension of extracted multi-modal aspects. The majority of VQA systems were designed to predict the correct answer from a pool of candidates in the final phase. VQA can be viewed as a classification task in this context. Open-ended VQA is a variant in which the final answers are generated as free-form phrases. Table 1 shows the most common approaches for implementing phases 1 and 2 in the VQA literature.

Fig. 1
figure 1

Three phases of the general VQA pipeline

Table 1 Challenges of VQA sub-tasks when applied to medical domain (MVQA)

Because of the huge difference between the general and medical domains, knowledge transfer from the general domain based on inadequate medical data can be of limited use [39]. Table 1 also highlights specific challenges that MVQA has paused at various stages of the general VQA pipeline. In summary, the challenging facts of MVQA are: 1) semantic parsing of clinical language while interpreting medical questions, 2) poor contrast and different modalities (MRI, CT, X-Ray) of medical images, and 3) need of a complex fusion method.

In response to the above concerns, a unique MVQA system is developed that can produce valid results with limited data. The model seeks to provide equal emphasis to multi-modal feature extraction and fusion. The power of unsupervised learning with denoising auto-encoders is used to handle image featurization. With the use of domain-specific bioBERT embeddings [26] and a newly proposed term-weighting scheme, the system was able to interpret complex clinical terminologies in questions. Finally, two parallel branches of multi-head attention modules accomplish multi-modal fusion to correctly comprehend visual and language data and predict the correct answer. The proposed model is named Multi-head attention for Medical VQA (MaMVQA).

Inspired by the success of VQA in the general domain and the potential applications of VQA in medical domain, in [24] the authors released the first manually constructed dataset VQA-RAD for MVQA where questions and answers about images are naturally created and validated by clinicians. There are two major types of questions in the VQA-RAD dataset, closed-ended questions where answers are limited to a set of predetermined options and open-ended questions where the answers are free-form text which is harder. The proposed MaMVQA model achieves superior accuracy in VQA_RAD, especially for open-ended questions.

In summary, the main contributions of this paper included:

  • The paper proposes a parallel Multi-head Attention network for Medical domain VQA (MaMVQA).

  • The paper introduces a novel semantic supervised term-weighting scheme named ‘qf.MI’ that utilizes bioBERT word embeddings and mutual information scores to assign word weights.

  • Presentation of results obtained from the extensive experiments and comparative studies carried out with VQA-RAD dataset.

  • Discussion and analysis of qualitative and qualitative results of ablation studies conducted to validate the significance of individual components of MaMVQA.

The remainder of this paper is arranged as follows: We briefly discuss several recognized related works in Section 2. Section 3 presents the details of MaMVQA framework and the proposed term-weighting scheme. Section 4 conduct experiments to evaluate novel methods and discusses the results. Finally, we conclude in Section 5.

2 Related works

This section summarizes the progress happened in general VQA. The second part focuses on studies that employed the VQA-RAD dataset in their experiments. A dedicated subsection also includes a quick rundown of the most common term-weighting strategies.

2.1 Visual question answering

After Antol et al. [1] introduced the VQA system, numerous researchers [12, 17, 32, 55] built on the baseline VQA pipeline, which included CNN/R-CNN [40] for image featurization, RNN for question featurization, and any elementary feature fusion. The focus of VQA research in the following phase was on feature fusion techniques, which led to certain milestones in the field [8, 19, 47]. Later, with the enormous influence of attention mechanisms on VQA, various types of attention networks were introduced to the stack of VQA systems [2, 13, 18, 23, 30, 36, 43, 50, 52, 53].

2.2 Medical visual question answering with VQA-RAD

While proposing the first manually constructed high-quality dataset for VQA in the medical domain (VQA-RAD), Lau et al. [24] discussed and tested two baseline solutions. The two models stem from the two well-known general VQA models Multimodal Compact Bilinear Pooling (MCB) [8] and Stacked Attention Network (SAN) [51] respectively. To forecast attention and for the joint embedding of attended visual representation with question representation, MCB incorporates three components in line with VQA phases: a CNN image model with ResNet-152, an LSTM question model, and MCB pooling. The SAN model, as the name implies, attends to the image several times to enhance visual attention over time. In experiments, these baseline models were given the names MCB-RAD and SAN-RAD, respectively.

Nguyen et al. [35] reported a powerful baseline for VQA-RAD called BAN-RAD, which is based on Bilinear Attention Network (BAN) [20] instead of SAN in the joint comprehension phase of VQA. However, due to the significant differences between medical and general VQA data, such direct adaptations suffer from a serious lack of data and multi-modal reasoning abilities. To tackle the challenges caused by transfer learning in image feature extraction, the authors devised model-agnostic meta-learning (MAML) [7] to train meta-weights that swiftly adapt to visual concepts. Later, the MEVF [35] network used a Convolutional Denoising Auto-encoder (CDAE) trained on an external dataset to extend the image feature extraction using MAML. This combination conquered MVQA’s data constraint.

Do et al. (2021) [5] presented a multiple meta-model quantifying (MMQ) method to increase meta-data through auto-annotations and noisy label processing. This method allows more meaningful feature extraction from medical images without external data.

Table 2 critically compares various state-of-the-art medical VQA systems from the literature. The table dissects each work to discuss the strategies used for image understanding, question understanding and multi-modal feature fusion and the datasets used in experiments along with special remarks on the methodology.

Table 2 Overview of the current state-of-the-art medical VQA works. Table focuses the way the works tackled three main tasks in general VQA pipeline. Omitted the last phase called ‘answering’ because all the works considered VQA as a classification task. [*indicates the works that resembles the proposed system and are critically examined later in section 4.3]

2.3 Term-weighting schemes

Initially, STW schemes were created by combining classic unsupervised component ‘term frequency (tf)’ with feature selection metrics including chi-square (χ2), information gain (ig) and Odds Ratio (OR) to have tf.χ2, tf.ig, and tf.OR STW schemes (Debole and Sebastiani. 2004) [4]. Lan et al. (2008) [22] presented tf.rf, a new approach based on the concept of ‘relevance frequency,’ which marked a significant milestone in STW research. It was built on the principle that “the more concentrated a high-frequency term is in one class c than in others, the greater contributions it makes in identifying class c samples.”

Quan et al. (2010) [38] improved this by emphasizing the importance of terms with a high frequency in the positive category, particularly in the context of question classification. For this, they proposed four new concepts.

  • question frequency (qf): qf(t) is the number of samples in positive category that contain t.

  • category frequency (cf): cf.(t) is the number of classes in which term t occurs.

  • inverse category frequency (icf): used to measure discriminative power of a term t.

  • inverse question frequency (iqf): similar to inverse document frequency (idf).

They proposed three new STW schemes named qf.icf, iqf.qf.icf and vrf and obtained the best results for the second one.

The frequency of the weighting term in the form of tf or qf is a fixed component in all of these ideas. Statistical information about terms is also contained in the various factors multiplied with them. Text semantics should be used in STW to get closer to how humans determine term weights. The semantic distance between terms and category core terms (CCTs), the representative terms of a class, is substituted with the global statistical component of STW in Wei et al. (2011) [49]. The technique presumes that the problem is a binary classification task, with positive and negative CCT well defined. The procedure becomes infeasible as the number of classes grows.

Luo et al. (2011) [31] looked at the semantics of categories as well as WordNet’s interpretation of terms occurring in the category label. The similarity between a word and category semantics is used to compute the multiplicative factor of STW. When the words in category labels are not generic and/or domain-specific, such as in clinical text, the usage of WordNet senses becomes ineffective because the majority of these words will not be found in WordNet. Determining the meaning of medical terminology for term-weighting while focusing on clinical content is a challenging task. Matsuo and Ho (2018) [33] suggested a two-phase approach that uses the medical ontology to generate a two-part hierarchy for determining semantic weights of terms in clinical texts.

3 Method

The proposed parallel Multi-head Attention network for Medical domain VQA (MaMVQA) is portrayed in Fig. 2, which includes a denoising auto-encoder for medical image representation learning, two question encoding branches and two parallel multi-head attention blocks for feature fusion as its core components.

Fig. 2
figure 2

MaMVQA architecture

3.1 Problem modeling

Given a medical image I and an associated natural language question Q, the correct answer is finally predicted. This can be mathematically formulated as:

$$ \hat{y}=\mathit{\arg}\underset{a\in A}{\max }P\left(a\ \right|\ Q,I,\theta \Big) $$
(1)

where \( \hat{y} \) represents the final classification result. ‘A’ and ‘a’ respectively indicates a set of candidate answers and one of the answers. P denotes the MaMVQA framework and θ represents all parameters of MaMVQA. The framework is divided into three modules; features extraction, feature fusion and answer prediction. The three modules are elaborated in the following subsections.

3.2 Feature extraction

VQA is an AI-Complete task that requires multi-modal knowledge that extends beyond a single sub-domain. As a result, the quality of features extracted is crucial for subsequent operations. The image and the corresponding natural language question are two modalities that must be investigated as part of the VQA task.

3.2.1 Image Featurization

The majority of image features in conventional VQA systems are now extracted based on CNN models that have been pre-trained on the ImageNet database. Fine-tuning those models to make them adequate for predicting medical image features necessitates a massive amount of labelled data, which is unavailable in medical VQA. Thus, for VQA-RAD images, we designed and trained an unsupervised denoising autoencoder based on a convolutional neural network.

An autoencoder is a form of neural network structure that generates targets from input data via self-supervision. At each step, the encoder part of the autoencoder extracts input image features and compresses the learning to a hidden or bottleneck representation h.

$$ e:h= ReLu\ \left( Wx+\mathrm{b}\right) $$
(2)

The latent compressed representation h is then mapped back to input z with the same shape as x using the decoder.

$$ d:z= ReLu\ \left({W}^{\prime }h+{b}^{\prime}\right) $$
(3)

By minimizing a loss measure that measures how similar the reconstructed image is to the input image, the parameters W, W′, b, and b’ are optimized.

By introducing a stochastic noise component to the input image, the overfitting characteristic of traditional autoencoders can be controlled. The model is then trained to rebuild the uncorrupted original image as an output. Denoising autoencoders (DAE) are a type of autoencoder that works based on this strategy. Convolutional DAEs (CDAE) are a variant of DAE that adds convolutional encoding and decoding layers to the normal DAE architecture. Previous research [10, 25] has demonstrated that CDAEs perform better in image processing than ordinary DAEs because to comprehend visual structure, CDAE makes use of all of CNN’s capabilities.

To avoid degradation [15] caused by several convolutional and deconvolutional layers, residual connections from the CDAE encoder to the decoder are added, bypassing the bottleneck layer [6, 25]. These extra connections allow feature maps to be sent straight from an earlier encoder layer to a later decoder layer. This aids the decoder in producing better specified decompressions of the input image.

The SkipCDAE architecture utilized in this study to learn the features of VQA-RAD medical images is shown in Fig. 3. The input images comprise 512*512*1 (gray-scale) matrices, and Gaussian noise with a standard deviation of 0.2 has been introduced. Then, using a convolution kernel of size 3*3, 32 filters of size 3*3*1 were employed to construct 32 feature maps from the input layer. A LeakyReLu activation function follows this layer. Each of the 2 to 14 layers has 64, 128, 256, 512, 1024, 1024, 512, 256, 256, 128, 64, and 32 filers respectively, with LeakyReLu as the activation function for all of them. Sum aggregation is used to add two skip connections from the fourth and sixth encoder layers to the corresponding layers at the decoder. Simple actions such as image rotation, image scaling, and normalizations are applied to the VQA-RAD images before being utilized to train SkipCDAE to augment the available image set.

Fig. 3
figure 3

SkipCDAE architecture used for image feature extraction

3.2.2 Question Featurization

Tokenizing the natural language inquiries into words and turning the words to lowercase are the first steps in the pre-processing process. Next, the punctuations associated with the words are removed including the question mark (?). Each question is then reassembled into a series of ‘n’ tokens, where ‘n’ is the number of words in the training dataset’s longest question. Zero padding will be applied to any queries with fewer than ‘n’ words. The value of ‘n’ in VQA-RAD is 22. The preprocessed question data is now input into two distinct question featurization modules of MaMVQA framework.

Domain-specific question embedding

In this strategy, question word sequences are represented by BioBERT [26] word embedding matrix, \( D=\left\{{e}_1,{e}_2,\dots \dots \dots, {e}_n\right\}\in {R}^{n\ast {d}_e} \), where de represents the embedding dimension of each word in the sequence which is 768 for BioBERT embedding. BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining) is the first domain specific language representation pre-trained on large-scale biomedical corpora. The creators proved the effectiveness of BioBERT embedding for the task of question answering over the state-of-the-art language models.

The embedded word vectors are then fed into a bidirectional LSTM (biLSTM) network to generate the final question encoding, \( Q1=\left\{{q}_1,{q}_2,\dots, {q}_N\right\}\in {R}^{N\ast {d}_q} \), where N is the total number of questions in the training set and dq is the dimension of last hidden state output from the biLSTM network which is set to 2048 in experiments. biLSTM can capture both contextual and historical information because it sums up information from both the forward and backward directions of a sentence.

Supervised term-weighted (STW) question embedding

The primary goal of the STW scheme is to limit the impact of irrelevant words on subsequent tasks. The auxiliary task is to look up the weight assigned to each word that appears in the question of interest in the term-weight vector created using the newly proposed STW method qf.MI. Multiply the scalar weights by the distributed embedding vector for that word to get a weighted embedding matrix E ∈ R|V| × d, where |V| is the corpus vocabulary size and d is the dimension of distributed embedding vector. The embedding layer of the proposed question featurization network is initialized with this E. The final weighted question encoding Q2 is obtained by feeding the embedding layer output to the biLSTM network.

Proposed STW scheme: Qf.MI

In the case of STW, two ways for assigning term weighted feature vectors are available: local and global. Each class in the local policy has its weight vector, resulting in a set of feature vectors for each training sample corresponding to distinct classes. For each class, a common weight vector is inferred from local vectors in the global policy. This can be accomplished by performing ‘max’, ‘avg’, or ‘sum’ operations on local vectors, with the ‘max’ strategy performing best in earlier studies [4, 38]. The global weight vector can be determined using the ‘max’ policy as shown in Eq. 4.

$$ w(t)=\underset{1\le i\le \left|C\right|}{\max }w\left(t,{c}_i\right) $$
(4)

where w(t) is the final weight of the term t, w(t, ci) is the weight of t in class i and C is the set of classes.

Two ideas were used in the suggested scheme: question frequency (qf) and mutual information (MI). The initial idea, qf, is borrowed from [38] and used to replace the word frequency (tf) as the first multiplicative factor in STW schemes. Their research found that this approach is particularly useful for short texts with a large number of tf values equal to one. The VQA task’s questions are brief texts. The frequency of occurrence of a term t in the class of interest (positive category) is represented by its qf.Footnote 1

$$ qf(t)=\mathit{\log}\left({t}_p+1\right) $$
(5)

The second factor MI stems from information theory that denotes statistical relatedness between two datasets/random variables [41]. It is a non-negative value and the higher the value means higher the dependency. MI can be mathematically computed using Eq. 6.

$$ I\left(X,Y\right)=\iint p\left(x,y\right)\log \left(\frac{p\left(x,y\right)}{p(x).p(y)}\right)\ dx. dy $$
(6)

Each (image, question) pair in the VQA dataset has an associated answer. The answer offers a wealth of information that helps to weigh the question words. This idea was put into action by computing the MI between each word vector and its corresponding answer vector to allocate weights to each term. The motivational fact behind the development of this scheme can be summarized as follows: ‘the more closely a word is related to the label text of the class of interest, the better it will be able to assist the question representation to forecast the proper answer.’

The complete procedure of term weight calculation using qf.MI is depicted in Fig. 4 for better understanding. The process creates an STW weight matrix W of size ∣C ∣  × |V| from a corpus consisting of |C| classes and a vocab of size |V|. Each class’s local weight vectors are represented by each row of W. Each class vector is built so that terms that appear at least once in the class of interest are given weights calculated according to Eq. 7. For those terms that do not occur in the class of interest ci but do appear in V, the weight will be presented as the lowest of all computed weights for that class.

$$ W(i)=\kern0.5em \left\{\begin{array}{c} qf. MI\left(t,{c}_i\right)\ if\ t\in class\_ vocab\kern0.5em V\\ {}\underset{1\le j\le \left| class\_ vocab\right|}{\min } qf. MI\left({t}_j,{c}_i\right)\ if\ t\in V- class\_ vocab\end{array}\right. $$
(7)
Fig. 4
figure 4

Pictorial representation of qf.MI term weight calculation for a single class of interest

3.3 Feature fusion via multi-head attention (MHA)

Two parallel branches of the multi-head attention (MHA) network are used to interpret the retrieved visual and linguistic cues. MHA is an attention module that uses n-heads to run through an attention process numerous times in parallel. Following that, the individual attention outputs are concatenated and linearly translated into the desired dimensions. Multiple attention heads, appear to enable different kinds of attention to be paid to different areas of the inputs. The general process is portrayed in Fig. 5 and formulated as follows:

$$ MHA\left(Q,K,V\right)=\left[{head}_1,\dots, {head}_n\right]{W}_0\ \mathrm{Where}\ {head}_i= Attention\ \left({QW}_i^Q,{KW}_i^k,{VW}_i^V\right) $$
(8)
Fig. 5
figure 5

Graphical representation of Multi-head Attention (MHA) [48]

W are all learnable parameter matrices and three inputs (Q, K, V) to MHA module are named as (Query, Key, Value). Mathematically, the layer first projects Q, K, and V. Then the query and key tensors are dot-producted and scaled. To achieve attention scores, these are softmaxed. These scores are then used to interpolate the value tensors, which are subsequently concatenated back into a single tensor. Finally, the result can be projected linearly and returned.

The terms query and key are used in this paper to refer to extracted visual features and question features, respectively. Because no special value (V) is specified, it is assumed to be same as the key. These input tensors are split into several segments and sent to a number of distinct heads. Medical image features are compared to corresponding question features within each head, and attention weights are assigned. The interaction between Q (image) and K and V (question) is represented by each cell in the resulting attention weight matrix.

Two parallel branches of multi-head attention are implemented with two types of question embeddings in the proposed MaMVQA framework, as detailed in section 3.2.2. By executing question guided visual attention on distinct areas of the input, the MHA module generates refined visual features with the help of many heads.

3.4 Answer prediction

The MaMVQA model is handled as a classification model in its entirety. The outputs of two branches of multi-head attention are combined to generate a refined visual representation that is attended by the question inputs. The final fused vector, say z, for answer prediction is formed by concatenating this with the domain-specific question embedding. The projection of the fused feature vector z to a vector a ∈ RN is then performed, followed by a softmax activation, where N is the number of potential answers in the training dataset. The ‘categorical crossentropy’ loss function was used to train an N-way classifier on top of feature vector z.

4 Experiment

This section describes the extensive tests that were carried out to assess MaMVQA’s effectiveness. It begins with a description of the dataset before moving on to the implementation and evaluation details. On the VQA-RAD dataset, MaMVQA is then compared to other state-of-the-art models. The ablation research is then provided to demonstrate the necessity of the suggested framework’s components. Finally, MaMVQA qualitative analysis employing VQA-RAD data has been demonstrated.

4.1 Dataset description

The aim is to generate or predict the proper answer to natural language questions regarding the content of medical images in medical VQA datasets. In the field of medical VQA, the VQA-RAD dataset is the first manually generated dataset. It contains 315 medical radiological images divided evenly throughout the head, chest, and abdomen. The statistics of the dataset are as follows:

  • 3064 question-answer pairs form the training set.

  • 451 question-answer pairs form the test set.

Throughout the tests, the same train-test split was applied. Abnormality (ABN), attribute (ATTRIB), colour, count, modality, organ, other, plane, positional reasoning (POS), object condition presence (PRES), and size are the 11 question-categories. The question-answer pairs are classified as open-ended or closed-ended based on the answer types. Closed-ended responses are yes/no or limited to a set of predetermined options, but open-ended responses represent a significant challenge for the VQA system.

4.2 Implementation setup and evaluation metric

All of the experiments were run on an NVIDIA P100 GPU with the Keras Python library. The embedded question tokens were input through biLSTM, which produced a 2048-dimensional output combining forward and backward pass with a hidden layer dimension of 1024. The ‘Adam’ optimizer was used to train the models, with a learning rate of 1e-5 and a batch size of 32. In the VQA-RAD dataset, there are a total of 458 classes. A total of 16 heads were used in each multi-head attention module. By monitoring validation accuracy and validation loss, Keras Checkpoint and EarlyStopping callbacks are leveraged to save the best model during training. The maximum epochs have been set to 200.

The accuracy is used as an evaluation metric of the models in all experiments, that is, the proportion of the correctly answered questions over the total number of questions. It can be represented as:

$$ acc=\frac{t}{N}\times 100\% $$
(9)

where t represents the number of correctly answered questions and N refers to the total number of questions. As followed in current VQA_RAD literature, open-ended and close-ended accuracies are reported separately.

4.3 VQA-RAD benchmarking

The proposed MaMVQA model is compared with 9 existing models listed below:

  • SAN-RAD, MCB-RAD [24]: Two baselines directly adopted from general VQA literature using stacked attention and multi-modal compact bilinear pooling respectively.

  • BAN-RAD [35]: It is an advanced baseline again stemmed from the general VQA methodology of bilinear attention for multi-modal feature fusion.

  • MAML-SAN, MAML-BAN [7]: Introduced model-agnostic meta-learning to learn meta-weights to overcome problems caused by transfer learning for image featurization.

  • MEVF-SAN, MEVF-BAN [35]: Leveraged meta-learning and denoising autoencoder to extract image features.

  • MMQ-SAN, MMQ-BAN [5]: Aimed to increase metadata through automatic annotations and process noisy labels.

MEVF and MMQ resemble the proposed system and are critically analyzed in Table 3.

Table 3 MaMVQA vs. MEVF and MMQ: a critical comparison

The result of the proposed model and other existing methods using the VQA-RAD dataset are compared in Table 4. It can be noted from the quantitative analysis that the proposed system outperforms the other models in open-ended, close-ended and overall accuracies. MaMVQA achieved a maximum improvement of 21% from SAN-RAD and a minimum improvement of 3% from MMQ-BAN in close-ended category. Similarly, 31% (max) and 2% (min) hike in open-ended accuracies. The overall accuracy rate is also improved by above 8% on average.

Table 4 Comparison of accuracies of different existing MVQA methods for VQA-RAD

VQA-RAD dataset queries are now grouped into 11 types, as mentioned in section 4.1. In both the training and test sets, these types are not evenly distributed (refer Fig. 6). Experiments with the training and test sets separately are used to estimate MaMVQA’s accuracy for these question kinds. The results are presented in Table 5, where two separate columns provide the MaMVQA model’s percentage accuracies for training and test data over various question categories. The overall accuracy is then shown in Fig. 7 as an average of training and testing accuracies for each question type. The disparity in performance shown across a few question categories is due to the data imbalance. In any case, it demonstrates the proposed system’s generic performance potential, with an average accuracy of 82%.

Fig. 6
figure 6

VQA_RAD question distribution over question types

Table 5 Accuracy of the proposed systems’ answers sub-grouped into question types
Fig. 7
figure 7

VQA_RAD overall accuracy % over 11 question types

As mentioned in the dataset description, VQA-RAD evenly distributes the available radiology images and associated questions into three areas of the human body – CHEST, HEAD and ABDOMEN. Figure 8a and b show the statistics of the VQA-RAD train-test split across the three organ categories. Also, Table 6 shows the results of the MaMVQA model when evaluated separately on visual questions from these three categories. The proposed model shows a generalizable performance over different categories.

Fig. 8
figure 8

(a) VQA-RAD train set distribution (b) test set distribution over the organ types CHEST, HEAD and ABDOMEN

Table 6 Accuracy of the proposed systems’ answers sub-grouped into three organ types

4.4 Ablation study

Complete ablation studies on the VQA RAD dataset were discussed in this section to verify the significance of various components integrated to develop the MaMVQA framework. Supervised term-weighting, concurrent multi-head attention, and the usage of two different question representations are the primary factors to address here. When these ablations are carried out, the following models emerge:

  • MaMVQA: This is the full proposed model with parallel multi-head attention and two branches of question embedding, domain-specific and supervised term-weighted. Refer Fig. 2

  • MaMVQA_Ab1 [MaMVQA-Term Weighting]: By omitting the usage of term-weighting (qf.MI) in the second branch of question representation, this becomes the least ablated form of MaMVQA. Questions are fed to the embedding layer, which subsequently feeds them to biLSTM.

  • MaMVQA_Ab2 [MaMVQA – Parallel MHA]: The remaining components include two types of question embeddings and a term-weighting mechanism. The architecture used for ablation 2 is depicted in Fig. 9.

Fig. 9
figure 9

Architecture of MaMVQA_Ab2

  • MaMVQA_Ab3(a) [MaMVQA - Parallel MHA – STW question embedding]: Another branch of question representation that this model avoids is supervised term weighted question embedding. Both the attention process and the ultimate fusion rely on domain-specific embedding. The architectural view of this ablation can be seen in Fig. 10.

Fig. 10
figure 10

Architecture of MaMVQA_Ab3(a)

  • MaMVQA_Ab3(b) [MaMVQA – Parallel MHA – Domain-specific question embedding]: This is same as the Ab3(a) except in the avoided branch of question representation. For clear distinction, the conceptual view of ablation 3(b) has been shown in Fig. 11.

Fig. 11
figure 11

Architecture of MaMVQA_Ab3(b)

Table 7 shows the quantitative findings of the ablation studies. The results show that cooperation among all components is preferable than any one of them, yet they are still better than the three baseline systems discussed in section 4.3. (SAN-RAD, MCB-RAD and BAN-RAD). Although several ablations provided equal accuracies in the close-ended category, the difference and usefulness of each MaMVQA component is clear in the more challenging open-ended category of VQA-RAD.

Table 7 VQA-RAD ablation results, in percentage

In Fig. 12, the time efficiency of the proposed parallel multi-head attention model with two kinds of question embeddings is evaluated and compared with its ablations. The results are obtained by averaging the training time of 5 epochs and test time using the fully trained models. MaMVQA had a considerable increase in training time when compared to ablation models, whereas the testing overhead was insignificant. These results show that the model can be efficiently applied without incurring much time complexity.

Fig. 12
figure 12

Time efficiency of the proposed model

4.5 Qualitative evaluation

The qualitative findings of MaMVQA and its ablations on the VQA-RAD dataset are shown in Table 5. The VQA-RAD dataset has a well-balanced distribution of images from the chest, head, and belly. The three examples in Table 5‘s top row clearly demonstrate MaMVQA’s superior performance in answering questions regarding images in all three categories. When the concurrent multi-head attention and the second branch of question embeddings were eliminated, the model became unable to correctly answer some questions. The fourth and fifth instances (d) and (e), respectively, emphasize the importance of the basic components. In all of these cases, the last two ablations, 3a and 3b, are inferior. There are also some challenges with the suggested system. The last instance (f) of Table 8 shows the case where the MaMVQA model and all the other models fail. Upon close examination of that instances, the following facts can be inferred.

  • The background area (black) is very much dominant so that it is difficult to focus at region-of-interest.

  • The ground truth answer and predicted answer other than ‘nothing’ is difficult to distinguish without a focused examination of the image content.

Table 8 Comparison of Ground Truth answer with the predicted answer of VQA-RAD samples

5 Conclusions

Today, we have access to our electronic health records (EHRs) online including radiology reports and images. Online access to medical images makes the patients more curious about them. Medical VQA opens an intelligent platform to interact with and understand medical images by asking questions.

In this research, MaMVQA is proposed as a solution to medical VQA tasks. There are three modules in all. To acquire image representation and two different kinds of question representations, the feature extraction module employs three feature extraction methodologies. To prevent the usage of a huge amount of external data, an unsupervised denoising autoencoder with skip connections is used to handle image featurization. The two branches of question featurization are domain-specific and supervised term-weighted embeddings. Second, the research suggests parallel multi-head attention driven by question to maximize the interpretation of visual semantic information. Third, a module for predicting answers that implements MaMVQA as an n-way classifier. In addition, a new supervised term-weighting strategy (qf.MI) based on the concept of mutual information is proposed based on the concept of mutual information. The VQA-RAD dataset was used in the experiments. Several ablation experiments are shown to prove that the proposed MaMVQA components are effective for medical VQA tasks. Furthermore, the new method outperformed existing methods in terms of accuracy.

The proposed architecture can be applied to solve VQA in other domains with trivial changes to the pre-trained language embedding used. Quantitative and qualitative analysis of the obtained results reveals that the model fails on some questions. Thus, we wish to do a human-adversarial benchmarking of the medical VQA model in future to correctly identify the type of instances where the model fails and then refine the model.