Keywords

1 Introduction

Theories about logic in human understanding have a long history. In modern times, Piaget and Fodor [34] studied the representation of logical hypotheses in the human mind. George Boole [7] formalized conjunction, disjunction, and negation into an “algebra of thought” as a way to improve, systemize, and mathematize Aristotle’s Logic [12]. Horn regarded negation to be a fundamental and defining characteristic of human communication [19], following the traditions of Sankara [35], Spinoza [42], and Hegel [18]. Recent studies [11] have suggested that infants can formulate intuitive and stable logical structures to interpret dynamic scenes and to entertain and rationally modify hypotheses about the scenes. As such we argue that understanding logical structures in questions, is a fundamental requirement for any question-answering system.

Fig. 1.
figure 1

State-of-the-art models answer questions from the VQA dataset (\(Q_1, Q_2\)) correctly, but struggle when asked a logical composition including negation, conjunction, disjunction, and antonyms. We develop a model that improves on this metric substantially, while retaining VQA performance.

If a question can be put at all, then it can be answered. [44]

In the above proposition, Wittgenstein linked the process of asking a question with the existence of an answer. While we do not comment on the existence of an answer, we suggest the following softer proposition -

If questions \(Q_1\dots Q_n\) can be answered, then so should all composite questions created from \(Q_1\dots Q_n\)

Visual question answering (VQA) [3] is an intuitive, yet challenging task that lies at a crucial intersection of vision and language. Given an image and a question about it, the goal of a VQA system is to provide a free-form or open-ended answer. Consider the image in Fig. 1 which shows a person in front of an open fridge. When asked the questions \(Q_1\) (Is there beer?) and \(Q_2\) (Is the man wearing shoes?) independently, the state-of-the-art model LXMERT [43] answers both correctly. However when we insert a negation in \(Q_2\) (Is the man not wearing shoes?) or for a conjunction of two questions \(\lnot Q_2 \wedge Q1\) (Is the man not wearing shoes and is there beer?), the system makes wrong predictions. Our motivation is to reliably answer such logically composed questions. In this paper, we analyze VQA systems under this Lens of Logic (LOL) and develop a model that can answer such questions reflecting human logical inference. We offer our work as the first investigation into the logical structure of questions in visual question-answering and provide a solution that learns to interpret logical connectives in questions.

The first question is: can models pre-trained on the VQA dataset answer logically composed questions? It turns out that these models are unable to do so, as illustrated in Fig. 1 and Table 2. An obvious next experiment is to split the question into its component questions, predict the answer to each, and combine the answers logically. However language parsers (either oracle or trained parsers) are not accurate at understanding negation, and as such this approach does not yield correct answers for logically composed questions. The question then arises: can the model answer such questions, if we explicitly train it with data that also contains logically composed questions? For this investigation, we construct two datasets, VQA-Compose and VQA-Supplement, by utilizing annotations from the VQA dataset, as well as object and caption annotations from COCO [25]. We use these datasets to train the state-of-the-art model LXMERT [43] and perform multiple experiments to test for robustness towards logically composed questions.

After this investigation, we develop our LOL model architecture that jointly learns to answer questions while understanding the type of question and which logical connective exists in the question, through our attention modules, as shown in Fig. 3. We further train our model with a novel Fréchet-Compatibility loss that ensures compatibility between the answers to the component questions and the answer of the logically composed question. One key finding is that our models are better than existing models trained on logical questions, with a small deviation from state-of-the-art on VQA test set. Our models also exhibit better Compositional Generalization i.e. models trained to answer questions with a single logical connective are able to answer those with multiple connectives.

Our contributions are summarized below:

  1. 1.

    We conduct a detailed analysis of the performance of the state-of-the-art VQA model with respect to logically composed questions,

  2. 2.

    We curate two large scale datasets VQA-Compose and VQA-Supplement that contain logically composed binary questions.

  3. 3.

    We propose LOL – our end-to-end model with dedicated attention modules that answer questions by understanding the logical connectives in questions.

  4. 4.

    We show a capability of answering logically composed questions, while retaining VQA performance.

2 Related Work

Logic in Human Expression: Is logical thinking a natural feature of human thought and expression? Evidence in psychological studies [10, 11, 16] suggests that infants are capable of logical reasoning, toddlers understand logical operations in natural language and are able to compositionally compute meanings even in complex sentences containing multiple logical operators. Children are also able to use these meanings to assign truth values to complex experimental tasks. Given this, question-answering systems also need to answer compositional questions, and be robust to the manifestation of logical operators in natural language.

Logic in Natural Language Understanding: The task of understanding compositionality in question-answering (QA) can also be interpreted as understanding logical connectives in text. While question compositionality is largely unstudied, approaches in natural language understanding seek to transform sentences into symbolic formats such as first-order logic (FOL) or relational tables [24, 30, 47]. While such methods benefit from interpretability, they suffer from practical limitations like intractability, reliance on background knowledge, and failure to process noise and uncertainty. [8, 39, 41] suggest that better generalization can be achieved by learning embeddings to reason about semantic relations, and to simulate FOL behavior [40]. Recursive neural networks have been shown to learn logical semantics on synthetic English-like sentences by using embeddings [9, 32].

Detection of negation in text has been studied for information extraction and sentiment analysis [31]. [22] have shown that BERT-based models [13, 26] are incapable of differentiating between sentences and their negations. Concurrent to our work, [4] show the efficacy of FOL-guided data augmentation for performance improvements on natural language QA tasks that require reasoning. Since our work deals with both vision and language modalities, it encounters a greater degree of ambiguity, thus calling for robust VQA systems that can deal with logical transformations.

Visual Question Answering (VQA) [3] is a large-scale, human-annotated dataset for open-ended question-answering on images. VQA-v2[17] reduces the language bias in the dataset by collecting complementary images for each question-image pair. This ensures that the number of questions in the VQA dataset with the answer “YES” is equal to those with the answer “NO”. This dataset contains 204k images from MS-COCO [25], and 1.1M questions.

Cross-modal pre-trained models [27, 43, 48] have proved to be highly effective in vision-and-language tasks such as VQA, referring expression comprehension, and image retrieval. While neuro-symbolic approaches [29] have been proposed for VQA tasks which require reasoning on synthetic images, their performance on natural images is lacking. Recent work seeks to incorporate reasoning in VQA, such as visual commonsense reasoning [14, 46], spatial reasoning [20, 21], and by integrating knowledge for end-to-end reasoning [1].

We take a step back and extensively analyze the pivotal task of VQA with respect to various aspects of generalization. We consider a rigorous investigation of a task, dataset, and models to be equally important as proposing new challenges that are arguably harder. In this paper we analyse existing state-of-the-art VQA models with respect to their robustness to logical transformations of questions.

Table 1. Illustration of question composition in VQA-Compose, for the same example as in Fig. 1. QF: Question Formula, AF: Answer Formula
Fig. 2.
figure 2

Some questions in VQA-Supplement created with adversarial antonyms.

3 The Lens of Logic

A lens magnifies objects under investigation, by allowing us to zoom and focus on desired contents or processes. Our lens of logical composition of questions, allows us to magnify, identify, and analyze the problems in VQA models.

Consider Fig. 2(a), where we transform the first question “Is the lady holding the baby” by first replacing “lady” with an adversarial antonym “man” and observe that the system provides a wrong answer with very high probability. Swapping “man” with “baby” results in a wrong answer as well. In Fig. 2(b) a conjunction of two questions containing antonyms (girls vs boys) yields a wrong answer. We identify that the ability to answer composite questions created by negation, conjunction and disjunction of questions is crucial for VQA.

We use “closed questions” as defined in [6] to construct logically composed questions. Under this definition, if a closed question has a negative (“NO”) answer then its negation must have an affirmative (“YES”) answer. Of the three types of questions in the VQA dataset (yes/no, numeric, other), “yes-no” questions satisfy this requirement. Although, visual questions in the VQA dataset can have multiple correct answers [5], \(20.91\%\) of the questions (around 160k) in the VQA dataset are closed questions, i.e. questions with a single unambiguous yes-or-no answer, unanimously annotated by multiple human workers. This allows us to treat these questions as propositions and create a truth table for answers to compose logical questions as shown in Table 1.

3.1 Composite Questions

Let \(\mathcal {D}\) be the VQA dataset. For closed questions \(Q_1\) and \(Q_2\) about image \(I\in \mathcal {D}\), we define the composite question \(Q^*\) composed using connective \(\circ \in \{ \vee , \wedge \}\), as:

$$\begin{aligned} Q^* = \widehat{Q_1} ~\circ ~\widehat{Q_2}, \qquad where~~ \widehat{Q_1} \in \{ Q_1, \lnot Q_1\}, ~ \widehat{Q_2} \in \{ Q_2, \lnot Q_2\}. \end{aligned}$$
(1)

3.2 Dataset Creation Process

Using the above definition we create two new datasets by utilizing multiple questions about the same image (VQA-Compose) and external object and caption annotations about the image from COCO to create more questions (VQA-Supplement). The seed questions for creating these datasets are all closed binary questions from VQA-v2 [17]. These datasets serve as test-beds, and enable experiments that analyze performance of models when answering such questions.

VQA-Compose: Consider the first two rows in Table 1. \(Q_1\) and \(Q_2\) are two questions about the image in Fig. 1 taken from the VQA dataset. Additional questions are composed from \(Q_1\) and \(Q_2\) by using the formulas in Table 1. Thus for each pair of closed questions in the VQA dataset, we get 10 logically composed questions. Using the same train-val-test split as the VQA-v2 dataset [17], we get 1.25 million samples for our VQA-Compose dataset. The dataset is balanced in terms of the number of questions with affirmative and negative answers.

VQA-Supplement: Images in VQA-v2 follow identical train-val-test splits as their source MS-COCO [25]. Therefore, we use the object annotations from COCO to create additional closed binary questions, such as “Is there a bottle” for the example in Fig. 1. We also create “adversarial” questions about objects, like “Is there a wine-glass?” by using an object that is not present in the image (wine-glass), but is semantically close to an object in the image (bottle). We use Glove vectors [33] to find the adversarial object with the closest embedding. Following a similar strategy, we also convert captions provided in COCO to closed binary questions, for example “Does this seem like a man bending over to look inside the fridge”. Since we know what objects are present in the image, and the captions describe a “true” scene, we are able to obtain the ground-truth answers for questions created from objects and captions. Similar methods for creation of question-answer pairs have previously been used in [28, 37].

Thus for every question, we obtain several questions from objects and captions, and use these to compose additional questions by following a process similar to the one for VQA-Compose. For each closed question in the VQA dataset, we get 20 additional logically composed questions by utilizing questions created from objects and captions, yielding a total of 2.55 million samples as VQA-Supplement.

3.3 Analytical Setup

In order to test the robustness of our models to logically composed questions, we devise five key experiments to analyse baseline models and our methods. These experiments help us gain insights into the nuances of the VQA dataset, and allow us to develop strategies for promoting robustness.

Effect of Data Augmentation: In this experiment, we compare the performance of models on VQA-Compose and VQA-Supplement with or without logically composed training data. This experiment allows us to test our hypotheses about the robustness of any VQA model to logically composed questions. We first use models trained on VQA data to answer questions in our new datasets and record performance. We then explicitly train the same models with our new datasets, and make a comparison of performance with the pre-trained baseline.

Learning Curve: We train our models with an increasing number of logically composed questions and compare performance. This serves as an analysis of the number of logical samples needed by the model to understand logic in questions.

Fig. 3.
figure 3

LOL model architecture showing a cross-modal feature encoder followed by our Question-Attention () and Logic Attention () modules. The concatenated output of is used by the Answering Module to predict the answer.

Training only with Closed Questions: In this ablation study, we restrict the training data to only closed questions i.e. “Yes-No” VQA questions, VQA-Compose and VQA-Supplement, allowing our model to focus solely on closed questions.

Compositional Generalization: We address whether training on closed questions containing single logical operation (\(\lnot Q_1\), \(Q_1\vee Q_2\)) can generalize to multiple operations (\(Q_1 \wedge \lnot Q_2\), \(\lnot Q_1 \vee Q_2\)). For instance, rows 1 through 6 in Table 1 are single operation questions, while rows 7 through 12 are multi-operation questions. Our aim is to have models that exhibit such compositional generalization.

Inductive Generalization: We investigate if training on compositions of two questions (\(\lnot Q_1 \vee Q_2\)) can generalize to compositions of more than two questions (\(Q_1 \wedge \lnot Q_2 \wedge Q_3 \dots \)). This studies whether our models develop an understanding of logical connectives, as opposed to simply learning patterns from large data.

4 Method

In this section. we describe LXMERT [43] (a state-of-the-art VQA model), our Lens of Logic (LOL) model, attention modules which learn the question-type and logical connectives in the question, and the Fréchet-Compatibility (FC) Loss. This section refers to a composition of two questions, but applies to \(n\ge 2\) questions.

4.1 Cross-Modal Feature Encoder

LXMERT (Learning Cross-Modality Encoder Representations from Transformers) [43] is one of the first cross-modal pre-trained frameworks for vision-and-language tasks, that combines a strong visual feature extractor [38] with a strong language model (BERT) [13]. LXMERT is pre-trained for key vision-and-language tasks, on a large corpus of \(\sim \)9M image-sentence pairs, making it a powerful cross-modal encoder for vision+language tasks such as visual question answering, as compared to other models such as MCAN [45] and UpDn [2], and strong representative baseline for our experiments.

4.2 Our Model: Lens of Logic (LOL)

The design for our LOL model is driven by three key insights:

  1. 1.

    As logically composed questions are closed questions, understanding the type of question will guide the model to answer them correctly.

  2. 2.

    Predicted answers must be compatible with the predicted question type. For instance, a closed question can have an answer that is either “Yes” or “No”.

  3. 3.

    The model must learn to identify the logical connectives in a question.

Given these insights, we develop the Question Attention module that encodes the type of question (Yes-No, Number, or Other), and the Logic Attention module that predicts the connectives (AND, OR, NOT, no connective) present in the question, and use these to learn representations. The overall model architecture is shown in Fig. 3. For every question Q and corresponding image I, we obtain embeddings \(z_Q\) and \(z_I\) respectively, as well as a cross-modal embedding \(z_X\).

Question Attention Module (\(q_{{\varvec{ATT}}}\)) takes cross-modal embedding \(z_x\) from LXMERT as input, and outputs vector , representing the probabilities of each question-type. These probabilities are used to get a final representation \(\mathbf {z^{type}}\) which combines the features for each question-typeFootnote 1.

Logic Attention Module (\(\ell _{{\varvec{ATT}}}\)) takes the cross-modal embedding \(z_X\) from LXMERT as input, and outputs vector which represents the probabilities of each type of connective. We use sigmoid (\(\sigma \)) instead of a softmax, since a question can have multiple connectives. These probabilities are used to combine the features for each type of connective into a final representation \(\mathbf {z^{conn}}\) which encodes information about the connectives in the question.

4.3 Loss Functions

We train our models jointly with the loss function given by:

$$\begin{aligned} \mathcal {L} = (1{-}\alpha _1{-}\alpha _2)\cdot \mathcal {L}_{ans} + \alpha _1 \cdot \mathcal {L}_{type} + \alpha _2 \cdot \mathcal {L}_{conn} + \beta \cdot \mathcal {L}_{ FC }. \end{aligned}$$
(2)

Answering Loss. \(\ell _{ans}\) is conditioned on the type of question. We multiply the final prediction vector with the probability and the mask \(M_i\) for question-type i. \(M_i\) is a binary vector with 1 for every answer-index of type-i and 0 elsewhere:

(3)

Attention Losses: is trained to minimize a Negative Log Likelihood (NLL) classification loss, ensuring a shrinkage of probabilities of the answer choices of the wrong type. is trained to minimize a multi-label classification loss, using Binary Cross-Entropy (BCE) given by:

figure a

where \(y_{ans}, y_{type}, y_{conn}\) are labels for answer, question-type and connective.

Fréchet-Compatibility Loss: We introduce a new loss function that ensures compatibility between the answers predicted by the model for the component questions \(Q_1\) and \(Q_2\) and the composed question Q. Let \(A, A_1, A_2\) be the respective answers predicted by the model for Q, \(Q_1\), and \(Q_2\). \(Q_i\) can have negation. Then Fréchet inequalities [7, 15] provide us with bounds for the probabilities of the answers of the conjunction and disjunction of the two questions:

$$\begin{aligned} max (0, p(A_1)+p(A_2)-1)&\le p(A_1 \wedge A_2) \le min (p(A_1), p(A_2)). \end{aligned}$$
(6)
$$\begin{aligned} max (p(A_1), p(A_2))&\le p(A_1 \vee A_2) \le min (1, p(A_1) + p(A_2)). \end{aligned}$$
(7)

We define “Fréchet bounds” \(b_L\) and \(b_R\) to be the left and right bounds for the triplet \(A, A_1, A_2\), and the “Fréchet Mean” \(m_A\) to be the average of the Fréchet bounds; \(m_A = (b_L + b_R)/2\). Then, the Fréchet-Compatibility Loss given by:

(8)

ensures that the predicted answer and that determined by \(m_A\) match.

4.4 Implementation Details

The LXMERT feature encoder produces a vector z of length 768 which is used by our attention modules, each having sub-networks \(\mathbf {f_i}, \mathbf {g_i}\) with 2 feed-forward layers. We first train our models without FC loss. Then we select the best models with a checkpoint of 10 epochs and finetune these further for 3 epochs with FC loss, since the FC loss is designed to work for a model whose predictions are not random. Thus our improvements in accuracy are attributable to the FC Loss and not more training epochs. We utilize the Adam optimizer [23] with a learning rate of \(5\textit{e}\)-5, batch size of 32 and train for 20 epochs. Our models are trained on 4 NVIDIA V100 GPUs, and take approximately 24 h for training 20 epochs (see footnote 1).

Table 2. Comparison of LXMERT and LOL trained on VQA data, combinations with Compose, Supplement, and our Frechet-Compatibility (FC) Loss (In all tables, best overall scores are bold, our best scores underlined)

5 Experiments

We first conduct analytical experiments to test for logical robustness and transfer learning capability. We use three datasets for our experiment: the VQA v2.0 [3] dataset, a combination of VQA and our VQA-Compose dataset, and a combination of VQA, VQA-Compose and VQA-Supplement. The size of the training dataset and the distribution of yes-no, number and other questions is kept the same as the original VQA dataset (\(\sim \)443k) for fair comparison. Since VQA-Supplement uses captions and objects from MS-COCO, we use is to analyze the ability of our models to generalize to a new source of data (MS-COCO) as well as questions containing adversarial objects. After training, our attention modules (\(q_{\textit{ATT}}\) and \(\ell _{\textit{ATT}}\)) achieve an accuracy of 99.9% on average, showing almost perfect performance when it comes to learning the type of question and the logical connectives present in the question.

Table 3. Validation accuracies (\(\%\)) for Compositional Generalization and Commutative Property. Note that 50% is random performance. (In all tables, best overall scores are bold, our best scores underlined)
Fig. 4.
figure 4

Learning Curve comparison for models (Red: LXMERT, Blue: LOL) trained on our datasets (solid lines: VQA + Comp, dotted lines: VQA + Comp + Supp) (Color figure online)

5.1 Can’t We Just Parse the Question into Components?

Since our questions are a composition of multiple questions, an obvious approach is to split the question into its components, and to discern the logical formula for composition. The answers to these component questions (predicted by VQA models) can be re-combined with the predicted logical formula to obtain the final answer. We use parsers to map components and logical operations to predefined slots in a logical function. The oracle parser uses the ground truth component questions and combines predicted answers using the true formula. However, at test time we do not have access to the true mapping and components. So we train a RoBERTa-Base [26] parser using B-I-O tagging [36] for a Named-Entity Recognition task with constituent questions as entities (see footnote 1).

The performance of the oracle parser serves as the upper bound as we have a perfect mapping, with the QA system being the only source of error. The trained parser has an exact-match accuracy of \(85\%\), but only a \(72\%\) accuracy in determining the number of operands. The parser has an accuracy of \(89\%\) for questions with 3 or less operands, but only \(78\%\) for longer compositions. End-to-end (E2E) models do not need to parse questions and hence overcome these hurdles, but do require an understanding of logical operations. Table 4 shows that both oracle and trained parsers when used with LOL outperform parsers with LXMERT, by \(6.82\%\)) and \(5.60\%\) respectively. The LOL model without using any parsers is better than both LXMERT and LOL with the trained parser by \(7.55\%\) and \(1.95\%\) respectively.

5.2 Explicit Training with Logically Composed Questions

Can models trained on the VQA-v2 dataset answer logically composed questions? The first section of Table 2 shows that LXMERT, when trained only on questions from VQA-v2 has near random accuracy (\(\sim \)50%) on our logically composed datasets, thus exhibiting little robustness to such questions.

Can baseline model improve if trained explicitly with logically composed questions? We train the models with data containing a combination of samples from VQA-v2, VQA-Compose, and VQA-Supplement. The accuracy on VQA-Compose and VQA-Supplement improves, but there is a drop in performance on yes-no questions from VQA. Our models with our attention modules ( and ) are able to retain performance on VQA-v2 while achieving improvements on all validation datasets.

5.3 Analysis

Training with Closed Questions only: We analyse the performance of models when trained only with closed questions from VQA, VQA + Comp and VQA + Comp + Supp and see that our model achieves the best accuracy on logically composed questions, as shown in Sects. 3 and 4 in Table 2. Since we train only closed questions, we do not use our question attention module for this experiment.

Effect of Logically Composed Questions: We increase the number of logical samples in the training data on a log scale from 10 to 100k. As can be seen from the learning curves in Fig. 4(a), models trained on VQA + Comp + Supp are able to retain performance on VQA validation data, while those trained only on VQA + Comp data deteriorate. Figure 4(b) shows that our models improve on VQA Yes-No performance after being trained on more logically composed samples, exhibiting transfer learning capabilities. In (c) both our models are comparable to the baseline, but our model shows improvements over the baseline when trained on VQA + Comp + Supp. In (d) for all levels of additional logical questions, our model trained on VQA + Comp + Supp is the best performing. From (c) and (d), we observe that a large number of logical questions are needed during training for the models to learn to answer them during inference. We also see that our model yields the best performance on VQA-Supplement.

Compositional Generalization: To test for compositional generalization, we train models on questions with a maximum of one connective (single) and test on those with multiple connectives. It can be seen from Table 3 that our models are better equipped than the baseline to generalize to multiple connectives and also to be able to generalize from VQA-Compose to Supplement.

Fig. 5.
figure 5

Accuracy for each type of question in (a) VQA-Compose, (b) VQA-Supplement and for questions with number of operands greater than 2.

Inductive Generalization: We test our models on questions composed with more than two components. Parser-based models have this property by default. As shown by Fig. 5c our E2E models outperform the baseline LXMERT.

Commutative Property: Our models have identical answers when the question is composed either as \(Q_1\circ Q_2\) or \(Q_2\circ Q_1\), for logical operation \(\circ \), as shown in Table 3. The parser-based models are agnostic to the order of components if the parsing is accurate, while our E2E models are robust to the order.

Accuracy per Category of Question Composition: In Fig. 5 we show a plot of accuracy versus question type for each model. \(Q, Q_1, Q_2\) are questions from VQA, BC are object-based and caption-based questions from COCO respectively. From the results, we interpret that questions such as \(Q\wedge antonym(B), Q\wedge \lnot B, Q\wedge \lnot C\) are easy because the model is able to understand absence of objects, therefore can always answer these questions with a “NO”. Similarly, \( Q\vee B, Q\vee C\) are easily answered since presence of the object makes the answer always “YES”. By simply understanding object presence many such questions can be answered. Figure 5 shows the model has the same accuracy for logically equivalent operations.

Table 4. Performance on ‘test-standard’ set of VQA-v2 and validation set of our datasets. LOL performance is close to SOTA on VQA-v2, but significantly better at logical robustness. \(^*\)MCAN uses a fixed vocabulary that prohibits evaluation on VQA-Supplement which has questions created from COCO captions. \(^{\#}\)Test-dev scores, since MCAN does not report test-std single-model scores (In all tables, best overall scores are bold, our best scores underlined)

5.4 Evaluation on VQA V2.0 Test Data

Table 4 shows the performance the VQA Test-Standard dataset. Our models maintain overall performance on the VQA test dataset, and at the same time substantially improve from random performance (\(\sim \)50%) on logically composed questions to 82.39% on VQA-Compose and 87.80% on VQA-Supplement. This shows that logical connectives in questions can be learned while not degrading the overall performance on the original VQA test set (our models are within \(\sim \)1.5% of the state-of-the-art on all three types of questions on the VQA test-set).

6 Discussion

Consider the example, “Is every boy who is holding an apple or a banana, not wearing a hat?”, humans are able to answer it to be true if and only if each boy who is holding at least one of an apple or a banana is not wearing a hat [11]. Natural language contains such complex logical compositions, not to mention ambiguities and the influence of context. In this paper, we focus on the simplest – negation, conjunction, and disjunction. We have shown that existing VQA models are not robust to questions composed with these logical connectives, even when we train parsers to split the question into its components. When humans are faced with such questions, they may refrain from giving binary (Yes/No) answers. For instance, logically, the question“Did you eat the pizza and did you like it?” has a negative answer if either of the two component questions has a negative answer. However, humans might answer the same question with the answer “Yes, but I did not like it”. While human question-answering is indeed elaborate, explanatory, and clarifying, that is the scope of our future work; here we focus only on predicting a single binary answer.

We have shown how connectives in a question can be identified by enhancing LXMERT encoders with dedicated attention modules and loss functions. We would like to stress on the fact that we do not use knowledge of the connectives during inference, but instead train the network to be aware of it based on cross-modal features, instead of predicting purely based on language model embeddings which fail to capture these nuances. Our work is an attempt to modularize the understanding of logical components to train the model to utilize the outputs of the attention modules. We believe this work has potential implications on logic-guided data augmentation, logically robust question answering, and for conversational agents (with or without images). Similar strategies and learning mechanisms may be used in the future to operate “logically” in the image-space at the level of object classes, attributes, or semantic segments.

7 Conclusion

In this work, we investigate VQA in terms of logical robustness. The key hypothesis is that the ability to answer questions about an image, must be extendable to a logical composition of two such questions. We show that state-of-the-art models trained on VQA dataset lack this. Our solution involves the “Lens of Logic” model architecture that learns to answer questions with negation, conjunction, and disjunction. We provide VQA-Compose and VQA-Supplement, two datasets containing logically composed questions to serve as benchmarks. Our models show improvements in terms of answering these questions, while at the same time retaining performance on the original VQA test-set.