Straight to the Facts: Learning Knowledge Base Retrieval for Factual Visual Question Answering

Narasimhan, Medhini; Schwing, Alexander G.

doi:10.1007/978-3-030-01237-3_28

Medhini Narasimhan¹⁷ &
Alexander G. Schwing¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11212))

Included in the following conference series:

European Conference on Computer Vision

2968 Accesses
48 Citations
3 Altmetric

Abstract

Question answering is an important task for autonomous agents and virtual assistants alike and was shown to support the disabled in efficiently navigating an overwhelming environment. Many existing methods focus on observation-based questions, ignoring our ability to seamlessly combine observed content with general knowledge. To understand interactions with a knowledge base, a dataset has been introduced recently and keyword matching techniques were shown to yield compelling results despite being vulnerable to misconceptions due to synonyms and homographs. To address this issue, we develop a learning-based approach which goes straight to the facts via a learned embedding space. We demonstrate state-of-the-art results on the challenging recently introduced fact-based visual question answering dataset, outperforming competing methods by more than $5\%$.

You have full access to this open access chapter, Download conference paper PDF

Graphhopper: Multi-hop Scene Graph Reasoning for Visual Question Answering

Question Answering for Visual Navigation in Human-Centered Environments

A-OKVQA: A Benchmark for Visual Question Answering Using World Knowledge

Keywords

1 Introduction

When answering questions given a context, such as an image, we seamlessly combine the observed content with general knowledge. For autonomous agents and virtual assistants which naturally participate in our day to day endeavors, where answering of questions based on context and general knowledge is most natural, algorithms which leverage both observed content and general knowledge are extremely useful.

To address this challenge, in recent years, a significant amount of research has been devoted to question answering in general and Visual Question Answering (VQA) in particular. Specifically, the classical VQA tasks require an algorithm to answer a given question based on the additionally provided context, given in the form of an image. For instance, significant progress in VQA was achieved by introducing a variety of VQA datasets with strong baselines [1,2,3,4,5,6,7,8]. The images in these datasets cover a broad range of categories and the questions are designed to test perceptual abilities such as counting, inferring spatial relationships, and identifying visual cues. Some challenging questions require logical reasoning and memorization capabilities. However, the majority of the questions can be answered by solely examining the visual content of the image. Hence, numerous approaches to solve these problems [7,8,9,10,11,12,13] focus on extracting visual cues using deep networks.

We note that many of the aforementioned methods focus on the visual aspect of the question answering task, i.e., the answer is predicted by combining representations of the question and the image. This clearly contrasts the described human-like approach, which combines observations with general knowledge. To address this discrepancy, in very recent meticulous work, Wang et al. [14] introduced a ‘fact-based’ VQA task (FVQA), an accompanying dataset, and a knowledge base of facts extracted from three different sources, namely WebChild [15], DBPedia [16], and ConceptNet [17]. Different from the classical VQA datasets, Wang et al. [14] argued that such a dataset can be used to develop algorithms which answer more complex questions that require a combination of observation and general knowledge. In addition to the dataset, Wang et al. [14] also developed a model which leverages the information present in the supporting facts to answer questions about an image.

To this end, Wang et al. [14] design an approach which extracts keywords from the question and retrieves facts that contain those keywords from the knowledge base. Clearly, synonyms and homographs pose challenges which are hard to recover from.

To address this issue, we develop a learning based retrieval method. More specifically, our approach learns a parametric mapping of facts and question-image pairs to an embedding space. To answer a question, we use the fact that is most aligned with the provided question-image pair. As illustrated in Fig. 1, our approach is able to accurately answer both more visual questions as well as more fact based questions. For instance, given the image illustrated on the left hand side along with the question, “Which object in the image can be used to eat with?”, we are able to predict the correct answer, “fork.” Similarly, the proposed approach is able to predict the correct answer for the other two examples. Quantitatively we demonstrate the efficacy of the proposed approach on the recently introduced FVQA dataset, outperforming state-of-the-art by more than $5\%$ on the top-1 accuracy metric.

2 Related Work

We develop a framework for visual question answering that benefits from a rich knowledge base. In the following, we first review classical visual question answering tasks before discussing visual question answering methods that take advantage of knowledge bases.

Visual Question Answering. In recent years, a significant amount of research has been devoted to developing techniques which can answer a question about a provided context such as an image. Of late, visual question answering has also been used to assess reasoning capabilities of state-of-the-art predictors. Using a variety of datasets [2, 3, 5, 8, 10, 11], models based on multi-modal representation and attention [18,19,20,21,22,23,24,25], deep network architectures [12, 26,27,28], and dynamic memory nets [29] have been developed. Despite these efforts, assessing the reasoning capabilities of present day deep network-based approaches and differentiating them from mere memorization of training set statistics remains a hard task. Most of the methods developed for visual question answering [2, 6,7,8, 10, 12, 18,19,20,21,22,23,24, 27, 29,30,34] focus exclusively on answering questions related to observed content. To this end, these methods use image features extracted from networks such as the VGG-16 [35] trained on large image datasets such as ImageNet [36]. However, it is unlikely that all the information which is required to answer a question is encoded in the features extracted from the image, or even the image itself. For example, consider an image containing a dog, and a question about this image, such as “Is the animal in the image capable of jumping in the air?”. In such a case, we would want our method to combine common sense and general knowledge about the world, such as the ability of a healthy dog to jump, along with features and observations from the image, such as the presence of the dog. This motivates us to develop methods that can use knowledge bases encoding general knowledge.

Knowledge-Based Visual Question Answering. There has been interest in the natural language processing community in answering questions based on knowledge bases (KBs) using either semantic parsing [37,38,39,40,41,42,43,44,45,46,47] or information retrieval [48,49,50,51,52,53,54] methods. However, knowledge based visual question answering is still relatively unexplored, even though this is appealing from a practical standpoint as this decouples the reasoning by the neural network from the storage of knowledge in the KB. Notable examples in this direction are work by Zhu et al. [55], Wu et al. [56], Wang et al. [57], Krishnamurthy and Kollar [58], and Narasimhan et al. [59].

The works most related to our approach include Ask Me Anything (AMA) by Wu et al. [60], Ahab by Wang et al. [61], and FVQA by Wang et al. [14]. AMA describes the content of an image in terms of a set of attributes predicted about the image, and multiple captions generated about the image. The predicted attributes are used to query an external knowledge base, DBpedia [16], and the retrieved paragraphs are summarized to form a knowledge vector. The predicted attribute vector, the captions, and the database-based knowledge vector are passed as inputs to an LSTM that learns to predict the answer to the input question as a sequence of words. A drawback of this work is that it does not perform any explicit reasoning and ignores the possible structure in the KB. Ahab and FVQA, on the other hand, attempt to perform explicit reasoning. Ahab converts an input question into a database query, and processes the returned knowledge to form the final answer. Similarly, FVQA learns a mapping from questions to database queries through classifying questions into categories and extracting parts from the question deemed to be important. While both of these methods rely on fixed query templates, this very structure offers some insight into what information the method deems necessary to answer a question about a given image. Both these methods use databases with a particular structure: those that contain facts about visual concepts represented as tuples, for example, (Cat, CapableOf, Climbing), and (Dog, IsA, Pet). We develop our method on the dataset released as part of the FVQA work, referred to as the FVQA dataset [14], which is a subset of three structured databases – DBpedia [16], ConceptNet [17], and WebChild [15]. The method presented in FVQA [14] produces a query as an output of an LSTM which is fed the question as an input. Facts in the knowledge base are filtered on the basis of visual concepts such as objects, scenes, and actions extracted from the input image. The predicted query is then applied on the filtered database, resulting in a set of retrieved facts. A matching score is then computed between the retrieved facts and the question to determine the most relevant fact. The most correct fact forms the basis of the answer for the question.

In contrast to Ahab and FVQA, we propose to directly learn an embedding of facts and question-image pairs into a space that permits to assess their compatibility. This has two important advantages over prior work: (1) by avoiding the generation of an explicit query, we eliminate errors due to synonyms, homographs, and incorrect prediction of visual concept type and answer type; and (2) our technique is easy to extend to any knowledge base, even one with a different structure or size. We also do not require any ad-hoc filtering of knowledge, and can instead learn to transform extracted visual concepts into a vector close to a relevant fact in the learned embedding space. Our method also naturally produces a ranking of facts deemed to be useful for the given question and image.

3 Learning Knowledge Base Retrieval

In the following, we first provide an overview of the proposed approach for knowledge based visual question answering before discussing our embedding space and learning formulation.

Overview. Our developed approach is outlined in Fig. 2. The task at hand is to predict an answer y for a question Q given an image x by using an external knowledge base KB, which consists of a set of facts $f_i$, i.e., $\text {KB} = \left\{ f_1, \ldots , f_{|\text {KB}|} \right\} $. Each fact $f_i$ in the knowledge base is represented as a Resource Description Framework (RDF) triplet of the form $f_i = (a_i, r_i, b_i)$, where $a_i$ is a visual concept in the image, $b_i$ is an attribute or phrase associated with the visual entity $a_i$, and $r_i\in \mathcal{R}$ is a relation between the two entities. The dataset contains $|\mathcal{R}| = 13$ relations $r \in \mathcal{R}= \{$Category, Comparative, HasA, IsA, HasProperty, CapableOf, Desires, RelatedTo, AtLocation, PartOf, ReceivesAction, UsedFor, CreatedBy$\}$. Example triples of the knowledge base in our dataset are (Umbrella, UsedFor, Shade), (Beach, HasProperty, Sandy), (Elephant, Comparative-LargerThan, Ant).

To answer a question Q correctly given an image x, we need to retrieve the right supporting fact and choose the correct entity, i.e., either a or b. Importantly, entity a is always derived from the image and entity b is derived from the fact base. Consequently we refer to this choice as the answer source $s\in \left\{ \text {Image}, \text {KnowledgeBase} \right\} $. Using this formulation, we can extract the answer y from a predicted fact $\hat{f} = (\hat{a}, \hat{r}, \hat{b})$ and a predicted answer source $\hat{s}$ using

$$\begin{aligned} y = {\left\{ \begin{array}{ll} \hat{a}, &{} \text {from } \hat{f} \text { if } \hat{s} = \text {Image}\\ \hat{b}, &{} \text {from } \hat{f} \text { if } \hat{s} = \text {KnowledgeBase} \end{array}\right. }. \end{aligned}$$

(1)

It remains to answer, how to predict a fact $\hat{f}$ and how to infer the answer source $\hat{s}$. The latter is a binary prediction task and we describe our approach below. For the former, we note that the knowledge base contains a large number of facts. We therefore consider it infeasible to search through all the facts $f_i$ $\forall i\in \{1, \ldots , |\text {KB}|\}$ using an expensive evaluation based on a deep net. We therefore split this task into two parts: (1) Given a question, we train a network to predict the relation $\hat{r}$, that the question focuses on. (2) Using the predicted relation, $\hat{r}$, we reduce the fact space to those containing only the predicted relation.

Subsequently, to answer the question Q given image x, we only assess the suitability of the facts which contain the predicted relation $\hat{r}$. To assess the suitability, we design a score function $S(g^\text {F}(f_i), g^\text {NN}(x,Q))$ which measures the compatibility of a fact representation $g^\text {F}(f_i)$ and an image-question representation $g^\text {NN}(x,Q)$. Intuitively, the higher the score, the more suitable the fact $f_i$ for answering question Q given image x.

Formally, we hence obtain the predicted fact $\hat{f}$ via

$$\begin{aligned} \hat{f} = \arg \max _{i\in \{j : {\text {rel}}(f_j) = \hat{r}\}} S(g^\text {F}(f_i), g^\text {NN}(x,Q)), \end{aligned}$$

(2)

where we search for the fact $\hat{f}$ maximizing the score S among all facts $f_i$ which contain relation $\hat{r}$, i.e., among all $f_i$ with $i\in \{j : {\text {rel}}(f_j) = \hat{r}\}$. Hereby we use the operator ${\text {rel}}(f_i)$ to indicate the relation of the fact triplet $f_i$. Given the predicted fact using Eq. (2) we obtain the answer y from Eq. (1) after predicting the answer source $\hat{s}$.

This approach is outlined in Fig. 2. Pictorially, we illustrate the construction of an image-question embedding $g^\text {NN}(x,Q)$, via LSTM and CNN net representations that are combined via an MLP. We also illustrate the fact embedding $g^\text {F}(f_i)$. Both of them are combined using the score function $S(\cdot , \cdot )$, to predict a fact $\hat{f}$ from which we extract the answer as described in Eq. (1).

In the following, we first provide details about the score function S, before discussing prediction of the relation $\hat{r}$ and prediction of the answer source $\hat{s}$.

Scoring the Facts. Figure 2 illustrates our approach to score the facts in the knowledge base, i.e., to compute $S(g^\text {F}(f_i), g^\text {NN}(x,Q))$. We obtain the score in three steps: (1) computing of a fact representation $g^\text {F}(f_i)$; (2) computing of an image-question representation $g^\text {NN}(x,Q)$; (3) combination of the fact and image-question representation to obtain the final score S. We discuss each of those steps in the following.

(1) Computing a Fact Representation. To obtain the fact representation $g^\text {F}(f_i)$, we concatenate two vectors, the averaged GloVe-100 [62] representation of the words of entity $a_i$ and the averaged GloVe-100 representation of the words of entity $b_i$. Note that this fact representation is non-parametric, i.e., there are no trainable parameters.

(2) Computing an Image-Question Representation. We compute the image-question representation $g^\text {NN}(x,Q)$, by combining a visual representation $g_w^V(x)$, obtained from a standard deep net, e.g., ResNet or VGG, with a visual concept representation $g_w^C(x)$, and a sentence representation $g_w^Q(Q)$, of the question Q, obtained using a trainable recurrent net. For notational convenience we concatenate all trainable parameters into one vector w. Making the dependence on the parameters explicit, we obtain the image-question representation via $g^\text {NN}_w(x, Q) = g^\text {NN}_w(g_w^V(x), g_w^Q(Q), g_w^C(x)).$

More specifically, for the question embedding $g^Q_w(Q)$, we use an LSTM model [63]. For the image embedding $g^V_w(x)$, we extract image features using ResNet-152 [64] pre-trained on the ImageNet dataset [65]. In addition, we also extract a visual concept representation $g_w^C(x)$, which is a multi-hot vector of size 1176 indicating the visual concepts which are grounded in the image. The visual concepts detected in the images are objects, scenes, and actions. For objects, we use the detections from two Faster-RCNN [66] models that are trained on the Microsoft COCO 80-object [67] and the ImageNet 200-object [36] datasets. In total, there are 234 distinct object classes, from which we use that subset of labels that coincides with the FVQA dataset. The scene information (such as pasture, beach, bedroom) is extracted by the VGG-16 model [35] trained on the MIT Places 365-class dataset [68]. Again, we use a subset of Places to construct the 1176-dimensional multi-hot vector $g_w^C(x)$. For detecting actions, we use the CNN model proposed in [69] which is trained on the HICO [70] and MPII [71] datasets. The HICO dataset contains labels for 600 human-object interaction activities while the MPII dataset contains labels for 393 actions. We use a subset of actions, namely those which coincide with the ones in the FVQA dataset.

All the three vectors $g_w^V(x), g_w^Q(Q), g_w^C(x)$ are concatenated and passed to the multi-layer perceptron $g^\text {NN}_w(\cdot , \cdot , \cdot )$.

(3) Combination of Fact and Image-Question Representation. For each fact representation $g^\text {F}(f_i)$, we compute a score

$$ S_w(g^\text {F}(f_i), g_w^\text {NN}(x,Q)) = \cos (g^\text {F}(f_i), g_w^\text {NN}(x,Q)) = \frac{g^\text {F}(f_i) \cdot g_w^\text {NN}(x,Q)}{||g^\text {F}(f_i)|| \cdot ||g_w^\text {NN}(x,Q)||}, $$

where $g_w^\text {NN}(x,Q)$ is the image question representation. Hence, the score S is the cosine similarity between the two normalized representations and represents the fit of fact $f_i$ to the image-question pair (x, Q).

Predicting the Relation. To predict the relation $\hat{r}\in \mathcal{R}= h_{w_1}^r(Q)$, from the obtained question Q, we use an LSTM net. More specifically, we first embed and then encode the words of the question Q, one at a time, and linearly transform the final hidden representation of the LSTM to predict $\hat{r}$, from $|\mathcal{R}|$ possibilities using a standard multinomial classification. For the results presented in this work, we trained the relation prediction parameters $w_1$ independently of the score function. We leave a joint formulation to future work.

Predicting the Answer Source. Prediction of the answer source $\hat{s} = h_{w_2}^s(Q)$ from a given question Q is similar to relation prediction. Again, we use an LSTM net to embed and encode the words of the question Q before linearly transforming the final hidden representation to predict $\hat{s}\in \{\text {Image}, \text {KnowledgeBase}\}$. Analogous to relation prediction, we train this LSTM net’s parameters $w_2$ separately and leave a joint formulation to future work.

Learning. As mentioned before, we train the parameters w (score function), $w_1$ (relation prediction), and $w_2$ (answer source prediction) separately. To train $w_1$, we use a dataset $\mathcal{D}_1 = \{(Q, r)\}$ containing pairs of question and the corresponding relation which was used to obtain the answer. To learn $w_2$, we use a dataset $\mathcal{D}_2 = \{(Q, s)\}$, containing pairs of question and the corresponding answer source. For both classifiers we use stochastic gradient descent on the classical cross-entropy and binary cross-entropy loss respectively. Note that both the datasets are readily available from [14].

To train the parameters of the score function we adopt a successive approach operating in time steps $t = \{1, \ldots , T\}$. In each time step, we gradually increase the difficulty of the dataset $\mathcal{D}^{(t)}$ by mining hard negatives. More specifically, for every question Q, and image x, $\mathcal{D}^{(0)}$ contains the ‘groundtruth’ fact $f^*$ as well as 99 randomly sampled ‘non-groundtruth’ facts. After having trained the score function on this dataset we use it to predict facts for image-question pairs and create a new dataset $\mathcal{D}^{(1)}$ which now contains, along with the groundtruth fact, another 99 non-groundtruth facts that the score function assigned a high score to.

Given a dataset $\mathcal{D}^{(t)}$, we train the parameters w of the representations involved in the score function $S_w(g^\text {F}(f_i), g^\text {NN}_w(x,Q))$, and its image, question, and concept embeddings by encouraging that the score of the groundtruth fact $f^*$ is larger than the score of any other fact. More formally, we aim for parameters w which ensure the classical margin, i.e., an SVM-like loss for deep nets:

$$\begin{aligned} S_w(f^*, x, Q) \ge L(f^*, f) + S_w(f, x, Q) \quad \quad \forall (f, x, Q) \in \mathcal{D}^{(t)}, \end{aligned}$$

(3)

where $L(f^*, f)$ is the task loss (aka margin) comparing the groundtruth fact $f^*$ to other facts f. In our case $L\equiv 1$. Since we may not find parameters w which ensure feasibility $\forall (f, x, Q) \in \mathcal{D}^{(t)}$, we introduce slack variables $\xi _{(f,x,Q)} \ge 0$ to obtain after reformulation:

$$\begin{aligned} \xi _{(f,x,Q)} \ge L(f^*, f) + S_w(f, x, Q) - S_w(f^*, x, Q) \quad \quad \forall (f, x, Q) \in \mathcal{D}^{(t)}. \end{aligned}$$

(4)

Instead of enforcing the constraint $\forall (f, x, Q)$ in the dataset $\mathcal{D}^{(t)}$, it is equivalent to require [72]

$$\begin{aligned} \xi _{(x,Q)} \ge \max _f \{L(f^*, f) + S_w(f, x, Q)\} - S_w(f^*, x, Q) \quad \quad \forall (x, Q) \in \mathcal{D}^{(t)}. \end{aligned}$$

(5)

Using this constraint, we find the parameters w by solving

$$\begin{aligned} \min _{w, \xi _{(x,Q)}\ge 0} \frac{C}{2}\Vert w\Vert _2^2 + \sum _{(x, Q)\in \mathcal{D}^{(t)}} \xi _{(x,Q)} \quad \text {s.t. Constraints in Eq. (5).} \end{aligned}$$

(6)

For applicability of the standard sub-gradient descent techniques, we reformulate the program given in Eq. (6) to read as

$$\begin{aligned} \min _{w} \frac{C}{2}\Vert w\Vert _2^2 + \sum _{(x, Q)\in \mathcal{D}^{(t)}} \left( \max _f \{L(f^*, f) + S_w(f, x, Q)\} - S_w(f^*, x, Q)\right) , \end{aligned}$$

(7)

which can be optimized using standard deep net packages. The proposed approach for learning the parameters w is summarized in Algorithm 1. In the following we now assess the suitability of the proposed approach.

Table 1. Accuracy of predicting relations given the question.

Full size table

Table 2. Accuracy of predicting answer source from a given question.

Full size table

4 Evaluation

In the following, we assess the proposed approach. We first provide details about the proposed dataset before presenting quantitative results for prediction of relations from questions, prediction of answer-source from questions, and prediction of the answer and the supporting fact. We also discuss mining of hard negatives. Finally, we show qualitative results.

Dataset and Knowledge Base. We use the publicly available FVQA dataset [14] and its knowledge base to evaluate our model. This dataset consists of 2,190 images, 5,286 questions, and 4,126 unique facts corresponding to the questions. The knowledge base, consisting of 193,449 facts, were constructed by extracting the top visual concepts for all the images in the dataset and querying for those concepts in the three knowledge bases, WebChild [15], ConceptNet [17], and DBPedia [16]. The dataset consists of 5 train-test folds, and all the scores we report are averaged across all splits.

Predicting Relations from Questions. We use an LSTM architecture as discussed in Sect. 3 to predict the relation $r\in \mathcal{R}$ given a question Q. The standard train-test split of the FVQA dataset is used to evaluate our model. Batch gradient descent with Adam optimizer was used on batches of size 100 and the model was trained over 50 epochs. LSTM embedding and word embeddings are of size 128 each. The learning rate is set to $1\mathrm {e}{-3}$ and a dropout of 0.7 is applied after the word embeddings as well as the LSTM embedding. Table 1 provides a comparison of our model to the FVQA baseline [14] using top-1 and top-3 prediction accuracy. We observe our results to improve the baseline by more than 10% on top-1 accuracy and by more than 9% when using the top-3 accuracy metric.

Predicting Answer Source from Questions. We assess the accuracy of predicting the answer source s given a question Q. To predict the source of the answer, we use an LSTM architecture as discussed in detail in Sect. 3. Note that for predicting the answer source, the size of the LSTM embedding and word embeddings was set to 64 each. Table 2 summarizes the accuracy of the prediction results of our model. We observe the prediction accuracy of the proposed approach to be close to perfect.

Predicting the Correct Answer. Our score function based model to retrieve the supporting fact is described in detail in Sect. 3. For the image embedding, we pass the 2048 dimensional feature vector returned by ResNet through a fully-connected layer and reduce it to a 64 dimensional vector. For the question embedding, we use an LSTM with a hidden layer of size 128. The two are then concatenated into a vector of size 192 and passed through a two layer perceptron with 256 and 128 nodes respectively. Note that the baseline doesn’t use image features apart from the detected visual concepts.

The multi-hot visual concept embedding is passed through a fully-connected layer to form a 128 dimensional vector. This is then concatenated with the output of the perceptron and passed through another layer with 200 output nodes. We found a late fusion of the visual concepts to results in a better model as the facts explicitly contain these terms.

Fact embeddings are constructed using GloVe-100 vectors each, for entities a and b. If a or b contain multiple words, an average of all the embeddings is computed. We use cosine distance between the MLP and the fact embeddings to score the facts. The highest scoring fact is chosen as the answer. Ties are broken randomly.

Based on the answer source prediction which is computed using the aforementioned LSTM model, we choose either entity a or b of the fact to be the answer. See Eq. (1) for the formal description. Accuracy is computed based on exact match between the chosen entity and the groundtruth answer.

To assess the importance of particular features we investigate 5 variants of our model with varying features: two oracle approaches ‘gt Question + Image + Visual Concepts’ and ‘gt Question + Visual Concepts’ which make use of groundtruth relation type and answer type data. More specifically, ‘gt Question + Image + Visual Concepts’ and ‘gt Question + Visual Concepts’ use the groundtruth relations and answer sources respectively. We have three approaches using a variety of features as follows: ‘Question + Image + Visual Concepts,’ ‘Question + Visual Concepts,’ and ‘Question + Image.’ We drop either the Image embeddings from ResNet or the Visual Concept embeddings to obtain two other models, ‘Question + Visual Concepts’ and ‘Question + Image.’

Table 3. Answer accuracy over the FVQA dataset.

Full size table

Table 3 shows the accuracy of our model in predicting an answer and compares our results to other FVQA baselines. We observe the proposed approach to outperform the state-of-the-art ensemble technique by more than $3\%$ and the strongest baseline without ensemble by over $5\%$ on the top-1 accuracy metric. Moreover we note the importance of visual concepts to accurately predict the answer. By including groundtruth information we assess the maximally possible top-1 and top-3 accuracy. We observe the difference to be around $8\%$, suggesting that there is some room for improvement.

Question to Supporting Fact. To provide a complete assessment of the proposed approach we illustrate in Table 4 the top-1 and top-3 accuracy scores in retrieving the supporting facts of our model compared to other FVQA baselines. We observe the proposed approach to improve significantly both the top-1 and top-3 accuracy by more than $20\%$. We think this is a significant improvement towards efficiently including knowledge bases into visual question answering.

Mining Hard Negatives. We trained our model over three iterations of hard negative mining, i.e., $T = 2$. In iteration 1 ($t = 0$), all the 193,449 facts were used to sample the 99 negative facts during train. At every 10th epoch of training, negative facts which received high scores were saved. In the next iteration, the trained model along with the negative facts is loaded and we ensure that the 99 negative facts are now sampled from the hard negatives. Table 5 shows the Top-1 and Top-3 accuracy for predicting the supporting facts over each of the three iterations. We observe significant improvements due to the proposed hard negative mining strategy. While naïve training of the proposed approach yields only $20.17\%$ top-1 accuracy, two iterations improve the performance to $64.5\%$.

Table 4. Correct fact prediction precision over the FVQA dataset.

Full size table

Table 5. Correct fact prediction precision with hard negative mining.

Full size table

Synonyms and Homographs. Here we show the improvements of our model compared to the baseline with respect to synonyms and homographs. To this end, we run additional tests using Wordnet to determine the number of question-fact pairs which contain synonyms. The test data contains 1105 such pairs out of which our model predicts 91.6% (1012) correctly, whereas the FVQA model predicts 78.0% (862) correctly. In addition, we manually generated 100 synonymous questions by replacing words in the questions with synonyms (e.g., “What in the bowl can you eat?” is rephrased to “What in the bowl is edible?”). Tests on these 100 new samples find that our model predicts 89 of these correctly, whereas the key-word matching FVQA technique [14] gets 61 of these right. With regards to homographs, the test set has 998 questions which contain words that have multiple meanings across facts. Our model predicts correct answers for 79.4% (792), whereas the FVQA model gets 66.3% (662) correct.

Qualitative Results. Figure 3 shows the Visual Concepts (VCs) detected for a few samples along with the top 3 facts retrieved by our model. Providing these predicted VCs as input to our fact-scoring MLP helps improve supporting fact retrieval as well as answer accuracy by a large margin of over $30\%$ as seen in Tables 3 and 4. As can be seen in Fig. 3, there is a close alignment between relevant facts and predicted VCs, as VCs provide a high-level overview of the salient content in the images.

In Fig. 4, we show success and failure cases of our method. There are 3 steps to producing the correct answer using our method: (1) correctly predicting the relation, (2) retrieving supporting facts containing the predicted relation, and relevant to the image, and (3) choosing the answer from the predicted answer source (Image/Knowledge Base). The top two rows of images show cases where all the 3 steps were correctly executed by our proposed method. Note that our method works for a variety of relations, objects, answer sources, and varying difficulty. It is correctly able to identify the object of interest, even when it is not the most prominent object in the image. For example, in the middle image of the first row, the frisbee is smaller than the dog in the image. However, we were correctly able to retrieve the supporting fact about the frisbee using information from the question, such as ‘capable of’ and ‘flying.’

A mistake in any of the 3 steps can cause our method to produce an incorrect answer. The bottom row of images in Fig. 4 displays prototypical failure modes. In the leftmost image, we miss cues from the question such as ‘round,’ and instead retrieve a fact about the person. In the middle image, our method makes a mistake at the final step and uses information from the wrong answer source. This is a very rare source of errors overall, as we are over $97\%$ accurate in predicting the answer source, as shown in Table 2. In the rightmost image, our method makes a mistake at the first step of predicting the relation, making the remaining steps incorrect. Our relation prediction is around $75\%$, and $92\%$ accurate by the top-1 and top-3 metrics, as shown in Table 1, and has some scope for improvement. For qualitative results regarding synonyms and homographs we refer the interested reader to the supplementary material.

5 Conclusion

In this work, we addressed knowledge-based visual question answering and developed a method that learns to embed facts as well as question-image pairs into a space that admits efficient search for answers to a given question. In contrast to existing retrieval based techniques, our approach learns to embed questions and facts for retrieval. We have demonstrated the efficacy of the proposed method on the recently introduced and challenging FVQA dataset, producing state-of-the-art results. In the future, we hope to address extensions of our work to larger structured knowledge bases, as well as unstructured knowledge sources, such as online text corpora.

References

Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. IJCV 123, 32–73 (2017)
Article MathSciNet Google Scholar
Ren, M., Kiros, R., Zemel, R.: Exploring models and data for image question answering. In: NIPS (2015)
Google Scholar
Zhu, Y., Groth, O., Bernstein, M., Fei-Fei, L.: Visual7W: grounded question answering in images. In: CVPR (2016)
Google Scholar
Malinowski, M., Fritz, M.: Towards a visual turing challenge. In: NIPS (2014)
Google Scholar
Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C.L., Girshick, R.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: CVPR (2017)
Google Scholar
Jabri, A., Joulin, A., van der Maaten, L.: Revisiting visual question answering baselines. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016, Part VIII. LNCS, vol. 9912, pp. 727–739. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_44
Chapter Google Scholar
Yu, L., Park, E., Berg, A., Berg, T.: Visual Madlibs: fill in the blank image generation and question answering. In: ICCV (2015)
Google Scholar
Antol, S., et al.: VQA: visual question answering. In: ICCV (2015)
Google Scholar
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: CVPR (2017)
Google Scholar
Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., Xu, W.: Are you talking to a machine? Dataset and methods for multilingual image question answering. In: NIPS (2015)
Google Scholar
Malinowski, M., Fritz, M.: A multi-world approach to question answering about real-world scenes based on uncertain input. In: NIPS (2014)
Google Scholar
Malinowski, M., Rohrbach, M., Fritz, M.: Ask your neurons: a neural-based approach to answering questions about images. In: ICCV (2015)
Google Scholar
Hu, R., Andreas, J., Rohrbach, M., Darrell, T., Saenko, K.: Learning to reason: end-to-end module networks for visual question answering. CoRR, abs/1704.05526 3 (2017)
Google Scholar
Wang, P., Wu, Q., Shen, C., Dick, A., van den Hengel, A.: FVQA: fact-based visual question answering. TPAMI (2018)
Google Scholar
Tandon, N., de Melo, G., Suchanek, F., Weikum, G.: Webchild: harvesting and organizing commonsense knowledge from the web. In: WSDM (2014)
Google Scholar
Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: a nucleus for a web of open data. In: Aberer, K. (ed.) ASWC/ISWC 2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-76298-0_52
Chapter Google Scholar
Speer, R., Chin, J., Havasi, C.: ConceptNet 5.5: an open multilingual graph of general knowledge. In: AAAI (2017)
Google Scholar
Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering. In: NIPS (2016)
Google Scholar
Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. In: CVPR (2016)
Google Scholar
Andreas, J., Rohrbach, M., Darrell, T., Klein, D.: Deep compositional question answering with neural module networks. In: CVPR (2016)
Google Scholar
Das, A., Agrawal, H., Zitnick, C.L., Parikh, D., Batra, D.: Human attention in visual question answering: do humans and deep networks look at the same regions? In: EMNLP (2016)
Google Scholar
Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. In: EMNLP (2016)
Google Scholar
Shih, K.J., Singh, S., Hoiem, D.: Where to look: focus regions for visual question answering. In: CVPR (2016)
Google Scholar
Xu, H., Saenko, K.: Ask, attend and answer: exploring question-guided spatial attention for visual question answering. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016, Part VII. LNCS, vol. 9911, pp. 451–466. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_28
Chapter Google Scholar
Schwartz, I., Schwing, A.G., Hazan, T.: High-order attention models for visual question answering. In: NIPS (2017)
Google Scholar
Ben-younes, H., Cadene, R., Cord, M., Thome, N.: MUTAN: multimodal tucker fusion for visual question answering. In: ICCV (2017)
Google Scholar
Ma, L., Lu, Z., Li, H.: Learning to answer questions from image using convolutional neural network. In: AAAI (2016)
Google Scholar
Jain, U., Zhang, Z., Schwing, A.G.: Creativity: generating diverse questions using variational autoencoders. In: CVPR (2017)
Google Scholar
Xiong, C., Merity, S., Socher, R.: Dynamic memory networks for visual and textual question answering. In: ICML (2016)
Google Scholar
Kim, J.H., et al.: Multimodal residual learning for visual QA. In: NIPS (2016)
Google Scholar
Zitnick, C.L., Agrawal, A., Antol, S., Mitchell, M., Batra, D., Parikh, D.: Measuring machine intelligence through visual question answering. AI Mag. 37, 63–72 (2016)
Article Google Scholar
Zhou, B., Tian, Y., Sukhbataar, S., Szlam, A., Fergus, R.: Simple baseline for visual question answering (2015). arXiv preprint: arXiv:1512.02167
Wu, Q., Shen, C., van den Hengel, A., Wang, P., Dick, A.: Image captioning and visual question answering based on attributes and their related external knowledge (2016). arXiv:1603.02814
Jain, U., Lazebnik, S., Schwing, A.G.: Two can play this game: visual dialog with discriminative question generation and answering. In: CVPR (2018)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014). arXiv preprint: arXiv:1409.1556
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. IJCV 115, 211–252 (2015)
Article MathSciNet Google Scholar
Zettlemoyer, L.S., Collins, M.: Learning to map sentences to logical form: structured classification with probabilistic categorial grammars. In: UAI (2005)
Google Scholar
Zettlemoyer, L.S., Collins, M.: Learning context-dependent mappings from sentences to logical form. In: ACL (2005)
Google Scholar
Berant, J., Chou, A., Frostig, R., Liang, P.: Semantic parsing on freebase from question-answer pairs. In: EMNLP (2013)
Google Scholar
Cai, Q., Yates, A.: Large-scale semantic parsing via schema matching and lexicon extension. In: ACL (2013)
Google Scholar
Liang, P., Jordan, M.I., Klein, D.: Learning dependency-based compositional semantics. Comput. Linguist. 39, 389–446 (2013)
Article MathSciNet Google Scholar
Kwiatkowski, T., Choi, E., Artzi, Y., Zettlemoyer, L.: Scaling semantic parsers with on-the-fly ontology matching. In: EMNLP (2013)
Google Scholar
Berant, J., Liang, P.: Semantic parsing via paraphrasing. In: ACL (2014)
Google Scholar
Fader, A., Zettlemoyer, L., Etzioni, O.: Open question answering over curated and extracted knowledge bases. In: KDD (2014)
Google Scholar
Yih, W., Chang, M.W., He, X., Gao, J.: Semantic parsing via staged query graph generation: question answering with knowledge base. In: ACL-IJCNLP (2015)
Google Scholar
Reddy, S., et al.: Transforming dependency structures to logical forms for semantic parsing. In: ACL (2016)
Google Scholar
Xiao, C., Dymetman, M., Gardent, C.: Sequence-based structured prediction for semantic parsing. In: ACL (2016)
Google Scholar
Unger, C., Bühmann, L., Lehmann, J., Ngomo, A.C.N., Gerber, D., Cimiano, P.: Template-based question answering over RDF data. In: WWW (2012)
Google Scholar
Kolomiyets, O., Moens, M.F.: A survey on question answering technology from an information retrieval perspective. Inf. Sci. 181, 5412–5434 (2011)
Article MathSciNet Google Scholar
Yao, X., Durme, B.V.: Information extraction over structured data: question answering with freebase. In: ACL (2014)
Google Scholar
Bordes, A., Chopra, S., Weston, J.: Question answering with sub-graph embeddings. In: EMNLP (2014)
Google Scholar
Bordes, A., Weston, J., Usunier, N.: Open question answering with weakly supervised embedding models. In: Calders, T., Esposito, F., Hüllermeier, E., Meo, R. (eds.) ECML PKDD 2014, Part I. LNCS (LNAI), vol. 8724, pp. 165–180. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-662-44848-9_11
Chapter Google Scholar
Dong, L., Wei, F., Zhou, M., Xu, K.: Question answering over freebase with multi-column convolutional neural networks. In: ACL (2015)
Google Scholar
Bordes, A., Usunier, N., Chopra, S., Weston, J.: Large-scale simple question answering with memory networks. In: ICLR (2015)
Google Scholar
Zhu, Y., Zhang, C., Ré, C., Fei-Fei, L.: Building a large-scale multimodal knowledge base for visual question answering. CoRR (2015)
Google Scholar
Wu, Q., Wang, P., Shen, C., van den Hengel, A., Dick, A.: Ask me anything: free-form visual question answering based on knowledge from external sources. In: CVPR (2016)
Google Scholar
Wang, P., Wu, Q., Shen, C., van den Hengel, A., Dick, A.: Explicit knowledge-based reasoning for visual question answering. In: IJCAI (2017)
Google Scholar
Krishnamurthy, J., Kollar, T.: Jointly learning to parse and perceive: connecting natural language to the physical world. In: ACL (2013)
Google Scholar
Narasimhan, K., Yala, A., Barzilay, R.: Improving information extraction by acquiring external evidence with reinforcement learning. In: EMNLP (2016)
Google Scholar
Wu, Q., Wang, P., Shen, C., Dick, A., van den Hengel, A.: Ask me anything: free-form visual question answering based on knowledge from external sources. In: CVPR (2016)
Google Scholar
Wang, P., Wu, Q., Shen, C., Dick, A., Van Den Henge, A.: Explicit knowledge-based reasoning for visual question answering. In: IJCAI (2017)
Google Scholar
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: EMNLP (2014)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997)
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: CVPR (2009)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS (2015)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part V. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., Torralba, A.: Places: a 10 million image database for scene recognition. TPAMI 40, 1452–1464 (2017)
Article Google Scholar
Mallya, A., Lazebnik, S.: Learning models for actions and person-object interactions with transfer to question answering. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016, Part I. LNCS, vol. 9905, pp. 414–428. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_25
Chapter Google Scholar
Chao, Y.W., Wang, Z., He, Y., Wang, J., Deng, J.: HICO: a benchmark for recognizing human-object interactions in images. In: ICCV (2015)
Google Scholar
Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: CVPR (2014)
Google Scholar
Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y.: Large margin methods for structured and interdependent output variables. JMLR 6, 1453–1484 (2005)
MathSciNet MATH Google Scholar

Download references

Acknowledgments

This material is based upon work supported in part by the National Science Foundation under Grant No. 1718221, Samsung, and 3M. We thank NVIDIA for providing the GPUs used for this research. We also thank Arun Mallya and Aditya Deshpande for their help.

Author information

Authors and Affiliations

University of Illinois Urbana-Champaign, Champaign, USA
Medhini Narasimhan & Alexander G. Schwing

Authors

Medhini Narasimhan
View author publications
You can also search for this author in PubMed Google Scholar
Alexander G. Schwing
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Medhini Narasimhan .

Editor information

Editors and Affiliations

Google Research, Zurich, Switzerland
Vittorio Ferrari
Carnegie Mellon University, Pittsburgh, PA, USA
Martial Hebert
Google Research, Zurich, Switzerland
Cristian Sminchisescu
Hebrew University of Jerusalem, Jerusalem, Israel
Yair Weiss

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 5471 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Narasimhan, M., Schwing, A.G. (2018). Straight to the Facts: Learning Knowledge Base Retrieval for Factual Visual Question Answering. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds) Computer Vision – ECCV 2018. ECCV 2018. Lecture Notes in Computer Science(), vol 11212. Springer, Cham. https://doi.org/10.1007/978-3-030-01237-3_28

Download citation

DOI: https://doi.org/10.1007/978-3-030-01237-3_28
Published: 07 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01236-6
Online ISBN: 978-3-030-01237-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Straight to the Facts: Learning Knowledge Base Retrieval for Factual Visual Question Answering

Abstract

Similar content being viewed by others

Graphhopper: Multi-hop Scene Graph Reasoning for Visual Question Answering

Question Answering for Visual Navigation in Human-Centered Environments

A-OKVQA: A Benchmark for Visual Question Answering Using World Knowledge

Keywords

1 Introduction

2 Related Work

3 Learning Knowledge Base Retrieval

4 Evaluation

5 Conclusion

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 5471 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation