1 Introduction

Understanding vision and language and reasoning about both modalities is a challenging research problem. With the development of advanced machine learning techniques and large-scale datasets, recent progress in computer vision (CV) and natural language processing (NLP) has resulted in promising achievements in developing intelligent agents for various vision-language tasks [5, 9, 12, 16, 38]. A typical task is visual question answering (VQA) [5], which requires to answer an open-ended question about an image. As a step further, researchers generalize VQA to the more challenging visual dialog (VD) [12] task, which aims at holding a continuous question-answering dialog about visual contents. A unique challenge of VD is to understand the context of a question from the dialog history. Take the question “what is the fruit to the right of it with the same color?” for example (see Fig. 1) – to answer the question, one must extract contextual information from previous questions about what “it” and “same color” refer to.

Fig. 1.
figure 1

An example from our GQA-VD dataset. It consists of a variety of questions that require contextual information to answer. Different from existing datasets, each GQA-VD question can refer to multiple entities (blue) or abstract concepts (red) in the dialog history, which offers a more challenging testbed for VD modeling. (Color figure online)

To tackle this challenge, recent VD studies have developed models to keep track of all phrases in the dialog that refer to the same entity in the image (i.e., coreferences) [23]. Despite their promising results in existing VD benchmarks, it has been observed that state-of-the-art VQA models can achieve comparable or better performances in some metrics (e.g., mean rank) without even considering the dialog history [29]. This suggests that existing VD benchmarks place an imbalanced emphasis on answering questions that do not depend on information from the dialog context. Therefore, further advances in VD research require to bridge three research gaps in the design of VD datasets and models: 1) the unclear definition and quantification of contextual dependencies, 2) the shortage of context-dependent questions in current datasets, and 3) the lack of model design for encoding complex dialog contexts.

In this work, we bridge the research gap with new datasets and models that focus on diverse dialog contexts. Specifically, based on linguistic theories [8], we first define a hierarchy of contextual patterns that explicitly characterize contextual dependencies, the general and diverse relationships across different questions in a dialog. Different from visual coreferences [23] that only focus on visual entities, contextual dependencies are more general and account for a broader range of contextual relationships.

Based on the novel definitions, we then develop two context-rich VD datasets (i.e., CLEVR-VD and GQA-VD) by generating dialogs based on the popular CLEVR [21] and GQA [19] datasets. Compared with existing VD datasets [12, 24], our proposed datasets consist of more diverse and balanced contexts. As shown in Fig. 1, many questions of our GQA-VD dataset depend on one or multiple previous questions. They not only refer to the previously mentioned visual entities (e.g., watermelon, banana, cabbage, green pepper), but also depend on the understanding of abstract concepts (e.g., number, color, etc.). Such general and diverse contextual dependencies lead to more challenging dialogs demanding the capabilities of VD models to reason about the dialog context.

Further, we propose a neural module network approach that explicitly models the reasoning process with a novel memory design and corresponding contextual modules to enable the attention shift among the abstract contextual knowledge. Experimental results demonstrate significant improvements of our method on the proposed datasets and existing datasets (i.e., CLEVR-Dialog [24] and VisDial [12]). This work pushes the state-of-the-art VD research towards a more fine-grained and explainable direction. Our main contributions are as follows:

  1. 1.

    Inspired by linguistic studies, we propose a novel definition of VD based on a hierarchy of contextual patterns, explicitly characterizing how dialog contexts are involved in a dialog.

  2. 2.

    We propose CLEVR-VD and GQA-VD, two new VD datasets offering diverse and complex dialog contexts, enabling the development of more sophisticated VD models. We also provide structured representations of a dialog (i.e., primitives, compounds, and topics) as extra annotations.

  3. 3.

    Based on the new definition and datasets, we propose an explainable VD method that explicitly reasons about the dialog context with a novel memory mechanism and contextual modules, resulting in significantly improved performance while demonstrating model interpretability.

2 Related Works

Language Contexts in Dialog. Linguistic researches have studied language contexts in dialog [6,7,8, 14, 36] for many decades. Linguistic theories (e.g., Speech Act Theory [6, 36]) have been widely applied in dialogue act classification [33, 34] and dialogue state tracking [27, 44]. Derived from VQA [5], the task of VD [13] performs multiround question answering. To better encode language contexts, recent VD works [20, 23] consider visual coreference resolution by linking phrases and pronouns across different QA rounds that refer to the same entity. However, coreference is far from sufficient to address complex dialog contexts related to abstract concepts (e.g., number or color) or multiple entities (e.g., watermelon and banana in Fig. 1). Aiming to represent language contexts in a formal, mathematical, and detailed manner, we revisit the VD task and introduce a hierarchy of dialog contextual patterns that clearly describe the semantics and functionalities of different language entities following the Speech Act Theory [6, 36]. These patterns characterize a broad range of contextual dependencies.

Visual Dialog Datasets. VisDial [12] and CLEVR-Dialog [24] are two large-scale VD datasets for real-world and diagnostic images, respectively. To create multiround questions and answers, VisDial hires crowd workers to discuss about real-world images (e.g., MSCOCO [26]), while CLEVR-Dialog leverages virtual agents to ground complete scene graphs from synthetic images (e.g., CLEVR [21]). The CLEVR-Dialog has more frequent and difficult coreference cases than VisDial. We draw inspiration from CLEVR-Dialog to create our own datasets for both real-world (e.g., GQA [19]) and diagnostic images (e.g., CLEVR [21]). Compared to VisDial and CLEVR-Dialog, our datasets contain richer contexts in terms of both diversity and complexity. Our new datasets include a broader range of contextual dependencies other than just coreferences. The novel contextual patterns are annotated to offer detailed and structured representations that previous datasets did not provide. Another difference lies in the question generation process. Unlike CLEVR-Dialog that solely relies on two agents to implicitly include contexts, we provide a set of randomly sampled contexts to the question engine to ensure the context diversity and complexity.

Visual Dialog Models. Most VD models [12, 13, 20, 27, 32, 37] follow an encoder-decoder framework to fuse dialog contexts and decode either an answer ranking or free-form response. With some researches [10, 29] pointing out the importance of dialog context modeling, recent works use attention networks to solve coreferences [37], and more recently, a probabilistic treatment of dialogs using conditional variational autoencoders [30] to better encode the dialog context. All those models consider coreferences implicitly by encoding features and lack interpretability. Recent studies focus on pretraining and attention modeling (e.g., VisDial-BERT [31], VD-BERT [40]) to improve model performance. Different from these methods, our proposed NDM model explicitly learns the reasoning process using neural modules that result in better explainability. It is mostly relevant to the CorefNMN [23] model that learns to infer coreferences using neural module networks. Inspired by a class of explicit VQA models [4, 18, 39, 41, 43] where an instance-specific architecture is dynamically constructed from basic building blocks representing different reasoning operations, CorefNMN stores all mentioned entities in a memory and represents coreferences as a feature extraction process with novel neural module implementations. Different from CorefNMN, we develop new modules along with a memory mechanism to reason over richer contextual dependencies and achieves significant improvements.

3 Visual Dialog Context

Visual Dialog (VD) refers to the task of answering a sequence of questions about a given image in multiple rounds [13]. Understanding the context of a dialog is essential for VD models, which helps them to answer each question based on its relationship with previous ones. Although it has been well known that extracting coreferences from the dialog history can benefit the answering of new questions [23, 37], existing VD models fail to demonstrate superior performance over VQA methods, because of insufficient context representation. To promote the development of context-rich VD datasets and models, in this section, we present a more structured definition of dialog contexts. Inspired by linguistic theories [6, 36] and visual reasoning studies [23, 25], we define dialog contexts based on three levels of basic patterns: primitives, compounds, and topics.

Table 1. Summary of all primitives. We introduce two novel primitives (Include, Exclude) that represent the knowledge inclusion and exclusion through contextual dependencies. [rel] – predicate in a subject-predicate-object triplet, [fea] – the feature type, (param) – the parameter of primitives (i.e., a specific object or attribute), (att) – the intermediate attention map, [qids] – the IDs of related questions.

Primitives are atomic patterns derived from the Speech Act Theory [35], which also corresponds to the atomic reasoning operations defined in visual reasoning studies. For contextual reasoning in VD, we define two new primitives (i.e., Include and Exclude) that represent the knowledge inclusion and exclusion through contextual dependencies. They each can refer to one or multiple concepts mentioned in previous questions, and these concepts can either be visually grounded entities or abstract ones, as specified in the parameters. Such parameters consist of 1) a list of related questions with shared knowledge, 2) the knowledge type (e.g., name or number), and 3) the knowledge entity (e.g., an object). In contrast, coreferences defined by previous studies (i.e., visual entities that are referred to by multiple questions) can only represent a single visual entity, which is insufficient for complex contextual representation. Other primitives are defined following conventional visual reasoning operations [19], such as attention operations (i.e., Find, Relate, Filter), logical operations (i.e., And, Or, Not), output operations (i.e., Compare, Exist, Count, Describe), etc. Examples of all primitives are shown in Table 1.

Compounds are contextual patterns composed of a sequence of primitives. Each compound corresponds to a question in the dialog. If a compound contains Include or Exclude primitives, it means that the corresponding question is dependent on previous questions in the dialog history. For instance, the question “What is the fruit that shares the same color as the watermelon and banana?” can be represented as a parameterized sequence of primitives Find(fruit)-Include[qids][color](watermelon, banana)- Describe[name]. Therefore, all previous questions about the watermelons, bananas or their colors are its contextual dependencies, because they share the same contextual knowledge with it.

Topics are contextual patterns defined as connected graphs of multiple questions and their dependencies. We represent questions as graph nodes and their dependencies as edges. Thus, different topics are represented as isolated graphs. Each dialog consists of at least one topic, while the maximum number of topics is the number of questions (i.e., all questions are independent from each other and can be answered without knowledge from the dialog history).

The primitives, compounds, and topics defined above provide concise and informative representation of dialog contexts, which are used in Sect. 4 to ensure the contextual richness of our proposed datasets.

Fig. 2.
figure 2

An overview of the dataset generation process. First, we generate question contexts by sampling and instantiating them with parameters into a collection of topics. Next, instantiated contexts are fed into the question engine, which performs template matching, decoy and sanity check, and question reordering to generate diverse dialogs.

4 The CLEVR-VD and GQA-VD Datasets

Based on the definition in Sect. 3 and the popular visual reasoning datasets CLEVR [21] and GQA [19], we develop two novel datasets, namely CLEVR-VD and GQA-VD, featuring rich dialog contexts and questions. Both datasets offer ten-round dialogs with complex contexts and diverse questions. Compared with existing VD datasets, the diversity and complexity of CLEVR-VD and GQA-VD demonstrate great potential for developing and benchmarking VD models capable of better contextual reasoning. In this section, we first describe the process of generating dialogs, and then report the data statistics.

4.1 Dataset Generation

Previous studies [12, 24] develop VD datasets by either recruiting crowd workers or developing AI agents to perform question answering. Though these datasets consist of naturally generated questions about the context, there is no explicit control over the richness of contextual dependency. Differently, we generate dialogs explicitly from a structured representation of dialog context following the definition in Sect. 3. As shown in Fig. 2, the data generation process consists of five steps: context sampling, context instantiation, template matching, decoy & sanity check, and question reordering. Following these steps, we 1) generate complex dialog contexts with a variety of primitives, compounds, and topics, and 2) develop a question engine to generate a diverse set of dialogs based on each dialog context. We summarize these data generation steps in this section. For more details, please refer to the supplementary.

  • 1. Context sampling. Different from existing datasets that generate questions directly from the scene graph of images, in this work, we aim to ensure the contextual richness of the generated dialogs. Therefore, we first randomly sample a number of predefined compounds and make sure they contain a sufficient number of contextual dependencies. These compounds specify the general layout of the dialog context without concrete parameters. In particular, for each sampled compound consisting of Include or Exclude primitives, we recursively sample their dependencies, which generates complex topics. With this approach, we arrive at a preliminary layout of the dialog context. It contains a number of topics, each forming a graph with compounds as nodes and their dependencies as edges, indicating the overall contextual relationships of questions.

  • 2. Context instantiation. The previous step specifies the structure of the dialog context without taking into account the visual information. Next, given the scene graph of an input image, we instantiate the dialog context by filling in the parameters. Specifically, we first randomly sample objects and attributes from the scene graph and assign them to each primitive. We then validate the compounds to makes sure that a question depending on another one must have shared parameters, but independent questions must all have different parameters. For example, when the referred object is unique in the image, a question about the object could be independent of the context. This process leads to an image-specific dialog context with rich contextual dependencies.

  • 3. Question templates. From a dialog context, we can generate a variety of dialogs by choosing different question templates for each compound. For example, the questions “Is there any watermelon?”, “Does there exist any watermelon?” can be generated from the compound Find(watermelon)-Exist using different templates. We not only design 240 templates for CLEVR-VD and 360 templates for GQA-VD, but also prepare a set of synonyms to further increase the language diversity. The lists of templates are presented in the supplementary.

  • 4. Decoys and sanity check. To further increase the diversity of the dialogs, we randomly replace objects or attributes in the questions with plausible decoys. The decoys do not necessarily exist in the image and they may affect the answer. After the replacement, to maintain the validity of questions, we perform sanity check based on a set of predefined rules. For example, considering the two questions “Does there exist any watermelon?” and “what is the color of it?” (see Fig. 2), with a decoy “lemon”, the first one may be changed to “Does there exist any lemon?”. Due to this change, the next question must be revised to “What is the color of the watermelon?” to maintain the validity of the dialog context. By making adjustments to the affected questions accordingly, these rules (see the supplementary for details) of the sanity check ensure the dialog-image integrity.

  • 5. Question reordering. Although the order of questions has been determined by the dialog context, some questions in the dialog can be reordered without breaking the integrity of the context. For example, as shown in Fig. 2, independent questions or topics can be randomly shuffled without affecting each other, since they do not require shared knowledge. Therefore, by shuffling the question orders we further increase the diversity of dialogs.

4.2 Dataset Analysis

Table 2 compares the overall statistics between ours and the related VisDial [13] and CLEVR-Dialog [24]. These datasets are grouped based on their image sources: VisDial and GQA-VD use COCO images, while CLEVR-Dialog and CLEVR-VD use CLEVR images. Both CLEVR-VD and GQA-VD have several unique characteristics that distinguish them from the previous ones. For example, they have larger sizes of vocabulary and unique questions. GQA-VD has 5 times more questions than VisDial and 3 times more unique questions, making it more diverse for mitigating biases. Although the total number of questions for CLEVR-VD is smaller than CLEVR-Dialog, it has more unique questions and answers. In particular, compared with CLEVR-Dialog and VisDial, our datasets have a reduced number of topics and more contextual dependencies per question. They also have more long-term contextual dependencies between non-adjacent questions and fewer independent questions. These statistics suggest that our datasets have more complex dialog contexts, with more questions being dependent on each other. In the following, we analyze the distribution of questions and answers, as well as different contextual patterns. Detailed statistics of our datasets are reported in the supplementary.

Table 2. Dataset statistics of CLEVR-Dialog, CLEVR-VD, VisDial and GQA-VD. Q. – questions, A. – answers, T. – topics, C. – contextual dependencies. Note that the 1.4k unique VisDial answers are short answers extracted from the 340k long answers by removing synonyms, while the 1.8k short answers of GQA-VD can also be augmented into 840k unique long answers with the current templates.

Balanced Questions and Answers. One of the main challenges of VQA and VD is the prevalent language bias [1, 2, 11, 15, 42] that allows models to answer questions based on shallow question-answer correlations rather than reasoning over both modalities. To mitigate such bias and encourage models to focus on the learning of dialog contexts, we diversify and balance the question and answer categories in the generated dialog. Figure 3a–b show the answer distribution for the six major question categories of CLEVR-VD and the top-10 question categories of GQA-VD. As it is shown, the answers are well-balanced for each question category, which reduces the tendency of models fitting the language bias.

Fig. 3.
figure 3

Our CLEVR-VD and GQA-VD datasets maintain a balanced distribution of answers and contextual patterns. (a) Answer distribution of the six major question types of CLEVR-VD. (b) Answer distribution of the top-10 question types of GQA-VD. (c) Distribution of primitives and number of contextual dependencies.

Diverse Contextual Patterns. The core characteristics of both CLEVR-VD and GQA-VD are their diverse contextual patterns. Figure 3c demonstrates the statistics of various patterns (i.e., primitives and number of contextual dependencies) for both datasets. Although their total numbers of compounds are different, CLEVR-VD and GQA-VD maintain a similar distribution of primitives and compounds. In particular, more than half of all questions have at least two contextual dependencies, which is significantly higher than existing VD datasets. The increased number of contextual dependencies leads to more challenging benchmarks for future VD models.

5 Explainable Contextual Reasoning

To model the rich and diverse dialog contexts, we develop a Neural Dialog Modular network (NDM) for explainable contextual reasoning. In particular, we propose a memory mechanism and two contextual modules to explicitly store and transfer knowledge across different questions to tackle specific challenges in understanding dialog contexts. These novel components enable the shift of attention to multiple abstract concepts through diverse contextual dependencies rather than just a single coreference [23].

Neural module networks are a class of explainable reasoning methods [4, 39, 43]. They perform visual reasoning by first parsing the questions into a set of pre-defined reasoning modules to dynamically construct a network and then feeding the visual input to the network to predict an answer. Our NDM method adopts conventional question parser and VQA modules following the NMN approach [4]. Table 3 shows the implementation of our neural modules. In the following, we briefly present the design of our novel components: memory and contextual modules. More details are presented in supplementary.

Table 3. Implementation of neural modules. Apart from common neural modules, we design two novel contextual modules (Include, Exclude) to include or exclude the memorized features from the dialog history. \(\textsc {MLP}(\cdot )\) indicates a multi-layer perceptron consisting of several fully-connected and ReLU layers, \({\boldsymbol{W}}_h\) is the transfer matrix computed following [43], and \({\boldsymbol{W}}\) is a set of K matrices of learnable weights [39] that map features onto K specific fields. \({\boldsymbol{a}}\), \({\boldsymbol{h}}\), and \({\boldsymbol{q}}\) indicate the input attention, features, and parameters. \({\boldsymbol{a}}'\) and \({\boldsymbol{h}}'\) are the output attention and features, respectively. \({\boldsymbol{a}}_1\), \({\boldsymbol{a}}_2\) are two input attention maps for Or/And, while \({\boldsymbol{h_1}}\), \({\boldsymbol{h_2}}\) are two input features for Compare.
Fig. 4.
figure 4

The proposed memory mechanism and contextual modules that retrieve relevant knowledge \({\boldsymbol{h}}_{ex}\) from the dialog history. Contextual modules first find the attended features of relevant entity from image feature \({\boldsymbol{h}}\) (with memorized attended feature \({\boldsymbol{M}}_t^v\) and the parameter \({\boldsymbol{q}}\)), and then retrieve the relevant knowledge using a weighted combination of the features projected over different spaces (e.g., name, color, number). The weights are computed by measuring the overlap between the memorized semantic embedding \({\boldsymbol{M}}_t^p\) and target feature name \({\boldsymbol{p}}\).

Memorizing Visual and Semantic Features. Due to the complexity of dialog contexts, knowledge from the dialog history can be critical for answering questions, while simply storing features of coreferences can be insufficient. For example (see Fig. 1), to answer “What is the total number of the two latest mentioned fruits?”, abstract knowledge (e.g., the number of watermelons) can be included from the history to help answer the question. To effectively retrieve the relevant knowledge, we propose a novel memory mechanism \({\boldsymbol{M}}_t\) that stores both the attended visual features \({\boldsymbol{M}}_t^v\) and their corresponding semantic embeddings \({\boldsymbol{M}}_t^p\). The memory (as shown in Fig. 4) is updated by projecting the concatenation of the previous memory \(\{{\boldsymbol{M}}_{t-1}^v, {\boldsymbol{M}}_{t-1}^p\}\) and current features \(\{{\boldsymbol{m}}_t^{v}\), \({\boldsymbol{m}}_t^{v}\}\)

$$\begin{aligned} {\boldsymbol{M}}_t^v= tanh ({\boldsymbol{W}}^{v} [{\boldsymbol{M}}^{v}_{t-1}, {\boldsymbol{m}}_t^{v}]) \end{aligned}$$
(1)
$$\begin{aligned} {\boldsymbol{M}}_t^p= tanh ({\boldsymbol{W}}^{p} [{\boldsymbol{M}}^{p}_{t-1}, {\boldsymbol{m}}_t^{p}]), \end{aligned}$$
(2)

where \({\boldsymbol{W}}_v\), \({\boldsymbol{W}}_p\) are learnable parameters. \({\boldsymbol{m}}_t^{v}\) is the duplication of current attended visual features, while \({\boldsymbol{m}}_t^{p}\) describes the attended language features by encoding the dialog history into semantic embeddings with an LSTM [17].

Contextual Modules. To precisely extract relevant information from the attended visual features \({\boldsymbol{M}}^v\) and their semantic embeddings \({\boldsymbol{M}}^p\), we also implement Include and Exclude as novel contextual modules. Different from CorefNMN [23], our contextual modules extract visual features from the memory \({\boldsymbol{M}}^v\), project them into several feature spaces (e.g., name, color, count) and finally produce the abstract features with a linear combination.

As shown in Fig. 4, given the memorized features \({\boldsymbol{M}}^v\), the input parameter \({\boldsymbol{q}}\) and the image features \({\boldsymbol{h}}\), we can obtain relevant features \({\boldsymbol{h}}_{m}\) from the memory

$$\begin{aligned} {\boldsymbol{h}}_\textit{m} = \text {softmax}(\text {MLP}({\boldsymbol{M}}^v, {\boldsymbol{q}}))\circ {\boldsymbol{h}}, \end{aligned}$$
(3)

where \(\circ \) denotes the Hadamard product. The relevant features \({\boldsymbol{h}}_{m}\) are then projected into K spaces with the same learnable projecting matrix (\({\boldsymbol{W}}=\{{\boldsymbol{W}}_k\}_{k=1}^K\)) as Describe. Finally, given the memorized semantic embeddings \({\boldsymbol{M}}^p\) and target feature name \({\boldsymbol{p}}\), we measure the overlap of their probability distributions (i.e., \({\boldsymbol{r}} = \text {softmax}(\text {MLP}({\boldsymbol{M}}_p))\), \({\boldsymbol{r}}' = \text {softmax}(\text {MLP}({\boldsymbol{p}}))\)) as weights and weighted combine K projections to obtain the extracted features

$$\begin{aligned} {\boldsymbol{h}}_{\textit{ex}} = \sum _{k=1}^K \text {min}(r_k, r_k'){\boldsymbol{W}}_k{\boldsymbol{h}}_{m}, \end{aligned}$$
(4)

where \(r_k, r'_k\) are the k-th entries of \({\boldsymbol{r}}\) and \({\boldsymbol{r}}'\). Finally, as shown in Table 3, the Include and Exclude modules process the result (\({\boldsymbol{h}}_{ex}\)) of Eq. (4) differently to determine the inclusion or exclusion of the retrieved knowledge.

6 Experiments

Our proposed datasets provide new opportunities for developing and benchmarking context-aware VD models. In this section, we conduct extensive experiments to demonstrate the effectiveness of our datasets and the proposed NDM method. Section 6.2 reports quantitative results in comparison with the state-of-the-art. Section 6.3 visualizes the parameters of neural modules to illustrate the contextual knowledge reasoning. Section 6.4 analyzes the effectiveness of our novel memory mechanism and contextual modules.

6.1 Models and Evaluation

We systematically evaluate NDM and a series of baselines and state-of-the-art models. First, we develop a baseline model that predicts the answers based on the prior distribution of the training data. We then compare our method with three VD models (i.e., HRE-QIH [12], MN-QIH [12], CorefNMN [23]) and two VQA models (i.e., NMN [4], BUTD [3]). In addition, we incorporate pretrained ViLBERT [28] features into our NDM model, and compare it (i.e., NDM-BERT) with language-pretrained VD-BERT [40] and VisDial-BERT [31] methods. We train and evaluate these models on our proposed CLEVR-VD and GQA-VD datasets, as well as two public datasets: CLEVR-Dialog [24] and VisDial [12]. All the compared models are trained with default parameters, and evaluated on the validation sets. Our NDM and NDM-BERT models are optimized using the Adam [22] optimizer with a learning rate of \(10^{-4}\) and a decay rate of \(10^{-5}\).

Table 4. Quantitative comparison with state-of-the-art methods on CLEVR-Dialog, CLEVR-VD, VisDial, and GQA-VD datasets.

6.2 Quantitative Results

Table 4 shows quantitative results demonstrating the importance of context-rich datasets for visual dialog modeling. In general, we find that the VD models perform much better on CLEVR-VD and GQA-VD than VQA models (i.e., NMN and BUTD in the top panel), suggesting that the more challenging dialogs of our datasets with complex contextual patterns cannot be handled without reasoning about contextual dependencies. Further, we find that our NDM achieves the highest accuracy among all non-pretraining methods (i.e., HRE-QIH, MN-QIH, CorefNMN). The significant gains on CLEVER-VD and GQA-VD datasets demonstrate its ability to reason about rich dialog contexts, and its high performances on CLEVR-Dialog and VisDial demonstrate our model’s generalizability.

Though NDM is a neural module network focusing on structured reasoning but not pretraining, Table 4 also compares it with the state-of-the-art methods based on language pretraining (the bottom panel of Table 4). The proposed NDM, without pretraining, is competitive among the state-of-the-art pretrained models. It also consistently outperforms VD-BERT on all four datasets. Further, our pretrained NDM-BERT maintains interpretability while achieving the best performance (i.e., also outperforming VisDial-BERT) on CLEVR-Dialog, CLEVR-VD, and GQA-VD. Between NDM and NDM-BERT, we only observe minor performance improvements, which suggests that the learning of contextual dependencies does not benefit significantly from pretraining.

Table 5. Average accuracy for questions with different numbers of contextual dependencies on CLEVR-VD and GQA-VD.

Table 5 groups the questions into categories with different numbers of contextual dependencies and shows the average accuracy for each category. It is noteworthy that for VQA models, the performances decrease significantly with the number of contextual dependencies, while for VD models the performance drop is less significant. Our proposed NDM performs almost equally well on questions with different number of dependencies, suggesting its ability to perform contextual reasoning across multiple questions.

Fig. 5.
figure 5

A typical example on the GQA-VD dataset. Heat maps demonstrate the attention of each parameterized reasoning module when answering Q3.

6.3 Qualitative Analysis

Figure 5 shows a typical example of answering questions in a context-rich dialog, with attention maps demonstrating the reasoning processes of NDM and CorefNMN. In this dialog, NDM shifts attention to multiple abstract concepts in the contextual knowledge, while CorefNMN only focuses on visual entities. The dialog starts with questions about the existence and color of the watermelon, and both models answer correctly. However, CorefNMN fails to answer Q3 and the subsequent questions Q4 and Q5 that depend on Q3. It incorrectly answers “apple” that is also to the right of the watermelon, but with different colors. Differently, NDM correctly locates the banana that is both “to the right of the watermelon” and “with the same color”. It is because our NDM can acquire both the name and color of the watermelon. By memorizing this knowledge and leveraging multiple contextual dependencies, NDM performs more effectively in reasoning across questions. The ability of using multiple Include modules to infer complex contextual dependencies allows NDM to focus on the watermelon and its color in different reasoning steps, while CorefNMN fails to handle such complexity. Further qualitative results are reported in the supplementary.

Table 6. Ablation study of CorefNMN [23] and NDM baselines with different combinations of conventional VQA modules, memory (M), and contextual modules (C).

6.4 Ablation Study

To analyze the contributions of different technical components, we further compare NDM variants with different combinations of conventional VQA modules, contextual modules (C) and the memory mechanism (M). Similarly, we adapt CorefNMN by keeping its original VQA neural modules but replacing its coreference modules and/or its memory mechanism with ours. Table 6 shows the results on the CLEVR-VD and GQA-VD datasets. We find that our memory and contextual modules contribute significantly to the model accuracy, leading to further improvements when they are combined. They are shown to be general, with consistent performance gains on both baselines.

7 Conclusion

Research on VD could fundamentally change the experience of human-machine interaction. However, VD studies are limited by insufficient contextual dependencies in existing datasets. To overcome this limitation, we introduce a novel definition of the dialog context with a hierarchy of contextual patterns, and construct two new VD datasets, CLEVR-VD and GQA-VD. We further propose NDM, a neural module network that performs explainable visual reasoning over the dialog context across different questions. Experimental results demonstrate that our proposed datasets offer a more general and challenging benchmark for VD models. Our NDM method also achieves promising performance by explicitly memorizing and retrieving contextual knowledge. We hope that our work will inspire future developments of interpretable and contextual reasoning methods.