Efficient Attention Mechanism for Visual Dialog that Can Handle All the Interactions Between Multiple Inputs

Nguyen, Van-Quang; Suganuma, Masanori; Okatani, Takayuki

doi:10.1007/978-3-030-58586-0_14

Van-Quang Nguyen¹²,
Masanori Suganuma^12,13 &
Takayuki Okatani^12,13

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12369))

Included in the following conference series:

European Conference on Computer Vision

3540 Accesses
25 Citations

Abstract

It has been a primary concern in recent studies of vision and language tasks to design an effective attention mechanism dealing with interactions between the two modalities. The Transformer has recently been extended and applied to several bi-modal tasks, yielding promising results. For visual dialog, it becomes necessary to consider interactions between three or more inputs, i.e., an image, a question, and a dialog history, or even its individual dialog components. In this paper, we present a neural architecture named Light-weight Transformer for Many Inputs (LTMI) that can efficiently deal with all the interactions between multiple such inputs in visual dialog. It has a block structure similar to the Transformer and employs the same design of attention computation, whereas it has only a small number of parameters, yet has sufficient representational power for the purpose. Assuming a standard setting of visual dialog, a layer built upon the proposed attention block has less than one-tenth of parameters as compared with its counterpart, a natural Transformer extension. The experimental results on the VisDial datasets validate the effectiveness of the proposed approach, showing improvements of the best NDCG score on the VisDial v1.0 dataset from 57.59 to 60.92 with a single model, from 64.47 to 66.53 with ensemble models, and even to 74.88 with additional finetuning.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Hierarchical Vision and Language Transformer for Efficient Visual Dialog

Large-Scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline

New Datasets and Models for Contextual Reasoning in Visual Dialog

Keywords

1 Introduction

Recently, an increasing amount of attention has been paid to problems lying at the intersection of the vision and language domains. Many pilot tasks in this intersecting region have been designed and introduced to the research community, together with datasets. Visual dialog has been developed aiming at a higher level of vision-language interactions [7], as compared with VQA (visual question answering) [2] and VCR (visual commonsense reasoning). It extends VQA to multiple rounds; given an image and a history of question-answer pairs about the image, an agent is required to answer a new question. For example, to answer the question ‘What color are they?’, the agent needs to understand the context from a dialog history to know what ‘they’ refers to and look at the relevant image region to find out a color.

In recent studies of vision-language tasks, a primary concern has been to design an attention mechanism that can effectively deal with interactions between the two modalities. In the case of visual dialog, it becomes further necessary to consider interactions between an image, a question, and a dialog history or additionally multiple question-answer pairs in the history. Thus, the key to success will be how to deal with such interactions between three and more entities. Following a recent study [36], we will use the term utility to represent each of these input entities for clarity, since the term modality is inconvenient to distinguish between the question and the dialog history.

Existing studies have considered attention from one utility to another based on different hypotheses, such as “question $\rightarrow $ history $\rightarrow $ image” path in [18, 28], and “question $\rightarrow $ image $\rightarrow $ history $\rightarrow $ question” path in [12, 43], etc. These methods cannot take all the interactions between utilities into account, although the missing interactions could be crucial. Motivated by this, a recent study tries to capture all the possible interactions by using a factor graph [36]. However, building the factor graph is computationally inefficient, which seemingly hinders the method from unleashing the full potential of modeling all the interactions, especially when the dialog history grows long.

The Transformer [41] has become a standard neural architecture for various tasks in the field of natural language processing, especially since the huge success of its pretrained model, BERT [11]. Its basic mechanism has recently been extended to the bi-modal problems of vision and language, yielding promising results [6, 13, 26, 27, 47]. Then, it appears to be natural to extend it further to deal with many-to-many utility interactions. However, it is not easy due to several reasons. As its basic structure is designed to be deal with self-attention, even in the simplest case of bi-modality, letting X and Y be the two utilities, there are four patterns of attention, $X\rightarrow Y$, $Y\rightarrow X$, $X\rightarrow X$, and $Y\rightarrow Y$; we need an independent Transformer block for each of these four. When extending this to deal with many-to-many utility interactions, the number of the blocks and thus of their total parameters increases proportionally with the square of the number of utilities, making it computationally expensive. Moreover, it is not apparent how to aggregate the results from all the interactions.

To cope with this, we propose a neural architecture named Light-weight Transformer for Many Inputs (LTMI) that can deal with all the interactions between many utilities. While it has a block structure similar to the Transformer and shares the core design of attention computation, it differs in the following two aspects. One is the difference in the implementation of multi-head attention. Multi-head attention in the Transformer projects the input feature space linearly to multiple lower-dimensional spaces, enabling to handle multiple attention maps, where the linear mappings are represented with learnable parameters. In the proposed model, we instead split the input feature space to subspaces mechanically according to its indexes, removing all the learnable parameters from the attention computation.

The other difference from the Transformer is that LTMI is designed to receive multiple utilities and compute all the interactions to one utility from all the others including itself. This yields the same number of attended features as the input utilities, which are then concatenated in the direction of the feature space dimensions and then linearly projected back to the original feature space. We treat the parameters of the last linear projection as only learnable parameters in LTMI. This design makes it possible to retain sufficient representational power with a much fewer number of parameters, as compared with a natural extension of the Transformer block to many utilities. By using the same number of blocks in parallel as the number of utilities, we can deal with all the interactions between the utilities; see Fig. 2 for example. Assuming three utilities and the feature space dimensionality of 512, a layer consisting of LTMI has 2.38M parameters, whereas its counterpart based on naive Transformer extension has 28.4M parameters.

2 Related Work

2.1 Attention Mechanisms for Vision-Language Tasks

Attention mechanisms are currently indispensable to build neural architectures for vision-language tasks, such as VQA [4, 16, 20, 29, 31, 45, 48, 49] and visual grounding [10, 46, 52], etc. Inspired by the recent success of the Transformer for language tasks [11, 41], several studies have proposed its extensions to bi-modal vision-language tasks [6, 13, 26, 27, 40, 47]. Specifically, for VQA, it is proposed to use intra-modal and inter-modal attention blocks and stack them alternately to fuse question and image features [13]; it is also proposed to use a cascade of modular co-attention layers that compute the self-attention and guided-attention of question and image features [47]. The method of pretraining a Transformer model used in BERT [11] is employed along with Transformer extension to bi-modal tasks for several vision-language tasks [6, 26, 27]. They first pretrain the models on external datasets, such as COCO Captions [5] or Conceptual Captions dataset [38], and then fine-tune them on several target tasks.

2.2 Visual Dialog

The task of visual dialog has recently been proposed by two groups of researchers concurrently [7, 9]. De Vries et al. introduced the GuessWhat?! dataset, which is built upon goal-oriented dialogs held by two agents to identify unknown objects in an image through a set of yes/no questions [9]. Das et al. released the VisDial dataset, which is built upon dialogs consisting of pairs of a question and an answer about an image that are provided in the form of natural language texts [7]. Kottur et al. recently introduced CLEVR-Dialog as the diagnostic dataset for visual dialog [23].

Most of the existing approaches employ an encoder-decoder architecture [39]. They can be categorized into the following three groups by the design of the encoder: i) fusion-based methods, e.g., LF [7] and HRE [7], which fuses the inputs by their concatenation followed by the application of a feed-forward or recurrent network, and Synergistic [14], which fuses the inputs at multiple stages; ii) attention-based methods that compute attended features of the input image, question, and history utilities, e.g., MN [7], CoAtt [43], HCIAE [28], Synergistic [14], ReDAN [12], FGA [36], and CDF [19]; ReDAN compute the attention over several reasoning steps, FGA models all the interactions over many utilities via a factor graph; iii) methods that attempt to resolve visual co-reference, e.g., RvA [32] and CorefNMN [22], which use neural modules to form an attention mechanism, DAN [18], which employs a network having two attention modules, and AMEM [37], which utilizes a memory mechanism for attention. As for the decoder, there are two designs: i) discriminative decoders that rank the candidate answers using the cross-entropy loss [7] or the n-pair loss [28]; and ii) generative decoders that yield an answer by using a MLE loss [7], weighted likelihood estimation [50], or a combination with adversarial learning [28, 43], which trains a discriminator on both positive and negative answers, then transferring it to the generator with auxiliary adversarial learning.

Other approaches include GNN [51], which models relations in a dialog by an unknown graph structure; the employment of reinforcement learning [3, 8]; and HACAN [44] which adopts policy gradient to learn the impact of history by intentionally imposing the wrong answer into dialog history. In [30, 42], pretrained vision-language models are adopted, which consist of many Transformer blocks with hundreds of millions parameters, leading to some performance gain. Qi et al. [34] present model-agnostic principles for visual dialog to maximize performance.

3 Efficient Attention Mechanism for Many Utilities

3.1 Attention Mechanism of Transformer

As mentioned earlier, the Transformer has been applied to several bi-modal vision-language tasks, yielding promising results. The Transformer computes and uses attention from three types of inputs, Q (query), K (key), and V (value). Its computation is given by

$$\begin{aligned} \mathcal{A}(Q,K,V)=\text{ softmax }\left( \frac{Q K^\top }{\sqrt{d}} \right) V, \end{aligned}$$

(1)

where Q, K, and V are all collection of features, each of which is represented by a d-dimensional vector. To be specific, $Q=[q_1,\ldots ,q_M]^\top \in \mathbb {R}^{M\times d}$ is a collection of M features; similarly, K and V are each a collection of N features, i.e., $K, V\in \mathbb {R}^{N\times d}$. In Eq. (1), V is attended with the weights computed from the similarity between Q and K.

The above computation is usually multi-plexed in the way called multi-head attention. It enables to use a number of attention distributions in parallel, aiming at an increase in representational power. The outputs of H ‘heads’ are concatenated, followed by linear transformation with learnable weights $W^O\in \mathbb {R}^{d\times d}$ as

$$\begin{aligned} \mathcal{A}^{\mathrm {M}}(Q,K,V)=\begin{bmatrix} \mathrm {head}_1,\cdots ,\mathrm {head}_H \end{bmatrix}W^O. \end{aligned}$$

(2)

Each head is computed as follows:

$$\begin{aligned} \mathrm {head}_h = \mathcal{A}(QW_h^Q, KW_h^K, VW_h^V), \;\;h=1,\ldots ,H, \end{aligned}$$

(3)

where $W_h^Q$, $W_h^K$, $W_h^V \in \mathbb {R}^{d\times d_H}$ each are learnable weights inducing a linear projection from the feature space of d-dimensions to a lower space of $d_H(=d/H)$-dimensions. Thus, one attentional block $\mathcal{A}^{\mathrm {M}}(Q,K,V)$ has the following learnable weights:

$$\begin{aligned} (W_1^Q, W_1^K, W_1^V),\cdots ,(W_H^Q, W_H^K, W_H^V)\;\; \text{ and } \;\; W^O. \end{aligned}$$

(4)

3.2 Application to Bi-modal Tasks

While Q, K, and V in NLP tasks are of the same modality (i.e., language), the above mechanism has been extended to bi-modality and applied to vision-language tasks in recent studies [6, 13, 26, 27, 40, 47]. They follow the original idea of the Transformer, considering attention from source features Y to target features X as

$$\begin{aligned} \mathcal{A}_Y(X) = \mathcal{A}^{\mathrm {M}}(X, Y, Y). \end{aligned}$$

(5)

In MCAN [47], language feature is treated as the source and visual feature is as the target. In [26] and others [6, 13, 27, 40], co-attention, i.e., attention in the both directions, is considered. Self-attention, i.e., the attention from features to themselves, is given as a special case by

$$\begin{aligned} \mathcal{A}_X(X) = \mathcal{A}^{\mathrm {M}}(X, X, X). \end{aligned}$$

(6)

In the above studies, the Transformer block with the source-to-target attention and that with the self-attention are independently treated and are stacked, e.g., alternately or sequentially.

3.3 Light-Weight Transformer for Many Inputs

Now suppose we wish to extend the above attention mechanism to a greater number of utilities^{Footnote 1}; we denote the number by U. If we consider every possible source-target pairs, there are $U(U-1)$ cases in total, as there are U targets, for each of which $U-1$ sources exist. Then we need to consider attention computation $\mathcal{A}_{Y}(X)$ over $U-1$ sources Y’s for each target X. Thus, the straightforward extension of the above attention mechanism to U utilities needs $U(U-1)$ times the number of parameters listed in Eq. (4). If we stack the blocks, the total number of parameters further increases proportionally.

To cope with this, we remove all the weights from Eq. (5). To be specific, for each head $h(=1,\ldots ,H)$, we choose and freeze $(W_h^Q, W_h^K, W_h^V)$ as

$$\begin{aligned} W^Q_h=W^K_h=W^V_h = [\underbrace{O_{d_H},\cdots ,O_{d_H}}_{(h-1)d_H},I_{d_H},\underbrace{O_{d_H},\cdots ,O_{d_H}}_{(H-h)d_H}]^\top , \end{aligned}$$

(7)

where $O_{d_H}$ is a $d_H\times d_H$ zero matrix and $I_{d_H}$ is a $d_H\times d_H$ identity matrix. In short, the subspace for each head is determined to be one of H subspaces obtained by splitting the d-dimensional feature space with its axis indexes. Besides, we set $W^O=I$, which is the linear mapping applied to the concatenation of the heads’ outputs. Let $\bar{\mathcal{A}}_Y(X)$ denote this simplified attention mechanism.

Now let the utilities be denoted by $\{X,Y_1,\ldots ,Y_{U-1}\}$, where $X\in \mathbb {R}^{M\times d}$ is the chosen target and others $Y_i\in \mathbb {R}^{N_i\times d}$ are the sources. Then, we compute all the source-to-target attention as $\bar{\mathcal{A}}_{Y_1}(X),\cdots , \bar{\mathcal{A}}_{Y_{U-1}}(X)$. In the standard Transformer block (or rigorously its natural extensions to bi-modal problems), the attended features are simply added to the target as $X + \mathcal{A}_Y(X)$, followed by normalization and subsequent computations. To recover some of the loss in representational power due to the simplification yielding $\bar{\mathcal{A}}_Y(X)$, we propose a different approach to aggregate $\bar{\mathcal{A}}_{Y_1}(X),\cdots , \bar{\mathcal{A}}_{Y_{U-1}}(X)$ and X. Specifically, we concatenate all the source-to-target attention plus the self-attention $\bar{\mathcal{A}}_{X}(X)$ from X to X as

$$\begin{aligned} X_{\mathrm {concat}} = [\bar{\mathcal{A}}_{X}(X), \bar{\mathcal{A}}_{Y_1}(X), \cdots ,\bar{\mathcal{A}}_{Y_{U-1}}(X)], \end{aligned}$$

(8)

where $X_{\mathrm {concat}}\in \mathbb {R}^{M\times Ud}$. We then apply linear transformation to it given by $W\in \mathbb {R}^{Ud\times d}$ and $b\in \mathbb {R}^d$ with a single fully-connected layer, followed by the addition of the original X and layer normalization as

$$\begin{aligned} \tilde{X} = \mathrm {LayerNorm}( \mathrm {ReLU}(X_{\mathrm {concat}} W +\mathbf {1}_{M} \cdot b^\top ) + X), \end{aligned}$$

(9)

where $\mathbf {1}_M$ is M-vector with all ones. With this method, we aim at recovery of representational power as well as the effective aggregation of information from all the utilities.

3.4 Interactions Between All Utilities

We have designed a basic block (Fig. 1(b)) that deals with attention from many sources to a single target. We wish to consider all possible interactions between all the utilities, not a single utility being the only target. To do this, we use U basic blocks to consider all the source-to-target attention. Using the basic block as a building block, we show how an architecture is designed for visual dialog having three utilities, visual features V, question features Q, and dialog history features R, in Fig. 2(b).

4 Implementation Details for Visual Dialog

4.1 Problem Definition

The problem of Visual Dialog is stated as follows. An agent is given the image of a scene and a dialog history containing T entities, which consists of a caption and question-answer pairs at $T-1$ rounds. Then, the agent is further given a new question at round T along with 100 candidate answers for it and requested to answer the question by choosing one or scoring each of the candidate answers.

4.2 Representation of Utilities

We first extract features from an input image, a dialog history, and a new question at round T to obtain their representations. For this, we follow the standard method employed in many recent studies. For the image utility, we use the bottom-up mechanism [1], which extracts region-level image features using the Faster-RCNN [35] pre-trained on the Visual Genome dataset [24]. For each region (i.e., a bounding box = an object), we combine its CNN feature and geometry to get a d-dimensional vector $v_i$ ($i=1,\ldots ,K$), where K is the predefined number of regions. We then define $V = [v_1, v_2, \cdots , v_K]^\top \in \mathbb {R}^{K \times d}$. For the question utility, after embedding each word using an embedding layer initialized by pretrained GloVe vectors, we use two-layer Bi-LSTM to transform them to $q_i$ $(i=1,\ldots ,N)$, where N is the number of words in the question. We optionally use the positional embedding widely used in NLP studies. We examine its effects in an ablation test. We then define $Q = [q_1,\ldots ,q_N]^\top \in \mathbb {R}^{N \times d}$. For the dialog history utility, we choose to represent it as a single utility here. Thus, each of its entities represents the initial caption or the question-answer pair at one round. As with the question utility, we use the same embedding layer and a two-layer Bi-LSTM together with the positional embeddings for the order of dialog rounds to encode them with a slight difference in formation of an entity vector $r_i$ ($i=1,\ldots ,T)$, where T is the number of Q&A plus the caption. We then define $R = [r_1,\ldots ,r_T]^\top \in \mathbb {R}^{T \times d}$. More details are provided in the supplementary material.

4.3 Overall Network Design

Figure 3 shows the entire network. It consists of an encoder and a decoder. The encoder consists of L stacks of the proposed attention blocks; a single stack has U blocks in parallel, as shown in Fig. 2(b). We set $V_0 = V$, $Q_0 = Q$, and $R_0 = R$ as the inputs of the first stack. After the l-th stack, the representations of the image, question, and dialog history utilities are updated as $V_l$, $Q_l$, and $R_l$, respectively. In the experiments, we apply dropout with the rate of 0.1 to the linear layer inside every block. There is a decoder(s) on top of the encoder. We consider a discriminative decoder and a generative decoder, as in previous studies. Their design is explained below.

4.4 Design of Decoders

Decoders receive the updated utility representations, $V_L$, $Q_L$, and $R_L$ at their inputs. We convert them independently into d-dimensional vectors $c_V$, $c_Q$, and $c_R$, respectively. This conversion is performed by a simple self-attention computation. We take $c_V$ as an example here. First, attention weights over the entities of $V_L$ are computed by a two-layer network as

$$\begin{aligned} a_V = \mathrm {softmax}(\mathrm {ReLU}(V_LW_1 + \mathbf {1}_Kb_1^\top )W_2 + \mathbf {1}_Kb_2), \end{aligned}$$

(10)

where $W_1 \in \mathbb {R}^{d\times d}$, $W_2 \in \mathbb {R}^{d\times 1}$, $b_1\in \mathbb {R}^d$, $b_2 \in \mathbb {R}^1$, and $\mathbf {1}_K$ is K-vector with all ones. Then, $c_V$ is given by

$$\begin{aligned} c_V = \sum _{i = 1}^{K}v_{L,i}^\top a_{V,i}, \end{aligned}$$

(11)

where $v_{L,i}$ is the i-th row vector of $V_L$ and $a_{V,i}$ is the i-th attention weight (a scalar). The others, i.e., $c_Q$ and $c_R$, can be obtained similarly.

These vectors are integrated and used by the decoders. In our implementation for visual dialog, we found that $c_R$ does not contribute to better results; thus we use only $c_V$ and $c_Q$. Note that this does not mean the dialog utility R is not necessary; it is interacted with other utilities inside the attention computation, contributing to the final prediction. The two d-vectors $c_V$ and $c_Q$ are concatenated as $[c_V^\top , c_Q^\top ]^\top $, and this is projected to d-dimensional space, yielding a context vector $c\in \mathbb {R}^d$.

We design the discriminative and generative decoders following the previous studies. Receiving c and the candidate answers, the two decoders compute the score of each candidate answer in different ways. See details in the supplementary material.

4.5 Multi-task Learning

We observe in our experiments that accuracy is improved by training the entire network using the two decoders simultaneously. This is simply done by minimizing the sum of the losses, $\mathcal {L}_D$ for the discriminative one and $\mathcal {L}_G$ for the generative one (we do not use weights on the losses):

$$\begin{aligned} \mathcal {L} = \mathcal {L}_D + \mathcal {L}_G. \end{aligned}$$

(12)

The increase in performance may be attributable to the synergy of learning two tasks while sharing the same encoder. Details will be given in Sect. 5.3.

5 Experimental Results

5.1 Experimental Setup

Dataset. We use the VisDial v1.0 dataset in our experiments which consists of the train 1.0 split (123,287 images), the val 1.0 split (2,064 images), and test v1.0 split (8,000 images). Each image has a dialog composed of 10 question-answer pairs along with a caption. For each question-answer pair, 100 candidate answers are given. The val v1.0 split and 2,000 images of the train v1.0 split are provided with dense annotations (i.e., relevance scores) for all candidate answers. Although the test v1.0 split was also densely annotated, the information about the ground truth answers and the dense annotations are not publicly available. Additionally, we evaulate the method on the Audio Visual Scene-aware Dialog Dataset [15]; the results are shown in the supplementary.

Evaluation Metrics. From the visual dialog challenge 2018, normalized discounted cumulative gain (NDCG) has been used as the principal metric to evaluate methods on the VisDial v1.0 dataset. Unlike other classical retrieval metrics such as R@1, R@5, R@10, mean reciprocal rank (MRR), and mean rank, which are only based on a single ground truth answer, NDCG is computed based on the relevance scores of all candidate answers for each question, which can properly handle the case where each question has more than one correct answer, such as ‘yes it is’ and ‘yes’; such cases do occur frequently.

Other Configurations. We employ the standard method used by many recent studies for the determination of hyperparameters etc. For the visual features, we detect $K=100$ objects from each image. For the question and history features, we first build the vocabulary composed of 11,322 words that appear at least five times in the training split. The captions, questions, and answers are truncated or padded to 40, 20, and 20 words, respectively. Thus, $N=20$ for the question utility Q. T for the history utilities varies depending on the number of dialogs. We use pre-trained 300-dimensional GloVe vectors [33] to initialize the embedding layer, which is shared for all the captions, questions, and answers.

For the attention blocks, we set the dimension of the feature space to $d=512$ and the number of heads H in each attention block to 4. We mainly use models having two stacks of the proposed attention block. We train our models on the VisDial v0.9 and VisDial v1.0 dataset using the Adam optimizer [21] with 5 epochs and 15 epochs respectively. The learning rate is warmed up from $1\times 10^{-5}$ to $1\times 10^{-3}$ in the first epoch, then halved every 2 epochs. The batch size is set to 32 for the both datasets.

5.2 Comparison with State-of-the-Art Methods

Compared Methods. We compare our method with previously published methods on the VisDial v0.9 and VisDial v1.0 datasets, including LF, HRE, MN [7], LF-Att, MN-Att (with attention) [7], SAN [45], AMEM [37], SF [17], HCIAE [28] and Sequential CoAttention model (CoAtt) [43], Synergistic [14], FGA [36], GNN [51], RvA [32], CorefNMN [22], DAN [18], and ReDAN [12], all of which were trained without using external datasets or data imposition. Unless noted otherwise, the results of our models are obtained from the output of discriminative decoders.

Table 1. Comparison of the performances of different methods on the validation set of VisDial v1.0 with discriminative and generative decoders.

Full size table

Results on the val v1.0 Split. We first compare single-model performance on the val v1.0 split. We select here MN, CoAtt, HCIAE, and ReDAN for comparison, as their performances from the both decoders in all metrics are available in the literature. To be specific, we use the accuracy values reported in [12] for a fair comparison, in which these methods are reimplemented using the bottom-up-attention features. Similar to ours, all these methods employ the standard design of discriminative and generative decoders as in [7]. Table 1 shows the results. It is seen that our method outperforms all the compared methods on the NDCG metric with large margins regardless of the decoder type. Specifically, as compared with ReDAN, the current state-of-the-art on the VisDial v1.0 dataset, our model has improved NDCG from 59.32 to 62.72 and from 60.47 to 63.58 with discriminative and generative decoders, respectively.

Results on the Test-Standard v1.0 Split. We next consider performance on the test-standard v1.0 split. In our experiments, we encountered a phenomenon that accuracy values measured by NDCG and other metrics show a trade-off relation (see the supplementary material for details), depending much on the choice of metrics (i.e., NDCG or others) for judging convergence at the training time. This is observed in the results reported in [12] and is attributable to the inconsistency between the two types of metrics. Thus, we show two results here, the one obtained using NDCG for judging convergence and the one using MRR for it; the latter is equivalent to performing early stopping.

Table 2(a) shows single-model performances on the blind test-standard v1.0 split. With the outputs from the discriminative decoder, our model gains improvement of 3.33pp in NDCG from the best model. When employing the aforementioned early stopping, our model achieves at least comparable or better performance in other metrics as well.

Table 2. Comparison in terms of (a) single- and (b) ensemble-model performance on the blind test-standard v1.0 split of the VisDial v1.0 dataset and in terms of (c) the number of parameters of the attention mechanism. The result obtained by early stopping on MRR metric is denoted by $\star $ and those with fine-tuning on dense annotations are denoted by $\dagger $.

Full size table

Many previous studies report the performance of an ensemble of multiple models. To make a comparison, we create an ensemble of 16 models with some differences, from initialization with different random seeds to whether to use sharing weights across attention blocks or not, the number of attention blocks (i.e. L = 2, 3), and the number of objects in the image (i.e. K = 50, 100). Aiming at achieving the best performance, we also enrich the image features by incorporating the class label and attributes of each object in an image, which are also obtained from the pretrained Faster-RCNN model. Details are given in the supplementary material. We take the average of the outputs (probability distributions) from the discriminative decoders of these models to rank the candidate answers. Furthermore, we also test fine-tuning each model with its discriminative decoder on the available dense annotations from the train v1.0 and val v1.0, where the cross-entropy loss with soft labels (i.e. relevance scores) is minimized for two epochs. Table 2(b) shows the results. It is observed that our ensemble model (w/o the fine-tuning) achieves the best $\text {NDCG}=66.53$ in all the ensemble models.

With optional fine-tuning, our ensemble model further gains a large improvement in NDCG, resulting in the third place in the leaderboard. The gap in NDCG to the first place (VD-BERT) is only 0.25pp, while our model yields performance that is better in all the other metrics, i.e, by 2.14pp, 5.67pp, and 3.37pp in MRR, R@5, and R@10, respectively, and 5.36% reduction in Mean.

Table 2(c) shows the number of parameters of the multi-modal attention mechanism employed in the recent methods along with their NDCG scores on the VisDial v1.0 test-standard set. We exclude the parameters of the networks computing the input utilities and the decoders, as they are basically shared among these methods. ‘Naive Transformer’ consists of two stacks of transformer blocks with simple extension to three utilities as mentioned in Sect. 1. The efficiency of our models can be observed. Note also that the gap between (Q, V) and (Q, V, R) is small, contrary to the argument in [34].

Table 3. Ablation study on the components of our method on the val v1.0 split of VisDial dataset. $\uparrow $ indicates the higher the better.

Full size table

5.3 Ablation Study

To evaluate the effect of each of the components of our method, we perform the ablation study on the val v1.0 split of VisDial dataset. We evaluate here the accuracy of the discriminative decoder and the generative decoder separately. We denote the former by D-NDCG and the latter by G-NDCG, and the accuracy of their averaged model by A-NDCG (i.e., averaging the probability distributions over the candidate answers obtained by the discriminative and generative decoders). The results are shown in Table 3(a–b).

The first block of Table 3(a) shows the effect of the number of stacks of the proposed attention blocks. We observe that the use of two to three stacks achieves good performance on all three measures. More stacks did not bring further improvement, and thus are omitted in the table.

The second block of Table 3(a) shows the effect of self-attention, which computes the interaction within a utility, i.e., ${\bar{\mathcal{A}}}_X(X)$. We examine this because it can be removed from the attention computation. It is seen that self-attention does contribute to good performance. The third block shows the effects of how to aggregate the attended features. It is seen that their concatenation yields better performance than their simple addition. The fourth block shows the impact of sharing the weights across the stacks of the attention blocks. If the weights can be shared as in [25], it contributes a further decrease in the number of parameters. We observe that the performance does drop if weight sharing is employed, but the drop is not very large.

The first block of Table 3(b) shows the effect of how to aggregate the context features $c_V$, $c_Q$, and $c_R$ in the decoder(s), which are obtained from the outputs of our encoder. As mentioned above, the context vector $c_R$ of the dialog history does not contribute to the performance. However, the context vector $c_v$ of the image is important for achieving the best performance. The second block of Table 3(b) shows the effects of simultaneously training the both decoders (with the entire model). It is seen that this contributes greatly to the performance; this indicates the synergy of learning two tasks while sharing the encoder, resulting better generalization as compared with those trained with a single decoder.

We have also confirmed that the use of fewer objects leads to worse results. Besides, the positional embedding for representing the question and history utilities as well as the spatial embedding (i.e., the bounding box geometry of objects) for image utility representation have a certain amount of contribution.

5.4 Visualization of Generated Attention

Figure 4 shows attention weights generated in our model on two rounds of Q&A on two images. We show here two types of attention. One is the self-attention weights used to compute the context vectors $c_V$ and $c_Q$. For $c_V$, the attention weights $a_{V}$ are generated over image regions (i.e., bounding boxes), as in Eq. (10). Similarly, for $c_Q$, the attention weights are generated over question words. These two sets of attention weights are displayed by brightness of the image bounding-boxes and darkness of question words, respectively, in the center and the rightmost columns. It can be observed from these that the relevant regions and words are properly highlighted at each Q&A round.

The other attention we visualize is the source-to-target attention computed inside the proposed block. We choose here the image-to-question attention $\bar{\mathcal{A}}_V(Q)$ and the history-to-question attention $\bar{\mathcal{A}}_R(Q)$. For each, we compute the average of the attention weights over all the heads computed inside the block belonging to the upper stack. In Fig. 4, the former is displayed by the red boxes connected between an image region and a question word; only the region with the largest weight is shown for the target word; the word with the largest self-attention weight is chosen for the target. The history-to-question attention is displayed by the Q&As highlighted in blue color connected to a selected question word that is semantically ambiguous, e.g., ‘its’, ‘he’, and ‘his’. It is seen that the model performs proper visual grounding for the important words, ‘hair’, ‘shorts’, and ’tusks’. It is also observed that the model properly resolves the co-reference for the words, ‘he’ and ‘its’.

6 Summary and Conclusion

In this paper, we have proposed LTMI (Light-weight Transformer for Many Inputs) that can deal with all the interactions between multiple input utilities in an efficient way. As compared with other methods, the proposed architecture is much simpler in terms of the number of parameters as well as the way of handling inputs (i.e., their equal treatment), and nevertheless surpasses the previous methods in accuracy; it achieves the new state-of-the-art results on the VisDial datasets, e.g., high NDCG scores on the VisDial v1.0 dataset. Thus, we believe our method can be used as a simple yet strong baseline.

Notes

1.
As we stated in Introduction, we use the term utility here to mean a collection of features.

References

Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)
Google Scholar
Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)
Google Scholar
Chattopadhyay, P., et al.: Evaluating visual conversational agents via cooperative human-AI games. In: Proceedings of AAAI Conference on Human Computation and Crowdsourcing (2017)
Google Scholar
Chen, K., Wang, J., Chen, L.C., Gao, H., Xu, W., Nevatia, R.: ABC-CNN: an attention based convolutional neural network for visual question answering. arXiv preprint arXiv:1511.05960 (2015)
Chen, X., et al.: Microsoft COCO captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)
Chen, Y.C., et al.: UNITER: learning universal image-text representations. arXiv preprint arXiv:1909.11740 (2019)
Das, A., et al.: Visual dialog. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 326–335 (2017)
Google Scholar
Das, A., Kottur, S., Moura, J.M., Lee, S., Batra, D.: Learning cooperative visual dialog agents with deep reinforcement learning. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2951–2960 (2017)
Google Scholar
De Vries, H., Strub, F., Chandar, S., Pietquin, O., Larochelle, H., Courville, A.: Guesswhat?! visual object discovery through multi-modal dialogue. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5503–5512 (2017)
Google Scholar
Deng, C., Wu, Q., Wu, Q., Hu, F., Lyu, F., Tan, M.: Visual grounding via accumulated attention. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7746–7755 (2018)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Gan, Z., Cheng, Y., Kholy, A.E., Li, L., Liu, J., Gao, J.: Multi-step reasoning via recurrent dual attention for visual dialog. In: Proceedings of the Conference of the Association for Computational Linguistics, pp. 6463–6474 (2019)
Google Scholar
Gao, P., et al.: Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6639–6648 (2019)
Google Scholar
Guo, D., Xu, C., Tao, D.: Image-question-answer synergistic network for visual dialog. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10434–10443 (2019)
Google Scholar
Hori, C., et al.: End-to-end audio visual scene-aware dialog using multimodal attention-based video features. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2352–2356 (2019)
Google Scholar
Ilievski, I., Yan, S., Feng, J.: A focused dynamic attention model for visual question answering. arXiv preprint arXiv:1604.01485 (2016)
Jain, U., Lazebnik, S., Schwing, A.G.: Two can play this game: visual dialog with discriminative question generation and answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5754–5763 (2018)
Google Scholar
Kang, G.C., Lim, J., Zhang, B.T.: Dual attention networks for visual reference resolution in visual dialog. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 2024–2033 (2019)
Google Scholar
Kim, H., Tan, H., Bansal, M.: Modality-balanced models for visual dialogue. arXiv preprint arXiv:2001.06354 (2020)
Kim, J.H., Jun, J., Zhang, B.T.: Bilinear attention networks. In: Advances in Neural Information Processing Systems, pp. 1564–1574 (2018)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Kottur, S., Moura, J.M.F., Parikh, D., Batra, D., Rohrbach, M.: Visual coreference resolution in visual dialog using neural module networks. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 160–178. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_10
Chapter Google Scholar
Kottur, S., Moura, J.M., Parikh, D., Batra, D., Rohrbach, M.: CLEVR-dialog: a diagnostic dataset for multi-round reasoning in visual dialog. arXiv preprint arXiv:1903.03166 (2019)
Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123(1), 32–73 (2017)
Article MathSciNet Google Scholar
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: a lite BERT for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019)
Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: VisualBERT: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019)
Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv preprint arXiv:1908.02265 (2019)
Lu, J., Kannan, A., Yang, J., Parikh, D., Batra, D.: Best of both worlds: transferring knowledge from discriminative learning to a generative visual dialog model. In: Advances in Neural Information Processing Systems, pp. 314–324 (2017)
Google Scholar
Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering. In: Advances in Neural Information Processing Systems, pp. 289–297 (2016)
Google Scholar
Murahari, V., Batra, D., Parikh, D., Das, A.: Large-scale pretraining for visual dialog: a simple state-of-the-art baseline. arXiv preprint arXiv:1912.02379 (2019)
Nguyen, D.K., Okatani, T.: Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6087–6096 (2018)
Google Scholar
Niu, Y., Zhang, H., Zhang, M., Zhang, J., Lu, Z., Wen, J.R.: Recursive visual attention in visual dialog. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6679–6688 (2019)
Google Scholar
Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1532–1543 (2014)
Google Scholar
Qi, J., Niu, Y., Huang, J., Zhang, H.: Two causal principles for improving visual dialog. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10860–10869 (2020)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)
Google Scholar
Schwartz, I., Yu, S., Hazan, T., Schwing, A.G.: Factor graph attention. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2039–2048 (2019)
Google Scholar
Seo, P.H., Lehrmann, A., Han, B., Sigal, L.: Visual reference resolution using attention memory for visual dialog. In: Advances in Neural Information Processing Systems, pp. 3719–3729 (2017)
Google Scholar
Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 2556–2565 (2018)
Google Scholar
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)
Google Scholar
Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (2019)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
Wang, Y., Joty, S., Lyu, M.R., King, I., Xiong, C., Hoi, S.C.: VD-BERT: a unified vision and dialog transformer with BERT. arXiv preprint arXiv:2004.13278 (2020)
Wu, Q., Wang, P., Shen, C., Reid, I., van den Hengel, A.: Are you talking to me? Reasoned visual dialog generation through adversarial learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6106–6115 (2018)
Google Scholar
Yang, T., Zha, Z.J., Zhang, H.: Making history matter: history-advantage sequence training for visual dialog. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2561–2569 (2019)
Google Scholar
Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 21–29 (2016)
Google Scholar
Yu, L., et al.: MAttNet: modular attention network for referring expression comprehension. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1307–1315 (2018)
Google Scholar
Yu, Z., Yu, J., Cui, Y., Tao, D., Tian, Q.: Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6281–6290 (2019)
Google Scholar
Yu, Z., Yu, J., Fan, J., Tao, D.: Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1821–1830 (2017)
Google Scholar
Yu, Z., Yu, J., Xiang, C., Fan, J., Tao, D.: Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering. IEEE Trans. Neural Netw. Learn. Syst. 29(12), 5947–5959 (2018)
Article Google Scholar
Zhang, H., et al.: Generative visual dialogue system via weighted likelihood estimation. In: Proceedings of the International Joint Conference on Artificial Intelligence, pp. 1025–1031 (2019)
Google Scholar
Zheng, Z., Wang, W., Qi, S., Zhu, S.C.: Reasoning visual dialogs with structural and partial observations. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6669–6678 (2019)
Google Scholar
Zhuang, B., Wu, Q., Shen, C., Reid, I., van den Hengel, A.: Parallel attention: a unified framework for visual object discovery through dialogs and queries. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4252–4261 (2018)
Google Scholar

Download references

Acknowledgments

This work was partly supported by JSPS KAKENHI Grant Number JP15H05919 and JP19H01110.

Author information

Authors and Affiliations

Grad School of Information Sciences, Tohoku University, Sendai, Japan
Van-Quang Nguyen, Masanori Suganuma & Takayuki Okatani
RIKEN Center for AIP, Tokyo, Japan
Masanori Suganuma & Takayuki Okatani

Authors

Van-Quang Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Masanori Suganuma
View author publications
You can also search for this author in PubMed Google Scholar
Takayuki Okatani
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Van-Quang Nguyen .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 15147 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nguyen, VQ., Suganuma, M., Okatani, T. (2020). Efficient Attention Mechanism for Visual Dialog that Can Handle All the Interactions Between Multiple Inputs. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12369. Springer, Cham. https://doi.org/10.1007/978-3-030-58586-0_14

Download citation

DOI: https://doi.org/10.1007/978-3-030-58586-0_14
Published: 30 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58585-3
Online ISBN: 978-3-030-58586-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Efficient Attention Mechanism for Visual Dialog that Can Handle All the Interactions Between Multiple Inputs

Abstract

Similar content being viewed by others

Hierarchical Vision and Language Transformer for Efficient Visual Dialog

Large-Scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline

New Datasets and Models for Contextual Reasoning in Visual Dialog

Keywords

1 Introduction

2 Related Work

2.1 Attention Mechanisms for Vision-Language Tasks

2.2 Visual Dialog

3 Efficient Attention Mechanism for Many Utilities

3.1 Attention Mechanism of Transformer

3.2 Application to Bi-modal Tasks

3.3 Light-Weight Transformer for Many Inputs

3.4 Interactions Between All Utilities

4 Implementation Details for Visual Dialog

4.1 Problem Definition

4.2 Representation of Utilities

4.3 Overall Network Design

4.4 Design of Decoders

4.5 Multi-task Learning

5 Experimental Results

5.1 Experimental Setup

5.2 Comparison with State-of-the-Art Methods

5.3 Ablation Study

5.4 Visualization of Generated Attention

6 Summary and Conclusion

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 15147 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation