1 Introduction

Visual Question Answering (VQA) is a typical multimodal task that answers a given question based on image understanding (Antol et al., 2015). Recently, large-scale pretrained Vision-Language Models (VLMs) (Wang et al., 2023; Zeng et al., 2022; Wang et al., 2022; Yu et al., 2022; Wang et al., 2022; Li et al., 2022; Yuan et al., 2021; Wang et al., 2021) have advanced VQA performance to the level of human oracle. However, finetuning such pretrained VLMs on limited data for the downstream VQA task often leads to overfitting and poor generalization, limiting the improvement in robustness that pretrained VLMs can offer compared to the improvement in accuracy.

In this paper, we investigate how to effectively improve input robustness when adapting pretrained VLMs to a downstream VQA task. Input robustness in VQA refers to the ability of models to defend against visual variations (such as question-related object removal in images (Agarwal et al., 2020)), linguistic variations (such as word substitution and sentence rephrasing in questions (Shah et al., 2019)), and multimodal shortcut learning involved in input images and questions (Dancette et al., 2021). Practically, during the finetuning process, VQA is usually formulated as a multi-answer classification problem or a text generation problem, where pretrained multimodal transformers act as representation extractors with rich knowledge and are utilized to extract vision-language representations for answer prediction. As such, improving the input robustness of models essentially means obtaining more compact and task-related representations.

To this end, we propose to improve input robustness from an information-theoretical perspective. The representations yielded by pretrained VLMs inevitably contain irrelevant and redundant information for the specific downstream task, which is one possible reason for poor robustness. This is because irrelevant information encourages models to learn statistically spurious correlations between representations and labels, while task-agnostic redundant information reduces the sensitivity of models to input variations. Therefore, the two factors will compromise the input robustness of the model. To obtain more robust and compact representations, we thus anticipate that when adapting pretrained VLMs to VQA, these pretrained VLMs can discard irrelevant and redundant information in representations while preserving task-relevant information. The information bottleneck principle (Tishby et al., 2000) is adept at seeking a tradeoff between representation compression and redundancy. Motivated by this insight, we explore how to elegantly generalize the information bottleneck to find the minimal sufficient statistic for the learned representations, thereby improving the input robustness of VQA models.

We propose Correlation Information Bottleneck (CIB) to enhance input robustness when adapting pretrained VLMs to the downstream VQA task. Overall, by minimizing mutual information (MI) between representations and inputs while maximizing MI between representations and outputs, CIB seeks an optimal tradeoff between compression and redundancy in the representations learned by pretrained VLMs, enabling representations to converge to a minimal sufficient statistic. In detail, to accurately estimate the MI between multimodal inputs and representations, we derive a tight upper bound for the symmetrized joint MI, which measures different internal correlations rather than the overall dependency between different modalities. More specifically, the upper bound incorporates correlations between single-modal input and representation, as well as the correlation between visual and linguistic representations, guiding VQA models to learn more robust representations and better capture actual relationships. In particular, the multimodal representation correlation can facilitate modality alignment. Moreover, to ensure applicability to different transformer architectures, i.e., single-stream encoder, two-stream encoder, and encoder-decoder, we unify the internal representations of different pretrained VLMs using the representations after visual and linguistic embedding layers for CIB estimation.

To demonstrate the proposed CIB, we first provide rigorous theoretical proofs. Subsequently, using CIB as the training objective, we finetune pretrained VLMs including VisualBERT (Li et al., 2019), ViLBERT (Lu et al., 2019), VL-BERT\(_{\text {B}}\) (Su et al., 2020), VL-T5 (Cho et al., 2021), LXMERT (Tan & Bansal, 2019), UNITER\(_{\text {B}}\) (Chen et al., 2020), ALBEF (Li et al., 2021), mPLUG\(_{\text {B}}\) (Li et al., 2022), and BEiT-3\(_{\text {B}}\) (Wang et al., 2023) under a standard and clean data setting, and evaluate input robustness on five robustness benchmark datasets: VQA-Rephrasings (Shah et al., 2019), VQA P2 (Whitehead et al., 2020), IV-VQA (Agarwal et al., 2020), CV-VQA (Agarwal et al., 2020), and VQA-CE (Dancette et al., 2021). Extensive experiments and analyses consistently demonstrate that CIB significantly improves input robustness and exhibits advantages over existing methods when adapting pretrained VLMs to the downstream VQA task.

In summary, our main contributions are as follows: (i) We propose Correlation Information Bottleneck (CIB), a generic objective that can encourage representations to converge to a minimal sufficient statistic and enhance input robustness when adapting pretrained VLMs to VQA. (ii) We derive a tight upper bound for the MI between multimodal inputs and representations, incorporating different internal correlations that can guide models to learn more robust representations and facilitate modality alignment. (iii) Theoretical proofs and extensive experiments evaluate the robustness, superiority, and generalizability of our CIB.

The remainder of the paper is organized as follows: Sect. 2 introduces related literature on robustness in VQA, information bottleneck, and vision-language models. Section 3 elaborates on CIB, the application of CIB in adapting pretrained VLMs to VQA, and the theoretical analysis of input robustness for CIB. In Sect. 4, we conduct comprehensive experiments and discussions to demonstrate the effectiveness and superiority of CIB in terms of robustness and accuracy. In Appendix A, we provide a theoretical derivation for CIB and proofs for some proposed theorems.

2 Related Work

2.1 Robustness in VQA

Recently, in order to promote practical applications, numerous studies have been proposed to investigate various aspects of VQA robustness, such as input robustness (Shah et al., 2019; Whitehead et al., 2020; Agarwal et al., 2020; Kant et al., 2021), human-adversarial robustness (Li et al., 2021; Sheng et al., 2021), and robustness against answer distribution shift (Agrawal et al., 2022; Pan et al., 2022; Kervadec et al., 2021; Jiang et al., 2021; Teney et al., 2020; Clark et al., 2019; Goyal et al., 2017). In this paper, we explore input robustness, which refers to the capability of VQA models to defend against visual and linguistic variations, such as rephrasing questions (Shah et al., 2019; Whitehead et al., 2020), manipulating images (Agarwal et al., 2020), and shortcut learning involved in multimodal inputs (Dancette et al., 2021). The prevailing method to improve input robustness is data augmentation, i.e., generating additional data to train more robust VQA models. While data augmentation is a feasible and effective solution, the quality of the generated data is uncontrollable (e.g., limited expressiveness and excessive verbosity), and the human-generated process is time-consuming. Moreover, cycle-consistency between the original question and its rephrasings (Shah et al., 2019), contrastive learning (Kant et al., 2021), and adversarial training (Li et al., 2020) have also been introduced to improve input robustness. These recent studies demonstrate that state-of-the-art VQA models remain vulnerable to input variation attacks. Therefore, in this paper, we focus on further improving the input robustness of existing VQA models.

2.2 Information Bottleneck

The Information Bottleneck (IB) principle was originally proposed by Tishby et al. (2000) for information compression, and was later applied to analyze deep learning model architectures (Tishby & Zaslavsky, 2015; Shwartz-Ziv & Tishby, 2017). Essentially, the IB objective is to seek a tradeoff between maximizing predictive accuracy and minimizing representation complexity. Some recent research targets exploiting the IB principle to improve model robustness and generalization, especially in domain generalization (Du et al., 2020; Li et al., 2022), out-of-distribution generalization (Ahuja et al., 2021), multiview representation learning (Federici et al., 2020; Bao, 2021), and finetuning of pretrained language models (Mahabadi et al., 2021; Wang et al., 2021; Dong et al., 2021). In addition, some works (Wang et al., 2022; Zhou et al., 2022; Pan et al., 2021; Jeon et al., 2021; Dubois et al., 2020) aim to learn disentangled optimal representations from an IB perspective. Since IB can facilitate compact and meaningful representation learning, we extend it to multimodal learning and apply IB to obtain robust VQA models.

2.3 Vision-Language Models

Vision-Language pretraining aims to learn task-agnostic visiolinguistic representations for improving the performance of downstream tasks in a finetuning fashion (Huang et al., 2020; Zhou et al., 2020; Shi et al., 2020; Li et al., 2021; Kim et al., 2021; Sun et al., 2021; Huang et al., 2021; Dou et al., 2022; Zhong et al., 2022; Alayrac et al., 2022; Xu et al., 2023). From the perspective of model architecture, prevailing pretrained vision-language models (VLMs) can be roughly grouped into three types: single-stream encoder (Su et al., 2020; Chen et al., 2020; Gan et al., 2020; Li et al., 2020; Zhang et al., 2021; Kim et al., 2021), two-stream encoder (Lu et al., 2019; Tan & Bansal, 2019; Lu et al., 2020; Yu et al., 2021; Li et al., 2021), and encoder-decoder (Cho et al., 2021; Li et al., 2021; Zeng et al., 2022; Li et al., 2022; Wang et al., 2022; Li et al., 2022). Specifically, single-stream models first align image regions and text tokens and then apply a uniform transformer (Vaswani et al., 2017) to learn the contextualized representations. Two-stream models first utilize two separate transformers to learn high-level representations for images and texts, and then integrate the two modalities with a cross-modal transformer. Encoder-decoder models respectively utilize encoders and decoders to learn multimodal representations and to generate related texts for specific downstream tasks. In this paper, we unify the three typical types of VLMs and propose CIB to improve input robustness when adapting these pretrained VLMs for the downstream VQA task.

3 Methodology

In this section, we first present the preliminaries of the problem setting and the general IB principle. Then, we elaborate on the proposed CIB in Sect. 3.2 and explain how to apply CIB to improve input robustness when adapting pretrained VLMs to VQA in Sect. 3.3.

3.1 Preliminary

Problem Setting. In the finetuning process, single-stream and two-stream VLMs usually formulate the VQA task as a multi-answer classification problem Chen et al. (2020), Tan and Bansal (2019), while encoder-decoder VLMs often regard VQA as text generation (Cho et al., 2021; Wang et al., 2022), i.e., generating free-form textual answers for a given question instead of selecting a specific one from the predefined set of answers. Given a VQA dataset \({\mathcal {D}}=\{(I, Q, y)\in {\mathcal {I}}\times {\mathcal {Q}}\times {\mathcal {Y}} \}\), where I is an image, Q is a question, and y is an answer, VLMs take image-question pairs as input, where the image is further represented as a set of image regions or patches \(\{v_1, \dots , v_K\}\) (K is the number of regions or patches in one image) and the question is tokenized as a token sequence \(\{w_1, \dots , w_L\}\) (L is the number of word tokens in a question). For single-stream and two-stream VLMs, they output the answer probability distribution Y using an additional VQA Head module, which is implemented by two fully-connected layers sandwiched with GeLU activation and Layer Normalization operation. Meanwhile, encoder-decoder VLMs directly generate textual answers without any additional module.

IB View of Representation Learning. From an information-theoretic perspective, seeking a robust representation T in representation learning is equivalent to preserving information about the output Y while removing irrelevant and redundant information from the input X. This is because for a given task, irrelevant and redundant information may encourage models to learn superfluous correlations between answer labels and inputs. Formally, the IB principle (Tishby et al., 2000; Tishby & Zaslavsky, 2015) formulates representation learning as an information tradeoff and finds an optimal representation by maximizing the Lagrangian

$$\begin{aligned} \begin{aligned} {\mathscr {L}}_{\text {IB}} {:}{=} I(Y; T) - \beta I(X; T), \end{aligned} \end{aligned}$$
(1)

where \(\beta \ge 0\) controls the tradeoff between compression and prediction, and \(I(\cdot ; \cdot )\) denotes mutual information (MI).

3.2 Correlation Information Bottleneck

In vision-language representation learning, given two modality inputs \(X^v\) and \(X^l\), VLMs learn the corresponding visual and linguistic representations \(T^v\) and \(T^l\) of some intermediate transformer layers while simultaneously maximizing the MI between the obtained representations and a given label Y to guarantee representations contain sufficient information for predicting Y. To extend the general IB principle to the multimodal setting, we first consider the inputs and internal representations as a whole, i.e., \(X=[X^v, X^l]\) and \(T=[T^v, T^l]\), respectively, and then derive a differentiable estimation for IB by expanding the MI terms in Eq. (1).

Specifically, we first focus on I(YT), which can be rewritten using the conditional probability definition:

$$\begin{aligned} \begin{aligned} I(Y; T) = \int p(y, t) \log \frac{p(y|t)}{p(y)} \, dydt . \end{aligned} \end{aligned}$$
(2)

Since the conditional probability p(y|t) is intractable, we instead estimate I(YT) with the BA (Barber & Agakov, 2003) lower bound:

$$\begin{aligned} \begin{aligned} I(Y; T) \ge \int p(y, t) \log q(y|t) dy dt - \int p(y) \log p(y) dy, \end{aligned}\nonumber \\ \end{aligned}$$
(3)

where q(y|t) is an accessible auxiliary distribution for p(y|t) and \(- \int p(y) \log p(y) dy = H(Y)\) is the entropy of labels, which is independent of the optimization procedure in finetuning. Ignoring H(Y), the remaining term of the lower bound in Eq. (3) is equal to \(-H(Y|T)\), meaning that maximizing the lower bound of I(YT) is equivalent to minimizing the cross-entropy loss of a specific task. In other words, when using IB as the training objective, maximizing I(YT) can be equivalent to minimizing the VQA loss \({\mathscr {L}}_{\text {vqa}}\).

Next, we consider the mutual information between the input sources and their corresponding representations, that is, the term I(XT) in Eq. (1). To accurately estimate I(XT), instead of directly measuring the overall dependency between X and T (i.e., regarding \(X^v\) and \(X^l\) as a whole one X, and regarding \(T^v\) and \(T^l\) as a whole one T), we consider expanding I(XT) to \(I(X^v, X^l; T^v, T^l)\), and attempt to derive a tight upper bound for it. Since \(I(X^v, X^l; T^v, T^l)\) incorporates different internal correlations, such as the correlation between visual input \(X^v\) and representation \(T^v\), the correlation between linguistic input \(X^l\) and representation \(T^l\), and the correlation between visual and linguistic representations (\(T^v\) and \(T^l\)). These correlations may guide models to learn more compact visual and linguistic representations and facilitate modality alignment between visual and linguistic representations. Therefore, we propose to maximize the Correlation Information Bottleneck (CIB) formula:

$$\begin{aligned} \begin{aligned} {\mathscr {L}}_{\text {CIB}} {:}{=} I(Y; T) - \beta I(X^v, X^l; T^v, T^l), \end{aligned} \end{aligned}$$
(4)

where \(I(X^v, X^l; T^v, T^l)\) is a symmetrized variant of joint mutual information (Bennasar et al., 2015) that considers the internal correlations between \(X=[X^v, X^l]\) and \(T=[T^v, T^l]\). To efficiently estimate \(I(X^v, X^l; T^v, T^l)\), we first further expand it conditioned on the properties of mutual information and the data processing inequality in representation learning (Federici et al., 2020). The derivation can be formally stated by Theorem 1 (cf. Sect. 1 for proof):

Fig. 1
figure 1

The information flow of three typical transformer architectures of VLMs

Theorem 1

(Upper Bound of \(I(X^v, X^l; T^v, T^l)\)) Given two groups of random variables \(X=[X^v, X^l]\) and \(T=[T^v, T^l]\), the MI \(I(X^v, X^l; T^v, T^l)\) can be upper-bounded with

$$\begin{aligned} I(X; T)&= I(X^v, X^l; T^v, T^l), \nonumber \\&\le I(X^v; T^v) + I(X^l; T^l) {- I(T^v; T^l)} + D_{\text {skl}}, \end{aligned}$$
(5)

where \(D_{\text {skl}}\) denotes the symmetric Kullback–Leibler (KL) divergence that can be calculated by averaging the divergences \(\text {KL}(p(t^v|x^v)||p(t^l|x^l))\) and \(\text {KL}(p(t^l|x^l)||p(t^v|x^v))\).

After approximating the MI \(I(X^v, X^l; T^v, T^l)\), the lower bound of \({\mathscr {L}}_{\text {CIB}}\) can be stated as the following Theorem 2.

Theorem 2

(Lower Bound of CIB) Given random variable \(X = [X^v, X^l]\), two deterministic functions \(f_{\theta ^v}\) and \(f_{\theta ^l}\) let \(T^v = f_{\theta ^v}(X^v)\) and \(T^l = f_{\theta ^l}(X^l)\). Correlation Information Bottleneck (CIB) can then be bounded as

$$\begin{aligned}&{\mathscr {L}}_{\text {CIB}} = I(Y; T) - \beta I(X^v, X^l; T^v, T^l), \nonumber \\&\ge I(Y; T) - \beta \Big [I(X^v; T^v) + I(X^l; T^l) {- I(T^v; T^l)} + D_{\text {skl}}\Big ]. \end{aligned}$$
(6)

In summary, Theorem 2 suggests that in vision-language representation learning, if I(YT) is considered a task-related objective, \(I(X^v, X^l; T^v, T^l)\) can be viewed as a regularizer used to constrain the compactness and redundancy of the learned representations. Overall, CIB encourages pretrained VLMs to learn more robust representations by seeking an optimal tradeoff between redundancy and compression in representations. Moreover, CIB facilitates modality alignment and correlation by maximizing the MI \(I(T^v; T^l)\) between visual and linguistic representations.

Fig. 2
figure 2

Illustration of using CIB to adapt pretrained VLMs to downstream task. CIB seeks a minimal sufficient statistic by minimizing MI between input and internal representation () while maximizing MI between output and representation ()

3.3 Adapting Pretrained VLMs to VQA with CIB

As illustrated in Fig. 1a, b, and c, there are three typical transformer architectures for VLMs: single-stream encoder (Li et al., 2019; Su et al., 2020; Chen et al., 2020), two-stream encoder (Tan & Bansal, 2019; Lu et al., 2019), and encoder-decoder (Cho et al., 2021; Li et al., 2021). When finetuning pretrained VLMs with CIB, to unify the three architectures into a single formulation, as shown in Fig. 2, we utilize the region-level or patch-level visual features after the visual Embedding layer (i.e., \(f_{\theta ^v}\) is the parametric embedding layer) as the internal visual representation \(T^v\). Analogously, the token-level linguistic features after the linguistic embedding layer (\(f_{\theta ^l}\)) are considered as the internal linguistic representation \(T^l\). All subsequent Transformer layers (\(f_{\theta ^{{\text {Tran}}}}\)) and the VQA Head module (\(f_{\theta ^\text {H}}\)) for the single-stream and two-stream VLMs as well as the Decoder (\(f_{\theta ^\text {Dec}}\)) for the encoder-decoder VLMs serve as the parametric approximator (\(f_{\theta ^{\text {ans}}}\)) to generate Y given \(T=[T^v, T^l]\). As summarized in Algorithm 1, we first convert I(YT) to the cross-entropy loss (\({\mathscr {L}}_{\text {vqa}}\)) for answer prediction in VQA and estimate the remaining terms in Theorem 2. After obtaining \({\mathscr {L}}_{\text {CIB}}\), we update all parameters by minimizing \(- {\mathscr {L}}_{\text {CIB}}\). Next, we elaborate on the estimation of CIB terms.

3.3.1 Estimating CIB Terms

As stated in Theorem 2, in addition to the task-related MI term I(YT), \(I(X^v, X^l; T^v, T^l)\) can be further decomposed into four computable MI terms. Firstly, we focus on the MI between inputs and representations within a single visual or linguistic modality. The inputs \(X^v\) and \(X^l\) are intrinsically two sets of random variables, i.e., \(X^v = [X^v_{1},..., X^v_{K}]\) and \(X^l = [X^l_{1},..., X^l_{L}]\). The functions \(f_{\theta ^v}\) and \(f_{\theta ^l}\) transform \(X^v\) and \(X^l\) into visual and linguistic representations, respectively, such that \(T^v = [T^v_{1},..., T^v_{K}] = [f_{\theta ^v}(X^v_{1}),..., f_{\theta ^v}(X^v_{K})]\) and \(T^l = [T^l_{1},..., T^l_{L}] = [f_{\theta ^l}(X^l_{1}),..., f_{\theta ^l}(X^l_{L})]\). While for sample pairs \(\{(X_i^v, T_i^v)\}_{i=1}^{K}\) and \(\{(X_i^l, T_i^l)\}_{i=1}^{L}\), the conditional probability distributions \(p(t^v|x^v)\) and \(p(t^l|x^l)\) are known during the finetuning process. Consequently, we adopt a sample-based differentiable MI estimator, CLUB (Cheng et al., 2020), to approximate the upper bound of the MI between visual or linguistic inputs and their corresponding representations, i.e., 

$$\begin{aligned}&{\hat{I}}(X^v; T^v) = \frac{1}{K^2} \sum _{i=1}^{K}\sum _{j=1}^{K} \Big [\log p(t_i^v|x_i^v) - \log p(t_j^v|x_i^v)\Big ], \end{aligned}$$
(7)
$$\begin{aligned}&{\hat{I}}(X^l; T^l) = \frac{1}{L^2} \sum _{i=1}^{L}\sum _{j=1}^{L} \Big [\log p(t_i^l|x_i^l) - \log p(t_j^l|x_i^l)\Big ]. \end{aligned}$$
(8)
Algorithm 1
figure a

Finetuning pretrained VLMs with CIB for VQA

For \(I(T^v; T^l)\), it is challenging to estimate directly due to the different sequence lengths of \(T^v \in {\mathbb {R}}^{K\times d}\) and \(T^l\in {\mathbb {R}}^{L\times d}\). Therefore, we transform the two sequence representations into the global visual and linguistic representations \({\bar{T}}^v \in {\mathbb {R}}^{d}\) and \({\bar{T}}^l\in {\mathbb {R}}^{d}\), using a one-layer fully-connected (FC) network. To guarantee that the inequality in Eq. (6) holds, we should approximate the lower bound of \(I(T^v; T^l)\). Therefore, we estimate \(I(T^v; T^l)\) with NWJ (Poole et al., 2019), i.e., 

$$\begin{aligned}&{\hat{I}}({\bar{T}}^v; {\bar{T}}^l) \nonumber \\&=\mathbb {E}_{p({\bar{t}}^v, {\bar{t}}^l)}\left[ \log f_{\theta _{\text {FC}}}({\bar{t}}^v, {\bar{t}}^l)\right] - \frac{1}{e}\mathbb {E}_{p({\bar{t}}^v) p({\bar{t}}^l)} \left[ f_{\theta _{\text {FC}}}({\bar{t}}^v, {\bar{t}}^l)\right] , \end{aligned}$$
(9)

where \(f_{\theta _{\text {FC}}}\) denotes the discriminant function implemented using a two-layer FC network.

Finally, for \(D_{\text {skl}}\) in \({\mathscr {L}}_{\text {CIB}}\), since \(p(t^l|x^l)\) and \(p(t^v|x^v)\) have a known probability density, we can directly compute the two KL divergences using internal visual and linguistic representations. That is, \(D_{\text {skl}}\) can be obtained by

$$\begin{aligned}&{\hat{D}}_{\text {skl}} \nonumber \\&=\frac{1}{2} \left[ \text {KL}\left( p(t^v|x^v)||p(t^l|x^l)\right) + \text {KL}\left( p(t^l|x^l)||p(t^v|x^v)\right) \right] . \end{aligned}$$
(10)
Table 1 Details on input robustness datasets

3.3.2 Theoretical Justification for Input Robustness

In the following section, we conduct a theoretical analysis of input robustness for CIB. Formally, for a perturbation \(\delta \) added to visual and linguistic inputs, let \(X'=[{X^v}', {X^l}']\) represent the perturbed inputs of standard inputs \(X=[X^v, X^l]\), i.e., \(X' = X + \delta \). Functions \(f_{\theta ^v}\) and \(f_{\theta ^l}\) transform \(X=[X^v, X^l]\) and \(X'=[{X^v}', {X^l}']\) into \(T = [T^v, T^l] = [f_{\theta ^v}(X^v), f_{\theta ^l}(X^l)]\) and \(T' = [{T^v}', {T^l}'] = [f_{\theta ^v}({X^v}'), f_{\theta ^l}({X^l}')]\), with \(T\ne T'\). The distributions of X and \(X'\) are denoted by probabilities p(x) and q(x), where q(x) approximates the distribution of p(x). \(\delta _m\) is the maximum perturbation bound that does not alter the output label, i.e., \(Y= f_{\theta ^{\text {ans}}}(T)=f_{\theta ^{\text {ans}}}(T')\) when \(||\delta || \le \delta _m\). According to the definition of CIB, the performance gap between standard inputs and perturbed inputs is \(|I(T; Y) - I(T'; Y)|=|I(T^v, T^l; Y) - I({T^v}', {T^l}'; Y)|\). To provide theoretical justification for the performance gap, based on the work (Wang et al., 2021), we derive the upper bound

$$\begin{aligned}&|I(T; Y) - I(T'; Y)| \nonumber \\&= |I(T^v, T^l; Y) - I({T^v}', {T^l}'; Y)|, \nonumber \\&\le B^v_1 \sqrt{{\mathcal {T}}^v} \left( I(X^v;T^v)\right) ^{1/2} + B^v_2 |{\mathcal {T}}^v|^{3/4} \left( I(X^v;T^v)\right) ^{1/4} \nonumber \\&\quad + B^v_3 \sqrt{|{\mathcal {T}}^v|} \left( I({X^v}';{T^v}')\right) ^{1/2} + B^v_4 |{\mathcal {T}}^v|^{3/4} \left( I({X^v}';{T^v}')\right) ^{1/4} \nonumber \\&\quad + B^l_1 \sqrt{{\mathcal {T}}^l} \left( I(X^l;T^l)\right) ^{1/2} + B^l_2 |{\mathcal {T}}^l|^{3/4} \left( I(X^l;T^l)\right) ^{1/4} \nonumber \\&\quad + B^l_3 \sqrt{|{\mathcal {T}}^l|} \left( I({X^l}';{T^l}')\right) ^{1/2} + B^l_4 |{\mathcal {T}}^l|^{3/4} \left( I({X^l}';{T^l}')\right) ^{1/4} \nonumber \\&\quad + B^v_0 + B^l_0, \end{aligned}$$
(11)

where \({\mathcal {T}}^v\) is the finite support of \(T^v\) and \({T^v}'\), and \(B^v_0\), \(B^v_1\), \(B^v_2\), \(B^v_3\), and \(B^v_4\) are constants that depend on the sequence length K, \(\delta \), and \(p(x^v)\). \({\mathcal {T}}^l\) is the finite support of \(T^l\) and \({T^l}'\), and \(B^l_0\), \(B^l_1\), \(B^l_2\), \(B^l_3\), and \(B^l_4\) are constants that depend on the sequence length L, \(\delta \), and \(p(x^l)\) (cf. Sect. 1 for proof).

4 Experiment

In this section, we evaluate the input robustness of the proposed CIB and carry out detailed ablation studies to analyze the performance contribution of CIB components. Meanwhile, we explore the effectiveness of CIB in some other cases, such as standard VQA performance, adversarial attacks, and other multimodal tasks beyond VQA.

4.1 Experimental Settings

4.1.1 Evaluation Datasets

Unless otherwise specified, we finetune pretrained VLMs on the standard and clean VQA v2 training set (Goyal et al., 2017) and evaluate input robustness on five robustness benchmark datasets: VQA-Rephrasings (Shah et al., 2019), VQA P2 (Whitehead et al., 2020), IV-VQA (Agarwal et al., 2020), CV-VQA (Agarwal et al., 2020), and VQA-CE (Dancette et al., 2021). VQA-Rephrasings and VQA P2 evaluate robustness against linguistic variations, while IV-VQA and CV-VQA evaluate robustness against visual variations. VQA-CE, on the other hand, assesses robustness against shortcut learning involving inputs. As all these datasets are built on the VQA v2 (Goyal et al., 2017) validation split, we consequently train our models only on the VQA v2 training set.

Table 1 summarizes dataset details, including the type of perturbation, specific evaluation metrics for robustness, question type (QType), shared dataset for finetuning, and the test datasets statistics. These statistics encompass the total number of image-question pairs (#IQ), perturbation samples (#PER/CE), original samples (#ORI/Easy), and the average question length (len(Q)). Specifically, VQA-Rephrasings averagely collects 3 rephrasings for each of the 40,504 questions sampled from the VQA v2 validation set, resulting in approximately 162k image-question pairs. VQA P2 creates three types of linguistic perturbations, i.e., sentence rephrasing (Par), and word substitution with synonyms (Syn) or antonyms (Ant), for 25,814 sampled questions, ultimately obtaining roughly about 52k image-question pairs. IV-VQA employs a GAN-based resynthesis technique to remove objects irrelevant to the given question from the image, such that object removal does not affect the answer. Conversely, CV-VQA focuses on counting questions (Num) and removes one relevant object, causing the predicted answer on the number of objects to be reduced by one. In total, IV-VQA and CV-VQA contain approximately 120k and 4k image-question pairs, respectively. VQA-CE is an evaluation benchmark for multimodal shortcuts involved in images and questions. It utilizes the detected shortcuts from the training set to obtain 63,298 counterexamples, where all shortcuts lead to incorrect answers, from the VQA v2 validation set. Additionally, VQA-CE constructs 147,681 easy examples in which at least one shortcut provides the correct answer.

Table 2 Summary of baseline pretrained VLMs (AC: Answer Classification, TG: Text Generation)
Table 3 Configuration setups

Moreover, to analyze the effectiveness of CIB on standard VQA performance, we conduct experiments on VQA v2 (Goyal et al., 2017). Specifically, we first utilize CIB as the training objective to finetune pretrained VLMs on the VQA v2 training and validation sets, and subsequently test standard VQA performance on VQA v2 test-dev. To evaluate the generalizability of CIB to other multimodal tasks, we perform experiments on RefCOCO+ (Yu et al., 2016) in weakly-supervised setups. This dataset contains a total of 141,564 expressions based on images from the COCO training set. To assess the effectiveness of CIB in addressing human-adversarial attacks, we evaluate our method on AdVQA (Sheng et al., 2021), a human-adversarial benchmark built upon VQA v2 images, featuring approximately 10k/36.8k image-question pairs for the validation/test split.

4.1.2 Evaluation Metrics

We follow previous work (Antol et al., 2015) to evaluate the VQA performance of our methods with VQA-Score. In addition, we evaluate robustness against linguistic variations using Consensus Score (CS) (Shah et al., 2019), which is the ratio of the number of subsets where all questions are answered correctly to the total number of subsets of size m. Specifically, for each question group Q containing one original question and its n corresponding rephrasings, all subsets of size m amount to \(^{n}C_{m}\), CS can then be defined as

$$\begin{aligned} \begin{aligned} \textrm{CS}(m) = \sum \nolimits _{q\in Q^{'}, Q^{'} \subset Q, |Q^{'}|=m} \frac{\mathbb {1}_{Q^{'}}(q)}{^{n}C_{m}}, \end{aligned} \end{aligned}$$
(12)

where \(\mathbb {1}\) is an indicator function defined on \(Q^{'}\) and \(\mathbb {1}_{Q^{'}}(q)\) means a set where the answer to question q is correct. Naturally, the higher the average CS at larger values of m, the more robust the model. To evaluate robustness against visual variations, we utilize #flips (Agarwal et al., 2020) as a robustness evaluation metric. #flips represents the ratio of the number of prediction mismatches before and after visual content manipulation to the total number of all samples. In IV-VQA, if the predicted answers for the original image and the corresponding edited image differ, the prediction is deemed “flipped”. In CV-VQA, an answer to a question based on an edited image is considered to be “flipped” if it is not one less than the prediction on the original image.

Table 4 Results of robustness against linguistic variations (i.e., sentence rephrasing) on the VQA-Rephrasings dataset (Shah et al., 2019)

4.1.3 Baseline Pretrained VLMs

As summarized in Table 2, we utilize nine pretrained VLMs with three typical transformer architectures as baselines to evaluate the input robustness of our method. Specifically, VisualBERT (Li et al., 2019), VL-BERT\(_{\text {B}}\) (Su et al., 2020), and UNITER\(_{\text {B}}\) (Chen et al., 2020) employ single-stream encoders. LXMERT (Tan & Bansal, 2019), ViLBERT (Lu et al., 2019), and BEiT-3\(_{\text {B}}\) (Wang et al., 2023) utilize two-stream encoders. VL-T5 (Cho et al., 2021), ALBEF (Li et al., 2021), and mPLUG\(_{\text {B}}\) (Li et al., 2022) incorporate encoder-decoder architectures. When applied to the downstream VQA task, mPLUG\(_{\text {B}}\), VL-T5, and ALBEF formulate VQA as a text generation task (TG), while the remaining baselines formulate VQA as a multi-answer classification problem (AC). These baselines adopt two typical image tokens, namely, the region feature extracted by a pretrained object detector and the patch embedding obtained using a linear projection, and are pretrained on large-scale image-text (IT) data to learn task-agnostic versatile representations. The pretraining IT datasets include MS COCO caption (COCO) (Chen et al., 2015), Visual Genome (VG) (Krishna et al., 2017), VQA v2 (VQA) (Goyal et al., 2017), GQA balance version (GQA) (Hudson & Manning, 2019), VG-QA (VGQA) (Zhu et al., 2016), Conceptual Captions (CC) (Sharma et al., 2018), SBU captions (SBU) (Ordonez et al., 2011), and Conceptual 12 M (CC12M) (Changpinyo et al., 2021). Since VQA v2 images originate from the COCO dataset, we follow the work (Chen et al., 2020) to categorize these pretrained VLMs into in-domain (ID), in-domain and out-of-domain (ID+OOD), and out-of-domain (OOD) groups based on whether they utilize the COCO dataset during the pretraining process.

4.1.4 Implementation Details

In the subsequent experiments, we maintain the initial configurations of all pretrained VLMs. The region features (visual inputs) of VisualBERT, VL-T5, LXMERT, UNITER\(_{\text {B}}\), ViLBERT, and VL-BERT\(_{\text {B}}\) are extracted using BUA Faster R-CNN (Anderson et al., 2018) pretrained on VG (Krishna et al., 2017). The representation dimension d is set to 768. The configurations of the number of word tokens L (i.e., the maximum token length allowed for a question) and image tokens K are detailed in Table 3. For the only crucial hyperparameter \(\beta \) in Eq. (6), it is set to \(1\times 10^{-4}\) in all cases except for finetuning VisualBERT and LXMERT, where \(\beta \) is set to \(5\times 10^{-5}\). All experiments, except those on ALBEF, BEiT-3\(_{\text {B}}\), and mPLUG\(_{\text {B}}\) implemented on one NVIDIA A100 40GB GPU, are conducted using PyTorch on one NVIDIA GTX2080Ti 12GB GPU. We uniformly utilize an AdamW optimizer with a linear warmup using linear decay and a warmup step of 1000. The number of finetuning epochs is 10. The configurations of batch size and peak learning rate for each pretrained VLM are shown Table 3. The best model is selected based on the VQA-Score on the mini-split of VQA v2 training set that excludes image-question pairs when evaluating input robustness.

4.2 Input Robustness Evaluation

4.2.1 Robustness Against Linguistic Variations

To evaluate the effectiveness of CIB against linguistic variations, with CIB as the training objective, we finetune pretrained VLMs on VQA v2 training split and report results on VQA-Rephrasings and VQA P2. Tables 4 and 5 show the comparisons with existing methods in terms of the VQA-Score as well as the robustness metric of CS(m).

Table 5 Results of robustness against linguistic variations (i.e., sentence rephrasing, and word substitution with synonyms and antonyms) on the VQA P2 dataset (Whitehead et al., 2020)
Table 6 Results of robustness against visual variations on IV-VQA and CV-VQA (Agarwal et al., 2020)

Result on VQA-Rephrasings We first compare the proposed CIB with existing methods: CC (Shah et al., 2019), ConClaT (Kant et al., 2021), and MANGO (Li et al., 2020). Specifically, both CC and ConClaT augment training datasets online by training a question generation model to generate paraphrases of questions. To effectively leverage augmented data and enhance model robustness to linguistic variations, CC considers cycle consistency between the question and its rephrasings, while ConClaT jointly optimizes contrastive and cross-entropy losses. CC considers three baseline VQA models, i.e., BUTD (Anderson et al., 2018), Pythia (Jiang et al., 2018), and BAN (Kim et al., 2018). ConClaT uses MMT (Kant et al., 2021), a modified version of UNITER, as its baseline. MANGO employs UNITER (Chen et al., 2020) and VILLA (Gan et al., 2020) as baseline models and adopts adversarial training to enhance model robustness. As shown in Table 4,Footnote 1 the results on nine pretrained VLMs consistently show that compared to baselines (i.e., finetuning pretrained VLMs with only the task-related loss for answer prediction\(^\dagger \)), using CIB as the training objective for VQA models can significantly improve their robustness to linguistic variations. This finding suggests that it is feasible to encourage models to learn more compact and robust representations from an information-theoretic perspective. In comparison with state-of-the-art methods, adapting LXMERT with CIB achieves the best performance across all metrics. This performance advantage can be attributed to the fact that LXMERT considers the VQA training objective during pretraining, which reduces the gap between upstream and downstream objectives. In addition, we observe that the data augmentation based method (CC) yields greater improvements in the metric of CS(4). However, without data augmentation, the average improvement of our method is more substantial.

Table 7 Results of robustness against multimodal shortcut learning on the VQA-CE dataset (Dancette et al., 2021)

Result on VQA P2 We next compare our method with the existing method Q3R (Whitehead et al., 2020) on VQA P2. Q3R augments training data by creating linguistic variations such as synonymous, paraphrastic, and antonymous of input questions, and regularizes the visual reasoning process between the question and its generated questions. Q3R utilizes three baseline models: StackNMN (Hu et al., 2018), HybridNet (Whitehead et al., 2020), and XNM (Shi et al., 2019). The results in Table 5 indicate that finetuning pretrained VLMs with the proposed CIB can markedly improve their robustness against question variations on VQA P2. Moreover, finetuning LXMERT with CIB also achieves the best performance on VQA P2. In addition, the data augmentation-based method (Q3R) continues to exhibit superiority in improving the input robustness of baseline VQA models.

4.2.2 Robustness Against Visual Variations

We evaluate the robustness of our method against visual variations on IV-VQA and CV-VQA. Table 6 shows the comparisons with existing methods in the metrics of VQA-Score and #flips. CL (a simple CNN+LSTM model) (Lu et al., 2015), SNMN (an attention-based method) (Hu et al., 2018), and SAAA (a compositional model) (Kazemi & Elqursh, 2017) are benchmarked by Agarwal et al. (2020). MANGO exploits adversarial training to improve the robustness of pretrained VLMs (UNITER (Chen et al., 2020) and VILLA (Gan et al., 2020)) against visual variations. The results in Table 6 show that significant improvements are achieved across all metrics and baselines on both IV-VQA and CV-VQA, suggesting the effectiveness of CIB in improving robustness against visual variations. Moreover, we observe that pretrained VLMs using raw images as visual inputs (e.g., BEiT-3\(_{\text {B}}\), mPLUG\(_{\text {B}}\), and ALBEF) exhibit superior performance in defending against visual variations compared to those pretrained VLMs (e.g., VisualBERT, LXMERT, and UNITER\(_{\text {B}}\)) that employ object-level region features as visual inputs. This can be attributed to the fact that pre-extracted region features lose some image information, which hinders VQA models in comprehending and retrieving visual content according to a given question.

Table 8 Comparison between different CIB bounds

4.2.3 Robustness Against Multimodal Shortcut Learning

To demonstrate the ability of CIB to defend against multimodal shortcuts present in input images and questions, we conduct experiments on VQA-CE and compare our methods with existing approaches. Results are summarized in Table 7. The compared methods in the table can be broadly classified into two groups: (i) plain models (SAN (Yang et al., 2016), BLOCK (Ben-Younes et al., 2019), VilBERT (Lu et al., 2019), and BUTD (Anderson et al., 2018)), and (ii) bias-reduction methods (RUBi (Cadene et al., 2019), LMH + RMFE (Gat et al., 2020), ESR (Shrestha et al., 2020), LMH (Clark et al., 2019), LfF (Nam et al., 2020), LMH + CSS (Chen et al., 2020), and RandImg (Teney et al., 2020)). These experimental results are cited from the work (Dancette et al., 2021). As shown in Table 7, finetuning baseline pretrained VLMs with CIB achieves significant improvements and outperforms bias-reduction methods by a considerable margin, particularly on counterexamples. These results suggest that the proposed CIB is more effective at alleviating the spurious correlations between representations and reducing shortcut learning involved in multimodal inputs.

4.3 Ablation Studies

4.3.1 Comparison with Alternative CIB Bounds

When finetuning pretrained VLMs with CIB, I(YT) is regarded as the task-related objective, while \(I(X^v, X^l; T^v, T^l)\) serves as a MI regularizer to constrain representation compactness and pursue more robust representations. As stated in Theorem 1, the upper bound of \(I(X^v, X^l; T^v, T^l)\) consists of four terms: \(I(X^v; T^v)\), \(I(X^l; T^l)\), \(- I(T^v; T^l)\), and \(D_{\text {skl}}\). To analyze the contribution of different terms to CIB, we perform an ablation study on different meaningful combinations of these terms, that is, provable upper bounds of \(I(X^v, X^l; T^v, T^l)\), on VQA-Rephrasings using LXMERT, UNITER\(_{\text {B}}\), and ALBEF as baseline pretrained VLMs. Specifically, the regularizer upper bound has three other meaningful alternatives: (i) \(\frac{3}{2}[I(X^v; T^v) + I(X^l; T^l)]\), (ii) \(-I(T^v; T^l) + D_{\text {skl}}\), and (iii) \(I(X^v; T^v) + I(X^l; T^l) + D_{\text {skl}}\) (cf. Sect. 1 for proofs). Table 8 presents results on VQA-Rephrasings. Overall, the ablation results on different bounds are consistent, indicating that CIB with any meaningful upper bounds can markedly improve the performance of baseline pretrained VLMs. However, CIB with our derived upper bound performs best, empirically demonstrating that the bound in Theorem 1 is a tighter and more precise bound. Furthermore, the comparison between upper bound (iii) \(I(X^v; T^v) + I(X^l; T^l) + D_{\text {skl}}\) and our upper bound \(I(X^v; T^v) + I(X^l; T^l) - I(T^v; T^l) + D_{\text {skl}}\) suggests that CIB can effectively facilitate the correlation between visual and linguistic representations and modality alignment by maximizing \(I(T^v; T^l)\).

4.3.2 Impact of MI Estimator on CIB

In practice, any sample-based upper bound estimator of MI can be utilized to approximate \(I(X^v; T^v)\) and \(I(X^l; T^l)\), and any differentiable MI lower bound estimator can be applied to approach \(I(T^v; T^l)\). To analyze the impact of different MI estimators on CIB, we consider the following experimental settings: (i) We alternately utilize L1Out (Poole et al., 2019) instead of CLUB (Cheng et al., 2020) as the estimator of MI upper bound to approximate \(I(X^v; T^v)\) and \(I(X^l; T^l)\). (ii) We approximate \(I(T^v; T^l)\) with the three other estimators of MI lower bound, i.e., InfoNCE (Oord et al., 2018), NWJ (Nguyen et al., 2010), and MINE (Belghazi et al., 2018). Table 9 presents comparisons between different MI estimators on VQA-Rephrasings using LXMERT, UNITER\(_{\text {B}}\), and ALBEF as baselines. These results consistently demonstrate that CIB can effectively improve the performance of baselines with different transformer architectures and that the effectiveness of CIB does not depend on a specific MI estimator.

Table 9 Impact of different MI estimators on CIB

4.3.3 Impact of Hyperparameter on CIB

When using CIB as the training objective to adapt pretrained VLMs to the downstream VQA task, \(\beta \) controls the tradeoff between redundancy and compression in representations, which is the crucial hyperparameter. Consequently, we perform a grid search for \(\beta \). Specifically, we consider the following values: \(\beta \in [1\times 10^{-6}, 1\times 10^{-5}, 5\times 10^{-5}, 1\times 10^{-4}, 5\times 10^{-4}, 1\times 10^{-3}, 5\times 10^{-3}, 1\times 10^{-2}, 5\times 10^{-2}]\). Figure 3 illustrates the variation curve of VQA-Score (PER) on VQA-Rephrasings with increasing \(\log \beta \). We observe that the performance starts to boost when \(\beta \) is quite small, indicating the effectiveness of CIB in improving the performance of baseline pretrained VLMs. When \(\beta \) increases to \(5\times 10^{-5}\), \(1\times 10^{-4}\), and \(1\times 10^{-4}\), UNITER\(_\text {B}\), LXMERT, and ALBEF respectively achieve the best performance. Beyond that point, the performance typically begins to degrade, suggesting that extremely compressed representations of pretrained VLMs may start to compromise model performance.

4.3.4 Impact of Internal Representation on CIB

As illustrated in Fig. 1b, for pretrained VLMs with two-stream encoders (e.g., LXMERT and ViLBERT), there is an alternative option for internal representations, i.e., \(T=[T^{'v}, T^{'l}]\), which are the visual and linguistic representations after the vision transformer layers (\(f_{\theta ^{\text {VTran}}}\)) and language transformer layers (\(f_{\theta ^{\text {LTran}}}\)). When finetuning the two-stream pretrained VLMs with CIB, we analyze the impact of different internal representations by replacing the original \(T=[T^{v}, T^{l}]\) in \({\mathscr {L}}_{\text {CIB}}\) with \(T=[T^{'v}, T^{'l}]\). Table 10 shows the VQA-Score for PER on VQA-Rephrasings, revealing that different internal representations have a slight impact on the PER performance of CIB. This indicates that for two-stream pretrained VLMs, using the visual and linguistic representations after the vision and language transformer layers as internal representations to estimate the mutual information terms in \({\mathscr {L}}_{\text {CIB}}\) is also a feasible approach.

4.4 Discussion and Analysis

4.4.1 Effectiveness of CIB for Standard VQA Performance

To analyze the impact of CIB on standard VQA performance (i.e., whether the representation compression impairs the standard VQA performance), we utilize CIB as the objective to train the aforementioned baseline pretrained VLMs on the VQA v2 training and validation sets. The results on VQA v2 test-dev are shown in Table 11.Footnote 2 Overall, training baseline pretrained VLMs with the proposed CIB can slightly improve the standard VQA performance. In particular, the performance improvement of VisualBERT and ALBEF is relatively significant. This because that their visual inputs contain more redundant information, such as image background and visual content irrelevant to the given question (VisualBERT and ALBEF respectively adopt 100 region-level features and 900 patch-level features as visual inputs). Therefore, our hypothesis is that a certain degree of compression of representations can reduce the redundant information learned from inputs and make the obtained representations more compact and robust. Noting that, in contrast to the significant improvement in input robustness, CIB leads to relatively limited improvement in standard performance. This observation also indirectly indicates that the proposed CIB is carefully designed and tailored to improve input robustness when adapting pretrained VLMs for the downstream VQA.

Fig. 3
figure 3

Variation curve of VQA-Score (PER) on VQA-Rephrasings as \(\log \beta \) increases

4.4.2 Effectiveness of CIB Against Adversarial Attack

To analyze the impact of CIB on defending against adversarial attacks, we conduct experiments considering the following attacks and dataset: (i) L4A (Ban & Dong, 2022), which adds pretrained adversarial perturbations (PAPs) to the low-level layer of pretrained models, can effectively fool the finetuned models on downstream tasks without any knowledge of the tasks. (ii) AdVQA (Sheng et al., 2021), an adversarial benchmark collected using a human-and-model-in-the-loop paradigm to attack state-of-the-art VQA models and obtain human-adversarial examples, can effectively evaluate the human-adversarial robustness of VQA models. The effectiveness of a model in defending against adversarial attacks is measured by the model VQA-Score under these attacks.

Table 10 Impact of different internal representations obtained by the two-stream pretrained VLMs on CIB
Table 11 Results on VQA v2 test-dev (Goyal et al., 2017) under standard and clean dataset setups
Fig. 4
figure 4

Results on VQA v2 test-dev under different adversarial attacks

For (i), as proposed in the work (Ban & Dong, 2022), we consider three different ways for perturbation generation, i.e., L4A\(_{\text {base}}\), L4A\(_{\text {fuse}}\), and L4A\(_{\text {ugs}}\). Before finetuning the pretrained ALBEF with a text generation loss and the proposed CIB as a training objective, we first utilize the three methods above to generate PAPs by lifting the neuron activations of low-level layers of the ALBEF. Next, we separately add the generated PAPs to input images and finetune the pretrained ALBEF on the VQA v2 training and validation sets, and test their performance on VQA v2 test-dev. Figure 4 shows the performance comparison, where the blue bar marked with red performance indicates the VQA-Score drop with respect to the standard performance under an attack. From the figure, we can observe that CIB markedly reduces the performance drop, demonstrating its ability to better alleviate the vulnerability of VQA models to such attacks. For (ii), following the experimental setups for evaluating input robustness, we first finetune pretrained UNITER\(_\text {B}\) and LXMERT on the standard and clean VQA v2 training set, and then evaluate human-adversarial robustness on the adversarial benchmark AdVQA. As shown in Table 12, the significant performance improvement of our method over baselines demonstrates the robustness of CIB against human-adversarial attacks. In summary, the aforementioned experiments consistently suggest that the proposed CIB, as a generic objective, can potentially alleviate the vulnerability of models to adversarial attacks when adapting pretrained VLMs to downstream tasks.

Fig. 5
figure 5

Visualization of the top two objects with the highest attention scores. The image-question pairs originate from VQA-Rephrasings. Objects with the best and second attention scores are marked in magenta and green. The wrongly predicted answers are marked in red (Color figure online)

Table 12 Results on AdVQA (Sheng et al., 2021)
Table 13 Results on the RefCOCO+ (Yu et al., 2016) dataset for weakly-supervised visual grounding
Fig. 6
figure 6

Qualitative examples of the baseline and our method. The correct and wrong answers are highlighted (Color figure online)

4.4.3 Generalizability of CIB to Other Multimodal Task

The proposed CIB is essentially a generic training objective that can be applied to various multimodal tasks beyond the VQA task. To evaluate the generalizability of CIB to other multimodal tasks, we consider the task of weakly-supervised visual grounding. Following the original experimental setups of ALBEF (Li et al., 2021), we finetune pretrained ALBEF with CIB on the RefCOCO+ (Yu et al., 2016) training dataset in a weakly-supervised setups, i.e., finetuning models using only image-text supervision without bounding box annotations. From the results in Table 13, we find that using CIB as the training objective can further improve the performance of the baseline ALBEF\(_{\text {itm}}\), demonstrating that the proposed CIB can be effectively applied to other multimodal tasks.

4.5 Qualitative Results

4.5.1 Visualization of Visual Attentional Objects

To empirically explore why CIB can improve input robustness, we utilize the pretrained LXMERT (Tan & Bansal, 2019) as a representative and conduct the following experiments. First, we enumerate the image-question pairs, whose answers are correctly predicted by the LXMERT finetuned with CIB but incorrectly predicted by the baseline LXMERT finetuned without CIB, from the VQA-Rephrasings dataset. Next, we compute the attention score between the final representation \(Z \in {\mathbb {R}}^d\) used for answer prediction and the input visual representation \(X^v\in {\mathbb {R}}^{K\times d}\) of object regions using the formula, i.e., \(\text {score}_\text {attn} = \textrm{softmax}(Z\cdot (X^{v})^{\mathbb {T}}/\sqrt{d})\). Finally, we utilize the magenta and green color to highlight the top two objects with the highest attention scores in the image. The results in Fig. 5 show that compared with the baseline LXMERT, the attended two objects obtained by the LXMERT finetuned with CIB are more consistent and question-related. This observation qualitatively illustrates that using CIB as a training objective to finetune pretrained VLMs can encourage models to learn more discriminative representations for different answers and reduce the irrelevant information to questions.

4.5.2 Visualization of Input Robustness Cases

Figure 6a, b, and c present several qualitative examples demonstrating robustness to linguistic variations, visual variations, and multimodal shortcut learning, respectively. According to the qualitative comparison in Fig. 6, we can further observe that compared to the baseline that finetunes the pretrained LXMERT with a cross-entropy loss, using the proposed CIB as a training objective to finetune the pretrained VLM can improve the ability of VQA models to correctly answer these difficult questions. This empirical evidence highlights the effectiveness of CIB in defending against such attacks involving both visual and linguistic inputs.

5 Conclusion

In this paper, we propose to improve input robustness from the information bottleneck perspective when adapting pretrained VLMs to the downstream VQA task. Specifically, we derive a new IB lower bound (CIB) for vision-language learning and apply CIB to finetune pretrained VLMs with various architectures for VQA. Extensive experiments on five robustness datasets consistently demonstrate the effectiveness and superiority of CIB. In the future, we plan to assess the effectiveness of CIB when tuning pretrained VLMs using parameter-effective strategies, such as adapter-based tuning and prompt-based tuning.

Limitation. Redundancy has two sides. One reason why pretrained VLMs can significantly improve the performance of downstream tasks is that they have learned rich and redundant knowledge during the pretraining stage. Practically, for downstream tasks, especially in-domain tasks, task-related redundancy can help models quickly adapt to new tasks, while task-agnostic redundancy may impair model robustness. Our work investigates improving input robustness of models while preserving their accuracy by seeking a tradeoff between representation compression and redundancy. Another potential research direction is to explore how to explicitly reduce task-agnostic redundancy and adequately exploit task-related redundancy when adapting pretrained VLMs to downstream tasks, particularly out-of-domain tasks.