1 Introduction

Emerging studies on the generalizability of deep neural networks have revealed that the existing models, which assume independent and identically distributed (i.i.d.) training and test distribution, are not robust to significant distribution shifts between training and test distribution, e.g., backgrounds [60], geographic distribution [57], demographic statistics [50, 65], textures [3, 23], or day-to-night shifts [17, 40]. Domain generalization (DG) aims to learn robust representations against distribution shifts from multiple source domains during training. The trained model is evaluated on an unseen domain to measure the robustness. The existing DG approaches have tried to learn invariant features across multiple domains [2, 11, 13, 22, 32, 53, 70]. However, recent studies [24, 29] have shown that simple baselines without learning invariant features are comparable to or even outperform the existing DG methods on the diverse DG benchmarks with a fair evaluation protocol in realistic settings (e.g., using ResNet-50 instead of ResNet-18 [25]). We presume that it is because training and test distributions differ too significantly to learn domain-invariant features by the training distribution only.

Instead of learning domain-invariant features, we let a model learn similar features to “oracle” representations, i.e., an optimal model generalized to any domain. In particular, we re-formulate the DG problem by maximizing the mutual information (MI) between the oracle model representations and the target model representations while preserving the training loss on source domains. However, the oracle model is not achievable in practice. Hence, we use a large pre-trained model (e.g., ImageNet pre-trained ResNet-50 [25]) as an approximation. With this approximation, we derive a tractable variational lower bound of the proposed maximization problem, named Mutual Information Regularization with Oracle (MIRO). At a high level, our MIRO objective consists of two objectives: an original target task (i.e., an ERM objective) and a regularization term between the pre-trained model and the current target model. Note that the standard DomainBed benchmark [24] uses the ImageNet pre-trained ResNet-50 as the initialization of a DG method, thus, we use the pre-trained ResNet as the initialization and the approximation of the oracle model at the same time.

While a naive fine-tuning approach of a large pre-trained model can harm the robustness against distribution shifts [31, 59], our proposed algorithm remarkably improves the robustness against unseen domains during fine-tuning in a plug-and-play manner to any scale of the backbone model and datasets.

In our experiment, we observe that the naive fine-tuning of a larger pre-trained model can fail to provide better performances, even though the larger pre-trained model is trained with more data and domains. For example, ERM with the ResNet pre-trained on ImageNet (trained with 1.3M images) shows 64.2% of averaged accuracy, while ERM with the ViT pre-trained on CLIP (trained with 400M image-caption pairs) shows 61.1%. On the other hand, we show that our method can significantly improve the average DG performances with backbone models at different scales, e.g., ImageNet pre-trained ResNet (64.2% \(\rightarrow \) 65.9%), 400M image-text pre-trained ViT (CLIP) [45] (61.1% \(\rightarrow \) 73.7%) and Instagram 3.6B pre-trained RegNet (SWAG) [52] (68.0% \(\rightarrow \) 74.1%). Especially, we observe that the pre-trained knowledge by larger pre-trained models, such as SWAG and CLIP, is more effective to learn domain generalized features than the ImageNet pre-trained model: MIRO with the ViT pre-trained on CLIP outperforms MIRO with the ResNet pre-trained on ImageNet in contrast to the naive fine-tuning. Furthermore, our feature-level regularization method is easily combined with the existing parameter space ensemble methods [13, 59] (74.1% \(\rightarrow \) 77.3% average DG accuracy by combining with SWAD [13] and pre-trained RegNet).

Our contribution is as follows: (1) We re-formulate the DG objective by mutual information with the oracle model. Then, we approximate the oracle by a large pre-trained model to derive a tractable approximation of the target objective. We propose Mutual Information Regularization with Oracle (MIRO) to solve our objective. (2) We analyze the pre-trained models in terms of the MI with the oracle model. Our analysis shows that naive fine-tuning of pre-trained models can harm the MI with the oracle, on the other hand, MIRO shows high MI with the oracle. (3) We compare MIRO with state-of-the-art DG methods on DomainBed. MIRO outperforms all methods in all settings, including varying optimizers and pre-trained models. We also provide extensive analysis to understand MIRO. For example, we observe that MIRO shows stronger DG performances with larger pre-trained models, such as SWAG [52] or CLIP [45].

2 Related Works

Domain Generalization. Learning domain-invariant features from source domains has been a major branch in the DG field. The main idea is discarding biased knowledge to a specific domain while preserving invariant features over source domains, by minimizing feature divergences between the source domains [22, 35, 37, 39, 41, 53, 68], simulating domain shifts based on meta-learning [5, 11, 19, 32, 34, 67], robust optimization [2, 13, 30, 49, 51], or augmenting source domain examples [4, 12, 42, 43, 47, 64, 69, 70]. However, even if the model learns invariant representation to source domains, it can still be biased toward the source domains which causes limited performance on unseen target domains. That is, learning invariant representation across source domains is not enough to achieve the underlying objective of domain generalization [11, 14, 15]. To compensate for the issue, this paper employs pre-trained models, which provide general representations across various domains including unseen target domains.

Exploiting Pre-trained Models There have been numerous attempts to exploit pre-trained models in various fields. Transfer learning [36, 62] and knowledge distillation [1, 54] employ pre-trained models to improve in-domain performance when dataset or architecture shift occurs between pre-training and fine-tuning. Continual learning utilizes the pre-trained model to maintain old task performance when learning new tasks [38]. Recently, several studies targeting the out-of-distribution generalization are emerging [31, 59]. Kumar et al. [31] show that naive fine-tuning distorts the pre-trained features and propose a simple baseline, named LP-FT, to alleviate the distortion. WiSE-FT [59] focuses on zero-shot models. It combines pre-trained and fine-tuned weights to preserve the generalizability of the pre-trained zero-shot models. In this paper, we propose a MI-based regularization method, MIRO, to exploit the generalizability of the pre-trained representation in the training process.

3 Methods

In this section, we first re-formulate the objective for the out-of-domain generalization by introducing an oracle model. Then, we derive a tractable variational bound of the objective by approximating the oracle model to the pre-trained model. The final form consists of the empirical risk and the mutual information (MI) regularization by querying the approximated oracle, named Mutual Information Regularization with Oracle (MIRO). We empirically validate our approximation by MI between the oracle model and large pre-trained models.

3.1 Mutual Information Regularization with Oracle

The main idea of the proposed method is to guide the learning process using oracle representations of training datasets. In general, the problem of domain generalization (DG) is to find a model that minimizes an expected loss of any domain by using training datasets from only partial domains, which are called source domains. Many existing methods minimize an empirical loss averaged over source domains. More specifically, suppose that training samples \(\{\mathcal {S}_{d}\}_{d=1}^{m}\) are given in m domains and we consider a hypothesis set \(\mathcal {H}\) for optimization. Then, many existing DG frameworks can be formulated as follows:

$$\begin{aligned} \bar{h}={{\,\mathrm{arg\,min}\,}}_{h\in \mathcal {H}} \sum _{d=1}^{m} \mathcal E_{\mathcal {S}_{d}}(h), \end{aligned}$$
(1)

where d indicates an individual source domain and \(\mathcal E_{\mathcal {S}_{d}}\) is an empirical loss over the source domain d. Note that majority of existing DG methods can be interpreted as the variant of Eq. (1). For example, if we choose a simple cross-entropy loss for \(\mathcal E_{\mathcal {S}_{d}}\), then Eq. (1) becomes “ERM” baseline used in [24]Footnote 1. Otherwise, \(\mathcal E_{\mathcal {S}_{d}}\) can be formulated as a regularized ERM, such as IRM [2] or CORAL [53]. However, the formulation (1) still suffers from learning domain-invariant representations using only partial domains when the target distribution differs significantly from the training distribution. For example, CORAL, the state-of-the-art method, shows inconsistent out-of-domain accuracies across domains in DomainNet [44]. While CORAL achieves \(\approx \)50% top-1 accuracy on four easy domains (59.2% for Clipart, 46.6% for Painting, 59.8% for Real, 50.1% for Sketches), it only shows 13.4% for QuickDraw and 19.7% for Infographics where the domains show the significant distribution shift comparing to others.

To alleviate this issue, we re-formulate the DG problem by employing oracle representations of source domains. Here, we define an oracle model as a model that can be generalized to any possible domain, not only for the source domains. We define a model as a composition of a feature extractor f and a classifier g on the feature space where the whole classifier h can be written as \(h=f\circ g\). Then, let \(f^{*}\) be a feature extractor of the oracle model. We first start from a strong assumption: we may assume that \(f^*\) is accessible during the training phase. Then, we can obtain additional information from \(f^{*}\) by querying the oracle representations of training samples in the source domains. By using the oracle representations, we can guide the learning process of a target model by maximizing MI between oracle representations and target ones. We formulate the proposed oracle-guided DG framework as follows:

$$\begin{aligned} \begin{aligned} \max _{h} \quad&I(Z_{f^\mathfrak {*}};Z_{f}) \\ \text {s.t.} \quad&\mathcal {E}_{\mathcal S}(h) - \mathcal {E}_{\mathcal S}(\bar{h}) \le \epsilon , \end{aligned} \end{aligned}$$
(2)

where \(Z_{f^{*}}\) is a random feature extracted by \(f^{*}\) and \(Z_{f}\) is a random feature extracted by a target model f. \(I(Z_{f^{*}};Z_{f})\) is MI between \(Z_{f^{*}}\) and \(Z_{f}\), and \(\mathcal {E}_{\mathcal S}(\cdot ) = \sum _{d=1}^{m}\mathcal {E}_{\mathcal {S}_{d}}(\cdot )\). The inequality constraint ensures the performance of the target model on the source domains. Maximizing the MI will inhibit the target model from overfitting domain-specific features in the limited source domains. Because we assume that the “oracle” is generalized well to any possible domain, the MI constraints (2) will be beneficial to learning robust representations.

Unfortunately, the oracle feature extractor \(f^{*}\) is not accessible in practice. Instead, we approximate the oracle feature extractor by using a pre-trained model \(f^{0}\). Our assumption is that a model pre-trained on large-scale diverse datasets, such as ImageNet [48], contains information on diverse domains. In practice, we choose \(f^{0}\) as the ImageNet pre-trained ResNet-50 [25], the standard initialization choice for evaluating DG algorithms [24]. We also consider models trained by larger diverse datasets, such as CLIP [45] (trained with 400M web crawled image-text pairs) and SWAG [52] (trained with 3.6B noisy image-hashtag pairs crawled from Instagram). Although using CLIP and SWAG is not a fair comparison to the existing DG benchmark, here, we emphasize that naive fine-tuning of large pre-trained models leads to inferior generalizability to extreme distribution shifts at test time [31, 59]. In our experiments, we also observe a similar observation: naive fine-tuning of CLIP shows an inferior DG performance (61.1%) than ERM (64.2%).

Through the approximation of the oracle model, we derive a tractable variational bound of our objective (2). We assume a pre-trained model \(f^{0}\) is located near \(f^{*}\) in terms of distance equipped on the hypothesis set of the feature extractors and it can provide approximated representation of \(f^{*}\). Under this assumption, we can obtain a tractable objective by deriving an approximated lower bound of the MI. We first derive the variational lower bound of the MI as follows:

$$\begin{aligned} I(Z_{f^{*}} ; Z_f) =&\,\mathbb {E}_{Z_{f^{*}}, Z_f}\left[ \log \frac{q(Z_{f^{*}} \mid Z_f)}{p(Z_{f^{*}})}\right] +K L(p(Z_{f^{*}}\mid Z_f) \Vert q(Z_{f^{*}}\mid Z_f)) \nonumber \\ \ge&\, \mathbb {E}_{Z_{f^{*}}, Z_f}[\log q(Z_{f^{*}}\mid Z_f)]+H(Z_{f^{*}}), \end{aligned}$$
(3)

where q is the variational distribution with a mild regularity condition. More detailed derivation can be found in Barber and Agakov [7]. Then, we approximate the expectation in Eq. (3) by using \(f^{0}\).

$$\begin{aligned} I(Z_{f^{*}} ; Z_f)&\ge \mathbb {E}_{Z_{f^{*}}, Z_f}\left[ \log q(Z_{f^{*}} \mid Z_f)\right] +H(Z_{f^{*}})\nonumber \\&\ge \mathbb {E}_{Z_{f^{0}},Z_{f}}\left[ \log q(Z_{f^{0}} \mid Z_{f})\right] - Cd_{2,\infty }(f^{*}, f^{0})+H(Z_{f^{*}}), \end{aligned}$$
(4)

where C is a constant and \(d_{2,\infty }(f^{*},f^{0}):=\sup _{x}\Vert f^{*}(x)-f^{0}(x)\Vert _{2}\). Note that \(d_{2,\infty }\) is a proper metric on the hypothesis set of feature extractor. The last inequality of Equation (4) is derived by using the first-order Taylor expansion and assuming the regularity condition of q (See Appendix). We would like to note that the inequality is tight enough due to Taylor’s theorem. In other words, equality condition of the last inequality of Eq. (4) is \(d_{2,\infty }(f^{*}, f^{0})=0\). Hence, \(d_{2,\infty }(f^{*}, f^{0})\) represents the effect of the pre-trained model \(f^{0}\) on the approximation of the lower bound. Intuitively speaking, the lower bound shows that the smaller \(d_{2,\infty }(f^{*}, f^{0})\) is, the tighter the gap between the true lower bound and approximated one is. In summary, the MI between \(Z_{f^{*}}\) and \(Z_{f}\) can be maximized by maximizing the term \(\mathbb {E}_{Z_{f^{0}}, Z_f}[\log q(Z_{f^{0}}\mid Z_f)]\).

figure a

Finally, to consider the constraint term, we introduce the Lagrangian method to Eq. (2), then we can derive an objective function from Equation (4):

$$\begin{aligned} R(h)=\mathbb {E}_{Z_{f^0}, Z_f}[\log q(Z_{f^0} \mid Z_f)] - \beta \mathcal {E}_{\mathcal S}(h), \end{aligned}$$
(5)

where \(\beta \) indicates the Lagrangian multiplier. Note that the entropy of \(Z_{f^{*}}\) and \(d_{2,\infty }(f^{*},f^{0})\) are omitted, since they are independent to our optimization target \(h=f\circ g\). In the implementation, we model the variational distribution as a Gaussian distribution with mean vector \(\mu (Z_{f})\) and covariance matrix \(\Sigma (Z_{f})\) and replace the multiplier \(\beta \) with the regularization coefficient \(\lambda \). Then, our final loss function becomes:

$$\begin{aligned} {\textbf {(MIRO)}} \quad \mathcal L(h) = \mathcal {E}_{\mathcal S}(h) + \lambda \mathbb {E}_{Z_{f^{0}},Z_{f}}\left[ \log \left| \Sigma (Z_{f})\right| + \Vert Z_{f^{0}}-\mu (Z_{f})\Vert ^{2}_{\Sigma (Z_{f})^{-1}}\right] , \end{aligned}$$
(6)

where \(\Vert x\Vert _{A}=\sqrt{x^{\intercal }A x}\) and constants independent on h are omitted. Then, we optimize the loss function using a stochastic gradient method. The entire learning process is summarized in Algorithm 1. In the following sections, we empirically justify our approximation of \(f^{*}\) and explain implementation details for the mean and variance encoders of the Gaussian distribution q.

Fig. 1.
figure 1

Mutual information \(I\left( Z_{f^{*}}; Z_f \right) \) with oracle model. The mutual information is estimated by MINE [9] in PACS. Oracle model is trained using all of the four domains. Random and Pre-trained indicate random and pre-trained model initialization, respectively. ERM- and ERM+ are trained from random and pre-trained model initialization, respectively. \(\dagger \) indicates models without fine-tuning. The experiments are repeated with two pre-trained models: ImageNet 1.3M pre-trained ResNet-50 and Instagram 3.6B pre-trained RegNetY-16GF.

3.2 Mutual Information Analysis with the Oracle Model

Here, we empirically show how our approximation by pre-trained models is close to the oracle model and how our algorithm is effective to learn representations having high mutual information (MI) to the underlying oracle model. More specifically, we compare MI between the candidate models and the oracle model on the PACS dataset. Since the true oracle model is not achievable in practice, we train an oracle model by directly optimizing a model on the entire domains. We train two oracle models with ResNet-50 and RegNetY-16GF backbones, where the average validation accuracies across all domains are \(97.2\%\) and \(98.4\%\), respectively. We estimate MI between models by mutual information neural estimation (MINE) [9]. We describe the full details in Appendix.

Figure 1 illustrates the empirical MI between the candidate models and the oracle model. In the figures, we first observe that the larger and more powerful pre-trained backbone (“Pre-trained” in Fig. 1b) shows higher MI than the smaller backbone (“Pre-trained” in Fig. 1a). Both pre-trained models consistently outperform “Random” in MI regardless of the backbone models. Our observations imply that a larger and stronger model is closer to the oracle model in terms of MI. Similarly, we observe that ERM\(+\) always shows high MI than ERM−. However, interestingly, in Fig. 1b, we observe that fine-tuning significantly harms MI of the pre-trained model (“Pre-trained” vs. “ERM\(+\)”) when the pre-trained model becomes larger and more powerful. Our observation is aligned in the same line as the previous studies on fine-tuning of large models [31, 59]. Lastly, in both scenarios of ImageNet pre-trained ResNet (Fig. 1a) and SWAG pre-trained RegNet (Fig. 1b), our MIRO shows the highest MI with the oracle model. Note that MI with the oracle model may not be completely aligned with the DG performance, but in practice, we observed that the evaluation ranking of the candidates is the same as the MI ranking; MIRO scores the best, followed by ERM\(+\) and ERM−. Detailed results are provided in Appendix.

3.3 Features and Encoders Design

Multi-scale Features. One can only use the last-level features for our regularization. However, high-level features can include pre-training task-related information, often irrelevant to the target task. Instead, we use the intermediate outputs by each model block, i.e., stem output, blocks 1, 2, 3, and 4 for ResNet [25] and RegNet [46], and stem output, blocks 3, 6, 9, and 12 for ViT-B.

Design of the Mean and Variance Encoders. The multi-level structure increases the feature size, resulting in a computational cost increase. We alleviate the issue by employing simple yet effective architectures, identity function for the mean encoder and a bias-only model with diagonal covariance for the variance encoder. We also tested more complicated architectures, but only computational cost was increased without performance improvement.

4 Experiments

4.1 Experiment Setups and Implementation Details

Evaluation Protocols and Datasets. We employ DomainBed evaluation protocols [13, 24] for a fair comparison. The five benchmark datasets are used: PACS [33] (4 domains, 7 classes, and 9, 991 images), VLCS [21] (4 domains, 5 classes, and 10, 729 images), OfficeHome [56] (4 domains, 65 classes, and 15, 588 images), TerraIncognita [8] (4 domains, 10 classes, and 24, 788 images), and DomainNet [44] (6 domains, 345 classes, and 586, 575 images). All performance scores are evaluated by leave-one-out cross-validation, where averaging all cases that use a single domain as the target (test) domain and the others as the source (training) domains. Every experiment is repeated three times. We leave 20% of source domain data for validation. We use training-domain validation for the model selection and the hyperparameter search following DomainBed [24].

Implementation Details. We use ResNet-50 [25] pre-trained in the ImageNet [48] as default. The model is optimized using Adam [28] optimizer. A mini-batch contains all domains and 32 examples per domain. The regularization coefficient \(\lambda \) is tuned in [1.0, 0.1, 0.01, 0.001]. The other hyperparameters, such as batch size, learning rate, dropout rate, and weight decay, are tuned in the similar search space proposed in Cha et al. [13]. We provide full details in Appendix.

4.2 Main Results

Comparison with Domain Generalization Methods. We provide exhaustive out-of-domain performance comparisons on five DG benchmarks in Table 1. Compared to ERM, the proposed MI regularization significantly improves performance on every benchmark dataset, resulting in +1.7pp average improvement. Compared with the state-of-the-art methods, MIRO achieves the best performances in all benchmarks, except PACS. Especially, MIRO remarkably outperforms previous methods: +1.3pp in OfficeHome (mDSDI [11]; \(69.2\% \rightarrow 70.5\%\)) and +1.8pp in TerraIncognita (SagNet [42]; \(48.6\% \rightarrow 50.4\%\)). Considering the extensive experiment setup with 5 datasets and 22 target domains, the results demonstrate the effectiveness of MIRO to the diverse visual data types.

Table 1. Comparison with domain generalization methods. Out-of-domain accuracies on five domain generalization benchmarks are shown. We highlight the best results in bold. The results marked by \(\dagger , \ddagger \) are the reported numbers from Gulrajani and Lopez-Paz [24] and Cha et al. [13], respectively. The results of Fish, SelfReg, and mDSDI are the reported ones from each paper. Average accuracies and standard errors are reported from three trials.

The second part of Table 1 shows the performance with stochastic weight averaging densely (SWAD) [13], a state-of-the-art optimizer for DG by seeking flat minima. Since SWAD is an orthogonal direction to MIRO, we also evaluate the combination of MIRO and SWAD. As shown in the table, the combination of MIRO and SWAD achieves the best performance in all datasets, resulting in +0.8pp average improvement compared to the previous best results.

In the last part of Table 1, we push the limits of the out-of-domain performance by employing a large-scale backbone, RegNetY-16GF pre-trained by SWAG [52]; a weakly-supervised pre-trained model using 3.6 billion noisy Instagram images and hashtags. As shown in our previous study on MI with the oracle model, the pre-trained RegNet has higher MI than ImageNet pre-trained ResNet (Fig. 1). In the experiments, we first observe that the improvement gap by MIRO becomes remarkably large compared to the ResNet pre-trained model (from +1.7pp to +6.1pp). We presume that this significantly large gap originated from the negative effect of naive fine-tuning as observed by previous works [31, 59] and our study (Fig. 1b). As shown in Fig. 1b, MIRO keeps MI with the oracle model high, resulting in remarkable performance gains on large-scale models. We further explore the effect of the scalability of pre-trained models in the later section. Finally, by combining MIRO with RegNet backbone and SWAD, we achieve the best domain generalization results (77.3%) on our evaluation benchmark.

Table 2. Comparison with various pre-training datasets, methods, and backbones. We compare the performance changes according to the scale of the dataset, the method, and the backbone architecture of pre-training. ResNet-50 architecture is used as default. OH, TI, and DN indicate OfficeHome, TerraIncognita, and DomainNet, respectively. Every accuracy is averaged over three trials.

MIRO with Various Pre-trained Models. In this subsection, we investigate the robustness of the proposed method to the choice of pre-trained models. In Table 2, we explore the performance changes of MIRO by varying pre-training datasets, methods, and backbones. From the pre-training method perspective, we examine two image self-supervised pre-training methods (Barlow Twins [66] and MoCo v3 [16]), one image-language self-supervised pre-training method (CLIP [45]), and one weakly-supervised pre-training method (SWAG [52]), as well as ImageNet supervised pre-training baseline (ImageNet ERM). From the pre-training scale perspective, we employ the ImageNet [48] dataset of 1.3 million examples, the CLIP dataset of 400 million examples, and the Instagram dataset of 3.6 billion examples. We use ResNet-50 [25] backbone architecture as default, but a bigger model is also used for the large-scale pre-training, such as ViT-B [18] for CLIP or RegNetY-16GF [46] for SWAG.

As shown in the table, MIRO improves performances compared with the baseline ERM in all experiments. For the ImageNet pre-training, applying MIRO results in performance improvements of +1.7pp, +3.5pp, and +1.3pp for ERM (supervised learning), Barlow Twins, and MoCo v3, respectively. For the large-scale pre-training, such as CLIP and SWAG, MIRO brings larger performance improvements of +16.3pp, +12.6pp, and +6.1pp for CLIP, CLIP-ViT, and SWAG, respectively. These experiments demonstrate the robustness of the proposed method to the pre-training methods, datasets, and backbone architectures.

Notably, performance improvements of MIRO are remarkable with large-scale pre-trained models, such as CLIP, CLIP-ViT, and SWAG. This is consistent with our observation in Sect. 3.2. Our method helps large-scale pre-trained models (in terms of the pre-training dataset size) not to be biased to the training source domains compared to naive fine-tuning. Especially, naive fine-tuning of CLIP-ViT (61.1%) shows worse out-of-domain performance than fine-tuning ImageNet pre-trained model (64.2%). In contrast, MIRO can leverage the pre-trained knowledge from CLIP-ViT, resulting in superior performance (73.7%) compared with the ImageNet pre-trained model (65.9%). In our later analysis, we show that the knowledge of large-scale pre-trained models is more beneficial to domain generalization than the knowledge of ImageNet pre-trained models.

Table 3. Comparison with methods exploiting pre-trained models. Out-of-domain accuracies on five domain generalization benchmarks are shown. Average accuracies and standard errors are reported from three trials.

Comparison with Methods Exploiting Pre-trained Models. Other DG methods simply employ pre-trained models as weight initialization, while MIRO additionally exploits it in the training process. This is the first approach to exploit pre-trained models in domain generalization, but there are several studies in other fields for different purposes. Table 3 provides a comparison of the methods applicable to our DG settings. We exclude the methods that require additional information other than pre-trained models (e.g., pre-training datasets) or are restricted to a specific model. As shown in the table, MIRO outperforms the comparison methods with large margins. These results demonstrate the effectiveness of our method design for the out-of-domain generalization.

Fig. 2.
figure 2

Distribution of \(\Sigma (z_f)\). We plot the estimated variances, \(\Sigma (z_f)\), for each layer. X-axis indicates the feature layer where the features \(z_f\) are collected. In all datasets, the variances increase as the layer is closer to the output

4.3 Analysis of MIRO

Loss Function Interpretation: \(\boldsymbol{\Sigma }\) Distribution Analysis. We can interpret the variance term of MIRO, \(\Sigma (z_f)\) in Eq. (6), as control variables of the distance loss between pre-trained features \(z_{f^0}\) and current learning features \(z_f\). During the training phase, if the variance values become smaller then the model will preserve MI with the pre-trained model. On the contrary, when the model needs to learn new information, the variance will increase. We illustrate the learned variances in Fig. 2. The figure shows that pre-trained information is preserved well in lower layers, while task-specific new information is learned in higher layers. This result is consistent with the interpretation that high layer features represent more task-specific semantic information than low layer features [20]; task shifts during fine-tuning make higher layer features learn more semantics than lower layers.

Table 4. Performance improvements in Camelyon17 medical dataset. Even in the large distribution shift setup between pre-training and target datasets, MIRO consistently outperforms ERM. Every accuracy is averaged over three trials.
Fig. 3.
figure 3

Comparison of three pre-trained models according to \(\lambda \). Y-axis indicates the performance difference of MIRO to ERM. \(\lambda \) is the intensity of the mutual information regularization. We compare three models: ResNet-50 pre-trained in ImageNet [25], RegNetY-16GF pre-trained by SWAG [52], and ViT-B pre-trained by CLIP [45]

Case Study on  Camelyon17 : Large Distribution Shift Between Pre-training and Fine-Tuning. As shown in Eq. (4), the tightness of the lower bound is directly connected to the divergence between the representations of oracle and pre-trained models. Therefore, we investigate the case that there is a large shift between pre-trained and target datasets using the medical dataset [6, 29], Camelyon17. This dataset consists of whole-slide images of histological lymph node sections from the five hospitals, where each hospital corresponds to each domain. The task is to predict whether the image contains tumor tissue of breast cancer. There is a large gap between the pre-training distribution (ImageNet or Instagram-3.6B) and the fine-tuning distribution (Camelyon17). Detailed visual examples are provided in Appendix. The results in Table 4 demonstrate MIRO leads the model to learn robust representations even in the large distribution shift setup between pre-training and fine-tuning.

Relationship Between the Pre-training Scale and the Intensity of the MI Regularization. Our method has a control parameter \(\lambda \), which controls the balance between the cross-entropy loss and the MI regularization loss. If \(\lambda \) becomes larger, it implies that the strength of MI regularization becomes stronger, while it weakens the strength of the ERM objective. Intuitively, if the pre-trained knowledge is informative enough to the target task, larger \(\lambda \) will improve the performances, while if the pre-trained knowledge is uninformative to the target task, then larger \(\lambda \) can harm the performances, because of the penalty on the ERM objective. We compare three pre-trained models (ImageNet pre-trained model, SWAG, and CLIP-ViT) by varying \(\lambda \). Figure 3 shows how the out-of-domain performance of MIRO with different pre-trained backbones changes by \(\lambda \). The additional results on different datasets are given in Appendix.

First, we observe that the ImageNet pre-trained backbone has a negative correlation between the performance difference and \(\lambda \) in target domains. When distribution shifts significantly differ, such as cartoon and sketch domains, we can observe an apparent negative correlation. We presume that it is because the ImageNet samples barely contain non-photo images, such as art painting or sketch images. On the other hand, we observe that MIRO with SWAG and CLIP-ViT backbones make significant performance improvements by choosing larger \(\lambda \). In other words, SWAG and CLIP-ViT pre-trained knowledge are helpful to learn robust features for various target domains compared to the ImageNet pre-trained model. Furthermore, it implies that larger pre-trained models trained with massive diverse domain images show less sensitivity to the choice of \(\lambda \), not only bringing remarkable performance improvements as shown in Table 2.

5 Conclusion

Traditional domain generalization (DG) approaches focus to learn a robust representation using multiple source domains. However, in the recent trends of scaling up pre-training, the use of a large-scale pre-trained model becomes more important than the use of DG algorithms for the real-world DG. In line with this trend, we propose Mutual Information Regularization with Oracle (MIRO) to robustly exploit the pre-trained model by approximating an oracle model. To do this, we first re-formulate the domain generalization objective by introducing a concept of an oracle model. Then, we derive a tractable variational bound of the objective by approximating the oracle model with the pre-trained model. Our experimental results demonstrate both the effectiveness and the potential of the proposed method. MIRO achieves state-of-the-art performance in the DomainBed benchmarks. Furthermore, when combining MIRO with large-scale pre-trained backbones, such as CLIP [45] or SWAG [52], the performance improvements remarkably increases. We hope that this study promotes a new research direction of exploiting pre-trained backbones to learn robust representations for domain generalization.