1 Introduction

The gold standard for collecting a supervised training dataset of quality is to ensure the samples per class are as diverse as possible and the diversities across classes are as evenly distributed as possible [10, 34]. For example, the “cat” class should contain cats of varying contexts, such as types, poses, and backgrounds, and the rule also applies in the “dog” class. As illustrated in Fig. 1 (a), on such a dataset, any Empirical Risk Minimization objective (ERM) [59], e.g., the widely used softmax cross-entropy loss [16], can easily keep the class feature by penalizing inter-class similarities, while removing the context feature by favoring intra-class similarities. Thanks to the balanced context, the removal is clean. It can be summarized into the common principle:

Principle 1

Class is invariant to context.

For example, a “cat” sample is always a cat regardless of types, shapes, and backgrounds.

Fig. 1.
figure 1

GradCAM [51] visualizations of learned class and context. In (a) and (b): By using ERM, if the context is diverse and balanced within a class, the class feature is accurate—focused on the human’s action; if the context dominates in the data, the class feature contains the context feature, e.g., the background “grass”. In (c): The conventional context estimation [41] based on Principle 1 is biased to class (focusing on the class of human action “throwing”), while our IRMCon based on Principle 2 estimates better context (focusing on the background).

Given testing samples whose contexts are Out-Of-(training)Distribution (OOD), the above ERM model can still classify correctly thanks to its focus only on the context-invariant class featureFootnote 1—model generalization emerges [17, 19, 33]. However in practice, due to the limited annotation budget, real-world datasets are far from the “golden” balance, and learning the class invariance on imbalanced datasets is challenging. As shown in Fig. 1 (b), if the context “grass” in class “throwing” dominates the training, the model will use the spurious correlation “most throwing actions happen in the grass” to predict “throwing”. Therefore, the obstacle to OOD generalization is context imbalance.

Existing methods for context or context bias estimation fall into two categories (details in Sect. 2). First, they annotate the context directly [2, 31], as shown in Fig. 2 (c). This annotation takes additional costs. Besides, it is elusive to annotate complex contexts. For example, it is easy to label the coarse scenes “water” and “grass” but hard to further tell their fine-grained differences. Thus, context supervision is usually incomplete.

Fig. 2.
figure 2

Illustrations of the related approaches [2, 6, 27, 31, 41, 60, 70]. ERM is the baseline. Others and ours aim for mitigating context bias. The components are elaborated below. 1) The length of a context bar indicates the number of samples in that context—longer bar means the context is more prevailing. 2) A sole bar with the mixture of a color and a class number denotes the feature biased to the prevailing context. Our implementation method IRMCon-IPW is based on IRM and IPW, and our technical contribution (over the conventional methods of IRM or IPW) is the approach of disentangling context features not by using but by eliminating class features. We provide a theoretical justification in Sect. 4 and an empirical evaluation in Sect. 5.2.

Second, they estimate context bias by the biased class prediction [4, 27, 41], as shown in Fig. 2 (d). This relies on the contra-position of Principle 1 which is essentially an indirect context estimation.

Principle 1

(Complement). If a feature is not invariant to context, it is not class but context.

Here, the judgment of “not invariant to context” is implemented by using the biased prediction of a classifier, i.e., if the classifier predicts wrongly, it is due to that the class invariance is not yet achieved in the classifier. Unfortunately, as the classifier is a combined effect of both class and context, it is ill-posed to disentangle if the bias is from biased context or immature class modeling. The reflection in the result is the incorrect context estimation mixed with class (see the upper part of Fig. 1 (c)). In fact, coinciding with recent findings [14, 65], we show in Sect. 5 that existing methods with improper context estimation may even under-perform the ERM baseline. In particular, if the data is less biased, such methods may catastrophically mistake context for class—this limits their applicability only in severely biased training data.

In this paper, we propose a more direct and accurate context estimation method without needing any context labels. Our inspiration comes from the other side of Principle 1:

Principle 2

Context is also invariant to class.

For example, the context “grass” is always grassy regardless of its foreground object class.

Principle 1 implies that the success of learning class invariance is due to the varying context. Similarly, Principle 2 tells us that we can learn context invariance with varying classes, and this is even easier for us to implement because the classes (taken as varying environments [2]) have been labeled and balanced—a common practice for any supervised training data with an equal sample size per class. In Sect. 4, as illustrated in Fig. 2 (e), we propose a context estimator trained by minimizing the contrastive loss of intra-class sample similarity which is invariant to classes (based on Principle 2). In particular, the invariance is achieved by Invariant Risk Minimization (IRM) [2] with our new loss term. We call our method IRMCon where Con stands for context. Figure 1 (c) illustrates that our IRMCon can capture better context feature. Based on IRMCon, we can simply deploy a re-weighting method, e.g., [35], to generate the balancing weights for different contexts—context balance is achieved.

We follow DomainBed [14] for rigorous and reproducible evaluations, including 1) a strong Empirical Risk Minimization (ERM) baseline that is used to be mistakenly poor in OOD, and 2) a fair hyper-parameter tuning validation set. Experimental results in Sect. 5 demonstrate that our IRMCon can effectively learn context variance and eventually improve the context bias estimation, leading to a state-of-the-art OOD performance. Our another contribution in experiments is we propose a non-pretraining setting for OOD. It is known that many conventional experiment settings with pretraining, especially using the ImageNet [10], have data leakage issues as mentioned in related works [62, 66]. We have an in-depth discussion on these issues in Sect. 5.2.

2 Related Work

OOD Tasks. Traditional machine learning heavily relies on the Independent and Identically Distributed (IID) assumption for training and testing data. Under this assumption, model generalization emerges easily [59]. However, this assumption is often violated by data distribution shift in practice—the Out-of-Distribution (OOD) problem causes the catastrophic performance drop [18, 47]. In general, any test distribution unseen in training can be understood as OOD tasks, such as debiasing [8, 11, 24, 32, 63], long-tailed recognition [23, 37, 56], domain adaptation [5, 12, 58, 69] and domain generalization [28, 40, 52]. In this work, we focus on the most challenging one, where the distribution shift is unlabelled (e.g., different from long-tailed recognition, where the shift of class distribution is known) and even unavailable (e.g., different from domain adaptation, where the OOD data is available). We leave other related tasks as future work.

Invariant Feature Learning. The invariant class feature can help the model achieve robust classification when context distribution changes. The prevalent methods are: 1) Data augmentation [6, 31, 60, 68]. They pre-define some augmentations for images to enlarge the available context distribution artificially. As the features are only invariant to the augmentation-related contexts, they cannot deal with other contexts out of the augmentation inventory. 2) Context Annotation [2, 30, 54]. They split data by different context annotation into environments, and penalize the model by the feature shifts among different environments. As the features are only invariant to the annotated context, the inaccurate and incomplete annotations will impact their feature invariance. 3) Causal Learning [39, 44, 46, 65]. They learn the causal representations to capture the latent data generation process. Then, they can eliminate the context feature and pursue causal effect by intervention. These methods are essentially the re-weighting methods below in a causal perspective. 4) Reweighting [27, 41, 70]. They rebalance the context by re-weighting to help invariance feature learning. But, they improperly estimate the context weights by involving class learning into the context bias estimation. This inaccurate estimation problem severely influences the re-weighting and invariant feature learning. In contrast, IRMCon directly estimates the context without class prediction. The key difference is demonstrated in Fig. 2 (d) and (e): the output of our IRMCon does not contain class feature.

3 Common Pipeline: Invariance as Class

Model generalization in supervised learning is based on the fundamental assumption [20, 64]: any sample x is generated from the two disentangled features (or independent causal mechanisms [55]), \(x=g(\textbf{x}_c, \textbf{x}_t\)), where \(\textbf{x}_c\) is the class feature, \(\textbf{x}_t\) is the context feature, \(g(\cdot )\) is a generative function that transforms the two features in vector space to sample space (e.g., pixels). In particular, the disentanglement naturally encodes the two principles. To see this for Principle 1, if we only change the context of x and obtain a new image \(x'\), we have \(\textbf{x}_c = \textbf{x}'_c\) but \(\textbf{x}_t\ne \textbf{x}'_t\)—class is invariant to context; Principle 2 can be interpreted in a similar way. Therefore, we’d like to learn a feature extractor \(\phi _c(x) = \textbf{x}_c\) that helps the subsequent classifier to predict robustly across varying contexts.

3.1 Empirical Risk Minimization (ERM)

If the training data per class is balanced and diverse, i.e., containing sufficient samples in different contexts, ERM has been theoretically justified that it can learn the class feature extractor \(\phi _c(x)\) by minimizing a contrastive based loss such as softmax cross-entropy (CE) loss [64]:

$$\begin{aligned} \mathcal {L}_{{\text {ERM}}}(\phi _c, f) = \frac{1}{N}\sum \limits _{i=1}^N {\text {CE}}(y_i, \hat{y}_i = f(\phi _c(x_i))), \end{aligned}$$
(1)

where \(y_i\) is the ground-truth label of \(x_i\) and \(\hat{y}_i\) is the predicted label by the softmax classifier \(f(\cdot )\).

However, when the data is imbalanced and less diverse, ERM cannot learn \(\phi _c(x) = \textbf{x}_c\). We illustrate this in Fig. 2 (a): if more class 1 samples contain context \(\beta \) than \(\alpha \), the resultant \(\phi _c(x)\) will be biased to the prevailing context, e.g., features for classifying class 1 will be entangled with context \(\beta \). To this end, augmentation-based methods [6, 61] aim to compensate for the imbalance (Fig. 2 (b)). However, as contexts are complex, augmentation will be far from enough to compensate for all of them.

3.2 Invariant Risk Minimization (IRM)

If context annotation is available, we can use IRM [2] to learn \(\phi _c\) by applying Principle 1 that \(\phi _c\) should be invariant to different contexts. Compared to ERM on balanced data that achieves invariance in a passive way via random trials [3], IRM on imbalanced data adopts the active intervention, taking contexts as the environments:

(2)

where \(\hat{y}_i^{\theta }=f(\phi _c(x_i)\cdot \theta )\), e is one of the environments of the training data according to context labels, and \(\lambda >0\) is a trade-off hyper-parameter for the invariance regularization term. \(\theta \) is a dummy classifier, whose gradient is not applied to update itself but to calculate the regularization term in Eq. (2). The regularization term encourages \(\phi _c\) to be equally optimal in different environments, i.e., become invariant to environments (contexts). We follow IRM [2] to set \(\theta \) as 1.

As illustrated in Fig. 2 (c), if we want to learn a common classifier that discriminates 1 and 2 in both environments, the only way is to remove the context \(\alpha \) and \(\beta \). However, it has been demonstrated by [36, 65] that the context annotation is usually incomplete and using it may even under-perform ERM.

3.3 Inverse Probability Weighting (IPW)

When context annotation is unavailable, we can estimate the context and then re-balance data according to context. We begin with the following ERM-IPW loss [22, 50]:

(3)

We can see that the key difference between ERM-IPW and ERM is the sample-level IPW term \(1/P(x_i|\phi _t(x_i))\), where \(\phi _t(x) = \textbf{x}_t\) is the context feature extractor. This IPW implies that if x is more likely associated with its context \(\textbf{x}_t\), i.e., the class feature counterpart \(\textbf{x}_c\) is also more likely associated with \(\textbf{x}_t\), we should under-weight the loss because we need to discourage such a context bias.

However, the context estimation of \(\phi _t\) is almost challenging as learning \(\phi _c\). Instead, a prevailing strategy is to estimate it by a biased classifier [27, 41], e.g.,

(4)

where \(\phi _b\) is the bias feature extractor and \(f_b\) is the bias classifier. \(\phi _b\) and \(f_b\) are minimized by ERM equipped with generalized cross entropy (GCE) loss [71]:

(5)

where \({\text {GCE}}(y,\hat{y})\!=\!\sum _{k=1}^n y_k\cdot \frac{1-{\hat{y}_k}^q}{q}\) is used to amplify the bias, where q is a constant, k is the index of class and n is the class number. However, the loss in Eq. (5) inevitably includes the effect from the class feature \(\textbf{x}_c\), due to the aforementioned assumption \(x = g(\textbf{x}_c, \textbf{x}_t)\). In other words, such a combined effect cannot distinguish whether the bias is from class or context, resulting in inaccurate context estimation. We show the illustration in Fig. 2 (d). Specifically, the weights are estimated from class and context, and thus inaccurate to balance the context. In addition, the experimental results in Fig. 6 (Bottom) testify that: inaccurate context estimation will severely hurt the performance, i.e., fail to derive unbiased classifiers.

4 Our Approach: Invariance as Context

To tackle the inaccurate context estimation of \(\phi _t(x)\), we propose to apply Principle 2 as a way out. As illustrated in Fig. 2 (e), if we consider each class as the environment, we can clearly see that the unique environmental change is the class which has been already labeled. This motivates us to apply IRM to learn invariance as context by removing the environment-equivariant class. The crux is how to design the contrastive based loss—more specifically, how to modify \(\theta \) and \({\text {CE}}(\cdot )\) in Eq. (2). The following is our novel solution.

We design a new contrastive loss based on the intra-class (environment) sample similarity, as follows,

(6)

where \(\texttt {Aug}(\cdot )\) is the common augmentations, such as flip and Gaussian noise (used in standard contrastive losses [7, 13, 15]), e is the environment split by class, e.g., under the environment \(e_1\), any \(x_i \in e_1\) has the class label 1, \(\theta \) is the dummy classifier, we add \(\theta \) here for the convenience to introduce Eq. (7). The reason for using contrastive loss is that it preserves all the intrinsic features of each sample [43, 64]. Yet, without the invariance to class, \(\phi _t(x)\ne \textbf{x}_t\). Then, based on Eq. (2), our proposed IRMCon for learning “invariance as context” is:

(7)
Fig. 3.
figure 3

The training pipeline of our IRMCon-IPW. 1) “split env.” denotes we split the training samples in mini-batch into subsets based on class labels, i.e., samples of each class in one subset, forming N environments \(\{e_i\}_1^N\); 2) \(\theta \) is a dummy classifier, whose gradient is for regularizing \(\phi _t\) become invariant to classes. See the detailed algorithm in Appendix 

where \(\theta \) plays the same role in Eq. (2), to regularize \(\phi _t\) be invariant to environments (classes). We can prove that solving Eq. (7) achieves \(\phi _t(x)= \textbf{x}_t\), i.e., the context feature is disentangled (see Appendix). As demonstrated in Fig. 4, \(\phi _t\) can extract accurate context features. Thanks to \(\phi _t\), we can further improve IPW:

(8)

where \(\textbf{x}_t=\phi _t(x)\). We train \(f_b\) by using GCE loss, just replacing \(\phi _b(x)\) with \(\textbf{x}_t\) in Eq. (5). \(\phi _t\) is trained by IRMCon and then fixed when estimating the context.

As shown in Fig. 5, our biased classifier can estimate more accurate weights to perform better reweighting than the traditional one We streamline the proposed IRMCon-IPW in Fig. 3 and summarize our algorithm in Appendix.

Fig. 4.
figure 4

t-SNE [38] visualizations of our context features of the Colored MNIST test samples. The color of points denotes their class labels. IRMCon is trained on the 99% biased training set. Features are naturally clustered by context. As there is no context ground-truth, the context labels are interpreted by us.

5 Experiments

We introduce the benchmarks of two OOD generalization tasks, removing context bias (also called debias) and mitigating domain gaps (also called domain generalization and termed DG), and our implementation details in Sect. 5.1. Then, we evaluate the effectiveness of our approach based on the experimental results in Sect. 5.2.

5.1 Datasets and Settings

Context Biased Datasets. We follow LfF [41] to use two synthetic datasets,

Fig. 5.
figure 5

Illustrations of the reweighted sample frequencies for 10 color contexts. All models are trained on the 99.5% biased Colored MNIST. The reweighted frequency of a context indicates the normalized sum over the inverse probabilities of the samples in this context. Top: Biased context distribution in the training set. Middle: Biased context distribution derived by using LfF [41]. Bottom: Relatively balanced context distribution by using our method.

Colored MNIST and Corrupted CIFAR-10, and one real-world dataset, Biased Action Recognition (BAR) [41] for evaluation.

On each dataset, we manually control the context bias ratio by generating (in synthetic datasets) or sampling (in the real-world dataset) training images.

In specific, on Colored MNIST, we follow LfF to generate 10 colors as 10 contexts. We connect each digit (class) with a specific color and dye them with the ratio from {99.9%, 99.8%, 99.5%, 99.0%, 98.0%, 95.0%} to construct each biased training set. In the test set, 10 colors are uniformly distributed on the samples of each class. For Corrupted CIFAR-10, we follow LfF to use {Saturate, Elastic, Impulse, Brightness, Contrast, Gaussian, Defocus Blur, Pixelate, Gaussian Blur, Frost} as 10 contexts. Similar to Colored MNIST, we generate context biased training set by pairing a context and a class with a ratio chosen from {99.5%, 99.0%, 98.0%, 95.0%}. In the test set, 10 corruptions are uniformly distributed.

The real-world dataset BAR contains six kinds of action-place bias, and each one is between human action and background, e.g., “throwing” always happens with the “grass” background; We choose a bias ratio in {99.0%, 95.0%}.

Domain Gap Dataset. We use PACS [28] to testify our method. It consists of seven object categories spanning four image domains: Photo, Art-painting, Cartoon, and Sketch. We follow DomainBed [14] to each time select three domains for training and the left one for testing. More details about datasets, e.g., the number and size of the training images, are given in Appendix.

Table 1. Accuracy (%) on context biased datasets compared with SOTA methods. We reproduced the methods and averaged the results over three independent trials (mean±std). “*”: For reproducing mismatch issues, performance is quoted from the original paper. Our reproduced results are reported in Appendix. “-”: no report in that setting.

Comparing Methods. As the two types of datasets have their own state-of-the-art (SOTA) methods, we compare with different SOTA methods in context biased benchmark and domain gap benchmark, respectively.

For context biased datasets, we compare with Rebias [4], End [57], LfF [41], and Feat-Aug [27]. For domain gap dataset (DG task), we compare with domain-label based methods, such as DANN [1], fish [53], and TRM [67], as well as domain-label free methods, such as RSC [21] and StableNet [70]. As we claimed at the end of Sect. 3.1, we train all models from scratch. This makes some DG methods (e.g., MMD [30] and CDANN [42]) hard to converge.

Implementation Details. We first introduce two implementation details to deal with the implementation issues we met, and then provide training details.

1) Weighted sample strategy. This strategy is for the biased dataset. For example, under the 99.9% biased training set, in a mini-batch, all the images may have the same context in a class, unless we can sample over 1,000 images per class to get 1 sample with non-biased context. To solve this issue, we use the bias model from LfF [41] to learn an inaccurate context estimator, and based on its inverse probability we sample a relative context-balanced mini-batch. This strategy frees us from sampling a very large batch to learn Eq. (6).

Table 2. Accuracy (%) on the domain generalization dataset PACS [28]. We reproduced all the methods by the DomainBed [14] code base without pretraining. Results are averaged over 3 independent trials (mean±std). “-” denotes that methods fail to converge when training from scratch.

2) Strategy for learning augmentation-related context. It is hard to learn augmentation related context, when using contrastive loss. To minimize contrastive loss, the model needs to learn invariance on augmentations, i.e., augmentation related features will be removed. On Corrupted Cifar-10, we add the classification loss in Eq. (5) to our IRMCon loss to train the context extractor. Please note that we use this strategy only for Corrupted Cifar-10 as context on this dataset is dominated by augmentation-related context, such as 95% “car” has augmentation-related context ‘Gaussian noise”. Due to space limits, we put other details in Appendix.

3) Training details. On the Colored MNIST, we use 3-layers MLPs to model \(\phi _c,\phi _b\) and \(\phi _t\). On the Corrupted Cifar-10, we use ResNet-18 for \(\phi _c\) and 3-layers CNNs for \(\phi _b\) and \(\phi _t\). On the BAR and PACS, we use ResNet-18 for \(\phi _c,\phi _b\) and \(\phi _t\). For optimization in context biased datasets, we follow LfF [41] to use Adam [25] optimizer with the learning rate as 0.001. Other detailed settings, e.g.batch size, epochs, and \(\lambda \) in each setting, can be found in Appendix.

On all datasets, we follow DomainBed [14] to randomly split the original unbiased test set into 20% and 80% as the validation set and test set, respectively, and select the best model based on validation results. We average the results of three independent runs, and report them in the format of “mean accuracy ± standard deviation”.

5.2 Results and Analyses

IRMCon-IPW Achieves SOTA. We show our results of context biased datasets in Table 1 and domain gap dataset in Table 2.

1) Table 1 presents that our IRMCon-IPW achieves very clear margins over the related methods.

In particular, the improvements are more obvious in the settings of higher bias ratios. The possible reason is when the bias ratio is higher, the “rare” context samples become less. Reweighting methods are more sensitive to the accuracy of context weights estimation. Therefore, accurate context estimation plays a more essential role. Compared to related methods, our IRMCon can estimate more accurate context, i.e., extract high-quality context features like the illustration in Fig. 4, whose gain over others is more obvious when increasing the context bias ratio.

Fig. 6.
figure 6

Accuracy (%) of models when training on Colored MNIST context-balance set. Top: ERM is stable in test sets with varying context biases; Bottom: due to the incorrect context estimation, traditional reweighting methods degenerate significantly compared to ERM when training on context-balance set. Thanks to the correct context estimation, our IRMCon-IPW achieves comparable performance to ERM.

2) Table 2 presents that on the domain gap dataset, our method outperforms ERM and also achieves the best average performance over all the domain label-free methods. In addition, it achieves comparable results to the other DG methods (in the upper block) which need domain labels.

Why does ERM perform so well in most cases? On PACS, we follow the DomainBed [14] to implement a strong ERM baseline. On BAR, we use the strong augmentation strategy, Random Augmentation [9], which can be considered as an OOD method as shown in Fig. 2 (b). If we do not apply such strong augmentations, ERM performance drops significantly. We show the corresponding results in Appendix.

Why do we train models from scratch for OOD problems? We challenge the traditional pretraining settings in some OOD tasks, such as Domain Generalization, because we are concerned that the data or knowledge of the test set has been leaked to the model when pretrained on large-scale image datasets. Data leakage is a usual problem in pretraining settings, such as ImageNet [10] leaks to CUB [62]. Such problem will severely destroy the validity of the OOD task [66]. Empirically, we provide an observation in Domain Generalization to justify our challenge. In pretraining settings, ERM achieves the “impressive” 98% test accuracy [14] when Photo domain is used for testing. This number is significantly higher (around 20% higher) than using Cartoon and Sketch in testing. However, this is not the case if there is no pretraining on ImageNet, see Table 2, bottom block first line, ERM method. The reason is that ImageNet, collected from the real world, leaks more real images in Photo, compare to artificial images in Cartoon and Sketch. Therefore, we propose the non-pretraining setting for all OOD benchmarks to prevent the leakage problem.

Fig. 7.
figure 7

Comparing the bias classification heads in LfF [41] (LfF-BH) and in ours (IRMCon-BH) on Colored MNIST with different bias ratios. The bias classification heads (BH) intentionally use context to predict class. Our bias head is almost the same as the upper bound case in test set—random class prediction (10%).

How to evaluate the context feature learned in IRMCon-IPW? We visualize the comparisons between the context features learned by IRMCon-IPW and LfF in Fig. 7. We show the training and test accuracies of the linear classifiers (we call bias classification heads) that are trained with context features and class labels, i.e., to learn the bias intentionally. We can see from the figures that ours shows the almost same learning behavior as the upper bound case: context is invariant to class and should predict class by random chance. It means that IRMCon-IPW is able to recover the oracle distribution of contexts in the image. This can be taken as a support to the bottom illustration in Fig. 5 where using our weights can achieve a balanced context distribution—the ground truth distribution.

How does IRMCon-IPW tackle domain gap issues? Compared to the datasets with pre-defined context distribution in training (e.g., set color distribution in each class in Colored MNIST dataset [41]), the domain gap dataset such as PACS does not have such explicit context settings. While it has implicit context distribution related to the domain. This distribution is often imbalanced which leads to context bias problems (similar to context biased datasets such as BAR). Therefore, our method can help PACS to “debias”. We notice that, compared to ERM, our improvement for PACS is not as significant as that on the context biased datasets. This might be because the context bias in PACS is not as severe as that in context biased datasets.

Failure Cases. We show some failure cases of our IRMCon in Fig. 8. The failure cases are selected if their IRMCon-IPW classification results are wrong. As expected, we see that the key reasons for failure are the incorrect context estimation, e.g., the contexts are mixed with the foreground or wrongly attended to the foreground. By inspecting the BAR dataset, we find that some contexts, e.g., “pool” for the class “diving”, are relatively unique for certain classes. This implies that the context is NOT invariant to class. To resolve this, we conjecture that this is a dataset failure and the only way out is to bring external knowledge.

Fig. 8.
figure 8

GradCAM [51] visualizations of IRMCon-IPW failure cases. Top: input test images; Middle: context visualization by bias classifier of IRMCon; Bottom: class visualization. Left four columns are selected from BAR test set, the model is trained on the 99% biased training set; right four are selected from the Photo domain of PACS, model is trained on the other three domains. GT: ground-truth label; P: predicted label.

6 Conclusions

Context imbalance is the main challenge in learning class invariance for OOD generalization. Prior work tackles this challenge in two ways: 1) relying on context supervision and 2) estimating context bias by classifier failures. We showed how they fail and hence proposed a novel approach called IRM for Context (IRMCon) that directly learns the context feature without context supervision. The success of IRMCon is based on: context is invariant to class, which is the overlooked other side of the common principle—class is invariant to context. Thanks to the class supervision which has been already provided as environments in training data, IRMCon can achieve context invariance by using IRM on the intra-class sample similarity contrastive loss. We used the context feature for Inverse Probability Weighting (IPW): a method for context balancing, to learn the final classifier that generalizes to OOD. IRMCon-IPW achieves state-of-the-art results on several OOD benchmarks.