Improving Test-Time Adaptation Via Shift-Agnostic Weight Regularization and Nearest Source Prototypes

Choi, Sungha; Yang, Seunghan; Choi, Seokeon; Yun, Sungrack

doi:10.1007/978-3-031-19827-4_26

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13693))

Included in the following conference series:

European Conference on Computer Vision

3525 Accesses
17 Citations

Abstract

This paper proposes a novel test-time adaptation strategy that adjusts the model pre-trained on the source domain using only unlabeled online data from the target domain to alleviate the performance degradation due to the distribution shift between the source and target domains. Adapting the entire model parameters using the unlabeled online data may be detrimental due to the erroneous signals from an unsupervised objective. To mitigate this problem, we propose a shift-agnostic weight regularization that encourages largely updating the model parameters sensitive to distribution shift while slightly updating those insensitive to the shift, during test-time adaptation. This regularization enables the model to quickly adapt to the target domain without performance degradation by utilizing the benefit of a high learning rate. In addition, we present an auxiliary task based on nearest source prototypes to align the source and target features, which helps reduce the distribution shift and leads to further performance improvement. We show that our method exhibits state-of-the-art performance on various standard benchmarks and even outperforms its supervised counterpart.

Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc.

Access provided by Autonomous University of Puebla. Download conference paper PDF

A Comprehensive Survey on Test-Time Adaptation Under Distribution Shifts

Article 18 July 2024

BMD: A General Class-Balanced Multicentric Dynamic Prototype Strategy for Source-Free Domain Adaptation

Self-Training with Label-Feature-Consistency for Domain Adaptation

Keywords

1 Introduction

After deep neural networks (DNNs) trained on a given dataset (i.e., source domain) are deployed to a new environment (i.e., target domain), the DNNs make predictions from the data in the target domain. However, in most cases, the distribution of the source and target domains varies significantly, which degrades the model’s performance in the target domain. If the deployed model does not remain stationary during test time but adapts to the new environment using clues about unlabeled target data, its performance can be improved [25, 38, 39, 42, 52, 59, 63, 66].

Recently, several studies [25, 42, 59, 63] have proposed test-time adaptation to update the model during test time after model deployment. However, it is extremely challenging to adapt the model to the target domain with only unlabeled online data. As shown in Fig. 1(b), the adaptation of the entire model parameters may be detrimental due to the erroneous signals from the unsupervised objective such as entropy minimization [15, 26, 40, 57, 59], and the performance may be highly dependent on the learning rate. In addition, since the test-time adaptation can access unlabeled target data only once, and the adaptation proceeds simultaneously with the evaluation, updating all network parameters may result in overfitting [17, 62]. Thus, several approaches present the methods to update only some part of the network architecture [25, 42, 59, 63] such as batch normalization [24] or classifier layers. Especially, T3A [25] proposes an optimization-free method to adapt only the classifier layers using unlabeled target data, and TENT [59] updates batch statistics and affine parameters in the batch normalization layers by entropy minimization on unlabeled target data. However, updating only partial parameters or layers of the model may only result in marginal performance improvement, as shown in Fig. 1(c). Furthermore, such methods cannot be applied to the model architecture without a specific layer such as batch normalization or classifier layers.

Other approaches [39, 52] propose to jointly optimize a main task^{Footnote 1} and a self-supervised task, such as rotation prediction [13] or instance discrimination [6, 18], during pre-training in the source domain, and then update the model using only the self-supervised task during test time. In contrast to the unsupervised objective for the main task that highly depends on the model’s prediction accuracy, the self-supervised task always obtains a proper supervisory signal. However, the self-supervised task may interfere with the main task if both tasks are not properly aligned [39, 50, 64]. In addition, these approaches cannot be applied to adapt arbitrary pre-trained models to the target domain since they require specific pre-training methods in the source domain.

To resolve these issues, we present two novel approaches for the test-time adaptation. First, we consider a shift-agnostic weight regularization (SWR) that enables the model to quickly adapt to the target domain, which is beneficial when updating the entire model parameters with a high learning rate. In contrast to Fig. 1(b), the entropy minimization with the proposed SWR shows superior performance and less dependency on the learning rate choice, as shown in Fig. 1(e). In terms of distribution shift, the SWR identifies the entire model parameters into shift-agnostic and shift-biased parameters, updating the former less and the latter more. Second, we present an auxiliary task based on a non-parametric nearest source prototype (NSP) classifier, which pulls the target representation closer to its nearest source prototype. With the NSP classifier, both source and target representations can be well aligned, which significantly improves the performance of the main task. Our proposed method (Fig. 1(a)) outperforms the state-of-the-art method [59] (Fig. 1(c)) and even the supervised method using ground-truth labels (Fig. 1(d)).

Our method requires access to the source data to identify shift-agnostic and biased parameters and generate source prototypes before the model deployment, but it is applicable to any model regardless of its architecture or pre-training procedure. If a given model is pre-trained on open datasets, or if the source data owner deploys the model, source data is accessible before model deployment. In this case, our method significantly enhances the test-time adaptation capability by leveraging the source data without modifying the pre-trained model. Unlike TTT [52] and TTT++ [39], we do not change the pre-training method of a given model, so our method can take benefit from any pre-trained strong models, such as AugMix [21] (Table 1) or CORAL [51] (Table 6), as a good starting point for test-time adaptation. In these respects, we believe our method is practical.

The major contributions of this paper can be summarized as follows

Two novel approaches for test-time adaptation are presented in this paper. The proposed SWR enables the model to quickly and reliably adapt to the target domain, and the NSP classifier aligns the source and target features to reduce the distribution shift, leading to further performance improvement.
Our test-time adaptation method is model-agnostic and not dependent on the pre-training method in the source domain, and thus it can be applied to any pre-trained model. Therefore, our method can also complement other domain generalization approaches that mainly focus on the pre-training method in the source domain before model deployment.
We show that our method achieves state-of-the-art performance through extensive experiments on CIFAR-10-C, CIFAR-100-C, ImageNet-C [20] and domain generalization benchmarks including PACS [31], OfficeHome [55], VLCS [10], and TerraIncognita [5]. Especially, our method even outperforms its supervised counterpart on CIFAR-100-C dataset.

2 Related Work

2.1 Source-Free Domain Adaptation

Unsupervised domain adaptation (UDA) [11, 12, 22, 46, 51, 53, 57] assumes simultaneous access to both the source and target domains. Data is often distributed across multiple devices. In such cases, UDA requires data sharing for simultaneous access to all data. However, it is often impossible due to data privacy concerns, limited bandwidth, and computational cost. Source-free domain adaptation [1, 30, 36, 38, 58, 60, 61] overcomes this challenge by adapting a source pre-trained model to the target domain using only unlabeled target data. These approaches focus on offline adaptation in which the same target sample is fed to the model multiple times during target adaptation, whereas our method concentrates on online adaptation.

2.2 Test-Time Adaptation and Training

Test-time adaptation focuses on online adaptation in which all target data can be accessed only once at test time, and adaptation is performed simultaneously with evaluation. More specifically, it forward propagates target samples through the model for evaluation and then backpropagates the error signal from the model’s output in an unsupervised manner for training [59]. Several studies adopt self-supervised learning, such as rotation prediction [52] or instance discrimination [39], to jointly optimize the main and self-supervised tasks on the source domain and then optimize only the self-supervised task on the target domain. However, these methods are not universally applicable to arbitrary pre-trained models as they require specific pre-training methods in the source domain. Recently, model-agnostic test-time adaptation methods independent of the pre-training method in the source domain have been proposed [25, 42, 59, 63]. TENT [59] uses the batch statistics of the target domain and optimizes channel-wise affine parameters using entropy minimization loss. T3A [25] proposes an optimization-free method that adjusts a pre-trained linear classifier by updating the prototype for each class during test time. However, since these methods update only partial parameters or layers of the model, such as the batch normalization [42, 59, 63] or classifier layer [25], they may be suboptimal for target adaptation.

2.3 Domain Generalization

Since UDA aims to adapt the model to the predefined target domain before model deployment, it is not suitable to guarantee generalization performance to other arbitrary target domains. On the other hand, domain generalization (DG) differs from UDA in that it assumes that the model accesses only the source domain during training time before model deployment and aims to improve the generalization capability in arbitrary unseen target domains. Numerous DG approaches using meta-learning [3, 32, 33], normalization [7, 8, 44, 47], adversarial training [34, 37, 45], and data augmentation [14, 35, 56, 67] have been proposed to learn domain-agnostic feature representations for the target domain. However, these studies only focus on methods at training time before model deployment, whereas our method focuses on a test-time adaptation after model deployment.

3 Proposed Method

Assume that the model parameters $\theta $ trained on the source domain consist of an encoder part $\theta _e$ and a classifier part $\theta _c$, as shown in Fig. 2(c). After being deployed to the target domain, the model infers the class probability distribution of the target sample and then optimizes our proposed test-time adaptation loss $\mathcal {L}^{\text {target}}_{\theta _e,\theta _c}$. The overall loss of our proposed method is defined as

$$\begin{aligned} \mathcal {L}^{\text {target}}_{\theta _e,\theta _c} = \mathcal {L}^{\text {main}}_{\theta _e,\theta _c} + \mathcal {L}^{\text {aux}}_{\theta _e} + \lambda _r\sum _l w_l\Vert \boldsymbol{\theta }_l-\boldsymbol{\theta }_l^*\Vert ^2_2, \end{aligned}$$

(1)

where $w_l$ denotes the l-th element of the penalty vector $\boldsymbol{w}$ used to control the update of the model parameters, $\boldsymbol{\theta }_l$ is the parameter vector of the l-th layer^{Footnote 2} of the model, $\boldsymbol{\theta }^*$ is the parameters from the previous update step, $\lambda _r$ is the importance of the regularization term, and $\mathcal {L}^{\text {main}}_{\theta _e,\theta _c}$ and $\mathcal {L}^{\text {aux}}_{\theta _e}$ denote the main and auxiliary task losses, respectively. Optimizing the main task loss updates the entire model parameters $\theta _e$ and $\theta _c$, whereas optimizing the auxiliary task loss updates only the encoder part $\theta _e$. We first present a shift-agnostic weight regularization (SWR) and then describe an entropy objective of the main task. Finally, we propose an auxiliary task based on a nearest source prototype (NSP) classifier, which directly benefits the main task.

3.1 Shift-Agnostic Weight Regularization

The main idea of the SWR is to impose different penalties for each parameter update during test-time adaptation, depending on the sensitivity of each model parameter to the distribution shift. Assuming that the distribution shift is mainly caused by color and blur shifts, we mimic the distribution shift using transformation techniques such as color distortion and Gaussian blur. Experiments on variations of the SWR, including the use of other transform functions, can be found in the supplementary Section B.

To obtain the penalty vector $\boldsymbol{w}$ specified in Eq. (1), we first forward-propagate two input images (i.e., an original and its transformed image) through the pre-trained source model and then back-propagate the task loss (i.e., cross entropy) using the source labels to produce two sets $\boldsymbol{g}$ and $\boldsymbol{g}'$ of L gradient vectors, respectively. Note that L is the total number of layers in the model. Then the l-th element $w_l$ of the penalty vector $\boldsymbol{w}$ is calculated by employing the average cosine similarity $s_l$ between two gradient vectors, $\boldsymbol{g}_l$ and $\boldsymbol{g}'_l$ from N source samples as

$$\begin{aligned} \begin{aligned} s_l&=\frac{1}{N}\sum _{i=1}^N\frac{\boldsymbol{g}^i_l\cdot {\boldsymbol{g}_{l}'^i}}{\Vert {\boldsymbol{g}^i_l}\Vert \Vert {{\boldsymbol{g}_{l}'^i}}\Vert }\in \mathbb {R}, \\ \boldsymbol{w}&=\left( \nu \left[s_1,\dots ,s_l,\dots ,s_{L} \right]\right) ^2\in \mathbb {R}^L, \end{aligned} \end{aligned}$$

(2)

where $\nu \left[\cdot \right]$ denotes min-max normalization with the range of [0,1], $\boldsymbol{g}^i_l$ and $\boldsymbol{g}_{l}'^i$ denote the l-th gradient vectors for i-th source sample and its transformed sample, respectively. N denotes the total number of samples. Note that the penalty vector $\boldsymbol{w}$ is obtained from a frozen pre-trained source model before model deployment. Therefore, this process is independent of the source model’s pre-training method and does not require source data after model deployment, as shown in Fig. 2.

As shown in Eq. (1) and Fig. 3, during test-time adaptation, we apply the layer-wise penalty value $w_l$ to the difference between previous and current model parameters for each layer, and this controls the update of model parameters differently for each layer. Therefore, the model parameters belonging to the layers with high cosine similarity between the two gradient vectors are considered shift-agnostic, and we less update them by imposing high penalties. Section 4.6 experimentally shows that SWR takes advantage of using high learning rates to adapt the model to the target domain quickly.

3.2 Entropy Objective for the Main Task

The main task of the model $f_\theta $ is defined as the task performed by the parameters $\theta _e$ and $\theta _c$. The loss function for the main task during test time is built using the entropy of model predictions $\tilde{y}$ on test samples from the target distribution. We adopt information maximization loss [23, 27, 48], validated in several test-time adaptation and source-free domain adaptation methods [38, 42, 58], as an unsupervised learning objective for the main task. This loss consists of entropy minimization [38, 49, 57, 59] and mean entropy maximization [2, 28, 38, 58] as

$$\begin{aligned} \mathcal {L}^{\text {main}}_{\theta _e,\theta _c} = \lambda _{m_1}\frac{1}{N}\sum _{i=1}^N H(\tilde{y_i})-\lambda _{m_2} H(\bar{y}), \end{aligned}$$

(3)

where $H(p)=-\sum _{k=1}^C p^k\log p^k$, $\bar{y}=\frac{1}{N}\sum _{i}\tilde{y_i}$, $\lambda _{m_1}$ and $\lambda _{m_2}$ indicate the importance of each term. The number of classes and the batch size are denoted by C and N. Intuitively, entropy minimization makes individual predictions confident, and mean entropy maximization encourages average prediction within a batch to be close to the uniform distribution.

3.3 Auxiliary Task Based on the Nearest Source Prototype

Due to the distribution shift between the source and target domains, the target features deviate from the source features at test time. To resolve this issue, we propose an auxiliary task based on the nearest source prototype (NSP) classifier, which pulls the target embeddings closer to their nearest source prototypes in the embedding space. Eventually, optimizing the auxiliary task improves performance significantly since it directly supports the main task by aligning the source and target representations. We first explain how to generate source prototypes and define the NSP classifier based on them.

Source Prototype Generation. The source prototypes are defined as the averages over source embeddings for each class. As shown in Fig. 4, we freeze the model $f_\theta $ trained on the source data and attach an additional projection layer $h_\psi $ behind the encoder $f_{\theta _e}$. The encoder $f_{\theta _e}$ infers the representation $\boldsymbol{h}$ from the source sample x, and the projector $h_\psi $ maps $\boldsymbol{h}$ to the projection $\boldsymbol{z}$ in another embedding space where the loss $\mathcal {L}^{\text {emb}}_{\psi }$ is applied as $\boldsymbol{z}=h_\psi (f_{\theta _e}(x))$. The source prototype $\boldsymbol{q}^k_t$ for class k is updated through exponential moving average (EMA) with the projection $\boldsymbol{z}^k_t$ of the source sample $(x, y^k)_{k\in [1,\text {C}]}$ at time t during the optimization trajectory as

$$\begin{aligned} \boldsymbol{q}^k_t=\alpha \cdot \boldsymbol{q}^k_{t-1}+(1-\alpha )\cdot \boldsymbol{z}^k_t, \end{aligned}$$

(4)

where $\alpha $=0.99 and $\boldsymbol{q}^k_0=\boldsymbol{z}^k_0$.

We define the NSP classifier as a non-parametric classifier. It measures the cosine similarity of a given target embedding to the source prototypes for all classes and then generates a class probability distribution $\hat{y}$ as

$$\begin{aligned} \hat{y}=\sum _{k=1}^C\left( \frac{\text {exp}\left( S(\boldsymbol{z},\boldsymbol{q}^k)/\tau \right) }{\sum _{j=1}^C\text {exp}\left( S(\boldsymbol{z},\boldsymbol{q}^j)/\tau \right) }\right) y^k, \end{aligned}$$

(5)

where $S(\cdot ,\cdot )$ is a cosine similarity function, $S(\boldsymbol{a},\boldsymbol{b})=(\boldsymbol{a}\cdot \boldsymbol{b})/\Vert \boldsymbol{a}\Vert \Vert \boldsymbol{b}\Vert $, $\tau $ denotes a temperature that controls the sharpness of the distribution, and $y^k$ is the one-hot ground-truth label vector of k-th class.

In addition, inspired by recent self-supervised contrastive learning methods [4, 6, 18], we enable the projector $h_\psi $ to learn transformation-invariant mapping. We obtain projection $\boldsymbol{z}'$ of the transformed source sample by $\boldsymbol{z}'=h_\psi (f_{\theta _e}(\mathcal {T}(x)))$, where $\mathcal {T}$($\cdot $) denotes an image transform function. The embedding loss $\mathcal {L}^{\text {emb}}_{\psi }$ consisting of two cross entropy loss terms is applied to the embedding space to train the projector $h_\psi $ as

$$\begin{aligned} \mathcal {L}^{\text {emb}}_{\psi }=\frac{1}{N}\sum _{i=1}^{N}\left( \text {CE}\left( y_i,{\hat{y}}_i\right) +\text {CE}\left( y_i,{\hat{y}'}_i\right) \right) , \end{aligned}$$

(6)

where $\text {CE}\left( p,q\right) =-\sum _{k=1}^C p^k\log q^k$, and $y_i$ is the ground-truth label of i-th source sample. Here, $\hat{y}$ and $\hat{y}'$ denote the outputs of the NSP classifier for the projections $\boldsymbol{z}$ and $\boldsymbol{z}'$ of the source sample and its transformed one, respectively. As shown in Fig. 4, optimizing the embedding loss encourages the projector $h_\psi $ to learn a mapping that pulls the projections belonging to the same class closer together and pushes source prototypes farther away from each other.

Note that this process is applied to a frozen pre-trained source model and completed before model deployment. Therefore, it is model-agnostic and does not require source data during test time.

Auxiliary Task Loss at Test Time. Once the source prototypes are generated and the projection layer is trained, we can deploy the model and then jointly optimize both main and auxiliary tasks on unlabeled online data. The auxiliary task loss $\mathcal {L}^{\text {aux}}_{\theta _e}$ consists of two objective functions: the entropy objective $\mathcal {L}^{{\text {aux}}\_{\text {ent}}}_{\theta _e}$ using the entropy of the NSP classifier’s prediction $\hat{y}$, and the self-supervised loss $\mathcal {L}^{{\text {aux}}\_{\text {sel}}}_{\theta _e}$ that encourages the model’s encoder $f_{\theta _e}$ to learn transformation-invariant mappings as

$$\begin{aligned} \mathcal {L}^{\text {aux}}_{\theta _e} = \mathcal {L}^{{\text {aux}}\_{\text {ent}}}_{\theta _e} + \lambda _s\mathcal {L}^{{\text {aux}}\_{\text {sel}}}_{\theta _e}, \end{aligned}$$

(7)

where $\lambda _s$ denotes the importance of the self-supervised loss term. Similarly to Eq. (3), the entropy objective is built by using the entropy of the prediction $\hat{y}$ of the NSP classifier on the target sample as

$$\begin{aligned} \mathcal {L}^{{\text {aux}}\_{\text {ent}}}_{\theta _e} = \lambda _{a_1}\frac{1}{N}\sum _{i=1}^{N}H(\hat{y_i})-\lambda _{a_2}H(\bar{y}), \end{aligned}$$

(8)

where N is batch size, $\lambda _{a_1}$ and $\lambda _{a_2}$ indicate the importance of each term, $H(p)=-\sum _{k=1}^C p^k\log p^k$, and $\bar{y}=\frac{1}{N}\sum _{i=1}^{N}\hat{y_i}$. The self-supervised loss is applied to the prediction $\hat{y}'$ of the NSP classifier on the transformed target sample as

$$\begin{aligned} \mathcal {L}^{{\text {aux}}\_{\text {sel}}}_{\theta _e}=-\frac{1}{N}\sum _{i=1}^{N}\sum _{k=1}^C\hat{y}_i^k\log {\hat{y}'^k_i}. \end{aligned}$$

(9)

As shown in Fig. 5, the entropy objective function (Fig. 5(b)) pulls the projection $\boldsymbol{z}$ of the target sample to move closer to its nearest source prototype, and the self-supervised objective (Fig. 5(c)) encourages the projection $\boldsymbol{z}'$ of the transformed target sample to get closer to the same target as $\boldsymbol{z}$.

4 Experiments

This section describes the experimental setup, implementation details, and the experimental results of the comparisons with other state-of-the-art methods in test-time adaptation. We also show that generalization performance can be further improved by combining our proposed method with an existing domain generalization strategy that mainly focuses on training time in the source domain.

4.1 Experimental Setup

Following TENT [59] and T3A [25], all experiments in this paper are conducted on the online adaptation setting, where adaptation is performed concurrently with evaluation at test time without seeing the same data twice or more. After a prediction is obtained, the model is updated via back-propagation. We evaluate our proposed method on CIFAR-10-C, CIFAR-100-C, ImageNet-C^{Footnote 3} [20] and four domain generalization benchmarks such as PACS [31], OfficeHome [55], VLCS [10], and TerraIncognita [5]. Since our method can be used independently of the backbone networks and its pre-training method, we apply our method to publicly available pre-trained models for evaluation. We perform experiments on CIFAR datasets using WideResNet-28-10 [65] and WideResNet-40-2 [65] as backbone networks, based on RobustBench [9]. In the domain generalization setup, we use ResNet-50 [19] without the batch normalization layer, which is the default setting of DomainBed [16], DG benchmark framework. CIFAR-10/100 dataset [29] contains 50k images for training and 10k images for testing. Corruptions such as noise, blur, weather, and digital are applied to 10k images from CIFAR-10/100 test set to create CIFAR-10/100-C test images. For test-time adaptation, 50k images for CIFAR training set are defined as the source domain, and 10k images for CIFAR-C test set are defined as the target domain.

4.2 Implementation Details

We integrate our proposed method within the frameworks officially provided by other state-of-the-art methods [25, 39, 59] for fair comparisons. Specifically, different frameworks are used for each experiment as follows: TENT framework [59] for all experiments with WRN-40-2 and WRN-28-10 backbone networks on CIFAR-10/100-C, TTT++ framework [39] for all experiments with ResNet-50 on CIFAR-10/100-C, and T3A framework [25] for all domain generalization benchmarks. For experiments on CIFAR, we follow the default values provided by each framework for experimental settings such as batch size and optimizer.

Color distortion, random grayscale and Gaussian blurring are used as the image transformations specified in Fig. 3 and Fig. 5, and random cropping and random horizontal flipping are additionally applied for the image transformations in Fig. 4. We use batch statistics on test data instead of using running estimates. The hyper-parameters are empirically set as $\lambda _{m_1}$=0.2, $\lambda _{a_1}$=0.8 $\lambda _{m_2}$=0.25, $\lambda _{a_2}$=0.25, $\lambda _s$=0.1, $\lambda _r$=250, and softmax temperature $\tau $=0.1. The epoch for training the projector is 20, and N=1024 in Eq. (2). Since these hyper-parameters are not sensitive to the backbone and datasets, they are fixed without individual tuning in most experiments in this paper unless noted otherwise. The projector as described in Sect. 3.3 can be configured as a single- or multi-layer perceptron (MLP). The MLP consists of a linear layer followed by batch normalization [24], ReLU [43], and a final linear layer with output dimension 512. The performance change according to the projector configuration is shown in Table 3, and the detailed architecture is described in the supplementary Section C.

Table 1. Comparison with other methods. * denotes the reported results from the original paper, and the others are reproduced values in our environment based on the official framework provided by TENT [59] and TTT++ [39]. Source denotes the source pre-trained model without test-time adaptation.

Full size table

4.3 Robustness Against Image Corruptions

Table 1(a) shows a comparison of the robustness between our method and recent test-time adaptation methods for the most severe corruptions on CIFAR-100-C. TFA [39] and TTT++ [39] were originally implemented as offline adaptation methods that train a model by observing the same data multiple times across numerous epochs, so we change these methods to the online adaptation setting to reproduce the results. Our proposed method significantly outperforms other state-of-the-art methods with large margins of 3.89$\%$ for ResNet-50 and 2.59$\%$ for WRN-40-2. Table 1(b) shows the results on the most severe corruptions of CIFAR-10-C. Our method consistently outperforms other methods on CIFAR-10/100-C datasets across various backbone networks. In particular, WRN-40-2, which is trained with AugMix [21] for a data processing to increase the robustness of the model, outperforms the other backbone networks, and our method further enhances the performance by complementing it. Table 1(c) shows the results on CIFAR-100-C with all severity levels. Because severity denotes the strength of the corruption, it shows how much the distribution shift presents, and our method outperforms TENT [59] at all levels with a large margin.

Table 2. Ablation study on CIFAR-100-C. ResNet-50 is used.

Full size table

4.4 Ablation Studies

Table 2 shows the effectiveness of our proposed shift-agnostic weight regularization (SWR) and nearest source prototype (NSP) classifier through ablation studies. At a high learning rate, optimizing only the main task loss based on the entropy of the model prediction results in poor performance, but adjusting the learning rate reduces the error rate to 39.44$\%$. Adding the NSP to the main task loss leads to the performance improvement of 1.89%, and including the SWR improves the performance by 1.68% even at a high learning rate. Our method with both SWR and NSP achieves 35.65% error rate with 3.79% performance enhancement compared to using only the main task loss.

Table 3. Comparison of error rate ($\%$) according to changes in projector depth.

Full size table

4.5 Projector Design and Hyper-parameter Impacts

Table 3 shows the performance impact of changing the projector depth (i.e., number of projection layer). In addition, we conduct experiments to apply the auxiliary task loss $\mathcal {L}^{\text {aux}}_{\theta _e}$ directly to the feature representation $\boldsymbol{h}$, the encoder’s output without using the projector. The model with the projector outperforms the one without the projector on CIFAR-100-C, and opposite results are obtained on CIFAR-10-C. Since the auxiliary task loss is applied to the embedding space based on the cosine similarity between the source prototypes and the target embeddings, its effect may be minimal if they are severely misaligned. To compensate for this issue, we attach and train the projector that minimizes the misalignment between the source and target embeddings by enabling transformation-invariant mapping and bringing the projections belonging to the same class closer together in the new embedding space. However, if the number of classes is small (e.g., CIFAR-10-C), the source and target may already be relatively well aligned compared to the case with a large number of classes (e.g., CIFAR-100-C). In this case, we conjecture that applying the auxiliary task loss directly to the encoder’s output $\boldsymbol{h}$ rather than the new embedding space $\boldsymbol{z}$, the projector’s output, generates a better-aligned representation $\boldsymbol{h}$ between the source and target, which can be more helpful to the classifier.

Table 4 shows the experimental results according to (a) the projector width (i.e., output dimension of the last layer), (b) the transformation used for training the projector, and (c) whether to fine-tune or freeze the projector during test-time adaptation. Our default settings are marked with gray-colored cells, and these settings are also applied to the domain generalization benchmarks in the following section without additional tuning.

Table 4. Hyper-parameter impacts on CIFAR-100-C. ResNet-50 is used.

Full size table

Table 5. Comparison of error rate ($\%$) on CIFAR-100-C. Our method outperforms the supervised method in an online setting. LR denotes a learning rate.

Full size table

4.6 Quick Adaptation

As shown in Table 5, it is natural that the supervised method performs perfectly when learning and evaluating the same test samples iteratively. However, interestingly our method outperforms the supervised one in an online setting where the test sample is seen only once. Unlike the other methods that require a low learning rate to train (Fig. 1(b),(d)), our method updates the entire parameters at a high learning rate. We conjecture that SWR enables quick convergence without performance degradation because only parameters sensitive to distribution shift (i.e., parameters that need to quickly adapt to a new domain) are largely updated with a high learning rate.

4.7 Domain Generalization Benchmarks

To evaluate our method on the DG benchmarks, we follow the protocol proposed by DomainBed [16] and T3A [25]. Our method is model-agnostic, so we apply it to the pre-trained models using empirical risk minimization (ERM) [54] or CORAL [51] on the source domain in order to adapt the models to the target domain at test time. We use the leave-one-domain-out validation [16] for model selection in all experiments in Table 6. Our methods show state-of-the-art performance on average over four datasets and especially outperform T3A [25] and the source pre-trained models with a large margin on PACS, OfficeHome, and TerraIncognita datasets. The detailed experimental setup can be found in the supplementary Section C.

Table 6. Comparison of accuracy ($\%$) on four DG benchmarks. $^\dagger $ denotes the reported results from DomainBed [16], and the others are reproduced values.

Full size table

4.8 Qualitative Results

Figure 6 visualizes the features on CIFAR-10-C using t-SNE [41]. The results in the first row are from WRN-40-2 as a source pre-trained model, and the results in the second row are from ResNet-50. Even without test-time adaptation, WRN-40-2 (AugMix) [21] is more robust against corruptions than ResNet-50, so better results can be obtained. Our method significantly improves the performance in terms of intra-class cohesion and inter-class separation in both backbones.

5 Conclusions

This paper proposed two novel approaches for model-agnostic test-time adaptation. Our proposed shift-agnostic weight regularization enables the model to reliably and quickly adapt to unlabeled online data from the target domain by controlling the update of the model parameters according to their sensitivity to the distribution shift. In addition, our proposed auxiliary task based on the nearest source prototype classifier boosts the performance by aligning the source and target representations. Test-time adaptation is a challenging but promising area in terms of allowing the model to evolve itself while adapting to a new environment without human intervention. In this regard, our efforts aim to promote the importance of this field and stimulate new research directions.

Notes

1.
refers to the ultimate objective of the model (e.g., classification).
2.
denotes a part divided into torch.nn.Module units defined in Pytorch. The gradient vector of each layer can be easily obtained using torch.nn.module.parameters().
3.
Experiments on ImageNet-C are in the supplementary Section B.

References

Agarwal, P., Paudel, D.P., Zaech, J.N., Van Gool, L.: Unsupervised robust domain adaptation without source data. In: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2022)
Google Scholar
Assran, M., Caron, M., Misra, I., Bojanowski, P., Joulin, A., Ballas, N., Rabbat, M.: Semi-supervised learning of visual features by non-parametrically predicting view assignments with support samples. In: International Conference on Computer Vision (ICCV) (2021)
Google Scholar
Balaji, Y., Sankaranarayanan, S., Chellappa, R.: Metareg: towards domain generalization using meta-regularization. In: Advances in Neural Information Processing Systems (NeurIPS) (2018)
Google Scholar
Bardes, A., Ponce, J., LeCun, Y.: VICreg: variance-invariance-covariance regularization for self-supervised learning. arXiv preprint arXiv:2105.04906 (2021)
Beery, S., Van Horn, G., Perona, P.: Recognition in terra incognita. In: European Conference on Computer Vision (ECCV) (2018)
Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning (ICML) (2020)
Google Scholar
Choi, S., Kim, T., Jeong, M., Park, H., Kim, C.: Meta batch-instance normalization for generalizable person re-identification. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Google Scholar
Choi, S., Jung, S., Yun, H., Kim, J.T., Kim, S., Choo, J.: RobustNet: improving domain generalization in urban-scene segmentation via instance selective whitening. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Google Scholar
Croce, F., et al.: RobustBench: a standardized adversarial robustness benchmark. arXiv preprint arXiv:2010.09670 (2020)
Fang, C., Xu, Y., Rockmore, D.N.: Unbiased metric learning: on the utilization of multiple datasets and web images for softening bias. In: International Conference on Computer Vision (ICCV) (2013)
Google Scholar
Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation. In: International Conference on Machine Learning (ICML) (2015)
Google Scholar
Ganin, Y., et al.: Domain-adversarial training of neural networks. J. Mach. Learn. Res. (2016)
Google Scholar
Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. In: International Conference on Learning Representations (ICLR) (2018)
Google Scholar
Gong, R., Li, W., Chen, Y., Gool, L.V.: DLOW: domain flow for adaptation and generalization. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Grandvalet, Y., Bengio, Y.: Semi-supervised learning by entropy minimization. In: Advances in Neural Information Processing Systems (NeurIPS) (2004)
Google Scholar
Gulrajani, I., Lopez-Paz, D.: In search of lost domain generalization. In: International Conference on Learning Representations (ICLR) (2020)
Google Scholar
Guo, Y., Shi, H., Kumar, A., Grauman, K., Rosing, T., Feris, R.: SpotTune: transfer learning through adaptive fine-tuning. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Hendrycks, D., Dietterich, T.: Benchmarking neural network robustness to common corruptions and perturbations. In: International Conference on Learning Representations (ICLR) (2018)
Google Scholar
Hendrycks, D., Mu, N., Cubuk, E.D., Zoph, B., Gilmer, J., Lakshminarayanan, B.: Augmix: A simple data processing method to improve robustness and uncertainty. In: International Conference on Learning Representations (ICLR) (2019)
Google Scholar
Hoffman, J., et al.: CYCADA: cycle-consistent adversarial domain adaptation. In: International Conference on Machine Learning (ICML) (2018)
Google Scholar
Hu, W., Miyato, T., Tokui, S., Matsumoto, E., Sugiyama, M.: Learning discrete representations via information maximizing self-augmented training. In: International Conference on Machine Learning (ICML) (2017)
Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning (ICML) (2015)
Google Scholar
Iwasawa, Y., Matsuo, Y.: Test-time classifier adjustment module for model-agnostic domain generalization. In: Advances in Neural Information Processing Systems (NeurIPS) (2021)
Google Scholar
Jain, H., Zepeda, J., Pérez, P., Gribonval, R.: Learning a complete image indexing pipeline. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Krause, A., Perona, P., Gomes, R.: Discriminative clustering by regularized information maximization. In: Advances in Neural Information Processing Systems (NeurIPS) (2010)
Google Scholar
Krause, A., Perona, P., Gomes, R.: Discriminative clustering by regularized information maximization. In: Lafferty, J., Williams, C., Shawe-Taylor, J., Zemel, R., Culotta, A. (eds.) Advances in Neural Information Processing Systems (NeurIPS) (2010)
Google Scholar
Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009)
Google Scholar
Kundu, J.N., Venkat, N., Babu, R.V., et al.: Universal source-free domain adaptation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Li, D., Yang, Y., Song, Y.Z., Hospedales, T.M.: Deeper, broader and artier domain generalization. In: International Conference on Computer Vision (ICCV) (2017)
Google Scholar
Li, D., Yang, Y., Song, Y.Z., Hospedales, T.M.: Learning to generalize: meta-learning for domain generalization. arXiv preprint arXiv:1710.03463 (2017)
Li, D., Zhang, J., Yang, Y., Liu, C., Song, Y.Z., Hospedales, T.M.: Episodic training for domain generalization. In: International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Li, H., Jialin Pan, S., Wang, S., Kot, A.C.: Domain generalization with adversarial feature learning. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Li, L., et al.: Progressive domain expansion network for single domain generalization. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Google Scholar
Li, R., Jiao, Q., Cao, W., Wong, H.S., Wu, S.: Model adaptation: unsupervised domain adaptation without source data. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Li, Y., et al.: Deep domain generalization via conditional invariant adversarial networks. In: European Conference on Computer Vision (ECCV) (2018)
Google Scholar
Liang, J., Hu, D., Feng, J.: Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation. In: International Conference on Machine Learning (ICML) (2020)
Google Scholar
Liu, Y., Kothari, P., van Delft, B., Bellot-Gurlet, B., Mordan, T., Alahi, A.: TTT++: when does self-supervised test-time training fail or thrive? In: Advances in Neural Information Processing Systems (NeurIPS) (2021)
Google Scholar
Long, M., Zhu, H., Wang, J., Jordan, M.I.: Unsupervised domain adaptation with residual transfer networks. In: Advances in Neural Information Processing Systems (NeurIPS) (2016)
Google Scholar
Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. (2008)
Google Scholar
Mummadi, C.K., Hutmacher, R., Rambach, K., Levinkov, E., Brox, T., Metzen, J.H.: Test-time adaptation to distribution shift by confidence maximization and input transformation. arXiv preprint arXiv:2106.14999 (2021)
Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: International Conference on Machine Learning (ICML) (2010)
Google Scholar
Pan, X., Luo, P., Shi, J., Tang, X.: Two at once: enhancing learning and generalization capacities via IBN-Net. In: European Conference on Computer Vision (ECCV) (2018)
Google Scholar
Rahman, M.M., Fookes, C., Baktashmotlagh, M., Sridharan, S.: Correlation-aware adversarial domain adaptation and generalization. Pattern Recogn. (2020)
Google Scholar
Saito, K., Watanabe, K., Ushiku, Y., Harada, T.: Maximum classifier discrepancy for unsupervised domain adaptation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Seo, S., Suh, Y., Kim, D., Han, J., Han, B.: Learning to optimize domain specific normalization for domain generalization. arXiv preprint arXiv:1907.04275 (2019)
Shi, Y., Sha, F.: Information-theoretical learning of discriminative clusters for unsupervised domain adaptation. In: International Conference on Machine Learning (ICML) (2012)
Google Scholar
Springenberg, J.T.: Unsupervised and semi-supervised learning with categorical generative adversarial networks. International Conference on Learning Representations (ICLR) (2016)
Google Scholar
Su, J.C., Maji, S., Hariharan, B.: When does self-supervision improve few-shot learning? In: European Conference on Computer Vision (ECCV) (2020)
Google Scholar
Sun, B., Saenko, K.: Deep coral: Correlation alignment for deep domain adaptation. In: European Conference on Computer Vision (ECCV) (2016)
Google Scholar
Sun, Y., Wang, X., Liu, Z., Miller, J., Efros, A., Hardt, M.: Test-time training with self-supervision for generalization under distribution shifts. In: International Conference on Machine Learning (ICML) (2020)
Google Scholar
Tzeng, E., Hoffman, J., Saenko, K., Darrell, T.: Adversarial discriminative domain adaptation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Vapnik, V.N.: An overview of statistical learning theory. IEEE Trans. Neural Netw. (1999)
Google Scholar
Venkateswara, H., Eusebio, J., Chakraborty, S., Panchanathan, S.: Deep hashing network for unsupervised domain adaptation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Volpi, R., Namkoong, H., Sener, O., Duchi, J.C., Murino, V., Savarese, S.: Generalizing to unseen domains via adversarial data augmentation. In: Advances in Neural Information Processing systems 31 (2018)
Google Scholar
Vu, T.H., Jain, H., Bucher, M., Cord, M., Pérez, P.: ADVENT: adversarial entropy minimization for domain adaptation in semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Wang, D., Liu, S., Ebrahimi, S., Shelhamer, E., Darrell, T.: On-target adaptation. arXiv preprint arXiv:2109.01087 (2021)
Wang, D., Shelhamer, E., Liu, S., Olshausen, B., Darrell, T.: Tent: fully test-time adaptation by entropy minimization. In: International Conference on Learning Representations (ICLR) (2020)
Google Scholar
Yang, S., Wang, Y., van de Weijer, J., Herranz, L., Jui, S.: Generalized source-free domain adaptation. In: International Conference on Computer Vision (ICCV) (2021)
Google Scholar
Yeh, H.W., Yang, B., Yuen, P.C., Harada, T.: SoFA: source-data-free feature alignment for unsupervised domain adaptation. In: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2021)
Google Scholar
Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: How transferable are features in deep neural networks? In: Advances in Neural Information Processing Systems (NeurIPS) (2014)
Google Scholar
You, F., Li, J., Zhao, Z.: Test-time batch statistics calibration for covariate shift. arXiv preprint arXiv:2110.04065 (2021)
Yu, T., Kumar, S., Gupta, A., Levine, S., Hausman, K., Finn, C.: Gradient surgery for multi-task learning. In: Advances in Neural Information Processing Systems (NeurIPS) (2020)
Google Scholar
Zagoruyko, S., Komodakis, N.: Wide residual networks. In: British Machine Vision Conference (BMVC) (2016)
Google Scholar
Zhang, Y., Borse, S., Cai, H., Porikli, F.: AuxAdapt: stable and efficient test-time adaptation for temporally consistent video semantic segmentation. In: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2022)
Google Scholar
Zhou, K., Yang, Y., Hospedales, T., Xiang, T.: Learning to generate novel domains for domain generalization. In: European Conference on Computer Vision (ECCV) (2020)
Google Scholar

Download references

Acknowledgement

We would like to thank Kyuwoong Hwang, Simyung Chang, Hyunsin Park, Juntae Lee, Janghoon Cho, Hyoungwoo Park, Byeonggeun Kim, and Hyesu Lim of the Qualcomm AI Research team for their valuable discussions.

Author information

Authors and Affiliations

Qualcomm AI Research, Seoul, South Korea
Sungha Choi, Seunghan Yang, Seokeon Choi & Sungrack Yun

Authors

Sungha Choi
View author publications
You can also search for this author in PubMed Google Scholar
Seunghan Yang
View author publications
You can also search for this author in PubMed Google Scholar
Seokeon Choi
View author publications
You can also search for this author in PubMed Google Scholar
Sungrack Yun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sungha Choi .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2746 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Choi, S., Yang, S., Choi, S., Yun, S. (2022). Improving Test-Time Adaptation Via Shift-Agnostic Weight Regularization and Nearest Source Prototypes. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13693. Springer, Cham. https://doi.org/10.1007/978-3-031-19827-4_26

Download citation

DOI: https://doi.org/10.1007/978-3-031-19827-4_26
Published: 02 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19826-7
Online ISBN: 978-3-031-19827-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Improving Test-Time Adaptation Via Shift-Agnostic Weight Regularization and Nearest Source Prototypes

Abstract

Similar content being viewed by others

A Comprehensive Survey on Test-Time Adaptation Under Distribution Shifts

BMD: A General Class-Balanced Multicentric Dynamic Prototype Strategy for Source-Free Domain Adaptation

Self-Training with Label-Feature-Consistency for Domain Adaptation

Keywords

1 Introduction

2 Related Work

2.1 Source-Free Domain Adaptation

2.2 Test-Time Adaptation and Training

2.3 Domain Generalization

3 Proposed Method

3.1 Shift-Agnostic Weight Regularization

3.2 Entropy Objective for the Main Task

3.3 Auxiliary Task Based on the Nearest Source Prototype

4 Experiments

4.1 Experimental Setup

4.2 Implementation Details

4.3 Robustness Against Image Corruptions

4.4 Ablation Studies

4.5 Projector Design and Hyper-parameter Impacts

4.6 Quick Adaptation

4.7 Domain Generalization Benchmarks

4.8 Qualitative Results

5 Conclusions

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 2746 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation