Keywords

1 Introduction

After deep neural networks (DNNs) trained on a given dataset (i.e., source domain) are deployed to a new environment (i.e., target domain), the DNNs make predictions from the data in the target domain. However, in most cases, the distribution of the source and target domains varies significantly, which degrades the model’s performance in the target domain. If the deployed model does not remain stationary during test time but adapts to the new environment using clues about unlabeled target data, its performance can be improved [25, 38, 39, 42, 52, 59, 63, 66].

Fig. 1.
figure 1

Comparison of average error (\(\%\)) between our approach and other methods with varying learning rates on CIFAR-100-C [20]. The x- and y-axes are the learning rate and average error rate, respectively. (a) Our method significantly outperforms the other three methods: (b) updating the entire parameters with only entropy minimization, (c) the state-of-the-art method, TENT [59], and (d) a supervised method. (e) Our proposed SWR keeps the performance stable with the combining of entropy minimization even at higher learning rates: [1e-3, 1e-4].

Recently, several studies [25, 42, 59, 63] have proposed test-time adaptation to update the model during test time after model deployment. However, it is extremely challenging to adapt the model to the target domain with only unlabeled online data. As shown in Fig. 1(b), the adaptation of the entire model parameters may be detrimental due to the erroneous signals from the unsupervised objective such as entropy minimization [15, 26, 40, 57, 59], and the performance may be highly dependent on the learning rate. In addition, since the test-time adaptation can access unlabeled target data only once, and the adaptation proceeds simultaneously with the evaluation, updating all network parameters may result in overfitting [17, 62]. Thus, several approaches present the methods to update only some part of the network architecture [25, 42, 59, 63] such as batch normalization [24] or classifier layers. Especially, T3A [25] proposes an optimization-free method to adapt only the classifier layers using unlabeled target data, and TENT [59] updates batch statistics and affine parameters in the batch normalization layers by entropy minimization on unlabeled target data. However, updating only partial parameters or layers of the model may only result in marginal performance improvement, as shown in Fig. 1(c). Furthermore, such methods cannot be applied to the model architecture without a specific layer such as batch normalization or classifier layers.

Other approaches [39, 52] propose to jointly optimize a main taskFootnote 1 and a self-supervised task, such as rotation prediction [13] or instance discrimination [6, 18], during pre-training in the source domain, and then update the model using only the self-supervised task during test time. In contrast to the unsupervised objective for the main task that highly depends on the model’s prediction accuracy, the self-supervised task always obtains a proper supervisory signal. However, the self-supervised task may interfere with the main task if both tasks are not properly aligned [39, 50, 64]. In addition, these approaches cannot be applied to adapt arbitrary pre-trained models to the target domain since they require specific pre-training methods in the source domain.

To resolve these issues, we present two novel approaches for the test-time adaptation. First, we consider a shift-agnostic weight regularization (SWR) that enables the model to quickly adapt to the target domain, which is beneficial when updating the entire model parameters with a high learning rate. In contrast to Fig. 1(b), the entropy minimization with the proposed SWR shows superior performance and less dependency on the learning rate choice, as shown in Fig. 1(e). In terms of distribution shift, the SWR identifies the entire model parameters into shift-agnostic and shift-biased parameters, updating the former less and the latter more. Second, we present an auxiliary task based on a non-parametric nearest source prototype (NSP) classifier, which pulls the target representation closer to its nearest source prototype. With the NSP classifier, both source and target representations can be well aligned, which significantly improves the performance of the main task. Our proposed method (Fig. 1(a)) outperforms the state-of-the-art method [59] (Fig. 1(c)) and even the supervised method using ground-truth labels (Fig. 1(d)).

Our method requires access to the source data to identify shift-agnostic and biased parameters and generate source prototypes before the model deployment, but it is applicable to any model regardless of its architecture or pre-training procedure. If a given model is pre-trained on open datasets, or if the source data owner deploys the model, source data is accessible before model deployment. In this case, our method significantly enhances the test-time adaptation capability by leveraging the source data without modifying the pre-trained model. Unlike TTT [52] and TTT++ [39], we do not change the pre-training method of a given model, so our method can take benefit from any pre-trained strong models, such as AugMix [21] (Table 1) or CORAL [51] (Table 6), as a good starting point for test-time adaptation. In these respects, we believe our method is practical.

The major contributions of this paper can be summarized as follows

  • Two novel approaches for test-time adaptation are presented in this paper. The proposed SWR enables the model to quickly and reliably adapt to the target domain, and the NSP classifier aligns the source and target features to reduce the distribution shift, leading to further performance improvement.

  • Our test-time adaptation method is model-agnostic and not dependent on the pre-training method in the source domain, and thus it can be applied to any pre-trained model. Therefore, our method can also complement other domain generalization approaches that mainly focus on the pre-training method in the source domain before model deployment.

  • We show that our method achieves state-of-the-art performance through extensive experiments on CIFAR-10-C, CIFAR-100-C, ImageNet-C [20] and domain generalization benchmarks including PACS [31], OfficeHome [55], VLCS [10], and TerraIncognita [5]. Especially, our method even outperforms its supervised counterpart on CIFAR-100-C dataset.

2 Related Work

2.1 Source-Free Domain Adaptation

Unsupervised domain adaptation (UDA) [11, 12, 22, 46, 51, 53, 57] assumes simultaneous access to both the source and target domains. Data is often distributed across multiple devices. In such cases, UDA requires data sharing for simultaneous access to all data. However, it is often impossible due to data privacy concerns, limited bandwidth, and computational cost. Source-free domain adaptation [1, 30, 36, 38, 58, 60, 61] overcomes this challenge by adapting a source pre-trained model to the target domain using only unlabeled target data. These approaches focus on offline adaptation in which the same target sample is fed to the model multiple times during target adaptation, whereas our method concentrates on online adaptation.

2.2 Test-Time Adaptation and Training

Test-time adaptation focuses on online adaptation in which all target data can be accessed only once at test time, and adaptation is performed simultaneously with evaluation. More specifically, it forward propagates target samples through the model for evaluation and then backpropagates the error signal from the model’s output in an unsupervised manner for training [59]. Several studies adopt self-supervised learning, such as rotation prediction [52] or instance discrimination [39], to jointly optimize the main and self-supervised tasks on the source domain and then optimize only the self-supervised task on the target domain. However, these methods are not universally applicable to arbitrary pre-trained models as they require specific pre-training methods in the source domain. Recently, model-agnostic test-time adaptation methods independent of the pre-training method in the source domain have been proposed [25, 42, 59, 63]. TENT [59] uses the batch statistics of the target domain and optimizes channel-wise affine parameters using entropy minimization loss. T3A [25] proposes an optimization-free method that adjusts a pre-trained linear classifier by updating the prototype for each class during test time. However, since these methods update only partial parameters or layers of the model, such as the batch normalization [42, 59, 63] or classifier layer [25], they may be suboptimal for target adaptation.

2.3 Domain Generalization

Since UDA aims to adapt the model to the predefined target domain before model deployment, it is not suitable to guarantee generalization performance to other arbitrary target domains. On the other hand, domain generalization (DG) differs from UDA in that it assumes that the model accesses only the source domain during training time before model deployment and aims to improve the generalization capability in arbitrary unseen target domains. Numerous DG approaches using meta-learning [3, 32, 33], normalization [7, 8, 44, 47], adversarial training [34, 37, 45], and data augmentation [14, 35, 56, 67] have been proposed to learn domain-agnostic feature representations for the target domain. However, these studies only focus on methods at training time before model deployment, whereas our method focuses on a test-time adaptation after model deployment.

3 Proposed Method

Assume that the model parameters \(\theta \) trained on the source domain consist of an encoder part \(\theta _e\) and a classifier part \(\theta _c\), as shown in Fig. 2(c). After being deployed to the target domain, the model infers the class probability distribution of the target sample and then optimizes our proposed test-time adaptation loss \(\mathcal {L}^{\text {target}}_{\theta _e,\theta _c}\). The overall loss of our proposed method is defined as

$$\begin{aligned} \mathcal {L}^{\text {target}}_{\theta _e,\theta _c} = \mathcal {L}^{\text {main}}_{\theta _e,\theta _c} + \mathcal {L}^{\text {aux}}_{\theta _e} + \lambda _r\sum _l w_l\Vert \boldsymbol{\theta }_l-\boldsymbol{\theta }_l^*\Vert ^2_2, \end{aligned}$$
(1)

where \(w_l\) denotes the l-th element of the penalty vector \(\boldsymbol{w}\) used to control the update of the model parameters, \(\boldsymbol{\theta }_l\) is the parameter vector of the l-th layerFootnote 2 of the model, \(\boldsymbol{\theta }^*\) is the parameters from the previous update step, \(\lambda _r\) is the importance of the regularization term, and \(\mathcal {L}^{\text {main}}_{\theta _e,\theta _c}\) and \(\mathcal {L}^{\text {aux}}_{\theta _e}\) denote the main and auxiliary task losses, respectively. Optimizing the main task loss updates the entire model parameters \(\theta _e\) and \(\theta _c\), whereas optimizing the auxiliary task loss updates only the encoder part \(\theta _e\). We first present a shift-agnostic weight regularization (SWR) and then describe an entropy objective of the main task. Finally, we propose an auxiliary task based on a nearest source prototype (NSP) classifier, which directly benefits the main task.

Fig. 2.
figure 2

Our method consists of two stages: (b) and (c). (a) our method takes the pre-trained model in an off-the-shelf manner and (b) generates penalty vector \(\boldsymbol{w}\) and source prototypes \(\boldsymbol{q}\) while keeping the model frozen before model deployment. After model deployment, (c) our method does not access labeled source data \(D_s\) other than unlabeled online target data \(D_t\) during test-time adaptation.

3.1 Shift-Agnostic Weight Regularization

The main idea of the SWR is to impose different penalties for each parameter update during test-time adaptation, depending on the sensitivity of each model parameter to the distribution shift. Assuming that the distribution shift is mainly caused by color and blur shifts, we mimic the distribution shift using transformation techniques such as color distortion and Gaussian blur. Experiments on variations of the SWR, including the use of other transform functions, can be found in the supplementary Section B.

To obtain the penalty vector \(\boldsymbol{w}\) specified in Eq. (1), we first forward-propagate two input images (i.e., an original and its transformed image) through the pre-trained source model and then back-propagate the task loss (i.e., cross entropy) using the source labels to produce two sets \(\boldsymbol{g}\) and \(\boldsymbol{g}'\) of L gradient vectors, respectively. Note that L is the total number of layers in the model. Then the l-th element \(w_l\) of the penalty vector \(\boldsymbol{w}\) is calculated by employing the average cosine similarity \(s_l\) between two gradient vectors, \(\boldsymbol{g}_l\) and \(\boldsymbol{g}'_l\) from N source samples as

$$\begin{aligned} \begin{aligned} s_l&=\frac{1}{N}\sum _{i=1}^N\frac{\boldsymbol{g}^i_l\cdot {\boldsymbol{g}_{l}'^i}}{\Vert {\boldsymbol{g}^i_l}\Vert \Vert {{\boldsymbol{g}_{l}'^i}}\Vert }\in \mathbb {R}, \\ \boldsymbol{w}&=\left( \nu \left[s_1,\dots ,s_l,\dots ,s_{L} \right]\right) ^2\in \mathbb {R}^L, \end{aligned} \end{aligned}$$
(2)

where \(\nu \left[\cdot \right]\) denotes min-max normalization with the range of [0,1], \(\boldsymbol{g}^i_l\) and \(\boldsymbol{g}_{l}'^i\) denote the l-th gradient vectors for i-th source sample and its transformed sample, respectively. N denotes the total number of samples. Note that the penalty vector \(\boldsymbol{w}\) is obtained from a frozen pre-trained source model before model deployment. Therefore, this process is independent of the source model’s pre-training method and does not require source data after model deployment, as shown in Fig. 2.

Fig. 3.
figure 3

Overall process of our proposed SWR. We first obtain the penalty vector \(\boldsymbol{w}\) before model deployment and then use it as layer-wise penalties to control the update of the model parameters at test-time adaptation after model deployment.

As shown in Eq. (1) and Fig. 3, during test-time adaptation, we apply the layer-wise penalty value \(w_l\) to the difference between previous and current model parameters for each layer, and this controls the update of model parameters differently for each layer. Therefore, the model parameters belonging to the layers with high cosine similarity between the two gradient vectors are considered shift-agnostic, and we less update them by imposing high penalties. Section 4.6 experimentally shows that SWR takes advantage of using high learning rates to adapt the model to the target domain quickly.

3.2 Entropy Objective for the Main Task

The main task of the model \(f_\theta \) is defined as the task performed by the parameters \(\theta _e\) and \(\theta _c\). The loss function for the main task during test time is built using the entropy of model predictions \(\tilde{y}\) on test samples from the target distribution. We adopt information maximization loss [23, 27, 48], validated in several test-time adaptation and source-free domain adaptation methods [38, 42, 58], as an unsupervised learning objective for the main task. This loss consists of entropy minimization [38, 49, 57, 59] and mean entropy maximization [2, 28, 38, 58] as

$$\begin{aligned} \mathcal {L}^{\text {main}}_{\theta _e,\theta _c} = \lambda _{m_1}\frac{1}{N}\sum _{i=1}^N H(\tilde{y_i})-\lambda _{m_2} H(\bar{y}), \end{aligned}$$
(3)

where \(H(p)=-\sum _{k=1}^C p^k\log p^k\), \(\bar{y}=\frac{1}{N}\sum _{i}\tilde{y_i}\), \(\lambda _{m_1}\) and \(\lambda _{m_2}\) indicate the importance of each term. The number of classes and the batch size are denoted by C and N. Intuitively, entropy minimization makes individual predictions confident, and mean entropy maximization encourages average prediction within a batch to be close to the uniform distribution.

3.3 Auxiliary Task Based on the Nearest Source Prototype

Due to the distribution shift between the source and target domains, the target features deviate from the source features at test time. To resolve this issue, we propose an auxiliary task based on the nearest source prototype (NSP) classifier, which pulls the target embeddings closer to their nearest source prototypes in the embedding space. Eventually, optimizing the auxiliary task improves performance significantly since it directly supports the main task by aligning the source and target representations. We first explain how to generate source prototypes and define the NSP classifier based on them.

Fig. 4.
figure 4

Source prototype generation phase before model deployment. First, we repeat steps (1) and (2) until prototypes of all classes are generated, then train the projector and update the source prototype at the same time through an iterative process from (1) to (6) on the source data. (a) and (b) pull the original source projection and its transformed source projection, respectively, such that they become closer to the nearest source prototype from the original one.

Source Prototype Generation. The source prototypes are defined as the averages over source embeddings for each class. As shown in Fig. 4, we freeze the model \(f_\theta \) trained on the source data and attach an additional projection layer \(h_\psi \) behind the encoder \(f_{\theta _e}\). The encoder \(f_{\theta _e}\) infers the representation \(\boldsymbol{h}\) from the source sample x, and the projector \(h_\psi \) maps \(\boldsymbol{h}\) to the projection \(\boldsymbol{z}\) in another embedding space where the loss \(\mathcal {L}^{\text {emb}}_{\psi }\) is applied as \(\boldsymbol{z}=h_\psi (f_{\theta _e}(x))\). The source prototype \(\boldsymbol{q}^k_t\) for class k is updated through exponential moving average (EMA) with the projection \(\boldsymbol{z}^k_t\) of the source sample \((x, y^k)_{k\in [1,\text {C}]}\) at time t during the optimization trajectory as

$$\begin{aligned} \boldsymbol{q}^k_t=\alpha \cdot \boldsymbol{q}^k_{t-1}+(1-\alpha )\cdot \boldsymbol{z}^k_t, \end{aligned}$$
(4)

where \(\alpha \)=0.99 and \(\boldsymbol{q}^k_0=\boldsymbol{z}^k_0\).

We define the NSP classifier as a non-parametric classifier. It measures the cosine similarity of a given target embedding to the source prototypes for all classes and then generates a class probability distribution \(\hat{y}\) as

$$\begin{aligned} \hat{y}=\sum _{k=1}^C\left( \frac{\text {exp}\left( S(\boldsymbol{z},\boldsymbol{q}^k)/\tau \right) }{\sum _{j=1}^C\text {exp}\left( S(\boldsymbol{z},\boldsymbol{q}^j)/\tau \right) }\right) y^k, \end{aligned}$$
(5)

where \(S(\cdot ,\cdot )\) is a cosine similarity function, \(S(\boldsymbol{a},\boldsymbol{b})=(\boldsymbol{a}\cdot \boldsymbol{b})/\Vert \boldsymbol{a}\Vert \Vert \boldsymbol{b}\Vert \), \(\tau \) denotes a temperature that controls the sharpness of the distribution, and \(y^k\) is the one-hot ground-truth label vector of k-th class.

Fig. 5.
figure 5

Test-time adaptation phase after model deployment. (a) main task loss. (b),(c) auxiliary task loss. (b) and (c) pull the original target projection and its transformed target projection, respectively, such that they become closer to the nearest source prototype from the original one.

In addition, inspired by recent self-supervised contrastive learning methods [4, 6, 18], we enable the projector \(h_\psi \) to learn transformation-invariant mapping. We obtain projection \(\boldsymbol{z}'\) of the transformed source sample by \(\boldsymbol{z}'=h_\psi (f_{\theta _e}(\mathcal {T}(x)))\), where \(\mathcal {T}\)(\(\cdot \)) denotes an image transform function. The embedding loss \(\mathcal {L}^{\text {emb}}_{\psi }\) consisting of two cross entropy loss terms is applied to the embedding space to train the projector \(h_\psi \) as

$$\begin{aligned} \mathcal {L}^{\text {emb}}_{\psi }=\frac{1}{N}\sum _{i=1}^{N}\left( \text {CE}\left( y_i,{\hat{y}}_i\right) +\text {CE}\left( y_i,{\hat{y}'}_i\right) \right) , \end{aligned}$$
(6)

where \(\text {CE}\left( p,q\right) =-\sum _{k=1}^C p^k\log q^k\), and \(y_i\) is the ground-truth label of i-th source sample. Here, \(\hat{y}\) and \(\hat{y}'\) denote the outputs of the NSP classifier for the projections \(\boldsymbol{z}\) and \(\boldsymbol{z}'\) of the source sample and its transformed one, respectively. As shown in Fig. 4, optimizing the embedding loss encourages the projector \(h_\psi \) to learn a mapping that pulls the projections belonging to the same class closer together and pushes source prototypes farther away from each other.

Note that this process is applied to a frozen pre-trained source model and completed before model deployment. Therefore, it is model-agnostic and does not require source data during test time.

Auxiliary Task Loss at Test Time. Once the source prototypes are generated and the projection layer is trained, we can deploy the model and then jointly optimize both main and auxiliary tasks on unlabeled online data. The auxiliary task loss \(\mathcal {L}^{\text {aux}}_{\theta _e}\) consists of two objective functions: the entropy objective \(\mathcal {L}^{{\text {aux}}\_{\text {ent}}}_{\theta _e}\) using the entropy of the NSP classifier’s prediction \(\hat{y}\), and the self-supervised loss \(\mathcal {L}^{{\text {aux}}\_{\text {sel}}}_{\theta _e}\) that encourages the model’s encoder \(f_{\theta _e}\) to learn transformation-invariant mappings as

$$\begin{aligned} \mathcal {L}^{\text {aux}}_{\theta _e} = \mathcal {L}^{{\text {aux}}\_{\text {ent}}}_{\theta _e} + \lambda _s\mathcal {L}^{{\text {aux}}\_{\text {sel}}}_{\theta _e}, \end{aligned}$$
(7)

where \(\lambda _s\) denotes the importance of the self-supervised loss term. Similarly to Eq. (3), the entropy objective is built by using the entropy of the prediction \(\hat{y}\) of the NSP classifier on the target sample as

$$\begin{aligned} \mathcal {L}^{{\text {aux}}\_{\text {ent}}}_{\theta _e} = \lambda _{a_1}\frac{1}{N}\sum _{i=1}^{N}H(\hat{y_i})-\lambda _{a_2}H(\bar{y}), \end{aligned}$$
(8)

where N is batch size, \(\lambda _{a_1}\) and \(\lambda _{a_2}\) indicate the importance of each term, \(H(p)=-\sum _{k=1}^C p^k\log p^k\), and \(\bar{y}=\frac{1}{N}\sum _{i=1}^{N}\hat{y_i}\). The self-supervised loss is applied to the prediction \(\hat{y}'\) of the NSP classifier on the transformed target sample as

$$\begin{aligned} \mathcal {L}^{{\text {aux}}\_{\text {sel}}}_{\theta _e}=-\frac{1}{N}\sum _{i=1}^{N}\sum _{k=1}^C\hat{y}_i^k\log {\hat{y}'^k_i}. \end{aligned}$$
(9)

As shown in Fig. 5, the entropy objective function (Fig. 5(b)) pulls the projection \(\boldsymbol{z}\) of the target sample to move closer to its nearest source prototype, and the self-supervised objective (Fig. 5(c)) encourages the projection \(\boldsymbol{z}'\) of the transformed target sample to get closer to the same target as \(\boldsymbol{z}\).

4 Experiments

This section describes the experimental setup, implementation details, and the experimental results of the comparisons with other state-of-the-art methods in test-time adaptation. We also show that generalization performance can be further improved by combining our proposed method with an existing domain generalization strategy that mainly focuses on training time in the source domain.

4.1 Experimental Setup

Following TENT [59] and T3A [25], all experiments in this paper are conducted on the online adaptation setting, where adaptation is performed concurrently with evaluation at test time without seeing the same data twice or more. After a prediction is obtained, the model is updated via back-propagation. We evaluate our proposed method on CIFAR-10-C, CIFAR-100-C, ImageNet-CFootnote 3 [20] and four domain generalization benchmarks such as PACS [31], OfficeHome [55], VLCS [10], and TerraIncognita [5]. Since our method can be used independently of the backbone networks and its pre-training method, we apply our method to publicly available pre-trained models for evaluation. We perform experiments on CIFAR datasets using WideResNet-28-10 [65] and WideResNet-40-2 [65] as backbone networks, based on RobustBench [9]. In the domain generalization setup, we use ResNet-50 [19] without the batch normalization layer, which is the default setting of DomainBed [16], DG benchmark framework. CIFAR-10/100 dataset [29] contains 50k images for training and 10k images for testing. Corruptions such as noise, blur, weather, and digital are applied to 10k images from CIFAR-10/100 test set to create CIFAR-10/100-C test images. For test-time adaptation, 50k images for CIFAR training set are defined as the source domain, and 10k images for CIFAR-C test set are defined as the target domain.

4.2 Implementation Details

We integrate our proposed method within the frameworks officially provided by other state-of-the-art methods [25, 39, 59] for fair comparisons. Specifically, different frameworks are used for each experiment as follows: TENT framework [59] for all experiments with WRN-40-2 and WRN-28-10 backbone networks on CIFAR-10/100-C, TTT++ framework [39] for all experiments with ResNet-50 on CIFAR-10/100-C, and T3A framework [25] for all domain generalization benchmarks. For experiments on CIFAR, we follow the default values provided by each framework for experimental settings such as batch size and optimizer.

Color distortion, random grayscale and Gaussian blurring are used as the image transformations specified in Fig. 3 and Fig. 5, and random cropping and random horizontal flipping are additionally applied for the image transformations in Fig. 4. We use batch statistics on test data instead of using running estimates. The hyper-parameters are empirically set as \(\lambda _{m_1}\)=0.2, \(\lambda _{a_1}\)=0.8 \(\lambda _{m_2}\)=0.25, \(\lambda _{a_2}\)=0.25, \(\lambda _s\)=0.1, \(\lambda _r\)=250, and softmax temperature \(\tau \)=0.1. The epoch for training the projector is 20, and N=1024 in Eq. (2). Since these hyper-parameters are not sensitive to the backbone and datasets, they are fixed without individual tuning in most experiments in this paper unless noted otherwise. The projector as described in Sect. 3.3 can be configured as a single- or multi-layer perceptron (MLP). The MLP consists of a linear layer followed by batch normalization [24], ReLU [43], and a final linear layer with output dimension 512. The performance change according to the projector configuration is shown in Table 3, and the detailed architecture is described in the supplementary Section C.

Table 1. Comparison with other methods. * denotes the reported results from the original paper, and the others are reproduced values in our environment based on the official framework provided by TENT [59] and TTT++ [39]. Source denotes the source pre-trained model without test-time adaptation.

4.3 Robustness Against Image Corruptions

Table 1(a) shows a comparison of the robustness between our method and recent test-time adaptation methods for the most severe corruptions on CIFAR-100-C. TFA [39] and TTT++ [39] were originally implemented as offline adaptation methods that train a model by observing the same data multiple times across numerous epochs, so we change these methods to the online adaptation setting to reproduce the results. Our proposed method significantly outperforms other state-of-the-art methods with large margins of 3.89\(\%\) for ResNet-50 and 2.59\(\%\) for WRN-40-2. Table 1(b) shows the results on the most severe corruptions of CIFAR-10-C. Our method consistently outperforms other methods on CIFAR-10/100-C datasets across various backbone networks. In particular, WRN-40-2, which is trained with AugMix [21] for a data processing to increase the robustness of the model, outperforms the other backbone networks, and our method further enhances the performance by complementing it. Table 1(c) shows the results on CIFAR-100-C with all severity levels. Because severity denotes the strength of the corruption, it shows how much the distribution shift presents, and our method outperforms TENT [59] at all levels with a large margin.

Table 2. Ablation study on CIFAR-100-C. ResNet-50 is used.

4.4 Ablation Studies

Table 2 shows the effectiveness of our proposed shift-agnostic weight regularization (SWR) and nearest source prototype (NSP) classifier through ablation studies. At a high learning rate, optimizing only the main task loss based on the entropy of the model prediction results in poor performance, but adjusting the learning rate reduces the error rate to 39.44\(\%\). Adding the NSP to the main task loss leads to the performance improvement of 1.89%, and including the SWR improves the performance by 1.68% even at a high learning rate. Our method with both SWR and NSP achieves 35.65% error rate with 3.79% performance enhancement compared to using only the main task loss.

Table 3. Comparison of error rate (\(\%\)) according to changes in projector depth.

4.5 Projector Design and Hyper-parameter Impacts

Table 3 shows the performance impact of changing the projector depth (i.e., number of projection layer). In addition, we conduct experiments to apply the auxiliary task loss \(\mathcal {L}^{\text {aux}}_{\theta _e}\) directly to the feature representation \(\boldsymbol{h}\), the encoder’s output without using the projector. The model with the projector outperforms the one without the projector on CIFAR-100-C, and opposite results are obtained on CIFAR-10-C. Since the auxiliary task loss is applied to the embedding space based on the cosine similarity between the source prototypes and the target embeddings, its effect may be minimal if they are severely misaligned. To compensate for this issue, we attach and train the projector that minimizes the misalignment between the source and target embeddings by enabling transformation-invariant mapping and bringing the projections belonging to the same class closer together in the new embedding space. However, if the number of classes is small (e.g., CIFAR-10-C), the source and target may already be relatively well aligned compared to the case with a large number of classes (e.g., CIFAR-100-C). In this case, we conjecture that applying the auxiliary task loss directly to the encoder’s output \(\boldsymbol{h}\) rather than the new embedding space \(\boldsymbol{z}\), the projector’s output, generates a better-aligned representation \(\boldsymbol{h}\) between the source and target, which can be more helpful to the classifier.

Table 4 shows the experimental results according to (a) the projector width (i.e., output dimension of the last layer), (b) the transformation used for training the projector, and (c) whether to fine-tune or freeze the projector during test-time adaptation. Our default settings are marked with gray-colored cells, and these settings are also applied to the domain generalization benchmarks in the following section without additional tuning.

Table 4. Hyper-parameter impacts on CIFAR-100-C. ResNet-50 is used.
Table 5. Comparison of error rate (\(\%\)) on CIFAR-100-C. Our method outperforms the supervised method in an online setting. LR denotes a learning rate.

4.6 Quick Adaptation

As shown in Table 5, it is natural that the supervised method performs perfectly when learning and evaluating the same test samples iteratively. However, interestingly our method outperforms the supervised one in an online setting where the test sample is seen only once. Unlike the other methods that require a low learning rate to train (Fig. 1(b),(d)), our method updates the entire parameters at a high learning rate. We conjecture that SWR enables quick convergence without performance degradation because only parameters sensitive to distribution shift (i.e., parameters that need to quickly adapt to a new domain) are largely updated with a high learning rate.

4.7 Domain Generalization Benchmarks

To evaluate our method on the DG benchmarks, we follow the protocol proposed by DomainBed [16] and T3A [25]. Our method is model-agnostic, so we apply it to the pre-trained models using empirical risk minimization (ERM) [54] or CORAL [51] on the source domain in order to adapt the models to the target domain at test time. We use the leave-one-domain-out validation [16] for model selection in all experiments in Table 6. Our methods show state-of-the-art performance on average over four datasets and especially outperform T3A [25] and the source pre-trained models with a large margin on PACS, OfficeHome, and TerraIncognita datasets. The detailed experimental setup can be found in the supplementary Section C.

Table 6. Comparison of accuracy (\(\%\)) on four DG benchmarks. \(^\dagger \) denotes the reported results from DomainBed [16], and the others are reproduced values.

4.8 Qualitative Results

Figure 6 visualizes the features on CIFAR-10-C using t-SNE [41]. The results in the first row are from WRN-40-2 as a source pre-trained model, and the results in the second row are from ResNet-50. Even without test-time adaptation, WRN-40-2 (AugMix) [21] is more robust against corruptions than ResNet-50, so better results can be obtained. Our method significantly improves the performance in terms of intra-class cohesion and inter-class separation in both backbones.

Fig. 6.
figure 6

t-SNE visualization of features from the target domain (CIFAR-10-C).

5 Conclusions

This paper proposed two novel approaches for model-agnostic test-time adaptation. Our proposed shift-agnostic weight regularization enables the model to reliably and quickly adapt to unlabeled online data from the target domain by controlling the update of the model parameters according to their sensitivity to the distribution shift. In addition, our proposed auxiliary task based on the nearest source prototype classifier boosts the performance by aligning the source and target representations. Test-time adaptation is a challenging but promising area in terms of allowing the model to evolve itself while adapting to a new environment without human intervention. In this regard, our efforts aim to promote the importance of this field and stimulate new research directions.