1 Introduction

Dataset shift (Quinonero-Candela et al., 2008) poses a notable challenge for machine learning. Models often experience significant performance drops when confronting test data characterized by superior distribution differences from training. Such differences might come from changes in style and lighting conditions and various forms of corruption, making test data deviate from the data upon which these models were initially trained. To mitigate the performance degradation brought by these shifts, test-time adaptation (TTA) has emerged as a promising solution. TTA aims to rectify the dataset shift issue by adapting the model to unseen distributions using unlabeled test data (Liang et al., 2023). Different from unsupervised domain adaptation (Ganin & Lempitsky, 2015; Wang et al., 2022; Wang & Deng, 2018; Chen et al., 2023a, b, 2021; Wang et al., 2020; Luo et al., 2020), TTA does not require access to source data for distribution alignment. Commonly used strategies in TTA include unsupervised proxy objectives, spanning techniques such as pseudo-labeling (Liang et al., 2020), graph-based learning (Luo et al., 2023), and contrastive learning (Chen et al., 2022), applied on the test data through multiple training epochs to enhance model accuracy, such as autonomous vehicle detection (Hegde et al., 2021; Chen et al., 2024), pose estimation (Lee et al., 2023), video depth prediction (Liu et al., 2023a), frame interpolation (Choi et al., 2021), and medical diagnosis (Ma et al., 2022; Wang et al., 2022c; Saltori et al., 2022). Nevertheless, requiring access to the complete test set at every time step may not always align with practical use. In many applications, such as autonomous driving, adaptation is restricted to using only the current test batch processed in a streaming manner. Such operational restrictions make it untenable for TTA to require the full test set tenable.

In this study, our focus is on a specific line of TTA methods, i.e., online test-time adaptation (OTTA), which aims to accommodate real-time changes in the test data distribution. We provide a comprehensive overview of existing OTTA studies and evaluate the efficiency and effectiveness of these methods. To facilitate a structured OTTA landscape, we categorize existing approaches into three groups: optimization, data, and model-based OTTA.

  • Optimization-based OTTA focuses on various optimization methods. Examples are designing new loss functions, updating normalization layers (e.g., BatchNorm) during testing, using pseudo-labeling strategies, teacher-student frameworks, and contrastive learning-based approaches, etc.

  • Data-based OTTA maximizes prediction consistency across diversified test data. Diversification strategies use auxiliary data, improved data augmentation methods, and diffusion techniques and create a saving queue for test data, etc.

  • Model-based OTTA adjusts the model backbone, such as modifying specific layers or their mechanisms, adding supplementary branches, and incorporating prompts.

It is potentially useful to combine methods from different categories for further improvement. An in-depth analysis of this strategy is presented in Sect. 3. Note that this survey does not include the paper if source-stage customization is needed, such as (Thopalli et al., 2022; Döbler et al., 2023; Brahma & Rai, 2023; Jung et al., 2022; Adachi et al., 2023; Lim et al., 2023; Choi et al., 2022; Chakrabarty et al., 2023; Marsden et al., 2022; Gao et al., 2023), as they necessitate extra operation or information at the source pre-training stage, which may not be feasible in some scenarios and cannot form a fair comparison.

Differences from the existing survey. Liang et al. (2023) provided a comprehensive overview of the vast topic of test-time adaptation (TTA), discussing TTAs in diverse configurations and their applicability in vision, natural language processing (NLP), and graph data analysis. One limitation is that the survey does not provide experimental comparisons of existing methods. Mounsaveng et al. (2023) studied fully test-time adaptation for some specific components, e.g., batch normalization, calibration, class re-balancing, etc. However, it does not focus too much on analyzing the existing methods and exploring ViTs. Marsden et al. (2024) conducted comprehensive study on universal test-time adaptation setting. In contrast, our survey concentrates on purely online TTA approaches and provides valuable insight from experimental comparisons, considering various domain shifts, hyperparameter selection, and backbone influence (Zhao et al., 2023b).

Contributions. As Vision Transformer (ViT) architectures gain increasing prominence, a critical question arises: Do OTTA strategies, originally devised for CNNs, retain their effectiveness when applied to ViT models? This question stems from the significant architectural differences between ViTs and conventional CNNs, such as ResNets, particularly in their normalization layers and information processing mechanisms. Given the growing adoption of ViTs, investigating their compatibility with existing OTTA strategies is essential. To thoroughly explore this question, we evaluate eight representative OTTA algorithms across diverse distribution shifts, employing a set of metrics to evaluate both effectiveness and efficiency. Below, we summarize the key contribution of this survey:

  • [A focused OTTA survey] To the best of our knowledge, this is the first focused survey on online test-time adaptation, which provides a thorough understanding of three main working mechanisms. Experimental investigations are conducted in a fair comparison setting.

  • [Comprehensive Benchmarking and Adaptation of OTTA Strategies with ViT] We reimplemented representative OTTA baselines under the ViT architecture and testified their performance against six benchmark datasets. We drive a set of replacement rules that adapt the existing OTTA methods to accommodate the new backbone.

  • [Both accuracy and efficiency as evaluation Metrics] Apart from using the traditional recognition accuracy metric, we further provide insights into various facets of computational efficiency by Giga floating-point operations per second (GFLOPs), wall clock time, and GPU memory usage. These metrics are important in real-time streaming applications and can be treated as a supplementary of (Marsden et al., 2024).

  • [Real-world testbeds] While existing literature extensively explores OTTA methods on both corruption datasets and real-world datasets (Marsden et al., 2024), the diverse difficulty levels of these datasets hinder a fair comparison. Therefore, we further assess OTTA performance on CIFAR-10-Warehouse, a newly introduced, expansive test set of CIFAR-10, to ensure a comprehensive comparison across the same label set. Using the same pre-trained model for adaptation, we provide insights into various domain shifts that were previously unexplored in the existing survey (Liang et al., 2023).

This work aims to summarize existing OTTA methods with the aforementioned three categorization criteria and analyze some representative approaches by empirical results. Moreover, to assess real-world potential, we conduct comparative experiments to explore the portability, robustness, and environmental sensitivity of the OTTA components. We expect this survey to offer a systematic perspective in navigating OTTA’s intricate and diverse solutions. We also present new challenges as potential future research directions.

Table 1 Datasets used in this survey
Fig. 1
figure 1

Taxonomy of existing OTTA methods. The categories, i.e., optimization-, data-, and model-based, inform three mainstream working mechanisms. To provide a clear illustration, methods related to prompts are categorized into model-based methods

Organization of the survey. The rest of this survey will be organized as follows. Section 2 presents the problem definition and introduces widely used datasets, metrics, and applications. Using the taxonomy shown in Fig. 1, Sect. 3 comprehensively reviews existing OTTA methods. With Vision Transformer as new backbones, Sect. 4 empirically analyzes eight state-of-the-art methods by multiple evaluation metrics on corrupted and real-world distribution shifts. We introduce the potential future directions in Sect. 5 and conclude this survey in Sect. 6.

2 Problem Overview

Online Test-time Adaptation (OTTA), with its real-time characteristics, represents a critical approach in test-time adaptation. This section provides a formal definition of OTTA and delves into its fundamental attributes. Furthermore, we explore widely used datasets and evaluation methods, and examine the potential application scenarios of OTTA. To ensure a clear understanding, a comparative analysis is also undertaken to differentiate OTTA from similar settings.

2.1 Problem Definition

In OTTA, we assume access to a trained source model and adapt the model at test time over the test input to make the final prediction. The given source model \( f_{\theta ^S} \) parameterized by \( \theta ^S \) is pre-trained on a labeled source domain \( \mathcal {D}_S = \{ (\varvec{x}^S, \varvec{y}^S) \} \), which is formed by i.i.d. sampling from the source distribution \( p_S \). Unlabeled test data \( \mathcal {D}_T = \{ \varvec{x}_1^T, \varvec{x}_2^T, \varvec{x}_t^T, \ldots , \varvec{x}_n^T \} \) come in batches, where \( t \) indicates the current time step and \( n \) is the total number of time steps (i.e., number of batches). Test data often come from one or multiple different distributions \( (\varvec{x}_t^T, \varvec{y}_t^T) \sim p_T \), where \( p_S(\varvec{x}, y) \ne p_T(\varvec{x}, y) \) under the covariate shift assumption (Huang et al., 2006). During TTA, we update the model parameters for batch \( t \), resulting in an adapted model \( f_{\theta ^t} \).

Before adaptation, the pre-trained model is expected to retain its original architecture, especially the backbone, without modifying its layers or introducing new model branches during training. Additionally, the model is restricted to observing the test data only once and must produce predictions promptly online. By refining the definition of OTTA in this manner, we aim to minimize limitations associated with its application in real-world settings. Note that the model is reset to its original pre-trained state after being adapted to a specific domain, i.e., \( f_{\theta ^S} \rightarrow f_{\theta ^0} \rightarrow f_{\theta ^S} \rightarrow f_{\theta ^1} \rightarrow f_{\theta ^S} \rightarrow \cdots \rightarrow f_{\theta ^t} \).

Since there is no way to align the source and test set as in unsupervised domain adaptation in OTTA, what optimization objective works in this limited environment? As test data arrive at a fixed pace, how much data is ideal for effective test-time adaptation? Will adaptation work with new backbones (e.g., ViTs)? Does “Test-time Adaptation” lose validity with backbone changes? To address these concerns, we explore OTTA methods by their datasets, evaluations, and applications, decoupling strategies to identify which components work and why with updated backbones.

2.2 Datasets

This survey summarizes datasets in image classification, while recognizing that OTTA has been applied to many downstream tasks (Ma et al., 2022; Ding et al., 2023; Saltori et al., 2022). Testbeds in OTTA usually seek to facilitate adaptation from natural images to corrupted ones. The latter are created by perturbations such as Gaussian noise and Defocus blur. Despite including corruptions at varying severities, these synthetically induced corruptions may not sufficiently mirror the authentic domain shift encountered in real-world scenarios. Our work uses corruption (Croce et al., 2021), generated images, and real-world shift datasets, summarized in Table 1. Details of each testbed are described below.

CIFAR-10-C is a standard benchmark for image classification. It contains 950,000 color images, each of \(32\times 32\) pixels, spanning ten distinct classes. CIFAR-10-C retains the class structure of CIFAR-10 but incorporates 19 diverse corruption styles, with severities ranging from levels 1 to 5. This corrupted variant aims to simulate realistic image distortions or corruptions that might arise during processes like image acquisition, storage, or transmission.

CIFAR-100-C has 950,000 colored images with dimensions \(32\times 32\) pixels, uniformly distributed across 100 unique classes. The CIFAR-100 Corrupted dataset, analogous to CIFAR-10-C, integrates artificial corruptions into the canonical CIFAR-100 images.

ImageNet-C is a corrupted version of ImageNet (Krizhevsky et al., 2012) test set. Produced from ImageNet-1k, ImageNet-C has a similar setup to the CIFAR-10-C and CIFAR-100-C corruption types. For each domain, 5 levels of severity are produced, with 50, 000 images per severity from 1, 000 classes.

CIFAR-10.1 (Recht et al., 2018) is a real-world test set of CIFAR-10. It contains roughly 2,000 images sampled from the Tiny Image dataset (Yang et al., 2016).

CIFAR-10-Warehouse (Sun et al., 2023) integrates images from both diffusion model (i.e., Stable Diffusion-2-1 (Rombach et al., 2022)) and targeted keyword searches across eight popular search engines. The diffusion model uses the prompt “high quality photo of color class name”, with color chosen from 12 options. The dataset comprises 37 generated and 143 real-world subsets, each containing between 300 and 8,000 images.

OfficeHome is a widely used benchmark in domain adaptation and domain generalization tasks. It has 65 classes within 4 distinct domains: Artistic images (Art), Clip Art, Product images (Product), and Real-World images (RealWorld).

2.3 Evaluation

For a faithful comparison, effectiveness and efficiency are both considered in online test-time adaptation. This survey employs the following evaluation metrics:

mean Error (mErr) is one of the most commonly used metrics to assess model accuracy. It computes the average error rate across all corruption types or domains.

GFLOPs refers to giga floating point operations per second, which quantifies the number of floating-point calculations a model performs in a second. A model with lower GFLOPs is more computationally efficient.

Wall-clock time measures the actual time taken by the model to complete the adaptation process.

GPU memory usage refers to the amount of memory the model uses while running on a GPU. A model with lower GPU memory usage is more applicable to a wider range of devices.

2.4 Relationship with Other Tasks

Offline Test-time Adaptation (TTA) (Liang et al., 2020, 2022; Ding et al., 2022; Yang et al., 2021) aims to adapt a source pre-trained model to the target (i.e., test) set with access to the entire dataset at once. This differs from online test-time adaptation, where test data is given in batches.

Continual TTA Contrary to the classic OTTA setup, where adaptation occurs in discrete steps corresponding to distinct domain shifts, continual TTA (Wang et al., 2022a; Hong et al., 2023; Song et al., 2023; Chakrabarty et al., 2023; Gan et al., 2023) operates under the premise of seamless, continuous adaptation to new data distributions. This process does not require resetting the model with each perceived domain shift. Instead, it emphasizes the importance of a model’s ability to autonomously update and refine its parameters in response to ongoing changes in the data landscape, without explicit indicators of domain boundaries.

Gradual TTA tackles real-world scenarios where domain shifts are gradually introduced through incoming test samples (Marsden et al., 2022; Döbler et al., 2023). An example is the gradual and continuous change in weather conditions. For corruption datasets, existing gradual TTA approaches assume that test data transition from severity level 1 to level 2 and then progress slowly toward the highest level. Both continual and gradual TTA methods ensure online adaptation.

Test-time Training (TTT) introduces an auxiliary task for both training and adaptation (Sun et al., 2020; Gandelsman et al., 2022). During training, the original backbone is modified into a “Y”-shaped structure, with one branch for image classification and another for an auxiliary task, such as rotation prediction. During adaptation, the auxiliary task continues to be trained in a supervised manner, updating the model parameters. The classification head output serves as the final prediction for each test sample.

Test-time Augmentation (TTAug) applies data augmentations to input data during inference, creating multiple variations of the same test sample, from which predictions are obtained (Shanmugam et al., 2021; Kimura, 2021). The final prediction typically aggregates these predictions through averaging or majority voting. TTAug enhances model performance by providing a range of data views and can be applied to various tasks, including domain adaptation, offline TTA, and OTTA, as it does not require any modification to the model training process.

Domain Generalization (Qiao et al., 2020; Wang et al., 2021b; Xu et al., 2021; Zhou et al., 2023) aims to train models that perform effectively across multiple distinct domains without specific adaptation. It assumes the model learns domain-invariant features applicable across diverse datasets. While OTTA emphasizes dynamic adaptation to specific domains over time, domain generalization seeks to establish domain-agnostic representations.

3 Online Test-time Adaptation

Given the distribution divergence of online data from source training data, OTTA techniques are broadly classified into three categories that hinge on their responses to two primary concerns: managing online data and mitigating performance drops due to distribution shifts. Optimization-based methods anchored in designing unsupervised objectives typically lean towards adjusting or enhancing pre-trained models. Model-based approaches look to modify or introduce particular layers. On the other hand, data-based methods aim to expand data diversity, either to improve model generalization or to harmonize consistency across data views. According to this taxonomy, we sort out existing approaches in Table 9 and review them in detail below.

3.1 Optimization-Based OTTA

Optimization-based OTTA methods consist of three sub-categories: (1) recalibrating statistics in normalization layers, (2) enhancing optimization stability with the mean-teacher model, and (3) designing unsupervised loss functions. A timeline of these methods is illustrated in Fig. 2.

3.1.1 Normalization Calibration

In deep learning, a normalization layer aims to improve the training process and enhance the generalization capacity of neural networks by regulating the statistical properties of activations within a given layer. Batch normalization (BatchNorm) (Ioffe & Szegedy, 2015), the most commonly used normalization layer, stabilizes the training process by utilizing global statistics or a large batch size. By standardizing the mean and variance of activations, BatchNorm reduces the risk of vanishing or exploding gradients during training. Alternatives to BatchNorm include layer normalization (LayerNorm) (Ba et al., 2016), group normalization (GroupNorm) (Wu & He, 2020), and instance normalization (InstanceNorm) (Ulyanov et al., 2016) (Fig. 3). A similar concept to normalization layers is feature whitening, which adjusts features immediately after the activation layer. Both strategies are used in domain adaptation literature (Roy et al., 2019; Carlucci et al., 2017).

Fig. 2
figure 2

Timeline of optimization-based OTTA methods

Example. Take the most commonly used BatchNorm as an example. Let \(\varvec{x}_i\) be the activation for feature channel i in a mini-batch. The BatchNorm layer will first calculate the batch-level mean \(\varvec{\mu }\) and variance \(\varvec{\sigma }^2\) by:

$$\begin{aligned} \quad \varvec{\mu }= \frac{1}{m} \sum _{i=1}^{m} \varvec{x}_i, \quad \varvec{\sigma }^2 = \frac{1}{m} \sum _{i=1}^{m} (\varvec{x}_i - \varvec{\mu })^2, \end{aligned}$$
(1)

where m is the mini-batch size. Then, the calculated statistics will be applied to standardize the inputs:

$$\begin{aligned} \quad \hat{\varvec{x}}_i = \frac{\varvec{x}_i - \varvec{\mu }}{\sqrt{\varvec{\sigma }^2 + \epsilon }},\quad \varvec{y}_i = \gamma \hat{\varvec{x}}_i + \beta , \end{aligned}$$
(2)

where \(y_i\) is the final output of the i-th channel from this batch normalization layer, adjusting two learnable affine parameters, \(\gamma \) and \(\beta \). And \(\epsilon \) is used to avoid division of 0. For the update, the running mean \(\varvec{\mu }^\text {run}\) and variance \(\sigma ^\text {run}\) are computed as a moving average of the mean and variance over all batches seen during training, with a momentum factor \(\alpha \):

$$\begin{aligned} \varvec{\mu }^\text {run} = \alpha \varvec{\mu }+ (1-\alpha ) \varvec{\mu }^\text {run},\quad \varvec{\sigma }^\text {run} = \alpha \varvec{\sigma }+ (1-\alpha ) \varvec{\sigma }^\text {run}. \end{aligned}$$
(3)

Motivation. In domain adaptation, aligning batch normalization statistics helps mitigate performance degradation caused by covariate shifts (Wang et al., 2023b). The hypothesis is that data information is encoded in the weight matrices of each layer, while domain-specific knowledge is conveyed through the statistics of the BatchNorm layer. Therefore, updating the BatchNorm can enhance performance on unseen domains (Li et al., 2017). This idea is broadly applied to online test-time adaptation where we assume for a neural network \( f \) trained on a source dataset \( \mathcal {D}_S \) with normalization parameters \( \beta \) and \( \gamma \), updating \( \{\gamma , \beta \} \) based on test data \( \varvec{x}_i^t \) at each time step \( t \) will improve \( f \)’s robustness on the test domain.

Building on this assumption, initial investigations in OTTA predominantly focused on fine-tuning by updating only the normalization layers. This strategy has several popular variations. A common practice adjusts the statistics (\(\mu \) and \(\sigma \)) and affine parameters (\(\beta \) and \(\gamma \)) in the BatchNorm layer. The choice of normalization techniques, such as LayerNorm (Ba et al., 2016) or GroupNorm (Wu & He, 2020), may depend on the backbone architecture and specific optimization objectives, such as stabilizing entropy minimization (Niu et al., 2023).

Fig. 3
figure 3

Visualizing normalization layers (Wu & He, 2020)

Tent (Wang et al., 2021a) and its subsequent works, such as (Jang et al., 2023), are representative approaches within this paradigm. They update the statistics and affine parameters of BatchNorm for each test batch while freezing the remaining parameters. However, the effectiveness of batch-level updates, as seen in Tent, which is updated by minimizing soft entropy, depends on data quality within each batch, introducing potential performance fluctuations. For example, noisy or biased data can significantly affect BatchNorm updates. Methods aimed at stabilization via dataset-level estimates have been proposed to mitigate such performance fluctuations. Gradient preserving batch normalization GpreBN (Yang et al., 2022) allows for cross-instance gradient backpropagation by modifying the BatchNorm normalization factor.

$$\begin{aligned} \quad \hat{\varvec{y}}_i=\frac{\frac{\varvec{x}_i-\varvec{\mu }_c}{\varvec{\sigma }_c} \bar{\varvec{\sigma }}_c+\bar{\varvec{\mu }}_c-\varvec{\mu }}{\varvec{\sigma }} \gamma +\beta , \end{aligned}$$
(4)

where \(\frac{\varvec{x}_i-\varvec{\mu }_c}{\varvec{\sigma }_c}\) is the standardized input feature \(\hat{\varvec{x}}_i\) as in Eq. (2). \(\varvec{\sigma }_c\) and \(\varvec{\mu }_c\) means stop gradient. GpreBN normalizes \(\hat{\varvec{x}}_i\) by arbitrary non-learnable parameters \(\varvec{\mu }\) and \(\varvec{\sigma }\). MixNorm (Hu et al., 2021) mixes the statistics of the current batch, produced by augmented sample inputs, with global statistics computed through a moving average. Combining global-level and augmented batch-level statistics bridges the gap between historical context and real-time fluctuations, enhancing performance regardless of batch size. As an alternative, RBN (Yuan et al., 2023a) uses global robust statistics from a memory bank with a fixed momentum for the moving average to ensure high statistic quality. Similarly, Core (You et al., 2021) incorporates a momentum factor in the moving average to fuse source and test set statistics.

Instead of using a fixed momentum factor for the moving average, Mirza et al. (2022) proposed a dynamic approach that determines the momentum based on a decay factor. As model performance deteriorates over time, the decay factor increasingly considers the current batch to avoid biased learning from misled source statistics. ERSK (Niloy et al., 2023) follows a similar idea but determines its momentum by the KL divergence of BatchNorm statistics between the source-pretrained model and the current test batch.

Stabilization via renormalization. Focusing solely on moving averages can undermine the inherent characteristics of gradient optimization and normalization when updating BatchNorm layers. As noted by Huang et al. (2018), BatchNorm centers and scales activations but does not address their correlation, where decorrelated activations can lead to better feature representation (Schmidhuber, 1992) and generalization (Cogswell et al., 2016). Additionally, batch size significantly influences correlated activations, posing limitations when the batch size is small. The test-time batch renormalization module (TBR) in DELTA (Zhao et al., 2023a) addresses these limitations through a renormalization process. They adjust standardized outputs using two new parameters, r and d. \(r=\frac{s g\left( \hat{\varvec{\sigma }}^{\text{ batch } }\right) }{\hat{\varvec{\sigma }}^{\text{ ema } }}\) and \( d=\frac{s g\left( \hat{\mu }^{\text{ batch } }\right) -\hat{\varvec{\mu }}^{\text{ ema } }}{\hat{\varvec{\sigma }}^{\text{ ema } }}\), where \(sg(\cdot )\) is stop gradient. Both parameters are computed using batch and global moving statistics, inspired by (Ioffe, 2017)), to maintain stable batch statistics. Then \(\hat{\varvec{x}}_i\) is further normalized by \(\hat{\varvec{x}}_i=\hat{\varvec{x}}_i \cdot r + d\). The above OTTA methods reset the model for each domain, limiting their applicability to scenarios without clear domain boundaries. NOTE (Gong et al., 2022) focuses on continual OTTA under temporal correlation, i.e., distribution changes over time t: \(\left( {\textbf {x}}_t, \varvec{y}_t\right) \sim P_{\mathcal {T}}({\textbf {x}}, \varvec{y} \mid t)\). The authors proposed instance-level BatchNorm to avoid potential instance-wise variations in a domain non-identifiable paradigm.

Stabilization via enlarging batches. To improve the stability of adaptation, another idea is using large batch sizes. In fact, most methods based on batch normalization employ substantial batch sizes such as 200 in (Wang et al., 2021a; Hu et al., 2021). Despite their effectiveness, this practice cannot deal with scenarios where data arrives in smaller quantities due to hardware (e.g., GPU memory) constraints, especially in edge devices.

Alternatives to Batchnorm. To avoid using large-sized batches, viable options include updating GroupNorm (Mummadi et al., 2021) or LayerNorm, especially in transformer-based tasks (Kojima et al., 2022). In scenarios with limited computational resources, MECTA (Hong et al., 2023) replaces BatchNorm with a customized MECTA norm, reducing memory usage during adaptation. This change mitigates the overhead associated with large batches, extensive channel dimensions, and numerous layers requiring updates. Taking a different tack, EcoTTA (Song et al., 2023) incorporates and exclusively updates meta networks, including BatchNorm layers, effectively reducing computational expenses while maintaining source data discriminability and robust test-time performance. To address performance challenges with smaller batch sizes, TIPI (Nguyen et al., 2023) introduces additional BatchNorm layers alongside existing ones, maintaining two sets of data statistics and leveraging shared affine parameters to enhance consistency across different views of test data.

3.1.2 Mean Teacher Optimization

The mean teacher model, discussed in (Tarvainen & Valpola, 2017), enhances optimization stability in OTTA. This approach initializes both the teacher and student models with a pre-trained source model. For each test sample, weak and strong augmented versions are created and processed by the student and teacher models, respectively. The key lies in using prediction consistency, or consistency regularization, to update the student model. This strategy ensures identical predictions from different data views, reducing model sensitivity to test data changes and improving stability. The teacher model is refined as a moving average of the student across iterations. In OTTA, the mean teacher model and BatchNorm-based methods can be effectively integrated. Incorporating BatchNorm updates into the teacher–student framework can yield more robust results (Sect. 4). Similarly, integrating the mean teacher model with data-driven (Sect. 3.2) or model-driven (Sect. 3.3) methods can further enhance prediction accuracy and stability, marking significant progress in the field.

Model updating strategies. Following the idea of mean-teacher learning, ViDA (Liu et al., 2023b) supervising student output with teacher predictions from augmented input. It introduces high/low-rank adapters to facilitate continual OTTA learning (see Sect. 3.3). Wang et al. (2022a) generally followed the standard consistency learning strategy but introduced a reset method: a fixed number of weights are reset to their source pre-trained states after each iteration to preserve source knowledge and enhance robustness against misinformed updates.

RoTTA (Yuan et al., 2023a) adopts a different approach, focusing on updating only the customized batch normalization layer RBN in the student model, rather than altering all parameters. This strategy leverages consistency regularization and integrates statistics from the test data.

Divergence in augmentations. Drawing inspiration from the prediction consistency strategy in the mean teacher model, Tomar et al. (2023) proposed learning adversarial augmentation to identify the most challenging augmentation policies. These policies drive image feature representations toward uncertain regions near decision boundaries. This method not only achieves clearer decision boundaries but also enhances the separation of class-specific features, significantly improving model robustness to styles of unseen test data.

3.1.3 Optimization Objectives

Designing a proper optimization objective is important under the challenges of shifted test data with a limited amount. Commonly seen optimization-based Online Test-time Adaptation (OTTA) are summarized in Fig. 4. Existing literature addresses the optimization problem using three primary strategies:

Optimizing (increasing) confidence. Covariate shifts typically lead to lower model accuracy, which in turn causes the model to express high uncertainty. The latter is often observed. As such, to improve model performance, an intuitive way is to enhance model confidence for the test data.

Fig. 4
figure 4

Common optimization objectives in OTTA

Entropy-based confidence optimization. This strategy typically aims to minimize the entropy of the softmax output vector:

$$\begin{aligned} \quad H(\hat{y})=-\sum _c p\left( \hat{y}_c\right) \log p\left( \hat{y}_c\right) , \end{aligned}$$
(5)

where \(\hat{y}_c\) is the c-th predicted class, \(p\left( \hat{y}_c\right) \) is its corresponding prediction probability. Intuitively, when the entropy of the prediction decreases, the vector will look sharper, where the confidence, or maximum confidence, increases. In OTTA, minimizing entropy increases the model’s confidence for the current batch without relying on labels, thereby improving accuracy.

There are two main approaches within this strategy: one considers the entire softmax vector, and the other focuses on the maximum entry of the softmax output. Tent is a typical method for the former that uses entropy minimization to update the affine parameters in BatchNorm. Subsequent studies have expanded upon this strategy. For example, Seto et al. (2023) introduced entropy minimization with self-paced learning, ensuring the learning process progresses adaptively. By integrating general adaptive robust loss (Barron, 2019), the method achieves robustness against large and unstable loss values. TTPR (Sivaprasad & Fleuret, 2021) combines entropy minimization with prediction reliability, using a consistency loss across various views of a test image by merging the mean prediction across three augmented versions. Lin et al. (2023) minimized entropy loss on augmentation-averaged predictions while assigning high weights to low-entropy samples. In SAR (Niu et al., 2023), when minimizing entropy, sharpness-aware minimization is used allowing parameters to find “flatter” minimum regions for better model updating stability.

While entropy minimization is widely used, a natural question arises: what makes soft entropy a preferred choice? To reveal the working mechanism of the loss function, Conj-PL (Goyal et al., 2022) addresses this by designing a meta-network to parameterize the loss function, observing that the meta-output mirrors the temperature-scaled softmax output. They prove that if cross-entropy loss is used during source pre-training, soft entropy loss is the most appropriate during adaptation.

However, one observation of entropy minimization is the risk of obtaining a degenerate solution where every data point is assigned to the same class. To avoid this, MuSLA (Kingetsu et al., 2022) employs mutual information of the sample X and the corresponding prediction \(\hat{Y}\),

$$\begin{aligned} I_t(X ; \hat{Y})=H\left( \hat{\varvec{p}}_0\right) -\frac{1}{{\textbf {M}}} \sum _{t^i} H\left( \hat{\varvec{p}}_t^i\right) , \end{aligned}$$
(6)

where \(\hat{\varvec{p}}_0\) is the prior distribution, M is mini-batch size, i is the index of the sample within the batch. Maximizing \(\frac{1}{{\textbf {M}}} \sum _{t^i} \hat{p}^i_t\) could be seen as a regularizer to avoid same-class prediction.

In entropy minimization, gradients can often be dominated by low-confidence predictions. Conversely, cross-entropy loss can be too strict with predicted labels, leading to incorrect updates even with a single wrong prediction. To address this trade-off, Mummadi et al. (2021) proposed the Soft Likelihood Ratio (SLR) loss, which was further employed by Marsden et al. (2024). This approach emphasizes predicted classes while addressing the issue raised by MuSLA:

$$\begin{aligned} \quad \mathcal {L}_{{\text {SLR}}}(\hat{y}_{ti}) = - \sum _{c} w_{ti} \hat{y}_{tic} \log \left( \frac{\hat{y}_{tic}}{\sum _{j \ne c} \hat{y}_{tij}} \right) , \end{aligned}$$
(7)

where \(\hat{y}_{ti}\) is the softmax probability for the i-th test sample at time step t. \(w_{ti}\) is a ROID-only weight, combining diversity (the similarity between the recent trend of the model’s prediction and the current model output) and certainty (the negative entropy of the output). If the output confidence for class c is low, the loss calculation is reweighted by the summation over all other predictions \(\sum _{j \ne c} \hat{y}_{tij}\) in the denominator, reducing the focus on low-confidence classes.

The cooperation between a teacher and a student is another possible solution to optimize prediction confidence with reliability. Here, the teacher is usually the moving averaged model of interest across iterations. CoTTA (Wang et al., 2022a) uses the Softmax prediction from the teacher to supervise the Softmax predictions from the student under the cross-entropy loss.

To give more precise supervision to the student, RoTTA (Yuan et al., 2023b) adds a twist: the model is updated using samples stored in a memory bank. A reweighting mechanism is introduced to prevent the model from overfitting to “old” samples in the memory bank, prioritizing updates using “new” samples, ensuring a more dynamic and current learning process. See Sect. 3.2.2 for more details about its memory bank strategy.

Supervised by the student output, TeSLA (Tomar et al., 2023) designs its objective as a cross-entropy loss with a regularizer:

$$\begin{aligned} \mathcal {L}_{\text {pl}}(\varvec{X}, \hat{\varvec{Y}})= & -\frac{1}{B} \sum _{i=1}^B \sum _{k=1}^K f_s\left( \varvec{x}_i\right) _k \log \left( \left( \hat{\varvec{y}}_i\right) _k\right) \nonumber \\ & +\sum _{k=1}^K \hat{f}_s(\varvec{X})_k \log \left( \hat{f}_s(\varvec{X})\right) _k, \end{aligned}$$
(8)

where \(\hat{f}_s(X)=\frac{1}{B} \sum _{i=1}^B f_s\left( \varvec{x}_i\right) \) is the marginal class distribution of the student over the batch. \(\hat{y}\) is the soft pseudo-label from the teacher model. k is the number of classes. Except for the cross-entropy loss similar to the previous methods, it also maximizes the entropy for the averaged prediction of the student model across the batch to avoid overfitting.

A cross-entropy loss helps minimize model prediction uncertainty but may fail to provide consistent uncertainty scores under different augmentations. This issue is more critical when the teacher–student mechanism is not used. To address this, MEMO (Zhang et al., 2022) computes the average prediction across multiple augmentations for each test sample and then minimizes the entropy of the marginal output distribution over augmentations:

$$\begin{aligned} \ell (\theta ; {\textbf {x}}) \triangleq H\left( \bar{p}_\theta (\cdot \mid {\textbf {x}})\right) =-\sum _{y \in \mathcal {Y}} \bar{p}_\theta (y \mid {\textbf {x}}) \log \bar{p}_\theta (y \mid {\textbf {x}}), \end{aligned}$$
(9)

where \(\bar{p}\) denotes the averaged Softmax vector. Here, augmentations are randomly generated by AugMix (Hendrycks et al., 2020).

To encourage consistency against smaller perturbations, Marsden et al. (2024) proposed a consistency loss based on symmetric cross-entropy loss (SCE) (Wang et al., 2019):

$$\begin{aligned} \begin{aligned}&\mathcal {L}_{SCE}(\hat{y}_{ti}; y_{ti}) \\&\quad = - \frac{w_{ti}'}{2} \left( \sum _{c=1}^{C} \hat{y}_{tic} \log \tilde{y}_{tic} + \sum _{c=1}^{C} \tilde{y}_{tic} \log \hat{y}_{tic} \right) . \end{aligned} \end{aligned}$$
(10)

This loss promotes similar outputs between test images identified as certain and diverse and their augmented views. Here, \(\hat{y}_{ti}\) is the softmax probability for the \(i\)-th test sample at time step \(t\), \(\tilde{y}_{t}\) is the softmax probability of the augmented view, and \(w_{ti}'\) is the weight of the augmented view as in Eq. (7).

Prototype-based optimization. Prototype-based learning (Yang et al., 2018) is a strategy used for unlabeled data by selecting a representative or average for each class and classifying unlabeled data using distance-based metrics. However, its effectiveness can be limited under distribution shifts. To find reliable prototypes, TSD (Wang et al., 2023a) uses a Shannon entropy-based filter to identify class prototypes from target samples with high confidence. A target sample is then used to update the classifier-of-interest if its nearest prototypes are consistent with its class predictions from the same classifier.

Improving generalization ability to unseen target samples. Typically, OTTA uses the same batch of data for model update and evaluation. Here, we would like the model to perform well on upcoming test samples or have good generalization ability. A useful technique is sharpness-aware minimization (SAM) (Foret et al., 2021), where instead of seeking a minima that is “sharp” in its gradients nearby, a “flat” minima region is preferred. Niu et al. (2023) used the following formulation to demonstrate the effectiveness of this strategy.

$$\begin{aligned} \quad \min _{\tilde{\Theta }} S({\textbf {x}}) E^{S A}({\textbf {x}} ; \Theta ). \end{aligned}$$
(11)

Here, \(S({\textbf {x}})\) is an entropy-based indicator function that can filter out unreliable predictions based on a predefined threshold. \(E^{S A}({\textbf {x}}; \Theta )\) is defined as:

$$\begin{aligned} \quad E^{S A}({\textbf {x}} ; \Theta ) \triangleq \max _{\Vert \epsilon \Vert _2 \le \rho } E({\textbf {x}} ; \Theta +\varvec{\epsilon }). \end{aligned}$$
(12)

This term aims to identify a weight perturbation \(\epsilon \) within a Euclidean ball of radius \(\rho \) that maximizes entropy. It quantifies sharpness by measuring the maximal change in entropy between \(\Theta \) and \(\Theta + \rho \). As such, Eq. (12) jointly minimizes entropy and its sharpness. Gong et al. (2023) used same idea for their optimization.

Another difficulty affecting OTTA generalization is class imbalance in a batch: a limited number of data for model updates often cannot reflect the true class frequency. To address this, Dynamic Online Re-weighting (DOT) in DELTA (Zhao et al., 2023a) uses a momentum-updated class-frequency vector. This vector is initialized with equal weights for each class and is updated at every inference step based on the pseudo-label of the current sample and model weight. For a target sample, a significant weight (or frequency) for a particular class prompts DOT to diminish its contribution during subsequent adaptation learning. This prevents biased optimization towards frequent classes, thereby improving model generalization.

Feature representation learning. Since no annotations are assumed for the test data, contrastive learning (van den Oord et al., 2018) can be naturally applied to test-time adaptation tasks. In a self-supervised manner, contrastive learning aims to learn feature representations where positive pairs (a data sample and its augmentations) are close, and negative pairs (different data samples) are pushed away from each other. However, this typically requires multiple epochs of updates, which is incompatible with the online adaptation setting. To suit online learning, AdaContrast (Chen et al., 2022) uses target pseudo-labels to disregard potential same-class negative samples rather than treating all other data samples as negative.

3.1.4 Pseudo-Labeling

Pseudo-labeling is a useful technique in domain adaptation and semi-supervised learning. It assigns labels to samples with high confidence, and these pseudo-labeled samples are then used for training.

In OTTA, where adaptation is confined to the current batch of test data, batch-level pseudo-labeling is often employed. For example, MuSLA (Kingetsu et al., 2022) implements pseudo-labeling as a post-optimization step following BatchNorm updates, refining the classifier using pseudo-labels of the current batch to enhance model accuracy.

Furthermore, the teacher–student framework, as seen in models like CoTTA (Wang et al., 2022a), RoTTA (Yuan et al., 2023b), and ViDA (Liu et al., 2023b), also adopts the pseudo-labeling strategy. Here, the teacher outputs are used as soft pseudo-labels, which preventing the model to be overfitted to incorrect predictions.

Reliable pseudo-labels are essential but challenging in OTTA. Continuous data streams limit opportunities for reviewing the prediction. Besides, covariate shifts between the source and test sets degrades pseudo-label reliability.

To address these challenges, TAST (Jang et al., 2023) uses a prototype-based pseudo-labeling strategy. Class centroids are obtained from a support set, initially derived from the source pre-trained classifier’s weights and refined using normalized test data features. To avoid performance degradation brought by unreliable pseudo labels, it calculates centroids only using the nearby support examples and then uses the temperature-scaled output to obtain the pseudo labels. Alternatively, AdaContrast (Chen et al., 2022) uses soft K nearest neighbors voting (Mitchell & Schaefer, 2001) in the feature space to produce reliable pseudo-labels. Wu et al. (2021) suggest using multiple augmentations and majority voting to achieve consistent and trustworthy pseudo-labels.

Complementary pseudo-labeling (PL). One-hot pseudo-labels often result in substantial information loss, especially under domain shifts. To address this, ECL (Han et al., 2023) considers both maximum-probability predictions and predictions that fall below a certain confidence threshold (i.e., complementary labels). The intuition is that if the model is less confident about a prediction, this prediction should be penalized more heavily. This helps prevent the model from making aggressive updates based on incorrect but high-confidence predictions, offering a more stable approach similar to soft pseudo-label updates.

3.1.5 Other Approaches

Deviating from the conventional path of adapting source pre-trained models, Laplacian Adjusted Maximum likelihood Estimation (LAME) (Boudiaf et al., 2022) focuses on refining the model output. This is achieved by discouraging the refined output from deviating from the pre-trained model while encouraging label smoothness according to the manifold smoothness assumption. The final refined prediction is obtained when the energy gap for each refinement step of a batch is small.

To prevent loss of generalization and catastrophic forgetting, weight ensembling offers a solution. ROID (Marsden et al., 2024) continuously ensembles the weights of the initial source model and the weights of the current model at time step t using a moving average, allowing for partial retention of source knowledge. This approach is similar to the parameter reset strategy in CoTTA, commonly used in continual adaptation tasks. Additionally, to address temporal correlation and class imbalance during adaptation, ROID introduces prior correction. The intuition is that if the class distribution within a batch tends to be uniform, strong smoothing is applied to ensure no class is favored. This is indicated by the sample mean over the current softmax prediction \(\hat{p}_t\). Thus, the smoothing scheme is defined as:

$$\begin{aligned} \bar{p}_{t} = \frac{\hat{p}_{t} + \gamma }{1 + \gamma N_{c}}, \end{aligned}$$
(13)

where \(N_c\) denotes the number of classes and \(\gamma \) is an adaptive smoothing factor.

3.1.6 Summary

Optimization-based methods are the most common in online test-time adaptation, focusing on consistency, stability, and robustness. However, they rely on the availability of sufficient target data to reflect the global test data distribution. The next section will explore data-based methods, which partially address the challenge of limited target data in OTTA.

3.2 Data-Based OTTA

With a limited number of samples in the test batch, encountering unexpected distribution changes is common. We recognize that data might be key to bridging the gap between the source and test data.

This section delves into data-centric strategies in OTTA. We explore methods for diversifying data in each batch (Sect. 3.2.1) and preserving high-quality information on a global scale (Sect. 3.2.2). These strategies aim to enhance model generalizability and adapt the model’s discriminative capacity to the current data batch (Fig. 5).

Fig. 5
figure 5

Timeline of Data-based OTTA methods

3.2.1 Data Augmentation

Data augmentation is important in domain adaptation (Wang & Deng, 2018) and domain generalization (Zhou et al., 2023), as it mimics real-world variations to improve model transferability and generalizability. It is particularly useful for test-time adaptation.

Predefined augmentations. Common data augmentation methods like cropping, blurring, and flipping are effectively incorporated into various OTTA methodologies. An example of this integration is TTC (Lin et al., 2023), which updates the model using averaged predictions from multiple augmentations. Mean teacher models such as RoTTA (Yuan et al., 2023b), CoTTA (Wang et al., 2022a), and ViDA (Liu et al., 2023b), applies predefined augmentations to teacher/student input, and maintains prediction consistency across different augmented views.

To ensure consistent and reliable predictions, ROID (Marsden et al., 2024) uses augmentation for prediction consistency by employing symmetric cross-entropy (SCE). PAD (Wu et al., 2021) employs multiple augmentations of a single test sample for majority voting, believing that if most augmented views yield the same prediction, it is likely correct. TTPR (Sivaprasad & Fleuret, 2021) adopts KL divergence to achieve consistent predictions by aligning the average prediction across three augmented views with the prediction for each view. Another approach, MEMO, uses AugMix (Hendrycks et al., 2020) for test images. For a test data point, a range (usually 32 or 64) of augmentations from the AugMix pool \(\mathcal {A}\) is generated to make consistent predictions.

Contextual augmentations. Previously, OTTA methods often predetermined augmentation policies. Given that test distributions can undergo substantial variations in continuously evolving environments, fixed augmentation policies may not be suitable for every test sample. In CoTTA (Wang et al., 2022a), rather than augmenting every test sample by a uniformed strategy, augmentations are judiciously applied only when domain differences (i.e., low prediction confidence) are detected, mitigating the risk of misleading the odel.

Adversarial augmentation. Traditional augmentation methods always provide limited data views without fully representing the domain differences. TeSLA (Tomar et al., 2023) addresses this by leveraging adversarial data augmentation to identify the most effective augmentation strategy. Instead of using a fixed augmentation set, it creates a policy search space \(\mathcal {O}\) as the augmentation pool and assigns a magnitude parameter \(m \in [0,1]\) for each augmentation. A sub-policy \(\rho \) consists of augmentations and their corresponding magnitudes. To optimize the policy, the teacher model is adapted using an entropy maximization loss with severity regularization, encouraging prediction variations while avoiding augmentations that are too strong and deviate too far from the original image.

3.2.2 Memory Bank

Going beyond augmentation strategies that could diversify the data batch, a memory bank is a powerful tool to preserve valuable data information for future memory replay. Setting up a memory bank involves two key considerations: (1) Determining which data to store, identifying samples valuable for replay during adaptation. (2) Managing the memory bank, including adding new instances and removing old ones.

Memory bank strategies are generally time-uniform and class-balanced. Many methods integrate both for maximum effectiveness. To address temporally correlated distributions and class imbalance, NOTE (Gong et al., 2022) introduces Prediction-Balanced Reservoir Sampling (PBRS), saving sample-prediction pairs. PBRS combines time-uniform and prediction-uniform sampling. The time-uniform approach, reservoir sampling (RS), aims for uniform data over a temporal stream. For a sample x predicted as class k, a value p is randomly sampled from a uniform distribution [0, 1]. If p is smaller than the proportion of class k in the memory bank, a random sample from the same class is replaced with x. The prediction-uniform strategy (PB) prioritizes predicted labels to maintain majority class balance, replacing a random instance from the majority class with a new sample. PBRS ensures a balanced distribution across time and class, enhancing the model’s adaptability.

A similar strategy is employed in SoTTA (Gong et al., 2023) to facilitate class-balanced learning. Each high-confidence sample-prediction pair is stored when the memory bank has available space. If the bank is full, the method opts to replace a sample either from one of the majority classes or from its class if it belongs to the majority. This ensures a more equitable class distribution and strengthens the learning process against class imbalances. Another work, RoTTA (Yuan et al., 2023b), offers a category-balanced sampling with timeliness and uncertainty (CSTU) module, dealing with the batch-level shifted label distribution. CSTU proposes a category-balanced memory bank M with capacity N, storing data samples x with predicted labels \(\hat{y}\), heuristic scores \(\mathcal {H}\), and uncertainty metrics \(\mathcal {U}\). The heuristic score is calculated by:

$$\begin{aligned} \quad \mathcal {H}=\lambda _t \frac{1}{1+\exp (-\mathcal {A} / \mathcal {N})}+\lambda _u \frac{\mathcal {U}}{\log \mathcal {C}}, \end{aligned}$$
(14)

where \(\lambda _t\) and \(\lambda _u\) is the trade-off between time and uncertainty, \(\mathcal {A}\) is the age (i.e., how many iterations this sample has been stored in the memory bank) of a sample stored in the memory bank. \(\mathcal {C}\) is the number of classes, \(\mathcal {N}\) is the capacity of the memory bank, and \(\mathcal {U}\) is the uncertainty measurement, which is implemented as the entropy of the sample prediction. The score \(\mathcal {H}\) is then used to decide whether a test sample should be saved into the memory bank for each class. As the lower heuristic score is always preferred, its intuition is to maintain fresh (i.e., lower age \(\mathcal {A}\)), balanced, and certain (i.e., lower \(\mathcal {U}\)) test samples. thereby enhancing adaptability during online operations.

To avoid the negative impact of the batch-level class distribution, TeSLA (Tomar et al., 2023) incorporates an online queue to hold class-balanced weakly augmented sample features and their corresponding pseudo-labels. To enhance the correctness of pseudo-label predictions, each test sample is compared with its closest matches within the queue. Similarly, TSD (Wang et al., 2023a) is dedicated to preserving sample embeddings and their associated logits in a memory bank for trustworthy predictions. Initialized by the weights from a source pre-trained linear classifier (Iwasawa and Matsuo, 2021), this memory bank is subsequently employed for prototype-based classification.

Contrastive learning is well-suited for OTTA, as discussed in Sect. 3.1.3. However, it can be challenging for online learning, especially when it pushes away feature representations of data from the same class. Unlike conventional methods that revisit the feature space multiple times, AdaContrast (Chen et al., 2022) offers an innovative solution. It keeps all previously encountered key features and pseudo-labels in a memory queue to avoid forming “push-away” pairs from the same class. This speeds up the learning process and reduces error accumulation in data from the same class, improving efficiency and precision.

ECL (Han et al., 2023) represents a novel shift away from traditional methods by incorporating a memory bank about output distributions for setting thresholds on complementary labels. The memory bank is also periodically refreshed using the newly updated model parameters, ensuring its relevance and effectiveness.

3.2.3 Summary

Data-based techniques handle biased online test sets or unique stylistic constraints but often increase computational demands, posing challenges in online scenarios. The following section focuses on an alternative strategy: how architectural modifications can offer distinct advantages in Online Test-Time Adaptation.

3.3 Model-Based OTTA

Model-based OTTA adjusts the model architecture to address distribution shifts. Changes generally involve either adding new components or replacing existing blocks. This category includes developments in prompt-based techniques, involving adapting prompt parameters or using prompts to guide the adaptation process (Fig. 6).

3.3.1 Module Addition

Input transformation. In an effort to counteract domain shift, Mummadi et al. (2021) introduced to optimize an input transformation module d, along with the BatchNorm layers as discussed in Sect. 3.1.3. This module is built on the top of the source model f, i.e., \(g = f \circ d\). Specifically, d(x) is defined as:

$$\begin{aligned} d(x) = \gamma \cdot [\tau x + (1 - \tau ) r_\psi (x)] + \beta , \end{aligned}$$
(15)

where \(\gamma \) and \(\beta \) are channel-wise affine parameters. The component \(r_\psi \) denotes a network designed to have the same input and output shape, featuring \(3 \times 3\) convolutions, group normalization, and ReLU activations. The parameter \(\tau \) facilitates a convex combination of the unchanged and transformed input \(r_\psi (x)\).

Fig. 6
figure 6

Timeline of Model-based OTTA methods

Adaptation modules. To stabilize predictions during model updates, TAST (Jang et al., 2023) integrates 20 adaptation modules to the source pre-trained model. Based on BatchEnsemble (Wen et al., 2020), these modules are appended to the top of the pre-trained feature extractor. The adaptation modules are updated multiple times independently by merging their averaged results with the corresponding pseudo-labels for a batch of data.

For continual adaptation, promptly detecting and adapting to changes in data distribution is essential to address catastrophic forgetting and error accumulation. To achieve this, ViDA (Liu et al., 2023b) employs low/high-rank feature cooperation. Low-rank features retain general knowledge, while high-rank features capture distribution changes. The authors introduced two adapter modules parallel to the linear layers (if the backbone model is ViT) to obtain these features.

Since distribution changes in continual OTTA are unpredictable, strategically combining low/high-rank information is crucial. The authors used MC dropout (Gal & Ghahramani, 2016) to assess model prediction uncertainty about input x. This uncertainty adjusts the weight given to each feature. If the model is uncertain about a sample, the weight of domain-specific knowledge (high-rank feature) is increased; conversely, the weight of domain-shared knowledge (low-rank feature) is increased. This helps the model dynamically recognize distribution changes while preserving its decision-making capabilities.

3.3.2 Module Substitution

Module substitution typically refers to swapping an existing module in a model with a new one. The commonly used techniques are about:

Classifiers. Cosine-distance-based classifier (Chen et al., 2009) offer flexibility and interpretability by leveraging similarity to representative examples for decision-making. TAST (Jang et al., 2023) formulates predictions by assessing the cosine distance between the sample feature and the support set. TSD (Wang et al., 2023a) employs a similar classifier, comparing features of the current sample against its K-nearest neighbors from a memory bank. PAD (Wu et al., 2021) uses a cosine classifier for predicting augmented test samples in its majority voting process. T3A (Iwasawa and Matsuo, 2021) relies on the dot product between templates in the support set and input data representations for classification.

In the context of updating BatchNorm statistics, any alteration to BatchNorm that extends beyond the standard updating approach falls under this category. This includes techniques such as MECTA norm (Hong et al., 2023), MixNorm (Hu et al., 2021), RBN (Yuan et al., 2023a), and GpreBN (Yang et al., 2022), etc. To maintain focus and avoid redundancy, these specific methods and their intricate details will not be extensively covered again in this section.

3.3.3 Adaptation Techniques Using Prompts

Prompt, widely discussed in vision language models like CLIP (Radford et al., 2021), involves various design and learning strategies, especially when set as a learnable parameter. We consolidate all prompting related adaptation methods here for clarity.Footnote 1

“Decorate the Newcomers” (DN) (Gan et al., 2023) uses prompts as supplementary information atop image input. It employs a student-teacher framework with a frozen source pre-trained model to capture both domain-specific and domain-agnostic prompts. For domain-specific knowledge, it optimizes cross-entropy loss between the teacher and student models’ outputs. Additionally, DN introduces a parameter insensitivity loss to reduce the impact of parameters prone to domain shifts. This ensures updated parameters retain domain-agnostic knowledge while learning new, domain-specific information.

DePT (Gao et al., 2022) innovatively segments the transformer into stages, adding learnable prompts at the initial layer of each stage alongside image and CLS tokens. During adaptation, a mean-teacher model updates the learnable prompts and the classifier in the student model. The student model updates based on the cross-entropy loss between pseudo labels and outputs from strongly augmented data. Here, pseudo labels are generated using the averaged predictions of the top-k nearest neighbors from the student’s weakly augmented output within a memory bank. To mitigate errors from incorrect pseudo labels, DePT employs entropy loss between predictions from strongly augmented views of both student and teacher models. Additionally, it minimizes the mean squared error between the combined prompts of the student and teacher models at the transformer’s output layer. To ensure diversity and prevent trivial solutions, DePT maximizes the cosine distance among the student’s combined prompts.

To adapt test data in VLMs, Test Time Prompt Tuning merges an solution. Unlike conventional methods that fine-tune with a fixed number of labeled test samples per class, it learns the prompt by adjusting only the context of the model’s input, thus preserving the model’s generalization power. TPT (Shu et al., 2022) generates \(N\) randomly augmented views of each test image and updates the learnable prompt parameter by minimizing the entropy of the averaged prediction probability distribution. Additionally, a confidence selection strategy filters out outputs with high entropy to avoid noisy updates from unconfident samples. By updating the learnable prompt parameter, the model can adapt more easily to new, unseen domains (Fig. 7).

Fig. 7
figure 7

The exemplars of the adopted datasets. The datasets include color variations, synthetic data, and types of corruption

3.3.4 Summary

Model-based OTTA methods have shown effectiveness but are less prevalent than other groups, mainly due to their reliance on specific backbone architectures. For example, layer substitution primarily based on BatchNorm is inapplicable to ViT-based architectures.

A critical feature of this category is its effective integration with prompting strategies. This combination allows for fewer but more impactful model updates, leading to greater performance improvements. Such efficiency makes model-based OTTA methods especially suitable for complex scenarios.

4 Empirical Studies

Existing OTTA methods predominantly use WideResNet (Zagoruyko & Komodakis, 2016) or ResNet (He et al., 2016) for experiments, overlooking the evolution of backbones in recent years. In this study, we explore the possibility of decoupling OTTA methods from their conventional CNN backbones. We specifically focus on adapting these methods to the Vision Transformer model (Dosovitskiy et al., 2021). Our work presents strategies for adapting methods, initially developed for CNNs, to work effectively with ViT architectures, thereby examining their flexibility under backbone changes.

Baselines. We evaluate eight OTTA methods, using a standardized testing protocol for fair comparison. We use six diverse datasets: three corrupted datasets (CIFAR-10-C, CIFAR-100-C, and ImageNet-C), two real-world shifted datasets (CIFAR-10.1 and OfficeHome), and one comprehensive dataset (CIFAR-10-Warehouse). CIFAR-10-Warehouse plays a pivotal role in our evaluation, featuring a broad array of subsets, including real-world variations from different search engines with different colors and images created through the diffusion model. Specifically, we used the Google split in CIFAR-10-Warehouse to assess the OTTA model’s ability to handle color shifts and mixed object styles. Additionally, we evaluated the Diffusion split to test the effectiveness of OTTA on artificially generated image samples, which have gained popularity in recent years.

4.1 Implementation Details

Optimization details. We use PyTorch for implementation on an NVIDIA RTX A6000. The foundational backbone for all approaches is ViT-base-patch16-224 (Dosovitskiy et al., 2021).Footnote 2 For CIFAR-10-C, CIFAR-10.1, and CIFAR-10-Warehouse as target domains, we trained the source model on CIFAR-10 for 8000 iterations, including a warm-up phase of 1600 iterations. The training used a batch size of 64 and the stochastic gradient descent (SGD) algorithm with a learning rate of 3e-2. We applied the same configuration for training the source model on CIFAR-100, extending the training to 16,000 iterations with a warm-up period of 4000 iterations. The source model on the ImageNet-1k dataset was obtained from the Timm repository.Footnote 3 For the OfficeHome dataset, we trained the source model on the clipart domain for 3,500 iterations with 200 iterations of warmup. Basic data augmentation techniques, including random resizing and cropping, were applied across all methods. The common optimization setup employed the Adam optimizer with a momentum term \(\beta \) of 0.9 and a learning rate of 1e-3. Resizing and cropping techniques were used as default preprocessing steps for all datasets, followed by input normalization (0.5, 0.5, 0.5) to mitigate potential performance fluctuations from external factors beyond the algorithm’s core operations.

Fig. 8
figure 8

Comparison of the OTTA performance on the CIFAR-10-C (severity level 5) and CIFAR-10.1 datasets with ViT-base-patch16-224. The top and bottom plots show the experiments conducted with batch size 16 and 1 based on the LayerNorm updating strategy, respectively. Source-only (i.e., direct inference) performance is shown with the dotted line

Component substitution. We develop a series of strategies to adapt the OTTA methods to vision transformers:

  • Switch to LayerNorm: Due to the absence of the BatchNorm layer in ViT, all BatchNorm-based strategies in the original implementations would be automatically moved onto LayerNorm.

  • Disregard BatchNorm mixup: Removing statistic mixup strategy originally designed for BatchNorm-based methods because LayerNorm is designed to normalize each data point independently.

  • Sample embedding changes: For methods that rely on feature representations, an effective solution is to use class embedding as the image feature.

  • Pruning incompatible components: Any elements incompatible with vision transformer are removed.

Baselines: We select eight methods that represent the eight OTTA categories mentioned in this paper, respectively. They include:

  1. 1.

    Tent: [] A fundamental OTTA method rooted in BatchNorm updates for both statistics and affine parameters with entropy minimization. To reproduce it on ViTs, we replace its BatchNorm updates with a LayerNorm updates.

  2. 2.

    CoTTA [, ] uses the mean-teacher model, where the teacher model is updated by the moving average of the student. During adaptation, soft entropy is optimized and combined with selective augmentation. For each iteration, it also applies parameter reset (i.e., partially reset the model parameter to the source pre-trained version). While it requires updating the entire student network (i.e., CoTTA-ALL), we further assess CoTTA-LN that only update LayerNorm on its student model. We also examine its parameter reset strategy, resulting in another two variants: updating LayerNorm without parameter reset (CoTTA\(^*\)-LN), and full network updating without parameter reset (CoTTA\(^*\)-ALL).

  3. 3.

    SAR [] follows Tent while using sharpness-aware minimization (SAM) (Foret et al., 2021) for flat minima.

  4. 4.

    Conjugate-PL: [] As the source model is optimized by the cross-entropy loss, this method is Conj-CE. It is similar to Tent but allows the model to interact with the data twice for each iteration: once for updating LayerNorm and another for prediction.

  5. 5.

    MEMO: [, ] For each test sample, MEMO applies various data augmentations and adjusts the model parameters to minimize the entropy across the model marginal output distributions from these augmentations. To ensure consistency and avoid unexpected performance fluctuations, we omit all data normalization processes from its set of augmentations. At the same time, we evaluate the performance of MEMO on two versions, LayerNorm update and full update, resulting in MEMO-LN and MEMO-ALL correspondingly.

  6. 6.

    RoTTA [, , ] uses a memory bank to store class-balanced data, considering uncertainty and the “age” of each saved data sample. It also introduces a time-aware reweighting strategy to enhance adaptation stability. For evaluation, given that BatchNorm is not compatible with ViTs, we exclude the RBN module.

  7. 7.

    TAST [, , ] integrates multiple adaptation modules into the source pre-trained model. Based on BatchEnsemble (Wen et al., 2020), these modules are appended to the top of the pre-trained feature extractor. The adaptation modules are then updated multiple times independently by merging their averaged results with the corresponding pseudo-labels for a batch of data. To accommodate the ViT architecture (especially ViT-base), we use the class embedding as the feature representation.

  8. 8.

    ROIDFootnote 4 [, ] is a method for not only typical OTTA but also universal TTA, capable of handling temporal correlation and domain non-stationarity. It incorporates weighted SLR loss with the Symmetric cross-entropy loss, alongside weight ensemble and prior correction mechanisms to ensure efficacy in complex scenarios.

Despite the broad range of available OTTA methods, we believe that a detailed examination of this carefully selected subset will yield valuable insights. Our empirical study is designed to explore the following key questions.

4.2 Does OTTA Still Work with ViT?

To assess the transferability and adaptability of the selected OTTA methods, we compare them against the source-only baseline (i.e., direct inference) on vision transformers. For consistency and controlled variable comparison, each bar chart included in our analysis is plotted based on the LayerNorm updating strategy.

4.2.1 On CIFAR-10-C and CIFAR-10.1 Benchmarks

We evaluate the CIFAR-10-C and CIFAR-10.1 datasets with batch sizes 1 and 16 and show results in Fig. 8. We discuss our observations in three aspects.

Corruption types. Across both batch sizes, most methods exhibit high error in response to noise-induced corruptions, as illustrated in Figs. 8 and 13. In contrast, these methods tend to perform better than noise-induced corruptions when dealing with structured corruptions, such as snow, zoom, or brightness. This pattern suggests that various corruption types reflect different degrees of divergence from the corruption domain to the source dataset. Especially, adapting to noise corruption poses a significant challenge for confidence optimization-based methods (such as Tent, MEMO-LN, and Conj-CE), regardless of the batch size. This difficulty can be linked to the substantial domain gap and the unpredictable nature of noise patterns discussed earlier. Although these strategies aim to increase the model’s confidence, they are not equipped to directly correct erroneous predictions. It is worth noting that ROID surpasses all other baselines on noise-based corruptions. This may attributed to the certainty- and diversity-weighted SCE loss, which is designed to address noise issues while avoiding trivial solutions. CIFAR 10.1 stands out as an exception, as it is not associated with any specific corruption. Instead, it represents real-world data that share the same label set as CIFAR 10. Compared to other corruption domains - except the brightness domain - CIFAR 10.1 exhibits a smaller domain gap, as evidenced by the lower error rate achieved through direct inference. In this context, most methods yield results comparable to direct inference. Notably, Conj-CE excels with a batch size of 16, while CoTTA\(^*\)-LN, CoTTA-LN, and SAR stand out when the batch size is reduced to 1.

Batch sizes. Batch size is found to aid pure optimization-based methods like Tent, Conj-CE, and MEMO-LN in outperforming direct inference (i.e., source only). Similarly, batch size significantly influences the performance of ROID. It achieves the highest performance, with a significant margin over others, in terms of mean error on CIFAR-10-C when the batch size is 16. Nevertheless, it makes almost entirely erroneous predictions under a batch size of 1. As detailed in Sect. 4.4, our analysis leads to the conclusion that larger batch sizes tend to stabilize loss optimization, which is beneficial for adaptation processes in these methods. This phenomenon is also evident in CIFAR-10.1, where entropy-based methods such as Tent and Conj-CE face challenges at smaller batch sizes.

Methods that incorporate prediction reliability can effectively navigate the limitations associated with smaller batch sizes.

Table 2 Classification error rate (%) for the standard CIFAR-100 \(\rightarrow \) CIFAR-100-C online test-time adaptation task

4.2.2 On CIFAR-100-C Benchmark

The performance on the CIFAR-100-C dataset exhibits a trend similar to that observed in the CIFAR-10-C dataset, as shown in Table 2.

Number of classes. As CIFAR-100-C shares the same corruption setup as CIFAR-10-C, it is important to understand the differences between the CIFAR-10-C and CIFAR-100-C datasets. CIFAR-10-C consists of only 10 classes, allowing all classes to be represented within a single batch if its size is 16. In contrast, CIFAR-100-C, with its 100 classes, introduces a distinct challenge for online streaming adaptation. This scenario is particularly problematic for methods like Tent, which rely on updating the LayerNorm parameters within the current batch for subsequent use. While no method surpasses the source-only performance, Tent experiences a 1.33% accuracy reduction on CIFAR-10-C and a more pronounced 3.94% reduction on CIFAR-100-C.

Batch sizes. A noteworthy finding from the CIFAR-100-C experiments is that RoTTA, with a mean error rate of 41.30% at any experimented batch size, consistently outperforms direct inference (which has a mean error rate of 41.88%). This achievement is likely due to RoTTA’s ability to maintain label diversity within its memory bank, emphasizing the importance of preserving a wide and varied information spectrum to tackle batch-sensitive and complex adaptation tasks effectively. This trend is similarly observed in Figs. 13 and 9, further reinforcing the robustness of RoTTA. Moreover, although SAR and TAST underperform compared to direct inference, they demonstrate stability across various batch sizes. Considering RoTTA, these methods represent primary approaches to handling different batch sizes: (1) stable optimization, as flat minima are more resilient to gradient fluctuations, and (2) information preservation, providing additional insights for each batch and reducing model sensitivity to batch sizes. Nevertheless, ROID and CoTTA-ALL exhibit significant performance variations concerning batch sizes.

4.2.3 On ImageNet-C Benchmark

When comparing performance across CIFAR-10-C, CIFAR-100-C, and ImageNet-C, it becomes apparent that ImageNet-C (Fig. 9) exhibits notably poorer performance overall. This could be attributed to the larger number of classes in the ImageNet-C dataset, as a similar trend is observed when comparing CIFAR-10-C (Fig. 8) with CIFAR-100-C Table 2.

Fig. 9
figure 9

Comparison of the OTTA performance on the ImageNet-C at the severity level 5. The upper and bottom plots show the experiments conducted with batch size 16 and 1 based on the LayerNorm updating strategy, respectively. Source-only performance is shown with the dotted line

Corruption types. Varying domain gaps emerge across datasets when employing the same mechanism to generate corrupted data. Nevertheless, overall consistency remains, likely influenced by the training process of the source model. Additionally, corruptions representing natural domain gaps (e.g., brightness, zoom, snow) consistently appear simpler than noise-based corruptions (e.g., Gaussian noise, shot noise). Surprisingly, contrast exhibits a notably high error rate for both direct inference and most methods, indicating a substantial domain gap compared to other corruption types. In this case, ROID appears to excel in addressing it. In contrast, brightness appears to have the smallest domain gap with the source data.

Batch sizes. When batch size is 1, optimization-based methods exhibit error rates exceeding 95% across all domains. This also applies to ROID, which mainly relies on model optimization. An exception to this trend is SAR, which produces a similar trend as in CIFAR-100-C. However, it produces a larger performance gap in ImageNet-C than CIFAR-100-C compared with direct inference. With different numbers of classes, this further indicates that datasets with increased complexity and a higher degree of challenge exhibit greater sensitivity to batch-size alterations.

Adaptation strategy. As depicted in Fig. 9, when batch size is 16, SAR, Conj-CE, RoTTA, and ROID outperform the source-only model in terms of mean error. In contrast, Tent, MEMO-LN, and CoTTA-LN demonstrate significantly poor results. The error rate of each domain exhibits a similar trend. Notably, Conj-CE, which conducts an additional inference of each batch for final prediction compared with Tent, markedly surpasses Tent across most domains and in mean error. This suggests a significant inter-batch shift in ImageNet-C, such as class differences.

Fig. 10
figure 10

Comparison of the OTTA performance on the Google split of CIFAR-10-Warehouse. The upper and bottom plots show the experiments conducted with batch size 16 and 1 based on the LayerNorm updating strategy. Source-only performance is shown with the dotted line

4.2.4 On CIFAR-10-Warehouse Benchmark

In our study, the CIFAR-10-Warehouse dataset (Sun et al., 2023) emerges as an indispensable resource for assessing OTTA methods, perfectly aligning with the CIFAR-10 label set to facilitate comprehensive comparisons across a spectrum of distribution shifts. Specifically, we examine two dataset domains—real-world shifts and diffusion synthesis shifts - to assess the resilience and adaptability of OTTA methods.

Google split. Diverging from previous CIFAR-10 variances, which primarily integrated artificially induced corruptions (i.e., CIFAR-10-C) or relied on sample differences (i.e., CIFAR 10.1), the Google split of the CIFAR-10-Warehouse offers an unparalleled perspective. Sourced from Google search queries, this segment includes 12 subdomains that exhibit a range of color variations within the CIFAR-10 labels. It constructs an essential benchmark to measure contemporary OTTA methods’ proficiency against real-world distribution shifts.

Cross-dataset comparison: real-world versus corruptions. Comparing the empirical results for both the Google split (Fig. 10) and CIFAR-10-C (Fig. 8), a notable difference is that the real-world shift presented in CIFAR-10-Warehouse exhibits a smaller domain gap globally, as indicated by the performances of direct inference. Concerning subdomains, the real-world shifts also demonstrate lower variance compared to corruption domains. This suggests that dealing with attacks or noise in a streaming manner is the most challenging adaptation task, as these two datasets are based on the same source pre-trained model.

Batch sizes. Regarding the differences in batch sizes depicted in Fig. 10, when the batch size is 16, five OTTA methods based on LayerNorm updating either match or surpass the performance of direct inference. This observation suggests that the LayerNorm updating strategy is generally effective in handling real-world shifts. However, when the batch size is reduced to 1, methods that solely rely on entropy minimization, such as Tent and Conj-CE, experience performance degradation across most domains. This decline may be attributed to the instability of optimization for single-sample batches. Additionally, the fact that ROID performs similarly to corruption-based datasets indicates its sensitivity to changes in batch size.

Adaptation strategy. RoTTA, TAST and SAR demonstrate exceptional stability regardless of batch sizes, which is also shown in CIFAR-100-C. Furthermore, ROID exhibits a smaller margin for surpassing performance compared to corruption-based datasets, indicating its robustness against data noise.

Table 3 Classification error rate (%) for the CIFAR-10 \(\rightarrow \) CIFAR-10-Warehouse online test-time adaptation task

Diffusion split. Furthermore, we incorporate the Diffusion split into our analysis as shown in Table 3. Created using stable diffusion (Rombach et al., 2022), this domain introduces a novel examination of synthetic samples. Considering the increasing prevalence of diffusion-generated imagery, this evaluation offers unique insights into the capability of OTTA methods to adapt to the rising tide of generated images.

Cross-dataset comparison: corruption, real-world, or diffusion? In terms of mean error, the diffusion split has the smallest domain gap among CIFAR-10-based datasets when comparing Fig. 8 with Table 3. This is even true when compared to the CIFAR 10.1 dataset, suggesting that OTTA methods based on ViT can generally handle diffusion-based images.

Adaptation strategy. The outcomes in Table 3 indicate that, on average, each OTTA method performs equally well or better than the baseline direct inference in terms of mean error. However, some methods still exhibit poor performance in certain domains. For instance, while TAST delivers satisfactory results on DM-01, it fails to perform on DM-02, DM-08, DM-09, and DM-12 domains. Similarly, Tent, Conj-CE, and MEMO-LN methods fall short on DM-05.

In contrast, RoTTA and ROID demonstrate their capability to surpass direct inference outcomes for every domain. Similarly, SAR enables the model to reach a region in the optimization landscape that is less sensitive to data variations, resulting in stable predictions.

Table 4 Classification error rate (%) for the OfficeHome online test-time adaptation task using a source model trained on Clipart with ViT-base-patch16-224

4.2.5 On OfficeHome Benchmark

OfficeHome (Venkateswara et al., 2017) has four domains. We use the clipart domain as the source dataset and apply OTTA methods to the remaining domains: Art, Product, and RealWorld.

Subdomains. The Art domain is the most difficult to adapt to from clipart based on the given pre-trained source model. However, even with minimal batch size, SAR, RoTTA, TAST, and ROID show competitive results against the baseline in the Art domain. Conversely, in the Product domain, reducing batch sizes significantly impacts the performance of several methods. The RealWorld domain presents the most significant challenge for adaptation, with few methods surpassing direct inference across different batch sizes.

Batch sizes. SAR, RoTTA, and TAST consistently outperform the source-only baseline irrespective of batch size, indicating stable performance. On the other hand, Tent, Conj-CE, and ROID show a decline in performance at a batch size of 1, suggesting a dependency on larger batch sizes for optimal results.

Adaptation strategy. It is observed that CoTTA-ALL and MEMO-LN do not perform well across all domains. This suggests that they heavily rely on augmentations that are not effective enough to bridge the domain gap in OfficeHome. As a result, it is crucial to further examine current augmentation strategies to address domain discrepancies more efficiently.

4.2.6 Conclusion

Based on our extensive experiments, most OTTA methods exhibit similar behavioral patterns across various datasets. This consistency indicates the potential of contemporary OTTA techniques in effectively managing diverse domain shifts. Of particular note are two methods, RoTTA and SAR, which highlight the significance of optimization insensitivity and information preservation, respectively. Additionally, the effectiveness of ROID is demonstrated when the batch size is reasonable, showcasing its capability.

Fig. 11
figure 11

Mean error versus a average wall-clock time per domain, b GPU memory usage, and c GFLOPs. All experiments are conducted on a severity level of 5 on the ImageNet-C dataset

4.3 Is OTTA Efficient?

To assess the efficacy of OTTA algorithms, especially within the constraints of hardware limitations, we employ three primary evaluation metrics: wall clock time, GPU memory usage, and GFLOPs.

As depicted in Fig. 11, lower values across these metrics are indicative of superior performance. Our analysis reveals that within the context of the ImageNet-C dataset, MEMO-LN exhibits suboptimal performance, accompanied by elevated computational demands, which are deemed disadvantageous.

Meanwhile, RoTTA demonstrates commendable results, characterized by reduced GFLOPs and processing time; however, its memory bank demands greater GPU memory. This requirement may necessitate further hardware provisions or memory optimizations for deployment in practical applications. Note that in our analyses, CoTTA-ALL is excluded to maintain the integrity and informativeness of the comparative showcase. This decision is made since CoTTA-ALL has unexpectedly high GFLOPs and wall clock time cost, along with high error rates, which made the comparative landscape appear skewed.

Conversely, SAR maintains low mean error while ensuring computational efficiency, especially compared with Tent. Similarly, Conj-CE attains significant error reduction with slightly increased resource consumption, performing inference for each batch directly after every iteration. ROID, additionally, achieves a balance between effectiveness and efficiency, as evidenced by its notably low time consumption and GFLOPs, alongside moderate GPU memory usage, while still delivering a superbly low error rate.

4.4 Is OTTA Sensitive to Hyperparameter Selection?

In this section, we investigate whether OTTA methods will be impacted by various hyperparameters. We conduct our experimental study on ImageNet-C with batch size 16 using Vit-base-patch16-224 to assess optimizers, learning rates, and schedulers.

4.4.1 Batch Size Matters, But Only to an Extent

Figure 13 examines the impact of different batch sizes on Tent in the CIFAR-10-C dataset. It reveals that performance significantly varies with batch sizes from 1 to 16 across most corruptions. However, this variability diminishes with larger batch sizes (16 to 128), indicating a reduced influence of LayerNorm updating on batch size, in contrast to traditional BatchNorm settings. This pattern is consistent across other datasets, as depicted in Fig. 12. Nevertheless, batch size remains crucial for stabilizing the optimization process. For example, a batch size of 16 outperforms a batch size of 1 in confidence optimization methods in the Google split of the CIFAR-10-Warehouse dataset.

Fig. 12
figure 12

Impact of varying batch sizes for Tent on CIFAR-10.1, CIFAR-10-Warehouse Google split, CIFAR-100-C, and ImageNet-C

Fig. 13
figure 13

Impact of batch size on CIFAR-10-C severity level 5. The base model is Tent optimized on LayerNorm

However, Larger batch sizes are essential for complex datasets like CIFAR-100-C and ImageNet-C, where direct inference struggles. Besides, Fig. 13 suggests that increasing batch sizes does not fully address challenging corruptions, such as Gaussian and Shot noise. This indicates the necessity for more advanced adaptation strategies beyond mere batch size adjustments in complex learning scenarios. Additionally, it is worth noting that gradient accumulation (Marsden et al., 2024) could be treated as a solution to allow single-sample adaptation.

Table 5 Comparisons of model update strategies across five benchmarks: Layernorm (LN) versus full model (ALL)

4.4.2 Optimization Layer Matters!

To assess the critical role of LayerNorm, we compare LayerNorm update with the full model update, as summarized in Table 5. This ablation study primarily focused on CoTTA and MEMO, evaluating the impact of optimizing LayerNorm alone. A notable observation is, for all methods, LayerNorm update plays an important role in gaining high performance, underscoring its effectiveness in boosting model performance by avoiding significant forgetting of the source knowledge. This is also proven by Tent, SAR, and Conj-CE, as shown in Fig. 15 in the appendix.

4.4.3 Optimizer Matter?

We evaluate Adam and SGD across eight methods with a batch size of 16 on the ImageNet-C dataset. Our findings, detailed in Table 6, reveal several key observations. Firstly, methods that adapt to test data under soft supervision, such as soft entropy, appear more susceptible to changes in the optimizer. For instance, switching to SGD resulted in a significant mean error reduction of 9.84% for Tent. Similarly, when updated by soft entropy, SGD enabled CoTTA-ALL to achieve a mean error difference of over 10%. This pattern was also observed in Conj-CE and MEMO-LN. This disparity likely stems from the superior generalization capabilities of SGD, as discussed by Wilson et al. (2017). The notable exception within this group is SAR, which utilizes Sharpness-Aware Minimization (SAM) to find stable minima. This approach may diminish the impact of the choice between optimizers.

Another distinct category includes RoTTA and TAST. These methods, significantly reliant on their memory banks or the averaging of predictions across multiple adaptation modules, demonstrate reduced sensitivity to the choice of optimization strategy. ROID instead exhibits high sensitivity in terms of the optimizer changes. Here, Adam shows its stability due to its adaptive learning rate and moment estimation for noisy and non-stationary objectives.

Table 6 Comparison between Adam and SGD (highlighted in italic) on ImageNet-C using ViT-base-patch16-224 at the severity level 5 and batch size 16, SO denotes source-only

4.4.4 Impact of Learning Rate

To further explore the impact of hyperparameters, we examine the learning rate within the range [0.0001, 0.0005, 0.001, 0.005, 0.01] on the ImageNet-C dataset, as detailed in Fig. 14. Results indicate that lower learning rates benefit online adaptation.

The empirical results shown in Fig. 14 confirm a general idea: a high learning rate can cause quick overfitting of the current batch in online adaptation. Lower rates enhance model stability, enabling smoother pattern adaptation. Conversely, a large learning rate could disrupt the finely learned knowledge from the source model thus causing performance degradation.

However, this is not universally applicable. Methods such as RoTTA and TAST consistently demonstrate stable performance across varying learning rates. By incorporating more information for prediction, they alleviate the impact of batch-specific label variations, bolstering model stability. Moreover, CoTTA-ALL, SAR, and ROID attain their optimal performance at specific learning rate values, excluding the lowest one, underscoring the continued relevance of studying learning rates to optimize adaptation performance.

Fig. 14
figure 14

Impact of varying learning rates on ImageNet-C severity level 5

Table 7 Variations of learning rate schedulers on ImageNet-C severity level 5 with batch size 16
Table 8 Impact of backbone on ImageNet-C

4.4.5 Impact of Scheduler

We examine the effectiveness of three different learning rate schedulers to ascertain their influence on performance. Compared with the default learning rate setup of 0.001 (Def) and a minimum learning rate of 0.0001 (Min) for the Adam optimizer, they are:

  • ExponentialLR (Exp): divides the learning rate every iteration by the decay rate of 0.9, adjusting from the first domain to the last.

  • CosineAnnealingWarmRestarts (Cos): periodically resets the learning rate to a higher value and then decreases it following a part of the cosine curve. Here, we gradually decrease the learning rate to 0.0001 for each domain, with the cycle length matching the number of iterations per domain.

  • CycleLR (Cyc): cyclically varying the learning rate between two boundary values over a set number of training iterations. Here, we put it to begin with a maximum learning rate of 0.001 in each domain and decay to 0.0001 by the final iteration of the domain.

As shown in Table 7, cyclical adjustments such as Cos and Cyc of Tent and Conj-CE can surpass the default setup (Def) but do not perform as well as the minimum learning rate (Min), indicating the need for a lower learning rate from the outset of the adaptation process. Meanwhile, RoTTA and TAST demonstrate very stable performance across different learning rate strategies, showcasing their robustness. Notably, RoTTA exceeds the source-only performance (i.e. 63.24%) under any condition. Additionally, ROID demonstrates superior performance when utilizing the default learning rate without a scheduler.

One exception is MEMO-LN, where only ExponentialLR (Exp) can compete with the minimum learning rate. Furthermore, as illustrated in Fig. 14, MEMO-LN produces significant differences in performance among the schedulers and learning rates, indicating its high sensitivity to the learning rate.

4.5 Is OTTA Effective with Other Vision Transformer Variants?

We further evaluate the effectiveness of selected OTTA methods on the SwinTransformerFootnote 5 as a point of comparison to the foundational ViT, as shown in Table 8. Switching the backbone architecture to SwinTransformer can greatly improve performance. This is evidenced by the direct inference of SwinTransformer, which outperforms all ViT outputs except for SAR and ROID. This emphasizes the crucial role of backbone architecture in achieving high performance and suggests that improvements in backbone designs can sometimes overshadow the enhancements brought about by advanced OTTA methodologies. Transitioning to a new backbone requires reassessing the effectiveness of individual OTTA methods. Specifically, architecture-agnostic methods like RoTTA (without the RBN module) and ROID outperform the source-only performance with a batch size of 16 using SwinTransformer. Additionally, RoTTA remains stable with a batch size of 1, demonstrating its robustness and adaptability across various architectures.

These findings highlight a dynamic interplay between backbone architectures and OTTA methods. Consequently, evaluating OTTA strategies in the context of evolving transformer models is essential.

5 Future Directions

Our initial evaluations of the Vision Transformer revealed that many Online Test-Time Adaptation methods are not fully optimized for this architecture, resulting in suboptimal outcomes. Based on these findings, we propose several key attributes for an ideal OTTA approach, suitable for future research and tailored to advanced architectures like ViT:

  • Refining OTTA in realistic settings: Future OTTA methods should undergo testing in realistic environments such as practical TTA (Marsden et al., 2024), weaken the domain boundaries, employing advanced architectures, practical testbeds, and reasonable batch sizes. This approach aims to gain deeper and more relevant insights.

  • Addressing multimodal challenges and exploring prompting techniques: With the evolution towards foundation models like CLIP (Radford et al., 2021), OTTA is confronted with new challenges. These models may face shifts across various modalities, necessitating innovative OTTA strategies that extend beyond reliance on images alone. Exploring prompt-based methods could offer significant breakthroughs in OTTA.

  • Hot-swappable OTTA: Keeping pace with the rapid evolution of backbone architectures is crucial. Future OTTA methods should focus on adaptability and generalizability to seamlessly integrate with evolving architectures.

  • Stable and robust optimization for OTTA: Stability and robustness in optimization remain paramount. Given that larger batch sizes demonstrate limited effectiveness in ViT, future research should investigate more universal optimization improvements. Such advancements aim to consistently enhance model performance, independent of external factors like batch size.

6 Conclusion

In this survey, we comprehensively examine Online Test-Time Adaptation (OTTA), covering existing methods, relevant datasets, evaluation benchmarks, and their implementations. We conduct extensive experiments to assess the effectiveness and efficiency of current OTTA methods applied to vision transformers. Our findings suggest that noise-synthesized domain shifts often pose greater challenges than other shifts, such as those encountered in real-world or diffusion environments. Additionally, a large number of classes in a dataset can lead to noticeable batch discrepancies, potentially affecting OTTA models’ ability to maintain consistent knowledge and increasing the risk of severe forgetting. To address these challenges, we find that updating the normalization layer with a memory bank or optimization flatness, along with appropriate batch size selection, can effectively stabilize the adaptation process and reduce forgetting. We hope this survey serves as a foundational reference, offering valuable insights for researchers and practitioners interested in the evolving field of OTTA.

Table 9 The summary of existing OTTA published in the top-tier conferences before Dec-2023