Keywords

1 Introduction

Deep neural networks have proven to be very successful in a myriad of computer vision tasks such as categorization, detection, and retrieval. However, much of the success has come at the price of excessive human effort put into the manual data-labelling process. Since collecting annotated data can be prohibitive and impossible at times, domain adaptation (DA, see [8] for an overview) methods have gained increasing attention. They enable training on unlabelled target data by conjointly leveraging a previously labelled yet related source data set while mitigating domain-shift [60] between the two. Such methods predominantly comprise of minimizing statistical moments between distributions [37, 53, 57, 62], using adversarial objectives to maximize domain confusion [14, 61], or reconstructing data with generative methods [23].

Fig. 1.
figure 1

Illustrative sketch of source-free domain adaptation (SFDA) on a labelled source domain , and an unlabelled target domain potentially containing additional classes . The top-row shows conventional methods which ignore model uncertainties; the bottom-row shows our method which incorporates uncertainties about the predictive model, enabling uncertainty-guided SFDA that is more robust to distribution shifts

Albeit successful, the preceding methods mandate access to the source data set during the target adaptation phase as they require an estimate of the source distribution for the alignment. With the emergence of regulations on data privacy and bottleneck in data transmission for large data sets, access to the source data can not always be guaranteed. Thus, paving the way to a relatively new and more realistic DA setting, called source-free DA (SFDA, [8]), where the task is to adapt to the target data set when the only source of supervision is a source-trained model. SFDA facilitates maintaining data anonymity in privacy-sensitive applications (e.g., surveillance or medical applications) and at the same time reduces data transmission and storage overhead. Towards this goal, recently, several SFDA methods have been proposed that utilize the hypotheses learned from the source data [27, 33, 58]. Notably, SHOT [33] – an information maximization (IM) [17] based SFDA method – has demonstrated to work reasonably well on DA benchmarks, sometimes outperforming traditional DA methods. While promising, these conventional SFDA techniques do not account for the uncertainty in the predictions of the source model on the target data. As a by-product, solely maximizing mutual information [17] on the target data can lead to erroneous decision surfaces (see Fig. 1 top).

This work argues that quantification of the uncertainty in predictions is essential in SFDA. Depending on the inductive biases of the model, the source model may predict incorrect target pseudo-labels with high confidence, e.g., due to the extrapolation property in ReLU networks [22] (see Fig. 2b left). In the literature, uncertainty-guided methods have been proposed in the context of traditional UDA and SFDA settings, employing Monte Carlo (MC) dropout to estimate the uncertainties in the model predictions [51, 72]. However, MC dropout requires specialized training and specialized model architecture, suffers from manual hyperparameter tuning [13], and is known to provide a poor approximation even for simple (e.g., linear) models [10, 46, 47].

In this work, we propose to construct a probabilistic source model by incorporating priors on the network parameters, inducing a distribution over the model predictions, on the last layer of the source model. This enables us to perform an efficient local approximation to the posterior using a Laplace approximation (LA, [41, 59]), see Fig. 2a. This principled Bayesian treatment leads to more robust predictions, especially when the target data set contains out-of-distribution (OOD) classes (see Fig. 1 bottom) or in case of strong domain shifts. Once the uncertainty in predictions is estimated, we selectively guide the target model to maximize the mutual information [17] in the target predictions. This alleviates the alignment of the target features with the wrong source hypothesis, resulting in a domain adaptation scheme that is robust to mild and strong domain shifts without tuning. We call our proposed method Uncertainty-guided Source-Free AdaptatioN (U-SFAN). Our approach requires no specialized source training or specialized architecture, opposed to exiting works (e.g., [30, 72]), introduces little computational overhead, and decouples source training and target adaptation.

We summarize our contributions as follows. (i) We emphasize the need to quantify uncertainty in the predictions for SFDA and propose to account for uncertainties by placing priors on the parameters of the source model. Our approach is computationally efficient by employing a last-layer Laplace approximation and greatly decouples the training of the source and target. (ii) We demonstrate that our proposed U-SFAN successfully guides the target adaptation without specialized loss functions or a specialised architecture. (iii) We empirical show the advantage of our method over SHOT [33] in the closed-set and the open-set setting for several benchmarks tasks and provide evidence for the improved robustness against mild and strong domain shifts.

Fig. 2.
figure 2

(a) The Laplace approximation is mode-seeking and adapts to the local curvature around the mode \(\theta _\textrm{MAP}\). It does not necessarily capture the (intractable) full posterior, but gives a proxy for it, is principled, and efficient to evaluate. (b) Example of predictive uncertainty (un   certain) captured by a ReLU network vs. a Laplace approximation that assigns higher uncertainty to inputs ( ) of an unseen class

2 Related Work

Closed-set Domain Adaptation, often abbreviated as UDA, refers to the family of DA methods that aim to learn a classifier for an unlabelled target data set while simultaneously using the labelled source data set, which differ in their underlying data distributions. In the literature [64] mainly three categories of UDA methods can be found. First, discrepancy-based UDA methods aim to diminish the domain-shift between the two domains with maximum mean discrepancy (MMD, [37, 39, 62]), or with correlation alignment [43, 53, 57]. The second category of UDA methods exploits the adversarial objective [18] to promote domain confusion between the two data distributions by using domain discriminator [14, 38, 61]. Finally, the third category comprises reconstruction-based UDA methods [4, 16, 23] that casts data reconstruction as an auxiliary objective in order to ensure invariance in the feature space. However, these methods can only work in the presence of the source data set during the adaptation stage, which might be limited in practice due to data privacy or storage concerns.

Open-set Domain Adaptation (OSDA), originally proposed in [48], refers to the DA setting where both the domains have some shared and private classes, with explicit knowledge about the shared classes. However, such a setting was deemed impractical, and later Saito et al.  [56] proposed the open-set setting where the source labels are a subset of the target labels. Thereon, several OSDA methods have been proposed which use image-to-image translations [71], progressive filtering [36], ensemble of multiple classifiers [11] and one-vs-all classifiers [55] to detect OOD samples. Similar to the UDA, the OSDA methods also require support from the source data to detect target private classes, which make them unsuitable for source-free DA.

Source-free Domain Adaptation (SFDA) aims to adapt a model to the unlabelled target domain when only the source model is available and the source data set is absent during target adaptation. Existing SFDA methods use pseudo-label refinement [1, 5, 33], latent source feature generation using variational inference [70], or disparity among an ensemble of classifiers [30]. Certain SFDA methods resort to ad hoc source training protocols to enable the source model to be adapted on the target data. For instance, [30] requires an ensemble of classifiers to be trained during source training so that the disparity among them could be utilized for target adaptation. Similarly, USFDA [27] requires artificially generated negative samples in the source training stage for the model to detect OOD samples. Such coupled source and target training procedures make these SFDA methods less viable for practical applications. On the other hand, our proposed U-SFAN does not require specialized source training except a computationally lightweight approximate inference, which can be done with a single pass of the source data during the source training. Moreover, unlike [1, 30], our U-SFAN works well on both closed-set and open-set SFDA without ad hoc modifications.

Uncertainty Quantification in the form of Bayesian deep learning (e.g., [24, 45]) is concerned with formalizing prior knowledge and representing uncertainty in model predictions, especially under domain-shift or out-of-distribution samples. Even though the Bayesian methodology gives an explicit way of formalizing uncertainty, computation is often intractable. Thus, approximate inference methods such as Monte Carlo (MC) dropout [12], deep ensembles [29, 66], other stochastic methods (e.g., [42]), variational methods [3], or the Laplace approximation [52] are typically employed in practice. Prior works in semantic segmentation [72] and UDA [20, 28, 30, 51, 65] applied MC dropout or deep ensembles, respectively, for uncertainty quantification if DA. However, none of those above approaches can be considered practical for the more challenging source-free DA scenario as MC dropout, ensembles, and other stochastic methods do not lend themselves well to the source-free case. In particular, they either require retraining several models on the source, changing the model architecture or requiring a tailored learning procedure on the source data. Thus we take a Laplace approach which allows re-using the source model by linearizing around a point-estimate (see Fig. 2), which is post hoc, yet grounded in classical statistics [15].

3 Methods

Problem Definition and Notation. We are given a labelled source data set, having \(n^{[\texttt{S}]}\) instances, \(\mathcal {D}^{[\texttt{S}]}= \{(\textbf{x}^{[\texttt{S}]}_i, \textbf{y}^{[\texttt{S}]}_i)\}^{n^{[\texttt{S}]}}_{i=1}\), where \(\textbf{x}^{[\texttt{S}]}\in \mathcal {X}^{[\texttt{S}]}\) are D-dimensional inputs and \(\textbf{y}^{[\texttt{S}]}\in \mathcal {Y}^{[\texttt{S}]}\) where we assume K-dimensional one-hot encoded class labels, i.e., \(\mathcal {Y}^{[\texttt{S}]}= \mathbb {B}^K\). Moreover, we have \(n^{[\texttt{T}]}\) unlabelled target observations \(\mathcal {D}^{[\texttt{T}]}= \{\textbf{x}^{[\texttt{T}]}_j\}^{n^{[\texttt{T}]}}_{j=1}\), where \(\textbf{x}^{[\texttt{T}]}\in \mathcal {X}^{[\texttt{T}]}\) are D-dimensional unlabelled inputs. As in any DA scenario, the assumption made is that the marginal distributions of the source and the target are different, but the semantic concept represented through class labels does not change. Formally, we assume that \(p(\textbf{y}^{[\texttt{S}]}\,|\,\textbf{x}^{[\texttt{S}]}) \approx p(\textbf{y}^{[\texttt{S}]}\,|\,\textbf{x}^{[\texttt{T}]})\) and \(p(\textbf{x}^{[\texttt{S}]}) \ne p(\textbf{x}^{[\texttt{T}]})\). In the SFDA scenario we further assume that the source data set is only available while learning the source function \(f :\mathcal {X}^{[\texttt{S}]}\rightarrow \mathcal {Y}^{[\texttt{S}]}\) and becomes unavailable while adapting on the unlabelled data. The goal of SFDA is to adapt the source function f to the target domain solely by using the data in \(\mathcal {D}^{[\texttt{T}]}\). The resulting target function, denoted as \(f' :\mathcal {X}^{[\texttt{T}]}\rightarrow \mathcal {Y}^{[\texttt{T}]}\), can then be used to infer the class assignment for \(\textbf{x}^{[\texttt{T}]}\in \mathcal {X}^{[\texttt{T}]}\). In this work we have considered two settings of the SFDA: i) vanilla closed-set SFDA where the label space of the source \(\texttt{S}\) and the target \(\texttt{T}\) is the same, \(\textrm{L}^{[\texttt{S}]}= \textrm{L}^{[\texttt{T}]}\); and ii) open-set SFDA where the label space of the \(\texttt{S}\) is a subset of the \(\texttt{T}\), i.e., \(\textrm{L}^{[\texttt{S}]}\subset \textrm{L}^{[\texttt{T}]}\), and \(\textrm{L}^{[\texttt{T}]}\setminus \textrm{L}^{[\texttt{S}]}\) are denoted as target-private or OOD classes.

We model the source and target functions f with a neural network that is composed of two sub-networks: feature extractor g and hypothesis function h, such that \(f = h \circ g\). The feature extractor g and the hypothesis function h are parameterized by parameters \(\beta \) and \(\theta \), respectively. During target adaptation, the model is initialized with parameters learned on \(\mathcal {D}^{[\texttt{S}]}\) and subsequently the feature extractor parameters are updated using backpropagation, i.e., the hypothesis function is kept frozen.

Fig. 3.
figure 3

The pipeline for U-SFAN: (a) Initial source model training (1) and the additional step (2) of composing a Laplace approximation for assessing the posterior over model parameters, \(p(\theta \,|\,\mathcal {D}^{[\texttt{S}]})\). (b) At target adaptation, we keep the posterior over the parameters fixed ( ) and train g under a uncertainty-aware composite loss that weights samples according to predictive uncertainty

Overall Idea. Our proposed method for SFDA operates in two stages. We begin the first stage (see Fig. 3a) by training a source model on the data set \(\mathcal {D}^{[\texttt{S}]}\), which gives us the maximum-a-posteriori probability (MAP) estimate of the source network parameters (\(\{\beta ^{[\texttt{S}]}_\textrm{MAP}, \theta ^{[\texttt{S}]}_\textrm{MAP}\}\)). The second stage (see Fig. 3b) comprises of maximization of mutual information [17] in the predictions for the target inputs \(\mathcal {D}^{[\texttt{T}]}\). However, due to the overconfidence of ReLU networks [22], maximizing mutual information for all inputs equally, including those that are far away from the source data, could be detrimental. To overcome this pathology, we derive a per-sample weight using the model’s uncertainty and use it to modulate the mutual information objective in SHOT. To estimate the uncertainty in the predictions on the target data, we perform approximate posterior inference over the parameters of the hypothesis function, i.e., \(p(\theta ^{[\texttt{S}]}\,|\,\mathcal {D}^{[\texttt{S}]})\). Inspired by recent works on approximate inference in Bayesian neural networks [25, 40, 59], we propose to estimate the posterior predictive distribution \(p(\textbf{y}\,|\,\textbf{x}, \mathcal {D})\) using a Laplace approximation, introducing little computational overhead and without the need for specialized source training. We briefly describe the preliminaries to our approach in the following section.

3.1 Preliminaries

Liang et al.  [33] proposed SHOT (Source HypOthesis Transfer) for the task of SFDA, where the goal is to find a parameterization \(\beta ^{[\texttt{T}]}\) of the feature extractor g such that the distribution of latent features \(\textbf{z}^{[\texttt{T}]}= g_{\beta ^{[\texttt{T}]}}(\textbf{x}^{[\texttt{T}]})\) matches the distribution of the latent source features. This enables that the target data can be accurately classified by the hypothesis function parameterized by \(\theta ^{[\texttt{S}]}\). To this end, the authors address the SFDA task in two stages where the first and second stage comprise of source model training and maximizing the mutual information [17] between the latent representations and the classifier output, respectively.

The source model \(f :\mathcal {X}^{[\texttt{S}]}\rightarrow \mathcal {Y}^{[\texttt{T}]}\) for a K-way classification task is learned using a label-smoothed cross-entropy objective [44], i.e.,

$$\begin{aligned} \textstyle \mathcal {L}_{\textrm{src}} = - {{\,\mathrm{\mathbb {E}}\,}}_{p(\textbf{x}^{[\texttt{S}]}, \textbf{y}^{[\texttt{S}]})} \sum ^{K}_{k=1} \tilde{y}^{[\texttt{S}]}_{k} \log \phi _k(f(\textbf{x}^{[\texttt{S}]})), \end{aligned}$$
(1)

where \(\phi _k(\textbf{a}) = \nicefrac {\exp (a_k)}{\sum _j \exp (a_j)}\) denotes the likelihood for the \(k^\text {th}\) component of the model output and \(\tilde{y}^{[\texttt{S}]}_{i,k} = y^{[\texttt{S}]}_{i,k} (1 - \alpha ) + \nicefrac {\alpha }{K}\) the class label for the ithlabel smoothed datum.

After the source training, the \(\mathcal {D}^{[\texttt{S}]}\) is discarded and the target adaptation is conducted on \(\mathcal {D}^{[\texttt{T}]}\) only. To adapt on the target domain, the target function \(f'\) is initialized based on the learned source function f and learned with the information maximization (IM) loss [17]. The IM loss ensures that the function mapping will produce one-hot predictions while at the same time enforcing diverse assignments, i.e.,

$$\begin{aligned} \mathcal {L}_{\textrm{ent}}&{=} {-}{{{\,\mathrm{\mathbb {E}}\,}}}_{p(\textbf{x}^{[\texttt{T}]})} \textstyle \sum ^{K}_{k=1} \phi _k(f'(\textbf{x}^{[\texttt{T}]})) \log \phi _k(f'(\textbf{x}^{[\texttt{T}]})), \end{aligned}$$
(2)
$$\begin{aligned} \mathcal {L}_{\textrm{div}}&{=} \textrm{D}_\textrm{KL}(\hat{\textbf{p}} \,\Vert \, K^{-1} \boldsymbol{1}_K) - \log K, \end{aligned}$$
(3)

where \(\boldsymbol{1}_K\) is a vector of all ones, and \(\hat{p}_k = {{\,\mathrm{\mathbb {E}}\,}}_{p(\textbf{x}^{[\texttt{T}]})}[\phi _k(f'(\textbf{x}^{[\texttt{T}]}))]\) is the expected network output for the kthclass. Intuitively, \(\mathcal {L}_{\textrm{ent}}\) is in charge of making the network output one-hot, while \(\mathcal {L}_{\textrm{div}}\) is responsible for equally partitioning the network prediction into K classes. In practice \(\mathcal {L}_{\textrm{div}}\) operates on a mini-batch level. In this work we start from SHOT-IM to adapt to the target domain.

3.2 Uncertainty-Guided Source-Free DA

Distributional shift between source and target data sets causes the network outputs to differ, even for the same underlying semantic concept [8]. In a standard UDA scenario, where the source data is available during target adaptation, it is still possible to align the marginal distributions by using a quantifiable discrepancy metric. The task becomes more challenging in the SFDA scenario because it is not possible to align the target feature distribution to a reference (or source) distribution. Moreover, standard ReLU networks are known to yield overconfident predictions for data points which lie far away from the training (source) data [22]. In other words, the MAP estimates of a neural network has no notion of uncertainty over the learned weights. Thus, blindly trusting the source model predictions for \(\textbf{x}^{[\texttt{T}]}\in \mathcal {D}^{[\texttt{T}]}\) while performing information maximization [33] or entropy minimization [19] can potentially lead to misalignment of clusters between the source and target.

In this work we propose to incorporate the uncertainty of the neural network’s weights into the predictions. This mandates a Bayesian treatment of the networks parameters (\(\theta \)), which gives a posterior distribution over the model parameters by conditioning onto observed data (\(\mathcal {D}\)), i.e., \(p(\theta \,|\,\mathcal {D}) = \frac{p(\theta )\, p(\mathcal {D}\,|\,\theta )}{p(\mathcal {D})} \propto p(\theta )\, p(\mathcal {D}\,|\,\theta )\). The prediction of the network \(h_{\theta }\) for an observation \(\textbf{x}\) is given by the predictive posterior distribution, i.e.,

$$\begin{aligned} p(y_k \,|\,\textbf{x}, \mathcal {D}) = \int _\theta \phi _k(h_{\theta }(\textbf{x})) \, p(\theta \,|\,\mathcal {D}) \, \textrm{d}\theta . \end{aligned}$$
(4)

Note that the posterior \(p(\theta \,|\,\mathcal {D})\) in Eq. (4) does not have an analytical solution in general and need to be approximated. For this, we employ a local approximation to the posterior using a Laplace approximation (LA, [59]). The LA locally approximates the true posterior using a multivariate Gaussian distribution centred at a local maximum and with covariance matrix given by the inverse of the Hessian \(\textbf{H}\) of the negative log-posterior, i.e., with \(\textbf{H}{:}{=}-\nabla ^2_{\theta } \log p(\theta \,|\,\mathcal {D}) \,|\,_{\theta _\textrm{MAP}}\). Details can be found in the appendix. Note that the LA is a principled and simple, yet effective, approach to approximate posterior inference stemming from a second-order Taylor expansion of the true posterior around \(\theta _\textrm{MAP}\). Next we will discuss LA in the context of SFDA.

Bayesian Source Model Generation. In the source training stage (see Fig. 3a), by optimizing Eq. (1), we obtain a MAP estimate of the weights for our source model, comprising \(\beta _\textrm{MAP}\) and \(\theta _\textrm{MAP}\) for g and h, respectively. Since f is often modelled by a very deep neural network (e.g., ResNet-50), computing the Hessian can be computationally infeasible owing to the large number of parameters. So we make another simplification by applying a Bayesian treatment only to hypothesis function h, known as the last-layer Laplace approximation [25]. This gives us a probabilistic source hypothesis with posterior distribution \(p(\theta \,|\,\mathcal {D}^{[\texttt{S}]})\) for the parameters. The feature extractor g remains deterministic. Formally, let \(\textbf{z}= g_{\beta ^{[\texttt{S}]}}(\textbf{x})\) be the latent feature representation from the feature extractor. Following Eq. (4), the predictive posterior distribution is given as:

(5)

While the last-layer LA greatly simplifies the computational overhead for large networks, the Hessian can still be difficult to compute in the case the number of classes is large. To simplify computations, we assume that \(\textbf{H}\) can be Kronecker-factored \(\textbf{H}{:}{=}\textbf{V}\otimes \textbf{U}\) and the resulting approximation is referred to as Kronecker-factored Laplace approximation (KFLA, [52]). Such probabilistic treatment allows us to quantify uncertainty in the predictions for data points from the target with little computational overhead. Also, the LA can be readily computed using a single forward pass of the source data through the network. Next, we describe how to use the uncertainty estimates during target adaptation.

Uncertainty-Guided Information Maximization. Upon completion of the source model generation stage, we exploit the probabilistic source hypothesis to guide the information maximization in the target adaptation stage. SHOT puts equal confidence on all the target predictions and do not make any distinction for the target feature that lies outside of the source manifold. We emphasize that in case of strong domain-shift naïvely maximizing the IM loss could lead to cluster misalignment. For that reason, we propose to weigh the entropy minimization objective (Eq. (2)) with a weight which is proportional to the certainty in the target predictions (see Fig. 3b). To get the per-sample weight for a \(\textbf{x}^{[\texttt{T}]}\) we need to compute the predictive posterior distribution, as outlined in Eq. (5). However, exactly solving the integration is intractable in many cases and we, therefore, resort to Monte Carlo (MC) integration. Let \(\textbf{z}^{[\texttt{T}]}= g_{\beta ^{[\texttt{T}]}}(\textbf{x}^{[\texttt{T}]})\), the approximate predictive posterior distributions is:

$$\begin{aligned} p(y_k \,|\,\textbf{z}^{[\texttt{T}]}, \mathcal {D}^{[\texttt{S}]}) \approx \frac{1}{M} \sum ^{M}_{j=1} \phi _k\left( h_{\theta _j}(\textbf{z}^{[\texttt{T}]})\right) , \end{aligned}$$
(6)

where and M denotes the number of MC steps. To encourage low entropy predictions we additionally scale the outputs of the hypothesis by \(1/\tau \), where \(0<\tau \le 1\). The final weight of each observation \(\textbf{x}_i^{[\texttt{T}]}\) is then computed as \(w_i = \exp (-H)\) where H denotes the entropy of the predictive mean. The uncertainty-guided entropy loss is then given as:

$$\begin{aligned} \mathcal {L}^\textrm{ug}_\textrm{ent} = {-} {{{\,\mathrm{\mathbb {E}}\,}}}_{p(\textbf{x}^{[\texttt{T}]})} \sum ^{K}_{k=1} w \, \sigma _k(f'(\textbf{x}^{[\texttt{T}]}))\log \sigma _k(f'(\textbf{x}^{[\texttt{T}]})). \end{aligned}$$
(7)

The final training objective is then given as: \(\mathcal {L}_\text {U-SFAN} = (1-\gamma )\mathcal {L}^\textrm{ug}_\textrm{ent} + \gamma \, \mathcal {L}_\textrm{div}\). Pseudocode for our U-SFAN can be found in the appendix.

How Does this Differ from Conventional Uncertainty Estimation? The importance and advantages of adopting a Laplace approximation (LA) over Monte Carlo (MC) dropout to estimate uncertainty in SFDA can be summarized as follows: (i) LA does not require specialized network architecture (e.g., dropout layers), loss function, or re-training (as in MC dropout) to estimate predictive uncertainties. This greatly decouples the source training from target adaptation, which is essential to be applicable in SFDA; (ii) To have well-calibrated uncertainties, MC dropout requires a grid search over the dropout probabilities [13], a prohibitive operation in deep neural networks, especially as the future target data is not available at source training. LA is a more principled approach that does not require a grid search, making it better suited for SFDA. (iii) LA is computationally lightweight since it requires just a single forward pass of the source data through the network after the source training to estimate the posterior over the parameters of the sub-network. (iv) LA does not impact the training time during target adaptation because, unlike MC dropout, only a single forward pass is needed to quantify the predictive uncertainties. Because LA employs a Gaussian approximation to the posterior, MC integration is cheap and efficient to compute. (v) As used in our work, LA estimates the full posterior over the weights and biases, while MC dropout can only account for the uncertainties over the weights [12] and is known to be a poor approximation to the posterior [10, 46, 47]. (vi) LA preserves the decision boundary induced by the MAP estimate, which is not the case for MC dropout [25]. In summary, our contribution goes beyond the uncertainty re-weighting scheme [32, 35] commonly used in UDA, while carrying many advantages over existing works.

4 Experiments

We conduct experiments on four standard DA benchmarks: Office31 [54], Office-Home [63], Visda-C [50], and the large-scale DomainNet [49] (0.6 million images). The details of the benchmarks are summarized in appendix. For the experiments in the open-set DA setting we follow the split of [33] for shared and target-private classes.

Evaluation Protocol. We report the classification accuracy for every possible pair of \(source \mapsto target\) directions, except for the Visda-C where we are only concerned with the transfer from synthetic \(\mapsto \) real domain. For the open-set experiments, following the evaluation protocol in [33], we report the OS accuracy which includes the per-class accuracy of the known and the unknown class and is computed as \(\textrm{OS} = \frac{1}{K+1} \sum ^{K+1}_{k=1}\textrm{acc}_k\), where \(k=\{1, 2, \dots , K\}\) denote the shared classes and \((K+1)\)this the target-private or OOD classes. This metric is preferred over the known class accuracy, \(\textrm{OS}^*=\frac{1}{K}\sum ^{K}_{k=1}\textrm{acc}_k\), as it does not take into account the OOD classes.

Implementation Details. We adopted the network architectures used in the SFDA literature, which are ResNet-50 or ResNet-101 [21]. Following [33], we added a bottleneck layer containing 256 neurons which is then followed by a batch normalization layer. The network finally ends with a weight normalized linear classifier that is kept frozen during the target adaptation. Details about hyperparameters can be found in the appendix. For computing the KFLA we use the PyTorch package of Dangel et al. [9]. Our code is available at https://github.com/roysubhankar/uncertainty-sfda

4.1 Ablation Studies

As discussed in Sect. 3.2, conventional SFDA methods that rely on optimizing the IM loss on the unlabelled target data (e.g., SHOT) are prone to misalignment of the target data with the source hypothesis under strong domain shift. To visually demonstrate this phenomenon, we design an experiment of a 3-way classification task on toy data (see Fig. 4). Given a set of source data points, belonging to three classes, we simulate two kinds of domain-shift: mild shift (Fig. 4a) and strong shift (Fig. 4b). In the case of mild shift, the target data points stay very close to the source manifold, and the conventional approach (only using the MAP estimate) can classify a majority of target data points without the need of adaptation. Whereas, in the case of strong shift, the target data points for the blue class, in particular, shift drastically away from the source points. The source model based on the MAP estimate misclassifies most of the target data points with high confidence. On the other hand, our uncertainty-guided source model remains certain only for those target points which lie within the source support and assigns low certainty otherwise (proportional to the strength of colours depicting the decision surface in Fig. 4), robustifying the adaptation on the target data in case of domain shift.

Fig. 4.
figure 4

Comparison of conventional IM (MAP) with our uncertainty-guided IM on target data under mild and strong domain-shift. The solid vs. hollow circles represent the source and the target data, respectively. Each class is colour coded and the decision boundaries are shaded with the corresponding colours. Under strong domain-shift, IM, when used with a MAP estimate, finds a completely flipped decision boundary. U-SFAN finds the decision boundary by down-weighting the far away target data

Given such a set-up, we optimize the IM loss (i.e., SHOT-IM) for both the conventional and the uncertainty-guided source models. In the case of mild shift, both can reliably partition the target data points under the right decision surfaces (see Fig. 4a (right)). This is intuitive because the decision boundary of the target model already passes through the low-density regions. Hence, the optimization of the IM loss leads to correct target classification with both methods. However, when the domain shift is more substantial, the conventional approach results in completely flipped decision boundaries. This happens because most blue target points fall under the red decision surface, and thus, the IM loss assigns them to class ‘red’. On the contrary, our uncertainty-guided approach down-weights the blue points, and safely optimizes the IM loss as the model is uncertain about the class assignment for those points (Fig. 4b (right)). This protects from major changes in the decision boundaries and allows the optimization to find the correct decision boundaries for the target data. Therefore, highlighting the importance of having a notion of uncertainty in the model predictions during adaptation. We show later that this intuition also holds well for real-world data sets where our U-SFAN offers more robustness when the domain-shift in a data set becomes challenging.

To gain further insights, we visualize the entropy density plots of the source model predictions before and after adaptation with conventional (MAP estimate) and uncertainty-guided models on an image data set (CIFAR [26] as source data set and STL [7] as target). As shown in Fig. 5a, the MAP estimate has lower entropy predictions for both the correct and incorrect predictions, when compared to our uncertainty-guided model. Reduced over-confidence for our approach is expected before the adaptation phase, however, it is non-trivial that this behavior also bears in the post-adaptation phase. The reduced over-confident allows our U-SFAN to down-weight the incorrect predictions during target adaptation, resulting in improved target accuracy over SHOT-IM (\(77.04\%\) for U-SFAN vs \(75.69\%\) for SHOT-IM). This effect can be noticed in Fig. 5b where U-SFAN has overall higher entropy incorrect predictions, which is desirable in SFDA.

Fig. 5.
figure 5

Entropy density plots for CIFAR9 \(\rightarrow \) STL9 in the closed-set SFDA setting using the MAP estimate (  correct,  incorrect) or our approach (  correct,  incorrect). Our uncertainty-guided SFDA approach places less mass on low-entropy incorrect samples before and after adaptation

Table 1. Comparison of model performance using entropy weighting during target adaptation on the Office-Home data set. The weights computed using the LA is more beneficial than the weights computed with a MAP network

To further understand the contribution of our uncertainty-guided re-weighting, we run an ablation where the approximate posterior distribution of our method (Eq. (7)) is replaced by a weight computed from a point estimate from a MAP source model. This model is denoted as SHOT-IM + ENT. WEIGHTING in Table 1. We observe that such weighting scheme indeed improves the performance over SHOT-IM. However, it still lacks behind our proposed U-SFAN which uses weights computed from the uncertainty-guided model. This clearly shows that the improvement in performance with U-SFAN is not simply caused by the re-weighting but also due to better identification of target samples that are not well explained under the source model.

Table 2. Comparison of the classification accuracy on the Office-Home for the closed-set setting using ResNet-50. High overall performance signifies milder distributional shift between domains. The improvement of U-SFAN upon SHOT is moderate, but competitive w.r.t. \(\text {A}^2\text {Net}\) [67] or SHOT+ + [34], which require complex training objectives
Table 3. Comparison of the OS classification accuracy on the Office-Home for the open-set setting using ResNet-50. U-SFAN improves over SHOT without the need for nearest-centroid pseudo-labelling in the case of open-set SFDA
Table 4. (a) Comparison of the classification accuracy on the Visda-C for the closed-set DA, pertaining to the Synthetic \(\rightarrow \) Real direction, using ResNet-101. \(\dagger \) indicates the numbers of [33] that are obtained using the official code from the authors. Note that several SFDA methods perform equally well for visda-c, hinting at saturating performance. (b) Comparison of the average accuracy on the Domainnet for the closed-set SFDA using ResNet-50. The source column indicates the domain where the source model has been trained. The data set being challenging (exhibiting strong domain-shift), the improvement with our U-SFAN over [33] is substantial

4.2 State-of-the-Art Comparison

We compare our U-SFAN with UDA and SFDA methods on multiple data sets for closed-set and open-set settings. First, we compare U-SFAN with the baselines on the most common benchmark of office-home for both closed-set and open-set settings. As can be seen from Table 2 and Table 3 we improve the performance over majority of the baselines. Especially, we consistently improve over SHOT-IM with our method. We also combine the nearest centroid pseudo-labelling, used in SHOT [33], with U-SFAN (indicated as U-SFAN +  in Table 2 and Table 4a), and we find that it further helps improving the performance. Notably, the recently proposed \(\text {A}^2\text {Net}\) [67] (which just addresses closed-set SFDA) outperforms our U-SFAN in a couple of data sets, but uses a combination of several loss functions. Interplay of multiple losses can be hard to tune in practice. On the other hand, our method is simpler, more versatile and works for both the SFDA settings. Due to lack of space, we report the numbers for office31 in the appendix. Given the performance of the SFDA baseline methods in Office-home and visda-c are relatively high and closer to each other, the domain shift can be considered milder with respect to more challenging data set like domain-net.

When we compare U-SFAN with SHOT-IM on the challenging SFDA benchmark domain-net the advantage of our U-SFAN over SHOT-IM becomes imminent (cf. Table 4b), which is in line with the ablation study in Sect. 4.1. Different from the previous data sets, the difficulty in mitigating domain-shift for domain-net is evident from the low overall performance of both SHOT-IM and U-SFAN. This data set can be seen as a real-world example of strong domain-shift. The improvement in the performance of U-SFAN over SHOT-IM for domain-net demonstrates that incorporating the uncertainty in the model’s predictions plays a crucial role in SFDA. The conventional approach may overfit to noisy model predictions, leading to poor performance. Whereas, U-SFAN can capture the uncertainty in predictions and down-weight the impact of noisy predictions.

5 Discussion and Conclusion

In this work, we demonstrated the need for uncertainty quantification in SFDA and proposed U-SFAN for leveraging it during target adaptation. Our uncertainty-guided approach employs a Laplace approximation to the posterior, does not require specialized source training, and allows for efficient computation of predictive uncertainties. Our experiments showed that down-weighting distant target data points in our novel uncertainty-weighted IM loss alleviates the misalignment of target data with the source hypothesis. We ran experiments on closed and open-set SFDA settings and show that U-SFAN consistently improves upon the existing methods. Moreover, U-SFAN has shown to be robust under mild distribution shifts and shows promising results under severe distribution shifts.

While we mainly focused on the popular IM-based SFDA methods, our proposed uncertainty-guided adaptation is also applicable to other SFDA frameworks, e.g., neighbourhood clustering [69] or extensions to the multi-source SFDA problem. Moreover, the principles we build upon are general, interpretable, and have strong backing in classical statistics. We believe that uncertainty-guided SFDA will become a backbone tool for future methods in DA that generalize over different problem domains, are less sensitive to the training setup, and will provide good results without extensive ad hoc tuning to each problem.