1 Introduction

Deep learning has made a remarkable success in a wide range of computer vision tasks [1,2,3], given a large amount of annotated training data. However, deep models can not generalize well to novel domains due to the domain shift [4]. To adapt these models, people always have to collect and annotate a large volume of training samples in the target domain as well, which is costly.

Unsupervised Domain adaptation (UDA) [5] tackles this issue by transferring knowledge from a source domain to a related but different domain (target domain) through using only unlabeled data. Most of UDA algorithms assume that the source and target datasets cover identical categories, known as Closed-Set Domain Adaptation (CSDA), as shown in Fig. 1a. While this assumption does not stand in real applications, as it is not possible to guarantee two domains sharing the same label space if no labels are available in one domain (the target domain). Therefore, researchers come up with a more reasonable and realistic setting called Open-Set Domain Adaptation (OSDA) [6,7,8]. The mainstream setting was introduced by Saito et al. [7], where the classes in the source domain are fully known and some of the classes in the target domain are unknown to the model trained in the source domain, as shown in Fig. 1b. The methods for OSDA specifically aim to classify the target domain samples correctly either into the label space of the source domain or as a special class called “unknown”.

Fig. 1
figure 1

A comparison between the CSDA setting and the OSDA setting. (a) The CSDA setting assumes that the label space of two domains is identical. (b) The OSDA setting assumes that the target domain includes both known (shared) classes and unknown classes

The key in OSDA lies in how to effectively recognize and isolate the unknown samples, compared to the DA in closed-set scenarios. Existing methods usually define the normality score whose value shows to be lower for unknown sample than known sample. There are two typical issues. First, the model producing such scores is trained solely on source datasets, overlooking the confounder, i.e., the domain gap between source and target datasets. The essential reason behind is that when the model learns the semantic information of shared (known) classes, it is misled by spurious image styles or contexts. For example, if there are many samples of “dog” on grass in source domain and “sheep” on grass in target domain (while there are quite fewer “dog” on grass in the target domain), the model gets misled by the context “on grass” and thus makes the wrong prediction on “sheep” samples as “dog” [9]. The second issue is that the model usually produces a single uncalibrated prediction on each input data, making the recognition of unknown samples unstable or unreliable.

In this paper, we solve the above issues in the two-stage framework presented in Fig. 2. In the first stage, we improve the ability and stability of the model to separate known and unknown samples. 1) We propose an explicit module of deconfounding domain gaps (DDP), which transfers image styles and contexts from the target domain to the source domain. We then fine-tune the model on the source samples with transferred styles and contexts to enable it to recognize target samples as known (or unknown) because of their class semantics (rather than the confusion caused by spurious styles or contexts). 2) We propose a module of ensembling multiple transformations (EMT), which calibrates the model predictions by ensembling predictions from multiple transformations of each target sample. In the second stage, we leverage both the self-ensembling method [10] and the proposed DDP to deconfound domain gaps. Finally, we get the model that can recognize each target sample either as one of the known classes or as the special class “unknown”. We conduct extensive experiments on two OSDA benchmarks and show that both modules contribute to the performance improvement of the trained models. Therefore, our main contributions are in three folds.

  1. 1.

    We point out that separating known and unknown classes remains a challenging problem in OSDA due to the confusion caused by domain gaps. We propose a novel OSDA method that can perform effective separation.

  2. 2.

    We introduce an explicit module of deconfounding domain gaps (DDP), which transfers image styles and contexts from the target domain to the source domain and enables the model to correctly recognize unknown samples without being confounded by domain gaps. In addition, we propose a module of ensembling multiple transformations (EMT) to calibrate the recognition of the model and get more reliable normality scores.

  3. 3.

    We conduct experiments on two standard OSDA benchmarks. Our results demonstrate that our method outperforms the state-of-the-art. Our in-depth analyses verify that it gets better results because of its advanced ability of separating between known and unknown samples in the target domain.

Fig. 2
figure 2

An illustration of the proposed method. Stage I: we propose an explicit module of deconfounding domain gaps (DDP), which transfers domain information (i.e., image styles and contexts) from the target domain to the source domain. We then train the encoder F and semantic network C1 on the source samples with transferred styles and contexts to enable the model to recognize a class as known (or unknown) because of the class semantics rather than the confusion caused by spurious styles or contexts. After convergence, we propose a module of ensembling multiple transformations (EMT), which calibrates the model predictions by ensembling predictions from multiple transformations of each target sample. Based on the calibrated predictions, we can compute reliable normality scores used to divide the target datasets into a known target dataset \(\mathcal {D}_{t}^{knw}\) and an unknown target dataset \(\mathcal {D}_{t}^{unk}\). Stage II: two networks with the same architecture are used: a student network f = C2F, and a teacher network g with its weights being automatically set as an exponential moving average (EMA) of weights of the student network f. We minimize the consistency loss to mitigate the domain discrepancy between the source dataset \(\mathcal {D}_{s}\) and the target known dataset \(\mathcal {D}_{t}^{knw}\). In addition, we can classify the target known samples and reject the target unknown samples by minimizing the classification loss on source dataset \(\mathcal {D}_{s}\) and unknown target dataset \(\mathcal {D}_{t}^{unk}\). We also utilize the proposed DDP to further deconfound domain gaps, which is not shown in the right side of the figure for simplicity

2 Related works

In this section, we briefly review methods for domain adaptation and anomaly detection.

2.1 Domain adaptation

Closed-Set Domain Adaptation (CSDA) works with the assumption that two domains have identical categories. CSDA approaches focus on mitigating the domain discrepancy between domains and can be grouped into several categories based on the adopted strategy. Discrepancy-based methods measure the divergence between domains in the feature space with a discrepancy metric, such as Maximum Mean Discrepancy (MMD) [11], Higher-order moment matching (HoMM) [12], and Wasserstein distance [13]. The domain shift will be reduced by minimizing the metric during training. Adversarial methods [14] leverage a domain classifier to distinguish source and target features, while training the feature encoder to device the domain classifier in order to extract domain-agnostic feature representations. Adversarial methods are the most popular ones and have obtained promising performance, which is further enhanced by recent works [15,16,17,18] with novel network designs. Generative methods [19, 20] leverage generative models to translate source samples to the target dataset and then reduce the domain discrepancy in both feature and pixel levels. Self-supervised methods [21, 22] design auxiliary self-supervised tasks for unlabeled target data to learn robust cross-domain representations. Consistency-enforcing methods [10, 23] force the model to make similar predictions for unannotated target samples even after they have been augmented. Feature disentanglement methods [24,25,26] decouple the feature representations into domain-invariant and domain-specific parts, and only the former is used to predict the target labels. Our proposed DDP shares a similar spirit with this line of methods, as we also try to separate the domain-related representation (style/context) from the semantic representation. However, we intend to transfer the styles from the target domain to the source domain instead of only using the semantic representation. In addition, feature disentanglement methods for CSDA cannot be used in the OSDA scenario, as they need to exploit adversarial training to enhance the domain-invariant part, which may incur negative transfer. In comparison, our proposed DDP explicitly exploits feature statistics as the style representation, so we can also transfer the styles from unknown target samples without being interfered with by their semantic information.

Open-Set Domain Adaptation (OSDA) assumes target label set contains source label set. There are two different settings in the OSDA literature. Busto et al. [6] assumed each domain includes unknown categories besides the shared classes. And they proposed an algorithm called Assign-and-Transform-Iteratively (AIT), which maps target data to source domain and then utilizes SVMs for final prediction. Saito et al. [7] eased the setting by requiring no unknown data from the source dataset, so target dataset contains all the source classes (known) and additional private classes that do not belong to the source (unknown). They also proposed a method, called Open Set Back-Propagation (OSBP), which adversarially trains a classifier with an extra ‘unknown’ class to achieve common-private separation. Later OSDA methods all follow this more challenging and realistic setting. Separate To Adapt (STA) [8] aims to conduct known and unknown separation through a coarse-to-fine filtering process which includes two stages. First, multiple binary classifiers will be trained to compute the similarity score between source and target data. Second, target samples with very low and high scores will be selected to train a final binary recognizer to distinguish known and unknown target samples. Attract or Distract (AoD) [27] leverage metric learning to match target samples with the corresponding neighborhood or distract away from the known classes. Rotation-based Open Set (ROS) [28] adopts rotation classification, a self-supervised method, to distinguish the known and unknown target samples and then adapt source knowledge to the target known data.

Universal domain adaptation (UniDA), as a more general scenario, makes no assumption about the relationship of label sets between two domains. Universal Adaptation Network (UAN) [29] designs a measurement to evaluate sample-level transferability based on domain similarity and prediction uncertainty. Then samples with high transferability will be used with higher weight to promote common-class adaptation. However, as pointed by [30], this criterion is not discriminative and robust enough. Fu et al. [30] designed a better measurement that combines confidence, entropy, and consistency using multiple auxiliary classifiers to measure sample-wise uncertainty. Similarly, a class-level weighting strategy is applied for subsequent adversarial adaptation.

2.2 Anomaly detection

Anomaly detection targets at detecting out-of-distribution (anomalous) samples by learning from normal samples. The approaches in this direction can be grouped into three categories. Distribution-based approaches [31, 32] leverage the normal samples to model the distribution function so that anomalous samples with lower likelihood can be filtered out. Reconstruction-based approaches [33, 34] leverage the encoder-decoder network to reconstruct the normal training samples. Then anomalous samples can be recognized as they have larger reconstruction error compared with normal samples. Discriminative approaches [35, 36] train a classifier on the normal samples and directly recognize anomalous samples based on the model prediction.

3 Method

In this section, we first formally introduce the preliminaries, then we present an overview of the proposed method and describe it in detail.

3.1 Preliminaries

We denote the annotated source domain drawn from distribution ps as \(\mathcal {D}_{s}=\left \{\left ({x}_{j}^{s}, {y_{j}^{s}}\right )\right \}_{j=1}^{N_{s}} \sim p_{s}\) and the unannotated target domain drawn from distribution pt as \(\mathcal {D}_{t}=\left \{{x}_{j}^{t}\right \}_{j=1}^{N_{t}} \sim p_{t}\). In OSDA, target label set \(\mathcal {C}_{t}\) contains source label set \(\mathcal {C}_{s}\), i.e., \(\mathcal {C}_{s} \subset \mathcal {C}_{t}\). We refer to classes from \(\mathcal {C}_{s}\) as the known classes and classes from \(\mathcal {C}_{t} \backslash \mathcal {C}_{s}\) as the unknown classes. In OSDA, we both have pspt and \(p_{s} \neq p_{t}^{\mathcal {C}_{s}}\), where \(p_{t}^{\mathcal {C}_{s}}\) represents the distribution of the target known data. Thus, we encounter both domain shift \(\left (p_{s} \neq p_{t}^{\mathcal {C} s}\right )\) and class shift \(\left (\mathcal {C}_{s} \neq \mathcal {C}_{t}\right )\) problems in OSDA. The goal of OSDA methods is to classify target known data correctly and reject target unknown data.

OSDA introduces two challenges: negative transfer and known/unknown separation. (1) Enforcing to match the whole distribution of two domains as done in closed-set scenario will incur negative transfer, as the unknown target samples will also align mistakenly with source data. To solve this problem, we need to apply adaptation only to the shared \(\mathcal {C}_{s}\) categories, mitigating the domain shift between ps and \(p_{t}^{\mathcal {C}_{s}}\). (2) Thus, we encounter the second challenge: known/unknown separation. All target samples should be recognized from target private categories \(\mathcal {C}_{t} \backslash \mathcal {C}_{s}\) (unknown) or the shared categories \(\mathcal {C}_{s}\) (known).

3.2 Overview

To handle the aforementioned two challenges, we propose a novel OSDA method with a two-stage structure (Fig. 2): (i) we divide target datasets into known and unknown; (ii) we apply adaptation to source samples and target samples predicted as known. If we consider the unknown samples as anomalies, the first stage can be seen as an anomaly detection issue. And we can also treat the second stage as a CSDA issue between target known and source distributions. Specifically, in the first stage, we propose an explicit module of deconfounding domain gaps (DDP), which transfers image styles and contexts from the target domain to the source domain, eliminating the confounding effect caused by domain gaps. In addition, we propose a module of ensembling multiple transformations (EMT) to calibrate the model predictions. Thus, we can obtain more reliable normality scores based on the calibrated predictions. In the second stage, on the one hand, we leverage both the self-ensembling method [10] and the proposed DDP to reduce the domain discrepancy between the source data and target known data. On the other hand, we train the network to classify target known samples and reject target unknown samples by minimizing the classification loss on source data and unknown target data.

3.3 Deconfounding Domain Gaps (DDP)

Recent domain generalization works observe that image styles and contexts are closely related to visual domains [37, 38]. Inspired by this observation, we propose an explicit module of deconfounding domain gaps (DDP) that transfers image styles and contexts from the target domain to the source domain. It enables the model to recognize a class as known (or unknown) because of the class semantics rather than the confusion caused by spurious styles or contexts. Following the common practice [39, 40], we use the feature statistics that preserved at the lower layers of the CNN as domain-related representation (i.e., styles and contexts) and their spatial configuration as semantic representation.

For an input sample x, we first obtain its feature maps \(z \in \mathbb {R}^{C \times H \times W}\) from the feature encoder, where C indicates the number of channels, H and W represent spatial dimensions. Then we compute the channel-wise mean and standard deviation \(\mu (z), \sigma (z) \in \mathbb {R}^{C}\) as style/context representation:

$$ \mu(z)=\frac{1}{H W} \sum\limits_{h=1}^{H} \sum\limits_{w=1}^{W} z_{h w}, $$
(1)
$$ \sigma(z)=\sqrt{\frac{1}{H W} \sum\limits_{h=1}^{H} \sum\limits_{w=1}^{W}\left( z_{h w}-\mu(z)\right)^{2}+\epsilon}. $$
(2)

Given intermediate feature maps \(z^{s}, z^{t} \in \mathbb {R}^{C \times H \times W}\) corresponding to a source sample xs and a target sample xt, we replace the style/context of xs with the style/context of xt through adaptive instance normalization (AdaIN) [39]:

$$ \text{DDP}\left( {z^{s}}, z^{t}\right)=\sigma(z^{t}) \cdot\left( \frac{z^{s}-\mu\left( z^{s}\right)}{\sigma\left( z^{s}\right)}\right)+\mu(z^{t}). $$
(3)

After training, the learned model can reject unknown samples only based on the semantic content without the influence of the confounder (i.e., the domain gaps).

3.4 Ensembling Multiple Transformations (EMT)

Previous methods usually utilize confidence (i.e., the largest probability of all classes) [7, 8] and uncertainty (i.e., entropy) [28, 29] as the normality score to separate known (normal) and unknown (anomalous) samples, with an assumption that known samples have high confidence and low uncertainty, and vice versa. However, the confidence and the uncertainty used before are based on uncalibrated prediction, meaning they cannot represent the real confidence and uncertainty of the sample. Therefore, all the previous normality scores are not reliable and thus unable to separate known and unknown samples accurately.

To obtain more reliable normality score for separation, we propose a module of ensembling multiple transformations (EMT), which ensembles predictions from multiple transformations of each target sample to calibrate the confidence and the entropy. Specifically, given a target image xt, we apply random transformations (i.e., random crop and horizontal flip) to it to obtain m augmented samples \({\left \{\tilde {x}^{t}_{i}\right \}}_{i=1}^{m}\). \({\hat {y}}^{t}_{i} = C_{1}(F(\tilde {x}^{t}_{i})), (i=1,...,m)\) is the corresponding prediction of each augmented sample \({\tilde {x}}^{t}_{i}\), where F and C1 are the feature encoder and the semantic network. We compute the confidence wconf and the entropy went as follows:

$$ w_{\text{conf}}\left( \left.\hat{y}_{i}^{t}\right\vert_{i=1}^{m}\right)=\frac{1}{m} \sum\limits_{i=1}^{m} \max \left( \hat{y}_{i}^{t}\right), $$
(4)
$$ w_{\text{ent}}\left( \left.\hat{y}_{i}^{t}\right\vert_{i=1}^{m}\right)=\frac{1}{m} \sum\limits_{i=1}^{m}\left( \sum\limits_{k=1}^{\left\vert\mathcal{C}^{s}\right\vert}-\hat{y}_{i k}^{t} \log \left( \hat{y}_{i k}^{t}\right)\right), $$
(5)

where \({\hat {y}^{t}_{ik}}\) indicates the probability of k-th class and \(\max \limits \) get the maximum entry in \({\hat {y}}^{t}_{i}\). We unify the wconf and went within [0,1] by the minmax normalization. The formulation of the normality score is:

$$ \mathcal{N}\left( x^{t}\right)=\max \left\{w_{\text{conf}}, 1 - w_{\text{ent}}\right\}. $$
(6)

We maximize over these two terms to obtain the most reliable measurement.

3.5 Training procedure

Stage I: known/unknown separation. To separate the known and unknown samples of \(\mathcal {D}_{t}\), a CNN is trained on the source samples with transferred styles and contexts. To boost the discriminability of the model and facilitate the following known/unknown separation, we also exploit the label smoothing (LS) as it pushes samples to distribute in tight evenly separated clusters [41]. The network consists of a feature encoder F and a semantic network C1. We train network by minimizing the following cross-entropy objective:

$$ L_{\text{cls}}=-\mathbb{E}_{(x^{s}, y^{s}) \in \mathcal{D}_{s}, x^{t} \in \mathcal{D}_{t}} \left[y^{ls} \log C_{1}\left( \text{DDP}\left( F(x^{s}), F(x^{t}) \right)\right)\right], $$
(7)

where \(y^{ls}=(1-\alpha )y^{s}+\alpha / \vert \mathcal {C}_{s}\vert \) indicates the smoothed label and α represents the smoothing parameter. After training, we compute the normality score for each target sample using F and C1 as (6). Known samples have large values of \(\mathcal {N}\), and vice versa. The target dataset can be divided into an unknown target dataset \(\mathcal {D}_{t}^{unk}\) and a known target dataset \(\mathcal {D}_{t}^{knw}\) using the normality score. We use the average of the normality score over all target samples \(\overline {\mathcal {N}}=\frac {1}{N_{t}} {\sum }_{j=1}^{N_{t}} \mathcal {N}_{j}\) as the threshold, without the need to introduce any further parameter:

$$ \left\{\begin{array}{ll} x^{t} \in \mathcal{D}_{t}^{k n w} & \text { if } \quad \mathcal{N}\left( x^{t}\right)>\overline{\mathcal{N}} \\ x^{t} \in \mathcal{D}_{t}^{u n k} & \text {if} \quad \mathcal{N}\left( x^{t}\right)<\overline{\mathcal{N}}. \end{array}\right. $$
(8)

The detailed process about the computation of \(\mathcal {N}\) and the generation of \(\mathcal {D}_{t}^{knw}\) and \(\mathcal {D}_{t}^{unk}\) is described in Algorithm 1.

figure d

Stage II: domain adaptation. The problem is simplified to a CSDA problem after the target unknown data have been filtered out. Without the distraction of \(\mathcal {D}_{t}^{unk}\), we can exploit \(\mathcal {D}_{t}^{knw}\) to decrease the domain discrepancy directly. In addition, \(\mathcal {D}_{t}^{unk}\) can be used to train the classifier to recognize the unknown samples. The network has a similar architecture to that of Stage I, consisting of a feature extractor F and a semantic network C2. The semantic network C2 is the same as C1 except for the last layer: the output dimension of C1 is \(\left \vert \mathcal {C}_{s}\right \vert \), while the output dimension of C2 is \(\left (\left \vert \mathcal {C}_{s}\right \vert + 1\right )\) because of the additional unknown class. We utilize the self-ensembling method [10] to close the domain gap. Two networks with the same architecture are used: a student network f(x) = C2(F(x)), and a teacher network g(x) with its weights being automatically set as an exponential moving average (EMA) of weights of the student network. The student network is trained to minimize the classification loss on source and target unknown samples, while maintaining consistent predictions with the teacher network for target known samples. The loss function of consistency can be formulated as:

$$ L_{\text{con}}=\mathbb{E}_{x^{t} \in \mathcal{D}_{t}^{knw}} \left[\left( f\left( x^{t}\right) - g\left( x^{t}\right)\right)^{2}\right]. $$
(9)

The classification losses for samples from source and target unknown datasets are:

$$ L_{\text{cls}}^{s}=-\mathbb{E}_{(x^{s}, y^{s}) \in \mathcal{D}_{s}, x^{t} \in \mathcal{D}_{t}^{knw}} \left[y^{s} \log C_{2}\left( \text{DDP}\left( F\left( x^{s}\right), F\left( x^{t}\right)\right)\right)\right], $$
(10)
$$ L_{\text{cls}}^{unk}=-\mathbb{E}_{(x^{t}, y^{t}) \in {\mathcal{D}_{t}^{unk}}} \left[y^{t}\log f\left( x^{t}\right)\right]. $$
(11)

It is worth noting that we also exploit the proposed DDP to transfer styles and contexts from the known target datasets to the source datasets, aiming to further deconfound domain gaps. We train the network to minimize the following overall objective:

$$ L= (L_{\text{cls}}^{s} + L_{\text{cls}}^{unk} ) +{\lambda}L_{\text{con}}, $$
(12)

where λ is the weight that trades off between classification loss and consistency loss. Once the training is complete, we predict the labels for all target samples using F and C2.

4 Experiments

In this section, we first introduce the experimental settings including datasets, compared approaches, evaluation metrics, and implementation details. Then, we present classification results on two standard datasets. Finally, we conduct further analysis to verify the effectiveness of the proposed method.

4.1 Experimental settings

Datasets. Office-31 [42] contains images within 31 classes collected from three visually different domains: Webcam (W) with 795 low-quality images obtained by web camera, DSLR (D) with 534 high-quality images taken by digital SLR camera, and Amazon (A) with 2820 images obtained from amazon.com. Following the protocol in [7], we set the first 10 categories (1-10) as known and the last 11 categories (21-31) as unknown (in alphabetic order). We show some example images from Office-31 dataset in Fig. 3a. Office-Home [43] contains 15,500 images within 65 classes collected from four different domains, Artistic images (Ar), Product images (Pr), Clip-Art images (Cl), and Real-World images (Rw). Following the protocol in [8], we set the first 25 categories (1-25) in alphabetical order as known classes and the remaining 40 categories (26-65) as unknown. Office-Home is much more challenging than Office-31 due to the numerous categories and the large domain discrepancy. Some example images from Office-Home dataset are shown in Fig. 3b.

Fig. 3
figure 3

Example images in Office-31 and Office-Home

Compared Approaches. We compare our method with: (1) Source Only model: ResNet-50 [1]; (2) CSDA method: DANN [14]; (3) OSDA methods: STA [8], OSBP [7], and ROS [28]; (4) UniDA method: UAN [29]. For ResNet-50 and DANN, we leverage a confidence threshold to separate known and unknown samples. All the results reported are the average over three random runs.

Evaluation Metrics. OS and UNK are two usual metrics used to evaluate OSDA. OS denotes the average accuracy on known classes, and UNK denotes the accuracy on the unknown class. They can be combined in \(OS=\frac {\left \vert \mathcal {C}_{s}\right \vert }{\left \vert \mathcal {C}_{s}\right \vert +1} \times OS^{*}+\frac {1}{\left \vert \mathcal {C}_{s}\right \vert +1} \times UNK\) to evaluate the overall performance. However, OS is not an appropriate metric as it assumes the accuracy of each known class has the same importance as the whole “unknown” class. Considering the trade-off between the accuracy of known and unknown classes is important in evaluating OSDA methods, we exploit a metric: \(HOS=2 \frac {OS^{*} \times UNK}{OS^{*}+UNK}\) [28], which is the harmonic mean of OS and UNK. Unlike OS, HOS gives a high score only if the method achieves high performance both for known and unknown data.

Implementation Details. We utilize ResNet-50 [1] pretrained on ImageNet [44] as the backbone network. The feature encoder F consists of the first few blocks of the ResNet architecture (the second residual block and all layers before it), while the remaining part combines the semantic network C1. Our proposed DDP module is inserted between these two networks. We use the same hyperparameters for each dataset. Following DANN [14], we adjust the learning rate with \(l r_{p}=\frac {l r_{0}}{(1+\omega p)^{\phi }}\), where p changes from 0 to 1 during the training process, lr0 equal to 0.01 and 0.003 for Stage I and Stage II respectively, ω = 10, and ϕ = 0.75. The batch size is 32 for both two stages. For all the pretrained layers, the learning rate is 10 times lower than the layers learned from scratch. We adopt SGD to optimize the network, setting the momentum as 0.9 and the weight decay as 0.0005. In Stage I, we set the smoothing parameter to 0.1 and the number of multi-transformations (m) to 5. In Stage II, the trade-off parameter for consistency loss is λ = 3. We use the network learned in Stage I as the start for Stage II. The learning rate of the new unknown class is set to two times of the known classes.

4.2 Classification results

To evaluate the performance of the OSDA methods, we focus on the HOS as it can balance the importance between the accuracy of known (OS*) and unknown classes (UNK), as discussed in Section 4.1. For a fair comparison, all results of the compared methods are either taken from [28] or obtained by running the code of [45].

Table 1 reports the classification results of Office-31. Our method outperforms all comparison approaches on most tasks except \(W \rightarrow D\). Specifically, our method significantly outperforms OSBP by 3.7%. Our method also boosts the HOS of state-of-the-art method ROS by 1.5%. In addition, we observe that DANN, STA, and UAN perform even worse than the ResNet-50 backbone since they suffer from negative transfer caused by the mismatching between the source samples and target unknown samples. The failure of these methods are mainly due to their poor ability to target known and unknown separation.

Table 1 Accuracy (%) of all methods on Office-31 dataset. The best method is emphasized in bold

We also compare our method with previous works on the challenging Office-Home dataset. From Table 2, we can find that our method outperforms all compared methods on a total of 9 out of 12 transfer scenarios, demonstrating that our method works well with large domain gaps. On average, our method achieves the highest performance, 1.8% higher than the second-best method ROS. In addition, Our method outperforms STA and OSBP by a large margin, 6.9% and 3.3% respectively. The encouraging results indicate that our method is very effective for the OSDA setting.

Table 2 Accuracy (%) of all methods on Office-Home dataset. The best method is emphasized in bold

From Tables 1 and 2, we can get one key observation that the advantage of our method is mainly due to its capability in distinguishing known and unknown samples. We can observe that while the average OS* of the compared methods is close to ours, the UNK of our method is much higher, e.g., 2.8% and 3.4% higher than ROS on Office-31 and Office-Home respectively. This observation proves that our method is very significant for separating target known and unknown samples.

4.3 Analysis

Ablation Study. To investigate how our method benefits known/unknown separation, we compare the performance of our Stage I with Stage I of ROS and STA. Both ROS and STA include two stages: they use a multi-rotation classifier and a multi-binary classifier to distinguish known and unknown target samples, respectively. We compute the area under receiver operating characteristic curve (AUC-ROC) over the normality scores \(\mathcal {N}\) on Office-31 to evaluate the performance. As shown in Table 3, the AUC-ROC of our method (93.0) is higher than that of the multi-rotation used by ROS (91.5) and the multi-binary used by STA (79.9). Table 3 also reports the performance of Stage I when alternatively removing the module of deconfounding domain gaps (No DDP), the module of ensembling multiple transformations (No EMT), and the label smoothing (No LS). The performance of all above cases drops significantly compared to our complete method, verifying each component’s importance: (1) the DDP module can shield the distractions caused by confounding styles and contexts from source domain during separation; (2) the EMT module can produce reliable normality scores by the calibration from the ensemble; (3) label smoothing is helpful to suppress the overconfident predictions. To verify the efficiency of the self-ensembling method in OSDA, we also compare our method with the wildly adopted GRL [14] based on our Stage I. Table 3 shows that self-ensembling outperforms GRL by 0.4% in average. Furthermore, we also evaluate the role of the DDP in Stage II. As shown in Table 3, our full method outperforms the case when removing the DDP module (No DDP in stage II), which verifies the proposed DDP is also helpful for domain adaptation.

Table 3 Ablation analysis. The best method is emphasized in bold

Where to apply DDP? We conduct experiments on Office-31 using ResNet-50 to examine the effect of the position where DDP is applied. Given that the ResNet architecture consists of four residual blocks, we apply DDP to different layers to train different models. For notation, block1 means DDP is applied after the first residual block, block2 means DDP is applied after the second residual block, and so on. The results are shown in Fig. 4. We have the following observations: (1) we can get the best performance when DDP is applied after block2; (2) DDP is less helpful when applied after too low-level layers (i.e., block1); (3) The performance drops significantly when applying DDP after too high-level layers (i.e., block4), even worse than when we do not use DDP in Stage-I and Stage-II. This makes sense because block4 is the closest to the classification layer and is inclined to capture semantic (i.e., label-related) information rather than style/context. As a result, transferring the statistics at block4 will introduce unwanted semantic information.

Fig. 4
figure 4

HOS on Office-31 with applying DDP to different layers

Feature Visualization. To intuitively showcase the effectiveness of our method, we visualize features of target samples from the ResNet-50, ROS [28] and our method on the \(A \rightarrow D\) task by t-SNE [46]. The features obtained by ResNet-50 can be served as the initial state without adaptation. As shown in Fig. 5a, the features of unknown classes and several known classes mix together, demonstrating that ResNet-50 cannot separate known and unknown classes. In Fig. 5c, our method is capable of separating known and unknown features and discriminating different known classes. Compared with our method, the known and unknown features obtained from ROS (Fig. 5b) appear more confused.

Fig. 5
figure 5

Visualization of obtained target features for \(A \rightarrow D\) using t-SNE. Known and unknown samples are denoted as red and blue points respectively

Distribution Discrepancy. As discussed in [47], distribution discrepancy can be measured by the \(\mathcal {A}\)-distance. It is defined as \(d_{\mathcal {A}}=2(1-2 \epsilon )\), where 𝜖 indicates the generalization error of a domain classifier. A larger distribution discrepancy corresponds with a larger \(d_{\mathcal {A}}\) and vice versa. We compute \(d_{\mathcal {A}}\) for both source-known and known-unknown: source-known represents the distribution discrepancy between source samples and target known samples, and known-unknown represents the distribution discrepancy between target known samples and target unknown samples. We compare our method with ResNet-50 and ROS using a kernel SVM as the classifier on two tasks \(A \rightarrow W\) and \(D \rightarrow W\). From the Fig. 6, we can observe that \(d_{\mathcal {A}}\) for source-known using our method is much smaller than the ResNet-50 (source-only), while that for known-unknown is larger than ResNet-50. The above observations demonstrate that our method can align the source and target known data while filtering out target unknown samples.

Fig. 6
figure 6

The value of \(\mathcal {A}\)-distance for source-known (smaller is better) and unknown-known (larger is better)

Sensitivity to Varying Openness. The openness is defined as \(\mathbb {O}=1-\frac {\left \vert \mathcal {C}_{s}\right \vert }{\left \vert \mathcal {C}_{t}\right \vert }\), and the value of openness in the standard OSDA setting is around 0.5 which means the number of the known and unknown target classes are close. For example, the openness of Office-31 is \(\mathbb {O}=1-\frac {10}{21}=0.52\) and that of Office-Home is \(\mathbb {O}=1-\frac {25}{65}=0.62\). In practical applications, the number of unknown target classes may exceed the number of known classes by a large margin, with openness approaching 1. To testify the robustness of our method, we conduct experiments on Office-Home with the following different openness levels: \(\mathbb {O}=0.38\) (40 known classes), \(\mathbb {O}=0.62\) (25 known classes), \(\mathbb {O}=0.85\) (10 known classes), \(\mathbb {O}=0.92\) (5 known classes). As shown in Fig. 7, the performance of OSBP and STA drops a lot with larger \(\mathbb {O}\), as they are unable to reject the unknown instances well. In contrast, our method and ROS are resistant to the change in openness. In addition, our method outperforms ROS consistently, owing to its advanced ability of separating between known and unknown samples.

Sensitivity to Hyper-parameters. We investigate the sensitivity of two hyper-parameters: the number of transformations m (in (4) and (5)) and the trade-off weight λ (in (12)). The experiments are performed on two tasks \(A \rightarrow D\) and \(D \rightarrow A\) with ResNet-50 as the backbone. We plot the relationship of the AUCROC and the value of m in Fig. 8a, and the relationship of the HOS and the value of λ in Fig. 8b. Specifically, m = 0 denotes the ablation where EMT is not used. We can observe that our method is not sensitive to both hyper-parameters. We underline that the same hyper-parameters are used for all 18 domain pairs demonstrating that the choice of the hyperparameters’ value is robust across datasets.

Fig. 7
figure 7

Accuracy (%) over the four different openness levels

Fig. 8
figure 8

Hyper-parameters sensitivity analysis

5 Conclusions

In this paper, we propose a novel OSDA method that can conduct effective known and unknown separation. Specifically, we propose an explicit module of deconfounding domain gaps (DDP) that enables the model to recognize a class as known (or unknown) because of the class semantics rather than the confusion caused by spurious styles or contexts. In addition, to obtain the reliable normality scores, we also propose a module of ensembling multiple transformations (EMT) to calibrate the model output. The accurate known/unknown separation results boost the overall performance of the OSDA model. Experimental results on two standard datasets show that the proposed method outperforms the state-of-the-art OSDA methods, especially with a large margin on recognizing unknown samples.