1 Introduction

In recent years, traditional machine learning and its related applications have achieved great success [37, 41, 49, 84], but these successes require good labeling support. Labeling in machine learning is quite complicated and tedious, especially when labeling samples of new domains or tasks. Manual labeling will cost a lot of time and money. Semi-supervised [88, 103] learning alleviates the problem to a certain extent. However, Semi-supervised learning also needs a certain amount of labeled instances and a large number of unlabeled instances. A large number of unlabeled instances are difficult to obtain in real-life application scenarios. This usually makes the training model challenging to converge.

Unlike traditional machine learning, transfer learning [62, 109] allows different domains, tasks, and distributions to be used in training and testing. The original intention of transfer learning is to use the previously labeled domain to label the new domain. Just like some people can play the violin, maybe the cello can be learned quickly. Although the data distribution of the source domain and target domain is different, their tasks are the same. This unique transfer learning is domain adaptation. For example, a police officer investigating a crime can use a citizenship photo recorded in the system to quickly and accurately locate a target in a surveillance video [104]. Banks use standard fonts in their databases to help identify a target’s handwriting [59].

Domain adaptation (DA) is a particular case of transfer learning (TL). In the last few decades, various shallow domain adaptation methods have been proposed to solve domain transfer between the source and target domains. Shallow domain adaptation can usually be divided into three categories. 1) Instance-based domain adaptation. It is achieved by adjusting the weights of the instances so that the distributions of the two domains are similar [7, 20], 2) Feature-based domain adaptation. It achieves domain adaptation by adjusting the features of two domains [32, 61]. 3) Parameter-based Domain Adaptation. It performs better results by adjusting the model parameters [7, 99]. With the advancement of technology [38], more and more new fields and new tasks require suitable labels. The performance of shallow domain adaptation can no longer meet today’s requirements for accuracy. Deep neural networks are widely used in computer vision [1, 2, 47, 69] and natural language processing [39, 44, 76, 83] applications. Deep neural networks have more computing units and more robust non-linear representations [3], which can establish better decision-making boundaries. Therefore, the idea of combining domain adaptation and deep neural networks was born. Recently commonly used deep neural network models today include convolutional neural networks (CNNs) [3, 18, 22, 29, 40, 43], deep belief networks (DBNs) [23, 36, 56,57,58], and stacked autoencoders (SAEs) [30, 90, 107].

In this paper, we analyze and discuss the deep DA methods.To summarize, the main contributions are:

  • We divided the deep domain adaptation into several categories based on the label set of the source domain and the target domain.

  • We summarized various methods of Closed-set domain adaptation.

  • We discussed current methods of multi-source domain adaptation.

  • We discussed future research directions, challenges, and possible solutions.

The remainder of this survey is structured as follows. In Section 2, we reviewed the related work. In Section 3, we first define some notations, and then we categorize deep DA into different settings (given in Fig. 1). In the next two sections, other approaches are discussed for each setting, which is given in Tables 1 and 5 in detail. Finally, the conclusion of this paper and discussion of future work is presented in Section 6.

Fig. 1
figure 1

Classification of domain adaptation

Table 1 Classification of Closed-set DA

2 Related work

Over the past few years, there have been many reviews or surveys on transfer learning and domain adaptation. Pan et al. [62] divided transfer learning into three cases: inductive TL, transductive TL, and unsupervised TL, but they only studied homogeneous feature spaces. Patel et al. [64] focused only on domain adaptation. Csurka et al. [21] briefly described shallow domain adaptation for each case adaptation method, as well as categorizing deep domain adaptation methods into categories based on training loss: classification loss, discrepancy loss, and adversarial loss. However, Csurka et al. only studied deep domain adaptation in visual application scenarios. Wang et al. [93] divided deep domain adaptation methods into single and multi-step, and single-step domain adaptation methods based on training loss into difference-based, adversarial loss-based, and reconstruction-based. Four criteria were proposed in difference-based domain adaptation: class criterion, statistic criterion, architecture criterion, and geometric criterion. Adversarial loss-based domain adaptation consists of generative models and non-generative models. Reconstruction-based domain adaptation consisted of encoder-decoder reconstruction and adversarial reconstruction. Multi-step domain adaptation methods are divided into three categories based on how intermediate domains are selected and utilized: including Hand-crafted, Instance-based, and Representation-based. Sun et al. [82] mainly reviewed some theoretical results and well-established algorithms for multi-source domain adaptation problems. Kouw et al. [42] introduced dataset shifting in transfer learning and domain adaptation and the treatment of domain transfers. Wang et al. [94] analyzed existing work on Zero-shot learning at that time from three perspectives, which are semantic spaces, methods, and applications. Sematic spaces consist of engineered semantic spaces and learned semantic spaces. Zero-shot learning methods are divided into classifier-based methods and instance-based methods. Wilson et al. [96] compare unsupervised deep domain adaptation by examining alternative methods, the unique and common elements, results, and theoretical insights. Cai et al. [8] Give a comprehensive description of the available RGB-D data sets to guide researchers in choosing the right data set to evaluate their algorithms. Chu et al. [19] compared domain adaptation techniques for neural machine translation (NMT) with the techniques beingstudied in statistical machinetranslation (SMT), which has been the main research area in the last two decades (Tables 23, and 4).

Table 2 Accuracy (%) of different domain adaptation methods on the Office-31 datasets
Table 3 Accuracy (%) of different unsupervised domain adaptation methods on the digits datasets
Table 4 Accuracy (%) of different without generator adversarial domain adaptation methods on the Office-31 datasets

After reviewing the above literature, we study deep domain adaptation methods for various scenarios. Firstly, there is the consistently studied closed-set DA, which is the base scenario of most algorithms. In recent years, Partial DA, Open set DA, Universal DA, and Zero-shot DA have been proposed to address domain adaptation in various scenarios.

3 Overview

3.1 Notations and definitions

In this section, we introduce some of the symbols and definitions that will be used in this survey, and the symbols and definitions match those in the survey papers of [21, 93, 94]to maintain consistency across surveys. A domain consists of feature space \(\mathcal {X}\) and a marginal probability distribution P(X),where \(X=\left \{x_{1},\dots ,x_{n} \right \}\in \mathcal {X}\). Given a specific domain \(\mathcal {D}=\left \{ \mathcal {X},P(X)\right \}\), a task \(\mathcal {T}\) consists of label space \(\mathcal {Y}\) and an objective predictive function f(⋅), which can also be viewed as a conditional probability distribution P(Y |X) from a probabilistic perspective. In general, we can learn P(Y |X) in a supervised manner from the labeled data \(\left \{x_{i},y_{i}\right \}\),where \(x_{i} \in \mathcal {X}\) and \(y_{i} \in \mathcal {Y}\). Suppose Ls and Lt are the label sets in the source and target domains.

Assume that we have two domains: the training dataset with sufficient labeled data is the source domain \(\mathcal {D}^{s}=\left \{\mathcal {X}^{s},P(X)^{s} \right \}\), and the test dataset with a small amount of labeled data or no labeled data even no data in the traditional sense is the target domain \(\mathcal {D}^{t}=\left \{\mathcal {X}^{t},P(X)^{t} \right \}\). Firstly, we consider the target domain where the label exists, mark the labeled parts as \({\mathcal {D}}^{tl}\) and the unlabeled parts as \(\mathcal {D}^{tu}\), form the entire target domain, \(\mathcal {D} ={\mathcal {D}}^{tl}\cup \mathcal {D}^{tu}\). The task of the source domain is \({\mathcal {T}^{s}}= \left \{\mathcal {Y}^{s}, P(Y^{s}|X^{s}) \right \}\), and the one of target domain is \({\mathcal {T}^{t}}= \left \{\mathcal {Y}^{t}, P(Y^{t}|X^{t}) \right \}\). Similarly, P(Ys|Xs) can be learned from the source labeled data \(\left \{{x_{i}^{s}},{y_{i}^{s}}\right \}\), and P(Yt|Xt) can be learned from the target labeled data \(\left \{x_{i}^{tl},y_{i}^{tl}\right \}\) and unlabeled data \(\left \{x_{i}^{tu}\right \}\). Then, we do not have a traditional sample of the target domain available, we need to introduce a semantic representation \({\mathrm {a}}_{c}\in \mathbbm {R}^{Q}\) to aid in network training. The commonness between two domains is defined as the Jaccard distance between two label sets, \( \xi =\frac {|{{\mathscr{L}}}_{s}\cap {{\mathscr{L}}}_{t}|}{|{{\mathscr{L}}}_{s}\cup {{\mathscr{L}}}_{t}|} \).

3.2 Dataset

In this subsection, we introduce some usual datasets for DA. Office-31 [70] is relatively small, with 4,652 images in 31 classes. Three domains, namely A, D, W, are collected by downloading from amazon.com (A), taking from DSLR (D), and from web camera (W). Six domain adaptation tasks: A→W, D→W, W→D, A→D, D→A, and W→A. Office-Home [89] is a larger dataset, with 4 domains of distinct styles: Artistic, Clip Art, Product, and Real-World. Each domain contains images of 65 object categories. Denoting them as Ar, Cl, Pr, Rw, we obtain twelve domain adaptation tasks: Ar→Cl, Ar→Pr, Ar→Rw, Cl→Ar, Cl→Pr, Cl→Rw, Pr→Ar, Pr→Cl, Pr→Rw, Rw→Ar, Rw→Cl, and Rw→Pr. VisDA2017 [68] (VD) comprises of 12 categories with synthenic (S) and real-world (R) domains. Office-Caltech [33] utilizes the shared classes in Office-31 and Caltech as whole dataset.

3.3 Different scenarios for domain adaptation

The case for traditional machine learning is \(\mathcal {D}^{s}=\mathcal {D}^{t}\) and \(\mathcal {T}^{s}=\mathcal {T}^{t}\). As for transfer learning, Pan et al. [61] divide data set divergence into divergence in the domain itself and divergence brought about by the task. The former is generally caused by distribution shifts or feature space divergence, while the latter is caused by a divergence in the conditional distribution or label space. Based on these two types of divergence, Pan et al. classify transfer learning into three categories: inductive, transductive, and unsupervised transfer learning. In Pan et al.’s classification, domain adaptation falls into transductive transfer learning. It is characterized by the same task \(\mathcal {T}^{s}=\mathcal {T}^{t}\) but there is a domain divergence \(\mathcal {D}^{s} \ne \mathcal {D}^{t}\).

First, according to the number of source domains, domain adaptation can be classified into single source domain adaptation and multi-source domain adaptation. Secondly, it is classified into homogeneous domain adaptation and heterogeneous domain adaptation according to the divergence of domains. Under the setting of homogenous domain adaptation, the feature spaces of the target domain and the source domain are almost the same \((\mathcal {X}^{s} =\mathcal {X}^{t})\) and (ds = dt). The main difference lies in the difference in the edge distribution of the target domain and the source domain (P(X)sP(X)t). However, there is a big difference \((\mathcal {X}^{s} \ne \mathcal {X}^{t})\) or (dsdt) between the feature space of the target domain and the source domain under the heterogeneous domain adaptation setting. In this paper, we do not use the presence or absence of supervision as a classification criterion. We classify source and target domains according to their label sets. The classification of Single-source domain adaptation based on label set is shown in Fig. 2

Fig. 2
figure 2

The classification of Single-source domain adaptation: closed-set DA (Ls = Lt), Partial DA (LtLs), Open set DA (LsLt), and Open-Partial DA

4 Single-source domain adaptation

4.1 Homogeneous domain adaptation

The first consideration is single-source domain adaptation, i.e., learning a model from a tagged source domain and then generalizing it to other different but related target domains. The feature spaces of the target and source domains are essentially the same. The label sets of the target and source domains are also consistent. We refer to the domain adaptation in this setting as closed-set domain adaptation. Most of the current methods are divided into three main categories. Discrepancy-Based methods, Adversarial-Based methods, and Reconstruction-Based methods. The mentioned network properties are listed in Tables 23 and 4.

Closed-set domain adaptation

In the Closed-set DA setting, it is supposed that the source and target domains only contain images of the same set of object classes. It does not include images of unknown classes, or classes that do not exist in other domains. And the images should be of the same type. The first thing that comes to mind is to align the source and target domains, then reduce the classification loss, and finally fine-tune [14, 100] the classification case for the target domain. However, direct fine-tuning of the parameters of the deep network is very problematic.

1) Discrepancy-Based methods: In the past, many network structures have been proposed to solve the classification task, such as LeNet-5 [46], AlexNet [43], and VGG [79]. Due to the domain shift between the two domains, these network models’ accuracy will be significantly reduced. It is particularly true when the model has been trained in a domain and then used directly in the new domain.

Gretton et al. proposed Maximum Mean Discrepancy (MMD) to measure the discrepancy of the two different domains. MMD is essentially the supremum of the expected difference between two data distribution after the mapping function change. MMD is a very effective way to measure the distance between two distributions. Given two distributions s and t, the MMD is defined as follows,

$$ \begin{array}{@{}rcl@{}} MMD^{2}(s,t)=\sup_{{\Vert \phi\Vert}_{\mathcal{H}} \le 1}\Vert E_{\text{x}^{\text{s}}\sim s}[\phi(\mathrm{x}^{s})-E_{\text{x}^{\text{t}} \sim t}[\phi(\mathrm{x}^{t})]\Vert_{\mathcal{H}}^{2}, \end{array} $$
(1)

where ϕ represents the kernel function that maps the original data to a reproducing kernel Hilbert space (RKHS) and \(\Vert \phi \Vert _{{\mathscr{H}}} \le 1\) defines a set of functions in the unit ball of RKHS.

Based on MMD, Tzeng et al. [87] proposed a new network structure called deep domain confusion (DDC) to solve the DA problem. An adaptation layer is added between the feature layers of the shared weight network. This layer takes MMD between the source and target domain features as a loss and reduces the discrepancy between the source and target domains by minimizing MMD. The MMD also determines the location of the adaptation layer. According to [100], the adaptation layer is more useful at higher layers of features because lower layer features are usually general features that do not carry a higher level of discrimination. Therefore, the adaptation layer of the DDC is placed after fc7. The network architecture of DDC is shown in Fig. 3.

Fig. 3
figure 3

Using unsupervised domain adaptation as an example, a sample training network with labeled source domains is entered on the left, and the right network has the same weights as the left. The source and target domains are aligned by minimizing the classification loss and MMD distance. Thus, the classifier on the source domain can also be applied to the target domain

Based on DDC, Long et al. [52] proposed the Deep Adaptive Network (DAN), which differs from DDC in two main ways. 1) Only one layer is adapted in DDC, and multiple layers are adapted in DAN. 2) Only a single kernel function is used in DDC, and a multicore with weighted kernel function is used in DAN. DAN improves the performance through multilayer adaptation and Multi-Kernel Maximum Mean Discrepancy (MK-MMD) [34]. In 2016, long et al. Proposed RTN by imitating residual networks. The classifier layer of RTN connects the source classifier and the target classifier end-to-end. However, the above model assumes that the conditional distributions in the two domains are consistent. In real-world scenarios, this assumption of condition is too strong. For this reason, further research by Long et al. [53] proposed Joint Adaptation Network (JAN), which adjusts the joint distribution of source and target domains using classification loss and Joint Maximum Mean Discrepancy (JMMD) as a function of loss.

Because of the enormous computational effort required to compute the MK-MMD, Sun et al. [81] utilized CORAL loss to measure the distance between the two domains. Moreover, it can be seamlessly integrated into different layers or architectures. CORAL loss is defined as the distance between the second-order statistics (covariance) of the source and target domain features.

$$ \begin{array}{@{}rcl@{}} \mathcal{L}_{CORAL}=\frac{1}{4d^{2}}\Vert C_{S}-C_{T}{\Vert_{F}^{2}}, \end{array} $$
(2)

where \(\Vert \cdot {\Vert _{F}^{2}}\) denotes the squared matrix Frobenius norm. CS and CT denote the covariance matrices of the source and target data, respectively. The goal of domain adaptation is achieved by optimizing both classification loss and Correlation Alignment (CORAL) loss simultaneously. Zellinger et al. [102] proposed the Central Moment Discrepancy (CMD) based on MMD and KL divergence. CMD consists of a vector of empirical expectations and a vector of k-order sample center distances. In simple words, if the probability distributions of samples of source and target domains are similar, then their per-order center distances are also similar. The more similar the sample probability distributions are, the smaller the value of CMD. CMD contains higher-order moment information than KL divergence and reduces the computational effort compared to MMD because there is no need to compute the kernel matrix. Unlike CMD matching higher-order central moment, Higher-order Moment Matching (HoMM) [16] matches higher-order cumulant tensor. Because a higher-order moment tensor contains more information to represent feature distributions better. HoMM can be matched with arbitrary moment tensor, with first-order HoMM and second-order HoMM are equivalent to MMD and CORAL, respectively. Third- and fourth-order moment tensor matching helps achieve global alignment, as higher-order statistics can be adapted to more complex non-Gaussian distributions. The final objective function of HoMM is as follows,

$$ \begin{array}{@{}rcl@{}} \mathcal{L}=\mathcal{L}_{s}+{\lambda_{d}\mathcal{L}_{d}}+{\lambda_{dc}\mathcal{L}_{dc}}, \end{array} $$
(3)

where \({\mathscr{L}}_{s}\) is the classification loss in the source domain, \({\mathscr{L}}_{d}\) is the domain discrepancy loss measured by the higher-order moment matching, and \({\mathscr{L}}_{dc}\) denotes the discriminative clustering loss. Note that to obtain reliable discrimination of clustered pseudo-labels, set λdc to 0 in the initial iteration and enable clustering loss λdc after the total loss has stabilized. The domain discrepancy loss can be given as,

$$ \begin{array}{@{}rcl@{}} \mathcal{L}_{d}=\frac{1}{b^{2}}\sum\limits_{i=1}^{b}\sum\limits_{j=1}^{b}k({\pmb h}_{sp}^{i},{\pmb h}_{sp}^{j})-\frac{2}{b^{2}}\sum\limits_{i=1}^{b}\sum\limits_{j=1}^{b}k({\pmb h}_{sp}^{i},{\pmb h}_{tp}^{j})+\frac{1}{b^{2}}\sum\limits_{i=1}^{b}\sum\limits_{j=1}^{b}k({\pmb h}_{tp}^{i},{\pmb h}_{tp}^{j}), \end{array} $$
(4)

Where b is the batch size, \(k(\pmb {x,y})=\exp (-\gamma \Vert {\pmb {x-y}}\Vert _{2})\) is the RBF kernel function, \({\pmb h}_{sp}^{i}\) denotes a randomly sampled value in the p-level tensor. The discriminative clustering loss can be given as,

$$ \begin{array}{@{}rcl@{}} \mathcal{L}_{dc}=\frac{1}{n_{t}}\sum\limits_{i=1}^{n_{t}}\Vert {\pmb h}_{t}^{i}-{\pmb c}_{\hat {y_{t}^{i}}}{\Vert_{2}^{2}}, \end{array} $$
(5)

where \({\hat {y_{t}^{i}}}\) is the assigned pseudo-labels of \({x_{t}^{i}}\), \({\pmb c}_{\hat {y_{t}^{i}}}\in {{\mathbbm R}^{L}}\) denotes its estimated class center. Group moment matching and random sample matching to perform compact tensor matching in HoMM. Li et al. [48] introduced the attention mechanism in domain adaptation. This mechanism can simulate the independence between source and target convolution channels. Furthermore, it does facilitate the alignment of cross-domain features.

2) Adversarial-Based methods: Unlike previous Discrepancy-Based methods, the adversarial approach’s basic idea is a minimax game. The game ordinary takes place between the domain discriminator and the feature extractor. The domain discriminator identifies whether an instance comes from the target domain. The purpose of the feature extractor is to extract features that can cheat domain discriminators. The whole network iterates between the training domain discriminator and the feature extractor until the whole model converges.

Adversarial domain adaptation networks with generators generally synthesize the source data with labels (or pseudo-labels) into the target data and keep the labels (or pseudo-labels). The synthesized target data is then used to train the network. Unsupervised pixel-level domain adaptation (PixelDA) [5] employed pixel-space cross-domain transformation achieve domain adaptation. Unlike classical GANs, the input to PixelDA contains not only noise vectors but also source images. An almost infinite amount of training data can be synthesized using the noise vector and the source image. The PixelDA model maps the source domain image to the target domain image at the pixel level. It can change the architecture of a particular task without having to retrain the domain adaptation component. However, The downside of pixelDA is that it can only deal with the low-level differences between the source domain and the target domain, mainly noise, resolution, lighting, color. If the object type changes, geometric changes are difficult to deal with. Rather than using GANs as a data enhancement step as before, Sankaranarayanan et al. [75] utilized GANs to obtain rich gradient information that bridges the gap between the source and target domains. The joint adversarial-discriminative approach transfers the information of the target distribution to the learned embedding using a generator-discriminator pair (Fig. 4).

Fig. 4
figure 4

The main components of the GTA network are illustrated. During the training phase, the pipeline consists of two parallel streams. 1) Stream 1 is updated using supervised classification loss; 2) Stream 2 keeps the images from the target and source domains more similar via a GAN. In the testing phase, Stream2 is removed and classified using the F-C pair

Saito et al. [73] proposed a novel adversarial alignment technique to avoid misclassification of samples near the decision boundary. The model is composed of a feature extractor G and a classifier C. Different from previous work, this classifier also acts as a discriminator. C classfies input x into K class j by \(p(y=j|x)=\frac {\exp (l_{j})}{{\sum }_{k=1}^{K} exp(l_{k})}\). In the confrontation training of mixed domain samples, the discriminator can detect the instances close to the boundary. The feature extractor pushes these samples away from the boundary to generate discriminative domain invariant features. The goal of Adversarial Dropout Regularization (ADR) is to learn G and C by solving the optimization problem:

$$ \begin{array}{@{}rcl@{}} \min \limits_{G,C} L(X_{s},Y_{s})&=&-\mathbbm E_{(x_{s},y_{s})\sim(X_{s},Y_{s})} \sum\limits_{k=1}^{K} {\mathbbm 1}_{[k=y_{s}]}\log C(G(x_{s}))_{k}, \end{array} $$
(6)
$$ \begin{array}{@{}rcl@{}} &&\max \limits_{G} \min \limits_{C} L(X_{s},Y_{s})-L_{adv}(X_{t}), \end{array} $$
(7)
$$ \begin{array}{@{}rcl@{}} L_{adv}(X_{t})&=&\mathbbm E_{x_{t} \sim X_{t}}[d(C_{1}(G(x_{t})),(C_{2}(G(x_{t})))], \end{array} $$
(8)
$$ \begin{array}{@{}rcl@{}} d(p_{1},p_{2})&=& \frac{1}{2}(D_{kl}(p_{1}|p_{2})+D_{kl}(p_{2}|p_{1})), \end{array} $$
(9)

where L(Xs,Ys) is standard classification loss, Ladv(Xt) is the loss of between C1 and C2, d(p1,p2) is represent the difference between p1 and p2, Dkl() is KL divergence. Inspired by VAE-GAN [45], Xu et al. [97] proposed Adversarial Domain Adaptation with Domain Mixup (DM-ADA), which makes the source and target domains consistently distributed through VAE and discriminators. The pixel-level and feature-level domain mixture and well-designed soft domain labels improve the generalization capability. The classifier is optimized with cross-entropy loss. Namely, The source domain classifier loss is \({\mathscr{L}}_{C}=-\mathbf {E}_{x^{s} \sim P_{s}} {\sum }_{i=1}^{K} {y_{i}^{s}} \log \left (C\left (\left [ \cdot \right ]\right )\right )\), where K is the numbers of classes. Chen et al. [17] combine domain adversarial learning with self-learning to proposed Adversarial-Learned Loss for Domain Adaptation (ALDA). The confusion matrix is used to eliminate (or reduce) the effect of noise in the pseudo labels. In contrast to ordinary domain adversarial learning, this adversarial loss incorporates classifier predictions and label information into the optimization. In this way, the model enables level-by-level feature alignment. The noise-corrected can align the features between the source and target domains. According to the theory of [4], the expected error of the target sample can be defined by the expected error in the source domain and the difference in features between the domains. Therefore, the expected error of the target for noise-corrected is theoretically bounded. Therefore, the expected error of ALDA is theoretically bounded.

The key to adversarial domain adaptation networks without generators is to learn domain invariant representations from the source and target samples. These representations are used to deceive the classifier (discriminator) and introduce domain confusion losses to improve the performance of the network. Domain Adaptive Neural Network (DANN) was first proposed at the 2014 Pacific Rim AI Conference. Ganin et al. [28] formally proposed DANN to address the unsupervised domain adaptation (UDA) problem. DANN can be easily implemented using deep learning packages in the Deep Learning Framework. DANN is powered by the feature extractor Gf(⋅;𝜃f), the label predictor Gy(⋅;𝜃y), the domain classifier Gd(⋅;𝜃d), and the gradient inversion layer (GRL) comprise. It uses GRL for backpropagation training so that the distribution of source and target domains is consistent. The optimization goal of the entire network has two components: minimizing the source domain classification error, maximizing the domain classification error, and introducing λ as a trade-off parameter. Generally speaking, aligning the source domain and the target domain is to map the target domain to the source domain and then classify the target domain through a classifier trained on the source domain. However, Adversarial Discriminative Domain Adaptation (ADDA) [86] maps both the source domain and the target domain to a shared space and reduces the distance uses a trained classifier to classify the mapped target domain. ADDA minimizes the source and target representation distance by iteratively minimizing the following functions, which is most similar to the original GAN:

$$ \begin{array}{@{}rcl@{}} \min \limits_{M^{s},C}\mathcal{L}_{cls}(X^{s},Y^{s})=-\mathbb{E}_{(x^{s},y^{s})\sim(X^{s},Y^{s})} \sum \limits_{k=1}^{k}{\mathbbm 1}_{[k=y^{s}]}\log C(M^{s}(x^{s})), \end{array} $$
(10)
$$ \begin{array}{@{}rcl@{}} \min \limits_{D}\mathcal{L}_{advD}(X^{s},X^{t},M^{s},M^{t})= -\mathbb{E}_{(x^{s})\sim(X^{s})}[\log D(M^{s}(x^{s}))]\\ -\mathbb{E}_{(x^{t})\sim(X^{t})}[\log (1-D(M^{t}(x^{t})))], \end{array} $$
(11)
$$ \begin{array}{@{}rcl@{}} \min \limits_{M^{s},M^{t}}\mathcal{L}_{advM}(M^{s},M^{t})=-\mathbb{E}_{(x^{t})\sim(X^{t})}[\log D(M^{t}(x^{t}))], \end{array} $$
(12)

where the mappings Ms and Mt are learned from the source data Xs and target data Xt, C represents a classifier working on the source domain. \({\mathscr{L}}_{cls}\) is optimized by training the source model using the labeled source data. \({\mathscr{L}}_{advD}\) is minimized to train the discriminator, while \({\mathscr{L}}_{advM}\) is learning a representation that is domain invariant. After ADDA, Multi-Adversarial Domain Adaptation (MADA) [65] utilize more than one class discriminator, and this change may bring three benefits. It avoids rigidly assigning each point to only one domain discriminator, similar to using soft labels to increase information. It avoids negative transfer because each moment is only aligned with the most relevant class, and irrelevant classes are filtered out. By weighting, these domain discriminators with different parameters facilitate the positive transfer of each instance. This structure also has an obvious shortcoming, that is, specifying a discriminator for each class, and the computational cost is quite high. The structure of MADA is shown in Fig. 5. To make the classification on the target domain more precise, the DADA [85] proposed by Tang et al. makes the joint distribution alignment between the two domains more explicit. They proposed a target loss based on the design of an integrated classifier by using conditional category probability weighted domain prediction. The entropy minimization principle was used for the regularization term. Inspired by Wasserstein GAN, Shen et al. [77] proposed a novel approach to learn domain invariant feature representations, namely Wasserstein Distance Guided Representation Learning (WDGRL). WDGRL utilizes neural networks to estimate the empirical Wasserstein distance between the source and target samples and optimizes the network of feature extractors to minimize the estimated Wasserstein distance.

Fig. 5
figure 5

The architecture of the Multi-Adversarial Domain Adaptation (MADA) approach, where f is the extracted deep features, \(\hat y\) is the predicted data label, and \(\hat d\) is the predicted domain label; Gf is the feature extractor, Gy and Ly are the label predictor and its loss,\({G^{K}_{d}}\) and \({L^{K}_{d}}\) are the domain discriminator and its loss; GRL stands for Gradient Reversal Layer. The blue part shows the multiple adversarial networks (each for a class, K in total).The network uses each discriminator to determine which domain a sample belongs to. After the discriminator is trained, it is classified by the classifier. Best viewed in color

Fang et al. [25] introduced a perturbation function in the label classifier to simulate changes in the distribution of labels in different domains, and insert ResNet to learn the perturbation function. By learning the perturbation function, the label classifier will be more robust and accurate. And the joint distribution of image features and class labels is used to align the source and target domains to obtain a more robust and differentiated feature representation. An intuitive illustration is shown in Fig. 6, where the target classifier with the perturbation functioncorrectly classifies image samples from the target domain.

Fig. 6
figure 6

A well-trained source classifier may fail to classify images in the target domain correctly. By adding a perturbation function, the target classifier corrects the mistakes made by the source classifier on the target domain. Here the red dots and green triangles denote image samples from the target domain. Best viewed in color

3) Reconstruction-Based Approaches: The goal of the reconstruction-based approach is to extract domain invariant representations. The Deep Reconstruction Classification Network (DRCN) proposed in Ghifary et al. [31] learns a shared encoding representation.The DRCN is a CNN architecture that combines two pipelines with a shared encoder. The shared encoder can be considered as a feature extractor. The first pipeline performs supervised classification in the source domain, while the second pipeline performs unsupervised reconstruction on the target domain. Domain separation networks (DSNs) [6] model the private and shared components for domain representations. It uses a scale-invariant mean squared error reconstruction loss.

4.2 Heterogeneous domain adaptation

In a heterogeneous domain adaptation scenario, there are many situations in the relationship between the source domain and the target domain. When domain adaptation is applied to a real-world scenario, it is more likely to encounter a situation where the source domain and the target domain have large differences. We divide heterogeneous domain adaptation into three categories based on the shared category of the source domain and target domain. The category of the target domain is included in the source domain (LtLs)is Partial DA. The category of the source domain is included in the target domain (LsLt) is Open set DA. Part of the source domain and target domain category is shared category is Open-Partial DA. For open-Partial DA is not studied separately, it is integrated into the Universal DA, and additional Zero-shot DA for special cases is added (Table 5).

Table 5 Classification of heterogeneous DA

Partial domain adaptation

With the advent of the Big Data era, the label set of the target domain is usually included in one of the source domains. Previous approaches for homogeneous domain adaptation have performed poorly. While the Partial DA proposed by Long et al. applies to a more general case. The label set of the target domain is contained in the source domain (i.e., the target domain label space is just a subspace of the source domain label space). In this case, there is an obvious problem. Those labels (or samples) that exist only in the source domain will result in a negative transfer.

Fig. 7
figure 7

f is the extracted deep features, \(\hat y\) is the predicted data label, and \(\hat d\) is the predicted domain label; Gfis the feature extractor, Gy and Ly are the label predictor and its loss, \({G_{d}^{k}}\) and \({L_{d}^{k}}\) are the domain discriminator and its loss; GRL stands for Gradient Reversal Layer. The blue part shows the class-wise adversarial networks (\(\vert {\mathcal {C}_{s}}\vert \) in total). Best viewed in color

Cao et al. [11] proposed partial transfer learning to address partial transfer learning from large-scale domains to small-scale domains. The architecture of the proposed Selective Adversarial Networks (SAN) for partial transfer learning is shown in Fig. 7. The final objective of Selective Adversarial Network (SAN) is,

$$ \begin{array}{@{}rcl@{}} C(\theta_{f},\theta_{y},{\theta_{d}^{k}}|_{k=1}^{\vert \mathcal{C}_{s}\vert}) = \frac{1}{n_{s}}\sum\limits_{\mathrm{x}_{i} \in \mathcal{D}s}L_{y}(G_{y}(G_{f}(\mathrm{x}_{i})),y_{i}) +\frac{1}{n_{t}}\sum\limits_{\mathrm{x}_{i} \in \mathcal{D}_{t}}H(G_{y}(G_{f}(\mathrm{x}_{i}))) \\-\frac{\lambda}{n_{s}+n_{t}}\sum\limits_{k=1}^{\vert \mathcal{C}_{s}\vert}[(\frac{1}{n_{t}}\sum\limits_{\mathrm{x}_{i} \in \mathcal{D}_{t}}\hat {y_{i}^{k}}) \times (\sum\limits_{\mathrm{x}_{i} \in \mathcal{D}_{s} \cup \mathcal{D}_{t}}\hat {y_{i}^{k}}{L_{d}^{k}}({G_{d}^{k}}(G_{f}(\mathrm{x}_{i})),d_{i}))], \end{array} $$
(13)

where λ is a hyper-parameter that trade-offs the two objectives in the unified optimization problem. H(⋅) is the conditional-entropy loss functional, \(H(G_{y}(G_{f}(\mathrm {x}_{i}))) = -{\sum }_{k=1}^{\vert \mathcal {C}_{s}\vert } \hat {y_{i}^{k}}\log \hat {y_{i}^{k}}\). SAN down-weight the domain discriminators responsible for the outlier source classes as follows,

$$ \begin{array}{@{}rcl@{}} \mathcal{L}_{d} = \frac{1}{n_{s}+n_{t}}\sum\limits_{k=1}^{\vert \mathcal{C}_{s}\vert}[(\frac{1}{n_{t}}\sum\limits_{\mathrm{x}_{i} \in \mathcal{D}_{t}}\hat {y_{i}^{k}}) \times (\sum\limits_{\mathrm{x}_{i} \in \mathcal{D}_{s} \cup \mathcal{D}_{t}}\hat {y_{i}^{k}}{L_{d}^{k}}({G_{d}^{k}}(G_{f}(\mathrm{x}_{i})),d_{i}))]. \end{array} $$
(14)

The optimization problem is to find the network parameters \(\hat {\theta }_{f}\), \(\hat {\theta }_{y}\) and \(\hat {\theta _{d}^{k}}(k=1,2,\dots ,\vert \mathcal {C}_{s}\vert )\) that satisfy the following functions,

$$ \begin{array}{@{}rcl@{}} (\hat{\theta}_{f},\hat{\theta}_{y}) = \arg \min_{\theta_{f},\theta_{y}} C(\theta_{f},\theta_{y},{\theta_{d}^{k}}|_{k=1}^{\vert \mathcal{C}_{s}\vert}), \end{array} $$
(15)
$$ \begin{array}{@{}rcl@{}} (\hat {\theta_{d}^{1}},\dots,\hat{\theta}_{d}^{\vert \mathcal{C}_{s}\vert})= \arg \min_{{\theta_{d}^{1}},\dots,\theta_{d}^{\vert \mathcal{ C}_{s}\vert}}C(\theta_{f},\theta_{y},{\theta_{d}^{k}}|_{k=1}^{\vert \mathcal{C}_{s}\vert}), \end{array} $$
(16)

SAN reduces negative transfer due to categories that do not belong to the target domain by weighting the instances and weighting the categories. SAN preliminary addresses Partial DA, which simultaneously circumvents negative transfer by filtering the outlier source class \({\mathcal {C}_{s}\backslash \mathcal {C}_{t}}\) and promotes positive transfer by maximizing the data distribution \({p_{\mathcal {C}_{t}}}\) and q in the shared tag space \(\mathcal {C}_{t}\). Modified from SAN, Cao et al. [12] proposed Partial Adversarial Domain Adaptation (PADA).The architecture of PADA is shown in Fig. 8

$$ \begin{array}{@{}rcl@{}} C(\theta_{f},\theta_{y},\theta_{d})=\frac{1}{n_{s}}\sum\limits_{\mathrm{x}_{i}\in \mathcal{D}_{s}}\gamma_{y_{i}}L_{y}(G_{y}(G_{f}(\mathrm{x}_{i})),y_{i}) \\-\frac {\lambda}{n_{s}}\sum\limits_{\mathrm{x}_{i}\in \mathcal{D}_{s}}\gamma_{y_{i}}L_{d}(G_{d}(G_{f}(\mathrm{x}_{i})),d_{i}) \\-\frac {\lambda}{n_{t}}\sum\limits_{\mathrm{x}_{i}\in \mathcal{D}_{t}}L_{d}(G_{d}(G_{f}(\mathrm{x}_{i})),d_{i}), \end{array} $$
(17)
$$ \begin{array}{@{}rcl@{}} \gamma = \frac{1}{n_{t}}\sum\limits_{i=1}^{n_{t}} \hat{\textup{y}}_{i}, \end{array} $$
(18)

where γ is a \(\vert \mathcal {C}_{s}\vert \)-dimensional weight vector quantifying the contribution of each source class, yi is the ground truth label of source point xi while γyi is the corresponding class weight, and λ is a hyper-parameter that trade-offs the source label classifier and the partial adversarial domain discriminator in the optimization problem. PADA averages the label predictions and all target data to eliminate the effects of possible errors.

Fig. 8
figure 8

Overview of the architecture of PADA, f is the extracted deep features, \(\hat y\) is the predicted data label, and \(\hat d\) is the predicted domain label by softmax probability; Gfis the feature extractor, Gy and Ly are the label predictor and its loss, \({G_{d}^{k}}\) and \({L_{d}^{k}}\) are the domain discriminator and its loss, respectively, γ is the class weights averaged over the label predictions of target data. Best viewed in color

Fig. 9
figure 9

Fs and Ft are feature extractors for the source and target domains, respectively. The parameters of Fs are pre-learned and are not updated during training. D is the domain classifier that gets the importance weights w of the source samples and does not participate in the minimax game. D0 is another classifier that uses a weighted source domain sample and a target sample for the maximal-minimal game. GRL stands for Gradient Reversal Layer. Best viewed in color

Improved based on DANN, Zhang et al. [105] proposed a two-domain classifier strategy named Importance Weighted Adversarial Nets (IWAN) to solve partial DA. The network consists of two feature extractors Fs and Ft, two domain classifiers D and D0. The green parts are the feature extractors for source and target domains. The network architecture is shown in Fig. 9. The overall objective of the weighted adversarial nets-based method is,

$$ \begin{array}{@{}rcl@{}} \min_{F_{s},C}\mathcal{L}_{s}(F_{s},C)= -\mathbb E_{\mathrm{x},y\sim p_{s}(\mathrm{x},y)}\sum\limits_{k=1}^{K} \mathbbm 1_{[k=y]}\log C(F_{s}(\mathrm{x})), \end{array} $$
(19)
$$ \begin{array}{@{}rcl@{}} \min_{D}\mathcal{L}_{D}(D,F_{s},F_{t})=-(\mathbb E_{\mathrm{x}\sim p_{s}(\text{x})}[\log D(F_{s}(\mathrm{x}))]\\ +\mathbb E_{\mathrm{x}\sim p_{t}(\text{x})}[\log (1-D(F_{t}(\mathrm{x})))])\ \min_{F_{t}}\max_{D_{0}}\mathcal{L}_{w}(C,D_{0},F_{s},F_{t})=\gamma \mathbb E_{\mathrm{x}\sim p_{t}(\mathrm{x})}H(C(F_{t}(\mathrm{x})))\\ +\lambda (\mathbb E_{\mathrm{x}\sim p_{s}(\mathrm{x})}[w(\mathrm{z})\log D_{0}(F_{s}(\mathrm{x}))]\\ +\mathbb E_{\mathrm{x}\sim p_{t}(\text{x})}[\log (1-D_{0}(F_{t}(\mathrm{x})))]), \end{array} $$
(20)

where λ is the trade-off parameter, \({\mathscr{L}}_{s}\) is the loss of source domain classifier, \({\mathscr{L}}_{D}\) is the loss of domain classifier D, \({\mathscr{L}}_{w}\) is the sum of loss of domain classifier D0 and entropy of target classes, The objectives are optimized in stages. Fs and C are pre-trained on the source domain data and fixed afterwards. Then the D, D0 and Ft are optimized simultaneously without the need of revisiting Fs and C. The relative importance of the source sample is given by \(w(\mathbf {z})=\frac {\tilde {w}(\mathbf {z})}{\mathbb {E}_{\mathbf {z} \sim p_{s}(\mathbf {z})} \tilde {w}(\mathbf {z})}\), which \(\tilde {w}(\mathbf {z})=\frac {1}{\frac {p_{s}(\mathbf {z})}{p_{t}(\mathbf {z})}+1}\). The essence of IWAN is to reduce the Jensen-Shannon divergence between the weighted source data distribution and the target data distribution in the feature space. Cao et al. [13] proposed the Example Transfer Network (ETN) gradually reduces the weight of irrelevant samples of non-shared categories on the source classifier and employs a domain classifier to quantify the transferability of instances. The architecture of ETN is shown in Fig. 10. The goal of ETN model is finding saddle-point solutions \(\hat {\theta }_{f}\), \(\hat {\theta }_{y}\), \(\hat {\theta }_{d}\) and \(\hat {\theta }_{\tilde y}\) to model parameters as follows,

$$ \begin{array}{@{}rcl@{}} (\hat{\theta}_{f},\hat{\theta}_{g})=\arg \min_{\theta_{f},\theta_{y}}\mathbb E_{G_{y}}-\mathbb E_{G_{d}}, \end{array} $$
(21)
$$ \begin{array}{@{}rcl@{}} (\hat{\theta}_{d})=\arg \min_{\theta_{d}}\mathbb E_{G_{d}}, \end{array} $$
(22)
$$ \begin{array}{@{}rcl@{}} (\hat{\theta}_{\tilde y})= \arg \min_{\theta_{\tilde y}}\mathbb E_{\tilde G_{y}}+\mathbb E_{\tilde G_{d}}, \end{array} $$
(23)
$$ \begin{array}{@{}rcl@{}} \mathbb E_{G_{y}} = \frac{1}{n_{s}}\sum\limits_{i=1}^{n_{s}}w(\mathrm{x}_{i}^{s})L(G_{y}(G_{f}(\mathrm{x}_{i}^{s}),\mathrm{y}_{i}^{s})) \\+\frac{\gamma}{n_{t}}\sum\limits_{j=1}^{n_{t}}H(G_{y}(G_{f}(\mathrm{x}_{j}^{t}))), \end{array} $$
(24)
$$ \begin{array}{@{}rcl@{}} \mathbb E_{G_{d}} = -\frac{1}{n_{s}}\sum\limits_{i=1}^{n_{s}}w(\mathrm{x}_{i}^{s})\log(G_{d}(G_{f}(\mathrm{x}_{i}^{s}))) \\-\frac{1}{n_{t}}\sum\limits_{j=1}^{n_{t}}\log(1-G_{d}(G_{f}(\mathrm{x}_{i}^{t}))), \end{array} $$
(25)
$$ \begin{array}{@{}rcl@{}} \mathbb E_{\tilde G_{y}}=-\frac{\lambda}{n_{s}}\sum\limits_{i=1}^{n_{s}}\sum\limits_{c=1}^{\vert \mathcal{C}_{s}\vert}[y_{i,c}^{s}\log \tilde {G_{y}^{c}}(G_{f}(\mathrm{x}_{i}^{s})) \\+(1-y_{i,c}^{s})\log {\tilde G}_{y}^{c} (G_{f}(\mathrm{x}_{i}^{s}))], \end{array} $$
(26)
$$ \begin{array}{@{}rcl@{}} \mathbb E_{\tilde G_{d}}=-\frac{1}{n_{s}}\sum\limits_{i=1}^{n_{s}}\log({\tilde G_{d}}(G_{f}(\mathrm{x}_{i}^{s}))) \\-\frac{1}{n_{t}}\sum\limits_{j=1}^{n_{t}}\log(1-{\tilde G_{d}}(G_{f}(\mathrm{x}_{i}^{t}))), \end{array} $$
(27)

where \(w(\mathrm {x}_{i}^{s})=1-{\tilde G}_{d}(G_{f}(\mathrm {x}_{i}^{s}))\) is the weight of each source example \(\mathrm {x}_{i}^{s}\), which quantifies the example’s transferability, γ is a trade-off parameter. Equation (24) and (25) proposed transferability weighting framework. From (26) and (27), with the help of \({\tilde G}_{y}\) (leaky-softmax activation function) ,\({\tilde G}_{d}\), which is trained with label information and domain information, resolving the ambiguity between shared and unshared classes. ETN can derive more accurate and discriminative weights to quantify the transferability of each source example. The accuracy of the above network is listed in Table 6.

Fig. 10
figure 10

The architecture of ETN is shown in the figure, where Gf is the feature extractor, Gy is the source classifier, and Gd is the domain identifier for domain alignment; \( \tilde G_{d} \) is the auxiliary domain identifier that quantifies the transferability w of each source example, and \( \tilde G_{y} \) is the auxiliary predictive label that encodes the distinguishing information as the auxiliary domain discriminator \( \tilde G_{d} \). Best viewed in color

Table 6 Accuracy (%) of different Partial domain adaptation methods on the Office-31 datasets

Open set domain adaptation

Open set DA case is the opposite of Partial DA mentioned above, where the source domain label set is only a small fraction of the one of the target domain. The specific setup for the Open set DA problem is that the target domain contains all the classes in the source domain. We need to classify the data correctly for the known classes in the target domain (common to both target and source domains), and the data for all unknown classes (only in the target domain) is classified as “unknown“ because we don’t have information about these classes.

Busto et al. [63] first proposed a novel problem scenario. Considering the actual scene, there is usually an intersection between the source domain and the target domain, rather than the closed set previously set (Fig. 11).

Fig. 11
figure 11

Overview of Unsupervised Open Set Domain Adaptation Methods. a The source domain contains labeled and unlabeled images, where the same color means the labels are the same and gray means they belong to an unknown category. For the samples in the target domain, there are no labels. b As a first step, the category labels are assigned to some target samples, while the outliers have no labels. c By minimizing the distance between the source and the target domain samples by the same category. Then iterate between (b) and (c) until convergence to a local minimum. d shows the classification results of the algorithm. Best viewed in color

The method used in [63] is to project the target domain and source domain into the same space-based on distance and then classify the target samples by SVM. In the unsupervised scenario, the objective functions to be optimized are as follows,

$$ \begin{array}{@{}rcl@{}} \min_{x_{ct},w_{ct},o_{t}}&&\sum\limits_{t}(\sum\limits_{c}d_{ct}x_{ct}+\sum\limits_{c}w_{ct}+\lambda o_{t}), \\s.t. &&\sum\limits_{c} x_{ct}+o_{t} = 1 \\&&\sum\limits x_{ct} \ge 1 \\&&a_{ct}x_{ct}+\sum\limits_{t^{\prime}\in N_{t}}\sum\limits_{c^{\prime}}d_{cc^{\prime}}x_{c't^{\prime}}-w_{ct} \le a_{ct} \\&&x_{ct},o_{t} \in \left\{0,1\right\} \\&&w_{ct} \ge 0 \end{array} $$
(28)

where \(w_{ct}=x_{ct}({\sum }_{t^{\prime }=N_{t}}{\sum }_{c^{\prime }}x_{c't^{\prime }}d{cc^{\prime }})\), \(a_{ct}={\sum }_{t^{\prime }\in N_{t}}{\sum }_{c^{\prime }}d-{cc^{\prime }}\). Compared with unsupervised scenario, semi-supervised scenario adds some restrictions to ensure that labeled target samples are not misclassified. After solving the assignment problem by iterating, the source domain is transferred to the target domain by a linear transformation, which is represented by a matrix \(W \in \mathbb R^{D\times D}\). It could be estimated by minimizing the following loss function: \(f(W) =\frac {1}{2}\Vert WP_{S}-P_{T} {\Vert _{F}^{2}}\). After the approach has converged, Classification of data in the target domain using linear SVM trained in the source domain. Later, based on the idea of Generative Adversarial Networks (GAN), Saito et al. [74] Proposed a new method for open set domain adaptation. The network architecture is shown in the Fig. 12. The goal is to correctly categorize known target samples into corresponding known classes and recognize unknown target samples as unknown. The objective function is as follows,

Fig. 12
figure 12

Overview of the network architecture. The network has been trained to classify source samples. For the target sample, through the minimax game of the classifier and the classifier, the probability that the sample belongs to the unknown class or the correct class is obtained. Best viewed in color

$$ \begin{array}{@{}rcl@{}} \min_{C} L_{s}(x_{s},y_{s})+L_{adv}(x_{t}), \end{array} $$
(29)
$$ \begin{array}{@{}rcl@{}}\min_{G} L_{s}(x_{s},y_{s})-L_{adv}(x_{t}), \end{array} $$
(30)
$$ \begin{array}{@{}rcl@{}}L_{s}(x_{s},y_{s})=-\log (p(y=y_{s}|x_{s})), \end{array} $$
(31)
$$ \begin{array}{@{}rcl@{}}p(y=y_{s}|x_{s}) = (C\circ G(x_{s}))_{y_{s}}, \end{array} $$
(32)
$$ \begin{array}{@{}rcl@{}}L_{adv}(x_{t})=-t\log(p(y=K+1|x_{t}))\\-(1-t)\log(1-p(y=K+1|x_{t})), \end{array} $$
(33)

where Ls(xs,ys) is the loss of classifier C, t is set as 0.5. The generator attempts to maximize the value of Ladv(xt). Saito uses the symmetric KL divergence as a new binary cross-entropy loss formula. Liu et al. [50] developed a method Separate to Adapt (STA), a progressive separation mechanism consisting of a coarse-to-fine separation pipeline. First, a multi-binary classifier is trained with the source data to estimate the similarity between the data in the target domain and each of the source classes; second, data with extremely high and low similarity are selected as the boundary data for the known and unknown classes, and they are further used to train a fine-grained binary classifier to perform fine-grained separation of all target domain samples. Finally, iterate between the above two steps and use weights to reject samples with unknown domain adaptation classes. The network structure is shown in Fig. 13. This paper introduces a new concept: openness, which is used to measure how much the target domain class is compared to the source domain class. It is defined as \(\mathbb O=1-\frac {|\mathcal {C}_{s}|}{|\mathcal {C}_{t}|}\). There’s no need to select the threshold hyperparameters throughout the process manually, so we don’t need to adjust them when the openness changes manually. The accuracy of the above network is listed in Table 7.

Fig. 13
figure 13

The separate to Adapt (STA) approach for open set domain adaptation is split into two parts by the dotted line. Above the dotted line consists of a multi-binary classifier \(G_{c}|_{\mathcal {C}=1}^{|\mathcal {C}_{s}|}\) and a binary classifier Gb, which will generate the weights w for rejecting target samples in the unknown classes \(\mathcal {C}_{t} \backslash \mathcal {C}_{s}\). Below the dotted line is a feature extractor Gf, a classifier Gy, and a domain discriminator Gd to perform adversarial domain adaptation between source and target data in the shared label space. zs and zt is the extracted deep features of source and target domains. \(\hat y_{s}\) and \(\hat y_{t}\) are the predicted labels. \(\mathrm {z}^{\prime }\) is the feature selected by Gc. Best viewed in color

Table 7 Classification accuracy (%) of open set domain adaptation tasks on VisDA-2017 (VGGNet)

Universal domain adaptation

Universal Domain Adaptation (UDA) does not require a priori knowledge of the label set. In the UDA setting, given a labeled source domain, any related target domain regardless of how its label set differs from the source domain’s label set, requires to be appropriately classified if it belongs to any of the categories in the source label set. Otherwise, it is labeled as “unknown“. You et al. [101] proposed Universal Adaptation Network (UAN). It quantifies transferability at the sample level by sharing label sets and private label sets for each domain, thus facilitating the adaptation of automatically discovered public label sets and the successful identification of ”unknown” samples. In the training phase, EG, \(E_{D^{\prime }}\) and ED represent the error for label classifier G, non-adversarial domain discriminator \(D^{\prime }\) and adversarial domain discriminator D, which are formally defined as (Fig. 14),

$$ \begin{array}{@{}rcl@{}} E_{G} =\mathbb E_{{(\text{x,y})}\sim p}L(\mathrm{y},G(F(\mathrm{x}))), \end{array} $$
(34)
$$ \begin{array}{@{}rcl@{}} E_{D^{\prime}}=-\mathbb E_{\mathrm{x}\sim p}\log D^{\prime}(F(\mathrm{x}))\\ -\mathbb E_{\mathrm{x}\sim q}\log (1-D^{\prime}(F(\mathrm{x}))), \end{array} $$
(35)
$$ \begin{array}{@{}rcl@{}} E_{D}=-\mathbb E_{\mathrm{x}\sim p}w^{s}(\mathrm{x})\log D^{\prime}(F(\mathrm{x}))\\ -\mathbb E_{\mathrm{x}\sim q}w^{t}(\mathrm{x})\log (1-D^{\prime}(F(\mathrm{x}))), \end{array} $$
(36)

where L is the standard cross-entropy loss, \(w^{s}(\text {x})=\frac {H(\hat {\mathrm {y}})}{\log \vert \mathcal {C}_{s} \vert }-\hat {d^{\prime }}(\mathrm {x})\) indicates the probability of a source sample x belonging to the common label set \(\mathcal {C}\), \(w^{t}(\text {x})=\hat {d^{\prime }}(\mathrm {x})-\frac {H(\hat {\mathrm {y}})}{\log \vert \mathcal {C}_{s} \vert }\) indicates the probability of a target sample x belonging to the common label set \(\mathcal {C}\). With well-established weighting ws(x) and wt(x), the adversarial domain discriminator D is confined to distinguish the source and target data in the common label set \(\mathcal {C}\). Non-adversarial domain discriminator \(D^{\prime }\) is trained to get good weights ws(x) and wt(x), and Then conduct adversarial training on the adversarial domain discriminator D and label classifier G. After the training, the target sample class is judged by the value of weight wt(x).

Fig. 14
figure 14

The architecture of (Universal Adaptation Network) UAN is composed of a feature extractor F, an opposite domain discriminator D, a non opposite domain discriminator \(D^{\prime }\) and a label classifier G. Best viewed in color

Motivated by the domain similarity and uncertainty criteria proposed in [101], Saito et al. proposed Domain Adaptive Neighborhood Clustering via Entropy optimization (DANCE) in [72]. DANCE utilizes a classifier based on the prime center (prototype). This mapper maps the samples close to their true class prime centers (prototypes) and away from other classes. The target samples are first clustered in the target domain using Self-Supervision. Because of neighbor clustering, DANCE can extract different feature representations for “unknown“ samples unsupervised. Next, align the target point with the source class prototype or reject it as “unknown“ by entropy separation loss. Also, it utilizes domain-specific batch normalization [15, 17, 71] to eliminate domain style information as a form of weak domain alignment. It is worth noting that DANCE extracts discriminative feature representations for ”unknown“ class examples without any supervision on the target domain. Prediction entropy and output of the auxiliary domain classifier are not robust and discriminable enough. Fu et al. [27] proposed Calibrated Multiple Uncertainties (CMU) as the mixture of entropy, consistency, and confidence. And designed a deep ensemble model to characterizes different degrees of uncertainty and distinguishes target data in the common label set from those in the private label set. Fu et al. further proposed a novel H-score to compensate for the previous per-class accuracy for ignorance of open classes. H-score is the harmonic mean of the instance accuracy on common class \( a_{\mathcal {C}} \) and accuracy on the“unknown“ class \(a_{\mathcal {\bar {C}}^{t}}\) as \(h=2 \cdot \frac {a_{\mathcal {C}} \cdot a_{\mathcal {\bar {C}}^{t}} }{a_{\mathcal {C}} + a_{\mathcal {\bar {C}}^{t}}}\). Summary of the Universal comparisons is listed in Table 8.

Table 8 Summary of the Universal comparisons

Zero-shot domain adaptation

In some extreme cases, we can not get a sample of the target domain. Zero-shot DA, which was gradually promoted from Few-shot DA, made its appearance. In the case of a zero sample domain adaptation setting, only the source domain data is available for the task of interest. Sometimes semantic information about the target domain classification is used, known as generalized Zero-shot learning (GZSL). DeViSE [26] is initialized from two pre-trained neural network models: a skip-gram text model and a visual model (AlexNet without its softmax prediction layer in this paper). A combination of dot-product similarity and hinge rank loss was used in the paper. The network is trained by reducing the distance between the image and the corresponding label and expanding the distance between the image and the non-corresponding labels. Based on the main idea of DeViSE, [60] adopts the framework of CNN and word2vec. The nearest category weight (probability) of the image is obtained by standard CNN, and the category is obtained by word2vec. Then the similarity is calculated with the test category to predict the label of the image. Zhang et al. [106] proposed a new embedding model of ZSL based on a deep neural network. There are two main differences between the model and the previous model, 1)It uses the visual feature space output from the CNN subnet as the embedding space. The projection direction is from semantic space to visual feature space, reducing the pivot point problem. 2)It realizes the end-to-end learning of semantic space representation. The architecture of the model is shown in Fig. 15.

Fig. 15
figure 15

The entire network consists of a visual coding branch and a semantic coding branch, where the visual coding branch takes the image as input and outputs feature vectors. The space in which the feature vectors are located will be considered as the embedding space. The semantic coding branch takes the one-dimensional semantic representation vector as input and outputs a three-dimensional semantic embedding vector after two fully connected linear unit layers. The two branches are connected by a least-squares embedding loss. Best viewed in color

The purpose of least squares embedding loss is to minimize the difference between visual features and their class representation embedding vectors in the visual feature space. With these three losses, our objective function is as follows,

$$ \begin{array}{@{}rcl@{}} \mathcal{L}(W_{1}, W_{2})=\frac {1}{N}\sum\limits_{i=1}^{N}\Vert \phi(I_{i})-f_{1}(W_{2}f_{1}(W_{1}\mathrm{y}_{i}^{u}))\Vert^{2}+\lambda (\Vert W_{1}\Vert^{2}+\Vert W_{2}\Vert^{2}), \end{array} $$
(37)

where \(W_{1} \in {\mathbb R}^{L \times M}\) are the weights to be learned in the first FC layer and \(W_{2} \in {\mathbb R}^{M \times D}\) for the second FC layer. λ is a hyperparameter that weights the two parameters relative to the embedding loss after regularization. f1(⋅) is the Rectified Linear Unit which introduces nonlinearity in the encoding subnet. The classification of the test image Ij in the visual feature space can be achieved by merely calculating its distance to the embedded prototypes. It is illustrated that visual feature space as an embedding space is much better than semantic space as an embedding space. Liu et al. [51] proposed a novel Deep Calibration Network (DCN) approach towards this generalized Zero-shot learning paradigm, which enables simultaneous calibration of deep networks on the confidence of source classes and uncertainty of target classes. Two scenarios are given in this paper, Zero-shot Learning(ZSL) and Generalized Zero-shot Learning (GZSL). The biggest difference between ZSL and GZSL is that GZSL can utilize the available semantic representation of the target domain, But GZSL needs to classify over both source and target classes. The network architecture is shown in Fig. 16.

Fig. 16
figure 16

Deep Calibration Network (DCN) consists of four modules. A CNN for earning deep embedding ϕ(x) for each image x and an MLP for learning deep embedding ψ(a) for each class a.a prediction function f made by nearest prototype classifier (NPC).two probabilities p and q that transform the prediction function f into distributions over the source and target classes. A cross-entropy loss minimizes the overfitting to the source domain. The entropy loss minimizes the uncertainty of target classes. Best viewed in color

The optimization problem of the deep calibration network (DCN) for generalized Zero-shot learning can be formulated by integrating the empirical risk minimization and uncertainty calibration,

$$ \begin{array}{@{}rcl@{}} \min_{\phi,\psi} L+\lambda H+\gamma {\Omega}(\phi,\psi), \end{array} $$
(38)
$$ \begin{array}{@{}rcl@{}} L=-\sum\limits_{n=1}^{N}\sum\limits_{c=1}^{S}y_{n,c}\log p_{c}(\mathrm{x}_{n}), \end{array} $$
(39)
$$ \begin{array}{@{}rcl@{}} p_{c}(\mathrm{x}_{n})=\frac{\exp(f_{c}(\mathrm{x}_{n})/\tau)}{{\sum}_{c^{\prime}=1}^{S}\exp(f_{c^{\prime}}(\mathrm{x}_{n})/\tau)}, \end{array} $$
(40)
$$ \begin{array}{@{}rcl@{}} H=-\sum\limits_{n=1}^{N}\sum\limits_{c=S+1}^{S+T}q_{c}(\mathrm{x}_{n})\log q_{c}(\mathrm{x}_{n}), \end{array} $$
(41)
$$ \begin{array}{@{}rcl@{}} q_{c}(\mathrm{x}_{n})=\frac{\exp(f_{c}(\mathrm{x}_{n})/\tau)}{{\sum}_{c^{\prime}=S+1}^{S+T}\exp(f_{c^{\prime}}(\mathrm{x}_{n})/\tau)}, \end{array} $$
(42)
$$ \begin{array}{@{}rcl@{}} f_{c}(\mathrm{x}_{n}) =\text{sim}(\phi(\mathrm{x}_{n}),\psi(\mathrm{a}_{c})), \end{array} $$
(43)
$$ \begin{array}{@{}rcl@{}} y(\mathrm{x}_{n}) = \arg \max_{c} fc(\mathrm{x}), \end{array} $$
(44)

where Ω(ϕ,ψ) is the penalty to control model complexity; λ and γ are hyper-parameters; In deep learning, weight decay can be used to replace the penalty term γΩ(ϕ,ψ); L is the empirical risk; H is uncertainty calibration; sim(⋅) is a similarity function, e.g. inner product and cosine similarity. The ultimate goal is to minimize entropy to correctly classify, but this requires that the category of the target domain is known. [66] is the first domain adaptation and sensor fusion method that does not require relevant target domain data. [66] relies on the correspondences between source and target domain data samples in the irrelevant task to train the model. In contrast, Conditional Coupled Generative Adversarial Networks for Zero-shot DomainAdaptation (CoCoGAN) [91] does not rely on such information thanks to it captures the joint distribution of source and target domain data samples. Wang et al. [92] presented Adversarial Learning for Zero-shot Domain Adaptation (ALZSDA) to extend the scope of applications further. ALZSDA can learn the domain shift from an irrelevant task and transfer it to multiple different tasks of interest.

5 Multi-source domain adaptation

In practical scenarios, labeled data can be collected from multiple sources with different distributions. In this case, the above single-source domain adaptation(SSDA) approach can be applied simply by combining multiple source domains into a single source domain. However, merging multiple source domains and then using the SSDA approach usually results in more unsatisfactory performance than merely utilizing one of the source domains and discarding the others. Since domain transfer exists between each source domain and the target domain and between different source domains, merging source domain data from various sources may interfere with each other during the learning process. Therefore, to utilize all available data, multi-source domain adaptation (MSDA) is required (Fig. 17).

Fig. 17
figure 17

Overview of the Deep Cocktail Network (DCTN). The framework receives multi-source instances of ground truth with annotations and adaptively classifies the target samples. For simplicity, it is illustrated with the source domains j and k. Firstly, the feature extractor maps the target domain, source domain j, and source domain k into a common feature space. Secondly, The category classifier receives the target feature and produces the j-th and k-th classifications based upon the categories in source domain j and k, respectively. Thirdly, The domain discriminator receives features from source j, k and target, and then provides an adversary between each source and target domain pair. Finally, The target classifier integrates all the weighted classification results and then predicts the target class. Best viewed in color

Xu et al. [98] proposed a deep cocktail network (DCTN) to cope with domain and category shifts among multiple sources. According to the theoretical results in [55], the target distribution can be represented as a weighted combination of the source distributions. MSDA is executed in two iterative steps: first, the differences between the target source domain and multiple source domains are minimized through adversarial learning, and a confusion score is obtained for each source domain, which represents the likelihood that the target sample belongs to different source domains. In the second step, a multi-source category classifier is combined with confusion scores to classify the target samples and update the multi-source category classifier and feature extractor with pseudo label target and source samples (Fig. 18).

Fig. 18
figure 18

Overview of MDDA. Firstly, pre-train the classifiers of each source domain. Then map the features of the extracted target domain to each source domain for adversarial training. Then based on the Wassertein distance to select samples close to the target domain to fine-tune the classifier. Finally, the prediction of each target sample is weighted. Best viewed in color

In contrast to [98] which symmetrically maps multiple sources and targets to the same space, proposes Multi-source Distillation Domain Adaptive (MDDA), which asymmetrically maps targets to individual source domains. Get more distinguished representations of the target by using respective feature extractors. Adversarial training using Wasserstein distance also produces more stable gradients.

The Stage 1 is the pre-training of the source domain classifier. The objective function in the Stage 2 is,

$$ \begin{array}{@{}rcl@{}} \max_{D_{i}}\mathcal{L}_{wd_{D}}(D_{i})-\alpha\mathcal{L}_{grad}(D_{i}), \end{array} $$
(45)

where α is a balancing coefficient, the value of which can be empirically set. \({\mathscr{L}}_{wd_{D}}(D_{i})\) is the Wasserstein distance loss. To make sure the Lipschitz constraint is enforced, the gradient penalty is introduced for the parameters of each discriminator Di as in [35]. Unlike the above methods, the model of Peng et al. [67] directly matches all the distributions by matching the moments. Moreover, they provide concrete proof of why matching the moments of multiple distributions works for MSDA. The Domain AggRegation Network (DARN) proposed by Wen et al. [95] dynamically adjusts the weights of each source domain during the training process. The weights are determined by the discrepancy between the source domain and the target domain. Unlike previous works, the aggregation scheme is direct optimizing our generalization upper bound without resorting to surrogates.

6 Conclusion

The Deep DA methods mainly refer to the domain adaptation algorithm based on deep network end-to-end training optimization. This survey paper focuses on this definition, and we mainly have reviewed deep DA techniques on visual categorization tasks.

We classify source and target domains based on their label set status. We do not use the supervised state as a basis for classification; we consider unsupervised and weakly supervised to be the way forward. Supervised domain adaption to a bridge to understanding adaptability better.

Firstly, we classify DA into single-source DA and multi-source DA. Further, according to whether the feature space is the same or not, the domain adaptation of single-source DA is divided into homogeneous domain adaptation and heterogeneous domain adaptation.

Furthermore, We introduce the label set as a classification indicator and classify domain adaptation into, Closed-set DA, Partial DA, Open set DA, Universal DA, and Zero-shot DA. There are three main approaches to solve the Closed-set DA problem. Discrepancy-Based methods, Adversarial-Based methods, and Reconstruction-Based methods. The better solution for deep DA is a comprehensive approach. For Partial DA and Open set DA, it is essentially a matter of blocking the negative transfer caused by “irrelevant samples“ and extracting an invariant representation of the domain to promote positive transfer. For the Universal DA, self-supervised auxiliary domain adaptation is usually introduced. Intra-class distance is reduced by clustering. The inter-class distance is increased by entropy maximization. For Zero-shot DA, the primary research is still in the semantic representation of the class and the visual embedding of images.

Besides, we also study the multi-source DA. The current deep multi-source DA can be divided into two main categories. 1) using a shared network of feature extractors to symmetrically map multiple source and target domains into the same space. A discriminator is then trained for each source-target pair to distinguish between source domain features and target domain features. Based on the classifiers from different source domains, final predictions are made for the target image either on average or on weights. 2) Using a non-shared feature extractor to obtain the feature representation of each source domain, target domain features are asymmetrically matched to each source domain feature space. Pre-trained classifiers are extracted with selected representative samples, and the classification is performed using a weighted approach.

Despite the recent success of deep DA, there are still many problems to be solved. First, the importance of each class is consistent in most datasets. However, in the actual application scenario, the importance may be inconsistent. How to reduce or eliminate the deviation caused by this inconsistency may become a future research topic. In addition, there are few studies on Universal DA and Zero-shot DA, and there will be more studies in the future.

In addition, deep DA have been successfully applied to many real-world applications, including image classification and object detection. The datasets for these tasks are 2D. For some task-specific 3D/4D data [78, 80], it is challenging to design DA networks to capture their 3D/4D features.

Finally, most of the existing deep DA methods are single modality. However, to take advantage of complementary but heterogeneous data, such as 2D images and 3D point clouds, Magnetic Resonance Imaging (MRI) and Computed Tomography (CT) images. It is meaningful to consider the heterogeneity between modalities and the difference between domains when designing the DA model. Recently, some papers [9, 10, 24, 54, 108] began to focus on this issue, and it is worth more research.