A survey of deep domain adaptation based on label set classification

Fan, Min; Cai, Ziyun; Zhang, Tengfei; Wang, Baoyun

doi:10.1007/s11042-022-12630-8

A survey of deep domain adaptation based on label set classification

Published: 29 April 2022

Volume 81, pages 39545–39576, (2022)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Multimedia Tools and Applications Aims and scope Submit manuscript

A survey of deep domain adaptation based on label set classification

Download PDF

Min Fan¹,
Ziyun Cai ORCID: orcid.org/0000-0001-6822-915X¹,
Tengfei Zhang¹ &
…
Baoyun Wang¹

772 Accesses
5 Citations
1 Altmetric
Explore all metrics

Abstract

Traditional machine learning requires good tags to obtain excellent performance, while manual tagging usually consumes a lot of time and money. Due to the influence of domain shift, using the trained model on the source domain directly on the target domain is not good. Domain adaptation is used to solve the above problems. The deep domain adaptation method uses deep neural networks to complete domain adaptation. This article has carried out a comprehensive review of the deep domain adaptation method of image classification. The main contributions are the following four aspects. Firstly, we divided the deep domain adaptation into several categories based on the label set of the source domain and the target domain. Secondly, we summarized various methods of Closed-set domain adaptation. Thirdly, we discussed current methods of multi-source domain adaptation. Finally, we discussed future research directions, challenges, and possible solutions.

Universal Domain Adaptation

Unsupervised Domain Adaptation with Robust Deep Logistic Regression

Domain Adaptive Fusion for Adaptive Image Classification

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In recent years, traditional machine learning and its related applications have achieved great success [37, 41, 49, 84], but these successes require good labeling support. Labeling in machine learning is quite complicated and tedious, especially when labeling samples of new domains or tasks. Manual labeling will cost a lot of time and money. Semi-supervised [88, 103] learning alleviates the problem to a certain extent. However, Semi-supervised learning also needs a certain amount of labeled instances and a large number of unlabeled instances. A large number of unlabeled instances are difficult to obtain in real-life application scenarios. This usually makes the training model challenging to converge.

Unlike traditional machine learning, transfer learning [62, 109] allows different domains, tasks, and distributions to be used in training and testing. The original intention of transfer learning is to use the previously labeled domain to label the new domain. Just like some people can play the violin, maybe the cello can be learned quickly. Although the data distribution of the source domain and target domain is different, their tasks are the same. This unique transfer learning is domain adaptation. For example, a police officer investigating a crime can use a citizenship photo recorded in the system to quickly and accurately locate a target in a surveillance video [104]. Banks use standard fonts in their databases to help identify a target’s handwriting [59].

Domain adaptation (DA) is a particular case of transfer learning (TL). In the last few decades, various shallow domain adaptation methods have been proposed to solve domain transfer between the source and target domains. Shallow domain adaptation can usually be divided into three categories. 1) Instance-based domain adaptation. It is achieved by adjusting the weights of the instances so that the distributions of the two domains are similar [7, 20], 2) Feature-based domain adaptation. It achieves domain adaptation by adjusting the features of two domains [32, 61]. 3) Parameter-based Domain Adaptation. It performs better results by adjusting the model parameters [7, 99]. With the advancement of technology [38], more and more new fields and new tasks require suitable labels. The performance of shallow domain adaptation can no longer meet today’s requirements for accuracy. Deep neural networks are widely used in computer vision [1, 2, 47, 69] and natural language processing [39, 44, 76, 83] applications. Deep neural networks have more computing units and more robust non-linear representations [3], which can establish better decision-making boundaries. Therefore, the idea of combining domain adaptation and deep neural networks was born. Recently commonly used deep neural network models today include convolutional neural networks (CNNs) [3, 18, 22, 29, 40, 43], deep belief networks (DBNs) [23, 36, 56,57,58], and stacked autoencoders (SAEs) [30, 90, 107].

In this paper, we analyze and discuss the deep DA methods.To summarize, the main contributions are:

We divided the deep domain adaptation into several categories based on the label set of the source domain and the target domain.
We summarized various methods of Closed-set domain adaptation.
We discussed current methods of multi-source domain adaptation.
We discussed future research directions, challenges, and possible solutions.

The remainder of this survey is structured as follows. In Section 2, we reviewed the related work. In Section 3, we first define some notations, and then we categorize deep DA into different settings (given in Fig. 1). In the next two sections, other approaches are discussed for each setting, which is given in Tables 1 and 5 in detail. Finally, the conclusion of this paper and discussion of future work is presented in Section 6.

Table 1 Classification of Closed-set DA

Full size table

2 Related work

Over the past few years, there have been many reviews or surveys on transfer learning and domain adaptation. Pan et al. [62] divided transfer learning into three cases: inductive TL, transductive TL, and unsupervised TL, but they only studied homogeneous feature spaces. Patel et al. [64] focused only on domain adaptation. Csurka et al. [21] briefly described shallow domain adaptation for each case adaptation method, as well as categorizing deep domain adaptation methods into categories based on training loss: classification loss, discrepancy loss, and adversarial loss. However, Csurka et al. only studied deep domain adaptation in visual application scenarios. Wang et al. [93] divided deep domain adaptation methods into single and multi-step, and single-step domain adaptation methods based on training loss into difference-based, adversarial loss-based, and reconstruction-based. Four criteria were proposed in difference-based domain adaptation: class criterion, statistic criterion, architecture criterion, and geometric criterion. Adversarial loss-based domain adaptation consists of generative models and non-generative models. Reconstruction-based domain adaptation consisted of encoder-decoder reconstruction and adversarial reconstruction. Multi-step domain adaptation methods are divided into three categories based on how intermediate domains are selected and utilized: including Hand-crafted, Instance-based, and Representation-based. Sun et al. [82] mainly reviewed some theoretical results and well-established algorithms for multi-source domain adaptation problems. Kouw et al. [42] introduced dataset shifting in transfer learning and domain adaptation and the treatment of domain transfers. Wang et al. [94] analyzed existing work on Zero-shot learning at that time from three perspectives, which are semantic spaces, methods, and applications. Sematic spaces consist of engineered semantic spaces and learned semantic spaces. Zero-shot learning methods are divided into classifier-based methods and instance-based methods. Wilson et al. [96] compare unsupervised deep domain adaptation by examining alternative methods, the unique and common elements, results, and theoretical insights. Cai et al. [8] Give a comprehensive description of the available RGB-D data sets to guide researchers in choosing the right data set to evaluate their algorithms. Chu et al. [19] compared domain adaptation techniques for neural machine translation (NMT) with the techniques beingstudied in statistical machinetranslation (SMT), which has been the main research area in the last two decades (Tables 2, 3, and 4).

Table 2 Accuracy (%) of different domain adaptation methods on the Office-31 datasets

Full size table

Table 3 Accuracy (%) of different unsupervised domain adaptation methods on the digits datasets

Full size table

Table 4 Accuracy (%) of different without generator adversarial domain adaptation methods on the Office-31 datasets

Full size table

After reviewing the above literature, we study deep domain adaptation methods for various scenarios. Firstly, there is the consistently studied closed-set DA, which is the base scenario of most algorithms. In recent years, Partial DA, Open set DA, Universal DA, and Zero-shot DA have been proposed to address domain adaptation in various scenarios.

3 Overview

3.1 Notations and definitions

In this section, we introduce some of the symbols and definitions that will be used in this survey, and the symbols and definitions match those in the survey papers of [21, 93, 94]to maintain consistency across surveys. A domain consists of feature space $\mathcal {X}$ and a marginal probability distribution P(X),where $X=\left \{x_{1},\dots ,x_{n} \right \}\in \mathcal {X}$. Given a specific domain $\mathcal {D}=\left \{ \mathcal {X},P(X)\right \}$, a task $\mathcal {T}$ consists of label space $\mathcal {Y}$ and an objective predictive function f(⋅), which can also be viewed as a conditional probability distribution P(Y |X) from a probabilistic perspective. In general, we can learn P(Y |X) in a supervised manner from the labeled data $\left \{x_{i},y_{i}\right \}$,where $x_{i} \in \mathcal {X}$ and $y_{i} \in \mathcal {Y}$. Suppose L_s and L_t are the label sets in the source and target domains.

Assume that we have two domains: the training dataset with sufficient labeled data is the source domain $\mathcal {D}^{s}=\left \{\mathcal {X}^{s},P(X)^{s} \right \}$, and the test dataset with a small amount of labeled data or no labeled data even no data in the traditional sense is the target domain $\mathcal {D}^{t}=\left \{\mathcal {X}^{t},P(X)^{t} \right \}$. Firstly, we consider the target domain where the label exists, mark the labeled parts as ${\mathcal {D}}^{tl}$ and the unlabeled parts as $\mathcal {D}^{tu}$, form the entire target domain, $\mathcal {D} ={\mathcal {D}}^{tl}\cup \mathcal {D}^{tu}$. The task of the source domain is ${\mathcal {T}^{s}}= \left \{\mathcal {Y}^{s}, P(Y^{s}|X^{s}) \right \}$, and the one of target domain is ${\mathcal {T}^{t}}= \left \{\mathcal {Y}^{t}, P(Y^{t}|X^{t}) \right \}$. Similarly, P(Y^s|X^s) can be learned from the source labeled data $\left \{{x_{i}^{s}},{y_{i}^{s}}\right \}$, and P(Y^t|X^t) can be learned from the target labeled data $\left \{x_{i}^{tl},y_{i}^{tl}\right \}$ and unlabeled data $\left \{x_{i}^{tu}\right \}$. Then, we do not have a traditional sample of the target domain available, we need to introduce a semantic representation ${\mathrm {a}}_{c}\in \mathbbm {R}^{Q}$ to aid in network training. The commonness between two domains is defined as the Jaccard distance between two label sets, $ \xi =\frac {|{{\mathscr{L}}}_{s}\cap {{\mathscr{L}}}_{t}|}{|{{\mathscr{L}}}_{s}\cup {{\mathscr{L}}}_{t}|} $.

3.2 Dataset

In this subsection, we introduce some usual datasets for DA. Office-31 [70] is relatively small, with 4,652 images in 31 classes. Three domains, namely A, D, W, are collected by downloading from amazon.com (A), taking from DSLR (D), and from web camera (W). Six domain adaptation tasks: A→W, D→W, W→D, A→D, D→A, and W→A. Office-Home [89] is a larger dataset, with 4 domains of distinct styles: Artistic, Clip Art, Product, and Real-World. Each domain contains images of 65 object categories. Denoting them as Ar, Cl, Pr, Rw, we obtain twelve domain adaptation tasks: Ar→Cl, Ar→Pr, Ar→Rw, Cl→Ar, Cl→Pr, Cl→Rw, Pr→Ar, Pr→Cl, Pr→Rw, Rw→Ar, Rw→Cl, and Rw→Pr. VisDA2017 [68] (VD) comprises of 12 categories with synthenic (S) and real-world (R) domains. Office-Caltech [33] utilizes the shared classes in Office-31 and Caltech as whole dataset.

3.3 Different scenarios for domain adaptation

The case for traditional machine learning is $\mathcal {D}^{s}=\mathcal {D}^{t}$ and $\mathcal {T}^{s}=\mathcal {T}^{t}$. As for transfer learning, Pan et al. [61] divide data set divergence into divergence in the domain itself and divergence brought about by the task. The former is generally caused by distribution shifts or feature space divergence, while the latter is caused by a divergence in the conditional distribution or label space. Based on these two types of divergence, Pan et al. classify transfer learning into three categories: inductive, transductive, and unsupervised transfer learning. In Pan et al.’s classification, domain adaptation falls into transductive transfer learning. It is characterized by the same task $\mathcal {T}^{s}=\mathcal {T}^{t}$ but there is a domain divergence $\mathcal {D}^{s} \ne \mathcal {D}^{t}$.

First, according to the number of source domains, domain adaptation can be classified into single source domain adaptation and multi-source domain adaptation. Secondly, it is classified into homogeneous domain adaptation and heterogeneous domain adaptation according to the divergence of domains. Under the setting of homogenous domain adaptation, the feature spaces of the target domain and the source domain are almost the same $(\mathcal {X}^{s} =\mathcal {X}^{t})$ and (d^s = d^t). The main difference lies in the difference in the edge distribution of the target domain and the source domain (P(X)^s≠P(X)^t). However, there is a big difference $(\mathcal {X}^{s} \ne \mathcal {X}^{t})$ or (d^s≠d^t) between the feature space of the target domain and the source domain under the heterogeneous domain adaptation setting. In this paper, we do not use the presence or absence of supervision as a classification criterion. We classify source and target domains according to their label sets. The classification of Single-source domain adaptation based on label set is shown in Fig. 2

4 Single-source domain adaptation

4.1 Homogeneous domain adaptation

The first consideration is single-source domain adaptation, i.e., learning a model from a tagged source domain and then generalizing it to other different but related target domains. The feature spaces of the target and source domains are essentially the same. The label sets of the target and source domains are also consistent. We refer to the domain adaptation in this setting as closed-set domain adaptation. Most of the current methods are divided into three main categories. Discrepancy-Based methods, Adversarial-Based methods, and Reconstruction-Based methods. The mentioned network properties are listed in Tables 2, 3 and 4.

Closed-set domain adaptation

In the Closed-set DA setting, it is supposed that the source and target domains only contain images of the same set of object classes. It does not include images of unknown classes, or classes that do not exist in other domains. And the images should be of the same type. The first thing that comes to mind is to align the source and target domains, then reduce the classification loss, and finally fine-tune [14, 100] the classification case for the target domain. However, direct fine-tuning of the parameters of the deep network is very problematic.

1) Discrepancy-Based methods: In the past, many network structures have been proposed to solve the classification task, such as LeNet-5 [46], AlexNet [43], and VGG [79]. Due to the domain shift between the two domains, these network models’ accuracy will be significantly reduced. It is particularly true when the model has been trained in a domain and then used directly in the new domain.

Gretton et al. proposed Maximum Mean Discrepancy (MMD) to measure the discrepancy of the two different domains. MMD is essentially the supremum of the expected difference between two data distribution after the mapping function change. MMD is a very effective way to measure the distance between two distributions. Given two distributions s and t, the MMD is defined as follows,

$$ \begin{array}{@{}rcl@{}} MMD^{2}(s,t)=\sup_{{\Vert \phi\Vert}_{\mathcal{H}} \le 1}\Vert E_{\text{x}^{\text{s}}\sim s}[\phi(\mathrm{x}^{s})-E_{\text{x}^{\text{t}} \sim t}[\phi(\mathrm{x}^{t})]\Vert_{\mathcal{H}}^{2}, \end{array} $$

(1)

where ϕ represents the kernel function that maps the original data to a reproducing kernel Hilbert space (RKHS) and $\Vert \phi \Vert _{{\mathscr{H}}} \le 1$ defines a set of functions in the unit ball of RKHS.

Based on MMD, Tzeng et al. [87] proposed a new network structure called deep domain confusion (DDC) to solve the DA problem. An adaptation layer is added between the feature layers of the shared weight network. This layer takes MMD between the source and target domain features as a loss and reduces the discrepancy between the source and target domains by minimizing MMD. The MMD also determines the location of the adaptation layer. According to [100], the adaptation layer is more useful at higher layers of features because lower layer features are usually general features that do not carry a higher level of discrimination. Therefore, the adaptation layer of the DDC is placed after fc7. The network architecture of DDC is shown in Fig. 3.

Based on DDC, Long et al. [52] proposed the Deep Adaptive Network (DAN), which differs from DDC in two main ways. 1) Only one layer is adapted in DDC, and multiple layers are adapted in DAN. 2) Only a single kernel function is used in DDC, and a multicore with weighted kernel function is used in DAN. DAN improves the performance through multilayer adaptation and Multi-Kernel Maximum Mean Discrepancy (MK-MMD) [34]. In 2016, long et al. Proposed RTN by imitating residual networks. The classifier layer of RTN connects the source classifier and the target classifier end-to-end. However, the above model assumes that the conditional distributions in the two domains are consistent. In real-world scenarios, this assumption of condition is too strong. For this reason, further research by Long et al. [53] proposed Joint Adaptation Network (JAN), which adjusts the joint distribution of source and target domains using classification loss and Joint Maximum Mean Discrepancy (JMMD) as a function of loss.

Because of the enormous computational effort required to compute the MK-MMD, Sun et al. [81] utilized CORAL loss to measure the distance between the two domains. Moreover, it can be seamlessly integrated into different layers or architectures. CORAL loss is defined as the distance between the second-order statistics (covariance) of the source and target domain features.

$$ \begin{array}{@{}rcl@{}} \mathcal{L}_{CORAL}=\frac{1}{4d^{2}}\Vert C_{S}-C_{T}{\Vert_{F}^{2}}, \end{array} $$

(2)

where $\Vert \cdot {\Vert _{F}^{2}}$ denotes the squared matrix Frobenius norm. C_S and C_T denote the covariance matrices of the source and target data, respectively. The goal of domain adaptation is achieved by optimizing both classification loss and Correlation Alignment (CORAL) loss simultaneously. Zellinger et al. [102] proposed the Central Moment Discrepancy (CMD) based on MMD and KL divergence. CMD consists of a vector of empirical expectations and a vector of k-order sample center distances. In simple words, if the probability distributions of samples of source and target domains are similar, then their per-order center distances are also similar. The more similar the sample probability distributions are, the smaller the value of CMD. CMD contains higher-order moment information than KL divergence and reduces the computational effort compared to MMD because there is no need to compute the kernel matrix. Unlike CMD matching higher-order central moment, Higher-order Moment Matching (HoMM) [16] matches higher-order cumulant tensor. Because a higher-order moment tensor contains more information to represent feature distributions better. HoMM can be matched with arbitrary moment tensor, with first-order HoMM and second-order HoMM are equivalent to MMD and CORAL, respectively. Third- and fourth-order moment tensor matching helps achieve global alignment, as higher-order statistics can be adapted to more complex non-Gaussian distributions. The final objective function of HoMM is as follows,

$$ \begin{array}{@{}rcl@{}} \mathcal{L}=\mathcal{L}_{s}+{\lambda_{d}\mathcal{L}_{d}}+{\lambda_{dc}\mathcal{L}_{dc}}, \end{array} $$

(3)

where ${\mathscr{L}}_{s}$ is the classification loss in the source domain, ${\mathscr{L}}_{d}$ is the domain discrepancy loss measured by the higher-order moment matching, and ${\mathscr{L}}_{dc}$ denotes the discriminative clustering loss. Note that to obtain reliable discrimination of clustered pseudo-labels, set λ_dc to 0 in the initial iteration and enable clustering loss λ_dc after the total loss has stabilized. The domain discrepancy loss can be given as,

$$ \begin{array}{@{}rcl@{}} \mathcal{L}_{d}=\frac{1}{b^{2}}\sum\limits_{i=1}^{b}\sum\limits_{j=1}^{b}k({\pmb h}_{sp}^{i},{\pmb h}_{sp}^{j})-\frac{2}{b^{2}}\sum\limits_{i=1}^{b}\sum\limits_{j=1}^{b}k({\pmb h}_{sp}^{i},{\pmb h}_{tp}^{j})+\frac{1}{b^{2}}\sum\limits_{i=1}^{b}\sum\limits_{j=1}^{b}k({\pmb h}_{tp}^{i},{\pmb h}_{tp}^{j}), \end{array} $$

(4)

Where b is the batch size, $k(\pmb {x,y})=\exp (-\gamma \Vert {\pmb {x-y}}\Vert _{2})$ is the RBF kernel function, ${\pmb h}_{sp}^{i}$ denotes a randomly sampled value in the p-level tensor. The discriminative clustering loss can be given as,

$$ \begin{array}{@{}rcl@{}} \mathcal{L}_{dc}=\frac{1}{n_{t}}\sum\limits_{i=1}^{n_{t}}\Vert {\pmb h}_{t}^{i}-{\pmb c}_{\hat {y_{t}^{i}}}{\Vert_{2}^{2}}, \end{array} $$

(5)

where ${\hat {y_{t}^{i}}}$ is the assigned pseudo-labels of ${x_{t}^{i}}$, ${\pmb c}_{\hat {y_{t}^{i}}}\in {{\mathbbm R}^{L}}$ denotes its estimated class center. Group moment matching and random sample matching to perform compact tensor matching in HoMM. Li et al. [48] introduced the attention mechanism in domain adaptation. This mechanism can simulate the independence between source and target convolution channels. Furthermore, it does facilitate the alignment of cross-domain features.

2) Adversarial-Based methods: Unlike previous Discrepancy-Based methods, the adversarial approach’s basic idea is a minimax game. The game ordinary takes place between the domain discriminator and the feature extractor. The domain discriminator identifies whether an instance comes from the target domain. The purpose of the feature extractor is to extract features that can cheat domain discriminators. The whole network iterates between the training domain discriminator and the feature extractor until the whole model converges.

Adversarial domain adaptation networks with generators generally synthesize the source data with labels (or pseudo-labels) into the target data and keep the labels (or pseudo-labels). The synthesized target data is then used to train the network. Unsupervised pixel-level domain adaptation (PixelDA) [5] employed pixel-space cross-domain transformation achieve domain adaptation. Unlike classical GANs, the input to PixelDA contains not only noise vectors but also source images. An almost infinite amount of training data can be synthesized using the noise vector and the source image. The PixelDA model maps the source domain image to the target domain image at the pixel level. It can change the architecture of a particular task without having to retrain the domain adaptation component. However, The downside of pixelDA is that it can only deal with the low-level differences between the source domain and the target domain, mainly noise, resolution, lighting, color. If the object type changes, geometric changes are difficult to deal with. Rather than using GANs as a data enhancement step as before, Sankaranarayanan et al. [75] utilized GANs to obtain rich gradient information that bridges the gap between the source and target domains. The joint adversarial-discriminative approach transfers the information of the target distribution to the learned embedding using a generator-discriminator pair (Fig. 4).

Saito et al. [73] proposed a novel adversarial alignment technique to avoid misclassification of samples near the decision boundary. The model is composed of a feature extractor G and a classifier C. Different from previous work, this classifier also acts as a discriminator. C classfies input x into K class j by $p(y=j|x)=\frac {\exp (l_{j})}{{\sum }_{k=1}^{K} exp(l_{k})}$. In the confrontation training of mixed domain samples, the discriminator can detect the instances close to the boundary. The feature extractor pushes these samples away from the boundary to generate discriminative domain invariant features. The goal of Adversarial Dropout Regularization (ADR) is to learn G and C by solving the optimization problem:

$$ \begin{array}{@{}rcl@{}} \min \limits_{G,C} L(X_{s},Y_{s})&=&-\mathbbm E_{(x_{s},y_{s})\sim(X_{s},Y_{s})} \sum\limits_{k=1}^{K} {\mathbbm 1}_{[k=y_{s}]}\log C(G(x_{s}))_{k}, \end{array} $$

(6)

$$ \begin{array}{@{}rcl@{}} &&\max \limits_{G} \min \limits_{C} L(X_{s},Y_{s})-L_{adv}(X_{t}), \end{array} $$

(7)

$$ \begin{array}{@{}rcl@{}} L_{adv}(X_{t})&=&\mathbbm E_{x_{t} \sim X_{t}}[d(C_{1}(G(x_{t})),(C_{2}(G(x_{t})))], \end{array} $$

(8)

$$ \begin{array}{@{}rcl@{}} d(p_{1},p_{2})&=& \frac{1}{2}(D_{kl}(p_{1}|p_{2})+D_{kl}(p_{2}|p_{1})), \end{array} $$

(9)

where L(X_s,Y_s) is standard classification loss, L_adv(X_t) is the loss of between C1 and C2, d(p1,p2) is represent the difference between p1 and p2, D_kl() is KL divergence. Inspired by VAE-GAN [45], Xu et al. [97] proposed Adversarial Domain Adaptation with Domain Mixup (DM-ADA), which makes the source and target domains consistently distributed through VAE and discriminators. The pixel-level and feature-level domain mixture and well-designed soft domain labels improve the generalization capability. The classifier is optimized with cross-entropy loss. Namely, The source domain classifier loss is ${\mathscr{L}}_{C}=-\mathbf {E}_{x^{s} \sim P_{s}} {\sum }_{i=1}^{K} {y_{i}^{s}} \log \left (C\left (\left [ \cdot \right ]\right )\right )$, where K is the numbers of classes. Chen et al. [17] combine domain adversarial learning with self-learning to proposed Adversarial-Learned Loss for Domain Adaptation (ALDA). The confusion matrix is used to eliminate (or reduce) the effect of noise in the pseudo labels. In contrast to ordinary domain adversarial learning, this adversarial loss incorporates classifier predictions and label information into the optimization. In this way, the model enables level-by-level feature alignment. The noise-corrected can align the features between the source and target domains. According to the theory of [4], the expected error of the target sample can be defined by the expected error in the source domain and the difference in features between the domains. Therefore, the expected error of the target for noise-corrected is theoretically bounded. Therefore, the expected error of ALDA is theoretically bounded.

The key to adversarial domain adaptation networks without generators is to learn domain invariant representations from the source and target samples. These representations are used to deceive the classifier (discriminator) and introduce domain confusion losses to improve the performance of the network. Domain Adaptive Neural Network (DANN) was first proposed at the 2014 Pacific Rim AI Conference. Ganin et al. [28] formally proposed DANN to address the unsupervised domain adaptation (UDA) problem. DANN can be easily implemented using deep learning packages in the Deep Learning Framework. DANN is powered by the feature extractor G_f(⋅;𝜃_f), the label predictor G_y(⋅;𝜃_y), the domain classifier G_d(⋅;𝜃_d), and the gradient inversion layer (GRL) comprise. It uses GRL for backpropagation training so that the distribution of source and target domains is consistent. The optimization goal of the entire network has two components: minimizing the source domain classification error, maximizing the domain classification error, and introducing λ as a trade-off parameter. Generally speaking, aligning the source domain and the target domain is to map the target domain to the source domain and then classify the target domain through a classifier trained on the source domain. However, Adversarial Discriminative Domain Adaptation (ADDA) [86] maps both the source domain and the target domain to a shared space and reduces the distance uses a trained classifier to classify the mapped target domain. ADDA minimizes the source and target representation distance by iteratively minimizing the following functions, which is most similar to the original GAN:

$$ \begin{array}{@{}rcl@{}} \min \limits_{M^{s},C}\mathcal{L}_{cls}(X^{s},Y^{s})=-\mathbb{E}_{(x^{s},y^{s})\sim(X^{s},Y^{s})} \sum \limits_{k=1}^{k}{\mathbbm 1}_{[k=y^{s}]}\log C(M^{s}(x^{s})), \end{array} $$

(10)

$$ \begin{array}{@{}rcl@{}} \min \limits_{D}\mathcal{L}_{advD}(X^{s},X^{t},M^{s},M^{t})= -\mathbb{E}_{(x^{s})\sim(X^{s})}[\log D(M^{s}(x^{s}))]\\ -\mathbb{E}_{(x^{t})\sim(X^{t})}[\log (1-D(M^{t}(x^{t})))], \end{array} $$

(11)

$$ \begin{array}{@{}rcl@{}} \min \limits_{M^{s},M^{t}}\mathcal{L}_{advM}(M^{s},M^{t})=-\mathbb{E}_{(x^{t})\sim(X^{t})}[\log D(M^{t}(x^{t}))], \end{array} $$

(12)

where the mappings M^s and M^t are learned from the source data X^s and target data X^t, C represents a classifier working on the source domain. ${\mathscr{L}}_{cls}$ is optimized by training the source model using the labeled source data. ${\mathscr{L}}_{advD}$ is minimized to train the discriminator, while ${\mathscr{L}}_{advM}$ is learning a representation that is domain invariant. After ADDA, Multi-Adversarial Domain Adaptation (MADA) [65] utilize more than one class discriminator, and this change may bring three benefits. It avoids rigidly assigning each point to only one domain discriminator, similar to using soft labels to increase information. It avoids negative transfer because each moment is only aligned with the most relevant class, and irrelevant classes are filtered out. By weighting, these domain discriminators with different parameters facilitate the positive transfer of each instance. This structure also has an obvious shortcoming, that is, specifying a discriminator for each class, and the computational cost is quite high. The structure of MADA is shown in Fig. 5. To make the classification on the target domain more precise, the DADA [85] proposed by Tang et al. makes the joint distribution alignment between the two domains more explicit. They proposed a target loss based on the design of an integrated classifier by using conditional category probability weighted domain prediction. The entropy minimization principle was used for the regularization term. Inspired by Wasserstein GAN, Shen et al. [77] proposed a novel approach to learn domain invariant feature representations, namely Wasserstein Distance Guided Representation Learning (WDGRL). WDGRL utilizes neural networks to estimate the empirical Wasserstein distance between the source and target samples and optimizes the network of feature extractors to minimize the estimated Wasserstein distance.

Fang et al. [25] introduced a perturbation function in the label classifier to simulate changes in the distribution of labels in different domains, and insert ResNet to learn the perturbation function. By learning the perturbation function, the label classifier will be more robust and accurate. And the joint distribution of image features and class labels is used to align the source and target domains to obtain a more robust and differentiated feature representation. An intuitive illustration is shown in Fig. 6, where the target classifier with the perturbation functioncorrectly classifies image samples from the target domain.

3) Reconstruction-Based Approaches: The goal of the reconstruction-based approach is to extract domain invariant representations. The Deep Reconstruction Classification Network (DRCN) proposed in Ghifary et al. [31] learns a shared encoding representation.The DRCN is a CNN architecture that combines two pipelines with a shared encoder. The shared encoder can be considered as a feature extractor. The first pipeline performs supervised classification in the source domain, while the second pipeline performs unsupervised reconstruction on the target domain. Domain separation networks (DSNs) [6] model the private and shared components for domain representations. It uses a scale-invariant mean squared error reconstruction loss.

4.2 Heterogeneous domain adaptation

In a heterogeneous domain adaptation scenario, there are many situations in the relationship between the source domain and the target domain. When domain adaptation is applied to a real-world scenario, it is more likely to encounter a situation where the source domain and the target domain have large differences. We divide heterogeneous domain adaptation into three categories based on the shared category of the source domain and target domain. The category of the target domain is included in the source domain (L_t ⊂ L_s)is Partial DA. The category of the source domain is included in the target domain (L_s ⊂ L_t) is Open set DA. Part of the source domain and target domain category is shared category is Open-Partial DA. For open-Partial DA is not studied separately, it is integrated into the Universal DA, and additional Zero-shot DA for special cases is added (Table 5).

Table 5 Classification of heterogeneous DA

Full size table

Partial domain adaptation

With the advent of the Big Data era, the label set of the target domain is usually included in one of the source domains. Previous approaches for homogeneous domain adaptation have performed poorly. While the Partial DA proposed by Long et al. applies to a more general case. The label set of the target domain is contained in the source domain (i.e., the target domain label space is just a subspace of the source domain label space). In this case, there is an obvious problem. Those labels (or samples) that exist only in the source domain will result in a negative transfer.

Cao et al. [11] proposed partial transfer learning to address partial transfer learning from large-scale domains to small-scale domains. The architecture of the proposed Selective Adversarial Networks (SAN) for partial transfer learning is shown in Fig. 7. The final objective of Selective Adversarial Network (SAN) is,

$$ \begin{array}{@{}rcl@{}} C(\theta_{f},\theta_{y},{\theta_{d}^{k}}|_{k=1}^{\vert \mathcal{C}_{s}\vert}) = \frac{1}{n_{s}}\sum\limits_{\mathrm{x}_{i} \in \mathcal{D}s}L_{y}(G_{y}(G_{f}(\mathrm{x}_{i})),y_{i}) +\frac{1}{n_{t}}\sum\limits_{\mathrm{x}_{i} \in \mathcal{D}_{t}}H(G_{y}(G_{f}(\mathrm{x}_{i}))) \\-\frac{\lambda}{n_{s}+n_{t}}\sum\limits_{k=1}^{\vert \mathcal{C}_{s}\vert}[(\frac{1}{n_{t}}\sum\limits_{\mathrm{x}_{i} \in \mathcal{D}_{t}}\hat {y_{i}^{k}}) \times (\sum\limits_{\mathrm{x}_{i} \in \mathcal{D}_{s} \cup \mathcal{D}_{t}}\hat {y_{i}^{k}}{L_{d}^{k}}({G_{d}^{k}}(G_{f}(\mathrm{x}_{i})),d_{i}))], \end{array} $$

(13)

where λ is a hyper-parameter that trade-offs the two objectives in the unified optimization problem. H(⋅) is the conditional-entropy loss functional, $H(G_{y}(G_{f}(\mathrm {x}_{i}))) = -{\sum }_{k=1}^{\vert \mathcal {C}_{s}\vert } \hat {y_{i}^{k}}\log \hat {y_{i}^{k}}$. SAN down-weight the domain discriminators responsible for the outlier source classes as follows,

$$ \begin{array}{@{}rcl@{}} \mathcal{L}_{d} = \frac{1}{n_{s}+n_{t}}\sum\limits_{k=1}^{\vert \mathcal{C}_{s}\vert}[(\frac{1}{n_{t}}\sum\limits_{\mathrm{x}_{i} \in \mathcal{D}_{t}}\hat {y_{i}^{k}}) \times (\sum\limits_{\mathrm{x}_{i} \in \mathcal{D}_{s} \cup \mathcal{D}_{t}}\hat {y_{i}^{k}}{L_{d}^{k}}({G_{d}^{k}}(G_{f}(\mathrm{x}_{i})),d_{i}))]. \end{array} $$

(14)

The optimization problem is to find the network parameters $\hat {\theta }_{f}$, $\hat {\theta }_{y}$ and $\hat {\theta _{d}^{k}}(k=1,2,\dots ,\vert \mathcal {C}_{s}\vert )$ that satisfy the following functions,

$$ \begin{array}{@{}rcl@{}} (\hat{\theta}_{f},\hat{\theta}_{y}) = \arg \min_{\theta_{f},\theta_{y}} C(\theta_{f},\theta_{y},{\theta_{d}^{k}}|_{k=1}^{\vert \mathcal{C}_{s}\vert}), \end{array} $$

(15)

$$ \begin{array}{@{}rcl@{}} (\hat {\theta_{d}^{1}},\dots,\hat{\theta}_{d}^{\vert \mathcal{C}_{s}\vert})= \arg \min_{{\theta_{d}^{1}},\dots,\theta_{d}^{\vert \mathcal{ C}_{s}\vert}}C(\theta_{f},\theta_{y},{\theta_{d}^{k}}|_{k=1}^{\vert \mathcal{C}_{s}\vert}), \end{array} $$

(16)

SAN reduces negative transfer due to categories that do not belong to the target domain by weighting the instances and weighting the categories. SAN preliminary addresses Partial DA, which simultaneously circumvents negative transfer by filtering the outlier source class ${\mathcal {C}_{s}\backslash \mathcal {C}_{t}}$ and promotes positive transfer by maximizing the data distribution ${p_{\mathcal {C}_{t}}}$ and q in the shared tag space $\mathcal {C}_{t}$. Modified from SAN, Cao et al. [12] proposed Partial Adversarial Domain Adaptation (PADA).The architecture of PADA is shown in Fig. 8

$$ \begin{array}{@{}rcl@{}} C(\theta_{f},\theta_{y},\theta_{d})=\frac{1}{n_{s}}\sum\limits_{\mathrm{x}_{i}\in \mathcal{D}_{s}}\gamma_{y_{i}}L_{y}(G_{y}(G_{f}(\mathrm{x}_{i})),y_{i}) \\-\frac {\lambda}{n_{s}}\sum\limits_{\mathrm{x}_{i}\in \mathcal{D}_{s}}\gamma_{y_{i}}L_{d}(G_{d}(G_{f}(\mathrm{x}_{i})),d_{i}) \\-\frac {\lambda}{n_{t}}\sum\limits_{\mathrm{x}_{i}\in \mathcal{D}_{t}}L_{d}(G_{d}(G_{f}(\mathrm{x}_{i})),d_{i}), \end{array} $$

(17)

$$ \begin{array}{@{}rcl@{}} \gamma = \frac{1}{n_{t}}\sum\limits_{i=1}^{n_{t}} \hat{\textup{y}}_{i}, \end{array} $$

(18)

where γ is a $\vert \mathcal {C}_{s}\vert $-dimensional weight vector quantifying the contribution of each source class, y_i is the ground truth label of source point x_i while γy_i is the corresponding class weight, and λ is a hyper-parameter that trade-offs the source label classifier and the partial adversarial domain discriminator in the optimization problem. PADA averages the label predictions and all target data to eliminate the effects of possible errors.

Improved based on DANN, Zhang et al. [105] proposed a two-domain classifier strategy named Importance Weighted Adversarial Nets (IWAN) to solve partial DA. The network consists of two feature extractors F_s and F_t, two domain classifiers D and D₀. The green parts are the feature extractors for source and target domains. The network architecture is shown in Fig. 9. The overall objective of the weighted adversarial nets-based method is,

$$ \begin{array}{@{}rcl@{}} \min_{F_{s},C}\mathcal{L}_{s}(F_{s},C)= -\mathbb E_{\mathrm{x},y\sim p_{s}(\mathrm{x},y)}\sum\limits_{k=1}^{K} \mathbbm 1_{[k=y]}\log C(F_{s}(\mathrm{x})), \end{array} $$

(19)

$$ \begin{array}{@{}rcl@{}} \min_{D}\mathcal{L}_{D}(D,F_{s},F_{t})=-(\mathbb E_{\mathrm{x}\sim p_{s}(\text{x})}[\log D(F_{s}(\mathrm{x}))]\\ +\mathbb E_{\mathrm{x}\sim p_{t}(\text{x})}[\log (1-D(F_{t}(\mathrm{x})))])\ \min_{F_{t}}\max_{D_{0}}\mathcal{L}_{w}(C,D_{0},F_{s},F_{t})=\gamma \mathbb E_{\mathrm{x}\sim p_{t}(\mathrm{x})}H(C(F_{t}(\mathrm{x})))\\ +\lambda (\mathbb E_{\mathrm{x}\sim p_{s}(\mathrm{x})}[w(\mathrm{z})\log D_{0}(F_{s}(\mathrm{x}))]\\ +\mathbb E_{\mathrm{x}\sim p_{t}(\text{x})}[\log (1-D_{0}(F_{t}(\mathrm{x})))]), \end{array} $$

(20)

where λ is the trade-off parameter, ${\mathscr{L}}_{s}$ is the loss of source domain classifier, ${\mathscr{L}}_{D}$ is the loss of domain classifier D, ${\mathscr{L}}_{w}$ is the sum of loss of domain classifier D₀ and entropy of target classes, The objectives are optimized in stages. F_s and C are pre-trained on the source domain data and fixed afterwards. Then the D, D₀ and F_t are optimized simultaneously without the need of revisiting F_s and C. The relative importance of the source sample is given by $w(\mathbf {z})=\frac {\tilde {w}(\mathbf {z})}{\mathbb {E}_{\mathbf {z} \sim p_{s}(\mathbf {z})} \tilde {w}(\mathbf {z})}$, which $\tilde {w}(\mathbf {z})=\frac {1}{\frac {p_{s}(\mathbf {z})}{p_{t}(\mathbf {z})}+1}$. The essence of IWAN is to reduce the Jensen-Shannon divergence between the weighted source data distribution and the target data distribution in the feature space. Cao et al. [13] proposed the Example Transfer Network (ETN) gradually reduces the weight of irrelevant samples of non-shared categories on the source classifier and employs a domain classifier to quantify the transferability of instances. The architecture of ETN is shown in Fig. 10. The goal of ETN model is finding saddle-point solutions $\hat {\theta }_{f}$, $\hat {\theta }_{y}$, $\hat {\theta }_{d}$ and $\hat {\theta }_{\tilde y}$ to model parameters as follows,

$$ \begin{array}{@{}rcl@{}} (\hat{\theta}_{f},\hat{\theta}_{g})=\arg \min_{\theta_{f},\theta_{y}}\mathbb E_{G_{y}}-\mathbb E_{G_{d}}, \end{array} $$

(21)

$$ \begin{array}{@{}rcl@{}} (\hat{\theta}_{d})=\arg \min_{\theta_{d}}\mathbb E_{G_{d}}, \end{array} $$

(22)

$$ \begin{array}{@{}rcl@{}} (\hat{\theta}_{\tilde y})= \arg \min_{\theta_{\tilde y}}\mathbb E_{\tilde G_{y}}+\mathbb E_{\tilde G_{d}}, \end{array} $$

(23)

$$ \begin{array}{@{}rcl@{}} \mathbb E_{G_{y}} = \frac{1}{n_{s}}\sum\limits_{i=1}^{n_{s}}w(\mathrm{x}_{i}^{s})L(G_{y}(G_{f}(\mathrm{x}_{i}^{s}),\mathrm{y}_{i}^{s})) \\+\frac{\gamma}{n_{t}}\sum\limits_{j=1}^{n_{t}}H(G_{y}(G_{f}(\mathrm{x}_{j}^{t}))), \end{array} $$

(24)

$$ \begin{array}{@{}rcl@{}} \mathbb E_{G_{d}} = -\frac{1}{n_{s}}\sum\limits_{i=1}^{n_{s}}w(\mathrm{x}_{i}^{s})\log(G_{d}(G_{f}(\mathrm{x}_{i}^{s}))) \\-\frac{1}{n_{t}}\sum\limits_{j=1}^{n_{t}}\log(1-G_{d}(G_{f}(\mathrm{x}_{i}^{t}))), \end{array} $$

(25)

$$ \begin{array}{@{}rcl@{}} \mathbb E_{\tilde G_{y}}=-\frac{\lambda}{n_{s}}\sum\limits_{i=1}^{n_{s}}\sum\limits_{c=1}^{\vert \mathcal{C}_{s}\vert}[y_{i,c}^{s}\log \tilde {G_{y}^{c}}(G_{f}(\mathrm{x}_{i}^{s})) \\+(1-y_{i,c}^{s})\log {\tilde G}_{y}^{c} (G_{f}(\mathrm{x}_{i}^{s}))], \end{array} $$

(26)

$$ \begin{array}{@{}rcl@{}} \mathbb E_{\tilde G_{d}}=-\frac{1}{n_{s}}\sum\limits_{i=1}^{n_{s}}\log({\tilde G_{d}}(G_{f}(\mathrm{x}_{i}^{s}))) \\-\frac{1}{n_{t}}\sum\limits_{j=1}^{n_{t}}\log(1-{\tilde G_{d}}(G_{f}(\mathrm{x}_{i}^{t}))), \end{array} $$

(27)

where $w(\mathrm {x}_{i}^{s})=1-{\tilde G}_{d}(G_{f}(\mathrm {x}_{i}^{s}))$ is the weight of each source example $\mathrm {x}_{i}^{s}$, which quantifies the example’s transferability, γ is a trade-off parameter. Equation (24) and (25) proposed transferability weighting framework. From (26) and (27), with the help of ${\tilde G}_{y}$ (leaky-softmax activation function) ,${\tilde G}_{d}$, which is trained with label information and domain information, resolving the ambiguity between shared and unshared classes. ETN can derive more accurate and discriminative weights to quantify the transferability of each source example. The accuracy of the above network is listed in Table 6.

Table 6 Accuracy (%) of different Partial domain adaptation methods on the Office-31 datasets

Full size table

Open set domain adaptation

Open set DA case is the opposite of Partial DA mentioned above, where the source domain label set is only a small fraction of the one of the target domain. The specific setup for the Open set DA problem is that the target domain contains all the classes in the source domain. We need to classify the data correctly for the known classes in the target domain (common to both target and source domains), and the data for all unknown classes (only in the target domain) is classified as “unknown“ because we don’t have information about these classes.

Busto et al. [63] first proposed a novel problem scenario. Considering the actual scene, there is usually an intersection between the source domain and the target domain, rather than the closed set previously set (Fig. 11).

The method used in [63] is to project the target domain and source domain into the same space-based on distance and then classify the target samples by SVM. In the unsupervised scenario, the objective functions to be optimized are as follows,

$$ \begin{array}{@{}rcl@{}} \min_{x_{ct},w_{ct},o_{t}}&&\sum\limits_{t}(\sum\limits_{c}d_{ct}x_{ct}+\sum\limits_{c}w_{ct}+\lambda o_{t}), \\s.t. &&\sum\limits_{c} x_{ct}+o_{t} = 1 \\&&\sum\limits x_{ct} \ge 1 \\&&a_{ct}x_{ct}+\sum\limits_{t^{\prime}\in N_{t}}\sum\limits_{c^{\prime}}d_{cc^{\prime}}x_{c't^{\prime}}-w_{ct} \le a_{ct} \\&&x_{ct},o_{t} \in \left\{0,1\right\} \\&&w_{ct} \ge 0 \end{array} $$

(28)

where $w_{ct}=x_{ct}({\sum }_{t^{\prime }=N_{t}}{\sum }_{c^{\prime }}x_{c't^{\prime }}d{cc^{\prime }})$, $a_{ct}={\sum }_{t^{\prime }\in N_{t}}{\sum }_{c^{\prime }}d-{cc^{\prime }}$. Compared with unsupervised scenario, semi-supervised scenario adds some restrictions to ensure that labeled target samples are not misclassified. After solving the assignment problem by iterating, the source domain is transferred to the target domain by a linear transformation, which is represented by a matrix $W \in \mathbb R^{D\times D}$. It could be estimated by minimizing the following loss function: $f(W) =\frac {1}{2}\Vert WP_{S}-P_{T} {\Vert _{F}^{2}}$. After the approach has converged, Classification of data in the target domain using linear SVM trained in the source domain. Later, based on the idea of Generative Adversarial Networks (GAN), Saito et al. [74] Proposed a new method for open set domain adaptation. The network architecture is shown in the Fig. 12. The goal is to correctly categorize known target samples into corresponding known classes and recognize unknown target samples as unknown. The objective function is as follows,

$$ \begin{array}{@{}rcl@{}} \min_{C} L_{s}(x_{s},y_{s})+L_{adv}(x_{t}), \end{array} $$

(29)

$$ \begin{array}{@{}rcl@{}}\min_{G} L_{s}(x_{s},y_{s})-L_{adv}(x_{t}), \end{array} $$

(30)

$$ \begin{array}{@{}rcl@{}}L_{s}(x_{s},y_{s})=-\log (p(y=y_{s}|x_{s})), \end{array} $$

(31)

$$ \begin{array}{@{}rcl@{}}p(y=y_{s}|x_{s}) = (C\circ G(x_{s}))_{y_{s}}, \end{array} $$

(32)

$$ \begin{array}{@{}rcl@{}}L_{adv}(x_{t})=-t\log(p(y=K+1|x_{t}))\\-(1-t)\log(1-p(y=K+1|x_{t})), \end{array} $$

(33)

where L_s(x_s,y_s) is the loss of classifier C, t is set as 0.5. The generator attempts to maximize the value of L_adv(x_t). Saito uses the symmetric KL divergence as a new binary cross-entropy loss formula. Liu et al. [50] developed a method Separate to Adapt (STA), a progressive separation mechanism consisting of a coarse-to-fine separation pipeline. First, a multi-binary classifier is trained with the source data to estimate the similarity between the data in the target domain and each of the source classes; second, data with extremely high and low similarity are selected as the boundary data for the known and unknown classes, and they are further used to train a fine-grained binary classifier to perform fine-grained separation of all target domain samples. Finally, iterate between the above two steps and use weights to reject samples with unknown domain adaptation classes. The network structure is shown in Fig. 13. This paper introduces a new concept: openness, which is used to measure how much the target domain class is compared to the source domain class. It is defined as $\mathbb O=1-\frac {|\mathcal {C}_{s}|}{|\mathcal {C}_{t}|}$. There’s no need to select the threshold hyperparameters throughout the process manually, so we don’t need to adjust them when the openness changes manually. The accuracy of the above network is listed in Table 7.

Table 7 Classification accuracy (%) of open set domain adaptation tasks on VisDA-2017 (VGGNet)

Full size table

Universal domain adaptation

Universal Domain Adaptation (UDA) does not require a priori knowledge of the label set. In the UDA setting, given a labeled source domain, any related target domain regardless of how its label set differs from the source domain’s label set, requires to be appropriately classified if it belongs to any of the categories in the source label set. Otherwise, it is labeled as “unknown“. You et al. [101] proposed Universal Adaptation Network (UAN). It quantifies transferability at the sample level by sharing label sets and private label sets for each domain, thus facilitating the adaptation of automatically discovered public label sets and the successful identification of ”unknown” samples. In the training phase, E_G, $E_{D^{\prime }}$ and E_D represent the error for label classifier G, non-adversarial domain discriminator $D^{\prime }$ and adversarial domain discriminator D, which are formally defined as (Fig. 14),

$$ \begin{array}{@{}rcl@{}} E_{G} =\mathbb E_{{(\text{x,y})}\sim p}L(\mathrm{y},G(F(\mathrm{x}))), \end{array} $$

(34)

$$ \begin{array}{@{}rcl@{}} E_{D^{\prime}}=-\mathbb E_{\mathrm{x}\sim p}\log D^{\prime}(F(\mathrm{x}))\\ -\mathbb E_{\mathrm{x}\sim q}\log (1-D^{\prime}(F(\mathrm{x}))), \end{array} $$

(35)

$$ \begin{array}{@{}rcl@{}} E_{D}=-\mathbb E_{\mathrm{x}\sim p}w^{s}(\mathrm{x})\log D^{\prime}(F(\mathrm{x}))\\ -\mathbb E_{\mathrm{x}\sim q}w^{t}(\mathrm{x})\log (1-D^{\prime}(F(\mathrm{x}))), \end{array} $$

(36)

where L is the standard cross-entropy loss, $w^{s}(\text {x})=\frac {H(\hat {\mathrm {y}})}{\log \vert \mathcal {C}_{s} \vert }-\hat {d^{\prime }}(\mathrm {x})$ indicates the probability of a source sample x belonging to the common label set $\mathcal {C}$, $w^{t}(\text {x})=\hat {d^{\prime }}(\mathrm {x})-\frac {H(\hat {\mathrm {y}})}{\log \vert \mathcal {C}_{s} \vert }$ indicates the probability of a target sample x belonging to the common label set $\mathcal {C}$. With well-established weighting w^s(x) and w^t(x), the adversarial domain discriminator D is confined to distinguish the source and target data in the common label set $\mathcal {C}$. Non-adversarial domain discriminator $D^{\prime }$ is trained to get good weights w^s(x) and w^t(x), and Then conduct adversarial training on the adversarial domain discriminator D and label classifier G. After the training, the target sample class is judged by the value of weight w^t(x).

Motivated by the domain similarity and uncertainty criteria proposed in [101], Saito et al. proposed Domain Adaptive Neighborhood Clustering via Entropy optimization (DANCE) in [72]. DANCE utilizes a classifier based on the prime center (prototype). This mapper maps the samples close to their true class prime centers (prototypes) and away from other classes. The target samples are first clustered in the target domain using Self-Supervision. Because of neighbor clustering, DANCE can extract different feature representations for “unknown“ samples unsupervised. Next, align the target point with the source class prototype or reject it as “unknown“ by entropy separation loss. Also, it utilizes domain-specific batch normalization [15, 17, 71] to eliminate domain style information as a form of weak domain alignment. It is worth noting that DANCE extracts discriminative feature representations for ”unknown“ class examples without any supervision on the target domain. Prediction entropy and output of the auxiliary domain classifier are not robust and discriminable enough. Fu et al. [27] proposed Calibrated Multiple Uncertainties (CMU) as the mixture of entropy, consistency, and confidence. And designed a deep ensemble model to characterizes different degrees of uncertainty and distinguishes target data in the common label set from those in the private label set. Fu et al. further proposed a novel H-score to compensate for the previous per-class accuracy for ignorance of open classes. H-score is the harmonic mean of the instance accuracy on common class $ a_{\mathcal {C}} $ and accuracy on the“unknown“ class $a_{\mathcal {\bar {C}}^{t}}$ as $h=2 \cdot \frac {a_{\mathcal {C}} \cdot a_{\mathcal {\bar {C}}^{t}} }{a_{\mathcal {C}} + a_{\mathcal {\bar {C}}^{t}}}$. Summary of the Universal comparisons is listed in Table 8.

Table 8 Summary of the Universal comparisons

Full size table

Zero-shot domain adaptation

In some extreme cases, we can not get a sample of the target domain. Zero-shot DA, which was gradually promoted from Few-shot DA, made its appearance. In the case of a zero sample domain adaptation setting, only the source domain data is available for the task of interest. Sometimes semantic information about the target domain classification is used, known as generalized Zero-shot learning (GZSL). DeViSE [26] is initialized from two pre-trained neural network models: a skip-gram text model and a visual model (AlexNet without its softmax prediction layer in this paper). A combination of dot-product similarity and hinge rank loss was used in the paper. The network is trained by reducing the distance between the image and the corresponding label and expanding the distance between the image and the non-corresponding labels. Based on the main idea of DeViSE, [60] adopts the framework of CNN and word2vec. The nearest category weight (probability) of the image is obtained by standard CNN, and the category is obtained by word2vec. Then the similarity is calculated with the test category to predict the label of the image. Zhang et al. [106] proposed a new embedding model of ZSL based on a deep neural network. There are two main differences between the model and the previous model, 1)It uses the visual feature space output from the CNN subnet as the embedding space. The projection direction is from semantic space to visual feature space, reducing the pivot point problem. 2)It realizes the end-to-end learning of semantic space representation. The architecture of the model is shown in Fig. 15.

The purpose of least squares embedding loss is to minimize the difference between visual features and their class representation embedding vectors in the visual feature space. With these three losses, our objective function is as follows,

$$ \begin{array}{@{}rcl@{}} \mathcal{L}(W_{1}, W_{2})=\frac {1}{N}\sum\limits_{i=1}^{N}\Vert \phi(I_{i})-f_{1}(W_{2}f_{1}(W_{1}\mathrm{y}_{i}^{u}))\Vert^{2}+\lambda (\Vert W_{1}\Vert^{2}+\Vert W_{2}\Vert^{2}), \end{array} $$

(37)

where $W_{1} \in {\mathbb R}^{L \times M}$ are the weights to be learned in the first FC layer and $W_{2} \in {\mathbb R}^{M \times D}$ for the second FC layer. λ is a hyperparameter that weights the two parameters relative to the embedding loss after regularization. f₁(⋅) is the Rectified Linear Unit which introduces nonlinearity in the encoding subnet. The classification of the test image I_j in the visual feature space can be achieved by merely calculating its distance to the embedded prototypes. It is illustrated that visual feature space as an embedding space is much better than semantic space as an embedding space. Liu et al. [51] proposed a novel Deep Calibration Network (DCN) approach towards this generalized Zero-shot learning paradigm, which enables simultaneous calibration of deep networks on the confidence of source classes and uncertainty of target classes. Two scenarios are given in this paper, Zero-shot Learning(ZSL) and Generalized Zero-shot Learning (GZSL). The biggest difference between ZSL and GZSL is that GZSL can utilize the available semantic representation of the target domain, But GZSL needs to classify over both source and target classes. The network architecture is shown in Fig. 16.

The optimization problem of the deep calibration network (DCN) for generalized Zero-shot learning can be formulated by integrating the empirical risk minimization and uncertainty calibration,

$$ \begin{array}{@{}rcl@{}} \min_{\phi,\psi} L+\lambda H+\gamma {\Omega}(\phi,\psi), \end{array} $$

(38)

$$ \begin{array}{@{}rcl@{}} L=-\sum\limits_{n=1}^{N}\sum\limits_{c=1}^{S}y_{n,c}\log p_{c}(\mathrm{x}_{n}), \end{array} $$

(39)

$$ \begin{array}{@{}rcl@{}} p_{c}(\mathrm{x}_{n})=\frac{\exp(f_{c}(\mathrm{x}_{n})/\tau)}{{\sum}_{c^{\prime}=1}^{S}\exp(f_{c^{\prime}}(\mathrm{x}_{n})/\tau)}, \end{array} $$

(40)

$$ \begin{array}{@{}rcl@{}} H=-\sum\limits_{n=1}^{N}\sum\limits_{c=S+1}^{S+T}q_{c}(\mathrm{x}_{n})\log q_{c}(\mathrm{x}_{n}), \end{array} $$

(41)

$$ \begin{array}{@{}rcl@{}} q_{c}(\mathrm{x}_{n})=\frac{\exp(f_{c}(\mathrm{x}_{n})/\tau)}{{\sum}_{c^{\prime}=S+1}^{S+T}\exp(f_{c^{\prime}}(\mathrm{x}_{n})/\tau)}, \end{array} $$

(42)

$$ \begin{array}{@{}rcl@{}} f_{c}(\mathrm{x}_{n}) =\text{sim}(\phi(\mathrm{x}_{n}),\psi(\mathrm{a}_{c})), \end{array} $$

(43)

$$ \begin{array}{@{}rcl@{}} y(\mathrm{x}_{n}) = \arg \max_{c} fc(\mathrm{x}), \end{array} $$

(44)

where Ω(ϕ,ψ) is the penalty to control model complexity; λ and γ are hyper-parameters; In deep learning, weight decay can be used to replace the penalty term γΩ(ϕ,ψ); L is the empirical risk; H is uncertainty calibration; sim(⋅) is a similarity function, e.g. inner product and cosine similarity. The ultimate goal is to minimize entropy to correctly classify, but this requires that the category of the target domain is known. [66] is the first domain adaptation and sensor fusion method that does not require relevant target domain data. [66] relies on the correspondences between source and target domain data samples in the irrelevant task to train the model. In contrast, Conditional Coupled Generative Adversarial Networks for Zero-shot DomainAdaptation (CoCoGAN) [91] does not rely on such information thanks to it captures the joint distribution of source and target domain data samples. Wang et al. [92] presented Adversarial Learning for Zero-shot Domain Adaptation (ALZSDA) to extend the scope of applications further. ALZSDA can learn the domain shift from an irrelevant task and transfer it to multiple different tasks of interest.

5 Multi-source domain adaptation

In practical scenarios, labeled data can be collected from multiple sources with different distributions. In this case, the above single-source domain adaptation(SSDA) approach can be applied simply by combining multiple source domains into a single source domain. However, merging multiple source domains and then using the SSDA approach usually results in more unsatisfactory performance than merely utilizing one of the source domains and discarding the others. Since domain transfer exists between each source domain and the target domain and between different source domains, merging source domain data from various sources may interfere with each other during the learning process. Therefore, to utilize all available data, multi-source domain adaptation (MSDA) is required (Fig. 17).

Xu et al. [98] proposed a deep cocktail network (DCTN) to cope with domain and category shifts among multiple sources. According to the theoretical results in [55], the target distribution can be represented as a weighted combination of the source distributions. MSDA is executed in two iterative steps: first, the differences between the target source domain and multiple source domains are minimized through adversarial learning, and a confusion score is obtained for each source domain, which represents the likelihood that the target sample belongs to different source domains. In the second step, a multi-source category classifier is combined with confusion scores to classify the target samples and update the multi-source category classifier and feature extractor with pseudo label target and source samples (Fig. 18).

In contrast to [98] which symmetrically maps multiple sources and targets to the same space, proposes Multi-source Distillation Domain Adaptive (MDDA), which asymmetrically maps targets to individual source domains. Get more distinguished representations of the target by using respective feature extractors. Adversarial training using Wasserstein distance also produces more stable gradients.

The Stage 1 is the pre-training of the source domain classifier. The objective function in the Stage 2 is,

$$ \begin{array}{@{}rcl@{}} \max_{D_{i}}\mathcal{L}_{wd_{D}}(D_{i})-\alpha\mathcal{L}_{grad}(D_{i}), \end{array} $$

(45)

where α is a balancing coefficient, the value of which can be empirically set. ${\mathscr{L}}_{wd_{D}}(D_{i})$ is the Wasserstein distance loss. To make sure the Lipschitz constraint is enforced, the gradient penalty is introduced for the parameters of each discriminator D_i as in [35]. Unlike the above methods, the model of Peng et al. [67] directly matches all the distributions by matching the moments. Moreover, they provide concrete proof of why matching the moments of multiple distributions works for MSDA. The Domain AggRegation Network (DARN) proposed by Wen et al. [95] dynamically adjusts the weights of each source domain during the training process. The weights are determined by the discrepancy between the source domain and the target domain. Unlike previous works, the aggregation scheme is direct optimizing our generalization upper bound without resorting to surrogates.

6 Conclusion

The Deep DA methods mainly refer to the domain adaptation algorithm based on deep network end-to-end training optimization. This survey paper focuses on this definition, and we mainly have reviewed deep DA techniques on visual categorization tasks.

We classify source and target domains based on their label set status. We do not use the supervised state as a basis for classification; we consider unsupervised and weakly supervised to be the way forward. Supervised domain adaption to a bridge to understanding adaptability better.

Firstly, we classify DA into single-source DA and multi-source DA. Further, according to whether the feature space is the same or not, the domain adaptation of single-source DA is divided into homogeneous domain adaptation and heterogeneous domain adaptation.

Furthermore, We introduce the label set as a classification indicator and classify domain adaptation into, Closed-set DA, Partial DA, Open set DA, Universal DA, and Zero-shot DA. There are three main approaches to solve the Closed-set DA problem. Discrepancy-Based methods, Adversarial-Based methods, and Reconstruction-Based methods. The better solution for deep DA is a comprehensive approach. For Partial DA and Open set DA, it is essentially a matter of blocking the negative transfer caused by “irrelevant samples“ and extracting an invariant representation of the domain to promote positive transfer. For the Universal DA, self-supervised auxiliary domain adaptation is usually introduced. Intra-class distance is reduced by clustering. The inter-class distance is increased by entropy maximization. For Zero-shot DA, the primary research is still in the semantic representation of the class and the visual embedding of images.

Besides, we also study the multi-source DA. The current deep multi-source DA can be divided into two main categories. 1) using a shared network of feature extractors to symmetrically map multiple source and target domains into the same space. A discriminator is then trained for each source-target pair to distinguish between source domain features and target domain features. Based on the classifiers from different source domains, final predictions are made for the target image either on average or on weights. 2) Using a non-shared feature extractor to obtain the feature representation of each source domain, target domain features are asymmetrically matched to each source domain feature space. Pre-trained classifiers are extracted with selected representative samples, and the classification is performed using a weighted approach.

Despite the recent success of deep DA, there are still many problems to be solved. First, the importance of each class is consistent in most datasets. However, in the actual application scenario, the importance may be inconsistent. How to reduce or eliminate the deviation caused by this inconsistency may become a future research topic. In addition, there are few studies on Universal DA and Zero-shot DA, and there will be more studies in the future.

In addition, deep DA have been successfully applied to many real-world applications, including image classification and object detection. The datasets for these tasks are 2D. For some task-specific 3D/4D data [78, 80], it is challenging to design DA networks to capture their 3D/4D features.

Finally, most of the existing deep DA methods are single modality. However, to take advantage of complementary but heterogeneous data, such as 2D images and 3D point clouds, Magnetic Resonance Imaging (MRI) and Computed Tomography (CT) images. It is meaningful to consider the heterogeneity between modalities and the difference between domains when designing the DA model. Recently, some papers [9, 10, 24, 54, 108] began to focus on this issue, and it is worth more research.

References

Ahmed A, Yousif H, He Z (2021) Ensemble diversified learning for image classification with noisy labels. Multimed Tools and Appl. https://doi.org/10.1007/s11042-021-10760-z
Alyafeai Z, Ghouti L (2020) A fully-automated deep learning pipeline for cervical cancer classification. Expert Syst Appl 141:112951. https://doi.org/10.1016/j.eswa.2019.112951
Article Google Scholar
Aquino G, Rubio JDJ, Pacheco J, Gutierrez GJ, Ochoa G, Balcazar R, Cruz DR, Garcia E, Novoa JF, Zacarias A (2020) Novel nonlinear hypothesis for the delta parallel robot modeling. IEEE Access 8:46324–46334. https://doi.org/10.1109/ACCESS.2020.2979141
Article Google Scholar
Ben-David S, Blitzer J, Crammer K, Kulesza A, Pereira F, Vaughan JW (2010) A theory of learning from different domains. Mach Learn 79(1):151–175
Article MathSciNet Google Scholar
Bousmalis K, Silberman N, Dohan D, Erhan D, Krishnan D (2017) Unsupervised pixel-level domain adaptation with generative adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3722–3731
Bousmalis K, Trigeorgis G, Silberman N, Krishnan D, Erhan D (2016) Domain separation networks. In: advances in neural information processing systems, pp 343–351
Bruzzone L, Marconcini M (2009) Domain adaptation problems: A DASVM classification technique and a circular validation strategy. IEEE Trans Pattern Anal Mach Intell 32(5):770–787
Article Google Scholar
Cai Z, Han J, Liu L, Shao L (2017) RGB-D datasets using microsoft kinect or similar sensors: A survey. Multimed Tools Appl 76(3):4313–4355. https://doi.org/10.1007/s11042-016-3374-6
Article Google Scholar
Cai Z, Jing X-Y, Shao L (2020) Visual-depth matching network: deep rgb-D domain adaptation with unequal categories. IEEE Trans Cybern:1–13. https://doi.org/10.1109/TCYB.2020.3032194
Cai Z, Long Y, Shao L (2018) Adaptive RGB image recognition by visual-depth embedding. IEEE Trans Image Process 27(5):2471–2483. https://doi.org/10.1109/TIP.2018.2806839
Article MathSciNet Google Scholar
Cao Z, Long M, Wang J, Jordan MI (2018) Partial transfer learning with selective adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2724–2732
Cao Z, Ma L, Long M, Wang J (2018) Partial adversarial domain adaptation. In: Proceedings of the European conference on computer vision, pp 135–150
Cao Z, You K, Long M, Wang J, Yang Q (2019) Learning to transfer examples for partial domain adaptation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2985–2994
Chakraborty S, Mondal R, Singh PK, Sarkar R, Bhattacharjee D (2021) Transfer learning with fine tuning for human action recognition from still images. Multimed Tools Appl. https://doi.org/10.1007/s11042-021-10753-y
Chang W-G, You T, Seo S, Kwak S, Han B (2019) Domain-specific batch normalization for unsupervised domain adaptation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7354–7362
Chen C, Fu Z, Chen Z, Jin S, Cheng Z, Jin X, Hua X-S (2020) HoMM: Higher-order moment matching for unsupervised domain adaptation. Order 1(10):20
Google Scholar
Chen M, Zhao S, Liu H, Cai D (2020) Adversarial-learned loss for domain adaptation.. In: AAAI, pp 3521–3528
Chiang H-S, Chen M-Y, Huang Y-J (2019) Wavelet-Based EEG Processing for Epilepsy Detection Using Fuzzy Entropy and Associative Petri Net. IEEE Access 7:103255–103262. https://doi.org/10.1109/ACCESS.2019.2929266
Article Google Scholar
Chu C, Wang R (2018) A survey of domain adaptation for neural machine translation. arXiv:1806.00258
Chu W-S, De la Torre F, Cohn JF (2013) Selective transfer machine for personalized facial action unit detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3515–3522
Csurka G (2017) Domain adaptation for visual applications: A comprehensive survey. arXiv:1702.05374
de Jesus Rubio J (2009) SOFMLS: online self-organizing fuzzy modified least-squares network. IEEE Trans Fuzzy Syst 17(6):1296–1309. https://doi.org/10.1109/TFUZZ.2009.2029569
Article Google Scholar
de Rubio JJ (2020) Stability analysis of the modified levenberg-marquardt algorithm for the artificial neural network training. IEEE Trans Neural Netw Learn Syst:1–15. https://doi.org/10.1109/TNNLS.2020.3015200
Dou Q, Ouyang C, Chen C, Chen H, Heng P-A (2018) Unsupervised cross-modality domain adaptation of convnets for biomedical image segmentations with adversarial loss. arXiv:1804.1091
Fang X, Bai H, Guo Z, Shen B, Hoi S, Xu Z (2020) Dart: Domain-adversarial residual-transfer networks for unsupervised cross-domain image classification. Neural Netw
Frome A, Corrado GS, Shlens J, Bengio S, Dean J, Ranzato M, Mikolov T (2013) Devise: A deep visual-semantic embedding model. In: advances in neural information processing systems, pp 2121–2129
Fu B, Cao Z, Long M, Wang J (2020) Learning to detect open classes for universal domain adaptation. In: european conference on computer vision. Springer, pp 567–583
Ganin Y, Lempitsky V (2015) Unsupervised domain adaptation by backpropagation. In: international conference on machine learning. PMLR, pp 1180–1189
Gatys LA, Ecker AS, Bethge M (2016) Image style transfer using convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2414–2423
Gehring J, Miao Y, Metze F, Waibel A (2013) Extracting deep bottleneck features using stacked auto-encoders. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, pp 3377–3381
Ghifary M, Kleijn WB, Zhang M, Balduzzi D, Li W (2016) Deep reconstruction-classification networks for unsupervised domain adaptation. In: european conference on computer vision. Springer, pp 597–613
Gong B, Grauman K, Sha F (2013) Connecting the dots with landmarks: Discriminatively learning domain-invariant features for unsupervised domain adaptation. In: international conference on machine learning, pp 222–230
Gong B, Shi Y, Sha F, Grauman K (2012) Geodesic flow kernel for unsupervised domain adaptation. In: 2012 IEEE conference on computer vision and pattern recognition. IEEE, pp 2066–2073
Gretton A, Borgwardt KM, Rasch MJ, Schölkopf B, Smola A (2012) A kernel two-sample test. J Mach Learn Res 13(1):723–773
MathSciNet MATH Google Scholar
Gulrajani I, Ahmed F, Arjovsky M, Dumoulin V, Courville AC (2017) Improved training of wasserstein gans. In: advances in neural information processing systems, pp 5767–5777
Hernández G, Zamora E, Sossa H, Téllez G, Furlán F (2020) Hybrid neural networks for big data classification. Neurocomputing 390:327–340. https://doi.org/10.1016/j.neucom.2019.08.095
Article Google Scholar
Huang Z, Wang X, Huang L, Huang C, Wei Y, Liu W (2019) Ccnet: Criss-cross attention for semantic segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 603–612
Huh J-H, Seo Y-S (2019) Understanding edge computing: engineering evolution with artificial intelligence. IEEE Access 7:164229–164245. https://doi.org/10.1109/ACCESS.2019.2945338
Article Google Scholar
Iwendi C, Srivastava G, Khan S, Maddikunta PKR (2020) Cyberbullying detection solutions based on deep learning architectures. Multimed Syst. https://doi.org/10.1007/s00530-020-00701-5
Kim Y (2014) Convolutional neural networks for sentence classification. arXiv:1408.5882
Kong T, Sun F, Liu H, Jiang Y, Li L, Shi J (2020) Foveabox: Beyound anchor-based object detection. IEEE Trans Image Process 29:7389–7398
Article Google Scholar
Kouw WM, Loog M (2018) An introduction to domain adaptation and transfer learning. arXiv:1812.11806
Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90
Article Google Scholar
Kutia S, Chauhdary SH, Iwendi C, Liu L, Yong W, Bashir AK (2019) Socio-technological factors affecting user’s adoption of eHealth functionalities: a case study of China and Ukraine eHealth systems. IEEE Access 7:90777–90788. https://doi.org/10.1109/ACCESS.2019.2924584
Article Google Scholar
Larsen ABL, Sønderby SK, Larochelle H, Winther O (2016) Autoencoding beyond pixels using a learned similarity metric. In: international conference on machine learning. PMLR, pp 1558–1566
LeCun Y et al (2015) LeNet-5, convolutional neural networks. http://yann.lecun.com/exdb/lenet 20(5):14
Lee H, Park S-H, Yoo J-H, Jung S-H, Huh J-H (2020) Face recognition at a distance for a stand-alone access control system. Sensors 20(3):785. https://doi.org/10.3390/s20030785
Article Google Scholar
Li S, Liu CH, Lin Q, Xie B, Ding Z, Huang G, Tang J (2020) Domain conditioned adaptation network.. In: AAAI, pp 11386–11393
Li W, Zhu X, Gong S (2018) Harmonious attention network for person re-identification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2285–2294
Liu H, Cao Z, Long M, Wang J, Yang Q (2019) Separate to adapt: Open set domain adaptation via progressive separation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2927–2936
Liu S, Long M, Wang J, Jordan MI (2018) Generalized zero-shot learning with deep calibration network. In: advances in neural information processing systems, pp 2005–2015
Long M, Cao Y, Wang J, Jordan M (2015) Learning transferable features with deep adaptation networks. In: international conference on machine learning. PMLR, pp 97–105
Long M, Zhu H, Wang J, Jordan MI (2017) Deep transfer learning with joint adaptation networks. In: international conference on machine learning. PMLR, pp 2208–2217
Ma X, Zhang T, Xu C (2019) Deep multi-modality adversarial networks for unsupervised domain adaptation. IEEE Trans Multimed 21(9):2419–2431
Article Google Scholar
Mansour Y, Mohri M, Rostamizadeh A (2009) Domain adaptation with multiple sources. In: advances in neural information processing systems, pp 1041–1048
Meda-Campana JA (2018) On the estimation and control of nonlinear systems with parametric uncertainties and noisy outputs. IEEE Access 6:31968–31973. https://doi.org/10.1109/ACCESS.2018.2846483
Article Google Scholar
Mohamed A-r, Dahl GE, Hinton G (2011) Acoustic modeling using deep belief networks. IEEE Trans Audio Speech Lang Process 20(1):14–22
Article Google Scholar
Mohamed A-, Hinton G, Penn G (2012) Understanding how deep belief networks perform acoustic modelling. In: 2012 IEEE international conference on acoustics, speech and signal processing. IEEE, pp 4273–4276
Narang SR, Kumar M, Jindal MK (2021) DeepNetDevanagari: A deep learning model for Devanagari ancient character recognition. Multimed Tools Appl. https://doi.org/10.1007/s11042-021-10775-6
Norouzi M, Mikolov T, Bengio S, Singer Y, Shlens J, Frome A, Corrado GS, Dean J (2013) Zero-shot learning by convex combination of semantic embeddings. arXiv:1312.5650
Pan SJ, Tsang IW, Kwok JT, Yang Q (2010) Domain adaptation via transfer component analysis. IEEE Trans Neural Netw 22(2):199–210
Article Google Scholar
Pan SJ, Yang Q (2009) A survey on transfer learning. IEEE Trans Knowl Data Eng 22(10):1345–1359
Article Google Scholar
Panareda Busto P, Gall J (2017) Open set domain adaptation. In: Proceedings of the IEEE international conference on computer vision, pp 754–763
Patel VM, Gopalan R, Li R, Chellappa R (2015) Visual domain adaptation: A survey of recent advances. IEEE Signal Process Mag 32(3):53–69
Article Google Scholar
Pei Z, Cao Z, Long M, Wang J (2018) Multi-adversarial domain adaptation. arXiv:1809.02176
Peng K-C, Wu Z, Ernst J (2018) Zero-shot deep domain adaptation. In: Proceedings of the European conference on computer vision, pp 764–781
Peng X, Bai Q, Xia X, Huang Z, Saenko K, Wang B (2019) Moment matching for multi-source domain adaptation. In: Proceedings of the IEEE international conference on computer vision, pp 1406–1415
Peng X, Usman B, Kaushik N, Hoffman J, Wang D, Saenko K (2017) Visda: The visual domain adaptation challenge. arXiv:1710.06924
Rakshit RD, Kisku DR, Gupta P, Sing JK (2021) Cross-resolution face identification using deep-convolutional neural network. Multimed Tools Appl. https://doi.org/10.1007/s11042-021-10745-y
Saenko K, Kulis B, Fritz M, Darrell T (2010) Adapting visual category models to new domains. In: European conference on computer vision. Springer, pp 213–226
Saito K, Kim D, Sclaroff S, Darrell T, Saenko K (2019) Semi-supervised domain adaptation via minimax entropy. In: Proceedings of the IEEE international conference on computer vision, pp 8050–8058
Saito K, Kim D, Sclaroff S, Saenko K (2020) Universal domain adaptation through self supervision. arXiv:2002.07953
Saito K, Ushiku Y, Harada T, Saenko K (2017) Adversarial dropout regularization. arXiv:1711.01575
Saito K, Yamamoto S, Ushiku Y, Harada T (2018) Open set domain adaptation by backpropagation. In: Proceedings of the European conference on computer vision, pp 153–168
Sankaranarayanan S, Balaji Y, Castillo CD, Chellappa R (2018) Generate to adapt: Aligning domains using generative adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8503–8512
Seo Y-S, Huh J-H (2019) Automatic emotion-based music classification for supporting intelligent IoT applications. Electronics 8(2):164. https://doi.org/10.3390/electronics8020164
Article Google Scholar
Shen J, Qu Y, Zhang W, Yu Y (2017) Wasserstein distance guided representation learning for domain adaptation. arXiv:1707.01217
Shi H, Lin G, Wang H, Hung T-Y, Wang Z (2020) Spsequencenet: Semantic segmentation network on 4d point clouds. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4574–4583
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
Sothmann T, Gauer T, Werner R (2018) Influence of 4d ct motion artifacts on correspondence model-based 4d dose accumulation. In: medical imaging 2018: image-guided procedures, robotic interventions, and modeling, vol 10576. International Society for Optics and Photonics, p 105760F
Sun B, Saenko K (2016) Deep coral: Correlation alignment for deep domain adaptation. In: European conference on computer vision. Springer, pp 443–450
Sun S, Shi H, Wu Y (2015) A survey of multi-source domain adaptation. Inf Fusion 24:84–92
Article Google Scholar
Syed AM, Anjum A, Khan S, Mohan S, Srivastava G (2020) N-Sanitization: A semantic privacy-preserving framework for unstructured medical datasets. Comput Commun 161:160–171. https://doi.org/10.1016/j.comcom.2020.07.032
Article Google Scholar
Tan M, Pang R, Le QV (2020) Efficientdet: Scalable and efficient object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10781–10790
Tang H, Jia K (April 2020) Discriminative adversarial domain adaptation. Proc AAAI Conf Artif Intell 34(04):5940–5947. https://doi.org/10.1609/aaai.v34i04.6054
Google Scholar
Tzeng E, Hoffman J, Saenko K, Darrell T (2017) Adversarial discriminative domain adaptation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7167–7176
Tzeng E, Hoffman J, Zhang N, Saenko K, Darrell T (2014) Deep domain confusion: Maximizing for domain invariance. arXiv:1412.3474
Van Engelen JE, Hoos HH (2020) A survey on semi-supervised learning. Mach Learn 109(2):373–440
Article MathSciNet Google Scholar
Venkateswara H, Eusebio J, Chakraborty S, Panchanathan S (2017) Deep hashing network for unsupervised domain adaptation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5018–5027
Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol P-A, Bottou L (2010) Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res 11(12)
Wang J, Jiang J (2019) Conditional coupled generative adversarial networks for zero-shot domain adaptation. In: Proceedings of the IEEE international conference on computer vision, pp 3375–3384
Wang J, Jiang J (2020) Adversarial learning for zero-shot domain adaptation. In: European conference on computer vision. Springer, pp 329–344
Wang M, Deng W (2018) Deep visual domain adaptation: A survey. Neurocomputing 312:135–153
Article Google Scholar
Wang W, Zheng VW, Yu H, Miao C (2019) A survey of zero-shot learning: Settings, methods, and applications. ACM Trans Intell Syst Technol 10(2):1–37
Google Scholar
Wen J, Greiner R, Schuurmans D (2020) Domain aggregation networks for multi-source domain adaptation. In: international conference on machine learning. PMLR, pp 10214–10224
Wilson G, Cook DJ (2020) A survey of unsupervised deep domain adaptation. ACM Trans Intell Syst Technol 11(5):1–46. https://doi.org/10.1145/3400066
Article Google Scholar
Xu M, Zhang J, Ni B, Li T, Wang C, Tian Q, Zhang W (2020) Adversarial domain adaptation with domain mixup. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 6502–6509
Xu R, Chen Z, Zuo W, Yan J, Lin L (2018) Deep cocktail network: Multi-source unsupervised domain adaptation with category shift. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3964–3973
Yang J, Yan R, Hauptmann AG (2007) Cross-domain video concept detection using adaptive svms. In: Proceedings of the 15th ACM international conference on multimedia, pp 188–197
Yosinski J, Clune J, Bengio Y, Lipson H (2014) How transferable are features in deep neural networks?. In: advances in neural information processing systems, pp 3320–3328
You K, Long M, Cao Z, Wang J, Jordan MI (2019) Universal domain adaptation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2720–2729
Zellinger W, Grubinger T, Lughofer E, Natschläger T, Saminger-Platz S (2017) Central moment discrepancy (cmd) for domain-invariant representation learning. arXiv:1702.08811
Zhai X, Oliver A, Kolesnikov A, Beyer L (2019) S4l: Self-supervised semi-supervised learning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1476–1485
Zhang G, Jiang T, Yang J, Xu J, Zheng Y (2021) Cross-view kernel collaborative representation classification for person re-identification. Multimed Tools Appl. https://doi.org/10.1007/s11042-021-10671-z
Zhang J, Ding Z, Li W, Ogunbona P (2018) Importance weighted adversarial nets for partial domain adaptation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8156–8164
Zhang L, Xiang T, Gong S (2017) Learning a deep embedding model for zero-shot learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2021–2030
Zhang M, Hu H, Li Z, Chen J (2021) Attention-based encoder-decoder networks for workflow recognition. Multimed Tools Appl. https://doi.org/10.1007/s11042-021-10633-5
Zhang W, Xu D, Zhang J, Ouyang W (2021) Progressive modality cooperation for multi-modality domain adaptation. IEEE Trans Image Process 30:3293–3306
Article Google Scholar
Zhuang F, Qi Z, Duan K, Xi D, Zhu Y, Zhu H, Xiong H, He Q (2020) A comprehensive survey on transfer learning. Proc IEEE 109(1):43–76
Article Google Scholar

Download references

Acknowledgments

This work is partly supported by the Natural Science Foundation of China (Grant No. 62006127, 61833011 and 62073173), partly supported by NUPTSF under Grant NY218120 and Grant NY220021, and partly supported by Jiangsu Shuang-Chuang Project under Grant CZ005SC19019 and Nanjing Overseas Innovation Project Grant RK005NLX20001. It is also supported by National Science Foundation of Jiangsu Province, China (Grant No. BK20191376 and BK20190728).

Author information

Authors and Affiliations

College of Automation, Nanjing University of Posts and Telecommunications, Nanjing, 210023, China
Min Fan, Ziyun Cai, Tengfei Zhang & Baoyun Wang

Authors

Min Fan
View author publications
You can also search for this author in PubMed Google Scholar
Ziyun Cai
View author publications
You can also search for this author in PubMed Google Scholar
Tengfei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Baoyun Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ziyun Cai.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fan, M., Cai, Z., Zhang, T. et al. A survey of deep domain adaptation based on label set classification. Multimed Tools Appl 81, 39545–39576 (2022). https://doi.org/10.1007/s11042-022-12630-8

Download citation

Received: 18 December 2020
Revised: 12 March 2021
Accepted: 09 February 2022
Published: 29 April 2022
Issue Date: November 2022
DOI: https://doi.org/10.1007/s11042-022-12630-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A survey of deep domain adaptation based on label set classification

Abstract

Similar content being viewed by others

Universal Domain Adaptation

Unsupervised Domain Adaptation with Robust Deep Logistic Regression

Domain Adaptive Fusion for Adaptive Image Classification

1 Introduction

2 Related work

3 Overview

3.1 Notations and definitions

3.2 Dataset

3.3 Different scenarios for domain adaptation

4 Single-source domain adaptation

4.1 Homogeneous domain adaptation

Closed-set domain adaptation

4.2 Heterogeneous domain adaptation

Partial domain adaptation

Open set domain adaptation

Universal domain adaptation

Zero-shot domain adaptation

5 Multi-source domain adaptation

6 Conclusion

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A survey of deep domain adaptation based on label set classification

Abstract

Similar content being viewed by others

Universal Domain Adaptation

Unsupervised Domain Adaptation with Robust Deep Logistic Regression

Domain Adaptive Fusion for Adaptive Image Classification

Explore related subjects

1 Introduction

2 Related work

3 Overview

3.1 Notations and definitions

3.2 Dataset

3.3 Different scenarios for domain adaptation

4 Single-source domain adaptation

4.1 Homogeneous domain adaptation

Closed-set domain adaptation

4.2 Heterogeneous domain adaptation

Partial domain adaptation

Open set domain adaptation

Universal domain adaptation

Zero-shot domain adaptation

5 Multi-source domain adaptation

6 Conclusion

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation