1 Introduction

Nowadays, communications across social media and content sharing applications increase the information volume (i.e., image, text and video) exponentially where the classification is an essential requirement to take the advantages of information explosion, efficiently [1]. However, the manual classification of data may be prohibitive. Therefore, the machine learning models are used to classify the information with a basic assumption of machine learning models on which the used data for training and test sets must be drawn from the same or similar distributions. But, in real world, this assumption is not guaranteed in many applications and consequently, the trained machine learning models in source domain may not work well in target domain under various conditions. Thus, domain adaptation (DA) [2] as one of the transfer learning (TL) [3] solutions is used to solve such cross-domain learning problems with different distributions.

DA is a technique for knowledge transfer from the labeled source domain to unlabeled target domain by exploiting domain invariant structures that facilitate the transfer between different domains with different distributions [4]. In this paper, we focus on unsupervised domain adaptation where the target data labels are not accessible in transfer learning phase.

Based on the type of transferred information, the TL algorithms can be classified into three different learning paradigms as follows [5]: (i) instance-based transfer learning, feature-representation transfer learning and classifier-based transfer learning. In instance-based transfer learning approaches, instead of using the entire source domain, some parts of the source data that have similar distribution with target data are reused in the learning phase. (ii) Feature-representation transfer learning approaches aim to obtain new representation of source and target domains to minimize the distribution discrepancy between domains. (iii) Classifier-based transfer learning approaches assume that the performance of target classifier can be improved using source classifier. Ensemble learners can be called as an example of classifier-based TL methods that combine multiple source classifiers to create an improved target classifier. This paper follows combination of feature-based and classifier-based approaches.

In this paper, we propose a novel joint distinct subspace learning and unsupervised transfer classification for visual domain adaptation (JDSC) method that improves the model accuracy, significantly, in an unsupervised manner.

In this method, subspace alignment and label prediction function are learned iteratively to find better representation for data and consequently better prediction function for labeling. In subspace alignment, for reducing the distributional and geometrical divergences between domains, two coupled projections are obtained that map the source and target data into respective subspaces, simultaneously. Also, a domain-invariant classifier is learned in new representations of data with structural risk minimization, while consistency between the classifier and intrinsic manifold structure of data is maximized using marginal distributions. The contributions of this work are summarized as follows.

  1. (1)

    In this paper, a novel unsupervised domain adaptation approach is introduced that is based on hybrid of feature-based and classifier-based approaches, which uses the feature-based techniques to address the challenge of domain divergence and classifier-based techniques to learn the reliable classifier.

  2. (2)

    For subspace alignment, weighted joint geometrical and statistical alignment (WJGSA) is proposed where it is the modified version of joint geometrical and statistical alignment (JGSA) [4] to learn two coupled projections to map the source and target data into respective subspaces by accounting the importance of marginal and conditional distributions, separately and quantitatively.

  3. (3)

    JDSC reduces the divergence of source and target subspaces and increases the variance of target data while maintaining the data structure.

  4. (4)

    The proposed method has been evaluated on following four real-world image datasets: object recognition (Office-10 and Caltech-10) [6], handwritten digit recognition (USPS and MNIST) [7, 8], large image recognition dataset (ImageNet, VOC 2007) [9] and face dataset [10] to compare it against several novel state-of-the-art methods where the experiments demonstrate that our proposed method achieves a significant improvement in average classification accuracy.

In the rest, the paper is organized as follows. The second part of paper provides an overview on related work in this field. In the third section, the proposed method is described in detail. In the fourth section, the evaluated datasets are presented in detail. In the fifth section, the results of the proposed algorithm against other machine learning and domain adaptation methods are reported. Finally, the paper concludes with some suggestions for future works in the last section.

2 Related work

In general, TL aims to adapt trained models in an existing domain (source) to solve the classification problem in a new (target), yet related, domain. Based on what is transferred, TL algorithms are categorized into three different paradigms as follows.

The strategy behind instance-based approaches is to use the reweighted instances in the source domain to label the target domain. Asgarian et al. [11] proposed a hybrid instance-based transfer learning method that uses a probabilistic weighting strategy to transfer knowledge from the source domain to learn a model for target domain.

Feature-representation transfer learning approaches can be categorized into two different types, data-oriented and subspace-oriented methods where data-oriented approaches are divided into symmetric and asymmetric feature-based TL [4, 12]. The data-oriented category focuses on subspace learning by exploiting the underlying representative structures between both domains to find common latent space (features) to reduce the marginal distribution differences between source and target domains (i.e., symmetric) [13] or focuses on distribution alignment by transforming the features of source domain to be closer to target domain to reduce the marginal or conditional distribution divergences between domains (i.e., asymmetric) [14]. For reducing the domain shift in the subspace-oriented category, subspaces of both domains without clearly considering the distribution shift between projected data of domains are manipulated for final mapping [15]. In this approach, the assumption of existing a unified transformation to reduce the domain shifts does not exist.

In classifier-based transfer approaches, transferring of prior knowledge of parameters from source to target domain is considered. Rubin et al. [16] focused on creating an ensemble model from two boosting-based classifiers, gradient tree boosting and adaptive boosting, based on prediction average to predict the transferring of pediatric populations from the hospital general ward to the pediatric intensive care unit. Our work belongs to the feature-representation transfer learning and classifier-based transfer categories. In feature-representation transfer, the distribution shift across domains is reduced by two coupled projections for source and target data to map into respective subspaces, while the data properties are preserved. Moreover, the shift across subspace geometries is reduced alongside reducing the distribution shifts of both domains by quantitative importance evaluation of both distributions (i.e., marginal and conditional distributions) via considering their different effects. Hence, the proposed feature-representation algorithm in this paper is a hybrid of data-based symmetric and subspace-based categories.

Also, in classifier-based transfer, a domain-invariant classifier is learned on new obtained representation of data to overcome the feature distortions. In addition, for maximizing the consistency between the classifier and the intrinsic manifold structure of data, manifold regularization is used.

3 Joint subspace and model learning

In this section, we first define the problem setting and the purpose of domain transfer learning. Then, we present our proposed approach, joint distinct subspace learning and unsupervised transfer classification for visual domain adaptation, in detail.

3.1 Notation

We first define the notations that are frequently used in this paper. A domain D consists of the following two terms: a feature space X and a marginal probability distribution \(P({\mathbf {x}})\), i.e., \(D=\{{X}, {P}({{\mathbf {x}}})\}\) where X is drawn from distribution P(x) and \({\mathbf {x}}\in X\). Subsequently, given domain D, a task T is defined by a label space Y and a prediction function \(f({\mathbf {x}})\), i.e., \({T} =\{{Y}, {f }{(}{{\mathbf {x}}}{)}\}\), where \(y \in Y\), and \({f}{(}{{\mathbf {x}}}{)} = {Q }{(}{y } | {{\mathbf {x}}}{)}\) that can be interpreted as the conditional probability distribution. In unsupervised domain adaptation, the source domain \(D_s = \{{(}{{{\mathbf {x}}}}_{{{\mathbf {1}}}}, y_1{),\ldots , (}{{{\mathbf {x}}}}_{{{\mathbf {n}}}}{,\ }y_n{)}\}\) has sufficient labeled data, while in target domain \(D_t=\{{{\mathbf{x}}}_{n+1}, \ldots ,{{{\mathbf {x}}}}_{n+m}\}\) no labeled data exist. The goal of domain transfer learning is to learn a target model \({f:\ }\ x_t \,\rightarrow \,y_t\) with labeled source domain to minimize the prediction error in target domain, under the following assumptions, \(X_s = X_t , Y_s = Y_t\), \(P_s\left( x_s\right) \ne P_t(x_t)\) and \(Q_s\left( y_s|x_s\right) \ne Q_t({y_t|x}_t)\). Moreover, \(tr\left( .\right) \) and I are defined as the trace of matrix and identity matrix, in turn. Also, \({||.||}^2_F\) and \({||.||}^2_K\) denote the squared of Frobenius norm and squared norm in reproducing kernel Hilbert space, respectively.

3.2 Proposed method

In this paper, we focus on following three main goals to achieve: (1) obtaining two coupled projections for source and target domains, to reduce the domain divergence, specifically, by accounting the different importance among the marginal and conditional distributions; (2) minimizing the classification error on new representation of source domain labeled data; (3) maximizing the manifold consistency underlying the marginal distributions of source and target domains; and (4) finding the optimal representation and classifier, iteratively.

3.2.1 Weighted joint geometrical and statistical alignment

In this section, we introduce weighted joint geometrical and statistical alignment method which is the modified version of joint geometrical and statistical alignment. Our proposed method adapts the marginal and conditional distributions with different importance to adapt across domains. In fact, JGSA finds two coupled subspaces to obtain the new representations of source and target domains by considering equal importance for each distribution, whereas our idea considers the relative importance of each distribution, quantitatively and separately. According to domain shift scale, one of the distributions (i.e., marginal or conditional) becomes more important in domain adaptation. Therefore, we define Eq. (1) which aims to find two coupled subspaces A and B for source and target domains, respectively, by quantitative evaluation of marginal and conditional distributions significance, as follows:

$$\begin{aligned} {\mathop {\min }_{A,\ B} tr\left( \begin{array}{ccc} \left[ A^T\ \ B^T\right] &{} \left[ \begin{array}{cc} M_s &{} M_{st} \\ M_{ts} &{} M_t \end{array} \right] &{} \left[ \begin{array}{c} A \\ B \end{array} \right] \end{array} \right) \ } \end{aligned}$$
(1)

where

$$\begin{aligned} M_s= & {} X_s((1-\gamma )L_s+\gamma \sum ^C_{c=1}{L^{(c)}_s})X^T_s,\nonumber \\ L_s= & {} \frac{1}{n^2_s}\mathbf{1 }_s\mathbf{1 }^T_s,\nonumber \\ {{(L}^{(c)}_s)}_{ij}= & {} \left\{ \begin{array}{c} \begin{array}{ll} \frac{1}{{{(n}^{(c)}_s)}^2} &{} x_i,\ x_j\in \ X^{(c)}_s \end{array}\\ \begin{array}{ll} 0 &{} \;\;\;\;\; \hbox {otherwise}, \end{array} \end{array} \right. \end{aligned}$$
(2)
$$\begin{aligned} M_t= & {} X_t({(1-\gamma )L}_t+\gamma \sum ^C_{c=1}{L^{(c)}_t})X^T_t,\nonumber \\ L_t= & {} \frac{1}{n^2_t}\mathbf{1 }_t\mathbf{1 }^T_t,\nonumber \\ {{(L}^{(c)}_t)}_{ij}= & {} \left\{ \begin{array}{c} \begin{array}{ll} \frac{1}{{{(n}^{(c)}_t)}^2} &{} x_i,\ x_j\in \ X^{(c)}_t \end{array}\\ \begin{array}{ll} 0 &{} \;\;\;\;\; \hbox {otherwise}, \end{array} \end{array} \right. \end{aligned}$$
(3)
$$\begin{aligned} M_{st}= & {} X_s((1-\gamma )L_{st}+\gamma \sum ^C_{c=1}{L^{(c)}_{st}})X^T_t,\nonumber \\ L_{st}= & {} -\frac{1}{n_sn_t}\mathbf{1 }_s\mathbf{1 }^T_t,\nonumber \\ {{(L}^{(c)}_{st})}_{ij}= & {} \left\{ \begin{array}{c} - \begin{array}{ll} \frac{1}{n^{(c)}_sn^{(c)}_t} &{} x_i\in \ X^{(c)}_s,\ x_j\in \ X^{(c)}_t \end{array} \\ \begin{array}{ll} 0 &{} \;\;\;\;\; \;\;\;\;\;\hbox {otherwise}, \end{array} \end{array} \right. \nonumber \\ M_{ts}= & {} X_t((1-\gamma )L_{ts}+\gamma \sum ^C_{c=1}{L^{(c)}_{ts}})X^T_s,\nonumber \\ L_{ts}= & {} -\frac{1}{n_sn_t}\mathbf{1 }_t\mathbf{1 }^T_s, \end{aligned}$$
(4)

and

$$\begin{aligned} {{(L}^{(c)}_{ts})}_{ij}= & {} \left\{ \begin{array}{c} - \begin{array}{ll} \frac{1}{n^{(c)}_sn^{(c)}_t} &{} x_j\in \ X^{(c)}_s,\ x_i\in \ X^{(c)}_t \end{array} \\ \begin{array}{ll} 0 &{} \;\;\;\;\;\;\;\;\;\; \hbox {otherwise} \end{array} \end{array} \right. \end{aligned}$$
(5)

where \(\mathbf{1 }_s\in {\mathbb {R}}^{n_s}\) and \(\mathbf{1 }_t\in {\mathbb {R}}^{n_t}\) denote the column vector with all ones related to the source and target domains, in turn. In addition, \(\gamma \) is an adaptive parameter that induces the importance of marginal and conditional distributions, quantitatively, which is computed through Eq. (6),

$$\begin{aligned} \gamma \approx 1-\frac{d_M}{d_M+\sum ^C_{c=1}{d_c}}\ . \end{aligned}$$
(6)

where \(d_M\) and \(d_c\) are the marginal and conditional \({\mathscr {A}}\) distances [17], respectively. \({{\mathscr {A}}}\)-distance is defined as Eq. (7) in which \(\epsilon (h)\) is a linear classifier error in source \(D_s\) and target \(D_t \) domains’ classification.

$$\begin{aligned} {d_M}(D_s, D_t)=2(1-2 \epsilon (h)) \end{aligned}$$
(7)

In addition, for reducing shift across source and target subspaces (i.e., A and B), Eq. (8) is utilized:

$$\begin{aligned} {\mathop {\min }_{A,\ B} {\left\| A-B\right\| }^2_F}. \end{aligned}$$
(8)

Moreover, Eq. (9) maximizes the variance of target domain with the goal of preserving target data properties by projecting the features into the relevant dimensions,

$$\begin{aligned} \mathop {\max }\limits _{B} tr\left( {{B^T}{S_t}B} \right) , \hbox {s.t.}, \ {S_t} = {X_t}{H_t}{X_t}^T \end{aligned}$$
(9)

where \(S_t\) is the scatter matrix of target domain and \(H_t\) is the centering matrix. For a good domain adaptation, it is better to maintain the discriminative information of source data within finding a new subspace for source domain. Therefore, Eqs. (10) and (11) are used to preserve the information of source domain using labeled samples in source domain. The purpose of this work is to find a subspace (A) for source domain that converges the samples with same classes and diverges the samples in different classes as follows:

$$\begin{aligned}&{\mathop {\max }_{A} tr\left( A^TS_bA\right) },\ \hbox {s.t.}, \nonumber \\&\quad S_b=\sum ^C_{c=1}{n^{\left( c\right) }_s(m^{\left( c\right) }_s-\overline{m_s}){(m^{\left( c\right) }_s-\overline{m_s})}^T} \end{aligned}$$
(10)
$$\begin{aligned}&{\mathop {\min }_{A} tr(A^TS_wA)\ },\ \hbox {s.t.}, \nonumber \\&\quad S_w=\sum ^C_{c=1}{X^{\left( c\right) }_s(H^{\left( c\right) }_s){(X^{\left( c\right) }_s)}^T} \end{aligned}$$
(11)

where \(S_b\) is the between-class scatter matrix and \(S_w\) is the within-class scatter matrix of source domain. Also, \(m^{\left( c\right) }_s\) and \(\overline{m_s}\) are the average of source samples that belong to class c and the average of source samples, respectively. Considering Eqs. (1), (8), (9), (10) and (11), the objective function is achieved as follows:

$$\begin{aligned} \mathop {\max }\limits _{A,B} \frac{{tr\left( {\begin{array}{*{20}{c}} {\left[ {{A^T}{B^T}} \right] }&{}{\left[ {\begin{array}{*{20}{c}} {\beta {S_b}}&{}0\\ 0&{}{\mu {S_t}} \end{array}} \right] }&{}{\left[ {\begin{array}{*{20}{c}} A\\ B \end{array}} \right] } \end{array}} \right) }}{{tr\left( {\begin{array}{*{20}{c}} {\left[ {{A^T}{B^T}} \right] }&{}{\left[ {\begin{array}{*{20}{c}} {{M_s} + \lambda I + \beta {S_w}}&{}{{M_{st}} - \lambda I}\\ {{M_{ts}} - \lambda I}&{}{{M_t} + \left( {\lambda + \mu } \right) I} \end{array}} \right] }&{}{\left[ {\begin{array}{*{20}{c}} A\\ B \end{array}} \right] } \end{array}} \right) }}.\nonumber \\ \end{aligned}$$
(12)

By optimizing Eq. (12), the following equation is achieved:

$$\begin{aligned} \left[ \begin{array}{ll} \beta S_b &{} 0 \\ 0 &{} \mu S_t \end{array} \right] W=\left[ \begin{array}{ll} M_s+\lambda I+\beta S_w &{} M_{st}-\lambda I \\ M_{ts}-\lambda I &{} M_t+\left( \lambda +\mu \right) I \end{array} \right] W\phi \nonumber \\ \end{aligned}$$
(13)

where W consists of corresponding eigenvectors of k leading eigenvalues of \(\phi \). Due to the lack of label in target domain, the computation of the conditional distribution \(Q_t({y_t|x}_t)\) is not possible. Therefore, we use the idea in [18] to compute the class conditional distribution \(Q_t({x_t|y}_t)\) instead of conditional distribution \(Q_t({y_t|x}_t)\). For evaluation of \(Q_t({x_t|y}_t)\), soft target labels \({\hat{y}}_t\) is used instead of true target labels \(y_t\). Soft labels of target domain is predicted using a base classifier trained on source domain in first iteration that is refined, iteratively.

3.2.2 Prediction function

The original data are mapped via A and B to find the new representations of source and target domains (i.e., \(Z_s=A^TX_s\) and \(Z_t=B^TX_t)\). Our main goal is to learn an adaptive classifier f on labeled source domain \(D_s\) for target domain classification. To learn f, the structural risk functional is minimized as follows:

$$\begin{aligned} f=\ {\mathop {{arg\ min}}_{f\in {{{\mathscr {H}}}}_K} \sum ^n_{i=1}{\ell (f\left( Z_{s_i}\right) ,y_i)}+\eta {\left\| f\right\| }^2_K\ } \end{aligned}$$
(14)

where \({{{\mathscr {H}}}}_K\) consists of classifiers in the reproducing kernel Hilbert space, \(\eta \) is the regularization parameter and \(\ell \) is the squared loss function \(\ell ={(y_i-f\left( Z_{s_i}\right) )}^2\) that measures the performance of classifier f on prediction of training labels. Therefore, the Representer theorem [19] is used to define the classifier f as follows:

$$\begin{aligned} f\left( z\right) =\sum ^{n+m}_{i=1}{a_iK(z_i,z)} \end{aligned}$$
(15)

where K(., .) is the kernel function and \(a_i\) is the coefficient. Considering the squared loss function, and Eq. (15), the Eq. (14) is reformulated as follows:

$$\begin{aligned} f=\ {\mathop {{arg\ min}}_{f\in {{{\mathscr {H}}}}_K} {\left\| \left( Y-{\varLambda }^TK\right) E\right\| }^2_F\ }+\eta tr\left( {\varLambda }^TK\varLambda \right) \end{aligned}$$
(16)

where E is the diagonal domain indicator matrix with each element \(E_{ii}= 1\) if \(z_i\) \(\in \) \(D_s\), and \(E_{ii}= 0\) otherwise. Also, \(\varLambda {=}{(a_1,\dots {,a}_{n+m})}^T\) consists of the vector of coefficients and \({\mathbf {Y}}= [y_1,\dots ,y_{n+m}]\) is the label matrix of source and target data.

3.2.3 Manifold regularization

In addition, the manifold regularization term (i.e., Eq. (17)) is added into Eq. (16) to maximize the consistency between the intrinsic manifold structure of data and predictive structure of f using the marginal distributions of source and target domains (i.e., \(P_s\left( Z_s\right) \ \hbox {and}\ P_t\left( Z_t\right) \)) as follows:

$$\begin{aligned} {M_f\left( P_s,P_t\right) = \sum ^{n+m}_{i,j=1}{V_{ij}}\ }{\left( f\left( z_i\right) -f\left( z_j\right) \right) }^2. \end{aligned}$$
(17)

By incorporating Eq. (15) into Eq. (17) and adding the obtained equation into Eq. (16), we achieve

$$\begin{aligned} f= & {} {\mathop {{\hbox {arg}\ {\min }}}_{f\in {{{\mathscr {H}}}}_K} {\left\| \left( Y-{\varLambda }^TK\right) E\right\| }^2_F\ }\nonumber \\&\quad +\,\eta tr\left( {\varLambda }^TK\varLambda \right) +\rho tr\left( {\varLambda }^TKLK\varLambda \right) \end{aligned}$$
(18)

where \(L=D-V\) is the Laplacian matrix, which is normalized with diagonal matrix \(D_{ii}=\sum ^{n+m}_{j=1}{V_{ij}}\). Also, V is the affinity matrix which is computed by Eq. (19) as follows:

$$\begin{aligned} V_{ij}=\left\{ \begin{array}{c} \begin{array}{ll} {\cos \left( z_i,z_j\right) \ }, &{} if\ z_i\in \ N_P\left( z_j\right) \ \vee \ z_j\in \ N_P(z_i) \end{array}\\ \begin{array}{ll} 0, &{} \hbox {otherwise} \end{array} \end{array} \right. \end{aligned}$$
(19)

where \(N_P\left( z_j\right) \) is the set of P-nearest neighbors of point \(z_j\). Setting derivative of objective function in Eq. (18) to 0 leads to

$$\begin{aligned} {\varLambda }^*={\left( \left( E+\rho L\right) K+\eta I\right) }^{-1}EY^T. \end{aligned}$$
(20)

The cross-domain function f is learned through Eq. (15) using Eq. (20), directly, without the need of explicit classifier training.

4 Experimental setup

In this section, we consider data description to evaluate the performance of our JDSC. Also, we compare the performance of several state-of-the-art domain adaptation methods with the performance of our proposed method. Finally, the implementation details are described in the last subsection.

4.1 Data description

In this paper, the following four datasets: Office-Caltech-10 [6], Digits (USPS, MNIST) [7, 8], ImageNet and VOC 2007 [9] and Pie (Face) [10], are used to evaluate the performance of JDSC.

The Office-31 dataset consists of the following three domains: Amazon (collected images from online merchants), Webcam (images taken by web camera) and DLSR (images taken by digital SLR camera), each of which contains a set of images of different objects with different qualities in each domain. The Office-31 dataset has 4652 images with 4096 features per image and 31 classes. The Caltech-256 (collected from Google images) dataset is another object recognition dataset that has 30,607 images with 4096 features per image and 256 classes. Ten common classes of four domains are used in experiments (i.e., keyboard, bike, calculator, headphones, mouse, mug, laptop, monitor, backpack and projector). The Office-Caltech-10 dataset consists of 12 tasks; in each task, one domain (e.g., Amazon) is considered as source domain and another domain (e.g., Caltech) as target domain. Differences in distribution of Office and Caltech datasets have a beneficial effect on performance evaluation of domain adaptation methods.

Digit dataset consists of two domains, USPS and MNIST, which contains handwritten numbers from 0 to 9. The USPS dataset has 7291 training images and 2007 test images of size \(16 \times 16\) pixels with 256 features for each image, while the MNIST dataset has 60,000 training images and 10,000 test images of size \(28 \times 28\) pixels with 256 features for each image. Ten common classes (i.e., digits 0–9) of both domains are used in experiments. Two experiments are performed using these two domains; in each experiment, one of them is considered as source and another one as target domain. It is worth noting that the distribution of each number is different in USPS and MNIST domains.

The Pie dataset is used for face recognition. It consists of the following five domains with 41,368 images of 68 different persons in different imaging modes for each domain: Pie1 (face image from left), Pie2 (face image from top), Pie3 (face image from bottom), Pie4 (face image from front) and Pie5 (face image from right). Therefore, 20 tasks are achieved from the above five domains to evaluate the performance of the proposed method, in which two domains are selected from five domains as source and target domains.

ImageNet and VOC 2007 are two large datasets of natural images with different distributions. ImageNet has over 14 million images with more than 20,000 categories, while VOC 2007 dataset consists of 9963 images containing 24,640 annotated objects. Five common classes of both datasets are exploited in our experiments (i.e., dog, chair, cat, bird and person). Therefore, two tasks I–V and V–I are considered in experiments.

4.2 Implementation details

The number of images and type of features in different datasets are described as follow. In office dataset, 1410 images are selected, randomly; each image is defined by DeCaf6 features (which are the activations of sixth fully connected layer of a convolutional network trained on ImageNet). Moreover, in Caltech dataset, 1123 images with DeCaf6 features are selected, randomly. In Digit dataset, 1800 images of USPS domain and 2000 images of MNIST domain with 256 features are selected, randomly. In Pie dataset, 3332, 1629, 1632, 3329 and 1632 images with 1024 features are selected for Pie1, Pie2, Pie3, Pie4 and Pie5 domains, respectively. In ImageNet and VOC 2007 datasets, 7341 and 3376 images with 4096 DeCaf6 features are sampled, respectively. The optimal parameters for mentioned datasets are summarized in Table 1. The iteration number, T, and the used kernel are 10 and RBF (radial basis functions), respectively. The accuracy of classifier is computed through Eq. (21) where \({\hat{y}}\left( x\right) \) and y(x) are the predicted and true labels for target domain, respectively,

$$\begin{aligned} \hbox {Accuracy}=\frac{\left| x:x\in \ D_t\bigwedge {{\hat{y}}\left( x\right) =y(x)}\right| }{|x:x\in \ D_t|}. \end{aligned}$$
(21)
Table 1 Optimal parameters for different datasets

5 Experimental results and discussions

In this section, the classification accuracy results on Office-Caltech-10, Digit, ImageNet-VOC 2007 and Pie datasets are shown in Tables 2 and  3. We describe our observations and analyze the parameter sensitivity of JDSC on different types of datasets in the rest.

5.1 Result evaluation

JDSC outperforms other state-of-the-art domain adaptation and transfer learning methods (LRSR [20], ARTL [19], DICD [21], JGSA [4], VDA [22], D-CORAL [23], UTML [24]) on most of experiments (24 out of 36 tasks). The average classification accuracy of JDSC on 36 tasks is 86.2%, and the improvement in average performance is significant against the best compared method. In the rest, we compare our proposed method with other methods in detail.

Low-rank and sparse representation (LRSR) is a subspace learning method that obtains a common subspace which represents target domain by sparse and low-rank minimization problem to reduce the domain shift between source and target domains. However, LRSR does not address cross-domain distribution discrepancy completely. While JDSC is able to adapt domains both geometrically and statistically, JDSC performs (8.6%), (18.9%), (1.0%) and (21.2%) better than LRSR in prediction accuracy in Office-Caltech-10, Digit, ImageNet-VOC 2007 and Pie datasets, respectively.

Table 2 Accuracy (%) of JDSC against compared methods in Office-Caltech-10 dataset using DeCaf6 features
Table 3 Accuracy (%) of JDSC against compared methods in Digit, ImageNet-VOC 2007 and Pie datasets
Table 4 Run time (s) of LRSR, ARTL, JGSA, VDA and JDSC

Adaptation regularization-based transfer learning (ARTL) is a transfer learning method to learn domain-invariant classifier in original space, whereas JDSC learns adaptive classifier in a new space with better features, which prevents the feature distortion in model building. Our results show that JDSC gets (1.7%), (4.8%), (6.0%) and (12.7%) significant classification accuracy improvement compared to ARTL in Office-Caltech-10, Digit, ImageNet-VOC 2007 and Pie datasets, respectively.

Domain-invariant and class discriminative feature learning (DICD) creates a common subspace by reducing the difference in conditional and marginal distributions while important data properties are preserved. In addition, DICD reduces the distance of samples from same classes, while it increases the distance of samples from different classes. JDSC maximizes target variance to prevent feature distortions. Also, our method preserves source label information to get discriminative representation. JDSC in Office-Caltech-10 datasets obtains (2.5%) improvement and in Digit and Pie datasets obtains (11.6%) and (10.3%) performance improvement aga inst DICD, respectively.

Joint geometrical and statistical alignment (JGSA) is an unsupervised domain adaptation framework, which obtains two subspaces for source and target domains to mitigate both geometrical and distribution shifts, jointly. However, JDSC reduces distribution discrepancies across domains by accounting the different importance of marginal and conditional distributions. Compared to JGSA, the average performance improvement of JDSC is (2.4%), (8.8%),(11.8%) and (6.4%) in Office-Caltech-10, Digit, ImageNet-VOC 2007 and Pie datasets, respectively.

Visual domain adaptation (VDA) is a transfer learning and domain adaptation approach, which reduces joint marginal and conditional distribution shifts, iteratively, by domain-invariant clustering in an embedding representation to discriminate different classes alongside with domain transfer. Despite VDA, JDSC preserves manifold consistency and performs dynamic distribution alignment. JDSC obtains (6%), (14.1%), (9.9%) and (12.4%) improvement against VDA in average accuracy in Office-Caltech-10, Digit, ImageNet-VOC 2007 and Pie datasets, respectively.

Unsupervised transfer metric learning (UTML) tackles domain shift problem by minimizing the intraclass and maximizing the interclass distribution discrepancies between source and target domains via maximum mean discrepancy. Moreover, in UTML, the property of domains is preserved by maintaining variance of samples. Unlike UTML, which adapts only conditional distribution, JDSC adapts both marginal and conditional distributions with different significance. JDSC has (15.1%) and (1.9%) improvement compared to the best baseline method UTML in the classification accuracy on Digit and Pie datasets, respectively.

Also, deep learning methods were widely considered in recent years [25]. JDSC can be compared with deep learning methods under the following two circumstances: (1) use of data with pretrained features by deep learning networks as input data and (2) using deep learning networks instead of label prediction function in classifier-based step. For this purpose, we use circumstance 1 to compare JDSC with deep methods, and the experiment results on Office-Caltech-10 with pretrained DeCaf6 features learned on convolutional networks are given in Table 2. As can be seen from Table 2, JDSC outperforms D-CORAL method [23] (which adapts the second-order subspaces using deep neural networks) and has 2.7% improvement.

5.2 Time complexity

Table 4 presents the run time of JDSC and other baseline and state-of-the-art methods on different tasks. By considering high time complexity of deep methods for backpropagations, they are not compared in this challenge. As is clear from Table 4, JDSC has modest run time (i.e., 207.2 s) in task C-A compared to the total run time (i.e., \(224.6 + 20.1 = 244.7\)) of the two baseline methods JGSA and ARTL. Therefore, JDSC has an acceptable and comparable time complexity against other compared methods, due to its performance in classification accuracy where the test environment is an Intel\(^{\circledR }\) Core\(^{\mathrm{TM}}\) i7-8550 CPU with 8 GB memory. Also, the MATLAB is selected as the coding language.

Fig. 1
figure 1

Parameter evaluation with respect to classification accuracy (%) for \(\gamma ,\lambda ,\mu ,\beta ,\eta ,\rho , K\) and P parameters on C-A, P1-P2, V-I and U-M tasks

5.3 Parameters impact

We evaluate the parameter sensitivity of JDSC on selected tasks of four benchmark datasets (i.e., C-A from Office-Caltech-10 dataset, P1-P2 from Pie dataset, V-I from Image Net-VOC 2007 dataset and U-M from Digit dataset) to validate its performance on a wide range of parameter values. Figure 1 illustrates the relationship between various parameters and accuracy. Each of \(\gamma ,\lambda ,\mu ,\beta ,\eta ,\rho ,\) \({K\ \hbox {and}\ P\ }\) parameters has been validated on different types of datasets by fixing other parameters. Figure 1a illustrates \(\gamma \) which is a trade-off parameter for marginal and conditional distribution alignments. Also, in Fig. 1b–d, \(\lambda \) , \(\mu \) and \(\beta \) are the trade-off parameters to balance the importance of each component in Eq. (13). In addition, Fig. 1e and 1f show parameter sensitivity of \(\eta \ \hbox {and}\ \rho \) parameters in Eq. (20). Figure 1g and 1h illustrate the impact of K (the dimension of embedded subspaces) and P (the number of neighbors in Laplacian graph) parameters in prediction accuracy. All \(\gamma ,\lambda ,\mu ,\beta ,\eta , \hbox {and}\;\rho \) parameters are evaluated in range between 0.0 to 1.0. Also, parameters K and P are evaluated in ranges of [40, 200] and [2, 14], respectively. As is clear from Fig. 1a–c and 1e–f the results of Pie and Digit datasets for \(\gamma \), \(\lambda \), \(\mu \), \(\eta \) and \(\rho \) parameters illustrate stability of accuracy values after several iterations. Classification accuracy on C-A, P1-P2 and V-I tasks for \(\beta \) and K parameters is almost steady, while for high values of \(\beta \) and K, the classification accuracy on U-M task is low. Also, for C-A, U-M and V-I tasks, the accuracy has no obvious change for parameter P, while in P1-P2 task, the predicted accuracy is sensitive to values of P.

6 Conclusion

In this paper, we proposed a new transfer learning method referred as joint distinct subspace learning and unsupervised transfer classification for visual domain adaptation (JDSC) to address discrepancy problem between source and target domains. JDSC finds two coupled projections for source and target domains, respectively, to minimize the domain shift, specifically, by accounting the different importance for marginal and conditional distributions. In addition, JDSC increases the manifold consistency underlying the marginal distributions of source and target domains. As a result, the optimal new representations and classifier are achieved to adapt domains. We assess the improvement of JDSC in transferring the knowledge by performing experiments on standard visual datasets where the results show the prominence of JDSC in comparison with other state-of-the-art visual domain adaptation methods. JDSC can find its applications in a wide range of classification problems, e.g., land cover classification through remote sensing [26] and recognition of anomalies in thermal images [27]. As a future work, we aim to extend JDSC using extracted features through deep neural networks. Also, the proposed method can be applied to reinforcement learning approaches to improve challenges in robotics.