1 Introduction

The high dimensionality of data poses challenges to learning tasks such as the curse of dimensionality. A common way to solve this problem is dimensionality reduction, which has attracted much attention in machine learning and data mining community in the past decades. In the literature, there are mainly two distinct ways for dimensionality reduction, that is, feature selection and feature extraction. In the former, subsets of features are selected directly. In the latter, new features are gained from their original features through algebraic transformation. Despite different motivations of these methods, they can all be interpreted in a unified Graph Embedding framework [1]. Subspace learning algorithms belong to the feature extraction issue. The most popular subspace learning methods include unsupervised principle component analysis (PCA) [2], supervised linear discriminant analysis (LDA) [2], maximum margin criterion (MMC) [3], and locality preserving projection (LPP) [4]. These algorithms project the data by linear transformation according to some optimization criteria. One of the key shortcomings of subspace learning is that new features are linear combinations of all original features. This means while subspace learning facilitates model interpretation and visualization by concentrating the information in a few features, the features themselves are still constructed using all features, hence are often hard to interpret [5]. Then, sparse subspace learning methods attempted to solve this problem effectively [69].

Conventional subspace learning algorithms for data mining and machine learning perform well under the assumption that training and testing samples are independent and identically distributed (i.i.d). Unfortunately, for many practical applications, this assumption is always violated, and this will deeply decrease the effect of conventional algorithms. Transfer learning aims to solve the problem when the training data from a source domain and the testing data from a target domain follow different distributions or are represented in different feature spaces [10]. The key idea of transfer learning is that, although distributions between source and target domain are different, there must contend some common knowledge structures across domains. These common structures can be utilized as a bridge for knowledge transfer.

One fundamental motivation of transfer learning in the real applications is the so-called data sparsity problem in target domain, where data sparsity can be defined by a lack of useful labels or sufficient high-quality data in the training set. This sparsity problem in target domain will lead to an over-fitting model when training with conventional methods. The regularization framework is popular in machine learning to address various problems, for example, Tikhonov regularization [11], manifold regularization [12], graph Laplacian-based regularization [13], etc. To overcome the data sparsity problem, we enforce sparse regularization on the transfer learning framework.

The motivation of our work is as follows:

  1. 1.

    Though there are many subspace learning methods via sparse regularization methods, none of them is suitable for transfer learning paradigm, because most of them rely on the i.i.d assumption, which may be impractical in real applications. How to extend the traditional subspace learning methods to solve the transfer learning applications is significative.

  2. 2.

    Due to the fact that data sparsity problems happen in both transfer learning and subspace learning paradigm, how to overcome the over-fitting problem and which kind of sparse regularization to select is worthy of study.

  3. 3.

    There are lots of widely used transfer learning methods, suitable for different assumptions on the divergence measures of the data distributions, such as maximum mean discrepancy (MMD), Bregman divergence, and Kullback–Leibler (K–L) divergence, and most of recent algorithms depend on specific circumstances and applications. So how to unify them into a general framework is a challenging problem.

In this paper, we proposed a general framework for transfer learning, referred to as transfer sparse subspace learning (TSSL). The main contributions of this paper include the following:

  1. 1.

    We successfully extend the traditional subspace learning algorithms such as PCA, LDA, MMC, and LPP to solve transfer learning problems.

  2. 2.

    To deal with the considerable change between distributions of the source and target domains, TSSL minimized the distribution distance via two important criteria, that is, MMD and Bregman divergence. Indeed, TSSL provided a unified framework for handling any distribution distance estimation criterion.

  3. 3.

    We employ sparse regularization term on the transfer subspace learning framework to avoid or at least reduce the over-fitting problems and also reduce time and space cost obviously. We verify that the L 2,1-norm regularization is an effective constraint on the transfer subspace learning procedure.

The rest of the paper is organized as follows. In Sect. 2, the previous related works are discussed, and the preliminaries including sparse subspace learning, L 2,1-norm, transfer learning, MMD, Bregman divergence, and MMC are introduced. We presented our framework for transfer sparse subspace learning and corresponding solutions in Sect. 3. The experimental results on both text data sets and face data sets are discussed in Sect. 4. Finally, we draw a conclusion and discuss the future work.

2 Previous works and preliminaries

2.1 Sparse subspace learning

When the number of samples is smaller than the number of features, the subspace learning methods may fail, and it is necessary to control the model complexity according to the regularization theory. The most important regularization techniques include L 1-norm, L 2-norm, and the elastic net penalty. Recently, sparse subspace learning draws increasing interests, and many dimensionality reduction methods are extended to their sparse version. Zou et al. [6] proposed an elegant sparse PCA algorithm (SPCA) using “Elastic Net” framework for L 1-penalized regression on regular principle components, solved very effectively using least angle regression (LARS). Moghaddam et al. [7] proposed a spectral bound framework for sparse subspace learning. Particularly, they proposed both exact and greedy algorithms for sparse PCA and sparse LDA [8]. Cai et al. [9] propose a unified sparse subspace learning framework, which builds the connection between regression and many popular graph-based subspace learning algorithms, for example, LDA, LPP, and NPE. Their sparse solutions can be effectively computed with a L 1-norm regularization in the proposed framework.

Recently, sparse regularization has been widely investigated and also applied into subspace learning studies. L 1-SVM was proposed to perform feature selection using the L 1-norm regularization that tends to give sparse solution [14]. A hybrid huberized SVM (HHSVM) was proposed combining both L 1-norm and L 2-norm to form a more structured regularization [15]. Obozinsky et al. [16] and Argyriou et al. [17] have developed a similar model for L 2,1-norm regularization to couple feature selection across tasks. Such regularization has close connections to group lasso. On the basis of the motivation that the selected features by sparse subspace learning methods are independent and generally different for each dimension of the subspace, Gu et al. [18] proposed a joint framework based on using L 2,1-norm on the projection matrix, which can do feature selection and subspace learning simultaneously.

None of these sparse subspace learning methods is suitable for transfer learning problems, since they ignore the fact that the distributions of source domain data and target domain data are different.

2.2 L 2,1-norm

For a matrix W ∈ R m×d, the L r,p -norm is defined as follows:

$$ \left\| W \right\|_{r,p} = \left( {\sum\limits_{i = 1}^{m} {\left( {\sum\limits_{j = 1}^{d} {\left| {w_{ij} } \right|^{r} } } \right)^{p/r} } } \right)^{1/p} = \left( {\sum\limits_{i = 1}^{m} {\left\| {w^{i} } \right\|}_{r}^{p} } \right)^{1/p} $$
(1)

where w i is the ith row of W.

Then, L 2,1-norm is defined in the following equation.

$$ \left\| W \right\|_{2,1} = \sum\limits_{i = 1}^{m} {\sqrt {\sum\limits_{j = 1}^{d} {w_{ij}^{2} } } } = \sum\limits_{i = 1}^{m} {\left\| {w^{i} } \right\|}_{2} $$
(2)

We can easily verify that L 2,1-norm is rotational invariant for rows for any rotational matrix R, that is, \( \left\| {WR} \right\|_{2,1} = \left\| W \right\|_{2,1} . \)

We also give an intuitional explanation of L 2,1-norm. First, we can compute the L 2-norm of the rows w i (corresponding to dimension i) and then compute L 1-norm of the vector \( b(W) = \left( {\left\| {w^{1} } \right\|_{2} ,\left\| {w^{2} } \right\|_{2} , \ldots ,\left\| {w^{m} } \right\|_{2} } \right). \) The magnitudes of the components of the vector b(W) indicate how important each dimension is. The L 2,1-norm favors a small numbers of nonzero rows in the matrix W, thereby ensuring that dimensionality reduction will be achieved.

The L 2,1-norm of a matrix was introduced [19] as rotational invariant L 1-norm and used for multitask learning [16, 17]. Argyriou et al. [17] developed a non-convex multitask generalization of the L 2,1-norm regularization that can be used to learn a few features common across multiple tasks. Obozinski et al. [16] proposed a type of joint regularization of the model parameters in order to couple feature selection across tasks. Liu et al. [20] consider the L 2,1-norm regularized regression model for joint feature selection from multiple tasks, which can be derived in the probabilistic framework by assuming a suitable prior from the exponential family. One appealing feature of the L 2,1-norm regularization is that it encourages multiple predictors to share similar sparsity patterns.

Motivated by previous research [16, 18, 20, 21], an L 2,1-norm regularization is performed to select features across all data points with joint sparsity, that is, each feature either has small scores for all data points or has large scores over all data points. In this paper, we also employ L 2,1-norm regularization on the projection matrix W to enable subspace learning effectively.

2.3 Transfer learning

In the past decades, there are many transfer learning algorithms, which can be summarized into four cases, that is, instance-based transfer learning, parameter-based transfer learning and relational-knowledge transfer learning, and feature-based transfer learning. We refer to [22] for more information.

The instance-based transfer learning approaches reweighted some labeled data in the source domain for use in the target domain, the representative algorithms including KLIEP [23], TrAdaBoost [24], TranferBoost [25], TrAdaBoost.R2 [26], and MultiSourceTrAdaBoost [27].

The parameter-based transfer learning approaches assumed that the source tasks and the target tasks share some parameters or prior distributions of the hyperparameters of the models. To discover the shared parameters or priors, the knowledge can be transferred across tasks. The representative algorithms include MI-IVM [28], GPDRTL [29], TLVM [30], etc.

The relational-knowledge transfer learning approaches assumed that some relationship among the data of the source and target domains is similar. Statistical relational learning techniques such as MLNs dominate this context [31, 32].

The feature-based transfer learning approaches include feature reweighting [33], feature replication [34], feature projection [35, 36], feature correlation [37], feature subsetting [38], feature extraction [39, 40], etc. The feature-based transfer learning aims to discover a shared feature space in which the data distributions across domains are close to each other. The shared feature space can be constructed in the original feature space [35, 36], or in the projected subspace [41, 42].

Our framework belongs to the feature-based transfer learning. So we focus on some previous representative feature-based algorithms as follows. Structured correspondence learning (SCL) [35] introduces the concept of pivot features, which possess high frequency and similar meaning in both auxiliary and target domains. Non-pivot features can be mapped to each other via the pivot features from the unlabeled data of both source and target domains. Blitzer et al. [36] proposed to use mutual information (MI) to choose the pivot features instead of using more heuristic criteria. MI-SCL tries to find some pivot features that have high dependence on the labels in the source domain. Pan et al. [41] exploited the maximum mean discrepancy embedding (MMDE) method, originally designed for dimensionality reduction, to learn a low-dimensional space to reduce the difference of distributions between different domains for transductive transfer learning. However, MMDE may suffer from its computational burden. Thus, Pan et al. [42] further proposed an effective feature extraction algorithm, known as transfer component analysis (TCA), to overcome the drawback of MMDE.

Since the distribution divergence between the original source and target domains is large in transfer learning settings, the classification or regression function f may not generalize well in the target domain. So in this paper, we want to alleviate this difficulty by reducing the distribution distance across domains in a projected latent space. That is to say, our framework aims to employ subspace learning algorithm to discover the common shared projected subspace. The advantage of our method is that we do so in a regularization framework, which takes the distribution distance regularization and the sparse regularization into consideration. Also, our framework can avoid over-fitting problems.

2.4 Maximum mean discrepancy (MMD)

There are some criteria to estimate the distance between different distributions [41]. Many criteria are parametric because they need intermediate density estimation. MMD is a relevant criterion for comparing distributions based on reproducing kernel Hilbert space (RKHS) [43]. Given two domains X and Y, let \( X = \{ x_{1} ,x_{2} , \ldots ,x_{{n_{1} }} \} \) and \( Y = \{ y_{1} ,y_{2} , \ldots ,y_{{n_{2} }} \} \) be random variable sets with different distributions P and Q. The empirical estimation by MMD will be as follows:

$$ {\text{Dist}}(X,Y) = \mathop {\sup }\limits_{{\left\| f \right\|_{\user1{H}} \le 1}} \left( {\frac{1}{{n_{1} }}\sum\limits_{i = 1}^{{n_{1} }} {f\left( {x_{i} } \right)} - \frac{1}{{n_{2} }}\sum\limits_{i = 1}^{{n_{2} }} {f\left( {y_{i} } \right)} } \right) $$
(3)

where \( f:{\mathcal{X}} \to {\mathcal{H}} \) and \( H \) is a universal RKHS [44]. In a RKHS, function evaluation can be written as \( f(x) = \left\langle {\phi (x),f} \right\rangle \), where \( \phi (x):{\mathcal{X}} \to {\mathcal{H}} \) is a kernel-induced feature map, and the empirical estimation of MMD can be rewritten as:

$$ {\text{Dist}}(X,Y) = \left\| {\frac{1}{{n_{1} }}\sum\limits_{i = 1}^{{n_{1} }} {\phi \left( {x_{i} } \right)} - \frac{1}{{n_{2} }}\sum\limits_{i = 1}^{{n_{2} }} {\phi \left( {y_{i} } \right)} } \right\|_{H} $$
(4)

On the basis of the MMD theory [43], the distance between distributions of two sets of samples is just the distance between the mean values of these two sets of samples in a RKHS.

In order to enable knowledge transfer for mismatching distribution data sets, Brian et al. [45, 46] explored a feature extraction perspective, starting with the popular sparse coding approach, which learns a set of higher-order features for the data. They improved the original sparse coding technique by incorporating distribution distance regularization and the target data label information into the general objective function. Ren et al. [47] proposed a multiple kernel learning framework improved by MMD to solve transfer learning problems. The model not only utilizes the capacity of kernel learning to construct a nonlinear hyperplane which maximizes the separation margin, but also reduces the distribution discrepancy between training and testing data simultaneously. Zhang et al. [48] proposed an approach, which performs multiple related clustering tasks simultaneously through domain adaptation. A shared subspace will be learned, where the gap of distributions via MMD among tasks is reduced, and the shared knowledge will be transferred through all tasks by exploiting the strengthened relation in the learned subspace. Uguroglu et al. [49] presented a method to identify variant and invariant features between two data sets. Unlike traditional feature-based transfer learning methods, rather than finding a projection of the feature space to maximize the similarity between source domain and target domain data via MMD, Duan et al. [50] proposed a cross-domain kernel learning framework called domain transfer multiple kernel learning (DTMKL), which simultaneously learns a kernel function and a robust classifier by minimizing both the structural risk functional and the distribution mismatch via MMD between source domain and target domain data.

Motivated by the success of MMD used for transfer learning problems, in this paper, we also employ MMD as one of the basic distribution distance estimation criteria, and our goal is to reduce distribution gap between the projected source and target domain subspace. The difference with the above algorithms is that we consider not only the distribution estimation but also the sparse regularization, which is important to handling over-fitting problem.

2.5 Bregman divergence

Definition [51]

Let φ be a continuous-differentiable real-valued and strict convex function defined on a closed convex set Ω, and a Bregman distance function d φ(x, y) associated with the function φ is defined as follows:

$$ d_{\varphi } (x,y) = \nabla \varphi (x) - \nabla \varphi (y) - \left\langle {\nabla \varphi (y),(x - y)} \right\rangle $$
(5)

for any points x, y ∈ Ω, where ∇φ(y) is the gradient of φ evaluated at y.

According to the definition, different convex functions φ define different specific forms of Bregman divergences. Some frequently used divergences are all specific forms of Bregman divergences that are showed in Table 1 [52].

Table 1 Some frequently used divergences

There are some important properties of Bregman divergence. A Bregman divergence is not generally symmetric, that is, it does not always hold that d φ(x, y) = d φ(y, x). The squared Euclidean distance and the squared Mahalanobis distance are two examples of symmetric Bregman divergences, and the Kullback–Leibler (K–L) divergence is asymmetric. A Bregman divergence d φ(x, y) is convex w.r.t. its left variate x, but non-convex w.r.t. its right variate y. More details about Bregman divergence can be found in [51].

Si et al. [53] presented a family of subspace learning algorithms based on Bregman divergence regularization, which transfers the knowledge gained in the source domain data to the target domain data. The transfer subspace learning (TSL) framework extends many classical subspace learning algorithms under transfer learning settings, such as TPCA, TLDA, TLPP, and TMFA. In the following work, Si et al. [54, 55] proposed the cross-domain discriminative Hessian Eigenmaps (CDHE) and cross-domain discriminative locally linear embedding (CDLLE), which incorporated Bregman divergence regularization. Wu et al. [56] proposed a scheme of learning Bregman distance function with side information. Gao et al. [57] proposed a transfer learning framework for latent variable model, which can utilize the Bregman divergence of the source and target domain data to modify the parameters of the obtained latent variable model. Zhang et al. [58] deal with multitask clustering-based Bregman divergence, which aims to improve performance of each single task and also discover the relationship between clusters of different tasks.

Motivated by these excellent works, we also unify the Bregman divergence into our general transfer learning framework. The advantage of our method is that we take both distribution estimation and the sparse regularization into consideration, so that our methods can be extended to more real applications.

2.6 Maximum margin criterion (MMC)

MMC [3] aims at maximizing the average margin between classes in the projected space. Therefore, the feature extraction criterion is defined as:

$$ J = \frac{1}{2}\sum\limits_{i = 1}^{C} {\sum\limits_{j = 1}^{C} {p_{i} p_{j} d\left( {C_{i} ,C_{j} } \right)} } $$
(6)

where C is the number of distinct classes and p i , p j are the prior probability of class i and class j, respectively; the interclass margin is defined as:

$$ d\left( {C_{i} ,C_{j} } \right) = d\left( {m_{i} ,m_{j} } \right) - s\left( {C_{i} } \right) - s\left( {C_{j} } \right)s\left( {C_{i} } \right) = tr\left( {S_{i} } \right),s\left( {C_{j} } \right) = tr\left( {S_{j} } \right) $$
(7)

where m i , m j are the mean vectors of the class C i and the class C j and S i , S j are the covariance matrix of the class C i and the class C j . After simple mathematical operation, we can obtain the following formula:

$$ J = tr\left( {S_{b} - S_{w} } \right) $$
(8)

The between-class scatter matrix S b and the within-class scatter matrix S w are defined as:

$$ S_{b} = \sum\limits_{i = 1}^{C} {n_{i} \left( {m_{i} - m} \right)\left( {m_{i} - m} \right)^{T} } \quad S_{w} = \sum\limits_{i = 1}^{C} {\left( {X_{i} - m_{i} } \right)\left( {X_{i} - m_{i} } \right)^{T} } $$
(9)

where n i is the number of class C i and m is the mean vector of all data. Then, the MMC can be formulated as:

$$ \arg \mathop {\max }\limits_{W \in \Upomega } F(W) = \arg \mathop {\max }\limits_{W \in \Upomega } tr(W^{T} \left( {S_{b} - S_{w} } \right)W) $$
(10)

Obviously, we can get the optimal W by solving the generalized eigenvalue problem: (S b  − S w )W = λW. Therefore, W is composed of the first d largest eigenvectors of S b  − S w .

The number of clusters is predefined as c, F ∈ R n×c is the indicator matrix, \( F_{ij} = {1 \mathord{\left/ {\vphantom {1 {\sqrt {l_{j} } }}} \right. \kern-\nulldelimiterspace} {\sqrt {l_{j} } }} \) if x i belong to jth cluster, and F ij  = 0 otherwise, where l j is the number of samples in jth cluster. We can easily verify the following equations:

$$ S_{w} = X\left( {I - FF^{T} } \right)X^{T} ;\quad S_{b} = XFF^{T} X^{T} $$
(11)

Different with LDA, we need not calculate the inverse of S w , which allows us to avoid the small sample size problem easily. Due to the advantage of MMC method, we employ it as an example to test the efficiency of our framework. Of course, our framework is suitable for any subspace learning methods.

3 Transfer sparse subspace learning (TSSL)

3.1 Problem statement and notations

In a transfer learning setting, we denote the source domain data as \( {\mathcal{D}}_{S} = \left\{ {\left( {x_{1} ,z_{1} } \right),\left( {x_{2} ,z_{2} } \right), \ldots ,\left( {x_{{n_{1} }} ,z_{{n_{1} }} } \right)} \right\}, \) x i  ∈ \(\mathbb{R}\) m, i = 1, 2,…,n 1 and z i is the corresponding label. Similarly, we denote the target domain data as \( {\mathcal{D}}_{T} = \left\{ {x_{{n_{1} + 1}} ,x_{{n_{1} + 2}} , \ldots ,x_{{n_{1} + n_{2} }} } \right\}, \) and we assumed x i  ∈ \(\mathbb{R}\) m, i = n 1 + 2,…,n 1 + n 2. Denote X = [x 1, x 2,…,x n ] ∈ Rm×n, where n = n 1 + n 2. Let P(X S ) and Q(X T ) be the marginal distribution of X S and X T , respectively, and usually P(X S ) ≠ Q(X T ).

3.2 General framework

3.2.1 Subspace learning framework

A subspace learning algorithm finds a low-dimensional subspace \(\mathbb{R}\) d, where samples from different classes can be well separated or a specific redundancy is minimized. The objective function of subspace learning framework is as follows:

$$ W = \mathop {\arg \min }\limits_{W \in \Upomega } F(W) $$
(12)

Whatever the objective is, to approximate the transformation from \(\mathbb{R}\) m to \(\mathbb{R}\) d, the linear function can be used, y i  = W T xi , where i = 1, 2,…n 1 + n 2, where y i is the low-dimensional representation of the samples x i , W ∈ \(\mathbb{R}\) m×d, x i  ∈ \(\mathbb{R}\) m, y i  ∈ \(\mathbb{R}\) d. Denote the low-dimensional samples as:

$$ Y = Y_{S} \cup Y_{T} ,Y_{S} = \left\{ {y_{1} ,y_{2} , \ldots ,y_{{n_{1} }} } \right\},Y_{T} = \left\{ {y_{{n_{1} + 1}} ,y_{{n_{1} + 2}} , \ldots ,y_{{n_{1} + n_{2} }} } \right\} $$

Let the probability density for the source domain data and the target domain data in the projected subspace W be p(y s ) and q(y t ), respectively, and in general p(y s ) ≠ p(y t ).

In this paper, we use the MMC as the subspace learning method. Of course, our framework can be easily extended to all of other subspace learning methods such as PCA, LDA, and LPP.

3.2.2 Incorporate sparse regularization

The regularization principals can deal with various machine learning problems [53], and there are some advantages of incorporating sparse regularization to the subspace learning framework. First, it can make the subspace more succinct and simpler, and the calculation will be more effective. Parsimony is especially an important factor when the dimension of the original samples is very high. Second, it can control the importance of original dimensions and decrease the influence brought by possible over-fitting problem. Third, it provides a good interpretation of the subspace and thus reveals an explicit relationship between the objective of the model and the given variables. It is important because we can understand the problem better by learning which kind of dimension plays more important role.The objective function of sparse subspace learning framework incorporated with sparse regularization is as follows:

$$ W = \mathop {\arg \min }\limits_{W \in \Upomega } F(W) + \alpha \Upphi (W) $$
(13)

We enforce the sparsity penalty on the projection matrix W and encourage the rows of W to be zeroed as much as possible. The intuition behind this is that we expect the source domain and target domain data to only depend on a subset of the latent dimensions. The zero-valued rows of W remove the influence of the corresponding latent dimensions.

In this paper, we enforced L 2,1-norm regularization on the W, the rows of W will be zero as much as possible. This lets us automatically discover the dimensionality of the latent space. Furthermore, in our transfer learning setting, we assume the source domain and the target domain share some information, and this regularization will favor representing this shared information in a common latent dimension space.

3.2.3 Incorporate distribution divergence regularization

The subspace learning framework incorporated with sparse regularization works well when source domain data and target domain data are independent and identically distributed (i.i.d). But in transfer learning setting, we relax this assumption, that is, P(X S ) ≠ Q(X T ). Then, the distributions of low-dimensional projected data are also different, that is, P(Y S ) ≠ Q(Y T ). So we should take the difference between P(Y S ) and Q(Y T ) into consideration and ensure they are close to each other in the projected subspace. Let Dist(Y S , Y T ) be the distance estimation of the different distributions between the source and target domains in the projected subspace. Then, we can get the general framework for transfer sparse subspace learning as follows:

$$ W = \mathop {\arg \min }\limits_{W \in \Upomega } F(W) + \alpha \Upphi (W) + \beta {\text{Dist}}\left( {Y_{S} ,Y_{T} } \right) $$
(14)

In this paper, we use the MMD criterion and the Bregman divergence criterion as the two main regularizations. Of course, our framework can be easily extended to all of other distribution divergence regularizations such as Kullback–Leibler (K–L) divergence, β divergence, Jensen–Shannon divergence, and χ2 divergence.

3.3 MMD-based regularization

3.3.1 Reformulation of MMD regularization

Pan et al. [41] developed a transfer learning technique for learning in a latent space, called MMDE. MMDE embeds the data from both domains into a common low-dimensional latent subspace. The key idea is to formulate it as a kernel learning problem using the kernel trick, K ij  = K(x i , x j ) = ϕ(x i )Tϕ(x j ), and to learn the Kernel matrix defined on all the data:

$$ K = \left[ {\begin{array}{*{20}c} {K_{{Y_{S} ,Y_{S} }} } & {K_{{Y_{S} ,Y_{T} }} } \\ {K_{{Y_{T} ,Y_{S} }}^{T} } & {K_{{Y_{T} ,Y_{T} }} } \\ \end{array} } \right] \in R^{{(n_{1} + n_{2} ) \times (n_{1} + n_{2} )}} $$
(15)

where \( K_{{Y_{S} ,Y_{S} }} , \) \( K_{{Y_{T} ,Y_{T} }} \) and \( K_{{Y_{S} ,Y_{T} }} \) are the Gram matrix defined on the source domain \( D_{\user1{S}} \), target domain \( D_{\user1{T}} \), and cross-domains, respectively. Y S and Y T are the low-dimensional representations of source domain data X S and target domain data X T , respectively. Then, we can get the new formulation of minimizing the distance (measured by MMD) between the two domains as:

$$ {\text{Dist}}\left( {Y_{S} ,Y_{T} } \right) = tr(KL) $$
(16)

where L = [L ij ] ≥ 0 with

$$ L_{ij} = \left\{ {\begin{array}{*{20}c} {\frac{1}{{n_{1}^{2} }}} \hfill & {{\text{when }}x_{i} ,x_{j} \in X_{S} } \hfill \\ {\frac{1}{{n_{2}^{2} }}} \hfill & {{\text{when }}x_{i} ,x_{j} \in X_{T} } \hfill \\ { - \frac{1}{{n_{1} n_{2} }}} \hfill & {\text{otherwise}} \hfill \\ \end{array} } \right. $$
(17)

There are several limitations of MMDE. First, it is transductive and cannot handle out-of-domain samples. Second, the resultant kernel learning problem has to be solved by expensive SDP complexity solvers. Third, the obtained K has to be processed by PCA. This may discard potential useful information in K.

To overcome the limitations of MMDE above, Pan proposed a new feature extraction method, TCA, for transfer learning [42]. It learns a set of transfer components in a RKHS such that when projecting domain data onto the latent space spanned by the transfer components, the distance between domains can be reduced. According to the empirical kernel map [59], K = (KK −1/2)(K −1/2 K), the projection matrix \( \tilde{W} \in {\mathbb{R}}^{{(n_{1} + n_{2} ) \times m}} \) transforms the empirical kernel map features to an m-dimensional space (where m ≪ n 1 + n 1). The new resultant kernel matrix is as follows:

$$ \tilde{K} = (KK^{{ - {1 \mathord{\left/ {\vphantom {1 2}} \right. \kern-\nulldelimiterspace} 2}}} \tilde{W})(\tilde{W}^{T} K^{{ - {1 \mathord{\left/ {\vphantom {1 2}} \right. \kern-\nulldelimiterspace} 2}}} K) = KWW^{T} K $$

where \( W = K^{{ - {1 \mathord{\left/ {\vphantom {1 2}} \right. \kern-\nulldelimiterspace} 2}}} \tilde{W}. \) The distance between the two domains can be formulated as:

$$ {\text{Dist}}\left( {Y_{S} ,Y_{T} } \right) = tr(\tilde{K}L) = tr\left( {\left( {KWW^{T} K} \right)L} \right) = tr\left( {W^{T} KLKW} \right) $$
(18)

3.3.2 Reformulation of sparse regularization

In this paper, we incorporate L 2,1-norm to the objective as the sparse regularization.

Let \( \Upphi (W) = \left\| W \right\|_{2,1} = \sum\nolimits_{i = 1}^{m} {\sqrt {\sum\nolimits_{j = 1}^{d} {w_{ij}^{2} } } } = \sum\nolimits_{i = 1}^{m} {\left\| {w^{i} } \right\|}_{2} \), D = (d ij ) ∈ R m×m be the diagonal matrix with the ith diagonal element, \( d_{ii} = {1 \mathord{\left/ {\vphantom {1 {\left( {2\left\| {w^{i} } \right\|_{2} } \right)}}} \right. \kern-\nulldelimiterspace} {\left( {2\left\| {w^{i} } \right\|_{2} } \right)}},i = 1,2, \ldots ,m \), where w i is the ith row of W. It can be easily verified that [21]

$$ \min \left\| W \right\|_{2,1} \Leftrightarrow \min tr\left( {W^{T} DW} \right) $$
(19)

For more details, please refer to [21].

3.3.3 Transfer sparse subspace learning–based MMD (TSSL_MMD)

Besides reducing the distance between two marginal distributions, we should also preserve data properties that are useful for the target supervised learning task. In this part, we select the MMC as the base subspace learning method. As we know, the formulation of MMC is equally as follows:

$$ \arg \mathop {\min }\limits_{{W^{T} W = I}} F(W) = \arg \mathop {\min }\limits_{{W^{T} W = I}} tr\left( {W^{T} \left( {S_{w} - S_{b} } \right)W} \right) $$
(20)

To avoid the rank deficiency of the denominator in the generalized eigenvalue decomposition, a regularization term W T W = I is needed.

Then, we can get the final optimization formulation of the transfer sparse subspace learning–based MMD (TSSL_MMD):

$$ \arg \mathop {\min }\limits_{{W^{T} W = I}} tr\left( {W^{T} \left( {S_{w} - S_{b} } \right)W} \right) + \alpha tr\left( {W^{T} DW} \right) + \beta tr\left( {W^{T} KLKW} \right) $$
(21)

The first term of Eq. (21) is to learn a shared subspace, the second term is to handle data sparsity and over-fitting problems, and the last term is to reduce the gap of distributions via MMD among domains in the reduced subspace. By minimizing Eq. (21), the shared knowledge will be transferred through domains by exploiting the strengthened relation in the learned subspace.

3.3.4 Solution for TSSL_MMD

According to Eq. (11), we can reformulate the TSSL_MMD as follows:

$$ \arg \mathop {\min }\limits_{\begin{subarray}{l} W^{T} W = I \\ F^{T} F = I \end{subarray} } tr\left( {W^{T} X\left( {I - 2FF^{T} } \right)X^{T} W} \right) + \alpha tr\left( {W^{T} DW} \right) + \beta tr\left( {W^{T} KLKW} \right) $$
(22)

We can see that there are three different variables that should be optimized, that is, W, D, and F. It is difficult to compute them simultaneously. We alternately optimize them, and then we can get W, D, and F.

Step 1 Fixing W, compute F, the second and the third term are fixed, so the optimization problem becomes

$$ \arg \mathop {\max }\limits_{{F^{T} F = I}} tr\left( {F^{T} X^{T} WW^{T} XF} \right) $$

Clearly, we can use spectral decomposition technique to solve this problem. That is, the optimal F is formed by the eigenvectors corresponding to the m largest eigenvalues of the matrix X T WW T X.

Step 2 Fixing F, compute W and D, we notice that there are still two variables to be optimized. We use nesting optimization technique.

Step 2.1 Fixing W, compute D, we can easily update D as follows:

$$ D = (d_{ii} ),\quad {\text{where}}\quad d_{ii} = {1 \mathord{\left/ {\vphantom {1 {\left( {2\left\| {w^{i} } \right\|_{2} } \right)}}} \right. \kern-\nulldelimiterspace} {\left( {2\left\| {w^{i} } \right\|_{2} } \right)}},\quad i = 1,2, \ldots ,m $$

Step 2.2 Fixing D, compute W, the optimization problem becomes

$$ \arg \mathop {\min }\limits_{{W^{T} W = I}} tr\left( {W^{T} (X\left( {I - 2FF^{T} } \right)X^{T} + \alpha D + \beta KLK)W} \right) $$

We can also use spectral decomposition technique to solve this problem. Moreover, the optimal W is formed by the eigenvectors corresponding to the m smallest eigenvalues of the matrix X(I − 2FF T)X T + αD + βKLK, where m ≪ n 1 + n 2 − 1.

The iteration procedure is repeated until the algorithm converges. We also give the algorithm convergence analysis in the next section.

The main algorithm is presented in Table 5.

3.3.5 Convergence analysis of TSSL_MMD

In this section, we will prove that TSSL_MMD monotonically decreases the objective of the problem in Eq. (21). Firstly, we give the lemma from [21].

Lemma 1

For any nonzero vector w, v ∈ R d , the following inequality holds.

$$ \left\| w \right\|_{2} - \frac{{\left\| w \right\|_{2}^{2} }}{{2\left\| v \right\|_{2} }} \le \left\| v \right\|_{2} - \frac{{\left\| v \right\|_{2}^{2} }}{{2\left\| v \right\|_{2} }} $$

Theorem 1

The algorithm will monotonically decrease the objective of the problem in Eq. ( 21 ) in each iteration and will converge to the optimum of the problem.

Proof

It can be easily verified that optimizing Eq. (21) is equivalent to solving Eq. (22). As seen in algorithm, when fixing D as D t , we can compute W and F. In the t iteration, we should solve the following problem:

$$ W_{t + 1} ,F_{t + 1} = \arg \mathop {\min }\limits_{{W^{T} W = I,F^{T} F = I}} tr\left( {W^{T} X\left( {I - 2{\text{FF}}^{T} } \right)X^{T} W} \right) + \alpha tr\left( {W^{T} D_{t} W} \right) + \beta tr\left( {W^{T} KLKW} \right) $$

Then we can get the following equation:

$$ \begin{gathered} tr\left( {W_{t + 1}^{T} X\left( {I - 2F_{t + 1} F_{t + 1}^{T} } \right)X^{T} W_{t + 1} } \right) + \alpha tr\left( {W_{t + 1}^{T} D_{t} W_{t + 1} } \right) + \beta tr\left( {W_{t + 1}^{T} KLKW_{t + 1} } \right) \hfill \\ \le tr\left( {W_{t}^{T} X\left( {I - 2F_{t} F_{t}^{T} } \right)X^{T} W_{t} } \right) + \alpha tr\left( {W_{t}^{T} D_{t} W_{t} } \right) + \beta tr\left( {W_{t}^{T} KLKW_{t} } \right) \hfill \\ \end{gathered} $$

Since

$$ \left\| W \right\|_{2,1} = \sum\limits_{i = 1}^{m} {\left\| {w^{i} } \right\|_{2} } ,d_{ii} = {1 \mathord{\left/ {\vphantom {1 {\left( {2\left\| {w^{i} } \right\|_{2} } \right)}}} \right. \kern-\nulldelimiterspace} {\left( {2\left\| {w^{i} } \right\|_{2} } \right)}},i = 1,2, \ldots ,m $$

w i is the ith row of W. The above inequality indicates that

$$ \begin{gathered} tr\left( {W_{t + 1}^{T} X\left( {I - 2F_{t + 1} F_{t + 1}^{T} } \right)X^{T} W_{t + 1} } \right) + \alpha \sum\limits_{i = 1}^{m} {\frac{{\left\| {w_{t + 1}^{i} } \right\|_{2}^{2} }}{{2\left\| {w_{t}^{i} } \right\|_{2} }}} + \beta tr\left( {W_{t + 1}^{T} KLKW_{t + 1} } \right) \hfill \\ \le tr\left( {W_{t}^{T} X\left( {I - 2F_{t} F_{t}^{T} } \right)X^{T} W_{t} } \right) + \alpha \sum\limits_{i = 1}^{m} {\frac{{\left\| {w_{t}^{i} } \right\|_{2}^{2} }}{{2\left\| {w_{t}^{i} } \right\|_{2} }}} + \beta tr\left( {W_{t}^{T} KLKW_{t} } \right) \hfill \\ \end{gathered} $$
(23)

According to Lemma 1, for each i, we have

$$ \left\| {w_{t + 1}^{i} } \right\|_{2} - \frac{{\left\| {w_{t + 1}^{i} } \right\|_{2}^{2} }}{{2\left\| {w_{t}^{i} } \right\|_{2} }} \le \left\| {w_{t}^{i} } \right\|_{2} - \frac{{\left\| {w_{t}^{i} } \right\|_{2}^{2} }}{{2\left\| {w_{t}^{i} } \right\|_{2} }} $$

Then the following inequality holds

$$ \sum\limits_{i = 1}^{m} {\left\| {w_{t + 1}^{i} } \right\|_{2} - \frac{{\left\| {w_{t + 1}^{i} } \right\|_{2}^{2} }}{{2\left\| {w_{t}^{i} } \right\|_{2} }}} \le \sum\limits_{i = 1}^{m} {\left\| {w_{t}^{i} } \right\|_{2} - \frac{{\left\| {w_{t}^{i} } \right\|_{2}^{2} }}{{2\left\| {w_{t}^{i} } \right\|_{2} }}} $$
(24)

Combining Eqs. (23) and (24), we can get the following result:

$$ \begin{gathered} tr\left( {W_{t + 1}^{T} X\left( {I - 2F_{t + 1} F_{t + 1}^{T} } \right)X^{T} W_{t + 1} } \right) + \alpha \sum\limits_{i = 1}^{d} {\left\| {w_{t + 1}^{i} } \right\|_{2} } + \beta tr\left( {W_{t + 1}^{T} KLKW_{t + 1} } \right) \hfill \\ \le tr\left( {W_{t}^{T} X\left( {I - 2F_{t} F_{t}^{T} } \right)X^{T} W_{t} } \right) + \alpha \sum\limits_{i = 1}^{d} {\left\| {w_{t}^{i} } \right\|_{2} } + \beta tr\left( {W_{t}^{T} KLKW_{t} } \right) \hfill \\ \end{gathered} $$

That is to say

$$ \begin{gathered} tr\left( {W_{t + 1}^{T} X\left( {I - 2F_{t + 1} F_{t + 1}^{T} } \right)X^{T} W_{t + 1} } \right) + \alpha \left\| {W_{t + 1} } \right\|_{2,1} + \beta tr\left( {W_{t + 1}^{T} KLKW_{t + 1} } \right) \hfill \\ \le tr\left( {W_{t}^{T} X\left( {I - 2F_{t} F_{t}^{T} } \right)X^{T} W_{t} } \right) + \alpha \left\| {W_{t} } \right\|_{2,1} + \beta tr\left( {W_{t}^{T} KLKW_{t} } \right) \hfill \\ \end{gathered} $$

This inequality indicates the algorithm will monotonically decrease the objective of the problem in Eq. (21) in each iteration. Besides, since the three items in Eq. (21) are convex functions and the objective function has lower bounds, such as zero, the above iteration will converge to the optimum solution. In the following experiment section, we can see that our algorithm converges fast. □

3.4 Bregman divergence–based regularization

3.4.1 Reformulation of Bregman divergence regularization

On the basis of the definition above, we can give the Bregman divergence–based regularization, which measures the distance between P(Y S ) and Q(Y T ).

$$ {\text{Dist}}(Y_{S} ,Y_{T} ) = \int {{\text{d}}\left( {p\left( {y_{s} } \right),q\left( {y_{t} } \right)} \right){\text{d}}\mu } $$
(25)

where dμ is the Lebesgue measure. The right side is also called the U-divergence on the subspace \(\mathbb{R}\) d. When we set φ(y) = y 2, the regularization item reduces to squared Euclidean distance form:

$$ {\text{Dist}}(Y_{S} ,Y_{T} ) = \int {{\text{d}}\left( {p\left( {y_{s} } \right),q\left( {y_{t} } \right)} \right){\text{d}}\mu } = \int {\left( {p\left( {y_{s} } \right),q\left( {y_{t} } \right)} \right)^{2} } {\text{d}}y $$
(26)

Then, we use kernel density estimation (KDE) technique [60] to estimate the distribution P(Y S ) and Q(Y T ) in the projected subspace W, that is,

$$ p\left( {y_{s} } \right) = \left( {{1 \mathord{\left/ {\vphantom {1 {n_{1} }}} \right. \kern-\nulldelimiterspace} {n_{1} }}} \right)\sum\limits_{i = 1}^{{n_{1} }} {G_{{\Upsigma_{1} }} \left( {y - y_{i} } \right)} ,q\left( {y_{t} } \right) = \left( {{1 \mathord{\left/ {\vphantom {1 {n_{2} }}} \right. \kern-\nulldelimiterspace} {n_{2} }}} \right)\sum\limits_{{j = n_{1} + 1}}^{{n_{1} + n_{2} }} {G_{{\Upsigma_{2} }} \left( {y - y_{j} } \right)} $$
(27)

For two arbitrary Gaussian kernels, we have

$$ \int {G_{{\Upsigma_{1} }} \left( {y - y_{i} } \right)G_{{\Upsigma_{2} }} \left( {y - y_{j} } \right)} = G_{{\Upsigma_{1} + \Upsigma_{2} }} \left( {y_{i} - y_{j} } \right) $$
(28)

Then, we can get the discrete form of Bregman divergence:

$$ \begin{aligned} {\text{Dist}}\left( {Y_{S} ,Y_{T} } \right) & = \frac{1}{{n_{1}^{2} }}\sum\limits_{i = 1}^{{n_{1} }} {\sum\limits_{j = 1}^{{n_{1} }} {G_{{\Upsigma_{11} }} \left( {y_{j} - y_{i} } \right)} } + \frac{1}{{n_{2}^{2} }}\sum\limits_{{i = n_{1} + 1}}^{{n_{1} + n_{2} }} {\sum\limits_{{j = n_{1} + 1}}^{{n_{1} + n_{2} }} {G_{{\Upsigma_{22} }} \left( {y_{j} - y_{i} } \right)} } \\ & \quad - \frac{1}{{n_{1} n_{2} }}\sum\limits_{i = 1}^{{n_{1} }} {\sum\limits_{{j = n_{1} + 1}}^{{n_{1} + n_{2} }} {G_{{\Upsigma_{12} }} \left( {y_{j} - y_{i} } \right)} } \\ \end{aligned} $$
(29)

where \( \Upsigma_{11} = \Upsigma_{1} + \Upsigma_{1} ,\quad \Upsigma_{12} = \Upsigma_{1} + \Upsigma_{2} ,\quad \Upsigma_{22} = \Upsigma_{2} + \Upsigma_{2} \).

3.4.2 Transfer sparse subspace learning–based Bregman divergence (TSSL_BD)

In this section, we presented a Bregman divergence–based regularization Dist(Y S , Y T ), which measures the distribution difference of samples drawn from different domains in a projected subspace. We also employ the L 2,1-norm as the sparse regularization, and the MMC as the base subspace learning method. Then, we can get the final optimization formulation of the transfer sparse subspace learning–based Bregman divergence (TSSL_BD)

$$ \arg \mathop {\min }\limits_{{W^{T} W = I}} tr\left( {W^{T} (S_{w} - S_{b} )W} \right) + \alpha tr\left( {W^{T} DW} \right) + \beta {\text{Dist}}\left( {Y_{S} ,Y_{T} } \right). $$
(30)

3.4.3 Solution for TSSL_BD

The convexity of TSSL_BD method depends on all of the three terms of the objective function. The convexity of the first term F(W) depends on a particular subspace learning method, such as PCA, MMC, and LPP. The convexity of the second term Φ(W) depends on the selection of sparse regularization, such as L 1 -norm, L 2 -norm, and L 2,1 -norm. The convexity of the third term Dist(Y S , Y T ) depends on the selection of the distribution estimation criteria of the different data sets. So it is not easy to give a general convexity of TSSL problem theoretically, and it is problem dependent.

Because the Eq. (30) is not convex, the solution can be obtained by the gradient decent algorithm, that is,

$$ W \leftarrow W - \eta \left( {\partial_{W} tr\left( {W^{T} \left( {S_{w} - S_{b} } \right)W} \right) + \alpha \partial_{W} tr\left( {W^{T} DW} \right) + \beta \partial_{W} {\text{Dist}}\left( {Y_{S} ,Y_{T} } \right)} \right) $$
(31)

where η is the learning rate and ∂ W is the gradient with respect to W.

Next, we calculate the derivative of the three terms one by one.

First, it is simple that the derivative of tr(S w  − S b )W with respect to W is

$$ \partial_{W} tr\left( {W^{T} \left( {S_{w} - S_{b} } \right)W} \right) = 2\left( {S_{w} - S_{b} } \right)W $$
(32)

Second, due to the fact that D is related with W as follows:

$$ d_{ii} = {1 \mathord{\left/ {\vphantom {1 {\left( {2\left\| {w^{i} } \right\|_{2} } \right)}}} \right. \kern-\nulldelimiterspace} {\left( {2\left\| {w^{i} } \right\|_{2} } \right)}},\left\| {w^{i} } \right\|_{2} = \sum\limits_{j = 1}^{d} {w_{ij}^{2} } $$

So the derivative of tr(W T DW) with respect to W is as follows:

$$ \begin{aligned} \frac{{\partial tr\left( {W^{T} DW} \right)}}{\partial W} & = \frac{{\partial \left( {\sum\nolimits_{i = 1}^{d} {\sum\nolimits_{j = 1}^{m} {w_{ji}^{2} d_{jj} } } } \right)}}{{\partial w_{{j^{\prime}i^{\prime}}} }}DW = 2\sum\limits_{i = 1}^{d} {\sum\limits_{j = 1}^{m} {d_{{j^{\prime}i^{\prime}}} w_{{j^{\prime}i^{\prime}}} } } + \sum\limits_{i = 1}^{d} {\sum\limits_{j = 1}^{m} {w_{{j^{\prime}i^{\prime}}}^{2} \frac{{\partial d_{{j^{\prime}j^{\prime}}} }}{{\partial w_{{j^{\prime}i^{\prime}}} }}} } \\ & = 2\sum\limits_{i = 1}^{d} {\sum\limits_{j = 1}^{m} {d_{{j^{\prime}i^{\prime}}} w_{{j^{\prime}i^{\prime}}} } } + \sum\limits_{i = 1}^{d} {\sum\limits_{j = 1}^{m} {w_{{j^{\prime}i^{\prime}}}^{2} \frac{{w_{{j^{\prime}i^{\prime}}} }}{{ - 2\left( {\sqrt {\sum\nolimits_{j = 1}^{d} {w_{{i^{\prime}j^{\prime}}}^{2} } } } \right)^{3} }}} } \\ & = \sum\limits_{i = 1}^{d} {\sum\limits_{j = 1}^{m} {d_{{j^{\prime}i^{\prime}}} w_{{j^{\prime}i^{\prime}}} } } = DW \\ \end{aligned} $$
(33)

Third, similar to [53], we can get the derivative of Dist(Y S , Y T ) with respect to W:

$$ \begin{aligned} \sum\limits_{i = 1}^{{n_{1} + n_{2} }} {\frac{{Dist(Y_{S} ,Y_{T} )}}{{\partial y_{i} }}\frac{{\partial y_{i} }}{\partial W}} & = \sum\limits_{i = 1}^{{n_{1} }} {\frac{{Dist(Y_{S} ,Y_{T} )}}{{\partial y_{i} }}\frac{{\partial y_{i} }}{\partial W}} + \sum\limits_{{i = n_{1} + 1}}^{{n_{1} + n_{2} }} {\frac{{Dist(Y_{S} ,Y_{T} )}}{{\partial y_{i} }}\frac{{\partial y_{i} }}{\partial W}} \\ & = \frac{2}{{n_{1}^{2} }}\sum\limits_{i = 1}^{{n_{1} }} {\sum\limits_{j = 1}^{{n_{1} }} {G_{{\Upsigma_{11} }} (y_{i} - y_{j} )(\Upsigma_{11} )^{ - 1} (y_{j} - y_{i} )x_{i}^{T} } } \\ & \quad + \frac{2}{{n_{2}^{2} }}\sum\limits_{{i = n_{1} + 1}}^{{n_{1} + n_{2} }} {\sum\limits_{{j = n_{1} + 1}}^{{n_{1} + n_{2} }} {G_{{\Upsigma_{22} }} (y_{i} - y_{j} )(\Upsigma_{22} )^{ - 1} (y_{j} - y_{i} )x_{i}^{T} } } \\ & \quad - \frac{2}{{n_{1} n_{2} }}\sum\limits_{i = 1}^{{n_{1} }} {\sum\limits_{{j = n_{1} + 1}}^{{n_{1} + n_{2} }} {G_{{\Upsigma_{12} }} (y_{j} - y_{i} )(\Upsigma_{12} )^{ - 1} (y_{j} - y_{i} )x_{i}^{T} } } \\ & \quad - \frac{2}{{n_{1} n_{2} }}\sum\limits_{{i = n_{1} + 1}}^{{n_{1} + n_{2} }} {\sum\limits_{j = 1}^{{n_{1} }} {G_{{\Upsigma_{12} }} (y_{j} - y_{i} )(\Upsigma_{12} )^{ - 1} (y_{j} - y_{i} )x_{i}^{T} } } \\ \end{aligned} $$
(34)

So we can obtain a solution W of TSSL_BD in an iterative way.

Similar to Sect. 3.3.4, after obtaining the projection matrix W, we can predict the labels of target domain data. In the training stage, the labels of target domain are blind to the subspace learning methods. One reference image for each test class is preserved, so that the classification can be done in the testing stage. Then, we adopt the nearest neighbor classifier to predict the labels of remaining test images in the selected subspace. The main algorithm is presented in Table 6.

In this section, we do not give any theoretical convergence analysis, and instead, we give some experimental results in the next section. Experimental results demonstrate that we can get a fast convergence.

4 Experimental results

4.1 Data sets descriptions

4.1.1 20-Newsgroups data sets

The 20-newsgroups data set collects approximately 20,000 documents across 20 different newsgroups. It is widely used to test the performance of text mining algorithms. We employed the conventional strategy to construct the data sets, that is, ignored the headers, removed stop words, and selected the top words by mutual information. In order to make the 20-newsgroups data set suitable for our transfer learning problem setting, we reorganize the 20 subcategories and put them in related but different domains. The preprocessing strategy of 20-newsgroups data set is similar to [61]. We reorganize the 20 subcategories into 6 source and target domain pairs. Within each domain pair, the texts are from only two top categories. And within each domain in the pair, positive instances consist of some subcategories in one top category, while negative instances consist of some other subcategories in the other top category. In each data set, we randomly selected 100 target documents as the training samples, and the remaining were used as the testing samples. Detail settings of the 20-newsgroups data set are shown in Table 2.

Table 2 Data description: the 20-newsgroups data sets (the number of the document is inside the parentheses)

4.1.2 Face image data sets

To the best of our knowledge, no public face data sets are constructed for transfer learning. In this paper, similar to [53], we build a set of data sets based on the existing face data sets, for example, ORL [62], YALE [2], FERET [63], and UMIST [64]. ORL face database [62] contains 10 images for each of the 40 human subjects, which are taken at different times, varying the lighting, facial expressions, and facial details. The original images (with 256 gray levels) have size 92 × 112, which are resized to 32 × 32 for efficiency. The YALE face database [2] consists of 165 images from 15 individuals, and each has 11 images with different facial expressions or configurations. The FERET face database [63] contains 13,539 face images collected from 1565 individuals, where images are photographed with different sizes, poses, illuminations, and facial expressions. The UMIST face database [64] consists of 564 images of 20 people with different races, genders, and appearances, covering a range of poses from profile to frontal views. For simplicity, we randomly select 100 individuals, each of which has 6 images for FERET. All images are used from the other three face databases. We generated six new data sets for transfer learning settings by mixing some of them together. Detail settings of the new face image data sets are shown in Table 3.

Table 3 Data description: the face image data sets (the number of the images is inside the parentheses)

4.2 Experimental setup

For the text data sets, we compare our transfer sparse subspace learning methods, TSSL_MMD and TSSL_BD, with the following typical methods:

  1. 1.

    SVM, use linear support vector machine to train a classifier in the original space of source domain, and then directly apply this classifier to the testing target data;

  2. 2.

    PCA [2], first apply PCA to get a latent space of source and target domain and then use SVM to train a classifier for the testing target data;

  3. 3.

    MMDE [41], TCA [42], SCL [35], and TSL [53], some classical transfer learning methods, first get the common latent space of source and target domain, and then use SVM to train a classifier for the testing target data.

For the face image data sets, we compare our transfer sparse subspace learning methods, TSSL_MMD and TSSL_BD, with the following typical methods:

  1. 1.

    PCA [2], MMC [3], first apply PCA or MMC to get a latent space of source domain W, and then project the target domain data into this latent space by W, and then use KNN to find labels of unlabeled target domain data;

  2. 2.

    MMDE [41], TCA [42], SCL [35], and TSL [53], some classical transfer learning methods, first get the common latent space of source and target domain, and then use KNN to find labels of unlabeled target domain data.

In this section, we use classification accuracy to measure the classification performance. We run 10 repetitions and report the means and the standard deviations of all methods.

In TSSL_MMD method, the most important factor is the selection of kernel function. On the basis of the well-known observation that the linear kernel is often adequate for high-dimensional text data [41], we also employ the linear kernel function for 20-newsgroup data. On the basis of the fact that the manifold assumption of the face image data set is much stronger than the text data, we employ Laplacian kernel function for face image data.

In TSSL_BD method, we initialized W as the MMC projection matrix and empirically set the learning rate as η(k) = η(0)/k. The intuition is that in the forepart, we give large step sizes for iterations because the initial of W is far away from the optimal solution, and in the afterward, we give small step sizes for iterations and let it gradually approach to the optimal solution.

4.3 Overall comparison results

In this section, we perform three groups of experiments. The first group is the classification accuracies of various methods on both of the text data and face image data. The second group is the convergence property of TSSL methods. The last group is the sensitivity of the regularization parameters.

4.3.1 Classification accuracies of various methods

Firstly, on the text data, the number of dimensionality of the subspace varies from 5 to 40, and on the face image data, the number of dimensionality of the subspace varies from 10 to 80. Figures 1 and 2 show the performances of various methods on the text data sets and the face image data sets separately.

Fig. 1
figure 1

The performances of various methods based on different subspace dimensions on the text data sets

Fig. 2
figure 2

The performances of various methods based on different subspace dimensions on the face image data sets

Secondly, we fix the dimensionality of the subspace at 20 on the text data and fix the dimensionality of subspace at 40 on the face image data, and then we can get the classification accuracies of various methods in Table 4.

Table 4 The classification accuracies of various methods (the standard deviations are inside the parentheses)

From Figs. 1 and 2 and Tables 4, 5 and 6, we can see that our TSSL methods can outperform other methods on all of the text data sets and the face image data sets. There are several observations. The numbers showed in bold are the best results among the eight methods.

Table 5 Transfer sparse subspace learning based on MMD (TSSL_MMD)
Table 6 Transfer sparse subspace learning based on BD (TSSL_BD)
  1. 1.

    The SVM method on text data sets and PCA or MMC methods on the face image data sets perform worse than other transfer learning methods because neither of them takes the distribution difference of source and target domain into consideration. The worst method on the face image data set is SCL, maybe SCL is developed for Natural Language Processing (NLP), and it cannot be suitable for image data well. MMC performs better than PCA because it considers more discriminant structure of face image data set, and that is why we select MMC as the base subspace learning method in the following methods.

  2. 2.

    On some text data sets, such as C2R and C2T, PCA sometimes performs as well as traditional transfer learning methods. This is because these two data sets may have more similarities than others, so they need not transfer to each other.

  3. 3.

    The representative transfer learning methods, MMDE, TCA, and TSL, perform much better than PCA, but worse than our TSSL methods. The reason is that the previous methods only consider the distribution difference of source and target domain data, but in our TSSL methods, we also enforce sparse regularization on the objective function to get better results, and then it can transfer more useful information from source domain to target domain.

  4. 4.

    There is another observation that TSSL_MMD is better than TSSL_BD on the text data, but worse on the face image data. Maybe, the MMD distance criterion is more suitable for text data, and Bregman divergence distance criterion is more suitable for face image data. This is an open problem, and we will focus on it in the future.

  5. 5.

    The classification accuracy results on the face image data are lower than text data sets because there is less commonality between the source and target domains on the face image data sets.

4.3.2 Convergence property

In this section, we test the convergence property of our TSSL methods. We select one text data set C2R and one face image data set O2Y as the basic databases to the experiment.

The results in Fig. 3 showed that both of TSSL_MMD and TSSL_BD can converge fast and the number of iteration is <20.

Fig. 3
figure 3

The convergence property of TSSL methods

4.3.3 Sensitivity analysis of the TSSL parameters

There are two parameters α and β in TSSL methods to discuss. In intuition, when we set α larger, then the sparse regularization favors larger numbers of zero rows in the projection matrix W, which makes the source and target domains share more information in the common subspace. But if we set α too large, the most rows of W will be zeros, which is not suitable for transfer either. Similarly, when we set β larger, the distribution difference between source and target domains will be smaller. But if we set β too large, there will be less information can be transferred from source domain to target domain.

In a word, we should select duly parameters to make transfer learning more effectively. We first determine two parameters α and β of TSSL method by grid search and then change them within certain ranges.

The main procedure is as follows. Firstly, we fix α = 1 and search for the best β value based on the validation set in the range of [10−5, 105]. Then, we fix β and search the best value in the range of [10−1 101]. Finally, we let α vary from 1.4 to 2.3, let β vary in the range of [10−2, 10−1]. The classification accuracies with different α and β on the S2T data set and the F2U data set are shown in Fig. 4. As seen from Fig. 4, when the two parameters are changed within a certain range, the performance of TSSL changes within a certain range.

Fig. 4
figure 4

Sensitivity analysis of the TSSL parameters

5 Conclusion and discussion

In this paper, we proposed a general framework for transfer sparse subspace learning, which is suitable for different assumptions on the divergence measures of the data distributions, such as MMD, Bregman divergence, and K–L divergence. To overcome over-fitting problems, we employ sparse regularization on the objective function and give different solutions to the problems based on different distribution distance estimation criteria. Experiments on both text data sets and face image data sets verify the efficiency and effectiveness of the proposed TSSL methods.

In the future, we plan to extend our TSSL framework to a multi-source-domain setting. We also plan to consider the heterogeneous transfer learning (HTL), which is a more challenging transfer learning setting. HTL considers not only the difference distributions between the source and target domains but also the different feature spaces between them.

There is another profound insight direction. In this paper, we used unsupervised divergence estimation terms and ignored the labeled target domain data. In future work, we will find another divergence estimation regularization that can make full use of the supervised information of target domain data and finally improve the accuracy of distribution divergence estimation.