1 Introduction

As an important research field of computer vision, image classification has been widely studied in the past few years, and many methods have been proposed (Lan et al. 2019; Wright et al. 2009; Yang et al. 2013). In these methods, it is common to assume that samples from the training set and testing set have a similar distribution. However, it is difficult to assure that both the two sets follow the same distribution in many practical applications due to various factors (i.e. resolution, viewpoint and illumination). A model obtained from a training set usually fails in the testing set if their distributions are different. To contend with the scenario, transfer learning attracts lots of attention, which aims to effectively apply the knowledge learned from a training (source) domain to a testing (target) domain. Recently, numerous works on transfer learning have been proposed and have achieved exciting performance in many applications, such as image classification (Kobylarz et al. 2020; Singh et al. 2019; Wang et al. 2017), motion segmentation (Wang et al. 2018b) and image retrieval (Xu et al. 2019).

As a special case of transfer learning, domain adaptation (DA) attracts lots of attention in recent years. In DA, it is a popular strategy to take the information from both domains into consideration to extract new domain-invariant features for the two domains. According to the availability of the target labels in the training process, DA can be divided into two categories: semi-supervised DA and unsupervised DA. This paper focuses on the unsupervised DA, which is more difficult and challenging.

Finding a common latent subspace to reduce the distribution shift problem between the two domains is critical in unsupervised DA (Si et al. 2010; Singh and Nigam 2019). For example, Si et al. (2010) constructed a regularization to minimize the Bregman divergence between the two domains in a selected subspace. Gong et al. (2012) utilized geodesic flow kernel (GFK) to find a geodesic from the source domain to the target domain. Long et al. (2013b) proposed to project data of both domains into a new subspace by jointly embedding both marginal distribution and conditional distribution into a principled dimensionality reduction procedure. Ghifary et al. (2017) proposed scatter component analysis (SCA) for DA and domain generalization, in which data scatter is used as a geometrical measure to evaluate the separability of classes, the mismatch between domains and the separability of data. To reduce the distribution divergence and evaluate the importance of the marginal and conditional distributions, Wang et al. (2018a) proposed to learn a domain-invariant classifier in a Grassmann manifold with structural risk minimization. Although good performance has been reported, these methods seldom exploit the structural information of the data.

Many studies (Xu et al. 2016; Shao et al. 2014) have verified that importing structure constraints (such as low-rank constraint and sparse constraint) into the transfer learning process could effectively improve the DA performance. For instance, Shao et al. (2014) proposed low-rank transfer subspace learning (LTSL) for face recognition and object recognition, where each sample from target domain is reconstructed by the samples from source domain in a generalized subspace with a low-rank constraint. Xiao et al. (2019) proposed structure preservation and distribution alignment (SPDA) for unsupervised DA. However, these works ignore the statistical characteristics of the data. Moreover, all these methods use the convex nuclear norm to approximate the non-convex rank function, which may make the solution deviate considerably from the original solution (Nie et al. 2012).

To better approximate the original low rank, mitigate the distribution shift problem and further exploit the structure properties, we propose a unified framework that incorporates structure information and statistical distribution for unsupervised DA of image classification, as shown in Fig. 1. Specifically, the proposed framework retains both global and local structural information by constructing a graph-structure constraint and Schatten p-norm; and reduces the distribution shifts by aligning both marginal and conditional distributions between the source and target domains. Moreover, the label information of the source domain is also exploited effectively by \(\epsilon \)-dragging technique. An effective optimization procedure is proposed. Experiments on six datasets have been done and the results demonstrate the superiorities of the proposed approach.

The contributions of this work are summarized as follows. (1) A unified framework consists of global and local structure constraints, marginal and conditional distributions and classification error is proposed for unsupervised DA and the Schatten p-norm (\(0<p<1\)) is introduced to capture the data structure more accurately. (2) Better performance has been achieved on six benchmark datasets than several representative unsupervised DA methods.

Fig. 1
figure 1

Illustration of our method. Blue: source samples. Pink: target samples. Circles, squares, and triangles indicate three different categories. By considering structural information, marginal and conditional distribution shifts, as well as label information of the source domain, the hyper-plane learned from the two domains can perfectly classify the target samples

The rest of this paper is structured as follows: Sect. 2 introduces the notation used in this paper. Section 3 briefly reviews some related works. Section 4 shows the description and optimization procedure about the proposed method in detail. Section 5 discusses the experiments and results on the six datasets, and the last section concludes the paper.

2 Preliminaries

In this work, we denote \({\varvec{X}}_s\in R^{d\times n_s}\), \({\varvec{X}}_t\in R^{d\times n_t}\), \(\varvec{ {Z}}\in R^{n_s\times n_t}\), \(\varvec{ {P}}\in R^{d\times m}\) and \(\varvec{ {E}}\in R^{m\times n_t}\) as the source data, target data, reconstruction matrix of \({\varvec{X}}_t\), projection matrix and noise matrix, respectively, where d is the dimensionality of each sample, \(n_s\) and \(n_t\) are the number of samples in \({\varvec{X}}_s\) and \({\varvec{X}}_t\), respectively. Matrices (vectors) are denoted by boldface uppercase (lowercase) letters. \({\varvec{A}}_{i,j}\) is the ijth entry of \({\varvec{A}}\), and \({\varvec{A}}_i\) represents the ith column of \(\varvec{ {A}}\).

The Schatten p-norm (\(0<p<\propto \)) of a matrix \(\varvec{ {A}}\in R^{m\times n}\) is defined as \({||\varvec{ {A}}||}_{s_p}={(\sum ^{\mathrm {min}\{m,n\}}_{i=1}{{\sigma }^p_i})}^{1/p}=(tr(\varvec{ {A}}^T\varvec{ {A}})^{p/2})^{1/p}\), where \(tr(\cdot )\) represents the trace operator. When \(p=1\), the Schatten p-norm becomes the nuclear norm (\({||\cdot ||}_*\)). Although the Schatten p-norm is only a quasi-norm when \(p<1\), for convenience, we still call it the Schatten p-norm.

3 Related work

3.1 Domain adaptation

DA aims at addressing the problem where the task of training and testing domains are the same while their data distributions are different. In DA, it is a popular strategy to learn a common feature subspace where the distributions of both domains are well aligned. Following this, some subspace learning methods have been proposed. In Si et al. (2010), the authors reduced the distribution shift by minimizing the Bregman divergence. Long et al. (2013b) made use of the pseudo-labels of the target domain to calculate conditional distribution shift and proposed joint distribution analysis (JDA). By adaptively weighting the marginal and conditional distributions, Wang et al. (2017) proposed balanced distribution adaptation (BDA) to improve the performance. The methods mentioned above mainly concentrate on minimizing the domain distribution shifts and overlook the geometric information among data.

To preserve specific properties (such as, low-rank and sparsity) in DA, various kinds of regularizers are exploited. For example, Xu et al. (2016) introduced both low-rank and sparse constraints on the coefficient matrix to preserve the structural information of data. Different from (Xu et al. 2016), Shao et al. (2014) constructed a generalized subspace term to preserve the local structure. These methods aim to exploit the geometric characteristics among samples while seldom leverage the statistical characteristics.

To take full advantage of the structure information and the statistical distribution, we incorporate global and local structure preservation, marginal and conditional distributions and classification error into a unified framework. Different from (Xiao et al. 2019), the proposed method (1) utilizes the non-convex Schatten p-norm to capture the global structure in the data more precisely, and (2) considers that the marginal distribution and conditional distribution have different importance to further minimize the domain shifts.

3.2 Low-rank representation

Given a data matrix \({\varvec{X}}\in R^{d\times m}\) with m samples and a dictionary \(\varvec{ {D}}\in R^{d\times k}\) with k atoms, low-rank representation (LRR) concentrates on seeking a representation matrix \(\varvec{Z}\in R^{k\times m}\), which not only has the lowest rank but also can reconstruct the samples with the dictionary atoms through linear combination. LRR can be formulated as follows:

$$\begin{aligned} {min}_{\varvec{Z},\varvec{ {E}}}\ rank\left( \varvec{Z}\right)+\lambda {\left| \left| \varvec{ {E}}\right| \right| }_1,\ s.t.\ {\varvec{X}}=\varvec{ {DZ}}+\varvec{ {E}} \end{aligned}$$
(1)

where \(\varvec{ {E}}\) represents noise matrix and \(\Vert \cdot \Vert _1\) is the sparsity constraint. Since the rank minimization problem is NP-hard, it is popular to use nuclear norm to approximate the rank function (Shao et al. 2014; Liu et al. 2013; Zhu et al. 2018; Yang et al. 2018).

However, the nuclear norm relaxation may deviate the outcome away from the real solution. The nuclear norm of a matrix is equal to the \(L_1\) norm of the matrix’s singular vector. According to the definition of the Schatten p-norm, when \(p = 1\), it is equal to the nuclear norm. When \(p\rightarrow 0\), the Schatten p-norm becomes rank function under \(0^0 = 0\). We can see that the Schatten p-norm is more approximate to the rank function than the nuclear norm when \(0< p < 1\). Therefore, we adopt the Schatten p-norm (\(0< p < 1\)) to obtain a closer solution of problem (1).

4 The proposed method

4.1 Problem formulation

Based on the assumption that there is a common subspace shared by the samples from both source and target domains and the samples from the target domain can be linearly represented by the samples from the source domain in the common subspace, we can construct a general formulation for unsupervised DA:

$$\begin{aligned} \begin{aligned} {min}_{{\varvec{P}},\varvec{Z},\varvec{ {E}}}&rank(\varvec{Z})+\alpha ||\varvec{Z}||_1+\beta ||\varvec{ {E}}||_1+f({\varvec{P}},{\varvec{X}}_s,{\varvec{Y}}_s)\\&s.t.{\varvec{P}}^T{\varvec{X}}_t={\varvec{P}}^T{\varvec{X}}_s\varvec{Z}+\varvec{ {E}} \end{aligned} \end{aligned}$$
(2)

where \(f({\varvec{P}},{\varvec{X}}_s,{\varvec{Y}}_s)\) is a function for learning a discriminative projection matrix \({\varvec{P}}\), and \({\varvec{Y}}_s\) is the binary label matrix of \({\varvec{X}}_s\).

Although the convex nuclear norm has been widely used to approximate \(rank(\varvec{Z})\), the relaxation may deviate the outcome from the real solution (Nie et al. 2012). Recently, some approaches (Chang et al. 2016, 2019; Wang et al. 2019) have verified that the non-convex Schatten p-norm minimization performs better than the nuclear norm minimization in image denoising when p is close to 0. To obtain a closer solution of problem (2), we adopt \(\Vert \cdot \Vert _{s_p}^p\) (\(0<p<1\)) to approximate \(rank(\varvec{Z})\) by:

$$\begin{aligned} \begin{aligned} {min}_{{\varvec{P}},\varvec{Z},\varvec{ {E}}}&\Vert \varvec{Z}\Vert _{S_p}^p+\alpha ||\varvec{Z}||_1+\beta ||\varvec{ {E}}||_1+f({\varvec{P}},{\varvec{X}}_s,{\varvec{Y}}_s)\\&s.t.{\varvec{P}}^T{\varvec{X}}_t={\varvec{P}}^T{\varvec{X}}_s\varvec{Z}+\varvec{ {E}} \end{aligned} \end{aligned}$$
(3)

To leverage the local geometric structure information in the original feature space, we construct a graph-constraint term as

$$\begin{aligned} \sum ^n_{i,j=1}({\varvec{P}}^T{\varvec{X}}_i-{\varvec{P}}^T{\varvec{X}}_j)^2W_{i,j}=tr({\varvec{P}}^T\varvec{ {XL}}{\varvec{X}}^T{\varvec{P}}) \end{aligned}$$
(4)

where \({\varvec{X}}=[{\varvec{X}}_s, {\varvec{X}}_t],\) \(n=n_s+n_t\). \(W_{i,j}=1\), if \({\varvec{X}}_i\in kNN({\varvec{X}}_j)\vee {\varvec{X}}_j\in kNN({\varvec{X}}_i)\) and \(W_{i,j}=0\), otherwise. \(\varvec{ {L}}=\varvec{ {D}}-\varvec{ {W}}\) is the normalized graph Laplacian matrix, and \(\varvec{ {D}}\) is a diagonal matrix with diagonal entries \(\varvec{ {D}}_{i,i}=\sum ^n_{j=1}\varvec{ {W}}_{i,j}\).

In addition to exploiting the structural information, reducing the distribution distance between the two domains is also significant for unsupervised DA. Therefore, the maximum mean discrepancy (MMD) is adopted to measure the marginal distribution difference between the two domains by

$$\begin{aligned} ||1/n_s\sum ^{n_s}_{i=1}{\varvec{P}}^T{\varvec{X}}_i-1/n_t\sum ^{n_t}_{j=1}{\varvec{P}}^T{\varvec{X}}_j ||^2_F=tr({\varvec{P}}^T{\varvec{X}}\varvec{{\mathcal {B}}}{\varvec{X}}^T{\varvec{P}}) \end{aligned}$$
(5)

where \(\varvec{{\mathcal {B}}}\) is the marginal distribution MMD matrix, and \({{\mathcal {B}}}_{i,j}\) is computed as

$$\begin{aligned} {{\mathcal {B}}}_{i,j}={\left\{ \begin{array}{ll} {1}/{n^2_s},&{\varvec{X}}_i,{\varvec{X}}_j\in {\varvec{X}}_s\ \\ {1}/{n^2_t},&{\varvec{X}}_i,{\varvec{X}}_j\in {\varvec{X}}_t \\ {-1}/{(n_sn_t)},&otherwise \end{array}\right. } \end{aligned}$$

Since the labels of target samples are not available, the conditional distribution difference between the two domains cannot be directly calculated. As in Wang et al. (2017), Long et al. (2013a), we calculate pseudo-labels of the target samples by applying some base classifiers (e.g., NN, SVM), where the classifiers are trained on the labeled source data. Then, the conditional distribution distance can be formulated as

$$\begin{aligned}&\sum _{c=1}^C||1/n^c_s\sum ^{n^c_s}_{i=1}{\varvec{P}}^T{\varvec{X}}_i-1/n^c_t\sum ^{n^c_t}_{j=1}{\varvec{P}}^T{\varvec{X}}_j||^2_F\nonumber \\&\quad =tr({\varvec{P}}^T{\varvec{X}}\sum _{c=1}^C{\varvec{{\mathcal {A}}}}_c{\varvec{X}}^T{\varvec{P}}) \end{aligned}$$
(6)

where C is the class number, \(\varvec{{\mathcal {A}}}_c\) is the conditional distribution matrix of cth class. \(\varvec{{\mathcal {A}}}_c\) is calculated as

$$\begin{aligned} (\varvec{{{\mathcal {A}}}}_c)_{i,j}={\left\{ \begin{array}{ll} {1}/{{n^c_s}^2},&{\varvec{X}}_i,{\varvec{X}}_j\in {\varvec{X}}^c_s\ \\ {1}/{{n^c_t}^2},&{\varvec{X}}_i,{\varvec{X}}_j\in {\varvec{X}}^c_t \\ {-1}/{\left( n^c_sn^c_t\right) },&({\varvec{X}}_i\in {\varvec{X}}^c_s,{\varvec{X}}_j\in {\varvec{X}}^c_t)\vee ({\varvec{X}}_j\in {\varvec{X}}^c_s,{\varvec{X}}_i\in {\varvec{X}}^c_t) \\ 0,&otherwise \end{array}\right. } \end{aligned}$$

where \({\varvec{X}}^c_s\) (\({\varvec{X}}^c_t\)) represents the set of samples belonging to c-th class in the source (target) domain, which includes \(n^c_s\) (\(n^c_t\)) samples.

To maximize the class separation distance and improve classification accuracy, the label information \({\varvec{Y}}_s\) of the source samples should be considered. Following Xu et al. (2016), we design a non-negative label relaxation matrix \({\varvec{M}}\) to increase inter-class separation in the source domain as much as possible and perform the label regularization as follows:

$$\begin{aligned} {\left| \left| {{\varvec{P}}}^T{\varvec{X}}_s-\left( {\varvec{Y}}_s+{\varvec{B}}\odot {\varvec{M}}\right) \right| \right| }^2_F \end{aligned}$$
(7)

where \({\varvec{B}}_{i,j}=1\), if \({\varvec{Y}}_s\left( i,j\right) =1\) and \({\varvec{B}}_{i,j}=-1\), otherwise. \(\odot \) is the hadamard product operator.

By jointly taking geometric regularization, statistical regularization, and label regularization into consideration, the final objective function is formulated as follows

$$\begin{aligned} \begin{aligned}&{min}_{{\varvec{P}},\varvec{Z},\varvec{ {E}},{\varvec{M}}}||\varvec{Z}||^p_{s_p}+\alpha ||\varvec{Z}||_1+\beta ||\varvec{ {E}}||_1\\&\quad +||{\varvec{P}}^T{\varvec{X}}_s-({\varvec{Y}}_s+{\varvec{B}}\odot {\varvec{M}})||^2_F\\&\quad +tr({\varvec{P}}^T{\varvec{X}}(\gamma \varvec{ {L}}+\eta \varvec{{\mathcal {B}}}+\nu \sum ^C_{c=1}\varvec{{\mathcal {A}}}_c){\varvec{X}}^T{\varvec{P}})\\&\quad s.t.\ {\varvec{P}}^T{\varvec{X}}_t={\varvec{P}}^T{\varvec{X}}_s\varvec{Z}+\varvec{ {E}} \end{aligned} \end{aligned}$$
(8)

where \(\alpha , \beta , \gamma , \eta \) and \(\nu \) are the trade-off parameters, and \(p\in (0,1)\). In the experiment, p is set to 1/2 for convenience. Compared with \(||\varvec{Z}||_*\), \(||\varvec{Z}||^{1/2}_{s_{1/2}}\) is much closer to \(rank({\mathbf {Z}})\).

4.2 Optimization

To effectively solve problem (8), two auxiliary variables J and R are first introduced to make the problem separable. Then, (8) can be rewritten as:

$$\begin{aligned} \begin{aligned}&{min}_{\begin{array}{c} {\varvec{P}},\varvec{Z},\varvec{ {E}},\\ {\varvec{M}},{\varvec{J}},\varvec{ {R}} \end{array}}||{\varvec{J}}||^{1/2}_{s_{1/2}}+\alpha ||\varvec{ {R}}||_1+\beta ||\varvec{ {E}}||_1\\&\quad +||{\varvec{P}}^T{\varvec{X}}_s-({\varvec{Y}}_s+{\varvec{B}}\odot {\varvec{M}})||^2_F\\&\quad +tr({\varvec{P}}^T{\varvec{X}}(\gamma \varvec{ {L}}+\eta \varvec{{\mathcal {B}}}+\nu \sum ^C_{c=1}{\varvec{{\mathcal {A}}}_c}){\varvec{X}}^T{\varvec{P}})\\&\quad s.t.\ {\varvec{P}}^T{\varvec{X}}_t={\varvec{P}}^T{\varvec{X}}_s\varvec{Z}+\varvec{ {E}}, \varvec{Z}={\varvec{J}}, \varvec{Z}=\varvec{ {R}} \end{aligned} \end{aligned}$$
(9)

Then, augmented Lagrangian multiplier (ALM) is applied to solve (9). The augmented Lagrangian function H of (9) is:

$$\begin{aligned} \begin{aligned}&H({\varvec{P}},\varvec{Z},\varvec{ {E}},{\varvec{M}},{\varvec{J}},\varvec{ {R}},{\varvec{Y}}_1,{\varvec{Y}}_2,{\varvec{Y}}_3,\mu )=||{\varvec{J}}||^{1/2}_{s_{1/2}}\\&q+\alpha ||\varvec{ {R}}||_1+\beta ||\varvec{ {E}}||_1+||{\varvec{P}}^T{\varvec{X}}_s-({\varvec{Y}}_s+{\varvec{B}}\odot {\varvec{M}})||^2_F\\&+ tr({\varvec{P}}^T{\varvec{X}}(\gamma \varvec{ {L}}+\eta \varvec{{\mathcal {B}}} +\nu \sum ^C_{c=1}{\varvec{{\mathcal {A}}}_c}){\varvec{X}}^T{\varvec{P}})\\+ & {} <{\varvec{Y}}_1,{\varvec{P}}^T{\varvec{X}}_t-{\varvec{P}}^T{\varvec{X}}_s-\varvec{ {E}}>+<{\varvec{Y}}_2,\varvec{Z}-{\varvec{J}}> +<{\varvec{Y}}_3,\varvec{Z}-\varvec{ {R}}>\\&+\mu /2(||{\varvec{P}}^T{\varvec{X}}_t-{\varvec{P}}^T{\varvec{X}}_s\varvec{Z}-\varvec{ {E}}||^2_F+||\varvec{Z}-{\varvec{J}}||^2_F+||\varvec{Z}-\varvec{ {R}}||^2_F) \end{aligned} \end{aligned}$$
(10)

where \(<\varvec{ {A}},{\varvec{B}}>=trace(\varvec{ {A}}^T{\varvec{B}})\), \({\varvec{Y}}_1, {\varvec{Y}}_2\) and \({\varvec{Y}}_3\) are Lagrange multipliers and \(\mu >0\) is a penalty parameter. At each iteration step, only one variable is updated while fixing the other variables. The updating process is as follows.

Step 1. Optimizing \({\varvec{J}}\)

Keeping other variables constant, the problem in Eq. (10) can be simplified as follows:

$$\begin{aligned} {argmin}_{{\varvec{J}}}1/\mu ||{\varvec{J}}||^{1/2}_{s_{1/2}}+1/2||{\varvec{J}}-(\varvec{Z}+{\varvec{Y}}_2/\mu )||^2_F \end{aligned}$$
(11)

Equation (11) can be solved according (Chang et al. 2016), which gives details for the solution.

Step 2. Optimizing \({\varvec{P}}\)

Keeping other variables constant, Eq. (10) can be rewritten as

$$\begin{aligned} \begin{aligned}&{argmin}_{{\varvec{P}}}tr\Big({\varvec{P}}^T{\varvec{X}}(\gamma \varvec{ {L}}+\eta \varvec{{\mathcal {B}}}+\nu \sum ^C_{c=1}{\varvec{{\mathcal {A}}}_c}){\varvec{X}}^T{\varvec{P}}\Big)\\&\quad +||{\varvec{P}}^T{\varvec{X}}_s-({\varvec{Y}}_s+{\varvec{B}}\odot {\varvec{M}})||^2_F\\&\quad +\mu /2||{\varvec{P}}^T{\varvec{X}}_t-{\varvec{P}}^T{\varvec{X}}_s\varvec{Z}-\varvec{ {E}}+{\varvec{Y}}_1/\mu ||^2_F \end{aligned} \end{aligned}$$
(12)

\({\varvec{P}}\) can be updated by taking the stationary point of (12) as

$$\begin{aligned} {\varvec{P}}^*= & {} ({\varvec{X}}_s{\varvec{X}}^T_s+{\varvec{X}}\varvec{{\mathcal {K}}}_1{\varvec{X}}^T+\mu \varvec{{\mathcal {K}}}_2\varvec{{\mathcal {K}}}^T_2+\lambda\varvec{ {I}})^{-1}({\varvec{X}}_s({\varvec{Y}}_s+{\varvec{B}}\odot {\varvec{M}})^T\nonumber \\&+\mu \varvec{{\mathcal {K}}}_2(\varvec{ {E}}-{\varvec{Y}}_1/\mu )^T) \end{aligned}$$
(13)

where \(\varvec{{\mathcal {K}}}_1=\gamma \varvec{ {L}}+\eta \varvec{{\mathcal {B}}}+\nu \sum ^C_{c=1}{\varvec{{\mathcal {A}}}_c}\), \(\varvec{{\mathcal {K}}}_2={\varvec{X}}_t-{\varvec{X}}_s\varvec{Z}\), \(\varvec{ {I}}\) is an identity matrix, and \(\lambda \) is a small positive constant.

Step 3. Optimizing \(\varvec{Z}\)

Keeping other variables constant, \(\varvec{Z}\) can be updated by solving

$$\begin{aligned}&{argmin}_{\varvec{Z}}||{\varvec{P}}^T{\varvec{X}}_t-{\varvec{P}}^T{\varvec{X}}_s\varvec{Z}-\varvec{ {E}}+{\varvec{Y}}_1/\mu ||^2_F\nonumber \\&\quad +||\varvec{Z}-{\varvec{J}}+{\varvec{Y}}_2/\mu ||^2_F+||\varvec{Z}-\varvec{ {R}}+{\varvec{Y}}_3/\mu ||^2_F \end{aligned}$$
(14)

\(\varvec{Z}\) can be calculated by

$$\begin{aligned} \varvec{Z}^*= & {} ({\varvec{X}}^T_s{\varvec{P}}{\varvec{P}}^T{\varvec{X}}_s+2\varvec{ {I}})^{-1}({\varvec{X}}^T_s{\varvec{P}}({\varvec{P}}^T{\varvec{X}}_t-\varvec{ {E}}\nonumber \\&+{\varvec{Y}}_1/\mu )^T+{\varvec{J}}-{\varvec{Y}}_2/\mu +\varvec{ {R}}-{\varvec{Y}}_3/\mu ) \end{aligned}$$
(15)

Step 4. Optimizing \(\varvec{ {R}}\)

Keeping other variables constant, \(\varvec{ {R}}\) can be solved as

$$\begin{aligned} {\varvec{ {R}}}^{\varvec{*}}&={argmin}_{\varvec{ {R}}}{\alpha }/{\mu }{\left| \left| \varvec{ {R}}\right| \right| }_1+{1}/{2}{\left| \left| {\varvec{Z-R}}+{{\varvec{Y}}_3}/{\mu }\right| \right| }^2_F\\ &={{\Phi }}_{{\alpha }/{\mu }}[{\varvec{Z}}{\varvec{+}}{\varvec{Y}}_3/\mu ] \end{aligned}$$
(16)

where \({{\Phi }}_w[\cdot ]\) is the soft thresholding (shrinkage) operator, and \({{\Phi }}_w\left[ x\right] =signmax(\left| x\right| -w,0)\)

Step 5. Optimizing \(\varvec{ {E}}\)

Keeping other variables constant, \(\varvec{ {E}}\) is updated by

$$\begin{aligned} \begin{aligned} {\varvec{ {E}}}^{\varvec{*}}&={argmin}_{\varvec{ {R}}}{\beta }/{\mu }{\left| \left| \varvec{ {E}}\right| \right| }_1\\&\quad +{1}/{2}{\left| \left| {{\varvec{P}}}^T{\varvec{X}}_t-{{\varvec{P}}}^T{\varvec{X}}_s\varvec{Z}-\varvec{ {E}}\varvec{+}{\varvec{Y}}_1/\mu \right| \right| }^2_F\\&={{\Phi }}_{\beta }[{{\varvec{P}}}^T{\varvec{X}}_t-{{\varvec{P}}}^T{\varvec{X}}_s\varvec{Z}\varvec{+}{\varvec{Y}}_1/\mu ] \end{aligned} \end{aligned}$$
(17)

Step 6. Optimizing \({\varvec{M}}\)

Keeping other variables constant, \({\varvec{M}}\) can be solved by

$$\begin{aligned} {{\varvec{M}}}^{\varvec{*}}={argmin}_{{\varvec{M}}}{\left| \left| {{\varvec{P}}}^T{\varvec{X}}_s-\left( {\varvec{Y}}_s+{\varvec{B}}\odot {\varvec{M}}\right) \right| \right| }^2_F \end{aligned}$$
(18)

\({\varvec{M}}\) can be calculated by

$$\begin{aligned} {\varvec{M}}=\mathrm {max}(({\varvec{P}}^T{\varvec{X}}_s-{\varvec{Y}}_s)\odot {\varvec{B}},0) \end{aligned}$$
(19)

The whole optimization of Eq. (8) is summarized in Algorithm 1. A projective matrix \({\mathbf {P}}\) is output by the Algorithm 1. When classification, all samples of both domains are first transferred to a new subspace by multiplying \({{\varvec{P}}}^T\). Then the label of one target sample is the label of its nearest source sample in the subspace by carrying out a one-nearest-neighbor algorithm.

figure a

4.3 Complexity and convergence analysis

In Algorithm 1, the great mass of run time is consumed in optimizing \({\varvec{J}}\), \({\varvec{P}}\) and \(\varvec{Z}\), which require matrix inversion and singular value decomposition (SVD). The complexities of optimizing \({\varvec{J}}\), \({\varvec{P}}\) and \(\varvec{Z}\) are \(O(n_sn^2_t)\)(we assume that \(n_t\le n_s\)), \(O(d^3+2d^2(n_s+n_t))\) and \(O(dn_sm+n^3_s)\), respectively. Then, the total time complexity of the proposed algorithm is \(O(N(n_sn^2_t+d^3+2d^2(n_s+n_t)+dn_sm+n^3_s))\), where N is the maximum iteration. In our experiments, N is set 15.

Since problem (8) is not smooth (the Schatten 1/2 norm is non-convex) and there are more than two blocks in the proposed algorithm, it is difficult to give a convergent proof of Algorithm 1 in theory. Figure 2 shows the the value of (8) with respect to the number of iterations on four datasets. From Fig. 2, we can see that the objective function value decreases as the number of iterations increases.

Fig. 2
figure 2

The objective function value versus the iteration number on different datasets. a COIL1\(\rightarrow \)COIL2, b MNIST\(\rightarrow \)USPS, c Caltech256\(\rightarrow \)Amazon, d PIE1\(\rightarrow \)PIE2

5 Experiments

The proposed method is evaluated on six datasets which are widely used for unsupervised transfer learning: COIL20, Office, Caltech-256, USPS, MNIST, and CMU PIE. The first three datasets are used for object classification, the middle two datasets are used for digit classification and the last dataset is used for face classification. The proposed method is compared with the latest ten unsupervised transfer learning methods, i.e., SDA (Sun and Saenko 2015), TSL (Si et al. 2010), GFK (Gong et al. 2012), JDA (Long et al. 2013b), BDA (Wang et al. 2017), SCA (Ghifary et al. 2017), LTSL (Shao et al. 2014), LRSR (Xu et al. 2016), LRDRM (Razzaghi et al. 2019) and SPDA (Xiao et al. 2019). Two other standard machine learning methods (i.e.,NN and PCA) are also include. The results of all the competing methods are either referenced from the original papers or from widely published results to ensure a fair comparison.

5.1 COIL20 dataset

This dataset consists of 20 objects with 1440 \(32\times 32\) gray images. The images were taken every \(5^\circ \) as the objects were rotated on a motorized turntable against a black background. In the experiment, the dataset is divided into two subsets according to the rotation angle. COIL1 contains the images of all objects taken at angles of [\(0^\circ \), \(85^\circ \)] and [\(180^\circ \), \(265^\circ \)]; and COIL2 contains the images of all objects taken in the directions of [\(90^\circ \), \(175^\circ \)] and [\(270^\circ \), \(355^\circ \)]. Each subset contains 720 images. Some samples are shown in Fig. 3a. The images in COIL1 and COIL2 follow different distributions. In the experiments, one subset is selected as the source set and the other subset as the target set. Then, two pairs of domain adaption are constructed.

Fig. 3
figure 3

a Samples of the COIL20 dataset. b Some examples of the MNIST dataset (left) and USPS dataset (right)

Fig. 4
figure 4

Classification accuracies (%) on the COIL20 dataset

Figure 4 shows that the proposed method is superior to other competing methods. In particular, the proposed method achieves the highest accuracies of 99.31% and 99.03% on COIL1\(\rightarrow \)COIL2 and COIL2\(\rightarrow \)COIL1, respectively. Compared with standard learning methods, the proposed method achieves improvements of approximately 14.0% on both experiments. Compared with other DA methods, there are at least 1.39% and 0.14% improvement in the two experiments, respectively, which demonstrate the effectiveness of the proposed unified framework incorporating non-convex Schatten p-norm, graph-structure constraint, statistical distributions, and classification error.

5.2 Office and Caltech-256 datasets

The Office dataset consists of 4652 images in 32 categories. These images are divided into three image groups: Amazon, Webcam and DSLR (denoted by A, W and D, respectively). The images in the three groups are obtained from the online merchant, a web camera with low-resolution, and a digital SLR camera with high-resolution, respectively. Different from the Office dataset, which is a standard DA benchmark, the Caltech-256 dataset (denoted by C) is a benchmark for image classification, which consists of more than 30,000 images of 256 categories. The two datasets share 10 classes, namely, backpack, bike, calculator, headphones, keyboard, laptop computer, monitor, mouse, mug and projector, as shown in Fig. 5. It can be seen that the variances of the images in the datasets are quite large. In the experiments, two different parts are randomly selected from the four parts (i.e., A, W, D, and C) as the source and target domains, and in all 12 DA experiments are constructed, i.e., C\(\rightarrow \)A, C\(\rightarrow \)W, \(\ldots \), D\(\rightarrow \)W.

Fig. 5
figure 5

Some images from the Office and Caltech-256 datasets

Table 1 lists the performance of all competing methods. The proposed method achieves the highest average accuracy over the 12 cross-domain experiments and has at least 10.38% (0.85%) improvement over the standard learning methods (DA methods). In the 12 cross-domain experiments, the proposed method achieves four best performances and three second-best performances.

Table 1 Classification accuracies (%) on the Office and Caltech256 datasets (The number in bold indicates the highest accuracy and the number in underline represents the second-highest accuracy)

5.3 USPS and MNIST datasets

The USPS dataset consists of 9298 images with a size of \(16\times 16\) images in all, where 7291 images are for training and 2007 images for testing. The MNIST dataset includes 70,000 images with a size of \(28\times 28\), where 60,000 images are for training and 10,000 images for testing. The two datasets share 10 classes of digits but follow very different distributions. Some examples are shown in Fig. 3b. Following the settings of (Long et al. 2013b), two subsets are formed by randomly selecting 1800 and 2000 images from the USPS and MNIST, respectively. By selecting one subset as the source set and the other subset as the target set, two DA experiments are constructed: USPS\(\rightarrow \)MNIST and MNIST\(\rightarrow \)USPS.

Figure 6 lists the classification accuracies of all methods. The proposed method achieves the best performance, with improvements of at least 0.27% and 0.89% on the two experiments. It should be noted that LTSL (Shao et al. 2014) and LRSR (Xu et al. 2016) perform better than most of the other DA methods, except BDA (Wang et al. 2017) and JDA (Long et al. 2013b). The main reason is that the background and structure of the digit image are simple. In this case, the materiality of the geometric structure is greater than that of the distribution difference. By taking both structure information and distribution shifts into consideration, SPDA and the proposed method achieve better performance than other competing methods. Compared with SPDA, the proposed method utilizes the Schatten p-norm to exploit the geometric structure more precisely and weigh the marginal distribution and conditional distribution differently to further minimize the domain shifts, which results in better performance.

Fig. 6
figure 6

Classification accuracies (%) on the USPS and MNIST datasets

5.4 CMU PIE dataset

There are 41,368 face images with a size of \(32\times 32\) in the CMU PIE dataset. The images are taken from 68 people under different illumination conditions, poses and expressions. In this experiment, five subsets of PIE are used to test different methods. Each subset corresponds to a distinct pose with illumination and expression variations: PIE1 (C05, left pose, 3332 images, 49 images for each person), PIE2 (C07, upward pose, 1632 images, 24 images for each person), PIE3 (C09, downward pose, 1632 images, 24 images for each person), PIE4 (C27, front pose, 3332 images, 49 images for each person), and PIE5 (C29, right pose, 1632 images, 24 images for each person), as seen as in Fig. 7. Following the experimental settings in Xu et al. (2016)), two different subsets are randomly chosen to be the source set and target set, which results in 20 different pairs of DA experiments, i.e., \(1\rightarrow 2, 1\rightarrow 3,\ldots , 5\rightarrow 4\).

Fig. 7
figure 7

Some examples of the CMU PIE dataset. a Left pose, b upward pose, c downward pose, d front pose, e right pose

Table 2 Classification accuracies (%) on the CMU PIE dataset

Table 2 shows the classification accuracies of different methods. The proposed method outperforms other competing methods in all 20 cross-domain datasets and achieves \(\mathbf {10.84\%}\) improvement on the average accuracy. The main reason is that the proposed method uses both statistical distribution and geometric structure information, which reduces the distribution shifts between the source domain and the target domain further than other methods.

5.5 Discussion of parameters

There are five parameters in the objective function (8): \(\alpha \) and \(\beta \) are l\(_{1}\) regularization on the representation and noise matrix, respectively. \(\gamma , \eta \) and \(\nu \) are used to balance the graph regularization, marginal distribution and conditional distribution, respectively. To verify the impacts of the parameters, we calculate the results of the proposed method with different combinations of values. The values of each parameter are selected from a small set. Specifically, the parameters \(\alpha \) and \(\beta \) are searched in \(\{0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1\}\), and the search ranges for the parameters \(\gamma , \eta \) and \(\nu \) are \(\{0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000\}\).

The results for COIL1\(\rightarrow \)COIL2, Caltech256\(\rightarrow \)Amazon, USPS\(\rightarrow \)MNIST, PIE1\(\rightarrow \)PIE2 with different parameters are shown in Fig. 8. From Fig. 8, we can see that the classification accuracies are roughly consistent. Specifically, the method is insensitive to \(\alpha \) and \(\beta \) for all of the datasets. The accuracy range is within \(1.0\%\). The values of \(\gamma , \eta \) and \(\nu \) are different since the roles of their corresponding regularizations are diverse for different datasets.

Fig. 8
figure 8

Classification accuracies (%) of the proposed method with different parameters on different datasets. a \(\alpha \) and \(\beta \) for COIL1\(\rightarrow \)COIL2, b \(\gamma \), \(\eta \) and \(\nu \) for COIL1\(\rightarrow \)COIL2, c \(\alpha \) and \(\beta \) for Caltech256\(\rightarrow \)Amazon, d \(\gamma \), \(\eta \) and \(\nu \) for Caltech256\(\rightarrow \)Amazon, e \(\alpha \) and \(\beta \) for MNIST\(\rightarrow \)USPS, f \(\gamma \), \(\eta \) and \(\nu \) for MNIST\(\rightarrow \)USPS, g \(\alpha \) and \(\beta \) for CMU PIE1\(\rightarrow \)CMU PIE2, h \(\gamma \), \(\eta \) and \(\nu \) for CMU PIE1\(\rightarrow \)CMU PIE2

5.6 Discussion with deep learning methods

Deep learning is a popular technology and has been applied in various applications (Yu et al. 2019; Lu et al. 2018, 2019). In Ding and Fu (2018), developed a deep transfer low-rank coding (DTLC) for cross-domain learning. Benefitting from the convolution neural networks, DTLC could capture more representative and discriminative image features. In Long et al. (2017), proposed joint adaptation networks to learn a transfer network by introducing the joint maximum mean discrepancy criterion and adversarial training strategy.

The image features play an essential role in the classification task. Intuitively, the proposed method will perform better if deep features are used, because deep features have a greater advantage in representation and discrimination than the handcrafted features. In the experiments, simply handcrafted features are used (SURF feature for the Office and Caltech256 datasets, grayscale pixel values for the other datasets). For practical applications, the deep features of images can be first extracted through trained deep neural networks, then used for unsupervised DA via (8).

6 Conclusions

This paper presented a novel unsupervised DA method for image classification. The method exploits both geometric and statistical characteristics of the samples by (1) constructing graph-structure constraint and the Schatten p-norm on the reconstruction matrix, (2) minimizing both marginal and conditional distributions, to reduce the distribution shift problem between the source and target domain. By projecting the samples of both domains into a common separable subspace, the samples from the target domain can be correctly classified. An iterative algorithm is proposed for effectively solving the proposed method. Extensive experiments on six datasets are done and the results verify the advantages of the proposed method.