Graph-structure constraint and Schatten p-norm-based unsupervised domain adaptation for image classification

Chang, Heyou; Zhang, Fanlong; Gao, Guangwei; Zheng, Hao

doi:10.1007/s12652-020-02350-y

Graph-structure constraint and Schatten p-norm-based unsupervised domain adaptation for image classification

Original Research
Published: 08 August 2020

Volume 13, pages 5137–5149, (2022)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Graph-structure constraint and Schatten p-norm-based unsupervised domain adaptation for image classification

Download PDF

Heyou Chang^1,2,
Fanlong Zhang³,
Guangwei Gao⁴ &
…
Hao Zheng¹

272 Accesses
2 Citations
Explore all metrics

Abstract

Unsupervised domain adaptation, which aims to classify a target domain correctly only using a labeled source domain, has achieved promising performance yet remains a challenging problem. Most traditional methods focus on exploiting either geometric or statistical characteristics to reduce domain shifts. To take advantage of both sides, in this paper, we propose a unified framework incorporating both the geometric and statistical characteristics by adopting the non-convex Schatten p-norm and graph Laplacian constraints to preserve global and local structure information and constructing marginal and conditional distribution minimization terms to reduce the distribution shifts. Moreover, a classification error term on the source domain is embedded into the objective function to increase the discriminability. The proposed method has been evaluated on six datasets and the experimental results demonstrate the superiority of the proposed method over several state-of-the-art methods. The MATLAB code of our method will be publicly available at https://github.com/HeyouChang/unsupervised-domain-adaptation.

Domain-specific feature elimination: multi-source domain adaptation for image classification

Article 06 December 2022

Subdomain adaptation via correlation alignment with entropy minimization for unsupervised domain adaptation

Article 28 February 2024

Maximum Mean and Covariance Discrepancy for Unsupervised Domain Adaptation

Article 08 August 2019

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

As an important research field of computer vision, image classification has been widely studied in the past few years, and many methods have been proposed (Lan et al. 2019; Wright et al. 2009; Yang et al. 2013). In these methods, it is common to assume that samples from the training set and testing set have a similar distribution. However, it is difficult to assure that both the two sets follow the same distribution in many practical applications due to various factors (i.e. resolution, viewpoint and illumination). A model obtained from a training set usually fails in the testing set if their distributions are different. To contend with the scenario, transfer learning attracts lots of attention, which aims to effectively apply the knowledge learned from a training (source) domain to a testing (target) domain. Recently, numerous works on transfer learning have been proposed and have achieved exciting performance in many applications, such as image classification (Kobylarz et al. 2020; Singh et al. 2019; Wang et al. 2017), motion segmentation (Wang et al. 2018b) and image retrieval (Xu et al. 2019).

As a special case of transfer learning, domain adaptation (DA) attracts lots of attention in recent years. In DA, it is a popular strategy to take the information from both domains into consideration to extract new domain-invariant features for the two domains. According to the availability of the target labels in the training process, DA can be divided into two categories: semi-supervised DA and unsupervised DA. This paper focuses on the unsupervised DA, which is more difficult and challenging.

Finding a common latent subspace to reduce the distribution shift problem between the two domains is critical in unsupervised DA (Si et al. 2010; Singh and Nigam 2019). For example, Si et al. (2010) constructed a regularization to minimize the Bregman divergence between the two domains in a selected subspace. Gong et al. (2012) utilized geodesic flow kernel (GFK) to find a geodesic from the source domain to the target domain. Long et al. (2013b) proposed to project data of both domains into a new subspace by jointly embedding both marginal distribution and conditional distribution into a principled dimensionality reduction procedure. Ghifary et al. (2017) proposed scatter component analysis (SCA) for DA and domain generalization, in which data scatter is used as a geometrical measure to evaluate the separability of classes, the mismatch between domains and the separability of data. To reduce the distribution divergence and evaluate the importance of the marginal and conditional distributions, Wang et al. (2018a) proposed to learn a domain-invariant classifier in a Grassmann manifold with structural risk minimization. Although good performance has been reported, these methods seldom exploit the structural information of the data.

Many studies (Xu et al. 2016; Shao et al. 2014) have verified that importing structure constraints (such as low-rank constraint and sparse constraint) into the transfer learning process could effectively improve the DA performance. For instance, Shao et al. (2014) proposed low-rank transfer subspace learning (LTSL) for face recognition and object recognition, where each sample from target domain is reconstructed by the samples from source domain in a generalized subspace with a low-rank constraint. Xiao et al. (2019) proposed structure preservation and distribution alignment (SPDA) for unsupervised DA. However, these works ignore the statistical characteristics of the data. Moreover, all these methods use the convex nuclear norm to approximate the non-convex rank function, which may make the solution deviate considerably from the original solution (Nie et al. 2012).

To better approximate the original low rank, mitigate the distribution shift problem and further exploit the structure properties, we propose a unified framework that incorporates structure information and statistical distribution for unsupervised DA of image classification, as shown in Fig. 1. Specifically, the proposed framework retains both global and local structural information by constructing a graph-structure constraint and Schatten p-norm; and reduces the distribution shifts by aligning both marginal and conditional distributions between the source and target domains. Moreover, the label information of the source domain is also exploited effectively by $\epsilon $-dragging technique. An effective optimization procedure is proposed. Experiments on six datasets have been done and the results demonstrate the superiorities of the proposed approach.

The contributions of this work are summarized as follows. (1) A unified framework consists of global and local structure constraints, marginal and conditional distributions and classification error is proposed for unsupervised DA and the Schatten p-norm ($0<p<1$) is introduced to capture the data structure more accurately. (2) Better performance has been achieved on six benchmark datasets than several representative unsupervised DA methods.

The rest of this paper is structured as follows: Sect. 2 introduces the notation used in this paper. Section 3 briefly reviews some related works. Section 4 shows the description and optimization procedure about the proposed method in detail. Section 5 discusses the experiments and results on the six datasets, and the last section concludes the paper.

2 Preliminaries

In this work, we denote ${\varvec{X}}_s\in R^{d\times n_s}$, ${\varvec{X}}_t\in R^{d\times n_t}$, $\varvec{ {Z}}\in R^{n_s\times n_t}$, $\varvec{ {P}}\in R^{d\times m}$ and $\varvec{ {E}}\in R^{m\times n_t}$ as the source data, target data, reconstruction matrix of ${\varvec{X}}_t$, projection matrix and noise matrix, respectively, where d is the dimensionality of each sample, $n_s$ and $n_t$ are the number of samples in ${\varvec{X}}_s$ and ${\varvec{X}}_t$, respectively. Matrices (vectors) are denoted by boldface uppercase (lowercase) letters. ${\varvec{A}}_{i,j}$ is the i, jth entry of ${\varvec{A}}$, and ${\varvec{A}}_i$ represents the ith column of $\varvec{ {A}}$.

The Schatten p-norm ($0<p<\propto $) of a matrix $\varvec{ {A}}\in R^{m\times n}$ is defined as ${||\varvec{ {A}}||}_{s_p}={(\sum ^{\mathrm {min}\{m,n\}}_{i=1}{{\sigma }^p_i})}^{1/p}=(tr(\varvec{ {A}}^T\varvec{ {A}})^{p/2})^{1/p}$, where $tr(\cdot )$ represents the trace operator. When $p=1$, the Schatten p-norm becomes the nuclear norm (${||\cdot ||}_*$). Although the Schatten p-norm is only a quasi-norm when $p<1$, for convenience, we still call it the Schatten p-norm.

3 Related work

3.1 Domain adaptation

DA aims at addressing the problem where the task of training and testing domains are the same while their data distributions are different. In DA, it is a popular strategy to learn a common feature subspace where the distributions of both domains are well aligned. Following this, some subspace learning methods have been proposed. In Si et al. (2010), the authors reduced the distribution shift by minimizing the Bregman divergence. Long et al. (2013b) made use of the pseudo-labels of the target domain to calculate conditional distribution shift and proposed joint distribution analysis (JDA). By adaptively weighting the marginal and conditional distributions, Wang et al. (2017) proposed balanced distribution adaptation (BDA) to improve the performance. The methods mentioned above mainly concentrate on minimizing the domain distribution shifts and overlook the geometric information among data.

To preserve specific properties (such as, low-rank and sparsity) in DA, various kinds of regularizers are exploited. For example, Xu et al. (2016) introduced both low-rank and sparse constraints on the coefficient matrix to preserve the structural information of data. Different from (Xu et al. 2016), Shao et al. (2014) constructed a generalized subspace term to preserve the local structure. These methods aim to exploit the geometric characteristics among samples while seldom leverage the statistical characteristics.

To take full advantage of the structure information and the statistical distribution, we incorporate global and local structure preservation, marginal and conditional distributions and classification error into a unified framework. Different from (Xiao et al. 2019), the proposed method (1) utilizes the non-convex Schatten p-norm to capture the global structure in the data more precisely, and (2) considers that the marginal distribution and conditional distribution have different importance to further minimize the domain shifts.

3.2 Low-rank representation

Given a data matrix ${\varvec{X}}\in R^{d\times m}$ with m samples and a dictionary $\varvec{ {D}}\in R^{d\times k}$ with k atoms, low-rank representation (LRR) concentrates on seeking a representation matrix $\varvec{Z}\in R^{k\times m}$, which not only has the lowest rank but also can reconstruct the samples with the dictionary atoms through linear combination. LRR can be formulated as follows:

$$\begin{aligned} {min}_{\varvec{Z},\varvec{ {E}}}\ rank\left( \varvec{Z}\right)+\lambda {\left| \left| \varvec{ {E}}\right| \right| }_1,\ s.t.\ {\varvec{X}}=\varvec{ {DZ}}+\varvec{ {E}} \end{aligned}$$

(1)

where $\varvec{ {E}}$ represents noise matrix and $\Vert \cdot \Vert _1$ is the sparsity constraint. Since the rank minimization problem is NP-hard, it is popular to use nuclear norm to approximate the rank function (Shao et al. 2014; Liu et al. 2013; Zhu et al. 2018; Yang et al. 2018).

However, the nuclear norm relaxation may deviate the outcome away from the real solution. The nuclear norm of a matrix is equal to the $L_1$ norm of the matrix’s singular vector. According to the definition of the Schatten p-norm, when $p = 1$, it is equal to the nuclear norm. When $p\rightarrow 0$, the Schatten p-norm becomes rank function under $0^0 = 0$. We can see that the Schatten p-norm is more approximate to the rank function than the nuclear norm when $0< p < 1$. Therefore, we adopt the Schatten p-norm ($0< p < 1$) to obtain a closer solution of problem (1).

4 The proposed method

4.1 Problem formulation

Based on the assumption that there is a common subspace shared by the samples from both source and target domains and the samples from the target domain can be linearly represented by the samples from the source domain in the common subspace, we can construct a general formulation for unsupervised DA:

$$\begin{aligned} \begin{aligned} {min}_{{\varvec{P}},\varvec{Z},\varvec{ {E}}}&rank(\varvec{Z})+\alpha ||\varvec{Z}||_1+\beta ||\varvec{ {E}}||_1+f({\varvec{P}},{\varvec{X}}_s,{\varvec{Y}}_s)\\&s.t.{\varvec{P}}^T{\varvec{X}}_t={\varvec{P}}^T{\varvec{X}}_s\varvec{Z}+\varvec{ {E}} \end{aligned} \end{aligned}$$

(2)

where $f({\varvec{P}},{\varvec{X}}_s,{\varvec{Y}}_s)$ is a function for learning a discriminative projection matrix ${\varvec{P}}$, and ${\varvec{Y}}_s$ is the binary label matrix of ${\varvec{X}}_s$.

Although the convex nuclear norm has been widely used to approximate $rank(\varvec{Z})$, the relaxation may deviate the outcome from the real solution (Nie et al. 2012). Recently, some approaches (Chang et al. 2016, 2019; Wang et al. 2019) have verified that the non-convex Schatten p-norm minimization performs better than the nuclear norm minimization in image denoising when p is close to 0. To obtain a closer solution of problem (2), we adopt $\Vert \cdot \Vert _{s_p}^p$ ($0<p<1$) to approximate $rank(\varvec{Z})$ by:

$$\begin{aligned} \begin{aligned} {min}_{{\varvec{P}},\varvec{Z},\varvec{ {E}}}&\Vert \varvec{Z}\Vert _{S_p}^p+\alpha ||\varvec{Z}||_1+\beta ||\varvec{ {E}}||_1+f({\varvec{P}},{\varvec{X}}_s,{\varvec{Y}}_s)\\&s.t.{\varvec{P}}^T{\varvec{X}}_t={\varvec{P}}^T{\varvec{X}}_s\varvec{Z}+\varvec{ {E}} \end{aligned} \end{aligned}$$

(3)

To leverage the local geometric structure information in the original feature space, we construct a graph-constraint term as

$$\begin{aligned} \sum ^n_{i,j=1}({\varvec{P}}^T{\varvec{X}}_i-{\varvec{P}}^T{\varvec{X}}_j)^2W_{i,j}=tr({\varvec{P}}^T\varvec{ {XL}}{\varvec{X}}^T{\varvec{P}}) \end{aligned}$$

(4)

where ${\varvec{X}}=[{\varvec{X}}_s, {\varvec{X}}_t],$ $n=n_s+n_t$. $W_{i,j}=1$, if ${\varvec{X}}_i\in kNN({\varvec{X}}_j)\vee {\varvec{X}}_j\in kNN({\varvec{X}}_i)$ and $W_{i,j}=0$, otherwise. $\varvec{ {L}}=\varvec{ {D}}-\varvec{ {W}}$ is the normalized graph Laplacian matrix, and $\varvec{ {D}}$ is a diagonal matrix with diagonal entries $\varvec{ {D}}_{i,i}=\sum ^n_{j=1}\varvec{ {W}}_{i,j}$.

In addition to exploiting the structural information, reducing the distribution distance between the two domains is also significant for unsupervised DA. Therefore, the maximum mean discrepancy (MMD) is adopted to measure the marginal distribution difference between the two domains by

$$\begin{aligned} ||1/n_s\sum ^{n_s}_{i=1}{\varvec{P}}^T{\varvec{X}}_i-1/n_t\sum ^{n_t}_{j=1}{\varvec{P}}^T{\varvec{X}}_j ||^2_F=tr({\varvec{P}}^T{\varvec{X}}\varvec{{\mathcal {B}}}{\varvec{X}}^T{\varvec{P}}) \end{aligned}$$

(5)

where $\varvec{{\mathcal {B}}}$ is the marginal distribution MMD matrix, and ${{\mathcal {B}}}_{i,j}$ is computed as

$$\begin{aligned} {{\mathcal {B}}}_{i,j}={\left\{ \begin{array}{ll} {1}/{n^2_s},&{\varvec{X}}_i,{\varvec{X}}_j\in {\varvec{X}}_s\ \\ {1}/{n^2_t},&{\varvec{X}}_i,{\varvec{X}}_j\in {\varvec{X}}_t \\ {-1}/{(n_sn_t)},&otherwise \end{array}\right. } \end{aligned}$$

Since the labels of target samples are not available, the conditional distribution difference between the two domains cannot be directly calculated. As in Wang et al. (2017), Long et al. (2013a), we calculate pseudo-labels of the target samples by applying some base classifiers (e.g., NN, SVM), where the classifiers are trained on the labeled source data. Then, the conditional distribution distance can be formulated as

$$\begin{aligned}&\sum _{c=1}^C||1/n^c_s\sum ^{n^c_s}_{i=1}{\varvec{P}}^T{\varvec{X}}_i-1/n^c_t\sum ^{n^c_t}_{j=1}{\varvec{P}}^T{\varvec{X}}_j||^2_F\nonumber \\&\quad =tr({\varvec{P}}^T{\varvec{X}}\sum _{c=1}^C{\varvec{{\mathcal {A}}}}_c{\varvec{X}}^T{\varvec{P}}) \end{aligned}$$

(6)

where C is the class number, $\varvec{{\mathcal {A}}}_c$ is the conditional distribution matrix of cth class. $\varvec{{\mathcal {A}}}_c$ is calculated as

$$\begin{aligned} (\varvec{{{\mathcal {A}}}}_c)_{i,j}={\left\{ \begin{array}{ll} {1}/{{n^c_s}^2},&{\varvec{X}}_i,{\varvec{X}}_j\in {\varvec{X}}^c_s\ \\ {1}/{{n^c_t}^2},&{\varvec{X}}_i,{\varvec{X}}_j\in {\varvec{X}}^c_t \\ {-1}/{\left( n^c_sn^c_t\right) },&({\varvec{X}}_i\in {\varvec{X}}^c_s,{\varvec{X}}_j\in {\varvec{X}}^c_t)\vee ({\varvec{X}}_j\in {\varvec{X}}^c_s,{\varvec{X}}_i\in {\varvec{X}}^c_t) \\ 0,&otherwise \end{array}\right. } \end{aligned}$$

where ${\varvec{X}}^c_s$ (${\varvec{X}}^c_t$) represents the set of samples belonging to c-th class in the source (target) domain, which includes $n^c_s$ ($n^c_t$) samples.

To maximize the class separation distance and improve classification accuracy, the label information ${\varvec{Y}}_s$ of the source samples should be considered. Following Xu et al. (2016), we design a non-negative label relaxation matrix ${\varvec{M}}$ to increase inter-class separation in the source domain as much as possible and perform the label regularization as follows:

$$\begin{aligned} {\left| \left| {{\varvec{P}}}^T{\varvec{X}}_s-\left( {\varvec{Y}}_s+{\varvec{B}}\odot {\varvec{M}}\right) \right| \right| }^2_F \end{aligned}$$

(7)

where ${\varvec{B}}_{i,j}=1$, if ${\varvec{Y}}_s\left( i,j\right) =1$ and ${\varvec{B}}_{i,j}=-1$, otherwise. $\odot $ is the hadamard product operator.

By jointly taking geometric regularization, statistical regularization, and label regularization into consideration, the final objective function is formulated as follows

$$\begin{aligned} \begin{aligned}&{min}_{{\varvec{P}},\varvec{Z},\varvec{ {E}},{\varvec{M}}}||\varvec{Z}||^p_{s_p}+\alpha ||\varvec{Z}||_1+\beta ||\varvec{ {E}}||_1\\&\quad +||{\varvec{P}}^T{\varvec{X}}_s-({\varvec{Y}}_s+{\varvec{B}}\odot {\varvec{M}})||^2_F\\&\quad +tr({\varvec{P}}^T{\varvec{X}}(\gamma \varvec{ {L}}+\eta \varvec{{\mathcal {B}}}+\nu \sum ^C_{c=1}\varvec{{\mathcal {A}}}_c){\varvec{X}}^T{\varvec{P}})\\&\quad s.t.\ {\varvec{P}}^T{\varvec{X}}_t={\varvec{P}}^T{\varvec{X}}_s\varvec{Z}+\varvec{ {E}} \end{aligned} \end{aligned}$$

(8)

where $\alpha , \beta , \gamma , \eta $ and $\nu $ are the trade-off parameters, and $p\in (0,1)$. In the experiment, p is set to 1/2 for convenience. Compared with $||\varvec{Z}||_*$, $||\varvec{Z}||^{1/2}_{s_{1/2}}$ is much closer to $rank({\mathbf {Z}})$.

4.2 Optimization

To effectively solve problem (8), two auxiliary variables J and R are first introduced to make the problem separable. Then, (8) can be rewritten as:

$$\begin{aligned} \begin{aligned}&{min}_{\begin{array}{c} {\varvec{P}},\varvec{Z},\varvec{ {E}},\\ {\varvec{M}},{\varvec{J}},\varvec{ {R}} \end{array}}||{\varvec{J}}||^{1/2}_{s_{1/2}}+\alpha ||\varvec{ {R}}||_1+\beta ||\varvec{ {E}}||_1\\&\quad +||{\varvec{P}}^T{\varvec{X}}_s-({\varvec{Y}}_s+{\varvec{B}}\odot {\varvec{M}})||^2_F\\&\quad +tr({\varvec{P}}^T{\varvec{X}}(\gamma \varvec{ {L}}+\eta \varvec{{\mathcal {B}}}+\nu \sum ^C_{c=1}{\varvec{{\mathcal {A}}}_c}){\varvec{X}}^T{\varvec{P}})\\&\quad s.t.\ {\varvec{P}}^T{\varvec{X}}_t={\varvec{P}}^T{\varvec{X}}_s\varvec{Z}+\varvec{ {E}}, \varvec{Z}={\varvec{J}}, \varvec{Z}=\varvec{ {R}} \end{aligned} \end{aligned}$$

(9)

Then, augmented Lagrangian multiplier (ALM) is applied to solve (9). The augmented Lagrangian function H of (9) is:

$$\begin{aligned} \begin{aligned}&H({\varvec{P}},\varvec{Z},\varvec{ {E}},{\varvec{M}},{\varvec{J}},\varvec{ {R}},{\varvec{Y}}_1,{\varvec{Y}}_2,{\varvec{Y}}_3,\mu )=||{\varvec{J}}||^{1/2}_{s_{1/2}}\\&q+\alpha ||\varvec{ {R}}||_1+\beta ||\varvec{ {E}}||_1+||{\varvec{P}}^T{\varvec{X}}_s-({\varvec{Y}}_s+{\varvec{B}}\odot {\varvec{M}})||^2_F\\&+ tr({\varvec{P}}^T{\varvec{X}}(\gamma \varvec{ {L}}+\eta \varvec{{\mathcal {B}}} +\nu \sum ^C_{c=1}{\varvec{{\mathcal {A}}}_c}){\varvec{X}}^T{\varvec{P}})\\+ & {} <{\varvec{Y}}_1,{\varvec{P}}^T{\varvec{X}}_t-{\varvec{P}}^T{\varvec{X}}_s-\varvec{ {E}}>+<{\varvec{Y}}_2,\varvec{Z}-{\varvec{J}}> +<{\varvec{Y}}_3,\varvec{Z}-\varvec{ {R}}>\\&+\mu /2(||{\varvec{P}}^T{\varvec{X}}_t-{\varvec{P}}^T{\varvec{X}}_s\varvec{Z}-\varvec{ {E}}||^2_F+||\varvec{Z}-{\varvec{J}}||^2_F+||\varvec{Z}-\varvec{ {R}}||^2_F) \end{aligned} \end{aligned}$$

(10)

where $<\varvec{ {A}},{\varvec{B}}>=trace(\varvec{ {A}}^T{\varvec{B}})$, ${\varvec{Y}}_1, {\varvec{Y}}_2$ and ${\varvec{Y}}_3$ are Lagrange multipliers and $\mu >0$ is a penalty parameter. At each iteration step, only one variable is updated while fixing the other variables. The updating process is as follows.

Step 1. Optimizing ${\varvec{J}}$

Keeping other variables constant, the problem in Eq. (10) can be simplified as follows:

$$\begin{aligned} {argmin}_{{\varvec{J}}}1/\mu ||{\varvec{J}}||^{1/2}_{s_{1/2}}+1/2||{\varvec{J}}-(\varvec{Z}+{\varvec{Y}}_2/\mu )||^2_F \end{aligned}$$

(11)

Equation (11) can be solved according (Chang et al. 2016), which gives details for the solution.

Step 2. Optimizing ${\varvec{P}}$

Keeping other variables constant, Eq. (10) can be rewritten as

$$\begin{aligned} \begin{aligned}&{argmin}_{{\varvec{P}}}tr\Big({\varvec{P}}^T{\varvec{X}}(\gamma \varvec{ {L}}+\eta \varvec{{\mathcal {B}}}+\nu \sum ^C_{c=1}{\varvec{{\mathcal {A}}}_c}){\varvec{X}}^T{\varvec{P}}\Big)\\&\quad +||{\varvec{P}}^T{\varvec{X}}_s-({\varvec{Y}}_s+{\varvec{B}}\odot {\varvec{M}})||^2_F\\&\quad +\mu /2||{\varvec{P}}^T{\varvec{X}}_t-{\varvec{P}}^T{\varvec{X}}_s\varvec{Z}-\varvec{ {E}}+{\varvec{Y}}_1/\mu ||^2_F \end{aligned} \end{aligned}$$

(12)

${\varvec{P}}$ can be updated by taking the stationary point of (12) as

$$\begin{aligned} {\varvec{P}}^*= & {} ({\varvec{X}}_s{\varvec{X}}^T_s+{\varvec{X}}\varvec{{\mathcal {K}}}_1{\varvec{X}}^T+\mu \varvec{{\mathcal {K}}}_2\varvec{{\mathcal {K}}}^T_2+\lambda\varvec{ {I}})^{-1}({\varvec{X}}_s({\varvec{Y}}_s+{\varvec{B}}\odot {\varvec{M}})^T\nonumber \\&+\mu \varvec{{\mathcal {K}}}_2(\varvec{ {E}}-{\varvec{Y}}_1/\mu )^T) \end{aligned}$$

(13)

where $\varvec{{\mathcal {K}}}_1=\gamma \varvec{ {L}}+\eta \varvec{{\mathcal {B}}}+\nu \sum ^C_{c=1}{\varvec{{\mathcal {A}}}_c}$, $\varvec{{\mathcal {K}}}_2={\varvec{X}}_t-{\varvec{X}}_s\varvec{Z}$, $\varvec{ {I}}$ is an identity matrix, and $\lambda $ is a small positive constant.

Step 3. Optimizing $\varvec{Z}$

Keeping other variables constant, $\varvec{Z}$ can be updated by solving

$$\begin{aligned}&{argmin}_{\varvec{Z}}||{\varvec{P}}^T{\varvec{X}}_t-{\varvec{P}}^T{\varvec{X}}_s\varvec{Z}-\varvec{ {E}}+{\varvec{Y}}_1/\mu ||^2_F\nonumber \\&\quad +||\varvec{Z}-{\varvec{J}}+{\varvec{Y}}_2/\mu ||^2_F+||\varvec{Z}-\varvec{ {R}}+{\varvec{Y}}_3/\mu ||^2_F \end{aligned}$$

(14)

$\varvec{Z}$ can be calculated by

$$\begin{aligned} \varvec{Z}^*= & {} ({\varvec{X}}^T_s{\varvec{P}}{\varvec{P}}^T{\varvec{X}}_s+2\varvec{ {I}})^{-1}({\varvec{X}}^T_s{\varvec{P}}({\varvec{P}}^T{\varvec{X}}_t-\varvec{ {E}}\nonumber \\&+{\varvec{Y}}_1/\mu )^T+{\varvec{J}}-{\varvec{Y}}_2/\mu +\varvec{ {R}}-{\varvec{Y}}_3/\mu ) \end{aligned}$$

(15)

Step 4. Optimizing $\varvec{ {R}}$

Keeping other variables constant, $\varvec{ {R}}$ can be solved as

$$\begin{aligned} {\varvec{ {R}}}^{\varvec{*}}&={argmin}_{\varvec{ {R}}}{\alpha }/{\mu }{\left| \left| \varvec{ {R}}\right| \right| }_1+{1}/{2}{\left| \left| {\varvec{Z-R}}+{{\varvec{Y}}_3}/{\mu }\right| \right| }^2_F\\ &={{\Phi }}_{{\alpha }/{\mu }}[{\varvec{Z}}{\varvec{+}}{\varvec{Y}}_3/\mu ] \end{aligned}$$

(16)

where ${{\Phi }}_w[\cdot ]$ is the soft thresholding (shrinkage) operator, and ${{\Phi }}_w\left[ x\right] =signmax(\left| x\right| -w,0)$

Step 5. Optimizing $\varvec{ {E}}$

Keeping other variables constant, $\varvec{ {E}}$ is updated by

$$\begin{aligned} \begin{aligned} {\varvec{ {E}}}^{\varvec{*}}&={argmin}_{\varvec{ {R}}}{\beta }/{\mu }{\left| \left| \varvec{ {E}}\right| \right| }_1\\&\quad +{1}/{2}{\left| \left| {{\varvec{P}}}^T{\varvec{X}}_t-{{\varvec{P}}}^T{\varvec{X}}_s\varvec{Z}-\varvec{ {E}}\varvec{+}{\varvec{Y}}_1/\mu \right| \right| }^2_F\\&={{\Phi }}_{\beta }[{{\varvec{P}}}^T{\varvec{X}}_t-{{\varvec{P}}}^T{\varvec{X}}_s\varvec{Z}\varvec{+}{\varvec{Y}}_1/\mu ] \end{aligned} \end{aligned}$$

(17)

Step 6. Optimizing ${\varvec{M}}$

Keeping other variables constant, ${\varvec{M}}$ can be solved by

$$\begin{aligned} {{\varvec{M}}}^{\varvec{*}}={argmin}_{{\varvec{M}}}{\left| \left| {{\varvec{P}}}^T{\varvec{X}}_s-\left( {\varvec{Y}}_s+{\varvec{B}}\odot {\varvec{M}}\right) \right| \right| }^2_F \end{aligned}$$

(18)

${\varvec{M}}$ can be calculated by

$$\begin{aligned} {\varvec{M}}=\mathrm {max}(({\varvec{P}}^T{\varvec{X}}_s-{\varvec{Y}}_s)\odot {\varvec{B}},0) \end{aligned}$$

(19)

The whole optimization of Eq. (8) is summarized in Algorithm 1. A projective matrix ${\mathbf {P}}$ is output by the Algorithm 1. When classification, all samples of both domains are first transferred to a new subspace by multiplying ${{\varvec{P}}}^T$. Then the label of one target sample is the label of its nearest source sample in the subspace by carrying out a one-nearest-neighbor algorithm.

4.3 Complexity and convergence analysis

In Algorithm 1, the great mass of run time is consumed in optimizing ${\varvec{J}}$, ${\varvec{P}}$ and $\varvec{Z}$, which require matrix inversion and singular value decomposition (SVD). The complexities of optimizing ${\varvec{J}}$, ${\varvec{P}}$ and $\varvec{Z}$ are $O(n_sn^2_t)$(we assume that $n_t\le n_s$), $O(d^3+2d^2(n_s+n_t))$ and $O(dn_sm+n^3_s)$, respectively. Then, the total time complexity of the proposed algorithm is $O(N(n_sn^2_t+d^3+2d^2(n_s+n_t)+dn_sm+n^3_s))$, where N is the maximum iteration. In our experiments, N is set 15.

Since problem (8) is not smooth (the Schatten 1/2 norm is non-convex) and there are more than two blocks in the proposed algorithm, it is difficult to give a convergent proof of Algorithm 1 in theory. Figure 2 shows the the value of (8) with respect to the number of iterations on four datasets. From Fig. 2, we can see that the objective function value decreases as the number of iterations increases.

5 Experiments

The proposed method is evaluated on six datasets which are widely used for unsupervised transfer learning: COIL20, Office, Caltech-256, USPS, MNIST, and CMU PIE. The first three datasets are used for object classification, the middle two datasets are used for digit classification and the last dataset is used for face classification. The proposed method is compared with the latest ten unsupervised transfer learning methods, i.e., SDA (Sun and Saenko 2015), TSL (Si et al. 2010), GFK (Gong et al. 2012), JDA (Long et al. 2013b), BDA (Wang et al. 2017), SCA (Ghifary et al. 2017), LTSL (Shao et al. 2014), LRSR (Xu et al. 2016), LRDRM (Razzaghi et al. 2019) and SPDA (Xiao et al. 2019). Two other standard machine learning methods (i.e.,NN and PCA) are also include. The results of all the competing methods are either referenced from the original papers or from widely published results to ensure a fair comparison.

5.1 COIL20 dataset

This dataset consists of 20 objects with 1440 $32\times 32$ gray images. The images were taken every $5^\circ $ as the objects were rotated on a motorized turntable against a black background. In the experiment, the dataset is divided into two subsets according to the rotation angle. COIL1 contains the images of all objects taken at angles of [$0^\circ $, $85^\circ $] and [$180^\circ $, $265^\circ $]; and COIL2 contains the images of all objects taken in the directions of [$90^\circ $, $175^\circ $] and [$270^\circ $, $355^\circ $]. Each subset contains 720 images. Some samples are shown in Fig. 3a. The images in COIL1 and COIL2 follow different distributions. In the experiments, one subset is selected as the source set and the other subset as the target set. Then, two pairs of domain adaption are constructed.

Figure 4 shows that the proposed method is superior to other competing methods. In particular, the proposed method achieves the highest accuracies of 99.31% and 99.03% on COIL1$\rightarrow $COIL2 and COIL2$\rightarrow $COIL1, respectively. Compared with standard learning methods, the proposed method achieves improvements of approximately 14.0% on both experiments. Compared with other DA methods, there are at least 1.39% and 0.14% improvement in the two experiments, respectively, which demonstrate the effectiveness of the proposed unified framework incorporating non-convex Schatten p-norm, graph-structure constraint, statistical distributions, and classification error.

5.2 Office and Caltech-256 datasets

The Office dataset consists of 4652 images in 32 categories. These images are divided into three image groups: Amazon, Webcam and DSLR (denoted by A, W and D, respectively). The images in the three groups are obtained from the online merchant, a web camera with low-resolution, and a digital SLR camera with high-resolution, respectively. Different from the Office dataset, which is a standard DA benchmark, the Caltech-256 dataset (denoted by C) is a benchmark for image classification, which consists of more than 30,000 images of 256 categories. The two datasets share 10 classes, namely, backpack, bike, calculator, headphones, keyboard, laptop computer, monitor, mouse, mug and projector, as shown in Fig. 5. It can be seen that the variances of the images in the datasets are quite large. In the experiments, two different parts are randomly selected from the four parts (i.e., A, W, D, and C) as the source and target domains, and in all 12 DA experiments are constructed, i.e., C$\rightarrow $A, C$\rightarrow $W, $\ldots $, D$\rightarrow $W.

Table 1 lists the performance of all competing methods. The proposed method achieves the highest average accuracy over the 12 cross-domain experiments and has at least 10.38% (0.85%) improvement over the standard learning methods (DA methods). In the 12 cross-domain experiments, the proposed method achieves four best performances and three second-best performances.

Table 1 Classification accuracies (%) on the Office and Caltech256 datasets (The number in bold indicates the highest accuracy and the number in underline represents the second-highest accuracy)

Full size table

5.3 USPS and MNIST datasets

The USPS dataset consists of 9298 images with a size of $16\times 16$ images in all, where 7291 images are for training and 2007 images for testing. The MNIST dataset includes 70,000 images with a size of $28\times 28$, where 60,000 images are for training and 10,000 images for testing. The two datasets share 10 classes of digits but follow very different distributions. Some examples are shown in Fig. 3b. Following the settings of (Long et al. 2013b), two subsets are formed by randomly selecting 1800 and 2000 images from the USPS and MNIST, respectively. By selecting one subset as the source set and the other subset as the target set, two DA experiments are constructed: USPS$\rightarrow $MNIST and MNIST$\rightarrow $USPS.

Figure 6 lists the classification accuracies of all methods. The proposed method achieves the best performance, with improvements of at least 0.27% and 0.89% on the two experiments. It should be noted that LTSL (Shao et al. 2014) and LRSR (Xu et al. 2016) perform better than most of the other DA methods, except BDA (Wang et al. 2017) and JDA (Long et al. 2013b). The main reason is that the background and structure of the digit image are simple. In this case, the materiality of the geometric structure is greater than that of the distribution difference. By taking both structure information and distribution shifts into consideration, SPDA and the proposed method achieve better performance than other competing methods. Compared with SPDA, the proposed method utilizes the Schatten p-norm to exploit the geometric structure more precisely and weigh the marginal distribution and conditional distribution differently to further minimize the domain shifts, which results in better performance.

5.4 CMU PIE dataset

There are 41,368 face images with a size of $32\times 32$ in the CMU PIE dataset. The images are taken from 68 people under different illumination conditions, poses and expressions. In this experiment, five subsets of PIE are used to test different methods. Each subset corresponds to a distinct pose with illumination and expression variations: PIE1 (C05, left pose, 3332 images, 49 images for each person), PIE2 (C07, upward pose, 1632 images, 24 images for each person), PIE3 (C09, downward pose, 1632 images, 24 images for each person), PIE4 (C27, front pose, 3332 images, 49 images for each person), and PIE5 (C29, right pose, 1632 images, 24 images for each person), as seen as in Fig. 7. Following the experimental settings in Xu et al. (2016)), two different subsets are randomly chosen to be the source set and target set, which results in 20 different pairs of DA experiments, i.e., $1\rightarrow 2, 1\rightarrow 3,\ldots , 5\rightarrow 4$.

Table 2 Classification accuracies (%) on the CMU PIE dataset

Full size table

Table 2 shows the classification accuracies of different methods. The proposed method outperforms other competing methods in all 20 cross-domain datasets and achieves $\mathbf {10.84\%}$ improvement on the average accuracy. The main reason is that the proposed method uses both statistical distribution and geometric structure information, which reduces the distribution shifts between the source domain and the target domain further than other methods.

5.5 Discussion of parameters

There are five parameters in the objective function (8): $\alpha $ and $\beta $ are l$_{1}$ regularization on the representation and noise matrix, respectively. $\gamma , \eta $ and $\nu $ are used to balance the graph regularization, marginal distribution and conditional distribution, respectively. To verify the impacts of the parameters, we calculate the results of the proposed method with different combinations of values. The values of each parameter are selected from a small set. Specifically, the parameters $\alpha $ and $\beta $ are searched in $\{0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1\}$, and the search ranges for the parameters $\gamma , \eta $ and $\nu $ are $\{0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000\}$.

The results for COIL1$\rightarrow $COIL2, Caltech256$\rightarrow $Amazon, USPS$\rightarrow $MNIST, PIE1$\rightarrow $PIE2 with different parameters are shown in Fig. 8. From Fig. 8, we can see that the classification accuracies are roughly consistent. Specifically, the method is insensitive to $\alpha $ and $\beta $ for all of the datasets. The accuracy range is within $1.0\%$. The values of $\gamma , \eta $ and $\nu $ are different since the roles of their corresponding regularizations are diverse for different datasets.

5.6 Discussion with deep learning methods

Deep learning is a popular technology and has been applied in various applications (Yu et al. 2019; Lu et al. 2018, 2019). In Ding and Fu (2018), developed a deep transfer low-rank coding (DTLC) for cross-domain learning. Benefitting from the convolution neural networks, DTLC could capture more representative and discriminative image features. In Long et al. (2017), proposed joint adaptation networks to learn a transfer network by introducing the joint maximum mean discrepancy criterion and adversarial training strategy.

The image features play an essential role in the classification task. Intuitively, the proposed method will perform better if deep features are used, because deep features have a greater advantage in representation and discrimination than the handcrafted features. In the experiments, simply handcrafted features are used (SURF feature for the Office and Caltech256 datasets, grayscale pixel values for the other datasets). For practical applications, the deep features of images can be first extracted through trained deep neural networks, then used for unsupervised DA via (8).

6 Conclusions

This paper presented a novel unsupervised DA method for image classification. The method exploits both geometric and statistical characteristics of the samples by (1) constructing graph-structure constraint and the Schatten p-norm on the reconstruction matrix, (2) minimizing both marginal and conditional distributions, to reduce the distribution shift problem between the source and target domain. By projecting the samples of both domains into a common separable subspace, the samples from the target domain can be correctly classified. An iterative algorithm is proposed for effectively solving the proposed method. Extensive experiments on six datasets are done and the results verify the advantages of the proposed method.

References

Chang H, Luo L, Yang J, Yang M (2016) Schatten p-norm based principal component analysis. Neurocomputing 207:754–762
Article Google Scholar
Chang H, Zhang F, Gao G, Zheng H (2019) Structure-constrained discriminative dictionary learning based on Schatten p-norm for face recognition. Digital Signal Process 95
Ding Z, Fu Y (2018) Deep transfer low-rank coding for cross-domain learning. IEEE Trans Neural Netw Learn Syst 30(6):1768–1779
Article MathSciNet Google Scholar
Ghifary M, Balduzzi D, Kleijn W, Zhang M (2017) Scatter component analysis: a unified framework for domain adaptation and domain generalization. IEEE Trans Pattern Anal Mach Intell 39(7):1414–1430
Article Google Scholar
Gong B, Shi Y, Sha F, Grauman K (2012) Geodesic flow kernel for unsupervised domain adaptation. In: CVPR, pp 2066–2073
Kobylarz J, Bird JJ, Faria DR, Ribeiro EP, Ekárt A (2020) Thumbs up, thumbs down: non-verbal human-robot interaction through real-time emg classification via inductive and supervised transductive transfer learning. J Ambient Intell Humaniz Comput 1–11
Lan R, Lu H, Zhou Y, Liu Z, Luo X (2019) An lbp encoding scheme jointly using quaternionic representation and angular information. Neural Comput Appl 1–7
Liu G, Lin Z, Yan S, Sun J, Yu Y, Ma Y (2013) Robust recovery of subspace structures by low-rank representation. IEEE Trans Pattern Anal Mach Intell 35(1):171–184
Article Google Scholar
Long M, Ding G, Wang J, Sun J, Guo Y, Yu PS (2013a) Transfer sparse coding for robust image representation. In: CVPR, pp 407–414
Long M, Wang J, Ding G, Sun J, Yu PS (2013b) Transfer feature learning with joint distribution adaptation. In: ICCV, pp 2200–2207
Long M, Zhu H, Wang J, Jordan MI (2017) Deep transfer learning with joint adaptation networks. In: ICML, pp 2208–2217
Lu H, Li Y, Mu S, Wang D, Kim H, Serikawa S (2018) Motor anomaly detection for unmanned aerial vehicles using reinforcement learning. IEEE Internet Things J 5(4):2315–2322
Article Google Scholar
Lu H, Wang D, Li Y, Li J, Li X, Kim H, Serikawa S, Humar I (2019) Conet: a cognitive ocean network. IEEE Wirel Commun 26(3):90–96
Article Google Scholar
Nie F, Huang H, Ding C (2012) Low-rank matrix recovery via efficient Schatten p-norm minimization. In: AAAI, pp 655–661
Razzaghi P, Razzaghi P, Abbasi K (2019) Transfer subspace learning via low-rank and discriminative reconstruction matrix. Knowl Based Syst 163:174–185
Article Google Scholar
Shao M, Kit D, Fu Y (2014) Generalized transfer subspace learning through low-rank constraint. Int J Comput Vis 109(1–2):74–93
Article MathSciNet Google Scholar
Si S, Tao D, Geng B (2010) Bregman divergence-based regularization for transfer subspace learning. IEEE Trans Knowl Data Eng 22(7):929–942
Article Google Scholar
Singh A, Nigam A (2019) Effect of identity mapping, transfer learning and domain knowledge on the robustness and generalization ability of a network: a biometric based case study. J Ambient Intell Humaniz Comput 11:1905–1922
Article Google Scholar
Singh R, Ahmed T, Singh R, Udmale SS, Singh SK (2019) Identifying tiny faces in thermal images using transfer learning. J Ambient Intell Humaniz Comput 11:1957–1966
Article Google Scholar
Sun B, Saenko K (2015) Subspace distribution alignment for unsupervised domain adaptation. In: BMVC, pp 24.1–10
Wang J, Chen Y, Hao S, Feng W, Shen Z (2017) Balanced distribution adaptation for transfer learning. In: ICDM, pp 1129–1134
Wang J, Feng W, Chen Y, Yu H, Huang M, Yu PS (2018a) Visual domain adaptation with manifold embedded distribution alignment. In: ACM MM, pp 402–410
Wang J, Wang H, Gao G, Lu H, Zhang Z (2019) Single underwater image enhancement based on $\text{ l}_p$ -norm decomposition. IEEE Access 7:1145199–145213
Google Scholar
Wang L, Ding Z, Fu Y (2018b) Low-rank transfer human motion segmentation. IEEE Trans Image Process 28(3):1023–1034
MathSciNet MATH Google Scholar
Wright J, Yang AY, Ganesh A, Ma Y (2009) Robust face recognition via sparse representation. IEEE Trans Pattern Anal Mach Intell 31(2):210–227
Article Google Scholar
Xiao T, Liu P, Zhao W, Liu H, Tang X (2019) Structure preservation and distribution alignment in discriminative transfer subspace learning. Neurocomputing 337:218–234
Article Google Scholar
Xu X, Lu H, Song J, Yang Y, Shen H, Li X (2019) Ternary adversarial networks with self-supervision for zero-shot cross-modal retrieval. IEEE Trans Cybern
Xu Y, Fang X, Wu J, Li X, Zhang D (2016) Discriminative transfer subspace learning via low-rank and sparse representation. IEEE Trans Image Process 25(2):850–863
Article MathSciNet Google Scholar
Yang J, Chu D, Zhang L, Xu Y, Yang J (2013) Sparse representation classifier steered discriminative projection with applications to face recognition. IEEE Trans Neural Netw Learn Syst 24(7):1023–1035
Article Google Scholar
Yang Z, Zhang H, Xu D, Zhang F, Yang G (2018) Double truncated nuclear norm-based matrix decomposition with application to background modeling. J Ambient Intell Humaniz Comput 1–10
Yu Y, Tang S, Aizawa K, Aizawa A (2019) Category-based deep cca for fine-grained venue discovery from multimodal data. IEEE Trans Neural Netw Learn Syst 30(4):1250–1258
Article MathSciNet Google Scholar
Zhu D, Gao G, Gao H, Lu H (2018) Nuclear norm regularized structural orthogonal procrustes regression for face hallucination with pose. In: International symposium on artificial intelligence and robotics. Springer, pp 159–169

Download references

Acknowledgements

This work was partially supported by the NSFC under Grant Nos. 61806098, 61976118, 61972212, 61603192 and 61772568, the Natural Science Foundation of Jiangsu Province under Grant No. BK20190089 and BK20180142, the Natural Science Foundation of the Jiangsu Higher Education Institutions of China under Grant Nos. 18KJB520029 and 17KJB520020, Nanjing Xiaozhuang University under Grant Nos. 2017NXY49.

Author information

Authors and Affiliations

School of Information and Engineering, Nanjing XiaoZhuang University, Nanjing, China
Heyou Chang & Hao Zheng
Laboratory of Image Science and Technology, Southeast University, Nanjing, China
Heyou Chang
School of Technology, Nanjing Audit University, Nanjing, China
Fanlong Zhang
Institute of Advanced Technology, Nanjing University of Posts and Telecommunications, Nanjing, China
Guangwei Gao

Authors

Heyou Chang
View author publications
You can also search for this author in PubMed Google Scholar
Fanlong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Guangwei Gao
View author publications
You can also search for this author in PubMed Google Scholar
Hao Zheng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Heyou Chang or Hao Zheng.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chang, H., Zhang, F., Gao, G. et al. Graph-structure constraint and Schatten p-norm-based unsupervised domain adaptation for image classification. J Ambient Intell Human Comput 13, 5137–5149 (2022). https://doi.org/10.1007/s12652-020-02350-y

Download citation

Received: 20 March 2020
Accepted: 17 July 2020
Published: 08 August 2020
Issue Date: November 2022
DOI: https://doi.org/10.1007/s12652-020-02350-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Graph-structure constraint and Schatten p-norm-based unsupervised domain adaptation for image classification

Abstract

Similar content being viewed by others

Domain-specific feature elimination: multi-source domain adaptation for image classification

Subdomain adaptation via correlation alignment with entropy minimization for unsupervised domain adaptation

Maximum Mean and Covariance Discrepancy for Unsupervised Domain Adaptation

1 Introduction

2 Preliminaries