1 Introduction

Usually, since the distributions of samples from the source and target domain are different from each other, directly applying the classifiers trained on source domain samples to target domain samples would lead to poor classification performance. It’s unwise to retrain a new classifier on target domain samples due to the deficiency of labeled samples. And domain adaption can address this classification problem, because it can transfer the knowledge learned from source domain to target domain [1,2,3,4,5,6,7,8,9,10,11]. For example, with the help of domain adaption, the classifier trained on labeled source domain that consists of ID photos under controlled condition stored in police stations can work well on the unlabeled target domain that consists of target photos captured by some video monitors [12, 13]. At present, a common way of domain adaption based on distribution difference is that source and target domains are transformed into a Reproducing Kernel Hilbert Space (RKHS) subspace shared by the domains, which should be optimized so that their distributions are as close as possible [2, 5,6,7,8]. It can be seen that the distribution difference metric named as domain adaption measure is vital for RKHS subspace learning. The Maximum Mean Difference (MMD) is the most representative domain adaption measure. Many related studies [4,5,6,7, 10, 11, 14] used the MMD measure to judge the distribution gap between different domains. Generally, the MMD measure between the source domain data \(\left\{ x_{i}^{s}\mid i=1\cdots n_{s}\right\} \) and the target domain data \(\left\{ x_{j}^{t}\mid j=1,\ldots n{_{t}} \right\} \) can be written as

$$\begin{aligned} {\mathop {\hbox{argmin}}\limits _{H_s}}\left\| \frac{1}{n_{s}}\sum _{i=1}^{n_{s}}y_{i}^{s}-\frac{1}{n_{t}}\sum _{j=1}^{n_{t}} y_{j}^{t}\right\| _{H}^{2}, \end{aligned}$$

where \(H_s\) is a RKHS subspace and \(\left\| \cdot \right\| _{H}\) is the RKHS norm; also, \(\left\{ y_{i}^{s}\mid i=1,\ldots n{_{s}} \right\} \) and \(\left\{ y_{j}^{t} \mid j=1,\ldots n{_{t}} \right\} \) represent the RKHS subspace projections of the original data \(\left\{ x_{i}^{s}\mid i=1\cdots n_{s}\right\} \) and \(\left\{ x_{j}^{t}\mid j=1,\ldots n{_{t}} \right\}\), respectively. Although MMD measure is simple and easy to implement, it has its theoretical defect in terms of measuring the distribution difference between two domains: (1) MMD only considers their mean values but ignores their higher-order moments, such as variances; (2) domains with diverse distributions maybe have the same mean value. So, the MMD measure is unable to accurately measure the distribution discrepancy.

In addition, co-variance matrix measure based on the second-order moment proposed by Li [8] and Sun [15] is also commonly used to measure the distribution distance between source and target domains. And the co-variance matrix measure of source domain data and target domain data is

$$\begin{aligned} {\mathop {\hbox{arg min}}\limits _{H_s}}\left\| \sum {_{s}}-\sum {_{t}} \right\| _{F}^{2}, \end{aligned}$$

where \(\left\| \cdot \right\| _{F}\) is Frobenius norm and the definition of co-variance matrices is

$$\begin{aligned}{} & {} \sum {_{s}}=\frac{1}{n_{s}}\left( y_{i}^{s} -\bar{\mu ^{s}}\right) \left( y_{i}^{s} -\bar{\mu ^{s}}\right) ^{T}, \\{} & {} \sum {_{t}}=\frac{1}{n_{t}}\left( y_{i}^{t} -\bar{\mu ^{t}}\right) \left( y_{i}^{t} -\bar{\mu ^{t}}\right) ^{T}, \end{aligned}$$

with

$$\begin{aligned} \bar{\mu ^{s}}=\frac{1}{n_{s}}\sum _{i=1}^{n_{s}}y_{i}^{s},\quad \bar{\mu ^{t}} =\frac{1}{n_{t}}\sum _{i=1}^{n_{t}}y_{i}^{t}. \end{aligned}$$

From the perspective of distribution matching, MMD-based or co-variance-based domain adaption methods aim to align the mean (MMD measure) and covariance (co-variance matrix measure) of different domains to align the distributions of domains, which are suitable for the domains obeying Guassian distribution. However, in real world, domains usually obey complex non-Guassian distributions. So, the MMD measure and co-variance matrix measure cannot fully display the performance of domain adaption based on the RKHS subspace learning. In addition, the complicated co-variance matrix measure has large computational costs, because it needs iterative optimization.

To solve the above limitations, we propose a new domain adaption measure MDG. The distributions of the source data \(X_s\) and target data \(X_t\) in subspace can be matched better by MDG measure, which enhance the transferability of the models trained on source domain. And the MDG measure of the source data \(X_s\) and target data \(X_t\) is as followed:

$$\begin{aligned} \hbox{MDG} \left( X_s,X_t \right) = E \left[ \left\| X_s-X_t \right\| ^{2} \right] . \end{aligned}$$
(1)

Our main contributions are as follows, (1) we prove that the MDG measure is effective for the RKHS subspace classification; (2) the optimized RKHS subspace has been analytically derived through the Lagrange Multiplier Method (LMM); and (3) the results of extensive experiments on different dataset have verified the advantages of the proposed MDG measure, compared to the approaches based on the MMD measure and co-variance matrix measure.

The rest of this paper is organized as follows: In Sect. 2, we briefly review partial-related works on traditional domain adaption based on RKHS subspace learning and deep domain adaption-based neural network; In Sect. 3, we introduce some necessary background of second-order random variable, the related definitions of RKHS and RKHS subspace learning; In Sect. 4, we give the proof of the transformation validity of RKHS subspace, propose the MDG measure and apply it into RKHS subspace classification. In addition, the optimization problem, algorithm and computational complexity analysis of MDG measure are added in Sect. 4. In Sect. 5, the experiments show the validity of MDG measure from the aspects of classification accuracy, running time and RKHS subspace dimension stability; And the conclusion is made in Sect. 6.

2 Related work

Domain adaption [16] aims to transfer the knowledge learned from the well-labeled source domain to help the poor-labeled target domain. The domain adaption based on RKHS subspace learning [2] is the very popular among domain adaption methods, which learn a latent RKHS subspace for source and target domains to reduce their distribution difference. So, the key problem of RKHS subspace is how to measure the distribution gap of two domains. Gretton et al. [14] proposed the MMD to measure the distribution distance of two domains, which simply takes the two means of two domains in RKHS as their distributions, respectively. Currently, the MMD-based methods is the most common among the RKHS subspace learning. For instance, TCA proposed by Pan et al. [17] learned a shared and latent RKHS subspace by using the MMD to reduce distribution divergence and preserving the data properties as much as possible, where the distribution of target domain can align the source domain better. Therefore, the trained models on source domain could apply and perform well on target domain. What’s more, Pan et al. put forward a semi-supervised TCA (SSTCA) [17], which considers the label information in subspace learning. IGLDA [6] not only uses MMD to measure the distribution distance of two domains but also retains the local geometry of the labeled source domain data to unearth a suitable subspace, where the distributions could be as much as similar. In 2017, the proposed MIDA [18] reduces the distribution gap between the source domain and target domain by minimizing the MMD distance of them, in the meantime, keeps the maximum independence of the domain features. The MMD-based TIT [5] and LPJT [19] extend domain adaption into the heterogeneous domain adaption [20, 21] which handles domains with arbitrary features and dimensionalities by learning different transformations for different domains. In addition to MMD, the co-variance matrix measure based on second-order moment proposed by DACoM model is used to measure distribution gap to match the distributions of domains. And in DACoM model [8], the local geometric structure and discriminative information are preserved simultaneously.

Deep domain adaption integrates domain adaption into the neural networks to learn more transferable features, which conducts to adapt models trained on source domain to a different but related target domain. For instance, DDC [22] proposed a new CNN architecture that introduces an domain adaption layer and an additional domain adaption loss term based on MMD measure to learn domain invariant representation. Therefore, DDC improves the problem of domain shift between source domain and target domain. In order to further reduce the distribution discrepancy between source domain and target domain, DAN [23] proposed multi-kernel MMD measure(MK-MMD), and then applied MK-MMD measure into pre-trained AlexNet model. Benefiting from CNN and MK-MMD measure, the DAN is likely to learn features that work well on the target domain. In 2017, Deep CORAL [15] extended co-variance matrix measure into the deep neural network, that is, co-variance measure between the source and target feature activation’s was added as a domain adaption loss term. Joint training with co-variance loss and classification loss, Deep CORAL could enhance the transferability of feature representation. In addition to combining domain adaption and neural network for classification, Liang et al. [24] applied MK-MMD measure into CNNs and proposed a transferable reconstruction neural network for the compressed signal (CTCS), which applied MK-MMD measure to fine-tuning the pre-trained network. Therefore, the reconstruction capability on target domain signals can be achieved by only fine-tuning the network trained on source domain signals.

3 Preliminary

In this section, some related background knowledge are introduced. First of all, we give the definition of the second-order moment random variable and the necessary and sufficient condition for two second-order moment random variables to be equal. Next, we review some basic concept of RKHS. Finally, we introduce the framework of the RKHS subspace learning. The notions appeared in this paper is collected in Table 1.

Table 1 The table of symbols used in the paper

3.1 Second-order moment random variable

Given a random variable X which obeys a distribution \(p\left( x \right) \), it becomes a second-order moment random variable if the condition \(E \left[ \left| X \right| ^{2} \right] =\int _{\Omega }x^{^{2}}p\left( x \right) \textrm{d} x<+\infty \) is satisfied. From a physical point of view, a second-order random variable is the limited-energy random signal. And in real life, all signals have limited energy. So, the source and target domain data in original space can be treated as the samplings from two second-order random variables with different distributions.

Assuming that a set of random variables satisfies \(\left\{ X\left| E\left[ \left| X \right| ^{2} \right] <+\infty \right\} \right. \), it is called as a \(L^{2}\) space that is a Hilbert space and its inner product is defined as [25, 26]

$$\begin{aligned} \left( X,Y \right) _{L^{2}}=E\left[ X{Y}^{*} \right] , \end{aligned}$$

where \(\forall X,Y\in L^{2}\), the star denotes the complex conjugate, and the inner product specified by round brackets on \(L^{2}\) space. Besides, the norm in \(L^{2}\) is defined as [27]

$$\begin{aligned} \left\| X \right\| _{L^{2}}=\sqrt{\left( X,X \right) _{L^{2}}}. \end{aligned}$$

In light of the positive definiteness of inner product defined in Hilbert space \(L^{2}\), the necessary and sufficient condition for two second-order random variables to be equal is that the mean squared error between them is zero, which can be formulated as follows (see details in “Appendix A”),

$$\begin{aligned} X_1= & {} X_2\Leftrightarrow \left\| X_1-X_2 \right\| _{L^{2}}^{2}= \left( X_1-X_2,X_1-X_2 \right) _{L^{2}}\\= & {} E\left[ \left| X_1-X_2 \right| ^{2} \right] =0. \end{aligned}$$

where \(X_1\) and \(X_2\) all are second-order random variables from the \(L^2\) space.

3.2 Reproducing kernel Hilbert space

Similarly, the continuous square integrable function space H is given by [28]

$$\begin{aligned} H: \left\{ f \left| f:\Omega \rightarrow {\mathbb {R}},\int _{\Omega }{\left| f\left( x \right) \right| }^{2}\textrm{d}x< +\infty \right. \right\} , \end{aligned}$$

H is a Hilbert space and the inner product of H space is [9],

$$\begin{aligned} \left\langle f,g \right\rangle _{H}\ =\int _{\Omega } f\left( x \right) {g^{*}\left( x \right) }\textrm{d} x, \end{aligned}$$

where the star denotes the complex conjugate.

In particular, a Hilbert space is called a RKHS space if its kernel \(k(x', x):\Omega \times \Omega \) satisfies the following [10, 25, 26]:

For \(\forall f\in H \), it can be reproduced through the RKHS inner product of the function itself and the feature vector \(k\left( \cdot ,x \right) :\)

$$\begin{aligned} f\left( x \right) =\left\langle f, k\left( \cdot ,x \right) \right\rangle _{H}. \end{aligned}$$

from which the following also holds,

$$\begin{aligned} \left\langle k\left( \cdot , x' \right) , k\left( \cdot , x \right) \right\rangle _{H} = k\left( x', x \right) . \end{aligned}$$

3.3 The RKHS subspace learning framework

In domain adaption applications, \(X_s=\left\{ x_{1}^{s},\ldots , x_{n_{s}}^{s}\right\} \) and \(X_t=\left\{ x_{1}^{t},\ldots , x_{n_{t}}^{t}\right\} \) are from the source and target domains respectively and obey different distributions. Domain adaption based on RKHS subspace learning tries to find a better RKHS subspace to minimize their distribution difference.

First, the kernel transformation \(\varphi \left( x \right) =k\left( \cdot ,x \right) \) maps the data samples \(X=X_s\cup X_t=\left\{ x_{1}^{s},\ldots , x_{n_{s}}^{s},x_{1}^{t},\ldots , x_{n_{t}}^{t}\right\} =\left\{ x_1,\ldots ,x_N \right\} \subseteq \Omega \) into the RKHS space H. And the new orthogonal basis \( \vartheta _i \) of RKHS subspace \(H_s\) can be constructed through linear combination of these non-orthogonal feature vectors:

$$\begin{aligned} \vartheta _i=\sum _{j=1}^{N}w_{ji}\varphi \left( x_j \right) ,\quad i=1,\ldots ,d, \end{aligned}$$
(2)

which can be cast into the matrix form as follows

$$\begin{aligned} \Theta = \Phi W, \end{aligned}$$
(3)

with

$$\begin{aligned} W= & {} \begin{bmatrix} w_{11} &{} \cdots &{} w_{1d}\\ \vdots &{} \ddots &{}\vdots \\ w_{N1}&{}\cdots &{} w_{Nd} \\ \end{bmatrix}, \quad \Theta = \begin{bmatrix} \vartheta _1&\dots&\vartheta _d \end{bmatrix}, \\ \Phi= & {} \begin{bmatrix} \varphi \left( x_1 \right)&\dots&\varphi \left( x_N \right) \end{bmatrix}. \end{aligned}$$

The orthogonality of the new basis \(\Theta \) satisfies the following condition

$$\begin{aligned} \begin{bmatrix} \left\langle \vartheta _1, \theta _1\right\rangle _{H} &{} \cdots &{} \left\langle \vartheta _1,\vartheta _d \right\rangle _{H}\\ \vdots &{} \ddots &{} \vdots \\ \left\langle \vartheta _d,\vartheta _1 \right\rangle _{H} &{}\cdots &{}\left\langle \vartheta _d,\vartheta _d \right\rangle _{H}\\ \end{bmatrix} = \Theta ^T \Theta =I_{d}. \end{aligned}$$
(4)

Substituting Eq. (3) into Eq. (4), the following is obtained:

$$\begin{aligned} \begin{bmatrix} \left\langle \vartheta _1, \theta _1\right\rangle _{H} &{} \cdots &{} \left\langle \vartheta _1,\vartheta _d \right\rangle _{H}\\ \vdots &{} \ddots &{} \vdots \\ \left\langle \vartheta _d,\vartheta _1 \right\rangle _{H} &{}\cdots &{}\left\langle \vartheta _d,\vartheta _d \right\rangle _{H}\\ \end{bmatrix}=W^TKW=I_{d}, \end{aligned}$$
(5)

where K is the kernel matrix given by

$$\begin{aligned} K=\begin{bmatrix} k\left( x_1,x_1 \right) &{} \cdots &{} k\left( x_1,x_N \right) \\ \vdots &{} \ddots &{} \vdots \\ k\left( x_N,x_1\right) &{}\cdots &{} k\left( x_N,x_N \right) \end{bmatrix} \end{aligned}$$
(6)

Then, a certain domain adaption measure is used to achieve the optimal RKHS subspace \(H_s\) with basis \(\Theta \) characterized by its weighting matrix W.

Finally, the feature vector \(\varphi (x_i)\) is projected onto the RKHS subspace \(H_s\) that satisfies the constraint formula of Eq. (5). According to the subspace projection theorem in Hilbert space [27], the coordinates \(y_i\) of the feature vector \(\varphi (x_i)\) in the RKHS subspace basis \(H_s\) with \(\Theta \) is given by

$$\begin{aligned} y_i = \begin{bmatrix} y_{i1} \\ \vdots \\ y_{id} \end{bmatrix} =\begin{bmatrix} \left\langle \varphi \left( x_i\right) ,\vartheta _1 \right\rangle _{H} \\ \vdots \\ \left\langle \varphi \left( x_i\right) ,\vartheta _d \right\rangle _{H}\\ \end{bmatrix}\in {\mathbb {R}}^d. \end{aligned}$$
(7)

where d is dimension of the RKHS subspace \(H_s\).

4 RKHS subspace classification with MDG

In this section, we first introduce the proposed MDG measure; second, we confirm the mapping validity of RKHS subspace, that is, a second-order moment random variable in the original data space is still a second-order moment random variable when is transformed into RKHS subspace; then, we apply our MDG measure for the RKHS classification and derive its optimized formula via the LMM; at last, we analysis the algorithm of MDG-based RKHS subspace learning and its computational cost.

4.1 Minimum distribution gap

Suppose there are two second-order moment variables, that is, source domain \(X_s\sim p(x)\) and target domain \(X_t\sim q(x)\) where \(p(x)\ne q(x)\). In order to achieve the goal of aligning the different distributions, we propose an effective MDG measure to reduce the discrepancy between \(X_s\) and \(X_t\), as shown in Eq. (1).

In real application, the exact joint probability density functions of \(X_s\) and \(X_t\) are unknown, and only the sampling data sets from \(X_s\) and \(X_t\) are available, namely \(X_s=\left\{ x_{1}^{s},\ldots , x_{n_{s}}^{s}\right\} \) and \(X_t=\left\{ x_{1}^{t},\ldots , x_{n_{t}}^{t}\right\} \). So, Eq. (1) can be rewritten as:

$$\begin{aligned} \hbox{MDG}\left( X_s,X_t \right)= & {} E\left[ \left| X_{s}- X_{t} \right| ^{2} \right] \nonumber \\= & {} \int _{\Omega \times \Omega }\left\| x^s-x^t \right\| ^2p\left( x^s,x^t \right) \textrm{d}x^s\textrm{d}x^t \nonumber \\\approx & {} \frac{1}{n_sn_t}\sum _{i=1}^{n_s}\sum _{j=1}^{n_t}\left\| x^s_i-x^t_j \right\| ^2, \end{aligned}$$
(8)

where \( p\left( x^s,x^t \right) \) is the joint probability density function of \(X_s\) and \(X_t\) and it is replaced by the uniform distribution.

4.2 The mapping validity of RKHS subspace

Here, we give a proof for the mapping validity of RKHS subspace. In other words, a second-order moment variable is still second-order moment through the transformation of RKHS subspace. And this proof is essential for the MDG measure to be further extended into the RKHS subspace.

For a second-order moment random variable \(X \in \Omega \), we get a random variable Y in light of the projection theorem Eq. (7):

$$\begin{aligned} Y= \begin{bmatrix} Y_1 \\ \vdots \\ Y_d\\ \end{bmatrix}= \begin{bmatrix} \left\langle \varphi \left( X \right) ,\vartheta _1 \right\rangle _{H} \\ \vdots \\ \left\langle \varphi \left( X \right) ,\vartheta _d \right\rangle _{H} \end{bmatrix}, \end{aligned}$$

which represents the projection of \(\varphi \left( X \right) \) in subspace \(H_s\) with the orthogonal basis \( \vartheta _i(i =1\cdots d)\). Now, we prove Y is a second-order moment random variable by proving that each component \(Y_i\) of Y meets the condition \(E \left[ \left| Y_i \right| ^{2} \right] =\int _{\Omega } y_i^{^{2}}p\left( y_i \right) \textrm{d} y_i<+\infty \). In fact, we have

$$\begin{aligned} E\left[ {\left| Y_i \right| }^2 \right]&=E\left[ {\left| \left\langle \varphi \left( X \right) ,\vartheta _i \right\rangle _{H} \right| }^2 \right] \\&=\int _\Omega \left| \left\langle \varphi \left( x \right) ,\vartheta _i \right\rangle _{H} \right| ^2p\left( x \right) \textrm{d}x \\&\le \int _\Omega \left| \left\langle \varphi \left( x \right) ,\vartheta _i \right\rangle _{H} \right| ^2\textrm{d}x\\&=\int _\Omega \left| \left\langle \varphi \left( x \right) ,\sum _{j=1}^{N}\omega _{ji}\varphi \left( x _j\right) \right\rangle _{H} \right| ^2\textrm{d}x \\&=\int _\Omega \left| \sum _{j=1}^{N} \omega _{ji}\left\langle \varphi \left( x \right) ,\varphi \left( x_j \right) \right\rangle _{H} \right| ^2\textrm{d}x \\&=\int _\Omega \left| \sum _{j=1}^{N} \omega _{ji}k\left( x,x_j \right) \right| ^2\textrm{d}x \\&\le \sum _{p=1}^{N}\sum _{q=1}^{N}\left| \omega _{pi}\omega _{qi} \right| \left| \int _\Omega k\left( x,x_q \right) k\left( x,x_p \right) \textrm{d}x \right| \\&\le \sum _{p=1}^{N}\sum _{q=1}^{N} \left| \omega _{pi}\omega _{qi}\right| \sqrt{\int _\Omega k^2\left( x,x_q \right) \textrm{d}x}\\&\quad \sqrt{\int _\Omega k^2\left( x,x_p \right) \textrm{d}x} < +\infty , \end{aligned}$$

where the X follows the probability density function \(0\le p\left( x \right) \le 1\) and \(\varphi \left( x \right) = k \left( \cdot ,x \right) \) is an absolutely integrable function.

From the above derivation, we can make a useful conclusion that mapping second-order moment variables into RKHS subspace, these variables still are second-order moment. In light of this conclusion, we can apply the MDG measure to the domain adaption based on RKHS subspace learning.

4.3 MDG for RKHS subspace classification

In this paper, the MDG as domain adaption measure is used for the domain adaption shown in Fig. 1. Specifically, we first transform the source domain \(X_s=\left\{ x_{1}^{s},\ldots , x_{n_{s}}^{s}\right\} \) and target domain \(X_t=\left\{ x_{1}^{t},\ldots , x_{n_{t}}^{t}\right\} \) into the RKHS subspace \(H_s\) to get \({Y}_s=\left[ y_{1}^{s},\ldots ,y_{n_s}^{s} \right] \) and \({Y}_t=\left[ y_{1}^{t},\ldots ,y_{n_t}^{t} \right] \), which represent the coordinates of the corresponding projection on the orthogonal basis \(\Theta \) of subspace \(H_s\). According to the proof in Sect. 4.2, \({Y_s}\) and \({Y_t}\) are second-order moment variables.

Fig. 1
figure 1

The illustration of RKHS subspace domain adaption via MDG. Firstly, we map the instances from two domains into the RKHS space (the red dots and blue dots represent instances from source and target domains respectively). Then, we project these mapped instances into RKHS subspace \(H_{s}\) and RKHS subspace \(H_{s}^{'}\) respectively, where \(H_{s}\) is the optimal subspace learned by minimizing the MDG measure proposed in this paper and \(H_{s}^{'}\) is non-optimal. Obviously, the distribution gap of the two domains data has been minimized more dramatically in the optimal RKHS subspace \(H_{s}\) than in the non-optimal RKHS subspace \(H_{s}^{'}\)

Then, we minimize the MDG between \( Y_s\) and \( Y_t\) to learn a optimal RKHS subspace \(H_{s}\) so that their distributions are as close as possible. So, our goal is to minimize the following problem:

$$\begin{aligned}&{\mathop {\hbox{arg min}}\limits _{W}} \left\| {Y}_s-{Y}_t \right\| ^2 =E\left[ \left| Y_{s}- Y_{t} \right| ^{2} \right] ,\nonumber \\&\quad \hbox{s.t.} \quad W^{T}KW =I_{d}. \end{aligned}$$
(9)

With the help of Sects. 4.1 and 4.2, Eq. (9) can be derived as follows

$$\begin{aligned}{} & {} \left\| Y_s-Y_t \right\| ^2 = \frac{1}{n_sn_t}\sum _{i=1}^{n_s}\sum _{j=1}^{n_t}\left\| y_i^s-y_j^t \right\| ^2 \nonumber \\{} & {} \quad = \frac{1}{n_sn_t}\sum _{i=1}^{n_s}\sum _{j=1}^{n_t}\left\| W^T\left( K(:,i)-K(:,n_s+j)\right) \right\| ^2 \nonumber \\{} & {} \quad = \frac{1}{n_sn_t}\sum _{i=1}^{n_s}\sum _{j=1}^{n_t}\left\| W^T\varphi _{ij} \right\| ^2 \nonumber \\{} & {} \quad =\frac{1}{n_sn_t}\sum _{i=1}^{n_s}\sum _{j=1}^{n_t}tr\left( W^T\varphi _{ij}\varphi _{ij}^TW\right) \nonumber \\{} & {} \quad = \hbox{tr}\left( W^T\left( \frac{1}{n_sn_t}\sum _{i=1}^{n_s}\sum _{j=1}^{n_t}\varphi _{ij}\varphi _{ij}^T \right) W \right) \nonumber \\{} & {} \quad =\hbox{tr}\left( W^T\Psi W \right) , \end{aligned}$$
(10)

where \(\varphi _{ij}=K(:,i)-K(:,n_s+j) \), and

$$\begin{aligned} \Psi =\frac{1}{n_sn_t}\sum _{i=1}^{n_s}\sum _{j=1}^{n_t}\varphi _{ij}\varphi _{ij}^T. \end{aligned}$$
(11)

Finally, optimization problem of Eq. (9) reduces to

$$\begin{aligned} {\mathop {\hbox{arg min}}\limits _{W}} \hbox{tr}\left( W^{T} \Psi W\right) , \quad \hbox{s.t.}\quad W^{T}KW=I_{d}. \end{aligned}$$
(12)

4.4 Optimization problem

Next, we explain in detail how Eq. (12) is solved by LMM: Because K is a SPD matrix, it can be factorized through its eigenvalues and eigenvectors matrices as follows

$$\begin{aligned} K=U\Lambda U^{T}=U\Lambda ^{\frac{1}{2}}\Lambda ^{\frac{1}{2}} U^{T}, \end{aligned}$$
(13)

where \(UU^{T}=I\) and \(\Lambda \) is a diagonal matrix.

Denoting \(\Lambda ^{\frac{1}{2}} U^{T}\) by L, the following are obtained

$$\begin{aligned} K=L^{T}L,\quad W^{T}KW= W^{T}L^{T}LW=I_{d}, \end{aligned}$$
(14)

from which the matrix trace of Eq. (12) is given by

$$\begin{aligned} tr\left( W^{T} \Psi W\right) =tr\left( W^{T}L^{T}\left( L^{T} \right) ^{-1} \Psi L^{-1}LW\right) . \end{aligned}$$

Now, denoting \(G=LW\), the optimization problem of Eq. (12) is transformed into

$$\begin{aligned} {\mathop {\hbox{arg\;min}}\limits _{G}} tr\left( G^{T} A G\right) ,\quad \hbox{s.t.}\quad G^{T}G=I_{d}, \end{aligned}$$
(15)

with

$$\begin{aligned} A=\left( L^{T} \right) ^{-1} \Psi L^{-1}, \end{aligned}$$
(16)

which can be solved through the LMM with the Lagrangian function given by [29]

$$\begin{aligned} L\left( G,Z \right) =\hbox{tr}\left( G^{T}AG \right) -tr\left( \left( G^{T}G-I_{d} \right) Z \right) , \end{aligned}$$
(17)

where Z is a symmetric matrix and \(z_{ij}\) are Lagrange multipliers.

Equation (17) can be solved as follows

$$\begin{aligned} \left\{ \begin{array}{ll} \frac{\partial L\left( G,Z \right) }{\partial G}=2AG-2GZ=0\\ \\ \frac{\partial L\left( G,Z \right) }{\partial Z}=\left( G^{T}G-I_{d}\right) ^{T}=0 \end{array}\right. \Rightarrow \left\{ \begin{array}{ll} AG= GZ;\\ G^{T}G=I_{d}. \end{array}\right. \end{aligned}$$
(18)

When Z is a diagonal matrix, if G is the eigenvectors matrix of A, then Eq. (18) is satisfied and the minimization problem of Eq. (15) can be achieved by selecting the smallest d eigenvalues and the corresponding eigenvectors.

When Z is not a diagonal matrix but symmetric, it can be factorized in terms of its eigenvalues matrix \(\Sigma \) and eigenvectors matrix V as follows,

$$\begin{aligned} Z = V \Sigma V^T. \end{aligned}$$
(19)

Substituting Eq. (19) into Eq. (18), the following is obtained,

$$\begin{aligned} \left\{ \begin{matrix} AG= GV \Sigma V^T \Rightarrow AGV= GV \Sigma \Rightarrow A\tilde{G}= \tilde{G} \Sigma ,\\ G^{T}G=I_{d} \Rightarrow V^T G^{T} G V =I_{d} \Rightarrow \tilde{G}^{T} \tilde{G} =I_{d}, \end{matrix}\right. \end{aligned}$$
(20)

where \(\tilde{G} = GV\) and we have used the orthogonality relation of the eigenvectors matrix \(V^TV = I_d\).

It’s clear that the eigenvalues matrix \(\Sigma \) is a diagonal matrix and the minimization problem of Eq. (15) can be achieved when \(\tilde{G}\) is formed with d smallest eigenvectors of A.

According to the above analysis, the minimization problem of Eq. (15) can be achieved by selecting the smallest d eigenvectors of A.

Finally, we can get the optimized weighting matrix W of the original optimization problem of Eq. (12) as follows

$$\begin{aligned} W = L^{-1}G=\left( \Lambda ^{\frac{1}{2}} U^{T}\right) ^{-1}G=U\Lambda ^{-\frac{1}{2}}G. \end{aligned}$$
(21)

4.5 Algorithm of RKHS classification with MDG

The procedure for the solution of MDG is summarized in Algorithm 1, which is explained as follows:

figure a

The input of the algorithm are samples from \(X_s\) and \(X_t\), the kernel function \(k(\cdot , \cdot )\), and the RKHS subspace dimension d; and the output of the algorithm are the weighting matrix W that characterizes the orthogonal basis \(\Theta \) of the RKHS subspace.

The algorithm takes the samples from both \(X_s\) and \(X_t\) to form the joint RKHS subspace kernel matrix K; then its eigenvectors matrix U is calculated; after that, the intermediate matrix L and A are calculated from Eqs. (14) and (16), respectively; and finally, the weighting matrix W that characterizes the orthogonal basis of the RKHS subspace is obtained by selecting the d smallest eigenvectors of the intermediate matrix A according to Eq. (21).

After having the weighting matrix W that characterizes the orthogonal basis of the RKHS subspace, the unknown labels of instances \(X_t\) can be obtained as follows:

  1. 1.

    The data samples set \(X = X_s \cup X_t\) in \(H_s\) can be projected to the RKHS subspace as \(Y=W^TK\): the samples set from source domain \(X_s\) are projected to get \(Y_s = Y\left( :,n_s \right) \), and the data samples set \(X_t\) are projected to get \(Y_t = Y\left( :,n_s+1:n_s+n_t \right) \);

  2. 2.

    Train the classifier with the projection samples \(Y_s\);

  3. 3.

    Use the trained classifier to label the projection samples \(Y_t\).

4.6 Computational complexity

According to the Algorithm 1, the computation costs of our MDG-based RKHS subspace learning consists of three major parts:

  1. 1.

    The Construction the kernel matrix K in step 3, and it costs \( {\mathcal {O}} \left( mn^2 \right) \) for computing (m is the dimension of samples)

  2. 2.

    The construction of matrix \(\Psi \) in step 3, and it costs \( {\mathcal {O}} \left( n_{s}n_{t}n^2 \right) \) for computing (\(n=n_{s}+n_{t}\))

  3. 3.

    The optimization of coefficient matrix W in step 4 and step 7, which costs \( {\mathcal {O}} \left( d n^2 \right) \)

So, the overall computational complexity of Algorithm 1 would be \( {\mathcal {O}} \left( mn^2+n_{s}n_{t}n^2+dn^2 \right) \).

5 Experiments

In this section, we conduct two kinds of experiments to verify the classification effectiveness of our MDG measure: one is the comparison with the MMD and co-variance measures; the another is to apply our MDG measure into the four domain adaption algorithms to replace their original distribution discrepancy measures to evaluate MDG measure’s performance. In addition, we conduct the experiment to verify the insensitivity of MDG measure to RKHS subspace dimension.

5.1 The real-world datasets

We assess the performance of the proposed MDG measure on four popular datasets: Office-Caltech10 dataset, handwritten digits dataset, text dataset and VLSIC dataset. The data that support the findings of this study are available from this website.Footnote 1 Next, the four datasets are introduced, respectively.

  1. 1.

    Office-Caltech10 dataset. Office-Caltech10 dataset consists of four domains: Amazon (A, collected from Amazon), DSLR (D, shot by SLR camera), webcam (W, collected by webcam), and Caltech (C, collected by Caltech) [30]. Each domain contains 10 classes, such as backpack, monitor, headphone and so on. Examples of headphones from A, D, W, and C domains are shown in Fig. 2. And each domain is used as source domain and target domain repeatedly.

  2. 2.

    Text dataset. This dataset comes from Reuters-21,578 datasetFootnote 2 including 21,578 documents and 672 categories. In fact, we use a pre-processed dataset, which are divided into three categories: orgs, places, and people, with each category containing two sub-classes [6]. We regard these three categories as three domains and select randomly two categories as source and target domain respectively on each classification task.

  3. 3.

    Handwritten digits dataset. The handwritten digits dataset consists of MNISTFootnote 3 and USPSFootnote 4 dataset with different distributions, which include handwritten 10 digits from 0 to 9. MNIST dataset contains 70,000 sheets of 28 \(\times \) 28 gray images, and the USPS dataset contains 11,000 sheets of 16 \(\times \) 16 gray images. Since the large amount of samples in this dataset and the limited processing power of our device, the subset of handwritten digits dataset are used in following experiments, which consists of 2000 images from MNIST and 1800 images from USPS that are all randomly selected. Then, some data preparation are done for this subset, which contains the uniformly scaling these gray images to \(16 \times 16\) images, and then flattening each image into 256 dimensional vector. Some examples of the handwritten digits dataset are shown in Fig. 3. And MNIST and USPS dataset are taken as source and target domain by turns.

  4. 4.

    VLSIC dataset. The VLSIC dataset consists of 5 domains: VOC2007(V), LabelMe(L), SUN09(S), ImageNet(I) and Caltech101(C) from different distributions. Since the original data have very high dimension, we firstly applied PCA [31] to reduce the dimension of original data from 4096 into 300. And then we selected the 5 classes shared by the five domains to conduct the experiments.

Fig. 2
figure 2

Examples of four domains in Office-Caltech10 dataset

Fig. 3
figure 3

Examples of 0–9 digits in handwritten digits dataset

5.2 The comparison with MMD and co-variance measures

In this subsection, we conduct the comparison with MMD and co-variance measures on the above four dataset. Specially, the MMD measure is the most popular among domain adaption algorithms, and co-variance measure has recently been used in domain adaption algorithms [8, 15]. For simplicity, co-variance measure be denoted as cov in Tables 234 and 5. In this subsection experiments, the used parameters are set up to:

  1. 1.

    The Gaussian Radial Basis function (RBF) kernel is chosen as the reproducing kernel of RKHS [9]: \(k\left( x_1,x_2 \right) =e^{-\frac{{\left\| x_1-x_2 \right\| }^2}{2\delta ^2}}, \delta =10\).

  2. 2.

    The dimension of the RKHS subspace \(H_{s}\) has been set to \(d=30\) for the handwritten digits dataset and \(d=100\) for the other three datasets.

  3. 3.

    k-Nearest Neighbor method (knn) [32] is used for classification, and experiments are carried out on \(k = [ 1, 3, 5, 7]\). The calculation of classification accuracy is as followed:

    $$\begin{aligned} \hbox{accuracy} = \frac{\sum _{x_{t}\in X_t}\left\{ x_{t}\in X_t\cap \hbox{knn}\left\{ x_t \right\} =label\left\{ x_t \right\} \right\} }{\hbox{num}\left\{ X_t \right\} }, \end{aligned}$$

    where \(X_t\) is target domain samples set and \(\hbox{num}\left\{ X_t \right\} \) is the number of samples in \(X_t\), \(\hbox{knn}\left\{ x_t \right\} \) is the label predicted by knn method for a target data \(x_t\) and label \(\left\{ x_t \right\} \) is the ground truth label of \(x_t\).

  4. 4.

    The number of iterations of the co-variance domain adaption measure is set to 10.

Table 2 Classification accuracy comparison of the Office-Caltech10 dataset
Table 3 Classification accuracy comparison of the text dataset
Table 4 Accuracy comparison of the handwritten digits dataset
Table 5 Accuracy comparison of the VLSIC dataset

And the specific classification task arrangement are as followed:

  1. 1.

    Office-Caltech10 dataset classification. According to IGLDA [6], the SURF (Speed Up Robust Features) [33] of the dataset are first extracted; then the features are normalized and z-scored so that their means are zero and the standard deviations are set up to one. In total, we carried out six tasks: \(A \rightarrow C\), \(A \rightarrow D\), \(C \rightarrow A\), \(D \rightarrow A\), \(D \rightarrow C\), \(D \rightarrow W\), \(W \rightarrow A\), \(W\rightarrow C\). In detail, \(A \rightarrow C\) means that Amazon is the source domain and Caltech is the target domain.

  2. 2.

    Text dataset classification. We set up only one classification task: orgs \(\rightarrow \) places.

  3. 3.

    Handwritten digits dataset classification. For the handwritten digits dataset, we set two tasks, MNIST \(\rightarrow \) USPS and USPS \(\rightarrow \) MNIST, where MNIST \(\rightarrow \) USPS means that the MNIST dataset are selected as the source domain and USPS dataset are target domain.

  4. 4.

    VLSIC dataset classification. Six tasks are set up on this dataset: \(C \rightarrow L\), \(C\rightarrow S\), \(C \rightarrow V\), \(I \rightarrow C\), \(I \rightarrow V\), \(V \rightarrow L\).

5.3 Comparisons with state-of-the-art domain adaption algorithms

In this subsection, we compare the proposed MDG measure with TIT [5], IGLDA [6], LPJT [19] and MIDA [18] algorithms in the literature to show its performance. It’s noted that the above algorithms consist of not only domain adaption measures, but also other regularization terms to ensure the classification performance. For the objective assessment of our MDG, we replace the domain adaption measure used in each algorithm with the MDG measure so that we get four nearly-new domain adaption algorithms, namely, TIT_MDG, IGLDA_MDG, LPJT_MDG and MIDA_MDG. For example, TIT_MDG is obtained by replacing the domain adaption measure used in the TIT algorithm with MDG method, and the original regularization of TIT algorithm remains unchanged. And IGLDA_MDG, MIDA_MDG and LPJT_MDG are generated alike. We totally have four comparison tasks: TIT vs TIT_MDG, IGLDA vs IGLDA_MDG, MIDA vs MIDA_MDG and LPJT vs LPJT_MDG. Since the four original algorithms all apply SVM to classify, the very common SVM classifiers with different kernels are used to classify the target domain samples. Besides, the dimension of RKHS subspace is 100.

  1. 1.

    TIT versus TIT_MDG. In this experiment, the Office-Caltech10 dataset are used and twelve tasks are set up totally. We randomly select two domains samples for each task, and one is as source domain and the other as target domain. In addition, we use the SVM classifier with RBF kernel.

  2. 2.

    IGLDA versus IGLDA_MDG. We conduct 12 tasks on Office-Caltech10 dataset to compare the IGLDA_MDG with IGLDA, and tasks setup are as above. In addition, the SVM classifier based on linear kernel is used.

  3. 3.

    LPJT versus LPJT_MDG. In this experiment, we conduct two tasks on handwritten digits dataset, that is, MNIST\(\rightarrow \)USPS and USPS \(\rightarrow \) MNIST. And we select the RBF kernel-based SVM classifier to classify.

  4. 4.

    MIDA versus MIDA_MDG. Here, we compare the combined algorithm MIDA_MDG with MIDA to verify the effectiveness of our MDG measure on handwritten digits dataset. And SVM classifier based on linear kernel is used.

5.4 Classification results

Under the experiment setting of Sects. 5.2 and 5.3, we get the all classification results and report them in Tables 2345678 and 9 and the best result in each classification task is bolded for convenience.

Table 6 Accuracy comparison of the Office-Caltech10: TIT versus TIT_MDG
Table 7 Accuracy comparison of the Office-Caltech10: IGLDA versus IGLDA_MDG
Table 8 Accuracy comparison of the handwritten digits dataset: LPJT versus LPJT_MDG
Table 9 Accuracy comparison of the Text dataset: MIDA vs MIDA_MDG

The classification results of Sect. 5.2 on the four real-world dataset are all collected in Tables 23 and 4 and visualize with Figs. 456 and 7, respectively. Among these domain adaption measures, the MDG measure works much better than the MMD measure and the un-optimized co-variance measure, which delivers more decent classification accuracy. It show that MDG measure can learn a good common subspace for source and target domain samples, where their distributions match better than in the subspace learned by MMD or co-variance. For 80% tasks of this subsection, co-variance measure achieves higher classification accuracy than MMD measure. The reason is that MMD measure use the first-order moment statistical information of domain, while the second-order statistical information are used in co-variance. And the generally low classification accuracy of tasks in the subsection is due to the fact that the domain adaption measure only focus on global information—inter-domain distribution difference, but ignores the local information such as intra-class distance within domain, the local geometric structure, and discriminative information [5, 6, 8, 11, 16, 18, 19]. So, current domain adaption algorithms all consider the global and local information at the same time. However, since the innovation of this paper is to propose a neat and effective MDG measure to align the different distributions, the local information is not considered for the time being.

Fig. 4
figure 4

The classification accuracy of MDG, MMD, co-variance measure in different k on Office-Caltech10 dataset

Fig. 5
figure 5

The classification accuracy of MDG, MMD, co-variance measure in different k on text dataset

Fig. 6
figure 6

The classification accuracy of MDG, MMD, co-variance measure in different k on handwritten digits dataset

Fig. 7
figure 7

The classification accuracy of MDG, MMD, co-variance measure in different k on VLSIC Datasets

Tables 678 and 9 show that the proposed MDG works well on the four dataset, which transforms the source and target data into a great latent RKHS subspace where the distribution gap is smaller than the original algorithms so that it enhances the ability to classify.

In addition, we compared their running speed on the domain adaption task orgs \(\rightarrow \) places. The codes of MMD, co-variance and MDG were written in MATLAB R2018a, and no parallel computing was used. The running times of MMD, co-variance and MDG were 0.5 s, 96.3 s and 29.1 s, respectively. Although the running time of our measure is not the shortest, it is acceptable compared with the co-variance method. And considering the classification results, MDG is more practical to use than MMD and co-variance measures.

5.5 RKHS subspace dimension sensitivity analysis

The RKHS is an infinite linear space, so its subspace dimension can be arbitrary or even infinite. Therefore, it is difficult or even impossible to realize RKHS subspace learning by computer. For domain adaption methods based on RKHS subspace learning, the subspace is constructed by the linear combination of the transformed samples in RKHS. According to Sect. 3.3, the upper limit of the subspace dimension d is the rank of the kernel matrix K, that is N. In practice, d is often selected adaptively according to the input data.

We perform the experiment on tuning \(d (d<N)\) to show that the proposed RKHS subspace learning based on MDG measure is robust on the parameter d, namely the classification accuracy remains stable when d changes over a large range. Keeping other parameters unchanged, we constantly adjust the dimension d of the subspace from 350 to 60, and conduct a classification task every 10 dimension on orgs \(\rightarrow \) places. And the results of different d is showed in Fig. 8

Fig. 8
figure 8

The classification accuracy of orgs \(\rightarrow \) places task with RKHS subspace dimension d from 350 to 60

From Fig. 8, we can see that the classification results remain robust even d changes over a large range.

6 Conclusion

In this paper, we study a neat and effective MDG measure for RKHS subspace domain adaption classification problem. The MDG measure optimizes the RKHS subspace, where distribution difference between the source-domain data and the target-domain data are as small as possible. Compared to the first-order moment MMD measure and the second-order moment co-variance, the MDG measure has the advantage of capturing the higher-order moments of the distribution. Also, compared to the complicated co-variance measure, it has the advantage of easy to use and can be optimized analytically: rigorous mathematical formula has been derived for the weighting matrix of the optimized orthogonal Hilbert subspace basis, via the LMM optimization. At last, extensive experiments with four image dataset have been carried out. Comparisons with other four state-of-the-art domain adaption algorithms in the literature with both the MMD and co-variance measures show that the RKHS subspace based on MDG measure approach does achieve better classification performance in general.

And according to Sect. 2, some recent works have applied MMD and co-variance into deep neural network as additional loss term to enhance the transferability of feature representation. Hence, in our future work, we will consider extending MDG measure into the deep learning architecture.