RKHS subspace domain adaption via minimum distribution gap

Qiu, Yanzhen; Zhang, Chuangfeng; Xiong, Chenkui; Ma, Zhengming; Liao, Shaolin

doi:10.1007/s10044-023-01170-y

RKHS subspace domain adaption via minimum distribution gap

Theoretical Advances
Published: 17 May 2023

Volume 26, pages 1425–1439, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Pattern Analysis and Applications Aims and scope Submit manuscript

RKHS subspace domain adaption via minimum distribution gap

Download PDF

Yanzhen Qiu¹,
Chuangfeng Zhang¹,
Chenkui Xiong¹,
Zhengming Ma¹ &
…
Shaolin Liao ORCID: orcid.org/0000-0002-4432-3448¹

247 Accesses
1 Citation
Explore all metrics

Abstract

Subspace learning of Reproducing Kernel Hilbert Space (RKHS) is most popular among domain adaption applications. The key goal is to embed the source and target domain samples into a common RKHS subspace where their distributions could match better. However, most existing domain adaption measures are either based on the first-order statistics that can’t accurately qualify the difference of distributions for non-Guassian distributions or complicated co-variance matrix that is difficult to be used and optimized. In this paper, we propose a neat and effective RKHS subspace domain adaption measure: Minimum Distribution Gap (MDG), where the rigorous mathematical formula can be derived to learn the weighting matrix of the optimized orthogonal Hilbert subspace basis via the Lagrange Multiplier Method. To show the efficiency of the proposed MDG measure, extensive numerical experiments with different datasets have been performed and the comparisons with four other state-of-the-art algorithms in the literature show that the proposed MDG measure is very promising.

Domain adaption based on source dictionary regularized RKHS subspace learning

Article 17 June 2021

Transfer subspace learning joint low-rank representation and feature selection

Article 23 April 2022

Low-rank representation-based regularized subspace learning method for unsupervised domain adaptation

Article 05 December 2019

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Usually, since the distributions of samples from the source and target domain are different from each other, directly applying the classifiers trained on source domain samples to target domain samples would lead to poor classification performance. It’s unwise to retrain a new classifier on target domain samples due to the deficiency of labeled samples. And domain adaption can address this classification problem, because it can transfer the knowledge learned from source domain to target domain [1,2,3,4,5,6,7,8,9,10,11]. For example, with the help of domain adaption, the classifier trained on labeled source domain that consists of ID photos under controlled condition stored in police stations can work well on the unlabeled target domain that consists of target photos captured by some video monitors [12, 13]. At present, a common way of domain adaption based on distribution difference is that source and target domains are transformed into a Reproducing Kernel Hilbert Space (RKHS) subspace shared by the domains, which should be optimized so that their distributions are as close as possible [2, 5,6,7,8]. It can be seen that the distribution difference metric named as domain adaption measure is vital for RKHS subspace learning. The Maximum Mean Difference (MMD) is the most representative domain adaption measure. Many related studies [4,5,6,7, 10, 11, 14] used the MMD measure to judge the distribution gap between different domains. Generally, the MMD measure between the source domain data $\left\{ x_{i}^{s}\mid i=1\cdots n_{s}\right\} $ and the target domain data $\left\{ x_{j}^{t}\mid j=1,\ldots n{_{t}} \right\} $ can be written as

$$\begin{aligned} {\mathop {\hbox{argmin}}\limits _{H_s}}\left\| \frac{1}{n_{s}}\sum _{i=1}^{n_{s}}y_{i}^{s}-\frac{1}{n_{t}}\sum _{j=1}^{n_{t}} y_{j}^{t}\right\| _{H}^{2}, \end{aligned}$$

where $H_s$ is a RKHS subspace and $\left\| \cdot \right\| _{H}$ is the RKHS norm; also, $\left\{ y_{i}^{s}\mid i=1,\ldots n{_{s}} \right\} $ and $\left\{ y_{j}^{t} \mid j=1,\ldots n{_{t}} \right\} $ represent the RKHS subspace projections of the original data $\left\{ x_{i}^{s}\mid i=1\cdots n_{s}\right\} $ and $\left\{ x_{j}^{t}\mid j=1,\ldots n{_{t}} \right\}$, respectively. Although MMD measure is simple and easy to implement, it has its theoretical defect in terms of measuring the distribution difference between two domains: (1) MMD only considers their mean values but ignores their higher-order moments, such as variances; (2) domains with diverse distributions maybe have the same mean value. So, the MMD measure is unable to accurately measure the distribution discrepancy.

In addition, co-variance matrix measure based on the second-order moment proposed by Li [8] and Sun [15] is also commonly used to measure the distribution distance between source and target domains. And the co-variance matrix measure of source domain data and target domain data is

$$\begin{aligned} {\mathop {\hbox{arg min}}\limits _{H_s}}\left\| \sum {_{s}}-\sum {_{t}} \right\| _{F}^{2}, \end{aligned}$$

where $\left\| \cdot \right\| _{F}$ is Frobenius norm and the definition of co-variance matrices is

$$\begin{aligned}{} & {} \sum {_{s}}=\frac{1}{n_{s}}\left( y_{i}^{s} -\bar{\mu ^{s}}\right) \left( y_{i}^{s} -\bar{\mu ^{s}}\right) ^{T}, \\{} & {} \sum {_{t}}=\frac{1}{n_{t}}\left( y_{i}^{t} -\bar{\mu ^{t}}\right) \left( y_{i}^{t} -\bar{\mu ^{t}}\right) ^{T}, \end{aligned}$$

with

$$\begin{aligned} \bar{\mu ^{s}}=\frac{1}{n_{s}}\sum _{i=1}^{n_{s}}y_{i}^{s},\quad \bar{\mu ^{t}} =\frac{1}{n_{t}}\sum _{i=1}^{n_{t}}y_{i}^{t}. \end{aligned}$$

From the perspective of distribution matching, MMD-based or co-variance-based domain adaption methods aim to align the mean (MMD measure) and covariance (co-variance matrix measure) of different domains to align the distributions of domains, which are suitable for the domains obeying Guassian distribution. However, in real world, domains usually obey complex non-Guassian distributions. So, the MMD measure and co-variance matrix measure cannot fully display the performance of domain adaption based on the RKHS subspace learning. In addition, the complicated co-variance matrix measure has large computational costs, because it needs iterative optimization.

To solve the above limitations, we propose a new domain adaption measure MDG. The distributions of the source data $X_s$ and target data $X_t$ in subspace can be matched better by MDG measure, which enhance the transferability of the models trained on source domain. And the MDG measure of the source data $X_s$ and target data $X_t$ is as followed:

$$\begin{aligned} \hbox{MDG} \left( X_s,X_t \right) = E \left[ \left\| X_s-X_t \right\| ^{2} \right] . \end{aligned}$$

(1)

Our main contributions are as follows, (1) we prove that the MDG measure is effective for the RKHS subspace classification; (2) the optimized RKHS subspace has been analytically derived through the Lagrange Multiplier Method (LMM); and (3) the results of extensive experiments on different dataset have verified the advantages of the proposed MDG measure, compared to the approaches based on the MMD measure and co-variance matrix measure.

The rest of this paper is organized as follows: In Sect. 2, we briefly review partial-related works on traditional domain adaption based on RKHS subspace learning and deep domain adaption-based neural network; In Sect. 3, we introduce some necessary background of second-order random variable, the related definitions of RKHS and RKHS subspace learning; In Sect. 4, we give the proof of the transformation validity of RKHS subspace, propose the MDG measure and apply it into RKHS subspace classification. In addition, the optimization problem, algorithm and computational complexity analysis of MDG measure are added in Sect. 4. In Sect. 5, the experiments show the validity of MDG measure from the aspects of classification accuracy, running time and RKHS subspace dimension stability; And the conclusion is made in Sect. 6.

2 Related work

Domain adaption [16] aims to transfer the knowledge learned from the well-labeled source domain to help the poor-labeled target domain. The domain adaption based on RKHS subspace learning [2] is the very popular among domain adaption methods, which learn a latent RKHS subspace for source and target domains to reduce their distribution difference. So, the key problem of RKHS subspace is how to measure the distribution gap of two domains. Gretton et al. [14] proposed the MMD to measure the distribution distance of two domains, which simply takes the two means of two domains in RKHS as their distributions, respectively. Currently, the MMD-based methods is the most common among the RKHS subspace learning. For instance, TCA proposed by Pan et al. [17] learned a shared and latent RKHS subspace by using the MMD to reduce distribution divergence and preserving the data properties as much as possible, where the distribution of target domain can align the source domain better. Therefore, the trained models on source domain could apply and perform well on target domain. What’s more, Pan et al. put forward a semi-supervised TCA (SSTCA) [17], which considers the label information in subspace learning. IGLDA [6] not only uses MMD to measure the distribution distance of two domains but also retains the local geometry of the labeled source domain data to unearth a suitable subspace, where the distributions could be as much as similar. In 2017, the proposed MIDA [18] reduces the distribution gap between the source domain and target domain by minimizing the MMD distance of them, in the meantime, keeps the maximum independence of the domain features. The MMD-based TIT [5] and LPJT [19] extend domain adaption into the heterogeneous domain adaption [20, 21] which handles domains with arbitrary features and dimensionalities by learning different transformations for different domains. In addition to MMD, the co-variance matrix measure based on second-order moment proposed by DACoM model is used to measure distribution gap to match the distributions of domains. And in DACoM model [8], the local geometric structure and discriminative information are preserved simultaneously.

Deep domain adaption integrates domain adaption into the neural networks to learn more transferable features, which conducts to adapt models trained on source domain to a different but related target domain. For instance, DDC [22] proposed a new CNN architecture that introduces an domain adaption layer and an additional domain adaption loss term based on MMD measure to learn domain invariant representation. Therefore, DDC improves the problem of domain shift between source domain and target domain. In order to further reduce the distribution discrepancy between source domain and target domain, DAN [23] proposed multi-kernel MMD measure(MK-MMD), and then applied MK-MMD measure into pre-trained AlexNet model. Benefiting from CNN and MK-MMD measure, the DAN is likely to learn features that work well on the target domain. In 2017, Deep CORAL [15] extended co-variance matrix measure into the deep neural network, that is, co-variance measure between the source and target feature activation’s was added as a domain adaption loss term. Joint training with co-variance loss and classification loss, Deep CORAL could enhance the transferability of feature representation. In addition to combining domain adaption and neural network for classification, Liang et al. [24] applied MK-MMD measure into CNNs and proposed a transferable reconstruction neural network for the compressed signal (CTCS), which applied MK-MMD measure to fine-tuning the pre-trained network. Therefore, the reconstruction capability on target domain signals can be achieved by only fine-tuning the network trained on source domain signals.

3 Preliminary

In this section, some related background knowledge are introduced. First of all, we give the definition of the second-order moment random variable and the necessary and sufficient condition for two second-order moment random variables to be equal. Next, we review some basic concept of RKHS. Finally, we introduce the framework of the RKHS subspace learning. The notions appeared in this paper is collected in Table 1.

Table 1 The table of symbols used in the paper

Full size table

3.1 Second-order moment random variable

Given a random variable X which obeys a distribution $p\left( x \right) $, it becomes a second-order moment random variable if the condition $E \left[ \left| X \right| ^{2} \right] =\int _{\Omega }x^{^{2}}p\left( x \right) \textrm{d} x<+\infty $ is satisfied. From a physical point of view, a second-order random variable is the limited-energy random signal. And in real life, all signals have limited energy. So, the source and target domain data in original space can be treated as the samplings from two second-order random variables with different distributions.

Assuming that a set of random variables satisfies $\left\{ X\left| E\left[ \left| X \right| ^{2} \right] <+\infty \right\} \right. $, it is called as a $L^{2}$ space that is a Hilbert space and its inner product is defined as [25, 26]

$$\begin{aligned} \left( X,Y \right) _{L^{2}}=E\left[ X{Y}^{*} \right] , \end{aligned}$$

where $\forall X,Y\in L^{2}$, the star denotes the complex conjugate, and the inner product specified by round brackets on $L^{2}$ space. Besides, the norm in $L^{2}$ is defined as [27]

$$\begin{aligned} \left\| X \right\| _{L^{2}}=\sqrt{\left( X,X \right) _{L^{2}}}. \end{aligned}$$

In light of the positive definiteness of inner product defined in Hilbert space $L^{2}$, the necessary and sufficient condition for two second-order random variables to be equal is that the mean squared error between them is zero, which can be formulated as follows (see details in “Appendix A”),

$$\begin{aligned} X_1= & {} X_2\Leftrightarrow \left\| X_1-X_2 \right\| _{L^{2}}^{2}= \left( X_1-X_2,X_1-X_2 \right) _{L^{2}}\\= & {} E\left[ \left| X_1-X_2 \right| ^{2} \right] =0. \end{aligned}$$

where $X_1$ and $X_2$ all are second-order random variables from the $L^2$ space.

3.2 Reproducing kernel Hilbert space

Similarly, the continuous square integrable function space H is given by [28]

$$\begin{aligned} H: \left\{ f \left| f:\Omega \rightarrow {\mathbb {R}},\int _{\Omega }{\left| f\left( x \right) \right| }^{2}\textrm{d}x< +\infty \right. \right\} , \end{aligned}$$

H is a Hilbert space and the inner product of H space is [9],

$$\begin{aligned} \left\langle f,g \right\rangle _{H}\ =\int _{\Omega } f\left( x \right) {g^{*}\left( x \right) }\textrm{d} x, \end{aligned}$$

where the star denotes the complex conjugate.

In particular, a Hilbert space is called a RKHS space if its kernel $k(x', x):\Omega \times \Omega $ satisfies the following [10, 25, 26]:

For $\forall f\in H $, it can be reproduced through the RKHS inner product of the function itself and the feature vector $k\left( \cdot ,x \right) :$

$$\begin{aligned} f\left( x \right) =\left\langle f, k\left( \cdot ,x \right) \right\rangle _{H}. \end{aligned}$$

from which the following also holds,

$$\begin{aligned} \left\langle k\left( \cdot , x' \right) , k\left( \cdot , x \right) \right\rangle _{H} = k\left( x', x \right) . \end{aligned}$$

3.3 The RKHS subspace learning framework

In domain adaption applications, $X_s=\left\{ x_{1}^{s},\ldots , x_{n_{s}}^{s}\right\} $ and $X_t=\left\{ x_{1}^{t},\ldots , x_{n_{t}}^{t}\right\} $ are from the source and target domains respectively and obey different distributions. Domain adaption based on RKHS subspace learning tries to find a better RKHS subspace to minimize their distribution difference.

First, the kernel transformation $\varphi \left( x \right) =k\left( \cdot ,x \right) $ maps the data samples $X=X_s\cup X_t=\left\{ x_{1}^{s},\ldots , x_{n_{s}}^{s},x_{1}^{t},\ldots , x_{n_{t}}^{t}\right\} =\left\{ x_1,\ldots ,x_N \right\} \subseteq \Omega $ into the RKHS space H. And the new orthogonal basis $ \vartheta _i $ of RKHS subspace $H_s$ can be constructed through linear combination of these non-orthogonal feature vectors:

$$\begin{aligned} \vartheta _i=\sum _{j=1}^{N}w_{ji}\varphi \left( x_j \right) ,\quad i=1,\ldots ,d, \end{aligned}$$

(2)

which can be cast into the matrix form as follows

$$\begin{aligned} \Theta = \Phi W, \end{aligned}$$

(3)

with

$$\begin{aligned} W= & {} \begin{bmatrix} w_{11} &{} \cdots &{} w_{1d}\\ \vdots &{} \ddots &{}\vdots \\ w_{N1}&{}\cdots &{} w_{Nd} \\ \end{bmatrix}, \quad \Theta = \begin{bmatrix} \vartheta _1&\dots&\vartheta _d \end{bmatrix}, \\ \Phi= & {} \begin{bmatrix} \varphi \left( x_1 \right)&\dots&\varphi \left( x_N \right) \end{bmatrix}. \end{aligned}$$

The orthogonality of the new basis $\Theta $ satisfies the following condition

$$\begin{aligned} \begin{bmatrix} \left\langle \vartheta _1, \theta _1\right\rangle _{H} &{} \cdots &{} \left\langle \vartheta _1,\vartheta _d \right\rangle _{H}\\ \vdots &{} \ddots &{} \vdots \\ \left\langle \vartheta _d,\vartheta _1 \right\rangle _{H} &{}\cdots &{}\left\langle \vartheta _d,\vartheta _d \right\rangle _{H}\\ \end{bmatrix} = \Theta ^T \Theta =I_{d}. \end{aligned}$$

(4)

Substituting Eq. (3) into Eq. (4), the following is obtained:

$$\begin{aligned} \begin{bmatrix} \left\langle \vartheta _1, \theta _1\right\rangle _{H} &{} \cdots &{} \left\langle \vartheta _1,\vartheta _d \right\rangle _{H}\\ \vdots &{} \ddots &{} \vdots \\ \left\langle \vartheta _d,\vartheta _1 \right\rangle _{H} &{}\cdots &{}\left\langle \vartheta _d,\vartheta _d \right\rangle _{H}\\ \end{bmatrix}=W^TKW=I_{d}, \end{aligned}$$

(5)

where K is the kernel matrix given by

$$\begin{aligned} K=\begin{bmatrix} k\left( x_1,x_1 \right) &{} \cdots &{} k\left( x_1,x_N \right) \\ \vdots &{} \ddots &{} \vdots \\ k\left( x_N,x_1\right) &{}\cdots &{} k\left( x_N,x_N \right) \end{bmatrix} \end{aligned}$$

(6)

Then, a certain domain adaption measure is used to achieve the optimal RKHS subspace $H_s$ with basis $\Theta $ characterized by its weighting matrix W.

Finally, the feature vector $\varphi (x_i)$ is projected onto the RKHS subspace $H_s$ that satisfies the constraint formula of Eq. (5). According to the subspace projection theorem in Hilbert space [27], the coordinates $y_i$ of the feature vector $\varphi (x_i)$ in the RKHS subspace basis $H_s$ with $\Theta $ is given by

$$\begin{aligned} y_i = \begin{bmatrix} y_{i1} \\ \vdots \\ y_{id} \end{bmatrix} =\begin{bmatrix} \left\langle \varphi \left( x_i\right) ,\vartheta _1 \right\rangle _{H} \\ \vdots \\ \left\langle \varphi \left( x_i\right) ,\vartheta _d \right\rangle _{H}\\ \end{bmatrix}\in {\mathbb {R}}^d. \end{aligned}$$

(7)

where d is dimension of the RKHS subspace $H_s$.

4 RKHS subspace classification with MDG

In this section, we first introduce the proposed MDG measure; second, we confirm the mapping validity of RKHS subspace, that is, a second-order moment random variable in the original data space is still a second-order moment random variable when is transformed into RKHS subspace; then, we apply our MDG measure for the RKHS classification and derive its optimized formula via the LMM; at last, we analysis the algorithm of MDG-based RKHS subspace learning and its computational cost.

4.1 Minimum distribution gap

Suppose there are two second-order moment variables, that is, source domain $X_s\sim p(x)$ and target domain $X_t\sim q(x)$ where $p(x)\ne q(x)$. In order to achieve the goal of aligning the different distributions, we propose an effective MDG measure to reduce the discrepancy between $X_s$ and $X_t$, as shown in Eq. (1).

In real application, the exact joint probability density functions of $X_s$ and $X_t$ are unknown, and only the sampling data sets from $X_s$ and $X_t$ are available, namely $X_s=\left\{ x_{1}^{s},\ldots , x_{n_{s}}^{s}\right\} $ and $X_t=\left\{ x_{1}^{t},\ldots , x_{n_{t}}^{t}\right\} $. So, Eq. (1) can be rewritten as:

$$\begin{aligned} \hbox{MDG}\left( X_s,X_t \right)= & {} E\left[ \left| X_{s}- X_{t} \right| ^{2} \right] \nonumber \\= & {} \int _{\Omega \times \Omega }\left\| x^s-x^t \right\| ^2p\left( x^s,x^t \right) \textrm{d}x^s\textrm{d}x^t \nonumber \\\approx & {} \frac{1}{n_sn_t}\sum _{i=1}^{n_s}\sum _{j=1}^{n_t}\left\| x^s_i-x^t_j \right\| ^2, \end{aligned}$$

(8)

where $ p\left( x^s,x^t \right) $ is the joint probability density function of $X_s$ and $X_t$ and it is replaced by the uniform distribution.

4.2 The mapping validity of RKHS subspace

Here, we give a proof for the mapping validity of RKHS subspace. In other words, a second-order moment variable is still second-order moment through the transformation of RKHS subspace. And this proof is essential for the MDG measure to be further extended into the RKHS subspace.

For a second-order moment random variable $X \in \Omega $, we get a random variable Y in light of the projection theorem Eq. (7):

$$\begin{aligned} Y= \begin{bmatrix} Y_1 \\ \vdots \\ Y_d\\ \end{bmatrix}= \begin{bmatrix} \left\langle \varphi \left( X \right) ,\vartheta _1 \right\rangle _{H} \\ \vdots \\ \left\langle \varphi \left( X \right) ,\vartheta _d \right\rangle _{H} \end{bmatrix}, \end{aligned}$$

which represents the projection of $\varphi \left( X \right) $ in subspace $H_s$ with the orthogonal basis $ \vartheta _i(i =1\cdots d)$. Now, we prove Y is a second-order moment random variable by proving that each component $Y_i$ of Y meets the condition $E \left[ \left| Y_i \right| ^{2} \right] =\int _{\Omega } y_i^{^{2}}p\left( y_i \right) \textrm{d} y_i<+\infty $. In fact, we have

$$\begin{aligned} E\left[ {\left| Y_i \right| }^2 \right]&=E\left[ {\left| \left\langle \varphi \left( X \right) ,\vartheta _i \right\rangle _{H} \right| }^2 \right] \\&=\int _\Omega \left| \left\langle \varphi \left( x \right) ,\vartheta _i \right\rangle _{H} \right| ^2p\left( x \right) \textrm{d}x \\&\le \int _\Omega \left| \left\langle \varphi \left( x \right) ,\vartheta _i \right\rangle _{H} \right| ^2\textrm{d}x\\&=\int _\Omega \left| \left\langle \varphi \left( x \right) ,\sum _{j=1}^{N}\omega _{ji}\varphi \left( x _j\right) \right\rangle _{H} \right| ^2\textrm{d}x \\&=\int _\Omega \left| \sum _{j=1}^{N} \omega _{ji}\left\langle \varphi \left( x \right) ,\varphi \left( x_j \right) \right\rangle _{H} \right| ^2\textrm{d}x \\&=\int _\Omega \left| \sum _{j=1}^{N} \omega _{ji}k\left( x,x_j \right) \right| ^2\textrm{d}x \\&\le \sum _{p=1}^{N}\sum _{q=1}^{N}\left| \omega _{pi}\omega _{qi} \right| \left| \int _\Omega k\left( x,x_q \right) k\left( x,x_p \right) \textrm{d}x \right| \\&\le \sum _{p=1}^{N}\sum _{q=1}^{N} \left| \omega _{pi}\omega _{qi}\right| \sqrt{\int _\Omega k^2\left( x,x_q \right) \textrm{d}x}\\&\quad \sqrt{\int _\Omega k^2\left( x,x_p \right) \textrm{d}x} < +\infty , \end{aligned}$$

where the X follows the probability density function $0\le p\left( x \right) \le 1$ and $\varphi \left( x \right) = k \left( \cdot ,x \right) $ is an absolutely integrable function.

From the above derivation, we can make a useful conclusion that mapping second-order moment variables into RKHS subspace, these variables still are second-order moment. In light of this conclusion, we can apply the MDG measure to the domain adaption based on RKHS subspace learning.

4.3 MDG for RKHS subspace classification

In this paper, the MDG as domain adaption measure is used for the domain adaption shown in Fig. 1. Specifically, we first transform the source domain $X_s=\left\{ x_{1}^{s},\ldots , x_{n_{s}}^{s}\right\} $ and target domain $X_t=\left\{ x_{1}^{t},\ldots , x_{n_{t}}^{t}\right\} $ into the RKHS subspace $H_s$ to get ${Y}_s=\left[ y_{1}^{s},\ldots ,y_{n_s}^{s} \right] $ and ${Y}_t=\left[ y_{1}^{t},\ldots ,y_{n_t}^{t} \right] $, which represent the coordinates of the corresponding projection on the orthogonal basis $\Theta $ of subspace $H_s$. According to the proof in Sect. 4.2, ${Y_s}$ and ${Y_t}$ are second-order moment variables.

Then, we minimize the MDG between $ Y_s$ and $ Y_t$ to learn a optimal RKHS subspace $H_{s}$ so that their distributions are as close as possible. So, our goal is to minimize the following problem:

$$\begin{aligned}&{\mathop {\hbox{arg min}}\limits _{W}} \left\| {Y}_s-{Y}_t \right\| ^2 =E\left[ \left| Y_{s}- Y_{t} \right| ^{2} \right] ,\nonumber \\&\quad \hbox{s.t.} \quad W^{T}KW =I_{d}. \end{aligned}$$

(9)

With the help of Sects. 4.1 and 4.2, Eq. (9) can be derived as follows

$$\begin{aligned}{} & {} \left\| Y_s-Y_t \right\| ^2 = \frac{1}{n_sn_t}\sum _{i=1}^{n_s}\sum _{j=1}^{n_t}\left\| y_i^s-y_j^t \right\| ^2 \nonumber \\{} & {} \quad = \frac{1}{n_sn_t}\sum _{i=1}^{n_s}\sum _{j=1}^{n_t}\left\| W^T\left( K(:,i)-K(:,n_s+j)\right) \right\| ^2 \nonumber \\{} & {} \quad = \frac{1}{n_sn_t}\sum _{i=1}^{n_s}\sum _{j=1}^{n_t}\left\| W^T\varphi _{ij} \right\| ^2 \nonumber \\{} & {} \quad =\frac{1}{n_sn_t}\sum _{i=1}^{n_s}\sum _{j=1}^{n_t}tr\left( W^T\varphi _{ij}\varphi _{ij}^TW\right) \nonumber \\{} & {} \quad = \hbox{tr}\left( W^T\left( \frac{1}{n_sn_t}\sum _{i=1}^{n_s}\sum _{j=1}^{n_t}\varphi _{ij}\varphi _{ij}^T \right) W \right) \nonumber \\{} & {} \quad =\hbox{tr}\left( W^T\Psi W \right) , \end{aligned}$$

(10)

where $\varphi _{ij}=K(:,i)-K(:,n_s+j) $, and

$$\begin{aligned} \Psi =\frac{1}{n_sn_t}\sum _{i=1}^{n_s}\sum _{j=1}^{n_t}\varphi _{ij}\varphi _{ij}^T. \end{aligned}$$

(11)

Finally, optimization problem of Eq. (9) reduces to

$$\begin{aligned} {\mathop {\hbox{arg min}}\limits _{W}} \hbox{tr}\left( W^{T} \Psi W\right) , \quad \hbox{s.t.}\quad W^{T}KW=I_{d}. \end{aligned}$$

(12)

4.4 Optimization problem

Next, we explain in detail how Eq. (12) is solved by LMM: Because K is a SPD matrix, it can be factorized through its eigenvalues and eigenvectors matrices as follows

$$\begin{aligned} K=U\Lambda U^{T}=U\Lambda ^{\frac{1}{2}}\Lambda ^{\frac{1}{2}} U^{T}, \end{aligned}$$

(13)

where $UU^{T}=I$ and $\Lambda $ is a diagonal matrix.

Denoting $\Lambda ^{\frac{1}{2}} U^{T}$ by L, the following are obtained

$$\begin{aligned} K=L^{T}L,\quad W^{T}KW= W^{T}L^{T}LW=I_{d}, \end{aligned}$$

(14)

from which the matrix trace of Eq. (12) is given by

$$\begin{aligned} tr\left( W^{T} \Psi W\right) =tr\left( W^{T}L^{T}\left( L^{T} \right) ^{-1} \Psi L^{-1}LW\right) . \end{aligned}$$

Now, denoting $G=LW$, the optimization problem of Eq. (12) is transformed into

$$\begin{aligned} {\mathop {\hbox{arg\;min}}\limits _{G}} tr\left( G^{T} A G\right) ,\quad \hbox{s.t.}\quad G^{T}G=I_{d}, \end{aligned}$$

(15)

with

$$\begin{aligned} A=\left( L^{T} \right) ^{-1} \Psi L^{-1}, \end{aligned}$$

(16)

which can be solved through the LMM with the Lagrangian function given by [29]

$$\begin{aligned} L\left( G,Z \right) =\hbox{tr}\left( G^{T}AG \right) -tr\left( \left( G^{T}G-I_{d} \right) Z \right) , \end{aligned}$$

(17)

where Z is a symmetric matrix and $z_{ij}$ are Lagrange multipliers.

Equation (17) can be solved as follows

$$\begin{aligned} \left\{ \begin{array}{ll} \frac{\partial L\left( G,Z \right) }{\partial G}=2AG-2GZ=0\\ \\ \frac{\partial L\left( G,Z \right) }{\partial Z}=\left( G^{T}G-I_{d}\right) ^{T}=0 \end{array}\right. \Rightarrow \left\{ \begin{array}{ll} AG= GZ;\\ G^{T}G=I_{d}. \end{array}\right. \end{aligned}$$

(18)

When Z is a diagonal matrix, if G is the eigenvectors matrix of A, then Eq. (18) is satisfied and the minimization problem of Eq. (15) can be achieved by selecting the smallest d eigenvalues and the corresponding eigenvectors.

When Z is not a diagonal matrix but symmetric, it can be factorized in terms of its eigenvalues matrix $\Sigma $ and eigenvectors matrix V as follows,

$$\begin{aligned} Z = V \Sigma V^T. \end{aligned}$$

(19)

Substituting Eq. (19) into Eq. (18), the following is obtained,

$$\begin{aligned} \left\{ \begin{matrix} AG= GV \Sigma V^T \Rightarrow AGV= GV \Sigma \Rightarrow A\tilde{G}= \tilde{G} \Sigma ,\\ G^{T}G=I_{d} \Rightarrow V^T G^{T} G V =I_{d} \Rightarrow \tilde{G}^{T} \tilde{G} =I_{d}, \end{matrix}\right. \end{aligned}$$

(20)

where $\tilde{G} = GV$ and we have used the orthogonality relation of the eigenvectors matrix $V^TV = I_d$.

It’s clear that the eigenvalues matrix $\Sigma $ is a diagonal matrix and the minimization problem of Eq. (15) can be achieved when $\tilde{G}$ is formed with d smallest eigenvectors of A.

According to the above analysis, the minimization problem of Eq. (15) can be achieved by selecting the smallest d eigenvectors of A.

Finally, we can get the optimized weighting matrix W of the original optimization problem of Eq. (12) as follows

$$\begin{aligned} W = L^{-1}G=\left( \Lambda ^{\frac{1}{2}} U^{T}\right) ^{-1}G=U\Lambda ^{-\frac{1}{2}}G. \end{aligned}$$

(21)

4.5 Algorithm of RKHS classification with MDG

The procedure for the solution of MDG is summarized in Algorithm 1, which is explained as follows:

The input of the algorithm are samples from $X_s$ and $X_t$, the kernel function $k(\cdot , \cdot )$, and the RKHS subspace dimension d; and the output of the algorithm are the weighting matrix W that characterizes the orthogonal basis $\Theta $ of the RKHS subspace.

The algorithm takes the samples from both $X_s$ and $X_t$ to form the joint RKHS subspace kernel matrix K; then its eigenvectors matrix U is calculated; after that, the intermediate matrix L and A are calculated from Eqs. (14) and (16), respectively; and finally, the weighting matrix W that characterizes the orthogonal basis of the RKHS subspace is obtained by selecting the d smallest eigenvectors of the intermediate matrix A according to Eq. (21).

After having the weighting matrix W that characterizes the orthogonal basis of the RKHS subspace, the unknown labels of instances $X_t$ can be obtained as follows:

1.
The data samples set $X = X_s \cup X_t$ in $H_s$ can be projected to the RKHS subspace as $Y=W^TK$: the samples set from source domain $X_s$ are projected to get $Y_s = Y\left( :,n_s \right) $, and the data samples set $X_t$ are projected to get $Y_t = Y\left( :,n_s+1:n_s+n_t \right) $;
2.
Train the classifier with the projection samples $Y_s$;
3.
Use the trained classifier to label the projection samples $Y_t$.

4.6 Computational complexity

According to the Algorithm 1, the computation costs of our MDG-based RKHS subspace learning consists of three major parts:

1.
The Construction the kernel matrix K in step 3, and it costs $ {\mathcal {O}} \left( mn^2 \right) $ for computing (m is the dimension of samples)
2.
The construction of matrix $\Psi $ in step 3, and it costs $ {\mathcal {O}} \left( n_{s}n_{t}n^2 \right) $ for computing ($n=n_{s}+n_{t}$)
3.
The optimization of coefficient matrix W in step 4 and step 7, which costs $ {\mathcal {O}} \left( d n^2 \right) $

So, the overall computational complexity of Algorithm 1 would be $ {\mathcal {O}} \left( mn^2+n_{s}n_{t}n^2+dn^2 \right) $.

5 Experiments

In this section, we conduct two kinds of experiments to verify the classification effectiveness of our MDG measure: one is the comparison with the MMD and co-variance measures; the another is to apply our MDG measure into the four domain adaption algorithms to replace their original distribution discrepancy measures to evaluate MDG measure’s performance. In addition, we conduct the experiment to verify the insensitivity of MDG measure to RKHS subspace dimension.

5.1 The real-world datasets

We assess the performance of the proposed MDG measure on four popular datasets: Office-Caltech10 dataset, handwritten digits dataset, text dataset and VLSIC dataset. The data that support the findings of this study are available from this website.^{Footnote 1} Next, the four datasets are introduced, respectively.

1.
Office-Caltech10 dataset. Office-Caltech10 dataset consists of four domains: Amazon (A, collected from Amazon), DSLR (D, shot by SLR camera), webcam (W, collected by webcam), and Caltech (C, collected by Caltech) [30]. Each domain contains 10 classes, such as backpack, monitor, headphone and so on. Examples of headphones from A, D, W, and C domains are shown in Fig. 2. And each domain is used as source domain and target domain repeatedly.
2.
Text dataset. This dataset comes from Reuters-21,578 dataset^{Footnote 2} including 21,578 documents and 672 categories. In fact, we use a pre-processed dataset, which are divided into three categories: orgs, places, and people, with each category containing two sub-classes [6]. We regard these three categories as three domains and select randomly two categories as source and target domain respectively on each classification task.
3.
Handwritten digits dataset. The handwritten digits dataset consists of MNIST^{Footnote 3} and USPS^{Footnote 4} dataset with different distributions, which include handwritten 10 digits from 0 to 9. MNIST dataset contains 70,000 sheets of 28 $\times $ 28 gray images, and the USPS dataset contains 11,000 sheets of 16 $\times $ 16 gray images. Since the large amount of samples in this dataset and the limited processing power of our device, the subset of handwritten digits dataset are used in following experiments, which consists of 2000 images from MNIST and 1800 images from USPS that are all randomly selected. Then, some data preparation are done for this subset, which contains the uniformly scaling these gray images to $16 \times 16$ images, and then flattening each image into 256 dimensional vector. Some examples of the handwritten digits dataset are shown in Fig. 3. And MNIST and USPS dataset are taken as source and target domain by turns.
4.
VLSIC dataset. The VLSIC dataset consists of 5 domains: VOC2007(V), LabelMe(L), SUN09(S), ImageNet(I) and Caltech101(C) from different distributions. Since the original data have very high dimension, we firstly applied PCA [31] to reduce the dimension of original data from 4096 into 300. And then we selected the 5 classes shared by the five domains to conduct the experiments.

5.2 The comparison with MMD and co-variance measures

In this subsection, we conduct the comparison with MMD and co-variance measures on the above four dataset. Specially, the MMD measure is the most popular among domain adaption algorithms, and co-variance measure has recently been used in domain adaption algorithms [8, 15]. For simplicity, co-variance measure be denoted as cov in Tables 2, 3, 4 and 5. In this subsection experiments, the used parameters are set up to:

1.
The Gaussian Radial Basis function (RBF) kernel is chosen as the reproducing kernel of RKHS [9]: $k\left( x_1,x_2 \right) =e^{-\frac{{\left\| x_1-x_2 \right\| }^2}{2\delta ^2}}, \delta =10$.
2.
The dimension of the RKHS subspace $H_{s}$ has been set to $d=30$ for the handwritten digits dataset and $d=100$ for the other three datasets.
3.
k-Nearest Neighbor method (knn) [32] is used for classification, and experiments are carried out on $k = [ 1, 3, 5, 7]$. The calculation of classification accuracy is as followed:
$$\begin{aligned} \hbox{accuracy} = \frac{\sum _{x_{t}\in X_t}\left\{ x_{t}\in X_t\cap \hbox{knn}\left\{ x_t \right\} =label\left\{ x_t \right\} \right\} }{\hbox{num}\left\{ X_t \right\} }, \end{aligned}$$
where $X_t$ is target domain samples set and $\hbox{num}\left\{ X_t \right\} $ is the number of samples in $X_t$, $\hbox{knn}\left\{ x_t \right\} $ is the label predicted by knn method for a target data $x_t$ and label $\left\{ x_t \right\} $ is the ground truth label of $x_t$.
4.
The number of iterations of the co-variance domain adaption measure is set to 10.

Table 2 Classification accuracy comparison of the Office-Caltech10 dataset

Full size table

Table 3 Classification accuracy comparison of the text dataset

Full size table

Table 4 Accuracy comparison of the handwritten digits dataset

Full size table

Table 5 Accuracy comparison of the VLSIC dataset

Full size table

And the specific classification task arrangement are as followed:

1.
Office-Caltech10 dataset classification. According to IGLDA [6], the SURF (Speed Up Robust Features) [33] of the dataset are first extracted; then the features are normalized and z-scored so that their means are zero and the standard deviations are set up to one. In total, we carried out six tasks: $A \rightarrow C$, $A \rightarrow D$, $C \rightarrow A$, $D \rightarrow A$, $D \rightarrow C$, $D \rightarrow W$, $W \rightarrow A$, $W\rightarrow C$. In detail, $A \rightarrow C$ means that Amazon is the source domain and Caltech is the target domain.
2.
Text dataset classification. We set up only one classification task: orgs $\rightarrow $ places.
3.
Handwritten digits dataset classification. For the handwritten digits dataset, we set two tasks, MNIST $\rightarrow $ USPS and USPS $\rightarrow $ MNIST, where MNIST $\rightarrow $ USPS means that the MNIST dataset are selected as the source domain and USPS dataset are target domain.
4.
VLSIC dataset classification. Six tasks are set up on this dataset: $C \rightarrow L$, $C\rightarrow S$, $C \rightarrow V$, $I \rightarrow C$, $I \rightarrow V$, $V \rightarrow L$.

5.3 Comparisons with state-of-the-art domain adaption algorithms

In this subsection, we compare the proposed MDG measure with TIT [5], IGLDA [6], LPJT [19] and MIDA [18] algorithms in the literature to show its performance. It’s noted that the above algorithms consist of not only domain adaption measures, but also other regularization terms to ensure the classification performance. For the objective assessment of our MDG, we replace the domain adaption measure used in each algorithm with the MDG measure so that we get four nearly-new domain adaption algorithms, namely, TIT_MDG, IGLDA_MDG, LPJT_MDG and MIDA_MDG. For example, TIT_MDG is obtained by replacing the domain adaption measure used in the TIT algorithm with MDG method, and the original regularization of TIT algorithm remains unchanged. And IGLDA_MDG, MIDA_MDG and LPJT_MDG are generated alike. We totally have four comparison tasks: TIT vs TIT_MDG, IGLDA vs IGLDA_MDG, MIDA vs MIDA_MDG and LPJT vs LPJT_MDG. Since the four original algorithms all apply SVM to classify, the very common SVM classifiers with different kernels are used to classify the target domain samples. Besides, the dimension of RKHS subspace is 100.

1.
TIT versus TIT_MDG. In this experiment, the Office-Caltech10 dataset are used and twelve tasks are set up totally. We randomly select two domains samples for each task, and one is as source domain and the other as target domain. In addition, we use the SVM classifier with RBF kernel.
2.
IGLDA versus IGLDA_MDG. We conduct 12 tasks on Office-Caltech10 dataset to compare the IGLDA_MDG with IGLDA, and tasks setup are as above. In addition, the SVM classifier based on linear kernel is used.
3.
LPJT versus LPJT_MDG. In this experiment, we conduct two tasks on handwritten digits dataset, that is, MNIST$\rightarrow $USPS and USPS $\rightarrow $ MNIST. And we select the RBF kernel-based SVM classifier to classify.
4.
MIDA versus MIDA_MDG. Here, we compare the combined algorithm MIDA_MDG with MIDA to verify the effectiveness of our MDG measure on handwritten digits dataset. And SVM classifier based on linear kernel is used.

5.4 Classification results

Under the experiment setting of Sects. 5.2 and 5.3, we get the all classification results and report them in Tables 2, 3, 4, 5, 6, 7, 8 and 9 and the best result in each classification task is bolded for convenience.

Table 6 Accuracy comparison of the Office-Caltech10: TIT versus TIT_MDG

Full size table

Table 7 Accuracy comparison of the Office-Caltech10: IGLDA versus IGLDA_MDG

Full size table

Table 8 Accuracy comparison of the handwritten digits dataset: LPJT versus LPJT_MDG

Full size table

Table 9 Accuracy comparison of the Text dataset: MIDA vs MIDA_MDG

Full size table

The classification results of Sect. 5.2 on the four real-world dataset are all collected in Tables 2, 3 and 4 and visualize with Figs. 4, 5, 6 and 7, respectively. Among these domain adaption measures, the MDG measure works much better than the MMD measure and the un-optimized co-variance measure, which delivers more decent classification accuracy. It show that MDG measure can learn a good common subspace for source and target domain samples, where their distributions match better than in the subspace learned by MMD or co-variance. For 80% tasks of this subsection, co-variance measure achieves higher classification accuracy than MMD measure. The reason is that MMD measure use the first-order moment statistical information of domain, while the second-order statistical information are used in co-variance. And the generally low classification accuracy of tasks in the subsection is due to the fact that the domain adaption measure only focus on global information—inter-domain distribution difference, but ignores the local information such as intra-class distance within domain, the local geometric structure, and discriminative information [5, 6, 8, 11, 16, 18, 19]. So, current domain adaption algorithms all consider the global and local information at the same time. However, since the innovation of this paper is to propose a neat and effective MDG measure to align the different distributions, the local information is not considered for the time being.

Tables 6, 7, 8 and 9 show that the proposed MDG works well on the four dataset, which transforms the source and target data into a great latent RKHS subspace where the distribution gap is smaller than the original algorithms so that it enhances the ability to classify.

In addition, we compared their running speed on the domain adaption task orgs $\rightarrow $ places. The codes of MMD, co-variance and MDG were written in MATLAB R2018a, and no parallel computing was used. The running times of MMD, co-variance and MDG were 0.5 s, 96.3 s and 29.1 s, respectively. Although the running time of our measure is not the shortest, it is acceptable compared with the co-variance method. And considering the classification results, MDG is more practical to use than MMD and co-variance measures.

5.5 RKHS subspace dimension sensitivity analysis

The RKHS is an infinite linear space, so its subspace dimension can be arbitrary or even infinite. Therefore, it is difficult or even impossible to realize RKHS subspace learning by computer. For domain adaption methods based on RKHS subspace learning, the subspace is constructed by the linear combination of the transformed samples in RKHS. According to Sect. 3.3, the upper limit of the subspace dimension d is the rank of the kernel matrix K, that is N. In practice, d is often selected adaptively according to the input data.

We perform the experiment on tuning $d (d<N)$ to show that the proposed RKHS subspace learning based on MDG measure is robust on the parameter d, namely the classification accuracy remains stable when d changes over a large range. Keeping other parameters unchanged, we constantly adjust the dimension d of the subspace from 350 to 60, and conduct a classification task every 10 dimension on orgs $\rightarrow $ places. And the results of different d is showed in Fig. 8

From Fig. 8, we can see that the classification results remain robust even d changes over a large range.

6 Conclusion

In this paper, we study a neat and effective MDG measure for RKHS subspace domain adaption classification problem. The MDG measure optimizes the RKHS subspace, where distribution difference between the source-domain data and the target-domain data are as small as possible. Compared to the first-order moment MMD measure and the second-order moment co-variance, the MDG measure has the advantage of capturing the higher-order moments of the distribution. Also, compared to the complicated co-variance measure, it has the advantage of easy to use and can be optimized analytically: rigorous mathematical formula has been derived for the weighting matrix of the optimized orthogonal Hilbert subspace basis, via the LMM optimization. At last, extensive experiments with four image dataset have been carried out. Comparisons with other four state-of-the-art domain adaption algorithms in the literature with both the MMD and co-variance measures show that the RKHS subspace based on MDG measure approach does achieve better classification performance in general.

And according to Sect. 2, some recent works have applied MMD and co-variance into deep neural network as additional loss term to enhance the transferability of feature representation. Hence, in our future work, we will consider extending MDG measure into the deep learning architecture.

Notes

References

Bruzzone L, Marconcini M (2010) Domain adaptation problems: a DASVM classification technique and a circular validation strategy. IEEE Trans Pattern Anal Mach Intell 32(5):770–787. https://doi.org/10.1109/TPAMI.2009.57
Article Google Scholar
Gopalan R, Li R, Chellappa R (2014) Unsupervised adaptation across domain shifts by generating intermediate data representations. IEEE Trans Pattern Anal Mach Intell 36(11):2288–2302. https://doi.org/10.1109/TPAMI.2013.249
Article Google Scholar
Zhang Y, Deng B, Tang H, Zhang L, Jia K (2020) Unsupervised multi-class domain adaptation: theory, algorithms, and practice. IEEE Trans Pattern Anal Mach Intell https://doi.org/10.1109/TPAMI.2020.3036956
Chen B, Lam W, Tsang IW, Wong TL (2013) Discovering low-rank shared concept space for adapting text mining models. IEEE Trans Pattern Anal Mach Intell 35(6):1284–1297. https://doi.org/10.1109/TPAMI.2012.243
Article Google Scholar
Li J, Lu K, Huang Z, Zhu L, Shen HT (2019) Transfer independently together: a generalized framework for domain adaptation. IEEE Trans Cybern 49(6):2144–2155. https://doi.org/10.1109/TCYB.2018.2820174
Article Google Scholar
Jiang M, Huang W, Huang Z, Yen GG (2017) Integration of global and local metrics for domain adaptation learning via dimensionality reduction. IEEE Trans Cybern 47(1):38–51. https://doi.org/10.1109/TCYB.2015.2502483
Article Google Scholar
Pan SJ, Tsang IW, Kwok JT, Yang Q (2011) Domain adaptation via transfer component analysis. IEEE Trans Neural Netw 22(2):199–210. https://doi.org/10.1109/TNN.2010.2091281
Article Google Scholar
Li L, Zhang Z (2019) Semi-supervised domain adaptation by covariance matching. IEEE Trans Pattern Anal Mach Intell 41(11):2724–2739. https://doi.org/10.1109/TPAMI.2018.2866846
Article Google Scholar
Steinwart I, Hush D, Scovel C (2006) An explicit description of the reproducing kernel Hilbert spaces of Gaussian RBF kernels. IEEE Trans Inf Theory 52(10):4635–4643. https://doi.org/10.1109/TIT.2006.88171
Article MathSciNet MATH Google Scholar
Zhang Z, Wang M, Nehorai A (2020) Optimal transport in reproducing kernel Hilbert spaces: theory and applications. IEEE Trans Pattern Anal Mach Intell 42(7):1741–1754. https://doi.org/10.1109/TPAMI.2019.2903050
Article Google Scholar
Deng WY, Lendasse A, Ong YS, Tsang IWH, Chen L, Zheng QH (2019) Domain adaption via feature selection on explicit feature map. IEEE Trans Neural Netw Learn Syst 30(4):1180–1190. https://doi.org/10.1109/TNNLS.2018.2863240
Article MathSciNet Google Scholar
Feng Y, Yuan Y, Lu X (2021) Person reidentification via unsupervised cross-view metric learning. IEEE Trans Cybern 51(4):1849–1859. https://doi.org/10.1109/TCYB.2019.2909480
Article Google Scholar
Tao D, Jin L, Wang Y, Li X (2015) Person reidentification by minimum classification error-based KISS metric learning. IEEE Trans Cybern 45(2):242–252. https://doi.org/10.1109/TCYB.2014.2323992
Article Google Scholar
Gretton A, Borgwardt K, Rasch M, Schölkopf B, Smola A (2006) A kernel method for the two-sample-problem. Adv Neural Inf Process Syst 19:513–520
MATH Google Scholar
Sun B, Saenko K (2016) Deep coral: correlation alignment for deep domain adaptation. In: European conference on computer vision. Springer, pp 443–450
Pan SJ, Yang Q (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 22(10):1345–1359. https://doi.org/10.1109/TKDE.2009.191
Article Google Scholar
Pan SJ, Tsang IW, Kwok JT, Yang Q (2011) Domain adaptation via transfer component analysis. IEEE Trans Neural Netw 22(2):199–210. https://doi.org/10.1109/TNN.2010.2091281
Article Google Scholar
Yan K, Kou L, Zhang D (2018) Learning domain-invariant subspace using domain features and independence maximization. IEEE Trans Cybern 48(1):288–299. https://doi.org/10.1109/TCYB.2016.2633306
Article Google Scholar
Li J, Jing M, Lu K, Zhu L, Shen HT (2019) Locality preserving joint transfer for domain adaptation. IEEE Trans Image Process 28(12):6103–6115. https://doi.org/10.1109/TIP.2019.2924174
Article MathSciNet MATH Google Scholar
Tsai YHH, Yeh YR, Wang YCF (2016) Learning cross-domain landmarks for heterogeneous domain adaptation. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 5081–5090
Li J, Lu K, Huang Z, Zhu L, Shen HT (2019) Heterogeneous domain adaptation through progressive alignment. IEEE Trans Neural Netw Learn Syst 30(5):1381–1391. https://doi.org/10.1109/TNNLS.2018.2868854
Article MathSciNet Google Scholar
Tzeng E, Hoffman J, Zhang N, Saenko K, Darrell T (2014) Deep domain confusion: maximizing for domain invariance. arXiv:1412.3474
Long M, Cao Y, Wang J, Jordan M (2015) Learning transferable features with deep adaptation networks. In: International conference on machine learning. PMLR, pp 97–105
Liang J, Li L, Zhao C (2021) A transfer learning approach for compressed sensing in 6G-IoT. IEEE Internet Things J 8(20):15276–15283
Article Google Scholar
Saitoh S, Sawano Y (eds) (2016) Theory of reproducing kernels and applications. Springer, Singapore
MATH Google Scholar
Gori F, Martínez-Herrero R (2021) Reproducing kernel Hilbert spaces for wave optics: tutorial. JOSA A 38(5):737–748
Article Google Scholar
Yosida K et al (1965) Functional analysis. Springer, Berlin
Book MATH Google Scholar
Paulsen VI, Raghupathi M (2016) An introduction to the theory of reproducing kernel Hilbert spaces. Cambridge University Press, Cambridge
Book MATH Google Scholar
Fukunaga K (2013) Introduction to statistical pattern recognition. Elsevier, Amsterdam
MATH Google Scholar
Gong B, Shi Y, Sha F, Grauman K (2012) Geodesic flow kernel for unsupervised domain adaptation. In: IEEE conference on computer vision and pattern recognition, vol 2012. IEEE, pp 2066–2073
Wold S, Esbensen K, Geladi P (1987) Principal component analysis. Chemom Intell Lab Syst 2(1–3):37–52
Article Google Scholar
Peterson LE (2009) K-nearest neighbor. Scholarpedia 4(2):1883
Article Google Scholar
Bay H, Tuytelaars T, Van Gool L (2006) Surf: speeded up robust features. In: European conference on computer vision. Springer, pp 404–417

Download references

Funding

No funding was received to assist with the preparation of this manuscript.

Author information

Authors and Affiliations

Sun Yat-Sen University, Guangzhou, China
Yanzhen Qiu, Chuangfeng Zhang, Chenkui Xiong, Zhengming Ma & Shaolin Liao

Authors

Yanzhen Qiu
View author publications
You can also search for this author in PubMed Google Scholar
Chuangfeng Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Chenkui Xiong
View author publications
You can also search for this author in PubMed Google Scholar
Zhengming Ma
View author publications
You can also search for this author in PubMed Google Scholar
Shaolin Liao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Zhengming Ma or Shaolin Liao.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Identical random variables

Two second-order moment random variables are identical if and only if their statistical mean square error is zero,

$$\begin{aligned} E\left[ \left| Y-Y' \right| ^{2}\right] =0 \Leftrightarrow Y = Y'. \end{aligned}$$

(22)

To demonstrate this, the variance of Eq. (22) can be expressed as follows

$$\begin{aligned} E\left[ \left| Y-Y' \right| ^{2}\right] = \int \int _{\Omega (Y, Y')} \left( y-y'\right) ^2 p(y, y') dydy'. \end{aligned}$$

(23)

Because both $\left( y-y'\right) ^2$ and $ p(y, y')$ are semi-definite or non-negative, Eq. (23) is zero when one of the following two conditions are met for all points in the probability domain $\Omega $,

$$\begin{aligned} \left\{ \begin{matrix} \left( y-y'\right) ^2 = 0; \\ p(y, y') = 0. \end{matrix} \right. \end{aligned}$$

(24)

It can be shown that Eq. (24) is equivalent to the joint probability $p(y, y') = f(y) \delta (y-y')$,

$$\begin{aligned}{} & {} p(y) = \int _{y'} p(y, y') dy' = \int _{y'} f(y) \delta (y-y') dy' = f(y), \nonumber \\{} & {} p(y') = \int _{y} p(y, y')dy = \int _{y} f(y) \delta (y-y')dy = f(y), \end{aligned}$$

(25)

from which the marginal probabilities of Y and $Y'$ are identical and Eq. 22 is proved.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Qiu, Y., Zhang, C., Xiong, C. et al. RKHS subspace domain adaption via minimum distribution gap. Pattern Anal Applic 26, 1425–1439 (2023). https://doi.org/10.1007/s10044-023-01170-y

Download citation

Received: 24 April 2022
Accepted: 19 April 2023
Published: 17 May 2023
Issue Date: August 2023
DOI: https://doi.org/10.1007/s10044-023-01170-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

RKHS subspace domain adaption via minimum distribution gap

Abstract

Similar content being viewed by others

Domain adaption based on source dictionary regularized RKHS subspace learning

Transfer subspace learning joint low-rank representation and feature selection

Low-rank representation-based regularized subspace learning method for unsupervised domain adaptation

Explore related subjects

1 Introduction

2 Related work

3 Preliminary

3.1 Second-order moment random variable

3.2 Reproducing kernel Hilbert space

3.3 The RKHS subspace learning framework

4 RKHS subspace classification with MDG

4.1 Minimum distribution gap

4.2 The mapping validity of RKHS subspace

4.3 MDG for RKHS subspace classification

4.4 Optimization problem

4.5 Algorithm of RKHS classification with MDG

4.6 Computational complexity

5 Experiments

5.1 The real-world datasets

5.2 The comparison with MMD and co-variance measures

5.3 Comparisons with state-of-the-art domain adaption algorithms

5.4 Classification results

5.5 RKHS subspace dimension sensitivity analysis

6 Conclusion

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix A: Identical random variables

Appendix A: Identical random variables

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation