Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

3.1 Introduction

3.1.1 Background

Recently, transfer learning has been studied extensively for different applications [1], such as text classification and indoor WiFi location estimation. Referring to Fig. 3.1 and the explanations given in Table 3.1, transfer learning is an approach to obtain an effective model of data from the target domain by effectively leveraging the useful information from source domains in the learning procedure.

Fig. 3.1
figure 1

An illustration of transfer learning for regression

Situations requiring transfer learning are becoming common in real-world applications. The modeling of fermentation process [2] is one example where the transfer learning is required. In the target domain of a microbiological fermentation process, the data collected may be insufficient or some of the data may be missing due to the deficiency of the sensor setup. Thus, we cannot effectively model the fermentation process for this domain with the collected data. However, data available from other similar microbiological fermentation process could be sufficient and considered as source domains for the target domain. Hence, transfer learning can be exploited to make use of the information from the source domain to improve the modeling effect of the target domain, thereby resulting in a model with better generalization capability. In this case, transfer learning is an effective solution to the corresponding modeling task because it can enhance the model by leveraging the information available from the source domains, such as the data collected in other time frames or with other setups.

Table 3.1 Some terms used for transfer leaning in the text

A comprehensive survey about transfer learning can be discovered in [1]. In general, the existing work about transfer learning can be categorized into three types: (1) transfer learning for classification [315]; (2) transfer learning for unsupervised learning (clustering [16, 17] and dimensionality reduction [18, 19]); and (3) transfer learning for regression [2024]. According the setting whether there are the labeled data in the target domain available, all the transfer learning methods can also be classified as inductive transfer learning methods and transductive learning methods. While there are a few labeled data for the supervised learning in the inductive transfer learning methods, all the data are unlabeled in the target domain for the transductive learning methods and the unsupervised learning is implemented accordingly While there are a few labeled data for the inductive transfer learning, all the data are unlabeled in the target domain for the transductive learning method. Among the existing transfer methods, a lot of them are based on the support vector learning. In this chapter, we mainly focus on the novel support vector learning-based methods for the inductive and transductive learning.

3.1.2 Support Vector Learning-Based Inductive Transfer Learning

In the inductive transfer learning setting, the target task is different from the source task. In this case, some labeled data in the target domain are required to induce an objective predictive model for use in the target domain. The representative inductive transfer learning algorithms are reviewed below. Dai et~al. [25] proposed a boosting algorithm with the support vector machine (SVM) as the learner, TrAdaBoost, which is an extension of the AdaBoost algorithm, to address the inductive transfer learning problems. TrAdaBoost attempts to iteratively reweight the source domain data to reduce the effect of the “bad” source data while encouraging the “good” source data to contribute more for the target domain. Wu and Dietterich [26] integrated the source domain (auxiliary) data and SVM framework for improving the classification performance. Evgeniou and Pontil [27] borrowed the idea of hierarchical Bayesian to SVMs for multitask learning. The proposed method assumed that the parameter, w, in SVMs for each task can be separated into two terms. One is a common term over tasks and the other is a task-specific term.

3.1.3 Support Vector Learning-Based Transductive Transfer Learning

For support vector learning-based transductive transfer learning, a major computational problem is how to reduce the difference between the distributions of the source and target domains. There have existed several works describing how to measure the distance between distributions [28, 29]. Intuitively, discovering a good feature representation across domains is crucial [13, 30]. A good feature representation should be able to reduce the distribution discrepancy between two domains as much as possible, while at the same time preserving the underlying geometric structures (or scatter information) of both source and target domain data as much as possible. Ben-David et~al. [31] used an example of hyperplane classifiers to show that the performance of the hyperplane classifier that could best separate the data could provide a good method for measuring the distribution distance for different data representations. Along these same lines, Gretton et~al. [32] showed that for a given class of functions, the measure could be simplified by computing the discrepancy between two means of the distributions in a reproducing kernel Hilbert space (RKHS), thus resulting in the maximum mean discrepancy (MMD) measure. Inspired by the ideas of both transductive SVM (TSVM) and MMD, Brian et~al. [28] proposed a so-called large margin kernel projected (LMPROJ) TSVM paradigm for domain adaptation problems based on the projected distance measure in an RKHS. The basic idea of LMPROJ is to minimize the distribution mean distance between source and target domain data by finding a feature translation in an RKHS. By the same way of LMPROJ, based on multiple kernel learning framework, Duan et~al. also proposed a domain transfer SVM (DTSVM) for domain adaptation learning (DAL) problem such as video concept detection. Further details about DTSVM can be found in [29].

3.1.4 Main Work in This Study

In this study, one support vector learning-based inductive learning approach and one support vector learning-based transductive learning approach are proposed, respectively.

3.1.4.1 Support Vector Learning-Based Inductive Transfer Learning with Knowledge-Leveraged Fuzzy Logic Systems

As support vector learning-based fuzzy system modeling is a type of important modeling methods [2, 33], it is promising to incorporate transfer learning with the fuzzy model. To the best of our knowledge, however, the study of transfer learning for support vector learning-based fuzzy system modeling has not yet been reported before. For support vector learning-based fuzzy system modeling, transfer learning is very useful in real-world modeling tasks where traditional fuzzy modeling methods may not work very well. For example, the trained fuzzy systems are much weaker in generalization capability when the training data are insufficient or only partially available [34, 35]. The situation is common in real-world applications in which the sensors and setups for data sampling are not steady due to noisy environment or other malfunctions that lead to insufficiency of data for the modeling task.

In order to tackle the problems with traditional support vector learning-based fuzzy system modeling as described above, a feasible remedy strategy is to boost up the performance by taking advantage of the useful information from source domains (or related domains), which can be the data in the domains, or the relevant knowledge like the density distribution and/or fuzzy rules. The simplest way to obtain the information from source domains is to directly use the data, collected from the source domains, but this approach leads to two major challenges. First, due to the necessity of privacy protection in some proprietary applications, such as the aforementioned fermentation process, the data of the source domains cannot always be obtained. Under this situation, the knowledge about the source domains, e.g. the density distribution and model parameters, can be obtained more easily to enhance the modeling of the target domain. Second, drifting phenomenon may exist between the source domain and the target domain, which makes it inappropriate to directly use the data from the former in the latter, or negative effect on the modeling task will be produced. These two issues should be properly addressed in order to develop an effective transfer learning modeling strategy for fuzzy systems.

In this study, a support vector learning-based fuzzy system modeling approach with knowledge-leverage capability from source domains is exploited for the inductive transfer learning. In view of its popularity, the Takagi–Sugeno–Kang-type fuzzy system (TSK-FS) is chosen to incorporate with a knowledge-leverage mechanism and hence the knowledge-leveraged TSK-type fuzzy system (KL-TSK-FS) is proposed. A novel objective criterion is proposed to integrate the model knowledge of the source domains and the data of the target domain, and the induced fuzzy rules of the model are learned accordingly. The knowledge of the source domain will effectively make up the deficiency in learning due to the lack of data in the target domain. Hence, the proposed system—KL-TSK-FS is more adaptive to the situations where the data are only partially available from the target domain while some useful knowledge of source domains is available. Besides, the proposed method is distinctive in preserving data privacy as only the knowledge (e.g., the corresponding model parameters) rather than the data of the source domain is used.

3.1.4.2 Support Vector Learning-Based Transductive Transfer Learning with DAL

As we may know well, mean (or expectation) and variance (or scatter) are two main features characterizing the distribution of samples which measure order one and order two statistics, respectively. However, most existing DAL methods for support vector learning-based transductive learning focus only on the first-order statistics matching which attempts to make the empirical means of the training and testing instances from source and target domain to be closer in an RKHS [36]. Intuitively, it is not enough to measure the distribution distance discrepancy between two domains to some extent only by considering the mean of the distribution of samples [13, 29, 36]. Hence, the state-of-the-art DAL MMD-based methods [28, 29, 37], which are only focused on the first-order statistics of the data distributions still have considerable limitation in the generalization capacity for specific domain adaptation transfer learning problems. What is more, since LMPROJ or DTSVM only focuses on the consistency of domain distributions in an RKHS, they sometimes project the data onto some noisy directions of separation which are completely irrelevant to the target learning task [13], and even result in poor performance.

In this study, we claim that it is indispensable to consider both mean and variance (or scatter) of data distribution in order to efficiently measure the distribution discrepancy between source and target domains. This motivates us to definitely utilize both MMD and scatter information of both domains to sufficiently evaluate their distribution discrepancy. In order to overcome the drawbacks of the MMD-based methods, we proposed a novel domain adaptation kernelized SVM (DAKSVM) using GPMDD discrepancy metric on RKHS embedding domain distributions, which can simultaneously consider both the distribution mean and scatter discrepancies between source and target domains. The idea is to find an RKHS for which the means and variances of the training and test data distributions are brought to be consistent, so that the labeled training data can be used to learn a model for the test data. Particularly, we aim to obtain a linear kernel classifier based on the Representer Theorem [32], in an RKHS, such that it achieves a trade-off between the maximal margin between classes and the minimal discrepancy between the training and test distributions.

Compared with the existing state-of-the-art DAL methods, our main contributions include the following aspects: (1) the proposed methods inherit the potential advantages of classical TSVMs and MMD-based methods described as above, and further extend them to DAL; (2) as a novel large margin domain adaptation classifier, the proposed methods can reduce the distribution gap between different domains in an RKHS as much as possible, since they effectively integrate the large margin learner with the proposed GPMDD metric; (3) in addition, we propose two extensions to the standard formulation of DAKSVM based on both v-SVM and least-square SVM (LS-SVM), respectively.

The rest of this chapter is organized as follows. In Sect. 3.2, inductive transfer learning with support vector learning-based fuzzy systems is proposed; in Sect 3.3, transductive transfer learning with support vector learning-based domain adaptation transfer learning SVM by using the GPMDD metric is proposed; in Sect. 3.4, the experimental results about the proposed inductive transfer learning approach are reported; in Sect. 3.5, the experimental results about the proposed transductive transfer learning approach are reported; and The conclusions are given in the final section.

3.2 Inductive Transfer Learning with Support Vector Learning-Based Fuzzy Systems

3.2.1 Support Vector Learning-Based Fuzzy Systems

Support vector learning has been extensively used in the machine learning methods, such as kernel methods and other intelligence modeling methods. In this section, the support vector learning-based fuzzy systems, which have strong learning abilities and nicer interpretation properties, are introduced to develop the inductive transfer learning method.

3.2.1.1 Concept and Principle of TSK-FS

Classical fuzzy logic system models include the TSK model [38], Mamdani–Larsen (ML) model [39], and generalized fuzzy model [40]. Among them, the TSK model is the most popular one due to its effectiveness. In this study, the TSK model is adopted to develop the KL-TSK-FS for implementing the inductive transfer learning.

For TSK fuzzy logic systems, the most commonly used fuzzy inference rules are defined as follows:

TSK Fuzzy Rule R k:

$$ \mathrm{IF}\ {x}_1\ \mathrm{is}\ {A}_1^k\wedge {x}_2\ \mathrm{is}\ {A}_2^k\wedge \cdots \wedge {x}_d\ \mathrm{is}\ {A}_d^k~~~~~~~~~~~~~~~~~~~~~~~~~~~~ $$
(3.1)
$$ \mathrm{Then}\kern0.5em {f}^k\left(\mathbf{x}\right)={p}_0^k+{p}_1^k{x}_1+\cdots +{p}_d^k{x}_d\kern1em k=1,\cdots, K $$

In Eq. (3.1) A kK i is a fuzzy subset subscribed by the input variable x i for the kth rule; K is the number of fuzzy rules, and ∧ is a fuzzy conjunction operator. Each rule is premised on the input vector x = [x 1,x 2, ⋯,x d ]T, and maps the fuzzy sets in the input space A kR d to a varying singleton denoted by f k(x). When multiplicative conjunction is employed as the conjunction operator, multiplicative implication as the implication operator, and additive disjunction as the disjunction operator, the output of the TSK fuzzy model can be formulated as

$$ {y}^0={\displaystyle \sum_{k=1}^K\frac{\mu^k\left(\mathbf{x}\right)}{{\displaystyle {\sum}_{k^{\prime }=1}^K{\mu}^{k^{\prime }}\left(\mathbf{x}\right)}}}\cdot {f}^k\left(\mathbf{x}\right)={\displaystyle \sum_{k=1}^K{\tilde{\mu}}^k}\left(\mathbf{x}\right)\cdot {f}^k\left(\mathbf{x}\right),$$
(3.2a)

where μ k(x) and \( {\tilde{\mu}}^k\left(\mathbf{x}\right) \) denote the fuzzy membership function and the normalized fuzzy membership associated with the fuzzy set A k. These two functions can be calculated by using

$$ {\mu}^k\left(\mathbf{x}\right)={\displaystyle \prod_{i=1}^d{\mu}_{A_i^k}\left({x}_i\right)}$$
(3.2b)

and

$$ {\tilde{\mu}}^k\left(\mathbf{x}\right)={\mu}^k\left(\mathbf{x}\right)/{\displaystyle \sum_{k^{\prime }=1}^K{\mu}^{k^{\prime }}\left(\mathbf{x}\right)}.$$
(3.2c)

A commonly used fuzzy membership function is the Gaussian membership function which can be expressed by

$$ {\mu}_{A_i^k}\left({x}_i\right)= \exp \left(\frac{-{\left({x}_i-{c}_i^k\right)}^2}{2{\delta}_i^k}\right),$$
(3.2d)

where the parameters c k i , δ k i can be estimated by clustering techniques or other partition methods. For example, with fuzzy c-means (FCM) clustering, c k i , δ k i can be estimated as follows:

$$ {c}_i^k={\displaystyle \sum_{j=1}^N{u}_{jk}{x}_{ji}}/{\displaystyle \sum_{j=1}^N{u}_{jk}}, $$
(3.2e)
$$ {\delta}_i^k=h\cdot {\displaystyle \sum_{j=1}^N{u}_{jk}{\left({x}_{ji}-{c}_i^k\right)}^2}/{\displaystyle \sum_{j=1}^N{u}_{jk}}, $$
(3.2f)

where u jk denotes the fuzzy membership of the jth input data x j = (x j1, ⋯,x jd )T, belonging to the kth cluster obtained by FCM clustering [41]. Here, h is a scale parameter and can be adjusted manually.

When the antecedents of the TSK fuzzy model are determined, let

$$\begin{array}{l} {\mathbf{x}}_e={\left(1,{\mathbf{x}}^T\right)}^T, \end{array} $$
(3.3a)
$$ \begin{array}{l} {\tilde{\mathbf{x}}}^k={\tilde{\mu}}^k\left(\mathbf{x}\right){\mathbf{x}}_e, \end{array} $$
(3.3b)
$$ {\mathbf{x}}_g={\left({\left({\tilde{\mathbf{x}}}^1\right)}^T,{\left({\tilde{\mathbf{x}}}^2\right)}^T,\cdots, {\left({\tilde{\mathbf{x}}}^K\right)}^T\right)}^T, $$
(3.3c)
$$ {\mathbf{p}}^k={\left({p}_0^k,{p}_1^k,\cdots, {p}_d^k\right)}^T $$
(3.3d)

and

$$ {\mathbf{p}}_g={\left({\left({\mathbf{p}}^1\right)}^T,{\left({\mathbf{p}}^2\right)}^T,\cdots, {\left({\mathbf{p}}^K\right)}^T\right)}^T, $$
(3.3e)

then Eq. (3.2a) can be formulated as the following linear regression problem [33]

$$ {y}^o={\mathbf{p}}_g^T{\mathbf{x}}_g. $$
(3.3f)

Thus, the problem of TSK fuzzy model training can be transformed into the learning of the parameters in the corresponding linear regression model [2, 33].

3.2.1.2 Support Vector Learning-Based TSK-FS Training

Given a training dataset D tr = {x i , y i |x i R d, y i R, i = 1, ⋯, N}, for fixed antecedents obtained via clustering of the input space (or by other partition techniques), the least-square (LS) solution to the consequents is to minimize the following LS criterion function [29], that is,

$$ \min \underset{{\mathbf{p}}_g}{E}={\displaystyle \sum_{i=1}^N{\left({y}_i^o-{y}_i\right)}^2}={\displaystyle \sum_{i=1}^N{\left({{\mathbf{p}}_g}^T{{\mathbf{x}}_g}_i-{y}_i\right)}^2}={\left(\mathbf{y}-{\mathbf{X}}_g{\mathbf{p}}_g\right)}^T\left(\mathbf{y}-{\mathbf{X}}_g{\mathbf{p}}_g\right), $$
(3.4)

where X g = [x g1, ⋯,x gN ]TR N × K ⋅ (d + 1) and y = [y 1, ⋯,y N ]TR N.

The most popular LS criterion-based TSK-FS training algorithm is the one used in the adaptive-network-based fuzzy inference systems [42]. For LS criterion-based algorithms, a main shortcoming is that they usually have weak robustness for modeling tasks involving noisy and/or small datasets. Besides the LS criterion-based TSK-FS training methods, the more promising TSK-FS training methods are the support vector learning-based training algorithms, which are reviewed as follows.

3.2.1.2.1 Support Vector Learning-Based TSK-FS Training with L1-Norm Penalty

In addition to the LS criterion, another important criterion for TSK-FS training is the ε-insensitive criterion [33]. Given a scalar g and a vector g = [g 1, ⋯,g g ]T, the corresponding ε-insensitive loss functions take the following forms, respectively [33]: |g| ε = gε (g > ε), |g| ε = 0 (g ≤ 0) and \( \left|\mathbf{g}\right|{}_{\varepsilon }={\displaystyle \sum_{i=1}^d\left|{g}_i\right|{}_{\varepsilon }} \). For the linear regression problem of the TSK-FS in Eq. (3.3f), the corresponding ε-insensitive loss-based criterion function [33] is defined as

$$ \underset{{\mathbf{p}}_g}{ \min }E={\displaystyle \sum_{i=1}^N\left|{y}_i^o-{y}_i\right|{}_{\varepsilon }}={\displaystyle \sum_{i=1}^N\left|{\mathbf{p}}_g^T{{\mathbf{x}}_g}_i-{y}_i\right|{}_{\varepsilon }} $$
(3.5a)

In general, the inequalities y i p T g x gi < ε and p T g x gi y i < ε are not satisfied for all data pairs (x gi ,y i ). By introducing the slack variables ξ + i ≥ 0 and ξ i ≥ 0, Eq. (3.5a) can be equivalently written as

$$ \underset{{\mathbf{p}}_g,{\xi}_i^{+},\xi }{ \min }E={\displaystyle \sum_{j=1}^N\left({\xi}_i^{+}+{\xi}_i^{-}\right)}~~~~$$
(3.5b)
$$ \mathrm{s}.\mathrm{t}.\kern1em \left\{\begin{array}{l}{y}_i-{\mathbf{p}}_g^T{\mathbf{x}}_{gi}<\varepsilon +{\xi}_i^{+}\\ {}{\mathbf{p}}_g^T{\mathbf{x}}_{gi}-{y}_i<\varepsilon +{\xi}_i^{-}\end{array}\right.,\kern1.5em {\xi}_i^{+}\ge 0,{\xi}_i^{-}\ge 0\kern1.36em \forall i. $$

Further, by introducing the regularization term [30], Eq. (3.5b) is modified to become

$$ \underset{{\mathbf{p}}_g,{\boldsymbol{\upxi}}^{+},{\boldsymbol{\upxi}}^{+}}{ \min}\kern0.75em E=\frac{1}{\tau }{\displaystyle \sum_{j=1}^N\left({\xi}_i^{+}+{\xi}_i^{-}\right)}+\frac{1}{2}{{\mathbf{p}}_g}^T{\mathbf{p}}_g $$
(3.5c)
$$ \mathrm{s}.\mathrm{t}.\kern1em \left\{\begin{array}{l}{y}_i-{\mathbf{p}}_g^T{\mathbf{x}}_{gi}<\varepsilon +{\xi}_i^{+}\\ {}{\mathbf{p}}_g^T{\mathbf{x}}_{gi}-{y}_i<\varepsilon +{\xi}_i^{-}\end{array}\right.,\kern1.5em {\xi}_i^{+}\ge 0,{\xi}_i^{-}\ge 0\kern1.5em \forall i, $$

where τ > 0 controls the trade-off between the complexity of the regression model and the tolerance of the errors. Here, ξ + i and ξ i can be taken as the L1-norm penalty terms and thus Eq. (3.5c) is an objective function based on L1-norm penalty terms. TSK training algorithm of this type is referred to as support vector learning-based L1-TSK-FS, which has the similar learning way as the classical SVM. The dual optimization in Eq. (3.5c) is a quadratic programming (QP) problem, which can be expressed as

$$ \begin{aligned}&\underset{{\boldsymbol{\upalpha}}^{+},{\boldsymbol{\upalpha}}^{-}}{ \max }-\frac{1}{2}{\displaystyle \sum_{i=1}^N{\displaystyle \sum_{j=1}^N\left({\alpha}_i^{+}-{\alpha}_i^{-}\right)}}\left({\alpha}_j^{+}-{\alpha}_j^{-}\right){\mathbf{x}}_{gi}^T{\mathbf{x}}_{gi}-{\displaystyle \sum_{i=1}^N\varepsilon \left({\alpha}_j^{+}+{\alpha}_j^{-}\right)}\hfill\\&\qquad+{\displaystyle \sum_{i=1}^N{y}_i\left({\alpha}_j^{+}-{\alpha}_j^{-}\right)}\end{aligned} $$
(3.5d)
$$ \mathrm{s}.\mathrm{t}.\kern0.75em {\displaystyle \sum_{i=1}^N\left({\alpha}_j^{+}-{\alpha}_j^{-}\right)}=0,\kern1.25em {\alpha}_j^{+},{\alpha}_j^{-}\in \left[0,\tau \right]\kern1em \forall i. $$

Compared with the LS-criterion-based algorithms, the support vector learning-based L1-TSK-FS with the ε-insensitive criterion has been shown to be more robust when dealing with noisy and small datasets.

3.2.1.2.2 Support Vector Learning-Based TSK-FLS Training with L2-Norm Penalty

Instead of the L1-norm penalty terms in Eq. (3.5c), another representative support vector learning-based TSK-FS learning method is the one developed by employing the L2-norm penalty terms [3]. The insensitive parameter ε is also added as a penalty term in the objective function. This is similar to the approaches used in other existing L2-norm penalty-based methods, e.g. L2-norm support vector regression (L2-SVR) [43]. For TSK fuzzy model training, the ε-insensitive objective function based on L2-norm penalty terms is then given by

$$ \underset{{\mathbf{p}}_g,{\boldsymbol{\upxi}}^{+},{\boldsymbol{\upxi}}^{+},\varepsilon }{ \min}\kern0.75em g\left({\mathbf{p}}_g,{\boldsymbol{\upxi}}^{+},{\boldsymbol{\upxi}}^{+},\varepsilon \right)=\frac{1}{\tau}\cdot \frac{1}{N}{\displaystyle \sum_{j=1}^N\left({\left({\xi}_i^{+}\right)}^2+{\left({\xi}_i^{-}\right)}^2\right)}+\frac{1}{2}{\mathbf{p}}_g^T{\mathbf{p}}_g+\frac{2}{\tau}\cdot \varepsilon $$
(3.6a)
$$ \mathrm{s}.\mathrm{t}.\kern1em \left\{\begin{array}{l}{y}_i-{\mathbf{p}}_g^T{\mathbf{x}}_{gi}<\varepsilon +{\xi}_i^{+}\\ {}{\mathbf{p}}_g^T{\mathbf{x}}_{gi}-{y}_i<\varepsilon +{\xi}_i^{-}\end{array}\right.\kern0.75em \forall i.$$

Compared with the L1-norm penalty-based ε-insensitive criterion, the L2-norm penalty-based criterion is advantageous because of the following characteristics: (1) the constraints ξ + i ≥ 0 and ξ i ≥ 0 in Eq. (3.5c) are not needed for the optimization; (2) the insensitive parameter ε can be obtained automatically by optimization without the need of manual setting. Similar properties can also be found in other L2-norm penalty-based machine learning algorithms, such as L2-SVR [43]. For convenience, the L2-norm penalty-based ε-insensitive TSK fuzzy model training is referred to as L2-TSK-FS in this chapter. Based on the optimization theory, the dual problem in Eq. (3.6a) can be formulated as the following QP problem.

$$ \begin{aligned}&\underset{{\boldsymbol{\upalpha}}^{+},{\boldsymbol{\upalpha}}^{-}}{ \max}\kern0.5em -{\displaystyle \sum_{i=1}^N{\displaystyle \sum_{j=1}^N\left({\alpha}_i^{+}-{\alpha}_i^{-}\right)\left({\alpha}_i^{+}-{\alpha}_i^{-}\right)\cdot {\mathbf{x}}_{gi}^T{\mathbf{x}}_{gj}}}-{\displaystyle \sum_{i=1}^N\frac{ N\tau}{2}{\left({\alpha}_i^{+}\right)}^2}-{\displaystyle \sum_{i=1}^N\frac{ N\tau}{2}{\left({\alpha}_i^{-}\right)}^2}\hfill\\ &\qquad+{\displaystyle \sum_{i=1}^N{\alpha}_i^{+}\cdot y{}_i\cdot \tau }-{\displaystyle \sum_{i=1}^N{\alpha}_i^{-}}\cdot y{}_i\cdot \tau \end{aligned} $$
(3.6b)
$$ \mathrm{s}.\mathrm{t}.\kern0.5em {\displaystyle \sum_{i=1}^N\left({\alpha}_j^{+}+{\alpha}_j^{-}\right)}=1,\kern0.75em {\alpha}_j^{+},{\alpha}_j^{-}\ge 0\kern1.75em \forall i $$

Notably, the characteristic of the QP problem in Eq. (3.6b) enables the use of core-set-based minimal enclosing ball (MEB) approximation technique to solve problems involving very large datasets [43]. The scalable L2-TSK-FS learning algorithm (STSK) has thus been proposed in this regard [3].

3.2.2 Inductive Transfer Learning with Support Vector Learning-Based TSK-FS

3.2.2.1 Framework of Knowledge-Leveraged Inductive Transfer Learning~with TSK-FS

Most inductive transfer learning algorithms are developed to learn from the data in the source domain directly with some strategies. Recently, the transfer learning from the knowledge in the source domain rather than the original data is investigated with the knowledge-leveraged transfer learning framework [44], by observing the characteristics of two types of the learning ways below from the source domain, i.e., from the original data and from the induced knowledge.

  1. 1.

    For the data in the source domains, it is the original information and is also the most commonly used information for transfer learning. However, the data are not always available in some situations. For example, many data samples cannot be made open due to the necessity of privacy protection in the real world. Moreover, even if the data of source domains are available, it may not be always appropriate to directly adopt these data for the modeling task in the target domain due to the following issues: first, it is difficult to control and balance the similarity and difference of distributions of the source and target domains by using the data directly; secondly, there possibly exists a drifting between the distributions of different domains and thus some data from the source domain may result in an obvious negative influence on the modeling effect of the target domain.

  2. 2.

    For the knowledge in the source domains, it is another kind of important information. The types of knowledge are diverse, such as density distribution and model parameters. Most of them can be obtained by some learning procedures in the past. For example, the model parameters for the source domain can be learned by a certain modeling algorithm based on the data collected from that domain in a certain historical modeling task. Despite the fact that most of the knowledge obtained cannot be inversely mapped to the original data, it is a good property from a privacy preservation point of view and the important information from the source domains to improve the modeling effect of the target domain.

Thus, the characteristics above show that it should be more appropriate to exploit the use of knowledge rather than data from the source domains to enhance the modeling/learning performance of the models obtained in the target domain. As shown in Fig. 3.2, a generalized learning framework was proposed in [44] for knowledge-leveraged transfer learning. Under this framework, the model in the target domain can be learned from the data in the target domain and the knowledge in the source domain simultaneously. In this study, the knowledge-leveraged inductive transfer learning for the support vector learning-based TSK-FS will be studied accordingly.

Fig. 3.2
figure 2

Framework of knowledge-leveraged TSK-FS learning

3.2.2.2 Inductive Transfer Learning with Support Vector Learning-Based TSK-FS

To take advantage of knowledge-leveraged learning mechanism for TSK-FS, KL-TSK-FS is proposed by using support vector learning and the L2-norm penalty-based TSK-FS learning strategy with the corresponding knowledge-leverage mechanism. The goal is to effectively use the knowledge of the source domains to remedy the deficiency caused by data insufficiency in the target domain and develop an efficient learning algorithm for TSK-FS.

3.2.2.2.1 Objective Criterion Integrating the Knowledge of Source Domain

For a TSK-FS constructed by the support vector learning-based technique, the corresponding model parameters obtained in the source domains can be regarded as the knowledge. To develop an effective KL-TSK-FS for model learning of the target domain, we propose an optimization criterion which is integrated with the knowledge of the source domain as follows:

$$ \underset{{\mathbf{p}}_g}{ \min }{\displaystyle \sum_{i=1}^N{\displaystyle {\left|{\displaystyle {\mathbf{p}{}_g{}^T\mathbf{x}}_{gi}}-{\displaystyle {y}_i}\right|}_{\varepsilon }}}+\lambda {\displaystyle {\left({\mathbf{p}}_g-{\mathbf{p}}_{g0}\right)}^T}\left({\mathbf{p}}_g-{\mathbf{p}}_{g0}\right). $$
(3.7)

The optimization criterion in Eq. (3.7) contains two terms. The first term refers to the learning from the data of the target domain for the desired TSK-FS. This term is included so that the desired TSK-FS will fit the sampled training data of the target domain as accurate as possible. The second term refers to the knowledge-leverage of the source domain, with p g0 denoting model parameters learned from the source domains. The purpose is to estimate the desired parameters by approximating the model obtained from the source domains. The parameter λ in Eq. (3.7) is used to balance the influence of these two terms and the optimal value can be determined by using the commonly used cross-validation strategy in machine learning. As in L2-TSK-FS [20], we introduce the terms structure risk and ε-insensitive penalty into Eq. (3.7) to obtain the following objective criterion

$$ \begin{aligned}&\underset{{\displaystyle {\mathbf{p}}_g},{\displaystyle {\xi}^{+}},{\displaystyle {\xi}^{-}},\varepsilon }{ \min}\frac{1}{\tau}\cdot \frac{1}{N}{\displaystyle \sum_{i=1}^N\left({\displaystyle {\left({\displaystyle {\xi}_i^{+}}\right)}^2}+{\displaystyle {\left({\displaystyle {\xi}_i^{-}}\right)}^2}\right)}+\frac{1}{2}\left({\displaystyle {\mathbf{p}}_g^T}{\displaystyle {\mathbf{p}}_g}\right)\\&\qquad+\frac{2}{\tau}\cdot \varepsilon +\lambda {\displaystyle {\left({\mathbf{p}}_g-{\mathbf{p}}_{g0}\right)}^T}\left({\mathbf{p}}_g-{\mathbf{p}}_{g0}\right)\end{aligned} $$
(3.8)
$$ \mathrm{s}.\mathrm{t}.\left\{\begin{array}{l}{\displaystyle {y}_i}-{\displaystyle {{\displaystyle {\mathbf{p}}_g^T}x}_{gi}}<\varepsilon +{\displaystyle {\xi}_i^{+}}\\ {}{\displaystyle {{\displaystyle {\mathbf{p}}_g^T}x}_{gi}}-{\displaystyle {y}_i}<\varepsilon +{\displaystyle {\xi}_i^{-}}\end{array}\right.,\forall i $$

In fact, the former three terms in Eq. (3.8) are directly inherited from the L2-TSK-FS [20] and the last term is referred to as the knowledge-leverage term which is used to learn the knowledge from the source domains. Based on the objective criterion in Eq. (3.8), we can derive the corresponding learning rules for the proposed KL-TSK-FS.

3.2.2.2.2 Parameter Solution for KL-TSK-FS

Given the optimization problem in Eq. (3.8), Theorem 1 below is proposed for parameter solution.

Theorem 1

The dual problem of Eq. (3.8) is a QP problem as shown in Eq. (3.9).

$$ \begin{array}{l}\underset{{\boldsymbol{\upalpha}}^{-},{\boldsymbol{\upalpha}}^{+}}{ \max} -\frac{1}{2\left(1+2\lambda \right)}{\displaystyle \sum_{i=1}^N{\displaystyle \sum_{i=1}^N\left({\alpha}_i^{-}-{\alpha}_i^{+}\right)\left({\alpha}_j^{-}-{\alpha}_j^{+}\right){\mathbf{x}}_{gi}^T{\mathbf{x}}_{gj}}}-\frac{ N\tau}{4}{\displaystyle \sum_{i=1}^N\left({\left({\alpha}_i^{+}\right)}^2+{\left({\alpha}_i^{-}\right)}^2\right)}\\ {}\kern2.5em -\frac{2\lambda }{1+2\lambda }{\displaystyle \sum_{i=1}^N\left({\alpha}_i^{-}-{\alpha}_i^{+}\right)\left({\mathbf{p}}_{{}_{g0}}^T{\mathbf{x}}_{gi}+{y}_i\right)}+\frac{\lambda }{1+2\lambda }{\mathbf{p}}_{{}_{g0}}^T{\mathbf{p}}_{{}_{g0}}\kern8.75em \end{array} $$
(3.9)
$$ \mathrm{s}.\mathrm{t}.\kern1.5em {\displaystyle \sum_{i=1}^N{\displaystyle {\alpha}_i^{-}}}+{\displaystyle \sum_{i=1}^N{\displaystyle {\alpha}_i^{+}}}=\frac{2}{\tau },\kern1.5em {\displaystyle {\alpha}_i^{-}}\ge 0,{\displaystyle {\alpha}_i^{+}}\ge 0.$$

Proof

By using the Lagrangian optimization theorem, we can obtain the following Lagrangian function for Eq. (3.8)

$$ \begin{gathered}{} \begin{aligned} &L\left({\displaystyle {\mathbf{p}}_g},{\displaystyle {\boldsymbol{\upxi}}^{+}},{\displaystyle {\boldsymbol{\upxi}}^{-}},\varepsilon, {\boldsymbol{\upalpha}}^{+},{\boldsymbol{\upalpha}}^{-}\right)=\frac{1}{\tau}\cdot \frac{1}{N}{\displaystyle \sum_{i=1}^N\left({\displaystyle {\left({\displaystyle {\xi}_i^{+}}\right)}^2}+{\displaystyle {\left({\displaystyle {\xi}_i^{-}}\right)}^2}\right)}\\ &\quad+\frac{1}{2}\left({\displaystyle {\mathbf{p}}_g^T}{\displaystyle {\mathbf{p}}_g}\right)+\frac{2}{\tau}\cdot \varepsilon +\lambda {\displaystyle {\left({\mathbf{p}}_g-{\mathbf{p}}_{g0}\right)}^T}\left({\mathbf{p}}_g-{\mathbf{p}}_{g0}\right)\nonumber\\ &\quad+{\displaystyle \sum_{i=1}^N{\alpha}_i^{+}\left({y}_i-{\displaystyle {\mathbf{p}}_g^T}{\displaystyle {\mathbf{x}}_{gi}}-\varepsilon -{\displaystyle {\xi}_i^{+}}\right)}+{\displaystyle \sum_{i=1}^N{\displaystyle {\alpha}_i^{-}}\left({\displaystyle {\mathbf{p}}_g^T}{\displaystyle {\mathbf{x}}_{gi}}-{y}_i-\varepsilon -{\displaystyle {\xi}_i^{-}}\right)}.\nonumber \end{aligned}\end{gathered} $$
(3.10)

According to the dual theorem, the minimum of the Lagrangian function in Eq. (3.10) with respect to p g , ξ +, ξ , ε is equal to the maximum of the function with respect to α +, α . Then the following equations can be considered as the necessary conditions of the optimal solution:

$$ \frac{\partial L}{\partial {\mathbf{p}}_g}={\mathbf{p}}_g+2\lambda \left({\mathbf{p}}_g-{\mathbf{p}}_{g0}\right)-{\displaystyle \sum_{i=1}^N\left({\alpha}_i^{+}-{\alpha}_i^{-}\right){\mathbf{x}}_{gi}=0},$$
(3.11a)
$$ \frac{\partial L}{\partial {\displaystyle {\xi}_i^{+}}}=\frac{2}{ N\tau}{\displaystyle {\xi}_i^{+}}-{\alpha}_i^{+}=0,$$
(3.11b)
$$\frac{\partial L}{\partial {\displaystyle {\xi}_i^{-}}}=\frac{2}{ N\tau}{\displaystyle {\xi}_i^{-}}-{\alpha}_i^{-}=0, $$
(3.11c)
$$\frac{\partial L}{\partial \varepsilon }=\frac{2}{\tau }-{\displaystyle \sum_{i=1}^N{\displaystyle {\alpha}_i^{-}}}-{\displaystyle \sum_{i=1}^N{\displaystyle {\alpha}_i^{+}}}=0. $$
(3.11d)

From Eqs. (3.11a)–(3.11d), we have

$$ {\mathbf{p}}_g=\frac{2\lambda {\mathbf{p}}_{g0}+{\displaystyle \sum_{i=1}^N\left({\alpha}_i^{+}-{\alpha}_i^{-}\right){\mathbf{x}}_{gi}}}{1+2\lambda },\hfill $$
(3.12a)
$$\begin{array}{l} {\displaystyle {\xi}_i^{+}}=\frac{ N\tau \cdot {\alpha}_i^{+}}{2},\end{array}$$
(3.12b)
$$ \begin{array}{l} {\displaystyle {\xi}_i^{-}}=\frac{ N\tau \cdot {\alpha}_i^{-}}{2},\end{array} $$
(3.12c)
$$ {\displaystyle \sum_{i=1}^N{\displaystyle {\alpha}_i^{-}}}+{\displaystyle \sum_{i=1}^N{\displaystyle {\alpha}_i^{+}}}=\frac{2}{\tau }. $$
(3.12d)

Substituting Eqs. (3.12a)–(3.12d) into Eq. (3.10), we obtain the dual problem for Eq. (3.8), i.e.,

$$ \begin{array}{l}\underset{{\boldsymbol{\upalpha}}^{-},{\boldsymbol{\upalpha}}^{+}}{ \max}\kern0.5em \frac{-1}{2\left(1+2\lambda \right)}\cdot {\displaystyle \sum_{i=1}^N{\displaystyle \sum_{i=1}^N\left({\alpha}_i^{+}-{\alpha}_i^{-}\right)\left({\alpha}_j^{+}-{\alpha}_j^{-}\right){\mathbf{x}}_{gi}^T{\mathbf{x}}_{gj}}}-\frac{ N\tau}{4}\cdot {\displaystyle \sum_{i=1}^N\left({\left({\alpha}_i^{+}\right)}^2+{\left({\alpha}_i^{-}\right)}^2\right)}\\ {}\kern2.25em -\frac{2\lambda }{1+2\lambda}\cdot {\displaystyle \sum_{i=1}^N\left({\alpha}_i^{+}-{\alpha}_i^{-}\right){\mathbf{p}}_{{}_{g0}}^T{\mathbf{x}}_{gi}}+{\displaystyle \sum_{i=1}^N\left({\alpha}_i^{+}-{\alpha}_i^{-}\right){y}_i}+\frac{\lambda }{1+2\lambda}\cdot {\mathbf{p}}_{{}_{g0}}^T{\mathbf{p}}_{{}_{g0}}\end{array} $$
$$\mathrm{s}.\mathrm{t}.\kern1em {\displaystyle \sum_{i=1}^N{\displaystyle {\alpha}_i^{-}}}+{\displaystyle \sum_{i=1}^N{\displaystyle {\alpha}_i^{+}}}=\frac{2}{\tau },\kern1em {\displaystyle {\alpha}_i^{-}}\ge 0,{\displaystyle {\alpha}_i^{+}}\ge 0,\kern0.37em \forall i. $$
(3.12e)

Since the optimal solution of the dual problem, i.e., (α +), (α ), is independent of \( \frac{\lambda }{1+2\lambda}\cdot {\mathbf{p}}_{{}_{g0}}^T{\mathbf{p}}_{{}_{g0}} \), Eq. (3.12e) is equivalent to the following equation:

$$ \begin{array}{l}\underset{{\boldsymbol{\upalpha}}^{-},{\boldsymbol{\upalpha}}^{+}}{ \max}\frac{-1}{2\left(1+2\lambda \right)}\cdot \displaystyle \sum_{i=1}^N\displaystyle \sum_{i=1}^N\big({\alpha}_i^{+}-{\alpha}_i^{-}\big)\big({\alpha}_j^{+}-{\alpha}_j^{-}\big){\mathbf{x}}_{gi}^T{\boldsymbol{x}}_{gj}-\frac{ N\tau}{4}.{\displaystyle \sum_{i=1}^N\left({\left({\alpha}_i^{+}\right)}^2+{\left({\alpha}_i^{-}\right)}^2\right)}\\ {}\kern2em \ -\frac{2\lambda }{1+2\lambda}\cdot \displaystyle \sum_{i=1}^N\left({\alpha}_i^{+}-{\alpha}_i^{-}\right) \left({\boldsymbol{p}}_{g0}^T{\boldsymbol{x}}_{gi}+{y}_i\right)+\displaystyle \sum_{i=1}^N\left({\alpha}_i^{+}-{\alpha}_i^{-}\right){y}_i \end{array} $$
$$ \mathrm{s}.\mathrm{t}.\kern1em {\displaystyle \sum_{i=1}^N{\displaystyle {\alpha}_i^{-}}}+{\displaystyle \sum_{i=1}^N{\displaystyle {\alpha}_i^{+}}}=\frac{2}{\tau },{\displaystyle {\alpha}_i^{-}}\ge 0,{\displaystyle {\alpha}_i^{+}}\ge 0. $$
(3.12f)

Thus, Theorem 1 is hold.

It is clear from the above results that the optimization problem in Eq. (3.9) for TSK-FS training can be transformed into a QP problem that can be directly solved by the traditional QP solutions [45].

With the optimal solution (α +), (α ) of the dual problem in Eq. (3.9), we can get the optimal solution of the primal problem in Eq. (3.8) based on the relations presented in Eqs. (3.12a)–(3.12d). The optimal model parameters of trained TSK-FS, i.e., (p g ), is then given by

$$ {\left({\mathbf{p}}_g\right)}^{*}=\frac{2\lambda {\mathbf{p}}_{g0}+{\displaystyle \sum_{i=1}^N\left({\left({\alpha}_i^{+}\right)}^{*}-{\left({\alpha}_i^{-}\right)}^{*}\right)}{\mathbf{x}}_{gi}}{1+2\lambda }, $$
(3.13a)

which can be further expressed as

$$ {\left({\mathbf{p}}_g\right)}^{\ast }=\gamma {\mathbf{p}}_{g0}+\left(1\hbox{-} \gamma \right){\mathbf{p}}_{gc}, $$
(3.13b)

with \( \gamma =\frac{2\lambda }{1+2\lambda } \), \( {\mathbf{p}}_{gc}={\displaystyle \sum_{i=1}^N\left({\left({\alpha}_i^{+}\right)}^{\ast }-{\left({\alpha}_i^{-}\right)}^{\ast}\right){\mathbf{x}}_{gi}} \).

From Eq. (3.13b), we can see that the final optimal parameter (p g ) obtained for the desired TSK-FS contains two parts, i.e. γp g0 and (1 − γ) ⋅ p gc . While (1 − γ) ⋅ p gc can denote the knowledge learned from the data of the target domain, γp g0 can be taken as the knowledge inherited from the source domains. Thus, the final model parameter (p g ) is a balance between these two kinds of knowledge.

3.2.2.2.3 Learning Algorithm For KL-TSK-FS

Based on the findings in the previous section, the learning algorithm of the proposed KL-TSK-FS is developed and described as follows:

Algorithm KL-TSK-FS

Step 1

Introduce the knowledge of the source domains, i.e., the model parameter.

Step 2

Set the balance parameters τ, λ in Eq. (3.8).

Step 3

Use the antecedent parameters of the fuzzy model obtained from the source domains and Eqs. (3.2d) and (3.3e) to construct the dataset x gi for the corresponding model task, i.e., the linear regression model in Eq. (3.3f), associated with the fuzzy system to be constructed for the target domain.

Step 4

Use Eqs. (3.9) and (3.13a) to obtain the final consequent parameters (p g ) of the desired TSK-FS in the target domain.

Step 5

Use the antecedent parameters inherited from the source domains and the consequent parameters obtained in step 4 to generate the fuzzy system for the target domain.

3.2.2.2.4 Computational Complexity Analysis

The computational complexity of the above algorithm is analyzed as follows. The whole algorithm includes two main parts: (1) acquisition of the antecedent parameters of the fuzzy system and (2) learning of the consequent parameters. For the first part, since the antecedent parameters are inherited directly from the reference scene as the available knowledge, the computational complexity is O(1). For the second, the consequent parameters are obtained by solving the QP problem in Eq. (3.9) and the complexity is usually O(N 2) for typical QP problems. However, it can be further reduced to O(N) with some sophisticated algorithms, such as the working set-based algorithm [33]. Therefore, the computational complexity of the proposed algorithm is between O(N) and O(N 2), depending on the QP solutions used. In this study, we adopt the working set-based QP solution [33] for solving the QP problem concerned.

3.3 Transductive Transfer Learning with DAKSVM

3.3.1 Concepts and Problem Formulation

In this subsection, we introduce several definitions to clarify our terminology and propose our algorithm and analysis on the domain adaptation transfer learning problems.

Definition 1 (Domain)

A domain D is composed of both feature space χ and marginal probabilistic distribution P(X), i.e., D = {χ, P(X)}, where X = {x i } N i = 1 χ.

Definition 2 (Task)

Given a specific domain D = {χ, P(X)}, a task is composed of both tag space Y and target prediction function f(⋅), i.e., T = {Y, f(⋅)}, where f(⋅) learned from the training dataset {x i ,y i }, where x i X, y i Y. The function f(⋅) can be used to make prediction for the tag f(x) corresponding with X. From a probabilistic point of view f(x) = P(y|x).

Definition 3 (Domain Adaptation Learning, DAL)

Given a source domain D s with its learning task T s and target domain D t with its learning task T t , respectively, we refer to domain adaptation learning (DAL) as the following problem: given a set of labeled training dataset X s = {(x i ,y i )} i D s × {±1}, where y i Y s Y is the class label corresponding to x i , from source domain D s . Thus, we need to make prediction f t (⋅) for some unlabeled test dataset X t = {x j } j D t from target domain D t . D s with its task T s and D t with its task T t are different, respectively, in the same feature space. When D s = D t and T s = T t , DAL will be degenerated into classical machine learning problems.

Given an input space X and a label set Y of classes, a classifier is a function as f(x) : XY which maps data xX to label set Y. In the context, let us consider two datasets X s = {(x s1,y s1), …,(x sn ,y sn )} drawn from X × Y with probabilistic distribution P s (x s ,y s ) and X t = {x t1, …,x tm } drawn from X with probabilistic distribution P t (x t ,y t ) where y t needs to be predicted, which are composed of n source domain and m target domain patterns, respectively, and usually 0 ≤ m < < n. x s and x t are denoted by d-dimensional feature vectors with respect to X s and X t , respectively. The classical large margin learning machines (such as SVMs) work well under such hypothesis as P s (x s ,y s ) = P t (x t ,y t ). However, DAL can make accurate prediction for the unlabeled target data to some extent by learning a classifier under even such hypothesis as P s (x s ,y s ) ≠ P t (x t ,y t ). The performance of DAL depends on both the complexity of the investigated problems and the correlation between P s (x s ,y s ) and P t (x t ,y t ) [6]. In this chapter, the proposed method is formulated under the following hypothesis:

  1. 1.

    There are only one source domain and one target domain sharing the same feature space in DAL problems, which is the most popular hypothesis used by the state-of-the-art methods.

  2. 2.

    A training dataset X s = {(x si ,y si )} i is available for D s while a testing dataset X t = {(x tj ,y tj )} j is available for D t with y tj which is unknown.

  3. 3.

    P s (x s ,y s ) ≠ P t (x t ,y t ) and P s (y s |x s ) ≠ P t (y t |x t ).

3.3.2 Distribution Discrepancy Metrics on RKHS Embedding Domain Distributions

Kernel methods are broadly used as an effective way of constructing nonlinear algorithms from linear ones by embedding datasets into some higher dimensional RKHSs [46]. A generalization of this idea is to embed probabilistic distributions into RKHS, giving us a linear method for dealing with higher order statistics [47, 48]. Let a complete inner product space H of functions F, and for gF, g : XR, where X is a nonempty compact set, if the linear dot function mapping gg(x) exists for all xX, we call H as an RKHS. Under the aforementioned conditions, g(x) can be denoted as an inner product: g(x) = < g, ϕ(x) > H , where ϕ : XH denotes the feature space projection from x to H. And the inner product of the images of any points x and x′ in feature space is called kernel k(x,x′) = < ϕ(x), ϕ(x′) > H . It is pointed out in [48] that RKHS with Gaussian kernel is universal.

Definition 4 (Integral Probability Metric on RKHS Embedding Distributions [2])

Given the set Θ of all Borel probabilistic measures defined on the topological space M, and the RKHS (H, k) of functions on M with k as its reproducing kernel. For any P ∈ Θ, denotes by Pk := ∫ M k(.,x)dP(x). If k is measurable and bounded, then we may define the embedding of P in H as PkH. Then, the RKHS embedding distributions distance between two such mappings associated with P, Q ∈ Θ is defined as follows:

$$ {\gamma}_k\left(P,Q\right)=\left|\right| Pk- Qk\left|\right|{}_H $$
(3.14)

We may say k is a characteristic kernel (CK) if the mapping PPk is injective [48], in which case γ k (P,Q) = 0 if and only if P = Q [49]. Hence γ k is viewed as the distance metric on Θ. The RKHS embedding distributions cannot be distinguished when k is not a CK, thus leading to the failure of RKHS embedding distribution measure. Hence, it is a key factor for the success of RKHS embedding distribution measure that whether k is a CK or not. Fortunately, many popular kernel functions, such as polynomial kernel function, Gaussian kernel function, and Laplace kernel function, are all CK and universal ones [48]. Particularly, it is worth noting that Gaussian kernel mapping can provide us an effective RKHS embedding skill for the consistency estimation of the probability distribution distance between different domains [48]. Hence, in the sequel, we adopt the Gaussian kernel function \( {k}_{\sigma}\left(\mathbf{x},\mathbf{z}\right)= \exp \left(-\frac{1}{2{\sigma}^2}\left|\right|\mathbf{x}-\mathbf{z}\left|\right|{}^2\right) \), where x, zX, and σ denotes the kernel bandwidth, as the reproducing kernel in Hilbert space in this work. It is worthy to note that instead of using a fixed and parameterized kernel, one can also use a finite linear combination of kernels to compute γ k .

For domain adaptation transfer learning problems, let D s and D t denote source and target domain, respectively, and X s D s , X t D t denote sample from D s and D t , respectively, with probability measures P s and P t , respectively. Let \( {P}_{x_s,{x}_t} \) denote the joint probability measure of X s × X t . Assume all measures are Borel ones and X s , X t are two compact sets. Besides, let an RKHS H of a class of functions F with kernel k, then for gF, g : XR, where X is a nonempty compact set, there exists the reproducing property as follows: < g(⋅), k(x,⋅) > = g(x), < k(x,⋅), k(x′,⋅) > = k(x,x′), where <, > denotes inner product operator. Thus, by Definition 1, the unbiased empirical estimator of maximum mean distance (MMD) on RKHS embedding domain distributions is defined as [50]:

$$ MMD\left(F,{\mathbf{X}}_s,{\mathbf{X}}_t\right)={\left\Vert \frac{1}{n}{\displaystyle \sum_{i=1}^n\varphi \left({\mathbf{x}}_i\right)}-\frac{1}{m}{\displaystyle \sum_{j=1}^m\varphi \left({\mathbf{z}}_j\right)}\right\Vert}^2, $$
(3.15)

where x i X s , z j X t .

Specifically, by Definition 1, we can have the following definitions on RKHS embedding distribution distance metric.

Definition 5 (Projected Maximum Mean Distance Metric on RKHS Embedding Domain Distributions)

Let linear function f: f(x) = 〈w, ϕ(x)〉, where w is a projection vector. Then the projected maximum mean distance metric on RKHS embedding domain distributions is defined as follows:

$$ \begin{array}{l}{\gamma}_{KM}\left(f,{\mathbf{X}}_s,{\mathbf{X}}_t\right)={\left\Vert \frac{1}{n}{\displaystyle \sum_{i=1}^n{\mathbf{w}}^T\varphi \left({\mathbf{x}}_i\right)}-\frac{1}{m}{\displaystyle \sum_{j=1}^m{\mathbf{w}}^T\varphi \left({\mathbf{z}}_j\right)}\right\Vert}^2\\ {}={\mathbf{w}}^T\left(\frac{1}{n}{\displaystyle \sum_{i=1}^n\varphi \left({\mathbf{x}}_i\right)}-\frac{1}{m}{\displaystyle \sum_{j=1}^m\varphi \left({\mathbf{z}}_j\right)}\right){\left(\frac{1}{n}{\displaystyle \sum_{i=1}^n\varphi \left({\mathbf{x}}_i\right)}-\frac{1}{m}{\displaystyle \sum_{j=1}^m\varphi \left({\mathbf{z}}_j\right)}\right)}^T\mathbf{w},\end{array} $$
(3.16)

where x i X s , z j X t .

Definition 6 (Projected Maximum Scatter Distance Metric on RKHS Embedding Domain Distributions)

Let linear function f: f(x) = 〈w, ϕ(x)〉, where w is a projection vector. Then, along the same line of Definition 2, the projected maximum scatter distance metric on RKHS embedding domain distributions is defined as

$$ {\gamma}_{KS}\left(f,{\mathbf{X}}_s,{\mathbf{X}}_t\right)={\mathbf{w}}^T\left|\frac{1}{n}{\displaystyle \sum_{i=1}^n\varphi \left({\mathbf{x}}_i\right){\left[\varphi \left({\mathbf{x}}_i\right)\right]}^T}-\frac{1}{m}{\displaystyle \sum_{j=1}^m\varphi \left({\mathbf{z}}_j\right){\left[\varphi \left({\mathbf{z}}_j\right)\right]}^T}\right|\mathbf{w}, $$
(3.17)

where xX s , zX t .

Definition 7 (GPMDD Metric on RKHS Embedding Domain Distributions)

By Definitions 2 and 3, generalized projected maximum distribution distance metric on RKHS embedding domain distributions with probabilistic distribution p, qP is defined as

$$ {\gamma}_{KM S}\left(f,{\mathbf{X}}_s,{\mathbf{X}}_t\right)=\left(1-\lambda \right){\gamma}_{KM}\left(f,{\mathbf{X}}_s,{\mathbf{X}}_t\right)+\lambda {\gamma}_{KS}\left(f,{\mathbf{X}}_s,{\mathbf{X}}_t\right), $$
(3.18)

where λ ∈ [0,1] and when λ = 0, γ KMS = γ KM . The parameter λ is treated as a balance between probabilistic distribution mean and scatter (or variance). When λ increases, γ KMS is biased in favor of preserving the distribution scatter consistency between both domains. When λ decreases, γ KMS is biased in favor of preserving the distribution mean consistency between both domains. Hence, the proposed method can preserve both the distributions consistency between domains and the discriminative information in both domains.

It can be guaranteed by the following theorem that the GPMDD between both domains can be measured sufficiently.

Theorem 1 [51]

Let F is a unit ball defined in some universal RKHS H with a kernel k(⋅,⋅), which are all defined in a compact metric space. And let X s , X t are two compact sets generated from Borel probability metrics p and q, respectively, in the metric space with p and q. Then γ KMS (F,X s ,X t ) = 0 if and only if p = q.

3.3.3 Domain Adaptation Kernelized Support Vector Machine

Inspired by the idea of manifold regularization, MMD-based methods for transductive transfer learning (e.g., LMPROJ [28] and DTSVM [29], etc.) can be formulated as follows:

$$ \begin{array}{l}f=\underset{\mathrm{w}\in {H}_K}{ \min }C{\displaystyle \sum_{i=1}^n{\xi}_i}+\frac{1}{2}{\left\Vert \mathbf{w}\right\Vert}_K^2+\lambda {\gamma}_{KM}\left(f,{\mathbf{X}}_s,{\mathbf{X}}_t\right)\\[1.3pc] {}s.t.\begin{array}{cc}\hfill \hfill & \hfill \begin{array}{l}{y}_i\left({\mathbf{w}}^T\phi \left({\mathbf{x}}_i\right)+b\right)\ge 1-{\xi}_i,\\[8pt] {}{\xi}_i\ge 0,i=1,\dots, n\end{array}\hfill \end{array}\end{array} $$
(3.19)

where w is a normal projection vector, k is a kernel with ϕ a kernel mapping, H K is a set of functions in the kernel space, λ is a balance parameter, and γ KM (f,X s ,X t ) is the projected distribution mean distance metric between source and target domains, where x i X s .

However, Eq. (3.6) discloses a key limitation of MMD-based methods to some extent, i.e., they ignore considering sufficiently the potential scatter statistics, which may include underlying discriminative information in both domains for DAL, such that they may lead to “overfitting” phenomenon in some specific pattern recognition applications. Therefore, in this chapter, we propose a robust DAKSVM regularized by GPMDD metric on RKHS embedding domain distributions, which partially extends the ideas of classical SVMs and MMD. The key goals of our methods are to find a feature transform such that the mean and variance distances between the distributions of the testing and training data are minimized sufficiently, while at the same time maximizing the class margin or certain classification performance criterion for the training data, thus learning a robust model to effectively make prediction for target domain.

3.3.3.1 Objective Function of DAKSVM

For simplicity, firstly we only consider binary pattern classification problems, and secondly we propose a so-called least-square DAKSVM (LSDAKSVM) based on the classical LS-SVM [52] as an extension to the standard DAKSVM method for multi-class pattern classification problems.

For DAL problems, DAKSVM aims to find a linear transform f(x) = w T φ(x) in a universal RKHS with Gaussian kernel mapping, where w is a linear projection vector, in order to minimize the distribution discrepancy between-domain as well as to reduce the empirical risk of the classification decision function as much as possible, thus implementing transfer learning in cross-domains. DAKSVM can be formulated as

$$ \min f=C{\displaystyle \sum_{i=1}^nV\left({\mathbf{x}}_i,{y}_i,f\right)+}{\gamma}_{KMS}\left(f,{\mathbf{X}}_s,{\mathbf{X}}_t\right),$$
(3.20)

where x i X s is a set of training data and matrix ϕ(X s ) = (ϕ(x s1), ϕ(x s2), …, ϕ(x sn )), y i Y s is the class label corresponding to x i , C > 0 is a regularization coefficient, and V measures the fitness of the function in terms of predicting the class labels for the training data and is called the risk function. The hinge loss function is a commonly used risk function in the form of V = (1 − y i f(x i ))+ [53] in which (x)+ = x if x ≥ 0 and zero otherwise.

Therefore, the linear function f in Eq. (3.20) represented by a vector w can be represented as

$$ \begin{array}{l} \arg \underset{\mathbf{w},b,\xi }{ \min }f=C{\displaystyle \sum_{i=1}^n{\xi}_i}+{\gamma}_{KMS}\left(f,{\mathbf{X}}_s,{\mathbf{X}}_t\right)\\ {}\mathrm{s}.\mathrm{t}.\kern0.5em {y}_i\left(\right(\boldsymbol{w},\varphi \left({\mathbf{x}}_i\right)\left)+b\right)\ge 1-{\zeta}_i,i=1,2,\dots, n\end{array} $$
(3.21)

In order to solve the primal in Eq. (3.21) effectively, we introduce the following revised Representer Theorem for DAL problems as follows:

Theorem 2 (Representer Theorem [54] for DAL)

For a DAL problem, let ψ : [0, ) → R denote a strictly monotonic increasing function, X = X s X t be a dataset, and c : (X × R 2)nR ∪ {} be any loss function. Then the regularized risk function is defined as

$$ R(f)=c\left({\left({\boldsymbol{x}}_i,{y}_i,f\left({\boldsymbol{x}}_i\right)\right)}_{i=1}^n\right)+\psi \left(\left|\right|f\left|\right|{}_H^2\right), $$
(3.22)

where fH is represented as

$$ f\left(\boldsymbol{x}\right)={\displaystyle \sum_{i=1}^m{\beta}_i}k\left({\boldsymbol{x}}_i,\boldsymbol{x}\right)+{\displaystyle \sum_{j=1}^n{\beta}_j}k\left({\boldsymbol{z}}_j,\boldsymbol{x}\right), $$
(3.23)

where k is a kernel, x i X s , y i Y s , z j X t and β i is a coefficient.

By Theorem 2, we can have the following theorem.

Theorem 3

The primal of DAKSVM can be formulated as

$$\begin{array}{l}\underset{\boldsymbol{\upbeta}, \boldsymbol{\upxi}, b}{ \min }f=\frac{1}{2}{\boldsymbol{\upbeta}}^T\boldsymbol{\Omega} \boldsymbol{\upbeta} +C{\displaystyle \sum_{i=1}^N{\xi}_i},\end{array} $$
(3.24a)
$$ \mathrm{s}.\mathrm{t}.\kern0.5em {y}_i\left({\displaystyle \sum_{j=1}^{n+m}{\beta}_j{k}_{\sigma}\left({\mathbf{x}}_i,{\mathbf{x}}_j\right)}+b\right)\ge 1-{\xi}_i,\kern1em i=1,\dots, n, $$

where x i X s , x j X s X t , Ω is a positive semi-definite kernel matrix with

$$ \boldsymbol{\Omega} =\left(1-\lambda \right){\boldsymbol{\Omega}}_1+\lambda {\boldsymbol{\Omega}}_2 $$
(3.24b)

where Ω 1 is a (n + m) × (n + m) symmetrical positive semi-definite kernel matrix defined as

$$ {\boldsymbol{\Omega}}_1=\frac{1}{n^2}{\mathbf{K}}_s{\left[1\right]}^{n\times n}{\mathbf{K}}_s^T+\frac{1}{m^2}{\mathbf{K}}_t{\left[1\right]}^{m\times m}{\mathbf{K}}_t^T-\frac{1}{ nm}\left({\mathbf{K}}_s{\left[1\right]}^{n\times m}{\mathbf{K}}_t^T+{\mathbf{K}}_t{\left[1\right]}^{m\times n}{\mathbf{K}}_s^T\right) $$
(3.24c)

and Ω 2 is a (n + m) × (n + m) symmetrical positive semi-definite kernel matrix defined as

$$ {\boldsymbol{\Omega}}_2=\left|\frac{1}{n}{\mathbf{K}}_s{\mathbf{K}}_s^T-\frac{1}{m}{\mathbf{K}}_t{\mathbf{K}}_t^T\right| $$
(3.24d)

where K s is a (n + m) × n kernel matrix for the training data, K t is a (n + m) × m kernel matrix for test data, and [1]k × l is a k × l matrix of all ones.

Theorem 4

The dual of the primal in Eq. (3.24) can be formulated as

$$ \begin{array}{l}\underset{\boldsymbol{\upalpha}}{ \min}\frac{1}{2}{\boldsymbol{\upalpha}}^T{\mathbf{H}}^{\phi}\boldsymbol{\upalpha} -{1}^T\boldsymbol{\upalpha} \end{array}$$
(3.25)
$$ \mathrm{s}.\mathrm{t}.\kern0.5em 0\le {\alpha}_i\le C,\kern0.5em i=1,\dots, n, $$
$$ {\displaystyle \sum_{i=1}^n{\alpha}_i{y}_i}=0, $$

where \( {\mathbf{H}}^{\phi }=\tilde{\mathbf{Y}}{\mathbf{K}}_s^T{\left(\boldsymbol{\Omega} \right)}^{-1}{\mathbf{K}}_s\tilde{\mathbf{Y}} \), and \( \tilde{\mathbf{Y}}= diag\left({y}_1,{y}_2,\dots, {y}_n\right) \), y i Y s .

By the same way of the classical SVM, the biased variable b φ in the kernel space can be formulated as

$$ {b}^{\phi }=-\frac{1}{2}\left(\frac{1}{\left|{\mathbf{X}}_{s+}\right|}{\displaystyle \sum_{\mathbf{x}\in {\mathbf{X}}_{s+}}{\displaystyle \sum_{j=1}^{n+m}{\beta}_j{k}_{\sigma}\left({\mathbf{x}}_j,\mathbf{x}\right)}}+\frac{1}{\left|{\mathbf{X}}_{s-}\right|}{\displaystyle \sum_{\mathbf{x}\in {\mathbf{X}}_{s-}}{\displaystyle \sum_{j=1}^{n+m}{\beta}_j{k}_{\sigma}\Big({\mathbf{x}}_j,\mathbf{x}}}\Big) \right) $$
(3.26a)

Meanwhile, we can get the solution of β with dual theory as follows:

$$ \boldsymbol{\beta} ={\left(\boldsymbol{\Omega} \right)}^{-1}{\mathbf{K}}_s\overset{\sim }{\mathbf{Y}}\boldsymbol{\alpha} $$
(3.26b)

3.3.3.2 Learning Algorithm of DAKSVM

The proposed DAKSVM algorithm can be summarized as follows.

Algorithm DAKSVM

Input:

Dataset matrix X = ({x i ,y i } n i = 1 ,{z j } m j = 1 ), x i X s , y i Y s , z j X t , set Gaussian kernel bandwidths σ, σ/γ, respectively, in γ KM and γ KS of GPMDD.

Output:

Decision function f(x).

Step 1:

Determine the parameter γ in γ KS of GPMDD such that the scatter consistency between source and target domains is maximized.

Step 2:

Compute the matrices Ω 1 and Ω 2, respectively, by Eqs. (3.24a) and (3.24b). In terms of λ given by users to construct matrix Ω = (1 − λ)Ω 1 + λ Ω 2.

Step 3:

For the given C, find out the optimal vector β by applying Theorem 4 to solve the corresponding dual. And then recover the optimal normal vector w and bias b ϕ by β;

Step 4:

Output the decision function f(x) = w T ϕ(x) + b ϕ.

3.3.4 Variants and Extensions

3.3.4.1 Least-Square DAKSVM

One variant of DAKSVM is the LSDAKSVM which is also based on the idea of LS-SVM [52], which can be formulated as:

$$ \begin{array}{l} \arg \underset{\mathbf{w},b,\xi }{ \min }f=\frac{C}{2}{\displaystyle \sum_{i=1}^n{\xi}_i^2}+{\gamma}_{KMS}\left(p,q\right)\\ {}\mathrm{s}.\mathrm{t}.\kern0.5em \left(\mathbf{w},\phi \left({\mathbf{x}}_i\right)\right)+b={y}_i-{\xi}_i,i=1,2,\dots, n.\end{array} $$
(3.27)

Along the same line of DAKSVM, the primal of Eq. (3.17) is defined as

$$ \begin{array}{l}\underset{\boldsymbol{\upbeta}, \boldsymbol{\upxi}, b}{ \min }f=\frac{1}{2}{\boldsymbol{\upbeta}}^T\boldsymbol{\Omega} \boldsymbol{\upbeta} +\frac{C}{2}{\displaystyle \sum_{i=1}^n{\xi}_i^2},\end{array} $$
(3.28)
$$ \mathrm{s}.\mathrm{t}.\kern1em {\displaystyle \sum_{j=1}^{n+m}{\beta}_j{k}_{\sigma /\gamma}\left({\mathbf{x}}_i,{\mathbf{x}}_j\right)}+b={y}_i-{\xi}_i,\kern1em {\xi}_i\ge 0,i=1,\dots, n. $$

Theorem 5 (Analytic Solution to Binary Class Case)

Given the parameter λ ∈ [0,1], for a binary classification problem, the optimal solution of Eqs. (3.27) and (3.28) is equivalent to the linear system of equations with respect to variable α as follows:

$$ \left[\begin{array}{c@{\quad}c}\hfill 0\hfill & \hfill {\mathbf{1}}_n^T\hfill \\ {}\hfill {\mathbf{1}}_n\hfill & \hfill \tilde{\boldsymbol{\Omega}}\hfill \end{array}\right]\left[\begin{array}{c}\hfill b\hfill \\ {}\hfill \boldsymbol{\upalpha} \hfill \end{array}\right]=\left[\begin{array}{c}\hfill 0\hfill \\ {}\hfill {\mathbf{Y}}_s\hfill \end{array}\right], $$
(3.29)

where 1 n = [1, …,1]T, α = [α 1, …,α n ]T, Y s = [y 1, …,y n ]T, \( \tilde{\boldsymbol{\Omega}}={\mathbf{K}}_s^T{\left(\boldsymbol{\Omega} \right)}^{-1}{\mathbf{K}}_s+\frac{{\mathbf{I}}_n}{C} \), I n is an n-dimensional identity matrix.

As for multi-class classification problems, the traditional skills are to separate a multi-class classification problem into several binary classification problems in one-against-one (OAO) or one-against-all (OAA) way. However, the main drawbacks of these skills deal with high computational complexity and imbalance between classes. Hence, here we introduce the vector labeled outputs into the solution of LSDAKSVM, which can make the corresponding computational complexity independent of the number of classes and require no more computations than a single binary classifier [55]. Furthermore, Szedmak and Shawe-Taylor [55] pointed out that this technique does not reduce the classification performance of a learning model but in some cases can improve it, with respect to OAO and OAA. Therefore, we represent the class labels according to the one-of-c rule, namely, if training sample x i (i = 1, …, n) belongs to the kth class, then the class label of x i is \( {\boldsymbol{Y}}_i={\left[\underset{k}{\underbrace{0,\dots, 1}},\dots, 0\right]}^T\in {\mathbf{R}}^c \), where the kth element is one and all the other elements are zero. Hence, for some multi-class classification problems, the optimal problem of LSDAKSVM can be formulated as

$$ \begin{array}{l}\underset{\boldsymbol{\upbeta}, \boldsymbol{\upxi}, b}{ \min }f=\frac{1}{2}{\tilde{\boldsymbol{\upbeta}}}^T\boldsymbol{\Omega} \tilde{\boldsymbol{\upbeta}}+\frac{C}{2}{\displaystyle \sum_{i=1}^n{\xi}_i^2},\ \end{array} $$
(3.30)
$$ \mathrm{s}.\mathrm{t}.\kern0.5em {\tilde{\boldsymbol{\upbeta}}}^T{\mathbf{K}}_s+b={Y}_i-{\xi}_i,i=1,\dots, n, $$

where \( \tilde{\boldsymbol{\upbeta}}\in {R}^{n\times c} \), bR c.

Theorem 6 (Analytic Solution to Multi-Class Case)

Given the parameter λ ∈ [0,1], for a multi-class classification problem, the optimal solution of Eq. (3.30) is equivalent to the linear system of the following equation.

$$ \left[\begin{array}{c@{\quad}c}\hfill b\hfill & \hfill \boldsymbol{\upalpha} \hfill \end{array}\right]\left[\begin{array}{c@{\quad}c}\hfill 0\hfill & \hfill {1}_n^T\hfill \\ {}\hfill {1}_n\hfill & \hfill \tilde{\boldsymbol{\Omega}}\hfill \end{array}\right]=\left[\begin{array}{cc}\hfill {0}_c\hfill & \hfill {\tilde{\mathbf{Y}}}_s\hfill \end{array}\right], $$
(3.31)

where 0 c = [0, …,0]T, α = [α 1, …,α n ]T, \( {\tilde{\mathbf{Y}}}_s={\left[{Y}_1,{Y}_2,\dots, {Y}_n\right]}^T \), \( \overset{\sim }{\boldsymbol{\Omega}} \) is the same as in Theorem 6.

Theorems 5 and 6 actually provide us the LSDAKSVM versions for both binary and multi-class classification problems, respectively. It is clearly shown from Eqs. (3.20) and (3.23) that LSDAKSVM keeps the same solution framework for both binary and multi-class cases.

3.3.4.2 μ-Domain Adaptation Kernelized Support Vector Machine

The v-SVM [56] is a typical extension of SVM for classification in which Schölkopf et~al. introduced a new parameter v instead of C in SVM to control the number of support vectors and the training errors. More details about v-SVM can be found in [56]. Hence, as the second variant of DAKSVM based on v-SVM, μ-DAKSVM can be formulated as:

$$ \begin{array}{l}\underset{\boldsymbol{\upbeta}, \boldsymbol{\upxi}, b}{ \min }f=\frac{1}{2}{\boldsymbol{\upbeta}}^T\boldsymbol{\Omega} \boldsymbol{\upbeta} -\mu \rho +\frac{1}{N}{\displaystyle \sum_{i=1}^n{\xi}_i}, \end{array}$$
(3.32)
$$ \mathrm{s}.\mathrm{t}.\kern0.5em {y}_i\left({\displaystyle \sum_{j=1}^N{\beta}_j{k}_{\sigma /\gamma}\left({\mathbf{x}}_i,{\mathbf{x}}_j\right)}+b\right)\ge \rho -{\xi}_i,\kern0.5em i=1,\dots, n, $$

where the variables N = n + m, ρ ≥ 0, μ > 0 and ξ i ≥ 0 have the same meaning as in v-SVM. Similar to v-SVM, the dual of the primal in Eq. (3.32) can be formulated as:

$$ \begin{array}{l}\underset{\boldsymbol{\alpha}}{ \min}\frac{1}{2}{\boldsymbol{\alpha}}^T{\mathbf{H}}^{\phi}\boldsymbol{\alpha} \end{array}$$
(3.33)
$$ \mathrm{s}.\mathrm{t}.\kern0.5em 0\le {\alpha}_i\le \frac{1}{N},\kern0.5em i=1,\dots, n, $$
$${\displaystyle \sum_{i=1}^n{\alpha}_i{y}_i}=0, $$
$${\displaystyle \sum_{i=1}^n{\alpha}_i}\ge \mu, $$

where \( {\mathbf{H}}^{\varphi }=\overset{\sim }{\mathbf{Y}}{\mathbf{K}}_s^T{\left(\overline{\boldsymbol{\Omega}}\right)}^{-1}{\mathbf{K}}_s\overset{\sim }{\mathbf{Y}} \), and \( \overset{\sim }{\mathbf{Y}}= diag\left({y}_1,{y}_2,\dots, {y}_n\right) \), y i Y s .

3.3.5 Computational Complexity Analysis

In terms of Algorithm 1, DAKSVM and its variants can be implemented by using standard SVM solver (e.g., LibSVM [57]) with the quadratic form induced by matrix Ω aforementioned above, and using the optimal solution to obtain the expansion coefficients by Eqs. (3.35) and (3.13)–(3.15) respectively. It is worth noting that our algorithms compute the inverse of a dense Gram matrix Ω which leads to O((n + m)3) training complexity comparable to SVM. This seems to be impractical for large datasets. However, for highly sparse datasets, for example, in text categorization problems, effective conjugate gradient schemes can be used in a large-scale implementation [58]. For the nonlinear case, one may obtain approximate solutions (e.g., using greedy, matching pursuit techniques) where the optimization problem is solved over the span of a small set of basis functions instead of using the full representation in f(x) = w T ϕ(x). Besides, CVM [59] may be an alternative choice in addressing scalability issues occurring in SVM learning. The testing complexity of DAKSVM depends on the number of support vector learned from the training stage. In fact, the proposed method DAKSVM and its variants take less than half a minute to finish the whole prediction for test samples from target domain in most of the following experiments.

3.4 Experimental Results of KL-TSK-FS

3.4.1 Experimental Settings

The proposed inductive transfer learning method KL-TSK-FS is evaluated by using both synthetic and real-world datasets. Details about the evaluation will be described in detail in Sects. 3.4.2 and 3.4.3, respectively. For clarity, the notations for the datasets and their definitions are listed in Table 3.2. Here, datasets generated from the source domain and the target domain are denoted by D1 and D2, respectively. The proposed support learning-based KL-TSK-FS algorithm is evaluated from the following two aspects.

Table 3.2 Notations of the adopted datasets and their definitions
  1. 1.

    Comparison with traditional support vector learning-based L2-TSK-FS. The performance of KL-TSK-FS is compared comprehensively with that of three L2-TSK-FS methods implemented under different conditions. That is, four TSK-FS systems are constructed by (a) L2-TSK-FS based on the data in the source domain, (b) L2-TSK-FS based on the data in the target domain, (c) L2-TSK-FS based on the data in both the target domain and the source domain, and (d) the proposed KL-TSK-FS. They are denoted by L2-TSK-FS(D1), L2-TSK-FS (D2), L2-TSK-FS (D1 + D2), and KL-TSK-FS(D2 + Knowledge), respectively. With these four fuzzy systems, the testing data, i.e. D2_test, of the target domain are used to evaluate their generalization capability.

  2. 2.

    Comparison with regression methods designed for datasets with missing or noisy data. Three related regression methods are employed to compare with the proposed KL-TSK-FS for performance evaluation. The three methods include: (a) TS-fuzzy system-based support vector regression (TSFS-SVR) [60]; (b) fuzzy system learned through fuzzy clustering and SVM (FS-FCSVM) [61]; and (c) Bayesian task-level transfer learning for nonlinear regression method (HiRBF) [20].

The methods adopted for performance comparison from these two aspects are summarized in Table 3.3. The following generalization performance index J is used in the experiments [2],

$$ J=\sqrt{\frac{\frac{1}{N}{\displaystyle \sum_{i=1}^N{\left({y}_i^{\prime }-{y}_i\right)}^2}}{\frac{1}{N}{\displaystyle \sum_{i=1}^N{\left({y}_i-\overline{y}\right)}^2}}}, $$
(3.34)

where N is the number of test datasets, y i is the output for the ith test input, y i is the fuzzy model output for the ith test input, and \( \overline{y}=\frac{1}{N}{\displaystyle \sum_{i=1}^N{y}_i} \). The smaller the value of J, the better the generalization performance.

Table 3.3 The methods adopted for performance comparison

In the experiments, the hyperparameters of all the methods adopted are determined by using the fivefold cross-validation strategy with the training datasets. All the algorithms are implemented using MATLAB on a computer with Intel Core 2 Duo P8600 2.4 GHz CPU and 2GB RAM.

3.4.2 Synthetic Datasets

3.4.2.1 Generation of Synthetic Datasets

Synthetic datasets are generated to simulate the domains in the study and the following requirements need to be satisfied: (1) the source domain should be related to the target domain, i.e., the source and target domains are different but related; (2) some of the data of the target domain are not available or missing. In other words, the data available from the target domain are insufficient.

Based on the above requirements, the function Y = f(x) = sin(x) ∗ x, x ∈ [−10, 10] is used to describe the source domain and to generate the dataset D1. On the other hand, the function y = r * f(x) = r * sin(x) * x, x ∈ [−10, 10] is used to describe the target domain and to generate the dataset D2 and testing dataset D2_test for the target domain. Here, r is a relation parameter between the source domain and the target domain. The parameter is used to control the degree of similarity/difference between these two domains. When r = 1, the two domains are identical. On the other hand, the lack of information (data insufficiency) is simulated by introducing intervals with missing data into the training set generated for the target domain. The settings for generating the synthetic datasets are described in Table 3.4. For example, the two functions used to simulate the two related domains, with the relation parameter r = 0.85, are shown in Fig. 3.3a. The datasets of the source domain and the training sets of the target domain, generated with the same relation parameter (i.e. r = 0.85), are shown in Fig. 3.3b. The figures also show the two data-missing intervals [−6, −3] and [0, 4] introduced into the dataset.

Table 3.4 Details of the synthetic datasets
Fig. 3.3
figure 3

Functions representing two different domains with the relation parameter r = 0.85 and the corresponding sampled data from these domains: (a) the functions representing the source domain (Y) and the target domain (y); (b) the data of the source domain and the training data of the target domain with missing data in the intervals [−6, −3] and [0, 4]

3.4.2.2 Comparing with the Traditional Support Vector Learning-Based L2-TSK-FS Modeling Methods

The performance of the proposed KL-TSK-FS and the three traditional L2-norm penalty-based TSK-FS modeling methods is evaluated and compared using the synthetic datasets. The experimental results are shown in Table 3.5 and Fig. 3.4. In Table 3.5 and other tables in this paper, the bold values denote the best results obtained among all the methods. The following observations can be made from the results:

Table 3.5 Generalization performance (J) of the proposed KL-TSK-FS method and the traditional L2-TSK-FS methods on the synthetic datasets
Fig. 3.4
figure 4

Modeling results of the proposed KL-TSK-FS method and three traditional L2-TSK-FS methods by using the synthetic datasets shown in Fig. 3.5b: (a) L2-TSK-FS based on the data of the source domain (D1); (b) L2-TSK-FS based on the data of the target domain (D2); (c) L2-TSK-FS based on the data of both the reference and target domains (D1 + D2); (d) the proposed KL-TSK-FS (D2 + Knowledge)

  1. 1.

    It can be seen from Table 3.4 that the generalization performance of the knowledge-leverage-based fuzzy system KL-TSK-FS is better than that of the traditional L2-TSK-FS methods.

  2. 2.

    Figure 3.4a shows the modeling results of the L2-TSK-FS obtained by using the data of the source domain only. The results indicate that drifting exists between the source domain and the target domain, as evident from the discrepancies between the two curves in the figure. Hence, the generalization performance of the TSK-FS obtained by L2-TSK-FS from the source domain is weak for the target domain. The findings show that the use of the data of the source domain alone is not appropriate for the modeling of the target domain.

  3. 3.

    Figure 3.4b shows the modeling results of the L2-TSK-FS obtained by using the data of the target domain only. The results indicate that the generalization performance of the TSK-FS obtained by L2-TSK-FS is even much weaker for the target domain. An obvious reason is that the data in the training set is insufficient, which degrades the generalization capability of the obtained TSK-FS. The prediction performance is especially poor in the intervals with missing data in the training dataset.

  4. 4.

    Figure 3.4c shows the modeling results of the L2-TSK-FS obtained by using the data of both the target domain and the source domain. Although the data of both domains have been used for training, the generalization performance of the obtained TSK-FS is still not good enough for the target domain. This can be explained by two reasons. First, drifting occurs between the reference and target domains, i.e., not all data in the source domain are useful for the modeling task of the target domain. Some of them may even have negative influence. Second, the size of the source domain is larger than that of the target domain, which makes the obtained TSK-FS more apt to approximate the source domain rather than the target domain.

  5. 5.

    Figure 3.4d shows the modeling results of the proposed KL-TSK-FS. The following observations can be made by comparing its results with the results of the three L2-TSK-FS methods, respectively. First, by inspecting Fig. 3.4a, d, we see that the KL-TSK-FS demonstrates better prediction results than the L2-TSK-FS which only uses the data of source domain. Second, it is evident from Fig. 3.4b, d that, by introducing the knowledge-leverage mechanism, the proposed KL-TSK-FS has effectively remedied the deficiency of the L2-TSK-FS obtained by the data of the target domain. By comparing Fig. 3.4c, d, we also find that the KL-TSK-FS has demonstrated better generalization performance than the L2-TSK-FS which employs the data of both the reference and target domains. It is noteworthy to point out that the KL-TSK-FS also has better privacy-protection capability than the methods that use the data of source domains directly. When the data in the source domains are not available due to the necessity of privacy protection, or in situations where knowledge are only partially revealed, methods requiring the data of all domains are no longer feasible. Therefore, the proposed KL-TSK-FS is particularly suitable for these situations attributed to its distinctiveness in privacy protection.

3.4.2.3 Comparing with Regression Methods Designed for Missing or Noisy Data

The performance of the proposed KL-TSK-FS method is evaluated by comparing its performance with that of three regression methods designed for handling noisy/missing data, i.e., TSFS-SVR, FS-FCSVM, and HiRBF. The evaluation is performed on the synthetic datasets. The experimental results are shown in Table 3.6 and Fig. 3.5, and the following observations can be obtained:

Table 3.6 Generalization performance (J) of the proposed KL-TSK-FS method and three related regression methods on the synthetic datasets
Fig. 3.5
figure 5

Modeling results of the proposed KL-TSK-FS method and three related regression methods using the synthetic datasets in Fig. 3.5b: a TSFS-SVR, b FS-FCSVM, c HiRBF, and d KL-TSK-FS

  1. 1.

    KL-TSK-FS has demonstrated better generalization performance than the other three related methods.

  2. 2.

    The results in Fig. 3.5a, b show that the support vector learning-based fuzzy modeling methods TSFS-SVR and FS-FCSVM are able to give better generalization performance to a certain extent. For example, although the data in the interval [−6, −3] are missing, these two methods still demonstrate promising generalization capability at this interval. However, the generalization abilities of these two methods in the other data-missing interval [0, 4] are not satisfactory.

  3. 3.

    Although the transfer learning-based method HiRBF has used the data in both the target domain and the source domain in the training, it is evident form Fig. 3.5c that this method cannot effectively cope with the problem caused by the missing data, still exhibiting poor generalization ability in the two data-missing intervals.

  4. 4.

    Figure 3.5d shows that the proposed method KL-TSK-FS is able to give acceptable generalization capability in the two data-missing intervals, indicating that the method has effectively leveraged the useful knowledge from the source domain and remedy the generalization abilities in the training procedure.

3.4.3 Real-World Datasets

3.4.3.1 The Glutamic Acid Fermentation Process Modeling

To further evaluate the performance of the proposed method, an experiment is conducted to apply the method to model a biochemical process with real-world datasets [2]. The datasets adopted originates from the glutamic acid fermentation process, which is a multiple-input–multiple-output system. The input variables of the dataset include the fermentation time h, glucose concentration S(h), thalli concentration X(h), glutamic acid concentration P(h), stirring speed R(h), and ventilation Q(h), where h = 0, 2, ⋯, 28. The output variables are glucose concentration S(h + 2), thalli concentration X(h + 2), and glutamic acid concentration P(h + 2) at a future time h + 2. The TSK-FS-based biochemical process prediction model is illustrated in Figs. 3.6. The data in this experiment were collected from 21 batches of fermentation processes, with each batch containing 14 effective data samples. In this experiment, in order to match the situation discussed in this study, the data are divided into two domains, i.e., the source domain and the target domain, as described in Table 3.7.

Fig. 3.6
figure 6

Illustration of the glutamic acid fermentation process prediction model based on the TSK-FS

Table 3.7 The fermentation process modeling datasets

3.4.3.2 Comparing with the Traditional L2-TSK-FS Modeling Methods

The experimental results of fermentation process modeling using the proposed inductive transfer learning method KL-TSK-FS and the traditional L2-TSK-FS are given in Table 3.8 and Fig. 3.7. The findings are similar to those presented in section IV-B for the experiments performed on the synthetic datasets. The modeling results of the KL-TSK-FS are better than that of the three traditional L2-TSK-FS methods. As the proposed method can effectively exploit not only the data of the target domain but also the useful knowledge of the source domains, the obtained TSK-FS has demonstrated better adaptive abilities. It can be seen from the experimental results that, even if the data in the training data of the target domain are missing, the generalization capability of the TSK-FS obtained by the proposed KL-TSK-FS does not degrade significantly. This remarkable feature is very valuable for the task of biochemical process modeling since the lack of sampled data is common due to poor sensitivity of sensors in the noisy environment.

Table 3.8 Generalization performance (J) of the proposed KL-TSK-FS method and the traditional L2-TSK-FS methods in fermentation process modeling
Fig. 3.7
figure 7

Performance comparison between the proposed KL-TSK-FS method and three traditional L2-TSK-FS methods in fermentation process modeling: the prediction results of a S(h + 2) for the 20th batch; b S(h + 2) for the 21st batch; c X(h + 2) for the 20th batch; d X(h + 2) for the 21st batch; e P(h + 2) for the 20th batch; and f P(h + 2) for the 21st batch

3.4.3.3 Comparing with the Regression Methods Designed for Missing or Noisy Data

The experimental results of fermentation process modeling using the proposed inductive transfer learning method KL-TSK-FS and three regression methods (i.e., TSFS-SVR, FS-FCSVM, and HiRBF) are shown in Table 3.9 and Fig. 3.8. Similar to the findings presented in Sect. 3.4.2.3 for the experiments conducted with the synthetic datasets, in general, the proposed KL-TSK-FS has demonstrated better generalization performance than the other three regression methods in fermentation process modeling. This can be explained again by the fact that the proposed KL-TSK-FS has effectively leveraged the useful knowledge from the source domain in the training procedure such that the influence of the missing data can be properly reduced.

Table 3.9 Generalization performance (J) of the proposed KL-TSK-FS method and several related regression methods in fermentation process modeling
Fig. 3.8
figure 8

Performance comparison between the proposed KL-TSK-FS method and three regression methods in fermentation process modeling: the prediction results of a S(h + 2) for the 20th batch; b S(h + 2) for the 21st batch; c X(h + 2) for the 20th batch; d X(h + 2) for the 21st batch; e P(h + 2) for the 20th batch; and f P(h + 2) for the 21st batch

3.5 Experimental Results of DAKSVM

3.5.1 Experiment Settings

To evaluate the effectiveness of the proposed transductive learning method DAKSVM and its extensions for DAL problems, we systematically compare them with several state-of-the-art algorithms on different datasets. We investigate three classes of domain adaptation problems: (1) a series of two-dimensional synthetic problems having different complexities with a two-moon dataset, (2) several real-world cross-domain text classification problems with different domain adaptation datasets such as 20Newsgroups, Reuters, Email Spam Filtering, web query set, and Amazon sentiment reviews set, and (3) a real problem in the context of multi-class classification in intra-domain on face recognition with Yale and ORL datasets. For all these datasets, true labels are available for both source and target-domain instances. However, prior information related to the target domain D t is considered only for an objective and quantitative assessment of the performances of the proposed algorithms.

We construct synthetic datasets (two-moon) to exhibit the performance of the proposed method and choose real-world datasets to show the classification performance of the proposed method DAKSVM and its extension μ-DAKSVM. We also carry out a multi-class classification experiment to show the performance of the proposed method LSDAKSVM in multi-class classification problems.

In the sequel, we will first describe the whole experimental details. Throughout this experimental part, we use standard Gaussian kernel function as for several related kernel methods such as SVM, TSVM, KMM, TCA, LMPROJ, and DTSVM. For multiple kernel learning in DTSVM, according to the setting in [29], we use four Gaussian base kernels with the bandwidth 1.2δ σ, where δ is set as {0,0.5,1,1.5}. For our methods, we use the parameterized Gaussian kernel as k σ/γ (x,x i ) = exp(−‖xx i 2/2(σ/γ)2) in γ KS of GPMDD, where the kernel parameter σ can be obtained by minimizing MMD with the most conservative test, which follows the setting in [46]. Empirically, we first select σ as the square root of the mean norm of the training data for binary classification and \( \sigma \sqrt{c} \)(where c is the number of classes) for multi-class classification. The tunable parameter γ can be set by minimizing GPMDD with the most optimal target test.

Presently, how to choose the algorithm parameters for the kernel methods still keeps an open and hot topic. In general, the algorithm parameters are manfully set. In order to evaluate the performance of the algorithm, a strategy, as is pointed out in [62], is that a set of the prior parameters is first given and then the best cross-validation mean rate among the set is used to estimate the generalized accuracy. In this work we adopted this strategy. The fivefold cross validation is used on the training set for parameter selection. Finally, the mean of experimental results on the test data is used for the performance evaluation. We chose the percentage overall accuracy AC% (i.e., the percentage of correctly labeled samples over the number of the whole samples) as the classification accuracy measure.

In the context, SVMs (such as SVM or v-SVM, TSVM) is implemented by the state-of-the-art software package such as LIBSVM [57]. As the experiments in Sect. 3.4, all the algorithms are implemented using MATLAB on a computer with Intel Core 2 Duo P8600 2.4 GHz CPU and 2GB RAM.

3.5.2 Synthetic Datasets

3.5.2.1 Generation of Synthetic Datasets

In this subsection, we construct a serial of trials on two-moon datasets to justify our method DAKSVM. In this toy problem, a serial of two-moon datasets with different complexities are used to exhibit the generalization capability of the proposed method DAKSVM on domain adaptation transfer learning. We compare the proposed method DAKSVM with SVM and LMPROJ on this toy data.

A synthetic dataset containing 600 samples generated according to a bi-dimensional pattern of two intertwining moons associated with two specific information classes (300 samples each) is taken as the source domain data, as shown in Fig. 3.9a. Target data were generated by rotating anticlockwise the original source dataset 11 times by 10°, 15°, 20°, 25°, 30°, 35°, 40°, 45°, 50°, 55°, and 60°, respectively. Due to rotation, source and target-domain data exhibit different distributions. Particularly, the greater the rotation angle, the more complex the resultant domain adaptation problem, as confirmed by the values for Jensen–Shannon scatter (D JS ) [6] shown in Fig. 3.10a. The proposed DAKSVM algorithm is proved to be particularly effective for solving this kind of problems with high accuracy. Figure 3.9b, c shows the target domain data with the rotation angle 30° and 60°, respectively.

Fig. 3.9
figure 9

Performance of different classifiers on two two-moon datasets with different complexities. a The original two-moon dataset; b rotation angle 30°; c rotation angle 60°; d classification accuracy for SVM: 95.4 %; e classification accuracy for SVM: 65 %; f classification accuracy for LMPROJ: 97.3 %; g classification accuracy for LMPROJ: 78.7 %; h classification accuracy for DAKSVM: 98.7 %; i classification accuracy for DAKSVM: 87.5 %

Fig. 3.10
figure 10

Jenson–Shannon divergence values and classification accuracies on target domain data for different rotation angles: a Jenson–Shannon divergence values for different rotation angles (D JS ); b accuracies exhibited on target domain data for different rotation angles

3.5.2.2 Comparing with the Related Methods

Figure 3.9d–i shows the learning accuracy rates of different methods on the datasets shown in Fig. 3.9b, c. And Fig. 3.10b shows the performance comparison among different methods on 11 target datasets aforementioned above. From Figs. 3.9b–d and 3.10b, we can observe that with appropriate learning parameters, the proposed method can obtain perfect separation between classes even if the rotation angles range from 10° to 50°. Besides, we can also observe several results as follows:

  1. 1.

    From Fig. 3.9d–i, we can observe that the accuracies of DAKSVM and LMPROJ are always higher than those by SVM according to a fivefold cross-validation on source domain data. This result shows that it is unsuitable for SVM on cross-domain learning. With Figs. 3.9 and 3.10, in some angles range (i.e., from 10° to 50°), the proposed method and LMPROJ can preserve the solution consistency well with target domain to some extent, which shows that the proposed method is better than or at least comparable to LMPROJ in this experiment.

  2. 2.

    Figure 3.10b shows that for greater values of rotation angles (i.e., from 50° to 60°), the classification accuracy rates of all methods descend dramatically, which seems reasonable due to the increase of the complexity of the corresponding domain adaptation problems; however, the descendant rate of the proposed method is slower than others due to preserving the distribution consistency of both means and variances of different domains. When the rotation angle is big enough, all methods will not be able to keep the solution consistency with target domain. If this case happens, the hypothesis aforementioned above will not be satisfied.

3.5.3 Binary Class Text Classification Datasets

In this section, we demonstrate the overall efficiency and effectiveness of the proposed method DAKSVM and its variation μ-DKSVM on five different real-world domain adaptation tasks for text datasets such as 20Newsgroups, Reuters, mail spam filtering, web query classification, and Amazon sentiment reviews classification.

Except for SVM, KMM, DTSVM, LMPROJ, and the TSVM, we still choose for comparison another two algorithms from KDD’08. They are cross-domain spectral classifier [63] and locally weighted ensemble (LWE) classifier [9].

Table 3.10 Cross-domain text classification tasks
Table 3.11 Web query text and sentiment reviews classification tasks

Unlike SVM and TSVM with default parameter values are adopted in most cases, in order to make our comparison fair, we report the best performance for each method over a range of parameter selections.

3.5.3.1 Dataset Settings

A brief description of each dataset and its setup is given in this subsection. Tables 3.10 and 3.11 summarize the datasets and give the indices to some of which we will refer in our experimental results. For example, dataset 6 is a 20Newsgroup dataset about Rec. vs. Sci. where the number of positive and negative training samples is 1,984 and 1,977, respectively, and the number of positive and negative class testing samples is 1,993 and 1,972, respectively.

3.5.3.1.1 20Newsgroups and Reuters

Reuters and 20Newsgroups are two cross-domain text classification datasets commonly used by the state-of-the-art DAL classifiers [9, 2830, 36, 64]. These datasets both represent text categorization tasks, Reuters is made up of news articles with five top-level categories, among which, Orgs, Places, and People are the largest, and the 20Newsgroups dataset contains 20newsgroup categories each with approximately 1,000 documents. For these text categorization data, in each case the goal is to correctly discriminate between articles at the top level, e.g. “sci” articles vs. “talk” articles, using different sets of sub-categories within each top-category for training and testing, e.g. sci.electronics and sci.med vs. talk.politics.misc and talk.religion.misc for training and sci.crypt and sci.space vs. talk.politics.guns and talk.politics.mideast for testing. For more details about the sub-categories, see [65]. Each set of sub-categories represents a different domain in which different words will be more common. Features are given by converting the documents into bag-of-word representations which are then transformed into feature vectors using the term frequency, details about this procedure can also be found in [65]. Table 3.10 shows the more detailed information about the experimental datasets drawn from the aforementioned above datasets.

3.5.3.1.2 Email Spam Filtering

In email spam filtering datasets [66], there are three email subsets (denoted by User1, User2, and User3, respectively) annotated by three different users. In this trial, the task is to classify spam and non-spam emails. Since the spam and non-spam emails in the subsets have been identified by different users, the data distributions of the three subsets are different but related. Each subset has 2,500 emails, in which one half of the emails are non-spam (labeled as 1) and the other half of them are spam (labeled as −1). On this dataset, in terms of [54], we consider three settings: (1) User1 (source domain) and User2 (target domain); (2) User2 (source domain) and User3 (target domain), and 3) User3 (source domain) and User1 (target domain). For each setting, the training dataset contains all labeled samples from the source domain. And the samples in the target domain are used as the unlabeled test ones. We report the experimental results with their means and the standard deviations of all methods. Again, the word-frequency feature is used to represent each document as in [66]. The more detailed information about the experimental datasets drawn from Email Spam Filtering datasets can be found in Table 3.2.

3.5.3.1.3 Web Query

We also construct a set of tasks on cross-domain query classification for a search engine, e.g. Google. We use a set of search snippets gathered from Google as our training data and some incoming unlabeled queries as the test data. The detailed descriptions of the procedure can be found in [67]. We use the labeled queries from AOL provided by [68] (http://grepgsadetsky.com/aol-data) for evaluation. We consider queries from five classes: Business, Computer, Entertainment, Health, and Sports which are shown in both training and test datasets. We form ten binary classification tasks for query classification [64]. The more detailed information can be seen in Table 3.11.

3.5.3.1.4 Sentiment Reviews

The data of sentiment domain adaptation [69] consist of Amazon product reviews for four different product types, including books, DVDs, electronics, and kitchen appliances. Each review consists of a rating with scores ranging from 0 to 5, a reviewer name and location, a product name, a review title and date, and the review text. Reviews with ratings higher than three are labeled as positive and reviews with ratings lower than three are labeled as negative, the rest are discarded since the polarity of these reviews is ambiguous. The details of the data in different domains are summarized in Table 3.11. The experimental settings are the same as in [69]. To study the performance of our methods in this task, we construct 12 pairs of cross-domain sentiment classification tasks as shown in Table 3.6, e.g., we use the reviews from domain A as the training data and then predict the sentiment of the reviews in the domain B.

3.5.3.2 Comparing with the Related Methods

Tables 3.12, 3.13, and 3.14 and Fig. 3.11 show the means and standard deviations of classification accuracies of different methods on the above domain adaptation transfer learning tasks, respectively. From these results, we can make several interesting observations as follows:

Table 3.12 Means and standard deviations (%) of classification accuracies (ACC) of all methods on the 20Newsgroups, Reuters datasets, and email spam filtering datasets
Table 3.13 Means and standard deviations (%) of classification accuracies (ACC) of all methods on the web query dataset
Table 3.14 Means and standard deviations (%) of classification accuracies (ACC) of all methods on the sentiment reviews dataset
Fig. 3.11
figure 11

Means and standard deviations (%) of classification accuracies (ACC) of all methods on text datasets. a Text datasets: Reuters, 20Newsgroups and mail spam filtering; b web query dataset; c sentiment classification dataset

  1. 1.

    From Tables 3.12, 3.13, and 3.14, we can see that our method achieves very promising result. The major limitation of LMPROJ, DTSVM, and KMM is that they only consider the first-order statistics and thus cannot well generalize their result. However, since our methods definitely consider both the second-order and the first-order statistics between the source and target domains, it yields better generalization capability. It can be observed that our method significantly outperforms other methods. These empirical results again show that considering second-order statistics as well as first-order statistics can help us improve the domain adaptation performance.

  2. 2.

    SVM and TSVM have the worst performance on almost all learning tasks compared to other classifiers, which is consistent with the experimental results of the above toy datasets. Though obtaining better classification on both 20Newsgroup and Reuters datasets, TSVM exhibits its worse classification performance on two web text classification tasks than other methods. It is worth noting that we obtain a little better results for SVM and TSVM than those typically reported in the previous literature on the same datasets used in our trials. This is because in order to make our comparison fair we reported the best results over a set of parameters for SVM and instead of selecting a default parameter on the training data to be performed.

  3. 3.

    In Tables 3.12, 3.13, and 3.14 and Fig. 3.11, we can also observe that although seven methods, i.e., CDCS, LWE, LMPROJ, DTSVM, KMM, DAKSVM, and its variation μ-DAKSVM, exhibit comparable classification capability on all text datasets, the proposed method DAKSVM and its variation μ-DAKSVM always keep significantly high classification accuracy in most cases, which implies that it is more stable than other methods, particularly on two web text classification datasets such as web query and sentiment reviews datasets.

  4. 4.

    The results in Tables 3.12, 3.13, and 3.14 and Fig. 3.11 also show that the proposed method DAKSVM and its variation μ-DAKSVM perform relatively better than MMD-based methods LMPROJ and KMM in almost all datasets, which justifies that the only emphasis on minimizing distribution mean discrepancy between both domains is far from sufficiency for domain adaptation transfer learning. Hence, we should introduce more underlying information, such as distribution scatter discrepancy minimization, into the regularization framework of the classifier to further enhance the classification performance. Besides, it is worth mentioning that DTSVM also obtains fairly robust performance on almost all datasets by adopting multiple kernel learning scheme. A possible explanation is that multiple kernel learning skill can improve learning capability for DAL.

  5. 5.

    μ-DAKSVM keeps obviously superior capability over DAKSVM in classification accuracy for almost all these datasets, which demonstrates that parameter μ can be used to enhance the generalization capability of DAKSVM. Therefore, we use μ-DAKSVM instead of DAKSVM for the performance evaluation hereafter.

  6. 6.

    In order to verify whether the proposed methods are significantly better than the other methods, we also performed the paired two-tailed t-test [70] on the classification results of the 10 runs to calculate the statistical significance of the proposed method μ-DAKSVM. The smaller the p-value, the more significant the difference of the two average results is, and a p-value of 0.05 is a typical threshold which is considered to be statistically significant. Thus, in Tables 3.12, 3.13, and 3.14, if the p-value of each dataset is less than 0.05, the corresponding results will be denoted “*.” Therefore, as shown in Tables 3.12, 3.13, and 3.14, we can clearly find that the proposed method μ-DAKSVM significantly outperforms other methods in most datasets.

3.5.4 Multi-Class Face Recognition Datasets

3.5.4.1 Dataset Settings

In this subsection, in order to evaluate the effectiveness of the proposed methods on multi-class classification problems, we investigate the performance of the proposed algorithms LSDAKSVM and μ-DAKSVM for face recognition on two benchmarking Yale and ORL face databases. The Yale face database was constructed at the Yale Center for Computation Vision and Control. There are 165 images about 15 individuals in this database where each person has 11 images. The images demonstrate face variations under lighting condition (left-right, center-light, right-light) and facial expression (normal, happy, sad, sleepy, surprised and wink) with or without glasses. Each image is cropped to be the size of 32 × 32 in our experiment. We randomly select 8 images of each individual to construct the source domain dataset; the ORL database contains 400 images grouped into 40 distinct subjects with 10 different images for each. The images are captured at different times, and for some subjects, the images may vary in facial expressions and facial details. All the images are taken against a dark homogeneous background with the tolerance for some side movement of about 20. The original images are all sized 112 × 92 pixels with 256 gray levels per pixel, which are further down-sampled into 32 × 32 pixels in our experiment. We randomly select eight images of each individual to construct the source domain training set. Figure 3.12a, c shows the cropped images of one person in Yale and ORL face databases, respectively.

Fig. 3.12
figure 12

Face examples from the face databases Yale and ORL. a Yale faces for an object; b Yale faces for an object with rotation angle 10°; c ORL faces for an object; d ORL faces for an object with rotation angle 10°

The target datasets are generated by rotating anticlockwise the original source domain dataset three times by 10°, 30°, and 50°, respectively. Due to rotation, source and target-domain data exhibit different distributions. Particularly, the greater the rotation angle is, the more complex the resulting domain adaptation problem becomes. Thus we construct three face domain adaptation transfer learning problems for each face database. Figure 3.12b, d shows the face samples with rotation angle 10°, respectively.

3.5.4.2 Comparing with the Related Methods

We test the performance of LSDAKSVM and μ-DAKSVM in comparison with CDCS, LWE, DTSVM, and LMPROJ. In order to do a comprehensive comparison, we also perform the baseline method LS-SVM for face recognition with different distributions. For the above multi-class classification tasks, μ-DAKSVM, CDCS, LWE, LS-SVM, DTSVM, and LMPROJ adopt OAO multi-class separation strategy to finish the corresponding multi-class classification tasks. For each evaluation, ten rounds of experiments are repeated with randomly selected training data, and the average result is recorded as the final classification accuracy in Table 3.7. Several attractive insights can be obtained from these results as follows:

  1. 1.

    The overall accuracy of LS-SVM is lower than any other classifier on all DAL tasks, which is consistent with SVM.

  2. 2.

    With the increase of rotation angle, the classification performance of all classifiers descends gradually. However, LSDAKSVM seems to decrease more slowly than other methods. Exceptionally, CDCS and DTSVM exhibit competitive performance to some extent compared to other methods, particularly on more complex datasets.

  3. 3.

    As shown in Table 3.15, we can observe that the LSDAKSVM method delivers more stable results across all the datasets and is competitive as the best method for the majority of all the other datasets. It obtains the best classification accuracy more times than any other method. Hence, as discussed in the above section, LSDAKSVM possesses overall DAL advantages over other methods in the sense of both computational complexity and classification accuracy.

    Table 3.15 Means and standard deviations (%) of classification accuracies (ACC) of all methods on Yale and ORL with different rotation angles
  4. 4.

    Table 3.15 also shows that although LSDAKSVM seems to have overall advantage over μ-DAKSVM in classification accuracy, μ-DAKSVM is actually considerably comparable to LSDAKSVM.

3.6 Conclusions

In this chapter, we propose one inductive learning approach and one transductive learning approach based on support vector learning, respectively. On the one hand, the proposed inductive transfer learning method, i.e., KL-TSK-FS, is more adaptive to the situations where the data are only partially available from the target domain while some useful knowledge of the source domains is available. Besides, the proposed method is distinctive in preserving data privacy as only the knowledge (e.g., the corresponding model parameters) rather than the data of the source domain is adopted. On the other hand, the proposed transductive transfer learning method DAKSVM and its two extensions indeed inherit the potential advantages of classical TSVMs and MMD-based methods and are further extended to DAL. As a novel large margin domain adaptation classifier, the proposed methods can reduce the distribution gap between different domains in an RKHS as much as possible, since they effectively integrate the large margin learner with the proposed GPMDD metric, in which both the distribution mean discrepancy and the distribution scatter discrepancy on RKHS embedding domain distributions are simultaneously considered.