Introduction

Although many cognitive computational models (e.g. neural networks and support vector machines, etc.) have been proposed to solve classification problem, those methods encounter lots of challenges, such as poor computational scalability, trivial human intervention, etc. Recently, extreme learning machines (ELM) [1,2,3,4,5], as a cognitive-based technique, are proposed for “generalized” Single-Layer Feed-forward Networks (SLFNs) [1, 2, 4, 6, 7]. ELM can analytically determine the output layer using Moor-Penrose generalized inverse by adopting the square loss of prediction error. Huang et al. [4, 6, 8, 9] have rigorously proved that, in theory, ELM can approximate any continuous functions and also proved the classification capability of ELM [4]. Moreover, different from traditional learning algorithms [10], ELM tends to achieve not only the smallest training error but also the smallest norm of output weights for better generalization performance [3, 7]. Its variants [11,12,13,14,15,16,17,18,19] also focus on the regression, classification, and pattern recognition applications.

For the past years, there are a number of improved versions of ELM. A weighted ELM was proposed for binary/multiclass classification tasks with both balanced and imbalanced data distribution [20]. Bai et al. [21] proposed a sparse ELM for reducing storage space and testing time. Huang et al. [22] proposed a semi-supervised ELM for classification, in which a manifold regularization with graph Laplacian was imposed as constraint, and an unsupervised ELM was also explored for clustering. Liu et al. [23,24,25] proposed many methods to address the issue of tactile object recognition. Recently, Liu et al. [26] also proposed an extreme kernel sparse learning for tactile object recognition, which combines the ELM and kernel dictionary learning.

The data distribution obtained in different stages with different sampling conditions would be different, which is identified as cross-domain problem. Traditional ELM assumes that the data distribution between training and testing data should be similar and therefore cannot address this issue. To handle this problem, domain adaptation has been proposed for heterogeneous data (i.e. cross-domain problem), by leveraging a few labeled instances from another domain with similar semantic [27, 28]. Inspired by domain adaptation, Zhang et al. [29] proposed a domain adaption ELM (DAELM) for classification across tasks (source domain vs. target domain), which is the first paper to study the cross-domain learning mechanism in ELM. However, DAELM was proposed as a cross-domain classifier, and how to learn a shared (common) subspace with source and target domains, to our knowledge, has never been studied in ELM. Therefore, in this paper, we extend ELMs to handle cross-domain problems by transfer learning and subspace learning, and explore its capability in multi-domain application for ELMs. Inspired by DAELM, we propose a cross-domain extreme learning machine (CdELM) for common subspace learning, the basic idea of the proposed CdELM method is illustrated in Fig. 1, in which we aim at learning a shared subspace β. Notably, the proposed CdELM is different from DAELM that we would like to learn a shared subspace for source and target domains, instead of a shared classifier.

Fig. 1
figure 1

Schematic diagram of the proposed CdELM method; after a subspace projection β, the source domain and target domain of different space distribution lie in a latent subspace with good distribution consistency (the centers of both domains become very close and drift is removed). Formally, the upper coordinate system denotes the raw data points of source domain and target domain in three dimensions. We use the word “center” to represent the mean of each domain data. From the upper figure, we can see that the difference between the mean of source domain and the mean of target domain is large. After a subspace projection β in the below figure, we can see that the values of d 2 become smaller, which demonstrate that the distribution difference becomes small, and both domains of different space distribution lie in a latent common subspace with good distribution consistency

The remainder of this is as follows. Section 2 contains a brief review of ELM. In Section 3, the proposed CdELM method with detailed model formulation and optimization algorithm is presented. The experiments and results have been discussed in Section 4. Finally, Section 5 concludes this paper.

Related Work

Review of ELM

Briefly, the principle of ELM is described as follows. Given the training data X = [x 1, x 2,  … , x N ] ∈  N × n, where n is the dimensionality and N is the number of training samples, and T = [t 1, t 2,  … , t N ] ∈  N × m denotes the labels with respect to the data X, where m is the number of classes. The output of the hidden layer is denoted as ℋ(x i ) ∈  1 × L, where L is the number of hidden nodes and ℋ(⋅) is the activation function. The output weights between the hidden layer and the output layer being learned is denoted as β ∈  L × m. The regularized ELM aims at minimizing the training error and the norm of the output weights for better generalization performance, formulated as follows:

$$ \begin{array}{l}\underset{\beta}{ \min}\mathrm{\mathcal{L}}=\frac{1}{2}{\left\Vert \beta \right\Vert}^2+\frac{C}{2}\sum_{i=1}^N{\left\Vert {\xi}_i\right\Vert}^2\\ {} s. t.\mathrm{\mathscr{H}}\left({x}_i\right)\beta ={t}_i-{\xi}_i, i=1,\dots, N\end{array} $$
(1)

where ξ i denotes the prediction error with respect to the ith training pattern x i and C is a penalty constant on the training errors.

By substituting the constraint term in (1) into the objective function, an equivalent unconstrained optimization problem can be obtained as follows:

$$ \underset{\beta}{ \min}\mathrm{\mathcal{L}}=\frac{1}{2}{\left\Vert \beta \right\Vert}^2+\frac{C}{2}\sum_{i=1}^N{\left\Vert T- H\beta \right\Vert}^2 $$
(2)

where H = [ℋ(x 1); ℋ(x 2);  … ; ℋ(x N )] ∈  N × L.

The minimization problem (2) is a regularized least square problem. By setting the gradient of with respect to β to zero, we can get the closed-form solution of β. There are two cases while solving β. If N is larger than L, the gradient equation is over-determined, and the closed-form solution can be obtained as

$$ {\beta}^{\ast }={\left({H}^T T+\frac{I}{C}\right)}^{-1}{H}^T T $$
(3)

where I denotes the identity matrix of size L.

Second, if the number N of training samples is smaller than L, an under-determined least square problem would be handled. In this case, the solution of (1) can be obtained as

$$ {\beta}^{\ast }={H}^T{\left({ H H}^T+\frac{I}{C}\right)}^{-1} T $$
(4)

where I denotes the identity matrix of size N.

Therefore, for classification problem, the solution of β can be computed by using Eq. (3) or Eq. (4). We direct the interested readers to [3] for more details on ELM theory and the algorithms.

Subspace Learning

Subspace learning aims at learning a low-dimensional subspace. There are several common methods, such as principal component analysis (PCA) [30], linear discriminant analysis (LDA) [31], and manifold learning-based locality preserving projections (LPP) [32]. All of these methods suppose that the data distribution is consistent, namely, that they are only applicable to single domain. However, this assumption is often violated in many real-world applications. So, for heterogeneous data, we proposed a new cross-domain learning method to learn a shared subspace, which is called cross-domain extreme learning machine.

The Proposed CdELM Method

Notations

In this paper, source domain and target domain are defined by subscript S and T, respectively. The training data of source and target domain is denoted as \( {X}_S=\left[{x}_S^1,\dots, {x}_S^{N_S}\right]\in {\Re}^{D\times {N}_S} \) and \( {X}_T=\left[{x}_T^1,\dots, {x}_T^{N_T}\right]\in {\Re}^{D\times {N}_T} \), respectively, where D is the number of dimensions, N S and N T are the number of training samples in both domains. Let β ∈  L × d represent the basis transformation that maps the ELM space of source and target data to some subspace with dimension of d. ║∙║F and ║∙║2 denote the Frobenius norm and l 2-norm. Tr(∙) denotes the trace operator and (∙)T denotes the transpose operator. Throughout this paper, matrix is written in capital bold face, vector is presented in lower bold face, and variable is in italics.

The Proposed Method

As illustrated in Fig. 1, the distribution between the source domain and target domain is different. Therefore, the performance of the learned classifier by the source domain will be dramatically degraded. Inspired by subspace learning and ELM, the main idea of the proposed CdELM is to learn a shared subspace β in ELM space rather than a classifier β. Therefore, the source domain and target domain share the similar feature distribution in the latent projection β.

Firstly, mapping the source data and target data into the ELM space, and then we could obtain \( {H}_S=\left[{h}_S^1,\dots, {h}_S^{N_S}\right]\in {\Re}^{L\times {N}_S} \) and \( {H}_T=\left[{h}_T^1,\dots, {h}_T^{N_T}\right]\in {\Re}^{L\times {N}_T} \), where \( {h}_S^i= g\left({W}^T{x}_S^i+{b}^T\right) \) and \( {h}_T^j= g\left({W}^T{x}_T^j+{b}^T\right) \) are the output (column) vector of the hidden layer with respect to the input \( {x}_S^i \) and \( {x}_T^j \), respectively, i = 1,2,...,N S , j = 1,2,...,N T , g(∙) is a activation function, L is the number of randomly generated hidden nodes, W ∈  D × L and b ∈  1 × L are randomly generated weights.

In the learned subspace β, we expect that not only the distribution between source domain and target domain should be consistent, and the discrimination can also be improved for recognition. Inspired by linear discriminant analysis (LDA), we aim at minimizing the intra-class scatter matrix \( {S}_W^S \) and simultaneously maximizing the inter-class scatter matrix \( {S}_B^S \) of the source data, such that the separability can be promised in the learned linear subspace. Therefore, for source domain, it is rational to maximize the following term

$$ \underset{\beta}{ \max}\frac{ T r\left({\beta}^T{S}_B^S\beta \right)}{ T r\left({\beta}^T{S}_W^S\beta \right)} $$
(5)

where the inter-class scatter matrix and intra-class scatter matrix can be computed as \( {S}_B^S=\sum_{c=1}^C\left({\mu}_S^c-{\mu}_S\right){\left({\mu}_S^c-{\mu}_S\right)}^T \) and \( {S}_W^S=\sum_{c=1}^C\sum_{k=1,{h}_S^k\in {G}_c}^{n_c}\left({h}_S^k-{\mu}_S^c\right){\left({h}_S^k-{\mu}_S^c\right)}^T \), where μ S represents the center of source data, \( {\mu}_S^c \) represents the center of class c of source data in the raw space, C represents the number of categories, G c represents a collection belonging to class c, and n c represents the number of class c.

For learning such a subspace β that maximizes the formulation (5), we should also ensure that the projection does not distort the data from target domain, such that much more available information can be kept in the new subspace representation. Therefore, it is rational to maximize the following term:

$$ \underset{\beta}{ \max } Tr\left(\left({\beta}^T{H}_T\right){\left({\beta}^T{H}_T\right)}^T\right)=\underset{\beta}{ \max } Tr\left({\beta}^T{H}_T{H}_T^T\beta \right) $$
(6)

Naturally, after projected by β, the feature distributions between the mapped source domain \( {H}_S=\left[{h}_S^1,\dots, {h}_S^{N_S}\right]\in {\Re}^{L\times {N}_S} \) and target domain \( {H}_T=\left[{h}_T^1,\dots, {h}_T^{N_T}\right]\in {\Re}^{L\times {N}_T} \) can become similar. Therefore, it is rational to have an idea that the mean distribution discrepancy (MDD) between H S and H T can be minimized. That is, the distance between the centers of the two domains should be minimized. Therefore, the MDD minimization is formulated as

$$ \min {\left\Vert \frac{1}{N_S}\sum_{i=1}^{N_S}{\beta}^T{h}_S^i-\frac{1}{N_T}\sum_{j=1}^{N_T}{\beta}^T{h}_T^j\right\Vert}_F^2 $$
(7)

With the merits of ELM, we expect that the norm of β is minimized,

$$ \underset{\beta}{ \min }{\left\Vert \beta \right\Vert}_F^2 $$
(8)

After a detailed description of the four specific parts in the proposed CdELM model, by incorporating the Eq. (5) to Eq. (8) together, a complete CdELM model is formulated as follows

$$ \underset{\beta}{ \min}\frac{ T r\left({\beta}^T{S}_W^S\beta \right)+{\lambda}_0{\left\Vert \beta \right\Vert}_F^2+{\lambda}_1{\left\Vert \frac{1}{N_S}\sum_{i=1}^{N_S}{\beta}^T{h}_S^i-\frac{1}{N_T}\sum_{j=1}^{N_T}{\beta}^T{h}_T^j\right\Vert}_F^2}{ T r\left({\beta}^T{S}_B^S\beta \right)+{\lambda}_2 Tr\left({\beta}^T{H}_T{H}_T^T\beta \right)} $$
(9)

where λ 0, λ 1, and λ 2 denote the trade-off parameters.

Let \( {\mu}_S=\frac{1}{N_S}\sum_{i=1}^{N_S}{h}_S^i \) and \( {\mu}_T=\frac{1}{N_T}\sum_{j=1}^{N_T}{h}_T^j \) be the centers of source domain and target domain in ELM space; then the minimization problem in Eq.(9) can be finally written as

$$ \underset{\beta}{ \min}\frac{ T r\left({\beta}^T{S}_W^S\beta \right)+{\lambda}_0{\left\Vert \beta \right\Vert}_F^2+{\lambda}_1{\left\Vert {\beta}^T\left(\frac{1}{N_S}\sum_{i=1}^{N_S}{h}_S^i\right)-{\beta}^T\left(\frac{1}{N_T}\sum_{j=1}^{N_T}{h}_T^j\right)\right\Vert}_F^2}{ T r\left({\beta}^T{S}_B^S\beta \right)+{\lambda}_2 Tr\left({\beta}^T{H}_T{H}_T^T\beta \right)}=\underset{\beta}{ \min}\frac{ T r\left({\beta}^T{S}_W^S\beta \right)+{\lambda}_0{\left\Vert \beta \right\Vert}_F^2+{\lambda}_1{\left\Vert {\beta}^T{\mu}_S-{\beta}^T{\mu}_T\right\Vert}_F^2}{ T r\left({\beta}^T{S}_B^S\beta \right)+{\lambda}_2 Tr\left({\beta}^T{H}_T{H}_T^T\beta \right)}=\underset{\beta}{ \min}\frac{ T r\left({\beta}^T{S}_W^S\beta \right)+{\lambda}_0 Tr\left({\beta}^T\beta \right)+{\lambda}_1 Tr\left(\left({\beta}^T{\mu}_S-{\beta}^T{\mu}_T\right){\left({\beta}^T{\mu}_S-{\beta}^T{\mu}_T\right)}^T\right)}{ T r\left({\beta}^T{S}_B^S\beta +{\lambda}_2{\beta}^T{H}_T{H}_T^T\beta \right)}=\underset{\beta}{ \min}\frac{ T r\left({\beta}^T{S}_W^S\beta +{\lambda}_0{\beta}^T\beta +{\lambda}_1\left({\beta}^T{\mu}_S-{\beta}^T{\mu}_T\right){\left({\beta}^T{\mu}_S-{\beta}^T{\mu}_T\right)}^T\right)}{ T r\left({\beta}^T{S}_B^S\beta +{\lambda}_2{\beta}^T{H}_T{H}_T^T\beta \right)}=\underset{\beta}{ \min}\frac{ T r\left({\beta}^T\left({S}_W^S+{\lambda}_0 I+{\lambda}_1\left({\mu}_S-{\mu}_T\right){\left({\mu}_S-{\mu}_T\right)}^T\right)\beta \right)}{ T r\Big({\beta}^T\left({S}_B^S+{\lambda}_2{H}_T{H}_T^T\right)\beta} $$
(10)

where I is an identity matrix of size L.

Model Optimization

In the minimization problem Eq. (10), there are many possible solutions of β (i.e. non-unique solutions). To guarantee the unique property of solution, we impose an equality constraint on the optimization problem, and then Eq. (10) can be written as

$$ \begin{array}{l}\underset{\beta}{ \min } Tr\left({\beta}^T\left({S}_W^S+{\lambda}_0 I+{\lambda}_1\left({\mu}_S-{\mu}_T\right){\left({\mu}_S-{\mu}_T\right)}^T\right)\beta \right)\\ {} s. t. Tr\left({\beta}^T\left({S}_B^S+{\lambda}_2{H}_T{H}_T^T\right)\beta \right)=\eta \end{array} $$
(11)

where η is a positive constant.

To solve Eq. (11), the Lagrange multiplier function is written as

$$ \mathrm{\mathcal{L}}\left(\beta, \rho \right)={\beta}^T\left({S}_W^S+{\lambda}_0 I+{\lambda}_1\left({\mu}_S-{\mu}_T\right){\left({\mu}_S-{\mu}_T\right)}^T\right)\beta -\rho \left({\beta}^T\left({S}_B^S+{\lambda}_2{H}_T{H}_T^T\right)\beta -\eta \right) $$
(12)

where ρ denotes the Lagrange multiplier coefficient.

By setting the partial derivation of ℒ(β, ρ) with respect to β to be 0, we have

$$ \frac{\partial \mathrm{\mathcal{L}}\left(\beta, \rho \right)}{\partial \beta}=0\to {\left({S}_B^S+{\lambda}_2{H}_T{H}_T^T\right)}^{-1}\left({S}_W^S+{\lambda}_0 I+{\lambda}_1\left({\mu}_S-{\mu}_T\right){\left({\mu}_S-{\mu}_T\right)}^T\right)\beta =\rho \beta $$
(13)

From Eq. (13), we can observe that β can be obtained by solving the following eigenvalue decomposition problem,

$$ A\beta =\rho \beta $$
(14)

where \( A={\left({S}_B^S+{\lambda}_2{H}_T{H}_T^T\right)}^{-1}\left({S}_W^S+{\lambda}_0 I+{\lambda}_1\left({\mu}_S-{\mu}_T\right){\left({\mu}_S-{\mu}_T\right)}^T\right) \) and ρ denotes the eigenvalues.

From (14), it is clear that β denotes the eigenvectors. Due to that the model (11) is a minimization problem; therefore, the optimal subspace denotes the eigenvectors with respect to the first d smallest eigenvalues [ρ 1,..., ρ d ], represented by

$$ {\beta}^{\ast }=\left[{\beta}_1,{\beta}_2,\dots, {\beta}_d\right] $$
(15)

For easy implementation, the proposed algorithm is summarized in Algorithm 1.

figure a

Experiments

Data Description

We validate the proposed method on our own datasets. This dataset includes three subsets: master data (collected 5 years ago), slave 1 data (collected now), and slave 2 data (collected now). In data acquisition, the master and two slavery E-nose systems were developed in [33]. Each system consists of four TGS series sensors and an extra temperature and humidity module. Therefore, the dimension of each sample is 6. This dataset includes six kinds of gaseous contaminants (i.e. six classes), such as formaldehyde, benzene, toluene, carbon monoxide, nitrogen dioxide, and ammonia. The detailed description of the dataset is shown in Table 1. For visually observing the heterogeneous E-nose data, the PCA scatter points on the master data, slave 1 data and slave 2 data, are shown in Fig. 2, in which we can see that the points from different classes are overlapped.

Table 1 Data description of the E-nose data
Fig. 2
figure 2

The PCA scatter points of the master, slave 1, and slave 2 data, respectively

Experimental Settings

The master data collected 5 years ago is used as source domain data (no drift). The slave 1 and slave 2 data collected now are used as target domain data (with drift). Then, we conduct the experiments with two settings as follows.

  1. Setting 1.

    During CdELM training and classifier learning, the labels of the target domain data are unavailable, and only the source labels are used. The classification accuracy on slave 1 data (or slave 2 data) is reported.

  2. Setting 2.

    The only difference between Setting 1 and Setting 2 is that, in classifier training, partial labeled data of target domain can be used. Specifically, for each class in the target domain, k labeled samples can be used for classifier learning, where the values k = 1, 3, 5, 7, and 9 are discussed in this paper.

Compared Methods

To show the effectiveness of the proposed method, we have chosen 12 machine learning methods. First, three baseline methods such as support vector machine (SVM), principal component analysis (PCA), and linear discriminant analysis (LDA) are compared. Second, five semi-supervised learning methods based manifold learning, including locality preservation projection (LPP) [32], multidimensional scaling (MDS) [34], neighborhood component analysis (NCA) [35], neighborhood preserving embedding (NPE) [36], and local fisher discriminant analysis (LFDA) [37] are explored and compared. Finally, the popular subspace transfer learning method, sampling geodesic flow (SGF) [38], is also explored and compared.

Results

In this section, the experimental results on each setting are reported to validate the performance of the proposed CdELM method. Under each setting, two tasks including master → slave 1 and master → slave 2 are conducted.

Under setting 1, we first observe the qualitative result by implementing the proposed CdELM method on master → slave 1 and master → slave 2, respectively. The result is shown in Fig. 3, in which the separability among data points from different classes (represented as different symbols) is much improved in the learned common subspace compared to Fig. 2. Further, the odor classification accuracy of the target domain data has been presented in Table 2. From the results, we can observe that the proposed CdELM achieves the highest accuracies on two tasks. While the activation function in hidden layer is Gaussian (RBF function), the best performance of the CdELM is achieved. This demonstrates that the proposed CdELM has a good performance for cross-domain pattern recognition scenarios.

Fig. 3
figure 3

The scatter points by using CdELM from master → slave 1 (black dot) and master → slave 2 (red dot)

Table 2 Recognition accuracy (%) with sensor calibration under setting 1

Under the setting 2, k-labeled data for each class in the target domain is leveraged for classifier learning. The recognition accuracy of the first task (i.e. master → slave 1) is reported in Table 3 and the second task (i.e. master → slave 2) is reported in Table 4. Notably, all the compared methods follow the same setting conditions. From Tables 3 and 4, we can observe that the proposed CdELM still outperforms other methods. Therefore, we confirm that the proposed method is effective in handling heterogeneous measurement data.

Table 3 Recognition accuracy (%) with sensor calibration under setting 2 (task 1)
Table 4 Recognition accuracy (%) with sensor calibration under setting 2 (task 2)

Parameter Sensitivity Analysis

In the proposed CdELM model, there are three parameters: λ 0, λ 1, and d. We focus on observing the performance variations in tuning λ 0 and λ 1 according to \( {10}^t \), where \( t=\left\{-6,-4,-2,0,2,4,6\right\} \). To show the performance with respect to each parameter, one is tuned by freezing the other one. The parameter λ 1 tuning results by fixing λ 0 are shown in Fig. 4, and the parameter λ 0 tuning results by fixing λ 1 are shown in Fig. 5, from which the best parameters λ 0 and λ 1 can be witnessed. Further, we tune the subspace dimensionality d from the parameter set d = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}, and the result is shown in Fig. 6 by fixing other model parameters.

Fig. 4
figure 4

The performance curves with respect to λ 1 under different values of λ 0

Fig. 5
figure 5

The performance curves with respect to λ 0 under different values of λ 1

Fig. 6
figure 6

The performance curves with respect to subspace dimensionality d

Conclusion

In this paper, we present a cross-domain common subspace learning approach for heterogeneous data classification problem, which is called cross-domain extreme learning machine (CdELM). The method is motivated by subspace learning, domain adaptation, and cognitive-based extreme learning machine, such that the advantage of ELM, such as good generalization, is inherited. Since traditional ELM supposes that the training data and testing data should be with similar distribution, once the assumption is violated in multi-domain scenarios, the ELM may not be adapted. The aim of this paper is to bring some new perspective for ELM in multi-domain subspace learning scenarios. Extensive experiments demonstrate that the proposed method outperforms other compared methods.