1 Introduction

Extreme learning machine (ELM) proposed by Huang et al. [1, 2] is an efficient learning algorithm of training single layer feed-forward neural networks (SLFNs). Many researches regard ELM as a learning method for regression and multiclass classification [36]. Regularized ELM (RELM) has been developed for classification and regression [7]. Weighted ELM (WELM) has been used for the data with imbalanced class distribution [8]. Sparse ELM achieves similar generalization performance in binary classification applications [9]. Chen et al. [10] and Lin et al. [11] present the theoretical analysis of ELM. Recently, ELM has been applied to clustering, such as ELM k-means algorithm [12], and unsupervised extreme learning machine (US-ELM) k-means algorithm [13]. Clustering is an important part of machine learning. At present, many clustering methods have been proposed, such as partitioning methods [14], hierarchical methods [15, 16], density-based methods [17, 18], and graph theory methods [19, 20].

As we know, the original data will be linear separable through a nonlinear transformation. ELM k-means algorithm uses k-means clustering in ELM feature space. US-ELM obtains embedded matrix by constructing a new cost function, and then uses k-means to cluster in embedded matrix of US-ELM. However, the input weights and biases of US-ELM are randomly generated. Therefore, the outputs of US-ELM hidden layer are not reasonable features of the original data. Extreme learning machine as an auto encoder (ELM-AE) [21] can obtain main features of the original data. At the same time, ELM-AE is not an iterative algorithm. The proposed unsupervised extreme learning machine based on embedded features of ELM-AE (US-EF-ELM) can use the features that learned from ELM-AE instead of the US-ELM, in combination with the US-ELM cost function to obtain output weights.

In order to assess the performance of the proposed algorithm, we compare the proposed algorithm with other algorithms on eight UCI datasets. US-EF-ELM k-means algorithm has achieved satisfying results on most of the data sets. This paper is organized as follows. In Sect. 2, we review the ELM method, and introduce the model of ELM-AE, then describe the principle of US-ELM. In Sect. 3, we make a detailed description of US-EF-ELM. In Sect. 4, experimental results on UCI data sets and the performance analysis of US-EF-ELM k-means algorithm are presented. Finally, some conclusions and the intending work are given in the last section.

2 Related works

The proposed US-EF-ELM is based on ELM-AE and US-ELM. This section provides brief reviews of ELM, ELM-AE, and US-ELM.

2.1 Extreme learning machine

ELM proposed by Huang et al. is an efficient learning algorithm of SLFNs. For N distinct samples (\(x_{i} ,y_{i}\)), i = 1,\(\ldots\), N, \(x_{i} \in R^{j}\) and \(y_{i} \in R^{m}\), the ELM model structure has j input layer nodes, n hidden layer nodes, m output layer nodes and a hidden layer activation function \(g(x)\). The outputs of hidden layer can be expressed as Eq. 1, and the relationship between the outputs of hidden layer and the outputs of output layer can be expressed as Eq. 2.

$$h = g(ax + b)$$
(1)
$$h(x_{i} )\beta = y_{i} ,\;\;\;{\kern 1pt} {\text{where}}\,\,i = 1,2, \ldots ,N$$
(2)

The above equation can be rewritten as Eq. 3:

$$H\beta = Y$$
(3)

where

$${\text{H}} = \left[ {\begin{array}{*{20}c} {g(\vec{a}_{1} ,b_{1} ,\vec{x}_{1} )} & {g(\vec{a}_{1} ,b_{1} ,\vec{x}_{2} )} & \cdots & {g(\vec{a}_{n} ,b_{n} ,\vec{x}_{N} )} \\ {g(\vec{a}_{2} ,b_{2} ,\vec{x}_{1} )} & {g(\vec{a}_{2} ,b_{2} ,\vec{x}_{2} )} & \cdots & {g(\vec{a}_{n} ,b_{n} ,\vec{x}_{N} )} \\ \vdots & \vdots & \ddots & \vdots \\ {g(\vec{a}_{n} ,b_{n} ,\vec{x}_{1} )} & {g(\vec{a}_{n} ,b_{n} ,\vec{x}_{2} )} & \cdots & {g(\vec{a}_{n} ,b_{n} ,\vec{x}_{N} )} \\ \end{array} } \right]^{T}\!,\,\,\beta = \left[ {\begin{array}{*{20}c} {\beta_{1}^{T} } \\ {\beta_{2}^{T} } \\ \vdots \\ {\beta_{n}^{T} } \\ \end{array} } \right]_{n \times m} ,\,Y = \left[ {\begin{array}{*{20}c} {y_{1}^{T} } \\ {y_{2}^{T} } \\ \vdots \\ {y_{N}^{T} } \\ \end{array} } \right]_{N \times m} .$$

Using ELM to obtain the output weights \(\beta\) can be divided into three steps:

Step 1::

Randomly select numerical values to set input weights and the bias of the hidden layer

Step 2::

Calculate the output matrix H

Step 3::

Calculate the output weights \(\beta\):

$$\beta = H^{\dag } Y$$
(4)

where \(H^{\dag }\) represents the generalized inverse matrix of the output matrix H.

ELM has the advantage of high training speed and better generalization. However, the robustness of ELM is bad. To solve this question, Deng et al. [7] propose RELM which combines experiential risk and structural risk. RELM aims to obtain the output weights by minimizing the regularized least squares estimation cost function, which leads to the following formulation:

$$\hbox{min} \,\,L_{RELM} = \frac{1}{2}\left\| \beta \right\|^{2} + \frac{C}{2}\left\| {Y - H\beta } \right\|^{2}$$
(5)

where C is a parameter to balance experiential risk and structural risk.

By setting the gradient of L RELM to zero, we have

$$\beta + CH^{T} (Y - H\beta ) = 0$$
(6)

When the number of training samples is larger than the number of hidden layer nodes, the output weight matrix \(\beta\) in RELM can be expressed as Eq. 7:

$$\beta = \left( {\frac{I}{C} + H^{T} H} \right)^{ - 1} H^{T} Y$$
(7)

When the number of training samples is less than the number of hidden layer nodes, the output weight matrix \(\beta\) in RELM can be expressed as Eq. 8:

$$\beta = H^{T} \left( {\frac{I}{C} + HH^{T} } \right)^{ - 1} Y$$
(8)

2.2 Extreme learning machine as an auto encoder

The auto encoder is an unsupervised neural network model which is commonly used in deep learning. The outputs of auto encoder are equal to the inputs. ELM-AE proposed by Kasun et al. [21] is a novel method of neural network which can reproduce the input signal as well as auto encoder.

The ELM-AE model consists of an input layer, a single-hidden layer and an output layer. The model structure of ELM-AE is shown in Fig. 1, which has j input layer nodes, n hidden layer nodes, j output layer nodes and a hidden layer activation function \(g(x)\). ELM-AE can be divided into three different representations.

Fig. 1
figure 1

The model structure of ELM-AE

j > n::

compressed representation

Represent features from a higher dimensional input signal space to a lower dimensional feature space.

j = n::

equal dimension representation

Represent features from an input signal space dimension equal to feature space dimension.

j < n::

sparse representation

Represent features from a lower dimensional input signal space to a higher dimensional feature space.

There are two differences between ELM-AE and traditional ELM. Firstly, ELM is a supervised neural network and the outputs of ELM are labels, but ELM-AE is an unsupervised neural network and the outputs of ELM-AE are equal to the inputs of ELM-AE. Secondly, the input weights of ELM-AE are orthogonal and the biases of hidden layer of ELM-AE are also orthogonal, when the parameters in ELM are randomly initialized. For N distinct samples, the outputs of ELM-AE hidden layer can be expressed as Eq. 9, and the numerical relationship between the outputs of the hidden layer and the outputs of the output layer can be expressed as Eq. 10.

$$h = g(ax + b),\;\;\;{\kern 1pt} {\text{where}}\,\,{\text{a}}^{\text{T}} {\text{a = I,}}\,{\text{b}}^{\text{T}} {\text{b = 1}}$$
(9)
$$h(x_{i} )\beta = x_{i}^{T} ,\;\;\;{\kern 1pt} {\text{where}}\,\,i = 1,2, \ldots ,N$$
(10)

Using ELM-AE to obtain the output weights \(\beta\) also can be separated into three steps, but the calculation method of the output weights \(\beta\) in ELM-AE is different from ELM.

For sparse and compressed ELM-AE representations, output weights \(\beta\) are calculated by Eqs. 11 and 12:

When the number of training samples is more than the number of hidden layer nodes,

$$\beta = \left( {\frac{I}{C} + H^{T} H} \right)^{ - 1} H^{T} X$$
(11)

When the number of training samples is less than the number of hidden layer nodes,

$$\beta = H^{T} \left( {\frac{I}{C} + HH^{T} } \right)^{ - 1} X$$
(12)

For equal dimension ELM-AE representation, output weights \(\beta\) are calculated by Eq. 13:

$$\beta = H^{ - 1} X$$
(13)

2.3 Unsupervised extreme learning machine

US-ELM proposed by Huang et al. makes use of least square method to obtain the embedded matrix \(E \in R^{N*m}\) (m is decided by us) which can be used to cluster. And the clustering validity of E is better than the original samples \(X \in R^{N*j}\).

If two samples x i and x j are close to each other, then the outputs y i and y j of US-ELM should be close to each other as well. So US-ELM minimizes the following cost function,

$$\begin{aligned} L_{m} & = \frac{{\sum\limits_{i = 1}^{n} {\sum\limits_{j = 1}^{n} {w_{ij} (y_{i} - y_{j} )^{2} } } }}{{2(c_{1} - c_{2} )^{2} }} = \frac{{\sum\limits_{i = 1}^{n} {\sum\limits_{j = 1}^{n} {w_{ij} (y_{i}^{2} - 2y_{i} y_{j} + y_{j}^{2} )} } }}{{2(c_{1} - c_{2} )^{2} }} \\ & = \frac{{\sum\limits_{i = 1}^{n} {\sum\limits_{j = 1}^{n} { - 2w_{ij} y_{i} y_{j} } } + \sum\limits_{i = 1}^{n} {y_{i}^{2} \left( {\sum\limits_{j = 1}^{n} {w_{ij} } } \right)} }}{{2(c_{1} - c_{2} )^{2} }} = \frac{{2\hat{Y}^{T} (D - W)\hat{Y}}}{{2(c_{1} - c_{2} )^{2} }} = \frac{{\hat{Y}^{T} L\hat{Y}}}{{(c_{1} - c_{2} )^{2} }} \\ \end{aligned}$$
(14)

where D is a diagonal matrix, L is a Laplacian matrix, \(\hat{Y} = [\begin{array}{*{20}c} {\vec{y}_{i} } & {\vec{y}_{j} } \\ \end{array} ]^{T}\).

US-ELM aims to obtain the output weights by minimizing the least squares regularization cost function, which leads to the following formulation:

$$\begin{aligned} & \hbox{min} \, L_{{US{ - }ELM}} = \left\| \beta \right\|^{2} + C\,{\text{Tr}}(\beta V^{T} H^{T} LH\beta ) = \, Tr(\beta^{T} (I + CH^{T} LH)\beta ) \\ & \,\,\,\,subject. \, to. \, (H\beta )^{T} H\beta = I \\ \end{aligned}$$
(15)

where \({\text{Tr(}} \cdot )\) denotes the trace of a matrix.

If the number of the US-ELM output layer nodes for m, then choosing the output weights \(\beta\) whose columns are the eigenvectors corresponding to the first m smallest eigenvalues is an optimal solution to Eq. 15. However, there is always a constant eigenvector 1 in Laplace eigenvector space and this eigenvalue corresponding eigenvector is the smallest eigenvalue 0. Thus we choose eigenvectors corresponding to first m smallest eigenvalues except for the 0 to compose the output weights \(\beta\).

When the number of training samples is more than the number of hidden layer nodes,

$$(I + CH^{T} LH)v = \gamma H^{T} Hv$$
(16)
$$\beta = [\tilde{v}_{2} ,\tilde{v}_{3} , \ldots ,\tilde{v}_{{n_{0} + 1}} ],\;\;\;{\kern 1pt} where\,\,\tilde{v}_{i} = {\raise0.7ex\hbox{${v_{i} }$} \!\mathord{\left/ {\vphantom {{v_{i} } {\left\| {Hv_{i} } \right\|}}}\right.\kern-0pt} \!\lower0.7ex\hbox{${\left\| {Hv_{i} } \right\|}$}}$$
(17)

When the number of training samples is less than the number of hidden layer nodes,

$$(I + CLHH^{T} )\,u = \gamma HH^{T} u$$
(18)
$$\beta = H^{T} [\tilde{u}_{2} ,\tilde{u}_{3} , \ldots ,\tilde{u}_{{n_{0} + 1}} ],\;\;\;{\kern 1pt} where\,\,\tilde{u}_{i} = {\raise0.7ex\hbox{${u_{i} }$} \!\mathord{\left/ {\vphantom {{u_{i} } {\left\| {HH^{T} u_{i} } \right\|}}}\right.\kern-0pt} \!\lower0.7ex\hbox{${\left\| {HH^{T} u_{i} } \right\|}$}}$$
(19)

3 Unsupervised extreme learning machine based on embedded features of ELM-AE

US-ELM can obtain the output weights by constructing a different cost function instead of the ELM cost function without supervised information. However, the randomly generated input weights and biases will lead a result that the outputs of US-ELM hidden layer cannot represent features of the original samples very well. ELM-AE is an artificial neural network algorithm which is able to reconstruct the input signal. In order to achieve the reconstruction, ELM-AE obtains the main features of the original samples, as auto encoder does. ELM-AE can find principal components which represent the original samples and the learning process of ELM-AE without iteration. We can make use of features of ELM-AE instead of the outputs of US-ELM hidden layer, and then use the US-ELM cost function to obtain the output weights. We call it US-EF-ELM.

For N distinct samples (\(x_{i} ,y_{i}\)), i = 1,\(\ldots\), N, \(x_{i} \in R^{j}\) and \(y_{i} \in R^{m}\), the model structure of US-EF-ELM is shown in Fig. 2, with j input layer nodes, n hidden layer nodes, m output layer nodes and a hidden layer activation function g(x). First of all, we make use of ELM-AE to calculate the output weights \(\beta_{1}\). The transpose of \(\beta_{1}\) is the input weights of US-EF-ELM. The outputs of US-EF-ELM hidden layer H can be expressed as

$${\text{When}}\,\,n\, \ne \,j\;\;\;{\kern 1pt} {\text{and}}\;\;\;{\kern 1pt} N < n,\,\,\,\beta_{1} = \left( {\frac{{I_{n} }}{{C_{1} }} + H_{ELM - AE}^{T} H_{ELM - AE} } \right)^{ - 1} H_{ELM - AE}^{T} X$$
(20)
$${\text{When}}\,\,n\, \ne \,j\;\;\;{\kern 1pt} {\text{and}}\;\;\;{\kern 1pt} N < n,\,\,\,\beta_{1} = H_{{ELM{ - }AE}}^{T} \left( {\frac{{I_{N} }}{{C_{1} }} + H_{{ELM{ - }AE}} H_{{ELM{ - }AE}}^{T} } \right)^{ - 1} X$$
(21)
$${\text{When}}\,\,\,n = j,\,\,\,\beta_{1} = H_{{ELM{ - }AE}}^{ - 1} X$$
(22)
$$H = g(X \cdot \beta_{1}^{T} )$$
(23)
Fig. 2
figure 2

The model structure of US-EF-ELM

US-EF-ELM aims to obtain the output weights by minimizing the US-ELM cost function, and leads to the following formulation:

$$\mathop {\hbox{min} }\limits_{{\beta_{2}^{T} H^{T} H\beta_{2} = I}} \, L_{{US{ - }SL{ - }ELM}} = \left\| {\beta_{2} } \right\|^{2} + C_{2} {\text{Tr}}(\beta_{2}^{T} H^{T} LH\beta_{2} ) = {\text{Tr}}(\beta_{2}^{T} (I + C_{2} H^{T} LH)\beta_{2} )$$
(24)

So we choose the eigenvectors corresponding to the first m smallest eigenvalues except for the 0 to compose output weights \(\beta_{2}\).

$$\begin{aligned} & {\text{When}}\;\,N > n,\,\,\,(I_{n} + C_{2} H^{T} LH)v = \gamma H^{T} Hv, \\ & \,\,\beta_{2} = [\tilde{v}_{2} ,\tilde{v}_{3} , \ldots ,\tilde{v}_{{n_{0} + 1}} ],\;\;\;{\kern 1pt} where\,\,\tilde{v}_{i} = {\raise0.7ex\hbox{${v_{i} }$} \!\mathord{\left/ {\vphantom {{v_{i} } {\left\| {Hv_{i} } \right\|}}}\right.\kern-0pt} \!\lower0.7ex\hbox{${\left\| {Hv_{i} } \right\|}$}} \\ \end{aligned}$$
(25)
$$\begin{aligned} & {\text{When}}\,\,N < n,\,\,\,(I_{N} + C_{2} LHH^{T} )u = \gamma HH^{T} u, \\ & \beta_{2} = H^{T} [\tilde{u}_{2} ,\tilde{u}_{3} , \ldots ,\tilde{u}_{{n_{0} + 1}} ],\;\;\;{\kern 1pt} where\,\,\tilde{u}_{i} = {\raise0.7ex\hbox{${u_{i} }$} \!\mathord{\left/ {\vphantom {{u_{i} } {\left\| {HH^{T} u_{i} } \right\|}}}\right.\kern-0pt} \!\lower0.7ex\hbox{${\left\| {HH^{T} u_{i} } \right\|}$}} \\ \end{aligned}$$
(26)

The outputs E of US-EF-ELM can be calculated by following formulation:

$$E = H\beta_{2}$$
(27)

Then we can adopt the k-means algorithm to perform clustering in the embedded spaces of US-EF-ELM E. The following algorithm is a summary of the proposed US-EF-ELM.

4 Experiments and results

In this section, the performance of the US-EF-ELM k-means clustering algorithm is compared with classical methods (k-means algorithm and spectral clustering algorithm) and the US-ELM k-means clustering algorithm.

4.1 Data sets

The data sets used in the experiments are all from the UCI Machine Learning Repository, which include Ionosphere, Breast Cancer Wisconsin (Diagnostic), Musk (Version 1), Musk (Version 2), Abalone, ISOLET, EEG Eye State, and PEMS-SF. The details of those data sets are given in Table 1.

Table 1 The details of UCI data sets

4.2 Experimental setup

We do experiments in a work station with a core i7 DMI2-Intel 3.6 GHz processor and 18 GB RAM running MATLAB 2012B. We run k-means algorithm, spectral clustering (SC) algorithm, US-ELM k-means algorithm, and US-EF-ELM k-means algorithm 100 times, and then run SC algorithm 10 times on last five datasets.

The number of US-ELM hidden nodes and US-EF-ELM hidden nodes is set to 2000 for all data sets, and activation functions of US-ELM and US-EF-ELM are sigmoid function. The kernel of SC algorithm is the Gaussian kernel. The parameter C in US-ELM and the parameters C 1 and C 2 in US-ELM are chosen from [10−4 10−3 10−2 10−1 100 101 102 103 104].

4.3 Quality of the clustering results

This paper uses clustering accuracy (ACC) [22] to measure the quality of the clustering results. For N distinct samples \(x_{i} \in R^{j}\), y i and c i are the inherent category label and the predicted cluster label of x i , the calculation formula of ACC is

$$ACC = {{\sum\limits_{i = 1}^{N} {\delta (y_{i} ,map(c_{i} ))} } \mathord{\left/ {\vphantom {{\sum\limits_{i = 1}^{N} {\delta (y_{i} ,map(c_{i} ))} } N}} \right. \kern-0pt} N}$$
(28)

where \(map( \cdot )\) maps each cluster label to a category label by the Hungarian algorithm [23] and this mapping is optimal, let \(\delta (y_{i} ,c_{i} )\) equal to 1 if y i  = c i or equals to 0 otherwise. The higher values of the ACC, the better the clustering performance.

The comparison of these algorithms is shown in Table 2, and the dimensions of the embedded space of US-ELM and US-EF-ELM are selected based on Fig. 3. It is obvious that the US-EF-ELM k-means algorithm has achieved gratifying results on most of the data sets. The first three data sets are small, next five data sets are larger. SC algorithm performs better on small data sets. US-ELM k-means algorithm and US-EF-ELM k-means algorithm perform better on larger data sets.

Table 2 The performance comparison of the proposed US-EF-ELM
Fig. 3
figure 3figure 3

Clustering performance on UCI data sets as a function of the dimension of the embedded space. a Clustering performance on Ionosphere. b Clustering performance on Breast Cancer Wisconsin (Diagnostic). c Clustering performance on Musk (Version 1). d Clustering performance on Musk (Version 2). e Clustering performance on Abalone. f Clustering performance on ISOLET. g Clustering performance on EEG Eye State. h Clustering performance on PEMS-SF

The embedded space of US-ELM and US-EF-ELM makes an obvious effect on clustering performance. Figure 3 shows the best clustering performance of US-ELM and US-EF-ELM in different dimensions of the embedded space. We can get the following conclusion from Fig. 3: (1) the best cluster performance of US-EF-ELM is better than US-ELM; (2) the US-EF-ELM is better than US-ELM in a wide range of embedding dimensions; (3) Huang et al. [13] shows that the US-ELM attains its best performance with a very low dimensional embedding, but the US-ELM and US-EF-ELM attain their best performance with a high dimensional embedding on ISOLET and Musk (Version 2) data sets. Performing k-means in a relatively lower dimensional space is just for our reference.

4.4 Computational efficiency

The training time comparison of these algorithms is shown in Table 3. It is obvious that the k-means algorithm has the best computational efficiency. In terms of the process of US-EF-ELM k-means algorithm, the training time of US-EF-ELM is longer than US-ELM. The embedded space dimension of US-EF-ELM is different from US-ELM, so the training time of US-ELM k-means algorithm is longer on some datasets. From experimental results on PEMS-SF dataset, we can conclude that the computational efficiency of US-EF-ELM is significantly lower than US-ELM when samples have high dimension. The process of ELM-AE has taken a long time when samples have high dimension. The training time of SC algorithm is longest on large datasets. Though the training time of US-EF-ELM k-means algorithm is longer than US-ELM on most of the datasets, US-EF-ELM is still efficient.

Table 3 The training time comparison of the proposed US-EF-ELM

5 Conclusions

In this paper, US-EF-ELM is presented, and US-EF-ELM k-means algorithm gets better clustering performances compared to US-ELM k-means algorithm on UCI datasets. Besides the good performance of US-EF-ELM k-means algorithm, the proposed algorithm is also computational efficient. US-EF-ELM is a learning algorithm of SLFNs, and can be a learning algorithm of multi layer feed-forward neural networks. The new algorithm makes use of ELM-AE to train the weights in each layer and then uses the US-ELM cost function to obtain the output weights. However, it is difficult to determine the model structure. Future research is to find a suitable model structure of the algorithm.

6 Acknowledgements

This work is supported by the National Natural Science Foundation of China (No. 61379101), the National Key Basic Research Program of China (No. 2013CB329502), and the Natural Science Foundation of Jiangsu Province (No. BK20130209).