Keywords

1 Introduction

Extreme learning machine (ELM) [1] is a useful learning method for training “generalized” single hidden layer feedforward neural networks (SLFNs), which shows its good performance in various research studies [2]. Compared with traditional neural networks which adjust the network parameters iteratively, in ELM, the input weights and hidden layer biases are randomly generated, while the output weights are analytically determined by using Moore-Penrose (MP) generalized inverse. Due to its extremely fast learning speed, good generalization ability and ease of implementation, ELM has drawn great attention in academia [3, 4]. However, the manually assigned network parameters often degrade the performance of ELM.

In order to enhance the performance of ELM, researchers have proposed a number of improved methods from different perspectives, such as ensemble learning [5], voting scheme [6], weighting method [7] and instance cloning [4]. Liu and Wang [5] embedded ensemble learning into the training phase of ELM to mitigate the overfitting problem and improve the predictive stability. Cao et al. [6] incorporated multiple independent ELM models into a unified framework to enhance the performance in a voting manner. Zong et al. [7] proposed a weighting scheme method for ELM by assigning different weights for each example. The aforementioned methods for ELM have achieved good performance in some specific problems. However, they do not solve the primary problem in ELM (i.e., the random generation of the network parameters). Therefore, the performance of the above-mentioned methods may be compromised. How to select suitable network parameters for ELM is still an opening problem.

In reality, the original data can provide valuable information according to its different representations. Therefore, it is imperative for ELM to determine the network parameters based on the original data. A straightforward approach to solve the above problem is to use the idea of autoencoder. Autoencoder [8] is a special case of artificial neural network usually used for unsupervised learning, where the output layer are with the same neurons as the input layer. In autoencoder, the learning procedure can be divided into the processes of encoding and decoding [9, 10]. The input data is mapped to a high-level representation in the encoding stage, while the high-level representation is mapped back to the original input data in the decoding stage. By doing so, autoencoder can explore the underlying data information and encode these information into the output weights.

Based on the above observations, in this paper, we propose a novel pre-trained extreme learning machine (P-ELM for short), where an ELM-based autoencoder (ELM-AE) is adopted to pre-train the suitable network parameters. The proposed P-ELM encodes the data information into the learned network parameters, which can achieve satisfactory performance for further learning. Experiments on face image recognition and handwritten image annotation applications demonstrate that the proposed P-ELM consistently outperforms other state-of-the-art ELM algorithms. The advantages of P-ELM can be summarized as follows:

  • P-ELM falls into the category of data-driven methods, which can successfully find the proper network parameters for further learning.

  • P-ELM is simple in both theory and implementation, which inherits the advantages of the original ELM.

  • P-ELM is a nonlinear learning model and flexible in modeling different complex real-world relationships.

The remainder of the paper is structured as follows. Section 2 surveys the related work. Section 3 presents the proposed P-ELM method. The experiments are demonstrated in Sect. 4. Finally, we conclude the paper in Sect. 5.

Fig. 1.
figure 1

Illustration of network structures for (a) ELM and (b) ELM-AE.

2 Related Work

Extreme learning machine (ELM) is an elegant learning method, which was originally proposed for SLFNs and then extended to “generalized” SLFNs [1]. In ELM, the hidden neurons need not be neuron alike and the networks parameters are without iterative tunning. The network structure of ELM is shown in Fig. 1(a). The basic ELM can fundamentally be regarded as a two-stage learning system, which can be spilt into feature mapping and parameter solving [11, 12]. In the feature mapping stage, ELM randomly selects the input weights and hidden biases to calculate the hidden layer output matrix via an activation function. In the parameter solving stage, the output weights are analytically determined according to the Moore-Penrose (MP) generalized inverse and the smallest norm least-squares solution of general linear system. To accelerate the learning speed, Huang et al. [13] presented a constrained-optimization-based ELM and provided two effective solution for different size of training data. The learning theories and real-world applications of ELM are well-developed in the literature [2].

Apart from ELM-based SLFNs, the ELM theories can be also applied to built an ELM-based autoencoder (ELM-AE) [14]. Autoencoder is always used to be a feature extractor and usually functions as a basic unit in a multilayer learning model [15]. In recent years, autoencoder has been widely used for tackling numerous real-world applications, e.g., cross-language learning problem and domain adaption problem. Similar to the ELM, an ELM-AE can be also regarded as a two-stage process, where the input data is first mapped to a high-level representation, and then the high-level latent representation is mapped back to the original input data [16]. The network structure of ELM-AE is shown in Fig. 1(b). The main difference between ELM and ELM-AE is the output layer. In ELM, the output layer is to predict the target value for given data. By contrast, in ELM-AE, the output layer is to reconstruct the original input data. Due to the unique learning mechanism, ELM-AE extracts the informative features through the hidden layer and encodes the underlying data information into the output weights. Motivated by these, we propose to employ an ELM-AE to pre-train the network parameters for P-ELM in this paper.

3 Proposed Method

In this section, we present the proposed pre-trained extreme learning machine (P-ELM). Specifically, P-ELM is achieved through the following steps: (1) Employ an ELM-AE for network parameter learning; (2) Train the P-ELM model with the learned network parameters; and (3) Predict the class labels of the testing instances. Algorithm 1 reports the learning process of the proposed P-ELM.

3.1 Parameter Learning

In P-ELM, the most important aspect is to choose the suitable network parameters based on the original data. To this end, we use an ELM-AE to learn the network parameters. Given N distinct training examples \(\mathcal {D}\) \(=\{(\varvec{x}_i, \varvec{t}_i)\}_{i=1}^{N}\), where \(\varvec{x}_i{\in }\mathbb {R}^n\) is the input data and \(\varvec{t}_i{\in }\mathbb {R}^m\) is the expectation output, the encoding process in ELM-AE with L hidden neurons can be presented as the following equation:

$$\begin{aligned} \varvec{h}(\varvec{x}_i) = g(\varvec{\alpha }{\cdot }\varvec{x}_i+\varvec{b}),~~~~i=1,2,...,N; \end{aligned}$$
(1)

where \(\varvec{\alpha }{\in }\mathbb {R}^{L{\times }n}\) is the input weight matrix, \(\varvec{b}{\in }\mathbb {R}^{L{\times }1}\) is the hidden neuron bias vector, \(g(\cdot )\) is an activation function, and \(\varvec{h}(\varvec{x}_i)\) is the high-level latent representation for the input data \(\varvec{x}_i\). By contrast, the decoding process in ELM-AE can be formulated as follows:

$$\begin{aligned} \varvec{h}(\varvec{x}_i)\varvec{\varpi } = \varvec{x}_i,~~~~i=1,2,...,N; \end{aligned}$$
(2)

where \(\varvec{\varpi }{\in }\mathbb {R}^{L{\times }n}\) is the output weight matrix. Equation (2) can be also rewritten as the compacted form based on the whole dataset:

$$\begin{aligned} \mathbf H \varvec{\varpi }=\mathbf X . \end{aligned}$$
(3)

To enhance the performance of ELM-AE, the output weight matrix \(\varvec{\varpi }\) can be updated by minimizing the objective fuction: \(L(\varvec{\varpi })= \frac{1}{2}||\varvec{\varpi }||^2+\frac{C}{2}||\mathbf X -\mathbf H \varvec{\varpi }||^2.\) The calculation of the output weight matrix \(\varvec{\varpi }\) can be solved by Eq. (4) according to the relationship between the number of training samples N and the number of hidden neurons L.

(4)

The unique parameter learning mechanism enables ELM-AE to encode the underlying information of the original data into the output weights, which can be used as the input weights for the P-ELM model to achieve better performance. This is a data-driven method, which can adaptively search the suitable network parameters based on the specific data.

figure a

3.2 Model Training

In this section, we aim to formulate the learning model of the proposed pre-trained extreme learning machine (P-ELM). As described in the previous section, we use the output weights \(\varvec{\varpi }\) learned by ELM-AE as the input weights for P-ELM. In P-ELM, the input weights can be represented as \(\varvec{\varpi }^T\), and the hidden layer biases can be expressed as \(\varvec{b}^\prime \), where the ith hidden layer bias is \(b_i^\prime =(\sum _{j=1}^{n}\varpi _{ij})/n, i=1,2,...,L\). Therefore, the proposed P-ELM with L hidden neurons can be formulated as:

$$\begin{aligned} \begin{aligned} \varvec{t}_i&= \sum \nolimits _{j=1}^{L} \varvec{\beta }_jg(\varvec{\varpi }^T_j{\cdot }\varvec{x}_i+b_j^\prime ) \\ {}&=g(\varvec{\varpi }^T{\cdot }\varvec{x}_i+\varvec{b}^\prime )\varvec{\beta } \\ {}&=\varvec{h}^\prime (\varvec{x}_i)\varvec{\beta } \end{aligned} ,~~~~i=1,2,...,N; \end{aligned}$$
(5)

where \(\varvec{h}^\prime (\varvec{x}_i)=g(\varvec{\varpi }^T{\cdot }\varvec{x}_i+\varvec{b}^\prime )\) is the hidden layer output for the input data \(\varvec{x}_i\) and \(\varvec{\beta }{\in }\mathbb {R}^{L{\times }m}\) is the output weight matrix of the proposed P-ELM. Mathematically, Eq. (5) can be rewritten as the following compacted form:

$$\begin{aligned} \mathbf H ^\prime \varvec{\beta }=\mathbf T . \end{aligned}$$
(6)

To calculate the output weight matrix \(\varvec{\beta }\), Eq. (6) can be solved by minimizing the objective function: \(L(\varvec{\beta })= \frac{1}{2}||\varvec{\beta }||^2+\frac{C}{2}||\mathbf T -\mathbf H ^\prime \varvec{\beta }||^2.\) Similar to Eq. (4), the output weight matrix \(\varvec{\beta }\) can be calculated as the following equation according to the relationship between the number of training samples N and the number of hidden neurons L.

(7)

The training process of P-ELM is determined by Eq. (5). Different from the traditional ELM with randomly generated network parameters, the proposed P-ELM uses the network parameters pre-trained by ELM-AE for model training. By doing so, the performance of P-ELM can be improved. This is the major difference between P-ELM and the original ELM.

3.3 Label Prediction

In the testing phase, the class labels of each testing instance is predicted by the trained P-ELM model. The testing instances are used to calculate the output of hidden layer based on the pre-trained input weights and hidden layer biases. Then, the class labels of the testing instances can be determined by Eq. (6). Indeed, instance label prediction in the proposed P-ELM is similar to the prediction process in ELM.

4 Experimental Results

To validate the performance of the proposed method, the experiments are conducted on face image recognition [17] and handwritten image annotation [14] respectively. Classification accuracy [18, 19] and running time [20] are used as the evaluation metrics. The reported results are based on 10-fold cross validation (CV). In P-ELM, the parameter C is tuned by a grid-search strategy from \(\{0.01,0.1,1,10,100,1000\}\), the sigmoid function is applied as the activation function for the hidden layer, and the setting of the number of hidden neurons depends on specific applications. For comparison purposes, we use four ELM-based methods compared to P-ELM, including a faster ELM method (ELM) [13], ensemble based ELM (EN-ELM) [5], voting based ELM (V-ELM) [6] and weighting based ELM (W-ELM) [7].

4.1 Face Image Recognition

In this section, we report the performance of P-ELM on face image recognition real-world application. The corresponding datasets used in the experiments are the ORL and Yale face image recognition datasetsFootnote 1. The ORL dataset contains 400 face images with the size of \(32\times 32\), which belongs to 10 different people. These images were taken at different times, varying the lighting, facial expressions and facial details. The Yale dataset has 165 face images with the size of \(32\times 32\) of different facial expressions conducted by 10 different people (Fig. 2).

Fig. 2.
figure 2

Example images from different face image databases: (a) ORL and (b) Yale.

In Table 1, we report the experimental results of P-ELM and other baselines with 50 hidden neurons on two different face image datasets. The results indicate that P-ELM are with high testing accuracy and low standard deviation compared to other baselines. P-ELM achieves 74.50% testing accuracy with 4.06% standard deviation on the ORL dataset, and 60.63% testing accuracy with 8.04% standard deviation on the Yale dataset. In terms of both training time and testing time, P-ELM is superior to EN-ELM and V-ELM, and slightly inferior to ELM and W-ELM. Besides, the experimental results for all compared methods with different numbers of hidden neurons are given in Fig. 3. From Fig. 3, we can observe that P-ELM always significantly outperforms other baselines on both the ORL and Yale datasets. P-ELM’s remarkable performance on face image recognition owes to the unique of parameter learning mechanism, which guarantees that P-ELM can achieve superior performance.

Table 1. Performance comparison on face image recognition.
Fig. 3.
figure 3

Performance comparison with respect to the number of hidden neurons on face image recognition: (a) ORL and (b) Yale.

4.2 Handwritten Image Annotation

For handwritten image annotation application, we report the performance of P-ELM in this section. In the experiments, we use the USPS and MNIST handwritten image annotation datasetsFootnote 2. The USPS dataset contains 9298 different gray-scale handwritten digit images with the size of \(16\times 16\). The MNIST dataset used in the experiments consists of 10000 images of handwritten numbers with the size of \(28\times 28\), where each digital number consists of 1000 images. For the USPS and MNIST datasets, they are both associated with 10 different categories of “0” through “9” (Fig. 4).

Fig. 4.
figure 4

Example images from different handwritten image databases: (a) USPS and (b) MNIST.

In Table 2, the results on handwritten image datasets show the performance of P-ELM and other baselines with 100 hidden neurons. P-ELM achieves 91.86% testing accuracy with 0.79% standard deviation on the USPS dataset, and 88.44% testing accuracy with 0.94% standard deviation on the MNIST dataset, which shows its superiority compared to other baselines. In terms of training time, P-ELM needs a little more running time than ELM, achieves slightly superior performance than W-ELM, and runs much faster than EN-ELM and V-ELM. In terms of testing time, P-ELM is slightly inferior to ELM and W-ELM, and significantly superior to EN-ELM and V-ELM. In addition, the simulation results for P-ELM and other baseline methods with various numbers of hidden neurons are presented in Fig. 5. As can be observed from Fig. 5, P-ELM is always superior to the baselines on the USPS dataset, and achieves better or comparable performance compared to other baselines on the MNIST dataset. The above observation suggests that P-ELM is also effective on handwritten image annotation, mainly because that it uses an ELM-AE to learn the suitable network parameters for P-ELM.

Table 2. Performance comparison on handwritten image annotation.
Fig. 5.
figure 5

Performance comparison with respect to the number of hidden neurons on handwritten image annotation: (a) USPS and (b) MNIST.

5 Conclusion

In this paper, we proposed a novel method called pre-trained extreme learning machine (P-ELM for short). The proposed P-ELM is a data-driven method, which uses an ELM-AE to intelligently determine the suitable network parameters for diverse learning tasks. The unique parameter learning mechanism, including the processes of encoding and decoding, ensures that P-ELM can encode the underlying information of the original data into the network parameters. Experiments and comparisons on face image recognition and handwritten image annotation (each application contains two datasets) demonstrate the superior performance of the proposed P-ELM compared to baseline methods.