Abstract
This paper proposes a modified ELM algorithm that properly selects the input weights and biases before training the output weights of single-hidden layer feedforward neural networks with sigmoidal activation function and proves mathematically the hidden layer output matrix maintains full column rank. The modified ELM avoids the randomness compared with the ELM. The experimental results of both regression and classification problems show good performance of the modified ELM algorithm.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Feedforward neural networks have been deeply and systematically studied because of their universal approximation capabilities on compact input sets and approximation in a finite set. Cybenko [1] and Funahashi [2] proved that any continuous function could be approximated on a compact set with uniform topology by a single-hidden layer feedforward neural network (SLFN) with any continuous sigmoidal activation function. Hornik in [3] showed that any measurable function could be approached with such networks. A variety of results on SLFN approximations to multivariate functions were later established by many authors: [4–9] by Cao, Xu et al. [10], [11] by Chen and Chen, [12] by Hahm and Hong, [13] by Leshno et al., etc. SLFNs have been extensively used in many fields due to their abilities: (1) to approximate complex nonlinear mappings directly from the input samples; and (2) to provide models for a large class of natural and artificial phenomena that are difficult to handle with using classical parametric methods.
It is known that the problems of density and complexity in neural network are theoretical bases for algorithms. In practical applications, it is paid close attention to the faster learning algorithms of neural networks in a finite training set.
Huang and Babri [14] showed that a single-hidden layer feedforward neural network with at most N hidden nodes and with almost any type of nonlinear activation function could exactly learn N distinct samples. Although conventional gradient-based learning algorithms, such as back-propagation (BP) and its variant Levenberg–Marquardt (LM) method, have been extensively used in training multilayer feedforward neural networks, these learning algorithms are still relatively slow and may also easily get stuck in a local minimum. Support vector machines (SVMs) have been extensively used for complex mappings and famous for its good generalization ability. However, to tune SVM kernel parameters finely is a time-intensive business.
This paper is organized as follows. Section 2 gives theoretical deductions of modified ELM algorithm with sigmoidal activation function. Section 3 proposes a modified ELM algorithm. Section 4 presents performance evaluation. Section 5 consists of discussions and conclusions.
2 Theoretical deductions of modified ELM algorithm with sigmoidal activation function
Recently, a new learning algorithm for SLFN named extreme learning machine (ELM) has been proposed by Huang et al. in [15, 16]. Unlike traditional approaches (such as BP algorithms, SVMs), ELM algorithm has a concise architecture, no need to tune input weights and biases. Particularly, the learning speed of ELM can be thousands of times faster than traditional feedforward network learning algorithms like BP algorithm. Compared with traditional learning algorithms, ELM not only tends to reach the smallest training error but also to obtain the smallest norm of weights. According to Bartlett in [17], the smaller training error the neural network reaches and the smaller the norm of weights is, the better generalization performance the networks tend to have. Therefore, the ELM can have good generalization performance. So far much work has been conducted on ELM and its related problems, and some relevant results have been obtained in [18–22].
The ELM algorithm of SLFNs can be summarized as the following three steps:
Algorithm of ELM: Given a training set \({{\mathcal{N}}=\{(X_{i},t_{i})|\in {\mathbb{R}}^{d}, t_{i}\in {\mathbb{R}}, i=1,2,\ldots, n\}}\) and activation function g, hidden node number N.
- Step 1::
-
Randomly assign input weight W i and bias b i \((i=1,2,\ldots, N)\)
- Step 2::
-
Calculate the hidden layer output matrix H.
- Step 3::
-
Calculate the output weight β by \(\beta={\bf H}^{\dag}T,\) here \({\bf H}^{\dag}\) is the Moore–Penrose generalized inverse of H and \(T=(t_{1},t_{2},\ldots,t_{n})^{T}.\)
Several methods, such as orthogonal projection method, iterative method and singular value decomposition (SVD), can be used to compose the Moore–Penrose generalized inverse of H. The orthogonal projection, orthogonalization method and iterative method have their limitations as searching and iteration may consume extra training time. The orthogonal project method can be used when the hidden layer output matrix is a full column rank matrix. SVD can be generally used to calculate the Moore–Penrose generalized inverse of hidden layer output matrix in all cases, but it costs plenty of time. This paper designs a learning machine that first properly trains input weights and biases such that the hidden layer output matrix is full column rank, and then orthogonal project method will be used to calculate the Moore–Penrose generalized inverse of hidden layer output matrix.
For n arbitrary distinct sample (X i , t i ), where \({X_i=(x_{i1},x_{i2},\ldots,x_{id})^T\in {\mathbb{R}}^d}\) and \({t_i=(t_{i1},t_{i2},\ldots,t_{im})\in {\mathbb{R}}^m}\), standard SLFNs with N hidden nodes and activation function g(x) are mathematically modeled as
where \(\beta_i=(\beta_{i1},\beta_{i2},\ldots, \beta_{iN})^T\) is the output weight vector connecting the ith nodes and the output nodes, \(W_i=(w_{i1},w_{i2},\ldots,w_{id})^T\) is the input weight vector connecting the ith hidden nodes and the input nodes, and b i is the threshold of the ith hidden node. The SLFNs G N (x) can approximate these n samples with zero error means
The above n equations can be written as
where
As named in Huang et al. [14–23], H is called the hidden layer output matrix of the neural network.
In most cases, the number of hidden nodes is much less than the number of training samples, N ≪ n. Then, there may not exist \(w_{i}, b_{i},\beta_{i}(i=1,2,\ldots,N)\) such that Hβ = T. So, the least-square method is used to find the least-square solutions of Hβ = T. Furthermore, the special solution \(\hat{\beta}\) can be computed,
this \(\hat{\beta}\) is the minimum norm least-square solution. From [24, 27],
where \(H^{\dag}\) is called the Moore–Penrose generalized inverse of matrix of H. Thus, the output weight of ELM algorithm is
Suppose \({X_1, X_2\in {\mathbb{R}}^d,\; X_i=(x_{i1}, x_{i2},\ldots, x_{id}),\, i=1,2.}\) We say \(X_1\prec X_2\) if and only if there exists \(j_0\in \{1,2,\ldots,d\}\) such that \(x_{{1j_{0}}}< x_{{2j_{0}}}, \, x_{1j}=x_{2j},\, j=j_0+1, j_0+2, \ldots, d\).
Lemma 2.1
For n distinct vectors \({X_1\prec X_2\prec\cdots\prec X_n\in {\mathbb{R}}^d (d\geq 2)}\), there exists a vector \({W\in {\mathbb{R}}^d}\) such that
Proof
For \(X_i=(x_{i1}, x_{i2},\ldots, x_{id}), i=1,2,\ldots,n\), we set
then
Let
For given j(1 ≤ j ≤ d), if y 1 ij = 0 for all \(i=1,2,\ldots,n-1, \) let n j = 2d; otherwise,
Obviously,
Define
and
For fixed i(1 ≤ i ≤ n − 1), owing to \(X_i\prec X_{i+1}, \) there exists \({k_0\in {\mathbb{N}}}\) such that
Write
Since
And
This gives \(W\cdot X_{i+1}- W\cdot X_{i}>0, \) that is, \(W\cdot X_{i} < W \cdot X_{i+1}, \) which completes the proof of Lemma 2.1. \(\square\)
Remark 1
From Lemma 2.1, it is deduced that the sequence of the samples is unchanged under the affine transformation \(X_{i}\,\mapsto\, W\cdot X_{i}. \) It is noteworthy that Lemma 2.1 also provides the method (10)–(16) to compute the weight W of the transformation, and the method here is better than that given in [22] as the weights in the Lemma 2.1 are much smaller than those in [22]. With this property, it is possible to make the hidden layer output matrix full column rank.
Suppose that σ(x) is a bounded sigmoidal function, then
Set
then
Let \((x_1,y_1),(x_2,y_2),\ldots,(x_n,y_n)\) be a set of samples, and \(x_1< x_2< \cdots < x_n.\) We have the following lemma.
Lemma 2.2
There exist numbers \(w_{1},w_2,\ldots,w_n\) and \(\alpha_{1},\alpha_2,\ldots,\alpha_n\) such that the matrix
is nonsingular.
Proof
Since
there exists A > 0 such that δσ(A) < 1/(4n). Hence,
and
Input weights and biases are chosen by the following criteria.
it follows that
In turn, subtract the second row from the first row, the third row from the third row, \(\cdots\cdots\), and keep the last row unchanged. Denote the matrix that has performed the above row operations by \(\widetilde{G}_{\sigma}=(\widetilde{g}_{ij})_{n}.\)
Now for \(i=1,2,\ldots, n-1, \) the elements of the ith row can be estimated as follows. For \(j=1, \ldots, i-1, \) since
For j = i,
For \(j=i+1,\ldots,n-1\),
implies
For j = n, it can similarly be obtained that
In the nth row of \(\widetilde{G}_{\sigma},\; \hbox{for}\; j=1,2,\ldots, n-1 \),
which implies
and
Therefore,
that is, \(\widetilde{G}_{\sigma}\) is strictly diagonally dominant which guarantees that \(\widetilde{G}_{\sigma}\) is nonsingular. This indicates the nonsingularity of G σ. \(\square\)
Remark 2
Lemma 2.2 gives the method to construct the weights and biases of the single-hidden layer neural network with the sigmoidal activation function such that the hidden layer output matrix is a full column rank for any given data. With Lemma 2.2, it is possible to construct the algorithm to train the neural network.
For \({X_i\in {\mathbb{R}}^d,\, \, i=1,2,\ldots,N}\), and \(X_{1}\prec X_2\prec\cdots \prec X_N.\) Lemma 1 illustrates that there exists \({W\in {\mathbb{R}}^d}\), such that
Let \(x_i=W\cdot X_{i},\, \, i=1,2, \ldots ,N\), and set
where A satisfies δσ(A) < 1/(4N).
From Lemma 2.2, the following theorem is easily obtained.
Theorem 2.1
Suppose σ(x) is a bounded sigmoidal function, \({ X_i\in {\mathbb{R}}^d,\, \, i=1,2,\ldots, N}\), and \(X_{1}\prec X_2\prec\cdots \prec X_N.\) Then, there exist \({W_i\in {\mathbb{R}}^d,\; b_i\in{\mathbb{R}},\; i=1,2,\ldots, N}\), such that the matrix
is nonsingular.
Based on the knowledge of matrix and Theorem 2.1, we have the following conclusion.
Corollary 2.1
Suppose \({X_i\in {\mathbb{R}}^d,\, i=1,2,\ldots,n}\) are n distinct vectors, N > n. Then, there exist \({W_i\in {\mathbb{R}}^d,b_i\in {\mathbb{R}},\,i=1,2,\ldots,N}\) such that the matrix
is full column rank.
3 A modified ELM using sigmoidal activation function
According to the discussion of Sect. 2 and based on ELM algorithm, a new algorithm is proposed. It uses sigmoidal function as its activation function and can make good use of orthogonal projection method to calculate the output weights. The modified extreme learning machine is stated in the following algorithm.
Algorithm of Modified ELM: Given a training data set \({{\mathcal{N}}=\{(X_{i}^{*},t_{i}^{*})|X_{i}^{*}\in {\mathbb{R}}^{d}, t_{i}^{*}\in {\mathbb{R}},\, i=1,2,\ldots, n\}, }\) activation function of sigmoidal function (for instance, g(x) = 1/(1 + e −x)) and hidden node number N.
- Step 1::
-
Select weights W i and biases b i \((i=1,2,\ldots,N). \)
Sort the former N samples \((X_{i}^{*},t_{i}^{*})\,(i=1,2, \ldots, N)\) in terms of \(W\cdot X_{1}^{*}, W\cdot X_{2}^{*}, \ldots, W\cdot X_{N}^{*}\) such that \(W\cdot X_{i_1}^{*}<W\cdot X_{i_2}^{*}<\cdots< W\cdot X_{i_{N}}^{*}\) (i j ≠ i k for \(j\neq k,\, j, k=1,2,\ldots,N\) and \(i_{j}=1,2,\ldots,N\)) and denote the sorted data as \((X_{i},t_{i})\,(i=1,2,\ldots,n)\) and \(X_{i}=(x_{i 1}, x_{i 2}, \ldots, x_{i d})\,(i=1,2,\ldots,n).\) For \(j=1,2,\ldots,d, \) make following calculations.
Set
Let \(a_{i}=W\cdot X_{i}\),
- Step 2::
-
Calculate output weights \(\beta=(\beta_{1},\beta_{2},\ldots,\beta_{N}).\)
Let \(T=(t_{1},t_{2},\ldots,t_{n})^{T}\) and
Then,
The ELM proves in practice to be an extremely fast algorithm. This is because it randomly chooses the input weights W i and biases b i of the SLFNs instead of carefully selecting. However, this big advantage makes the algorithm less effective sometimes. As discussed in [22], the random selection of input weights and biases is likely to yield an unexpected result that the hidden layer output matrix H is not full column rank, which might cause two difficulties that cannot be overcome theoretically. First, the SLFNs cannot approximate the training samples with zero error, which relatively lowers the prediction accuracy. Secondly, it is known that the orthogonal projection is typically faster than SVD, and the algorithm proposed here can make good use of the faster method which is unable to be used in ELM due to the requirement of nonsingularity of H T H.
4 Simulation results
This section measures the performance of the proposed new ELM learning algorithm. The simulations for ELM and modified ELM algorithms are carried out in the Matlab 7.10 environment running in AMD athlon(tm) II, X3, 425 processor with the speed of 2.71 GHz. The activation function used in algorithm is sigmoid function g(x) = 1/(1 + e −x). For classification problems, accuracy rates are used as the measurement of the performance of learning algorithms, that is, the ratio of the number of the testing samples that are correctly classified to the number of all samples. For regression problems, root-mean-square deviation (RMSD) of the difference of target vectors and predicted target vectors by the SFLNs is used. We call this RMSD the error of the algorithm for the testing data. The RMSDs of the accuracy and the error of the algorithms are used as the measurement of fluctuations in algorithms.
4.1 Benchmarking with a regression problem: approximation of ‘SinC’ function with noise
First of all, the ‘SinC’ function is used to measure the performance of ELM and the modified ELM algorithms. The target function is as follows.
The training set (x i , y i ) and testing set (u i , t i ) with 5,000 samples are, respectively, created, where x i , u i in training and testing data are distributed in [−10,10], respectively, with uniform step length. Large uniform noise distributed in [0.2, 0.2] has been used in all training data to obtain a real regression problem. The experiment is carried out on these data as follows. There are 20 hidden nodes assigned for both ELM and modified ELM algorithms. Fifty trials have been conducted for the ELM algorithm to eliminate the random error and the results shown are the average results. Results shown in Table 1 include training time, testing time, training accuracy, testing accuracy and the number of nodes of both algorithms. Plus, standard deviation of time and accuracy in the process of training and testing are recorded in Table 2.
It can be seen from Table 1 that the modified ELM algorithm spends 0.0675, 0.0169 s CPU time training and testing, respectively, whereas it takes 0.0541 and 0.0213 s for the ELM algorithm to complete the same process. Consider the accuracy of training and testing of both algorithms. The original ELM performs better than the modified ELM. However, the standard deviations of training and testing time and the accuracy are all smaller than those of ELM as shown in Table 2, which means the modified algorithm is more stable than ELM.
Figure 1 shows the raw data and trained results of ELM algorithm, and Fig. 2 shows the raw data and the approximated results based on the modified ELM algorithm, which also demonstrates that the modified ELM algorithm has an acceptable performance.
4.2 Benchmarking with practical problems applications
Performance comparison of the proposed modified ELM and the ELM algorithms for four real problems is carried out in this section. Classification and regression tasks are included in four real problems that are shown in Table 3. Two of them are classification tasks including Diabetes, Glass Identification (Glass ID), and the other two are regression tasks including Housing and Slump (Concrete Slump). All the data sets are from UCI repository of machine learning databases [28]. The speculation of each database is shown in the Table 3. For the databases that have only one data table, as conducted in [25, 26, 29, 30], 75 and 25% of samples in the problem are randomly chosen for training and testing, respectively, at each trial.
In order to reduce the random error, for every database, fifty trials for both two algorithms are conducted, then taking the average of fifty times as final results. The results are reported in Tables 4, 5 and 6, which show that in our simulation, on the average, both of ELM and the modified ELM algorithms have similar learning time which is reported in Table 6 as well as fast learning rate.
However, the modified ELM has more stable accuracies of training and testing, which can be seen from Table 5, especially in cases of regression problems. From Table 4, it can be observed that the modified ELM algorithm is better than ELM in the accuracy of learning for the regression cases. And detailed information is shown in Figs. 3, 4, 5, 6, 7, 8, 9 and 10.
In addition, it is worth mentioning that compared with ELM algorithm, the modified ELM selects the input weights and biases, which helps avoid the risk of the random errors as we can see from Table 5 and Figs. 3, 4, 5, 6, 7, 8, 9 and 10.
5 Conclusions
This paper proposes a modified ELM algorithm based on the ELM for training single-hidden layer feedforward neural networks (SLFNs) in an attempt to solve the least-square minimization of SLFNs in a more effective way and meanwhile solves the open problem in [22].
The learning speed of modified ELM is as fast as ELM. The main difference between the modified ELM and ELM algorithms lies in the selection of input weights and biases. The modified algorithm selects the input weights and biases properly, which consumes little time compared with the training time of output weights. The modified ELM algorithm overcomes the shortcomings of EELM, the algorithm proposed in [22], that is, it can use sigmoidal function as activation function to train the networks but still keep the qualities of ELM and EELM.
References
Cybenko G (1989) Approximation by superposition of sigmoidal function. Math Control Signals Syst 2:303–314
Funahashi KI (1989) On the approximate realization of continuous mappings by neural networks. Neural Netw 2:183–192
Hornik K (1991) Approximation capabilities of multilayer feedforward networks. Neural Netw 4:251–257
Cao FL, Xie TF, Xu ZB (2008) The estimate for approximation error of neural networks: a constructive approach. Neurocomputing 71:626–630
Cao FL, Zhang YQ, He ZR (2009) Interpolation and rate of convergence by a class of neural networks. Appl Math Model 33:1441–1456
Cao FL, Zhang R (2009) The errors of approximation for feedforward neural networks in the Lp metric. Math Comput Model 49:1563–1572
Cao FL, Lin SB, Xu ZB (2010) Approximation capabilities of interpolation neural networks. Neurocomputing 74:457–460
Xu ZB, Cao FL (2004) The essential order of approximation for neural networks. Sci China Ser F Inf Sci 47:97–112
Xu ZB, Cao FL (2005) Simultaneous L p approximation order for neural networks. Neural Netw 18:914–923
Chen TP, Chen H (1995) Approximation capability to functions of several variables, nonlinear functionals, and operators by radial basis function neural networks. IEEE Trans Neural Netw 6:904–910
Chen TP, Chen H (1995) Universal approximation to nonlinear operators by neural networks with arbitrary activation functions and its application to dynamical systems. IEEE Trans Neural Netw 6:911–917
Hahm N, Hong BI (2004) An approximation by neural networks with a fixed weight. Comput Math Appl 47:1897–1903
Lan Y, Soh YC, Huang GB (2010, April) Random search enhancement of error minimized extreme learning machine. In: ESANN 2010 proceedings, European symposium on artificial neural networks—computational intelligence and machine learning, pp 327–332
Huang GB, Babri HA (1998) Upper bounds on the number of hidden neurons in feedforward networks with arbitrary bounded nonlinear activation functions. IEEE Trans Neural Netw 9:224–229
Huang GB, Zhu QY, Siew CK (2006) Extreme learning machine: theory and applications. Neurocomputing 70:489–501
Huang GB, Zhu QY, Siew CK (2004) Extreme learning machine: a new learning scheme of feedforward neural networks. In: 2004 IEEE international joint conference on neural networks, vol 2, pp 985–990
Bartlett PL (1998) The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE Trans Inf Theory 44:525–536
Feng G, Huang GB, Lin Q, Gay R (2009) Error minimized extreme learning machine with growth of hidden nodes and incremental learning. IEEE Trans Neural Netw 20:1352–1357
Huang GB, Chen L (2007) Convex incremental extreme learning machine. Neurocomputing 70:3056–3062
Huang GB, Chen L (2008) Enhanced random search based incremental extreme learning machine. Neurocomputing 71:3460–3468
Huang GB, Chen L, Siew CK (2006) Universal approximation using incremental constructive feedforward networks with random hidden nodes. IEEE Trans Neural Netw 17:879–892
Wang YG, Cao FL, Yuan YB (2011) A study on effectiveness of extreme learning machine. Neurocomputing 74(16):2483–2490
Huang GB (2003) Learning capability and storage capacity of two-hidden-layer feedforward networks. IEEE Trans Neural Netw 14:274–281
Rao CR, Mitra SK (1971) Generalized inverse of matrices and its applications. Wiley, New York
Rätsch G, Onoda T, Müller KR (1998) An improvement of AdaBoost to avoid overfitting. In: Proceedings of the 5th international conference on neural information processing (ICONIP 1998)
Romero E, Alquézar R (2002) A new incremental method for function approximation using feed-forward neural networks. In: Proceedings of the 2002 international joint conference on neural networks (IJCNN’2002), pp 1968–1973
Serre D (2000) Matrices: theory and applications. Springer, New York
Frank A, Asuncion A (2010) UCI machine learning repository. University of California, School of Information and Computer Science, Irvine. http://archive.ics.uci.edu/ml
Freund Y, Schapire RE (1996) Experiments with a new boosting algorithm. In: Machine learning: proceedings of the 13th international conference, pp 148–156
Wilson DR, Martinez TR (1996, June) Heterogeneous radial basis function networks. In: IEEE international conference on neural networks (ICNN’96), pp 1263–1267
Acknowledgments
We would thank Feilong Cao for his suggestions on this paper. The support of the National Natural Science Foundation of China (Nos. 90818020, 10871226, 61179041) is gratefully acknowledged.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Chen, Z.X., Zhu, H.Y. & Wang, Y.G. A modified extreme learning machine with sigmoidal activation functions. Neural Comput & Applic 22, 541–550 (2013). https://doi.org/10.1007/s00521-012-0860-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-012-0860-2