1 Introduction

Feedforward neural networks have been deeply and systematically studied because of their universal approximation capabilities on compact input sets and approximation in a finite set. Cybenko [1] and Funahashi [2] proved that any continuous function could be approximated on a compact set with uniform topology by a single-hidden layer feedforward neural network (SLFN) with any continuous sigmoidal activation function. Hornik in [3] showed that any measurable function could be approached with such networks. A variety of results on SLFN approximations to multivariate functions were later established by many authors: [49] by Cao, Xu et al. [10], [11] by Chen and Chen, [12] by Hahm and Hong, [13] by Leshno et al., etc. SLFNs have been extensively used in many fields due to their abilities: (1) to approximate complex nonlinear mappings directly from the input samples; and (2) to provide models for a large class of natural and artificial phenomena that are difficult to handle with using classical parametric methods.

It is known that the problems of density and complexity in neural network are theoretical bases for algorithms. In practical applications, it is paid close attention to the faster learning algorithms of neural networks in a finite training set.

Huang and Babri [14] showed that a single-hidden layer feedforward neural network with at most N hidden nodes and with almost any type of nonlinear activation function could exactly learn N distinct samples. Although conventional gradient-based learning algorithms, such as back-propagation (BP) and its variant Levenberg–Marquardt (LM) method, have been extensively used in training multilayer feedforward neural networks, these learning algorithms are still relatively slow and may also easily get stuck in a local minimum. Support vector machines (SVMs) have been extensively used for complex mappings and famous for its good generalization ability. However, to tune SVM kernel parameters finely is a time-intensive business.

This paper is organized as follows. Section 2 gives theoretical deductions of modified ELM algorithm with sigmoidal activation function. Section 3 proposes a modified ELM algorithm. Section 4 presents performance evaluation. Section 5 consists of discussions and conclusions.

2 Theoretical deductions of modified ELM algorithm with sigmoidal activation function

Recently, a new learning algorithm for SLFN named extreme learning machine (ELM) has been proposed by Huang et al. in [15, 16]. Unlike traditional approaches (such as BP algorithms, SVMs), ELM algorithm has a concise architecture, no need to tune input weights and biases. Particularly, the learning speed of ELM can be thousands of times faster than traditional feedforward network learning algorithms like BP algorithm. Compared with traditional learning algorithms, ELM not only tends to reach the smallest training error but also to obtain the smallest norm of weights. According to Bartlett in [17], the smaller training error the neural network reaches and the smaller the norm of weights is, the better generalization performance the networks tend to have. Therefore, the ELM can have good generalization performance. So far much work has been conducted on ELM and its related problems, and some relevant results have been obtained in [1822].

The ELM algorithm of SLFNs can be summarized as the following three steps:

Algorithm of ELM: Given a training set \({{\mathcal{N}}=\{(X_{i},t_{i})|\in {\mathbb{R}}^{d}, t_{i}\in {\mathbb{R}}, i=1,2,\ldots, n\}}\) and activation function g, hidden node number N.

Step 1::

Randomly assign input weight W i and bias b i \((i=1,2,\ldots, N)\)

Step 2::

Calculate the hidden layer output matrix H.

Step 3::

Calculate the output weight β by \(\beta={\bf H}^{\dag}T,\) here \({\bf H}^{\dag}\) is the Moore–Penrose generalized inverse of H and \(T=(t_{1},t_{2},\ldots,t_{n})^{T}.\)

Several methods, such as orthogonal projection method, iterative method and singular value decomposition (SVD), can be used to compose the Moore–Penrose generalized inverse of H. The orthogonal projection, orthogonalization method and iterative method have their limitations as searching and iteration may consume extra training time. The orthogonal project method can be used when the hidden layer output matrix is a full column rank matrix. SVD can be generally used to calculate the Moore–Penrose generalized inverse of hidden layer output matrix in all cases, but it costs plenty of time. This paper designs a learning machine that first properly trains input weights and biases such that the hidden layer output matrix is full column rank, and then orthogonal project method will be used to calculate the Moore–Penrose generalized inverse of hidden layer output matrix.

For n arbitrary distinct sample (X i t i ), where \({X_i=(x_{i1},x_{i2},\ldots,x_{id})^T\in {\mathbb{R}}^d}\) and \({t_i=(t_{i1},t_{i2},\ldots,t_{im})\in {\mathbb{R}}^m}\), standard SLFNs with N hidden nodes and activation function g(x) are mathematically modeled as

$$ G_{N}(X)=\sum_{i=1}^{N}\beta_i g(W_i\cdot X+b_i), $$
(1)

where \(\beta_i=(\beta_{i1},\beta_{i2},\ldots, \beta_{iN})^T\) is the output weight vector connecting the ith nodes and the output nodes, \(W_i=(w_{i1},w_{i2},\ldots,w_{id})^T\) is the input weight vector connecting the ith hidden nodes and the input nodes, and b i is the threshold of the ith hidden node. The SLFNs G N (x) can approximate these n samples with zero error means

$$ G_{N}(x_j)=t_j,\quad j=1,2,\ldots,N. $$
(2)

The above n equations can be written as

$$ H\beta=T, $$
(3)

where

$$ \begin{aligned} H & = H(w_{1},w_{2},\ldots,w_{N};\, b_{1},b_{2},\ldots,b_{N};\, X_{1},X_{2},\ldots,X_{n})\\ & =\left(\begin{array}{cccc} g(W_1\cdot X_1+b_1) & g(W_2\cdot X_1+b_2) &\cdots& g(W_{N}\cdot X_1+b_{N}) \\ g(W_1\cdot X_2+b_1)&g(W_2\cdot X_2+b_2) &\cdots&g(W_{N}\cdot X_2+b_{N}) \\ \vdots&\vdots&\cdots&\vdots\\ g(W_1\cdot X_{n}+b_1)&g(W_2\cdot X_{n}+b_2)&\cdots&g(W_{N}\cdot X_{n}+b_{N}) \\ \end{array}\right)_{n\times N}, \end{aligned} $$
(4)
$$ \beta=\left(\begin{array}{l} \beta_1^T \\ \beta_2^T\\ \vdots\\ \beta_{N}^{T}\\ \end{array}\right)_{N\times m} \quad {\rm and}\quad T=\left(\begin{array}{l} t_1^T \\ t_2^T\\ \vdots\\ t_{n}^{T}\\ \end{array}\right)_{n\times m}. $$
(5)

As named in Huang et al. [1423], H is called the hidden layer output matrix of the neural network.

In most cases, the number of hidden nodes is much less than the number of training samples, N ≪ n. Then, there may not exist \(w_{i}, b_{i},\beta_{i}(i=1,2,\ldots,N)\) such that Hβ = T. So, the least-square method is used to find the least-square solutions of Hβ = T. Furthermore, the special solution \(\hat{\beta}\) can be computed,

$$ \|H\hat{\beta}-T\|=\min\limits_{\beta}\|H\beta-T\|, $$
(6)

this \(\hat{\beta}\) is the minimum norm least-square solution. From [24, 27],

$$ \hat{\beta}=H^{\dag}T, $$
(7)

where \(H^{\dag}\) is called the Moore–Penrose generalized inverse of matrix of H. Thus, the output weight of ELM algorithm is

$$ \beta=H^{\dag}T. $$
(8)

Suppose \({X_1, X_2\in {\mathbb{R}}^d,\; X_i=(x_{i1}, x_{i2},\ldots, x_{id}),\, i=1,2.}\) We say \(X_1\prec X_2\) if and only if there exists \(j_0\in \{1,2,\ldots,d\}\) such that \(x_{{1j_{0}}}< x_{{2j_{0}}}, \, x_{1j}=x_{2j},\, j=j_0+1, j_0+2, \ldots, d\).

Lemma 2.1

For n distinct vectors \({X_1\prec X_2\prec\cdots\prec X_n\in {\mathbb{R}}^d (d\geq 2)}\), there exists a vector \({W\in {\mathbb{R}}^d}\) such that

$$ W \cdot X_1 < W\cdot X_2 < \cdots < W\cdot X_n. $$
(9)

Proof

For \(X_i=(x_{i1}, x_{i2},\ldots, x_{id}), i=1,2,\ldots,n\), we set

$$ w_j^1:=\frac{1}{{1+\max\nolimits_i\{|x_{ij}|\}}},\quad j=1,2,\ldots,d, $$
(10)

then

$$ x_{ij}^1=w_j^1 x_{ij}\in [-1, 1],\quad i=1,2,\ldots,n,\,\quad j=1,2,\ldots,d. $$
(11)

Let

$$ y_{ij}^1=|x_{i+1,j}^1 -x_{ij}^1|,\quad i=1,2,\ldots,n-1,\;\quad j=1,2,\ldots,d. $$
(12)

For given j(1 ≤ j ≤ d), if y 1 ij  = 0 for all \(i=1,2,\ldots,n-1, \) let n j  = 2d; otherwise,

$$ n_j=\frac{4d}{\min\nolimits_{y_{ij}^{1} \neq 0}\{y_{ij}^{1}\}}. $$
(13)

Obviously,

$$ n_j\geq 2d,\quad j=1,2,\ldots,d. $$
(14)

Define

$$ w_j^2:=w_j^1\Uppi_{i=1}^j n_i,\quad j=1,2,\ldots,d, $$
(15)

and

$$ W=(w_1^2,w_2^2,\ldots,w_d^2). $$
(16)

For fixed i(1 ≤ i ≤ n − 1), owing to \(X_i\prec X_{i+1}, \) there exists \({k_0\in {\mathbb{N}}}\) such that

$$ x_{ik_0}< x_{i+1,k_0};\quad x_{ij}=x_{i+1,j},\quad j=k_0+1,\ldots,d. $$
(17)

Write

$$ \begin{aligned} W\cdot X_{i+1}- W\cdot X_{i}&=\sum_{k=1}^{k_0-1}(x_{i+1,k}^1 -x_{i,k}^1)\Uppi_{s=1}^k n_s+(x_{i+1,k_0}^1 -x_{ik_0}^1)\Uppi_{s=1}^{k_0}n_s \\ &=:I_1+ I_2. \end{aligned} $$
(18)

Since

$$ \begin{aligned} |I_1| & \leq 2(n_1+ n_1 n_2+\cdots+ n_1 n_2\cdots n_{k_0-1})\\ & = 2\Uppi_{s=1}^{ k_0-1}n_s \left(1+\frac{ 1}{{n_{k_0-2}}}+\cdots +\frac{1}{{n_{2}\cdots {n}_{k_0-1}}}\right)\\ &\leq 2\Uppi_{s=1}^{k_0-1}n_s\cdot \frac{1}{{1-\frac{1}{{2d}}}}<2d\Uppi_{s=1}^{k_0-1}n_s. \end{aligned} $$
(19)

And

$$ \begin{aligned} I_2 & = \Uppi_{s=1}^{k_0-1}n_s\cdot n_{k_0}(x_{i+1, k_0}^1- x_{i k_0}^1)\\ &= \Uppi_{s=1}^{k_0-1}n_s\cdot n_{k_0}\cdot y_{i, k_0}^1\\ &\geq \Uppi_{s=1}^{ k_0-1}n_s\cdot y_{i,k_0}^1\cdot\frac{4d}{y_{i k_0}^1}\\ &= 4d \Uppi_{s=1}^{ k_0-1}n_s. \end{aligned} $$
(20)

This gives \(W\cdot X_{i+1}- W\cdot X_{i}>0, \) that is, \(W\cdot X_{i} < W \cdot X_{i+1}, \) which completes the proof of Lemma 2.1. \(\square\)

Remark 1

From Lemma 2.1, it is deduced that the sequence of the samples is unchanged under the affine transformation \(X_{i}\,\mapsto\, W\cdot X_{i}. \) It is noteworthy that Lemma 2.1 also provides the method (10)–(16) to compute the weight W of the transformation, and the method here is better than that given in [22] as the weights in the Lemma 2.1 are much smaller than those in [22]. With this property, it is possible to make the hidden layer output matrix full column rank.

Suppose that σ(x) is a bounded sigmoidal function, then

$$ \lim\limits_{x\rightarrow +\infty}\sigma(x)=1, \quad \lim\limits_{x\rightarrow -\infty}\sigma(x)=0. $$
(21)

Set

$$ \delta_{\sigma}(A):=\sup_{x\geq A}\max\{|1-\sigma(x)|,|\sigma(-x)|\}, $$
(22)

then

$$ \lim\limits_{A\rightarrow +\infty}\delta_{\sigma}(A)=0. $$
(23)

Let \((x_1,y_1),(x_2,y_2),\ldots,(x_n,y_n)\) be a set of samples, and \(x_1< x_2< \cdots < x_n.\) We have the following lemma.

Lemma 2.2

There exist numbers \(w_{1},w_2,\ldots,w_n\) and \(\alpha_{1},\alpha_2,\ldots,\alpha_n\) such that the matrix

$$ G_{\sigma}= \left(\begin{array}{cccc} \sigma(w_1x_1+\alpha_1)& \sigma(w_2x_1+\alpha_2)&\cdots&\sigma(w_nx_1+\alpha_n)\\ \sigma(w_1x_2+\alpha_1)&\sigma(w_2x_2+\alpha_2)&\cdots&\sigma(w_nx_2+\alpha_n)\\ \vdots&\vdots&\cdots&\vdots\\ \sigma(w_1x_n+\alpha_1)&\sigma(w_2x_n+\alpha_2)&\cdots&\sigma(w_nx_n+\alpha_n)\\ \end{array}\right) $$
(24)

is nonsingular.

Proof

Since

$$ \lim\limits_{A\rightarrow +\infty}\delta_{\sigma}(A)=0, $$
(25)

there exists A > 0 such that δσ(A) < 1/(4n). Hence,

$$ |1-\sigma(A)|<\frac{ 1}{{4n}},\quad |\sigma(-A)|<\frac{ 1}{{4n}}, $$
(26)

and

$$ |\sigma(c)|<\frac{1}{{4n}},\quad |1-\sigma(-c)|<\frac{ 1}{{4n}},\quad c<-A. $$
(27)

Input weights and biases are chosen by the following criteria.

$$ \begin{gathered} w_{1} = - \frac{{2A}}{{x_{2} - x_{1} }},\quad w_{2} = - \frac{{2A}}{{x_{3} - x_{2} }},\quad \ldots ,\quad w_{{n - 1}} = - \frac{{2A}}{{x_{n} - x_{{n - 1}} }},\quad w_{n} = - \frac{{2A}}{{x_{n} - x_{{n - 1}} }}, \hfill \\ \alpha _{1} = A + \frac{{2Ax_{1} }}{{x_{2} - x_{1} }},\quad \alpha _{2} = A + \frac{{2Ax_{2} }}{{x_{3} - x_{2} }},\quad \ldots ,\quad \alpha _{{n - 1}} = A + \frac{{2Ax_{n - 1} }}{{x_{n} - x_{{n - 1}} }},\quad \alpha _{n} = A + \frac{{2Ax_{n} }}{{x_{n} - x_{{n - 1}} }}, \hfill \\ \end{gathered} $$

it follows that

$$ \begin{aligned} &G_{\sigma}\\ &\quad =\left( {\begin{array}{ccccc} \sigma(A)&\sigma\left(- \frac{2A(x_1-x_2)}{x_3-x_2}+A\right)&\cdots& \sigma\left(-\frac{2A(x_1-x_{n-1})}{x_{n}-x_{n-1}}+A\right) & \sigma\left(-\frac{2A(x_1-x_n)}{x_n-x_{n-1}}+A\right) \\ \sigma(-A)&\sigma(A)&\cdots& \sigma\left(-\frac{2A(x_{2}-x_{n-1})}{x_{n}-x_{n-1}}+A\right) & \sigma\left(- \frac{2A(x_2-x_n)}{x_n-x_{n-1}}+A\right) \\ \vdots&\vdots&\cdots&\vdots&\vdots \\ \sigma\left(- \frac{2A(x_{n-1}-x_1)}{x_2-x_1}+A\right) & \sigma\left(-\frac{2A(x_{n-1}-x_2)}{x_3-x_2}+A\right)&\cdots& \sigma(A) &\sigma\left(- \frac{2A(x_{n-1}-x_n)}{x_n-x_{n-1}}+A\right) \\ \sigma\left(- \frac{2A(x_n-x_1)}{x_2-x_1}+A\right)&\sigma\left(-\frac{2A(x_n-x_2)}{x_3-x_2}+A\right)&\cdots& \sigma(-A) &\sigma(A) \\ \end{array}} \right). \end{aligned} $$
(28)

In turn, subtract the second row from the first row, the third row from the third row, \(\cdots\cdots\), and keep the last row unchanged. Denote the matrix that has performed the above row operations by \(\widetilde{G}_{\sigma}=(\widetilde{g}_{ij})_{n}.\)

Now for \(i=1,2,\ldots, n-1, \) the elements of the ith row can be estimated as follows. For \(j=1, \ldots, i-1, \) since

$$ -\frac{2A(x_{i}-x_{j})}{x_{j+1}-x_{j}}+A<-A, $$
(29)
$$ \left|\widetilde{g}_{ij}\right|=\left|\sigma\left(-\frac{2A(x_{i}-x_{j})}{x_{j+1}-x_{j}}+A\right)- \sigma\left(-\frac{2A(x_{i+1}-x_{j})}{x_{j+1}-x_{j}}+A\right)\right| < \frac{1}{{2n}}. $$
(30)

For j = i,

$$ \left|\widetilde{g}_{ii}\right|= \left|\sigma(A)-\sigma(-A)\right|\geq 1-\left(|1-\sigma(A)|+|\sigma(-A)|\right)>1-\frac{1}{2n}. $$
(31)

For \(j=i+1,\ldots,n-1\),

$$ -\frac{2A(x_{i}-x_{j})}{x_{j+1}-x_{j}}+A>A, $$
(32)

implies

$$ |\widetilde{g}_{ij}| =\left|\sigma\left(-\frac{2A(x_{i}-x_{j})}{x_{j+1}-x_{j}}+A\right)-\sigma\left(-\frac{2A(x_{i+1}-x_{j})}{x_{j+1}-x_{j}}+A\right)\right| $$
(33)
$$ \begin{aligned} &\leq \left|1-\sigma\left(-\frac{2A(x_{i}-x_{j})}{x_{j+1}-x_{j}}+A\right)\right| +\left|1-\sigma\left(-\frac{2A(x_{i+1}-x_{j})}{x_{j+1}-x_{j}}+A\right)\right|\\ &< \frac{1}{2n}. \end{aligned} $$
(34)

For j = n, it can similarly be obtained that

$$ |\widetilde{g}_{in}|=\left|\sigma\left(-\frac{2A(x_{i}-x_{n})}{x_{n}-x_{n-1}}+A\right) -\sigma\left(-\frac{2A(x_{i+1}-x_{n})}{x_{n}-x_{n-1}}+A\right)\right| < \frac{1}{{2n}}, $$
(35)

In the nth row of \(\widetilde{G}_{\sigma},\; \hbox{for}\; j=1,2,\ldots, n-1 \),

$$ -\frac{2A(x_{n}-x_{j})}{x_{j+1}-x_{j}}+A<-A $$
(36)

which implies

$$ \left|\widetilde{g}_{nj}\right|<\frac{1}{4n} $$
(37)

and

$$ \widetilde{g}_{nn}=\sigma(A)>1-\frac{1}{4n}. $$
(38)

Therefore,

$$ |\widetilde{g}_{ii}|> \sum_{1\leq i\neq j\leq n}| \widetilde{g}_{ij}|,\quad i=1,2,\ldots,n, $$
(39)

that is, \(\widetilde{G}_{\sigma}\) is strictly diagonally dominant which guarantees that \(\widetilde{G}_{\sigma}\) is nonsingular. This indicates the nonsingularity of G σ. \(\square\)

Remark 2

Lemma 2.2 gives the method to construct the weights and biases of the single-hidden layer neural network with the sigmoidal activation function such that the hidden layer output matrix is a full column rank for any given data. With Lemma 2.2, it is possible to construct the algorithm to train the neural network.

For \({X_i\in {\mathbb{R}}^d,\, \, i=1,2,\ldots,N}\), and \(X_{1}\prec X_2\prec\cdots \prec X_N.\) Lemma 1 illustrates that there exists \({W\in {\mathbb{R}}^d}\), such that

$$ W\cdot X_{1}<W\cdot X_2<\cdots < W\cdot X_N. $$
(40)

Let \(x_i=W\cdot X_{i},\, \, i=1,2, \ldots ,N\), and set

$$ \begin{aligned} W_{1} & = - \frac{{2A}}{{x_{2} - x_{1} }}W,\quad W_{2} = - \frac{{2A}}{{x_{3} - x_{2} }}W,\quad \ldots ,\quad W_{{N - 1}} = - \frac{{2A}}{{x_{N} - x_{{N - 1}} }}W,\quad W_{N} = - \frac{{2A}}{{x_{N} - x_{{N - 1}} }}W. \\ b_{1} & = A + \frac{{2Ax_{1} }}{{x_{2} - x_{1} }},\quad b_{2} = A + \frac{{2Ax_{2} }}{{x_{3} - x_{2} }},\quad \ldots ,\quad b_{{N - 1}} = A + \frac{{2Ax_{{N - 1}} }}{{x_{N} - x_{{N - 1}} }},\quad b_{N} = A + \frac{{2Ax_{N} }}{{x_{N} - x_{{N - 1}} }}, \\ \end{aligned} $$

where A satisfies δσ(A) < 1/(4N).

From Lemma 2.2, the following theorem is easily obtained.

Theorem 2.1

Suppose σ(x) is a bounded sigmoidal function, \({ X_i\in {\mathbb{R}}^d,\, \, i=1,2,\ldots, N}\), and \(X_{1}\prec X_2\prec\cdots \prec X_N.\) Then, there exist \({W_i\in {\mathbb{R}}^d,\; b_i\in{\mathbb{R}},\; i=1,2,\ldots, N}\), such that the matrix

$$ H= \left(\begin{aligned} \sigma(W_1\cdot X_1+b_1)\quad \sigma(W_2\cdot X_1+b_2)\cdots \sigma(W_N\cdot X_1+b_N) \\ \sigma(W_1\cdot X_2+b_1)\quad \sigma(W_2\cdot X_2+b_2) \cdots \sigma(W_N\cdot X_2+b_N) \\ \vdots \vdots \cdots \vdots\\ \sigma(W_1\cdot X_N+b_1) \quad\sigma(W_2\cdot X_N+b_2)\cdots\sigma(W_N\cdot X_N+b_N) \\ \end{aligned}\right) $$
(41)

is nonsingular.

Based on the knowledge of matrix and Theorem 2.1, we have the following conclusion.

Corollary 2.1

Suppose \({X_i\in {\mathbb{R}}^d,\, i=1,2,\ldots,n}\) are n distinct vectors, N > n. Then, there exist \({W_i\in {\mathbb{R}}^d,b_i\in {\mathbb{R}},\,i=1,2,\ldots,N}\) such that the matrix

$$ H=\left(\begin{array}{cccc} \sigma(W_1\cdot X_1+b_1)&\sigma(W_2\cdot X_1+b_2)&\cdots&\sigma(W_{N}\cdot X_1+b_{N}) \\ \sigma(W_1\cdot X_2+b_1)&\sigma(W_2\cdot X_2+b_2)&\cdots&\sigma(W_{N}\cdot X_2+b_{N}) \\ \vdots&\vdots&\cdots&\vdots\\ \sigma(W_1\cdot X_{n}+b_1)&\sigma(W_2\cdot X_{n}+b_2)&\cdots&\sigma(W_{N}\cdot X_{n}+b_{N}) \\ \end{array}\right)_{n\times N} $$
(42)

is full column rank.

3 A modified ELM using sigmoidal activation function

According to the discussion of Sect. 2 and based on ELM algorithm, a new algorithm is proposed. It uses sigmoidal function as its activation function and can make good use of orthogonal projection method to calculate the output weights. The modified extreme learning machine is stated in the following algorithm.

Algorithm of Modified ELM: Given a training data set \({{\mathcal{N}}=\{(X_{i}^{*},t_{i}^{*})|X_{i}^{*}\in {\mathbb{R}}^{d}, t_{i}^{*}\in {\mathbb{R}},\, i=1,2,\ldots, n\}, }\) activation function of sigmoidal function (for instance, g(x) = 1/(1 + e x)) and hidden node number N.

Step 1::

Select weights W i and biases b i \((i=1,2,\ldots,N). \)

Sort the former N samples \((X_{i}^{*},t_{i}^{*})\,(i=1,2, \ldots, N)\) in terms of \(W\cdot X_{1}^{*}, W\cdot X_{2}^{*}, \ldots, W\cdot X_{N}^{*}\) such that \(W\cdot X_{i_1}^{*}<W\cdot X_{i_2}^{*}<\cdots< W\cdot X_{i_{N}}^{*}\) (i j  ≠ i k for \(j\neq k,\, j, k=1,2,\ldots,N\) and \(i_{j}=1,2,\ldots,N\)) and denote the sorted data as \((X_{i},t_{i})\,(i=1,2,\ldots,n)\) and \(X_{i}=(x_{i 1}, x_{i 2}, \ldots, x_{i d})\,(i=1,2,\ldots,n).\) For \(j=1,2,\ldots,d, \) make following calculations.

$$ w_{j}^{1}=\frac{1}{\max\nolimits_{i=1,2,\ldots,N}\left\{|x_{i j}|\right\}}>0, $$
(43)
$$ x_{ij}^{1}=w_{j}^{1}x_{ij}\in [-1,1], $$
(44)
$$ y_{ij}^{1}=\left|x_{i+1,j}^{1}-x_{i j}^{1}\right| \quad(i=1,2,\ldots,N-1), $$
(45)
$$ n_{j}=\left\{\begin{array}{ll} 2d, & \hbox{if }\; y_{ij}^{1}=0\; \hbox{ for all }\; i=1, 2, \ldots, N-1,\\ \frac{4d}{\min\nolimits_{y_{ij}^{1}\neq0}\left\{y_{ij}^{1}\right\}}, & \hbox{else}, \end{array}\right. $$
(46)
$$ w_{j}^{2}=w_{j}^{1} \Uppi_{i=1}^{j} n_{i}. $$
(47)

Set

$$ W=\left(w_{1}^{2},w_{2}^{2},\ldots, w_{d}^{2}\right). $$
(48)

Let \(a_{i}=W\cdot X_{i}\),

$$ W_{i}=\frac{2A}{a_{i}-a_{i}}W,\quad i=1,2,\ldots,N-1, $$
(49)
$$ W_{N}=W_{N-1}, $$
(50)
$$ b_{i}=A+\frac{2Aa_{i}}{a_{i+1}-a_{i}},\quad i=1,2,\ldots,N-1, $$
(51)
$$ b_{N}=A+\frac{2Aa_{N}}{a_{N}-a_{N-1}}. $$
(52)
Step 2::

Calculate output weights \(\beta=(\beta_{1},\beta_{2},\ldots,\beta_{N}).\)

Let \(T=(t_{1},t_{2},\ldots,t_{n})^{T}\) and

$$ {\bf H}=\left(\begin{array}{cccc} \sigma(W_{1}\cdot X_{1}+b_{1}) & \sigma(W_{2}\cdot X_{1}+b_{2})& \ldots & \sigma(W_{N}\cdot X_{1}+b_{N})\\ \vdots & \vdots & \ldots & \vdots \\ \sigma(W_{1}\cdot X_{N}+b_{1})& \sigma(W_{2}\cdot X_{N}+b_{2})& \ldots & \sigma(W_{N}\cdot X_{N}+b_{N})\\ \sigma(W_{1}\cdot X_{N+1}+b_{1})& \sigma(W_{2}\cdot X_{N+1}+b_{2})& \ldots & \sigma(W_{N}\cdot X_{N+1}+b_{N})\\ \vdots & \vdots & \ldots & \vdots \\ \sigma(W_{1}\cdot X_{n}+b_{1})& \sigma(W_{2}\cdot X_{n}+b_{2})& \ldots & \sigma(W_{N}\cdot X_{n}+b_{N})\\ \end{array}\right)_{n\times N}. $$
(53)

Then,

$$ \beta={\bf H}^{\dag} T=({\bf H}^{T}{\bf H})^{-1}{\bf H}^{T}T. $$
(54)

The ELM proves in practice to be an extremely fast algorithm. This is because it randomly chooses the input weights W i and biases b i of the SLFNs instead of carefully selecting. However, this big advantage makes the algorithm less effective sometimes. As discussed in [22], the random selection of input weights and biases is likely to yield an unexpected result that the hidden layer output matrix H is not full column rank, which might cause two difficulties that cannot be overcome theoretically. First, the SLFNs cannot approximate the training samples with zero error, which relatively lowers the prediction accuracy. Secondly, it is known that the orthogonal projection is typically faster than SVD, and the algorithm proposed here can make good use of the faster method which is unable to be used in ELM due to the requirement of nonsingularity of H T H.

4 Simulation results

This section measures the performance of the proposed new ELM learning algorithm. The simulations for ELM and modified ELM algorithms are carried out in the Matlab 7.10 environment running in AMD athlon(tm) II, X3, 425 processor with the speed of 2.71 GHz. The activation function used in algorithm is sigmoid function g(x) = 1/(1 + e x). For classification problems, accuracy rates are used as the measurement of the performance of learning algorithms, that is, the ratio of the number of the testing samples that are correctly classified to the number of all samples. For regression problems, root-mean-square deviation (RMSD) of the difference of target vectors and predicted target vectors by the SFLNs is used. We call this RMSD the error of the algorithm for the testing data. The RMSDs of the accuracy and the error of the algorithms are used as the measurement of fluctuations in algorithms.

4.1 Benchmarking with a regression problem: approximation of ‘SinC’ function with noise

First of all, the ‘SinC’ function is used to measure the performance of ELM and the modified ELM algorithms. The target function is as follows.

$$ y = f(x) = \left\{ {\begin{array}{*{20}c} {\sin (x)/x,} \hfill & {x \ne 0,} \hfill \\ {1,} \hfill & {x = 0.} \hfill \\ \end{array} } \right.$$
(55)

The training set (x i y i ) and testing set (u i t i ) with 5,000 samples are, respectively, created, where x i , u i in training and testing data are distributed in [−10,10], respectively, with uniform step length. Large uniform noise distributed in [0.2, 0.2] has been used in all training data to obtain a real regression problem. The experiment is carried out on these data as follows. There are 20 hidden nodes assigned for both ELM and modified ELM algorithms. Fifty trials have been conducted for the ELM algorithm to eliminate the random error and the results shown are the average results. Results shown in Table 1 include training time, testing time, training accuracy, testing accuracy and the number of nodes of both algorithms. Plus, standard deviation of time and accuracy in the process of training and testing are recorded in Table 2.

Table 1 Performance comparison for learning function: SinC
Table 2 Standard deviation comparison for learning function: SinC

It can be seen from Table 1 that the modified ELM algorithm spends 0.0675, 0.0169 s CPU time training and testing, respectively, whereas it takes 0.0541 and 0.0213 s for the ELM algorithm to complete the same process. Consider the accuracy of training and testing of both algorithms. The original ELM performs better than the modified ELM. However, the standard deviations of training and testing time and the accuracy are all smaller than those of ELM as shown in Table 2, which means the modified algorithm is more stable than ELM.

Figure 1 shows the raw data and trained results of ELM algorithm, and Fig. 2 shows the raw data and the approximated results based on the modified ELM algorithm, which also demonstrates that the modified ELM algorithm has an acceptable performance.

Fig. 1
figure 1

Outputs of ELM learning algorithm

Fig. 2
figure 2

Outputs of modified ELM learning algorithm

4.2 Benchmarking with practical problems applications

Performance comparison of the proposed modified ELM and the ELM algorithms for four real problems is carried out in this section. Classification and regression tasks are included in four real problems that are shown in Table 3. Two of them are classification tasks including Diabetes, Glass Identification (Glass ID), and the other two are regression tasks including Housing and Slump (Concrete Slump). All the data sets are from UCI repository of machine learning databases [28]. The speculation of each database is shown in the Table 3. For the databases that have only one data table, as conducted in [25, 26, 29, 30], 75 and 25% of samples in the problem are randomly chosen for training and testing, respectively, at each trial.

Table 3 Speculations of real-world applications and the number of nodes for each

In order to reduce the random error, for every database, fifty trials for both two algorithms are conducted, then taking the average of fifty times as final results. The results are reported in Tables 4, 5 and 6, which show that in our simulation, on the average, both of ELM and the modified ELM algorithms have similar learning time which is reported in Table 6 as well as fast learning rate.

Table 4 Comparison training and testing accuracy/error of ELM and modified ELM
Table 5 Comparison of training and testing RMSD of accuracy/error of ELM and modified ELM
Table 6 Comparison of average training and testing time of ELM and modified ELM

However, the modified ELM has more stable accuracies of training and testing, which can be seen from Table 5, especially in cases of regression problems. From Table 4, it can be observed that the modified ELM algorithm is better than ELM in the accuracy of learning for the regression cases. And detailed information is shown in Figs. 3, 4, 5, 6, 7, 8, 9 and 10.

Fig. 3
figure 3

Training accuracy of ELM for Diabetes

Fig. 4
figure 4

Training accuracy of modified ELM for Diabetes

Fig. 5
figure 5

Testing accuracy of ELM for Diabetes

Fig. 6
figure 6

Testing accuracy of modified ELM for Diabetes

Fig. 7
figure 7

Testing accuracy of ELM for GlassID

Fig. 8
figure 8

Testing accuracy of modified ELM for GlassID

Fig. 9
figure 9

Training accuracy of ELM for Housing

Fig. 10
figure 10

Training accuracy of modified ELM for Housing

In addition, it is worth mentioning that compared with ELM algorithm, the modified ELM selects the input weights and biases, which helps avoid the risk of the random errors as we can see from Table 5 and Figs. 3, 4, 5, 6, 7, 8, 9 and 10.

5 Conclusions

This paper proposes a modified ELM algorithm based on the ELM for training single-hidden layer feedforward neural networks (SLFNs) in an attempt to solve the least-square minimization of SLFNs in a more effective way and meanwhile solves the open problem in [22].

The learning speed of modified ELM is as fast as ELM. The main difference between the modified ELM and ELM algorithms lies in the selection of input weights and biases. The modified algorithm selects the input weights and biases properly, which consumes little time compared with the training time of output weights. The modified ELM algorithm overcomes the shortcomings of EELM, the algorithm proposed in [22], that is, it can use sigmoidal function as activation function to train the networks but still keep the qualities of ELM and EELM.