Keywords

1 Introduction

Single hidden layer feedforward networks (SHLFNs) can act as universal approximators [1]. With the traditional training algorithms, such as backpropagation based algorithms, we need to estimate all the connection weights, including the input weights from the input layer to the hidden layer, and the output weights from the hidden layer to the output node. Training all the connection weights may have some problems, such as local minimum. Huang et al. [2] proposed the extreme learning machine concept, where the hidden nodes are generated randomly. Besides, they showed that SHLFNs with the ELM concept can act as universal approximators too. In [2, 3], Huang et al. developed the incremental ELM (I-ELM) [2] algorithm and the convex incremental ELM (CI-ELM) algorithm [3]. The mean square error (MSE) performances of these two algorithms are very well under the noiseless situation, where there is no node noise in the implementation.

In the implementation of neural networks, noise take place unavoidably [4]. When we use the finite precision technology to implement a trained network, multiplicative noise or additive noise would be introduced [5]. Also, when the implementation is at the nano-scale, transient noise may occur [6]. For traditional neural network models, some batch mode learning algorithms for trained neural networks under the imperfection situation were reported [8]. To the best of our knowledge, there are not many literatures related to the noise-resistant ELMs.

This paper considers the multiplicative node noise and the additive node noise as the imperfect conditions for the SHLFN model. We first derive the training set error expression of noisy SHLFNs. Afterwards, we develop two noise-resistant incremental ELM algorithms, namely noise-resistant I-ELM (NRI-ELM) and noise-resistant CI-ELM (NRCI-ELM). For the NRI-ELM algorithm, we keep all the previously trained weights unchanged, and we adjust the output weight of the newly inserted node. The noise-resistant performance of the NRI-ELM algorithm is better than that of I-ELM and CI-ELM. For the NRCI-ELM algorithm, we use a simple rule to update all the previously trained weights, and we estimate the output weight of the new node to maximize the reduction in the training set error of noisy SHLFNs. The noise-resistant ability of the NRCI-ELM algorithm is much better than that of I-ELM, CI-ELM, and NRI-ELM. In addition, we prove that in terms of the training set error of noisy SHLFNs, the NRCI-ELM algorithm and the NRCI-ELM algorithm converges.

The rest of this paper is organized as follows. Section 2 presents the background of the ELM concept and the node noise models. Section 3 derives the two proposed noise resistant incremental ELM algorithms. Section 4 presents the simulation result. Section 5 concludes the paper.

2 ELM and Node Noise

The nonlinear regression problem is considered in this paper. The training set is denoted as \(\mathbb {D}_t=\left\{ \left( \varvec{x}_k,o_k\right) :\varvec{x}_k \in \mathbb {R}^M, o_k \in \mathbb {R}, k=1,\ldots ,N \right\} \), where \(\varvec{x}_k\) and \(o_k\) are the input and the target output of the k-th sample, respectively. The test set is denoted as \(\mathbb {D}_f = \left\{ \left( \varvec{x}'_{k'},o'_{k'}\right) :\varvec{x}'_{k'} \in \mathbb {R}^M, o'_{k'} \in \mathfrak {R}, k'=1,\ldots ,N' \right\} \). In a SHLFN with n hidden nodes, the network output is given by \(f_n(\varvec{x})=\sum ^n_{i=1} \beta _i h_i \left( \varvec{x}\right) \), where \(h_i(\varvec{x})\) is the output of the ith hidden node, and \(\beta _i\) is the output weight of the ith hidden node. In this paper, we use the sigmoid function as the activation function. Hence the output of the ith hidden node is given by

$$\begin{aligned} h_i(\varvec{x})=\frac{1}{1+\exp \{-(\varvec{w}_i^\mathrm {T} \varvec{x}+{b_i})\}}, \end{aligned}$$
(1)

where \(b_i\) is the input bias of the ith hidden node, and \(\varvec{w}_i\) is the input weight vector of the ith hidden node.

In the ELM approach [2, 3], the bias terms \(b_i\)’s and the input weight vectors \(\varvec{w}_i\)’s are randomly generated. We only need to estimate the output weights \(\beta _i\)’s. For a trained SHLFN, the training set error is given by

$$\begin{aligned} \mathcal {E} = \sum _{k=1}^{N} (y_k - \sum _{i=1}^{n} \beta _i h_i(\varvec{x}_k))^2 = \left\| \varvec{o}- \sum _{i=1}^n \beta _i \varvec{h}_i \right\| _2^2, \end{aligned}$$
(2)

where \(\varvec{o}=[o_1,\ldots ,o_N]^\mathrm {T}\), and \(\varvec{h}_i=[h_i(\varvec{x}_1),\ldots ,h_i(\varvec{x}_N)]^\mathrm {T}\).

In the implementation of a network, node noise may not be avoided. When we use the digital implementation, finite precision can be modelled as multiplicative node noise or additive node noise [5]. When we use the floating point approach, the round-off error can be modelled as multiplicative noise. On the other hand, when we use the fixed point approach, the round-off error can be modelled as additive noise.

Given the kth input vector, when a hidden node is affected by the multiplicative noise and additive noise concurrently, its output can be modelled as

$$\begin{aligned} \tilde{h}_i (\varvec{x}_k)=(1+\delta _{ik}) h_i(\varvec{x}_k) + \epsilon _{ik}, \forall i=1,\ldots ,n\, \text { and } \forall k=1,\ldots ,N, \end{aligned}$$
(3)

where \(\delta _{ik}\)’s are the noise factors that describe the deviation due to the multiplicative node noise, and \(\epsilon _{ik}\)’s are the noise factors that describe the deviation due to the additive node noise. Note that in the multiplicative noise case, the magnitude of the noise component “\(\delta _{ik} h_i(\varvec{x})\)” is proportional to the magnitude of the output \(h_i(\varvec{x})\). This paper assumes that the noise factors \(\delta _{ik}\)’s and \(\epsilon _{ik}\)’s are zero-mean identically independently distributed random variables with variances equal to \(\sigma _\delta ^2\) and \(\sigma _\epsilon ^2\), respectively.

From [3], the CI-ELM algorithm works very well for the noiseless situation. For example, as shown in Fig. 1(a), the network outputs fit the training samples very well. However, when node noise exists, the network outputs contain a lot of noise with large magnitude, as shown in Fig. 1(b). When our proposed NRCI-ELM is used, the noise in the network outputs can be greatly suppressed, as shown in Fig. 1(c) and (d).

Fig. 1.
figure 1

Illustration of the noise resistant ability of CI-ELM and NRCI-ELM. (a) The network output of a noiseless network with CI-ELM. (b) The network output of a noisy network with CI-ELM. (c) The network output of a noiseless network with NRCI-ELM. (d) The network output of a noisy network with NRCI-ELM. In this example, \(\sigma ^2_\epsilon =\sigma ^2_\delta =0.01\).

3 Noise Resistant Incremental Learning

For a SHLFN with a particular noise pattern, the training set error can be expressed as

$$\begin{aligned} \tilde{\mathcal {E}} = \sum _{k=1}^{N} \left( o_k \!-\! \sum _{i=1}^{n} \beta _i \tilde{h}_i (\varvec{x}_k) \right) ^2 \, . \end{aligned}$$
(4)

According to the properties of \(\delta _{ik}\)’s and \(\epsilon _{ik}\)’s, the statistics of \(\tilde{h}_i (\varvec{x}_k)\)’s are given by

$$\begin{aligned} \langle \tilde{h}_i (\varvec{x}_k) \rangle= & {} h_i (\varvec{x}_k), \end{aligned}$$
(5)
$$\begin{aligned} \langle \tilde{h}^2_i (\varvec{x}_k) \rangle= & {} (1+\sigma _\delta ^2) h^2_i (\varvec{x}_k) + \sigma _\epsilon ^2, \end{aligned}$$
(6)
$$\begin{aligned} \langle \tilde{h}_i (\varvec{x}_k) \tilde{h}_{j} (\varvec{x}_k) \rangle= & {} h_i (\varvec{x}_k) h_{j} (\varvec{x}_k), \forall \, i \ne j. \end{aligned}$$
(7)

Taking the expectation over all possible noise patterns, we obtain the training set error of noisy SHLFNs, given by

$$\begin{aligned} \Big \langle \tilde{\mathcal {E}} \Big \rangle = \left\langle \sum _{k=1}^{N} \left( o_k \!-\! \sum _{i=1}^{n} \beta _i \Big ( (1\!+\!\delta _{ik}) h_i(\varvec{x}_k) \!+\! \epsilon _{ik} \Big ) \right) ^2 \right\rangle . \end{aligned}$$
(8)

From (5)–(7), Eq. (8) becomes

$$\begin{aligned} \Big \langle \tilde{\mathcal {E}} \Big \rangle =\left\| \varvec{o}\!-\! \sum _{i=1}^n \beta _i \varvec{h}_i \right\| _2^2 \!+\!\sigma _\delta ^2\! \sum _{i=1}^n \beta _i^2 \Vert \varvec{h}_i \Vert _2^2\!\,+\,\!\sigma _\epsilon ^2\! N \sum _{i=1}^n \beta _i^2. \end{aligned}$$
(9)

Similarly, we can obtain the test set error of noisy SHLFNs, given by

$$\begin{aligned} \Big \langle \tilde{\mathcal {E}}_t \Big \rangle =\left\| \varvec{o}' \!-\! \sum _{i=1}^n \beta _i \varvec{h}'_i \right\| _2^2 \!\,+\,\!\sigma _\delta ^2\! \sum _{i=1}^n \beta _i^2 \Vert \varvec{h}'_i \Vert _2^2\!\,+\,\!\sigma _\epsilon ^2\! N' \sum _{i=1}^n \beta _i^2, \end{aligned}$$
(10)

where \(\varvec{o}'=[o'_1,\ldots ,o'_N]^\mathrm {T}\), and \(\varvec{h}_i=[h_i(\varvec{x}'_1),\ldots ,h_i(\varvec{x}'_{N'})]^\mathrm {T}\).

For the NRI-ELM, at the nth iteration, a new hidden node \(h_n(\cdot )\), whose input bias and input weight vector are randomly generated, is inserted into the network. We keep the output weights \(\{\beta _1,\ldots ,\beta _{n-1}\}\) of the previously inserted hidden nodes unchanged. We need to estimate the output weight \(\beta _n\) of the nth hidden node. From (9), the training set error of the noisy networks at the nth iteration is

$$\begin{aligned} L_n = \left\| \varvec{o}\!-\! \sum _{i=1}^n \beta _i \varvec{h}_i \right\| _2^2 \!\,+\,\!\sigma _\delta ^2\! \sum _{i=1}^n \beta _i^2 \Vert \varvec{h}_i \Vert _2^2\!\,+\,\!\sigma _\epsilon ^2\! N \sum _{i=1}^n \beta _i^2. \end{aligned}$$
(11)

Define

$$\begin{aligned} \varvec{f}= \sum _{i=1}^{n} \beta _i \varvec{h}_i, \, \, \, \, \varvec{e}_{n} = \varvec{o}- \sum _{i=1}^{n} \beta _i \varvec{h}_i, \, \, \, \, v_{n} = \sum _{i=1}^n \beta _i^2 \Vert \varvec{h}_i \Vert _2^2, \, \, \, \, u_{n} = N \sum _{i=1}^n \beta _i^2. \end{aligned}$$
(12)

From (12), Eq. (11) can be rewritten as

$$\begin{aligned} L_n = \left\| \varvec{e}_n \right\| _2^2 +\sigma _\delta ^2 v_n + \sigma _\epsilon ^2 u_n. \end{aligned}$$
(13)

From (13), the change in the training set error between the nth-iteration and \((n-1)\)th-iteration is given by

$$\begin{aligned} \triangle _n = L_n - L_{n-1} = - 2\beta _n \varvec{e}_{n-1}^\mathrm {T} \varvec{h}_n + (1+\sigma _\delta ^2) \beta _n^2 \Vert \varvec{h}_n \Vert _2^2 + \sigma _\epsilon ^2 \beta _n^2 N. \end{aligned}$$
(14)

Since \(\triangle _n\) is a quadratic function of \(\beta _n\) with a minimum value equal to a negative value, the optimal value of \(\beta _n\) to maximize the decrease in the training set error is given by

$$\begin{aligned} \beta _n= \frac{\varvec{e}^\mathrm {T}_{n-1} \varvec{h}_n}{(1+\sigma ^2_\delta )\Vert \varvec{h}_n\Vert ^2_2+N \sigma _\epsilon ^2}. \end{aligned}$$
(15)

With (15), the change in the training set error between two consecutive iterations is

$$\begin{aligned} \triangle _n = - \frac{\left( \varvec{e}_{n-1}^\mathrm {T} \varvec{h}_n\right) ^2}{(1+\sigma _\delta ^2)\left\| \varvec{h}_n\right\| _2^2+ N \sigma _\epsilon ^2}. \end{aligned}$$
(16)

Equation (16) means that when we inserted a new hidden node, the training set error of noisy networks decreases. That means, in terms of the training set MSE of noisy network, the NRI-ELM algorithm converges. Algorithm 1 shows the proposed NRI-ELM algorithm. From Steps (5)–(8) in Algorithm 1, for the NRI-ELM algorithm, the computational complexity is O(N) for each iteration.

figure a

In [3], the CI-ELM algorithm was proposed. Under the noiseless situation [3], the training set error of the original CI-ELM algorithm is better than that of I-ELM algorithm. However, as shown in Sect. 4, the original CI-ELM algorithm has a very poor noise resistant ability. Hence it is interesting to develop a noise resistant version of CI-ELM, namely NRCI-ELM.

In the NRCI-ELM case, after we estimate the output weight \(\beta _n\) at the nth iteration, we update all the previously trained weights by

$$\begin{aligned} \beta _i^{new} = (1-\beta _n) \beta _i, \end{aligned}$$
(17)

for \(i=1\) to \(n-1\). Hence we have the recursive definitions for \(\varvec{f}_n\), \(\varvec{e}_n\), \(v_n\) and \(u_n\)

$$\begin{aligned}&\varvec{f}_n = (1-\beta _n) \varvec{f}_{n-1} + \beta _n \varvec{h}_n, \, \, \, \, \, \varvec{e}_n = \varvec{y}- \varvec{f}_n, \\&v_n = (1-\beta _n)^2 v_{n-1} + \beta ^2_n \left\| \varvec{h}_n \right\| ^2_2, \, \, u_n = (1-\beta _n)^2 u_{n-1} + \beta ^2_n N, \end{aligned}$$

where \(\varvec{f}_0=\varvec{o}\), \(\varvec{e}_0=\varvec{y}\), \(v_0=0\), and \(u_0=0\). With this new updating scheme for the previously trained output weights, the change in the training set error between the nth-iteration and \((n-1)\)th-iteration is given by

$$\begin{aligned} \triangle _n = L_n - L_{n-1}= & {} - 2\beta _n \Big ( \varvec{e}_{n-1}^\mathrm {T} \varvec{r}_n + \sigma _\delta ^2 v_{n-1} + \sigma _\epsilon ^2 u_{n-1} \Big ) \nonumber \\&+ \beta _n^2 \Big ( \Vert \varvec{r}_n\Vert ^2_2 + \sigma _\delta ^2 (v_{n-1} + \Vert \varvec{h}_n \Vert _2^2) + \sigma _\epsilon ^2 (u_{n-1} + N) \Big ), \end{aligned}$$
(18)

where \(\varvec{r}_{n} = \varvec{h}_n -\varvec{f}_{n-1}\).

Similar to the NRI-ELM case, to maximize the decrease in the training set error of noisy networks, \(\beta _n\) should be given by

$$\begin{aligned} \beta _n=\frac{\varvec{e}_{n-1}^\mathrm {T} \varvec{r}_n + \sigma _\delta ^2 v_{n-1} + \sigma _\epsilon ^2 u_{n-1}}{ \Vert \varvec{r}_n\Vert ^2_2 + \sigma _\delta ^2 (v_{n-1} + \Vert \varvec{h}_n \Vert _2^2) + \sigma _\epsilon ^2 (u_{n-1} + N) }. \end{aligned}$$
(19)

With (19), the change in the training set error between two consecutive iterations is

$$\begin{aligned} \triangle _n = - \frac{\left( \varvec{e}_{n-1}^\mathrm {T} \varvec{r}_n + \sigma _\delta ^2 v_{n-1} + \sigma _\epsilon ^2 u_{n-1}\right) ^2}{\Vert \varvec{r}_n\Vert ^2_2 + \sigma _\delta ^2 (v_{n-1} + \Vert \varvec{h}_n \Vert _2^2) + \sigma _\epsilon ^2 (u_{n-1} + N)}. \end{aligned}$$
(20)

Equation (20) means that when we insert a new hidden node, the training set error of noisy networks decreases. That means, in terms of the training set error of noisy network, the NRCI-ELM algorithm converges too. Algorithm 2 shows the proposed NRCI-ELM algorithm. At each each iteration, the complexity of the NRCI-ELM algorithm is “\(O(n)+O(N)\)”. Compared to the NRI-ELM case whose complexity is equal to O(N), the additional complexity O(n) is due to the update of the previous weights.

figure b
Fig. 2.
figure 2

The performance of the four incremental methods versus the number of additive nodes. The noise level is \(\sigma ^2_\epsilon =\sigma ^2_\delta =0.09\). The Abalone dataset is considered.

4 Simulation

Two real life datasets from the UCI data repository are used. They are Abalone [9] and Housing Price [10]. The Abalone dataset has 4,177 samples. Each sample has eight inputs and one output. Two thousand samples are randomly taken as the training set. The other 2,177 samples are used as the test set. The Housing Price dataset has 506 samples. Each sample has 13 inputs and one output. The training set contains 250 samples, while the test set has 256 samples.

This section considers four incremented algorithms. They are the original I-ELM algorithm, the original CI-ELM algorithm, the proposed NRI-ELM algorithm, and the proposed NRCI-ELM algorithm, respectively. Figure 2 shows the MSE performance versus the number of hidden nodes, where the noise level is equal to \(\sigma _\epsilon ^2=\sigma _\delta ^2=0.09\). It can be seen that the proposed NRI-ELM algorithm is better than the two original incremental algorithms. Also, the MSE performance of the original CI-ELM algorithm is very poor. When we use more hidden nodes, the performance of the CI-ELM algorithm suddenly becomes very poor. To sum up, the proposed NRCI-ELM algorithm is much better than the original I-ELM algorithm, the original CI-ELM algorithm and the proposed NRI-ELM algorithm.

Table 1 shows the average test set MSE values of noisy networks over 100 trials for various node noise levels. In Table 1, the number of hidden nodes is equal to 500. It can be seen that the performance of the CI-ELM algorithm is very poor. The noise resistant ability of the NRI-ELM algorithm is better than that of the I-ELM algorithm. In addition, the NRCI-ELM algorithm is much better than the other three algorithms. For instance, in the Abalone dataset with noise level \(\sigma ^2_\epsilon =\sigma ^2_\delta =0.01\), the test set MSE of I-ELM is equal to 0.01421. When the NRI-ELM is used, the test set error is reduced to 0.01367. The NRCI-ELM algorithm can further reduce the test set error to is 0.00815.

For high noise levels, the improvement of the NRCI-ELM algorithm is more significant. For instance, with the node noise level equal to \(\sigma ^2_\epsilon =\sigma ^2_\delta =0.09\) The test set MSE of the I-ELM algorithm is equal to 0.05855. With the NRI-ELM algorithm, the test set MSE is reduced to 0.03386. When the NRCI-ELM algorithm is used, the test MSE is reduced to 0.01002.

Another interesting property of the NRCI-ELM algorithm is that the test set error is insensitive to the node noise level. In the Abalone dataset, when the noise level is \(\sigma ^2_\delta =\sigma ^2_\epsilon =0.01\), the test set error of the NRCI-ELM algorithm is equal to 0.00815. When the noise level is greatly increased to \(\sigma ^2_\delta =\sigma ^2_\epsilon =0.25\), the test set error of the NRCI-ELM algorithm is slightly increased to 0.01174 only.

One may suggest that we should use the NRCI-ELM algorithm only because its test set error of noisy network is the best. The difference between the NRCI-ELM algorithm and the NRI-ELM algorithm is the computation complexity. For the NRI-ELM algorithm, the complexity is O(N). But for the NRCI-ELM algorithm, the computation complexity is “\(O(N)+O(n)\)”.

Table 1. Average test set MSEs of noisy networks. The average values are taken over 100 trials. There are 500 hidden nodes.

5 Conclusion

This paper proposed two incremental ELM algorithms, namely NRI-ELM and NRCI-ELM, for handling node noise. They insert the randomly generated hidden nodes into the network in the one-by-one manner. The NRI-ELM algorithm adjusts the output weight of the newly inserted hidden node only. Its noise-resistant ability is better than that of the original I-ELM algorithm and the original CI-ELM algorithm. Besides, we proposed the NRCI-ELM algorithm. It estimates the output weight of the newly additive node, and uses a single rule to modify the previously trained output weights. In addition, we prove that for the two proposed algorithms, the training set MSE of noisy networks converges. Simulation examples illustrate that the noise resistant ability of the NRI-ELM algorithm and NRCI-ELM algorithm is better than that of I-ELM and CI-ELM. In addition, the NRCI-ELM algorithm has the best noise resistant ability, compared to other three incremental algorithms. For the NRI-ELM algorithm, the complexity is O(N). For the NRCI-ELM algorithm, the computation complexity is “\(O(N)+O(n)\)”.