1 Introduction

Extreme learning machine (ELM) algorithms (Huang et al. 2006; Huang and Chen 2007) provide a low computation solution for constructing a single layer feed-forward neural network (SLFNN). In the ELM concept, the input connection weights between the input and hidden layers are generated randomly. Hence we only need to train the output connection weights between the hidden and output layers. Although the ELM concept uses the random node concept, a SLFNN trained by the ELM concept still has the universal approximation ability (Barron 1993; Hornik et al. 1989; Hornik 1991). In the last several years, many applications of using ELM were reported. For example, a modified ELM model for imbalance data was reported (Li et al. 2018). Also, some works of using the ELM concept to handle biological data were reported (Bi et al. 2018; Wang et al. 2017).

There are two kinds of ELM algorithms. One is batch mode, in which we first generate a number of hidden nodes and then we estimate all the output weights at a time. Another one is incremental mode, in which we add the hidden nodes one-by-one into the network until the predefined stopping condition reaches. The incremental ELM (IELM) and the convex IELM (CIELM) (Huang et al. 2006; Huang and Chen 2007) are two representative incremental ELM algorithms with simple update rules. Although many ELM algorithms have been developed, few of them have the ability to tolerate network fault and noise.

In hardware implementation of neural networks, noise or faults are prone to occur. For instance when a neural network is implemented on the field-programmable gate array (FPGA) technology, we may use a low precision floating point format to represent connection weights. The roundoff error of using the floating point format can be modelled as multiplicative noise (Liu and Kaneko 1969). In an analog implementation, thermal noise and drifts always exist in operational amplifiers. Besides, the precision of analog devices, such as resistors, are in terms of percentage error. In addition, implementing a trained network using a nano-scale device, during the operation transient noise or failure may be introduced (Pajarinen et al. 2011; Mahdiani et al. 2012). Traditional learning algorithms have poor fault or noise tolerant performance. Since biological neural networks have certain ability to tolerate noise, we would like to train neural network that has certain noise tolerant too.

To handle noise and fault, it is essential to understand how they affect the behaviour or performance of a trained network. Noise or fault tolerant learning algorithms aim at training a network to attain acceptable performance even under noise and fault situations. A survey of various kinds of imperfect conditions in the traditional network model, such as radial basis function (RBF) networks, was reported in Martolia et al. (2015). Besides, failure tolerant ability of RBF networks was extensively studied (Leung et al. 2010; Feng et al. 2017; Murakami and Honda 2007). However, few results about failure tolerant ability of ELM networks were studied.

This paper investigates the noise tolerant performance of the SLFNN model trained by the incremental ELM concept. We consider that multiplicative weight noise exist between the input and hidden layers, and between the hidden and output layers. Firstly, a noise tolerant training objective function for SLFNNs is formulated. Afterwards, two incremental ELM algorithms, namely weight deviation incremental extreme learning machine (WDT-IELM) and weight deviation convex incremental extreme learning machine (WDTC-IELM), are derived.

In the WDT-IELM algorithm, the hidden nodes are added into the existing network incrementally in the one-by-one manner. After adding a new hidden node, all the previous trained output weights are not modified .

In the WDTC-IELM algorithm, we use a strategy similar to WDT-IELM to create a SLFNN, but we use a simple updating rule to modify the previous trained output weights.

We show that for the two proposed ELM algorithms, the training objective values are non-increasing at each training iteration. We use several simulations, operated on several commonly used datasets, to validate the superiority of the two proposed algorithms. Compared to the original incremental ELM algorithms, the two proposed algorithms have much better ability to tolerate the multiplicative weight noise. In addition, we perform paired-t tests to show that the improvement of using the proposed algorithms is statistical significant.

The rest of the paper is organized as follows. The backgrounds on ELM are given in Sect. 2. In Sect. 3, weight noise models are presented and the noise tolerant objective function is derived. Sect. 4 presents the two proposed algorithms. In addition, in this section, we show that during training, the objective values are non-increasing. The simulation results are presented in Sect. 5. The paper is ended with conclusion in Sect. 6.

2 Mathematical background on extreme learning machine

The standard ELM was developed to train SLFNNs  (Huang et al. 2006; Huang and Chen 2007; Huang et al. 2006). In a SLFNN, there are three layers, namely, input, hidden and output layers. In the ELM concept, the input weights in between the input layer and the hidden layer are randomly generated. They do not need to be learned or tuned. Only the output weights in between the hidden layer and the output layer nodes are required to be trained. Hence, during learning, the computational cost is not prohibitive and is much lower than that of other traditional learning algorithms such as gradient descent (Guély and Siarry 1993).

This paper considers to use the ELM concept to solve the nonlinear regression problem. Let \({\mathbb {D}}_{train}=\{({\varvec{x_k}},t_k): \, {\varvec{x}}_k \, \in \, {\mathbb {R}}^D, t_k \, \in R, \quad k=1,\ldots ,N \}\) be the training set, where D is the number of input features, N is the number of training samples, and \({\varvec{x_k}}\) and \(t_k\) are the inputs and target output of the k-th sample, respectively. Similarly, the test set is denoted as \({\mathbb {D}}_{test}=\{{\varvec{x}}'_{k'},t'_{k'}):\, {\varvec{x}}'_{k'} \, \in \, {\mathbb {R}}^{D'}, t'_{k'} \in R, \quad k'=1,\ldots ,N' \}\), where \(N'\) is the number of samples in the test set.

The output of a SLFNN with m hidden nodes is equal to

$$\begin{aligned} f_m({\varvec{x}})=\sum _{j=1}^m \beta _j h_j({\varvec{x}}), \end{aligned}$$
(1)

where \(\beta _j\) is the jth output weight, and \(h_j({\varvec{x}})\) is the output of the jth hidden node. There are several possible activation functions, for instance, in the case of the sigmoid activation function, \(h_j(\cdot )\) is given by

$$\begin{aligned} h_{j}({\varvec{x}})=\frac{1}{1+\exp {-({\varvec{a}}_j^\text {T}{\varvec{x}} + b_j)}}, \end{aligned}$$
(2)

where \({\varvec{a}}_j\) and \(b_j\) are the input weights and input bias, respectively, of the jth hidden node. Grouping \({\varvec{a}}_j\) and \(b_j\) together, (3) can be rewritten as

$$\begin{aligned} h_{j}({\varvec{x}})=\frac{1}{1+\exp {(-{\varvec{w}}_j^\text {T}{\varvec{o}})}}\, , \end{aligned}$$
(3)

where \({\varvec{w}}_j=[{\varvec{a}}_j^\text {T},b_j]^\text {T}\), and \({\varvec{o}}=[{\varvec{x}}_k^\text {T},1]^\text {T}\). In the ELM concept, the input weight vectors, \({\varvec{w}}_j\)’s, are generated randomly.

Given all training samples, the training set error is equal to

$$\begin{aligned} \varepsilon _m= & {} \sum _{k=1}^N (t_k-f_m({\varvec{x}}_k))^2 \nonumber \\= & {} \sum _{k=1}^N (t_k- \sum _{j=1}^m \beta _j h_j({\varvec{x}}_k) )^2 . \end{aligned}$$
(4)

Let \(\varvec{t}\) be the collection of the target outputs, given by

$$\begin{aligned} \varvec{t}=[t_1,\ldots ,t_N]^\text {T}, \end{aligned}$$

and let \(\varvec{h}_j\) be the collection of the j hidden node outputs of all input samples, given by

$$\begin{aligned} \varvec{h}_j=[h_j(\varvec{x}_1), \ldots , h_j(\varvec{x}_N)]^\text {T}. \end{aligned}$$

Equation (4) can be written in a compact form, given by

$$\begin{aligned} \varepsilon _m = \Vert \varvec{t}- \sum _{j=1}^m \beta _j \varvec{h}_j \Vert _2^2 = \Vert \varvec{t}- \varvec{H}_m \varvec{\beta }_m \Vert _2^2, \end{aligned}$$
(5)

where \(\varvec{\beta }_m=[\beta _1,\ldots ,\beta _m]^\text {T}\), and

$$\begin{aligned} {\varvec{H}}_m= [\varvec{h}_1,\ldots ,\varvec{h}_m]=\left( \begin{array}{ccc} h_1({\varvec{x}}_1) &{} \cdots &{} h_m({\varvec{x}}_1) \\ \vdots &{} \ddots &{} \vdots \\ h_1({\varvec{x}}_N) &{} \cdots &{} h_m({\varvec{x}}_N) \\ \end{array} \right) . \end{aligned}$$
(6)

As aforementioned, the input weights \(\varvec{w}_j\)’s are randomly generated. During training, we do not need to adjust them. The ELM concept uses the least square approach to obtain the output weights. In the batch mode ELM, given m hidden nodes, the optimal output weight vector that minimizes the training set error is given by

$$\begin{aligned} {\varvec{\beta }}_m^*= \left( \varvec{H}_m^\text {T} \varvec{H}_m \right) ^{-1}\varvec{H}_m^\text {T} \varvec{t}. \end{aligned}$$
(7)

Instead of using the batch mode to find out the weights, the ELM concept has the incremental mode, in which we incrementally add hidden nodes into the existing network until the stopping condition reaches. When we insert a new hidden node, we need to determine its output weight only. The IELM and CIELM are two incremental ELM algorithms. In the IELM algorithm, after inserting a new hidden node, all the previous trained output weights are unchanged. The difference between the two algorithms is that the CIELM algorithm uses a simple updating rule to modify all the existing output weights. Although the ELM concept can simplify the creation process of a SLFNN, few ELM algorithms have ability to tolerate the noise situation. In the rest of the paper, we will first define a noise tolerant objective function, and then develop the noise tolerant versions of IELM and CIELM.

3 Weight noise model and objective function

In this section, we consider that a SLFNN is affected by multiplicative weight noise in the input weight vectors \(\varvec{w}_j\)’s and the output weights \(\beta _j\)’s. We will first describe the noise model and then develop a noise tolerant objective function.

3.1 Weight noise model

When we implement a trained network, weight noise is prone to occur. Weight noise can be regarded as the deviation from the nominal value of the weight of a well trained neural network. For instance, after training a neural network, we may implement the trained network on hardware such as FPGA. To do this, we may use a low precision floating point format to represent connection weights. The roundoff error of using the floating point format can cause an implemented weight to deviate from its nominal value. Also, in the analog implementation, noise are unavoidable. One of commonly used noise is multiplicative weight noise (Burr 1991; Liu and Kaneko 1969). In this model, the difference between the implemented weight value and its nominate value is proportional to its nominate value.

Let \({w}_{jl}\) be the original value of the lth element in the jth input weight vector. Under multiplicative noise model, the deviation of an input weight \(w_{jl}\) from its nominal value is given by

$$\begin{aligned} \delta _{jl}=\upsilon _{jl} w_{jl}, \quad \forall \; j,l, \end{aligned}$$
(8)

where \(\upsilon _{jl}\)’s are independent and identically distributed (iid) random variables (RVs). Their mean is equal to zero and variance is equal to \(\sigma ^2_w\). In other words, the implemented value of an input weight is given by

$$\begin{aligned} {\tilde{w}}_{jl} = w_{kl}+\delta _{kl} =(1+\upsilon _{jl} )w_{jl} . \end{aligned}$$
(9)

In (9), the magnitude of the noise component \(\delta _{jl}\) is proportional to that of the nominate value \(w_{jl}\).

Given the deviations \(\delta _{jl}\), for all \(l=1,\ldots ,D+1\), of the input weights for the jth hidden node, the hidden node output is

$$\begin{aligned} {\tilde{h}}_j(\varvec{x})= \frac{1}{1+\exp {(-{\tilde{\varvec{w}}}_j^\text {T}{\varvec{o}})}}. \end{aligned}$$
(10)

Note that \(\varvec{o}=[\varvec{x}^\text {T},1]^\text {T}\) and \({\tilde{\varvec{w}}}=[{\tilde{w}}_{j1},\ldots ,{\tilde{w}}_{j(D+1)}]^\text {T}\), where D is the number of input features of the neural network. We can use the first order Taylor series to expand (10), given by

$$\begin{aligned} {\tilde{h}}_j(\varvec{x})= & {} {h}_j(\varvec{x}) + \sum _{l=1}^{D+1} \delta _{jl} \frac{\partial {h}_j(\varvec{x})}{\partial {w}_{jl}} \nonumber \\= & {} {h}_j(\varvec{x}) + \sum _{l=1}^{D+1} \upsilon _{jl} w_{jl} \frac{\partial {h}_j(\varvec{x})}{\partial {w}_{jl}} \, \nonumber \\= & {} {h}_j(\varvec{x}) + \sum _{l=1}^{D+1} \upsilon _{jl} w_{jl} \varDelta H_{jl} (\varvec{x}), \end{aligned}$$
(11)

where

$$\begin{aligned} \varDelta H_{jl} (\varvec{x}) = \frac{\partial {h}_j(\varvec{x})}{\partial {w}_{jl}}. \end{aligned}$$
(12)

When the sigmoid function is used as the activation function,

$$\begin{aligned} \varDelta H_{jl} (\varvec{x}) = o_l {h}_j(\varvec{x}) (1-{h}_j(\varvec{x})). \end{aligned}$$
(13)

Similarly, under the multiplicative noise, the value of an output weight becomes

$$\begin{aligned} {\tilde{\beta }}_{j} = (1+\zeta _{j} )\beta _{j} \end{aligned}$$
(14)

where \(\zeta _{j}\)’s are iid RVs. Their mean is equal to zero and variance is equal to \(\sigma ^2_\beta\).

Hence for a given input pattern \(\varvec{x}\), the weighted output of a noisy hidden node is given by

$$\begin{aligned} {\tilde{\beta }}_j {\tilde{h}}_j(\varvec{x}) \!=\! (1\!+\! \zeta _{j} )\beta _{j} \left( {h}_j(\varvec{x}) \!+\! \sum _{l=1}^{D+1} \! \upsilon _{jl} w_{jl} \! \varDelta H_{jl} (\varvec{x})\right) . \end{aligned}$$
(15)

Since \(\zeta _{j}\)’s and \(\upsilon _{jl}\)’s are iid RVs and have zero mean, the expected values of the weighted outputs are given by

$$\begin{aligned} \left\langle {\tilde{\beta }}_j {\tilde{h}}_j(\varvec{x}) \right\rangle = (1+0)\beta _j (h_j(\varvec{x}) + 0)=\beta _j h_j(\varvec{x}) \, , \end{aligned}$$
(16)

where \(\left\langle \cdot \right\rangle\) is the expectation operator. Also, \(\zeta _{j}\)’s and \(\upsilon _{jl}\)’s are with variances equal to \(\sigma _w^2\) and \(\sigma _\beta ^2\), respectively. Hence the expected squares of the weighted outputs are given by

$$\begin{aligned} \left\langle \!\!{\tilde{\beta }}_{j}^2 {\tilde{h}}_{j}^2(\varvec{x})\!\! \right\rangle \!=\! \left( 1 \!+\! \sigma ^2_{\beta } \right) \beta _j^2 \left( h^2_j(\varvec{x}) \!+\! \sigma ^2_{w}\sum _{l=1}^{D+1} \!\!\!w^2_{jl} \!\varDelta H^2_{jl} (\varvec{x}) \right) . \end{aligned}$$
(17)

Furthermore, for \(j\ne j'\), we have

$$\begin{aligned} \left\langle {\tilde{\beta }}_{j} {\tilde{h}}_j (\varvec{x}) {\tilde{\beta }}_{j'} {\tilde{h}}_{j'} (\varvec{x}) \right\rangle = \beta _j h_j(\varvec{x})\beta _{j'} h_{j'}(\varvec{x}). \end{aligned}$$
(18)

3.2 Noise tolerant objective function

Traditional ELM algorithms, for regression, minimize the square error between the network output and target output. In the noise situation, we propose to minimize the expected error over all noise patterns. For a particular noise pattern, the training error set of a noisy network is given by

$$\begin{aligned} {\tilde{\varepsilon }}_m = \sum _{k=1}^N (t_k- \sum _{j=1}^m {\tilde{\beta }}_j {\tilde{h}}_j({\varvec{x}}_k) )^2 . \end{aligned}$$
(19)

From (16)–(18), the expected error over all noise patterns is given by

$$\begin{aligned} J_m= & {} \left\langle {\tilde{\varepsilon }}_m \right\rangle \nonumber \\= & {} \left\langle \sum _{k=1}^N (t_k- \sum _{j=1}^m {\tilde{\beta }}_j {\tilde{h}}_j({\varvec{x}}_k) )^2 \right\rangle \nonumber \\= & {} \sum _{k=1}^N (t_k- \sum _{j=1}^m {\beta }_j {h}_j({\varvec{x}}_k) )^2 \nonumber \\& +\, \sigma ^2_\beta \sum _{k=1}^N \sum _{j=1}^m \beta _j^2 h_j^2(\varvec{x}_k) \nonumber \\ & + \,(1+\sigma ^2_\beta ) \sum _{k=1}^N \sum _{j=1}^m \beta _j^2 \sum _{l=1}^{D+1} \sigma ^2_w w^2_{jl} \varDelta ^2_{jl} (\varvec{x}_k) \end{aligned}$$
(20)
$$\begin{aligned}= & {} \Vert \varvec{e}_m\Vert ^2_2 + \sigma ^2_\beta \kappa _m + (1+\sigma ^2_\beta ) \sigma ^2_w \tau _m , \end{aligned}$$
(21)

where

$$\begin{aligned} \varvec{e}_m= & {} \varvec{t}- \varvec{f}_m \end{aligned}$$
(22)
$$\begin{aligned} \varvec{f}_m= & {} \sum _{j=1}^m \beta _j \varvec{h}_j \end{aligned}$$
(23)
$$\begin{aligned} \varvec{h}_j= & {} [\varvec{h}_j(\varvec{x}_1),\ldots ,\varvec{h}_j(\varvec{x}_N)]^\text {T} \end{aligned}$$
(24)
$$\begin{aligned} \kappa _m= & {} \sum _{j=1}^m \beta _j^2 \Vert \varvec{h}_j\Vert ^2_2 \end{aligned}$$
(25)
$$\begin{aligned} \tau _m= & {} \sum _{k=1}^N \sum _{j=1}^m \beta _j^2 \sum _{l=1}^{D+1} \sigma ^2_w w^2_{jl} \varDelta H^2_{jl} (\varvec{x}_k). \end{aligned}$$
(26)

In (21), the expected error of a SLFFN contains three terms. The first term \(\Vert \varvec{e}_m\Vert ^2_2\) is the error of a noiseless SLFFN. The second term is the degradation from the noise in the output weights \(\beta _j\)’s. The third term is the the degradation from the noise in the input weights \(w_{jl}\)’s. In the ELM concept, the input weights \(w_{jl}\) are randomly generated, and their values are then fixed. Only the output weights are required to be trained.

4 Noise tolerant incremental algorithms

This section presents the two incremental ELM algorithms, namely WDT-IELM and WDTC-IELM, for training SLFNNs. The WDTC-IELM algorithm use the same strategy to add hidden nodes into the network, but we use a simple updating rule to modify the existing output weights.

4.1 WDT-IELM

The WDT-IELM algorithm is a noise tolerant version of the original IELM. The WDT-IELM algorithm incrementally adds hidden nodes one-by-one into the network. Suppose that a SLFFN already has \(m-1\) hidden nodes. The incremental strategy is that we determine the value of the output weight \(\beta _m\) of the newly inserted node and do not modify the existing output weights \(\{\beta _1,\ldots ,\beta _{m-1}\}\). According to that strategy, we have the following recursive relationships for \(\varvec{f}_m\), \(\kappa _m\), and \(\tau _m\), given by

$$\begin{aligned} \varvec{f}_ m= & {} \varvec{f}_{m-1} + \beta _m \varvec{h}_m \end{aligned}$$
(27)
$$\begin{aligned} \kappa _m= & {} \kappa _{m-1} + \beta _m^2 \Vert \varvec{h}_m\Vert ^2_2 \end{aligned}$$
(28)
$$\begin{aligned} \tau _m= & {} \tau _{m-1} + \beta _m^2 \sum _{k=1}^N \sum _{l=1}^{D+1} w^2_{ml} \varDelta H^2_{ml} (\varvec{x}_k). \end{aligned}$$
(29)

With (27), the recursive equation for the error vector is given by

$$\begin{aligned} \varvec{e}_m = \varvec{e}_{m-1} - \beta _m \varvec{h}_m . \end{aligned}$$
(30)

Based on (21), and (27)–(30), \(J_m\) can be expressed as

$$\begin{aligned} J_m= & {} J_{m-1} - 2 \beta _m \varvec{e}_{m-1}^\text {T} \varvec{h}_m +(1+\sigma _\beta ^2) \beta ^2_m\Vert \varvec{h}_m\Vert ^2_2 \nonumber \\&+ \, (1+\sigma ^2_\beta ) \sigma ^2_w \beta _m^2 \sum _{k=1}^N \sum _{l=1}^{D+1} w^2_{ml} \varDelta H^2_{ml} (\varvec{x}_k) . \end{aligned}$$
(31)

Let

$$\begin{aligned} R_m= & {} J_m-J_{m-1} \nonumber \\= & {} - 2 \beta _m \varvec{e}_{m-1}^\text {T} \varvec{h}_m +(1+\sigma _\beta ^2) \beta ^2_m\Vert \varvec{h}_m\Vert ^2_2 \nonumber \\&+ \, (1+\sigma ^2_\beta ) \sigma ^2_w \beta _m^2 \sum _{k=1}^N \sum _{l=1}^{D+1} w^2_{ml} \varDelta H^2_{ml}(\varvec{x}_k). \end{aligned}$$
(32)

To maximize the reduction in the expected error over all noise patterns, we should consider

$$\begin{aligned} \frac{\partial R_m}{\partial \beta _m}=0 . \end{aligned}$$
(33)

Then the optimal value of \(\beta _m\) is given by

$$\begin{aligned} \beta ^*_m = \frac{\varvec{e}_{m-1}^\text {T} \varvec{h}_m }{(1+\sigma _\beta ^2) (\Vert \varvec{h}_m\Vert ^2_2+\sigma ^2_w \nu _m )}, \end{aligned}$$
(34)

where

$$\begin{aligned} \nu _m= \sum _{k=1}^N \sum _{l=1}^{D+1} w^2_{ml} \varDelta H^2_{ml}(\varvec{x}_k) . \end{aligned}$$
(35)

With this optimal value, \(R_m\) is given by

$$\begin{aligned} R_m = -\frac{(\varvec{e}_{m-1}^\text {T} \varvec{h}_m)^2 }{\left( (1+\sigma _\beta ^2) (\Vert \varvec{h}_m\Vert ^2_2+\sigma ^2_w \nu _m )\right) ^2}. \end{aligned}$$
(36)

Apparently, the expected training error of the noisy network always reduces. The summary of the WDT-IELM algorithm is given in Algorithm 1. It should be noticed during training, we do not need to keep \(\tau _m\) and \(\kappa _m\).

figure a

4.2 WDTC-IELM

In Huang and Chen (2007), the CIELM algorithm was presented. It aims at improving the approximation ability of IELM. The performance demonstration in Huang and Chen (2007) showed that the CIELM algorithm outperforms the original IELM algorithm. However, it was designed for the noiseless situation. Hence, it is equally important to develop a noise tolerant version of CIELM.

In the proposed WDTC-IELM algorithm, we also incrementally insert hidden nodes into the network in the one-by-one manner. After determine the current output weight \(\beta _m\), we update all the previous trained weights, given by

$$\begin{aligned} {\varvec{\beta }}^{new}_j=(1-{\varvec{\beta }}_m){\varvec{\beta }}^{old}_j \quad \quad j=1,2,\ldots , m-1. \end{aligned}$$
(37)

With (37), the recursive equations for \(\varvec{f}_m\), \(\varvec{e}_m\), \(\kappa _m\), and \(\tau _m\) becomes

$$\begin{aligned} \varvec{f}_m= & {} (1-\beta _m)\varvec{f}_{m-1} + \beta _m \varvec{h}_m \end{aligned}$$
(38)
$$\begin{aligned} \varvec{e}_m= & {} \varvec{t}-\varvec{f}_m=\varvec{e}_{m-1} + \beta _m (\varvec{f}_{m-1} - \varvec{h}_m) \end{aligned}$$
(39)
$$\begin{aligned} \kappa _m= & {} (1-\beta _m)^2 \kappa _{m-1} + \beta _m^2 \Vert \varvec{h}_m\Vert ^2_2 \end{aligned}$$
(40)
$$\begin{aligned} \tau _m= & {} (1-\beta _m)^2 \tau _{m-1} + \beta _m^2 \sum _{k=1}^N \sum _{l=1}^{D+1} w^2_{ml} \varDelta H^2_{ml} (\varvec{x}_k). \end{aligned}$$
(41)

From (38) and (41), the training objective \(J_m\) can be expressed as

$$\begin{aligned} J_m= & {} \Vert \varvec{e}_{m-1}\Vert ^2_2 \!+ \!2 \beta _m \varvec{e}^\text {T}_{m-1} (\varvec{f}_{m-1}\!-\!\varvec{h}_m) \nonumber \\&+ \,\beta _m^2 \Vert \varvec{f}_{m-1}\!-\!\varvec{h}_m\Vert ^2_2 \nonumber \\&\!+ \sigma ^2_\beta \left( (1\!-\!\beta _m)^2 \kappa _{m-1} + \beta _m^2 \Vert \varvec{h}_m^2\Vert ^2_2 \right) \nonumber \\&+\,(1+\sigma ^2_\beta ) \sigma _w^2 \Big ( (1-\beta _m)^2 \tau _m \nonumber \\&+\, \beta ^2_m \sum _{k=1}^N \sum _{l=1}^{D+1} w^2_{ml} \varDelta H^2_{ml}(\varvec{x}_k)\Big ) \end{aligned}$$
(42)
$$\begin{aligned}= & {} J_{m-1} \nonumber \\&+ \,\beta ^2_m \Big [ \Vert \varvec{f}_{m-1}-\varvec{h}_m\Vert ^2_m + \sigma ^2_\beta (\kappa _{m-1} +\Vert \varvec{h}_m \Vert ^2_2) \nonumber \\&+\,(1+\sigma ^2_\beta )\sigma ^2_w \Big (\tau _{m-1}+ \sum _{k=1}^N \sum _{l=1}^{D+1} w^2_{ml} \varDelta H^2_{ml}(\varvec{x}_k)\Big )\Big ] \nonumber \\&+\,2 \beta _m \big [ \varvec{e}_{m-1}^T (\varvec{f}_{m-1}-\varvec{h}_m) -\sigma ^2_\beta \kappa _{m-1} \big . \nonumber \\&-\, \big . (1+\sigma ^2_\beta ) \sigma ^2_w \tau _{m-1} \big ]. \end{aligned}$$
(43)

Then we can obtain the difference between two consecutive iterations, given by

$$\begin{aligned} R_m= & {} J_m-J_{m-1} \end{aligned}$$
(44)
$$\begin{aligned}= & {} \beta ^2_m \Big [ \Vert \varvec{f}_{m-1}-\varvec{h}_m\Vert ^2_m + \sigma ^2_\beta (\kappa _{m-1} +\Vert \varvec{h}_m \Vert ^2_2) \nonumber \\&+\, (1+\sigma ^2_\beta )\sigma ^2_w \Big (\tau _{m-1}+ \sum _{k=1}^N \sum _{l=1}^{D+1} w^2_{ml} \varDelta H^2_{ml}(\varvec{x}_k)\Big )\Big ] \nonumber \\&+\, 2 \beta _m \big [ \varvec{e}_{m-1}^T (\varvec{f}_{m-1}-\varvec{h}_m) -\sigma ^2_\beta \kappa _{m-1} \big . \nonumber \\&\big . -(1+\sigma ^2_\beta ) \sigma ^2_w \tau _{m-1} \big ] \end{aligned}$$
(45)

Similar to the WDT-IELM algorithm, in the WDTC-IELM the optimal value of \(\beta _m\) is given by

$$\begin{aligned} \beta ^*_m = \frac{\varvec{e}_{m-1}^T (\varvec{f}_{m-1}-\varvec{h}_m) -\sigma ^2_\beta \kappa _{m-1} -(1+\sigma ^2_\beta ) \sigma ^2_w \tau _{m-1}}{\varOmega } \end{aligned}$$
(46)

where

$$\begin{aligned} \varOmega= & {} \Vert \varvec{f}_{m-1}-\varvec{h}_m\Vert ^2_m + \sigma ^2_\beta (\kappa _{m-1} +\Vert \varvec{h}_m \Vert ^2_2) \nonumber \\&(1+\sigma ^2_\beta )\sigma ^2_w \big (\tau _{m-1}+ \sum _{k=1}^N \sum _{l=1}^{D+1} w^2_{ml} \varDelta H^2_{ml}(\varvec{x}_k) \big ). \end{aligned}$$
(47)

With this optimal value, \(R_m\) is given by

$$\begin{aligned} R_m = -\frac{\varPi ^2}{\varOmega }. \end{aligned}$$
(48)

where

$$\begin{aligned} \varPi = \varvec{e}_{m\!-\!1}^T (\varvec{f}_{m-1}\!-\!\varvec{h}_m)\! -\!\sigma ^2_\beta \kappa _{m-1} \!-\!(1\!+\!\sigma ^2_\beta ) \sigma ^2_w \tau _{m-1}). \end{aligned}$$
(49)

In (48), the denominator \(\varOmega\) is positive. Hence, the expected training error of the noisy network always reduces. The summary of the WDTC-IELM algorithm is given in Algorithm 2. It can be seen that in the WDTC-IELM, we need to keep two addition variables \(\tau _m\) and \(\kappa _m\).

figure b
Table 1 Details of the seven UCI datasets
Table 2 The partitioning of the datasets in the tenfold
Fig. 1
figure 1

Test set MSE versus number of hidden nodes. Three noise levels are consider. They are \(\sigma _{\beta }^2=\sigma _{w}^2=0.04\), \(\sigma _{\beta }^2=\sigma _{w}^2=0.09\), \(\sigma _{\beta }^2=\sigma _{w}^2=0.25\). ac Are for Abalone dataset. df Are for Concrete Compressive strength dataset. gi Are for Housing Price dataset

Table 3 The performance of the four algorithms over the tenfold. The number of hidden nodes in the networks is 500
Fig. 2
figure 2

Test set MSE versus Noise Level for each data set considered. Four noise levels are considered. They are \(\sigma _{\beta }^2=\sigma _{w}^2=0.04\), \(\sigma _{\beta }^2=\sigma _{w}^2=0.09\), \(\sigma _{\beta }^2=\sigma _{w}^2=0.16\), and \(\sigma _{\beta }^2=\sigma _{w}^2=0.25\)

Table 4 The paired t-test result between WDT-IELM and IELM
Table 5 The paired t-test result between WDTC-IELM and IELM

5 Numerical study

In this section, we evaluate the performance of the proposed algorithms with some real life datasets obtained from the UCI machine learning repository (Lichman 2013). Datasets that we use to validate the performance of the algorithms are Abalone, Housing price, Concrete Compressive strength, Airfoil Self Noise (ASN), BodyFat, Chemical Sensor, and Building Energy. Table 1 presents the properties of the datasets. The datasets are pre-processed. The target outputs of these datasets are normalized to the range of [0, 1], while the input features of the data sets are normalized to the range of [−1, 1]. In addition, we randomly generate the input weights of the hidden nodes from range [−1, 1].

For fair comparison, we use the tenfold evaluation method. The samples of a dataset are randomly partitioned into ten subsets. The summary of the partitioning is given in Table  2. In our simulation, a subset is used as the test set, and the remaining nine subsets are used as training data. The noise levels that we test are (\(\{\sigma ^2_\beta =\sigma ^2_w=0.04, \sigma ^2_\beta =\sigma ^2_w=0.09, \sigma ^2_\beta =\sigma ^2_w=0.09\) and \(\sigma ^2_\beta =\sigma ^2_w=0.25 \}\)). According to the analysis in Liu and Kaneko (1969), the noise level \(\sigma ^2_\beta =\sigma ^2_w=0.04\) responds around 2–3 mantissa bits in the digital implementation. For other noise levels, we can consider that the standard deviation of noise is around 30–50 % of the nominate value in the analog implementation.

5.1 Number of hidden nodes

We use three datasets to demonstrate how the test set errors change with respect to the various numbers of hidden nodes. The three datasets are Abalone, Concrete, and Boston Housing. Three noise levels \(\{\sigma ^2_\beta =\sigma ^2_w=0.04, \sigma ^2_\beta =\sigma ^2_w=0.09,\sigma ^2_\beta =\sigma ^2_w=0.25 \}\) are considered. Figure 1 shows the test set MSE versus the number of nodes for a typical run. It can be seen that the test set errors of the CIELM are much higher than those of the other three algorithms. That means, the noise tolerant ability of the original CIELM is very poor. For the IELM, WDT-IELM, and WDTC-IELM algorithms, when the number of hidden nodes is around 400–500, the decreasing rate of the test set error is very slow. Thus, we treat 500 hidden nodes as a reference point to conduct a deeper analysis in the rest parts of the paper.

For the figure, the WDT-IELM algorithm is better than the IELM algorithm. When the noise level is high, the improvement on the test set error becomes more significant. In addition, when we use the WDTC-IELM algorithm, we can further improve the test set MSE. For instance, for the Abalone data set with noise level equal to \(\sigma ^2_\beta =\sigma ^2_w=0.04\), the test set MSE of the original I-ELM algorithm is 0.01074. When we use the WDT-IELM algorithm, we can reduce the test set MSE to 0.01053. Using the WDTC-IELM algorithm, we can further reduce the test set MSE to 0.006246. The MSE difference between IELM and WDTC-IELM is 0.005172.

5.2 Performance comparison

To further investigate the performance of those algorithms, we use the tenfold evaluation strategy. The setting of the tenfold is shown in Table 2. The average test set performance over the tenfold in the seven datasets is summarized in Table 3. Besides, for an easy and quick view of the result in Table 3, we also provide a chart view of the performance in Fig. 2. The table contains \(7 \times 4 \times 4=112\) entities. Each entity is the average MSE of the tenfolds (ten runs).

From the table, the test set MSE values from WDT-IELM are smaller than those of IELM. This is obvious at large noise level as shown in Fig. 2. Besides, the test set MSE values from WDTC-IELM are much smaller than those of the other three algorithms. For instance, for the BodyFat dataset, when the noise level is 0.04, the test set MSE value of the original IELM is 0.020329. When we use WDT-IELM, we reduce the test set MSE value to 0.019707. Furthermore, when we use the WDTC-IELM algorithm, the test set MSE value can be reduced to 0.010461. The improvement of using the WDTC-IELM algorithm is more significant for high noise levels. When the noise level is 0.25, the test set MSE value of the original IELM is 0.063316. When we use WDT-IELM, we further reduce the test set MSE value to 0.046406. Furthermore, when we use the WDTC-IELM algorithm, the test set MSE value can be reduced to 0.012829.

Furthermore, we discover that the WDTC-IELM are relatively insensitive to the noise level. Consider the Abalone dataset.

  • When the noise level is 0.04, the test set MSE of IELM is 0.011783. When the noise level increases to 0.25, the test set MSE increases to 0.031176.

  • When the noise level is 0.04, the test set MSE of CIELM is 0.020905. When the noise level increases to 0.25, the test set MSE increases to 0.089067.

  • When the noise level is 0.04, the test set MSE of WDT-IELM is 0.011560. When the noise level increases to 0.25, the test set MSE increases to 0.023635.

  • When the noise level is 0.04, the test set MSE of WDT-IELM is 0.007181. When the noise level increases to 0.25, the test set MSE increases to 0.008544.

The above phenomenon also happens in the other six datasets. Since the WDTC-IELM is the best among the four algorithms, one may argue that we do not need to consider the WDT-IELM. However, the WDTC-IELM algorithm needs to update all the previous trained weights and has a more complicated training procedure, as shown Algorithms 1 and 2.

5.3 Paired T-test analysis

From Fig 2 and Table 3, in terms of average test set MSE, the two proposed algorithms are better than the two original algorithms. In this section, we would like to check if the improvements are statistical significant or not. We would like to perform significant test, i.e., paired t-test, to check if the performance of the proposed algorithms are statistical significant or not. Since the CIELM is with very poor performance, we do not perform the paired t-test on it.

Tables 4 and 5 summarize the paired t-test results, i.e., the IELM versus WDT-IELM, and IELM versus WDTC-IELM. For the one-tail test with 10 trials and 95% confidence level, the critical t-value is 1.8331.

Before we perform the paired test, we should check if the data pass the normality test or not. In this paper, we consider the Anderson–Darling goodness-of-fit hypothesis test. For the hypothesis test, the critical p value is 0.05. That is the p value of the data should be greater than 0.05. Since there are three algorithms involved, four noise levels, and seven datasets, there are 84 sets of data. Among those sets, most of cases meet the normality test. Only nine cases do not the normality test. The nine cases appear in three datasets: the ASN dataset, the Chemical Sensor dataset, and the Housing dataset. In the ASN dataset, there are three cases: WDT-IELM (noise level=0.04) with p value equal to 0.0498, WDT-IELM (noise level=0.09) with p value equal to 0.0162, and IELM (noise level=0.09) with p value equal to 0.0313. In the Housing dataset, there are two cases: IELM (noise level=0.16) with p value equal to 0.0481 and IELM (noise level = 0.25) with p value equal to 0.0381. In the Chemical Sensor dataset, there are four cases: IELM (noise level=0.16) with p value equal to 0.028, IELM (noise level=0.25) with p value equal to 0.0129, WDT-IELM (noise level=0.16) with p value equal to 0.03, and WDT-IELM (noise level=0.25) with p value equal to 0.0345.

In Table 4, all t-values are greater than the critical t-value (i.e. 1.8331). Besides, all confidence intervals of the average improvements excluded the zero. For example, in the Bodyfat dataset with the noise level equal to 0.25, the t-value is 89.1, which is greater than 1.8331. Besides, the confidence interval of average improvement is in between [0.016480, 0.017339]. With these results, there is strong evidence that the WDT-IELM is better than IELM.

Again, in Table 5, all the t-values are much greater than the critical t-value. This phenomenon revealed the improvement of WDTC-IELM is statistically significant too.

As mentioned in the above, there are a few cases that do not meet the normality test. However, from Table 3, for those cases, the improvements of using WDT-IELM and WDTC-ILM are greater than the standard deviations of IELM, WDT-IELM and WDTC-IELM. Hence, we can conclude that the improvements of WDT-IELM and WDTC-IELM are significant.

For example, for the ASN with the noise level equal to 0.09, for the IELM, the test set MSE of is 0.049538 and the standard deviation is 0.002893. When we use the WDT-IELM, the test set MSE of is 0.044534 and the standard deviation is 0.002797. Clearly, the improvement of using WDT-IELM is around 0.005004 and is around two times of the standard deviations. When we use the WDTC-IELM, the test set MSE of is 0.018152 and the standard deviation is 0.002. Clearly, the improvement of using WDT-IELM is around 0.031386 and is around nines times of the standard deviations.

5.4 Performance improvement ratio

We also present the performance improvement ratios of WDT-IELM and WDTC-IELM. We calculate the performance improvement ratio:

$$\begin{aligned} ((P_e - P_p)/P_e)*100 , \end{aligned}$$
(50)

where \(P_e\) and \(P_p\) are the performances of existing and proposed models, respectively. Table 6 summarizes the the performance improvement ratios of WDT-IELM and WDTC-IELM. The 3rd and 4th columns show the the performance improvement ratios of WDT-IELM and WDTC-IELM to IELM, respectively. The 3rd and 4th columns show the the performance improvement ratios of WDT-IELM and WDTC-IELM to C-IELM. The table clearly shows that our algorithms outperform the existing ones. For example, in the ASN dataset we have the following performance improvement ratios.

Table 6 Performances Improvement ratio of the comparison algorithms: IELM, CIELM,WDT-IELM and WDTC-IELM
  • When the noise level is 0.04, the improvement ratio of WDT-IELM to IELM is 3.06%. When the noise level increases to 0.25, the improvement ratio increases to 29.14%.

  • When the noise level is 0.04, the improvement ratio of WDTC-IELM to IELM is 42.86%. When the noise level increases to 0.25, the improvement ratio increases to 77.54%.

  • When the noise level is 0.04, the improvement ratio of WDT-IELM to CIELM is 56.37%. When the noise level increases to 0.25, the improvement ratio increases to 82.06%.

  • When the noise level is 0.04, the improvement ratio of WDTC-IELM to CIELM is 74.28%. When the noise level increases to 0.25, the improvement ratio increases to 94.32%.

The above occurrence also happens in the other six datasets. Furthermore, we give visual figures as shown in Fig. 3. The figure makes a better view of Table 6. From the figure and table, the improvement ratios of our proposed algorithms (WDT-IELM and WDTC-IELM) are more significant for high noise levels.

Fig. 3
figure 3

Performance improvement ratios. Four noise levels are considered. They are \(\sigma _{\beta }^2=\sigma _{w}^2=0.04\), \(\sigma _{\beta }^2=\sigma _{w}^2=0.09\), \(\sigma _{\beta }^2=\sigma _{w}^2=0.16\), and \(\sigma _{\beta }^2=\sigma _{w}^2=0.25\)

6 Conclusion

In this paper, we have developed a noise resistant objective function for concurrent multiplicative weight noise in the input weights and output weights. Based on the developed objective function, we then propose two increment ELM algorithms that can handle weight noise. We also show that during the incremental training, the training objective is non-increasing. From the simulation results, the proposed incremental algorithms are much better than the original incremental algorithms. Besides, based on the paired t-test results and the performance improvement ratio results, the improvement of the proposed algorithms are statistical significant.