1 Introduction

Recently, a new learning algorithm for single hidden layer feedforward neural networks (SLFNs) architecture called extreme learning machine (ELM) method has been proposed in [21] to overcome many of the problems of traditional feedforward neural network learning algorithms such as the presence of local minima, imprecise learning rate, over fitting and slow rate of convergence. Once the input weights and hidden layer biases have been chosen randomly, ELM determines the unknown output weight vector of the network having the smallest norm by solving a system of linear equations. ELM is a simple unified algorithm for regression, binary and multiclass problems and it has been successfully tested on benchmark problems of practical importance. It was initially proposed for SLFNs and later extended to “generalized” SLFNs which may not neuron alike [15, 16]. The essence of ELM is that there is no need of tuning the hidden layer of SLFNs. The growing popularity of ELM [3, 4, 10, 14, 22, 27, 28, 33, 34, 38, 39, 42] is because of its better generalization performance with much faster learning speed in comparison to traditional computational intelligence techniques [19].

The main problem with ELM is the stochastic nature of the hidden layer output matrix which in practice may lower its learning accuracy [6]. Further, it was observed that to achieve an acceptable level of performance a large number of hidden nodes might be required [23, 44] which implies increase in problem size and therefore computational cost. This suggests to look for compact networks having the ability to achieve good generalization performance [23, 41, 44]. The other issue with ELM is that the number of hidden nodes is a parameter and is often chosen by trial and error method. Two heuristic approaches namely constructive [12, 1517] and pruning methods [26] have been proposed in the literature to address this problem. In fact, the optimally pruned ELM (OP-ELM) method proposed in [26] selects the hidden neurons by applying 1-norm for the outputs and then computes their weights using the classical least squares method. For an interesting study of ELM in 1-norm resulting in a sparse model representation, see [1].

By applying ELM kernels in the SVM formulation [7, 37] instead of support vector machine (SVM) kernels, it was shown in [13, 24] that better generalization can be achieved. In [18], optimization based ELMs (OB-ELMs) for classification were proposed as an extension to support vector networks in which as in SVM the training error and the norm of the output weight vector were minimized. Further it was observed that the proposed method tends to achieve better generalization performance than SVM and is less sensitive to the input parameters. For an interesting study on an extension of ELM to least squares SVM (LS-SVM) and proximal SVM (PSVM) as a learning algorithm providing a unified solution, see the work of Huang et al. [20]. Finally, for an excellent survey on ELM, we refer the reader to [19].

Motivated by the works of [18, 20], an enhanced optimization based ELM is proposed in its primal whose solution is obtained by solving an absolute value equation using a simple, functional iterative method. Our formulation directly finds the approximate optimal solution in the primal space rather than finding in its dual space. The effectiveness of the proposed method for regression and binary classification problems is demonstrated by performing numerical experiments on a number of real-world datasets and comparing their results with SVM, ELM, OP-ELM and OB-ELM.

The paper is organized as follows. Section 2 dwells briefly SVM for regression and classification. In Sect. 3, a short introduction to ELM is given. The OB-ELMs for regression and classification are introduced in Sect. 4. The proposed optimization based ELM formulation whose solution is obtained by applying a functional iterative method, is described in Sect. 5. Numerical experiments were performed on a number of benchmark real-world datasets and their results with additive and radial basis function (RBF) hidden nodes are compared with that of SVM, ELM, OP-ELM and OB-ELM in Sect. 6 and we conclude our work in Sect. 7.

Throughout in this paper, all vectors are taken as column vectors. For any two vectors x,y in R n, we denote their inner product by x t y where x t is the transpose of the vector x. The 2-norm of a vector x will be denoted by \( \left\| {\bf x} \right\| \). For any vector \( {\bf x} = (x_{1} , \ldots ,x_{n} )^{t} \in R^{n} \), let x + be a vector in R n defined by setting all the negative components of x to zero and further let \( |\,\,{\bf x}\,\,|\, = \,\,(\,|x_{1} |, \ldots ,|x_{n} |\,)^{t} \in R^{n} \). The zero vector and the column vector of ones of dimension m are denoted by 0 and e respectively. For any real matrix \( {\mathbf{\mathcal{H}}}\, \in R^{m \times \ell } \), its transpose is denoted by \( {\mathbf{\mathcal{H}}}^{t} \). For any symmetric matrix \( {\mathbf{\mathcal{A}}} \), let its maximum eigenvalue be \( \lambda_{\hbox{max} } ({\mathbf{\mathcal{A}}}) \). The identity matrix of appropriate size is denoted by I. Further, diag( x ) means a diagonal matrix of order n whose diagonal elements being the components of the vector \( {\mathbf{x}} \in R^{n} \). If f is a real valued function of the variable \( {\mathbf{x}} = (x_{1} , \ldots ,x_{n} )^{t} \in R^{n} \), we denote its gradient by \( \nabla f = (\partial f/\partial x_{1} , \ldots ,\partial f/\partial x_{n)} )^{t} \).

2 Support vector machines for regression and classification

In this section, we briefly describe the standard support vector machines for regression and binary classification problems.

2.1 Regression formulation

Assume that a set of training examples \( \{ ({\bf x}_{i} ,y_{i} )\}_{i = 1, \ldots ,m} \) is given such that for each input example \( {\bf x}_{i} = (x_{i1} , \ldots ,x_{in} )^{t} \in R^{n} \), let its corresponding observed value be \( y_{i} \in R \). By mapping the input examples \( {\bf x}_{i} \in R^{n} \) into a higher dimensional feature space via a nonlinear map \( \varphi (.) \), the linear regression model \( f(.) \) of the following form is fitted in the feature space

$$ f({\bf x}) = {\bf w}^{t} \varphi ({\bf x}) + b . $$
(1)

The goal of nonlinear SVR is in determining the unknowns w and b of (1) as the solution of the constrained minimization problem [7, 37]

$$ \mathop {{\mathbf{min}}}\limits_{{{\mathbf{w}},b,\xi_{1} ,\xi_{2} }} \frac{1}{2}{\mathbf{w}}^{t} {\mathbf{w}} + C({\mathbf{e}}^{t} {\varvec{\upxi}}_{1} + {\mathbf{e}}^{t} {\varvec{\upxi}}_{2} ) $$

subject to

$$ \begin{gathered} y_{i} - {\mathbf{w}}^{t} \varphi ({\mathbf{x}}_{i} ) - b \le \varepsilon + \xi_{1i} , \hfill \\ {\mathbf{w}}^{t} \varphi ({\mathbf{x}}_{i} ) + b - y_{i} \le \varepsilon + \xi_{2i} , \hfill \\ \xi_{1i} \ge 0\ {\text{and}}\ \xi_{2i} \ge 0\quad {\text{for}}\ i = 1, 2, \ldots ,m, \hfill \\ \end{gathered} $$
(2)

where C > 0, \( \varepsilon > 0 \) are input parameters and \( {\varvec{\upxi}}_{1} = (\xi_{11} , \ldots ,\xi_{1m} )^{t} \),\( {\varvec{\upxi}}_{2} = (\xi_{21} , \ldots ,\xi_{2m} )^{t} \in R^{m} \) are vectors of slack variables.

Instead of solving the primal problem (2), by introducing Lagrange multipliers \( \varvec{\alpha}_{1} = (\alpha_{11} , \ldots ,\alpha_{1m} )^{t} \), \( \varvec{\alpha}_{2} = (\alpha_{21} , \ldots ,\alpha_{2m} )^{t} \) \( \in R^{m} \) and applying the kernel trick, its dual problem of the following form is solved [7, 37]

$$ \mathop {\text{min}}\limits_{{\varvec{\alpha}_{1} ,\varvec{\alpha}_{2} }} \frac{1}{2}\sum\limits_{i,j = 1}^{m} {(\alpha_{1i} - \alpha_{2i} )k({\mathbf{x}}_{i} ,{\mathbf{x}}_{j} )(\alpha_{1j} - \alpha_{2j} ) + } \varepsilon \sum\limits_{i = 1}^{m} {(\alpha_{1i} + \alpha_{2i} ) - \sum\limits_{i = 1}^{m} {y_{i} (\alpha_{1i} - \alpha_{2i} )} } $$

subject to

$$ \sum\limits_{i = 1}^{m} {(\alpha_{1i} - \alpha_{2i} ) = 0} \;{\text{and}}\;{\mathbf{0}} \le {\varvec{\upalpha}}_{1} ,{\varvec{\upalpha}}_{2} \le C\,{\mathbf{e}}, $$

where k(x i , x j ) is a kernel function that replaces the term \( \varphi ({\mathbf{x}}_{i} )^{t} \varphi ({\mathbf{x}}_{j} ) \).

Finally, for any \( {\bf x} \in R^{n} \) its prediction by the fitted regression function f(.) is given by [7, 37]

$$ f({\mathbf{x}}) = \sum\limits_{i = 1}^{m} {(\alpha_{1i} - \alpha_{2i} ) )k({\mathbf{x}},{\mathbf{x}}_{i} ) + b} . $$

2.2 Classification formulation

Let \( \{ ({\mathbf{x}}_{i} ,y_{i} )\}_{i = 1, \ldots ,m} \) be a set of training examples given in which for each input point \( {\mathbf{x}}_{i} = (x_{i1} , \ldots ,x_{in} )^{t} \in R^{n} \), let y i  = ±1 be its corresponding class label. It is well known that SVM constructs a hyperplane of the form \( {\mathbf{w}}^{t} \varphi ({\mathbf{x}}) + b \), by maximizing the margin between the input points of the positive and negative classes in the feature space and at the same time minimizing the sum of the training errors. In fact, the unknowns w and b will be obtained as the solution of the following constrained minimization problem [7, 37]

$$ \mathop {\hbox{min} }\limits_{{{\mathbf{w}},b,\xi }} \frac{1}{2}{\mathbf{w}}^{t} {\mathbf{w}} + C{\mathbf{e}}^{t} {\varvec{\upxi}} $$

subject to

$$ \begin{gathered} y_{i} ({\mathbf{w}}^{t} \varphi ({\mathbf{x}}_{i} ) + b) \ge 1 - \xi_{i} , \hfill \\ \xi_{i} \ge 0\quad {\text{for}}\ i = 1,2, \ldots ,m, \hfill \\ \end{gathered} $$
(3)

where C > 0 is the regularization parameter and \( {\varvec{\upxi}} = (\xi_{1} , \ldots ,\xi_{m} )^{t} \) in R m is a vector of slack variables. Its dual can be written of the form [7, 37]

$$ \mathop {\hbox{min} }\limits_{{{\varvec{\upalpha}} \in R^{m} }} \frac{1}{2}\sum\limits_{i,j = 1}^{m} {\alpha_{i} \alpha_{j} y_{i} y_{j} k({\mathbf{x}}_{i} ,{\mathbf{x}}_{j} )} - \sum\limits_{i = 1}^{m} {\alpha_{i} } $$

subject to

$$ \sum\limits_{i = 1}^{m} {\alpha_{i} y_{i} } = 0\;{\text{and}}\;{\mathbf{0}} \le {\varvec{\upalpha}} \le C\,{\mathbf{e}}, $$

where k(.,.) is a kernel function and \( {\varvec{\upalpha}} = (\alpha_{1} , \ldots ,\alpha_{m} )^{t} \in R^{m} \) is the Lagrange multiplier vector. In this case, the decision function f(.) will become

$$ f({\mathbf{x}}) = {\text{sign}}(\sum\limits_{j = 1}^{m} {\alpha_{j} y_{j} k({\mathbf{x}},{\mathbf{x}}_{j} ) + b} )\;{\text{for}}\;{\mathbf{x}} \in R^{n} . $$

3 Extreme learning machine method

ELM is a unified learning approach for regression and classification problems proposed for training SLFNs. It is based on least squares solution having the ability to achieve comparable generalization performance at much faster learning speed in accordance with the traditional SVM.

Assume that a set of training examples \( \{ ({\mathbf{x}}_{i} ,y_{i} )\}_{i = 1, \ldots ,m} \) is given such that for each input example \( {\mathbf{x}}_{i} = (x_{i1} , \ldots x_{in} )^{t} \in R^{n} \), let \( y_{i} \in R \) be its corresponding target value. For the randomly assigned values for the learning parameters \( {\mathbf{a}}_{s} = (a_{s1} , \ldots ,a_{sn} )^{t} \in R^{n} \) and \( b_{s} \in R \) of the hidden nodes, ELM determines its output function f(.) such that

$$ f({\mathbf{x}}_{i} ) = \sum\limits_{s = 1}^{\ell } {w_{s} G({\mathbf{a}}_{s} ,b_{s} ,{\mathbf{x}}_{i} )} = y_{i} \quad {\text{for}}\ i = 1, \ldots ,m, $$
(4)

where the hidden layer output function G(a,b,x) is a nonlinear piecewise continuous function satisfying the conditions of the universal approximation capability theorems [17] and \( {\mathbf{w}} = (w_{1} , \ldots ,w_{\ell } )^{t} \in R^{\ell } \) is the unknown output weight vector connecting the hidden layer of ℓ nodes with the output node. The above condition (4) can be written in matrix form as

$$ {\mathbf{\mathcal{H}}}{\mathbf{w}} = {\mathbf{y}}, $$
(5)

where

$$ {\mathbf{\mathcal{H}}} = \left[ {\begin{array}{*{20}c} {G({\mathbf{a}}_{1} ,b_{1} ,{\mathbf{x}}_{1} )} & \cdots & {G({\mathbf{a}}_{\ell } ,b_{\ell } ,{\mathbf{x}}_{1} )} \\ . & \cdots & . \\ {G({\mathbf{a}}_{1} ,b_{1} ,{\mathbf{x}}_{m} )} & \cdots & {G({\mathbf{a}}_{\ell } ,b_{\ell } ,{\mathbf{x}}_{m} )} \\ \end{array} } \right]_{m \times \ell } $$

is the hidden layer output matrix of the network and \( {\mathbf{y}} = (y_{1} , \ldots ,y_{m} )^{t} \in R^{m} \) is the vector of observed values.

Some of the well-known hidden layer output functions are

  1. a)

    Sigmoid function : \( G({\mathbf{a}},b,{\mathbf{x}}) = \frac{1}{{1 + \exp ( - ({\mathbf{a}}^{t} {\mathbf{x}} + b))}}, \)

  2. b)

    Multiquadric function: \( G({\mathbf{a}},b,{\mathbf{x}}) = \left( {\left\| {{\mathbf{x}} - {\mathbf{a}}} \right\|^{2} + b^{2} } \right)^{1/2} , \)

  3. c)

    Gaussian function: \( G({\mathbf{a}},b,{\mathbf{x}}) = \exp \left( { - b\left\| {{\mathbf{x}} - {\mathbf{a}}} \right\|^{2} } \right). \)

Clearly, for the randomly assigned values for the parameters \( {\mathbf{a}}_{s} \in R^{n} ,b_{s} \in R \) and the predefined hidden layer output function \( G({\mathbf{a}},b,{\mathbf{x}}) \), training the SLFN is equivalent to obtaining a least squares solution \( {\mathbf{w}} \in R^{\ell } \) of the rectangular linear system (5). In fact, \( {\mathbf{w}} \in R^{\ell } \) can be explicitly obtained as the smallest norm least squares solution of (5) [21]: \( {\mathbf{w}} = {\mathbf{\mathcal{H}}}^{\dag } {\mathbf{y}} \), where \( {\mathbf{\mathcal{H}}}^{\dag } \) is the Moore–Penrose generalized inverse [32] of the matrix \( {\mathbf{\mathcal{H}}} \).

Finally, with the solution \( {\mathbf{w}} \in R^{\ell } \) obtained, the fitted model f(.) for ELM regression is taken as

$$ f({\mathbf{x}}) = \sum\limits_{s = 1}^{\ell } {w_{s} G({\mathbf{a}}_{s} ,b_{s} ,{\mathbf{x}})} . $$
(6a)

For binary classification problems, the ELM decision function is, however, defined by

$$ f({\mathbf{x}}) = {\text{sign}}\left(\sum\limits_{s = 1}^{\ell } {w_{s} G({\mathbf{a}}_{s} ,b_{s} ,{\mathbf{x}})}\right). $$
(6b)

Note that, once the values of the weight vector \( {\mathbf{a}}_{s} \in R^{n} \) and the bias \( b_{s} \in R \) are randomly assigned at the beginning of the learning algorithm they remain fixed and therefore the matrix \( {\mathbf{\mathcal{H}}} \) remains unchanged.

4 Optimization method based ELM

Recently, Huang et al. [18] studied ELM for classification in the aspect of optimization method by extending ELM to support vector network model and showing that the minimum norm property of ELM and the maximum margin between classes in SVM are in fact consistent.

In this section, we briefly summarize the formulations for optimization method based ELM (OB-ELM) for regression and classification problems.

The mathematical formulation of the optimization method based \( \varepsilon - \)ELM for regression can be stated as a minimization problem

$$ \mathop {\hbox{min} }\limits_{{{\mathbf{w}},\,{\varvec{\upxi}}_{1} ,\,{\varvec{\upxi}}_{2} }} \frac{1}{2}{\mathbf{w}}^{t} {\mathbf{w}} + C{\mathbf{e}}^{t} ({\varvec{\upxi}}_{1} + {\varvec{\upxi}}_{2} ) $$

subject to

$$ \begin{gathered} {\mathbf{\mathcal{H}}}{\mathbf{w}} - {\mathbf{y}} \le \varepsilon \;{\mathbf{e}} + {\varvec{\upxi}}_{1} , \hfill \\ {\mathbf{y}} - {\mathbf{\mathcal{H}}}{\mathbf{w}} \le \varepsilon \;{\mathbf{e}} + {\varvec{\upxi}}_{2} , \hfill \\ {\varvec{\upxi}}_{1} ,\,\,{\varvec{\upxi}}_{2} \, \ge \,\,{\mathbf{0}}, \hfill \\ \end{gathered} $$
(7)

where \( C > 0,\varepsilon > 0 \) are input parameters and \( {\varvec{\upxi}}_{1} ,{\varvec{\upxi}}_{2} \in R^{m} \) are vectors of slack variables.

By introducing Lagrange multipliers \( {\varvec{\upalpha}}_{1} ,\,{\varvec{\upalpha}}_{2} \in R^{m} \), the dual of (7) can be constructed as a minimization problem of the form

$$ \mathop {\hbox{min} }\limits_{{{\varvec{\upalpha}}_{1} ,\,{\varvec{\upalpha}}_{2} \in R^{m} }} \frac{1}{2}({\varvec{\upalpha}}_{1} - {\varvec{\upalpha}}_{2} )^{t} K_{{_{ELM} }} ({\varvec{\upalpha}}_{1} - {\varvec{\upalpha}}_{2} ) + \varepsilon \,{\mathbf{e}}^{t} ({\varvec{\upalpha}}_{1} + {\varvec{\upalpha}}_{2} ) - {\mathbf{y}}^{t} \,({\varvec{\upalpha}}_{1} - {\varvec{\upalpha}}_{2} ) $$

subject to

$$ {\mathbf{0}} \le {\varvec{\upalpha}}_{1} ,{\varvec{\upalpha}}_{2} \le C\,{\mathbf{e}}, $$
(8)

where \( (\,K_{ELM} \,)_{ij} = K_{ELM} ({\mathbf{x}}_{i} ,{\mathbf{x}}_{j} ) = \,(\,G({\mathbf{a}}_{1} ,b_{1} ,{\mathbf{x}}_{i} ), \ldots ,G({\mathbf{a}}_{s} ,b_{s} ,{\mathbf{x}}_{i} )\,)\,\,\,(\,G({\mathbf{a}}_{1} ,b_{1} ,{\mathbf{x}}_{j} ), \ldots ,G({\mathbf{a}}_{s} ,b_{s} ,{\mathbf{x}}_{j} )\,)^{t} \)is the ij-th coefficient of the matrix \( K_{ELM} . \)With its solution \( {\varvec{\upalpha}}_{1} ,\,{\varvec{\upalpha}}_{2} \in R^{m} \), the regression estimation function (6a) becomes

$$ f({\mathbf{x}}) = (\,K_{ELM} ({\mathbf{x}},{\mathbf{x}}_{1} )\,, \ldots ,\,K_{ELM} ({\mathbf{x}},{\mathbf{x}}_{m} )\,)\,\,({\varvec{\upalpha}}_{1} - {\varvec{\upalpha}}_{2} ) . $$
(9)

Note that, \( \varepsilon - \)ELM problem (8) has less number of constraints than SVR problem in its dual form and the decision function (9) is similar to the estimation function of SVR but without bias b.

For the purpose of separating the training data with acceptable minimum training error rather than zero training error, OB-ELM for classification can be formulated as

$$ \mathop {\hbox{min} }\limits_{{{\mathbf{w}},\,\,{\varvec{\upxi}}}} \frac{1}{2}{\mathbf{w}}^{t} {\mathbf{w}} + C{\mathbf{e}}^{t} {\varvec{\upxi}} $$

subject to

$$ \begin{gathered} {\mathcal{D}}{\mathbf{\mathcal{H}}}{\mathbf{w}} \ge {\mathbf{e}} - {\varvec{\upxi}}, \hfill \\ {\varvec{\upxi}} \ge 0, \hfill \\ \end{gathered} $$
(10)

where \( {\mathcal{D}} = diag({\mathbf{y}}) \) is a diagonal matrix of order m whose diagonal elements become the components of the output vector \( {\mathbf{y}} \in R^{m} \) of class labels. Since the separating ELM hyperplane tends to pass through the origin in the feature space with probability one [18], the bias term ‘‘b’’ is not included in (10).

Recall that the aim of ELM is to obtain an output weight vector \( {\mathbf{w}} \in R^{\ell } \) with smallest training error and further \( ||{\mathbf{w}}|| \) is minimum. However, since the distance between the separating boundaries of the two classes in the ELM feature will be: \( 2/||{\mathbf{w}}|| \), minimizing the regularization term in (10) is equivalent to maximizing the margin between the boundaries in the ELM feature space.

The solution of the primal problem (10) will be obtained by solving its dual optimization problem

$$ \mathop {\hbox{min} }\limits_{{{\varvec{\upalpha}}\,\, \in \,R^{m} }} \frac{1}{2}{\varvec{\upalpha}}^{t} {\mathcal{D}}\,K_{{_{ELM} }} {\mathcal{D}}\,{\varvec{\upalpha}} - \,{\mathbf{e}}^{t} {\varvec{\upalpha}} $$

subject to

$$ {\mathbf{0}} \le {\varvec{\upalpha}} \le C\,{\mathbf{e}}, $$

where \( {\varvec{\upalpha}} \) is the vector of Lagrange multipliers.

In this case, the decision function \( f(.) \) is given by

$$ f({\mathbf{x}}) = {\text{sign}}(\,(K_{ELM} ({\mathbf{x}},{\mathbf{x}}_{1} )\,, \ldots ,\,K_{ELM} ({\mathbf{x}},{\mathbf{x}}_{m} )\,)\,\,{\mathcal{D}}\,{\varvec{\upalpha}}). $$

5 Proposed functional iterative extreme learning machine

In this section, we propose the extension of the study of optimization method based ELM learning approach for classification problems [18] to both regression and classification problems by formulating their primal problems as unconstrained minimization problems and solving them by a simple functional iterative method.

5.1 ε–ELM for regression

The ELM for regression in 2-norm with ε–insensitive error loss function having input parameters C > 0 and ε > 0 can be formulated as a constrained minimization problem of the following form [2, 18]

$$ \mathop {\hbox{min} }\limits_{{{\mathbf{w}},\,{\varvec{\upxi}}_{1} ,{\varvec{\upxi}}_{2} }} \frac{1}{2}{\mathbf{w}}^{t} {\mathbf{w}} + \frac{C}{2}({\varvec{\upxi}}_{1}^{t} {\varvec{\upxi}}_{1} + {\varvec{\upxi}}_{2}^{t} {\varvec{\upxi}}_{2} ) $$

subject to

$$ \begin{gathered} {\mathbf{\mathcal{H}}}{\mathbf{w}} - {\mathbf{y}} \le \varepsilon \;{\mathbf{e}} + {\varvec{\upxi}}_{1} , \hfill \\ {\mathbf{y}} - {\mathbf{\mathcal{H}}}{\mathbf{w}} \le \varepsilon \;{\mathbf{e}} + {\varvec{\upxi}}_{2} \hfill \\ \end{gathered} $$
(11)

where \( {\varvec{\upxi}}_{1} ,\,{\varvec{\upxi}}_{2} \, \in R^{m} \) are vectors of slack variables. Using its solution \( {\mathbf{w}} \in R^{\ell } \), the regression estimation function f(.) defined by (6a) is obtained. The non-negativity constraints on the components of the vectors \( {\varvec{\upxi}}_{1} ,\,{\varvec{\upxi}}_{2} \, \in R^{m} \) have been dropped in the formulation (11) since none of their components will be negative at optimality. Note that adding the regularization term \( \frac{1}{2}{\mathbf{w}}^{t} {\mathbf{w}} \) in the objective function of (11) leads to a stable solution having better generalization performance.

One can easily verify that the constrained primal problem (11) can be formulated into an equivalent unconstrained minimization problem as given below [25]

$$ \mathop {\hbox{min} }\limits_{{{\mathbf{w}} \in \,R^{\ell } }} L_{R} ({\mathbf{w}}) = \frac{1}{2}{\mathbf{w}}^{t} {\mathbf{w}} + \frac{C}{2}(\left\| {({\mathbf{\mathcal{H}}}{\mathbf{w}} - ({\mathbf{y}} + \varepsilon \;{\mathbf{e}}))_{ + } } \right\|^{2} + \left\| {({\mathbf{y}} - \varepsilon {\mathbf{e}} - {\mathbf{\mathcal{H}}}{\mathbf{w}})_{ + } } \right\|^{2} ). $$
(12)

In this subsection, we solve the problem (12) by computing its critical point. For this purpose, consider the problem of solving the system of non-linear equations

$$ \nabla L_{R} ({\mathbf{w}}) = {\mathbf{0}}, $$

i.e. solve for \( {\mathbf{w}} \in R^{\ell } \) such that

$$ {\mathbf{w}} + C{\mathbf{\mathcal{H}}}^{t} (({\mathbf{\mathcal{H}}}{\mathbf{w}} - {\mathbf{y}} - \varepsilon \;{\mathbf{e}})_{ + } - ({\mathbf{y}} - \varepsilon \;{\mathbf{e}} - {\mathbf{\mathcal{H}}}{\mathbf{w}})_{ + } ) = {\mathbf{0}}. $$
(13)

Now, using the identity \( {\mathbf{u}}_{ + } = \frac{{{\mathbf{u}} + |{\mathbf{u}}|}}{2} \), for any vector \( {\mathbf{u}} \in R^{\ell } \), the condition (13) can be written in the following equivalent form

$$ {\mathbf{w}} + C{\mathbf{\mathcal{H}}}^{t} \left[ {\frac{{({\mathbf{\mathcal{H}}}{\mathbf{w}} - {\mathbf{y}} - \varepsilon \;{\mathbf{e}}) + |{\mathbf{\mathcal{H}}}{\mathbf{w}} - {\mathbf{y}} - \varepsilon \;{\mathbf{e}}|}}{2} - \frac{{({\mathbf{y}} - \varepsilon \;{\mathbf{e}} - {\mathbf{\mathcal{H}}}{\mathbf{w}}) + |{\mathbf{y}} - \varepsilon {\mathbf{e}} - {\mathbf{\mathcal{H}}}{\mathbf{w}}|}}{2}\;} \right] = {\mathbf{0}} $$

and from which we get

$$ \left( {\frac{I}{C} + {\mathbf{\mathcal{H}}}^{t} {\mathbf{\mathcal{H}}}} \right){\mathbf{w}} = {\mathbf{\mathcal{H}}}^{t} \left( {{\mathbf{y}} + \frac{{|{\mathbf{y}} - \varepsilon \;{\mathbf{e}} - {\mathbf{\mathcal{H}}}{\mathbf{w}}| - |{\mathbf{\mathcal{H}}}{\mathbf{w}} - {\mathbf{y}} - \varepsilon \;{\mathbf{e}}|}}{2}} \right). $$
(14)

This leads to the following simple iterative scheme which will be our functional iterative ELM algorithm for regression (FELMR)

$$ {\mathbf{w}}^{i + 1} = \left( {\frac{I}{C} + {\mathbf{\mathcal{H}}}^{t} {\mathbf{\mathcal{H}}}} \right)^{ - 1} {\mathbf{\mathcal{H}}}^{t} \left( {{\mathbf{y}} + \frac{{|{\mathbf{y}} - \varepsilon {\mathbf{e}} - {\mathbf{\mathcal{H}}}{\mathbf{w}}^{i} | - |{\mathbf{\mathcal{H}}}{\mathbf{w}}^{i} - {\mathbf{y}} - \varepsilon {\mathbf{e}}|}}{2}} \right)\;{\text{for}}\quad i = 0, \, 1, \, 2, \ldots $$
(15)

The iterative algorithm (15) for solving the regression problem (12) is summarized below.

Algorithm 1 (FELMR).

Input.

  • Parameter values \( C > 0 \) and \( \,\,\varepsilon > 0 \)

  • tol = tolerance value for learning accuracy, imax = maximum number of iterations

Step 1.

  • Set \( i = 0 \) and the initial vector \( {\mathbf{w}} = {\mathbf{w}}^{0} \) in \( R^{\ell } \)

Step 2.

  • Compute the vectors \( {\mathbf{ym}} = {\mathbf{y}} - \varepsilon \,{\mathbf{e}} \) and \( {\mathbf{yp}} = {\mathbf{y}} + \varepsilon \,{\mathbf{e}} \), and the matrix \( {\mathbf{\mathcal{G}}}\, = \,\left( {\frac{{\mathbf{I}}}{C} + {\mathbf{\mathcal{H}}}^{t} {\mathbf{\mathcal{H}}}} \right)^{ - 1} {\mathbf{\mathcal{H}}}^{{\mathbf{t}}} \)

  • Compute \( {\mathbf{w}}^{1} \,\, = \,\,{\mathbf{\mathcal{G}}}\,\left( {\,{\mathbf{y}}\, + \,\frac{{|\,{\mathbf{ym}}\, - \,{\mathbf{\mathcal{H}}}\,{\mathbf{w}}^{0} |\, - \,|\,{\mathbf{\mathcal{H}}}\,{\mathbf{w}}^{0} \, - {\mathbf{yp}}\,|}}{2}} \right) \)

Step 3.

  • While (\( ||{\mathbf{w}}^{i + 1} - {\mathbf{w}}^{i} ||\,\, > \,\,tol \) & \( i\,\, < \,\, \) imax)

$$ \begin{gathered} i = i + 1 \hfill \\ {\mathbf{w}}^{i + 1} \,\, = \,\,{\mathbf{\mathcal{G}}}\,\left( {\,{\mathbf{y}}\, + \,\frac{{|\,{\mathbf{ym}}\, - \,{\mathbf{\mathcal{H}}}\,{\mathbf{w}}^{i} |\, - \,|\,{\mathbf{\mathcal{H}}}\,{\mathbf{w}}^{i} \, - {\mathbf{yp}}\,|}}{2}} \right) \hfill \\ \end{gathered} $$

end while

  • $$ {\mathbf{w}} = {\mathbf{w}}^{i + 1} $$

Now we show the convergence of the proposed FELMR algorithm in the next theorem.

Theorem 1

Assume that the following condition holds:

$$ \sqrt {\lambda_{\hbox{max} } ({\mathbf{\mathcal{H}\mathcal{H}}}^{t} )\lambda_{\hbox{max} } ({\mathbf{\mathcal{H}}}^{t} {\mathbf{\mathcal{H}}})} < \frac{1}{C}. $$

Then, for any starting vector w 0 in R , the sequence of iterates {w i} by the FELMR iterative step (15) converges to the unique solution w of (12) at the linear rate.

Proof

Since w is the solution of (12), it satisfies (14). Therefore, using (14) and (15), we get

$$ {\mathbf{w}} - {\mathbf{w}}^{i + 1} = \left( {\frac{I}{C} + {\mathbf{\mathcal{H}}}^{t} {\mathbf{\mathcal{H}}}} \right)^{ - 1} {\mathbf{\mathcal{H}}}^{t} \left[ \begin{gathered} \frac{{|{\mathbf{y}} - \varepsilon \;{\mathbf{e}} - {\mathbf{\mathcal{H}}}{\mathbf{w}}| - |{\mathbf{y}} - \varepsilon \;{\mathbf{e}} - {\mathbf{\mathcal{H}}}{\mathbf{w}}^{i} |}}{2} \hfill \\ + \frac{{|{\mathbf{\mathcal{H}}}{\mathbf{w}}^{i} - {\mathbf{y}} - \varepsilon \;{\mathbf{e}}| - |{\mathbf{\mathcal{H}}}{\mathbf{w}} - {\mathbf{y}} - \varepsilon \;{\mathbf{e}}|}}{2}\; \hfill \\ \end{gathered} \right] $$

But,

$$ |{\mathbf{y}} - \varepsilon \;{\mathbf{e}} - {\mathbf{\mathcal{H}}}{\mathbf{w}}| - |{\mathbf{y}} - \varepsilon \;{\mathbf{e}} - {\mathbf{\mathcal{H}}}{\mathbf{w}}^{i} | \le |{\mathbf{\mathcal{H}}}({\mathbf{w}} - {\mathbf{w}}^{i} )| $$

and similarly

$$ |{\mathbf{\mathcal{H}}}{\mathbf{w}}^{i} - {\mathbf{y}} - \varepsilon \;{\mathbf{e}}| - |{\mathbf{\mathcal{H}}}{\mathbf{w}} - {\mathbf{y}} - \varepsilon \;{\mathbf{e}}| \le |{\mathbf{\mathcal{H}}}({\mathbf{w}} - {\mathbf{w}}^{i} )| $$

hold. Thus, we have

$$ \begin{gathered} \left\| {{\mathbf{w}} - {\mathbf{w}}^{i + 1} } \right\| \le ||\left( {\frac{I}{C} + {\mathbf{\mathcal{H}}}^{t} {\mathbf{\mathcal{H}}}} \right)^{ - 1} {\mathbf{\mathcal{H}}}^{t} ||\left\| {{\mathbf{\mathcal{H}}}({\mathbf{w}} - {\mathbf{w}}^{i} )} \right\| \hfill \\ \le ||\left( {\frac{I}{C} + {\mathbf{\mathcal{H}}}^{t} {\mathbf{\mathcal{H}}}} \right)^{ - 1} ||\;||{\mathbf{\mathcal{H}}}^{t} ||\;\;\;||{\mathbf{\mathcal{H}}}||\left\| {({\mathbf{w}} - {\mathbf{w}}^{i} )} \right\|. \hfill \\ \end{gathered} $$

Since \( ||\left( {\frac{I}{C} + {\mathbf{\mathcal{H}}}^{t} {\mathbf{\mathcal{H}}}} \right)^{ - 1} || \le C \) [43] and \( \left\| {\mathbf{\mathcal{H}}} \right\| = \sqrt {\lambda_{\hbox{max} } ({\mathbf{\mathcal{H}}}^{t} {\mathbf{\mathcal{H}}})} \), we get

$$ \left\| {{\mathbf{w}} - {\mathbf{w}}^{i + 1} } \right\| \le C\sqrt {\lambda_{\hbox{max} } ({\mathbf{\mathcal{H}\mathcal{H}}}^{t} )\lambda_{\hbox{max} } ({\mathbf{\mathcal{H}}}^{t} {\mathbf{\mathcal{H}}})} \left\| {{\mathbf{w}} - {\mathbf{w}}^{i} } \right\| $$

and the result follows by our assumption.

5.2 ELM for binary classification

Instead of taking zero training error condition, one may train ELM by allowing acceptable minimum training error. In this case, the ELM in 2-norm can be formulated as a minimization problem in primal of the form [18]

$$ \mathop {\hbox{min} }\limits_{{{\mathbf{w}},\,{\varvec{\upxi}}}} \frac{1}{2}{\mathbf{w}}^{t} {\mathbf{w}} + \frac{C}{2}{\varvec{\upxi}}^{t} {\varvec{\upxi}} $$

subject to

$$ {\mathcal{D}}{\mathbf{\mathcal{H}}}{\mathbf{w}} \ge {\mathbf{e}} - {\varvec{\upxi}}, $$
(16)

which is very much similar to the conventional SVM formulation (3) but without the bias term. Here \( {\varvec{\upxi}}\, \in R^{m} \) is a vector of slack variables and \( {\mathcal{D}} = diag(y_{1} , \ldots ,y_{m} ) \) is a diagonal matrix whose i-th diagonal element is y i .

Since at the solution of the above problem (16), the condition

$$ {\varvec{\upxi}} = ({\mathbf{e}} - {\mathcal{D}}{\mathbf{\mathcal{H}}}{\mathbf{w}})_{ + } $$
(17)

will be satisfied [25] and therefore using (17), the original ELM classification problem in 2-norm can be written as an unconstrained minimization of the following form

$$ \mathop {\hbox{min} }\limits_{{\mathbf{w}}} L_{C} ({\mathbf{w}}) = \frac{1}{2}{\mathbf{w}}^{t} {\mathbf{w}} + \frac{C}{2}\left\| {({\mathbf{e}} - {\mathcal{D}}{\mathbf{\mathcal{H}}}{\mathbf{w}})_{ + } } \right\|^{2} . $$
(18)

Consider the problem in solving for \( {\mathbf{w}} \in R^{\ell } \) such that \( \nabla L_{C} ({\mathbf{w}}) = {\mathbf{0}} \). Since \( \nabla L_{C} ({\mathbf{w}}) = {\mathbf{w}} - C{\mathbf{\mathcal{H}}}^{t} {\mathcal{D}}({\mathbf{e}} - {\mathcal{D}}{\mathbf{\mathcal{H}}}{\mathbf{w}})_{ + } \), using the identity \( {\mathbf{u}}_{ + } = \frac{{{\mathbf{u}} + |{\mathbf{u}}|}}{2} \) for any \( {\mathbf{u}} \in R^{\ell } \), and proceeding as in the regression case, we get the following condition to be satisfied by \( {\mathbf{w}} \in R^{\ell } \)

$$ \left( {\frac{I}{(C/2)} + {\mathbf{\mathcal{H}}}^{t} {\mathbf{\mathcal{H}}}} \right){\mathbf{w}} = {\mathbf{\mathcal{H}}}^{t} {\mathcal{D}}({\mathbf{e}} + |{\mathbf{e}} - {\mathcal{D}}{\mathbf{\mathcal{H}}}{\mathbf{w}}|). $$

This leads to the following simple iterative scheme which will be our functional iterative ELM algorithm for classification (FELMC)

$$ {\mathbf{w}}^{i + 1} = \left( {\frac{I}{(C/2)} + {\mathbf{\mathcal{H}}}^{t} {\mathbf{\mathcal{H}}}} \right)^{ - 1} {\mathbf{\mathcal{H}}}^{t} {\mathcal{D}}({\mathbf{e}} + |{\mathbf{e}} - {\mathcal{D}}{\mathbf{\mathcal{H}}}{\mathbf{w}}^{i} |)\quad{\text{for}}\ i = 0, \, 1, \, 2, \ldots $$
(19)

We summarize below the pseudo code of the iterative method (19) applied for solving the classification problem (18).

Algorithm 2 (FELMC).

Input.

  • Parameter value \( C > 0 \)

  • tol = tolerance value for learning accuracy, imax = maximum number of iterations

Step 1.

  • Set \( i = 0 \) and the initial vector \( {\mathbf{w}} = {\mathbf{w}}^{0} \) in \( R^{\ell } \)

Step 2.

  • Compute the matrix \( {\mathbf{\mathcal{G}}}\, = \,\left( {\frac{{\mathbf{I}}}{C/2} + {\mathbf{\mathcal{H}}}^{t} {\mathbf{\mathcal{H}}}} \right)^{ - 1} {\mathbf{\mathcal{H}}}^{{\mathbf{t}}} {\mathbf{\mathcal{D}}} \)

  • Compute \( {\mathbf{w}}^{1} \,\, = \,\,{\mathbf{\mathcal{G}}}\,\left( {\,{\mathbf{e}}\, + |{\mathbf{e}} - {\mathbf{\mathcal{D}}}\,{\mathbf{\mathcal{H}}}\,{\mathbf{w}}^{0} |\,} \right) \)

Step 3.

  • While (\( ||{\mathbf{w}}^{i + 1} - {\mathbf{w}}^{i} ||\,\, > \,\,tol \) & \( i\,\, < \,\, \) imax)

$$ i = i + 1 $$
$$ {\mathbf{w}}^{i + 1} \,\, = \,\,{\mathbf{\mathcal{G}}}\,\left( {\,{\mathbf{e}}\, + |{\mathbf{e}} - {\mathbf{\mathcal{D}}}\,{\mathbf{\mathcal{H}}}\,{\mathbf{w}}^{i} |\,} \right) $$

end while

  • $$ {\mathbf{w}} = {\mathbf{w}}^{i + 1} $$

Following the steps of the proof of Theorem 1, the convergence of the proposed FELMC algorithm can be easily verified.

Theorem 2

Assume that the following condition holds:

$$ \sqrt {\lambda_{\hbox{max} } ({\mathbf{\mathcal{H}\mathcal{H}}}^{t} )\lambda_{\hbox{max} } ({\mathbf{\mathcal{H}}}^{t} {\mathbf{\mathcal{H}}})} < \frac{2}{C}. $$

Then, for any starting vector w 0 in R , the sequence of iterates {w i} by the FELMC iterative step (19) converges to the unique solution w of (18) at the linear rate.

Finally, for any test data \( {\mathbf{x}} \in R^{m} \), its class label is predicted using the decision function (6b).

6 Numerical experiments and comparison of results

In this section, the performance of the proposed optimization method based ELM for regression and classification in primal solved by functional iterative methods were compared with SVM, ELM, OP-ELM and OB-ELM on interesting real-world benchmark datasets.

All experiments were carried-out on MATLAB R2008a environment on a PC running on Windows XP OS with 64 bit, 3.20 GHz Intel(R) core (TM) 2 Duo processor having 4 GB of RAM.

The standard SVM and OB-ELM formulations were solved by MOSEK optimization toolbox for MATLAB available at http://www.mosek.com and OP-ELM by the toolbox for OP-ELM [26]. However, no external optimizer was assumed for solving ELM and the proposed FELMR and FELMC.

The performance of the proposed formulation has been tested on additive and RBF hidden nodes. In fact, the activation function \( G(\varvec{a},b,\varvec{x}) \) was chosen as the sigmoid function for additive nodes whereas both multiquadric and Gaussian functions were considered for RBF hidden nodes. In case of sigmoid and multiquadric activation functions, the hidden node parameters were chosen randomly with uniform distribution in [−0.5, 0.5] and for Gaussian function, however, they were chosen randomly from [0, 1]. Note that the input weights and biases of the hidden nodes were selected randomly at the beginning of the algorithm and they remain fixed in each trial of simulation.

All the numerical experiments on regression and classification datasets were performed after normalizing the original data by taking: \( \bar{x}_{ij} = \frac{{x_{ij} - x_{j}^{\hbox{min} } }}{{x_{j}^{\hbox{max} } - x_{j}^{\hbox{min} } }} \), where \( x_{j}^{\hbox{min} } = \mathop {\hbox{min} }\limits_{i = 1, \cdots ,m} (x_{ij} ) \) and \( x_{j}^{\hbox{max} } = \mathop {\hbox{max} }\limits_{i = 1, \cdots ,m} (x_{ij} ) \) denote the minimum and maximum values, respectively, of the j-th attribute over all the input examples \( \varvec{x}_{i} \) and \( \bar{x}_{ij} \) is the normalized value corresponding to \( x_{ij} \).

6.1 Regression

In this sub-section, we demonstrate the effectiveness of the proposed FELMR for regression in comparison to SVR, ELM, OP-ELM and OB-ELM by performing experiments nonlinearly on a number of benchmark datasets. They include: the inverse dynamics of a flexible robot arm from http://homes.esat.kuleuven.be/~smc/daisy/daisydata.html; Box and Jenkins gas furnace dataset [5]; Servo, Auto-MPG, Machine CPU, Concrete CS, Wine-quality red, Wine-quality white, Abalone and Parkinson datasets from UCI repository [31]; Pollen grains, Bodyfat and Space_ga from the Statlib collection http://lib.stat.cmu.edu/datasets; Sunspots and SantaFeA from http://www.bme.ogi.edu/~ericwan/data.html; the financial time series datasets: Citigroup, Intel, Microsoft, RedHat and Standard & Poor 500 (SNP500) from http://finance.yahoo.com; hydraulic actuator [11, 35] and, NO2, Bank-32fth, Demo and Kin-32fh from [8].

The 2-norm root mean square error (RMSE) was selected as the measure of prediction performance and it was calculated using the following formula: \( RMSE = \sqrt {\frac{1}{N}\sum\limits_{i = 1}^{N} {(y_{i} - \tilde{y}_{i} )^{2} } } \), where y i and \( \tilde{y}_{i} \) are the observed and its corresponding predicted values respectively, and N is the number of test samples.

As the first example, the inverse dynamics of a flexible robot arm is estimated [36] as an interesting example of inverse system identification. It is assumed that the dynamics of the robot arm is a function of the measured values of the reaction torque of the structure u(t) whose output y(t) is its corresponding acceleration. Like in [36], by setting \( {\mathbf{x}}(t) = (u\left( {t - 1} \right), \ldots ,u\left( {t - 5} \right),y\left( {t - 1} \right), \ldots ,y\left( {t - 4} \right))^{t} \) and \( x^{\text{out}} \left( t \right) = u\left( t \right) \), the inverse identification problem was trained. The first 500 samples of the form: \( \left( {{\mathbf{x}}\left( t \right),x^{\text{out}} \left( t \right)} \right) \) were used for training and the remaining 519 samples for testing. Prediction errors corresponding to sigmoid, multiquadric and Gaussian functions on the test dataset were shown in Fig. 1a, b and c respectively.

Fig. 1
figure 1

Prediction error over the test data for flexible robot arm dataset

As the next practical example, the Box and Jenkins gas furnace dataset [5] was chosen. It consists of 296 pair of values of the form: \( \left( {u\left( t \right),y\left( t \right)} \right) \) where \( u\left( t \right) \) is the flow rate of the input gas and its output \( y\left( t \right) \) is the CO2 concentration from the gas furnace. Assume that the output \( y\left( t \right) \) is predicted based on six attributes taken to be of the form: \( {\mathbf{x}}(t) = \left( {y\left( {t - 1} \right),y\left( {t - 2} \right),y\left( {t - 3} \right),u\left( {t - 1} \right),u\left( {t - 2} \right),u\left( {t - 3} \right)} \right) \) [40]. Among the total of 293 samples \( \left( {{\mathbf{x}}\left( t \right),y\left( t \right)} \right) \) the first 100 samples were taken for training and the remaining samples for testing.

As an important study, the problem of time series prediction was considered. Along with the popular Sunspots and SantaFeA time series datasets, the stock index of: Citigroup, Intel, Microsoft, RedHat and SNP500 were further considered as examples of financial time series datasets. 755 closing stock prices starting from 01-01-2006 to 31-12-2008 were selected. Assuming that the current stock index is predicted using its previous five values, we get 750 samples, in total. The first 200 were taken for training and the rest 550 for testing.

Assuming different sampling rates τ = 0.05 and τ = 0.20 with parameter values ρ = 10, r = 28 and b = 8/3, two time series datasets Lorenz 0.05 and Lorenz 0.20 were generated as time series values corresponding to the variable x of the Lorenz differential equation [29, 30]: \( \dot{x} = \rho (y - x),\dot{y} = rx - y - xz \) and \( \dot{z} = xy - bz \), using fourth-order Runge–Kutta method. After discarding the first 1,000 values to avoid initial transients, the next 3,000 values were considered for our experiment. For predicting the current value, five previous values were used. The first 400 samples were taken for training and the rest for testing.

Finally, experiments were conducted on two more time series datasets, denoted by MG 17 and MG 30 generated using the Mackey–Glass time delay differential equation [29, 30]: \( \frac{dx(t)}{dt} = - 0.1x(t) + \frac{0.2x(t - \tau )}{{1 + x(t - \tau )^{10} }},\) corresponding to the time delay τ = 17 and 30, respectively. As before, five previous values were used to predict the current value. Among the total of 1,495 samples obtained, the first 500 were considered for training and the rest for testing.

Also we considered the hydraulic actuator dataset, a benchmark dataset for nonlinear systems identification [11, 35]. It assumes 1,024 pair of values \( \left( {u\left( t \right),y\left( t \right)} \right) \)where \( u\left( t \right) \)denotes the size of the valve through which oil flows into the actuator and \( y\left( t \right) \) is the oil pressure. By considering the same multi-dimensional regression model as in [11, 35], the output \( y\left( t \right) = f({\mathbf{x}}\left( t \right)) \) is predicted by setting \( {\mathbf{x}}(t) = (y\left( {t - 1} \right),y\left( {t - 2} \right),y\left( {t - 3} \right),u\left( {t - 1} \right),u\left( {t - 2} \right))^{t} \) From the total of 1,021 samples of the form: \( \left( {{\mathbf{x}}\left( t \right), y\left( t \right)} \right) \), the first 511 samples were chosen for training and the remaining 510 samples for testing.

In the implementation of SVR, ε = 0.01 is assumed. The Gaussian nonlinear kernel function, \( k({\mathbf{x}}_{i} ,{\mathbf{x}}_{j} ) = \exp \left( { - \mu \left\| {{\mathbf{x}}_{i} - {\mathbf{x}}_{j} } \right\|^{2} } \right) \) for i, j = 1,…,m was taken where μ > 0 is a parameter. The optimal values of the regularization parameter C and the kernel parameter μ were chosen by varying their values over the pre-defined sets: \( \{ 2^{ - 10} , \ldots ,2^{10} \} \) and \( \{ 2^{ - 5} , \ldots ,2^{5} \} \) respectively and applying the tenfold cross-validation methodology on the training dataset. In case of ELM, the optimal value of the single parameter was chosen from the set: \( \{ \,10,\,20,\,50,\,100,\,200\,\} \) and for OP-ELM it was selected, however, from the set \( \{ \,10,\,20,\,50,\,80,\,100\,\} . \) For OB-ELM, the optimal values of C and were chosen from \( \{ \,2^{ - 15} ,\, \ldots ,\,\,2^{20} \,\} \) and \( \{ \,10,\,20,\,50,\,100,\,200\,\} \) respectively. Finally using these optimal parameter values, RMSE on the test set was calculated.

In case of FELMR, it was observed from experiments that better generalization performance could be achieved for small/moderate values of . Further, since large value of will result in increase in computational time we assumed \( \ell \in \{ 10,20,50,100,200\} \). Again, by varying the parameter C from \( \{ 2^{ - 10} , \ldots ,2^{20} \} \), the optimal values of C and were obtained using tenfold cross-validation. Using the optimal values, the average test accuracy for each dataset was computed by performing 30 independent trials. As the proposed FELMR is an iterative method, the termination condition was chosen to be: \( ||{\mathbf{w}}^{i + 1} - {\mathbf{w}}^{i} ||\, < \,\,10^{ - 6} \) and the maximum number of iterations was taken as 20.

Though the performance of SVR is sensitive to its parameters C and μ, it is known that ELM is not sensitive to the choice of the parameter [20]. After extensive simulations, it was found that FELMR is also not very sensitive to the user specified parameters. To illustrate this result, the performance of FELMR with sigmoid additive hidden node and, multiquadratic and Gaussian RBF hidden nodes on Machine CPU and Bank-32fh datasets was shown in Figs. 2 and 3, respectively. Also from the figures, note that better accuracy could be achieved for small/moderate values of but with medium/large values of C.

Fig. 2
figure 2

Insensitivity performance of FELMR for regression to the user specified parameters (C, \( \ell \)) on machine CPU dataset

Fig. 3
figure 3

Insensitivity performance of FELMR for regression to the user specified parameters (C, \( \ell \)) on bank-32fh dataset

In order to test the convergence of the proposed FELMR algorithm empherically for both the additive and RBF hidden nodes in terms of the number of iterations, the values of \( ||{\mathbf{w}}^{i + 1} - {\mathbf{w}}^{i} || \)were computed as the error of convergence and their results were plotted for Kin-32fh and Wine-quality white datasets in Fig. 6a and b respectively. One can observe from the figures that the rate of convergence of FELMR is impressively faster.

For all the regression datasets considered: the number of training and test samples chosen, the number of attributes, the optimal parameter values determined using tenfold cross-validation and the accuracies obtained by FELMR, OB-ELM, OP-ELM, ELM and SVR on test sets were summarized in Table 1. One can observe from the table that the training time of FELMR is very close to that of ELM and OP-ELM, and much superior to SVR and OB-ELM. The number of times the best accuracy obtained by SVR, ELM, OP-ELM, OB-ELM and FELMR becomes 5, 3, 7, 5 and 12 respectively indicates the possible effectiveness of FELMR.

Table 1 Performance comparison of FELMR with ELM, OB-ELM having sigmoid, multiquadric and Gaussian functions, OP-ELM having sigmoid function and SVR using Gaussian kernel for regression

To further analyze the comparative performance of FELMR with SVR, ELM, OP-ELM and OB-ELM, the average ranks of all the algorithms on RMSE values were computed and listed in Table 2. One can clearly observe that the performance in terms of average rank for either the case of ELM, OB-ELM or FELMR using additive hidden nodes is better than RBF hidden nodes.

Table 2 Average ranks of SVR, ELM, OP-ELM, OB-ELM and FELMR on RMSE values for regression

For the statistical comparison on the performance of the 11 algorithms where experiments were conducted on 29 datasets, we use the Friedman test with the corresponding post hoc test recommended in [9] since it is a simple, yet safe and robust non-parametric test. Under the null hypothesis that all the algorithms are equivalent, the Friedman statistic is computed from Table 2 as

$$ \begin{gathered} \chi_{F}^{2} = \frac{12 \times 29}{11 \times 12}[(4.3276^{2} + 4.7586^{2} + 7.8103^{2} + 7.0862^{2} + 6.1036^{2} + 5.7414^{2} + 6.7759^{2} \hfill \\ \, + \;6.7241^{2} + 4.8103^{2} + 5.4310^{2} + 6.4310^{2} )\,\, - \,\,\frac{{11 \times 12^{2} }}{4}] = \,\,31.4356 \hfill \\ F_{F} = \frac{28 \times 31.4356}{29 \times 10\,\, - \,\,31.4356} = \,\,3.4042. \hfill \\ \end{gathered} $$

With 11 algorithms and 29 datasets, \( F_{F} \) is distributed according to the \( F \) distribution with \( (\,11 - 1, \) \( (11 - 1) \times (29 - 1)\,) = (\,10,\,280\,) \) degrees of freedom. The critical value of \( F(10,\,280) \) is 1.8646 for \( \alpha = 0.05. \) Since \( F_{F} \, > \,1.8646, \) we reject the null hypothesis. So, we proceed with Nemenyi test, a post hoc test, for pair wise comparison of algorithms. According to [9], the critical difference (CD) at \( p = 0.10 \) is \( 2.9778\,\sqrt {\frac{11 \times 12}{6 \times 29}} = 2.5936 \).

  1. i)

    Since the difference between the worst of FELMR and SVR \( (6.4310 - 4.3276 = 2.1034) \) is smaller than 2.5936, the post hoc test could unable to detect any significant difference between the algorithms.

  2. ii)

    For the comparison of FELMR with ELM we proceed as follows

  3. (a)

    Compare FELMR with ELM using sigmoid function: The difference between the worst of FELMR and ELM using sigmoid \( (6.4310 - 4.7586 = 1.6724) \) is smaller than 2.5936, the post hoc test could unable to detect any significant difference between the algorithms.

  4. (b)

    Compare FELMR using sigmoid function with ELM by multiquadric function: Since the difference between FELMR using sigmoid and ELM by multiquadric \( (7.8103 - 4.8103 = 3.0) \) is greater than 2.5936, we see that the performance of FELMR using sigmoid algorithm is better than ELM using multiquadric algorithm.

  5. (c)

    Compare FELMR using multiquadric and Gaussian functions with ELM by multiquadric function: Since the difference between the best of FELMR using multiquadric and Gaussian and ELM by multiquadric \( (7.8103 - 5.4310 = 2.3793) \) is smaller than 2.5936, the post hoc test could unable to detect any significant difference between the algorithms.

  6. (d)

    Compare FELMR with ELM by Gaussian function: Since the difference between the best of FELMR and ELM using Gaussian function \( (7.0862 - 4.8103 = 2.2759) \) is smaller than 2.5936, the post hoc test could unable to detect any significant difference between the algorithms.

  7. iii)

    For the comparison of FELMR with OP-ELM and OB-ELM, the difference between the best and worst algorithms \( (6.7759 - 4.8103 = 1.9656) \) is smaller than 2.5936, the post hoc test could unable to detect any significant difference between the algorithms.

6.2 Classification

In order to verify the effectiveness of the proposed FELMC method for classification, its performance was compared with SVM, ELM, OP-ELM and OB-ELM on 14 binary classification datasets. All the datasets were taken from UCI repository [31].

For the implementation of SVM, the Gaussian nonlinear kernel function with parameter σ > 0 of the form: \( k({\mathbf{x}}_{i} ,{\mathbf{x}}_{j} ) = \exp ( - \left\| {{\mathbf{x}}_{i} - {\mathbf{x}}_{j} } \right\|^{2} /(2\sigma^{2} )) \) for i, j = 1,…,m has been assumed. The optimal values of C and σ were chosen from predefined sets of values by applying tenfold cross validation methodology. In fact, we assumed \( C \in \{ 10^{ - 5} , \ldots ,10^{5} \} \) and \( \sigma \in \{ 2^{ - 5} , \ldots ,2^{5} \} \). In case of ELM, the optimal value of the single parameter was chosen from the set \( \{ 10,50,100,200,500\} \) and for OP-ELM, however, it was selected from the set \( \{ \,10,\,20,\,50,\,80,\,100\,\} \). By choosing the optimal parameter values using tenfold cross-validation, the prediction accuracy was calculated on the test set.

It was observed for FELMC that good generalization performance might be achieved, in general, for medium/large value of . However, since increase in the number of hidden nodes will result in increase in computational time, we assumed \( \ell \in \{ 10,50,100,200,500\} \) in both the cases of FELMC and OB-ELM. Also by varying the parameter C from \( \{ 2^{ - 10} , \ldots ,2^{15} \} \), the optimal values of C and \( \ell \) were obtained using tenfold cross-validation. The average test accuracy was computed by performing 30 independent trials. Since FELMC is an iterative method, the termination condition was taken as: \( ||{\mathbf{w}}^{i + 1} - {\mathbf{w}}^{i} ||\, < \,\,10^{ - 3} . \) The maximum number of iterations was assumed as 20.

As in the case of regression, it was observed that the performance of FELMC is not sensitive to the values of its parameters C and \( \ell \). To verify this result, its performance with additive and RBF hidden nodes on Breast-cancer and German credit datasets was illustrated in Figs. 4 and 5. From the figures, one can observe that better accuracy is resulted for medium/large values of C.

Fig. 4
figure 4

Insensitivity performance of FELMC for classification to the user specified parameters (C, \( \ell \)) on breast-cancer dataset

Fig. 5
figure 5

Insensitivity performance of FELMC for classification to the user specified parameters (C, \( \ell \)) on German dataset

Like in the case of regression, the convergence of the proposed FELMC algorithm in terms of the number of iterations were computed as the error of convergence and their results were shown for Breast cancer and Cleve datasets in Fig. 6c and d, respectively. For these classification datasets, the error of convergence becomes stable by the 20th iteration. Also observe that, unlike the case of regression, the rate of convergence of FELMC is slower.

Fig. 6
figure 6

Error of convergence versus number of iterations of FELM

For all the classification datasets considered, the number of training and test samples chosen, the number of attributes, the optimal parameter values determined using tenfold cross-validation and the accuracies obtained by SVM, ELM, OP-ELM, OB-ELM and FELMC on test sets were summarized in Table 3. One can conclude from the table that the proposed ELM model trained with functional iterative method shows comparable generalization performance in accordance with the rest of the methods considered. Again from Table 3, note that the number of times the best accuracy obtained by SVM, ELM, OP-ELM, OB-ELM and FELMC becomes 1, 3, 0, 1 and 10 respectively. This indicates the supremacy of FELMC. In terms of training time, as expected, ELM and OP-ELM outperform the other methods considered. Like in the case of regression, the training time of FELMC is very close to ELM and OP-ELM and much faster than SVM and OB-ELM. Note that, in case of FELMC, multiquadrtic shows the best performance among the activation functions.

Table 3 Performance comparison of FELMC with ELM and OB-ELM having sigmoid, multiquadric and Gaussian functions, OP-ELM having sigmoid function and SVM using Gaussian kernel for binary classification

For the statistical comparison on the performance of the 11 algorithms over 14 datasets we use, as in the case of regression, the Friedman test with post hoc test recommended in [9]. For this purpose, the average ranks of all the algorithms on the prediction accuracy values were computed and listed in Table 4. Under the null hypothesis that all the algorithms are equivalent, the Friedman statistic is computed as

Table 4 Average ranks of SVR, ELM, OP-ELM, OB-ELM and FELMC on accuracy values for binary classification
$$ \begin{gathered} \chi_{F}^{2} = \frac{12 \times 14}{11 \times 12}[(6.3214^{2} + 6.8571^{2} + 5.5^{2} + 7.1071^{2} + 5.6071^{2} + 5.3214^{2} + 6.5^{2} \hfill \\ \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, + 7.75^{2} + 5.4286^{2} + 4.1429^{2} + 5.4643^{2})\,\, - \,\,\frac{{11 \times 12^{2} }}{4}]\,\,\,\, = \,\,13.1117 \hfill \\ F_{F} = \frac{13 \times 13.1117}{14 \times 10\,\, - \,\,13.1117} = \,\,1.3433. \hfill \\ \end{gathered} $$

In this case, \( F_{F} \) is distributed according to the \( F \) distribution with \( (\,10,\,130\,) \) degrees of freedom. The critical value of \( F(10,\,130) \) is 1.9042 for \( \alpha = 0.05. \) Since the value of \( F_{F} \, \) is smaller than its corresponding critical value, there is no significant error between the algorithms.

7 Conclusion

In this work, a simple novel functional iterative method for the solution of the optimization based ELM in its primal for regression and classification has been proposed. The linear convergence of the proposed method is proved. Numerical experiments have been performed with sigmoid and RBF hidden nodes on a number of real-world, benchmark datasets and their results have been compared with SVM, ELM, OP-ELM and OB-ELM for regression and classification. Comparable generalization performance of the proposed approach with the rest of the methods considered at a faster learning speed than SVM and OB-ELM indicates its usefulness and applicability. Future work will be on the study ELM in dual variables and its applications.