On optimization based extreme learning machine in primal for regression and classification by functional iterative method

Balasundaram, S.; Gupta, Deepak

doi:10.1007/s13042-014-0283-8

On optimization based extreme learning machine in primal for regression and classification by functional iterative method

Original Article
Published: 01 August 2014

Volume 7, pages 707–728, (2016)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

On optimization based extreme learning machine in primal for regression and classification by functional iterative method

Download PDF

S. Balasundaram¹ &
Deepak Gupta¹

909 Accesses
38 Citations
Explore all metrics

Abstract

In this paper, the recently proposed extreme learning machine in the aspect of optimization method by Huang et al. (Neurocomputing, 74: 155–163, 2010) has been considered in its primal form whose solution is obtained by solving an absolute value equation problem by a simple, functional iterative algorithm. It has been proved under sufficient conditions that the algorithm converges linearly. The pseudo codes of the algorithm for regression and classification are given and they can be easily implemented in MATLAB. Experiments were performed on a number of real-world datasets using additive and radial basis function hidden nodes. Similar or better generalization performance of the proposed method in comparison to support vector machine (SVM), extreme learning machine (ELM), optimally pruned extreme learning machine (OP-ELM) and optimization based extreme learning machine (OB-ELM) methods with faster learning speed than SVM and OB-ELM demonstrates its effectiveness and usefulness.

Extreme Learning Machine Based on Adaptive Matrix Iteration

Metaheuristic-based extreme learning machines: a review of design formulations and applications

Article 08 June 2018

A Survey on Extreme Learning Machine and Evolution of Its Variants

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Recently, a new learning algorithm for single hidden layer feedforward neural networks (SLFNs) architecture called extreme learning machine (ELM) method has been proposed in [21] to overcome many of the problems of traditional feedforward neural network learning algorithms such as the presence of local minima, imprecise learning rate, over fitting and slow rate of convergence. Once the input weights and hidden layer biases have been chosen randomly, ELM determines the unknown output weight vector of the network having the smallest norm by solving a system of linear equations. ELM is a simple unified algorithm for regression, binary and multiclass problems and it has been successfully tested on benchmark problems of practical importance. It was initially proposed for SLFNs and later extended to “generalized” SLFNs which may not neuron alike [15, 16]. The essence of ELM is that there is no need of tuning the hidden layer of SLFNs. The growing popularity of ELM [3, 4, 10, 14, 22, 27, 28, 33, 34, 38, 39, 42] is because of its better generalization performance with much faster learning speed in comparison to traditional computational intelligence techniques [19].

The main problem with ELM is the stochastic nature of the hidden layer output matrix which in practice may lower its learning accuracy [6]. Further, it was observed that to achieve an acceptable level of performance a large number of hidden nodes might be required [23, 44] which implies increase in problem size and therefore computational cost. This suggests to look for compact networks having the ability to achieve good generalization performance [23, 41, 44]. The other issue with ELM is that the number of hidden nodes is a parameter and is often chosen by trial and error method. Two heuristic approaches namely constructive [12, 15–17] and pruning methods [26] have been proposed in the literature to address this problem. In fact, the optimally pruned ELM (OP-ELM) method proposed in [26] selects the hidden neurons by applying 1-norm for the outputs and then computes their weights using the classical least squares method. For an interesting study of ELM in 1-norm resulting in a sparse model representation, see [1].

By applying ELM kernels in the SVM formulation [7, 37] instead of support vector machine (SVM) kernels, it was shown in [13, 24] that better generalization can be achieved. In [18], optimization based ELMs (OB-ELMs) for classification were proposed as an extension to support vector networks in which as in SVM the training error and the norm of the output weight vector were minimized. Further it was observed that the proposed method tends to achieve better generalization performance than SVM and is less sensitive to the input parameters. For an interesting study on an extension of ELM to least squares SVM (LS-SVM) and proximal SVM (PSVM) as a learning algorithm providing a unified solution, see the work of Huang et al. [20]. Finally, for an excellent survey on ELM, we refer the reader to [19].

Motivated by the works of [18, 20], an enhanced optimization based ELM is proposed in its primal whose solution is obtained by solving an absolute value equation using a simple, functional iterative method. Our formulation directly finds the approximate optimal solution in the primal space rather than finding in its dual space. The effectiveness of the proposed method for regression and binary classification problems is demonstrated by performing numerical experiments on a number of real-world datasets and comparing their results with SVM, ELM, OP-ELM and OB-ELM.

The paper is organized as follows. Section 2 dwells briefly SVM for regression and classification. In Sect. 3, a short introduction to ELM is given. The OB-ELMs for regression and classification are introduced in Sect. 4. The proposed optimization based ELM formulation whose solution is obtained by applying a functional iterative method, is described in Sect. 5. Numerical experiments were performed on a number of benchmark real-world datasets and their results with additive and radial basis function (RBF) hidden nodes are compared with that of SVM, ELM, OP-ELM and OB-ELM in Sect. 6 and we conclude our work in Sect. 7.

Throughout in this paper, all vectors are taken as column vectors. For any two vectors x,y in R ⁿ, we denote their inner product by x ^t y where x ^t is the transpose of the vector x. The 2-norm of a vector x will be denoted by $ \left\| {\bf x} \right\| $. For any vector $ {\bf x} = (x_{1} , \ldots ,x_{n} )^{t} \in R^{n} $, let x ₊ be a vector in R ⁿ defined by setting all the negative components of x to zero and further let $ |\,\,{\bf x}\,\,|\, = \,\,(\,|x_{1} |, \ldots ,|x_{n} |\,)^{t} \in R^{n} $. The zero vector and the column vector of ones of dimension m are denoted by 0 and e respectively. For any real matrix $ {\mathbf{\mathcal{H}}}\, \in R^{m \times \ell } $, its transpose is denoted by $ {\mathbf{\mathcal{H}}}^{t} $. For any symmetric matrix $ {\mathbf{\mathcal{A}}} $, let its maximum eigenvalue be $ \lambda_{\hbox{max} } ({\mathbf{\mathcal{A}}}) $. The identity matrix of appropriate size is denoted by I. Further, diag( x ) means a diagonal matrix of order n whose diagonal elements being the components of the vector $ {\mathbf{x}} \in R^{n} $. If f is a real valued function of the variable $ {\mathbf{x}} = (x_{1} , \ldots ,x_{n} )^{t} \in R^{n} $, we denote its gradient by $ \nabla f = (\partial f/\partial x_{1} , \ldots ,\partial f/\partial x_{n)} )^{t} $.

2 Support vector machines for regression and classification

In this section, we briefly describe the standard support vector machines for regression and binary classification problems.

2.1 Regression formulation

Assume that a set of training examples $ \{ ({\bf x}_{i} ,y_{i} )\}_{i = 1, \ldots ,m} $ is given such that for each input example $ {\bf x}_{i} = (x_{i1} , \ldots ,x_{in} )^{t} \in R^{n} $, let its corresponding observed value be $ y_{i} \in R $. By mapping the input examples $ {\bf x}_{i} \in R^{n} $ into a higher dimensional feature space via a nonlinear map $ \varphi (.) $, the linear regression model $ f(.) $ of the following form is fitted in the feature space

$$ f({\bf x}) = {\bf w}^{t} \varphi ({\bf x}) + b . $$

(1)

The goal of nonlinear SVR is in determining the unknowns w and b of (1) as the solution of the constrained minimization problem [7, 37]

$$ \mathop {{\mathbf{min}}}\limits_{{{\mathbf{w}},b,\xi_{1} ,\xi_{2} }} \frac{1}{2}{\mathbf{w}}^{t} {\mathbf{w}} + C({\mathbf{e}}^{t} {\varvec{\upxi}}_{1} + {\mathbf{e}}^{t} {\varvec{\upxi}}_{2} ) $$

subject to

$$ \begin{gathered} y_{i} - {\mathbf{w}}^{t} \varphi ({\mathbf{x}}_{i} ) - b \le \varepsilon + \xi_{1i} , \hfill \\ {\mathbf{w}}^{t} \varphi ({\mathbf{x}}_{i} ) + b - y_{i} \le \varepsilon + \xi_{2i} , \hfill \\ \xi_{1i} \ge 0\ {\text{and}}\ \xi_{2i} \ge 0\quad {\text{for}}\ i = 1, 2, \ldots ,m, \hfill \\ \end{gathered} $$

(2)

where C > 0, $ \varepsilon > 0 $ are input parameters and $ {\varvec{\upxi}}_{1} = (\xi_{11} , \ldots ,\xi_{1m} )^{t} $,$ {\varvec{\upxi}}_{2} = (\xi_{21} , \ldots ,\xi_{2m} )^{t} \in R^{m} $ are vectors of slack variables.

Instead of solving the primal problem (2), by introducing Lagrange multipliers $ \varvec{\alpha}_{1} = (\alpha_{11} , \ldots ,\alpha_{1m} )^{t} $, $ \varvec{\alpha}_{2} = (\alpha_{21} , \ldots ,\alpha_{2m} )^{t} $ $ \in R^{m} $ and applying the kernel trick, its dual problem of the following form is solved [7, 37]

$$ \mathop {\text{min}}\limits_{{\varvec{\alpha}_{1} ,\varvec{\alpha}_{2} }} \frac{1}{2}\sum\limits_{i,j = 1}^{m} {(\alpha_{1i} - \alpha_{2i} )k({\mathbf{x}}_{i} ,{\mathbf{x}}_{j} )(\alpha_{1j} - \alpha_{2j} ) + } \varepsilon \sum\limits_{i = 1}^{m} {(\alpha_{1i} + \alpha_{2i} ) - \sum\limits_{i = 1}^{m} {y_{i} (\alpha_{1i} - \alpha_{2i} )} } $$

subject to

$$ \sum\limits_{i = 1}^{m} {(\alpha_{1i} - \alpha_{2i} ) = 0} \;{\text{and}}\;{\mathbf{0}} \le {\varvec{\upalpha}}_{1} ,{\varvec{\upalpha}}_{2} \le C\,{\mathbf{e}}, $$

where k(x _i, x _j) is a kernel function that replaces the term $ \varphi ({\mathbf{x}}_{i} )^{t} \varphi ({\mathbf{x}}_{j} ) $.

Finally, for any $ {\bf x} \in R^{n} $ its prediction by the fitted regression function f(.) is given by [7, 37]

$$ f({\mathbf{x}}) = \sum\limits_{i = 1}^{m} {(\alpha_{1i} - \alpha_{2i} ) )k({\mathbf{x}},{\mathbf{x}}_{i} ) + b} . $$

2.2 Classification formulation

Let $ \{ ({\mathbf{x}}_{i} ,y_{i} )\}_{i = 1, \ldots ,m} $ be a set of training examples given in which for each input point $ {\mathbf{x}}_{i} = (x_{i1} , \ldots ,x_{in} )^{t} \in R^{n} $, let y _i = ±1 be its corresponding class label. It is well known that SVM constructs a hyperplane of the form $ {\mathbf{w}}^{t} \varphi ({\mathbf{x}}) + b $, by maximizing the margin between the input points of the positive and negative classes in the feature space and at the same time minimizing the sum of the training errors. In fact, the unknowns w and b will be obtained as the solution of the following constrained minimization problem [7, 37]

$$ \mathop {\hbox{min} }\limits_{{{\mathbf{w}},b,\xi }} \frac{1}{2}{\mathbf{w}}^{t} {\mathbf{w}} + C{\mathbf{e}}^{t} {\varvec{\upxi}} $$

subject to

$$ \begin{gathered} y_{i} ({\mathbf{w}}^{t} \varphi ({\mathbf{x}}_{i} ) + b) \ge 1 - \xi_{i} , \hfill \\ \xi_{i} \ge 0\quad {\text{for}}\ i = 1,2, \ldots ,m, \hfill \\ \end{gathered} $$

(3)

where C > 0 is the regularization parameter and $ {\varvec{\upxi}} = (\xi_{1} , \ldots ,\xi_{m} )^{t} $ in R ^m is a vector of slack variables. Its dual can be written of the form [7, 37]

$$ \mathop {\hbox{min} }\limits_{{{\varvec{\upalpha}} \in R^{m} }} \frac{1}{2}\sum\limits_{i,j = 1}^{m} {\alpha_{i} \alpha_{j} y_{i} y_{j} k({\mathbf{x}}_{i} ,{\mathbf{x}}_{j} )} - \sum\limits_{i = 1}^{m} {\alpha_{i} } $$

subject to

$$ \sum\limits_{i = 1}^{m} {\alpha_{i} y_{i} } = 0\;{\text{and}}\;{\mathbf{0}} \le {\varvec{\upalpha}} \le C\,{\mathbf{e}}, $$

where k(.,.) is a kernel function and $ {\varvec{\upalpha}} = (\alpha_{1} , \ldots ,\alpha_{m} )^{t} \in R^{m} $ is the Lagrange multiplier vector. In this case, the decision function f(.) will become

$$ f({\mathbf{x}}) = {\text{sign}}(\sum\limits_{j = 1}^{m} {\alpha_{j} y_{j} k({\mathbf{x}},{\mathbf{x}}_{j} ) + b} )\;{\text{for}}\;{\mathbf{x}} \in R^{n} . $$

3 Extreme learning machine method

ELM is a unified learning approach for regression and classification problems proposed for training SLFNs. It is based on least squares solution having the ability to achieve comparable generalization performance at much faster learning speed in accordance with the traditional SVM.

Assume that a set of training examples $ \{ ({\mathbf{x}}_{i} ,y_{i} )\}_{i = 1, \ldots ,m} $ is given such that for each input example $ {\mathbf{x}}_{i} = (x_{i1} , \ldots x_{in} )^{t} \in R^{n} $, let $ y_{i} \in R $ be its corresponding target value. For the randomly assigned values for the learning parameters $ {\mathbf{a}}_{s} = (a_{s1} , \ldots ,a_{sn} )^{t} \in R^{n} $ and $ b_{s} \in R $ of the hidden nodes, ELM determines its output function f(.) such that

$$ f({\mathbf{x}}_{i} ) = \sum\limits_{s = 1}^{\ell } {w_{s} G({\mathbf{a}}_{s} ,b_{s} ,{\mathbf{x}}_{i} )} = y_{i} \quad {\text{for}}\ i = 1, \ldots ,m, $$

(4)

where the hidden layer output function G(a,b,x) is a nonlinear piecewise continuous function satisfying the conditions of the universal approximation capability theorems [17] and $ {\mathbf{w}} = (w_{1} , \ldots ,w_{\ell } )^{t} \in R^{\ell } $ is the unknown output weight vector connecting the hidden layer of ℓ nodes with the output node. The above condition (4) can be written in matrix form as

$$ {\mathbf{\mathcal{H}}}{\mathbf{w}} = {\mathbf{y}}, $$

(5)

where

$$ {\mathbf{\mathcal{H}}} = \left[ {\begin{array}{*{20}c} {G({\mathbf{a}}_{1} ,b_{1} ,{\mathbf{x}}_{1} )} & \cdots & {G({\mathbf{a}}_{\ell } ,b_{\ell } ,{\mathbf{x}}_{1} )} \\ . & \cdots & . \\ {G({\mathbf{a}}_{1} ,b_{1} ,{\mathbf{x}}_{m} )} & \cdots & {G({\mathbf{a}}_{\ell } ,b_{\ell } ,{\mathbf{x}}_{m} )} \\ \end{array} } \right]_{m \times \ell } $$

is the hidden layer output matrix of the network and $ {\mathbf{y}} = (y_{1} , \ldots ,y_{m} )^{t} \in R^{m} $ is the vector of observed values.

Some of the well-known hidden layer output functions are

a)
Sigmoid function : $ G({\mathbf{a}},b,{\mathbf{x}}) = \frac{1}{{1 + \exp ( - ({\mathbf{a}}^{t} {\mathbf{x}} + b))}}, $
b)
Multiquadric function: $ G({\mathbf{a}},b,{\mathbf{x}}) = \left( {\left\| {{\mathbf{x}} - {\mathbf{a}}} \right\|^{2} + b^{2} } \right)^{1/2} , $
c)
Gaussian function: $ G({\mathbf{a}},b,{\mathbf{x}}) = \exp \left( { - b\left\| {{\mathbf{x}} - {\mathbf{a}}} \right\|^{2} } \right). $

Clearly, for the randomly assigned values for the parameters $ {\mathbf{a}}_{s} \in R^{n} ,b_{s} \in R $ and the predefined hidden layer output function $ G({\mathbf{a}},b,{\mathbf{x}}) $, training the SLFN is equivalent to obtaining a least squares solution $ {\mathbf{w}} \in R^{\ell } $ of the rectangular linear system (5). In fact, $ {\mathbf{w}} \in R^{\ell } $ can be explicitly obtained as the smallest norm least squares solution of (5) [21]: $ {\mathbf{w}} = {\mathbf{\mathcal{H}}}^{\dag } {\mathbf{y}} $, where $ {\mathbf{\mathcal{H}}}^{\dag } $ is the Moore–Penrose generalized inverse [32] of the matrix $ {\mathbf{\mathcal{H}}} $.

Finally, with the solution $ {\mathbf{w}} \in R^{\ell } $ obtained, the fitted model f(.) for ELM regression is taken as

$$ f({\mathbf{x}}) = \sum\limits_{s = 1}^{\ell } {w_{s} G({\mathbf{a}}_{s} ,b_{s} ,{\mathbf{x}})} . $$

(6a)

For binary classification problems, the ELM decision function is, however, defined by

$$ f({\mathbf{x}}) = {\text{sign}}\left(\sum\limits_{s = 1}^{\ell } {w_{s} G({\mathbf{a}}_{s} ,b_{s} ,{\mathbf{x}})}\right). $$

(6b)

Note that, once the values of the weight vector $ {\mathbf{a}}_{s} \in R^{n} $ and the bias $ b_{s} \in R $ are randomly assigned at the beginning of the learning algorithm they remain fixed and therefore the matrix $ {\mathbf{\mathcal{H}}} $ remains unchanged.

4 Optimization method based ELM

Recently, Huang et al. [18] studied ELM for classification in the aspect of optimization method by extending ELM to support vector network model and showing that the minimum norm property of ELM and the maximum margin between classes in SVM are in fact consistent.

In this section, we briefly summarize the formulations for optimization method based ELM (OB-ELM) for regression and classification problems.

The mathematical formulation of the optimization method based $ \varepsilon - $ELM for regression can be stated as a minimization problem

$$ \mathop {\hbox{min} }\limits_{{{\mathbf{w}},\,{\varvec{\upxi}}_{1} ,\,{\varvec{\upxi}}_{2} }} \frac{1}{2}{\mathbf{w}}^{t} {\mathbf{w}} + C{\mathbf{e}}^{t} ({\varvec{\upxi}}_{1} + {\varvec{\upxi}}_{2} ) $$

subject to

$$ \begin{gathered} {\mathbf{\mathcal{H}}}{\mathbf{w}} - {\mathbf{y}} \le \varepsilon \;{\mathbf{e}} + {\varvec{\upxi}}_{1} , \hfill \\ {\mathbf{y}} - {\mathbf{\mathcal{H}}}{\mathbf{w}} \le \varepsilon \;{\mathbf{e}} + {\varvec{\upxi}}_{2} , \hfill \\ {\varvec{\upxi}}_{1} ,\,\,{\varvec{\upxi}}_{2} \, \ge \,\,{\mathbf{0}}, \hfill \\ \end{gathered} $$

(7)

where $ C > 0,\varepsilon > 0 $ are input parameters and $ {\varvec{\upxi}}_{1} ,{\varvec{\upxi}}_{2} \in R^{m} $ are vectors of slack variables.

By introducing Lagrange multipliers $ {\varvec{\upalpha}}_{1} ,\,{\varvec{\upalpha}}_{2} \in R^{m} $, the dual of (7) can be constructed as a minimization problem of the form

$$ \mathop {\hbox{min} }\limits_{{{\varvec{\upalpha}}_{1} ,\,{\varvec{\upalpha}}_{2} \in R^{m} }} \frac{1}{2}({\varvec{\upalpha}}_{1} - {\varvec{\upalpha}}_{2} )^{t} K_{{_{ELM} }} ({\varvec{\upalpha}}_{1} - {\varvec{\upalpha}}_{2} ) + \varepsilon \,{\mathbf{e}}^{t} ({\varvec{\upalpha}}_{1} + {\varvec{\upalpha}}_{2} ) - {\mathbf{y}}^{t} \,({\varvec{\upalpha}}_{1} - {\varvec{\upalpha}}_{2} ) $$

subject to

$$ {\mathbf{0}} \le {\varvec{\upalpha}}_{1} ,{\varvec{\upalpha}}_{2} \le C\,{\mathbf{e}}, $$

(8)

where $ (\,K_{ELM} \,)_{ij} = K_{ELM} ({\mathbf{x}}_{i} ,{\mathbf{x}}_{j} ) = \,(\,G({\mathbf{a}}_{1} ,b_{1} ,{\mathbf{x}}_{i} ), \ldots ,G({\mathbf{a}}_{s} ,b_{s} ,{\mathbf{x}}_{i} )\,)\,\,\,(\,G({\mathbf{a}}_{1} ,b_{1} ,{\mathbf{x}}_{j} ), \ldots ,G({\mathbf{a}}_{s} ,b_{s} ,{\mathbf{x}}_{j} )\,)^{t} $is the ij-th coefficient of the matrix $ K_{ELM} . $With its solution $ {\varvec{\upalpha}}_{1} ,\,{\varvec{\upalpha}}_{2} \in R^{m} $, the regression estimation function (6a) becomes

$$ f({\mathbf{x}}) = (\,K_{ELM} ({\mathbf{x}},{\mathbf{x}}_{1} )\,, \ldots ,\,K_{ELM} ({\mathbf{x}},{\mathbf{x}}_{m} )\,)\,\,({\varvec{\upalpha}}_{1} - {\varvec{\upalpha}}_{2} ) . $$

(9)

Note that, $ \varepsilon - $ELM problem (8) has less number of constraints than SVR problem in its dual form and the decision function (9) is similar to the estimation function of SVR but without bias b.

For the purpose of separating the training data with acceptable minimum training error rather than zero training error, OB-ELM for classification can be formulated as

$$ \mathop {\hbox{min} }\limits_{{{\mathbf{w}},\,\,{\varvec{\upxi}}}} \frac{1}{2}{\mathbf{w}}^{t} {\mathbf{w}} + C{\mathbf{e}}^{t} {\varvec{\upxi}} $$

subject to

$$ \begin{gathered} {\mathcal{D}}{\mathbf{\mathcal{H}}}{\mathbf{w}} \ge {\mathbf{e}} - {\varvec{\upxi}}, \hfill \\ {\varvec{\upxi}} \ge 0, \hfill \\ \end{gathered} $$

(10)

where $ {\mathcal{D}} = diag({\mathbf{y}}) $ is a diagonal matrix of order m whose diagonal elements become the components of the output vector $ {\mathbf{y}} \in R^{m} $ of class labels. Since the separating ELM hyperplane tends to pass through the origin in the feature space with probability one [18], the bias term ‘‘b’’ is not included in (10).

Recall that the aim of ELM is to obtain an output weight vector $ {\mathbf{w}} \in R^{\ell } $ with smallest training error and further $ ||{\mathbf{w}}|| $ is minimum. However, since the distance between the separating boundaries of the two classes in the ELM feature will be: $ 2/||{\mathbf{w}}|| $, minimizing the regularization term in (10) is equivalent to maximizing the margin between the boundaries in the ELM feature space.

The solution of the primal problem (10) will be obtained by solving its dual optimization problem

$$ \mathop {\hbox{min} }\limits_{{{\varvec{\upalpha}}\,\, \in \,R^{m} }} \frac{1}{2}{\varvec{\upalpha}}^{t} {\mathcal{D}}\,K_{{_{ELM} }} {\mathcal{D}}\,{\varvec{\upalpha}} - \,{\mathbf{e}}^{t} {\varvec{\upalpha}} $$

subject to

$$ {\mathbf{0}} \le {\varvec{\upalpha}} \le C\,{\mathbf{e}}, $$

where $ {\varvec{\upalpha}} $ is the vector of Lagrange multipliers.

In this case, the decision function $ f(.) $ is given by

$$ f({\mathbf{x}}) = {\text{sign}}(\,(K_{ELM} ({\mathbf{x}},{\mathbf{x}}_{1} )\,, \ldots ,\,K_{ELM} ({\mathbf{x}},{\mathbf{x}}_{m} )\,)\,\,{\mathcal{D}}\,{\varvec{\upalpha}}). $$

5 Proposed functional iterative extreme learning machine

In this section, we propose the extension of the study of optimization method based ELM learning approach for classification problems [18] to both regression and classification problems by formulating their primal problems as unconstrained minimization problems and solving them by a simple functional iterative method.

5.1 ε–ELM for regression

The ELM for regression in 2-norm with ε–insensitive error loss function having input parameters C > 0 and ε > 0 can be formulated as a constrained minimization problem of the following form [2, 18]

$$ \mathop {\hbox{min} }\limits_{{{\mathbf{w}},\,{\varvec{\upxi}}_{1} ,{\varvec{\upxi}}_{2} }} \frac{1}{2}{\mathbf{w}}^{t} {\mathbf{w}} + \frac{C}{2}({\varvec{\upxi}}_{1}^{t} {\varvec{\upxi}}_{1} + {\varvec{\upxi}}_{2}^{t} {\varvec{\upxi}}_{2} ) $$

subject to

$$ \begin{gathered} {\mathbf{\mathcal{H}}}{\mathbf{w}} - {\mathbf{y}} \le \varepsilon \;{\mathbf{e}} + {\varvec{\upxi}}_{1} , \hfill \\ {\mathbf{y}} - {\mathbf{\mathcal{H}}}{\mathbf{w}} \le \varepsilon \;{\mathbf{e}} + {\varvec{\upxi}}_{2} \hfill \\ \end{gathered} $$

(11)

where $ {\varvec{\upxi}}_{1} ,\,{\varvec{\upxi}}_{2} \, \in R^{m} $ are vectors of slack variables. Using its solution $ {\mathbf{w}} \in R^{\ell } $, the regression estimation function f(.) defined by (6a) is obtained. The non-negativity constraints on the components of the vectors $ {\varvec{\upxi}}_{1} ,\,{\varvec{\upxi}}_{2} \, \in R^{m} $ have been dropped in the formulation (11) since none of their components will be negative at optimality. Note that adding the regularization term $ \frac{1}{2}{\mathbf{w}}^{t} {\mathbf{w}} $ in the objective function of (11) leads to a stable solution having better generalization performance.

One can easily verify that the constrained primal problem (11) can be formulated into an equivalent unconstrained minimization problem as given below [25]

$$ \mathop {\hbox{min} }\limits_{{{\mathbf{w}} \in \,R^{\ell } }} L_{R} ({\mathbf{w}}) = \frac{1}{2}{\mathbf{w}}^{t} {\mathbf{w}} + \frac{C}{2}(\left\| {({\mathbf{\mathcal{H}}}{\mathbf{w}} - ({\mathbf{y}} + \varepsilon \;{\mathbf{e}}))_{ + } } \right\|^{2} + \left\| {({\mathbf{y}} - \varepsilon {\mathbf{e}} - {\mathbf{\mathcal{H}}}{\mathbf{w}})_{ + } } \right\|^{2} ). $$

(12)

In this subsection, we solve the problem (12) by computing its critical point. For this purpose, consider the problem of solving the system of non-linear equations

$$ \nabla L_{R} ({\mathbf{w}}) = {\mathbf{0}}, $$

i.e. solve for $ {\mathbf{w}} \in R^{\ell } $ such that

$$ {\mathbf{w}} + C{\mathbf{\mathcal{H}}}^{t} (({\mathbf{\mathcal{H}}}{\mathbf{w}} - {\mathbf{y}} - \varepsilon \;{\mathbf{e}})_{ + } - ({\mathbf{y}} - \varepsilon \;{\mathbf{e}} - {\mathbf{\mathcal{H}}}{\mathbf{w}})_{ + } ) = {\mathbf{0}}. $$

(13)

Now, using the identity $ {\mathbf{u}}_{ + } = \frac{{{\mathbf{u}} + |{\mathbf{u}}|}}{2} $, for any vector $ {\mathbf{u}} \in R^{\ell } $, the condition (13) can be written in the following equivalent form

$$ {\mathbf{w}} + C{\mathbf{\mathcal{H}}}^{t} \left[ {\frac{{({\mathbf{\mathcal{H}}}{\mathbf{w}} - {\mathbf{y}} - \varepsilon \;{\mathbf{e}}) + |{\mathbf{\mathcal{H}}}{\mathbf{w}} - {\mathbf{y}} - \varepsilon \;{\mathbf{e}}|}}{2} - \frac{{({\mathbf{y}} - \varepsilon \;{\mathbf{e}} - {\mathbf{\mathcal{H}}}{\mathbf{w}}) + |{\mathbf{y}} - \varepsilon {\mathbf{e}} - {\mathbf{\mathcal{H}}}{\mathbf{w}}|}}{2}\;} \right] = {\mathbf{0}} $$

and from which we get

$$ \left( {\frac{I}{C} + {\mathbf{\mathcal{H}}}^{t} {\mathbf{\mathcal{H}}}} \right){\mathbf{w}} = {\mathbf{\mathcal{H}}}^{t} \left( {{\mathbf{y}} + \frac{{|{\mathbf{y}} - \varepsilon \;{\mathbf{e}} - {\mathbf{\mathcal{H}}}{\mathbf{w}}| - |{\mathbf{\mathcal{H}}}{\mathbf{w}} - {\mathbf{y}} - \varepsilon \;{\mathbf{e}}|}}{2}} \right). $$

(14)

This leads to the following simple iterative scheme which will be our functional iterative ELM algorithm for regression (FELMR)

$$ {\mathbf{w}}^{i + 1} = \left( {\frac{I}{C} + {\mathbf{\mathcal{H}}}^{t} {\mathbf{\mathcal{H}}}} \right)^{ - 1} {\mathbf{\mathcal{H}}}^{t} \left( {{\mathbf{y}} + \frac{{|{\mathbf{y}} - \varepsilon {\mathbf{e}} - {\mathbf{\mathcal{H}}}{\mathbf{w}}^{i} | - |{\mathbf{\mathcal{H}}}{\mathbf{w}}^{i} - {\mathbf{y}} - \varepsilon {\mathbf{e}}|}}{2}} \right)\;{\text{for}}\quad i = 0, \, 1, \, 2, \ldots $$

(15)

The iterative algorithm (15) for solving the regression problem (12) is summarized below.

Algorithm 1 (FELMR).

Input.

Parameter values $ C > 0 $ and $ \,\,\varepsilon > 0 $
tol = tolerance value for learning accuracy, imax = maximum number of iterations

Step 1.

Set $ i = 0 $ and the initial vector $ {\mathbf{w}} = {\mathbf{w}}^{0} $ in $ R^{\ell } $

Step 2.

Compute the vectors $ {\mathbf{ym}} = {\mathbf{y}} - \varepsilon \,{\mathbf{e}} $ and $ {\mathbf{yp}} = {\mathbf{y}} + \varepsilon \,{\mathbf{e}} $, and the matrix $ {\mathbf{\mathcal{G}}}\, = \,\left( {\frac{{\mathbf{I}}}{C} + {\mathbf{\mathcal{H}}}^{t} {\mathbf{\mathcal{H}}}} \right)^{ - 1} {\mathbf{\mathcal{H}}}^{{\mathbf{t}}} $
Compute $ {\mathbf{w}}^{1} \,\, = \,\,{\mathbf{\mathcal{G}}}\,\left( {\,{\mathbf{y}}\, + \,\frac{{|\,{\mathbf{ym}}\, - \,{\mathbf{\mathcal{H}}}\,{\mathbf{w}}^{0} |\, - \,|\,{\mathbf{\mathcal{H}}}\,{\mathbf{w}}^{0} \, - {\mathbf{yp}}\,|}}{2}} \right) $

Step 3.

While ($ ||{\mathbf{w}}^{i + 1} - {\mathbf{w}}^{i} ||\,\, > \,\,tol $ & $ i\,\, < \,\, $ imax)

$$ \begin{gathered} i = i + 1 \hfill \\ {\mathbf{w}}^{i + 1} \,\, = \,\,{\mathbf{\mathcal{G}}}\,\left( {\,{\mathbf{y}}\, + \,\frac{{|\,{\mathbf{ym}}\, - \,{\mathbf{\mathcal{H}}}\,{\mathbf{w}}^{i} |\, - \,|\,{\mathbf{\mathcal{H}}}\,{\mathbf{w}}^{i} \, - {\mathbf{yp}}\,|}}{2}} \right) \hfill \\ \end{gathered} $$

end while

$$ {\mathbf{w}} = {\mathbf{w}}^{i + 1} $$

Now we show the convergence of the proposed FELMR algorithm in the next theorem.

Theorem 1

Assume that the following condition holds:

$$ \sqrt {\lambda_{\hbox{max} } ({\mathbf{\mathcal{H}\mathcal{H}}}^{t} )\lambda_{\hbox{max} } ({\mathbf{\mathcal{H}}}^{t} {\mathbf{\mathcal{H}}})} < \frac{1}{C}. $$

Then, for any starting vector w ⁰ in R ^ℓ, the sequence of iterates {w ⁱ} by the FELMR iterative step (15) converges to the unique solution w of (12) at the linear rate.

Proof

Since w is the solution of (12), it satisfies (14). Therefore, using (14) and (15), we get

$$ {\mathbf{w}} - {\mathbf{w}}^{i + 1} = \left( {\frac{I}{C} + {\mathbf{\mathcal{H}}}^{t} {\mathbf{\mathcal{H}}}} \right)^{ - 1} {\mathbf{\mathcal{H}}}^{t} \left[ \begin{gathered} \frac{{|{\mathbf{y}} - \varepsilon \;{\mathbf{e}} - {\mathbf{\mathcal{H}}}{\mathbf{w}}| - |{\mathbf{y}} - \varepsilon \;{\mathbf{e}} - {\mathbf{\mathcal{H}}}{\mathbf{w}}^{i} |}}{2} \hfill \\ + \frac{{|{\mathbf{\mathcal{H}}}{\mathbf{w}}^{i} - {\mathbf{y}} - \varepsilon \;{\mathbf{e}}| - |{\mathbf{\mathcal{H}}}{\mathbf{w}} - {\mathbf{y}} - \varepsilon \;{\mathbf{e}}|}}{2}\; \hfill \\ \end{gathered} \right] $$

But,

$$ |{\mathbf{y}} - \varepsilon \;{\mathbf{e}} - {\mathbf{\mathcal{H}}}{\mathbf{w}}| - |{\mathbf{y}} - \varepsilon \;{\mathbf{e}} - {\mathbf{\mathcal{H}}}{\mathbf{w}}^{i} | \le |{\mathbf{\mathcal{H}}}({\mathbf{w}} - {\mathbf{w}}^{i} )| $$

and similarly

$$ |{\mathbf{\mathcal{H}}}{\mathbf{w}}^{i} - {\mathbf{y}} - \varepsilon \;{\mathbf{e}}| - |{\mathbf{\mathcal{H}}}{\mathbf{w}} - {\mathbf{y}} - \varepsilon \;{\mathbf{e}}| \le |{\mathbf{\mathcal{H}}}({\mathbf{w}} - {\mathbf{w}}^{i} )| $$

hold. Thus, we have

$$ \begin{gathered} \left\| {{\mathbf{w}} - {\mathbf{w}}^{i + 1} } \right\| \le ||\left( {\frac{I}{C} + {\mathbf{\mathcal{H}}}^{t} {\mathbf{\mathcal{H}}}} \right)^{ - 1} {\mathbf{\mathcal{H}}}^{t} ||\left\| {{\mathbf{\mathcal{H}}}({\mathbf{w}} - {\mathbf{w}}^{i} )} \right\| \hfill \\ \le ||\left( {\frac{I}{C} + {\mathbf{\mathcal{H}}}^{t} {\mathbf{\mathcal{H}}}} \right)^{ - 1} ||\;||{\mathbf{\mathcal{H}}}^{t} ||\;\;\;||{\mathbf{\mathcal{H}}}||\left\| {({\mathbf{w}} - {\mathbf{w}}^{i} )} \right\|. \hfill \\ \end{gathered} $$

Since $ ||\left( {\frac{I}{C} + {\mathbf{\mathcal{H}}}^{t} {\mathbf{\mathcal{H}}}} \right)^{ - 1} || \le C $ [43] and $ \left\| {\mathbf{\mathcal{H}}} \right\| = \sqrt {\lambda_{\hbox{max} } ({\mathbf{\mathcal{H}}}^{t} {\mathbf{\mathcal{H}}})} $, we get

$$ \left\| {{\mathbf{w}} - {\mathbf{w}}^{i + 1} } \right\| \le C\sqrt {\lambda_{\hbox{max} } ({\mathbf{\mathcal{H}\mathcal{H}}}^{t} )\lambda_{\hbox{max} } ({\mathbf{\mathcal{H}}}^{t} {\mathbf{\mathcal{H}}})} \left\| {{\mathbf{w}} - {\mathbf{w}}^{i} } \right\| $$

and the result follows by our assumption.

5.2 ELM for binary classification

Instead of taking zero training error condition, one may train ELM by allowing acceptable minimum training error. In this case, the ELM in 2-norm can be formulated as a minimization problem in primal of the form [18]

$$ \mathop {\hbox{min} }\limits_{{{\mathbf{w}},\,{\varvec{\upxi}}}} \frac{1}{2}{\mathbf{w}}^{t} {\mathbf{w}} + \frac{C}{2}{\varvec{\upxi}}^{t} {\varvec{\upxi}} $$

subject to

$$ {\mathcal{D}}{\mathbf{\mathcal{H}}}{\mathbf{w}} \ge {\mathbf{e}} - {\varvec{\upxi}}, $$

(16)

which is very much similar to the conventional SVM formulation (3) but without the bias term. Here $ {\varvec{\upxi}}\, \in R^{m} $ is a vector of slack variables and $ {\mathcal{D}} = diag(y_{1} , \ldots ,y_{m} ) $ is a diagonal matrix whose i-th diagonal element is y _i.

Since at the solution of the above problem (16), the condition

$$ {\varvec{\upxi}} = ({\mathbf{e}} - {\mathcal{D}}{\mathbf{\mathcal{H}}}{\mathbf{w}})_{ + } $$

(17)

will be satisfied [25] and therefore using (17), the original ELM classification problem in 2-norm can be written as an unconstrained minimization of the following form

$$ \mathop {\hbox{min} }\limits_{{\mathbf{w}}} L_{C} ({\mathbf{w}}) = \frac{1}{2}{\mathbf{w}}^{t} {\mathbf{w}} + \frac{C}{2}\left\| {({\mathbf{e}} - {\mathcal{D}}{\mathbf{\mathcal{H}}}{\mathbf{w}})_{ + } } \right\|^{2} . $$

(18)

Consider the problem in solving for $ {\mathbf{w}} \in R^{\ell } $ such that $ \nabla L_{C} ({\mathbf{w}}) = {\mathbf{0}} $. Since $ \nabla L_{C} ({\mathbf{w}}) = {\mathbf{w}} - C{\mathbf{\mathcal{H}}}^{t} {\mathcal{D}}({\mathbf{e}} - {\mathcal{D}}{\mathbf{\mathcal{H}}}{\mathbf{w}})_{ + } $, using the identity $ {\mathbf{u}}_{ + } = \frac{{{\mathbf{u}} + |{\mathbf{u}}|}}{2} $ for any $ {\mathbf{u}} \in R^{\ell } $, and proceeding as in the regression case, we get the following condition to be satisfied by $ {\mathbf{w}} \in R^{\ell } $

$$ \left( {\frac{I}{(C/2)} + {\mathbf{\mathcal{H}}}^{t} {\mathbf{\mathcal{H}}}} \right){\mathbf{w}} = {\mathbf{\mathcal{H}}}^{t} {\mathcal{D}}({\mathbf{e}} + |{\mathbf{e}} - {\mathcal{D}}{\mathbf{\mathcal{H}}}{\mathbf{w}}|). $$

This leads to the following simple iterative scheme which will be our functional iterative ELM algorithm for classification (FELMC)

$$ {\mathbf{w}}^{i + 1} = \left( {\frac{I}{(C/2)} + {\mathbf{\mathcal{H}}}^{t} {\mathbf{\mathcal{H}}}} \right)^{ - 1} {\mathbf{\mathcal{H}}}^{t} {\mathcal{D}}({\mathbf{e}} + |{\mathbf{e}} - {\mathcal{D}}{\mathbf{\mathcal{H}}}{\mathbf{w}}^{i} |)\quad{\text{for}}\ i = 0, \, 1, \, 2, \ldots $$

(19)

We summarize below the pseudo code of the iterative method (19) applied for solving the classification problem (18).

Algorithm 2 (FELMC).

Input.

Parameter value $ C > 0 $
tol = tolerance value for learning accuracy, imax = maximum number of iterations

Step 1.

Set $ i = 0 $ and the initial vector $ {\mathbf{w}} = {\mathbf{w}}^{0} $ in $ R^{\ell } $

Step 2.

Compute the matrix $ {\mathbf{\mathcal{G}}}\, = \,\left( {\frac{{\mathbf{I}}}{C/2} + {\mathbf{\mathcal{H}}}^{t} {\mathbf{\mathcal{H}}}} \right)^{ - 1} {\mathbf{\mathcal{H}}}^{{\mathbf{t}}} {\mathbf{\mathcal{D}}} $
Compute $ {\mathbf{w}}^{1} \,\, = \,\,{\mathbf{\mathcal{G}}}\,\left( {\,{\mathbf{e}}\, + |{\mathbf{e}} - {\mathbf{\mathcal{D}}}\,{\mathbf{\mathcal{H}}}\,{\mathbf{w}}^{0} |\,} \right) $

Step 3.

While ($ ||{\mathbf{w}}^{i + 1} - {\mathbf{w}}^{i} ||\,\, > \,\,tol $ & $ i\,\, < \,\, $ imax)

$$ i = i + 1 $$

$$ {\mathbf{w}}^{i + 1} \,\, = \,\,{\mathbf{\mathcal{G}}}\,\left( {\,{\mathbf{e}}\, + |{\mathbf{e}} - {\mathbf{\mathcal{D}}}\,{\mathbf{\mathcal{H}}}\,{\mathbf{w}}^{i} |\,} \right) $$

end while

$$ {\mathbf{w}} = {\mathbf{w}}^{i + 1} $$

Following the steps of the proof of Theorem 1, the convergence of the proposed FELMC algorithm can be easily verified.

Theorem 2

Assume that the following condition holds:

$$ \sqrt {\lambda_{\hbox{max} } ({\mathbf{\mathcal{H}\mathcal{H}}}^{t} )\lambda_{\hbox{max} } ({\mathbf{\mathcal{H}}}^{t} {\mathbf{\mathcal{H}}})} < \frac{2}{C}. $$

Then, for any starting vector w ⁰ in R ^ℓ, the sequence of iterates {w ⁱ} by the FELMC iterative step (19) converges to the unique solution w of (18) at the linear rate.

Finally, for any test data $ {\mathbf{x}} \in R^{m} $, its class label is predicted using the decision function (6b).

6 Numerical experiments and comparison of results

In this section, the performance of the proposed optimization method based ELM for regression and classification in primal solved by functional iterative methods were compared with SVM, ELM, OP-ELM and OB-ELM on interesting real-world benchmark datasets.

All experiments were carried-out on MATLAB R2008a environment on a PC running on Windows XP OS with 64 bit, 3.20 GHz Intel(R) core (TM) 2 Duo processor having 4 GB of RAM.

The standard SVM and OB-ELM formulations were solved by MOSEK optimization toolbox for MATLAB available at http://www.mosek.com and OP-ELM by the toolbox for OP-ELM [26]. However, no external optimizer was assumed for solving ELM and the proposed FELMR and FELMC.

The performance of the proposed formulation has been tested on additive and RBF hidden nodes. In fact, the activation function $ G(\varvec{a},b,\varvec{x}) $ was chosen as the sigmoid function for additive nodes whereas both multiquadric and Gaussian functions were considered for RBF hidden nodes. In case of sigmoid and multiquadric activation functions, the hidden node parameters were chosen randomly with uniform distribution in [−0.5, 0.5] and for Gaussian function, however, they were chosen randomly from [0, 1]. Note that the input weights and biases of the hidden nodes were selected randomly at the beginning of the algorithm and they remain fixed in each trial of simulation.

All the numerical experiments on regression and classification datasets were performed after normalizing the original data by taking: $ \bar{x}_{ij} = \frac{{x_{ij} - x_{j}^{\hbox{min} } }}{{x_{j}^{\hbox{max} } - x_{j}^{\hbox{min} } }} $, where $ x_{j}^{\hbox{min} } = \mathop {\hbox{min} }\limits_{i = 1, \cdots ,m} (x_{ij} ) $ and $ x_{j}^{\hbox{max} } = \mathop {\hbox{max} }\limits_{i = 1, \cdots ,m} (x_{ij} ) $ denote the minimum and maximum values, respectively, of the j-th attribute over all the input examples $ \varvec{x}_{i} $ and $ \bar{x}_{ij} $ is the normalized value corresponding to $ x_{ij} $.

6.1 Regression

In this sub-section, we demonstrate the effectiveness of the proposed FELMR for regression in comparison to SVR, ELM, OP-ELM and OB-ELM by performing experiments nonlinearly on a number of benchmark datasets. They include: the inverse dynamics of a flexible robot arm from http://homes.esat.kuleuven.be/~smc/daisy/daisydata.html; Box and Jenkins gas furnace dataset [5]; Servo, Auto-MPG, Machine CPU, Concrete CS, Wine-quality red, Wine-quality white, Abalone and Parkinson datasets from UCI repository [31]; Pollen grains, Bodyfat and Space_ga from the Statlib collection http://lib.stat.cmu.edu/datasets; Sunspots and SantaFeA from http://www.bme.ogi.edu/~ericwan/data.html; the financial time series datasets: Citigroup, Intel, Microsoft, RedHat and Standard & Poor 500 (SNP500) from http://finance.yahoo.com; hydraulic actuator [11, 35] and, NO2, Bank-32fth, Demo and Kin-32fh from [8].

The 2-norm root mean square error (RMSE) was selected as the measure of prediction performance and it was calculated using the following formula: $ RMSE = \sqrt {\frac{1}{N}\sum\limits_{i = 1}^{N} {(y_{i} - \tilde{y}_{i} )^{2} } } $, where y _i and $ \tilde{y}_{i} $ are the observed and its corresponding predicted values respectively, and N is the number of test samples.

As the first example, the inverse dynamics of a flexible robot arm is estimated [36] as an interesting example of inverse system identification. It is assumed that the dynamics of the robot arm is a function of the measured values of the reaction torque of the structure u(t) whose output y(t) is its corresponding acceleration. Like in [36], by setting $ {\mathbf{x}}(t) = (u\left( {t - 1} \right), \ldots ,u\left( {t - 5} \right),y\left( {t - 1} \right), \ldots ,y\left( {t - 4} \right))^{t} $ and $ x^{\text{out}} \left( t \right) = u\left( t \right) $, the inverse identification problem was trained. The first 500 samples of the form: $ \left( {{\mathbf{x}}\left( t \right),x^{\text{out}} \left( t \right)} \right) $ were used for training and the remaining 519 samples for testing. Prediction errors corresponding to sigmoid, multiquadric and Gaussian functions on the test dataset were shown in Fig. 1a, b and c respectively.

As the next practical example, the Box and Jenkins gas furnace dataset [5] was chosen. It consists of 296 pair of values of the form: $ \left( {u\left( t \right),y\left( t \right)} \right) $ where $ u\left( t \right) $ is the flow rate of the input gas and its output $ y\left( t \right) $ is the CO₂ concentration from the gas furnace. Assume that the output $ y\left( t \right) $ is predicted based on six attributes taken to be of the form: $ {\mathbf{x}}(t) = \left( {y\left( {t - 1} \right),y\left( {t - 2} \right),y\left( {t - 3} \right),u\left( {t - 1} \right),u\left( {t - 2} \right),u\left( {t - 3} \right)} \right) $ [40]. Among the total of 293 samples $ \left( {{\mathbf{x}}\left( t \right),y\left( t \right)} \right) $ the first 100 samples were taken for training and the remaining samples for testing.

As an important study, the problem of time series prediction was considered. Along with the popular Sunspots and SantaFeA time series datasets, the stock index of: Citigroup, Intel, Microsoft, RedHat and SNP500 were further considered as examples of financial time series datasets. 755 closing stock prices starting from 01-01-2006 to 31-12-2008 were selected. Assuming that the current stock index is predicted using its previous five values, we get 750 samples, in total. The first 200 were taken for training and the rest 550 for testing.

Assuming different sampling rates τ = 0.05 and τ = 0.20 with parameter values ρ = 10, r = 28 and b = 8/3, two time series datasets Lorenz _0.05 and Lorenz _0.20 were generated as time series values corresponding to the variable x of the Lorenz differential equation [29, 30]: $ \dot{x} = \rho (y - x),\dot{y} = rx - y - xz $ and $ \dot{z} = xy - bz $, using fourth-order Runge–Kutta method. After discarding the first 1,000 values to avoid initial transients, the next 3,000 values were considered for our experiment. For predicting the current value, five previous values were used. The first 400 samples were taken for training and the rest for testing.

Finally, experiments were conducted on two more time series datasets, denoted by MG ₁₇ and MG ₃₀ generated using the Mackey–Glass time delay differential equation [29, 30]: $ \frac{dx(t)}{dt} = - 0.1x(t) + \frac{0.2x(t - \tau )}{{1 + x(t - \tau )^{10} }},$ corresponding to the time delay τ = 17 and 30, respectively. As before, five previous values were used to predict the current value. Among the total of 1,495 samples obtained, the first 500 were considered for training and the rest for testing.

Also we considered the hydraulic actuator dataset, a benchmark dataset for nonlinear systems identification [11, 35]. It assumes 1,024 pair of values $ \left( {u\left( t \right),y\left( t \right)} \right) $where $ u\left( t \right) $denotes the size of the valve through which oil flows into the actuator and $ y\left( t \right) $ is the oil pressure. By considering the same multi-dimensional regression model as in [11, 35], the output $ y\left( t \right) = f({\mathbf{x}}\left( t \right)) $ is predicted by setting $ {\mathbf{x}}(t) = (y\left( {t - 1} \right),y\left( {t - 2} \right),y\left( {t - 3} \right),u\left( {t - 1} \right),u\left( {t - 2} \right))^{t} $ From the total of 1,021 samples of the form: $ \left( {{\mathbf{x}}\left( t \right), y\left( t \right)} \right) $, the first 511 samples were chosen for training and the remaining 510 samples for testing.

In the implementation of SVR, ε = 0.01 is assumed. The Gaussian nonlinear kernel function, $ k({\mathbf{x}}_{i} ,{\mathbf{x}}_{j} ) = \exp \left( { - \mu \left\| {{\mathbf{x}}_{i} - {\mathbf{x}}_{j} } \right\|^{2} } \right) $ for i, j = 1,…,m was taken where μ > 0 is a parameter. The optimal values of the regularization parameter C and the kernel parameter μ were chosen by varying their values over the pre-defined sets: $ \{ 2^{ - 10} , \ldots ,2^{10} \} $ and $ \{ 2^{ - 5} , \ldots ,2^{5} \} $ respectively and applying the tenfold cross-validation methodology on the training dataset. In case of ELM, the optimal value of the single parameter ℓ was chosen from the set: $ \{ \,10,\,20,\,50,\,100,\,200\,\} $ and for OP-ELM it was selected, however, from the set $ \{ \,10,\,20,\,50,\,80,\,100\,\} . $ For OB-ELM, the optimal values of C and ℓ were chosen from $ \{ \,2^{ - 15} ,\, \ldots ,\,\,2^{20} \,\} $ and $ \{ \,10,\,20,\,50,\,100,\,200\,\} $ respectively. Finally using these optimal parameter values, RMSE on the test set was calculated.

In case of FELMR, it was observed from experiments that better generalization performance could be achieved for small/moderate values of ℓ. Further, since large value of ℓ will result in increase in computational time we assumed $ \ell \in \{ 10,20,50,100,200\} $. Again, by varying the parameter C from $ \{ 2^{ - 10} , \ldots ,2^{20} \} $, the optimal values of C and ℓ were obtained using tenfold cross-validation. Using the optimal values, the average test accuracy for each dataset was computed by performing 30 independent trials. As the proposed FELMR is an iterative method, the termination condition was chosen to be: $ ||{\mathbf{w}}^{i + 1} - {\mathbf{w}}^{i} ||\, < \,\,10^{ - 6} $ and the maximum number of iterations was taken as 20.

Though the performance of SVR is sensitive to its parameters C and μ, it is known that ELM is not sensitive to the choice of the parameter ℓ [20]. After extensive simulations, it was found that FELMR is also not very sensitive to the user specified parameters. To illustrate this result, the performance of FELMR with sigmoid additive hidden node and, multiquadratic and Gaussian RBF hidden nodes on Machine CPU and Bank-32fh datasets was shown in Figs. 2 and 3, respectively. Also from the figures, note that better accuracy could be achieved for small/moderate values of ℓ but with medium/large values of C.

In order to test the convergence of the proposed FELMR algorithm empherically for both the additive and RBF hidden nodes in terms of the number of iterations, the values of $ ||{\mathbf{w}}^{i + 1} - {\mathbf{w}}^{i} || $were computed as the error of convergence and their results were plotted for Kin-32fh and Wine-quality white datasets in Fig. 6a and b respectively. One can observe from the figures that the rate of convergence of FELMR is impressively faster.

For all the regression datasets considered: the number of training and test samples chosen, the number of attributes, the optimal parameter values determined using tenfold cross-validation and the accuracies obtained by FELMR, OB-ELM, OP-ELM, ELM and SVR on test sets were summarized in Table 1. One can observe from the table that the training time of FELMR is very close to that of ELM and OP-ELM, and much superior to SVR and OB-ELM. The number of times the best accuracy obtained by SVR, ELM, OP-ELM, OB-ELM and FELMR becomes 5, 3, 7, 5 and 12 respectively indicates the possible effectiveness of FELMR.

Table 1 Performance comparison of FELMR with ELM, OB-ELM having sigmoid, multiquadric and Gaussian functions, OP-ELM having sigmoid function and SVR using Gaussian kernel for regression

Full size table

To further analyze the comparative performance of FELMR with SVR, ELM, OP-ELM and OB-ELM, the average ranks of all the algorithms on RMSE values were computed and listed in Table 2. One can clearly observe that the performance in terms of average rank for either the case of ELM, OB-ELM or FELMR using additive hidden nodes is better than RBF hidden nodes.

Table 2 Average ranks of SVR, ELM, OP-ELM, OB-ELM and FELMR on RMSE values for regression

Full size table

For the statistical comparison on the performance of the 11 algorithms where experiments were conducted on 29 datasets, we use the Friedman test with the corresponding post hoc test recommended in [9] since it is a simple, yet safe and robust non-parametric test. Under the null hypothesis that all the algorithms are equivalent, the Friedman statistic is computed from Table 2 as

$$ \begin{gathered} \chi_{F}^{2} = \frac{12 \times 29}{11 \times 12}[(4.3276^{2} + 4.7586^{2} + 7.8103^{2} + 7.0862^{2} + 6.1036^{2} + 5.7414^{2} + 6.7759^{2} \hfill \\ \, + \;6.7241^{2} + 4.8103^{2} + 5.4310^{2} + 6.4310^{2} )\,\, - \,\,\frac{{11 \times 12^{2} }}{4}] = \,\,31.4356 \hfill \\ F_{F} = \frac{28 \times 31.4356}{29 \times 10\,\, - \,\,31.4356} = \,\,3.4042. \hfill \\ \end{gathered} $$

With 11 algorithms and 29 datasets, $ F_{F} $ is distributed according to the $ F $ distribution with $ (\,11 - 1, $ $ (11 - 1) \times (29 - 1)\,) = (\,10,\,280\,) $ degrees of freedom. The critical value of $ F(10,\,280) $ is 1.8646 for $ \alpha = 0.05. $ Since $ F_{F} \, > \,1.8646, $ we reject the null hypothesis. So, we proceed with Nemenyi test, a post hoc test, for pair wise comparison of algorithms. According to [9], the critical difference (CD) at $ p = 0.10 $ is $ 2.9778\,\sqrt {\frac{11 \times 12}{6 \times 29}} = 2.5936 $.

i)
Since the difference between the worst of FELMR and SVR $ (6.4310 - 4.3276 = 2.1034) $ is smaller than 2.5936, the post hoc test could unable to detect any significant difference between the algorithms.
ii)
For the comparison of FELMR with ELM we proceed as follows
(a)
Compare FELMR with ELM using sigmoid function: The difference between the worst of FELMR and ELM using sigmoid $ (6.4310 - 4.7586 = 1.6724) $ is smaller than 2.5936, the post hoc test could unable to detect any significant difference between the algorithms.
(b)
Compare FELMR using sigmoid function with ELM by multiquadric function: Since the difference between FELMR using sigmoid and ELM by multiquadric $ (7.8103 - 4.8103 = 3.0) $ is greater than 2.5936, we see that the performance of FELMR using sigmoid algorithm is better than ELM using multiquadric algorithm.
(c)
Compare FELMR using multiquadric and Gaussian functions with ELM by multiquadric function: Since the difference between the best of FELMR using multiquadric and Gaussian and ELM by multiquadric $ (7.8103 - 5.4310 = 2.3793) $ is smaller than 2.5936, the post hoc test could unable to detect any significant difference between the algorithms.
(d)
Compare FELMR with ELM by Gaussian function: Since the difference between the best of FELMR and ELM using Gaussian function $ (7.0862 - 4.8103 = 2.2759) $ is smaller than 2.5936, the post hoc test could unable to detect any significant difference between the algorithms.
iii)
For the comparison of FELMR with OP-ELM and OB-ELM, the difference between the best and worst algorithms $ (6.7759 - 4.8103 = 1.9656) $ is smaller than 2.5936, the post hoc test could unable to detect any significant difference between the algorithms.

6.2 Classification

In order to verify the effectiveness of the proposed FELMC method for classification, its performance was compared with SVM, ELM, OP-ELM and OB-ELM on 14 binary classification datasets. All the datasets were taken from UCI repository [31].

For the implementation of SVM, the Gaussian nonlinear kernel function with parameter σ > 0 of the form: $ k({\mathbf{x}}_{i} ,{\mathbf{x}}_{j} ) = \exp ( - \left\| {{\mathbf{x}}_{i} - {\mathbf{x}}_{j} } \right\|^{2} /(2\sigma^{2} )) $ for i, j = 1,…,m has been assumed. The optimal values of C and σ were chosen from predefined sets of values by applying tenfold cross validation methodology. In fact, we assumed $ C \in \{ 10^{ - 5} , \ldots ,10^{5} \} $ and $ \sigma \in \{ 2^{ - 5} , \ldots ,2^{5} \} $. In case of ELM, the optimal value of the single parameter ℓ was chosen from the set $ \{ 10,50,100,200,500\} $ and for OP-ELM, however, it was selected from the set $ \{ \,10,\,20,\,50,\,80,\,100\,\} $. By choosing the optimal parameter values using tenfold cross-validation, the prediction accuracy was calculated on the test set.

It was observed for FELMC that good generalization performance might be achieved, in general, for medium/large value of ℓ. However, since increase in the number of hidden nodes will result in increase in computational time, we assumed $ \ell \in \{ 10,50,100,200,500\} $ in both the cases of FELMC and OB-ELM. Also by varying the parameter C from $ \{ 2^{ - 10} , \ldots ,2^{15} \} $, the optimal values of C and $ \ell $ were obtained using tenfold cross-validation. The average test accuracy was computed by performing 30 independent trials. Since FELMC is an iterative method, the termination condition was taken as: $ ||{\mathbf{w}}^{i + 1} - {\mathbf{w}}^{i} ||\, < \,\,10^{ - 3} . $ The maximum number of iterations was assumed as 20.

As in the case of regression, it was observed that the performance of FELMC is not sensitive to the values of its parameters C and $ \ell $. To verify this result, its performance with additive and RBF hidden nodes on Breast-cancer and German credit datasets was illustrated in Figs. 4 and 5. From the figures, one can observe that better accuracy is resulted for medium/large values of C.

Like in the case of regression, the convergence of the proposed FELMC algorithm in terms of the number of iterations were computed as the error of convergence and their results were shown for Breast cancer and Cleve datasets in Fig. 6c and d, respectively. For these classification datasets, the error of convergence becomes stable by the 20th iteration. Also observe that, unlike the case of regression, the rate of convergence of FELMC is slower.

For all the classification datasets considered, the number of training and test samples chosen, the number of attributes, the optimal parameter values determined using tenfold cross-validation and the accuracies obtained by SVM, ELM, OP-ELM, OB-ELM and FELMC on test sets were summarized in Table 3. One can conclude from the table that the proposed ELM model trained with functional iterative method shows comparable generalization performance in accordance with the rest of the methods considered. Again from Table 3, note that the number of times the best accuracy obtained by SVM, ELM, OP-ELM, OB-ELM and FELMC becomes 1, 3, 0, 1 and 10 respectively. This indicates the supremacy of FELMC. In terms of training time, as expected, ELM and OP-ELM outperform the other methods considered. Like in the case of regression, the training time of FELMC is very close to ELM and OP-ELM and much faster than SVM and OB-ELM. Note that, in case of FELMC, multiquadrtic shows the best performance among the activation functions.

Table 3 Performance comparison of FELMC with ELM and OB-ELM having sigmoid, multiquadric and Gaussian functions, OP-ELM having sigmoid function and SVM using Gaussian kernel for binary classification

Full size table

For the statistical comparison on the performance of the 11 algorithms over 14 datasets we use, as in the case of regression, the Friedman test with post hoc test recommended in [9]. For this purpose, the average ranks of all the algorithms on the prediction accuracy values were computed and listed in Table 4. Under the null hypothesis that all the algorithms are equivalent, the Friedman statistic is computed as

Table 4 Average ranks of SVR, ELM, OP-ELM, OB-ELM and FELMC on accuracy values for binary classification

Full size table

$$ \begin{gathered} \chi_{F}^{2} = \frac{12 \times 14}{11 \times 12}[(6.3214^{2} + 6.8571^{2} + 5.5^{2} + 7.1071^{2} + 5.6071^{2} + 5.3214^{2} + 6.5^{2} \hfill \\ \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, + 7.75^{2} + 5.4286^{2} + 4.1429^{2} + 5.4643^{2})\,\, - \,\,\frac{{11 \times 12^{2} }}{4}]\,\,\,\, = \,\,13.1117 \hfill \\ F_{F} = \frac{13 \times 13.1117}{14 \times 10\,\, - \,\,13.1117} = \,\,1.3433. \hfill \\ \end{gathered} $$

In this case, $ F_{F} $ is distributed according to the $ F $ distribution with $ (\,10,\,130\,) $ degrees of freedom. The critical value of $ F(10,\,130) $ is 1.9042 for $ \alpha = 0.05. $ Since the value of $ F_{F} \, $ is smaller than its corresponding critical value, there is no significant error between the algorithms.

7 Conclusion

In this work, a simple novel functional iterative method for the solution of the optimization based ELM in its primal for regression and classification has been proposed. The linear convergence of the proposed method is proved. Numerical experiments have been performed with sigmoid and RBF hidden nodes on a number of real-world, benchmark datasets and their results have been compared with SVM, ELM, OP-ELM and OB-ELM for regression and classification. Comparable generalization performance of the proposed approach with the rest of the methods considered at a faster learning speed than SVM and OB-ELM indicates its usefulness and applicability. Future work will be on the study ELM in dual variables and its applications.

References

Balasundaram S, Gupta D, Kapil (2014) 1-norm extreme learning machine for regression and multiclass classification using Newton method. Neuocomputing 128:4–14
Article Google Scholar
Balasundaram S, Kapil (2013) On extreme learning machine for ε–insensitive regression in the primal by Newton method. Neural Comput Appl 22:559–567
Article Google Scholar
Balasundaram S, Kapil (2011) Application of error minimized extreme learning machine for simultaneous learning of a function and its derivatives. Neurocomputing 74:2511–2519
Article Google Scholar
Binu PC, Vimal Krishnan VR, Raju G, Babu Anto P (2012) Handwritten character recognition using wavelet energy and extreme learning machine. Int J Mach Learn Cybern 3(2):149–161
Article Google Scholar
Box GEP, Jenkins GM (1976) Time series analysis: forecasting and Control. Holden-Day, San Francisco
MATH Google Scholar
Chen Z, Zhu H, Wang Y (2013) A modified extreme learning machine with sigmoidal activation functions. Neural Comput Appl 22:541–550
Article MathSciNet Google Scholar
Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines and other kernel based learning methods. Cambridge University Press, Cambridge
Book MATH Google Scholar
DELVE (2005) Data for evaluating learning in valid experiments, http://www.cs.toronto.edu/~delve/data
Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7: 1-30
Deng W, Chen L (2010) Color image watermarking using regularized extreme learning machine. Neural Netw World 20(3):317–330
Google Scholar
Gretton A, Doucet A, Herbrich R, Rayner PJW, Scholkopf B (2001) Support vector regression for black-box system identification, In: Proceedings of the 11th IEEE Workshop on Statistical Signal Processing
Feng G, Huang G-B, Lin Q, Gay R (2009) Error minimized extreme learning machine with growth of hidden nodes and incremental learning. IEEE Trans Neural Netw 20(8):1352–1357
Article Google Scholar
Frenay B, Verleysen M (2010) Using SVMs with randomized feature spaces: an extreme learning approach. In: Proceedings of the 18th European Symposium on Artificial Neural Networks (ESANN), Bruges, Belgium, pp. 315–320
Fu A, Dong C, Wang L (2014) An experimental study on stability and generalization of extreme learning machines. Int J Mach Learn Cybern. doi:10.1007/s13042-014-0238-0
Google Scholar
Huang G-B, Chen L (2007) Convex incremental extreme learning machine. Neurocomputing 70:3056–3062
Article Google Scholar
Huang G-B, Chen L (2008) Enhanced random search based incremental extreme learning machine. Neurocomputing 71:3460–3468
Article Google Scholar
Huang G-B, Chen L, Siew C-K (2006) Universal approximation using incremental constructive feedforward networks with random hidden nodes. IEEE Trans Neural Netw 17(4):879–892
Article Google Scholar
Huang G-B, Ding X, Zhou H (2010) Optimization method based extreme learning machine for classification. Neurocomputing 74:155–163
Article Google Scholar
Huang G-B, Wang DH, Lan Y (2011) Extreme learning machines: a survey. Int J Mach Learn Cybern 2:107–122
Article Google Scholar
Huang G-B, Zhou H, Ding X, Zhang R (2012) Extreme learning machine for regression and multiclass classification. IEEE Trans Syst, Man Cybern Part B Cybern 42(2):513–528
Article Google Scholar
Huang G-B, Zhu Q-Y, Siew C-K (2006) Extreme learning machine: theory and applications. Neurocomputing 70:489–501
Article Google Scholar
Jun W, Shitong W, Chung F (2011) Positive and negative fuzzy rule system, extreme learning machine and image classification. Int J Mach Learn Cybern 2(4):261–271
Article Google Scholar
Lan Y, Soh C, Huang G-B (2010) Two-stage extreme learning machine regression. Neurocomputing 73:3028–3038
Article Google Scholar
Liu Q, He Q, Shi Z (2008) Extreme support vector machine classifier. LNCS 5012:222–233
Google Scholar
Mangasarian OL (2002) A finite newton method for classification. Optim Methods Softw 17:913–929
Article MathSciNet MATH Google Scholar
Miche Y, Sorjamaa A, Bas P, Simula O, Jutten C, Lendasse A (2010) OP-ELM: optimally pruned extreme learning machine. IEEE Trans Neural Netw 21(1):158–162
Article Google Scholar
Minhas R, Baradarani A, Seifzadeh S, Wu QMJ (2010) Human action recognition using extreme learning machine based on visual vocabularies. Neurocomputing 73:1906–1917
Article Google Scholar
Mohammed AA, Wu QMJ, Sid-Ahmed MA (2010) Application of wave atoms decomposition and extreme learning machine for finger print classification. LNCS 6112:246–256
Google Scholar
Mukherjee S, Osuna E, Girosi F (1997) Nonlinear prediction of chaotic time series using support vector machines, in: NNSP’97: Neural Networks for Signal Processing VII: in Proceedigs of IEEE Signal Processing Society Workshop, Amelia Island, FL, USA, pp. 511–520
Muller KR, Smola AJ, Ratsch G, Schölkopf B, Kohlmorgen J (1999) Using support vector machines for time series prediction. In: Schölkopf B, Burges CJC, Smola AJ (eds) Advances in kernel methods—support vector learning. MIT Press, Cambridge, pp 243–254
Google Scholar
Murphy PM, Aha DW (1992) UCI repository of machine learning databases. University of California, Irvine. http://www.ics.uci.edu/~mlearn
Rao CR, Mitra SK (1971) Generalized inverse of matrices and its applications. Wiley, New York
MATH Google Scholar
Samet S, Miri A (2012) Privacy-preserving back-propagation and extreme learning machine algorithms. Data Knowl Eng 79–80:40–61
Article Google Scholar
Singh R, Balasundaram S (2007) Application of extreme learning machine for time series analysis. Int J Intell Technol 2:256–262
Google Scholar
Sjoberg J, Zhang Q, Ljung L, Berveniste A, Delyon B, Glorennec P, Hjalmarsson H, Juditsky (1995) Nonlinear black-box modeling in system identification: a unified overview. Automatica 31:1691–1724
Article MathSciNet MATH Google Scholar
Souza LGM, Barreto GA (2006) Nonlinear system identification using local ARX models based on the self-organizing map. Learning and Nonlinear Models-Revista da Sociedade Brasileira de Redes Neurals (SBRN) 4(2):112–123
Article Google Scholar
Vapnik VN (2000) The nature of statistical learning theory, 2nd edn. Springer, New York
Book MATH Google Scholar
Wang X, Shao Q, Qing M, Zhai J (2013) Architecture selection for networks trained with extreme learning machine using localized generalization error model. Neurocomputing 102:3–9
Article Google Scholar
Wang X, Chen A, Feng H (2011) Upper integral network with extreme learning mechanism. Neurocomputing 74(16):2520–2525
Article Google Scholar
Wang XX, Chen S, Lowe D, Harris CJ (2006) Sparse support vector regression based on orthogonal forward selection for generalized kernel model. Neurocomputing 70:462–474
Article Google Scholar
Yuan Y, Wang Y, Cao F (2011) Optimization approximation solution for regression problem based on extreme learning machine. Neurocomputing 74:2475–2482
Article Google Scholar
Zhai J, Xu H, Wang X (2012) Dynamic ensemble extreme learning machine based on sample entropy. Soft Comput 16(9):1493–1502
Article Google Scholar
Zhou GL, Toh KC (2005) Super linear convergence of a newton-type algorithm for monotone equations. J Optim Theory Appl 125:205–221
Article MathSciNet MATH Google Scholar
Zhu Q-Y, Qin AK, Suganthan PN, Huang G-B (2005) Evolutionary extreme learning machine. Pattern Recogn 38:1759–1763
Article MATH Google Scholar

Download references

Acknowledgments

The authors are extremely thankful to the learned referees for their constructive comments that greatly improved the earlier version of the paper.

Author information

Authors and Affiliations

School of Computer and Systems Sciences, Jawaharlal Nehru University, New Delhi, 110067, India
S. Balasundaram & Deepak Gupta

Authors

S. Balasundaram
View author publications
You can also search for this author in PubMed Google Scholar
Deepak Gupta
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to S. Balasundaram.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Balasundaram, S., Gupta, D. On optimization based extreme learning machine in primal for regression and classification by functional iterative method. Int. J. Mach. Learn. & Cyber. 7, 707–728 (2016). https://doi.org/10.1007/s13042-014-0283-8

Download citation

Received: 17 November 2013
Accepted: 26 June 2014
Published: 01 August 2014
Issue Date: October 2016
DOI: https://doi.org/10.1007/s13042-014-0283-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

On optimization based extreme learning machine in primal for regression and classification by functional iterative method

Abstract

Similar content being viewed by others

Extreme Learning Machine Based on Adaptive Matrix Iteration

Metaheuristic-based extreme learning machines: a review of design formulations and applications

A Survey on Extreme Learning Machine and Evolution of Its Variants

1 Introduction

2 Support vector machines for regression and classification

2.1 Regression formulation

2.2 Classification formulation

3 Extreme learning machine method

4 Optimization method based ELM

5 Proposed functional iterative extreme learning machine

5.1 ε–ELM for regression

Theorem 1

Proof

5.2 ELM for binary classification

Theorem 2

6 Numerical experiments and comparison of results

6.1 Regression

6.2 Classification

7 Conclusion

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

On optimization based extreme learning machine in primal for regression and classification by functional iterative method

Abstract

Similar content being viewed by others

Extreme Learning Machine Based on Adaptive Matrix Iteration

Metaheuristic-based extreme learning machines: a review of design formulations and applications

A Survey on Extreme Learning Machine and Evolution of Its Variants

Explore related subjects

1 Introduction

2 Support vector machines for regression and classification

2.1 Regression formulation

2.2 Classification formulation

3 Extreme learning machine method

4 Optimization method based ELM

5 Proposed functional iterative extreme learning machine

5.1 ε–ELM for regression

Theorem 1

Proof

5.2 ELM for binary classification

Theorem 2

6 Numerical experiments and comparison of results

6.1 Regression

6.2 Classification

7 Conclusion

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation