C-Loss-Based Doubly Regularized Extreme Learning Machine

Wu, Qing; Fu, Yan–Lin; Cui, Dong–Shun; Wang, En

doi:10.1007/s12559-022-10050-2

C-Loss-Based Doubly Regularized Extreme Learning Machine

Published: 27 August 2022

Volume 15, pages 496–519, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Cognitive Computation Aims and scope Submit manuscript

C-Loss-Based Doubly Regularized Extreme Learning Machine

Download PDF

Qing Wu¹,
Yan–Lin Fu ORCID: orcid.org/0000-0003-1937-2990¹,
Dong–Shun Cui² &
…
En Wang³

301 Accesses
1 Citation
Explore all metrics

Abstract

Extreme learning machine has become a significant learning methodology due to its efficiency. However, extreme learning machine may lead to overfitting since it is highly sensitive to outliers. In this paper, a novel extreme learning machine called the C-loss-based doubly regularized extreme learning machine is presented to handle dimensionality reduction and overfitting problems. The proposed algorithm benefits from both L₁ norm and L₂ norm and replaces the square loss function with a C-loss function. And the C-loss-based doubly regularized extreme learning machine can complete the feature selection and the training processes simultaneously. Additionally, it can also decrease noise or irrelevant information of data to reduce dimensionality. To show the efficiency in dimension reduction, we test it on the Swiss Roll dataset and obtain high efficiency and stable performance. The experimental results on different types of artificial datasets and benchmark datasets show that the proposed method achieves much better regression results and faster training speed than other compared methods. Performance analysis also shows it significantly decreases the training time, solves the problem of overfitting, and improves generalization ability.

Extreme Learning Machine for Regression and Classification Using L 1-Norm and L 2-Norm

Robust Extreme Learning Machines with Different Loss Functions

Article 23 July 2018

LL-ELM: A regularized extreme learning machine based on $L_{1}$-norm and Liu estimator

Article 01 March 2021

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Processing a large quantity of data carries a high computational cost and slows the training process. To resolve these issues, a fast and stable algorithm needs to be proposed. In 1986, Rumelhart et al. [1] proposed the back propagation neural network (BPNN), which is a multilayer feedforward network for error correction. Support vector regression (SVR), used to minimize the generalization error bound so as to achieve generalized performance, was then presented by Vapnik et al. [2]. Single-layer feedforward neural networks (SLFNs) have a powerful nonlinear mapping capability and generally use the gradient descent algorithm to deal with the problems of classification and regression [3]. However, they have several disadvantages such as low training efficiency and being trapped easily in a local minimum. In 2006, extreme learning machine (ELM) for SLFNs was proposed by Huang et al. [4], which is still widely used in many research fields, such as foreign accent identification [5], fault detection [6], and emotion recognition [7]. Compared with traditional methods, such as the gradient descent algorithm, ELM can significantly increase training speed and improve generalization performance [8].

In ELM, the input network weights and hidden bias can be generated randomly. Meanwhile, the output network weights can be obtained by only calculating the Moore–Penrose inverse [9]. If the amplitude distribution of the singular value is relatively continuous and the minimum singular value is very close to 0, a large value of output weight vector will be obtained. Therefore, basic ELM, based on empirical risk minimization, leads to overfitting and affects prediction ability [10]. However, ELM uses the traditional least squares method to compute the output weight. As a convex function, the square loss function can cause outliers to sustain large losses because of unboundedness [11]. When outliers exist in the dataset, the approximation function of ELM may significantly deviate from the optimal function, resulting in poor generalization.

To overcome the above shortcomings, researchers have proposed several schemes. Deng et al. [12] put forward a regularized ELM embedded L₂ norm (L₂–ELM). The algorithm uses the weighted least squares method to obtain anti–noise ability by introducing the regularization factor $\gamma$. L₁–ELM with a sparse solution was proposed by Balasundaram et al. [13]. Clearly, L₁ norm is less sensitive to outliers than L₂ norm. The decision function of L₁–ELM uses a smaller number of hidden nodes than ELM. Martínez et al. [14] introduced L₁ norm and hybrid penalties to solve regression problems of ELM. The purpose of introducing different penalties is to moderate the detrimental effect of outliers. Taking the importance of features into consideration, the methods assign different weights to different features automatically. As a result, the smallest weight is assigned to outliers. An ELM model based on L₁ norm and L₂ norm regularizations (DRELM) is proposed to handle regression and multiple-class classification problems [15]. It is robust in both regression and classification applications. In 2014, the C-loss function for pattern classification was presented by Abhishek et al. [16]. The proposed loss function can improve the performance of neural network classifiers. In fact, the paper just introduces the C-loss function for only classification problems. Zhao et al. [17] offered an algorithm named C-loss-based extreme learning machine (CELM). Although CELM has good generalization performance, it has difficulty solving the problem of overfitting.

More recently, other alternative methods were proposed to eliminate the distraction caused by outliers. Jing et al. [18] proposed domain-invariant feature learning framework for partial domain adaptation. Fu et al. [19] developed a novel model termed partial feature selection and alignment by employing a feature selection vector based on the correlation among the features of multiple sources and target domains. Both of them show that re-weighting and feature selection can eliminate the distraction caused by outliers. However, they mainly tackled distribution shift and label shift problems.

To develop a more stable, stronger anti-interference and faster algorithm, we propose a doubly regularized ELM based on C-loss function called CDRELM. The proposed algorithm replaces the square loss function with C-loss function and embeds L₁ norm and L₂ norm on ELM. L₁ norm has the ability to reduce the feature dimension of samples. Therefore, CDRELM can not only deal with regression problems with strong generalization performance but also, as a method of feature selection, can decrease the dimension at high speed. CDRELM tends to be more robust and achieves much better generalization with a faster learning speed than L₂–ELM, L₁–ELM, CELM, DRELM, BPNN, and SVR. To find solutions for this mathematical model, CDRELM is transformed into least absolute shrinkage and selection operator (Lasso) [20]. The three main contributions in this paper are as follows:

1.
The C-loss function is used for regression problems rather than classification problems. To overcome the unsteadiness of the square loss function to outliers, the square loss function used in ELM is replaced by the C-loss function which is bounded, non-convex, and smooth. Thus, a novel algorithm CDRELM is proposed based on the C-loss function. In comparison with the traditional ELM, CDRELM overcomes the problem of overfitting and the insufficient robustness to outliers, which greatly improve the generalization capability.
2.
As a new method of feature selection which simultaneously allows feature selection and the training process, CDRELM can generate sparse eigenvalues by embedding the L₁ norm. In addition, the L₂ norm is added to maintain the amplitude of output weight sparsity and avoid increased sparsity. It can solve regression problems much faster with its ability of dimension reduction. It can also reduce the computational cost and process high-dimensional datasets efficiently.
3.
The new mathematical model is transformed into a Lasso problem for calculating the results. According to the proximal gradient descent (PGD) algorithm [21], an improved operator replaces the original operator to solve the Lasso problem. Compared with PGD, the new improved method can obtain the solution fast and efficiently decrease the number of iterations. It can also compute the solution, which is applied to various datasets with fast and accurate performance.

The rest of this paper is organized as follows. “Related Work” introduces the related work, including ELM, C-loss function, and proximal gradient descent algorithm. In “Proposed CDRELM Method,” the novel algorithm CDRELM, including a mathematical model, solution, and computational complexity analysis, is presented. The proposed algorithm can not only possess the nonconvex and bounded loss function with robustness to outliers but embed L₁ norm and L₂ norm to carry out feature selection at high speed. CDRELM can be solved by an improved alternating optimization method. To test the effectiveness of the proposed CDRELM, “Experiments and Discussion” presents the experimental results including improved solution, dimensionality reduction, and regression. “Performance for Regression” shows four artificial datasets and five benchmark datasets. The Friedman and Nemenyi tests are also shown for comparative analysis. “Conclusion” presents conclusions and future work.

Related Work

ELM

As a single-hidden-layer feedforward neural network, ELM plays a key role in academia and industry. The development of SLFNs has enabled ELM to reach enhanced generalization performance for classification and regression at high speed.

In SLFN, for Q arbitrary distinct samples $\left({x}_{i},{t}_{i}\right)$, where $x_{i} = \left[ {x_{i1} ,x_{i2} , \ldots ,x_{im} } \right]^{T} \in R^{m}$ and $t_{i} = \left[ {t_{i1} ,t_{i2} , \ldots ,t_{in} } \right]^{T} \in R^{n}$, the relationship between input $x_{i}$ and output $f\left( {x_{i} } \right)$ is given as follows:

$$f\left( {x_{i} } \right) = \sum\limits_{j = 1}^{P} {\beta_{j} G\left( {\varpi_{j} ,b_{j} ,x_{i} } \right){ = }} \sum\limits_{j = 1}^{P} {\beta_{j} G\left( {\varpi_{j} \cdot x_{i} + b_{j} } \right)} ,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} i = 1,2, \ldots ,Q,$$

(1)

where $\varpi_{j} = \left[ {\varpi_{j1} ,\varpi_{j2} , \ldots ,\varpi_{jn} } \right]^{T}$ and $b_{j} = \left[ {b_{j1} ,b_{j2} , \ldots ,b_{jn} } \right]^{T}$ are the randomly generated learning parameters of hidden nodes; $\beta_{j} = \left[ {\beta_{j1} ,\beta_{j2} , \ldots ,\beta_{jn} } \right]^{T}$ is the weight connecting the j-th hidden node and the output nodes; $G\left( \cdot \right)$ represents the activation function; P is number of hidden nodes; Q is the number of datasets. The output function of ELM is expressed as follows:

$$H\beta = y$$

(2)

Here, $\beta = \left[ {\beta_{1} ,\beta_{2} , \ldots ,\beta_{P} } \right]^{T}$ is the matrix of output weights and $y = \left[ {y_{1} ,y_{2} , \ldots ,y_{Q} } \right]^{T}$ is the matrix of targets. The hidden-layer output matrix is as follows:

$$H = \left[ {\begin{array}{*{20}c} {G\left( {\varpi_{1} ,b_{1} ,x_{1} } \right)} & \ldots & {G\left( {\varpi_{P} ,b_{P} ,x_{1} } \right)} \\ \vdots & \ddots & \vdots \\ {G\left( {\varpi_{1} ,b_{1} ,x_{Q} } \right)} & \cdots & {G\left( {\varpi_{P} ,b_{P} ,x_{Q} } \right)} \\ \end{array} } \right]$$

(3)

The value of the output weights $\beta$ can be determined by calculating the linear system Eq. (2) as follows:

$$\beta {\text{ = H}}^{ + } y$$

(4)

where $H^{ + }$ is the Moore–Penrose generalized inverse matrix H [22]. ELM computes $H^{ + }$ in Eq. (4) based on the singular value decomposition (SVD) of H.

C-loss Function

There are several loss functions such as hinge loss function [23], $\psi { - }$ learning loss function [24], normalized sigmoid loss function [25], and ramp loss function [26]. Compared with square loss function, these loss functions perform better in enhancing the robustness because of nonconvexity and boundedness. To find a better loss function, Abhishek et al. [16] proposed the C-loss function defined by the following:

$$l_{C} \left( \theta \right) = 1 - \exp \left\{ { - \frac{{\theta^{2} }}{{2\sigma^{2} }}} \right\}$$

(5)

where $\theta = y - f\left( x \right)$ is the space of errors, and $\sigma$ is window width. The comparison of various loss functions is depicted in Fig. 1.

Compared with the other loss functions, the C-loss function is bounded, nonconvex, and smooth being more stable to outliers. C-loss can process all sizes of errors for classification problems. In this paper, we introduce the C-loss function to a doubly regularized ELM for regression problems.

Proximal Gradient Descent Algorithm

In 2004, Boyd et al. [21] proposed PGD to solve the problems of L₁ regularization. It is an effective and rapid solution to Lasso problems in many applications.

Let $\nabla$ be a differential operator. The optimization objective is as follows:

$$\mathop {\min }\limits_{x} g\left( x \right) + \eta \left\| x \right\|_{1}$$

(6)

If $g\left( x \right)$ is derivative and $\nabla g$ meets the condition of L–Lipschitz,

$$\exists L \in R^ + , \left\| {\nabla g\left( {x{^{\prime}} } \right) - \nabla g\left( x \right)} \right\|_{2}^{2} \le L\left\| {x{^{\prime}} - x} \right\|_{2}^{2} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \left( {\forall x,x{^{\prime}} } \right)$$

(7)

In the neighborhood of $x_{k}$, the $g\left( x \right)$ can be approximately calculated by second-order Taylor expansion as follows:

$$\begin{gathered} \mathop g\limits^{ \wedge } \left( x \right) \simeq g\left( {x_{k} } \right) + \left\langle {\nabla g\left( {x_{k} } \right),x - x_{k} } \right\rangle + \frac{L}{2}\left\| {x - x_{k} } \right\| \hfill \\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = \frac{L}{2}\left\| {x - \left( {x_{k} - \frac{1}{L}\nabla g\left( {x_{k} } \right)} \right)} \right\|_{2}^{2} + const \hfill \\ \end{gathered}$$

(8)

where const is a constant, and $\left\langle \cdot \right\rangle$ is inner product. The minimum value of Eq. (8) can be obtained from the following:

$$x_{k + 1} = x_{k} - \frac{1}{L}\nabla g\left( {x_{k} } \right)$$

(9)

Gradient descent can be adopted to minimize $g\left( x \right)$. Each step of gradient descent iteration is equivalent to minimizing the quadratic function $\mathop g\limits^{ \wedge } \left( x \right)$. According to Eq. (6), each iteration step is similarly shown as follows:

$$x_{k + 1} = \mathop {\arg \min }\limits_{x} \frac{L}{2}\left\| {x - \left( {x_{k} - \frac{1}{L}\nabla g\left( {x_{k} } \right)} \right)} \right\|_{2}^{2} { + }\eta \left\| x \right\|_{1}$$

(10)

Each step of gradient descent iteration for $g\left( x \right)$ should consider minimizing the ${\mathrm{L}}_{1}$ norm at the same time.

For Eq. (10), let $h = x_{k} - \frac{1}{L}\nabla g\left( {x_{k} } \right)$ and let $x^{i}$ be the i-th component of x. Then, the closed-form solution is written as follows:

$$x_{k + 1}^{i} = \left\{ {\begin{array}{*{20}c} {h^{i} - \eta /L,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \eta /L < h^{i} } \\ {0,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \left| {h^{i} } \right| \le \eta /L} \\ {h^{i} { + }\eta /L,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} h^{i} {\kern 1pt} < - \eta /L{\kern 1pt} } \\ \end{array} } \right.$$

(11)

where $x_{k + 1}^{i}$ and $h^{i}$ are the i-th component of $x_{k + 1}^{i}$ and h, respectively.

Proposed CDRELM Method

Our framework adopts bounded, nonconvex, and smooth C-loss function, leading to processing the outliers successively. In addition, CDRELM based on L₁ and L₂ regularization can complete the feature selection and the training process simultaneously, which greatly decreases the training time. Therefore, CDRELM can be considered a new method of embedded feature selection that is fast and offers stable performance.

Mathematical Model

We know that the regression problems investigate the relationship between the prediction and the target. To solve these problems, factors such as prediction accuracy, time, robustness, and size of model should be considered.

A single-output regression problem is formulated as follows:

$$y = H\beta + \theta$$

(12)

where $\beta = \left[ {\beta_{1} ,\beta_{2} , \ldots ,\beta_{P} } \right]^{T}$ is the regression weights and $\theta = \left[ {\theta_{1} ,\theta_{2} , \ldots ,\theta_{Q} } \right]^{T}$ is the loss between prediction value and target value. The input of the problem H is a $Q \times P$ matrix that can be described as follows:

$$H = \left[ {\begin{array}{*{20}c} {h_{11} } & \cdots & {h_{1P} } \\ \vdots & \ddots & \vdots \\ {h_{Q1} } & \cdots & {h_{QP} } \\ \end{array} } \right]$$

(13)

The traditional solution is estimated by square loss function and can be defined as follows:

$$\mathop \beta \limits^{ \wedge } = \mathop {\arg \min }\limits_{\beta } \frac{{\left\| {y - H\beta } \right\|_{2}^{2} }}{2}$$

(14)

where $\mathop \beta \limits^{ \wedge } = \left[ {\mathop {\beta_{1} ,}\limits^{ \wedge } \mathop {\beta_{2} ,}\limits^{ \wedge } \ldots ,\mathop {\beta_{P} }\limits^{ \wedge } } \right]^{T}$ is the vector of estimated regression weights. Square loss function is convex and unbounded. However, C-loss, which is smooth, bounded, and nonconvex, can improve robustness and reduce overfitting. To overcome the instability of square loss to outliers, the square loss function is replaced with C-loss function. Then Eq. (14) can be transformed as follows:

$$\mathop \beta \limits^{ \wedge } = \mathop {\arg \min }\limits_{\beta } \left\{ {1 - \exp \left\{ { - \frac{{\left( {y - H\beta } \right)^{2} }}{{2\sigma^{2} }}} \right\}} \right\}$$

(15)

It is known that large-scale datasets lead to high computational cost. As a regularization technique, L₁ norm has been proposed to sparse eigenvalue and enhance generalization ability by shrinking some coefficients and setting others to 0. The Lasso estimate is shown as follows:

$$\mathop \beta \limits^{ \wedge } = \mathop {\arg \min }\limits_{\beta } \left\{ {1 - \exp \left\{ { - \frac{{\left( {y - H\beta } \right)^{2} }}{{2\sigma^{2} }}} \right\} + \eta \left\| \beta \right\|_{1} } \right\}$$

(16)

where $\eta$ is a positive regularization parameter and $\left\| \cdot \right\|_{1}$ is the L₁ norm. The value of $\eta$ is positively associated with the number of nonzero components of $\mathop \beta \limits^{ \wedge }$.

Nevertheless, Zou et al. [27] noted that in the situation of $Q < P$, Lasso can only select at most Q variables. For the general situation of $Q > P$, Lasso cannot behave well if there are high correlations between prediction targets. To solve this problem, the L₂ norm is added to the mathematic model. Therefore, the amplitude of output weight $\mathop \beta \limits^{ \wedge }$ maintains sparsity while avoiding “over–sparsity.” The modified system can now be expressed as follows:

$$\mathop \beta \limits^{ \wedge } = \mathop {\arg \min }\limits_{\beta } \left\{ {1 - \exp \left\{ { - \frac{{\left( {y - H\beta } \right)^{2} }}{{2\sigma^{2} }}} \right\} + \eta \left\| \beta \right\|_{1} + \xi \left\| \beta \right\|_{2}^{2} } \right\}$$

(17)

where $\xi$ is a L₂ norm regularization parameter and $\left\| \bullet \right\|$ is L₂ norm. According to [12,13,14,15], a 2-norm regularization parameter is introduced to ELM, creating the model L₂–ELM with good generalization performance and strong control ability. Compared with ELM, L₂–ELM realizes the limitation of model space and avoids overfitting. ELM is embedded with a 1-norm regularization parameter, giving the model L₁–ELM fast learning speed. L₁–ELM can achieve sparsity and has good optimization solution characteristics. DRELM can control the complexity of the network and prevent overfitting. In our proposed mathematic model, C-loss function increases the robustness to the outliers, L₁ norm offers an automatic variable selection through a sparse vector, and L₂ norm strengthens the control ability. All the components of the process are performed simultaneously which significantly decreases the time and obtains strong generalization.

It is clear that Eq. (17) is nonconvex and cannot use the traditional optimization algorithm to solve CDRELM. Therefore, it is necessary to develop a more efficient method for solving CDRELM.

Solution

Based on the mathematical model of CDRELM, it can be transformed into an equivalent Lasso problem [20]. According to the proximal gradient descent (PGD) algorithm [21], an improved operator replaces the original operator to solve the Lasso problem. In this paper, the improved PGD is used to compute $\mathop \beta \limits^{ \wedge }$ of CDRELM.

Let $\nabla$ be a differential operator. The optimization objective of CDRELM is as follows:

$$J\left( \beta \right) = \mathop {\min }\limits_{\beta } \left\{ {1 - \exp \left\{ { - \frac{{\left( {y - H\beta } \right)^{2} }}{{2\sigma^{2} }}} \right\} + \eta \left\| \beta \right\|_{1} + \xi \left\| \beta \right\|_{2}^{2} } \right\}$$

(18)

Let $\lambda_{k} = \beta_{k} + \frac{{k\left( {\beta_{k} - \beta_{k - 1} } \right)}}{k + 5}$, where $\beta_{k}$ is the k-th step of $\beta$ and the initial $\beta_{0}$ and $\beta_{1}$ are both equal to 0 with the size of $n \times 1$. Replacing $\beta_{k}$ with $\lambda_{k}$, it decreases the difference between the next gradient updating direction and the current gradient direction.

$$g\left( \lambda \right) = 1 - \exp \left\{ { - \frac{{\left( {y - H\lambda } \right)^{2} }}{{2\sigma^{2} }}} \right\} + \xi \left\| \lambda \right\|_{2}^{2}$$

(19)

$$\nabla g = - \frac{{H{}^{T}\left( {y - H\lambda } \right)}}{{\sigma^{2} }}{\kern 1pt} {\kern 1pt} {\kern 1pt} \cdot {\kern 1pt} {\kern 1pt} {\kern 1pt} \left( { - \exp \left\{ { - \frac{{\left( {y - H\lambda } \right)^{2} }}{{2\sigma^{2} }}} \right\}} \right) + 2\xi \lambda$$

(20)

$g\left( \lambda \right)$ is differentiable and $\nabla g$ meets the condition of L—Lipschitz,

$$\exists L \in R^+, \left\| {\nabla g\left( {\lambda{^{\prime}} } \right) - \nabla g\left( \lambda \right)} \right\|_{2}^{2} \le L\left\| {\lambda{^{\prime}} - \lambda } \right\|_{2}^{2} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \left( {\forall \lambda ,\lambda{^{\prime}} } \right)$$

(21)

In the neighborhood of $\lambda_{k}$, the $g\left( \lambda \right)$ can be approximately calculated by second-order Taylor expansion as follows:

$$\begin{gathered} \mathop g\limits^{ \wedge } \left( \lambda \right) \simeq g\left( {\lambda_{k} } \right) + \left\langle {\nabla g\left( {\lambda_{k} } \right),\lambda - \lambda_{k} } \right\rangle + \frac{L}{2}\left\| {\lambda - \lambda_{k} } \right\|^{2} \hfill \\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = \frac{L}{2}\left\| {\lambda - \left( {\lambda_{k} - \frac{1}{L}\nabla g\left( {\lambda_{k} } \right)} \right)} \right\|_{2}^{2} + const \hfill \\ \end{gathered}$$

(22)

where const is a constant and $\left\langle \cdot \right\rangle$ is inner product. The minimum value of Eq. (22) can be obtained from the following:

$$\lambda_{k + 1} = \lambda_{k} - \frac{1}{L}\nabla g\left( {\lambda_{k} } \right)$$

(23)

Gradient descent can be adopted to minimize $g\left( \lambda \right)$. Each step of gradient descent iteration is equivalent to minimizing the quadratic function $\mathop g\limits^{ \wedge } \left( \lambda \right)$. Let this method be extended to Eq. (18). Then, each iteration step is similarly shown as follows:

$$\lambda_{k + 1} = \mathop {\arg \min }\limits_{\lambda } \frac{L}{2}\left\| {\lambda - \left( {\lambda_{k} - \frac{1}{L}\nabla g\left( {\lambda_{k} } \right)} \right)} \right\|_{2}^{2} { + }\eta \left\| \lambda \right\|_{1}$$

(24)

Namely, each step of gradient descent iteration for $g\left( \lambda \right)$ should consider minimizing the L₁ norm at the same time.

For Eq. (24), let $h = \lambda_{k} - \frac{1}{L}\nabla g\left( {\lambda_{k} } \right)$ and let $\lambda^{i}$ be the i-th component of $\lambda$. Then, compute $\lambda_{k + 1} = \mathop {\arg \min }\limits_{x} \frac{L}{2}\left\| {\lambda - h} \right\|_{2}^{2} { + }\eta \left\| \lambda \right\|_{1}$. The closed-form solution is written as follows:

$$\lambda_{k + 1}^{i} = \left\{ {\begin{array}{*{20}c} {h^{i} - \eta /L,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \eta /L < h^{i} } \\ {0,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \left| {h^{i} } \right| \le \eta /L} \\ {h^{i} { + }\eta /L,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} h^{i} {\kern 1pt} < - \eta /L{\kern 1pt} } \\ \end{array} } \right.$$

(25)

Here, $\lambda_{k + 1}^{i}$ and $h^{i}$ are the i-th component of $\lambda_{k + 1}^{{}}$ and h, respectively.

The CDRELM algorithm shown below includes the process of modeling and solving the mathematical model.

Computational Complexity Analysis

In this section, we analyze the computational complexity of CDRELM.

For the matrix $H \in R^{Q \times P}$, where P is the number of hidden nodes and Q is the number of datasets, the computational complexity of SVD is $O\left( {4QP^{2} + 8P^{3} } \right)$ [28]. As mentioned in “ELM,” ELM computes its output weights based on the SVD of $H \in R^{Q \times P}$, so that the computational complexity of ELM is approximately the same as SVD.

According to Eq. (19), the computational complexity of each iteration step is also $O\left( {4QP^{2} + 8P^{3} } \right)$. If we assume that the method converges after K-th iterations, the overall computational time complexity is $K * O\left( {4QP^{2} + 8P^{3} } \right)$.

Experiments and Discussion

We conducted experiments to verify the performance of the presented algorithm. “Performance of Improved PGD” shows the comparison between the traditional PGD and the improved PGD. The performance in dimensionality reduction is given in “Performance on Dimensionality Reduction.” “Performance for Regression” details the performance for regression. We evaluated related algorithms, including L₂–ELM, L₁–ELM, CELM, DRELM, BPNN, and SVR, by using different types of datasets and two activation functions. All experiments were performed in MATLAB R2016a on a desktop computer with an Intel Core i7 1160G7 CPU at 2.11 GHz, 16 GB of memory, and Windows 10.

Performance of Improved PGD

The input $x_{i} = \left[ {x_{i1} ,x_{i2} , \ldots ,x_{i500} } \right]^{T} \in R^{500} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} i = 1, \ldots ,2500$ of CDRELM is generated randomly, where $x_{ij} \in \left( {0,1} \right)$. According to Eq. (18), we can obtain $\beta$ by using PGD and improved PGD.

Compared with the original PGD, the improved PGD can reach the optimal value fast and requires only 59 iterations. Moreover, the optimal value of $\beta$, which is calculated by improved PGD, obtains the fitter results. Table 1 shows the data comparison between PGD and improved PGD, including time, iterations, and optimal value. The visual results between two methods are plotted in Fig. 2.

Table 1 Data comparison between PGD and improved PGD

Full size table

Performance on Dimensionality Reduction

As a new method of embedded feature selection, CDRELM can automatically select the feature to reduce the dimension of the sample dataset and predict samples simultaneously. It can make eigenvalue sparse and enhance generalization ability by shrinking some coefficients and setting others to zero. It also reduces computational complexity and improves computational efficiency by decreasing the number of $\beta$.

The Swiss Roll dataset was created to verify different dimensionality reduction algorithms [29]. r and l return two arrays of random numbers generated from the continuous uniform distributions with lower and upper endpoints specified by 0 and 1, respectively. The data on the coordinate axis is generated from the following:

$$\begin{gathered} t = \frac{3\pi }{{2\left( {l + 2r} \right)}} \hfill \\ \left\{ {\begin{array}{*{20}c} {x = t * \cos \left( t \right)} \\ {y = 2l} \\ {z = t * \sin \left( t \right)} \\ \end{array} } \right. \hfill \\ \end{gathered}$$

(26)

The comparison of before and after the dimensionality reduction using CDRELM is shown in Figs. 3 and 4. Figure 3 signifies the situation of scatter which adopts CDRELM to verify the feature selection effect of the proposed method. In Fig. 4, the scatter value on the z-axis which belongs to Fig. 3 is located on the y-axis. The x-axis denotes the amount of scatter in Fig. 3.

From the experimental results, it is obvious that CDRELM can achieve significant dimensionality reduction, including reduced computational complexity and increased efficiency. From Figs. 3 and 4, it can be seen that CDRELM can not only narrow the range but also set some data to 0. Obviously, the impact of dimension reduction on the z-axis is significant.

Performance for Regression

Four artificial datasets and five benchmark datasets from UCI machine learning repository [30] and Kaggle [31] were used to test the proposed algorithm CDRELM. To evaluate the performance of CDRELM, it was compared with six algorithms: L₂–ELM, L₁–ELM, CELM, DRELM, BPNN, and SVR. In the experiments, two activation functions including sigmoid and sine were used on different datasets. Several parameters needed to be adjusted: L₁ norm term, L₂ norm term, window width $\sigma$ of C-loss function, and the number of hidden layer nodes P. Taking “sinc function datasets” as examples, we analyzed the sensitivity of CDRELM to the number of hidden layer nodes P. In Fig. 5, with the number of hidden layer nodes increasing, the $R^{2}$ has no obvious change. The numbers of hidden layer nodes selected were 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, and 400. During the experiments, we fixed the number of hidden layer nodes at 20 and combined the grid search with a cross-validation technique to select the best parameters. Using the best parameters, we performed the experiments 30 times and reported the results with variability information.

Two performance indices mean squared errors (MSE) and determination coefficient ($R^{2}$) are defined by the following:

$$MSE = E\left( {\mathop {y_{i} }\limits^{ \wedge } - y_{i} } \right)^{2} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} i = 1, \ldots ,Q$$

(27)

$$R^{2} = \frac{{\left( {l\sum\limits_{i = 1}^{l} {\mathop {y_{i} }\limits^{ \wedge } y_{i} - \sum\limits_{i = 1}^{l} {\mathop {y_{i} }\limits^{ \wedge } \sum\limits_{i = 1}^{l} {y_{i} } } } } \right)^{2} }}{{\left( {l\sum\limits_{i = 1}^{l} {\mathop {y_{i}^{2} }\limits^{ \wedge } - \left( {\sum\limits_{i = 1}^{l} {\mathop y\limits^{ \wedge }_{i} } } \right)}^{2} } \right)\left( {l\sum\limits_{i = 1}^{l} {y_{i}^{2} } - \left( {\sum\limits_{i = 1}^{l} {y_{i} } } \right)^{2} } \right)}}$$

(28)

where $\mathop {y_{i} }\limits^{ \wedge }$ represents the prediction of the desired $y_{i}$, and l is the number of testing samples. A smaller MSE or a larger performance index $R^{2}$ reflects better generalization performance.

Here, we use two activation functions for comparing L₂–ELM, L₁–ELM, CELM, DRELM, and CDRELM on the same datasets.

Sigmoid function:

$$F(a,b,x) = \frac{1}{{1 + \exp \left( { - a_{i}^{T} x + b_{i} } \right)}}$$

(29)

Sine function:

$$F(a,x) = \sin \left( {a_{i}^{T} x} \right)$$

(30)

To achieve good generalization performance, the appropriate optimization parameter needs to be chosen. We combined grid search with a cross-validation technique to choose these parameters [32]. The regularization parameters $\lambda$ for L₁–ELM, $\xi$ for L₂–ELM, and $\lambda$ and $\xi$ for DRELM are all determined from the parameter set $\left\{ {2^{ - 50} , \ldots ,2^{0} , \cdots ,2^{20} } \right\}$. In CELM, the window width $\sigma$ and the regularization parameter $\lambda$ are chosen from the candidate set $\left\{ {2^{ - 2} , \ldots ,2^{0} , \cdots ,2^{2} } \right\}$ and $\left\{ {2^{ - 50} , \ldots ,2^{0} , \cdots ,2^{20} } \right\}$, respectively. In CDRELM, the regularization parameters $\eta$ and $\xi$ are selected from $\left\{ {2^{ - 50} , \ldots ,2^{0} , \cdots ,2^{20} } \right\}$ and $\left\{ {2^{ - 50} , \ldots ,2^{0} , \cdots ,2^{20} } \right\}$, and the window width $\sigma$ is taken from $\left\{ {2^{ - 2} , \ldots ,2^{0} , \cdots ,2^{2} } \right\}$. In BPNN, we set the goal accuracy as 1e − 3 and the maximum iterations as 1500. The number of input layer nodes is equal to the number of input variables. The number of output layer nodes is set as 1. According to the MSE calculated using the trial-and-error method, the learning rate and the number of hidden layer nodes are selected according to different datasets. They are selected as 0.003 and 7 on artificial datasets. Meanwhile, the two parameters of learning rate and the number of hidden layer nodes are set as 0.009 and 3 on the octane number dataset, 0.008 and 7 on the Boston housing dataset, 0.008 and 11 on the life expectancy dataset, 0.005 and 7 on the energy consumption dataset, and 0.003 and 4 on the air quality dataset, respectively. The penalty parameter C for error entries of SVR affects the accuracy and generalization ability [2]. The penalty parameter C is taken from $\left\{ {2^{ - 50} , \ldots ,2^{0} , \cdots ,2^{20} } \right\}$ and the radial basis kernel function $K\left( {{\mathbf{x}},{\mathbf{x}}_{i} } \right) = \exp \left\{ { - \frac{{\left\| {{\mathbf{x}} - {\mathbf{x}}_{i} } \right\|^{2} }}{{2\theta^{2} }}} \right\}$ is used in SVR where the parameter $\theta$ is selected from $\left\{ {2^{ - 2} , \ldots ,2^{0} , \cdots ,2^{10} } \right\}$.

Performance on Artificial Datasets

Three artificial datasets with white Gaussian noise (WGN) and two-moon datasets were generated to verify the prediction accuracy and generalization ability. The power spectral density of WGN obeys uniform distribution, and the amplitude distribution obeys Gaussian distribution [33]. Based on the traditional function, WGN with the form of $2001 \times 1$ was added to the three functions whose definitions are given in Table 1. Moreover, in WGN, the power of each function in decibels relative to a watt is 2 dBW. In particular, to verify the effect of outliers on the comparison of performance and the relationship between outliers and parameters, the performance of seven algorithms on sinc function datasets with WGN of power 0 dBW, 2 dBW, 5 dBW, and 10 dBW is shown in Fig. 7 and Table 4.

The details of the experiments on the four datasets are listed in Tables 2 and 3. For artificial datasets, almost nine-tenths of the whole dataset are used as the training set, and the remaining one-tenth is used as the testing set. Figure 6 shows regression shapes of three functions with the WGN of power 2 dBW. The regression results of sinc function, linear regression function, self-defining function, and two-moon are plotted in Figs. 7, 8, 9, and 10, respectively. Similarly, the experimental results are given in Tables 4, 5, 6, and 7, respectively.

Table 2 Functions used for generating regression datasets

Full size table

Table 3 Details of artificial datasets

Full size table

Table 4 Experimental results on sinc function and sinc function with WGN

Full size table

Table 5 Experimental results on linear regression and linear regression with WGN

Full size table

Table 6 Experimental results on self-defining function and self–defining function with WGN

Full size table

Table 7 Experimental results on two-moon dataset

Full size table

It can be seen from Figs. 7, 8, 9, and 10 and Tables 4, 5, 6, and 7 that CDRELM achieves comparable performance to other algorithms with a much higher learning speed due to its ability in dimensionality reduction. It is clear that CDRELM takes much less time than other algorithms. In particular, the proposed algorithm can achieve better generalization performance and more significant efficiency than BPNN and SVR. In most cases, the proposed CDRELM obtains the largest $R^{2}$ and the smallest MSE, which shows the most robustness and highest accuracy in terms of generalization performance. In general, it can be seen that CDRELM has the best generalization ability and highest learning speed. BPNN, SVR, L₂–ELM, DRELM, and CELM perform well in comparison with L₁–ELM. On the basis of linear regression datasets, L₂–ELM can perform slightly better than CDRELM in terms of robustness and accuracy. However, when the WGN is added to the datasets, CDRELM has the most stable performance. In Fig. 7 and Table 4, as the number of outliers increases, the determination coefficient $R^{2}$ of seven algorithms decreases slightly. Meanwhile, MSE and time are almost unaffected by outliers. It is clear that CDRELM always maintains a high level of generalization performance and fast efficiency.

According to the papers reported by Jing et al. [18] and Fu et al. [19], as a feature selection method with L₁ norm and L₂ norm, CDRELM has strong anti-interference capability in theory. Meanwhile, the experimental results (Figs. 7, 8, 9, and 10 and Tables 4, 5, 6, and 7) show that CDRELM has a low sensitivity to outliers and a stable performance. It can be seen from Table 4 that the selected parameters can also change as the number of outliers increases. The larger the number of outliers, the lower the regularization parameter $\eta$ for L₁ norm. The number of outliers is positively associated with the regularization parameter $\xi$ for L₂ norm. The experiments also demonstrate that L₁ norm offers an automatic variable selection through a sparse vector, and L₂ norm strengthens the control ability. With the number of outliers increasing, L₂ norm increases to prevent overfitting and L₁ norm decreases to avoid excessive sparsity in CDRELM. According to the situations of all the functions with noises, CDRELM can maintain the accuracy and stability required to solve regression problems without interference.

Performance on Benchmark Datasets

To further verify the performance of seven algorithms, five real-world datasets from the Kaggle datasets and UCI machine learning repository were tested. The datasets were of different types, including low, medium, and high dimensions and small, medium, and large sizes.

We randomly divide the benchmark datasets into two subsets (80% + 20%), the former for training and the latter for testing. Table 8 shows the data descriptions of the five datasets, which can be classified into three groups of data:

1.
Datasets with relatively small size and high dimensions: the octane number dataset which contains 60 gasoline samples. The samples used Fourier transform near-infrared spectroscopy to scan. Each sample has 401 features, and each feature is a wavelength point from the scanning range of 900 to 1700 nm.
2.
Datasets with medium size and medium dimensions: The Boston housing dataset has 506 samples with 13 features concerning housing prices in the suburbs of Boston. The life expectancy dataset includes 958 samples with 18 features for analyzing the factors that affect the average life expectancy in different countries.
3.
Datasets with large size and small dimensions: The energy consumption dataset covers 2208 instances with five features that show the hourly energy consumption from October 1 to December 31 during the years 2008 to 2012. The air quality dataset contains 9358 samples of hourly averaged responses with eight reference analyzers.

Table 8 Details of benchmark datasets

Full size table

As can be seen from Figs. 11, 12, 13, 14, and 15, six algorithms were compared with CDRELM on different types of datasets. Figure 11 shows the predictive ability of seven algorithms on the octane number dataset. The performances of seven algorithms on the Boston housing, life expectancy, energy consumption, and air quality datasets are shown in Figs. 12, 13, 14, and 15, respectively. It is clear that CDRELM has the greatest performance and can fit the true value best. Table 9 provides the detailed comparisons of seven algorithms on three small- or medium-sized datasets. Table 10 lists the performance results of seven algorithms on two large-sized datasets. It can be seen that the proposed algorithm achieves better generalization performance in three types of datasets at much higher learning speeds. Figures 16, 17, and 18 show the comparison of $R^{2}$, MSE, and time among seven algorithms on each dataset and show the performance of seven algorithms with two different activate functions on various sizes of datasets. In Fig. 16, the histogram represents the performance index $R^{2}$, including two situations with different activate functions. The comparison of MSE among seven algorithms on the datasets divided into large-, small-, or medium-sized datasets is shown in Fig. 17. Interestingly, CDRELM can solve the problems of small- or medium-sized datasets much better than those of large-sized datasets. L₂–ELM, L₁–ELM, CELM, DRELM, and CDRELM obtain a slightly larger value of MSE than SVR and BPNN on large datasets. Hence, CDRELM is better at processing small datasets than other types. Figure 18 compares the overall trends in terms of the time of all the datasets. It can be seen that CDRELM is significantly faster than BPNN and SVR. Figure 18a is not clear to see the advantage of CDRELM. Thus, we completed the Fig. 18b to compare the time intuitionally. CDRELM is more efficient than DRELM, CELM, L₂–ELM, and L₁–ELM. It is clear that the proposed CDRELM obtains the fastest and the most stable performance with the highest accuracy.

To analyze the statistical accuracy more clearly, Table 11 lists the average ranks which are computed by the average value of $R^{2}$ in Tables 9 and 10 for each algorithm with two activate functions. As seen in Table 11, CDRELM is ranked first, and DRELM, CELM, BPNN, SVR, and L₁–ELM are ranked in turn. The experimental results also verify the expected achievement of each algorithm. L₁–ELM aims at reducing the learning time. DRELM has 1-norm and 2-norm penalties with high learning speeds and the ability to prevent overfitting. CELM aims at improving accuracy and robustness. Having the advantages of both CELM and DRELM, CDRELM can attain better generalization ability at a faster learning speed by introducing the C-loss function, L₂ norm and L₁ norm into ELM.

Table 9 Experimental results on small samples of benchmark dataset

Full size table

Table 10 Experimental results on large samples of benchmark dataset

Full size table

Table 11 Accuracy average ranks

Full size table

To obtain various precision and credible results, the Friedman statistical method was used to determine whether all the algorithms have the same performance. Let N be the number of datasets and m denotes the counts of algorithms. Meanwhile, $R_{i}$ represents the average ranks in Table 11. Friedman statistic [34] follows the distribution of $\chi_{{_{F} }}^{2}$ with $m - 1$ degrees of freedom, which is defined as follows:

$$\chi_{{_{F} }}^{2} = \frac{12N}{{m\left( {m + 1} \right)}}\left[ {\sum\limits_{i} {R_{i}^{2} - \frac{{m\left( {m + 1} \right)^{2} }}{4}} } \right]$$

(31)

Based on Eq. (31), Iman et al. [35] proposed a better statistic:

$$F_{F} = \frac{{\left( {N - 1} \right)\chi_{F}^{2} }}{{N\left( {m - 1} \right) - \chi_{F}^{2} }}$$

(32)

which follows the F-distribution with $m - 1$ and $\left( {m - 1} \right)\left( {N - 1} \right)$ degrees of freedom. According to Tables 9 and 10, $\chi_{{_{F} }}^{2} = 44.52$, $F_{F} \approx 25.88$, and $F_{0.05} \left( {6,54} \right)$ are 2.272 by referring to the F-distribution critical value table. It is clear that $F_{F} = 25.88 > F_{0.05} \left( {6,54} \right) = 2.272$, so the null hypothesis is rejected. Hence, it is shown that the performance of the algorithms is significantly different.

To further differentiate the algorithms, the Nemenyi test [36] is used to pairwise compare seven algorithms. It is defined as follows:

$$CD = q_{\alpha } \sqrt {\frac{{m\left( {m + 1} \right)}}{6N}}$$

(33)

where $q_{\alpha }$ is the critical value of Tukey distribution. When $\alpha = 0.05$ and $m = 7$, $q_{\alpha } = 2.949$ according to the inspection table of Nemenyi. The null hypothesis that two algorithms have the same performance is rejected if the corresponding average ranks differ by at least the critical difference $CD \approx 2.8439$. Due to the average rank difference between CDRELM and L₁–ELM which is 5.9 − 1 = 4.9 and is much larger than the critical difference 2.8439, the performance of CDRELM is substantially better than that of L₁–ELM. Similarly, the performance of CDRELM is much superior to that of SVR (5.6 − 1 = 4.1 > 2.8439). As a result of 5.1 − 1 = 4.1 > 2.8439, the Nemenyi test can detect significant difference between CDRELM and BPNN. The result 4.6 − 1 = 3.6 > 2.8439 makes it clear that CDRELM performs much better than CELM. It is clear that 3.8 − 1 = 2.8 < 2.8439 and 2 − 1 = 1 < 2.8439. Thus, CDRELM has slightly better performance than DRELM and L₂–ELM. The above comparison can be visually shown using the Friedman test chart. In Fig. 19, the vertical axis represents each algorithm. For the horizontal axis, dot is the value of average rank, and the horizontal line segment centered on a dot represents the value of CD. If there are overlaps between the horizontal line segment of two algorithms, then, there is no remarkable difference between the two algorithms. Hence, it can be clearly seen that CDRELM has significantly better performance than CELM, L₁–ELM, BPNN, and SVR. CDRELM is slightly better than DRELM and L₂–ELM. Of the seven algorithms, CDRELM has the best performance and L₁–ELM has the worst accuracy.

Conclusion

The traditional ELM with the square loss function has the disadvantages of overfitting and high sensibility to outliers. A new algorithm called CDRELM, which has a nonconvex and bounded C-loss function and embeds L₁ norm and L₂ norm in objective function, is proposed. CDRELM is used to solve the problems with many outliers, high dimension, and small or medium samples. It also offers a new embedded feature selection which has a strong capability for dimensionality reduction. Furthermore, CDRELM can complete the two processes of prediction and dimensionality reduction at the same time, so that it can speed up training efficiency. The improved PGD algorithm can also make the solving process more rapid and more accurate. Experiments on artificial datasets and benchmark datasets show that CDRELM has better generalization ability and more robustness at a higher learning speed than BPNN, SVR, DRELM, CELM, L₂–ELM, and L₁–ELM.

It should be noted that we only verified the performance for regression. In future work, we will attempt to verify the classification capacity of the proposed algorithm. Based on the comparison of MSE, it is clear that CDRELM obtains a slightly larger value on large datasets. Therefore, how to improve the accuracy and robustness for large sized datasets will be another focus of our future research.

References

Rumelhart DE, Hinton GE, Williams RJ. Learning representations by back–propagating errors. Nature. 1986;323:533–6.
Article MATH Google Scholar
Vapnik V, Golowich S, Smola A. Support vector method for function approximation, regression estimation, and signal processing. The 9th Int Conf Neural Inform Proc Sys. 1996;281–287.
Furfaro R, Barocco R, Linares R, Topputo F, Reddy V, Simo J, et al. Modeling irregular small bodies gravity field via extreme learning machines and Bayesian optimization. Adv Space Res. 2020;67(1):617–38.
Article Google Scholar
Huang GB, Zhu QY, Siew CK. Extreme learning machine: theory and applications. Neurocomputing. 2006;70(1–3):489–501.
Article Google Scholar
Kaleem K, Wu YZ, Adjeisah M. Consonant phoneme based extreme learning machine (ELM) recognition model for foreign accent identification. The World Symp Software Eng. 2019;68–72.
Liu X, Huang H, Xiang J. A personalized diagnosis method to detect faults in gears using numerical simulation and extreme learning machine. Knowl Based Syst. 2020;195(1): 105653.
Article Google Scholar
Fellx A, Daniela G, Liviu V, Mihaela–Alexandra P. Neural network approaches for children's emotion recognition in intelligent learning applications. The 7th Int Conf Education and New Learning Technol. 2015;3229–3239.
Huang GB, Zhou H, Ding X. Extreme learning machine for regression and multiclass classification. IEEE Trans Syst Man Cybern B. 2011;42(2):513–29.
Article Google Scholar
Huang S, Zhao G, Chen M. Tensor extreme learning design via generalized Moore-Penrose inverse and triangular type–2 fuzzy sets. Neural Comput Applical. 2018;31:5641–51.
Article Google Scholar
Bai Z, Huang GB, Wang D. Sparse Extreme learning machine for classification. IEEE Trans Cybern. 2014;44(10):1858–70.
Article Google Scholar
Wang Y, Yang L, Yuan C. A robust outlier control framework for classification designed with family of homotopy loss function. Neural Netw. 2019;112:41–53.
Article MATH Google Scholar
Deng WY, Zheng Q, Lin C. Regularized extreme learning machine. IEEE symposium on computational intelligence and data mining. 2009;2009:389–95.
Article Google Scholar
Balasundaram S, Gupta D. 1–Norm extreme learning machine for regression and multiclass classification using Newton method. Neurocomputing. 2014;128:4–14.
Article Google Scholar
Christine DM, Ernesto DV, Lorenzo R. Elastic–net regularization in learning theory. J complexity. 2009;25(2):201–30.
Article MathSciNet MATH Google Scholar
Luo X, Chang XH, Ban XJ. Regression and classification using extreme learning machine based on L-₁-norm and L-₂-norm. Neurocomputing. 2016;174:179–86.
Article Google Scholar
Abhishek S, Rosha P, Jose P. The C–loss function for pattern classification. Pattern Recognit. 2014;47(1):441–53.
Article MATH Google Scholar
Zhao YP, Tan JF, Wang JJ. C–loss based extreme learning machine for estimating power of small–scale turbojet engine. Aerosp Sci Technol. 2019;89(6):407–19.
Article Google Scholar
Jing TT, Xia HF, and Ding ZM. Adaptively-accumulated knowledge transfer for partial domain adaptation. In Proceedings of the 28th ACM International Conference on Multimedia. 2020;1606–1614.
Fu YY, Zhang M, Xu X, et al. Partial feature selection and alignment for multi-source domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021;16654–16663.
Khalajmehrabadi A, Gatsis N, Pack D. A joint indoor WLAN localization and outlier detection scheme using LASSO and Elastic-Net optimization techniques. IEEE Trans Mob Comput. 2017;16(8):1–1.
Article Google Scholar
Boyd S, Vandenberghe L, Faybusovich L. Convex optimization IEEE Trans Automat Contr. 2006;51(11):1859.
Article Google Scholar
Huang GB, Wang DH, Lan Y. Extreme learning machines: a survey. Int J Mach Learn Cyb. 2011;2(2):107–22.
Article Google Scholar
Peng HY, Liu CL. Discriminative feature selection via employing smooth and robust hinge loss. IEEE T Neur Net Lear. 2019;99:1–15.
Google Scholar
Lei Z, Mammadov MA. Yearwood J. From convex to nonconvex: a loss function analysis for binary classification. 2010 IEEE International Conference On Data Mining Workshops. 2010;1281–1288.
Hajiabadi H, Molla D, Monsefi R, et al. Combination of loss functions for deep text classification. Int J Mach Learn Cyb. 2019;11:751–61.
Article Google Scholar
Hajiabadi H, Monsefi R, Yazdi HS. RELF: robust regression extended with ensemble loss function. Appl Intell. 2018;49:473.
Google Scholar
Zou H, Hastie T. Addendum: Regularization and variable selection via the elastic net. J Roy Stat Soc. 2010;67(5):768–768.
Article Google Scholar
Golub GH, Loan CFV. Matrix computations 3rd edition. Johns Hopkins studies in mathematical sciences. 1996.
Dinoj S “Swiss roll datasets”, http://people.cs.uchicago.edu/~dinoj/manifold/swissroll.html, accessed on 12 Apr 2021.
UCI machine learning repository http://archive.ics.uci.edu/ml/datasets.php, accessed on 12 Apr 2021
Kaggle datasets https://www.kaggle.com/, accessed on 12 April 2021
Hua XG, Ni YQ, Ko JM, et al. Modeling of temperature–frequency correlation using combined principal component analysis and support vector regression technique. J Comput Civil Eng. 2007;21(2):122–35.
Article Google Scholar
Frost P, Kailath T. An innovations approach to least–squares estimation––part III: nonlinear estimation in white Gaussian noise. IEEE Trans Automat Contr. 2003;16(3):217–26.
Article Google Scholar
Demšar J. Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res. 2006;7:1–30.
MathSciNet MATH Google Scholar
Iman L, Davenport JM. Approximations of the critical region of the Friedman statistic. Commun Stat–Simul C. 1998;571–595.
Fei Z, Webb GI, Suraweera P, et al. Subsumption resolution: an efficient and effective technique for semi–naive Bayesian learning. Mach Learn. 2012;87(1):93–125.
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

The authors thank the anonymous reviewers for their constructive comments and suggestions. This work was supported in part by the National Natural Science Foundation of China under Grant 51875457, the Key Research Project of Shaanxi Province (2022GY-050), the Natural Science Foundation of Shaanxi Province of China (2022JQ-636, 2021JQ–701), and the Special Scientific Research Plan Project of Shaanxi Province Education Department (21JK0905).

Author information

Authors and Affiliations

School of Automation, Xi’an University of Posts and Telecommunications, Xi’an, 710121, China
Qing Wu & Yan–Lin Fu
School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore, 639798, Singapore
Dong–Shun Cui
School of Marxism, Xi’an Shiyou University, Xi’an, 710065, China
En Wang

Authors

Qing Wu
View author publications
You can also search for this author in PubMed Google Scholar
Yan–Lin Fu
View author publications
You can also search for this author in PubMed Google Scholar
Dong–Shun Cui
View author publications
You can also search for this author in PubMed Google Scholar
En Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yan–Lin Fu.

Ethics declarations

Informed Consent

Informed consent was not required as no human beings or animals were involved.

Human and Animal Rights

This article does not contain any studies with human or animal subjects performed by any of the authors.

Conflict of Interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wu, Q., Fu, Y., Cui, D. et al. C-Loss-Based Doubly Regularized Extreme Learning Machine. Cogn Comput 15, 496–519 (2023). https://doi.org/10.1007/s12559-022-10050-2

Download citation

Received: 17 August 2021
Accepted: 14 August 2022
Published: 27 August 2022
Issue Date: March 2023
DOI: https://doi.org/10.1007/s12559-022-10050-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

C-Loss-Based Doubly Regularized Extreme Learning Machine

Abstract

Similar content being viewed by others

Extreme Learning Machine for Regression and Classification Using L 1-Norm and L 2-Norm

Robust Extreme Learning Machines with Different Loss Functions

LL-ELM: A regularized extreme learning machine based on \(L_{1}\)-norm and Liu estimator

Introduction

Related Work

ELM

C-loss Function

Proximal Gradient Descent Algorithm

Proposed CDRELM Method

Mathematical Model

Solution

Computational Complexity Analysis

Experiments and Discussion

Performance of Improved PGD

Performance on Dimensionality Reduction

Performance for Regression

Performance on Artificial Datasets

Performance on Benchmark Datasets

Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Informed Consent

Human and Animal Rights

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

C-Loss-Based Doubly Regularized Extreme Learning Machine

Abstract

Similar content being viewed by others

Extreme Learning Machine for Regression and Classification Using L 1-Norm and L 2-Norm

Robust Extreme Learning Machines with Different Loss Functions

LL-ELM: A regularized extreme learning machine based on \(L_{1}\)-norm and Liu estimator

Explore related subjects

Introduction

Related Work

ELM

C-loss Function

Proximal Gradient Descent Algorithm

Proposed CDRELM Method

Mathematical Model

Solution

Computational Complexity Analysis

Experiments and Discussion

Performance of Improved PGD

Performance on Dimensionality Reduction

Performance for Regression

Performance on Artificial Datasets

Performance on Benchmark Datasets

Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Informed Consent

Human and Animal Rights

Conflict of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation