Introduction

Processing a large quantity of data carries a high computational cost and slows the training process. To resolve these issues, a fast and stable algorithm needs to be proposed. In 1986, Rumelhart et al. [1] proposed the back propagation neural network (BPNN), which is a multilayer feedforward network for error correction. Support vector regression (SVR), used to minimize the generalization error bound so as to achieve generalized performance, was then presented by Vapnik et al. [2]. Single-layer feedforward neural networks (SLFNs) have a powerful nonlinear mapping capability and generally use the gradient descent algorithm to deal with the problems of classification and regression [3]. However, they have several disadvantages such as low training efficiency and being trapped easily in a local minimum. In 2006, extreme learning machine (ELM) for SLFNs was proposed by Huang et al. [4], which is still widely used in many research fields, such as foreign accent identification [5], fault detection [6], and emotion recognition [7]. Compared with traditional methods, such as the gradient descent algorithm, ELM can significantly increase training speed and improve generalization performance [8].

In ELM, the input network weights and hidden bias can be generated randomly. Meanwhile, the output network weights can be obtained by only calculating the Moore–Penrose inverse [9]. If the amplitude distribution of the singular value is relatively continuous and the minimum singular value is very close to 0, a large value of output weight vector will be obtained. Therefore, basic ELM, based on empirical risk minimization, leads to overfitting and affects prediction ability [10]. However, ELM uses the traditional least squares method to compute the output weight. As a convex function, the square loss function can cause outliers to sustain large losses because of unboundedness [11]. When outliers exist in the dataset, the approximation function of ELM may significantly deviate from the optimal function, resulting in poor generalization.

To overcome the above shortcomings, researchers have proposed several schemes. Deng et al. [12] put forward a regularized ELM embedded L2 norm (L2–ELM). The algorithm uses the weighted least squares method to obtain anti–noise ability by introducing the regularization factor \(\gamma\). L1–ELM with a sparse solution was proposed by Balasundaram et al. [13]. Clearly, L1 norm is less sensitive to outliers than L2 norm. The decision function of L1–ELM uses a smaller number of hidden nodes than ELM. Martínez et al. [14] introduced L1 norm and hybrid penalties to solve regression problems of ELM. The purpose of introducing different penalties is to moderate the detrimental effect of outliers. Taking the importance of features into consideration, the methods assign different weights to different features automatically. As a result, the smallest weight is assigned to outliers. An ELM model based on L1 norm and L2 norm regularizations (DRELM) is proposed to handle regression and multiple-class classification problems [15]. It is robust in both regression and classification applications. In 2014, the C-loss function for pattern classification was presented by Abhishek et al. [16]. The proposed loss function can improve the performance of neural network classifiers. In fact, the paper just introduces the C-loss function for only classification problems. Zhao et al. [17] offered an algorithm named C-loss-based extreme learning machine (CELM). Although CELM has good generalization performance, it has difficulty solving the problem of overfitting.

More recently, other alternative methods were proposed to eliminate the distraction caused by outliers. Jing et al. [18] proposed domain-invariant feature learning framework for partial domain adaptation. Fu et al. [19] developed a novel model termed partial feature selection and alignment by employing a feature selection vector based on the correlation among the features of multiple sources and target domains. Both of them show that re-weighting and feature selection can eliminate the distraction caused by outliers. However, they mainly tackled distribution shift and label shift problems.

To develop a more stable, stronger anti-interference and faster algorithm, we propose a doubly regularized ELM based on C-loss function called CDRELM. The proposed algorithm replaces the square loss function with C-loss function and embeds L1 norm and L2 norm on ELM. L1 norm has the ability to reduce the feature dimension of samples. Therefore, CDRELM can not only deal with regression problems with strong generalization performance but also, as a method of feature selection, can decrease the dimension at high speed. CDRELM tends to be more robust and achieves much better generalization with a faster learning speed than L2–ELM, L1–ELM, CELM, DRELM, BPNN, and SVR. To find solutions for this mathematical model, CDRELM is transformed into least absolute shrinkage and selection operator (Lasso) [20]. The three main contributions in this paper are as follows:

  1. 1.

    The C-loss function is used for regression problems rather than classification problems. To overcome the unsteadiness of the square loss function to outliers, the square loss function used in ELM is replaced by the C-loss function which is bounded, non-convex, and smooth. Thus, a novel algorithm CDRELM is proposed based on the C-loss function. In comparison with the traditional ELM, CDRELM overcomes the problem of overfitting and the insufficient robustness to outliers, which greatly improve the generalization capability.

  2. 2.

    As a new method of feature selection which simultaneously allows feature selection and the training process, CDRELM can generate sparse eigenvalues by embedding the L1 norm. In addition, the L2 norm is added to maintain the amplitude of output weight sparsity and avoid increased sparsity. It can solve regression problems much faster with its ability of dimension reduction. It can also reduce the computational cost and process high-dimensional datasets efficiently.

  3. 3.

    The new mathematical model is transformed into a Lasso problem for calculating the results. According to the proximal gradient descent (PGD) algorithm [21], an improved operator replaces the original operator to solve the Lasso problem. Compared with PGD, the new improved method can obtain the solution fast and efficiently decrease the number of iterations. It can also compute the solution, which is applied to various datasets with fast and accurate performance.

The rest of this paper is organized as follows. “Related Work” introduces the related work, including ELM, C-loss function, and proximal gradient descent algorithm. In “Proposed CDRELM Method,” the novel algorithm CDRELM, including a mathematical model, solution, and computational complexity analysis, is presented. The proposed algorithm can not only possess the nonconvex and bounded loss function with robustness to outliers but embed L1 norm and L2 norm to carry out feature selection at high speed. CDRELM can be solved by an improved alternating optimization method. To test the effectiveness of the proposed CDRELM, “Experiments and Discussion” presents the experimental results including improved solution, dimensionality reduction, and regression. “Performance for Regression” shows four artificial datasets and five benchmark datasets. The Friedman and Nemenyi tests are also shown for comparative analysis. “Conclusion” presents conclusions and future work.

Related Work

ELM

As a single-hidden-layer feedforward neural network, ELM plays a key role in academia and industry. The development of SLFNs has enabled ELM to reach enhanced generalization performance for classification and regression at high speed.

In SLFN, for Q arbitrary distinct samples \(\left({x}_{i},{t}_{i}\right)\), where \(x_{i} = \left[ {x_{i1} ,x_{i2} , \ldots ,x_{im} } \right]^{T} \in R^{m}\) and \(t_{i} = \left[ {t_{i1} ,t_{i2} , \ldots ,t_{in} } \right]^{T} \in R^{n}\), the relationship between input \(x_{i}\) and output \(f\left( {x_{i} } \right)\) is given as follows:

$$f\left( {x_{i} } \right) = \sum\limits_{j = 1}^{P} {\beta_{j} G\left( {\varpi_{j} ,b_{j} ,x_{i} } \right){ = }} \sum\limits_{j = 1}^{P} {\beta_{j} G\left( {\varpi_{j} \cdot x_{i} + b_{j} } \right)} ,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} i = 1,2, \ldots ,Q,$$
(1)

where \(\varpi_{j} = \left[ {\varpi_{j1} ,\varpi_{j2} , \ldots ,\varpi_{jn} } \right]^{T}\) and \(b_{j} = \left[ {b_{j1} ,b_{j2} , \ldots ,b_{jn} } \right]^{T}\) are the randomly generated learning parameters of hidden nodes; \(\beta_{j} = \left[ {\beta_{j1} ,\beta_{j2} , \ldots ,\beta_{jn} } \right]^{T}\) is the weight connecting the j-th hidden node and the output nodes; \(G\left( \cdot \right)\) represents the activation function; P is number of hidden nodes; Q is the number of datasets. The output function of ELM is expressed as follows:

$$H\beta = y$$
(2)

Here, \(\beta = \left[ {\beta_{1} ,\beta_{2} , \ldots ,\beta_{P} } \right]^{T}\) is the matrix of output weights and \(y = \left[ {y_{1} ,y_{2} , \ldots ,y_{Q} } \right]^{T}\) is the matrix of targets. The hidden-layer output matrix is as follows:

$$H = \left[ {\begin{array}{*{20}c} {G\left( {\varpi_{1} ,b_{1} ,x_{1} } \right)} & \ldots & {G\left( {\varpi_{P} ,b_{P} ,x_{1} } \right)} \\ \vdots & \ddots & \vdots \\ {G\left( {\varpi_{1} ,b_{1} ,x_{Q} } \right)} & \cdots & {G\left( {\varpi_{P} ,b_{P} ,x_{Q} } \right)} \\ \end{array} } \right]$$
(3)

The value of the output weights \(\beta\) can be determined by calculating the linear system Eq. (2) as follows:

$$\beta {\text{ = H}}^{ + } y$$
(4)

where \(H^{ + }\) is the Moore–Penrose generalized inverse matrix H [22]. ELM computes \(H^{ + }\) in Eq. (4) based on the singular value decomposition (SVD) of H.

C-loss Function

There are several loss functions such as hinge loss function [23], \(\psi { - }\) learning loss function [24], normalized sigmoid loss function [25], and ramp loss function [26]. Compared with square loss function, these loss functions perform better in enhancing the robustness because of nonconvexity and boundedness. To find a better loss function, Abhishek et al. [16] proposed the C-loss function defined by the following:

$$l_{C} \left( \theta \right) = 1 - \exp \left\{ { - \frac{{\theta^{2} }}{{2\sigma^{2} }}} \right\}$$
(5)

where \(\theta = y - f\left( x \right)\) is the space of errors, and \(\sigma\) is window width. The comparison of various loss functions is depicted in Fig. 1.

Fig. 1
figure 1

Comparison of loss functions

Compared with the other loss functions, the C-loss function is bounded, nonconvex, and smooth being more stable to outliers. C-loss can process all sizes of errors for classification problems. In this paper, we introduce the C-loss function to a doubly regularized ELM for regression problems.

Proximal Gradient Descent Algorithm

In 2004, Boyd et al. [21] proposed PGD to solve the problems of L1 regularization. It is an effective and rapid solution to Lasso problems in many applications.

Let \(\nabla\) be a differential operator. The optimization objective is as follows:

$$\mathop {\min }\limits_{x} g\left( x \right) + \eta \left\| x \right\|_{1}$$
(6)

If \(g\left( x \right)\) is derivative and \(\nabla g\) meets the condition of L–Lipschitz,

$$\exists L \in R^ + , \left\| {\nabla g\left( {x{^{\prime}} } \right) - \nabla g\left( x \right)} \right\|_{2}^{2} \le L\left\| {x{^{\prime}} - x} \right\|_{2}^{2} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \left( {\forall x,x{^{\prime}} } \right)$$
(7)

In the neighborhood of \(x_{k}\), the \(g\left( x \right)\) can be approximately calculated by second-order Taylor expansion as follows:

$$\begin{gathered} \mathop g\limits^{ \wedge } \left( x \right) \simeq g\left( {x_{k} } \right) + \left\langle {\nabla g\left( {x_{k} } \right),x - x_{k} } \right\rangle + \frac{L}{2}\left\| {x - x_{k} } \right\| \hfill \\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = \frac{L}{2}\left\| {x - \left( {x_{k} - \frac{1}{L}\nabla g\left( {x_{k} } \right)} \right)} \right\|_{2}^{2} + const \hfill \\ \end{gathered}$$
(8)

where const is a constant, and \(\left\langle \cdot \right\rangle\) is inner product. The minimum value of Eq. (8) can be obtained from the following:

$$x_{k + 1} = x_{k} - \frac{1}{L}\nabla g\left( {x_{k} } \right)$$
(9)

Gradient descent can be adopted to minimize \(g\left( x \right)\). Each step of gradient descent iteration is equivalent to minimizing the quadratic function \(\mathop g\limits^{ \wedge } \left( x \right)\). According to Eq. (6), each iteration step is similarly shown as follows:

$$x_{k + 1} = \mathop {\arg \min }\limits_{x} \frac{L}{2}\left\| {x - \left( {x_{k} - \frac{1}{L}\nabla g\left( {x_{k} } \right)} \right)} \right\|_{2}^{2} { + }\eta \left\| x \right\|_{1}$$
(10)

Each step of gradient descent iteration for \(g\left( x \right)\) should consider minimizing the \({\mathrm{L}}_{1}\) norm at the same time.

For Eq. (10), let \(h = x_{k} - \frac{1}{L}\nabla g\left( {x_{k} } \right)\) and let \(x^{i}\) be the i-th component of x. Then, the closed-form solution is written as follows:

$$x_{k + 1}^{i} = \left\{ {\begin{array}{*{20}c} {h^{i} - \eta /L,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \eta /L < h^{i} } \\ {0,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \left| {h^{i} } \right| \le \eta /L} \\ {h^{i} { + }\eta /L,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} h^{i} {\kern 1pt} < - \eta /L{\kern 1pt} } \\ \end{array} } \right.$$
(11)

where \(x_{k + 1}^{i}\) and \(h^{i}\) are the i-th component of \(x_{k + 1}^{i}\) and h, respectively.

Proposed CDRELM Method

Our framework adopts bounded, nonconvex, and smooth C-loss function, leading to processing the outliers successively. In addition, CDRELM based on L1 and L2 regularization can complete the feature selection and the training process simultaneously, which greatly decreases the training time. Therefore, CDRELM can be considered a new method of embedded feature selection that is fast and offers stable performance.

Mathematical Model

We know that the regression problems investigate the relationship between the prediction and the target. To solve these problems, factors such as prediction accuracy, time, robustness, and size of model should be considered.

A single-output regression problem is formulated as follows:

$$y = H\beta + \theta$$
(12)

where \(\beta = \left[ {\beta_{1} ,\beta_{2} , \ldots ,\beta_{P} } \right]^{T}\) is the regression weights and \(\theta = \left[ {\theta_{1} ,\theta_{2} , \ldots ,\theta_{Q} } \right]^{T}\) is the loss between prediction value and target value. The input of the problem H is a \(Q \times P\) matrix that can be described as follows:

$$H = \left[ {\begin{array}{*{20}c} {h_{11} } & \cdots & {h_{1P} } \\ \vdots & \ddots & \vdots \\ {h_{Q1} } & \cdots & {h_{QP} } \\ \end{array} } \right]$$
(13)

The traditional solution is estimated by square loss function and can be defined as follows:

$$\mathop \beta \limits^{ \wedge } = \mathop {\arg \min }\limits_{\beta } \frac{{\left\| {y - H\beta } \right\|_{2}^{2} }}{2}$$
(14)

where \(\mathop \beta \limits^{ \wedge } = \left[ {\mathop {\beta_{1} ,}\limits^{ \wedge } \mathop {\beta_{2} ,}\limits^{ \wedge } \ldots ,\mathop {\beta_{P} }\limits^{ \wedge } } \right]^{T}\) is the vector of estimated regression weights. Square loss function is convex and unbounded. However, C-loss, which is smooth, bounded, and nonconvex, can improve robustness and reduce overfitting. To overcome the instability of square loss to outliers, the square loss function is replaced with C-loss function. Then Eq. (14) can be transformed as follows:

$$\mathop \beta \limits^{ \wedge } = \mathop {\arg \min }\limits_{\beta } \left\{ {1 - \exp \left\{ { - \frac{{\left( {y - H\beta } \right)^{2} }}{{2\sigma^{2} }}} \right\}} \right\}$$
(15)

It is known that large-scale datasets lead to high computational cost. As a regularization technique, L1 norm has been proposed to sparse eigenvalue and enhance generalization ability by shrinking some coefficients and setting others to 0. The Lasso estimate is shown as follows:

$$\mathop \beta \limits^{ \wedge } = \mathop {\arg \min }\limits_{\beta } \left\{ {1 - \exp \left\{ { - \frac{{\left( {y - H\beta } \right)^{2} }}{{2\sigma^{2} }}} \right\} + \eta \left\| \beta \right\|_{1} } \right\}$$
(16)

where \(\eta\) is a positive regularization parameter and \(\left\| \cdot \right\|_{1}\) is the L1 norm. The value of \(\eta\) is positively associated with the number of nonzero components of \(\mathop \beta \limits^{ \wedge }\).

Nevertheless, Zou et al. [27] noted that in the situation of \(Q < P\), Lasso can only select at most Q variables. For the general situation of \(Q > P\), Lasso cannot behave well if there are high correlations between prediction targets. To solve this problem, the L2 norm is added to the mathematic model. Therefore, the amplitude of output weight \(\mathop \beta \limits^{ \wedge }\) maintains sparsity while avoiding “over–sparsity.” The modified system can now be expressed as follows:

$$\mathop \beta \limits^{ \wedge } = \mathop {\arg \min }\limits_{\beta } \left\{ {1 - \exp \left\{ { - \frac{{\left( {y - H\beta } \right)^{2} }}{{2\sigma^{2} }}} \right\} + \eta \left\| \beta \right\|_{1} + \xi \left\| \beta \right\|_{2}^{2} } \right\}$$
(17)

where \(\xi\) is a L2 norm regularization parameter and \(\left\| \bullet \right\|\) is L2 norm. According to [12,13,14,15], a 2-norm regularization parameter is introduced to ELM, creating the model L2–ELM with good generalization performance and strong control ability. Compared with ELM, L2–ELM realizes the limitation of model space and avoids overfitting. ELM is embedded with a 1-norm regularization parameter, giving the model L1–ELM fast learning speed. L1–ELM can achieve sparsity and has good optimization solution characteristics. DRELM can control the complexity of the network and prevent overfitting. In our proposed mathematic model, C-loss function increases the robustness to the outliers, L1 norm offers an automatic variable selection through a sparse vector, and L2 norm strengthens the control ability. All the components of the process are performed simultaneously which significantly decreases the time and obtains strong generalization.

It is clear that Eq. (17) is nonconvex and cannot use the traditional optimization algorithm to solve CDRELM. Therefore, it is necessary to develop a more efficient method for solving CDRELM.

Solution

Based on the mathematical model of CDRELM, it can be transformed into an equivalent Lasso problem [20]. According to the proximal gradient descent (PGD) algorithm [21], an improved operator replaces the original operator to solve the Lasso problem. In this paper, the improved PGD is used to compute \(\mathop \beta \limits^{ \wedge }\) of CDRELM.

Let \(\nabla\) be a differential operator. The optimization objective of CDRELM is as follows:

$$J\left( \beta \right) = \mathop {\min }\limits_{\beta } \left\{ {1 - \exp \left\{ { - \frac{{\left( {y - H\beta } \right)^{2} }}{{2\sigma^{2} }}} \right\} + \eta \left\| \beta \right\|_{1} + \xi \left\| \beta \right\|_{2}^{2} } \right\}$$
(18)

Let \(\lambda_{k} = \beta_{k} + \frac{{k\left( {\beta_{k} - \beta_{k - 1} } \right)}}{k + 5}\), where \(\beta_{k}\) is the k-th step of \(\beta\) and the initial \(\beta_{0}\) and \(\beta_{1}\) are both equal to 0 with the size of \(n \times 1\). Replacing \(\beta_{k}\) with \(\lambda_{k}\), it decreases the difference between the next gradient updating direction and the current gradient direction.

$$g\left( \lambda \right) = 1 - \exp \left\{ { - \frac{{\left( {y - H\lambda } \right)^{2} }}{{2\sigma^{2} }}} \right\} + \xi \left\| \lambda \right\|_{2}^{2}$$
(19)
$$\nabla g = - \frac{{H{}^{T}\left( {y - H\lambda } \right)}}{{\sigma^{2} }}{\kern 1pt} {\kern 1pt} {\kern 1pt} \cdot {\kern 1pt} {\kern 1pt} {\kern 1pt} \left( { - \exp \left\{ { - \frac{{\left( {y - H\lambda } \right)^{2} }}{{2\sigma^{2} }}} \right\}} \right) + 2\xi \lambda$$
(20)

\(g\left( \lambda \right)\) is differentiable and \(\nabla g\) meets the condition of L—Lipschitz,

$$\exists L \in R^+, \left\| {\nabla g\left( {\lambda{^{\prime}} } \right) - \nabla g\left( \lambda \right)} \right\|_{2}^{2} \le L\left\| {\lambda{^{\prime}} - \lambda } \right\|_{2}^{2} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \left( {\forall \lambda ,\lambda{^{\prime}} } \right)$$
(21)

In the neighborhood of \(\lambda_{k}\), the \(g\left( \lambda \right)\) can be approximately calculated by second-order Taylor expansion as follows:

$$\begin{gathered} \mathop g\limits^{ \wedge } \left( \lambda \right) \simeq g\left( {\lambda_{k} } \right) + \left\langle {\nabla g\left( {\lambda_{k} } \right),\lambda - \lambda_{k} } \right\rangle + \frac{L}{2}\left\| {\lambda - \lambda_{k} } \right\|^{2} \hfill \\ {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} = \frac{L}{2}\left\| {\lambda - \left( {\lambda_{k} - \frac{1}{L}\nabla g\left( {\lambda_{k} } \right)} \right)} \right\|_{2}^{2} + const \hfill \\ \end{gathered}$$
(22)

where const is a constant and \(\left\langle \cdot \right\rangle\) is inner product. The minimum value of Eq. (22) can be obtained from the following:

$$\lambda_{k + 1} = \lambda_{k} - \frac{1}{L}\nabla g\left( {\lambda_{k} } \right)$$
(23)

Gradient descent can be adopted to minimize \(g\left( \lambda \right)\). Each step of gradient descent iteration is equivalent to minimizing the quadratic function \(\mathop g\limits^{ \wedge } \left( \lambda \right)\). Let this method be extended to Eq. (18). Then, each iteration step is similarly shown as follows:

$$\lambda_{k + 1} = \mathop {\arg \min }\limits_{\lambda } \frac{L}{2}\left\| {\lambda - \left( {\lambda_{k} - \frac{1}{L}\nabla g\left( {\lambda_{k} } \right)} \right)} \right\|_{2}^{2} { + }\eta \left\| \lambda \right\|_{1}$$
(24)

Namely, each step of gradient descent iteration for \(g\left( \lambda \right)\) should consider minimizing the L1 norm at the same time.

For Eq. (24), let \(h = \lambda_{k} - \frac{1}{L}\nabla g\left( {\lambda_{k} } \right)\) and let \(\lambda^{i}\) be the i-th component of \(\lambda\). Then, compute \(\lambda_{k + 1} = \mathop {\arg \min }\limits_{x} \frac{L}{2}\left\| {\lambda - h} \right\|_{2}^{2} { + }\eta \left\| \lambda \right\|_{1}\). The closed-form solution is written as follows:

$$\lambda_{k + 1}^{i} = \left\{ {\begin{array}{*{20}c} {h^{i} - \eta /L,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \eta /L < h^{i} } \\ {0,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} \left| {h^{i} } \right| \le \eta /L} \\ {h^{i} { + }\eta /L,{\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} h^{i} {\kern 1pt} < - \eta /L{\kern 1pt} } \\ \end{array} } \right.$$
(25)

Here, \(\lambda_{k + 1}^{i}\) and \(h^{i}\) are the i-th component of \(\lambda_{k + 1}^{{}}\) and h, respectively.

The CDRELM algorithm shown below includes the process of modeling and solving the mathematical model.

figure a

Computational Complexity Analysis

In this section, we analyze the computational complexity of CDRELM.

For the matrix \(H \in R^{Q \times P}\), where P is the number of hidden nodes and Q is the number of datasets, the computational complexity of SVD is \(O\left( {4QP^{2} + 8P^{3} } \right)\) [28]. As mentioned in “ELM,” ELM computes its output weights based on the SVD of \(H \in R^{Q \times P}\), so that the computational complexity of ELM is approximately the same as SVD.

According to Eq. (19), the computational complexity of each iteration step is also \(O\left( {4QP^{2} + 8P^{3} } \right)\). If we assume that the method converges after K-th iterations, the overall computational time complexity is \(K * O\left( {4QP^{2} + 8P^{3} } \right)\).

Experiments and Discussion

We conducted experiments to verify the performance of the presented algorithm. “Performance of Improved PGD” shows the comparison between the traditional PGD and the improved PGD. The performance in dimensionality reduction is given in “Performance on Dimensionality Reduction.” “Performance for Regression” details the performance for regression. We evaluated related algorithms, including L2–ELM, L1–ELM, CELM, DRELM, BPNN, and SVR, by using different types of datasets and two activation functions. All experiments were performed in MATLAB R2016a on a desktop computer with an Intel Core i7 1160G7 CPU at 2.11 GHz, 16 GB of memory, and Windows 10.

Performance of Improved PGD

The input \(x_{i} = \left[ {x_{i1} ,x_{i2} , \ldots ,x_{i500} } \right]^{T} \in R^{500} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} i = 1, \ldots ,2500\) of CDRELM is generated randomly, where \(x_{ij} \in \left( {0,1} \right)\). According to Eq. (18), we can obtain \(\beta\) by using PGD and improved PGD.

Compared with the original PGD, the improved PGD can reach the optimal value fast and requires only 59 iterations. Moreover, the optimal value of \(\beta\), which is calculated by improved PGD, obtains the fitter results. Table 1 shows the data comparison between PGD and improved PGD, including time, iterations, and optimal value. The visual results between two methods are plotted in Fig. 2.

Table 1 Data comparison between PGD and improved PGD
Fig. 2
figure 2

Comparison between PGD and improved PGD

Performance on Dimensionality Reduction

As a new method of embedded feature selection, CDRELM can automatically select the feature to reduce the dimension of the sample dataset and predict samples simultaneously. It can make eigenvalue sparse and enhance generalization ability by shrinking some coefficients and setting others to zero. It also reduces computational complexity and improves computational efficiency by decreasing the number of \(\beta\).

The Swiss Roll dataset was created to verify different dimensionality reduction algorithms [29]. r and l return two arrays of random numbers generated from the continuous uniform distributions with lower and upper endpoints specified by 0 and 1, respectively. The data on the coordinate axis is generated from the following:

$$\begin{gathered} t = \frac{3\pi }{{2\left( {l + 2r} \right)}} \hfill \\ \left\{ {\begin{array}{*{20}c} {x = t * \cos \left( t \right)} \\ {y = 2l} \\ {z = t * \sin \left( t \right)} \\ \end{array} } \right. \hfill \\ \end{gathered}$$
(26)

The comparison of before and after the dimensionality reduction using CDRELM is shown in Figs. 3 and 4. Figure 3 signifies the situation of scatter which adopts CDRELM to verify the feature selection effect of the proposed method. In Fig. 4, the scatter value on the z-axis which belongs to Fig. 3 is located on the y-axis. The x-axis denotes the amount of scatter in Fig. 3.

Fig. 3
figure 3

Comparison of value on z-axis before and after dimension reduction using CDRELM

Fig. 4
figure 4

Performance of dimensionality reduction based on CDRELM

From the experimental results, it is obvious that CDRELM can achieve significant dimensionality reduction, including reduced computational complexity and increased efficiency. From Figs. 3 and 4, it can be seen that CDRELM can not only narrow the range but also set some data to 0. Obviously, the impact of dimension reduction on the z-axis is significant.

Performance for Regression

Four artificial datasets and five benchmark datasets from UCI machine learning repository [30] and Kaggle [31] were used to test the proposed algorithm CDRELM. To evaluate the performance of CDRELM, it was compared with six algorithms: L2–ELM, L1–ELM, CELM, DRELM, BPNN, and SVR. In the experiments, two activation functions including sigmoid and sine were used on different datasets. Several parameters needed to be adjusted: L1 norm term, L2 norm term, window width \(\sigma\) of C-loss function, and the number of hidden layer nodes P. Taking “sinc function datasets” as examples, we analyzed the sensitivity of CDRELM to the number of hidden layer nodes P. In Fig. 5, with the number of hidden layer nodes increasing, the \(R^{2}\) has no obvious change. The numbers of hidden layer nodes selected were 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 200, 300, and 400. During the experiments, we fixed the number of hidden layer nodes at 20 and combined the grid search with a cross-validation technique to select the best parameters. Using the best parameters, we performed the experiments 30 times and reported the results with variability information.

Fig. 5
figure 5

Relationship between the number of hidden layer nodes and \(R^{2}\) on sinc function datasets

Two performance indices mean squared errors (MSE) and determination coefficient (\(R^{2}\)) are defined by the following:

$$MSE = E\left( {\mathop {y_{i} }\limits^{ \wedge } - y_{i} } \right)^{2} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} {\kern 1pt} i = 1, \ldots ,Q$$
(27)
$$R^{2} = \frac{{\left( {l\sum\limits_{i = 1}^{l} {\mathop {y_{i} }\limits^{ \wedge } y_{i} - \sum\limits_{i = 1}^{l} {\mathop {y_{i} }\limits^{ \wedge } \sum\limits_{i = 1}^{l} {y_{i} } } } } \right)^{2} }}{{\left( {l\sum\limits_{i = 1}^{l} {\mathop {y_{i}^{2} }\limits^{ \wedge } - \left( {\sum\limits_{i = 1}^{l} {\mathop y\limits^{ \wedge }_{i} } } \right)}^{2} } \right)\left( {l\sum\limits_{i = 1}^{l} {y_{i}^{2} } - \left( {\sum\limits_{i = 1}^{l} {y_{i} } } \right)^{2} } \right)}}$$
(28)

where \(\mathop {y_{i} }\limits^{ \wedge }\) represents the prediction of the desired \(y_{i}\), and l is the number of testing samples. A smaller MSE or a larger performance index \(R^{2}\) reflects better generalization performance.

Here, we use two activation functions for comparing L2–ELM, L1–ELM, CELM, DRELM, and CDRELM on the same datasets.

Sigmoid function:

$$F(a,b,x) = \frac{1}{{1 + \exp \left( { - a_{i}^{T} x + b_{i} } \right)}}$$
(29)

Sine function:

$$F(a,x) = \sin \left( {a_{i}^{T} x} \right)$$
(30)

To achieve good generalization performance, the appropriate optimization parameter needs to be chosen. We combined grid search with a cross-validation technique to choose these parameters [32]. The regularization parameters \(\lambda\) for L1–ELM, \(\xi\) for L2–ELM, and \(\lambda\) and \(\xi\) for DRELM are all determined from the parameter set \(\left\{ {2^{ - 50} , \ldots ,2^{0} , \cdots ,2^{20} } \right\}\). In CELM, the window width \(\sigma\) and the regularization parameter \(\lambda\) are chosen from the candidate set \(\left\{ {2^{ - 2} , \ldots ,2^{0} , \cdots ,2^{2} } \right\}\) and \(\left\{ {2^{ - 50} , \ldots ,2^{0} , \cdots ,2^{20} } \right\}\), respectively. In CDRELM, the regularization parameters \(\eta\) and \(\xi\) are selected from \(\left\{ {2^{ - 50} , \ldots ,2^{0} , \cdots ,2^{20} } \right\}\) and \(\left\{ {2^{ - 50} , \ldots ,2^{0} , \cdots ,2^{20} } \right\}\), and the window width \(\sigma\) is taken from \(\left\{ {2^{ - 2} , \ldots ,2^{0} , \cdots ,2^{2} } \right\}\). In BPNN, we set the goal accuracy as 1e − 3 and the maximum iterations as 1500. The number of input layer nodes is equal to the number of input variables. The number of output layer nodes is set as 1. According to the MSE calculated using the trial-and-error method, the learning rate and the number of hidden layer nodes are selected according to different datasets. They are selected as 0.003 and 7 on artificial datasets. Meanwhile, the two parameters of learning rate and the number of hidden layer nodes are set as 0.009 and 3 on the octane number dataset, 0.008 and 7 on the Boston housing dataset, 0.008 and 11 on the life expectancy dataset, 0.005 and 7 on the energy consumption dataset, and 0.003 and 4 on the air quality dataset, respectively. The penalty parameter C for error entries of SVR affects the accuracy and generalization ability [2]. The penalty parameter C is taken from \(\left\{ {2^{ - 50} , \ldots ,2^{0} , \cdots ,2^{20} } \right\}\) and the radial basis kernel function \(K\left( {{\mathbf{x}},{\mathbf{x}}_{i} } \right) = \exp \left\{ { - \frac{{\left\| {{\mathbf{x}} - {\mathbf{x}}_{i} } \right\|^{2} }}{{2\theta^{2} }}} \right\}\) is used in SVR where the parameter \(\theta\) is selected from \(\left\{ {2^{ - 2} , \ldots ,2^{0} , \cdots ,2^{10} } \right\}\).

Performance on Artificial Datasets

Three artificial datasets with white Gaussian noise (WGN) and two-moon datasets were generated to verify the prediction accuracy and generalization ability. The power spectral density of WGN obeys uniform distribution, and the amplitude distribution obeys Gaussian distribution [33]. Based on the traditional function, WGN with the form of \(2001 \times 1\) was added to the three functions whose definitions are given in Table 1. Moreover, in WGN, the power of each function in decibels relative to a watt is 2 dBW. In particular, to verify the effect of outliers on the comparison of performance and the relationship between outliers and parameters, the performance of seven algorithms on sinc function datasets with WGN of power 0 dBW, 2 dBW, 5 dBW, and 10 dBW is shown in Fig. 7 and Table 4.

The details of the experiments on the four datasets are listed in Tables 2 and 3. For artificial datasets, almost nine-tenths of the whole dataset are used as the training set, and the remaining one-tenth is used as the testing set. Figure 6 shows regression shapes of three functions with the WGN of power 2 dBW. The regression results of sinc function, linear regression function, self-defining function, and two-moon are plotted in Figs. 7, 8, 9, and 10, respectively. Similarly, the experimental results are given in Tables 4, 5, 6, and 7, respectively.

Table 2 Functions used for generating regression datasets
Table 3 Details of artificial datasets
Fig. 6
figure 6

Regression shapes of three functions with WGN

Fig. 7
figure 7

Regression results on sinc function and sinc function with WGN

Fig. 8
figure 8

Regression results on linear regression and linear regression with WGN

Fig. 9
figure 9

Regression results on self-defining function and self-defining function with WGN

Fig. 10
figure 10

Regression results on two-moon dataset

Table 4 Experimental results on sinc function and sinc function with WGN
Table 5 Experimental results on linear regression and linear regression with WGN
Table 6 Experimental results on self-defining function and self–defining function with WGN
Table 7 Experimental results on two-moon dataset

It can be seen from Figs. 7, 8, 9, and 10 and Tables 4, 5, 6, and 7 that CDRELM achieves comparable performance to other algorithms with a much higher learning speed due to its ability in dimensionality reduction. It is clear that CDRELM takes much less time than other algorithms. In particular, the proposed algorithm can achieve better generalization performance and more significant efficiency than BPNN and SVR. In most cases, the proposed CDRELM obtains the largest \(R^{2}\) and the smallest MSE, which shows the most robustness and highest accuracy in terms of generalization performance. In general, it can be seen that CDRELM has the best generalization ability and highest learning speed. BPNN, SVR, L2–ELM, DRELM, and CELM perform well in comparison with L1–ELM. On the basis of linear regression datasets, L2–ELM can perform slightly better than CDRELM in terms of robustness and accuracy. However, when the WGN is added to the datasets, CDRELM has the most stable performance. In Fig. 7 and Table 4, as the number of outliers increases, the determination coefficient \(R^{2}\) of seven algorithms decreases slightly. Meanwhile, MSE and time are almost unaffected by outliers. It is clear that CDRELM always maintains a high level of generalization performance and fast efficiency.

According to the papers reported by Jing et al. [18] and Fu et al. [19], as a feature selection method with L1 norm and L2 norm, CDRELM has strong anti-interference capability in theory. Meanwhile, the experimental results (Figs. 7, 8, 9, and 10 and Tables 4, 5, 6, and 7) show that CDRELM has a low sensitivity to outliers and a stable performance. It can be seen from Table 4 that the selected parameters can also change as the number of outliers increases. The larger the number of outliers, the lower the regularization parameter \(\eta\) for L1 norm. The number of outliers is positively associated with the regularization parameter \(\xi\) for L2 norm. The experiments also demonstrate that L1 norm offers an automatic variable selection through a sparse vector, and L2 norm strengthens the control ability. With the number of outliers increasing, L2 norm increases to prevent overfitting and L1 norm decreases to avoid excessive sparsity in CDRELM. According to the situations of all the functions with noises, CDRELM can maintain the accuracy and stability required to solve regression problems without interference.

Performance on Benchmark Datasets

To further verify the performance of seven algorithms, five real-world datasets from the Kaggle datasets and UCI machine learning repository were tested. The datasets were of different types, including low, medium, and high dimensions and small, medium, and large sizes.

We randomly divide the benchmark datasets into two subsets (80% + 20%), the former for training and the latter for testing. Table 8 shows the data descriptions of the five datasets, which can be classified into three groups of data:

  1. 1.

    Datasets with relatively small size and high dimensions: the octane number dataset which contains 60 gasoline samples. The samples used Fourier transform near-infrared spectroscopy to scan. Each sample has 401 features, and each feature is a wavelength point from the scanning range of 900 to 1700 nm.

  2. 2.

    Datasets with medium size and medium dimensions: The Boston housing dataset has 506 samples with 13 features concerning housing prices in the suburbs of Boston. The life expectancy dataset includes 958 samples with 18 features for analyzing the factors that affect the average life expectancy in different countries.

  3. 3.

    Datasets with large size and small dimensions: The energy consumption dataset covers 2208 instances with five features that show the hourly energy consumption from October 1 to December 31 during the years 2008 to 2012. The air quality dataset contains 9358 samples of hourly averaged responses with eight reference analyzers.

Table 8 Details of benchmark datasets

As can be seen from Figs. 11, 12, 13, 14, and 15, six algorithms were compared with CDRELM on different types of datasets. Figure 11 shows the predictive ability of seven algorithms on the octane number dataset. The performances of seven algorithms on the Boston housing, life expectancy, energy consumption, and air quality datasets are shown in Figs. 12, 13, 14, and 15, respectively. It is clear that CDRELM has the greatest performance and can fit the true value best. Table 9 provides the detailed comparisons of seven algorithms on three small- or medium-sized datasets. Table 10 lists the performance results of seven algorithms on two large-sized datasets. It can be seen that the proposed algorithm achieves better generalization performance in three types of datasets at much higher learning speeds. Figures 16, 17, and 18 show the comparison of \(R^{2}\), MSE, and time among seven algorithms on each dataset and show the performance of seven algorithms with two different activate functions on various sizes of datasets. In Fig. 16, the histogram represents the performance index \(R^{2}\), including two situations with different activate functions. The comparison of MSE among seven algorithms on the datasets divided into large-, small-, or medium-sized datasets is shown in Fig. 17. Interestingly, CDRELM can solve the problems of small- or medium-sized datasets much better than those of large-sized datasets. L2–ELM, L1–ELM, CELM, DRELM, and CDRELM obtain a slightly larger value of MSE than SVR and BPNN on large datasets. Hence, CDRELM is better at processing small datasets than other types. Figure 18 compares the overall trends in terms of the time of all the datasets. It can be seen that CDRELM is significantly faster than BPNN and SVR. Figure 18a is not clear to see the advantage of CDRELM. Thus, we completed the Fig. 18b to compare the time intuitionally. CDRELM is more efficient than DRELM, CELM, L2–ELM, and L1–ELM. It is clear that the proposed CDRELM obtains the fastest and the most stable performance with the highest accuracy.

Fig. 11
figure 11

Regression results on octane number dataset

Fig. 12
figure 12

Regression results on Boston housing dataset

Fig. 13
figure 13

Regression results on life expectancy dataset

Fig. 14
figure 14

Regression results on hourly energy consumption dataset

Fig. 15
figure 15

Regression results on air quality dataset

Fig. 16
figure 16

Comparison of \(R^{2}\) on all the datasets using different activation function

Fig. 17
figure 17

Comparison of MSE on all of datasets

Fig. 18
figure 18

Comparison of time on all of datasets

To analyze the statistical accuracy more clearly, Table 11 lists the average ranks which are computed by the average value of \(R^{2}\) in Tables 9 and 10 for each algorithm with two activate functions. As seen in Table 11, CDRELM is ranked first, and DRELM, CELM, BPNN, SVR, and L1–ELM are ranked in turn. The experimental results also verify the expected achievement of each algorithm. L1–ELM aims at reducing the learning time. DRELM has 1-norm and 2-norm penalties with high learning speeds and the ability to prevent overfitting. CELM aims at improving accuracy and robustness. Having the advantages of both CELM and DRELM, CDRELM can attain better generalization ability at a faster learning speed by introducing the C-loss function, L2 norm and L1 norm into ELM.

Table 9 Experimental results on small samples of benchmark dataset
Table 10 Experimental results on large samples of benchmark dataset
Table 11 Accuracy average ranks

To obtain various precision and credible results, the Friedman statistical method was used to determine whether all the algorithms have the same performance. Let N be the number of datasets and m denotes the counts of algorithms. Meanwhile, \(R_{i}\) represents the average ranks in Table 11. Friedman statistic [34] follows the distribution of \(\chi_{{_{F} }}^{2}\) with \(m - 1\) degrees of freedom, which is defined as follows:

$$\chi_{{_{F} }}^{2} = \frac{12N}{{m\left( {m + 1} \right)}}\left[ {\sum\limits_{i} {R_{i}^{2} - \frac{{m\left( {m + 1} \right)^{2} }}{4}} } \right]$$
(31)

Based on Eq. (31), Iman et al. [35] proposed a better statistic:

$$F_{F} = \frac{{\left( {N - 1} \right)\chi_{F}^{2} }}{{N\left( {m - 1} \right) - \chi_{F}^{2} }}$$
(32)

which follows the F-distribution with \(m - 1\) and \(\left( {m - 1} \right)\left( {N - 1} \right)\) degrees of freedom. According to Tables 9 and 10, \(\chi_{{_{F} }}^{2} = 44.52\), \(F_{F} \approx 25.88\), and \(F_{0.05} \left( {6,54} \right)\) are 2.272 by referring to the F-distribution critical value table. It is clear that \(F_{F} = 25.88 > F_{0.05} \left( {6,54} \right) = 2.272\), so the null hypothesis is rejected. Hence, it is shown that the performance of the algorithms is significantly different.

To further differentiate the algorithms, the Nemenyi test [36] is used to pairwise compare seven algorithms. It is defined as follows:

$$CD = q_{\alpha } \sqrt {\frac{{m\left( {m + 1} \right)}}{6N}}$$
(33)

where \(q_{\alpha }\) is the critical value of Tukey distribution. When \(\alpha = 0.05\) and \(m = 7\), \(q_{\alpha } = 2.949\) according to the inspection table of Nemenyi. The null hypothesis that two algorithms have the same performance is rejected if the corresponding average ranks differ by at least the critical difference \(CD \approx 2.8439\). Due to the average rank difference between CDRELM and L1–ELM which is 5.9 − 1 = 4.9 and is much larger than the critical difference 2.8439, the performance of CDRELM is substantially better than that of L1–ELM. Similarly, the performance of CDRELM is much superior to that of SVR (5.6 − 1 = 4.1 > 2.8439). As a result of 5.1 − 1 = 4.1 > 2.8439, the Nemenyi test can detect significant difference between CDRELM and BPNN. The result 4.6 − 1 = 3.6 > 2.8439 makes it clear that CDRELM performs much better than CELM. It is clear that 3.8 − 1 = 2.8 < 2.8439 and 2 − 1 = 1 < 2.8439. Thus, CDRELM has slightly better performance than DRELM and L2–ELM. The above comparison can be visually shown using the Friedman test chart. In Fig. 19, the vertical axis represents each algorithm. For the horizontal axis, dot is the value of average rank, and the horizontal line segment centered on a dot represents the value of CD. If there are overlaps between the horizontal line segment of two algorithms, then, there is no remarkable difference between the two algorithms. Hence, it can be clearly seen that CDRELM has significantly better performance than CELM, L1–ELM, BPNN, and SVR. CDRELM is slightly better than DRELM and L2–ELM. Of the seven algorithms, CDRELM has the best performance and L1–ELM has the worst accuracy.

Fig. 19
figure 19

Friedman test

Conclusion

The traditional ELM with the square loss function has the disadvantages of overfitting and high sensibility to outliers. A new algorithm called CDRELM, which has a nonconvex and bounded C-loss function and embeds L1 norm and L2 norm in objective function, is proposed. CDRELM is used to solve the problems with many outliers, high dimension, and small or medium samples. It also offers a new embedded feature selection which has a strong capability for dimensionality reduction. Furthermore, CDRELM can complete the two processes of prediction and dimensionality reduction at the same time, so that it can speed up training efficiency. The improved PGD algorithm can also make the solving process more rapid and more accurate. Experiments on artificial datasets and benchmark datasets show that CDRELM has better generalization ability and more robustness at a higher learning speed than BPNN, SVR, DRELM, CELM, L2–ELM, and L1–ELM.

It should be noted that we only verified the performance for regression. In future work, we will attempt to verify the classification capacity of the proposed algorithm. Based on the comparison of MSE, it is clear that CDRELM obtains a slightly larger value on large datasets. Therefore, how to improve the accuracy and robustness for large sized datasets will be another focus of our future research.