Multi-objective sparse echo state network

Yang, Cuili; Wu, Zhanhong

doi:10.1007/s00521-022-07711-6

Multi-objective sparse echo state network

Original Article
Published: 07 September 2022

Volume 35, pages 2867–2882, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Neural Computing and Applications Aims and scope Submit manuscript

Multi-objective sparse echo state network

Download PDF

Cuili Yang¹ &
Zhanhong Wu¹

336 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

The echo state network (ESN) has been widely applied for nonlinear system modeling. However, the too large reservoir size of ESN will lead to overfitting problem and reduce generalization performance. To balance reservoir size and training performance, the multi-objective sparse echo state network (MOS-ESN) is proposed. Firstly, the ESN design problem is formulated as a two-objective optimization problem, which is solved by the decomposition-based multi-objective optimization algorithm (MOEA/D). Secondly, to accelerate algorithm convergence, the local search strategy is designed, which combines the l₁ or l₀ norm regularization and coordinate descent algorithm, respectively. Thirdly, to produce more solutions around the knee point, an adaptive weight vectors updating method is proposed, which is based on decision maker interest. Experimental results show that the MOS-ESN outperforms other methods in terms of network sparseness and prediction accuracy.

Adaptive lasso echo state network based on modified Bayesian information criterion for nonlinear system modeling

Article 21 March 2018

Design of Echo State Network with Coordinate Descent Method and $$l_1$$ Regularization

L1/2 Norm Regularized Echo State Network for Chaotic Time Series Prediction

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Time series prediction is widely existed in all aspects of life [1,2,3,4], such as forecasting the number daily discharged inpatients of hospital, wind power prediction, global financial situation prediction, and so on. Typical methods include statistical regression [5], gray prediction [6] and machine learning [7]. The autoregressive moving average [8] is a commonly used statistical regression method, but it cannot solve nonlinear problems. Gray prediction [9] is suitable for time series prediction with uncertain partial information, but it is not suitable for the static dataset. Therefore, the prediction method based on machine learning [10] has been widely concerned, which requires no assumptions about the data or model.

In the field of machine learning, the commonly used methods include decision tree (DT) [11], support vector machine (SVM) [12] and artificial neural networks (ANN) [13,14,15,16,17], among which the ANN has drawn many attentions due to its nonlinear approximation ability. The most classical ANN is feed-forward neural network (FNN) [16], which can simulate any nonlinear system. However, it is difficult for FNN to capture the hidden sequence information of time series data. Therefore, the recurrent neural network (RNN) [17] is proposed to solve complex time series problems. However, its network structure and training method may lead to low training efficiency and memory loss. Therefore, Jaeger has proposed a new type of RNN, named as echo state network (ESN) [18].

Nowadays, ESN has been successfully applied in the field of time series prediction [18,19,20,21,22,23,24,25]. Unlike the traditional RNN, the ESN uses a reservoir to store and manage information. The input weights and internal weights of ESN are generated randomly and remain unchanged, only the output weights (also called readout) should be trained. In [21], the generation of reservoirs and training of readouts are reviewed. In [22], the hierarchical ESN is proposed, which is trained by stochastic gradient descent. In [23], the ESN with leaky integrator neurons is designed, which can easily adapt to time characteristics.

In reservoir initialization phase, hundreds of sparsely connected neurons are generated, and some neurons may have little influence on training performance. If all reservoir nodes are connected with network outputs, the ESN will perform very well on training data, but not good on testing data, leading to the overfitting problem. Hence, how to design a suitable reservoir size to improve the performance of ESN has always been the focus of research. In [24], the singular value decomposition-based growing ESN is proposed, which can weaken the coupling among reservoir neurons. In [25], the reservoir pruning method is designed, in which the mutual information between reservoir states is used to delete nodes. However, the pruning method may destroy the echo state characteristics of ESN [26].

To avoid overfitting problem, the regularization techniques are widely applied to sparse the readout of ESN, rather than control the size of reservoir directly [27, 28]. In [29], the reservoir nodes are dynamically added or deleted according to their importance to network performance, the l₂ regularization is used to update the output weights. However, the l₂ regularization is not able to generate the sparse ESN. In [30], the l₁ penalty term is added into the objective function to shrink some irrelevant output weights as small values, such that the readout is sparse. In [31], the l₀ regularization is used for sparse signal recovery, which is able to reduce computation complexity and improve classification ability, simultaneously. In [32], the online sparse ESN is designed, in which the l₁ and l₀ norms are respectively used as penalty terms to control the network size, the sparse recursive least squares and sub-gradient algorithm are combined to estimate output weights. This method has shown superior performance than other ESNs in prediction accuracy and network sparseness. Hence, the l₀ and l₁ regularization are the focus of this paper.

In traditional regularization approaches [27,28,29,30,31,32], the regularization coefficient is used to introduce the penalty term into the objective function,

$$F\left( {W^{{{\text{out}}}} } \right) = \left\| {T - {\mathbf{HW}}^{{{\text{out}}}} } \right\|_{2}^{2} + \mu \left\| {W^{{{\text{out}}}} } \right\|_{p}$$

(1)

where the first term and the second term are the training error and penalty term, respectively,$\mu$ is regularization coefficient, Wout is the output weight of ESN, p = 0,1 represent the l0-norm or l1-norm, respectively. The regularization coefficient is used to balance training error and sparseness of Wout. Different $\mu$ will lead to different optimal solutions [33], and a small change of the regularization coefficient will have a great influence on the training results. Thus, it is important to choose an appropriate regularization coefficient.

To avoid choosing regularization coefficient, in this paper, the optimization of Eq. (1) is formulated as a multi-objective optimization problem (MOP), in which the two conflicting objectives can be optimized [34]. From the view point of optimization, many Pareto-optimal solutions can be obtained by multi-objective optimization algorithms, and thus it is difficult to determine which solution can obtain the best network structure and training error. To select the appreciate solution, the preferences of decision maker should be considered [35]. The knee point is proposed in [36], in which a small change of one objective will generate a big change on the other [37,38,39]. Although the solutions in knee points does not provide the best result for some problems, they still be Pareto solutions which has the optimal performance for MOP.

In this paper, the multi-objective sparse ESN (MOS-ESN) is proposed, in which the training error and network size are treated as two optimization objectives. The main contribution is as follows. Firstly, the MOEA/D-based multi-objective optimization algorithm is designed to optimize network structure and network performance. Secondly, to improve algorithm convergence, the l₁ or l₀ regularization and coordinate descent algorithm-based local search strategy is designed. Thirdly, the preference information of knee point is integrated into weight vectors updating method, which guides the evolution of population toward knee region. Simulation results prove that MOS-ESN can improve the training accuracy and network sparseness without involving any regularization parameters.

The paper organization is as follows. Section 2 introduces the basic description of ESN, MOP and MOEA/D. The proposed MOS-ESN is given in Sect. 3. The simulations are discussed in Sect. 4. The paper is summarized in Sect. 5

2 Background

2.1 Original ESN

The original ESN (OESN) in Fig. 1 is constructed with an input layer, a reservoir and an output layer. The OESN has n input neurons in input layer, N nodes in reservoir and one output node. The input layer and reservoir are connected by the input weight matrix Wⁱⁿ ∈ $\mathbb{R}^{{N \times N}}$ the elements in reservoir are tied by the internal weight matrix W ∈ $\mathbb{R}^{{N \times N}}$ while the input layer and reservoir are related with the output layer through the output weight matrix W^out ∈ $\mathbb{R}^{{\left( {N + n} \right) \times 1}}$. Consider L distinct samples {u(k), t(k)}, where u(k) = [u₁(k), u₂(k), …, u_n(k)]^T ∈ $\mathbb{R}^{{\left( {N + n} \right) \times 1}}$ and t(k) are input and target, respectively, and the reservoir state x(k) is updated as below,

$${\mathbf{x}}(k) \, = {\mathbf{g}}({\mathbf{Wx}}(k - { 1}) \, + {\mathbf{W}}^{{{\text{in}}}} {\mathbf{u}}(k))$$

(2)

where g (·) = [g₁(·), …, g_N (·)^T are the activation functions. The output y(k) is equal to (W^out)^T[x(k); u(k)], where [x(k); u(k)] ∈ $$ \mathbb{R}^{{1 \times \left( {1 + n} \right)}} $$ is the concatenation of reservoir states and input matrix.

Denote T = [t(1), t(2),…, t(L)]^T as target data matrix and represent H = [X(1), X(2), …, X(L)]^T as internal state matrix as below

$$H = \left[ {\begin{array}{*{20}l} {X^{T} (1)} \hfill \\ {X^{T} (2)} \hfill \\ \vdots \hfill \\ {X^{T} (L)} \hfill \\ \end{array} } \right] = \left[ {\begin{array}{*{20}l} {X_{11} } \hfill & {X_{12} } \hfill & \cdots \hfill & {X_{1N + n} } \hfill \\ {X_{21} } \hfill & {X_{22} } \hfill & \cdots \hfill & {X_{2N + n} } \hfill \\ \vdots \hfill & \vdots \hfill & \ddots \hfill & \vdots \hfill \\ {X_{L1} } \hfill & {X_{L2} } \hfill & \cdots \hfill & {X_{LN + n} } \hfill \\ \end{array} } \right]$$

(3)

The output weight matrix W^out can be calculated by

$${\mathbf{W}}^{{{\text{out}}}} = {\mathbf{H}}^{\dag } {\mathbf{T}}$$

(4)

where H^† is the Moore–Penrose pseudoinverse of H. H^† can be computed by orthogonal projection methods [40], single-value decomposition, and so on. However, if the input data contain unknown random noise, the inverse calculation of H may lead to an ill-posed problem, i.e., an unstable solution is obtained.

2.2 MOPs

MOPs contain many conflicting objective functions that should be optimized at the same time. Generally speaking, the minimized MOPs can be expressed as below:

$${\text{Min}}{\mathbf{F}}\left( {\mathbf{W}} \right) \, = \, \left[ {f_{{1}} \left( {\mathbf{W}} \right),f_{{2}} \left( {\mathbf{W}} \right), \, ...,f_{{\text{m}}} \left( {\mathbf{W}} \right)} \right]^{{\text{T}}}$$

(5)

subject to W ∈ Ωwhere W is the decision variable, m is the number of objective functions, Ω is the decision space, F: Ω → Rm is consisted of m objective functions, and Rm is named as the objective space.

For two solutions W₁ and W₂, W₁ is said to dominate W₂ (denoted as W₁ ≺ W₂), if and only if f_i(W₁) ≤ f_i(W₂) for each objective i ∈ {1,…,m}, and f_j(W₁) < f_j(W₂) for at least one value j ∈ {1,…,m} [40].

Furthermore, the solution W^∗ ∈ Ω is defined as Pareto-optimal if there is no other feasible solution W ≺ W^∗. Particularly, the set of W^∗ is named as Pareto-optimal set (PS) and the union of all PS is called the Pareto-optimal front (PF) [34].

2.3 MOEA/D

MOEA/D decomposes a MOP into several single-objective subproblems by multiple weight vectors, and all the subproblems can be optimized at the same time [40]. The main steps of MOEA/D are given:

Step 1: Generate the initial population, x₁, x₂, …, x_P, and a group of uniformly weight vectors λ = (λ₁, …, λ_P), where P is the population size.

Step 2: Compute the Euclidean distance between any two weight vectors, find the nearest T vectors of each vector, which are denoted the neighborhood of λ_i and represented as B(i) = {i₁, i₂, …, i_T}.

Step 3: Choose two index k,l from B(i) randomly. Apply genetic operators on x_k and x_l to generate a new individual y.

Step 4: Update neighboring solutions. When the aggregate function value of y is smaller or equal to x_j, update x_j = y, where j ∈ B(i).

Step 5: Determine the non-dominated solutions of population and update the external population (EP), which saves the non-dominated solutions.

The main feature of MOEA/D is its decomposition method [40], such as the weighted sum approach, Tchebycheff approach and boundary intersection approach. In the following, the form of weighted sum approach is shown

$${\text{Ming}}^{{{\text{ws}}}} \left( {{\mathbf{W}}\lambda_{i} } \right){ = }\sum\limits_{{\text{k = 1}}}^{{\text{m}}} {\lambda_{i,k} f_{k} \left( {\mathbf{W}} \right)}$$

(6)

where λ_i = {λ_i,1, λ_i,2, …, λ_i,m} represents the weight vector corresponding to each objective function, and it is noted that $\sum\nolimits_{k = 1}^{m} {\lambda_{i,k} = 1}$.

3 MOS-ESN

To optimize the network size and training error simultaneously, the MOS-ESN is proposed. Firstly, the design of ESN is formulated as a bi-objective optimization problem, which is solved by MOEA/D. Secondly, to improve algorithm convergence, the l1 and l0 regularization-based local search strategy is designed. Furthermore, to find more solutions around the knee point, the decision maker preference-based weight vectors updating method is proposed.

3.1 Problem Formulation

The network size of ESN is closely related to training performance. To prove their relationship, a simple experiment is designed. Firstly, an ESN with 200 nodes is randomly initialized. Then, several sparse output weights are generated, and the corresponding training errors are recorded. Finally, the training error (denoted as f1) and the number of nonzero elements of output weights (denoted as f2) are drawn in Fig. 2. Obviously, the training error decreases as the network size increases, the too large network will lead to overfitting problem. However, if the network is too small, the training of ESN will be insufficient. Hence, how to achieve a balance between network size and training error becomes the key to research.

To solve this problem, the regularization methods are introduced by using the l₁ or l₀ norm penalty term, and then the design of ESN is realized by optimizing the following objective function

$$\min {\text{F}}_{{\left( {{\mathbf{W}}^{{{\text{out}}}} } \right)}} \left( {{\text{W}}^{{{\text{out}}}} } \right){\mkern 1mu} = {\mkern 1mu} \min \left( {\left\| {{\mathbf{T}} - {\mathbf{HW}}^{{{\text{out}}}} } \right\|_{2}^{2} + \mu \left\| {{\mathbf{W}}^{{{\text{out}}}} } \right\|_{{0/1}} } \right)$$

(7)

where $\mu$ is regularization parameters. Actually, the selection of regularization parameters is a difficult problem, because the large $\mu$ means a small reservoir with large training error, while the small $\mu$ has the opposite effect [33].

To avoid choosing regularization coefficient, the problem in Eq. (9) is treated as a multi-objective optimization problem,

$$\mathop {\min }\limits_{{{\mathbf{W}}^{{{\text{out}}}} }} {\text{F}}\left( {{\mathbf{W}}^{{{\text{out}}}} } \right)\,=\,\min \left( {\left\| {{\mathbf{T}} - {\mathbf{HW}}^{{{\text{out}}}} } \right\|_{{2}}^{{2}} ,\left\| {{\mathbf{W}}^{{{\text{out}}}} } \right\|_{{0/1}} } \right)$$

(8)

where the first term is training error and the second is network size. To minimize the training error and network size simultaneously, the MOEA/D is used, in which the weighted sum approach is applied to generate a set of subproblems

$$\mathop {{\text{min}}}\limits_{{{\mathbf{W}}^{out} }} g^{ws} ({\mathbf{W}}^{out} \lambda )\,= \lambda_{1} f_{1} ({\mathbf{W}}^{out} ) + \lambda_{2} f_{2} ({\mathbf{W}}^{out} )$$

(9)

where λ₁ and λ₂ represent the weight of f₁ and f₂, respectively, and λ₁ + λ₂ = 1.

3.2 Local Search Method

To accelerate the convergence speed of MOEA/D, the local search strategy is proposed, in which the l1 or l0 regularization term is applied to ensure network sparsity, and the coordinate descent algorithm is introduced to update the elements of Wout.

3.2.1 The l1 regularization-based local search method

By using the l1 regularization, the problem in Eq. (8) is formulated as below:

$$\mathop {\min }\limits_{{{\mathbf{W}}^{{{\text{out}}}} }} {\text{F}}\left( {{\mathbf{W}}^{{{\text{out}}}} } \right)\,=\,\min \left( {\left\| {{\mathbf{T}}{ - }{\mathbf{HW}}^{{{\text{out}}}} } \right\|_{{2}}^{{2}} {\mathbf{ + }}\mu \left\| {{\mathbf{W}}^{{{\text{out}}}} } \right\|_{{1}} } \right)$$

(10)

where ${\Vert {\text{W}}^{\text{out}}\Vert }_{1}\text{=}\sum_{\text{i} = {1}}^{\text{N+n}}\text{|}{\text{w}}_{\text{i}}\text{|}$ represent the l₁-norm of W^out. The subproblem in Eq. (9) is described as

$$E\left( {{\mathbf{W}}^{{{\text{out}}}} } \right)\,=\,\frac{{\lambda_{{1}} }}{{2}}\left\| {{\mathbf{T}}{ - }{\mathbf{HW}}^{{{\text{out}}}} } \right\|_{{2}}^{{2}} { + }\lambda_{{2}} \left\| {{\mathbf{W}}^{{{\text{out}}}} } \right\|_{{1}}$$

(11)

To facilitate computational analysis, ${{\lambda_{1} } \mathord{\left/ {\vphantom {{\lambda_{1} } 2}} \right. \kern-\nulldelimiterspace} 2}$ is applied in Eq. (11) instead of λ1. Because the two objectives in Eq. (11) are differentiable, the coordinate descent algorithm is selected to calculate the value of Wout, which has shown strong local search ability. Under the framework of coordinate descent algorithm, in each iteration, the ith variable wi (i = 1, 2, …, N + n)of Wout is updated, while the other elements remain the same. Thus, Eq. (11) becomes

$$\begin{aligned} E\left( {{\mathbf{W}}^{{out}} \left( {w_{i} } \right)} \right) &= \frac{{\lambda _{1} }}{2}\left\{ {\sum\limits_{{k = 1}}^{L} {\left[ {{\mathbf{t}}(k) - {\mathbf{X}}_{{ki}} w_{i} - \sum\limits_{{j \ne i}}^{{N + n}} {{\mathbf{X}}_{{kj}} w_{j} } } \right]} } \right\}^{2} + \lambda _{2} \sum\limits_{{i = 1}}^{{N + n}} {|w_{i} |} \hfill \\ & = \frac{{\lambda _{1} }}{2}\sum\limits_{{k = 1}}^{L} {\left( {{\mathbf{X}}_{{ki}} w_{i} } \right)^{2} } - \lambda _{1} \left\{ {\left[ {\sum\limits_{{k = 1}}^{L} {\left( {{\mathbf{t}}\left( k \right) - \sum\limits_{{j \ne i}}^{{N + n}} {{\mathbf{X}}_{{kj}} w_{j} } } \right)} {\mathbf{X}}_{{ki}} } \right] \cdot w_{i} } \right\} \hfill \\ &\quad+ \frac{{\lambda _{1} }}{2}\sum\limits_{{k = 1}}^{L} {\left[ {\left( {{\mathbf{t}}\left( k \right)) - \sum\limits_{{j \ne i}}^{{N + n}} {{\mathbf{X}}_{{kj}} w_{j} } } \right)^{2} } \right]} + \sum\limits_{{i = 1}}^{{N + n}} {\lambda _{2} |w_{i} |} \hfill \\ \end{aligned}$$

(12)

It can be found that$\frac{{\lambda _{1} }}{2}\sum\nolimits_{{{\text{k = 1}}}}^{{\text{L}}} {\left[ {\left( {{\text{t}}\left( {\text{k}} \right) - \sum\limits_{{{\text{j}} \ne {\text{i}}}}^{{{\text{N + n}}}} {{\text{X}}_{{{\text{kj}}}} } {\text{w}}_{{\text{j}}} } \right)^{2} } \right]}$ is irrelevant to wi, thus minimizing E(Wout( wi)) in Eq. (12) is equal to minimizing Z(Wout( wi)),

$$Z\left( {{\mathbf{W}}^{{out}} \left( {w_{i} } \right)} \right) = \frac{{\lambda _{1} }}{2}\sum\limits_{{k = 1}}^{L} {\left( {{\mathbf{X}}_{{ki}} w_{i} } \right)^{2} } - \lambda _{1} \left\{ {\left[ {\sum\limits_{{k = 1}}^{L} {({\mathbf{t}}(k) - \sum\limits_{{j \ne i}}^{{N + n}} {{\mathbf{X}}_{{kj}} w_{j} } )} {\mathbf{X}}_{{ki}} } \right] \cdot w_{i} } \right\} + \sum\limits_{{{\text{i = 1}}}}^{{N + n}} {\lambda _{2} |w_{i} |}$$

(13)

The sub-gradient of l₁-norm is shown as below

$$\partial \left( {\left\| {w_{i} } \right\|} \right) = \left\{ {\begin{array}{*{20}l} {1,} \hfill & {{\text{if}}\,w_{i} > 0} \hfill \\ { - 1,} \hfill & {{\text{if}}\,w_{i} < 0} \hfill \\ {\alpha \in \left[ { - 1,1} \right],} \hfill & {{\text{if}}\,w_{i} = 0} \hfill \\ \end{array} } \right.$$

(14)

When the derivative of Z(W^out( w_i)) respect to w_i is equal to zero, the minimize of Z(W^out( w_i)) can be obtained. The derivative of Z(W^out( w_i)) is given as

$$\frac{{\partial Z\left( {{\mathbf{W}}^{{out}} \left( {w_{i} } \right)} \right)}}{{\partial w_{i} }} = \lambda _{1} \left[ {\sum\limits_{{k = 1}}^{L} {\left( {{\mathbf{X}}_{{ki}}^{2} w_{i} } \right)} - \sum\limits_{{k = 1}}^{L} {\left( {{\mathbf{t}}\left( k \right) - \sum\limits_{{j \ne i}}^{{N + n}} {{\mathbf{X}}_{{kj}} w_{j} } } \right))} {\mathbf{X}}_{{ki}} } \right] + \lambda _{2} \frac{{\partial \left\| {w_{i} } \right\|_{1} }}{{\partial w_{i} }}$$

(15)

To simplify the calculation, two parameters D and C are introduced

$$D = \sum\limits_{k = 1}^{L} {{\mathbf{X}}_{ki}^{2} }$$

(16)

$$C = \sum\limits_{k = 1}^{L} {\left[ {{\mathbf{t}}\left( k \right) - \sum\limits_{j \ne i}^{N + n} {{\mathbf{X}}_{kj} w_{j} {\mathbf{X}}_{ki} } } \right]}$$

(17)

Thus, the derivative of Z(W^out( w_i)) is given,

$$\frac{{\partial Z({\mathbf{W}}^{{out}} (w_{i} ))}}{{\partial w_{i} }} = \left\{ {\begin{array}{*{20}l} {\lambda _{1} \left( {Dw_{i} - C} \right) + \lambda _{2} ,} \hfill & {{\text{if }}w_{i} > 0} \hfill \\ {\lambda _{1} \left( {Dw_{i} - C} \right) - \lambda _{2} ,} \hfill & {{\text{if }}w_{i} < 0} \hfill \\ {\left[ { - \lambda _{1} C - \lambda _{2} , - \lambda _{1} C + \lambda _{2} } \right],} \hfill & {{\text{if }}w_{i} = 0} \hfill \\ \end{array} } \right.$$

(18)

By setting $\frac{{\partial {\text{Z(Wout( wi))}}}}{{\partial {\text{w}}_{{\text{i}}} }} = {0}$, the update equation of w_i is shown in Eq. (19), the corresponding threshold function is shown in Fig. 3a.

$$w_{i} = \left\{ {\begin{array}{*{20}l} {\frac{{C - \frac{{\lambda _{2} }}{{\lambda _{1} }}}}{D},} \hfill & {if\quad {C > \frac{{\lambda _{2} }}{{\lambda _{1} }}}} \hfill \\ {\frac{{C + \frac{{\lambda _{2} }}{{\lambda _{1} }}}}{D}} \hfill & {if\quad{C - \frac{{\lambda _{2} }}{{\lambda _{1} }}}} \hfill \\ {0,} \hfill & {if - \frac{{\lambda _{2} }}{{\lambda _{1} }} \le C \le \frac{{\lambda _{2} }}{{\lambda _{1} }}} \hfill \\ \end{array} } \right.$$

(19)

In Eq. (19), w_i is related to, ${{\lambda_{1} } \mathord{\left/ {\vphantom {{\lambda_{1} } {\lambda_{2} }}} \right. \kern-\nulldelimiterspace} {\lambda_{2} }}$ which is a fixed value and not related to w_i-1. Therefore, an adjustment is made on Eq. (19)

$$w_{i} = \left\{ {\begin{array}{*{20}l} {\frac{{C - {\text{sgn}}\left( C \right)\frac{{\lambda _{2} }}{{\lambda _{1} }}(\varepsilon + \left| {w_{{i - 1}} } \right|)_{ + } }}{D}} \hfill & {{\text{if }}\left| C \right|\frac{{\lambda _{2} }}{{\lambda _{1} }}\left( {\varepsilon + \left| {w_{{i - 1}} } \right|} \right)_{ + } } \hfill \\ {0,} \hfill & {{\text{if }}\left| C \right| \le \frac{{\lambda _{2} }}{{\lambda _{1} }}(\varepsilon + \left| {w_{{i - 1}} } \right|)_{ + } } \hfill \\ \end{array} } \right.$$

(20)

where w_i-1 represents the weight at last iteration, ε is a small positive value and located in (0,1), and (x)₊ equals 1/x when x ≤ 1 and is 1 otherwise.

The advantage of above method is its modifiable threshold $\frac{{\lambda_{{2}} }}{{\lambda_{{1}} }}\left( {\varepsilon { + }\left| {{\text{w}}_{i - 1} } \right|} \right)_{ + }$. When w_i-1 is small, C has a higher probability between $- \frac{{\lambda_{{2}} }}{{\lambda_{{1}} }}\left( {\varepsilon { + }\left| {{\text{w}}_{i - 1} } \right|} \right)_{ + }$ and $\frac{{\lambda_{{2}} }}{{\lambda_{{1}} }}\left( {\varepsilon { + }\left| {{\text{w}}_{i - 1} } \right|} \right)_{ + }$. Therefore, w_i is attracted to zero with a higher possibility (show as Fig. 3b), while the increased threshold can reduce ||W^out||₁ effectively. To the contrary, if w_i−1 is large, the threshold will decrease to avoid becoming to zero.

3.2.2 The smoothed l0 regularization-based local search method

Actually, the l₁ regularization always generates many components that are close but not equal to zero. To generate more sparse solution, the l₀ regularization is considered

$$\mathop {\min }\limits_{{{\mathbf{W}}^{{{\text{out}}}} }} {\text{F}}\left( {{\mathbf{W}}^{{{\text{out}}}} } \right)\,=\,\min \left( {\left\| {{\mathbf{T}}{ - }{\mathbf{HW}}^{{{\text{out}}}} } \right\|_{{2}}^{{2}} ,\left\| {{\mathbf{W}}^{{{\text{out}}}} } \right\|_{{0}} } \right)$$

(21)

However, the minimization of Eq. (21) is NP-hard. To solve it, the ||W^out||₀ is approximated by ${\text{||Wout||0 = g}}\left( {{\text{W}}_{{{\text{out}}}} } \right){\text{ = }}\sum\nolimits_{{{\text{i = 1}}}}^{{{\text{N + n}}}} {\left( {{\text{1}} - e^{{ - {\text{Q|w}}_{{\text{i}}} {\text{|}}}} } \right)}$where Q is an appropriate positive constant. The subdifferential of ${\text{g}}\left({\text{W}}{\text{out}}\right)$ is as below

$$\frac{{\partial g(w_{i} )}}{{\partial w_{i} }}\,=\,sgn(w_{i} ) \cdot Q \cdot e^{{{ - }Q{|}w_{i} {|}}}$$

(22)

Transform e^−Q|wi| by the first-order Taylor series expansion

$${\text{e}}^{{ - Q|w_{i} |}} \approx \left\{ {\begin{array}{*{20}l} {1 - Q|w_{i} |,} \hfill & {|w_{i} | \le \frac{1}{Q}} \hfill \\ {0,} \hfill & {{\text{others}}} \hfill \\ \end{array} } \right.$$

(23)

Similar with l₁ regularization, the subdifferential of objective function can be described as:

$$\frac{{\partial Z\left( {{\mathbf{W}}^{out} \left( {w_{i} } \right)} \right)}}{{\partial w_{i} }} = \left\{ {\begin{array}{*{20}l} {\lambda_{1} \left( {Dw_{i} - C} \right) - \lambda_{2} \left( {Q + Q^{2} w_{i} } \right), - \frac{1}{Q} \le w_{i} < 0} \hfill \\ {\lambda_{1} \left( {Dw_{i} - C} \right) + \lambda_{2} \left( {Q - Q^{2} w_{i} } \right),0 < w_{i} \le \frac{1}{Q}} \hfill \\ {\left[ { - \lambda_{1} C - \lambda_{2} Q, - \lambda_{1} C + \lambda_{2} Q} \right],w_{i} = 0} \hfill \\ {\lambda_{1} \left( {Dw_{i} - C} \right),w_{i} < - \frac{1}{Q}w_{i} < - \frac{1}{Q}orw_{i} > \frac{1}{Q}} \hfill \\ \end{array} } \right.$$

(24)

By setting $\frac{{\partial {\text{Z}}\left( {{\text{Wout}}\left( {{\text{wi}}} \right)} \right)}}{{\partial {\text{w}}_{{\text{i}}} }} = {0}$, w_i can be obtained by Eq. (25), and the threshold function is shown in Fig. 3(c).

$$w_{i} = \left\{ {\begin{array}{*{20}l} {\frac{{\lambda _{1} C + \lambda _{2} Q}}{{\lambda _{1} D - \lambda _{2} Q^{2} }},} \hfill & {D > \frac{{\lambda _{2} Q^{2} }}{{\lambda _{1} }}\quad{\text{and}}\quad - \frac{D}{Q} \le C < - \frac{{\lambda _{2} Q}}{{\lambda _{1} }}} \hfill \\ {\frac{{\lambda _{1} C - \lambda _{2} Q}}{{\lambda _{1} D - \lambda _{2} Q^{2} }},} \hfill & {D > \frac{{\lambda _{2} Q^{2} }}{{\lambda _{1} }}\quad{\text{and}}\quad\frac{{\lambda _{2} Q}}{{\lambda _{1} }} < C \le \frac{D}{Q}} \hfill \\ {0,} \hfill & { - \frac{{\lambda _{2} Q}}{{\lambda _{1} }} \le C \le \frac{{\lambda _{2} Q}}{{\lambda _{1} }}} \hfill \\ {\frac{C}{D},} \hfill & {{\text{others}}} \hfill \\ \end{array} } \right.$$

(25)

To improve the zero-attraction effect, the modified l₀ regularization method is proposed

$$w_{{\text{i}}} = \left\{ {\begin{array}{*{20}l} {\frac{{\lambda _{1} C - {\text{sgn}}\left( C \right) \cdot \lambda _{2} Q}}{{\lambda _{1} D - \lambda _{2} Q^{2} }},} \hfill & {D > \frac{{\lambda _{{\text{2}}} Q^{{\text{2}}} }}{{\lambda _{{\text{1}}} }}\quad{\text{and}}\quad\frac{{\lambda _{{\text{2}}} Q\left( {\varepsilon {\text{ + }}\left| {{\text{w}}_{{{\text{i - 1}}}} } \right|} \right)_{{\text{ + }}} }}{{\lambda _{{\text{1}}} }} < \left| C \right| \le \frac{D}{Q}} \hfill \\ {{\text{0}},} \hfill & {\frac{{\lambda _{{\text{2}}} Q\left( {\varepsilon {\text{ + }}\left| {{\text{w}}_{{{\text{i - 1}}}} } \right|} \right)_{{\text{ + }}} }}{{\lambda _{{\text{1}}} }}} \hfill \\ {\frac{C}{D},} \hfill & {{\text{others}}} \hfill \\ \end{array} } \right.$$

(26)

which can modify the threshold based on the previous value of w_i so that the small components are attracted to zeros with a higher probability (Fig. 3d).

3.3 Weight Vectors Updating Algorithm

By using MOEA/D, many non-dominated solutions can be obtained. However, only one solution is chosen to realize the nonlinear modeling problem. Generally speaking, the ESN with too large training error (λ₁ is too small) or with too large network size (λ₂ is too small) will not be chosen. As described in [41], the knee point is able to make a tradeoff between two objects. Thus, the knee point is selected as the final solution. In order to generate more solutions around the knee point, the weight vectors updating method is proposed, in which the information of knee point is incorporated.

3.3.1 Knee point

In this part, the distance-based method is introduced to find the knee point [41]. For a bi-objective optimization problem, a line L can be defined as ax + by + c = 0, where a, b, and c are determined by the two solutions that has minimize f₁ and f₂, respectively. Then, the distances d between solutions in PF and L can be calculated as

$$d = \frac{{\left| {ax + by + c} \right|}}{{\sqrt {a^{2} + b^{2} } }}$$

(27)

Considering the minimization problem in this paper, only the solutions in the convex region are of interesting. Thus, the above equation can be modified as

$$d = \left\{ {\begin{array}{*{20}l} {\frac{{\left| {ax + by + c} \right|}}{{\sqrt {a^{2} + b^{2} } }},} \hfill & {ax + by + c < 0} \hfill \\ { - \frac{{\left| {ax + by + c} \right|}}{{\sqrt {a^{2} + b^{2} } }},} \hfill & {{\text{others}}} \hfill \\ \end{array} } \right.$$

(28)

According to Eq. (28), the solution farthest from L is defined as knee point. For example, in Fig. 4, points A and B can determine the line L. By calculating the distance d between each point and L, the point E is the knee point obviously.

3.3.2 Weight Vectors Updating Method

The weight vectors are updated according to the information of knee point and decision maker preference. As shown in Fig. 5, the red point K is the projection of the knee point on the line L_w and A and B are the boundary points where the updated weight vectors should be located, i.e., the line AB is divided into two subintervals by K. Then, the same number of weight vectors $\frac{\text{P}}{{2}}$ will be generated in each subinterval. The distance between two weight vectors is called step size, which is calculated as

$$d_{1} = \sqrt 2 ^{{\frac{1}{\alpha }}} \left( {\frac{{x - x_{1} }}{{1^{\alpha } + 2^{\alpha } + ... + \left\lceil {\frac{P}{2}} \right\rceil ^{\alpha } }}} \right)^{{\frac{1}{\alpha }}}$$

(29)

$$d_{2} = \sqrt 2 ^{{\frac{1}{\alpha }}} \left( {\frac{{x_{2} - x}}{{1^{\alpha } + 2^{\alpha } + ... + \left\lfloor {\frac{P}{2}} \right\rfloor ^{\alpha } }}} \right)^{{\frac{1}{\alpha }}}$$

(30)

where d₁ is the step size in line KA, d₂ is the step size in line KB, and α > 0 is the step size parameter. The value of weight vector λ_i in line KA is,

$$\begin{gathered} \lambda_{i,1} = x - d_{1}^{\alpha } - \left( {2d_{1} } \right)^{\alpha } - \ldots - (id_{1} )^{\alpha } \hfill \\ \lambda_{i,2} = 1 - \lambda_{i.,1} \hfill \\ \end{gathered}$$

(31)

Similarly, the value of weight vector $\lambda_{j}$ in line KB is

$$\begin{gathered} \lambda_{j,1} = x - d_{2}^{\alpha } \left( {2d_{2} } \right)^{\alpha } - ... - \left( {jd_{2} } \right)^{\alpha } \hfill \\ \lambda_{j,2} = 1 - \lambda_{j,1} \hfill \\ \end{gathered}$$

(32)

In the above weight vectors updating method, the weight vectors have a denser distribution near the knee point and are sparser at the boundary. Hence, more weight vectors will be generated near the knee point, and fewer at the boundary. Moreover, the determination of points A and B can be made by decision maker preference, which makes the algorithm converge to the region of interest.

The weight vectors updating algorithm is presented in Algorithm 1. Firstly, the two solutions which minimize f₁ and f₂ are selected and the line L can be calculated. Then, the distance between each solution to L is computed to find the knee point, and the weight vectors corresponding to knee point are chosen. Finally, the weight vectors are updated according to Eqs. (31) and (32).

3.4 Framework of MOS-ESN

The pseudo code of MOS-ESN is described in Algorithm 2. In Step 1, the population is randomly initialized. In Steps 2 and 3, two individuals are randomly chosen to generate the offspring, the uniform crossover operation and polynomial mutation operator are applied. In Step 4, the neighborhoods of each weight vector are updated. In Step 5, the local search is operated l_oca iterations to improve algorithm convergence. In Step 6, the knee point is selected from EP. In Step 7, the weight vectors are updated by the weight vectors updating method.

4 4 Simulation

In this section, the proposed MOS-ESN models are tested on two simulated benchmark problems and one practical system modeling problem, including the Rossler chaotic time series prediction [29], the nonlinear system modeling problem [42] and the effluent ammonia nitrogen (NH4-N) prediction in wastewater treatment process (WWTP) [28]. It is noted that the MOS-ESN with l0 regularization is named as MOS-ESN-l0, while the MOS-ESN with l1 regularization is termed as MOS-ESN-l1.The MOS-ESN models are compared with OESN [29], the OESN with l1 norm regularization (OESN-l1) [33], the OESN with l0 norm regularization (OESN-l0) [32], the OESN whose output weight is updated by coordinate descent and l1 or l0 norm (CD-ESN-l1 [43], CD-ESN-l0), as well as the OESN whose output weights are directly calculated by MOEA/D (OESN-MOEA/D). For each algorithm, 50 independent runs are carried out in the MATLAB 2018b environment on a personal computer with i7 core 8.0 GB memory.

The training and testing RMSE values are applied to evaluate the learning and testing performance of ESNs. Furthermore, the sparsity degree (SP) of the output weight matrix [28] is also introduced. The SP and RMSE are defined as follows:

$${\text{SP}} = \frac{{\left\| {{\mathbf{W}}^{{{\text{out}}}} } \right\|_{0} }}{N + n} \times 100\%$$

(33)

$${\text{RMSE}} = \sqrt {\sum\limits_{{k = 1}}^{L} {\frac{{\left[ {{\mathbf{y}}\left( k \right) - {\mathbf{t}}\left( k \right)} \right]^{2} }}{L}} }$$

(34)

where W^out is the output weights matrix and y(k) and t(k) stand for actual and target output, respectively. A smaller RMSE means a better training or testing accuracy. Meanwhile, the smaller SP means, the ESN has the sparser structure.

To evaluate the searching ability of a multi-objective optimization, the C-matrix [45] is introduced, which can measure the ration of the non-dominated solutions in P that are not dominated by any other solutions in P*,

$$C\left( {{\mathbf{P}},{\mathbf{P}}^{*} } \right) = \frac{{{\text{size}}\left( {{\mathbf{P}}{ - }\left\{ {{\text{x}} \in \left. {\mathbf{P}} \right|\exists {\text{y}} \in {\mathbf{P}}^{*} :{\text{y}} \prec {\text{x}}} \right\}} \right)}}{{{\text{size}}\left( {\mathbf{P}} \right)}}$$

(35)

The larger C(P, P*) value means a better non-dominated solution set P.

The parameters setting of MOS-ESN-l₀ and MOS-ESN-l₁ are as below: the reservoir size, the population size, and the neighborhood size are set as 1000, 200, 15 in each test instance, which is suggested in [44]. The optimal value of l_oca, α is selected by the grid search method, the number of local search operations l_ocal is varied from 0 to 5 by the step of 1, and the step size α is set from 0 to 3 by the step of 1. For other algorithms, their corresponding parameter settings are described in Appendix.

4.1 Rossler chaotic time series prediction

To study the performance of MOS-ESN, the Rossler chaotic time series [29], a typical chaotic dynamical time series, is introduced as below:

$$\begin{array}{*{20}l} {\frac{dx}{{dt}} = - y - z} \hfill \\ {\frac{dy}{{dt}} = x + \alpha y} \hfill \\ {\frac{dz}{{dt}} = \beta + z\left( {x - \gamma } \right)} \hfill \\ \end{array}$$

(36)

where α = 0.2, β = 0.4, γ = 5.7. There are 2000 samples in the experiment, in which 1400 are used for training and the rest 600 are applied for testing. The White Gaussian noise, which has the signal-to-noise ratio (SNR) of 20 dB, is added into the original training and testing datasets.

The testing outputs and prediction error of MOS-ESN-l₀, MOS-ESN-l₁, OESN are illustrated in Figs. 6 and 7, respectively, in which the red, blue, black and cyan lines are the trends of MOS-ESN-l₀, MOS-ESN-l₁, target and OESN, respectively. It is easily found that both MOS-ESN-l₀ and MOS-ESN-l₁ can predict the trends of testing output, while the OESN shows missing outputs in a partial enlargement. Furthermore, the RMSE values of MOS-ESN-l₀ are concentrated in [− 0.3,0.3], which is smaller than other methods, demonstrating its stable performance.

Simulation results of all methods are presented in Table 1, including sparsity, CUP running time for one operation, the mean RMSE and standard deviation (Std. for short) RMSE of training and testing of 50 independent runs. It is easily found that the OESN has the smallest running time, while it has the largest testing RMSE, implying its poor generalization ability. Both OESN-l₁ and OESN-l₀ have smaller SP, which implies the regularization method could generate the sparse output weight matrix. Besides, the proposed MOS-ESN-l₀ has the smallest testing RMSE and SP among all ESN models, which proves its effectiveness in terms of network sparseness and prediction accuracy.

Table 1 Simulation results for Rossler chaotic series prediction

Full size table

4.1.1 Effect of loca

As introduced in Sect. 3.4, l_oca decides how many local search operators are conducted on each individual. The effects of l_oca on network performance are investigated through 50 independent experiments. By setting l_oca = 0, l_oca = 1 and l_oca = 3, the obtained non-dominated solutions of MOS-ESN-l₁ and MOS-ESN-l₀ are plotted in Figs. 8 and 9, respectively. The x-coordinate and y-coordinate are the objective function ${{\text{f}}_{2}\text\,=\,\Vert {\text{W}}^{\text{out}}\Vert }_{1/0}$ and ${\text{f}}_{1}\text\,=\,{\Vert {\text{T}}-{\text{H}}{\text{W}}^{\text{out}}\Vert }_{2}^{2}$, respectively. Obviously, the algorithm with l_oca = 3 can always generate more non-dominated solutions than the algorithms with l_oca = 0 or l_oca = 1, which implies that the local search algorithm can accelerate the algorithm convergence speed.

By setting l_oca = {1, 2, 3, 4, 5} and α = 2, the statistics results of training time, training and testing RMSE values, C-matrix and SP of MOS-ESN-l₁ and MOS-ESN-l₀ are reported in Tables 2 and 3, respectively. It is noted that during the calculation process of C-matrix, P* = (P₁, P_2, …, P_n), where P_i is the non-dominated solution set by the algorithm with l_oca = i (i = 1, …, 5). It is easily found that when l_oca is set as a small value, such as 0 or 1, the small value of C-matrix is obtained, which means the worse non-dominated solutions is obtained. When l_oca is set as a moderate value (l_oca = 3), the larger value of C-matrix, lower SP and testing RMSE values can be obtained. On the contrary, if l_oca is set as a too large value (l_oca = 5), the training and testing RMSE values are not best among all the models, because the too large value of l_oca may have a risk of converging to local regain. Furthermore, the too large l_oca will increase the computational complexity or training time.

Table 2 Simulation results of different value of l_oca on MOS-ESN-l₁

Full size table

Table 3 Simulation results of different value of l_oca on MOS-ESN-l₀

Full size table

4.1.2 Effect of α

For ESN design, the network with too small training error or too small network sparseness is not preferred, while the solution at the knee point maybe a good choice, which is a tradeoff between two objects. To help the algorithm converge to the knee point, the weight vectors updating algorithm is proposed, in which the weight updating step α is applied. A larger α implies that the updated weight vector is closer to the corresponding weight vector of knee point.

To show the influence of α on network performance, by setting α = {0, 1, 2}, the obtained non-dominated solution sets of MOS-ESN-l₁ and MOS-ESN-l₀ are compared in Figs. 10 and 11, respectively, and the x-coordinate and y-coordinate are the objective function ${{\text{f}}_{2}\text{=} \, \Vert {\text{W}}^{\text{out}}\Vert }_{1/0}$ and ${\text{f}}_{1}\text{=}{ \Vert {\text{T}}-{\text{H}}{\text{W}}^{\text{out}}\Vert }_{2}^{2}$, respectively. It is easily found that the non-dominated solutions sets with α = 1 or α = 2 are better than that with α = 0, which implies the effectiveness of weight vectors updating algorithm in terms of algorithm convergence.

With α = {0, 1, 2, 3} and loca = 3, the statistic results of 50 independent experiments of MOS-ESN-l1 and MOS-ESN-l0 are listed in Tables 4 and 5, respectively. Obviously, when α = 2, the obtained ESN has the most sparse network structure and the best testing RMSE values. However, when α = 3, the corresponding testing RMSE values become larger. Hence, the too large or too small value of α is not preferred.

Table 4 Simulation results of different value of α on MOS-ESN-l₁

Full size table

Table 5 Simulation results of different value of α on MOS-ESN-l₀

Full size table

4.2 Nonlinear dynamic system modeling

The proposed method is performed on the nonlinear dynamic system as below

$$\begin{aligned} y(k + 1) &= 0.72y(k) + 0.025y(k - 1)u(k - 1) \hfill \\ &\quad+ 0.01u^{2} (k - 2) + 0.2u(k - 3) \hfill \\ \end{aligned}$$

(37)

where u(k) and y(k) are input and output, respectively. y(k + 1) is predicted by y(k), y(k − 1), u(k − 1), u(k − 2), u(k − 3). In the training phase, u(k) is 1.05sin(k/45). In the testing phase, u(k) is given as

$$u(k) = \left\{ {\begin{array}{*{20}l} {\sin \frac{{\pi k}}{{25}},} \hfill & {0 < k \le 250} \hfill \\ {1.0,} \hfill & {250 < k \le 500} \hfill \\ { - 1.0,} \hfill & {500 < k \le 750} \hfill \\ {0.3} \hfill & {\sin \frac{{\pi k}}{{25}} + 0.1\frac{{\pi k}}{{32}}} \hfill \\ {0.6\sin \frac{{\pi k}}{{10}},} \hfill & {750 < k \le 1000} \hfill \\ \end{array} } \right.$$

(38)

In this experiment, 2000 samples are generated by Eq. (38). The first 1400 points are used in training stage, and the remaining 600 are used in testing phase. In addition, the 20-dB Gaussian noise is added to generate the noisy environment.

The prediction output and testing error of the resulted MOS-ESN-l_1, MOS-ESN-l₀ and OESN are plotted in Figs. 12 and 13, respectively. Actually, all the algorithms show similar predictive trend on nonlinear dynamic system. However, the prediction error of MOS-ESN-l₀ is limited in [-0.4,0.4], which is smaller than other methods. Thus, the proposed MOS-ESN-l₀ has the best prediction effect among all compared algorithms.

The statistic results of 50 independent runs of compared algorithms are summarized in Table 6. Obviously, the OESN has shortest training time and smallest training error, but its testing error is largest, which indicates overfitting problem. Furthermore, the MOS-ESN-l₀ obtains the smallest testing RMSE and SP values, which means the MOS-ESN-l₀ has better prediction accuracy, sparser reservoir for nonlinear dynamic system modeling.

Table 6 Simulation results for nonlinear dynamic system modeling

Full size table

4.3 Effluent NH4 − N model in WWTP

Recently, the discharge of industrial and domestic wastewater has also increased sharply, and the phenomenon of water quality exceeding standard in the wastewater treatment process (WWTP) is serious. In WWTP, the excessive NH₄ -N will lead to eutrophication of water body and affect human health. Thus, predicting NH₄-N accurately is critical. However, the WWTP is a complex system with nonlinear, uncertainty, it is difficult to predict NH₄-N. To solve this problem, the laboratory analytical techniques are used. However, these methods always require long time.

In this section, the proposed MOS-ESN models are applied to predict NH₄-N in WWTP. This experiment contains 641 sets of data, which are collected from Chaoyang, Beijing in 2016. The first 400 groups are treated as training data and the rest 241 are set as testing data. The inputs of ESN include T, ORP, DO, TSS and pH, which are described in [29].

The prediction result of effluent NH4-N models of MOS-ESN-l₀, MOS-ESN-l₁ and OESN is demonstrated in Fig. 14, and the corresponding prediction error is shown in Fig. 15. Obviously, all the algorithms achieve the similar prediction accuracy. As compared with MOS-ESN-l₀ and MOS-ESN-l₁, it can be found that the l₀ regularization can get sparser structure, and thus the MOS-ESN-l₀-based effluent NH4-N model has smaller prediction error.

The comparison results of different models are shown in Table 7, including the network sparsity SP, training time, the mean and standard deviation of training and testing RMSE values of 50 independent experiments. Obviously, the OESN has small training but large testing RMSE values, which implies the overfitting problem occurs. Thus, the OESN has difficulty in predicting NH₄-N in WWTP. In OESN-l₁ and OESN-l₀, the regularization technique is applied to make the network structure sparse, but the testing RMSE value is still large. In CD-ESN-l₀ and CD-ESN-l₁, the regularization technique and coordinate descent are used to updates the output weights, which can obtain better prediction performance than OESN-l₁ and OESN-l_0. As compared with OESN-MOEA/D, the local search and weight vectors updating algorithm are applied in MOS-ESN-l₀ and MOS-ESN-l₁, which helps to improve solution performance. Particularly, the MOS-ESN-l₀ has the smallest testing error and the sparsest network structure, which can effectively predict NH₄-N of WWTP.

Table 7 Simulation results for Effluent NH₄ − N model in WWTP

Full size table

5 Conclusion

In this paper, the multi-objective sparse ESN is proposed, in which the training error and network structure are optimized simultaneously. Firstly, instead of searching the regularization parameters, the design of ESN is treated as a bi-objective optimization problem. Secondly, to improve algorithm convergence performance, the local search strategy is designed, which incorporates the l₁ or l₀ norm regularization and coordinate descent algorithm. Furthermore, to make the algorithm converge to the region of interest, the weight vectors updating method is designed, which applies the information of knee point. The effectiveness and usability of the proposed algorithm are evaluated by experimental results. In future work, this method will be applied in other practical engineering fields, such as garbage classification and image recognition.

Data availability statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

References

Zhu T, Luo L, Zhang XL et al (2017) Time-series approaches for forecasting the number of hospital daily discharged inpatients. IEEE J Biomed Health Inform 21(2):515–526
Google Scholar
Zhang H, Cao X, John H, Tommy C (2017) Object-level video advertising: an optimization framework. IEEE Trans Industr Inf 13(2):520–531
Google Scholar
Safari N, Chung CY, Price G (2018) A novel multi-step short-term wind power prediction framework based on chaotic time series analysis and singular spectrum analysis. IEEE Trans Power Syst 33(1):590–601
Google Scholar
Lee R (2020) Chaotic type-2 transient-fuzzy deep neuro-oscillatory network (CT2TFDNN) for worldwide financial prediction. IEEE Trans Fuzzy Syst 28(4):731–745
Google Scholar
Li JD, Tang H, Wu Z et al (2019) A stable autoregressive moving average hysteresis model in flexure fast tool servo control. IEEE Trans Autom Sci Eng 16(3):1484–1493
Google Scholar
Zhou D, Al-Durra A, Zhang K et al (2019) A robust prognostic indicator for renewable energy technologies: a novel error correction grey prediction model. IEEE Trans Industr Electron 66(12):9312–9325
Google Scholar
Ciprian C, Masychev K, Ravan M et al (2020) A machine learning approach using effective connectivity to predict response to clozapine treatment. IEEE Trans Neural Syst Rehabil Eng 28(12):2598–2607
Google Scholar
Park YM, Moon UC, Lee KY (1996) A self-organizing power system stabilizer using fuzzy auto-regressive moving average (FARMA) model. IEEE Trans Energy Convers 11(2):442–448
Google Scholar
Xie N, Liu S (2015) Interval grey number sequence prediction by using non-homogenous exponential discrete grey forecasting model. J Syst Eng Electron 26(1):96–102
Google Scholar
Zhang K, Liu Z, Zheng L (2020) Short-term prediction of passenger demand in multi-zone level: temporal convolutional neural network with multi-task learning. IEEE Trans Intell Transp Syst 21(4):1480–1490
Google Scholar
Kuang W, Chan YL, Tsang SH et al (2019) Machine learning-based fast intra mode decision for HEVC screen content coding via decision trees. IEEE Trans Circuits Syst Video Technol 30(5):1481–1496
Google Scholar
Han SJ, Bae KY, Park HS et al (2016) Solar power prediction based on satellite images and support vector machine. IEEE Transactions on Sustainable Energy 7(3):1255–1263
Google Scholar
Liu YT, Lin YY, Wu SL et al (2015) Brain dynamics in predicting driving fatigue using a recurrent self-evolving fuzzy neural network. IEEE Transactions on Neural Networks and Learning Systems 27(2):1–14
Google Scholar
Zhang HJ, Li JX, Ji YZ, Yue H (2017) Subtitle understanding by character-level sequence-to-sequence learning. IEEE Trans Industr Inf 13(2):616–624
Google Scholar
Zsuzsa P, Radu EP, Jozsef KT et al (2006) Use of multi-parametric quadratic programming in fuzzy control systems. Acta Polytechnica Hungarica 3(3):29–43
Google Scholar
Rizvi SA, Wang LC (1997) Nonlinear vector prediction using feed-forward neural networks. IEEE Trans Image Process 6(10):1431–1436
Google Scholar
Shi Z, Liang H, Dinavahi V (2017) Direct interval forecast of uncertain wind power based on recurrent neural networks. IEEE Transactions on Sustainable Energy 9(3):1177–1187
Google Scholar
Jaeger H, Hass H (2004) Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science 304:78–80
Google Scholar
Wu Z, Li Q, Zhang HJ (2022) Chain-structure echo state network with stochastic optimization: methodology and application. IEEE Trans Neural Netw Learn Syst 33(5):1974–1985
Google Scholar
Wu Z, Li Q, Xia XH (2021) Multi-timescale forecast of solar irradiance based on multi-task learning and echo state network approaches. IEEE Trans Industr Inf 17(1):300–310
Google Scholar
Mantas L, Jaeger H (2009) Reservoir computing approaches to recurrent neural network training. Computer science review 3:127–149
MATH Google Scholar
Jaeger H (2007) Discovering multiscale dynamical features with hierarchical echo state networks. Jacobs University Bremen, Bremen
Google Scholar
Jaeger H, Lukosevicius M, Popovici D et al (2007) Optimization and applications of echo state networks with leaky integrator neurons. Neural Netw 20(2007):335–352
MATH Google Scholar
Qiao J, Li F, Han H et al (2017) Growing echo-state network with multiple subreservoirs. IEEE Trans Neural Netw Learn Syst 28(2):391–404
Google Scholar
Wang HS, Ni CJ, Yan XF (2017) Optimizing the echo state network based on mutual information for modeling fed-batch bioprocesses. Neurocomputing 225:111–118
Google Scholar
Xu M, Han M (2017) Adaptive elastic echo state network for multivariate time series prediction. IEEE Trans Cybern 46(10):2173–2183
Google Scholar
Yang C, Nie K, Qiao J et al (2022) Robust echo state network with sparse online learning. Inf Sci 594:95–117
Google Scholar
Luo X, Chang X, Ban X (2016) Regression and classification using extreme learning machine based on l₁-norm and l₂-norm. Neurocomputing 174:179–186
Google Scholar
Yang CL, Qiao JF, Wang L et al (2019) Dynamical regularized echo state network for time series prediction. Neural Comput Appl 31(10):6781–6794
Google Scholar
Han M, Ren W, Xu M (2014) An improved echo state network via l₁-norm regularization. Acta Automatica Sinica 40(11):2428–2435
MATH Google Scholar
Dzati A, Ramli, et al (2017) Fast kernel sparse representation classifier using improved smoothed-l₀ norm. Proc Comput Sci 112:494–503
Google Scholar
Yang CL, Qiao JF, Ahmad Z et al (2019) Online sequential echo state network with sparse RLS algorithm for time series prediction. Neural Netw 118:32–42
MATH Google Scholar
Qiao JF, Wang L, Yang CL (2018) Adaptive lasso echo state network based on modified Bayesian information criterion for nonlinear system modeling. Neural Comput Appl 31(10):6163–6177
Google Scholar
Huang HZ, Gu YK, Du X (2006) An interactive fuzzy multi-objective optimization method for engineering design. Eng Appl Artif Intell 19(5):451–460
Google Scholar
Zhang HJ, Sun YF, Zhao MB et al (2020) Bridging user interest to item content for recommender systems: an optimization model. IEEE Transactions on Cybern 50(10):4268–4280
Google Scholar
Lin L, Yao X, Stolkin R et al (2014) An evolutionary multiobjective approach to sparse reconstruction. IEEE Trans Evol Comput 18(6):827–845
Google Scholar
Rachmawati L, Srinivasan D (2009) Multiobjective evolutionary algorithm with controllable focus on the knees of the pareto front. IEEE Trans Evol Comput 13(4):810–824
Google Scholar
Branke J, Deb K, Dierolf H et al (2004) Finding knees in multiobjective optimization. In: International Conference on Parallel Problem Solving from Nature, LNCS 3242:722–731
Das I (1999) On characterizing the ‘knee’ of the pareto curve based on normal-boundary intersection. Struct Multidiscip Optimiz 18(2):107–115
Google Scholar
Zhang Q (2007) MOEA/D: a multiobjective evolutionary algorithm based on decomposition. IEEE Transa Evolut Comput 11(6):712–731
Google Scholar
Deb K, Gupta S (2011) Understanding knee points in bicriteria problems and their implications as preferred solution principles. Eng Optim 43(11):1175–1204
MathSciNet Google Scholar
Weitian C, Brian DOA (2012) A combined multiple model adaptive control scheme and its application to nonlinear systems with nonlinear parameterization. IEEE Trans Autom Control 57(7):1778–1782
MathSciNet MATH Google Scholar
Yang CL, Wu ZH, Qiao JF (2020) Design of echo state network with coordinate descent method and l₁ regularization. Commun Comput Inf Sci 1265:357–367
Google Scholar
Dong ZM, Wang XP, Tang LX (2020) MOEA/D with a self-adaptive weight vector adjustment strategy based on chain segmentation. Inf Sci 521:209–230
MathSciNet MATH Google Scholar
Ishibuchi H, Yoshida T, Murata T (2003) Balance between genetic search and local search in memetic algorithms for multiobjective permutation flowshop scheduling. IEEE Trans Evol Comput 7(2):20
Google Scholar

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (61973010, 61890930–5,62021003, 61533002), in part by the National Natural Science Foundation of Beijing (4202006), and in part by the National Key Research and Development Project (2021ZD0112302, 2019YFC1906002, 2018YFC1900802

Author information

Authors and Affiliations

Beijing Key Laboratory of Computational Intelligence and Intelligence System, Beijing Laboratory for Intelligent Environmental Protection, Faculty of Information Technology, Beijing Institute of Artificial Intelligence, Beijing University of Technology, Pingleyuan, Beijing, Beijing, 100124, China
Cuili Yang & Zhanhong Wu

Authors

Cuili Yang
View author publications
You can also search for this author in PubMed Google Scholar
Zhanhong Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cuili Yang.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

The parameters setting of different algorithms is given:

OESN: The reservoir has 1000 nodes. The reservoir sparsity s₁ is chosen from the set (0.01, 0.015, …, 0.6), and the spectral radius of reservoir s₂ is chosen from the set (0.1, 0.15, …, 0.95).
OESN-l₁: The reservoir has same parameters as OESN. The regularization parameter λ₁ is selected by (LASSO) method [33].
OESN-l₀: The reservoir has same parameters as OESN. The regularization parameter λ₀ is adaptively calculated [32].
CD-ESN-l₁: The reservoir has same parameters as OESN. The regularization parameter λ is chosen from the set (0.05, 0.10, 0.15, …, 0.9) as suggested in [43].
CD-ESN-l₀: The reservoir has same parameters as OESN. The regularization parameter λ is chosen from the set (0, 0.05, 0.15, …, 0.95).

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yang, C., Wu, Z. Multi-objective sparse echo state network. Neural Comput & Applic 35, 2867–2882 (2023). https://doi.org/10.1007/s00521-022-07711-6

Download citation

Received: 24 November 2021
Accepted: 05 August 2022
Published: 07 September 2022
Issue Date: January 2023
DOI: https://doi.org/10.1007/s00521-022-07711-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Multi-objective sparse echo state network

Abstract

Similar content being viewed by others

Adaptive lasso echo state network based on modified Bayesian information criterion for nonlinear system modeling

Design of Echo State Network with Coordinate Descent Method and $$l_1$$ Regularization

L1/2 Norm Regularized Echo State Network for Chaotic Time Series Prediction

Explore related subjects

1 Introduction

2 Background

2.1 Original ESN

2.2 MOPs

2.3 MOEA/D

3 MOS-ESN

3.1 Problem Formulation

3.2 Local Search Method

3.2.1 The l1 regularization-based local search method

3.2.2 The smoothed l0 regularization-based local search method

3.3 Weight Vectors Updating Algorithm

3.3.1 Knee point

3.3.2 Weight Vectors Updating Method

3.4 Framework of MOS-ESN

4 4 Simulation

4.1 Rossler chaotic time series prediction

4.1.1 Effect of loca

4.1.2 Effect of α

4.2 Nonlinear dynamic system modeling

4.3 Effluent NH4 − N model in WWTP

5 Conclusion

Data availability statement

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation