1 Introduction

In electricity industries, accurate price prediction [1] is very important for market participants to decrease risks along with their decision making. Historically, great research efforts have been devoted to developing accurate and reliable methods for electricity price prediction [25]. The state-of-the-art techniques include time-series methods such as Autoregressive Integrated Moving Average (ARIMA) [6, 7] and Generalized Autoregressive Conditional Heteroskedasticity (GARCH) [8], and machine learning methods [9]. Among these techniques, machine learning models have shared the largest research attention mainly because of their strong nonlinear modeling capacity [10, 11]. In respect of utilizing machine learning approaches for electricity price prediction as well as other applications in power system area, three main categories of the learning algorithms related to this paper are listed as follows,

  • The first is the conventional machine learning tools, for instance, artificial neural networks (ANNs) [2, 3], support vector machine (SVM) [10], etc. The best merit of such kind of approaches is that they can extract the nonlinear relationships out of the input and output dataset. They therefore have been developed and adopted widely in power engineering domain during past decades. However, the approaches fallen in this category are mainly to use some dull gradient-descent information to guide the training for their forecasting models, and they are often deemed as lacking efficient strategies to escape premature convergence or many local minima.

  • The other is a novel approach, ELM [1215], which is proposed in recent years. ELM and its variants have one common distinguished capacity, fast training speed. At the beginning, they initialize hidden nodes randomly and then obtain the output weights via Moore-Penrose pseudo inverse [12], i.e. the output weights vector is treated as one solution of a group of linear-like equations. For many ordinary regression or classification problems being in low dimensionality, this kind of method is obviously enough to obtain better results than those traditional ones in both the training speed and training accuracy.

  • Using evolutionary algorithms, E-ELM aims to evolve out an optimum result based on a population, usually in a constant size, consisting of ELM candidates for training. Significantly, ELM training is mainly depended on an unique step to find out the matrix solution instead of training step by step, say, the output weight vector is obtained on the basis of hidden layer output matrix and parameters assigned randomly. This is also the key factor why ELM is usually superior to those conventional approaches on training speed while still keeping good learning performance. However, calculating solution based on partly random assignments may lead to the final output unstable, sometimes with a better training effect, sometimes with an even worse one, e.g., as for high dimensional problems as in the works [1]. Besides that, ELM is sometimes difficult to find out a satisfied regression or classification results by once calculation, even though it can calculate output weights fast [16]. Overall, the difference of the solution quality provides the developing space for E-ELM.

Although the E-ELM can promote the final learning quality, it increases the time complexity as well. There are mainly due to the reasons in two aspects: First, until the final optimum to be found, the evolutionary algorithm has to calculate the ELM population rather than a single ELM individual iteratively by maximum generations. In addition, evolutionary algorithm itself also increases its time cost as the dimensionality of decision space expanding. According to [17], one random matrix of hidden weights is corresponding to a vector of output weights, besides that, better hidden weights and output weights can lead to a lower root mean square error, that is, a better regression or classification result for ELM. The hidden weights can be combined into a single row solution for the objective function of ELM, the optimum can be obtained after evaluating the individual solution and comparing with other solutions among whole population. For example, let the data be 100 dimensions and the number of the hidden layers is 10, the dimensionality of the solution individual will then reach to 1000. Usually, the dimensionality of the power market data or the data for electricity load forecasting is over 100. Self-adaptive evolutionary extreme learning machine (SaE-ELM) [18] is a representative method of E-ELM, which can obtain output weights for a single hidden layer feed-forward network (SLFNs) with some promising features. However, in respect of training high dimensional data, SaE-ELM is also time-consuming in evolutionary iterations.

In order to promote the speed of E-ELM and robustness in optimum exploring in high dimensional space, this research is motivated to consider whether some rational analysis could be used to guide the optimum searching, especially in such environment as in high dimensional decision space, with a complicated objective function. As mentioned above, in conventional neural network, the gradient information provides some rapid exploring guides though often leading to local optima. So the gradient information perhaps can be properly used in E-ELM to provide some rational directions and then to accelerate the whole optimization procedure. Unfortunately, the complicated objective function of E-ELM is too difficult to mine the gradient information directly between decision variable vector and its fitness function. Moreover, the basic framework of ELM seems also to have pushed the gradient ideas out of date. Even so, the rest of this paper proposes a new approximation model, which not only can provide an approximate mapping dynamically to replace the old functional relationship of E-ELM within a comparative small region, but is compatible with differential evolution (DE) frame. Therefore, based on the new model, a hybrid DE-like algorithm is developed to ensure the new E-ELM not only can enter global optimal region faster than pure stochastic searching, but also can obtain high qualitative solutions, more reliably than those dull gradient methods. Overall, faster convergence and better quality of solution of E-ELM are two mandatory objectives should be considered in this paper.

The rest of paper is organized as follows. Section 2 outlines some relative backgrounds. The new approximation model is shown in Sect. 3. In Sect. 4, a new evolutionary algorithm for E-ELM learning high-dimensional data is proposed. The experimental results and discussions are placed in Sect. 5. Finally, a conclusion and future works are provided in Sect. 6.

2 Brief reviews

This section gives brief reviews of aboriginal extreme learning machine (ELM) as well as some necessary methods used in the rest of this paper for completeness.

2.1 Extreme learning machine (ELM)

ELM has been shown many good performances on the generalized single-hidden layer feed-forward networks (SLFNs) since it was proposed. It has some differences from the traditional neural networks on the hidden layer, e.g., random generation of its hidden nodes is the main feature of ELM. The basic working mechanism of ELM is briefly generalized as follows,

Given N training samples \(\left\{ {\left( {x_i ,t_i } \right) } \right\} _{i=1}^N \) which can be also described in matrix style \(\left\{ {\left( {P,T_{tar} } \right) } \right\} \), where P is a \(D\times N\) real matrix of input data and \(T_{tar} \) represents \(N\times 1\) target vector. H is a \(L\times D\) real matrix consisting of the hidden layer parameters generated randomly. \(\beta \) is a \(L\times 1\) real vector of output weights. Their mathematical relationship can be expressed as Eq. (1)

$$\begin{aligned} f\left( {H \cdot P+Bias} \right) ^{\mathrm{T}} \cdot \beta =T_{tar} \end{aligned}$$
(1)

where Bias is a \(L\times N\) real matrix and function \(f\left( \cdot \right) \) is a kind of activation functions [11], for instance, a log-sigmoid function,

$$\begin{aligned} \sigma \left( t \right) =\frac{1}{1+e^{-c \cdot t}} \end{aligned}$$
(2)

where c is a slop parameter.

Usually, Eq. (1) can be presented in brief as Eq. (3)

$$\begin{aligned} \hbox {H} \cdot \beta =T_{tar} \end{aligned}$$
(3)

where \(\hbox {H}=f\left( {H \cdot P+Bias} \right) ^{\mathrm{T}}\) is a \(N\times L\) matrix. ELM uses Moore-Penrose pseudoinverse \(\hat{\mathrm{H}}^{\dagger }\) and target vector \(T_{tar} \) to obtain a least-square solution of such linear system as Eq. (3). That is, a least-square solution of output weight vector \(\beta \) can be analytically determined as Eq. (4)

$$\begin{aligned} \hat{\beta }=\hat{\mathrm{H}}^{\dagger } \cdot T_{tar} \end{aligned}$$
(4)

where

$$\begin{aligned} \hat{\mathrm{H}}^{\dagger }=\left\{ {{\begin{array}{l@{\quad }l} \hbox {H}^{\mathrm{T}}\cdot \left( {\frac{I}{C}+\hbox {H}\cdot \hbox {H}^{\mathrm{T}}} \right) ^{-1},&{} N<L \\ \left( {\frac{I}{C}+\hbox {H}^{\mathrm{T}}\cdot \hbox {H}} \right) ^{-1}\cdot \hbox {H}^{\mathrm{T}},&{} L<N \\ \end{array} }} \right. , \end{aligned}$$
(5)

and c is a trade-off constant, which can be referred to [1215] for more details.

Instead of following traditional gradient descend approach, ELM minimizes training accuracy or the cost function Eq. (6) via the result gotten by the Eq. (4).

$$\begin{aligned} RSME=\sqrt{mse\left( {\hbox {H} \cdot \hat{\beta } -T_{tar} } \right) } \end{aligned}$$
(6)

where \(mse\left( \cdot \right) \) is the function to measure network performance as the mean of absolute errors.

2.2 Basic differential evolution framework

Differential evolution (DE) [19] has been applied to many practical engineering problems since it was proposed in 1995. Furthermore, the variants of DEs with enhanced search performance have been introduced in [2123]. Especially for multimodal optimization, researchers tend to combine DE with other evolutionary methods, sometimes called hybrid DEs, so as to promote whole performance of the algorithm.

In classical DE framework, the remarkably simple trial vector generation scheme is a main character distinguished from other EAs. It processes a scaled difference of vectors originating from a fixed-size population of decision vectors. Usually, such three evolutionary operators as mutation, crossover and selection are included respectively. During the g-th generation and in the basic DE mutation, a trial vector \(u_{i,g} \) is produced by a crossover operation between old individual vector \(x_{i,g} \) and a mutated vector \(v_{i,g} =x_{r0,g} +F_i \cdot \left( {x_{r1,g} -x_{r2,g} } \right) \), where \(F_i(F_i >0)\) is a scaling factor, \(x_{r0,g} ,x_{r1,g,} x_{r2,g} \) are three independent decision vectors selected randomly from the whole population \(P=\left\{ {x_{1,g} ,x_{2,g} ,\ldots ,x_{NP,g} } \right\} \). in decision space. For each vector \(x_{i,g} \in P\) in turn, there is a corresponding trial vector \(u_{i,g} \) being generated. Each old vector \(x_{i,g} \) in P will not be replaced unless its trial vector \(u_{i,g} \) yields a better objective function value than itself. Consequently \(x_{i,g} \) is also called a target vector in literature. One can refer to [24] for more different crossover operators and more variants of DE in detail.

2.3 SaE-ELM

Self-adaptive evolutionary extreme learning machine (SaE-ELM) [18] is upgraded from DE-LM [25] and E-ELM [17]. It chooses trial vector generation strategies and some relative control parameters adaptively. Their common place is to explore the network input weights and hidden node biases of ELM aiming to get optimum of the network output weights. When training data set \(X_{D\times N} \), L hidden layers and an activation function \(f\left( \cdot \right) \) are given, the individuals to be evolved during the g-th generation can be coded into as following vector [18],

\(\theta _{k,g} =\big ( h_{11}^g ,\ldots ,h_{1D}^g ,h_{21}^g ,\ldots ,h_{2D}^g ,\ldots ,h_{L1}^g ,\ldots ,h_{LD}^g ,b_1^g ,\ldots ,b_L^g \big ),\) where \(1\le k\le NP, NP\) is the population size, \(b_i^g ,1\le i\le L\), represents the bias value for the i-th hidden layer in g generations.

Based on the coding format, the parameters like HBias are obtained as follows,

$$\begin{aligned} H=\left[ {{\begin{array}{c} {h_{11}^g ,\ldots ,h_{1D}^g } \\ {h_{21}^g ,\ldots ,h_{2D}^g } \\ \vdots \\ {h_{L1}^g ,\ldots ,h_{LD}^g } \\ \end{array} }} \right] ,P=X_{D\times N} ,Bias=\left[ {{\begin{array}{c} {b_1^g } \\ {b_2^g } \\ \vdots \\ {b_L^g } \\ \end{array} }} \right] \times J_{1\times N}\nonumber \\ \end{aligned}$$
(7)

where \(J_{1\times N} \) is a one row and N columns matrix of ones. Then the corresponding fitness function is formulated as Eq. (8),

$$\begin{aligned} RSME=\sqrt{mse\left( {f\left( {H \cdot P_{test} +Bias} \right) ^{\mathrm{T}} \cdot \hat{\beta } -T_{test} } \right) } \end{aligned}$$
(8)

where \(P_{test} \) and \(T_{test} \) are testing data set and testing target vector respectively.

The main aim of such kind of algorithms is to explore an optimum of H from population consisted of \(\theta _{k,g} ( 1\le k\le {\textit{NP}})\) during \(g_{max} \) generations. The individuals which can survive from g generations to the next must satisfy Eq. (9).

$$\begin{aligned} \theta _{k,g+1} =\left\{ {{\begin{array}{l@{\quad }l} {u_{k,g} ,}&{} {RMSE_{\theta _{k,g} } -RMSE_{\theta _{k,g+1} } >\varepsilon \cdot RMSE_{\theta _{k,g} } } \\ {u_{k,g} ,}&{} {|RMSE_{\theta _{k,g} } -RMSE_{\theta _{k,g+1} } |<\varepsilon \cdot RMSE_{\theta _{k,g} } } \\ &{}{and\; \Vert \beta _{u_{k,g+1}}\Vert < \Vert \beta _{u_{k,g} } }\Vert \\ {\theta _{k,g} ,}&{} {Otherwise} \\ \end{array} }} \right. \end{aligned}$$
(9)

3 Approximation model

Sometimes traditional optimization approaches are demonstrated very useful for evolutionary exploring as long as they can be manipulated properly. However, in many situations, traditional optimization methods are difficult to be used directly due to the complex functional relationship. These stimulate a motivation to employee simpler approximation models to replace the original fitness function. For example, as shown in Fig. 1, suppose the functional relationship between variable x and its objective function \(f\left( x \right) \) is complicated, yet in another space variable vector \(\left( {t_1 ,t_2 } \right) \) can map into the same image \(f\left( x \right) \) by an easier mapping rule \(g\left( {t_1 ,t_2 } \right) \), hence, if a mapping relationship is found between the space in where x is located and the other space where variable vector \(\left( {t_1 ,t_2 } \right) \) is included, then the original problem can be simplified, because the complex functional relationship has been replaced by a simpler mapping model. In the following section, a simple first-order approximation model is proposed in order to imitate the compound mapping between variable data and their objective function. Perhaps, such kind of functional relationship based on the approximation model is not very accurate to replace the original one over whole hyper-plane, but within a limited region it can satisfy those practical demands [26, 27].

Fig. 1
figure 1

Principle of the approximation model

3.1 First-order approximation model

Without loss of generality, a decision space can be formulated as a hyper-plane by one point attached with two vectors. Let \(x\in { \mathbb {R}}^{n}\) is an arbitrary point in decision space \({\mathbb {R} }^{n}\) or the point can be denoted as a decision vector \((x_1 ,x_2 ,\ldots ,x_n )^{\mathrm{T}}, L\) is the hyper-plane, suppose \(x^{0}\ne x^{1}\ne x^{2}\) are three distinct points selected randomly among \({\mathbb {R}}^{n}\), then any arbitrary point, \(x\in L\), can be formulated as such style as Eq. (10)

$$\begin{aligned} x=x^{0}+ t_1 \cdot \left( {x^{1}-x^{0}} \right) + t_2 \cdot \left( {x^{2}-x^{0}} \right) \end{aligned}$$
(10)

where \(t_1 ,t_2 \) are two independent real variables.

According to Eq. (10), any \(x\in L\) is linear corresponding to the variable vector \(\left( {t_1 ,t_2} \right) \) because the rest parameters are constants, i.e., \(x\Leftrightarrow \left( {t_1 ,t_2 } \right) \), if and only if three arbitrary yet independent points, \(x^{0}\ne x^{1}\ne x^{2}\), have been fixed. In other words, if \(x^{0}\ne x^{1}\ne x^{2}\) are located, any \(x\in L\) can be evaluated based on variable vector \(\left( {t_1 ,t_2 } \right) \) and Eq. (10). Therefore, when decision vector x approaching its optimum, \(x^{*},\) there must exist a corresponding variable vector \(\left( {t_1^{*} ,t_2^{*} } \right) \Leftrightarrow x^{*}\), i.e.,

$$\begin{aligned} x^{{*}}=x^{0}+t_1^{*} \cdot \left( {x^{1}-x^{0}} \right) +t_2^{*} \cdot \left( {x^{2}-x^{0}} \right) \end{aligned}$$
(11)

Likewise, for any pair of fitness function \(f\left( x \right) \) and its variable x, there exists another pair of image \(g\left( \cdot \right) \) and its variable vector \(\left( {t_1 ,t_2 } \right) \) which has a common place, i.e., \(g\left( {t_1 ,t_2 } \right) =f\left( x \right) \). However, the difference is the functional relationship of \(g\left( \cdot \right) \) is simpler than the one of \(f\left( \cdot \right) \). The conversion relationship between \(g\left( \cdot \right) \) and \(f\left( \cdot \right) \) is defined as Eq. (12)

$$\begin{aligned} g\left( {t_1 ,t_2 } \right) =f\left( x \right) =g^{0}+t_1 \cdot \left( {g^{1}-g^{0}} \right) +t_2 \cdot \left( {g^{2}-g^{0}} \right) \end{aligned}$$
(12)

where \(g^{0},g^{1},g^{2}\), can be treated as constants, when \(x^{0}\ne x^{1}\ne x^{2}\) have been fixed as mentioned above. In order to obtain the constants, \(g^{0},g^{1},g^{2}\),in a simple way, some special points are considered here. Assume \(\left( {t_1 ,t_2 } \right) \) is substituted by vectors, (0,0), (1,0), (0,1) respectively, then \(g^{0}=f\left( {x_0 } \right) , g^{1}=f\left( {x_1 } \right) , g^{2}=f\left( {x_2 } \right) \) can be easily extracted out via Eqs. (12) and (10). Equation (12) hereby provides an approximation equation as well to replace the original fitness function, since \(g\left( {t_1 ,t_2 } \right) =f\left( x \right) \). So, no one would like to care about how the complicated functional relationship between the decision variable \(x\in {\mathbb {R}}^{n}\) and its original image \(f\left( x \right) \) is, as long as the new mapping between \( g\left( \cdot \right) \) and \(\left( {t_1 ,t_2 } \right) \) is simpler and enough reliable.

3.2 Direction to optimum

As pointed out before, Eq. (12) also provides a linear functional relationship between variable vector \(\left( {t_1 ,t_2 } \right) \) and its image \(g\left( {t_1 ,t_2 } \right) \). Through conventional optimization theories, \(g\left( {t_1 ,t_2 } \right) \) at point \(\left( {t_1 ,t_2 } \right) \) has a vector of first partial derivatives, or gradient vector \(\nabla g\big ( {t_1 ,t_2 } \big )=\big ( \big ( {g^{1}-g^{0}} \big ),\big ( {g^{2}-g^{0}} \big ) \big )\). Hence, the local minimum optimum of \(\left( {t_1^{*} ,t_2^{*} } \right) \) is most probably being placed in the opposite direction of \(\nabla g\left( {t_1 ,t_2 } \right) \).

$$\begin{aligned} \left( {t_1^{*} ,t_2^{*} } \right) =\left( {0,0} \right) -\alpha \cdot \nabla g\left( {t_1 ,t_2 } \right) =-\alpha \cdot \nabla g\left( {t_1 ,t_2 } \right) \end{aligned}$$
(13)

where \(\alpha \) is a step parameter. Overall, any three distinct decision variables, say, \(x^{0}\ne x^{1}\ne x^{2}\), can deduce out the local optimum \(x^{*}\) via Eqs. (1113), which can be expressed as Eq. (14),

$$\begin{aligned} x^{*}= & {} x^{0}-\alpha \cdot \Big [ \left( {g^{1}-g^{0}} \right) \cdot \left( {x^{1}-x^{0}} \right) \nonumber \\&+\left( {g^{2}-g^{0}} \right) \cdot \left( {x^{2}-x^{0}} \right) \Big ] \end{aligned}$$
(14)

4 Proposed algorithm

In general, no method is flawless, neither is the new rational approximation model. Aiming to obtain a tradeoff between global exploration and local exploitation, another DE mutation strategies, ‘DE/current-to-best/1’ [24], is enrolled as well to construct a hybrid Rational and Self-adaptive mutation (RSM) strategy, whose pseudo-code is shown in Fig. 2.

Fig. 2
figure 2

Pseudo-code of producing hybrid trial vectors

4.1 Historical pool

From experimental results, fitness values evaluated by the new approximation model are very sensitive to the shape formed by three input individuals \(x_{r0,g} , x_{r1,g} \) and \(x_{r2,g} \). The basic idea is to avoid excessive similarity between the candidates. Learn from JADE [21], RSM mutation applies a historical pool to temporarily reserve a part of individuals sifted out from the population. Each time, one of three distinct individual is picked out from the union of current population and the historical pool, hereby denoted by \({\hat{x}}_{r2,g} \), while the others \(x_{r0,g} , x_{r1,g} \) are still selected from the current population. The size of the historical pool is set to a quarter of the population and the initial state is empty. After being full, the pool permits the individual perished from current population to replace the worst one if the perished one is better.

4.2 Self-adaptive parameters

Motivated by [2022], in the new algorithm many control parameters are extended into solution individuals for controlling self-adaptively (see Table 1). The parameters are evolved simultaneously whilst the classical population of solutions is being processed in evolution procedure.

Table 1 Encoding format of self-adapting individuals

In general, the better control parameter value is corresponding to the optimal trial vector. Therefore the proposed algorithm utilizes the statistical results of recent successful parameters to guide the production of parameters for next generation.

Main parameters for self-adaptive control, such as \(Step_{i,g} \in \left[ {0,2} \right] , CR_i^1 ,CR_i^2 \in \left[ {0,0.9} \right] \) and \(F_{i,g} \in \left[ {0.1,1.0} \right] \), are initialized within their definition domain. The successful parameters survive to the next generation, while the unsuccessful ones are replaced by a normal distribution of the mean \(P_{m,g} \) and standard deviation sigma as shown in Eq. (15).

$$\begin{aligned} P_{i,g} =P_{m,g} +sigma \cdot randn_{i,g} \end{aligned}$$
(15)

where \(P_{i,g} \) represents the variable of parameters for the i-th individual in g generation. The sigma of each parameter equals to \({\mathrm{min}}\left( {\left| {P_{m,g} -P_{Ub,g,} \left| , \right| P_{m,g} -P_{Lb,g} } \right| } \right) \). The mean values are initialized as follows, \(Step_{m,1} =1.1, F_{m,1} =0.6,CR_i^k =0.6,\left( {k=1,2} \right) \). Parameter \(Step_{i,g} \) controls the incremental degree of the mutation shown in Fig. 2.

\(v_{i,g}^1 =x_{r0,g} +s \cdot \left[ {t_1 \cdot \left( {x_{r1,g} -x_{r0,g} } \right) {+}t_2 {\cdot } \left( {\hat{x}_{r2,g} -x_{r0,g} } \right) } \right] ,\) where \(s=Step_{i,g} /\sqrt{t_1^2 +t_2^2 }\). At the beginning of whole evolving procedure, \(Step_i \ge 1\) helps population converge to optimum fast, while \(Step_i <1\) is good at effective exploitation, especially for solutions approaching to the optimum.

4.3 Hybrid strategy for trial vector

In the procedure of selection, as shown in Fig. 3, if two new trail vectors satisfies

  • Case 1: \( f\left( {x_{i,g} } \right)<f\left( {u_{i,g}^1 } \right) <f\left( {u_{i,g}^2 } \right) \)

Both two trail vectors are successful trail ones, i.e., success(i,1) \(=\) 1, success (i,2) \(=\) 1, all their parameters can be kept to the next generation.

  • Case 2: \(\hbox {f}\left( {\hbox {u}_{\mathrm{i},\mathrm{g}}^1 } \right)<f\left( {\hbox {x}_{\mathrm{i},\mathrm{g}} } \right) <f\left( {\hbox {u}_{\mathrm{i},\mathrm{g}}^2 } \right) \)

\(u_{i,g}^1 \) is named as a successful trail vector and success (i,1) is set to 1.

  • Case 3: \(\hbox {f}\left( {\hbox {u}_{\mathrm{i},\mathrm{g}}^1 } \right)<f\left( {{\hbox {u}}_{\mathrm{i},\mathrm{g}}^2 } \right) <f\left( {\hbox {x}_{\mathrm{i},\mathrm{g}} } \right) \)

This case means all the parameters need to be adjusted.

At end of each generation, the mean of each parameter is adjusted by Eq. (16)

$$\begin{aligned} P_{i,g+1} =0.85 \cdot P_{m,g} +0.15 \cdot mean\left( {P_{success,g} } \right) \end{aligned}$$
(16)

where mean (.) is a function of arithmetic mean.

Fig. 3
figure 3

Pseudo-code of selection operator

4.4 RSM-DE algorithm

The main body of RSM-DE algorithm:

Input: :

NP: the size of the population;

Maxgen: the number of the maximum iteration;

Fitness function;

D: The dimension of decision space.

Output: :

Optima of the fitness function.

Step 1 Initialization

Create a random initial population \(\{x_{i,0} |i=1,\ldots ,NP\}\). Initialize parameters within their definition regions.

For \({{\varvec{g}}}={{\varvec{1}}},\ldots ,{{\varvec{Maxgen}}}\) , do

Step 2 Evolution Items

For \({{\varvec{i}}}=\mathbf{1},\ldots ,{{\varvec{NP}}}\) do

Step 2.1 New Parameters Generating: Unsuccessful parameters are refreshed based on Eq. (15).

Step 2.2 Mating: One of the P best individuals and other three independent individuals, \(x_{best,g}^P ,x_{r0,g} \ne x_{r1,g} \ne \hat{x} _{r2,g} \), are picked out. \(\hat{x} _{r2,g} \) is from the union of current population plus historical pool and \(x_{best,g}^P \) is one out of from current population, \(P=5\) in this paper.

Step 2.3 Call Function RSM-Trial(): To produce two trail vectors by two strategies respectively.

Step 2.4 Call Function Selection(): To select successful trail vectors and parameters into the next generation.

Step 2.5 Renew Historical Pool: If the historical pool is not full then the eliminated individuals are pushed into the pool, otherwise the worst one in the pool is replaced when the eliminated on is better.

Step 2.6 Summarize the Statistical Result of Successful Trail Vectors: To evaluate the arithmetical mean value of each parameter by Eq. (16).

Step 3 Stopping criteria When stopping criterion is satisfied, the algorithm stops here and outputs corresponding results. Otherwise, goes to Step 2.

Fig. 4
figure 4

Pseudo-code of RSM-DE-ELM

Table 2 Six single objective benchmark

4.5 RSM-DE-ELM

Given a set of training data, a set of testing data, a candidate range for L hidden layers and an objective function g(\(\cdot )\), RSM-DE-ELM algorithm is summarized as following Fig. 4. \(RSM\_DE\_ELM\left( \cdot \right) \) represents a procedure to optimize ELM based on the RSM-DE algorithm. It returns the optimum of the net. \(\left[ {L_1 ,L_2 } \right] \) is the candidate range and Train, Test denote training set, testing set respectively.

5 Experimental results

In this section, being compared with other representative differential evolution approaches, the DE-like part of the proposed algorithm is first tested by several single objective benchmark, which are picked out from different test categories, in order to examine the capacity for exploring optimum. Furthermore, regarding performance of the new E-ELM method in machine learning area, the integrate algorithm is also utilized to the application for electricity price prediction. The experimental results are also compared against other state-of-the-art algorithms.

Table 3 Summary of QLD RRP from June 2006 to May 2007 (AUD/MWH)
Fig. 5
figure 5

Comparison of convergence curves between rand/DE/1, JADE, RSM_DE on six representative benchmark

5.1 Performance on single objective benchmark

At the beginning of this section, six representative single objective benchmark listed in Table 2 are applied to testify the performance of RSM-DE algorithm. The indices of the benchmark are kept as same as [21] for convenient comparing. Function \(f_2 ,f_4 \) are continuous unimodal functions, \(f_7 \) is noisy quartic function, \(f_8 ,f_9 ,f_{12} \) are multimodal and the number of their local minima increase exponentially with the dimensionality of the problem. The main control parameters are set as: the population size is 100 when the decision space is 30 dimensions. JADE is one self-adaptive algorithm of state-of-the-art. In [21], JADE has shown better performances on many benchmark than SaDE [20], jDE [22] as well as PSO [28]. So this paper chooses JADE as one main reference to measure the new algorithm. Experimental results including RSM-DE, JADE [21], DE/rand/1 [19] are summarized in Table 3, simultaneously, Fig. 5 shows a group of convergence semi-log graphs for 30-dimensional problems with median values after 50 independent runs. Form these converging curves, it is easy to find the RSM-DE algorithm owning a very good performance in 30-dimensional space. Especially, the new algorithm is very successful on benchmark \(f_4 \), which shows clearly the approximation model works well both with the better rate of convergence and themore robust reliability. Even facing many local minima existed in the benchmark \(f_8 ,f_9 ,f_{12} \), the RSM-DE still can accelerate the optimization procedure with promising results although its mutation operator is equipped with a greedy strategy. The results display if the diversity of population is being kept properly during the whole population evolving, the rational method will play a positive role rather than become a crucial problem.

Table 4 Six 30 dimensional benchmark with 50 independent runs

5.2 RSM-DE-ELM for market price prediction

In this section, several sequential data series extracted from the Australian Energy Market Operator (AEMO) website [29] are used to test the performance of our new method.

For a convenient comparison, the first dataset in our case study is a whole year’s RRPs from QLD market just as [1]. It includes total of 17520 observations and the period crosses over 01 June, 2006 to 31 May, 2007. Without loss generality, the dataset can be divided into four seasons in Australia and the main features are summarized as Table 4.

  • The dataset format is constructed as follows,

  1. 1.

    All observation points in every two weeks are defined as input attributes and their targets are those observation data from the following day, that is to say the data in the last day of each fifteen days is the target set and each day includes 48 even observation points.

  2. 2.

    In each season, the data archive from the last seven successive days is used as the testing dataset. In order to compare efficiently, four other representative methods, BPNN, RBFNN, ELM and SaE-ELM are collected here to compare with the proposed approach via three different criteria, i.e., MAE, MAPE, and RMSE, respectively, which are listed in Eq. (17).

    $$\begin{aligned} \left\{ {{\begin{array}{l} {MAE=\frac{1}{n}\mathop {\sum }\nolimits _{i=1}^n \left| {y_i -t_i } \right| } \\ {MAPE=\frac{1}{n}\mathop {\sum }\nolimits _{i=1}^n \frac{\left| {y_i -t_i } \right| }{y_i }\times 100\,\% } \\ {RMSE=\sqrt{\frac{1}{n}\mathop {\sum }\nolimits _{i=1}^n \left( {y_i -t_i } \right) ^{2}}} \\ \end{array} }} \right. \end{aligned}$$
    (17)
  • Parameter setting for RSM-DE-ELM:

    The population size \(\textit{NP}=100\) and maximum generation is 60. Since the iteration number of RSM-DE is not high, so the mean values of the parameters are initialized as, \(Step_{m,1} =1.1,F_{m,1} =0.7,CR_{m,1}^k =0.75,\left( {k=1,2} \right) \). The candidate range \([L_1 ,L_2 ]\) of hidden layers L is set to [10,150].

  • Results analysis:

Table 5 shows the comparisons between the new approach and four existing methods [1, 18]. All these results are the mean values collected by multiple trails which include 50 independent forecasts of each season model. From Table 5 the new algorithm wins most of the lowest testing criteria in four season dataset among all these five approaches. For testing in Spring and Autumn, the performances have been improved dramatically. For example, the testing MAE of Spring using RSM-DE-ELM is 1.9443, while the other four methods, the testing MAEs of this season dataset are all greater than 2.2000.

Table 5 Comparison of five methods on RRP Forecast

In Fig. 6., the prediction results is given out, which is run by RSM-DE-ELM on first half of 7*48 observation points belonging to the testing dataset. The error curve shows the new algorithm can forecast with low and stable error rate in most points.

In terms of the training time, due to our approach falls into E- ELM category, the training procedure practically consists of several sub-trainings of basic ELM, thus it takes longer time in training than one single basic ELM. However, the proposed approach is definitely faster than SaDE-ELM [18] because only two basic DE strategies are included rather than four in SaDE-ELM. Secondly, rational DE model provides our method fast convergence in addition to promising experimental results, e.g., RSM-DE-ELM can get better results within 60 generations whilst SaDE-ELM need 100 more generations to reach the same magnitude. What’s more, our approach is no longer running on the way mentioned in the previous literature [17, 18] in which the number of hidden layers is often gradually increased and the one with the best generalization performance is adopted in final. In our proposed approach, the binary search frame helps the algorithm not only find the optimum at last, but also keep in less time complexity.

Fig. 6
figure 6

Average RRP of forecast by RSM-DE-ELM in spring

6 Conclusion

In this paper, a self-adaptive DE-like frame embedded with a rational approximation operator is proposed intending to optimize E-ELM with faster speed and better solution. Based on the benchmark experimental results, it can be seen that the reliability of rational means become a less crucial problem in optimization, i.e., they are no more the patent of local optimum or premature convergence, on the contrary, their fast convergence becomes more attractive as long as a well-design scheme is provided. Furthermore, supported by the new rational approach, E-ELM has obtained better results than many state-of-art ones in the practical application of electricity price predication. Overall, the experimental results have illustrated that mathematical auxiliary guiding during evolving optima may create better performances for E-ELM than stochastic strategies did.