1 Introduction

1.1 Aims and difficulties

Effective and accurate electricity price forecasting is a critical issue for energy-market participants, because it can help them make a suitable risk management in competitive electricity markets and thus maximize their profits. Market participants rely on price forecasts to decide on their bidding strategies, allocate assets, and plan facility investment (Taherian et al. 2013). The main purpose of electricity-market participants is to have a clear and cost-effective market. Hereby, all market players need an exact and robust estimation of future electricity price to set their bidding strategies in the real market to maximize their profit. On the other hand, the prediction of electricity prices is very difficult as prices are shown to be more volatile in electricity markets than any other financial markets. This is partially due to the fact that electrical supply and demand need to be on a real-time equilibrium and, unlike other commodities, it is not financially viable to store large quantity of electricity. In addition, many different parameters such as weather conditions, availability of relatively inexpensive generation facilities (e.g., nuclear and hydro), sudden disturbance or fault in generation and transmission power systems can affect on electricity market price volatility (Shayeghi et al. 2015). This volatility undoubtedly indicates that there is a strong need in the electric power industry for reasonably accurate methods that suitably forecast electricity prices.

1.2 Literature review

In deregulated electricity markets, participants that are involved in trading make extensive use of price prediction techniques either to bid or to hedge against volatility. Having reliable daily price forecast information, producers or energy service companies is able to delineate good bilateral contracts and makes better financial decision. In recent years, several models have been applied to predict prices in electricity markets (Taherian et al. 2013; Shayeghi et al. 2015; Contreras et al. 2003; Garcia et al. 2005; Mohamed and Bodger 2005; Melek and Derya 2010; Anbazhagan and Kumarappan 2012; Da et al. 2014). The available forecasting models can be generally classified into three groups: classical, intelligent and hybrid computing methods.

The classical group consists of auto-regressive integrated moving average (ARIMA) (Contreras et al. 2003), the generalized auto-regressive conditional heteroskedastic (GARCH) approach (Garcia et al. 2005), dynamic regression (DR) and multiple linear regressions (Mohamed and Bodger 2005). They usually use linear models with limited or even no capability to characterize the non-linearity of the electricity price patterns. In addition, the stationary process considered for most of these studies cannot capture non-stationary features of the price time-series.

Intelligent group use data-driven models, where input–output mapping is learned from historical examples. For example, fuzzy inference (Melek and Derya 2010), artificial neural network (ANN) (Anbazhagan and Kumarappan 2012), and wavelet transform (WT) \(+\) support vector machine (SVM) (Da et al. 2014), etc.

However, a single model may not be able to predict electricity price accurately. Thus, hybrid models have been widely used in various applications, including price forecasting. The goal in hybrid forecasting is to combine different models and improve the final forecast accuracy. The hybrid methods have been investigated in several studies such as MI \(+\) WT \(+\) FNN (Osórioa et al. 2014), MI \(+\) WT \(+\) SVM (Shayeghi and Ghasemi 2013), hybrid intelligent method (Ghofrani et al. 2015) and WT \(+\) ARIMA \(+\) LSSVM method (Zhang et al. 2012). A hybrid model using WT combined with ARIMA and GARCH models was used in Tan et al. (2010). Sousa et al. (2012) proposed a hybrid method that applied a neural network as an auxiliary forecasting tool for predicting electricity market prices. Through the analysis of prediction error patterns, the simulation method predicts the expected error for the next forecast, and uses it to adapt the actual forecast. Sharma and Srinivasan (2013) combined a FitzHugh–Nagumo model, for mimicking the spiky price behavior, with an Elman network, for regulating the latter, and a feed-forward ANN, for modeling the residuals. Therefore, this hybrid model is used for point and interval forecasting in markets in Australia, Ontario, Spain and California. To achieve the helter-skelter kind of electricity price, one-dimensional discrete cosine transforms (DCT) input featured feed-forward neural network (FFNN) known as DCT-FFNN is modeled in Anbazhagan and Kumarappan (2014). The DCT-FFNN model is compared with other models to estimate the market clearing prices of mainland Spain. The common disadvantage of most of previously reviewed methods is that those models suffer from the over-fitting problem. Therefore, this paper proposes a new hybrid algorithm for price forecasting that can efficiency tackle said shortages.

1.3 Motivation and contribution

The main contributions of this paper are summarized below:

  • With increasing input of data dimension into learning machine (e.g., ANN), different data classification models become accordingly harder to utilize. Also, high-dimensional data is a serious difficulty for many classification models due to its high computational cost and memory usage. Dimensionality diminution and best feature subset selection are two available methods for reducing the attribute space of an input feature set, which is a significant factor of both supervised and unsupervised regression and classification models (Shayeghi and Ghasemi 2013). MI is a special tool (to minimize relative entropy) in evaluating the approximation of the joint probability with the product of marginal probabilities. According to our review of MI, they require provision of a complete model with more performance. Albeit, the pervious MI technique can consider relevancy and redundancy in simultaneous term, but by increasing of input data, they may lose some valuable data. Therefore, in this paper, a new three-way feature selection algorithm is proposed that filters out irrelevant and redundant candidate inputs, respectively.

  • Specifically, this paper employs the WPT to construct features that highly correlate with alertness and the different levels of drowsiness. The WPT is chosen due to its ability to deal with stationary, non-stationary, or transitory characteristics of different signals including abrupt changes, spikes, drifts, and trends.

  • To select the best tree from WPT output to decrease the computational time, the Shannon entropy criteria will be applied in this paper. It measures the predictability of future subset values based on the probability distribution of the values already observed in the data.

  • This paper presents LSSVM-B as learning engine that models the nonlinear pattern in the price signal. The combination of LSSVM and Bayesian statistics allows us for both accurate prediction of day-ahead price forecasts and the level of their uncertainty. In fact, use of Bayesian method can calibrate an ensemble forecast method to give better forecasts than the linear methods. Also, enjoying all potential of LSSVM-B in learning, its control parameters are optimized by S-OLABC algorithm. In other words, to cope with the disadvantage of standard ABC method that is often trapped in local optima when optimization problem has a large number of optimization variables, two modifications are proposed in ABC based on self-adaptive model and orthogonal learning.

1.4 Structure of the paper

The reminder of this paper is organized as follows. Section 2 provides the mathematical formulation of the forecasting problem. Section 3 introduces the proposed price forecasting hybrid algorithm. To demonstrate the advantages of the proposed algorithm, three test markets have been considered. Simulation results and comparison with previously reported results are discussed in Sect. 4. Finally, the paper is concluded in Sect. 5.

2 Price forecasting problem formulation

2.1 Proposed WPT model

Application of WT in signal processing is one of the active areas in wavelet studies. In contrast to other transforms, e.g., in cosine or Fourier transforms that works truly in the frequency domain only, the DWT decomposes a signal into detail and approximation terms so that it is represented more efficiently and localized in both time (space) and frequency domains. The mathematical formulation of DWT is given in Wickerhauser et al. (1992). This paper uses WPT to decompose a price signal into detail and approximation terms (Wickerhauser et al. 1992). Let resolution \(2^{-j}\) be defined as an orthogonal level on \(L^{2}(R)\) where the scale index j is on integer. \(V_{j}\) is a possible approximation at resolution \(2^{-j}\). The orthogonal function of variable x on space \(V_{j}\) is the function \(x_{j}\) that minimizes \(\Vert x-x_{j}\Vert \). The details of a signal at resolution 2\(^{-j}\) are the difference between the approximations at resolutions 2\(^{-j+1}\) and \(2^{-j}\). According to this description and DWT (Shayeghi and Ghasemi 2013; Wickerhauser et al. 1992), a signal with approximation space \(V_{j}\) can be defined by \(V_{j+1}+W_{j+1}\) (approximation \(+\) detail). Therefore, the original orthogonal function \(\{\phi _j (t-2^jn)\}_{n\in Z} \) can be defined by approximation \(V_{j+1} =\{\phi _{j+1} (t-2^{j+1}n)\}_{n\in Z} \) and detail \(W_{j+1} =\{\varphi _{j+1} (t-2^{j+1}n)\}_{n\in Z} \) spaces. Decompositions of \(\varphi _{j+1}\) and \(\phi _{j+1} \) into \(\{\phi _j (t-2^jn)\}_{n\in Z}\) achieved by the following two filters of H (low-pass filter) and G (high-pass filter) that can be expressed as follows:

$$\begin{aligned}&H(n)=\left\langle {\frac{\upphi (0.5t)}{\sqrt{2} },\phi (t-n)} \right\rangle \end{aligned}$$
(1)
$$\begin{aligned}&\left\langle {\phi _{j+1,p} ,\phi _{j,n} } \right\rangle =\left\langle {\frac{\phi (0.5t)}{\sqrt{2} },\phi (t-n+2p)} \right\rangle =H(n-2p)\nonumber \\ \end{aligned}$$
(2)
$$\begin{aligned}&\phi _{j+1,p} =\sum \limits _{n\in (-\infty ,+\infty )} {H(n-2p)\phi _{j,n} }\end{aligned}$$
(3)
$$\begin{aligned}&G(n)=\left\langle {\frac{\varphi (0.5t)}{\sqrt{2} },\phi (t-n)} \right\rangle \end{aligned}$$
(4)
$$\begin{aligned}&\left\langle {\varphi _{j+1,p} ,\phi _{j,n} } \right\rangle =\left\langle {\frac{\varphi (0.5t)}{\sqrt{2} },\phi (t-n+2p)} \right\rangle =G(n-2p) \nonumber \\ \end{aligned}$$
(5)
$$\begin{aligned}&\varphi _{j+1,p} =\sum \limits _{n\in (-\infty ,+\infty )} {G(n-2p)\phi _{j,n}}\end{aligned}$$
(6)
$$\begin{aligned}&G(n)=(-1)^{-n+1}H(-n+1) \end{aligned}$$
(7)

Note that this theory can be generalized to each space \(D_{j}\) with approximation \(n2^{-j}\), \(n\,\in \,Z\). It is clear that WPT offers a more suitable and flexible analysis compared to DWT. Figure 1 shows a wavelet packet decomposition tree at 3 levels. It can be seen that n-level wavelet packet decomposition produces \(2^{n}\) different sets of coefficients as opposed to \(n+ 1\) set in the DWT (Wickerhauser et al. 1992).

Fig. 1
figure 1

Wavelet packet decomposition tree

2.2 Shannon entropy

The Shannon entropy is a way of quantifying the “disorder” of a random variable and represents a powerful approach for evaluating the contributions of the different channels. The Shannon entropy H(p) of a probability distribution \(p\in \hbox {Pr}_c ( A)\) is:

$$\begin{aligned} H(p)=\sum \limits _{x\in A} {F(p( x))} ,\quad F(x) = x \log \frac{1}{x} \end{aligned}$$
(8)

with the convention that \(F(0) = 0\). Given an A-random variable X, then have \(H(X) = H(p_{X})\). More details are given in previous work (Shayeghi et al. 2015).

2.3 Least squares support vector machine-Bayesian model (LSSVM-B)

The SVM algorithm, developed by Vapnik (1995) in 1995, is based on statistical learning theory. However, SVM formulates the training process through quadratic programming, which can take much too time. Suykenns and Vandewalle (1999) proposed a novel SVM known as LSSVM, which is able to solve linear problems quicker with a more straight forward approach. However, in this paper, Bayesian model is chosen among all learning methods for the artificial network, as it shows the most proper performance when limited training datasets are available (Neal 1996). In fact, this paper proposes the Bayesian model integrated with a Gaussian process to enhance the LSSVM performance in prediction. The proposed method enables the uncertainty in LSSVM weight estimates to be explicitly accounted for, which in turn enables the generation of probabilistic predictions (Shayeghi and Ghasemi 2013). In other words, Bayesian method can be used to overcome various limitations that prevents SVM from becoming more widely accepted and reaching its full potential, such as lack of consideration of forecasting uncertainty, complexity in estimating appropriate parameter values, difficulty in selecting the optimum complexity, and lack of a good method to properly validate the model and interpret the modeled relationship (Melek and Derya 2010). Assume input and training variables in Gaussian process with \(x_{i},y_{i}= f(x_{i}) (i = 1,2,{\ldots },l)\), respectively. Let \(y=f({\mathbf {x}})+\delta \) denote relation between two variables x and y which \(\delta \) is the error term. Then, the covariance between two input variables \(x_{i}, x_{j}\) with the relevancy factor \(w(w_{k}\) measures the relevance of the kth variable with the prediction of the output variable) as follows:

$$\begin{aligned} E[f(x_i )f(x_j )]=\hbox {e}^{\left( -\frac{1}{2}\sum \limits _{k=1}^l {\omega _k (x_i^k -x_j^k )} \right) } \end{aligned}$$
(9)

Let us collect relevancy factors w together with parameters, A and \(\varepsilon \), as the hyperparameter vector (denotes with \(\vartheta \)) that \(\varepsilon \) is noise density function. Hereby, the noise might be biased in the proposed model in this paper. Therefore, the posterior probability of f can be expressed as follows:

$$\begin{aligned}&p(\delta )=\frac{\exp (-A\chi _{\varepsilon ,\beta } (\delta ))}{\int {\exp (-A\chi _{\varepsilon ,\beta } (\delta ))} d\delta }\nonumber \\&\chi _{\varepsilon ,\beta } (\delta )=\left\{ {\begin{array}{ll} -(\varepsilon +\delta )&{}\quad -N<\delta \\ \frac{(\delta +(1-\beta )\varepsilon )^2}{4\beta \varepsilon }&{}\quad -N\le \delta \le -M \\ 0&{}\quad -M<\delta <M \\ \frac{(\delta -(1-\beta )\varepsilon )^2}{4\beta \varepsilon }&{}\quad M\le \delta \le N \\ -(\varepsilon -\delta )&{}\quad N<\delta \\ \end{array}} \right. ,\quad \left\{ {\begin{array}{l} M=(1-\beta )\varepsilon \\ N=(1+\beta )\varepsilon \\ \end{array}} \right. \nonumber \\ \end{aligned}$$
(10)
$$\begin{aligned}&p(f\vert U,\vartheta )=\frac{p(U\vert f,\vartheta )p(f\vert \ vartheta )}{p(U\vert \vartheta )} \end{aligned}$$
(11)

where \(\varepsilon >\)0 and \(0\,<\beta \le \,1\). \(p(U\vert f,\vartheta )\) and \(p(f\vert \vartheta )\) are the likelihood and priori probability functions based on Gaussian process of the dataset U, respectively, then,

$$\begin{aligned}&p(f\vert \vartheta )=\frac{\exp \left( -\frac{1}{2}f^T\sum {^{-1}f} \right) }{(2\pi )^{\frac{1}{2}l}\sqrt{\vert \Sigma \vert } },\quad f=\left[ {\begin{array}{l} f(x_1 ) \\ f(x_2 ) \\ \vdots \\ f(x_l ) \\ \end{array}} \right] \end{aligned}$$
(12)
$$\begin{aligned}&p(U\vert f,\vartheta )=\frac{\exp \left( -A\sum \nolimits _{k=1}^n {\chi _{\varepsilon ,\beta } (y_i -f(x_i) )}\right) }{\int {\exp ((-A\cdot \chi _{\varepsilon ,\beta } (\delta )))} } \end{aligned}$$
(13)

where \(| \Sigma |\) returns the determinant of the square covariance matrix. In other words, \(| \Sigma |\) is computed using the triangular factors obtained by Gaussian elimination. Using the normalized \(p(f\vert \vartheta )\) in Eq. (11), irrelevant deduction of f can be obtained. Based on Eqs. (12)–(13) and Bayes’ theorem, the posterior probability can be expressed as follows:

$$\begin{aligned}&p(f\vert U,\vartheta )\nonumber \\&=\frac{\exp \left( {-A\sum \nolimits _{k=1}^n {\chi _{\varepsilon ,\beta } ( {y_k -f(x_k )})-\frac{1}{2}f^T\sum {^{-1}f} } }\right) }{\displaystyle \int {\exp \left( {-A\sum \nolimits _{k=1}^n {\chi _{\varepsilon ,\beta } ( {y_k -f(x_k )})-\frac{1}{2}f^T\sum {^{-1}f} } }\right) \,\hbox {d}f} }\nonumber \\ \end{aligned}$$
(14)

If \(h(f)=A\sum _{k=1}^n {\chi _{\varepsilon ,\beta } (y_k -f(x_k )} )+\frac{1}{2}f^T\sum {^{-1}f} \) then \(p(f\vert U,\vartheta )\propto e^{-h(f)}\), and the minimization function can be calculated as follows:

$$\begin{aligned} \mathop {\min }\limits _f h(f)=\mathop {\min }\limits _f A\sum \limits _{k=1}^n {\chi _{\varepsilon ,\beta } (y_k -f(x_k )} )+\frac{1}{2}f^T\sum {^{-1}f}\nonumber \\ \end{aligned}$$
(15)

To solve Eq. (15), define two slack variables \(\zeta _i\) and \(\zeta _i^*\):

$$\begin{aligned} \begin{aligned}&y_i -f(x_i )-(1-\beta )\varepsilon =\zeta _i\\&{-}y_i +f(x_i )-(1-\beta )\varepsilon =\zeta _i^*\end{aligned} \end{aligned}$$
(16)

Consequently, LSSVM-B based on regression can be defined by:

$$\begin{aligned}&\mathop {\min }\limits _{f,\zeta ,\zeta ^*} J=\mathop {\min }\limits _{f,\zeta ,\zeta ^*} A\sum \limits _{k=1}^n {(a(\zeta _k^*)+a(\zeta _k )} )+\frac{1}{2}f^T\sum {^{-1}f} \nonumber \\&\hbox {S.t.} \nonumber \\&(1-\beta )\varepsilon +\zeta _k \ge y_k -f(x_k ) \nonumber \\&(1-\beta )\varepsilon +\zeta _k^*\ge -y_k +f(x_k ) \nonumber \\&\zeta _k^*,\zeta _k \ge 0 \end{aligned}$$
(17)

where f is the hyper-plane function. \(y_{k}\) is kth output of LSSVM-B, \(x_{k}\) is kth input data set to the model. Based on \(\zeta \)-insensitive Vapnik loss function, then (Vapnik 1995):

$$\begin{aligned}&\mathop {\min }\limits _{f,\zeta ,\zeta ^*} A\sum \limits _{k=1}^n {(a(\zeta _k^*)+a(\zeta _k )} )+\frac{1}{2}f^T\sum {^{-1}f} \nonumber \\&a(\zeta _k^*)=\left\{ {\begin{array}{l} \frac{(\zeta _k^*)^2}{4\varepsilon \beta },0\le \zeta _k^*\le 2\varepsilon \beta \\ -\varepsilon \beta +\vert \zeta _k^*\vert ,2\varepsilon \beta <\zeta _k^*\\ \end{array}} \right. \end{aligned}$$
(18)

Matching Lagrange function can be calculated as follows:

$$\begin{aligned}&\mathop {\min }\limits _{\alpha ,\alpha ^*} W(\alpha ,\alpha ^*)=\frac{1}{2}(\alpha -\alpha ^*)^T\sum (\alpha -\alpha ^*)\nonumber \\&\quad +\sum \limits _{k=1}^n \left( -(\alpha _k -\alpha _k^*)y_k +(\alpha _k +\alpha _k^*)(1-\beta )\varepsilon \right. \nonumber \\&\quad \left. +\frac{\varepsilon \beta }{A}\left( (\alpha _k +\alpha _k^*)^2-2\alpha _k \alpha _k^*\right) \right) \nonumber \\&\hbox {S.t.} \nonumber \\&\upsilon \in (0,A)\end{aligned}$$
(19)
$$\begin{aligned}&L(\alpha _k ,\alpha _k^*,\zeta ,\zeta ^*,\beta )=J(f,\zeta ,\zeta ^*)-W(\alpha _k ,\alpha _k^*) \end{aligned}$$
(20)

By the Karush–Kuhn–Tucker (KKT) conditions, the solutions using partial differentiation with respect to \(\alpha _k ,\alpha _k^*,\zeta ,\beta \) and \(\zeta ^*\)can be written as:

$$\begin{aligned} \left\{ {\begin{array}{l} \frac{\partial L}{\partial w}=0\Rightarrow J=\sum \limits _{k=1}^n {W(\alpha _k ,\alpha _k^*)} \\ \frac{\partial L}{\partial \zeta }=0\Rightarrow A\sum \limits _{k=1}^n {a(\zeta _k )} =0 \\ \frac{\partial L}{\partial \zeta ^*}=0\Rightarrow A\sum \limits _{k=1}^n {a(\zeta _k^*)} =0 \\ \frac{\partial L}{\partial \beta }=0\Rightarrow -\varepsilon \gamma =0 \\ \frac{\partial L}{\partial \alpha _k }=0\Rightarrow \sum \limits _{k=1}^n {W_{(\alpha _k ,\alpha _k^*)}^{'} \vert _{\partial \alpha _k } ==0} \\ \frac{\partial L}{\partial \alpha _k^*}=0\Rightarrow \sum \limits _{k=1}^n {W_{(\alpha _k ,\alpha _k^*)}^{'} \vert _{\partial \alpha _k^*} ==0} \\ \end{array}} \right. \end{aligned}$$
(21)

By eliminating w and \(\beta \), the equations can be written as:

$$\begin{aligned}&\left[ {\begin{array}{l} \beta \\ \alpha \\ \end{array}} \right] =\left[ {\begin{array}{ll} \begin{array}{ll} 0 &{}\quad I^{T} \\ \end{array} \\ \begin{array}{ll} {I} &{} \quad {\Omega +\gamma ^{-1}I} \\ \end{array} \\ \end{array}} \right] ^{-1}\left[ {\begin{array}{ll} 0 \\ y \\ \end{array}} \right] ,\quad \hbox {where}\nonumber \\&\left\{ {\begin{array}{ll} y=[y_1 ,y_2 ,...,y_l ]^T \\ I_\nu =[1,1,...,1]^T \\ \alpha =[\alpha _1 ,\alpha _2 ,...,\alpha _l ]^T\rightarrow \,\text{ Lagrange } \text{ multipliers } \\ \end{array}} \right. \end{aligned}$$
(22)

Finally, the LSSVM-B regression can be calculated by:

$$\begin{aligned} y(x)=\sum \limits _{i=1}^l {\alpha _i K(x,x_i )+\beta } \end{aligned}$$
(23)

where K refers to kernel function (\(K(x_{i}, x_{j})\)), with the positive definition of the matrix \(\Omega +\gamma ^{-1}I_l \in {{\mathbb {R}}}^{v\times v}\). Here, \(\Omega \in {{\mathbb {R}}}^{l\times l}\) is defined by its elements \(\Omega _{ij} =\varphi (x_i )^T\varphi (x_j )=K(x_i ,x_j )\) for \(\forall (i,j)\in {{\mathbb {N}}}_l \times {{\mathbb {N}}}_l \) and K(0, 0) is a kernel function meeting the Mercer’s theorem (Vapnik 1995). \(\beta \) is a variable scanning the space. \(\alpha _{i},\beta \in R\) are the solution of Eq. (22), \(x_{i}\) is training data and x is the new input vector, and y(x) is the LSSVM-B output model.

2.4 Theory of MI

Feature selection method is a commonly used process in data classification, wherein a subset of the features available from the data is selected for application of forecasting (Lee and Kim 2013). In this section, the concept of MI is briefly reviewed; more details can be found in Lee and Kim (2013). Let \(X= X_{1}\), \(X_{2}\), \({\ldots }\), \(X_{n}\) and \(P(X)= P(X_{1}), P(X_{2}),\ldots , P(X_{n})\) denote vector of variable and probability distribution. The entropy theory H(X) then can be expressed as:

$$\begin{aligned} H(X)=-\sum \limits _{i=1}^n {P(X_i )\log _2 ( {P(X_i )})} \end{aligned}$$
(24)

For universalization, let the X and Y be discrete random variables with a joint probability distribution P(XY) then the joint entropy function H(XY) can be calculated by:

$$\begin{aligned} H(X,Y)=-\sum \limits _{i=1}^n {\sum \limits _{j=1}^m {P(X_i ,Y_j )\log _2 \left( {P(X_i ,Y_j )}\right) } } \end{aligned}$$
(25)

Moreover, when certain variables are known and others are not, the remaining uncertainty is calculated by the conditional entropy (Akadi et al. 2008):

$$\begin{aligned} H(Y/X)= & {} \sum \limits _{i=1}^n {P(X_i )H(Y/ {X=X_i })}\nonumber \\= & {} -\sum \limits _{i=1}^n {P(X_i )\sum \limits _{j=1}^m {P({Y_j }/ {X_i })\log _2 ( {P({Y_j }/ {X_i })})} } \nonumber \\= & {} -\sum \limits _{i=1}^n {\sum \limits _{j=1}^m {P(X_i ,Y_j )\log _2 \left( {P({Y_j }/ {X_i })}\right) } } \end{aligned}$$
(26)

Then, H(XY) and H(X / Y) or H(Y / X) are defined by Eq. (27) known as chain rule (Fig. 2):

$$\begin{aligned} H(X,Y)=H(X)+H(Y/X)=H(Y)+H(X/Y) \end{aligned}$$
(27)
Fig. 2
figure 2

Representation of MI and different entropies

Thus, the mathematical formulation of \(\hbox {MI}(X,Y)\) between two variables X and Y can be expressed as:

$$\begin{aligned} \hbox {MI}(X,Y)=\sum \limits _{i=1}^n {\sum \limits _{j=1}^m {P(X_i ,Y_j )\log _2 \left( {\frac{P(X_i ,Y_j )}{P(X_i )P(Y_j )}}\right) } } \end{aligned}$$
(28)

When the MI is a large value then X and Y are closely related and vice versa.

2.5 Proposed GMI model

Dimensionality reduction of the raw input variable space is a fundamental step in most forecasting tasks. Focusing on the most relevant information in a potentially overwhelming amount of data is useful for a better understanding of the data (Lee and Kim 2013). In this paper, a new feature selection based on feature subset selection (FSS) and dimensionality reduction (DR) is proposed (Akadi et al. 2008).

There are two different kinds of relevance model as shown in Fig. 3:

Strong relevance A feature \(f_{i}\) is strongly relevant if its removal degrades the performance of the Bayes optimum classifier.

Weak relevance A feature \(f_{i}\) is weakly relevant if it is not strongly relevant and there exists a subset of features \(\vec {{F}'}\) such that the performance on \(\vec {{F}'}\cup \{f_i \}\) is better than the performance on just \(\vec {{F}'}\).

The rest of the features will be irrelevant. Among them we could say that there are redundant features and noisy features.

Fig. 3
figure 3

Representation of relevant concept

A different classification of the features consists of:

Relevant features those features which, by themselves or in a subset with other features, have information about the class.

Redundant features those which can be removed because there is another feature or subset of features which already supply the same information about the class. This feature selection is completely illustrated in previous work (Shayeghi et al. 2015). The conditional mutual information (CMI) between two subsets, e.g., A and B conditioned subset C can be defined as:

$$\begin{aligned} I(A;B\vert C)&\mathop \equiv \limits ^\Delta \sum \limits _{{a,b,c}} {P(a,b,c)\log _2 \left( {\frac{P(a,b\vert c)}{P(a\vert c)P(b\vert c)}}\right) } \nonumber \\&=H(A\vert C)+H(B\vert C)-H(AB\vert C) \nonumber \\&=H(A\vert C)-H(A\vert B,C)\nonumber \\&=H(AC)+H(BC)-H(C)-H(ABC)\nonumber \\ \end{aligned}$$
(29)

And in similar way, MI in three interactions information based on relevancy and redundancy influence can be expressed as follows:

$$\begin{aligned} I(A;B;C)&\mathop \equiv \limits ^\Delta I(A;B\vert C)-I(A;B)\nonumber \\&=I(A,B;C)-H(A;C)-I(B;C)\nonumber \\&=H(AB)+H(BC)+H(AC)-H(A)\nonumber \\&\quad -H(B)-H(C)-H(ABC) \end{aligned}$$
(30)

In this paper, the relevance and redundancy of the candidate feature can be calculated by:

$$\begin{aligned} \hbox {GMI}=\arg \max \left( {I(x_i ;C)+\mathop {\min }\limits _{X_s \in S} ( {I(x_i ;x_s ;C)})}\right) \end{aligned}$$
(31)

where \(I(x_{i};x_{s};C)\) is the redundancy term (Fig. 4).

Fig. 4
figure 4

Flowchart of GMI to select best data for WPT

3 Self-adaptive orthogonal learning artificial bee colony (S-OLABC)

3.1 Overview of standard ABC

In this section, the standard ABC is briefly reviewed. Interested readers are referred to Karaboga and Akay (2009) for more details. The pseudo-code of the standard ABC algorithm is shown in Fig. 5.

Fig. 5
figure 5

The pseudo-code of the ABC algorithm, \(\hbox {fit}_{i} \rightarrow \) fitness value of the solution i, \(\varphi _{ij} \rightarrow \) random number between \((-1,1)\), \(\hbox {SN} \rightarrow \) number of solutions, ik and j \(\rightarrow \) randomly chosen indexes, MCN \(\rightarrow \) maximum Cycle Number, \(v_{ij}\rightarrow \) candidate food position

3.2 Self-adaptive orthogonal learning artificial bee colony (S-OLABC)

Albeit the standard ABC is a successful algorithm for finding the optimal solution in a optimization problem, however, it suffers from often converging to local optima in search process. Therefore, to make equilibrium between the exploration and exploitation capability in the search process in an attempt to achieve high optimization performance, two modifications are proposed in this paper. However, the original design of the solution’s generation is affected by random variable \(\varphi _{ij} \), therefore, it is not strong enough to maximize the exploitation capacity. Hereby, an extra coefficient \(\mho _{ij}\) can be defined as follows:

$$\begin{aligned} x_{ij} (t+1)=x_{ij} (t)+\mho _{ij} \varphi _{ij} [x_{ij} (t)-x_{kj} (t)] \end{aligned}$$
(32)

where \(\mho _{ij}\) can be adaptively updated at each iteration. If \(\mho _{ij}\) is large, then \(\mho _{ij} \varphi _{ij} \) will be large and exploration can be enhanced to get rid of local optimal areas. Inversely, when \(\mho _{ij}\) is small, then \(\mho _{ij} \varphi _{ij} \) will be small and exploitation will be efficient to find the best solution in the search space. To select the best value for \(\mho _{ij}\) in exploitation and exploration, two variables \(T_{1}<\)0 and \(T_{2}>0\) are defined and two random variables \(R_{1}\) and \(R_{2}\) are generated in (0, \(T_{1}\)) and (0, \( T_{2})\), respectively. Therefore, the following factors \(\mho _{1j} =2^{R_1 }\) and \(\mho _{2j} =2^{R_2 }\) can be given for exploitation and exploration, respectively. In fact, two population vectors are generated by \(\mho _{1j}\), \(\mho _{2j} \). When \(T_{1}<0\), then \(R_{1}\) is negative, and \(\mho _{1j}\) will be small, and exploitation will be amplified. Next, the best \(\mho _{ij} \) is chosen based on its best fitness values obtained from the generated population with \(\mho _{1j}\) and \(\mho _{2j} \). Hereby, this factor can be updated by:

$$\begin{aligned} \mho _{ij} =\lambda \mho _{ij} +(1-\lambda )\mho _{kj} ,\quad k=1,2 \end{aligned}$$
(33)

where \(\lambda \) is a constant factor and initially \(\mho _{ij} =1\). As another disadvantage of standard ABC algorithm: each scout bee uses its historical/neighborhood’s best information through simple model. Such a learning approach is easy to employ, but is incompetent when searching in complex optimization problem spaces. Thus, a novel OL model (Parsopoulos and Vrahatis 2004) is proposed for ABC to recover more useful information via orthogonal operator in this paper. Let us consider two initial solution vectors of \(X_1 =\left\{ {x_1^1 ,x_1^2 ,\ldots ,x_1^D } \right\} \) and \(X_2 =\left\{ {x_2^1 ,x_2^2 ,\ldots ,x_2^D } \right\} \). \(X_{1 }\) and \(X_{2}\) define a search range \(\left[ {\min ( {x_1^j ,x_2^j }),\max ( {x_1^j ,x_2^j })} \right] \) for jth variable of the design variable vector, X. Hereby, the search range is quantized into N-level, \(\rho _{k,1} ,\rho _{k,2} ,\ldots ,\rho _{k,N} \)as follows:

$$\begin{aligned} \rho _{j,l} =\frac{(N-1)\min (x_1^j ,x_2^j )+(l-1)(\max (x_1^j ,x_2^j )-\min (x_1^j ,x_2^j ))}{N-1}\!\!\!\nonumber \\ \end{aligned}$$
(34)

This quantization cannot be directly applied to orthogonal array. Thus, it is divided into k-subsets vectors as follows:

$$\begin{aligned} K=\left[ {{\begin{array}{ll} {x_1 ,x_2 ,\ldots ,x_{n_1 } } \\ {x_{n_1 +1} ,x_{n_1 +2} ,\ldots ,x_{n_2 } } \\ {x_{n_2 +1} ,x_{n_2 +2} ,\ldots ,x_{n_3 } } \\ \vdots \\ {x_{n_k +1} ,x_{n_k +2} ,\ldots ,x_D } \\ \end{array} }} \right] ,\quad x_{n_1 } \le x_{n_2 } \le \cdots \le x_D\nonumber \\ \end{aligned}$$
(35)

Finally, the proposed OL solutions can be defined as follows:

$$\begin{aligned}&L_{k1} =[\rho _{n_{k-1} +1,1} ,\rho _{n_{k-1} +2,1} ,\ldots ,\rho _{n_k ,1} ]\nonumber \\&L_{k2} =[\rho _{n_{k-1} +1,2} ,\rho _{n_{k-1} +2,2} ,\ldots ,\rho _{n_k ,2} ]\nonumber \\&\vdots \nonumber \\&L_{kN} =[\rho _{n_{k-1} +1,N} ,\rho _{n_{k-1} +2,N} ,\ldots ,\rho _{n_k ,N} ] \end{aligned}$$
(36)

The algorithm analysis is given in the Appendix.

4 Application of S-OLABC to the day-ahead electricity price forecasting problem

The process of the day-ahead electricity price forecasting by S-OLABC algorithm can be summarized as:

  • Step 1 Read price data that contains price series and decompose them into training and testing sets known as \(T_{r}\) and \(I_{n}\), respectively.

  • Step 2 Set up the adjustable S-OLABC parameter values, such as the maximum cycle number, colony size and limit value to be 50, 40 and 10 values, respectively.

  • Step 3 Decompose price data (historical data up to 24 h of day \(d-1\)) via WPT in a set of four constitutive series which is defined as \(a_{h}, b_{h}, c_{h}\) and \(d_{h}, h=1,2 ,{\ldots },T\), where the value of T ranges usually between 1 week to 2 months. \(a_{h}, b_{h}, c_{h}\) are detail with small adjustments and \(d_{h}\) is approximation series which is the main component of the transform. Therefore, applying the WPT method to the original prices series can be formulated by:

    $$\begin{aligned} W(p_h ;h=1,\ldots ,T)=\{a_h ,b_h ,c_h ,d_h ;h=1,\ldots ,T\}\nonumber \\ \end{aligned}$$
    (37)
  • Step 4 As a contribution of this paper and unlike other research papers in prediction area, consider detail and approximation terms as valuable information with probable input candidate, and send to the GMI system. In addition, use the Shannon theory and select the best branch of WPT tree. Then, GMI system tries to select the best data with most relevancies and least redundancy. The passed data of this system are sorted as a vector; {\(x_{1}\), \(x_{2}\), \({\ldots }\), \(x_{n}\)}. Now, these vectors are ready to send to step 5 for training.

  • Step 5 This step is similar to the engine of a machine, being the main part of the forecasting algorithm. When the LSSVM-B is launched with {\(x_{1}\), \(x_{2}\), \({\ldots }\), \(x_{n}\)} vector then the training process will start. Further details about learning process of the LSSVM-B model are given in Sect. 2.3. In first iteration, the outputs of LSSVM-B system may be not proper; therefore, completion of training process needs stages 6 and 7 as well.

  • Step 6 Calculate the fitness value. Sort the initial population as well as all their related data in the descending order of fitness.

    $$\begin{aligned} \hbox {Fitness}=\frac{1}{N}\sum \limits _{i=1}^N {\frac{\hbox {act}_i -\hbox {forc}_i }{y_i }} \end{aligned}$$
    (38)

    where \(\hbox {act}_{i}\) and \(\hbox {forc}_{i}\) represent the actual and forecast values, respectively.

  • Step 7 Use the S-OLABC to find the best agent. The best solution is considered as an initial solution. If the best solution found by chaotic local searches is better than the previous food source then it is replaced.

  • Step 8 Use the inverse WPT to estimate the hourly prices for day d by means of the estimates for day d of the constitutive series. The inverse WPT is used in turn to reconstruct the estimate series for prices, i.e.,

    $$\begin{aligned} W^{-1}(\{a_h^{\mathrm{est}} ,b_h^{\mathrm{est}} ,c_h^{\mathrm{est}} ,d_h^{\mathrm{est}} ;h= & {} T+1,\ldots ,T+24\})\nonumber \\= & {} P_h^{W,est} \end{aligned}$$
    (39)
  • Step 9 Update velocity and position-based S-OLABC scheme. In other words, move the agents to the search area for discovering new solutions.

  • Step 10 If the maximum number of iteration is reached, then finish. Otherwise, go to step 2. The graphical process of the proposed algorithm is depicted in Fig. 6.

Fig. 6
figure 6

Flowchart of the proposed forecasting scheme

5 Simulation results and discussion

The proposed hybrid algorithm is tested using the real data of the hourly Iran electricity price (HIEP), Spanish and New South Wales (NSW) markets. In the proposed algorithm, the adjusted parameters are set with population size, limit value, \(G_{0}\) and iteration by 50, 100, 10 and 200, respectively.

5.1 Evaluating the forecasting error

As indicated in Shayeghi et al. (2015), this paper uses similar error-based indices to compare the forecast accuracy. These indices are:

The daily MAPE can be defined as

$$\begin{aligned}&\hbox {MAPE}=\frac{1}{N}\sum \limits _{i=1}^N {\frac{\vert P_{i\mathrm{ACT}} -P_{i\mathrm{FOR}} \vert }{P_{\mathrm{AVE-ACT}} }}\end{aligned}$$
(40)
$$\begin{aligned}&P_{\mathrm{AVE-ACT}} =\frac{1}{N}\sum \limits _{i=1}^N {P_{i\mathrm{ACT}} } \end{aligned}$$
(41)

where \(P_{i\mathrm{ACT}}\) and \(P_{i\mathrm{FOR}}\) are actual and predicted values, and \(P_{\mathrm{AVE-ACT}}\) denotes the average of actual prices. The FMSE is:

$$\begin{aligned} \hbox {FMSE}_{\mathrm{day}} =\sqrt{\frac{1}{N}\sum \limits _{i=1}^N {( {P_{i\mathrm{ACT}} -P_{i\mathrm{FOR}} })^2} } \end{aligned}$$
(42)

The ESD is given by:

$$\begin{aligned} \left\{ {\begin{array}{l} \hbox {ESD}_{\mathrm{day}} =\sqrt{\frac{1}{N}\sum \limits _{i=1}^N {( {E_i -E_i^{\mathrm{Ave}} })^2} }\\ E_i =P_{i\mathrm{ACT}} -P_{i\mathrm{FOR}} \\ E_i^{\mathrm{Ave}} =\frac{1}{N}\sum \limits _{i=1}^N {E_i } \\ \end{array}} \right. \end{aligned}$$
(43)

5.2 Spanish electricity market

To perform forecasting, the target and candidate inputs variable are linearly normalized between (0, 1). Then, the proposed GMI based on feature selection is employed to select the best input variables for machine learning. Using the selected training, inputs and validation samples are constructed based on the data of the previous 50 days. Each of 24 forecasted outputs is trained through its respective 49 training trials, and is validated via 1 validation trial. The proposed price forecasting approach has price series data up to the midnight of pervious night. Generally, if the kth forecaster engine acquires the hourly prices of the forecast day, the predicted price values for these hours by its earlier forecasters are used. For example, \(P_{h-1}\) for the second forecaster is the predicted price by the first forecaster. As shown in the Table 1, the results of the proposed algorithm for 4 test weeks of the Spanish electricity market in year 2002 are compared with some other existing methods in literature (Amjady and Hemmati 2009; Pindoriya et al. 2008; Catalão et al. 2010; Anbazhagan and Kumarappan 2013; Shafie-khah et al. 2011; Amjady and Keynia 2009, 2010; Amjady and Daraeepour 2009). For the sake of a fair comparison, the same test weeks, i.e., the fourth week of February, May, August, and November, in winter, spring, summer, and fall seasons, respectively, are considered for all calculations (Pindoriya et al. 2008; Catalão et al. 2010; Anbazhagan and Kumarappan 2013; Shafie-khah et al. 2011; Amjady and Keynia 2009, 2010; Amjady and Daraeepour 2009; Keynia et al. 2012).

Table 1 Weekly MAPE values in terms of percentage (%) for 4 weeks of the Spanish electricity market

According to Table 1, the proposed hybrid algorithm has better weekly MAPE than other methods in all seasons. Moreover, the average of the MAPE index of the proposed method is considerably less than all other methods (last row of Table 1). Improved values in the average MAPE of the proposed price forecasting approach with respect to the other methods are shown in Fig. 7, and are calculated by Eq. (44). Table 2 shows the variance of the forecasting errors, as a measure of uncertainty. While this index for the proposed hybrid algorithm is less than the other forecasting methods, the average variance value of the method is significantly less than the other methods. Improvement in the average error variance of the proposed method compared to HEA (as a recently published paper (Osórioa et al. 2014) and best answer among all other existing forecasting methods in Table 1) is about 20 %. The electricity price data of Spanish’s market can be found in website (Informe de operación del sistema eléctrico 2015).

$$\begin{aligned}&\hbox {Improvement}\nonumber \\&=\frac{\left( {\hbox {MAPE}_{{\mathrm{Other \,Method}}} -\hbox {MAPE}_{{\mathrm{Proposed\, Method}}} }\right) }{\hbox {MAPE}_{{\mathrm{Other\,Method}}} }\times 100\nonumber \\ \end{aligned}$$
(44)
Table 2 Error variance for 4 weeks of the Spanish electricity market
Fig. 7
figure 7

Improvement of the proposed method compared to other methods based on Table 1

Furthermore, sample results for the MI (Shayeghi and Ghasemi 2013) and GMI are presented. For the sake of conciseness, results for only one of the test days are represented in Table 3. \(P_{h}\) shows price of hour h. In Table 3, selected features (columns 2 and 4) and normalized MI values (columns 3 and 5) are shown. The results have been obtained after the convergence of the iterative search procedure. According to Table 3, out of 600 candidate inputs considered for the Spanish electricity market, 18 inputs are selected for MI, which indicates filtering ratio (\(600/18=33.33\,\%\)) of the feature selection technique. While the GMI selects only 14 inputs which indicates higher filtering ratio (\(600/14=42.85\,\%\)) than traditional MI methods. To be able to additionally provide a graphical view on the day-ahead and week-ahead price forecasting accuracy obtained by the proposed hybrid algorithm, the forecast, actual signal and forecast error for the day-ahead price forecasting is shown in Fig. 8.

Table 3 Selected features plus their ranks and normalized MI and GMI values for Spanish market
Fig. 8
figure 8

Spring forecast results for the Spanish electricity market; a daily, b weekly

5.3 Iranian electricity market

In this case study, the proposed hybrid forecasting algorithm shown in Fig. 6 is used to forecast the day-ahead prices in Iranian electricity market. The effectiveness of the proposed algorithm is demonstrated by forecasting price for Iran’s electricity market over March of 2012. The data of Iran’s electricity market can be found in Informe de operación electricity market (2015). The simulation mechanism is similar to the Spanish market. Price data of the Iran’s electricity market in year 2012 are shown in Fig. 9. For comparison, multi-layer perceptron (MLP) neural network, WT \(-\) MI \(+\) NN and the proposed hybrid algorithm is used for price forecasting. To provide a graphical view on the forecast accuracy of the proposed hybrid algorithm, results for the Iran’s electricity market are shown in Fig. 10a and b. Table 4 compares results obtained via the proposed hybrid algorithm, MLP and WT \(-\) MI \(+\) NN methods for the 24 h.

Fig. 9
figure 9

Price data of Iran electricity market in year 2012

Table 4 shows forecast results of the proposed hybrid algorithm, MLP and WT \(-\) MI \(+\) NN methods for the 24 h (daily) and 168 h (weekly). Comparison shows that the prediction is occasionally not acceptable because obtained error in some methods is generally above 40 % and move away from the actual price signal at price valley or peaks. One of important reasons for this shortage can be found in small size of available historical data. WT \(-\) MI \(+\) NN method recovers neural network (NN) performance considerably while it has some big deviation in some hours. Except for hour 24 in Table 4, the error percentage is always above of 15 %. Compared to NN and WT \(-\) MI \(+\) NN, the proposed hybrid algorithm (with error lower than 10 %) has superior accuracy with more stable prediction.

Fig. 10
figure 10

Spring forecast results for the Iran’s electricity market; a daily, b weekly

Table 4 A sample of 24 forecasted hours in a working day in Iran

In addition, Table 4 compares the proposed method, MLP and WT \(-\) MI \(+\) NN techniques for forecasting electricity prices using the MAPE, FMSE and ESD criteria. As the proposed forecasting algorithm shows lower MAPE (FMSE and ESD), its predictions are more stable. During the valley hours, the volatility of the spot price increases, and thus MLP shows considerably larger prediction errors. However, the price volatility during the specific off-peak hours is also less accurate. Furthermore, overall setup time of the proposed hybrid algorithm including the WPT-GMI, training of the forecaster (LSSVM+S-OLABC) and fine-tuning of the adaptable parameters takes about 26 min on a Pentium P4 3.2 GHz PC with 1 GB RAM which is acceptable within a day-ahead decision making framework.

Table 5 MAPE (%) analysis of the forecasting error of NSW electricity market in year 2007

5.4 New South Wales (NSW) market

As the final test market New South Wales (NSW) of Australian’s national grid is used to demonstrate the performance of the proposed forecasting algorithm. The historical data for this market in 2007 can be found in Australian Energy Market Operator (2015). To perform forecasting, the target and candidate input variables (\(T_{r}\) and \(I_{n})\) are normalized in the range of 0 and 1. Then, the proposed feature selection algorithm is used to pick up the best input candidates. Twelve months were considered for simulation so that all days of any month, except the last day, were selected for training and the final day was used for forecast evaluation. The forecast obtained by the proposed hybrid algorithm is compared with other existing methods in literature in Table 5. The performance of the proposed hybrid electricity price forecasting algorithm is compared with similar methods in literature such as ARIMA, LSSVM, PLSSVM, ARIMA \(+\) LSSVM, ARIMA \(+\) PLSSVM, and WT \(+\) ARIMA \(+\) LSSVM methods (Zhang et al. 2012). According to Table 5, the proposed hybrid algorithm has better daily MAPE than other methods in all simulations (except for some months such as April and November with nearly the same results). Furthermore, ARIMA method predictions were less accurate than those of the proposed hybrid algorithm with the average MAPE (%) increasing from 2.10 to 13.63 %.

5.5 Forecasting results for three markets

For fair comparison between MI and proposed feature selection (GMI), the data set is kept constant so that the efficiency of the feature selection methods can be compared. The MI technique (in Table 3) is a well-known feature selection method that only considers the relevancy of candidate inputs with the target variable. The proposed GMI method considers both relevancy and redundancy of candidate inputs. However, the whole common information content of two candidate inputs is considered as the redundant information in this feature selection method. Moreover, a more accurate formulation of redundancy based on interaction gain is proposed in the GMI algorithm, which considers the common information content of two candidate inputs. The numerical results of Tables 1, 2, 4 and 5 show that the proposed hybrid forecasting algorithm outperforms all other methods in the price forecasting. It is also shown that the proposed algorithm has lowest MAPE, FMSE and ESD values among all methods of Tables 1, 2, 4 and 5 in all test weeks, indicating better forecast accuracy and stability.

6 Conclusions

To improve the accuracy of electricity price forecasting, a novel hybrid method is proposed in this paper, which is the combination of WPT, GMI, and LSSVM-B optimized by S-OLABC algorithm. The proposed forecasting algorithm is examined using data from the Spanish, NSW, and Iranian electricity markets as three successful electricity markets. Greater performance of the proposed hybrid algorithm can be attributed to three causes. First, the WPT can convert ill-behaved price signals into some subsets of better behaving signals. Second, GMI can be computed by a reasonable amount of historical data and low computation burden which maximizes relevancy and minimizes redundancy and the LSSVM-B model has nonlinear mapping capabilities, which can more easily capture the nonlinear component of electricity prices. Third, S-OLABC algorithm can tune suitable variables of the LSSVM-B model, in which choosing inappropriate adjusting variables will cause either under or over fitting. The proposed method can also be used for the other forecast processes such as wind power forecasting. The GMI considers three-way interactions. Extension of this method to additionally include higher order interactions will be considered in the future work.