Keywords

1 Introduction

With worldwide energy shortage and environmental problems, the use of renewable energy is becoming more of a concern [1, 2]. As a clean, pollution-free, widely distributed renewable energy that is easy to exploit, wind energy has received widespread attention [3, 4].

However, wind energy is easily influenced by wind speed and air pressure, leading to strong fluctuation and intermittence of wind power, which harms its integration into the power grid and hinders its development [5, 6]. Accurate and reliable short-term wind power forecasting can not only help wind farms in work planning, but also reduce the impact on the power grid, which is an effective way to overcome this obstacle.

With the progress of research, wind speed forecasting methods are constantly being optimized and updated, from the physical method at the beginning to the statistical method, and in recent years, intelligent algorithms are mainly used for data analysis [7]. Physical methods are usually based on complex models, and make predictions according to all kinds of meteorological data. This method is suitable for medium and long-term wind forecasting, but performs not so well in short-term wind power forecasting [8]. Traditional statistical methods use historical data of wind power to forecast, the most popular model is auto-regressive integrated moving average (ARMIA) [9]. It predicts well on the linear part of the wind power fluctuation, but it cannot learn nonlinear characteristics of the change of wind power, so it fails to achieve a long-term satisfactory effect.

A study on recurrent neural network (RNN) finds it have higher accuracy compared with ARMIA method [10]. Many other neural networks such as Deep belief network (DBF) [11], convolution neural network (CNN) [13], long short-term memory (LSTM) [14] and echo state network (ESN) [12] are also valued for their advantages in feature extraction and nonlinear fitting. Among these models, the LSTM model is promising due to its advantages in processing long time series [15]. Hybrid model are also proposed for wind speed forecasting [16]. For example, for the LSTM network, empirical wavelet transforms [17], ensemble empirical mode decomposition [18] are respectively combined into the model for pre-optimization of data. For the parameters in the neural network model, many methods such as differential evolution algorithm (DE) [19], grey wolf optimization (GWO) [20], multi-objective whale optimization algorithm (MOWOA) [21] have been proposed. This is because network parameters, if not properly selected, will have a bad influence on the prediction results [22]. Another problem of those networks is when the input wind power fluctuates greatly, the prediction accuracy will be greatly reduced.

To find features under long-term input sequence and smooth the data, we first use CNN to extract local informative features in the input in our network. LSTM is often used for time series data prediction due to its excellent performance in long-term and relevant data processing [23]. Therefore, we next use LSTM to perform time series prediction on non-linear wind power data. Considering that there is currently no clear algorithm to guide the selection of parameters in the neural network, we use a reliable improved differential evolution algorithm to optimize parameters such as the number of layers and forgetting rate of the LSTM network to achieve better prediction results.

2 Strategy for Wind Power Forecasting

In order to achieve smart operation of wind farms and predict the wind power in advance to better arrange the start and stop of wind turbines in wind farms, neural network methods are used for short-term wind power forecasting. They can achieve better forecast results than traditional methods based on the historical data of wind power in a period of time, providing a basis for the subsequent optimal control.

Wind power prediction mostly adopts the LSTM model. It is suitable for time series prediction, but sensitive to fluctuations, and if the network parameters are not properly selected, the prediction accuracy is often low. Therefore, a comprehensive method of integrating CNN, LSTM and DE is proposed here. First, we use CNN to smooth the data and better extract high—dimensional features, then use LSTM network to predict the result. Differential evolution algorithm is used to optimize the parameters of the network (Fig. 1).

Fig. 1.
figure 1

Structure of CNN-LSTM-DE method

3 CNN-LSTM Structure

3.1 CNN

CNN is a successful deep learning architecture and it is an effective for feature extraction and pattern recognition [24]. It can extract high-dimensional features of input data through convolution layer (Fig. 2).

Fig. 2.
figure 2

Structure of CNN

The data input from the input layer is convolved, pooled, fully connected, and then output through the output layer in CNN network [27]. The function of convolution layer is to locally perceive input data, and extract the high-dimensional features of the input through the product of convolution kernel and sliding window. After the convolution layer is the pooling layer, also called the convergence layer. Convolution layer reduces the number of connections between different neurons, but the number of neurons does not decrease significantly. So the pooling layer is required to participate, which is used to reduce feature dimensions and avoid over fitting [28]. The last step is the fully connected layer to output the results. After the data is processed by the convolution layer and pooling, it is input to the fully connected layer to get the final result. After such a series of operations, the data volume has decreased significantly, the efficiency has improved significantly, and the computing cost has also decreased significantly. The most important convolution operation in CNN is shown in formula (1):

$$ O_t = \sigma \left( {W*x_t + b} \right) $$
(1)

where \(W\) represents the weight coefficient of the filter; \(x_t\) represents the data information of the input sample at time t; * represents the discrete convolution operation between \(x_t\) and \(W\); \(b\) is the bias parameter, which is obtained by learning and will be passed when training the model; \(\sigma\) represents the activation function; \(O_t\) represents the output data after the convolution operation.

Each neuron contains a filter. The input data will be convolved with each filter, and the results will be added together as the output data. For example, when a sequence is all 2, the filter (1, 1) is used, the step size of the sliding window is 1, then the output is 4 if there is no bias parameter.

3.2 LSTM

LSTM network is a variant of Recurrent Neural Network (RNN). RNN is very effective for data with sequence characteristics. It can mine temporal information in data [25], but in the process of model training, gradient disappearance or gradient explosion will occur along with the accumulation of time steps. This will cause the loss of previous historical input information or generate invalid information, resulting in wrong prediction results.

The LSTM network selectively store information by introducing forget gate, input gate and output gate to one cell to control the transmission of information, solving the problem of gradient disappearance or gradient explosion in long sequences in RNN to a certain extent [26]. The cell structure of LSTM is shown in Fig. 3.

Fig. 3.
figure 3

LSTM cell structure

The left end of the LSTM network structure in the figure is the input layer, the right end is the output layer, and the middle part is three gating units. First go through the input gate to see whether there is information input, then judge whether the forgetting gate chooses to forget the information in the memory cell, and finally go through the output gate to judge whether to output the information at this moment.

The forget gate is used for deletion of memory. First, combine \(h_{t - 1}\) with \(x_t\), and then use the \(W_f\) matrix to adjust it to the same dimension as the hidden layer at time t. The next step is to put them into the sigmod function. Finally, use the sigmoid function to compress the output value between 0 and 1. Output values close to Output of 0 will be eliminated, and values close to 1 will be retained. The operation formula is shown in formula (2).

$$ f_t = \sigma \left( {W_f \cdot h_{t - 1} + W_f \cdot x_t + b_f } \right) $$
(2)

In the formula, \(f_t\) represents the past memory measurement factor, \(W_f\) represents the weight, \(b_f\) represents the bias, \(h_{t - 1}\) represents the state information of the previous hidden layer, and \(x_t\) represents the input vector at time t.

$$ i_t = \sigma \left( {W_i \cdot h_{t - 1} + W_i \cdot x_t + b_i } \right) $$
(3)
$$ k_t = \tan h(W_k \cdot h_{t - 1} + W_k \cdot x_t + b_k ) $$
(4)
$$ c_t = f_t *c_{t - 1} + i_t *k_t $$
(5)

The input gate is used to update the information memory. First, \(h_{t - 1}\) and \(x_t\) are put into the sigmoid function for information screening. Meanwhile, \(h_{t - 1}\) and \(x_t\) are passed to the tanh function in order to create a new candidate value vector, then calculate the output value of the sigmoid function with the output value \(k_t\) of the tanh function and add it to the past memory information \(c_{t - 1}\) to expand the memory capacity, Finally, add the product of \(f_t\) and \(c_{t - 1}\) to get the updated memory \(c_t\).

$$ g_t = \sigma \left( {W_g \cdot h_{t - 1} + W_g \cdot x_t + b_g } \right) $$
(6)
$$ h_t = g_t *\tan h(c_t ) $$
(7)

The output value of the output gate shall be determined according to the cell state. First, we put \(h_{t - 1}\) and \(x_t\) into a sigmoid function to determine which part of the cell state needs to be output, then process the cell state through the tanh layer, and multiply the two to get the final information we want to output.

4 Differential Evolution Algorithm

Differential evolution algorithm (DE) is a heuristic search algorithm, which adopts the evolution law of “survival of the fittest” to conduct random search. Under the global search strategy of preserving samples, mutation operation, crossover operation and one-to-one competition, survival principle are adopted according to the difference of vector between parent samples, which improves local search ability, robustness and convergence.

Using the method of random selection, in the n-dimensional space, M sample data are generated to satisfy the following constraints. In the search space, it is assumed that \(x_{ij}^U\) is the upper bound of the space search of the ith sample in the jth dimension space, while \(x_{ij}^L\) is the lower bound. Samples that meet the constraints within its range are initialized.

$$ x_{ij} \left( 0 \right) = rand_{ij} \left( {0,1} \right)\left( {x_{ij}^U - x_{ij}^L } \right) + x_{ij}^L $$
(8)

In the formula, \(x_{ij} (0)\) represents the initialization sample.

When performing the mutation operation, the mutation factor F is introduced to control the ethnic diversity and convergence. In the traditional differential evolution algorithm, the value range of F is [0, 2]. When the value of F is small, it will not necessarily break through its local extremum during the evolution process, resulting in premature convergence. If the value of F is large, it will easily jump out of the local extreme value; however, the speed of convergence is reduced as well.

The mutation vector and a predetermined target vector are mixed with parameters to generate a test vector in the mutation operation, and the global and local area searches are balanced by controlling the numerical value of individual parameters in each dimension. Compare the test vector with the original vector, and choose the better one as the new solution vector, update the vector, and proceed to the next step. The traditional structure of DE is shown in the figure below (Fig. 4).

Fig. 4.
figure 4

Flow chart of DE algorithm

This paper improves the traditional differential evolution algorithm in the following two aspects. The first improvement is that two pairs of popular mutation operators are used in the mutation stage of the search process. In each generation, the selection depends on the absolute error between the best value and the mean value of the objective function according to the previous generation. When the error is large in earlier generations, one pair of mutation operator are used to explore the scope containing the global optimal solution. When error is smaller after several rounds of global search, another pair of mutation operator will be used to increase the algorithm convergence speed [29].

Another improvement can be called elite selection technology. The test vectors generated after crossover operation are mixed with their parent population. Then select the best individual from the whole group to build a new group for the next generation. In this way, the best individuals in the whole population are passed on to the next generation, which enables the differential evolution algorithm achieve better convergence speed.

5 Case Study

Applying the CNN-LSTM-DE method to perform short-term wind power prediction on wind farms. The experimental data comes from the wind power data of an offshore wind farm from January 1st 2020 to January 16th 2020, with sampling intervals of 10 min. The data only has one column of wind power over time. The original wind power output of the farm is shown in the figure below.

Fig. 5.
figure 5

Original wind farm output power sequence

Symmetric mean absolute percentage error (SMAPE) is used to evaluate the prediction. It is a correction based on MAPE, which can better avoid the problem that the calculation result of MAPE is too large when the real value is small. The formula is as follow.

$$ SMAPE = \frac{1}{n}\sum_{t = 1}^n {\frac{{\left| {F_t - A_t } \right|}}{(A_t + F_t )/2}} $$
(9)

where \(A_t\) is the actual value and \(F_t\) is the forecast value When the predicted \(F_t\) and the real \(A_t\) are exactly the same, the minimum SMAPE value is 0, so the prediction is better when its value of SMAPE closer to 0.

First, we study the case where the differential evolution algorithm is not used for optimization. We arrange the hyperparameters in the LSTM network by convention. The LSTM network has two layers, the number of neurons in the first and second layer is 64 and 16 respectively. The batch size is 12, learning rate is 0.01 and dropout proportion for 2 layers are both 0.5. The difference between the predicted value and the actual value is shown in the following figure and we can calculate the SMAPE of it is 0.0165.

Fig. 6.
figure 6

Comparison of CNN-LSTM model predicted value and actual value

Then, we applied the differential evolution algorithm to the original model and optimized the parameters of the LSTM. The number of neurons in the two layers is 176 and 124 respectively while the dropout proportion is 0.442 and 0.809. The batch size is 14 and learning rate decreases to 0.000696. The difference between the predicted value and the actual value is shown in Fig. 5 and 6 and we can calculate the SMAPE of it is 0.0109 (Figs. 7 and 8).

Fig. 7.
figure 7

Comparison of CNN-LSTM-DE model predicted value and actual value

Fig. 8.
figure 8

CNN-LSTM-DE scatterplot prediction against observed

Comparing the CNN-LSTM-DE model with the CNN-LSTM model that is not optimized by the differential evolution algorithm, it can be found that the prediction results come from the optimized model will be closer to the actual value. This shows the great effect of optimization algorithms on model improvement.

6 Conclusion

This paper proposes a combined CNN-LSTM-DE method for wind power short-term data forecasting. For the existing past data, first use CNN for feature extraction, and then use LSTM network for prediction, making full use of the historical data. In order to improve the accuracy of the model, an improved differential evolution algorithm is used to optimize parameters such as number of layers in the network, number of neurons in each layer and the learning rate of the network. Tests on the offshore wind power dataset show that the network model using the optimization algorithm has higher prediction accuracy. The reliable performance of the model suggests that we can use this structure in a variety of time series forecasting, especially with regard to power forecasting.