Data-driven random forest forecasting method of monthly electricity consumption

Pang, Xinfu; Luan, Changfeng; Liu, Li; Liu, Wei; Zhu, Yuancheng

doi:10.1007/s00202-021-01457-5

Data-driven random forest forecasting method of monthly electricity consumption

Original Paper
Published: 12 January 2022

Volume 104, pages 2045–2059, (2022)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Electrical Engineering Aims and scope Submit manuscript

Data-driven random forest forecasting method of monthly electricity consumption

Download PDF

Xinfu Pang ORCID: orcid.org/0000-0001-6981-596X¹,
Changfeng Luan¹,
Li Liu¹,
Wei Liu¹ &
…
Yuancheng Zhu²

678 Accesses
3 Citations
Explore all metrics

Abstract

Accurate forecast of monthly electricity consumption has guiding significance for the economic dispatch of the power system, and it is also a prerequisite for the power company to formulate a reasonable sales plan. The traditional forecasting method of monthly electricity consumption performs poorly in processing the sequence of monthly electricity consumption with a dual trend, and it cannot consider multiple influencing factors at the same time and cannot screen the influencing factors of monthly electricity consumption. This paper proposes a random forest prediction method of monthly electricity consumption based on the maximum mutual information coefficient. First, the maximum mutual information coefficient between monthly electricity consumption and its influencing factors is calculated; second, high-relevance factors are filtered out based on the maximum mutual information coefficient value; third, the data of high-relevance factors are combined, and random forest is used to predict monthly electricity consumption; finally, the program of the abovementioned method is compiled in Python language with the electricity consumption data of the whole society in Shenyang, Liaoning Province as an actual calculation example, and the method is compared with the method that does not use correlation factor identification. Simulation results show that the proposed method has high prediction accuracy and can provide a basis for making reasonable grid operation plans and making power decisions correctly.

Short-Term Load Forecasting Using Random Forests

Short-Term Load Forecasting Using Random Forest with Entropy-Based Feature Selection

Electric Power Forecasting in Inner Mongolia by Random Forest

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

1.1 Literature review

Prediction of monthly electricity consumption is important in the power system [1]. Accurate forecast of monthly electricity consumption is the premise for power departments to allocate power resources and power companies to make reasonable sales plans [2]. According to the study of the relationship between different socio-economic factors and electricity consumption, the monthly electricity consumption forecast can help power enterprises better understand the service demand of all walks of life and provide data support for the future development of power grid and the formulation of power demand-side response policies [3].

Traditional forecasting methods of monthly electricity consumption often use time series method [4, 5], exponential smoothing method [6], Arima [7], gray model method [8], and regression analysis method [9]. These methods require high stability of the original data, and the curve of monthly electricity consumption is a typical nonstationary and nonlinear time series. Moreover, the traditional forecasting methods have poor performance in dealing with the double trend monthly electricity consumption series with growth and volatility. In reference [4], a prediction method of monthly electricity consumption based on STL (Seasonal and Trend decomposition using Loess) model is proposed. The STL model is used to decompose the time series of electricity consumption, and then, the decomposed components are predicted. However, the method ignores other factors, such as economy. In recent years, quantile regression method has been widely used in prediction of monthly electricity consumption abroad; this method has the advantage of being insensitive to outliers [10]. In reference [11], a constrained quantile regression averaging (CQRA) method is proposed; this method creates an improved overall prediction from multiple individual probability predictions. The parameter estimation problem of CQRA is described as a linear programming problem, and its goal is to minimize the loss of marbles. In reference [12], a new joint forecasting system is established; this system improves the online probability forecasting of single load, and the refining process is based on multiple quantile regression. However, the main disadvantage of quantile regression is that the complexity of calculation in the process of solving leads to long prediction time [13]. In reference [14], a load residual forecasting method based on quantile regression is proposed; this method significantly improves the accuracy of load forecasting. However, the framework only considers the conditional distribution of load errors, and it ignores the relationship between multi-point load errors.

Data-driven forecasting methods of monthly electricity consumption mainly include artificial neural network [15,16,17,18] and support vector machine [19,20,21]. These methods can consider nonlinearity and have self-learning ability. In reference [22], a medium-term load forecasting method based on singular spectrum analysis and neural network is proposed. Singular spectrum analysis method is introduced to filter and decompose monthly power consumption series to obtain each sub series. Then, neural network model is used to predict each sub series. Finally, the predicted power consumption is reconstructed. However, this method only uses the power consumption series to forecast, and it ignores many influencing factors of monthly power consumption. In reference [23], a conditional probability density forecasting method of residential load based on deep hybrid network is proposed, and an end-to-end probabilistic residential load forecasting composite model composed of convolution neural network and gating recursive unit is designed. However, the convergence speed of neural network algorithm is slow and easy to fall into the problem of local minimization, which leads to the failure of network training. In reference [24], a new online integrated learning method is proposed; this method combines batch learning with online learning for load forecasting to ensure the adaptability of online application. In reference [25], subspace clustering is used to analyze the power consumption-related factors of different types of users. Finally, random forest algorithm is used for prediction, and good prediction results are obtained.

1.2 Motivation

The traditional forecasting method of monthly electricity consumption cannot fully consider various affecting factors of the monthly electricity consumption and is sensitive to outliers and noise. The data-driven electricity consumption method also has some problems, such as easy over fitting and slow convergence speed. Random forest algorithm is applicable to all kinds of data sets and has the advantages of preventing over fitting, being insensitive to outliers and noise [10], and having many input variables and fast convergence speed.

On the basis of the abovementioned discussion, this paper proposes a random forest prediction method based on the maximum mutual information coefficient. The maximum mutual information coefficient is used to identify the correlation between the monthly electricity consumption and its influencing factors, screen out the strong correlation factors, simplify the input variables of prediction, and use the random forest algorithm for prediction. Finally, the monthly electricity consumption and socio-economic variable data of Shenyang City in Liaoning Province from 2005 to 2014 are taken as the training set, and the monthly electricity consumption data of Shenyang City in 2015 are used as the verification set. The effectiveness of the studied prediction method of monthly electricity consumption is proven by an example.

2 Problem description of forecast of monthly electricity consumption

In the forecast of monthly electricity consumption, we need to consider the historical data of monthly electricity consumption, GDP, the total fixed assets investment (electricity, heat, gas, water production and supply, transportation, storage, and postal industry), the total number of hotel accommodation, the total registered residence, tap water sales, natural gas sales, and the production of industrial steel above designated size. The monthly electricity consumption is predicted on the basis of the factors related to electricity consumption, such as crude oil output of industries, above designated size, added value of industries above designated size, sales volume of wholesale and retail trade enterprises above designated size, and total value of import and export. However, more factors considered do not mean better effect of prediction of monthly electricity consumption. Thus, the strong correlation factors of monthly electricity consumption should be identified. At the same time, outliers and noise will be present considering that the forecast of monthly electricity consumption is the operation of the data set, and the data set contains many variables. Thus, an appropriate algorithm for the forecast of monthly electricity consumption should be selected. The problem description of prediction of monthly electricity consumption is shown in Fig. 1.

3 Strategy structure of stochastic forest forecasting method of monthly electricity consumption considering mutual information

The main contents of the prediction method of monthly electricity consumption based on the maximum mutual information coefficient are as follows. The maximum mutual information coefficient is used to analyze the correlation between the monthly electricity consumption and the potential correlation factors, and the strong correlation factors are screed out. The training sample set is constructed on the basis of the data of monthly electricity consumption and its strong correlation factors. After the parameters of the decision tree are optimized, the random forest algorithm is used to predict the monthly electricity consumption of the whole society. The specific implementation strategy of the method is shown in Fig. 2.

3.1 Identification of related factors

Monthly electricity consumption has different correlation with many factors. The maximum mutual information coefficient is used to analyze and sort the relevant influencing factors of monthly electricity consumption, and the factors that have a greater effect on the monthly electricity consumption, and a stronger correlation are screened out [26]. The factors that have a lower correlation to the monthly electricity consumption are eliminated, the input of monthly electricity consumption prediction and the complexity of modeling are reduced, and the prediction accuracy is improved.

The maximum information coefficient is based on information and mutual information theories [27], and it can better measure the linear and nonlinear relationship between variables by dividing the data interval with grid [28]. The maximum mutual information coefficient is a standard to determine the correlation between two variables.

$X$ and $Y$ are the monthly electricity consumption and related factors of the whole society in data set $D$, where $X = \left\{ {x_{i} ,\quad i = 1,2, \ldots } \right\}$, $Y = \left\{ {y_{j} ,\quad j = 1,2, \ldots } \right\}$. The mutual information between $X$ and $Y$ is defined as.

$$ {\text{MI}}\left( {X;Y} \right) = \sum\limits_{y \in Y} {\sum\limits_{x \in X} {p\left( {x,y} \right)} } \log \frac{{P\left( {x,y} \right)}}{P\left( x \right)P\left( y \right)}, $$

(1)

where $p\left( {x,y} \right)$ is the joint probability density of $X$ and $Y$; $p\left( x \right)$ and $p\left( y \right)$ are the edge probability density of $X$ and $Y$, respectively.

All values of monthly electricity consumption $X$ and related factor $Y$ in data set $D$ are divided into two grids $a$ and $b$, respectively, and such grid division is called $a \times b$, which is recorded as $R = \left( {a,b} \right)$. Many kinds of grid partition methods are available for the same $a \times b$, and data set $D$ has different distributions under different partition methods. If the maximum value of ${\text{MI}}\left( {X;Y} \right)$ in different partition methods is taken as the mutual information value of partition $R$, then the maximum mutual information can be defined as

$$ {\text{MI}}_{D|R}^{\max } \left( {X;Y} \right) = \mathop {\max }\limits_{{R = \left( {a,b} \right)}} {\text{MI}}_{D|R} \left( {X;Y} \right), $$

(2)

where $D|R$ is the partition of data set $D$ under grid $R$.

$$ {\text{MI}}_{D|R} \left( {X;Y} \right) = \frac{{{\text{MI}}_{D|R} \left( {X;Y} \right)}}{{\log \min \left( {a,b} \right)}} $$

(3)

The maximum information coefficients of $X$ and $Y$ are defined as.

$$ {\text{MIC}}\left( D \right) = \mathop {\max }\limits_{ab < B\left( n \right)} \left\{ {{\text{MI}}_{D|R} \left( {X,Y} \right)} \right\}, $$

(4)

where $ab < B\left( n \right)$ is the upper bound of mesh generation; $n = \max \left( {i,j} \right)$, $B\left( n \right) = n^{0.6}$.

The correlation criteria of maximum mutual information coefficient analysis are shown in Table 1.

Table 1 Relationship between correlation coefficient and degree

Full size table

The sample set $D$ is composed of monthly electricity consumption $X_{i}$ and related factors $Y_{j}$.

The detailed steps of maximum mutual information coefficient are given in Algorithm 1.

3.2 Power consumption prediction modeling based on random forest

3.2.1 Random selection of training sample subset

The random selection of training sample subset is realized by Bootstrap method [29]. Bootstrap method forms different data sets by repeatedly extracting samples from the original data set and putting them back instead of repeatedly dividing the original data into separate data sets. Each Bootstrap data set is based on extraction and then put back, which is the same size as the original data set. Specifically, if the size of the original data set is $N$, and $N$ samples are put back from it, then the size of the formed Bootstrap data set is $N$. An observation may appear many times in the bootstrap sample, or it may not appear at all.

With the original training sample set $S_{h}$ as the input, $S_{h}$ is composed of the power consumption of the whole society and its potential related factors, including GDP, investment in fixed assets of the whole society, number of star hotels and accommodation, population, sales of tap water, sales of natural gas, steel production, industrial added value, total import and export value, and automobile production. The resampling of $S_{h}$ is conducted, and its working process is shown in Fig. 3.

Using Bootstrap sampling method, we randomly select $w$ training sample subsets $S_{h1} ,S_{h2} , \ldots ,S_{hw}$ (each subset contains the abovementioned two types of data) from a to construct $w$ classification and regression tree (CART). The test set is used to estimate the error of CART decision tree. By averaging the error estimates of $w$ decision trees, the generalized error estimates of random forest can be obtained, and the accuracy of prediction model of power consumption can be quantitatively measured.

3.2.2 Construction of CART decision tree

Random forest is a combination of multiple decision trees. By voting the prediction results of each decision tree, the decision tree with the most votes is regarded as the final random forest prediction result.

CART algorithm constructs binary decision tree [30]. The CART algorithm selects the features by Gini coefficient when constructing decision tree. The principle is as follows:

If the sample data are divided into $K$ classes and with the probability that the data belong to $k$ class is $p_{k}$, then the Gini coefficient of the sample data is defined as.

$$ {\text{Gini}}\left( p \right) = \sum\limits_{k = 1}^{K} {p_{k} \left( {1 - p_{k} } \right)} . $$

(5)

For the training sample set $D$, the number is $\left| D \right|$. With the assumption of $K$ categories and the number of the $k$ category is $\left| {C_{k} } \right|$, the Gini coefficient is as follows:

$$ {\text{Gini}}\left( D \right) = \sum\limits_{k = 1}^{K} {\frac{{\left| {C_{k} } \right|}}{\left| D \right|}} \left( {1 - \frac{{\left| {C_{k} } \right|}}{\left| D \right|}} \right). $$

(6)

For the training sample set $D$, the number is $\left| D \right|$. According to the value of feature $A_{m}$, it can be divided into two parts $D_{1}$ and $D_{2}$. The number of $D_{1}$ and $D_{2}$ is $\left| {D_{1} } \right|$ and $\left| {D_{2} } \right|$, respectively. Under the condition of feature $A_{m}$, the Gini coefficient of $D$ is as follows:

$$ {\text{Gini}}\left( {D,A_{m} } \right) = \frac{{\left| {D_{1} } \right|}}{\left| D \right|}{\text{Gini}}\left( {D_{1} } \right) + \frac{{\left| {D_{2} } \right|}}{\left| D \right|}{\text{Gini}}\left( {D_{2} } \right). $$

(7)

Gini coefficient ${\text{Gini}}\left( D \right)$ represents the uncertainty of training sample set $D$ under the condition of feature $A_{m}$, and the Gini coefficient of $D$ represents the uncertainty of $D$ under the condition of feature $A_{m}$. Therefore, Gini coefficient can represent the ability of feature $A_{m}$ to classify data set $D$.

With the training sample subset $S_{hj}$ as an example, $S_{hj}$ is composed of feature set $Y$ ($Y$ is composed of GDP, investment in fixed assets of the whole society, number of star-rated hotels and accommodation, population, sales of tap water, sales of natural gas, steel production, industrial added value, total import and export value, and automobile production). For data set $X$ of monthly electricity consumption of the whole society, the threshold value of node sample number is $\delta$, and the threshold value of Gini coefficient is $\varepsilon$. CART binary decision tree is the output, and the construction process of decision tree is shown in Fig. 4.

First, the Gini coefficients of all the possible segmentation points of the feature set $Y$ in the training sample subset $S_{hj}$ are calculated. Then, whether the Gini coefficients of the samples are less than the given threshold $\varepsilon$ is determined. If they are all less than the given threshold $\varepsilon$, then a single node tree is generated, whose category is the class with the largest number of samples in $S_{hj}$; otherwise, the feature with the smallest Gini coefficient and the corresponding segmentation point $\alpha$ are selected as the eigenvalue and segmentation standard of the root node. $S_{hj}$ is divided into two subsets $S_{hj1}$ and $S_{hj2}$, and $S_{hj1}$ and $S_{hj2}$ are allocated to the two sub nodes, respectively. Next, whether the number of samples of the sub nodes is less than the given threshold $\delta$ is judged. Whether the Gini coefficients of sub node samples are less than $\varepsilon$ is also determined. If it is true, then the child node is a leaf node. If the two child nodes are leaf nodes, then the decision tree is generated; otherwise, for the non-leaf node, $S_{hj}$ is equal to the corresponding data set of the child node, and the feature with the smallest Gini coefficient is removed. The Gini coefficients of all possible segmentation points of $S_{hj}$ are recalculated.

ART algorithm is used to generate a decision tree for each subset of training samples based on the principle of minimum Gini coefficient, and $w$ decision trees are symbiotic to form a “forest.” Half of the strong correlation factors are randomly selected to participate in the node splitting process of decision tree for ensuring the randomness of decision tree construction and avoiding over fitting problem. In addition, the number of decision trees in the whole random forest should be adjusted according to the prediction results.

The detailed steps of generating CART decision tree are given in Algorithm 2.

3.3 Voting of forecast results of electricity consumption

The final output of the prediction model based on random forest algorithm is generated by voting:

$$ F_{h} \left( X \right) = \arg \mathop {\max }\limits_{Y} \sum\limits_{i = 1}^{w} {I\left( {f_{hi} \left( X \right) = Y} \right)} , $$

(8)

where $F_{h}$ is the prediction model of monthly electricity consumption; $f_{hi}$ is a single decision tree prediction model; $I\left( {\square } \right)$ is an indicative function.

For the same data set, when $w$ CART decision trees are constructed, $w$ prediction results will be obtained. At this time, we need to vote and select the prediction result with the highest number of votes. The sample data are tested for simulation, the data $X$ of correlation factors related to power consumption $Y$ are taken as the input, the prediction result series $\left\{ {f_{h1} \left( X \right),f_{h2} \left( X \right), \ldots ,f_{hw} \left( X \right)} \right\}$ of each decision tree model are obtained, and voting is conducted to derive the final prediction result of power consumption. The process is shown in Fig. 5.

4 Example analysis

4.1 Data source

The monthly electricity consumption of the whole society, agriculture, forestry, animal husbandry and fishery, industry, finance, real estate, business and residential services, urban and rural residents’ life, and the potential related factors of the abovementioned monthly electricity consumption are all from Huibo database, with the time span from January 2005 to December 2015. The aforementioned industries cover 6 categories: transportation, storage, postal, commerce, accommodation, and catering. A total of 14 potential related factors are also considered: GDP of Shenyang, GDP of the secondary industry of Shenyang, investment in fixed assets of Shenyang (production and supply of electricity, heat, gas, and water), investment in fixed assets of Shenyang (transportation, gas, and water), the total number of accommodation and reception in Shenyang star hotels, the total population of Shenyang registered residence, the volume of tap water sold in Shenyang, the sales of natural gas in Shenyang, the output of industrial steel above designated size in Shenyang, the industrial crude oil production above designated Size in Shenyang, the industrial added value above the Shenyang scale, the sales volume of wholesale and retail trade enterprises above the quota of Shenyang City, Shenyang’s total import and export value, and Shenyang’s industrial automobile output above designated size, as shown in Table 2.

Table 2 Data source

Full size table

The GDP data of potential related factors are quarterly data, but this study needs to use monthly data. Thus, the quarterly GDP data are processed by the way of average distribution to each month.

4.2 Identification of power consumption-related factors

The monthly power consumption data are taken as the explanatory variable $X$, and the matrix $X = \left\{ {X_{1} ,X_{2} ,X_{3} ,X_{4} ,X_{5} } \right\}$ is set. Among them, $X_{1} ,X_{2} ,X_{3} ,X_{4} ,X_{5}$ represent the monthly power consumption of agriculture, forestry, animal husbandry, and fishery in Shenyang, the monthly energy consumption of Shenyang’s industry, the monthly power consumption of finance, real estate, business, and residential service industry, the monthly consumption of urban and rural residents in Shenyang, and the monthly power consumption of the whole society of Shenyang.

The monthly data of potential related factors are taken as conditional variable $Y$, and $Y = \left\{ {Y_{1} ,Y_{2} , \ldots ,Y_{14} } \right\}$ is set. Among them, $Y_{1}$ is Shenyang GDP, $Y_{2}$ is the second industry GDP of Shenyang, $Y_{3}$ is the fixed assets investment (production and supply of electricity, heat, gas, and water) in Shenyang, $Y_{4}$ is the fixed assets investment (transportation, storage, and postal industry) of Shenyang, $Y_{5}$ is the total number of hotel accommodation and reception in Shenyang City, $Y_{6}$ is the total population of Shenyang household registration, $Y_{7}$ is the sales volume of tap water in Shenyang, $Y_{8}$ is the sales volume of natural gas in Shenyang, $Y_{9}$ is the industrial steel production above Shenyang scale, $Y_{10}$ is the industrial crude oil production above Shenyang scale, $Y_{11}$ is the industrial added value above Shenyang scale, $Y_{12}$ is the sales volume of wholesale and retail trade enterprises above the quota of Shenyang, $Y_{13}$ is the total import and export value of Shenyang, and $Y_{14}$ is the output of industrial vehicles above Shenyang.

The maximum mutual information coefficient of explanatory and conditional variables is analyzed by Python, and the maximum mutual information coefficient is obtained. Thus, a correlation coefficient table, as shown in Table 3, is formed.

Table 3 Maximum mutual information coefficient results

Full size table

Python’s Seaborn and Matplotlib packages are used to analyze the maximum mutual information coefficient data in Table 3 for more intuitively identifying the strong correlation factors of monthly electricity consumption. The data are displayed in the form of a heat map, as shown in Fig. 6.

Figure 6 shows that the maximum mutual information coefficients of $X_{1}$ and $Y_{1}$, $Y_{2}$, $Y_{3}$,$Y_{4}$, $Y_{5}$, $Y_{12}$ are relatively large and have strong correlation. The maximum mutual information coefficients of industrial monthly electricity consumption $X_{2}$ and $Y_{6}$, $Y_{7}$, $Y_{9}$, $Y_{11}$, $Y_{13}$, $Y_{14}$ in Shenyang are relatively large and have strong correlation. The maximum mutual information coefficients of $X_{3}$ and $Y_{6}$, $Y_{8}$, $Y_{7}$, $Y_{11}$, $Y_{13}$, $Y_{14}$ are relatively large, and the correlation is strong. The maximum mutual information coefficient of $X_{4}$ and $Y_{6}$, $Y_{7}$, $Y_{8}$, $Y_{9}$, $Y_{11}$, $Y_{13}$, $Y_{14}$ is larger, and the correlation is strong. The maximum monthly mutual information coefficients of $X_{5}$ and $Y_{6}$, $Y_{7}$, $Y_{8}$, $Y_{9}$, $Y_{11}$, $Y_{13}$, $Y_{14}$, $Y_{1}$ and $Y_{2}$ in Shenyang are relatively large, and the correlation is strong. The correlation is also affected by the total population of Shenyang registered residence $Y_{6}$, the volume of Shenyang tap water sales $Y_{7}$, the sales volume of Shenyang natural gas $Y_{8}$, the industrial steel output above the scale of Shenyang $Y_{9}$, the industrial added value above Shenyang scale $Y_{11}$, the total import and export value of Shenyang $Y_{13}$, and the output of industrial vehicles above designated scale $Y_{14}$.

4.3 Forecast of monthly electricity consumption

The forecast of monthly electricity consumption of Shenyang from January to December is conducted. With the monthly data of 9 strong correlation factors from 2005 to 2014 as the input and the monthly power consumption of Shenyang as the output, 12 original training sets from January to December are formed. Then, the random forest algorithm is used to predict the monthly power consumption. When using random forest algorithm to forecast electricity consumption, if we use monthly data to forecast directly, then the relationship between the data can be determined well, which greatly reduces the prediction accuracy. This study transforms the strong correlation factors and monthly electricity consumption data into monthly year-on-year growth rate as the input and output of the forecast to ensure the accuracy and stability of the forecast. The monthly year-on-year growth rate $R_{m,n}$ is calculated as follows:

$$ R_{m,n} = \frac{{d_{m,n} - d_{m - 1,n} }}{{d_{m - 1,n} }} \times 100\% , $$

(9)

where $d_{m,n}$ represents the monthly data of the $n$ month of the $m$ year.

Python’s train test split package is used to divide the training and test sets. Random forest classifier package is called, and bootstrap parameter is set to true.$w$ training sample subsets are selected from the returned samples in the training set, and $w$ decision trees are generated from these training sample subsets. The test set is used to estimate the error of random forest prediction model. When each decision tree is generated, half of the strong correlation factors are randomly selected as random characteristic variables to participate in the node splitting process.

Given different decision trees, the prediction accuracy of random forest algorithm will be different. The mean absolute percentage error (MAPE) is used to calculate the error value, and the formula is as follows:

$$ {\text{MAPE}} = \frac{1}{n}\sum\limits_{t = 1}^{n} {\left| {\frac{{y_{f\left( t \right)} - y_{a\left( t \right)} }}{{y_{a\left( t \right)} }}} \right|} \times 100\% , $$

(10)

where $y_{f\left( t \right)}$ is the predicted value, and $y_{a\left( t \right)}$ is the actual value.

With August as an example, Fig. 7 shows the MAPE between the predicted value and the actual value of the monthly electricity consumption forecast based on the random forest algorithm when taking different decision trees.

As shown in the figure, MAPE tends to a certain value with the increase in decision trees. However, when more decision trees are considered, the amount of calculation will increase rapidly and the prediction time will be longer. A total of 150 decision trees are selected to form a random forest to guarantee modeling speed and avoid prediction error.

The importance of each strong correlation factor to the prediction model differs. We adopt the way of average accuracy decline rate to intuitively determine the importance of each strong correlation factor to the prediction model. After a strong correlation factor is removed, the degree of prediction accuracy declines. The more decline in accuracy means the more important that this strong correlation factor is to the prediction model. The importance of the nine strong correlation factors is shown in Fig. 8. As shown in the figure, $Y_{6}$, $Y_{8}$, $Y_{11}$, and $Y_{14}$ are of high importance, which is consistent with the analysis results in Fig. 6.

With the 73rd decision tree in the monthly electricity consumption forecast of random forest in August 2015 as an example, the working process of CART decision tree is analyzed. The 73rd CART decision tree is shown in Fig. 9.

In the figure, $Y_{1}$, $Y_{6}$, $Y_{7}$, $Y_{8}$, and $Y_{13}$ are the strong correlation factors of Shenyang’s social monthly electricity consumption in August 2015; ${\text{gini}}$ is Gini coefficient, which is used for purity measurement. If all the training samples contained in a node are of the same category, then node is pure (${\text{gini}} = 0$); ${\text{samples}}$ represents how many training sample instances the current node is applied to; ${\text{value}}$ represents the number of samples for each category in the current node; ${\text{class}}$ is the classification result.

The specific prediction process of CART decision tree is as follows. Starting from the root node, whether $Y_{2}$ is less than or equal to 5.1 is determined; if yes, then the process moves to the left child node; otherwise, it moves to the right child node. In the specific power consumption prediction process, when $Y_{2} = 0.33 < 5.1$, the process moves to the left sub node. Then, when $Y_{2} = 0.33 < 4.215$, it moves to the left node. Finally, when $Y_{7} = 4.6 > 3.35$, it moves to the right sub node because ${\text{gini}} = 0$ determines that the year-on-year growth rate of power consumption in August 2015 is the same as that in August 2014. The year-on-year growth rate of power consumption in August 2014 is taken as the forecast value of year-on-year growth rate of power consumption in August 2015, which is 3.78. It is converted into monthly power consumption of 2634.47 million kwh.

4.4 Analysis of prediction results

The random forest algorithm is used for prediction by using the data of strong correlation factors in 2015. On the basis of obtaining the monthly growth rate of electricity consumption, the monthly electricity consumption of the same period of the previous year is taken as the benchmark to obtain the monthly electricity consumption forecast value. At the same time, the monthly data considering all factors and the monthly growth rate data considering all factors are taken as the training samples, and the random forest algorithm is used for prediction and comparison with the proposed method. Forecast result 1 is that of the proposed method, and the training samples are the monthly growth rate data. Forecast result 2 uses the monthly year-on-year growth rate as the training sample and considers all factors. Forecast result 3 uses monthly data as training samples and considers all factors.

At the same time, because support vector machines are widely used in monthly electricity consumption forecasting, in order to further verify the effectiveness of the method proposed in this article, the support vector machine is used to compare with the method proposed in this article. The forecast result 4 is based on the monthly data considering all factors as the training sample, and the support vector machine is used for prediction. The actual monthly electricity consumption and forecast results of Shenyang in 2015 are shown in Table 4 and Fig. 10.

Table 4 Comparison of prediction result

Full size table

As shown in Fig. 10 and Table 4, the prediction accuracy of training samples using monthly year-on-year growth rate is higher than that of training samples using monthly data directly under the condition of not using mutual information to screen correlation factors. On the basis of using the monthly growth rate, the prediction accuracy of using the monthly growth rate data of strong correlation factors as the training sample is higher than that of using the monthly growth rate data of strong correlation factors as the training sample. Therefore, better results are obtained when more correlation factors are considered. Too many factors with low correlation will make the prediction result worse. When the factors with high correlation are screened out by mutual information, the prediction error will be significantly reduced. Compared with the support vector machine algorithm, the method proposed in this paper improves the MAPE index by 7.29%. The experimental results show that the method proposed in this paper has a better prediction effect.

5 Conclusion and prospect

In this study, the maximum mutual information coefficient is introduced to identify the influencing factors of the monthly electricity consumption of the whole society in Shenyang, and the strong correlation factors of the monthly electricity consumption of the whole society are selected. The random forest algorithm is used to predict the monthly electricity consumption of the whole society with the strong correlation factors as the input. The predicted value of the monthly electricity consumption of the whole society is obtained. The effectiveness and correctness of the proposed method are verified by an example.

(1)
The maximum mutual information coefficient of mutual information theory is used to quantitatively calculate the correlation between the influencing factors and the monthly electricity consumption of the whole society. With the maximum mutual information coefficient, more effective correlation factors are screened out among many influencing factors.
(2)
The random forest algorithm is used for prediction. The bootstrap resampling of random forest algorithm and the random selection of features enable the algorithm to avoid over fitting and make it suitable for all kinds of data sets. Combined with mutual information, the factors with low correlation to the whole society’s monthly electricity consumption are eliminated. As a result, the prediction accuracy is higher.
(3)
The monthly forecast of random forest is conducted using the strategy of monthly forecast with the historical data of the same month as the training set. As a result, the prediction accuracy is improved.

In addition to the factors mentioned in this article, monthly electricity consumption forecasts are also greatly affected by economic factors. The energy consumption of heating or cooling will be generated when the temperature is higher than the upper limit or lower than the lower limit of the comfortable temperature range. Thus, the weather data will be introduced in the next work to further improve the prediction accuracy.

References

Wu D, Wang B, Precup D, Boulet B (2020) ‘Multiple kernel learning-based transfer regression for electric load forecasting.’ IEEE Trans Smart Grid 11(2):1183–1192
Article Google Scholar
Wang Y, Chen Q, Hong T, Kang C (2019) Review of smart meter data analytics: applications, methodologies, and challenges. IEEE Trans Smart Grid 10(3):3125–3148
Article Google Scholar
Zhang S, Liu J, Zhao B et al (2013) Cloud computing-based analysis on residential electricity consumption behavior. Power Syst Technol 37(6):1542–1546 ((in Chinese))
Google Scholar
Liu L, Wang Y, Pang X et al (2020) A comprehensive forecasting method of monthly electricity sales based on STL model. Control Eng China 27(11):1930–1936 ((in Chinese))
Google Scholar
González JP, Muñoz San Roque AMS, Pérez EA (2018) Forecasting functional time series with a new Hilbertian ARMAX model: application to electricity price forecasting. IEEE Trans Power Syst 33(1):545–556
Article Google Scholar
Medina Macaira P, Castro Sousa R, Cyrino Oliveira FL (2016) Forecasting Brazil’s electricity consumption with Pegels exponential smoothing techniques. IEEE Latin Am Trans 14(3):1252–1258
Article Google Scholar
Chen P, Pedersen T, Bak-Jensen B, Chen Z (2010) ARIMA-based time series model of stochastic wind power generation. IEEE Trans Power Syst 25(2):667–676
Article Google Scholar
Zhang Y, Sun H, Guo Y (2019) Wind power prediction based on PSO-SVR and grey combination model. IEEE Access 7:136254–136267
Article Google Scholar
Ceperic E, Ceperic V, Baric A (2013) A strategy for short-term load forecasting by support vector regression machines. IEEE Trans Power Syst 28(4):4356–4364
Article Google Scholar
Aprillia H, Yang H-T, Huang C-M (2021) Statistical load forecasting using optimal quantile regression random forest and risk assessment index. IEEE Trans Smart Grid 12(2):1467–1480
Article Google Scholar
Wang Y, Zhang N, Tan Y et al (2019) Combining probabilistic load forecasts. IEEE Trans Smart Grid 10(4):3664–3674
Article Google Scholar
Bracale A, Caramia P, Falco P et al (2020) Multivariate quantile regression for short-term probabilistic load forecasting. IEEE Trans Power Syst 35(1):628–638
Article Google Scholar
Xun G, Julian LC, Eduardo C et al (2019) Bottom-up load forecasting with Markov-based error reduction method for aggregated domestic electric water heaters. IEEE Trans Ind Appl 55(6):6401–6413
Article Google Scholar
Wang Yi, Chen Q, Zhang N et al (2018) Conditional residual modeling for probabilistic load forecasting. IEEE Trans Power Syst 33(6):7327–7330
Article Google Scholar
Taylor JW, Buizza R (2002) Neural network load forecasting with weather ensemble predictions. IEEE Trans Power Syst 17(3):626–632
Article Google Scholar
Li B, Zhang J, He Y, Wang Y (2017) Short-term load-forecasting method based on wavelet decomposition with second-order gray neural network model combined with ADF Test. IEEE Access 5:16324–16331
Article Google Scholar
Senjyu T, Takara H, Uezato K, Funabashi T (2002) One-hour-ahead load forecasting using neural network. IEEE Trans Power Syst 17(1):113–118
Article Google Scholar
Ranaweera DK, Karady GG, Farmer RG (1996) Effect of probabilistic inputs on neural network-based electric load forecasting. IEEE Trans Neural Netw 7(6):1528–1532
Article Google Scholar
Jiang H, Zhang Y, Muljadi E, Zhang JJ, Gao DW (2018) A short-term and high-resolution distribution system load forecasting approach using support vector regression with hybrid parameters optimization. IEEE Trans Smart Grid 9(4):3341–3350
Article Google Scholar
Li G, Li Y, Roozitalab F (2020) Midterm load forecasting: a multistep approach based on phase space reconstruction and support vector machine. IEEE Syst J 14(4):4967–4977
Article Google Scholar
Chen B-J, Chang M-W, Lin C-J (2004) Load forecasting using support vector machines: a study on EUNITE competition 2001. IEEE Trans Power Syst 19(4):1821–1830
Article Google Scholar
Chen H, Liu W, Li Y (2020) Medium-term load forecast based on singular spectrum analysis and neural network. Power Syst Technol 44(4):1333–1347 ((in Chinese))
Google Scholar
Afrasiabi M, Mohammadi M, Rastegar M et al (2020) Deep-based conditional probability density function forecasting of residential loads. IEEE Trans Smart Grid 11(4):3646–3657
Article Google Scholar
Von Krannichfeldt L, Wang Y, Hug G (2021) Online ensemble learning for load forecasting. IEEE Trans Power Syst 36(1):545–548
Article Google Scholar
Zhao T, Wang L, Zhang Y et al (2016) Relation factor identification of electricity consumption behavior of users and electricity demand forecasting based on mutual information and random forests. Proceed CSEE 36(3):604–614 ((in Chinese))
Google Scholar
Kiernan L, Kambhampati C, Mitchell RJ et al (1995) Automatic integrated system load forecasting using mutual information and neural networks. IFAC Proc Vol 28(26):503–508
Article Google Scholar
Gu T, Guo J, Li Z, Mao S (2021) Detecting associations based on the multi-variable maximum information coefficient. IEEE Access 9:54912–54922
Article Google Scholar
Zhen L, Karam LJ (2005) Mutual information-based analysis of JPEG2000 contexts. IEEE Trans Image Process 14(4):411–422
Article MathSciNet Google Scholar
Xuan Y et al (2021) Multi-model fusion short-term load forecasting based on random forest feature selection and hybrid neural network. IEEE Access 9:69002–69009
Article Google Scholar
Liu F, Dong T, Hou T, Liu Y (2021) A hybrid short-term load forecasting model based on improved fuzzy c-means clustering, random forest and deep neural networks. IEEE Access 9:59754–59765
Article Google Scholar

Download references

Acknowledgements

This work was partly supported by the National Natural Science Foundation of China (61773269), the Natural Science Foundation of Liaoning Province of China (2019-KF-03-08), the Program for Liaoning Excellent Talents in University (LR2019045), and the Program for Shenyang High Level Innovative Talents (RC190042).

Author information

Authors and Affiliations

Key Laboratory of Energy Saving and Controlling in Power System of Liaoning Province, Shenyang Institute of Engineering, Shenyang, 110136, China
Xinfu Pang, Changfeng Luan, Li Liu & Wei Liu
State Grid Yingkou Electric Power Supply Company, Yingkou, 115200, China
Yuancheng Zhu

Authors

Xinfu Pang
View author publications
You can also search for this author in PubMed Google Scholar
Changfeng Luan
View author publications
You can also search for this author in PubMed Google Scholar
Li Liu
View author publications
You can also search for this author in PubMed Google Scholar
Wei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yuancheng Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xinfu Pang.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pang, X., Luan, C., Liu, L. et al. Data-driven random forest forecasting method of monthly electricity consumption. Electr Eng 104, 2045–2059 (2022). https://doi.org/10.1007/s00202-021-01457-5

Download citation

Received: 10 July 2021
Accepted: 15 November 2021
Published: 12 January 2022
Issue Date: August 2022
DOI: https://doi.org/10.1007/s00202-021-01457-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Data-driven random forest forecasting method of monthly electricity consumption

Abstract

Similar content being viewed by others

Short-Term Load Forecasting Using Random Forests

Short-Term Load Forecasting Using Random Forest with Entropy-Based Feature Selection

Electric Power Forecasting in Inner Mongolia by Random Forest