1 Introduction

1.1 Literature review

Prediction of monthly electricity consumption is important in the power system [1]. Accurate forecast of monthly electricity consumption is the premise for power departments to allocate power resources and power companies to make reasonable sales plans [2]. According to the study of the relationship between different socio-economic factors and electricity consumption, the monthly electricity consumption forecast can help power enterprises better understand the service demand of all walks of life and provide data support for the future development of power grid and the formulation of power demand-side response policies [3].

Traditional forecasting methods of monthly electricity consumption often use time series method [4, 5], exponential smoothing method [6], Arima [7], gray model method [8], and regression analysis method [9]. These methods require high stability of the original data, and the curve of monthly electricity consumption is a typical nonstationary and nonlinear time series. Moreover, the traditional forecasting methods have poor performance in dealing with the double trend monthly electricity consumption series with growth and volatility. In reference [4], a prediction method of monthly electricity consumption based on STL (Seasonal and Trend decomposition using Loess) model is proposed. The STL model is used to decompose the time series of electricity consumption, and then, the decomposed components are predicted. However, the method ignores other factors, such as economy. In recent years, quantile regression method has been widely used in prediction of monthly electricity consumption abroad; this method has the advantage of being insensitive to outliers [10]. In reference [11], a constrained quantile regression averaging (CQRA) method is proposed; this method creates an improved overall prediction from multiple individual probability predictions. The parameter estimation problem of CQRA is described as a linear programming problem, and its goal is to minimize the loss of marbles. In reference [12], a new joint forecasting system is established; this system improves the online probability forecasting of single load, and the refining process is based on multiple quantile regression. However, the main disadvantage of quantile regression is that the complexity of calculation in the process of solving leads to long prediction time [13]. In reference [14], a load residual forecasting method based on quantile regression is proposed; this method significantly improves the accuracy of load forecasting. However, the framework only considers the conditional distribution of load errors, and it ignores the relationship between multi-point load errors.

Data-driven forecasting methods of monthly electricity consumption mainly include artificial neural network [15,16,17,18] and support vector machine [19,20,21]. These methods can consider nonlinearity and have self-learning ability. In reference [22], a medium-term load forecasting method based on singular spectrum analysis and neural network is proposed. Singular spectrum analysis method is introduced to filter and decompose monthly power consumption series to obtain each sub series. Then, neural network model is used to predict each sub series. Finally, the predicted power consumption is reconstructed. However, this method only uses the power consumption series to forecast, and it ignores many influencing factors of monthly power consumption. In reference [23], a conditional probability density forecasting method of residential load based on deep hybrid network is proposed, and an end-to-end probabilistic residential load forecasting composite model composed of convolution neural network and gating recursive unit is designed. However, the convergence speed of neural network algorithm is slow and easy to fall into the problem of local minimization, which leads to the failure of network training. In reference [24], a new online integrated learning method is proposed; this method combines batch learning with online learning for load forecasting to ensure the adaptability of online application. In reference [25], subspace clustering is used to analyze the power consumption-related factors of different types of users. Finally, random forest algorithm is used for prediction, and good prediction results are obtained.

1.2 Motivation

The traditional forecasting method of monthly electricity consumption cannot fully consider various affecting factors of the monthly electricity consumption and is sensitive to outliers and noise. The data-driven electricity consumption method also has some problems, such as easy over fitting and slow convergence speed. Random forest algorithm is applicable to all kinds of data sets and has the advantages of preventing over fitting, being insensitive to outliers and noise [10], and having many input variables and fast convergence speed.

On the basis of the abovementioned discussion, this paper proposes a random forest prediction method based on the maximum mutual information coefficient. The maximum mutual information coefficient is used to identify the correlation between the monthly electricity consumption and its influencing factors, screen out the strong correlation factors, simplify the input variables of prediction, and use the random forest algorithm for prediction. Finally, the monthly electricity consumption and socio-economic variable data of Shenyang City in Liaoning Province from 2005 to 2014 are taken as the training set, and the monthly electricity consumption data of Shenyang City in 2015 are used as the verification set. The effectiveness of the studied prediction method of monthly electricity consumption is proven by an example.

2 Problem description of forecast of monthly electricity consumption

In the forecast of monthly electricity consumption, we need to consider the historical data of monthly electricity consumption, GDP, the total fixed assets investment (electricity, heat, gas, water production and supply, transportation, storage, and postal industry), the total number of hotel accommodation, the total registered residence, tap water sales, natural gas sales, and the production of industrial steel above designated size. The monthly electricity consumption is predicted on the basis of the factors related to electricity consumption, such as crude oil output of industries, above designated size, added value of industries above designated size, sales volume of wholesale and retail trade enterprises above designated size, and total value of import and export. However, more factors considered do not mean better effect of prediction of monthly electricity consumption. Thus, the strong correlation factors of monthly electricity consumption should be identified. At the same time, outliers and noise will be present considering that the forecast of monthly electricity consumption is the operation of the data set, and the data set contains many variables. Thus, an appropriate algorithm for the forecast of monthly electricity consumption should be selected. The problem description of prediction of monthly electricity consumption is shown in Fig. 1.

Fig. 1
figure 1

Problem description of forecast of monthly electricity consumption

3 Strategy structure of stochastic forest forecasting method of monthly electricity consumption considering mutual information

The main contents of the prediction method of monthly electricity consumption based on the maximum mutual information coefficient are as follows. The maximum mutual information coefficient is used to analyze the correlation between the monthly electricity consumption and the potential correlation factors, and the strong correlation factors are screed out. The training sample set is constructed on the basis of the data of monthly electricity consumption and its strong correlation factors. After the parameters of the decision tree are optimized, the random forest algorithm is used to predict the monthly electricity consumption of the whole society. The specific implementation strategy of the method is shown in Fig. 2.

Fig. 2
figure 2

Strategy chart of forecasting method of monthly electricity consumption

3.1 Identification of related factors

Monthly electricity consumption has different correlation with many factors. The maximum mutual information coefficient is used to analyze and sort the relevant influencing factors of monthly electricity consumption, and the factors that have a greater effect on the monthly electricity consumption, and a stronger correlation are screened out [26]. The factors that have a lower correlation to the monthly electricity consumption are eliminated, the input of monthly electricity consumption prediction and the complexity of modeling are reduced, and the prediction accuracy is improved.

The maximum information coefficient is based on information and mutual information theories [27], and it can better measure the linear and nonlinear relationship between variables by dividing the data interval with grid [28]. The maximum mutual information coefficient is a standard to determine the correlation between two variables.

\(X\) and \(Y\) are the monthly electricity consumption and related factors of the whole society in data set \(D\), where \(X = \left\{ {x_{i} ,\quad i = 1,2, \ldots } \right\}\), \(Y = \left\{ {y_{j} ,\quad j = 1,2, \ldots } \right\}\). The mutual information between \(X\) and \(Y\) is defined as.

$$ {\text{MI}}\left( {X;Y} \right) = \sum\limits_{y \in Y} {\sum\limits_{x \in X} {p\left( {x,y} \right)} } \log \frac{{P\left( {x,y} \right)}}{P\left( x \right)P\left( y \right)}, $$
(1)

where \(p\left( {x,y} \right)\) is the joint probability density of \(X\) and \(Y\); \(p\left( x \right)\) and \(p\left( y \right)\) are the edge probability density of \(X\) and \(Y\), respectively.

All values of monthly electricity consumption \(X\) and related factor \(Y\) in data set \(D\) are divided into two grids \(a\) and \(b\), respectively, and such grid division is called \(a \times b\), which is recorded as \(R = \left( {a,b} \right)\). Many kinds of grid partition methods are available for the same \(a \times b\), and data set \(D\) has different distributions under different partition methods. If the maximum value of \({\text{MI}}\left( {X;Y} \right)\) in different partition methods is taken as the mutual information value of partition \(R\), then the maximum mutual information can be defined as

$$ {\text{MI}}_{D|R}^{\max } \left( {X;Y} \right) = \mathop {\max }\limits_{{R = \left( {a,b} \right)}} {\text{MI}}_{D|R} \left( {X;Y} \right), $$
(2)

where \(D|R\) is the partition of data set \(D\) under grid \(R\).

$$ {\text{MI}}_{D|R} \left( {X;Y} \right) = \frac{{{\text{MI}}_{D|R} \left( {X;Y} \right)}}{{\log \min \left( {a,b} \right)}} $$
(3)

The maximum information coefficients of \(X\) and \(Y\) are defined as.

$$ {\text{MIC}}\left( D \right) = \mathop {\max }\limits_{ab < B\left( n \right)} \left\{ {{\text{MI}}_{D|R} \left( {X,Y} \right)} \right\}, $$
(4)

where \(ab < B\left( n \right)\) is the upper bound of mesh generation; \(n = \max \left( {i,j} \right)\), \(B\left( n \right) = n^{0.6}\).

The correlation criteria of maximum mutual information coefficient analysis are shown in Table 1.

Table 1 Relationship between correlation coefficient and degree

The sample set \(D\) is composed of monthly electricity consumption \(X_{i}\) and related factors \(Y_{j}\).

The detailed steps of maximum mutual information coefficient are given in Algorithm 1.

figure a

3.2 Power consumption prediction modeling based on random forest

3.2.1 Random selection of training sample subset

The random selection of training sample subset is realized by Bootstrap method [29]. Bootstrap method forms different data sets by repeatedly extracting samples from the original data set and putting them back instead of repeatedly dividing the original data into separate data sets. Each Bootstrap data set is based on extraction and then put back, which is the same size as the original data set. Specifically, if the size of the original data set is \(N\), and \(N\) samples are put back from it, then the size of the formed Bootstrap data set is \(N\). An observation may appear many times in the bootstrap sample, or it may not appear at all.

With the original training sample set \(S_{h}\) as the input, \(S_{h}\) is composed of the power consumption of the whole society and its potential related factors, including GDP, investment in fixed assets of the whole society, number of star hotels and accommodation, population, sales of tap water, sales of natural gas, steel production, industrial added value, total import and export value, and automobile production. The resampling of \(S_{h}\) is conducted, and its working process is shown in Fig. 3.

Fig. 3
figure 3

Random selection of training sample subset

Using Bootstrap sampling method, we randomly select \(w\) training sample subsets \(S_{h1} ,S_{h2} , \ldots ,S_{hw}\) (each subset contains the abovementioned two types of data) from a to construct \(w\) classification and regression tree (CART). The test set is used to estimate the error of CART decision tree. By averaging the error estimates of \(w\) decision trees, the generalized error estimates of random forest can be obtained, and the accuracy of prediction model of power consumption can be quantitatively measured.

3.2.2 Construction of CART decision tree

Random forest is a combination of multiple decision trees. By voting the prediction results of each decision tree, the decision tree with the most votes is regarded as the final random forest prediction result.

CART algorithm constructs binary decision tree [30]. The CART algorithm selects the features by Gini coefficient when constructing decision tree. The principle is as follows:

If the sample data are divided into \(K\) classes and with the probability that the data belong to \(k\) class is \(p_{k}\), then the Gini coefficient of the sample data is defined as.

$$ {\text{Gini}}\left( p \right) = \sum\limits_{k = 1}^{K} {p_{k} \left( {1 - p_{k} } \right)} . $$
(5)

For the training sample set \(D\), the number is \(\left| D \right|\). With the assumption of \(K\) categories and the number of the \(k\) category is \(\left| {C_{k} } \right|\), the Gini coefficient is as follows:

$$ {\text{Gini}}\left( D \right) = \sum\limits_{k = 1}^{K} {\frac{{\left| {C_{k} } \right|}}{\left| D \right|}} \left( {1 - \frac{{\left| {C_{k} } \right|}}{\left| D \right|}} \right). $$
(6)

For the training sample set \(D\), the number is \(\left| D \right|\). According to the value of feature \(A_{m}\), it can be divided into two parts \(D_{1}\) and \(D_{2}\). The number of \(D_{1}\) and \(D_{2}\) is \(\left| {D_{1} } \right|\) and \(\left| {D_{2} } \right|\), respectively. Under the condition of feature \(A_{m}\), the Gini coefficient of \(D\) is as follows:

$$ {\text{Gini}}\left( {D,A_{m} } \right) = \frac{{\left| {D_{1} } \right|}}{\left| D \right|}{\text{Gini}}\left( {D_{1} } \right) + \frac{{\left| {D_{2} } \right|}}{\left| D \right|}{\text{Gini}}\left( {D_{2} } \right). $$
(7)

Gini coefficient \({\text{Gini}}\left( D \right)\) represents the uncertainty of training sample set \(D\) under the condition of feature \(A_{m}\), and the Gini coefficient of \(D\) represents the uncertainty of \(D\) under the condition of feature \(A_{m}\). Therefore, Gini coefficient can represent the ability of feature \(A_{m}\) to classify data set \(D\).

With the training sample subset \(S_{hj}\) as an example, \(S_{hj}\) is composed of feature set \(Y\) (\(Y\) is composed of GDP, investment in fixed assets of the whole society, number of star-rated hotels and accommodation, population, sales of tap water, sales of natural gas, steel production, industrial added value, total import and export value, and automobile production). For data set \(X\) of monthly electricity consumption of the whole society, the threshold value of node sample number is \(\delta\), and the threshold value of Gini coefficient is \(\varepsilon\). CART binary decision tree is the output, and the construction process of decision tree is shown in Fig. 4.

Fig. 4
figure 4

Flowchart of decision tree construction

First, the Gini coefficients of all the possible segmentation points of the feature set \(Y\) in the training sample subset \(S_{hj}\) are calculated. Then, whether the Gini coefficients of the samples are less than the given threshold \(\varepsilon\) is determined. If they are all less than the given threshold \(\varepsilon\), then a single node tree is generated, whose category is the class with the largest number of samples in \(S_{hj}\); otherwise, the feature with the smallest Gini coefficient and the corresponding segmentation point \(\alpha\) are selected as the eigenvalue and segmentation standard of the root node. \(S_{hj}\) is divided into two subsets \(S_{hj1}\) and \(S_{hj2}\), and \(S_{hj1}\) and \(S_{hj2}\) are allocated to the two sub nodes, respectively. Next, whether the number of samples of the sub nodes is less than the given threshold \(\delta\) is judged. Whether the Gini coefficients of sub node samples are less than \(\varepsilon\) is also determined. If it is true, then the child node is a leaf node. If the two child nodes are leaf nodes, then the decision tree is generated; otherwise, for the non-leaf node, \(S_{hj}\) is equal to the corresponding data set of the child node, and the feature with the smallest Gini coefficient is removed. The Gini coefficients of all possible segmentation points of \(S_{hj}\) are recalculated.

ART algorithm is used to generate a decision tree for each subset of training samples based on the principle of minimum Gini coefficient, and \(w\) decision trees are symbiotic to form a “forest.” Half of the strong correlation factors are randomly selected to participate in the node splitting process of decision tree for ensuring the randomness of decision tree construction and avoiding over fitting problem. In addition, the number of decision trees in the whole random forest should be adjusted according to the prediction results.

The detailed steps of generating CART decision tree are given in Algorithm 2.

figure b

3.3 Voting of forecast results of electricity consumption

The final output of the prediction model based on random forest algorithm is generated by voting:

$$ F_{h} \left( X \right) = \arg \mathop {\max }\limits_{Y} \sum\limits_{i = 1}^{w} {I\left( {f_{hi} \left( X \right) = Y} \right)} , $$
(8)

where \(F_{h}\) is the prediction model of monthly electricity consumption; \(f_{hi}\) is a single decision tree prediction model; \(I\left( {\square } \right)\) is an indicative function.

For the same data set, when \(w\) CART decision trees are constructed, \(w\) prediction results will be obtained. At this time, we need to vote and select the prediction result with the highest number of votes. The sample data are tested for simulation, the data \(X\) of correlation factors related to power consumption \(Y\) are taken as the input, the prediction result series \(\left\{ {f_{h1} \left( X \right),f_{h2} \left( X \right), \ldots ,f_{hw} \left( X \right)} \right\}\) of each decision tree model are obtained, and voting is conducted to derive the final prediction result of power consumption. The process is shown in Fig. 5.

Fig. 5
figure 5

Forecasting modeling of monthly electricity consumption based on random forest

4 Example analysis

4.1 Data source

The monthly electricity consumption of the whole society, agriculture, forestry, animal husbandry and fishery, industry, finance, real estate, business and residential services, urban and rural residents’ life, and the potential related factors of the abovementioned monthly electricity consumption are all from Huibo database, with the time span from January 2005 to December 2015. The aforementioned industries cover 6 categories: transportation, storage, postal, commerce, accommodation, and catering. A total of 14 potential related factors are also considered: GDP of Shenyang, GDP of the secondary industry of Shenyang, investment in fixed assets of Shenyang (production and supply of electricity, heat, gas, and water), investment in fixed assets of Shenyang (transportation, gas, and water), the total number of accommodation and reception in Shenyang star hotels, the total population of Shenyang registered residence, the volume of tap water sold in Shenyang, the sales of natural gas in Shenyang, the output of industrial steel above designated size in Shenyang, the industrial crude oil production above designated Size in Shenyang, the industrial added value above the Shenyang scale, the sales volume of wholesale and retail trade enterprises above the quota of Shenyang City, Shenyang’s total import and export value, and Shenyang’s industrial automobile output above designated size, as shown in Table 2.

Table 2 Data source

The GDP data of potential related factors are quarterly data, but this study needs to use monthly data. Thus, the quarterly GDP data are processed by the way of average distribution to each month.

4.2 Identification of power consumption-related factors

The monthly power consumption data are taken as the explanatory variable \(X\), and the matrix \(X = \left\{ {X_{1} ,X_{2} ,X_{3} ,X_{4} ,X_{5} } \right\}\) is set. Among them, \(X_{1} ,X_{2} ,X_{3} ,X_{4} ,X_{5}\) represent the monthly power consumption of agriculture, forestry, animal husbandry, and fishery in Shenyang, the monthly energy consumption of Shenyang’s industry, the monthly power consumption of finance, real estate, business, and residential service industry, the monthly consumption of urban and rural residents in Shenyang, and the monthly power consumption of the whole society of Shenyang.

The monthly data of potential related factors are taken as conditional variable \(Y\), and \(Y = \left\{ {Y_{1} ,Y_{2} , \ldots ,Y_{14} } \right\}\) is set. Among them, \(Y_{1}\) is Shenyang GDP, \(Y_{2}\) is the second industry GDP of Shenyang, \(Y_{3}\) is the fixed assets investment (production and supply of electricity, heat, gas, and water) in Shenyang, \(Y_{4}\) is the fixed assets investment (transportation, storage, and postal industry) of Shenyang, \(Y_{5}\) is the total number of hotel accommodation and reception in Shenyang City, \(Y_{6}\) is the total population of Shenyang household registration, \(Y_{7}\) is the sales volume of tap water in Shenyang, \(Y_{8}\) is the sales volume of natural gas in Shenyang, \(Y_{9}\) is the industrial steel production above Shenyang scale, \(Y_{10}\) is the industrial crude oil production above Shenyang scale, \(Y_{11}\) is the industrial added value above Shenyang scale, \(Y_{12}\) is the sales volume of wholesale and retail trade enterprises above the quota of Shenyang, \(Y_{13}\) is the total import and export value of Shenyang, and \(Y_{14}\) is the output of industrial vehicles above Shenyang.

The maximum mutual information coefficient of explanatory and conditional variables is analyzed by Python, and the maximum mutual information coefficient is obtained. Thus, a correlation coefficient table, as shown in Table 3, is formed.

Table 3 Maximum mutual information coefficient results

Python’s Seaborn and Matplotlib packages are used to analyze the maximum mutual information coefficient data in Table 3 for more intuitively identifying the strong correlation factors of monthly electricity consumption. The data are displayed in the form of a heat map, as shown in Fig. 6.

Fig. 6
figure 6

Maximum mutual information coefficient results

Figure 6 shows that the maximum mutual information coefficients of \(X_{1}\) and \(Y_{1}\), \(Y_{2}\), \(Y_{3}\),\(Y_{4}\), \(Y_{5}\), \(Y_{12}\) are relatively large and have strong correlation. The maximum mutual information coefficients of industrial monthly electricity consumption \(X_{2}\) and \(Y_{6}\), \(Y_{7}\), \(Y_{9}\), \(Y_{11}\), \(Y_{13}\), \(Y_{14}\) in Shenyang are relatively large and have strong correlation. The maximum mutual information coefficients of \(X_{3}\) and \(Y_{6}\), \(Y_{8}\), \(Y_{7}\), \(Y_{11}\), \(Y_{13}\), \(Y_{14}\) are relatively large, and the correlation is strong. The maximum mutual information coefficient of \(X_{4}\) and \(Y_{6}\), \(Y_{7}\), \(Y_{8}\), \(Y_{9}\), \(Y_{11}\), \(Y_{13}\), \(Y_{14}\) is larger, and the correlation is strong. The maximum monthly mutual information coefficients of \(X_{5}\) and \(Y_{6}\), \(Y_{7}\), \(Y_{8}\), \(Y_{9}\), \(Y_{11}\), \(Y_{13}\), \(Y_{14}\), \(Y_{1}\) and \(Y_{2}\) in Shenyang are relatively large, and the correlation is strong. The correlation is also affected by the total population of Shenyang registered residence \(Y_{6}\), the volume of Shenyang tap water sales \(Y_{7}\), the sales volume of Shenyang natural gas \(Y_{8}\), the industrial steel output above the scale of Shenyang \(Y_{9}\), the industrial added value above Shenyang scale \(Y_{11}\), the total import and export value of Shenyang \(Y_{13}\), and the output of industrial vehicles above designated scale \(Y_{14}\).

4.3 Forecast of monthly electricity consumption

The forecast of monthly electricity consumption of Shenyang from January to December is conducted. With the monthly data of 9 strong correlation factors from 2005 to 2014 as the input and the monthly power consumption of Shenyang as the output, 12 original training sets from January to December are formed. Then, the random forest algorithm is used to predict the monthly power consumption. When using random forest algorithm to forecast electricity consumption, if we use monthly data to forecast directly, then the relationship between the data can be determined well, which greatly reduces the prediction accuracy. This study transforms the strong correlation factors and monthly electricity consumption data into monthly year-on-year growth rate as the input and output of the forecast to ensure the accuracy and stability of the forecast. The monthly year-on-year growth rate \(R_{m,n}\) is calculated as follows:

$$ R_{m,n} = \frac{{d_{m,n} - d_{m - 1,n} }}{{d_{m - 1,n} }} \times 100\% , $$
(9)

where \(d_{m,n}\) represents the monthly data of the \(n\) month of the \(m\) year.

Python’s train test split package is used to divide the training and test sets. Random forest classifier package is called, and bootstrap parameter is set to true.\(w\) training sample subsets are selected from the returned samples in the training set, and \(w\) decision trees are generated from these training sample subsets. The test set is used to estimate the error of random forest prediction model. When each decision tree is generated, half of the strong correlation factors are randomly selected as random characteristic variables to participate in the node splitting process.

Given different decision trees, the prediction accuracy of random forest algorithm will be different. The mean absolute percentage error (MAPE) is used to calculate the error value, and the formula is as follows:

$$ {\text{MAPE}} = \frac{1}{n}\sum\limits_{t = 1}^{n} {\left| {\frac{{y_{f\left( t \right)} - y_{a\left( t \right)} }}{{y_{a\left( t \right)} }}} \right|} \times 100\% , $$
(10)

where \(y_{f\left( t \right)}\) is the predicted value, and \(y_{a\left( t \right)}\) is the actual value.

With August as an example, Fig. 7 shows the MAPE between the predicted value and the actual value of the monthly electricity consumption forecast based on the random forest algorithm when taking different decision trees.

Fig. 7
figure 7

Error analysis of different numbers of decision trees

As shown in the figure, MAPE tends to a certain value with the increase in decision trees. However, when more decision trees are considered, the amount of calculation will increase rapidly and the prediction time will be longer. A total of 150 decision trees are selected to form a random forest to guarantee modeling speed and avoid prediction error.

The importance of each strong correlation factor to the prediction model differs. We adopt the way of average accuracy decline rate to intuitively determine the importance of each strong correlation factor to the prediction model. After a strong correlation factor is removed, the degree of prediction accuracy declines. The more decline in accuracy means the more important that this strong correlation factor is to the prediction model. The importance of the nine strong correlation factors is shown in Fig. 8. As shown in the figure, \(Y_{6}\), \(Y_{8}\), \(Y_{11}\), and \(Y_{14}\) are of high importance, which is consistent with the analysis results in Fig. 6.

Fig. 8
figure 8

Influence of different correlation factors on prediction accuracy

With the 73rd decision tree in the monthly electricity consumption forecast of random forest in August 2015 as an example, the working process of CART decision tree is analyzed. The 73rd CART decision tree is shown in Fig. 9.

Fig. 9
figure 9

CART decision tree

In the figure, \(Y_{1}\), \(Y_{6}\), \(Y_{7}\), \(Y_{8}\), and \(Y_{13}\) are the strong correlation factors of Shenyang’s social monthly electricity consumption in August 2015; \({\text{gini}}\) is Gini coefficient, which is used for purity measurement. If all the training samples contained in a node are of the same category, then node is pure (\({\text{gini}} = 0\)); \({\text{samples}}\) represents how many training sample instances the current node is applied to; \({\text{value}}\) represents the number of samples for each category in the current node; \({\text{class}}\) is the classification result.

The specific prediction process of CART decision tree is as follows. Starting from the root node, whether \(Y_{2}\) is less than or equal to 5.1 is determined; if yes, then the process moves to the left child node; otherwise, it moves to the right child node. In the specific power consumption prediction process, when \(Y_{2} = 0.33 < 5.1\), the process moves to the left sub node. Then, when \(Y_{2} = 0.33 < 4.215\), it moves to the left node. Finally, when \(Y_{7} = 4.6 > 3.35\), it moves to the right sub node because \({\text{gini}} = 0\) determines that the year-on-year growth rate of power consumption in August 2015 is the same as that in August 2014. The year-on-year growth rate of power consumption in August 2014 is taken as the forecast value of year-on-year growth rate of power consumption in August 2015, which is 3.78. It is converted into monthly power consumption of 2634.47 million kwh.

4.4 Analysis of prediction results

The random forest algorithm is used for prediction by using the data of strong correlation factors in 2015. On the basis of obtaining the monthly growth rate of electricity consumption, the monthly electricity consumption of the same period of the previous year is taken as the benchmark to obtain the monthly electricity consumption forecast value. At the same time, the monthly data considering all factors and the monthly growth rate data considering all factors are taken as the training samples, and the random forest algorithm is used for prediction and comparison with the proposed method. Forecast result 1 is that of the proposed method, and the training samples are the monthly growth rate data. Forecast result 2 uses the monthly year-on-year growth rate as the training sample and considers all factors. Forecast result 3 uses monthly data as training samples and considers all factors.

At the same time, because support vector machines are widely used in monthly electricity consumption forecasting, in order to further verify the effectiveness of the method proposed in this article, the support vector machine is used to compare with the method proposed in this article. The forecast result 4 is based on the monthly data considering all factors as the training sample, and the support vector machine is used for prediction. The actual monthly electricity consumption and forecast results of Shenyang in 2015 are shown in Table 4 and Fig. 10.

Table 4 Comparison of prediction result
Fig. 10
figure 10

Forecast results of monthly electricity consumption of Shenyang in 2015

As shown in Fig. 10 and Table 4, the prediction accuracy of training samples using monthly year-on-year growth rate is higher than that of training samples using monthly data directly under the condition of not using mutual information to screen correlation factors. On the basis of using the monthly growth rate, the prediction accuracy of using the monthly growth rate data of strong correlation factors as the training sample is higher than that of using the monthly growth rate data of strong correlation factors as the training sample. Therefore, better results are obtained when more correlation factors are considered. Too many factors with low correlation will make the prediction result worse. When the factors with high correlation are screened out by mutual information, the prediction error will be significantly reduced. Compared with the support vector machine algorithm, the method proposed in this paper improves the MAPE index by 7.29%. The experimental results show that the method proposed in this paper has a better prediction effect.

5 Conclusion and prospect

In this study, the maximum mutual information coefficient is introduced to identify the influencing factors of the monthly electricity consumption of the whole society in Shenyang, and the strong correlation factors of the monthly electricity consumption of the whole society are selected. The random forest algorithm is used to predict the monthly electricity consumption of the whole society with the strong correlation factors as the input. The predicted value of the monthly electricity consumption of the whole society is obtained. The effectiveness and correctness of the proposed method are verified by an example.

  1. (1)

    The maximum mutual information coefficient of mutual information theory is used to quantitatively calculate the correlation between the influencing factors and the monthly electricity consumption of the whole society. With the maximum mutual information coefficient, more effective correlation factors are screened out among many influencing factors.

  2. (2)

    The random forest algorithm is used for prediction. The bootstrap resampling of random forest algorithm and the random selection of features enable the algorithm to avoid over fitting and make it suitable for all kinds of data sets. Combined with mutual information, the factors with low correlation to the whole society’s monthly electricity consumption are eliminated. As a result, the prediction accuracy is higher.

  3. (3)

    The monthly forecast of random forest is conducted using the strategy of monthly forecast with the historical data of the same month as the training set. As a result, the prediction accuracy is improved.

In addition to the factors mentioned in this article, monthly electricity consumption forecasts are also greatly affected by economic factors. The energy consumption of heating or cooling will be generated when the temperature is higher than the upper limit or lower than the lower limit of the comfortable temperature range. Thus, the weather data will be introduced in the next work to further improve the prediction accuracy.