Keywords

1 Introduction

The rapid development of the economy and society has produced a huge amount of information. In the process of collecting these data, some data omissions often occur due to temporary inaccessibility of some data or carelessness in the acquisition process, and it is basically inevitable. However, these missing data items are likely to carry important information of the data object. If these missing data are directly used for data mining or data analysis, the results obtained will have a very serious impact on decision-making, that is, it contains missing data. Value data will cause confusion in the mining process, resulting in reliable output.

At present, there are many filling algorithms. The commonly used algorithms include hot-deck filling, regression filling, KNN filling, and multiple filling. Hot-deck filling mainly includes two steps: (1) The pair does not contain missing values and is related to the value to be filled; (2) From each category, select the appropriate complete data as the donor for filling according to the correlation between the complete data and the missing data [7]. Literature [8] is based on literature [7], add weights to the donor selection calculation, and propose that the number of times each donor is used for missing value filling is proportional to the weight. Regression filling is a conditional mean filling, which is based on the complete data set to establish missing values and known values. The regression model is based on the parameters learned from historical data to estimate the missing variable value. When the variable is not linearly correlated or highly correlated, it will cause a large deviation between the filled value and the true value. Literature [3, 4] uses a series of linear and nonlinear regression models fill in missing values, literature [1, 2] uses a kernel function to build a regression model. The advantage of regression filling is that it is suitable for both categorical and continuous data. The disadvantage is that the parameter determination of the model is more complicated, and the accuracy of filling is not easy to guarantee. The main idea of KNN filling is to use the weighted mean of the K nearest neighbors of the missing value to replace the missing value. For genetic data, the literature [6] uses Euclidean distance to find the missing value in the data matrix. The most similar K nearest neighbors are then filled with missing values. Literature [9] searches for the K nearest neighbors of missing data based on the degree of gray correlation, and then fills the missing values. Literature [10] proposes partial filling based on KNN, which uses Machine learning and data mining abnormal point detection technology to detect whether the missing value can be filled: if the missing value cannot be filled, the filling is abandoned; if the missing value can be filled, the left and right neighbors of the missing value are used for weighted KNN filling. It can be seen that the filling method does not fill every missing value. In order to further improve the accuracy of filling, literature [11] proposed a non-parametric missing data filling method based on EM, which is similar to the EM algorithm. The difference is that non-parametric models such as KNN or kernel function regression replaces the parametric model in the EM algorithm.

The historical data of power grid has many attribute parameters and the missing data is widely distributed. The above research is generally used to deal with single attribute data missing. For the problem of multiple attribute missing data involving few or obvious performance degradation when applied to multiple attribute missing data, this paper proposes an improved random forest algorithm based on attribute comprehensive weighting. The method firstly proposes an attribute comprehensive weighting strategy based on error expectations, and performs comprehensive attribute calculation on the initial missing set to generate a complete set of examples; secondly, according to the attribute comprehensive weighting strategy, the similar set is obtained, and the random forest model is trained; finally, an improved random forest algorithm based on attribute comprehensive weighting is used to identify missing data and improve the identification accuracy of missing data. Simulation results show that the algorithm can fill in missing values of multiple attributes, which verifies the accuracy and effectiveness of the method.

2 Random Forest Theory

Random Forest Theory is a supervised learning algorithm proposed by LeoBreiman. RF is a supervised ensemble learning algorithm. The main idea is to combine regression decision trees (CART) into a forest through certain rules, and finally pass through the forest.

2.1 CART Regression Decision Tree

A decision tree is a set of rules for classification learning. The classification process is to obtain a tree model with the smallest deviation from the data set. The smallest deviation can be understood as the optimal balance between the training and test set errors, which can be well matched. The training set is fitted and the prediction function of the test set can be realized at the same time. Figure 1 shows the structure of the decision tree:

Fig. 1.
figure 1

Decision tree structure.

CART is a data set division strategy based on Gini index (GI) discrimination. By judging all the divisions on the end point, a largest tree is generated, and a decreasing set of subtrees is obtained through the branch reduction strategy. Using the best generalized tree as the target decision tree, the formula of GI is:

$$ GI_{U} = 1 - \sum\limits_{j = 1}^{m} {p_{j}^{2} } $$
(1)

In the formula, Pj is the frequency of occurrence of j-type elements, U represents the data set, and m represents the number of categories [11].

For the attribute GI of different individuals, it is required to divide them. For the division of any attribute T, U can be changed to U1 and U2. Then the GI of the sample set U of the divided attribute T is shown in formula 2:

$$ GI_{U,T} = \frac{{U_{1} }}{U}G_{{U_{1} }} (U_{1} ) + \frac{{U_{2} }}{U}G_{{U_{2} }} (U_{2} ) $$
(2)

For any attribute, the result of this division can make the attribute generate the smallest GI subset as a split subset. If the GIU,T on the attribute T is smaller, it can be considered that the division effect on the attribute T is better. Therefore, the growth of the entire decision tree can be completed according to this rule.

2.2 Bagging Algorithm Based on Bootstrap Sampling

Although CART has good dividing ability, its prediction accuracy is usually not ideal. The Bagging algorithm was proposed by Breiman L in the 1990s. The biggest feature of this method is the use of the repeated sampling characteristics of Bootstrap to train any CART tree by extracting a subset of the same size from the initial set.. In addition, Breiman L also gave a method to improve the performance of node splitting, which is to randomly extract all attributes to form an attribute subset, and then perform selective splitting. The Bagging algorithm not only improves the performance of a single decision tree in the forest, but also further weakens the association between different decision trees by sampling in the attribute subspace, which plays an important role in reducing the generalization error of the random forest.

2.3 Random Forest Algorithm

Random forest regression (RFR) is an algorithm with high accuracy in machine learning. It can overcome the shortcomings of classification models and single predictions and is widely used in economics, medical, energy and other fields.

Definition 1:

The set of all decision trees {h(X, θk), k = 1,⋯Ntree} constitutes a random forest f, h(X, θk) means CART without pruning; θk is a random vector independently and identically distributed with the k-th decision tree; majority voting is used for classification problems, and arithmetic average is used for regression problems to obtain the final prediction value of the random forest.

Definition 2:

Define the edge function Q(X, Y):

$$ Q(X,Y) = a_{k} I(h(X,\theta_{k} ) = Y) - \mathop {\max }\limits_{j \ne Y} a_{k} I(h(X,\theta_{k} ) = j) $$
(3)

X: The input vector, which contains at most three different categories; Y: the correct classification category of the output; j: represents one of the categories; I: indicator function; ak: average function (k = 1, 2, ⋯n).

It can be seen from formula (3) that the larger the marginal function value, the higher the confidence in the correctness of the classification. Therefore, the generalization error of RFR can be defined as shown in formula (4):

$$ E^{*} = S_{X,Y} (Q(X,Y) < 0) $$
(4)

In the formula, SX, Y is the classification error rate function of the input vector X. Using the law of large numbers for formula (4), the following theorem can be obtained:

Theorem 1:

For all sequences θk, if the number of trees increases, E* almost converges to:

$$ \begin{gathered} S_{X,Y} (S_{\theta } (h(X,\theta ){\text{ = Y}}) \hfill \\ \, - \mathop {\max }\limits_{j \ne Y} S_{\theta } (h(X,\theta ) = j) < 0 \hfill \\ \end{gathered} $$
(5)

Sθ is the classification error rate of set θ. It can be seen from the theorem that the generalization of RFR will converge to an upper bound, and the increase of the tree will not cause overfitting of the prediction result.

Theorem 2:

The upper bound of RFR generalization error, as shown in Eq. (6):

$$ E^{*} \le \frac{{\eta (1 - \xi^{2} )}}{{\xi^{2} }} $$
(6)

η: average correlation coefficient of the tree; ξ: average strength of the tree.

It can be seen from Theorem 2 that with the η decrease and increase of ξ, the upper bound of the generalization error of RFR will be further reduced, which is more conducive to error control. Therefore, the methods to improve the accuracy of RFR prediction are: 1) Reduce the correlation between trees 2) Improve the accuracy of a single decision tree. The RFR algorithm flow is shown in Fig. 2:

Fig. 2.
figure 2

Random forest algorithm schematic diagram.

3 Attribute Comprehensive Weighting Strategy Based on Correlation Analysis

The random forest algorithm given in the previous section can be used to fill different types of missing data, but the effect of the RFR model depends largely on the data set. In order to improve the training efficiency and training effect of the random forest, how to find the original training set that is more similar to the missing value becomes particularly important. This paper proposes a comprehensive weighting strategy for attributes based on correlation analysis to find a sample set with high similarity to the missing value, which is used to train the random forest model. It can not only ensure the consistency of the input feature quantity, but also simplify the training model.

3.1 Correlation Analysis Based on Pearson Coefficient

Pearson correlation coefficient is a measure of the degree of linear correlation between different random variables. When the Pearson coefficient is used in the overall, as shown in formula (7):

$$ \rho_{X,Y} = \frac{{{\text{cov}} (X,Y)}}{{\sigma_{X} \sigma_{Y} }} $$
(7)

X, Y are two random variables, σX, σY is the standard deviation of X and Y respectively, cov(X,Y) is the covariance, as shown in formula (8):

$$ {\text{cov}} (X,Y) = \frac{{\sum\nolimits_{n}^{i = 1} {(X_{i} - \overline{X} )} (Y_{i} - \overline{Y} )}}{n - 1} $$
(8)

n represents the number of samples. The covariance reflects the process of a random variable changing with another random variable. If the direction of change is the same, the result is a positive correlation, otherwise the result is a negative value and a negative correlation.

When the Pearson coefficient is used in the sample, as shown in Eq. (9):

$$ r_{x,y} = \frac{{\sum\nolimits_{i = 1}^{n} {(x_{i} - \overline{x} )} (y_{i} - \overline{y} )}}{{\sqrt {\sum\nolimits_{i = 1}^{n} {(x_{i} - \overline{x} )}^{2} } \sqrt {\sum\nolimits_{i = 1}^{n} {(y_{i} - \overline{y} )}^{2} } }} $$
(9)

xi and yi are the observation point values of variables X and Y corresponding to i, \(\overline{x}\) and \(\overline{y}\) are the sample mean values corresponding to X and Y.

3.2 Attribute Comprehensive Weighting Strategy Based on Error Expectation

In order to find the historical data set that has the highest correlation with the missing data of the power grid, it is first necessary to filter all the associated attributes at the moment when the missing data of the power grid is located, and to obtain the associated attributes that are closest to the filling data. The attribute is weighted to obtain the comprehensive weight of the attribute, and finally the historical data set with the highest correlation is obtained by sorting the comprehensive weight. Specific steps are as follows:

  1. 1)

    Select all the corresponding associated attributes of the missing data;

  2. 2)

    Calculate the correlation coefficient between each attribute through Pearson correlation coefficient, select the attribute with the correlation coefficient greater than α (α is a given threshold) and store it in the cross correlation set HG;

  3. 3)

    Further calculate the error expectation \(EXPError(X_{k} ,Y_{k} )\) of all attributes in the HG set;

    $$ EXPError(X_{k} ,Y_{k} ){ = }\frac{{Cov(X_{k} ,Y_{k} )}}{{\sqrt {Var[X_{k} ]Var[Y_{k} ]} }} $$
    (10)

    Cov(Xk,Yk) is the covariance of Xk and Yk; Var[Xk] is the variance of Xk; Var[Yk] is the variance of Yk;

  4. 4)

    If EXPError(Xk,Yk) > β (β is the strong correlation threshold), it is a strongly related attribute, which is retained in the strongly related attribute set QX;

  5. 5)

    The entropy weight method is used to establish the weight between the attributes of each attribute in the set QX, and the weight vector is obtained as follows:

    $$ W = [w_{1} ,w_{2} ,...,w_{m} ] $$
    (11)

    m is the number of strongly associated attributes;

  6. 6)

    Comprehensive weighted value of attributes obtained from strong correlation coefficient:

    $$ SX = W_{1} S_{1} + W_{2} S_{1} + ... + W_{m} S_{m} $$
    (12)
  7. 7)

    According to the comprehensive weighted value of each historical section data, sort from large to small, set the selection threshold, and select the sample with the larger threshold as the similar sample set.

The overall flow of the algorithm is shown in Fig. 3:

Fig. 3.
figure 3

Algorithm overall flow chart.

4 Improved Random Forest Algorithm Based on Attribute Comprehensive Weighting

In summary, this paper proposes an improved random forest algorithm framework based on attribute comprehensive weighting. By finding the training set most similar to the missing values of the power grid as the training set of the RFR model, the accuracy of the RFR model identification is improved, and the missing values of the power grid are improved. The specific steps are as follows.

  • Step 1: Obtain the historical data of the power grid, and construct a training sample set and a verification sample set;

  • Step 2: Use attribute weighting strategy based on error expectation to find similar data set R;

  • Step 3: Take the similar data set R as input to train the FRF model;

  • Step 4: Use Bootstrap to resample to form K data sets;

  • Step 5: Generate K CRAT decision trees;

  • Step 6: Generate a random forest through K CRAT decision trees;

  • Step 7: Whether to complete the training of the random forest, if completed, go to step 8, otherwise go to step 6;

  • Step 8: Use the FRF classifier to discriminate and classify the new data, and use the predicted mean of all trees as the filling result;

  • Step 9: Evaluate the filling result, if it is within the tolerance range, the filling calculation is completed, otherwise the random forest is retrained.

The overall process is shown in Fig. 4:

Fig. 4.
figure 4

Algorithm overall flow chart.

5 Case Analysis

This paper selects the historical data of a city-level power grid in the past 1 year as the data set, and selects the grid voltage data as the missing item for filling analysis. In order to better analyze the advantages of the algorithm proposed in this paper, this paper selects the SVM method, the traditional random forest algorithm and the three algorithms proposed in this paper are compared and analyzed, and the filling accuracy of different methods is analyzed.

5.1 Error Analysis Standard

The research object of this paper is the lack of voltage value of the power grid in a certain area. The mining data is selected from the historical database. The sampling period is 5 min. The Pearson correlation coefficient and error expectation (0.5) are calculated for all the attributes in the database. The attributes are: {reactive load, active load, current value}, select strong correlation attributes as the data set field, and then use the attribute comprehensive weighting strategy proposed in Sect. 2.2 of this article (take 0.6), and finally get about 2000 sets of data sample sets, such as Table 1 shows:

Table 1. Locations and capacities of DRSs for minimizing.

According to the data selected in the database and combined with the characteristics of the grid voltage data, this paper uses Root Mean Square Error (RMSE) and filling accuracy (Acuracy) evaluation algorithms to evaluate the missing data. σRMSE indicates the filling error. Obviously, when the value of σRMSE is smaller, the filling result is better at this time. The formula is as shown in (13):

$$ \sigma_{RMSE} = \sqrt {\frac{{\sum\limits_{i = 1}^{n} {(x_{r} - x{}_{i})^{2} } }}{n}} $$
(13)

In the formula, xr and xi are the true value and the filled value respectively; n is the number of missing values; σRMSE reflects the gap between the filling value and the true value, the smaller the value, the higher the credibility of the filling result.

Accuracy represents the accuracy of the filling result, the formula of Accuracy is shown in (14):

$$ Accuracy{ = }\frac{{n_{r} }}{n} \times 100{\text{\% }} $$
(14)

In the formula, nr is the correct number of estimates. The tolerance range of Accuracy is selected in this article [−8%, 8%], if the value of replacement or filling is finally in the given range, the result of replacement or filling is considered usable.

5.2 Analysis of Filling Results

First, select the missing attributes according to the situation, and construct the missing data sets with the missing rates of 1%, 3%, 5%, 10%, 15%, 20%, 25%, and 30% through random deletion. The IRFNNIS algorithm, the FR algorithm and the IN algorithm are respectively used for experiments under different missing rates, and the experimental results obtained by each algorithm are analyzed and compared based on the root mean square error and filling accuracy.

Taking the missing value of a certain voltage of the actual power grid as the filling target, constructing the missing data set of each ratio, and testing the performance of the three algorithms. In order to fully express the performance of each algorithm, 10 missing data sets are constructed for each missing rate by randomly generating missing values. The results of the algorithm applied to each data set are averaged as the final experimental results, and the experimental results are analyzed and compared.

Fig. 5.
figure 5

Comparison of the mean square error of different algorithm filling results.

It can be seen from Fig. 5 that the IRFNNIS algorithm proposed in this paper has the smallest root mean square error at all missing rates and the best filling effect. As the missing rate increases, the root mean square error increases.

The filling accuracy of missing values decreases as the missing rate increases. As shown in Fig. 6, when the missing rate is 1%, the filling accuracy of the three algorithms can reach more than 60%, indicating that each algorithm is missing a small amount of data. The filling performance is better. When the missing rate is 3%–15%, the IRFNNIS algorithm proposed in this paper is significantly better than the OCS-FCM algorithm. When the missing rate is greater than 15%, the accuracy of the SVR and OCS-FCM algorithms is not much different. In all missing cases, the filling effect of IRFNNIS is significantly better than the other two algorithms.

Fig. 6.
figure 6

Comparison of accuracy of filling results of different algorithms.

From the above analysis of root mean square error and filling accuracy, it can be seen that the filling effect of the IRFNNID algorithm proposed in this paper is better than the other two algorithms, in order to show the actual filling effect of the algorithm more intuitively, a data set with a missing rate of 10% is constructed and includes multiple consecutive missing data sets. The algorithm proposed in this paper is used for filling. Figure 7 shows the comparison between the filling results of 27 consecutive missing data and the true value. It can be seen that the filling value has a high correlation with the true value, which meets the data filling requirements.

Fig. 7.
figure 7

Comparison and analysis of the algorithm in this paper and the actual algorithm.

6 Conclusion

This article explains the theories related to missing data, including the reasons for missing data and the necessity of missing data processing, and introduces global optimization strategies. In order to improve the efficiency of missing value filling, this paper proposes an improved random forest algorithm based on attribute comprehensive weighting. The method firstly based on error expectation, proposes an attribute comprehensive weighting strategy based on error expectation, and performs comprehensive attribute calculation on the initial missing set to generate complete instance collection; secondly, the similar set is obtained according to the attribute comprehensive weighting strategy, and the random forest model is trained; finally, the improved random forest algorithm based on attribute comprehensive weighting is used to identify missing data and improve the identification accuracy of missing data. The simulation results show that the algorithm can fill in the missing values of multiple attributes, which verifies the accuracy and effectiveness of the method. The algorithm simulation shows that the algorithm proposed in this paper has advantages compared with other algorithms.