1 Introduction

Recent advances in the microelectromechanical system (MEMS) and flexible manufacturing system (FMS) have enabled significant sensor size reductions while retaining or extending advanced functionality and reducing their price [1]. Wireless communication technology has also enabled sensors to be embedded into many devices [2]. As a result, the continuous collection of large amounts of data through various sensors has become a very common phenomenon in modern systems [3].

Sensors measure physical quantities or objects that exist in nature, such as temperature, humidity, pressure, and convert them into electrical signals [4]. Most of the data collected through the sensor are time-series data, i.e., recorded at regular time intervals. The data change due to various variations such as trend variation, cyclical variation, and seasonal variation [5]. The importance of time-series data is growing because data analysis enables understanding of data changes and predicts future changes [6].

Sensor-based time-series data have been used in many fields [7]. The smart city is one of the fields that actively use these data. Many smart city systems utilize time-series data to develop applications [8]. For example, smart grid systems optimize energy operations by analyzing data collected from all power utilization processes within the smart city [9]. A smart grid, which is an intelligent power grid that combines information and communication technologies with existing power grids, can solve environmental problems and energy shortage problems by optimizing energy efficiency [10]. Through the smart grid, consumers can reduce electricity bills, and suppliers can optimize energy efficiency by solving demand and supply imbalances through real-time information sharing and control [11].

One of the main components that enable smart grid technology is the smart meter [12]. Smart meters are digital electronic meters that record energy consumption in real-time and control energy use by communicating information to both power suppliers and consumers through a communication network. The data recorded by the smart meter are used to analyze current or predict future energy usage.

Anomaly detection is essential to ensure smart meter data security and integrity [13]. Anomalies (outliers) are data points that differ significantly from other observations. They are commonly caused by malfunctioning smart meters, consumer behavior changes, energy leakage, or tampering. If the smart meter data are damaged or tampered with during acquisition or transmission, many problems such as incorrect electricity bills or inefficient smart grid operation can arise [14]. Therefore, it is essential to determine whether smart meter data include anomalies to avoid such problems [15].

Various statistical techniques have been proposed for anomaly detection [16], with interquartile range (IQR) being one of the most popular approaches [17]. However, IQR cannot detect some anomaly types, such as high leverage anomalies. To solve this problem, machine learning-based anomaly detection models were proposed [18]. While many studies make many efforts to improve anomaly detection performance, they do not consider how to repair the detected anomalies [19]. If the use of anomalies-removed data for training the predictive model can be expected to improve the prediction performance, but if the data including the anomalies are repaired to an appropriate value and then used for training the predictive model, further improvement in prediction performance can be expected [20].

Therefore, this paper proposes an accurate electric load forecasting based on anomaly detection and repair. We used a variational autoencoder (VAE) for anomaly detection, a random forest (RF) for data repair, and a sliding window-based LightGBM for electric load forecasting. Figure 1 shows the overall structure for the proposed model.

Fig. 1
figure 1

Overall structure of proposed electric load forecasting scheme

The main contributions from this paper are as follows.

  1. 1.

    We propose a VAE anomaly detection scheme and verify it can achieve better accuracy than other statistical anomaly detection schemes.

  2. 2.

    We propose an RF anomaly repair scheme and verify its effectiveness by reflecting input variables.

  3. 3.

    We propose a sliding window-based LightGBM model for electric load forecasting and verify it can forecast power consumption more accurately than other popular machine learning electric load forecasting models.

The remainder of this paper is organized as follows. Section 2 discusses related anomaly detection, data repair, and electric load forecasting studies. Section 3 describes input variable configurations to constructing the proposed forecasting model, and Sect. 4 discusses the proposed VAE anomaly detection and repair schemes. Section 5 presents the proposed sliding window-based LightGBM model for load forecasting, and Sect. 6 discusses several experiments we performed to evaluate the proposed model performance. Finally, Sect. 7 summarizes and concludes the paper.

2 Related works

Many previous studies have considered anomaly detection, data repair, and electric load forecasting [21]. This section introduces various anomaly detection models and data repair methods and discusses relevant studies regarding electric load forecasting.

2.1 Anomaly detection

Breunig et al. [22] proposed an anomaly detection model based on the local outlier factor (LOF), i.e., how isolated an object is with respect to its surrounding neighborhood. Experimental results verified that LOF-based anomaly detection was a promising approach, identifying meaningful local outliers that previous approaches could not. Liu et al. [23] proposed an isolation forest (IForest) approach. IForest can create algorithms with linear time complexity, low constant, and low memory requirement. The proposed approach performed well in terms of area under the curve and processing time, particularly for large datasets. Chen et al. [24] proposed a randomized neural network (NN), i.e., a randomly connected autoencoder-based ensemble model combining adaptive sample size with random edge sampling, to achieve high-quality results while avoiding overfitting and improving robustness compared with conventional NN outlier detection techniques. Akouemo et al. [25] employed autoregression with exogenous inputs (ARXs) and an artificial neural network (ANN) to detect and impute anomalies in time-series data. They performed hypothesis testing on residual extrema to verify their proposed approach could identify and impute anomalous data points. Araya et al. [26] proposed two frameworks for anomaly detection in building energy consumption: collective contextual anomaly detection using a sliding window (CCAD-SW) framework and ensemble anomaly detection (EAD). The CCAD-SW framework identified anomalous consumption patterns using overlapping sliding windows, and the EAD framework combined several anomaly detection classifiers using majority voting. They verified that the EAD framework improved CCAD-SW sensitivity by 3.6% and reduced the false alarm rate by 2.7%.

2.2 Data repair

Xu et al. [27] proposed a point estimation model for biased sentinel hospital area disease estimation to interpolate missing data in temperature datasets. This technique employed a weighted summation of observed stations to estimate the missing data’s unbiased minimum error variance, using ratio and covariance between stations to calculate the weights. They achieved improved interpolation accuracy for the missing data from the temperature data and obtained the best linear unbiased estimation. Habermann et al. [28] proposed cubic spline interpolation as an alternative to linear or standard B-spline interpolation. The proposed interpolation could be implemented faster and easier than B-spline interpolation but had limited preconditions. However, the approach could provide a quick solution to cubic spline interpolation once the preconditions were satisfied. Therefore, it could be used for various approximation problems as in computational economics. Gan et al. [29] proposed a seislet transform for sparsity-based interpolation from highly undersampled seismic data based on the classic projection onto convex sets framework. Seismic data undersampled at very low boundary frequency can be low-pass filtered to obtain accurate estimates and subsequently interpolated through this estimate. They verified that the proposed approach achieved better performance than traditional frequency wavenumber-based approaches.

2.3 Electric load forecasting

Jurado et al. [30] constructed several prediction models using RF, ANN, and fuzzy inductive reasoning (FIR) approaches. They then compared the prediction models with an auto-regressive integrated moving average model by predicting electric energy consumptions in three different buildings at Catalonia Technical University in Catalonia, Spain, verifying that FIR approaches achieved the best prediction performance. Grolinger et al. [31] proposed electric load forecasting models based on support vector machine (SVM) and ANN for a large entertainment building in Canada and compared their performance under various model configurations to discuss the strengths and weaknesses of each model. They also presented a model selection algorithm to determine SVM and ANN hyperparameters. ANN achieved better accuracy than SVM models with daily data. Abbasi et al. [32] proposed an extreme gradient boosting (XGBoost) electrical load forecasting model, using feature importance to extract input variables from historical load over a week. They verified that historical loads close to or a week before the prediction time point had high importance for model construction. They used Australian Energy Market Operator electrical load data to confirm prediction performance. The proposed XGBoost model exhibited mean absolute percentage error, MAPE = 10% with accuracy = 97%. Kuo et al. [33] proposed an electric load forecasting model based on a convolutional NN, using historical electric load data as input variables to build the forecasting model. They verified that their proposed model was more accurate than models based on SVM, RF, decision tree (DT), etc.

3 Input variable configuration

This study used electric load data collected at a private university in Seoul, South Korea. The university grouped its buildings into four clusters according to the purpose or location and collected their power consumption data in real-time using an i-Smart system operated by the Korea Electric Power Corporation (KEPCO). The data were collected every 15 min from January 1, 2016, to December 31, 2019. Cluster A comprised 32 academic buildings, including the central library and college of humanities buildings. Cluster B contained 16 residential buildings, and Clusters C and D contained 19 and 5 science and engineering buildings, respectively. Table 1 summarizes the collected data.

Table 1 Statistical analysis of electric load data in the four building clusters

We also used time and weather data as input variables for anomaly detection and electric load forecasting. The following sections describe various relevant data details.

3.1 Time Data

Since electric load patterns differ depending on various timescales (minutes, hours, days of the week, months, etc.), we considered all variables that express date and time as an input variable [34], including month, day, hour, minute, day of the week, and holiday. Month, day, hour, and minute have a sequence form which is difficult to reflect periodic information in machine learning algorithms. For example, 11:59 pm and midnight are continuous in time, but the difference of the minute data in sequence form is 59. To solve this problem, we enhanced the time data into two dimensions,

$${\text{time}}_{x} = \sin \left( {\left( {360/{\text{cycle}}} \right) \times {\text{time}}} \right)$$
(1)

and

$${\text{time}}_{y} = \cos \left( {\left( {360/{\text{cycle}}} \right) \times {\text{time}}} \right),$$
(2)

where cycle represents time data periodicity, e.g., month and minute cycles = 12 and 60, respectively. We retained one- and two-dimensional data to better represent temporal characteristics [35]. In addition, we used a vector of 0 or 1 for each day of the week and holiday data. Days of the week can be expressed in continuous and binary format. If the days of the week are represented as continuous data, the difference between two consecutive days is 1, while the difference between Sunday and Monday is 6. This could have a negative impact on the forecasting model. Thus, we represented the day of the week as binary data using the one-hot encoding method. Likewise, if the input day of the week is a holiday, the input variable for a holiday is 1 and 0 otherwise. Table 2 provides the resultant 20-time data input variable data types.

Table 2 Input variables configuration for time data

3.2 Weather data

As power usage is closely related to weather conditions, we considered weather data as an input variable [36]. The Korea Meteorological Administration (KMA) provides diverse short- and long-range weather forecasts. We considered short-range weather forecasts for daily minimum temperature, daily maximum temperature, temperature, humidity, wind speed, cloudiness, and precipitation. Short-range weather forecasts provide weather data for up to 67 h with 3 h resolution, and we calculated smaller resolution weather data using linear interpolation. Figure 2 shows an example short-range weather forecast provided by KMA.

Fig. 2
figure 2

Example of short-range weather forecast provided by KMA

We also calculated wind chill (WC) and discomfort index (DI) to establish a more direct association with power consumption [37],

$${\text{WC}} = 13.12 + 0.0615 \times T - 11.37 \times {\text{WS}}^{0.16} + 0.3965 \times T \times {\text{WS}}^{0.16}$$
(3)

and

$${\text{DI}} = 1.8 \times T - 0.55\left( {1.8 \times T - 26} \right) \times \left( {1 - 0.01 \times H} \right) + 32,$$
(4)

where T, H, and WS represent temperature, humidity, and wind speed, respectively. Thus, we used nine weather data types. Table 3 provides the Pearson correlation coefficient (PCC) value of the weather data for each cluster.

Table 3 PCC value of weather data for each cluster

4 Anomaly detection model configuration

An autoencoder (AE) is a deep learning network based on unsupervised learning comprising encoder and decoder networks [38]. The encoder maps input data from high-dimensional space to low-dimensional space and expresses it as a latent variable. Latent variables compressed by the encoder preserve input data characteristics, so the decoder can restore the original input data by analyzing the latent variable. The difference between output and the input values becomes a loss function in AE and is used for learning by backpropagation, enabling unsupervised learning.

The basic VAE principle is the same as for AE, but the latent variable is generated from a Gaussian distribution [39]. Latent variables generated by AE are random discrete values and hence difficult to understand what each latent variable means; whereas VAE constructs a Gaussian probability distribution that can derive mean and standard deviation of the latent variable and then uses variables randomly obtained from the probability distribution as input values for the decoder. Figure 3 shows a typical AE model structure.

Fig. 3
figure 3

Typical variational autoencoder model structure

There are considerable differences between AE and VAE loss functions for learning. Only the reconstruction error is used for the AE loss function, i.e., an index that determines the decoder resilience. Since it is impossible to learn how latent variables are generated from specific input data, completely different latent variables can be generated from similar input data. On the other hand, the VAE loss function combines reconstruction error and Kullback–Leibler (KL) divergence, an index that determines whether the VAE latent variable has a specific distribution.

The VAE loss function can be expressed as

$$L_{i} \left( {\phi ,\theta ,x_{i} } \right) = - {\mathbb{E}}_{{q_{\phi } \left( {z{|}x_{i} } \right)}} \left[ {\log \left( {p_{\theta } \left( {x_{i} {|}z} \right)} \right)} \right] + KL\left( {q_{\phi } (z|x_{i} )|{|}p\left( z \right)} \right),$$
(5)

where the first term represents reconstruction error, i.e., cross-entropy between \(x_{i}\) and the result of recovering \(x_{i}\) based on \(z\) generated by the encoder; and the second term represents KL divergence, i.e., the probability distribution difference between sampled z and z generated by the encoder. The latent variable distributions converge with increased second term value.

Using KL divergence in the learning process can help the encoder generate a common cluster depending on the data class. Thus, latent variable characteristics, i.e., anomalies, can be more clearly defined using VAE compared with AE. Therefore, this study employed a VAE-based anomaly detection model.

AE-based anomaly detection uses reconstruction errors from all input variables, and any reconstruction is determined to be an outlier if the reconstruction error exceeds a given threshold. However, if load data anomalies are detected using the reconstruction error for all input variables, normal load data may be classified as abnormal data. Therefore, we only used the load data reconstruction error after calculating the reconstruction error for each input variable in VAE.

5 Load forecasting model configuration

LightGBM is a popular ensemble model released in 2016 that uses a boosting algorithm [40] to combine several week learners into a more accurate model. The boosting concept trains weak models sequentially, compensating for previous model problems in the subsequent model. LightGBM shortens data processing time, a disadvantage of previous boosting algorithms, using gradient-based one-side sampling (GOSS) and exclusive feature bundling (EFB) techniques. GOSS excludes data instances with small gradients and uses the remainder to estimate information. Since data with large gradients are more critical, information can be estimated quickly and accurately, even from small-scale datasets. EFB bundles mutually exclusive variables and processes them to reduce the number of variables. The number of variables can be effectively reduced by bundling and processing variables that rarely have simultaneous nonzero values without significantly impairing accuracy. Thus, LightGBM achieves good performance with short training times.

We also employ a sliding window algorithm to reflect the latest trends [41], which use previous steps to predict the next step. Figure 4 shows the sliding window approach for time-series data.

Fig. 4
figure 4

Sliding window approach for time-series data

The sliding window algorithm requires considerable learning time because the model needs to be newly trained to predict the next point. Thus, we need a model that provides excellent prediction performance even with short learning times to support this effectively, and hence we selected LightGBM.

6 Experimental results

6.1 Anomaly detection

This section verifies the proposed VAE anomaly detection scheme effectiveness by comparison with several popular anomaly detection models, including IQR, LOF, and IForest. For comparison, we constructed several datasets by increasing the ratio of anomalies to the total data amount from 1 to 10% in 1% increments. Anomalies were randomly generated with values less than 0.8 times the normal data value or more than 1.2 times the normal data value. Also, randomly generated anomalies include both one-point anomalies and continuous anomalies. Figure 5 illustrates an example of the collected electric load data with generated anomalies. In the figure, points 1 to 4 represent one-point anomalies, and point 5 represents a continuous anomaly.

Fig. 5
figure 5

Example of the electric load data including generated anomalies

Anomaly detection scheme accuracy was measured as

$${\text{Accuracy}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{TN}} + {\text{FP}} + {\text{FN}}}},$$
(6)

where TP, TN, FP, and FN are true positive, true negative, false positive, and false negative metrics, respectively.

Figures 6, 7, 8, and 9 compare the selected model performances for the four building clusters, respectively. The proposed VAE anomaly detection scheme exhibits the best performance in most cases, with less accuracy reduction as the anomaly rate increases, compared with all other methods. For example, the LOF and proposed scheme overall accuracy reduction = 20.6% and 9.4%, respectively, for 1–10% anomaly rate.

Fig. 6
figure 6

Accuracy comparison of selected models according to anomalies rate in Cluster A

Fig. 7
figure 7

Accuracy comparison of selected models according to anomalies rate in Cluster B

Fig. 8
figure 8

Accuracy comparison of selected models according to anomalies rate in Cluster C

Fig. 9
figure 9

Accuracy comparison of selected models according to anomalies rate in Cluster D

6.2 Data repair

This section investigates the proposed RF data repair scheme effectiveness compared with linear interpolation. Linear interpolation is effective for repairing single point anomaly, but not for continuous anomalies. To repair continuous anomalies effectively, various external variables should be used to represent the situation at the time when the continuous anomalies occurred. We excluded popular models such as SVM and DNN in the comparison because they need a significant time for hyperparameter tuning and model training. On the other hand, RF is a flexible machine learning algorithm that performs well even without hyperparameter tuning. As RF works well with large amounts of data and large numbers of input variables, it is suitable for repairing anomalies using various external variables. Hence, we proposed a data repair scheme using RF and compared it with zero and linear interpolations. We repaired the randomly generated anomalies for the different anomaly ratio cases and compared the repaired and original data using mean absolute percentage error (MAPE), defined as

$${\text{MAPE = }}\frac{100}{n}\mathop \sum \limits_{i = 1}^{n} \left| {\frac{{y_{i} - \widehat{{y_{i} }}}}{{y_{i} }}} \right|,$$
(7)

where n, \(y_{i}\), and \(\widehat{{y_{i} }}\) represent data amount, actual electric load data, and forecasted electric load data, respectively. MAPE is effective at comparing the results of all clusters at once. Table 4 compares the repair methods for the defined building clusters. The proposed RF repair method achieves better repair performance than linear interpolation for all cases. Values in bold font indicate the best repair performance for each anomaly rate.

Table 4 MAPE comparison of repair methods for each cluster

6.3 Determining optimal window size

We constructed a sliding window-based LightGBM model for electric load forecasting, determining the optimal window size empirically by comparing performance for various window sizes (1–10 days). Each day was represented by 96 points since time resolution = 15 min. For the same reason using MAPE as an indicator of data repair, we used MAPE to compare the forecasting performance of sliding window-based LightGBM models with different window sizes.

As a result of conducting experiments on all clusters, prediction performance improved as window size increased up to 7 days. From then on, there is no further significant improvement compared to the increase in training time. Therefore, we set window size = 7 days. Figure 10 shows the training times and MAPE of the proposed model with different window sizes (Table 5).

Fig. 10
figure 10

MAPE and training time of proposed model

Table 5 Hyperparameter settings for each model

6.4 Electric load forecasting

Tables 6, 7, 8, and 9 compare the forecasting performance of various machine learning models: linear regression (LR), DT, gradient boosting machine (GBM), RF, XGBoost, LightGBM, deep neural network (DNN), and long short-term memory (LSTM). DNN and LSTM models were implemented using Tensorflow, and the rest of the models were implemented using scikit-learn library. Table 5 shows the hyperparameter settings for each forecasting model. We divided the dataset into training and test sets, comprising data collected from January 1, 2016, to December 31, 2018, and January 1, 2019, to December 31, 2019, respectively.

Table 6 MAE comparison of forecasting models
Table 7 RMSE comparison of forecasting models
Table 8 RMSLE comparison of forecasting models
Table 9 MAPE comparison of forecasting models

We compared forecasting performance using mean absolute error (MAE), root-mean-square error (RMSE), root-mean-square logarithmic error (RMSLE), and MAPE, respectively, defined as

$${\text{MAE}} = \frac{{\mathop \sum \nolimits_{i = 1}^{n} \left| {y_{i} - \widehat{{y_{i} }}} \right|}}{n},$$
(8)
$${\text{RMSE}} = \sqrt {\frac{{\mathop \sum \nolimits_{i = 1}^{n} \left( {y_{i} - \widehat{{y_{i} }}} \right)^{2} }}{n}} ,$$
(9)
$${\text{RMSLE}} = \sqrt {\frac{{\mathop \sum \nolimits_{i = 1}^{n} \left( {\log \left( {y_{i} + 1} \right) - \log \left( {\widehat{{y_{i} }} + 1} \right)} \right)^{2} }}{n}} .$$
(10)

and (7). In Tables 6, 7, 8, and 9, values in bold font indicate the best performance for each model.

The proposed RF-based data repair method provides superior performance overall metrics. Also, the sliding window-based LightGBM model achieves the best forecasting performance compared with the other machine learning techniques for most cases.

Finally, we conducted a Wilcoxon test to verify that the results of the proposed model are statistically the same as those of the other models. In the test, if the p value is less than the significance level, the null hypothesis is rejected, which indicates that there is no significant difference between the two dependent samples. The results of the Wilcoxon test with a significance level of 0.05 are shown in Table 10. As the p value in all cases is lower than the significance level, it was proved that the results of the proposed model were not statistically different from those of other models. This means that the data used for the sliding window-based LightGBM model, albeit in a small amount, are sufficient for training.

Table 10 Result of Wilcoxon test

7 Conclusion

This paper proposed an accurate electric load forecasting scheme that detects anomalies using VAE, repairs data using RF, and forecasts electric load using sliding window-based LightGBM. We collected 15-min resolution electric load data collected at a private university in Seoul, South Korea, and performed data preprocessing for the proposed scheme.

We proposed a VAE-based anomaly detection method and compared its performance to popular anomaly detection methods such as IQR, LOF, and IForest. The proposed VAE-based anomaly detection method shows the best performance in most cases, with less accuracy reduction as the anomaly rate increases, compared with all other methods. In addition, we used the RF model to repair the anomalies to appropriate values. To repair continuous anomalies effectively, various external variables should be used to represent the situation at the time when the continuous anomalies occurred. As RF works well with large amounts of data and large numbers of input variables, it is suitable for repairing anomalies using various external variables. As a result of comparing the proposed RF-based data repair method with the widely used missing data interpolation methods, such as zero interpolation and linear interpolation, it was confirmed that the use of RF could better repair anomalies. Finally, we proposed a sliding window-based LightGBM model for electric load forecasting. As a result of experimenting with various window sizes, prediction performance improved as window size increased up to 7 days, with no further significant improvement. Therefore, the proposed sliding window-based LightGBM model has a seven-day window size.

The performance of the proposed models was verified in terms of MAE, RMSE, RMSLE, and MAPE compared with other popular machine learning and deep learning methods. As a result of the experiment using the data repaired through the zero, linear, and RF interpolation techniques, it was confirmed that the best performance was obtained when using the data repaired by RF. In addition, as a result of comparing the performance of the proposed model with various models, it was confirmed that the performance of the proposed model shows the best performance in all indicators.

Despite meaningful experimental outcomes, our study has some limitations which present our future research directions. First, despite previous studies showing better performance, it could not be used due to time constraints. In order to apply to the actual smart grid system, we plan to explore new models that show good performance and take less time for model training. Second, it is difficult to explain how the proposed model derives the predicted values. So, we plan to develop a more accurate electrical load forecasting model by analyzing the influence of various input variables using an explainable artificial intelligence (XAI) technique. In addition, we will verify the possibility of application in the smart grid by research in link electrical load forecasting model with various systems such as Energy Management System (EMS) and Energy Storage System (ESS).