Nonintrusive load monitoring (NILM) performance evaluation

Makonin, Stephen; Popowich, Fred

doi:10.1007/s12053-014-9306-2

Nonintrusive load monitoring (NILM) performance evaluation

A unified approach for accuracy reporting

Short Communication
Published: 31 October 2014

Volume 8, pages 809–814, (2015)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Energy Efficiency Aims and scope Submit manuscript

Nonintrusive load monitoring (NILM) performance evaluation

Download PDF

Stephen Makonin¹ &
Fred Popowich¹

2902 Accesses
140 Citations
Explore all metrics

Abstract

Nonintrusive load monitoring (NILM), sometimes referred to as load disaggregation, is the process of determining what loads or appliances are running in a house from analysis of the power signal of the whole-house power meter. As the popularity of NILM grows, we find that there is no consistent way the researchers are measuring and reporting accuracies. In this short communication, we present a unified approach that would allow for consistent accuracy testing.

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Nonintrusive (appliance) load monitoring (NILM or NIALM) is the process of determining what loads or appliances are running in a house from analyzing the power signal of the whole-house power meter. NILM, which is sometimes called load disaggregation, can be used in systems to inform the occupants about how energy is used within a home without the need of purchasing additional power monitoring sensors. Once the occupants are informed about what appliances are running, and how much power these appliances consume, they can then make informed decisions about conserving power, whether motivated by economic or ecologic concerns (or both).

A review of NILM algorithms and research has led us and others (Kim et al. 2010; Zeifman and Roth 2011; Makonin 2014) to the conclusion that there is no consistent way to measure performance accuracy. Although some researchers still use the most basic forms of accuracy measure, there has been discussion concerning more sophisticated measurements. The most basic accuracy measure used by a majority of NILM researchers (e.g., Chang et al. 2010; Tsai and Lin 2012; Makonin et al. 2013) is defined as

$$ \text{Acc.} = \frac{\text{correct}\ \text{matches}}{\text{total}\ \text{possible}\ \text{matches}} = \frac{\text{correct}}{\text{correct} + \text{incorrect}}. $$

(1)

Kim et al. (2010) point out that accuracy results are “very skewed because using an appliance is a relatively rare event .... appliances [that] are off will achieve high accuracy” (Table 1). Better accuracy performance measures must be considered. Expanding on our previous work (Makonin 2014), we present a unified approach that would allow for consistent accuracy testing amongst NILM and load disaggregation researchers.

Table 1 Basic accuracy measures

Full size table

The rest of our short communication is organized as follows. We first define data noise (Section Data noise), then discuss strategies using ground truth (Section Ground truth and bias). Next, we focus on classification accuracy testing (Section Classification accuracy) and estimation testing (Section Estimation accuracy). We end the discussion with a look at why researchers need to report accuracies with respect to both the overall performance and appliance-specific performance (Section Overall and appliance-specific accuracies). Finally, we demonstrate some of the issues that we discussed previously by examining the results from an experiment (Section Experiment example).

Data noise

Data noise can be understood as unexpected or unaccounted for anomalies that can appear in the stream of data that an algorithm analyzes. Noise can take a number of forms when looking at disaggregation. There can be readings that are missing, leaving gaps in a time series of data. There can be data streams that have timestamps that are out of sync. There can be corrupted data where data measurements within the reading are missing or measured wrongly due to sensor miscalculation or malfunction. Aside from miscalculation or malfunction, data can contain Gaussian noise due to small fluctuations in sensor/ADC (analog-to-digital converter) precision and the consumption of power by an appliance. Specifically for disaggregation, noise can be unmetered appliances that create large unexpected patterns of energy consumption. For our purpose, we define noise as the amount of power remaining in the observed aggregate power reading once the disaggregated appliance power readings (in ground truth) have been subtracted. Mathematically, defined as

$$ \text{noise} = y_{t} - \sum\limits^{M}_{m = 1} y_{t}^{(m)}, $$

(2)

where y _t is the total ground truth or observed value at time t, M is the number of appliances, and $y_{t}^{(m)}$ is the ground truth power consumed at time t for appliance m.

Ground truth and bias

NILM researchers need to describe in detail the data they are using to build models, train, and test their NILM algorithms. If researchers are using data from publicly available datasets such as REDD (Kolter and Johnson 2011) or AMPds (Makonin et al. 2013), they need to discuss the method used to clean the data. For instance, discussing how they dealt with incomplete or erroneous data and with different meters having different sample rates.

There also needs to be a clear statement on whether the testing included noise or was denoised. In denoised data, the whole-house power reading is equal to the summation of all appliance power readings—which we often refer to as the unmetered load or appliance. Using denoised data for testing will cause higher accuracies to be reported. Denoised data does not reflect a real-world application because there would be a significant amount of noise due to unmetered loads running in the home. Furthermore, what needs to be reported is the percentage of noise in each test. This percent-noisy measure (%-NM) would be calculated on the ground truth data as such:

$$ \%-NM = \frac{{\sum}^{T}_{t = 1} | y_{t} - {\sum}^{M}_{m = 1} y^{(m)}_{t}|}{{\sum}^{T}_{t = 1} y_{t}} , $$

(3)

where y _t is the aggregate observed current/power amount at time t and $y^{(m)}_{t}$ is the ground truth current/power amount for each appliance m to be disaggregated. For example, a denoised test would result in 0 %, whereas a %-NM of 0.40 would mean that 40 % of the aggregate observed current/power for the whole test was noise.

Finally, researchers should use standard methods to minimize any effects of bias. Bias occurs when some data used for training is also used for testing, and when present, results in the reporting of higher accuracies. A well-accepted method used by the data mining community to avoid bias is 10-fold cross-validation (Liu and Motoda 1998, pp. 109). This simple method splits the ground truth data into ten subsets of size $\frac {n}{10}$. NILM algorithms can then be trained on nine of the subsets, and accuracy testing is performed on the excluded subset. This is repeated ten times (each time a different subset is used for testing) and the mean accuracy is then calculated and reported.

Classification accuracy

Researchers need to measure how accurately NILM algorithms can predict what appliance is running in each state. Classification accuracy measures, such as f-score (a.k.a. f-measure), are well suited for this task. F-score, often used in information retrieval and text/document classification, has also been used by NILM researchers (Figueiredo et al. 2012; Berges et al. 2010; Kim et al. 2010). It is the harmonic mean of precision and recall:

$$\begin{array}{@{}rcl@{}} &&F_{1} = 2 \cdotp \frac{precision \cdotp recall}{precision + recall} , \nonumber\\ &&precision = \frac{tp}{tp + fp}, \quad recall = \frac{tp}{tp + fn},\end{array} $$

where precision is the positive predictive values and recall is the true positive rate or sensitivity, tp is true-positives (correctly predicted that the appliance was ON), fp is false-positives (predicted appliance was ON but was OFF), and fn is false-negatives (appliance was ON but was predicted OFF). Note these measures (t p,f p,f n) are accumulations over a given experimental time period. However, f-score is generally used for binary classification purposes.

Kim et al. (2010) showed how f-score could be modified to account for non-binary outcomes, such as a power signal (we call M-fscore). Their approach combined appliance state classification and power estimation accuracies together even though in many instances classification and estimation are two distinct functions of NILM algorithms. Combining classification and estimation hides important diagnostic information as to what parts of NILM algorithms have low accuracy. Furthermore, functions, such as classification and estimation, require a specific type of accuracy measure that is suited for measuring their performance. Matching function with accuracy measure provides more detailed diagnostic and performance information.

To calculate the accuracies of non-binary classifications, we now define finite-state f-score (FS-fscore). We introduce a partial penalization measure called inaccurate portion of true-positives (inacc) which converts the binary nature of tp into a discrete measure. The inacc of a given experimental test is

$$ inacc = \sum\limits^{T}_{t = 1} \cfrac{| \hat{x}^{(m)}_{t} - x^{(m)}_{t} |}{K^{(m)}}, $$

(4)

where $\hat {x}^{(m)}_{t}$ is the estimated state from appliance m at time t, $x^{(m)}_{t}$ is the ground truth state, and K ^(m) is the number of states for appliance m. In other words, we penalize based on the distance (or difference) of the estimated state and the ground truth state. Precision and recall can now be redefined to account for these partial penalizations:

$$ precision = \frac{tp - inacc}{tp + fp} \quad \text{ and } \quad recall = \frac{tp - inacc}{tp + fn} . $$

(5)

The definition of f-score remains the same. A summation over all appliances M for each tp, inacc, fp, and fn (including a recalculation of precision, recall, and f-score) would allow for the overall classification accuracy of the experimental test to be reported.

Estimation accuracy

Accuracies based on power estimation also need to be reported to show how accurately the NILM algorithm can estimate how much power is being consumed compared to actual consumption. This is important because systems that use NILM need to report to the occupants what portion of the power bill can be attributed to each appliance. Additionally, when dealing with time-of-use billing (charging more per kWh at peak times), occupants need to know how much might have been saved if certain appliances (e.g., a clothes dryer) were not used during the peak period.

There are different accuracy measures that have been used to compare consumption estimation. Parson et al. (2012) has used root mean square error (RMSE) for reporting estimation accuracy. However, these measures are not normalized, and it is hard to compare how the disaggregation of one appliance performed over another. This becomes a bigger problem when you try to compare an appliance that consumes a large amount of power (e.g., heating) versus an appliance that consumes very little power (fridge).

Normalized disaggregation error (NDE) (Kolter and Jaakkola 2012; Parson et al. 2012; Dong et al. 2013) has also been used to measure the estimation accuracy of an appliance. With this measure, we would subtract the summation of all T estimations by the summation of all T ground truths. However, subtracting the summations would tend to report inflated accuracies because it is possible for errors to cancel each other out. For example, suppose we had an estimation of 2A and a ground truth of 0A at time t ₁ and an estimation of 0A and a ground truth of 2A at time t ₂, the NDE would be 0 % when in fact 100 % would be the correct error score. Kolter and Johnson (2011) and Johnson and Willsky (2013) estimation accuracy measure calculates the correct value of 0 % accurate (or 100 % error). We have chosen this estimation accuracy method to use and is defined as

$$ Est.\ Acc. = 1 - \cfrac{{\sum}^{T}_{t = 1} {\sum}^{M}_{m = 1} \lvert \hat{y}^{(m)}_{t} - y^{(m)}_{t} \rvert}{2 \cdot {\sum}^{T}_{t = 1} {\sum}^{M}_{m = 1} y^{(m)}_{t}} $$

(6)

where T is the time sequence or number of disaggregated readings, M as the number of appliances, $ \hat {y}^{(m)}_{t}$ is the estimated power consumed at time t for appliance m, and $y^{(m)}_{t}$ is the ground truth power consumed at time t for appliance m. This method allows for overall estimation accuracy reporting. By eliminating the summations over M, we can then report estimation accuracy for each appliance

$$ Est.\ Acc.^{(m)} = 1 - \cfrac{{\sum}^{T}_{t = 1} \lvert \hat{y}^{(m)}_{t} - y^{(m)}_{t} \rvert}{2 \cdot {\sum}^{T}_{t = 1} y^{(m)}_{t}}. $$

(7)

Overall and appliance-specific accuracies

Both classification accuracy and estimation accuracy need to be reported in overall scores and appliance specific scores. Reporting how each appliance scores is important for identifying strengths and weaknesses of different NILM algorithms. With this more detailed accuracy information, one could imagine a system that would select different algorithms depending on the context (including specific history) of the disaggregation task. It is important also to keep in mind when reporting accuracies the result needs to be normalized. Normalized results allow the readers to understand the relative standings from one appliance to another and from each appliance to the overall accuracy. Finally, although more detailed information has its advantages, reporting specific scores for appliance states is not necessary because different makes/models of appliances will have a different number of states at different power levels.

Experiment example

We investigated how basic accuracy can be misleading by reporting high confidence numbers that do not accurately reflect inaccuracies in predicting rare events (Supplementary data). This would be the case for most loads that are sporadically used. We also show why modified f-score, which combines classification and estimation, is not a detailed enough measure. We used the more detailed AMPds (Makonin et al. 2013) rather than REDD (Kolter and Johnson 2011) to illustrate the issues with these different measurements, using our own NILM algorithm (Makonin et al. 2014). Current draw (I) values were rounded up to the nearest whole-Ampere and tenfold cross validation was used on the entire one year of data. The whole-house current draw measurement was denoised so that it equalled the summation of the current draw from the 11 loads chosen for disaggregation (a %-NM of 0.00). The classification and estimation results are listed in Table 2. We have also provided other basic measures in Table 1. Additionally, we include true-negatives tn counts, and the accurate/inaccurate true-positives (atp and itp) using in M-fscore, where a t p+i t p=1 and can be seen as assigning partial accuracy and avoiding the binary nature of the true-positive tp score.

Table 2 Classification and estimation accuracy results

Full size table

In all cases, basic accuracy scores far better than FS-fscore. This is most noted for the garage results. The inacc results show partial penalization, and this is apparent when comparing f-score with FS-fscore. When we examine M-fscore, we see that it scores less than either f-score and FS-fscore, but it is hard to understand why. When examining the RMSE scores, it is hard to compare how appliances performed to each other or to the overall results as this score is not normalized. When comparing NDE with estimation accuracy, we see in most instances NDE scores better. This is most apparent in the ent/tv/dvd load. Overall, the FS-fscore and estimation of our test scores high, but this masks the fact some loads (clothes washer, garage, and home office) did not score well. Furthermore, the home office and garage results shows there can be a higher score for estimation but a lower classification score, and the ent/tv/dvd results show there can be a higher score for classification and a lower score for estimation.

Conclusion

We presented a unified approach that allows for consistent accuracy testing amongst NILM researchers. Our approach takes into account the classification performance and estimation performance—not one or the other. Additionally, we include performance reporting at both the overall level and an appliance level. This evaluation strategy has been incorporated into our research, and we look forward to continue the discussion and refinement of this framework as other NILM researchers continue to address the issue of inconsistent accuracy reporting.

References

Author Name (2014). Commented out for double-blind review.
Berges, M.E., Goldman, E., Matthews, H.S., Soibelman, L. (2010). Enhancing electricity audits in residential buildings with nonintrusive load monitoring. Journal of Industrial Ecology, 14(5), 844–858.
Article Google Scholar
Chang, H.H., Lin, C.L., Lee, J.K. (2010). Load identification in nonintrusive load monitoring using steady-state and turn-on transient energy algorithms. In 2010 14th International Conference on Computer Supported Cooperative Work in Design (CSCWD), (pp. 27–32).
Dong, H., Wang, B., Lu, C.T. (2013). Deep sparse coding based recursive disaggregation model for water conservation. In Proceedings of the Twenty-Third international joint conference on Artificial Intelligence (pp. 2804–2810): AAAI Press.
Figueiredo, M, de Almeida, A., Ribeiro, B. (2012). Home electrical signal disaggregation for non-intrusive load monitoring (nilm) systems. Neurocomputing, 96(0), 66–73.
Article Google Scholar
Johnson, M.J., & Willsky, A.S. (2013). Bayesian nonparametric hidden semi-markov models. The Journal of Machine Learning Research, 14(1), 673–701.
MathSciNet MATH Google Scholar
Kim, H., Marwah, M., Arlitt, M., Lyon, G., Han, J. (2010). Unsupervised disaggregation of low frequency power measurements. In 11th International Conference on Data Mining (pp. 747–758).
Kolter, J., & Johnson, M. (2011). Redd: A public data set for energy disaggregation research. In Workshop on Data Mining Applications in Sustainability (SIGKDD) San Diego, CA.
Kolter, J.Z., & Jaakkola, T. (2012). Approximate inference in additive factorial hmms with application to energy disaggregation. In International Conference on Artificial Intelligence and Statistics (pp. 1472–1482).
Liu, H., & Motoda, H. (1998). Feature selection for knowledge discovery and data mining: Springer.
Makonin, S. (2014). Real-time embedded low-frequency load disaggregation. Ph.D. thesis, Simon Fraser University, School of Computing Science.
Makonin, S., Bajic, I.V., Popowich, F. (2014). Efficient Sparse Matrix Processing for Nonintrusive Load Monitoring (NILM). In 2nd International Workshop on Non-Intrusive Load Monitoring.
Makonin, S., Popowich, F., Bartram, L., Gill, B., Bajic, I.V. (2013). AMPds: a public dataset for load disaggregation and eco-feedback research. In 2013 IEEE Electrical Power and Energy Conference (EPEC) (pp. 1–6).
Parson, O., Ghosh, S., Weal, M., Rogers, A. (2012). Non-intrusive load monitoring using prior models of general appliance types. In AAAI Conference on Artificial Intelligence.
Tsai, M.S., & Lin, Y.H. (2012). Modern development of an adaptive non-intrusive appliance load monitoring system in electricity energy conservation. Applied Energy, 96(0), 55–73.
Article Google Scholar
Zeifman, M., & Roth, K. (2011). Nonintrusive appliance load monitoring: Review and outlook. IEEE Transactions on Consumer Electronics, 57(1), 76–84.
Article Google Scholar

Download references

Acknowledgments

Research partly supported by grants from the Natural Sciences and Engineering Research Council (NSERC) of Canada, and the Graphics, Animation, and New Media Network of Centres of Excellence (GRAND NCE) of Canada.

Author information

Authors and Affiliations

Computing Science, Simon Fraser University, Burnaby, Canada
Stephen Makonin & Fred Popowich

Authors

Stephen Makonin
View author publications
You can also search for this author in PubMed Google Scholar
Fred Popowich
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stephen Makonin.

Electronic supplementary material

Below is the link to the electronic supplementary material.

(152 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Makonin, S., Popowich, F. Nonintrusive load monitoring (NILM) performance evaluation. Energy Efficiency 8, 809–814 (2015). https://doi.org/10.1007/s12053-014-9306-2

Download citation

Received: 22 March 2014
Accepted: 13 October 2014
Published: 31 October 2014
Issue Date: July 2015
DOI: https://doi.org/10.1007/s12053-014-9306-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Nonintrusive load monitoring (NILM) performance evaluation

Abstract

Explore related subjects

Introduction

Data noise

Ground truth and bias

Classification accuracy

Estimation accuracy

Overall and appliance-specific accuracies

Experiment example

Conclusion

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

(152 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation