Introduction

Nonintrusive (appliance) load monitoring (NILM or NIALM) is the process of determining what loads or appliances are running in a house from analyzing the power signal of the whole-house power meter. NILM, which is sometimes called load disaggregation, can be used in systems to inform the occupants about how energy is used within a home without the need of purchasing additional power monitoring sensors. Once the occupants are informed about what appliances are running, and how much power these appliances consume, they can then make informed decisions about conserving power, whether motivated by economic or ecologic concerns (or both).

A review of NILM algorithms and research has led us and others (Kim et al. 2010; Zeifman and Roth 2011; Makonin 2014) to the conclusion that there is no consistent way to measure performance accuracy. Although some researchers still use the most basic forms of accuracy measure, there has been discussion concerning more sophisticated measurements. The most basic accuracy measure used by a majority of NILM researchers (e.g., Chang et al. 2010; Tsai and Lin 2012; Makonin et al. 2013) is defined as

$$ \text{Acc.} = \frac{\text{correct}\ \text{matches}}{\text{total}\ \text{possible}\ \text{matches}} = \frac{\text{correct}}{\text{correct} + \text{incorrect}}. $$
(1)

Kim et al. (2010) point out that accuracy results are “very skewed because using an appliance is a relatively rare event .... appliances [that] are off will achieve high accuracy” (Table 1). Better accuracy performance measures must be considered. Expanding on our previous work (Makonin 2014), we present a unified approach that would allow for consistent accuracy testing amongst NILM and load disaggregation researchers.

Table 1 Basic accuracy measures

The rest of our short communication is organized as follows. We first define data noise (Section Data noise), then discuss strategies using ground truth (Section Ground truth and bias). Next, we focus on classification accuracy testing (Section Classification accuracy) and estimation testing (Section Estimation accuracy). We end the discussion with a look at why researchers need to report accuracies with respect to both the overall performance and appliance-specific performance (Section Overall and appliance-specific accuracies). Finally, we demonstrate some of the issues that we discussed previously by examining the results from an experiment (Section Experiment example).

Data noise

Data noise can be understood as unexpected or unaccounted for anomalies that can appear in the stream of data that an algorithm analyzes. Noise can take a number of forms when looking at disaggregation. There can be readings that are missing, leaving gaps in a time series of data. There can be data streams that have timestamps that are out of sync. There can be corrupted data where data measurements within the reading are missing or measured wrongly due to sensor miscalculation or malfunction. Aside from miscalculation or malfunction, data can contain Gaussian noise due to small fluctuations in sensor/ADC (analog-to-digital converter) precision and the consumption of power by an appliance. Specifically for disaggregation, noise can be unmetered appliances that create large unexpected patterns of energy consumption. For our purpose, we define noise as the amount of power remaining in the observed aggregate power reading once the disaggregated appliance power readings (in ground truth) have been subtracted. Mathematically, defined as

$$ \text{noise} = y_{t} - \sum\limits^{M}_{m = 1} y_{t}^{(m)}, $$
(2)

where y t is the total ground truth or observed value at time t, M is the number of appliances, and \(y_{t}^{(m)}\) is the ground truth power consumed at time t for appliance m.

Ground truth and bias

NILM researchers need to describe in detail the data they are using to build models, train, and test their NILM algorithms. If researchers are using data from publicly available datasets such as REDD (Kolter and Johnson 2011) or AMPds (Makonin et al. 2013), they need to discuss the method used to clean the data. For instance, discussing how they dealt with incomplete or erroneous data and with different meters having different sample rates.

There also needs to be a clear statement on whether the testing included noise or was denoised. In denoised data, the whole-house power reading is equal to the summation of all appliance power readings—which we often refer to as the unmetered load or appliance. Using denoised data for testing will cause higher accuracies to be reported. Denoised data does not reflect a real-world application because there would be a significant amount of noise due to unmetered loads running in the home. Furthermore, what needs to be reported is the percentage of noise in each test. This percent-noisy measure (%-NM) would be calculated on the ground truth data as such:

$$ \%-NM = \frac{{\sum}^{T}_{t = 1} | y_{t} - {\sum}^{M}_{m = 1} y^{(m)}_{t}|}{{\sum}^{T}_{t = 1} y_{t}} , $$
(3)

where y t is the aggregate observed current/power amount at time t and \(y^{(m)}_{t}\) is the ground truth current/power amount for each appliance m to be disaggregated. For example, a denoised test would result in 0 %, whereas a %-NM of 0.40 would mean that 40 % of the aggregate observed current/power for the whole test was noise.

Finally, researchers should use standard methods to minimize any effects of bias. Bias occurs when some data used for training is also used for testing, and when present, results in the reporting of higher accuracies. A well-accepted method used by the data mining community to avoid bias is 10-fold cross-validation (Liu and Motoda 1998, pp. 109). This simple method splits the ground truth data into ten subsets of size \(\frac {n}{10}\). NILM algorithms can then be trained on nine of the subsets, and accuracy testing is performed on the excluded subset. This is repeated ten times (each time a different subset is used for testing) and the mean accuracy is then calculated and reported.

Classification accuracy

Researchers need to measure how accurately NILM algorithms can predict what appliance is running in each state. Classification accuracy measures, such as f-score (a.k.a. f-measure), are well suited for this task. F-score, often used in information retrieval and text/document classification, has also been used by NILM researchers (Figueiredo et al. 2012; Berges et al. 2010; Kim et al. 2010). It is the harmonic mean of precision and recall:

$$\begin{array}{@{}rcl@{}} &&F_{1} = 2 \cdotp \frac{precision \cdotp recall}{precision + recall} , \nonumber\\ &&precision = \frac{tp}{tp + fp}, \quad recall = \frac{tp}{tp + fn},\end{array} $$

where precision is the positive predictive values and recall is the true positive rate or sensitivity, tp is true-positives (correctly predicted that the appliance was ON), fp is false-positives (predicted appliance was ON but was OFF), and fn is false-negatives (appliance was ON but was predicted OFF). Note these measures (t p,f p,f n) are accumulations over a given experimental time period. However, f-score is generally used for binary classification purposes.

Kim et al. (2010) showed how f-score could be modified to account for non-binary outcomes, such as a power signal (we call M-fscore). Their approach combined appliance state classification and power estimation accuracies together even though in many instances classification and estimation are two distinct functions of NILM algorithms. Combining classification and estimation hides important diagnostic information as to what parts of NILM algorithms have low accuracy. Furthermore, functions, such as classification and estimation, require a specific type of accuracy measure that is suited for measuring their performance. Matching function with accuracy measure provides more detailed diagnostic and performance information.

To calculate the accuracies of non-binary classifications, we now define finite-state f-score (FS-fscore). We introduce a partial penalization measure called inaccurate portion of true-positives (inacc) which converts the binary nature of tp into a discrete measure. The inacc of a given experimental test is

$$ inacc = \sum\limits^{T}_{t = 1} \cfrac{| \hat{x}^{(m)}_{t} - x^{(m)}_{t} |}{K^{(m)}}, $$
(4)

where \(\hat {x}^{(m)}_{t}\) is the estimated state from appliance m at time t, \(x^{(m)}_{t}\) is the ground truth state, and K (m) is the number of states for appliance m. In other words, we penalize based on the distance (or difference) of the estimated state and the ground truth state. Precision and recall can now be redefined to account for these partial penalizations:

$$ precision = \frac{tp - inacc}{tp + fp} \quad \text{ and } \quad recall = \frac{tp - inacc}{tp + fn} . $$
(5)

The definition of f-score remains the same. A summation over all appliances M for each tp, inacc, fp, and fn (including a recalculation of precision, recall, and f-score) would allow for the overall classification accuracy of the experimental test to be reported.

Estimation accuracy

Accuracies based on power estimation also need to be reported to show how accurately the NILM algorithm can estimate how much power is being consumed compared to actual consumption. This is important because systems that use NILM need to report to the occupants what portion of the power bill can be attributed to each appliance. Additionally, when dealing with time-of-use billing (charging more per kWh at peak times), occupants need to know how much might have been saved if certain appliances (e.g., a clothes dryer) were not used during the peak period.

There are different accuracy measures that have been used to compare consumption estimation. Parson et al. (2012) has used root mean square error (RMSE) for reporting estimation accuracy. However, these measures are not normalized, and it is hard to compare how the disaggregation of one appliance performed over another. This becomes a bigger problem when you try to compare an appliance that consumes a large amount of power (e.g., heating) versus an appliance that consumes very little power (fridge).

Normalized disaggregation error (NDE) (Kolter and Jaakkola 2012; Parson et al. 2012; Dong et al. 2013) has also been used to measure the estimation accuracy of an appliance. With this measure, we would subtract the summation of all T estimations by the summation of all T ground truths. However, subtracting the summations would tend to report inflated accuracies because it is possible for errors to cancel each other out. For example, suppose we had an estimation of 2A and a ground truth of 0A at time t 1 and an estimation of 0A and a ground truth of 2A at time t 2, the NDE would be 0 % when in fact 100 % would be the correct error score. Kolter and Johnson (2011) and Johnson and Willsky (2013) estimation accuracy measure calculates the correct value of 0 % accurate (or 100 % error). We have chosen this estimation accuracy method to use and is defined as

$$ Est.\ Acc. = 1 - \cfrac{{\sum}^{T}_{t = 1} {\sum}^{M}_{m = 1} \lvert \hat{y}^{(m)}_{t} - y^{(m)}_{t} \rvert}{2 \cdot {\sum}^{T}_{t = 1} {\sum}^{M}_{m = 1} y^{(m)}_{t}} $$
(6)

where T is the time sequence or number of disaggregated readings, M as the number of appliances, \( \hat {y}^{(m)}_{t}\) is the estimated power consumed at time t for appliance m, and \(y^{(m)}_{t}\) is the ground truth power consumed at time t for appliance m. This method allows for overall estimation accuracy reporting. By eliminating the summations over M, we can then report estimation accuracy for each appliance

$$ Est.\ Acc.^{(m)} = 1 - \cfrac{{\sum}^{T}_{t = 1} \lvert \hat{y}^{(m)}_{t} - y^{(m)}_{t} \rvert}{2 \cdot {\sum}^{T}_{t = 1} y^{(m)}_{t}}. $$
(7)

Overall and appliance-specific accuracies

Both classification accuracy and estimation accuracy need to be reported in overall scores and appliance specific scores. Reporting how each appliance scores is important for identifying strengths and weaknesses of different NILM algorithms. With this more detailed accuracy information, one could imagine a system that would select different algorithms depending on the context (including specific history) of the disaggregation task. It is important also to keep in mind when reporting accuracies the result needs to be normalized. Normalized results allow the readers to understand the relative standings from one appliance to another and from each appliance to the overall accuracy. Finally, although more detailed information has its advantages, reporting specific scores for appliance states is not necessary because different makes/models of appliances will have a different number of states at different power levels.

Experiment example

We investigated how basic accuracy can be misleading by reporting high confidence numbers that do not accurately reflect inaccuracies in predicting rare events (Supplementary data). This would be the case for most loads that are sporadically used. We also show why modified f-score, which combines classification and estimation, is not a detailed enough measure. We used the more detailed AMPds (Makonin et al. 2013) rather than REDD (Kolter and Johnson 2011) to illustrate the issues with these different measurements, using our own NILM algorithm (Makonin et al. 2014). Current draw (I) values were rounded up to the nearest whole-Ampere and tenfold cross validation was used on the entire one year of data. The whole-house current draw measurement was denoised so that it equalled the summation of the current draw from the 11 loads chosen for disaggregation (a %-NM of 0.00). The classification and estimation results are listed in Table 2. We have also provided other basic measures in Table 1. Additionally, we include true-negatives tn counts, and the accurate/inaccurate true-positives (atp and itp) using in M-fscore, where a t p+i t p=1 and can be seen as assigning partial accuracy and avoiding the binary nature of the true-positive tp score.

Table 2 Classification and estimation accuracy results

In all cases, basic accuracy scores far better than FS-fscore. This is most noted for the garage results. The inacc results show partial penalization, and this is apparent when comparing f-score with FS-fscore. When we examine M-fscore, we see that it scores less than either f-score and FS-fscore, but it is hard to understand why. When examining the RMSE scores, it is hard to compare how appliances performed to each other or to the overall results as this score is not normalized. When comparing NDE with estimation accuracy, we see in most instances NDE scores better. This is most apparent in the ent/tv/dvd load. Overall, the FS-fscore and estimation of our test scores high, but this masks the fact some loads (clothes washer, garage, and home office) did not score well. Furthermore, the home office and garage results shows there can be a higher score for estimation but a lower classification score, and the ent/tv/dvd results show there can be a higher score for classification and a lower score for estimation.

Conclusion

We presented a unified approach that allows for consistent accuracy testing amongst NILM researchers. Our approach takes into account the classification performance and estimation performance—not one or the other. Additionally, we include performance reporting at both the overall level and an appliance level. This evaluation strategy has been incorporated into our research, and we look forward to continue the discussion and refinement of this framework as other NILM researchers continue to address the issue of inconsistent accuracy reporting.