Introduction

Energy consumption has significantly increased in recent years, particularly in buildings, growing at a rate of approximately 0.9% per year in the USA [1]. Consistently, the residential buildings consume approximately 38% of electricity and 21% of this energy [2]. Given buildings’ overall significant contribution to energy use, as well as environmental concerns and climate change, methods are needed to reduce this consumption. This is particularly the case for residential buildings, whose operation is highly dependent on occupants and their behavior [3,4,5,6]. There are many possible strategies to reduce energy use in residential buildings; the most common of which is through retrofitting an existing building with more energy efficient systems. What retrofits are completed is often a decision made by the homeowner, based on a variety of factors [7, 8]. While non-energy-related factors can be influential in making such decisions [9], the most strongly cited reason is costs, i.e., the economics of the upfront costs, rebates, or incentives provided, and the energy savings that the retrofit(s) will achieve over time. Another method to reduce consumption is through occupant energy behavior interventions, which aim to reduce consumption through altering the behavior of occupants, particularly how they use energy-consuming systems [10].

The ultimate decision of the homeowner to implement retrofits or change occupants’ behaviors can depend on the information provided on quantification of the energy and costs savings achieved as a result of interventions, particularly if cost is the driving factor [11]. This includes (a) prediction of consumption of the building in its existing state, (b) prediction of energy consumption after interventions, as well as (c) how their relative difference translates into energy and costs savings [12]. Therefore, building energy use prediction methods, used to determine (a), (b), and ultimately (c), play a highly important role in building energy conservation [13]. The methods proposed and used in recent literature to predict energy use of buildings range in complexity and the frequency and duration of input data needed. Some methods have also been developed and tested only for specific building types. A recent review of these methods and their use for different types of data and building applications is thus needed, particularly as the availability and range of types of data to develop these algorithms is highly limited.

This work reviews two critical topics in this research area. This includes, first, a review of known available data and published information, which is relevant for use in the development of methods to predict residential building consumption. This review includes the type(s), frequency, quality, and duration of data, as well as identifies the challenges and needs in the area of building energy datasets. Second is a review of recent published literature on the methods used to predict the energy consumption in residential buildings, as well as those developed for other building types that could be applied to residential buildings. This concludes with the limitations of existing data and methods, and future research needs in this area.

Residential Building Energy and Non-energy Data: Sources, Availability, and Characteristics

Critical to the ability to develop, test, and validate methods to predict building energy use is the availability of data for algorithm development. This includes real building energy data, as well as non-energy data, such as characteristics of the building(s) and their occupants, and/or weather data, all of which have demonstrated impacts on energy use. For residential buildings, much of this information can be challenging to obtain, particularly for a large number of buildings. This section is divided into two main sub-sections, including first, an overview of residential energy data and second, residential non-energy use data. Both these sub-sections review the sources of data, data availability, and characteristics of datasets, such as frequency, quality, and duration.

Residential Energy Data

Historical energy use data includes electricity use, gas use, and in some cases, other fuel use data collected at a regular frequency. This historical data is used in many cases, to train, test, and validate building energy use prediction methodologies. One of the greatest sources of energy data is electric and gas utilities, which maintain large sets of energy use data from their residential customers. This is collected and stored at a minimum frequency of the monthly level for all residential buildings, with some locations having higher frequency data from utilizing AMR (automatic meter reading) and AMI (advanced metering infrastructure) technologies [14, 15]. However, the barriers associated with the use of this energy use data particularly for residential buildings are often privacy and law-related [16]. There are a small number of exceptions such as the city of Gainsville, Florida [17, 18], which provides public access to 6 years of monthly electricity and gas consumption data for all homes in the city; however, this type and availability of energy data is not common.

This means that in many cases, methods for predicting building energy use must often be developed and tested using limited data based on energy measurements from small number of occupied homes, energy measurements from real building(s) using simulated occupancy methods (e.g., using [19]), or energy use data based on simulated buildings resulting from a building energy modeling program such as EnergyPlus. While these real residential building data provide valuable information, larger datasets of real data can encompass energy use information for wider variety of home types, locations, occupant behaviors, and other natural variations in energy consumption that smaller datasets cannot. Given the significant variations in energy and occupant patterns that can occur in residential buildings, this can be beneficial to provide a more comprehensive understanding of how well a methodology works in comparison to others.

An alternative source of the utility energy use data collection is obtaining this information directly from homeowners, who have access to utility-collected monthly data and in some cases, 15 min or hourly data if a smart meter is installed in their home [20]. In rare cases, homeowners may have minute, sub-minute, or sub-metered data from a home energy monitoring system; however, these systems are not common currently. Thus, with homeowner consent, energy data can be obtained for algorithm development. However, large-scale collection of this information is time-consuming and costly. There are, however, some efforts towards more open access to energy use data, some available datasets, as well as broader platforms created to enable easier sharing of datasets.

Arguably, more information is currently available on commercial building energy use than for residential buildings. For commercial buildings, there are more policies supporting the public availability of energy information, particularly in large cities and for publically owned buildings. Large cities such as Boston [21], New York City [22], and Washington D.C. [23] among others have enacted laws and/or ordinances requiring energy benchmarking. Under these laws, buildings must report energy consumption on a regular basis, which is compiled into databases and often made publically available. In some cases, such as Boston [21, 24], this data includes larger non-residential and multi-family residential buildings. However, the data in these datasets is also only reported at the annual level which has limited use for building energy prediction methods.

Similar benchmarking efforts could also be beneficial for residential buildings, particularly if the data was at an appropriate level of frequency. For example, the ECAD Ordinance [25] requires that all residential buildings bought and sold that are over 10 years of age to have an energy audit completed in Austin, TX; the results of which are compiled into a centralize database; the city of Chicago allows for disclosure of energy use and/or costs during the sale of a home [26]. These and other policy-enforced energy data sources could be highly valuable. Some local policy-enforced data sources are available, such as energy use by census block in Chicago [27], energy use by zipcode in New York [28,29,30], and aggregated annual energy use savings for homes in Austin [31]. However, these datasets are also aggregated and in most cases, only at the annual level.

Other efforts collect data from a variety of sources on commercial and/or residential energy use in a common location. The Building Information Database [32], supported by the US Department of Energy (DOE) consists of datasets of residential and commercial building energy use intensity on an annual basis, building characteristics and systems, and location. Similarly, the DOE-supported Building Dataset [33] contains information on energy use, building operations and analysis tools for buildings-related datasets, and the Energy Data Resources site [34] collects information on sources of energy data and tools from energy-related projects. The types of data vary, but do include datasets with energy consumption at varying levels of frequency.

There are a small number of datasets of residential energy use information that provide higher frequency and in some cases, disaggregated end use energy data for residential buildings. A large-scale study in the Pacific Northwest in the 1980s and1990s collected whole-home and end-use data for residential buildings [35]. Many research papers were written based on this dataset, and the aggregated data is available online [36]. The results of this effort are also still used today in residential energy modeling programs [37, 38] for end-use modeling. The most recent US large-scale data collection effort for residential building data known to the authors is in Austin Texas [39••]. This database provides up to 1-min interval electricity and gas consumption for a large number of homes from 2012 to present and includes whole-home and end-use consumption. A number of recent research papers have used this to study residential energy use [40,41,42]. Given the current cost of equipment needed to obtain higher frequency and disaggregated data, it is unlikely that other efforts of this scale will occur frequently moving forward. However, given recent efforts to improve the ease of energy data equipment installation and collection, as well as improved abilities to disaggregate energy use data using higher-frequency whole-home energy data (e.g., [43]), lower-cost tools and/or equipment to obtain the frequency and quality of energy data for larger number of residential buildings may be more feasible moving forward.

Non-energy Data

Non-energy data, linked with the energy data, also has an important role in energy use prediction. Weather data is among the most critical non-energy factors impacting residential building energy use and particularly HVAC systems which are used in a high percentage of US residential buildings. Weather data is often available from public sources of ground-based weather station data, most commonly at airports [41, 44, 45]. However, as some recent research efforts have found, this weather data is not necessarily representative of the conditions where studied residential buildings are located. For example, recent efforts have found variations in localized wind speeds and temperatures (e.g., [46]). The state of the art in this general area has been summarized in several recent research articles (e.g., [47, 48]), and thus is not the focus of discussion herein. However, it is still important to note that while modeling methods and research efforts in building microclimates are significant, accessibility to raw weather data that well represents the actual conditions experienced by buildings is still a challenge. More recently, some fields of study have adopted the use of publically available satellite data-based weather data from MERRA [49], which is available worldwide on a regularly spaced grid. The use of this dataset reduces the dependency on ground-based weather stations.

Building characteristics, such as size, fuel type, HVAC system type, age, efficiency, appliances types, thermostat preferences, air exchange rate, and building envelope characteristics can also have a strong impact on energy consumption. Thus, while knowing this information can be highly beneficial, in many cases, this information is not available or linked with building-specific energy use data. The best publically available sources of building data originate from disparate sources, including assessors data, MLS data, cities’ GIS databases, and LIDAR data. However, if energy use datasets are anonymized for privacy reasons, this makes linking energy and non-energy datasets very challenging.

Some datasets, such as national-level datasets US Census data [50], RECS data [51], and American Community Survey [52] data, and localized datasets such as the Green Building Aggregate data in Austin, Texas [31], provide aggregate-level residential building and occupant characteristic data for enabling an understanding of building characteristics at a broader scale than the building level. The Better Building Neighborhood Program [53] provides a large anonymized building-level dataset representing over 75,000 building energy-related characteristics, specified by region and zipcode information. This and the aggregated datasets can be useful to determine likely characteristics of a building in a specific area, or for use in community-scale energy use prediction methods (e.g., [54]), but is of limited benefit to building-level energy consumption prediction as they are not linked to specific residential building energy use data. The datasets mentioned in the previous section, including the Building Information Database [32], the Building Dataset [33], and the Energy Data Resources dataset [34], do contain some building energy use information linked to building characteristic data.

In summary, building energy data and non-energy datasets are available; the characteristics of which range significantly. There are some promising sources of quality and higher frequency data which can be valuable for residential energy consumption prediction methods. There are also promising methods to encouraging sharing of data that can be further explored. However, significant opportunities remain to improve data availability in this field, which if done, will be highly beneficial to improvements in the capabilities of energy performance prediction methods.

Building Energy Performance Prediction Methods

Using energy data and non-energy data sources, building energy performance prediction methods range significantly in complexity and required types and frequencies of input data. Most recent efforts have followed similar methodologies for model development, including, as discussed in Wang and Srinivasan [13], first, (a) the collection of data for model development, then (b) the raw data processing is completed to ensure the data is of sufficient quality and format. The third step (c) includes using historical data to train the model to follow the patterns of use associated with the training dataset, as well as determining what of the available input data is significant and ultimately used for the model. The final step is (d) model testing. The fit of the model to input data not included in the training dataset is determined and evaluated in this step. Common metrics and statistical indices utilized for evaluation include root mean square error, coefficient of determination, coefficient of variation of the root mean square error, sum of squares error, mean squared error, and normalized mean bias error. Energy use prediction methods can either be physics-based approaches, data-driven inverse modeling approaches, or a combination of the two [55•]. In this section, the most recent efforts in energy performance prediction methods are reviewed, most of which are data-driven methods.

Change-Point Modeling

Change-point modeling is among the more simple methods, which are typically single-variate models using dry-bulb temperature as the predictor. A balance point is determined which best fits the trends in the energy data, where building energy use switches between seasonal trends [55•]. Linear regression is then used to create a multi-parameter model based on the determined level of fit criteria [56••, 57]. Perez et al. [58] focused on its use to predict daily consumption of residential HVAC systems in Austin, TX, using data from [39••]. Kim and Haberl [59] used three-parameter change-point models to calibrate daily whole-building energy simulations for two single-family homes based on monthly billing data. Do et al. [40, 60] utilized large number of homes across multiple climate zones to study the use of change point models, demonstrating these methods can fit to a wide range of homes’ use patterns. Zhang et al. [56••] used it to predict hourly and daily HVAC hot water energy and Abushakra and Paulus [61,62,63] used a hybrid inverse change-point model to predict consumption in simulated and actual buildings; however, both these efforts focused on commercial buildings.

The strength of the change-point models is the simpler development with lower computational effort in comparison to other methods [55•, 56••]. The accuracy of prediction in change-point models depends on the type and frequency of data available, but has been shown to demonstrate similar levels of accuracy to more complex models in some situations [56••]. Particularly for buildings with a limited number of data points, this method can be advantageous. However, as discussed in [40, 59], some data points can be considered outliers that may significantly impact the model fit, particularly for highly occupant-dependent residential buildings. With acceptable methods to assess what data is appropriate to use for residential building models as well model improvements such as those suggested by Abushakra and Paulus [61,62,63], this modeling method provides a simpler but often sufficiently accurate method.

Artificial Neural Networks

Artificial Neural Networks (ANN) consist of an input layer, one or more hidden layers, and an output layer, and have mostly been used for more frequent, hourly or sub-hourly building energy consumption prediction in recent literature [56••,64•]. Input variables typically include outdoor temperature, wind speed, solar radiation, and relative humidity. These methods have been used to predict whole-home HVAC, and appliance use in residential buildings [64•, 65], and hot water [56••], heating energy [66], total electricity [54, 67], and chilled water use [68] for commercial buildings. ANN has also been combined with other methods and/or enhancements, including feed forward backpropagation neural network, radial basis function network, and adaptive neuro-fuzzy interference system [66], backpropagation algorithms [64•,69], particle swarm optimization and genetic algorithms [54], principal component analysis [54, 70], and hybrid lightning search algorithms [65] to improve and/or optimize performance.

ANN generally performs well with sufficient training data and can be advantageous particularly for non-linear electricity consumption [64•, 68]. Wang and Srinivasan [13] also found performance of ANN methods in short-term prediction is better than regression methods. Improvements made to ANN methods also further improve accuracy [54, 71] with lower error [70]. However, the complexity of the model also increases computational time [72] and has limited physical interpretation which limits applicability outside of the training data limits [13]. In some cases, ANN has also been found to perform worse than simpler models [56••]. ANN has only been used in recent literature to predict whole-home consumption of unoccupied rather than occupied residential buildings [64•].

Genetic Programming

Genetic programming is an automated computational method based on the process of biological evolution [73] and has been used in combination with other methods to predict residential energy consumption. Castelli et al. [73] applied different genetic programming systems that use the genetics semantic operators to predict residential HVAC use. Jung et al. [74] used genetic programming with a hybrid of the direct search optimization algorithm and a conventional real-coded genetic algorithm, with least-squares support vector machine to predict daily commercial building energy. Genetic programming has been shown to be an effective method that produces lower errors than other methods [73] and to also provide an effective approach for parameter selection and better performance in terms of convergence time and iteration than conventional least-squares support vector machine methods. However, similar to ANN, genetic programming typically requires a larger set of input data. It also has only been used in limited studies for residential buildings.

Bayesian Networks

Bayesian Network models include nodes that represent random variables such as outdoor temperature and energy use with statistical and probabilistic dependencies between the cause nodes and the effect nodes with a probabilistic graphical model [76]. The parameters of such models are the conditional distributions at every node using Bayes’ rule. This method has been used to predict appliance energy use in residential buildings [75] and hot water HVAC use in a commercial building [76]. Bassamzadeh and Ghanem [77] also used this model to forecast the aggregated electricity demand in smart grids. In the limited number of studies that have used this method for building energy use prediction, the accuracy of the model predictions was within the recommended limits developed by ASHRAE for commercial buildings [76]. The uncertainties from input variables were also determined to be well represented using this type of method [77]. However, similar to the ANN and genetic algorithm methods discussed above, this method requires significant input data and can be highly complex to implement.

Gaussian Mixture Model

Gaussian mixture model (GMM) establishes a weighted sum of Gaussian component densities based on a parametric probability density function and multivariate non-linear regression function [56••]. This method has been used in a number of recent studies for a range of buildings. Li et al. [78] utilized GMM to design feasible time-of-use tariffs to minimize the electricity bills for residential customers. Also, in residential buildings, and Melzi et al. [79] used GMM to optimize smart meter electricity consumption, better understand consumer behavior and electricity use profiles. For other types of buildings, Zhang et al. [56••] applied GMM to predict daily and hourly commercial hot water energy and Carpenter et al. [80•] predicted supplied energy for a range of manufacturing processes in an industrial building. The advantage of this method found in [56••] was that it results in energy performance predictions that had the lowest error compared to change-point and ANN models, for commercial buildings. The GMM has also been found to capture non-linearity in simpler way than Bayesian or ANN methods [56••, 80•] for non-residential buildings. However, its performance in comparison to other methods for residential buildings is not well studied. Studies have also found that other statistical values of fitness are also higher for GMM than change-point modeling [80•].

Support Vector Machines

The final modeling method discussed is Support Vector Machines (SVM). This method has been shown to be effective in solving regression estimation problems and forecasting time series [72]. Jain et al. [81] used a version of SVM for regression estimation, Support Vector Regression, to evaluate the effect of temporal and spatial granularity of data on the prediction of energy in multi-family buildings. SVM has also been combined with genetic algorithms to predict energy use [74]. SVM has been assessed as a highly accurate and effective method for the energy prediction [72]. However, SVM requires multi-step forecasts, implemented using various features and selected techniques [81]; therefore, it is more complicated and requires more computational effort in comparison to other models discussed. Similar to other methods, it can also benefit from additional evaluation for residential building energy performance prediction methods.

In summary, there are a number of different types of methods used in recent literature to predict energy consumption of residential buildings. Table 1 represents the summary of six main methods of building energy performance prediction. However, particularly for residential buildings, it is challenging to compare the capabilities and determine the overall “best” model for use for residential energy performance prediction, in part, due to the lack of studies that compare performance of the models using residential datasets. Many of the algorithms have been developed, utilized, and tested for commercial building applications, and may be well suited for residential buildings as well. Some residential building energy prediction studies have used larger datasets [58, 77]; however, the number of studies with this size dataset is limited, for both residential and commercial buildings. The type of energy data being predicted also varies. Some studies focus on the use of methods to predict whole-building consumption [54, 67], while others focus on HVAC [58], or other end uses [75]. Finally, the frequency of data and type of energy use data used to develop and test these models ranges significantly.

Table 1 Summary of the building energy performance prediction methods

Conclusions

In summary, this review discusses both sources of energy and non-energy data, as well as methods that use these data to predict energy consumption. This review points to the need for the availability of more residential building energy and non-energy data sources to be able to improve energy performance prediction models, and the need to more comprehensively and comparatively study the accuracy of these models for residential buildings across a range of frequencies of data, and whole-home as well as end-use consumption. More specifically, the following conclusions can be drawn:

  • Most available datasets provide energy or non-energy data; however, these are generally not linked together or do not have the ability to be linked as they are anonymized; this limits the usability of these datasets for energy use prediction methods. Datasets that link energy and non-energy data are needed and with higher frequencies and quantities of data

  • Many available national-level and local-level datasets of energy use provide annual level data. Given that energy use prediction methods are often developed with the goal of predicting energy use at higher frequencies, this limits the data usability. There are some recent efforts to make large-scale studies’ data and law-mandated data available; however, more efforts are needed in this area, including those datasets associated with publications in this area, almost none of which are available for broader use. Recent efforts to improve the infrastructure, ease and motivation for energy data sharing [82, 83], may help to improve this moving forward

  • Further and more comprehensive testing is needed to assess the different energy prediction methods at different data frequencies; this will help to assess which models are most appropriate and best able to predict consumption for each frequency level, as this is currently not well established

  • Similarly, many of the prediction methods discussed have been tested for commercial buildings more than for residential, and in many cases, only tested for specific end uses; testing of the possible methods across larger sets of diverse residential buildings could provide a more comprehensive picture of capabilities of these methods

  • The complexity of prediction models ranges significantly, as well as the amount of input data needed. Further clarity is needed as to the positives and negatives associated with more complex versus less computationally complex methods

As more technologies become available that connect to the internet and are able to collect energy and non-energy data, such as through the internet of things, there is a significant opportunity to improve energy prediction methods. As energy efficiency continues to be a priority, improved data combined with improvements in prediction algorithms using this data will help to improve the accuracy and reliability of such models, and as a result, likely drive efficiency improvements as well.