Keywords

1 Introduction

Buildings account for 40% of global energy consumptions and 60% of global electricity uses [1,2,3]. The Heating, Ventilation and Air-Conditioning (HVAC) system consumes around 40% of the building energy consumptions and is closely related to indoor thermal comforts [4].

Researchers in recent years have found that the acquisition of occupancy information may be the key to make HVAC system more energy-efficient. Balaji et al. found that billions of dollars can be saved in building operations if HVAC systems were operated according to actual indoor occupancy patterns [5]. Researchers showed that by integrating indoor occupancy into the controls of HVAC systems can greatly enhance the energy efficiency and indoor thermal comforts [6,7,8]. To ensure building energy efficiency, it is essential to obtain accurate information on indoor occupancy and integrate such information for building system controls.

Studies related to indoor occupancy estimations can be divided into two groups. The first is to predict whether the room is occupied or not, which is in essence a binary classification task. The second is to predict the actual occupant numbers. Previous studies typically adopted two types of devices for prediction, i.e., devices with or without terminals. Devices with terminals include smart bracelets, wireless transmitters which support different communication technologies such as Wi-Fi, Bluetooth and Radio-Frequency Identification. Devices without terminals refer to video surveillance, occupant number counter and smart meters and so on. Despite the high prediction accuracy, such devices may impose negative impacts on human privacy, especially for personal devices and video surveillance.

To address such challenges, studies are being conducted to adopt environmental data for indoor occupancy estimations. Such solutions are particularly attractive due to their low implementation costs and non-intrusive nature [9]. Candanedo et al. collected data on light intensity, temperature, relative humidity and CO2 concentration and used them as inputs for indoor occupant number predictions [10]. Pedersen et al. adopted volatile organic compound sensors to predict whether a room is occupied or not [11]. The results indicated the CO2 and volatile organic compounds almost had the same importance in predictive models. Diaz and Jimenez conducted an experiment to show that CO2 measurements were useful in predicting indoor occupancy status, but not in occupant numbers [12]. Wang et al. enhanced the accuracy of indoor occupant numbers prediction to 91% and proposed an energy saving control strategy for buildings [13]. Despite the encouraging results obtained, there is still a lack of studies to systematically investigate the usefulness of different environmental data and machine learning techniques for both indoor occupancy classification and regression tasks.

This study proposes a non-intrusive method to accurately predict indoor occupancy using environmental data and machine learning techniques. Two-month experiment has been designed to obtain high-frequency data on environmental conditions of a conference room. Five environmental data, including temperature, relative humidity, CO2 concentration, light intensity and noise level, have been collected and used as model inputs. Four machine learning techniques, including fully connected neural networks (FCNN), extreme gradient boosting trees (XGB), long short-term memory (LSTM) networks and support vector machines (SVM) have been used to predict indoor occupancy status and occupant numbers together with over-sampling and under-sampling techniques.

This paper is organized as follows. The second part introduces the research methodology. The third part describes the experiment settings and data obtained. The fourth part illustrates the results on the binary classification and regression tasks of indoor occupancy estimations. The conclusions are drawn in Sect. 5. The main contributions are summarized as follows:

  1. (1)

    A thorough investigation on the impacts of different environmental variables and their collection frequencies for occupancy estimations.

  2. (2)

    Quantify the performance of machine learning techniques in occupancy estimations.

  3. (3)

    Quantify the value of data sampling techniques, especially for imbalanced modeling tasks.

2 Methodology

2.1 Data Collection

Human body is metabolizing all the time, giving off heat to the surrounding environment. A healthy adult radiates 100 Watts of heat at rest and up to 1000 Watts during exercise [14]. At the same time, CO2 and water are produced by metabolism. Occupants is the main source of indoor CO2 increment [15]. Even in the daytime, people often turn on the lamps for the sufficient light when they enter the room. Driven by the awareness of energy conservation, people usually turn off the lamps when they leave the room. People will inevitably produce sound due to conversation or interaction with objects. Therefore, the indoor temperature, relative humidity, CO2 concentration, light intensity, noise level will be affected by the numbers of occupant in the room, which makes it possible to detect the indoor occupancy through these environmental data.

The occupant positions may vary case by case. To reduce the data fluctuations and variations caused by occupants in different positions, various sensors should be evenly placed in each direction of the room. These environmental variables will also fluctuate naturally in daily cycle without the interference of occupants. In indoor environment, temperature, relative humidity and CO2 concentration are the variables fluctuating most obviously. Therefore, in the detection of indoor occupancy, we also need to collect the corresponding outdoor environmental data as a reference.

2.2 Data Preprocessing

Raw data collected by sensors often contain outliers and missing values. If the raw data is input into the model directly for indoor occupancy detection, it will often produce unsatisfactory results. The data preprocessing steps for raw data are as follows:

  1. (1)

    Find out the abnormal value and treat them as missing values. For example, noise level of 0 dB is an abnormal value, because it is impossible in a normal environment.

  2. (2)

    Apply different methods to deal with the missing values according to their types.

  3. (3)

    Perform data standardization to ensure the validity of predictive modeling. Different environmental data may have different ranges. For example, the normal range of CO2 concentration is 300–3000 ppm, while the absolute value for temperatures is typically less than \(50\,^\circ {\text{C}}\). As shown in Eq. 1, the Z-score standardization is usually used for data standardization.

    $${x}^{*}= \frac{x-\overline{x}}{\sigma }$$
    (1)

    where \(\overline{x }\) and σ are the mean and standard deviation of the raw data respectively.

2.3 Prediction Model Development

This study aims to study the relationships between indoor occupancy and environmental variables. It contains two types of prediction models. The first focuses on indoor occupancy detection, which is in essence a binary classification model. The second focuses on predicting the actual occupant numbers, which is in essence a regression model.

For each type of prediction, four state-of-the-art supervised machine learning techniques are used in this study: (1) fully connected neural networks (FCNN); (2) support vector machine or support vector regression (SVM or SVR); (3) long short-term memory networks (LSTM); (4) extreme gradient boosting trees (XGB).

Model parameters should be fine-tuned to ensure the overall model performance. In this study, the hidden unit number, learning rate, dropout ratio are optimized for FCNN and LSTM models. For SVM and SVR models, the penalty cost is optimized. For XGB, the max depth of the decision tree, learning rate, and the data sampling ratio for each iteration are considered. The grid-search method is adopted for parameter optimization.

2.4 Prediction Model Evaluation

2.4.1 Evaluation Metrics for Binary Classification Tasks

As shown in Table 1, a two-dimensional confusion matrix can be used to evaluate the binary classification performance of indoor occupancy detection, i.e., occupied or unoccupied. Each element in the confusion matrix represents the number of test observations. Each row represents actual values while each column represents predicted values. The diagonal elements are correct predictions, while the non-diagonal elements represent incorrect predictions.

As shown in Eq. 2 and Eq. 3, evaluation metrics such as accuracy and F1-score can be calculated based on true positive (TP), true negative (TN), false positive (FP) and false negative (FN), where N represents the number of observations in the test data set. Accuracy is often used as a simple indicator to evaluate the overall performance of classification models. F1 score considers both the precision and recall of the classification models.

Table 1. An illustrative confusion matrix of occupancy detection.
$$Accuracy=\frac{TP+TN}{TP+TN+FP+TN}=\frac{TP+TN}{N}$$
(2)
$$F1= \frac{2TP}{2TP+FN+FP}$$
(3)

2.4.2 Evaluation Metrics for Regression Tasks

The mean absolute error (MAE) is adopted in this study to evaluate the regression model performance. The MAE refers to the average value of the absolute errors between the actual and predicted occupant numbers. The equation for MAE is shown in Eq. 4, where m is the number of observations, h(xi) and yi are the predicted and actual occupant numbers for the ith observation respectively.

$$MAE=\frac{1}{m}\sum_{i=1}^{m}\left|h\left({x}^{i}\right)-{y}^{i}\right|$$
(4)

3 Experimental Settings

The experiment was conducted in a conference room in Shenzhen University. The experiment lasted from March 16, 2021 to May 17, 2021, resulting in data measurements collected from 62 days. The meeting room has an area of 8 × 5 m2 and a height of 3.5 m. The indoor occupancy does not follow fixed schedules, e.g., there may be meetings in midnight. People entering the meeting room can freely control parameters of the HVAC equipment.

3.1 Experimental Equipment

As shown in Fig. 1, five kinds of sensors have been used to collect various environmental data in and out of the conference room, i.e., temperature, relative humidity, light intensity, noise level, CO2 concentrations. The wireless information collection terminal in the upper left corner of Fig. 1 records data from temperature, CO2 concentrations, temperature and relative humidity sensors. The wired environmental monitoring host in the upper right corner of Fig. 1 is used to record data from noise level and light intensity sensors. All data were collected at 1 min interval. The sensor specifications are shown in Table 2. In addition, a counting device was installed on the ceiling above the door of the conference room as the ground truth. The instrument can detect the entry and exit of occupants, based on which the indoor occupant numbers is calculated at 5 min interval.

Fig. 1.
figure 1

Hosts for communication and environmental sensors

As shown in Fig. 2, sensors are evenly arranged across the conference room. Data collected by similar sensors are integrated to ensure data quality. The sensors detecting the indoor noise were placed close to seats for better sensitivity. By blocking out the sun, ensure that the main light source for sensors measuring indoor illuminance is the indoor lamps. Several sensors are also deployed outside the room to collect outdoor environmental data.

Table 2. Sensors specifications.
Fig. 2.
figure 2

The conference room layout and sensor placements.

3.2 Data Description

In this experiment, 87591 validl observations of environmental data were collected with a collection interval of 1 min. 17519 valid observations of occupant numbers were collected (the interval of each observation is 5 min). Figure 3 shows the data distributions of indoor and outdoor environmental variables. Figure 4 shows the distribution of occupant numbers during experiments.

Fig. 3.
figure 3

Density diagram of temperature, relative humidity, CO2 concentration, noise level and light intensity across the entire data set.

Fig. 4.
figure 4

Histogram of occupant numbers during the experiment period.

As shown in Fig. 2, in addition to the CO2 concentration, the indoor environmental data are more concentrated than the outdoor, which indicates that the indoor environment is more stable. It is noteworthy that the x-axis of the light intensity density map is log-transformed. In the daytime, outdoor light intensity is usually more than 10000 lx, while in the night, outdoor light intensity is concentrated around 1 lx. As mentioned above, the main light source for sensors indoor is the lamps, so the range of indoor light intensity is relatively small compared with outdoor. In Table 3, there are some statistical descriptors of the input variables of the total data set.

Table 3. The statistics of the variables of the total data set.

After preliminary exploring the data set, it is found that outdoor noise level, outdoor light intensity, outdoor relative humidity and outdoor CO2 concentration have little effect on detecting the indoor occupancy. Therefore, in the rest of this paper, only the remaining environmental variables e.g. indoor temperature, outdoor temperature, indoor relative humidity, indoor CO2 concentration, indoor noise level, indoor light intensity and hour are taken as the input variables of machine learning models.

Fig. 5.
figure 5

Pairs plot of input variables

Figure 5 is a pairs plot showing the relationship for input variables. In Fig. 5, the blue dots indicate that the room is occupied, while the red dots indicate that the meeting room is unoccupied. “Hour” refers to the time when environmental data is collected. When there is a clear separation boundary between the red dots and the blue dots in the row and column of an input variable, it shows that this input variable has a great potential in detecting occupancy. Otherwise, this input variable has little effect in detection indoor occupancy.

For example, there is a clear separation boundary between the red dots and the blue dots in the indoor light intensity’s row and column. It shows that the indoor light intensity as an input variable, has a great potential in detecting the indoor occupancy. This phenomenon also appears in the pair plot of row and columns where the indoor noise level is located.

When studying the effect of two input variables on detecting indoor occupancy, we can observe the pair plot in which two variables intersect. For example, in the pair plot of indoor temperature and indoor relative humidity, there is no clear boundary between the red dots and the blue dots, which indicates that it is difficult to establish a model that can accurately detecting indoor occupancy only based on the two input variables. This phenomenon also appears in the pair plots of the combinations of indoor temperature and outdoor temperature, outdoor temperature and indoor relative humidity, indoor temperature and indoor CO2 concentration, outdoor temperature and indoor CO2 concentration, and indoor relative humidity and CO2 concentration.

4 Discussions

4.1 Results on Occupancy Detection

In this section, by applying four machine learning methods, we intend to work out three important questions: Firstly, which machine learning method is the most effective in detecting indoor occupancy? Secondly, when the performance of the models is bad in that the proportion of occupied and unoccupied data in the training data set is highly imbalanced, what method can be adopted to effectively solve this problem? Thirdly, whether the accuracy of the models can be improved by increasing the frequency of data collection?

We divided the 17519 observations into two parts, a training data set (12263 observations) containing 70% of the total data set, and a test data set containing the remaining 5256 observations. In general, the prediction accuracy of machine learning models based on the training set is not important. Instead, we are focus on how well the model predicts based on the test set. The best model is the one for which the test accuracy is largest.

To ensure the accuracy obtained on the test data set can reflect the real performance of the model, we set the ratio of occupied and unoccupied data in the test set as 1:1. Table 4 details the amount of occupied and unoccupied data in the training set and test set. It is noteworthy that there are only 1561 occupied observations in the training data set (accounting for 12.73% of the whole training data set) under this data segmentation pattens. In general, the performance of models based on such imbalanced data set is not satisfactory.

Table 4. Summary on data partitioning.

Figure 6 shows the input patterns corresponding to the two environmental data collection frequencies. When the data collection frequency is 1 min, it means that the environmental data average value of the current time and the previous 4 min is used to predict the current indoor occupancy. When the data collection frequency is 5 min, it means that the environmental data at the current moment is used to predict the indoor occupancy at the current moment.

Fig. 6.
figure 6

Data entry patterns for different data collection frequencies.

Figure 7 shows the accuracy of different machine learning methods at two data collection frequencies. To minimize the influence caused by random factors on the model performance, all the accuracy are the average accuracy obtained after model repeated running for 10 times.

When the training data set is highly imbalanced (the number of occupied observations only account for 12.73%, and the rest are unoccupied observations), only LSTM among the four classification methods shows a good performance, with the accuracy of 92.67% and 91.41% respectively (the data collection frequencies is 1 min and 5 min). XGB is the worst classification method with accuracy of 62.84% and 64.39% respectively. Given the powerful classification capabilities of machine learning methods, it is obviously not a satisfactory result.

Fig. 7.
figure 7

Accuracy of different classification methods with different data collection frequencies.

The reason for this low performance is that our training data set is highly imbalanced. This phenomenon clearly illustrates the influence of class distribution on the learning of classification models.

To alleviate the problem of class imbalance in our training data set, we used the over-sampling and under-sampling respectively to balance the class ratio. We used the synthetic minority over-sampling technique (SMOTE) to achieve the over-sampling. In this over-sampling methods, new samples of the minority class (occupied class) are artificially generated using the K nearest neighbors of each minority class sample. The minority class is oversampled by using the SMOTE function in the R package smote family with the setting of K = 5. As shown in Table 3, the number of occupied observations in the original training data set was only 1561. After using the SMOTE technology, 7805 new occupied observations (500% of the original size) were generated. Table 5 shows the composition’s change of the training data set before and after over-sampling. After over-sampling, the ratio of occupied observations increased from 12.73% to 46.67%. It alleviates the imbalanced problem of training data set.

Under-sampling makes the number of positive and negative observations consistent by removing some of the majority class (unoccupied class) observations from the original data set.

For this data set, by removing 9141 unoccupied observations, the number of occupied and unoccupied observations in the data set is equal. It solves the problem of imbalanced class in this data set.

Table 5. Comparison of training data set before and after over-sampling and under-sampling.

Figure 8 shows the accuracy of each classification method before and after applying the over-sampling and under-sampling techniques. Because the input data of LSTM algorithm emphasizes continuity, LSTM is not suitable for the over-sampling technique. After applying over-sampling, the accuracy of the three classification methods (FCNN, XGB, SVM) is greatly improved. For XGB with the best performance, its accuracy is 97.11% and 95.86% (data collection frequencies are 1 min and 5 min respectively), while its accuracy was only 62.84% and 64.39% before applying the over-sampling.

For the LSTM whose classification performance is excellent before applying under-sampling, the under-sampling technique can also improve its performance. Its accuracy increases from 92.67% and 91.41% to 95.04% and 94.46% when the data collection frequencies is 1 min and 5 min respectively. These results indicate that the four classification methods’ performance can be improved by applying under-sampling or over-sampling techniques to alleviate the imbalance of the training data set.

As shown in Fig. 8, the accuracies of model built up by over-sampling dataset (20068 samples) and under-sampling dataset (8378 samples) is very close, which indicates the number of data samples has little effect on the accuracy of the model.

Fig. 8.
figure 8

Accuracy of different classification methods before and after over-sampling and under-sampling.

Fig. 9.
figure 9

Accuracy of classification methods in different data collection frequencies.

As shown in Fig. 9, for the four machine learning methods applied in this study, the accuracy with collection frequencies of 1 min is higher than that of 5 min. This indicates that after solving the imbalance of original data set, increasing the collection frequency of data can slightly improve the machine learning models’ performance in detecting indoor occupancy.

4.2 Results on Occupant Numbers Predictions

Predicting indoor occupant numbers can be regarded as a regression problem. The negative influence of imbalanced data to machine learning techniques in regression problems is not as great as that in classification problems. In such a case, there is no need to perform over-sampling or under-sampling. To summarize, 17,519 observations were randomly divided into two parts, of which the training set accounted for 70% (containing 12,263 observations) and the testing set accounted for 30% (containing 5,226 observations). It is noteworthy that there is no deliberate control over the ratio of various occupant numbers in the training set and the test set when dividing the data.

Figure 10 shows the MAE of four machine learning models when predicting the numbers of occupant at two data collection frequencies. The results shows that XGB is the best model and the MAEs are 0.32 and 0.39 considering 1-min and 5-min data collection intervals respectively. LSTM is the worst model with MAEs of 0.47 and 0.51 considering two different time collection intervals. Figure 11 shows the MAE across different hours. It is evident that the prediction errors are relatively high during office hour, especially from 12p.m. to 2p.m. and 8p.m.–10p.m. One possible explanation is that the data variations during such time periods are relatively large due to irregular meeting events.

Fig. 10.
figure 10

MAEs values of the numbers of occupant predicted by the four machine learning methods.

Fig. 11.
figure 11

MAEs values predicted by four machine learning methods at different time.

5 Conclusions

In this study, non-intrusive solutions are proposed for indoor occupancy detection using environmental sensors including temperature, relative humidity, noise level, CO2 concentration and light intensity. Four state-of-the-art machine learning techniques together with data sampling techniques have been adopted for both binary classification and regression tasks. The methods proposed have been validated through experimental data. The main findings are summarized as below:

  1. (1)

    For the occupancy detection task (i.e., a binary classification problem), the main data challenge is the potential data imbalance issue, e.g., the room is unoccupied for most of the time. In such a case, the XGB, FCNN and SVM models do not perform well with an accuracy of around 70%. By contrast, the LSTM model has rather good consistency and the classification accuracy can be as high as 92%.

  2. (2)

    Data sampling techniques can be applied to enhance the performance of binary classification tasks. For instance, the accuracy of FCNN model can be enhanced from 71.6% to 95.3% and 96.2% respectively when over-sampling and under-sampling techniques are used.

  3. (3)

    For the occupant numbers prediction task (i.e., a regression problem), the XGB model performs the best with MAEs of 0.32 and 0.39 considering 1-min and 5-min data collection intervals respectively. The LSTM model performs the worst with an averaged MAE of around 0.49.

  4. (4)

    The higher the data collection frequency, the better the prediction performance. For instance, compared with 5-min collection interval, the use of 1-min data helps to decrease the MAE of SVM models in occupant numbers prediction from 0.44 to 0.36.

The results obtained validated the usefulness of non-intrusive methods in indoor occupancy estimations. The research outcomes are helpful for devising occupancy-centric measures for building energy conservations.