Keywords

1 Introduction

Accurate prediction of the electricity demand of buildings is vital for effective and cost-efficient energy management in commercial buildings. It also plays a significant role in maintaining a balance between electricity supply and demand in modern power grids. However, forecasting energy usage during anomalous periods, such as the COVID-19 pandemic, can be challenging due to changes in occupancy patterns and energy usage behavior. One of the primary reasons for this is the shift in distribution of occupancy patterns, with many people working or learning from home, leading to increased residential occupancy and decreased occupancy in offices, schools, and most retail establishments. Essential retail stores, such as grocery stores and restaurants, might experience a divergence between occupancy and energy usage, as they have fewer dine-in customers but still require energy for food preparation and sales. This has created a need for new forecasting methods that can adapt to changing occupancy patterns.

Online learning has emerged as a promising solution to this challenge, as it enables building managers to adapt to changes in occupancy patterns and adjust energy usage accordingly. With Online learning, models can be updated incrementally with each new data point, allowing them to learn and adapt in real-time [13].

Furthermore, continual learning methods offer an even more powerful solution by addressing the issue of catastrophic forgetting [6, 17]. These methods allow models to retain previously learned information while accommodating new data, preventing the loss of valuable insights and improving generalization in out-of-distribution scenarios. By combining online learning with continual learning techniques, energy forecasting models can achieve robustness, adaptability, and accuracy, making them well-suited for handling the challenges posed by spatiotemporal data with evolving distributions.

Another solution is to use human mobility data as a proxy for occupancy, leveraging the prevalence of mobile devices to track movement patterns and infer occupancy levels. Human mobility data can be useful in this context as it provides a way to monitor occupancy patterns without relying on traditional sensors or manual data collection methods [28].

In this study, we evaluate the effectiveness of mobility data and continual learning for forecasting building energy usage during anomalous periods. We utilized real-world data from Melbourne, Australia, a city that experienced one of the strictest lockdowns globally [4], making it an ideal case for studying energy usage patterns during out-of-distribution periods. We conducted experiments using data from four building complexes to empirically assess the performance of these methods.

2 Related Works

2.1 Energy Prediction in Urban Environments

Electricity demand profiling and forecasting has been a task of importance for many decades. Nevertheless, there exist a limited number of work in literature that investigate how human mobility patterns are directly related to the urban scale energy consumption, both during normal periods as well as adverse/extreme events. Energy modelling in literature is done at different granularities, occupant-level (personal energy footprinting), building-level and city-level. Models used for energy consumption prediction in urban environments are known as Urban Building Energy Models (UBEM). While top-down UBEMs are used for predicting aggregated energy consumption in urban areas using macro-economic variables and other aggregated statistical data, bottom-up UBEMs are more suited for building-level modelling of energy by clustering buildings into groups of similar characteristics [2]. Some examples in this respect are SUNtool, CitySim, UMI, CityBES, TEASER and HUES. Software modelling (simulation-based) is also a heavily used approach for building-wise energy prediction (Eg: EnergyPlus [7]). Due to fine-grain end-user level modelling, bottom-up UBEMs can incorporate inputs of occupant schedules. There also exist occupant-wise personal energy footprinting systems. However, for such occupant-wise energy footprinting, it requires infrastructure related to monitoring systems and sensors for indoor occupant behaviours, which are not always available. Also, due to privacy issues, to perform modelling at end-user level granularity, it can be hard to get access to publicly available data at finer temporal resolutions (both occupancy and energy) [33]. Building-wise energy models also have the same problems. Simulation-based models have complexity issues when scaling to the city level, because they have to build one model per each building. Moreover, simulation-based models contain assumptions about the data which make their outputs less accurate [1]. Consequently, it remains mostly an open research area how to conduct energy forecasting with data distribution shifts.

2.2 Mobility Data as Auxiliary Information in Forecasting

The study of human mobility patterns involves analysing the behaviours and movements of occupants in a particular area in a spatio-temporal context [28]. The amount of information that mobility data encompasses can be huge. The behaviour patterns of humans drive the decision making in many use-cases. Mobility data in particular, can act as a proxy for the dynamic (time varying) human occupancy at various spatial densities (building-wise, city-wise etc.). Thus such data are leveraged extensively for many tasks in urban environments including predicting water demand [31], urban flow forecasting [34], predicting patterns in hospital patient rooms [8], electricity use [12] etc. that depend on human activities.

Especially, during the COVID19 pandemic, mobility data has been quite useful for disease propagation modelling. For example, in the work by [32], those authors have developed a Graph Neural Network (GNN) based deep learning architecture to forecast the daily new COVID19 cases state-wise in United States. The GNN is developed such that each node represents one region and each edge represents the interaction between the two regions in terms of mobility flow. The daily new case counts, death counts and intra-region mobility flow is used as the features of each node whereas the inter-region mobility flow and flow of active cases is used as the edge features. Comparisons against other classical models which do not use mobility data has demonstrated the competitiveness of the developed model.

Nevertheless, as [28] state, the existing studies involving human mobility data lack diversity in the datasets in terms of their social demographics, building types, locations etc. Due to the heterogeneity, sparsity and difficulty in obtaining diverse mobility data, it remains a significant research challenge to incorporate them in modelling techniques [2]. Yet, the lack of extracting valuable information from such real-world data sources remains untapped, with a huge potential of building smarter automated decision making systems for urban planning [28].

2.3 Deep Learning for Forecasting

Deep learning has gained significant popularity in the field of forecasting, with various studies demonstrating its effectiveness in different domains [11]. For instance, it has been widely applied in mobility data forecasting, including road traffic forecasting [24,25,26], and flight delay forecasting [30]. In the realm of electricity forecasting, Long Short-Term Memory (LSTM) networks have been widely utilized [21]. Another popular deep learning model for electricity load forecasting is Neural basis expansion analysis for interpretable time series forecasting (N-BEATS) [20].

However, one common challenge faced by these deep learning methods is the performance degradation when the data distributions change rapidly, especially during out-of-distribution (OOD) periods. Online learning methods have been proposed to address this issue [14, 16, 18]. However, online learning methods can suffer from catastrophic forgetting, where newly acquired knowledge erases previously learned information [28]. To mitigate this, continual learning methods have been developed, which aim to retain previously learned information while accommodating new data, thereby improving generalization in OOD scenarios.

One approach to continual learning is Experience Replay [6, 17], a technique that re-exposes the model to past experiences to improve learning efficiency and reduce the effects of catastrophic forgetting. Building upon this idea, the Dark Experience Replay++ algorithm [5] utilizes a memory buffer to store past experiences and a deep neural network to learn from them, employing a dual-memory architecture that allows for the storage of both short-term and long-term memories separately. Another approach is the Fast and Slow Network (FSNet) [22], which incorporates a future adaptor and an associative memory module. The future adaptor facilitates quick adaptation to changes in the data distribution, while the associative memory module retains past patterns to prevent catastrophic forgetting. These continual learning methods have shown promise in mitigating catastrophic forgetting and improving generalization in OOD scenarios.

In the context of energy forecasting, the utilization of continual learning techniques holds great potential for addressing the challenges posed by OOD spatiotemporal data. By preserving past knowledge and adapting to new patterns, these methods enable more robust and accurate energy forecasting even during periods of rapid data distribution shifts.

3 Problem Definition

3.1 Time Series Forecasting

Consider a multivariate time series \(\mathcal {X}\in \textbf{R}^{T\times N}\) comprising mobility data, weather data, and the target variable, which is the energy consumption data. The time series consists of T observations and N dimensions. To perform H-timestamps-ahead time series forecasting, a model f takes as input a look-back window of L historical observations \((\textbf{x}_{t-L+1},\textbf{x}_{t-L+2},...,\textbf{x}_{t})\) and generates forecasts for H future observations of the target variable y, which corresponds to the energy consumption of a building. We have:

$$\begin{aligned} f_{\boldsymbol{\omega }}(\textbf{x}_{t-L+1},\textbf{x}_{t-L+2},...,\textbf{x}_{t}) = (y_{t+1},y_{t+2},...,y_{t+H}), \end{aligned}$$
(1)

where \(\omega \) denotes the parameters in the model.

3.2 Continual Learning for Time Series Forecasting

In a continual learning setting, the conventional machine learning practice of separating data into training and testing sets with a \(70\%\) to \(30\%\) ratio does not apply, as learning occurs continuously over the entire period. After an initial pre-training phase using a short period of training data, typically the first 3 months, the model continually trains on incoming data and generates predictions for future time windows. Evaluation of the model’s performance is commonly done by measuring its accumulated errors throughout the entire learning process [27].

Fig. 1.
figure 1

The convolution architecture in TCN

4 Method

Continual learning presents unique challenges that necessitate the development of specialized algorithms and evaluation metrics to address the problem effectively. In this context, a continual learner must strike a balance between retaining previously acquired knowledge while facilitating the learning of new tasks. In time-series forecasting, the challenge lies in balancing the need to learn new temporal dependencies quickly while remembering past patterns, a phenomenon commonly referred to as the stability-plasticity dilemma [9]. Building on the concept of complementary learning systems theory for dual learning systems [15], a Temporal Convolutional Network (TCN) is utilized as the underlying architecture, which is pre-trained to extract temporal features from the training dataset. Subsequently, the convolutional layers of the TCN are customized with a future adaptor and an associative memory module to address the challenges associated with continual learning. The future adaptor facilitates quick adaptation to changes, while the associative memory module is responsible for retaining past patterns to prevent catastrophic forgetting. In this section we describe in detail the architecture of FSNet [22].

4.1 Backbone-Temporal Convolutional Network

FSNet adopts the TCN proposed by Bai et al. [3] as the backbone architecture for extracting features from time series data. Although traditional Convolutional Neural Networks (CNNs) have shown great success in image-processing tasks, their performance in time-series forecasting is often unsatisfactory. This is due to several reasons, including (a) the difficulty of capturing contextual relationships using CNNs, (b) the risk of information leakage caused by traditional convolutions that incorporate future temporal information, and (c) the loss of detail associated with pooling layers that extract contour features. In contrast, TCN’s superiority over CNNs can be attributed to its use of causal and dilated convolutions, which enhance its ability to capture temporal dependencies in a more effective manner.

Causal Convolutions. In contrast to traditional CNNs, which may incorporate future temporal information and violate causality, causal convolutions are effective in avoiding data leakage in the future. By only considering information up to and including the current time step, causal convolutions do not alter the order in which data is modelled and are therefore well-suited for temporal data. Specifically, to ensure that the output tensor has the same length as the input tensor, it is necessary to perform zero-padding. When zero-padding is performed only on the left side of the input tensor, causal convolution can be ensured. In Fig. 1(a), zero-padding is shown in light colours on the left side. There is no padding on the right side of the input sequence because the last element of the input sequence is the latest element on which the rightmost output element depends. Regarding the second-to-last output element, its kernel window is shifted one position to the left compared to the last output element. This implies that the second-to-last element’s latest dependency on the rightmost side of the input sequence is the second-to-last element. By induction, for each element in the output sequence, its latest dependency in the input sequence has the same index as the element itself.

Dilated Convolutions. Dilated convolution is an important component of TCN because causal convolution can only access the past inputs up to a certain depth, which is determined by the kernel size of the convolutional layer. In a deep network, the receptive field of the last layer may not be large enough to capture long-term dependencies in the input sequence. In dilated convolutions, the dilation factor is used to determine the spacing between the values in the kernel of the dilated convolution. More formally, we have:

$$\begin{aligned} Conv(\textbf{x})_{i} = \sum _{m=0}^{k}w_m\cdot \textbf{x}_{i-m\times d} \end{aligned}$$
(2)

where i represents the i-th element, w denotes the kernel, d is the dilation factor, k is the filter size. Dilation introduces a fixed step between adjacent filter taps. Specifically, if the dilation factor d is set to 1, the dilated convolution reduces to a regular convolution. However, for \(d > 1\), the filters are expanded by d units, allowing the network to capture longer-term dependencies in the input sequence. A dilated causal convolution architecture can be seen in Fig. 1(a).

4.2 Fast Adaptation

FSNet modify the convolution layer in TCN to achieve fast adaptation and associative memory. The modified structure is illustrated in Fig. 1(b). In this subsection, we first introduce the fast adaptation module.

In order to enable rapid adaptation to changes in data streams and effective learning with limited data, Sahoo et al. [27] and Phuong and Lampert [23] propose the use of shallower networks and single layers that can quickly adapt to changes in data streams or learn more efficiently with limited data. Instead of limiting the depth of the network, it is more advantageous to enable each layer to adapt independently. In this research, we adopt an independent monitoring and modification approach for each layer to enhance the learning of the current loss. An adaptor is utilized to map the recent gradients of the layer to a smaller, more condensed set of transformation parameters to adapt the backbone. However, the gradient of a single sample can cause significant fluctuation and introduce noise into the adaptation coefficients in continual time-series forecasting. As a solution, we utilize Exponential Moving Average (EMA) gradient to mitigate the noise in online training and capture the temporal information in time series:

$$\begin{aligned} \hat{g_l} = \gamma \hat{g_l} + (1 - \gamma )\hat{g_l^t}, \end{aligned}$$
(3)

where \(\hat{g_l^t}\) denotes the gradient of the l-th layer at time t, \(\hat{g_l}\) denotes the EMA gradient, and \(\gamma \) represents the momentum coefficient. For the sake of brevity, we shall exclude the superscript t in the subsequent sections of this manuscript. We take \(\hat{g_l}\) as input and get the adaptation coefficient \(\mu _l\):

$$\begin{aligned} \mu _l = \varOmega (\hat{g_l}; \phi _l), \end{aligned}$$
(4)

where \(\varOmega (\cdot )\) is the chunking operation in [10] that partitions the gradient into uniformly-sized chunks. These segments are subsequently associated with the adaptation coefficients that are characterized by the trainable parameters \(\phi _l\). Specifically, the adaptation coefficient \(\mu _l\) is composed of two components: a weight adaptation coefficient \(\alpha _l\) and a feature adaptation coefficient \(\beta _l\). Then we conduct weight adaptation and feature adaptation. The weight adaptation parameter \(\alpha _l\) performs an element-wise multiplication on the corresponding weight of the backbone network, as described in:

$$\begin{aligned} \tilde{\theta _l} = tile(\alpha _l) \odot \theta _l, \end{aligned}$$
(5)

where we represent the feature maps of all channels in a TCN as \(\theta _l\), while the adapted weights are denoted by \(\tilde{\theta _l}\). The weight adaptor is applied per-channel on all filters using the tile function, which repeats a vector along the new axes, as indicated by \(tile(\alpha _l)\). Finally, the element-wise multiplication is represented by \(\odot \). Likewise, we have:

$$\begin{aligned} \tilde{h_l} = tile(\beta _l) \odot h_l, \end{aligned}$$
(6)

where \(h_l = \tilde{\theta _l} *\tilde{h}_{l - 1}\) is the output feature map.

4.3 Associative Memory

In order to prevent a model from forgetting old patterns during continual learning in the context of time series, it is crucial to preserve the appropriate adaptation coefficients \(\mu \), which encapsulate adequate temporal patterns for forecasting. These coefficients reflect the model’s prior adaptation to a specific pattern, and thus, retaining and recalling the corresponding \(\mu \) can facilitate learning when the pattern resurfaces in the future. Consequently, we incorporate an associative memory to store the adaptation coefficients of recurring events encountered during training. This associative memory is denoted as \(M_l \in \textbf{R}^{N\times d}\), where d represents the dimensionality of \(\mu _l\) and is set to a default value of 64.

Memory Interaction Triggering. To circumvent the computational burden and noise that arises from storing and querying coefficients at each time step, FSNet propose to activate this interaction only when there is a significant change in the representation. The overlap between the current and past representations can be evaluated by taking the dot product of their respective gradients. FSNet leverage an additional EMA gradient \(\hat{g'}_l\), with a smaller coefficient \(\gamma '\) compared to the original EMA gradient \(\hat{g}_l\), and measure the cosine similarity between them to determine when to trigger the memory. We use a hyper-parameter \(\tau \), which we set to 0.7, to ensure that the memory is only activated to recall significant pattern changes that are more likely to recur. The interaction is triggered when \(cosine(\hat{g}_l, \hat{g'}_l) < - \tau \).

To guarantee that the present adaptation coefficients account for the entire event, which may extend over an extended period, memory read and write operations are carried out utilizing the adaptation coefficients of the EMA with coefficient \(\gamma '\). The EMA of \(\mu _l\) is computed following the same procedure as Eq. 3. In the event that a memory interaction is initiated, the adaptor retrieves the most comparable transformations from the past through an attention-read operation, which involves a weighted sum over the memory items:

$$\begin{aligned} \textbf{r}_l = softmax(M_l\hat{\mu }_l), \end{aligned}$$
(7)
$$\begin{aligned} \tilde{\mu }_l = \sum _{i=1}^k TopK(\textbf{r}_l)[i]M_l[i], \end{aligned}$$
(8)

where \(TopK(\cdot )\) selects the top k values from \(\textbf{r}_l\), and [i] means the i-th element. Retrieving the adaptation coefficient from memory enables the model to recall past experiences in adapting to the current pattern and improve its learning in the present. The retrieved coefficient is combined with the current parameters through a weighted sum: \(\mu _l = \tau \mu _l + (1 - \tau )\tilde{\mu }_l\). Subsequently, the memory is updated using the updated adaptation coefficient:

$$\begin{aligned} M_l = \tau M_l + (1 - \tau )\tilde{\mu }\otimes TopK(\textbf{r}_l), \end{aligned}$$
(9)

where \(\otimes \) denotes the outer-product operator. So far, we can effectively incorporate new knowledge into the most pertinent locations, as identified by the top-k attention values of \(\textbf{r}_l\). Since the memory is updated by summation, it can be inferred that the memory \(\mu _l\) does not increase as learning progresses.

5 Datasets and Contextual Data

This paper is based on two primary data sources: energy usage data and mobility data, as well as two contextual datasets: COVID lockdown dates and temperature data. The statistical summary of the main datasets are provided in Table 1 and visualized in Fig. 2. These datasets were collected from four building complexes in the Melbourne CBD area of Australia between 2018 and 2021.

Table 1 outlines the essential statistical properties of energy usage and mobility data collected from the four building complexes. It is evident from the data that energy usage varies significantly between the buildings, with BC2 having over ten times the average energy usage of BC4. Similarly, the mobility data shows distinct differences, with BC2 having a mean pedestrian count over three times greater than BC4. These differences emphasize the complexity of forecasting for energy usage in different building complexes.

Table 1. The summary statistics of the four datasets, each of which represents an aggregated and anonymized building complex (BC).
Fig. 2.
figure 2

Visualizing the four datasets and their features, showing the significant changes in distributions due to lockdowns. Plots on the left column are smoothed with a Gaussian filter with sigma = 24 h. Red areas are lockdowns. (Color figure online)

It is worth noting that lockdown had a more significant impact on mobility than energy usage, as illustrated in Fig. 2. Additionally, both energy usage and mobility started declining even before the start of lockdown.

5.1 Energy Usage Data

The energy usage data was collected from the energy suppliers for each building complex and measured the amount of electricity used by the buildings. To protect the privacy of the building owners, operators, and users, the energy usage data from each building was aggregated into complexes and anonymized. Buildings in the same complexes can have different primary use (e.g. residential, office, retails)

5.2 Mobility Data

The mobility data was captured by an automated pedestrian counting system installed by the City of Melbourne http://www.pedestrian.melbourne.vic.gov.au/ [19], and provided information on the movement patterns of individuals in and around each building complex. The system recorded the number of pedestrians passing through a given zone as shown in Fig. 3. As no images were recorded, no individual information was collected. Some sensors were installed as early as 2009, while others were installed as late as 2021. Some devices were moved, removed, and upgraded at various times. Seventy-nine sensors have been installed, and we have chosen four sensors, one for each building complex. We performed manual matching between the complexes and sensors by selecting the sensor that was closest to each building complex.

Fig. 3.
figure 3

Diagram of automated pedestrian counting system. Obtained from the City of Melbourne website [19].

5.3 COVID Lockdown Dates

We used data on the dates of the COVID lockdowns in Melbourne, one of the strictest in the world. Our datasets coincides with the first lockdown from March 30, 2020 to May 12, 2020 (43 days), and the second lockdown from July 8 to October 27, 2020 (111 days). We also divided the time into pre-lockdown and post-lockdown periods, taking the date of the first lockdown (March 30, 2020) as the boundary. We took this information from https://www.abc.net.au/news/2021-10-03/melbourne-longest-lockdown/100510710 [4].

5.4 Temperature Data

Temperature records are extracted from the National Renewable Energy Laboratory (NREL) Asia Pacific Himawari Solar Data [29]. As the building complexes are located in close proximity to one another, we utilized the same temperature data for all of them.

5.5 Dataset Preprocessing

For this study, we have fixed an observation of \(L=24\) h and a forecast horizon size of \(H=24\) h, to mimic a day-ahead forecasting experiment. To accurately link the foot traffic mobility data with the building, we carefully handpicked the pedestrian counting sensor that is located in the immediate vicinity of the building and used its corresponding mobility signal. The energy usage load of the building, the foot traffic volume, and the temperature degree were all aligned based on their timestamps.

6 Experiments and Results

We conducted two sets of experiments to evaluate the effectiveness of our proposed methods for predicting energy usage during anomalous periods. The first set of experiments evaluated the impact of including mobility contextual data in our models. The second set of experiments assessed the importance of continual learning. In addition, we conducted ablation experiments on FSNet to investigate the impact of different components of the model on the overall performance.

6.1 Experimental Setup

The experiments were conducted on a high-performance computing (HPC) node cluster with an Intel(R) Xeon(R) Platinum 8268 CPU @ 2.90GHz and Tesla V100-SXM2. The software specifications included intel-mkl 2020.3.304, nvidia-cublas 11.11.3.6, cudnn 8.1.1-cuda11, fftw3 3.3.8, openmpi 4.1.0, magma 2.6.0, cuda 11.2.2, pytorch 1.9.0, python3 3.9.2, pandas 1.2.4, and numpy 1.20.0.

The data was split into three months for pre-training, three months for validation of the pre-training, and the rest was used for the usual continual learning setup. No hyperparameter tuning was conducted as default settings were used. The loss function used is MSE.

6.2 Mobility

Table 2. Performance comparison between different contextual features. Results are average over 10 runs with different random seed. The standard deviation is shown. The algorithm used was FSNet with continual learning. +M is the improvement of adding mobility over no context, +T is the improvement of adding temperature over no context, T+M is the improvement of adding mobility over temperature only.

To assess the significance of the mobility context in predicting energy usage during anomalous periods, we performed a contextual feature ablation analysis, comparing pre- and post-lockdown performance. Table 2 presents the results of our experiments. Our findings suggest that the importance of mobility context is unclear in pre-lockdown periods, with mixed improvements observed, and the improvements are small compared to the standard deviations. However, post-lockdown, the importance of mobility context is more pronounced, and the best performance was achieved when both mobility and temperature contexts were utilized. Notably, our analysis revealed that post-lockdown, the improvement brought about by the mobility context is larger than that achieved through temperature alone, as observed in BC1, BC2, and BC4. This could be due to the fact that temperature has a comparatively simple and regular periodic pattern such that deep learning models can deduce them from energy data alone.

6.3 Continual Learning

Table 3. Comparing the performance of different algorithm with or without continual learning (CL). The metric used is MAE. Results are average over 10 runs with different random seed. The standard deviation is shown.

We conducted an experiment to determine the significance of continual learning by comparing the performance of various popular models with and without continual learning.

The models used in the experiment are:

  • FSNet [22]: Fast and slow network, described in detail in the method section of this paper. In the ‘no CL’, version we use the exact same architecture, however we use the traditional offline learning.

  • TCN [1]: Temporal Convolutional Network, is the offline learning baseline. It modifies the typical CNN using causal and dilated convolution which enhance its ability to capture temporal dependencies more effectively. The next three methods are different continual learning methods that uses TCN as the baseline.

  • OGD: Ordinary gradient descent, a popular optimization algorithm used in machine learning. It updates the model parameters by taking small steps in the direction of the gradient of the loss function.

  • ER [6, 17]: Experience Replay, a technique used to re-expose the model to past experiences in order to improve learning efficiency and reduce the effects of catastrophic forgetting.

  • DER++ [5]: Dark Experience Replay++ is an extension of the DER (Deep Experience Replay) algorithm, which uses a memory buffer to store past experiences and a deep neural network to learn from them. DER++ improves upon DER by using a dual-memory architecture, which allows it to store both short-term and long-term memories separately.

Table 3 displays the results, which demonstrate the consistent importance of continual learning in 1both the pre- and post-lockdown periods, with improvements multiple times larger than the standard deviations.

7 Conclusion

In this study, we investigated the impact of mobility contextual data and continual learning on building energy usage forecasting during out-of-distribution periods. We used data from Melbourne, Australia, a city that experienced one of the strictest lockdowns during the COVID-19 pandemic, as a prime example of such periods. Our results indicated that energy usage and mobility patterns vary significantly across different building complexes, highlighting the complexity of energy usage forecasting. We also found that the mobility context had a greater impact than the temperature context in forecasting energy usage during lockdown. We evaluated the importance of continual learning by comparing the performance of several popular models with and without continual learning, including FSNet, OGD, ER, and DER++. The results consistently demonstrated that continual learning is important in both pre- and post-lockdown periods, with significant improvements in performance observed across all models. Our study emphasizes the importance of considering contextual data and implementing continual learning techniques for robust energy usage forecasting in buildings.