Keywords

1 Introduction

To achieve sustainable development, effective management of production, distribution, transport and consumption of smart energy systems has become a focus for researchers and engineers [23]. As the operations of energy systems can be disrupted by various events such as equipment failures, power outages and malfunctions, energy systems have started to use Internet of Things (IoT) sensors and smart meters for monitoring and automation. Therefore, anomaly detection in smart meter data plays an important role in ensuring the healthy operation of an energy system. When performing anomaly detection, three types of anomalies are widely detected: global, contextual and collective anomalies [7]. In this paper, we mainly focus on global and contextual anomalies (defined in Sect. 3.1) from smart meter data. Global and contextual anomalies may indicate equipment failures or wrong operations. These two types of anomalies detection are needed for providing early warnings, thus reducing or avoiding economic losses. Figure 1 illustrates examples of these two types of anomalies in smart meter data.

Fig. 1.
figure 1

A fragment of the water supply temperature data set in our paper, with global and contextual anomalies marked as red dots. (Color figure online)

However, using smart meter data to detect anomalies faces some key challenges. First, smart meter data are the time series data from production or consumption, and are characterised by different seasonal patterns and highly nonstationary. Various patterns and nonstationary data require more generic and robust anomaly detection methods. Second, smart meter data are typically of high volume, high dimensionality and lack of labeled anomalies, which necessitates the use of unsupervised or semi-supervised approaches. In addition, there are many data quality issues for the collected data, such as missing values, outliers and temporal inconsistency. How to deal with “dirty” data will affect the performance of results.

Currently, there are several algorithms for detecting anomalies in energy data, such as [6, 15, 22, 29], but these algorithms are mainly designed to detect point anomalies and do not distinguish between global anomaly and contextual point anomaly. In addition, irregularly missing points should also be considered as anomalies, which occur very frequently in the time series of smart meters due to transmission or meter failures. It is therefore necessary to develop an effective and reliable anomaly detection model for smart energy systems.

In this paper, we propose an unsupervised anomaly detection algorithm for global and contextual anomalies in smart meter data by using variational recurrent autoencoders (VRAE) with attention. This algorithm can work without labels and takes advantage of the pre-detected global anomalies while training. It can also take advantage of the occasional labels when they are available.

  • We adapt and extend VRAE models by taking into account anomaly detection for smart meter data. The proposed model is capable of detecting not only global anomalies but also contextual anomalies. Although the model is presented for smart meter data, it can also be applied to other time series with time dependency.

  • We propose the method for minimising the impact of global anomalies and missing points on latent variables in the model training, using linear interpolation and an improved evidence lower bound function, which can improve the model performance.

  • We evaluate the method comprehensively by comparison with other baseline methods using a synthetic data set; and present a real world case study for the proposed method.

2 Related Work

2.1 Traditional Anomaly Detection Methods

Traditional anomaly detection methods include the traditional statistical approaches, e.g., [10, 16, 19, 33, 41, 43], the clustering-based approaches, e.g., [8, 42], the prediction-based approaches, e.g., [20,21,22], the nearest neighbour approaches, e.g. [5, 16, 18], the dimensionality reduction approaches, e.g. [10, 32] and other complementary models. These approaches can show good performance and effectiveness for their specific applications. However, due to the wide variation of energy data such as patterns, domain expert effort is often required to select a suitable detector for a particular type of anomaly. In addition, since most existing methods have their constraints or limitations in terms of parameterisation, interpretability and generalisability, a detection framework based on ensemble learning cannot even help to achieve better results.

2.2 Unsupervised Deep Learning Models

A rich body of literature presents unsupervised learning algorithms for detecting anomalies using deep learning techniques, among many others, which include [11, 26, 27, 38, 40]. Deep learning approaches can be further categorised into predictive models [27], VAE [1], Generative Adversarial Networks (GAN) [9] and VRNN [35]. For modeling sequential data such as time series, Recurrent Neural Networks (RNNs) show their advantage over others because of their capability to model long-term temporal dependence. RNNs (e.g. the Long and Short Term Memory (LSTM) [14] and the Gated Recurrent Unit (GRU)) introduce the so-called internal self-looping states in the network, which can accumulate information from the past. [31] combined ARIMA and LSTM to train a prediction model for energy anomaly detection. In this paper, we introduce LSTM into our neural network architecture for modeling the temporal dependence of time series.

VAE has been successfully applied in several applications for anomaly detection tasks, including [30, 36, 38, 39]. Hollingsworth et al. [15] proposed an autoencoder-based ensemble method to detect anomalies in building energy consumption data and evaluated their performance among reconstruction ability, high-level feature quality and computation efficiency. Compared to autoencoders, the variational inference technique [12] implements the encoding of the latent space as a distribution and enables the probabilistic reconstruction of a single generated value by a probabilistic model [1]. However, in the field of smart energy, few applications have previously used generative models to detect anomalies. Existing work based on VAE is not designed for energy smart meter data and requires domain experts in detectiong different types of anomalies.

2.3 Attention Mechanism for Deep Learning Models

Attention mechanisms [3, 24, 37] have been introduced to obtain state-of-the-art performance when modeling sequences such as natural language processing. Attention mechanism can model the relationship regarding different positions of a single sequence or across multiple sequences to obtain representative sequences. For example, Pereira et al. [29] used weighted sum of all encoder hidden states as the attention, which are then fitted to the decoder. The attention mechanism can, therefore, tackle the weakness of processing a long sequence by neural networks. However, there are still limited attempts and their application in anomaly detection for energy time series data which exists temporal interdependency at different time positions.

3 Problem Statement and Proposed Method

3.1 Problem Statement

Given historical data of n-dimensional time series with length T, i.e. \(X=\left( \mathbf {x}_{1}, \cdots , \mathbf {x}_{n}\right) ^{T} \in \mathbb {R}^{n \times T}\), our method is capable of detecting two types of anomalies:

  1. (a)

    Global anomalies: given an input time-series X, a global anomaly is a timestamp-value pair \(\left\langle t, x_{t}\right\rangle \) where the observed value is far from the rest of the data.

  2. (b)

    Contextual anomalies: given an input time-series X, a contextual anomaly is a timestamp-value pair \(\left\langle t, x_{t}\right\rangle \) where the observed value differs significantly from its neighbours in the same context, but is not a global anomaly.

3.2 Proposed Method

Global Anomaly Detection and Labeling. Data collected from real world applications are often dirty, which require preprocessing before being used for analysis. The training process for the anomaly detection should ideally learn from “normal” data, rather than learn from abnormal data. One of challenges of unsupervised anomaly detection methods is how to minimise the impact of abnormal data as much as possible. Hence, we detect global anomalies and sequential missing points and label them as anomalies before training. We use a statistical method based on histograms of each dimension. For multivariate time series with n dimensions, we first construct a univariate histogram with k bins for each dimension. Second, the frequency of samples in each histogram (dimension) is used as a density estimate of those samples. The higher the score of a sample, the higher the probability of anomaly.

Fig. 2.
figure 2

The network architecture of the proposed model.

For the missing points, we categorise them into the following two categories: single missing values and sequential missing values. For single missing values, we fill them with synthetically generated values using linear interpolation. For sequential missing values, the imputation error for missing data is accumulated according to the length of the missing subsequences. As it is difficult to generate sequential data that follow their original patterns, we therefore fill these sequential missing values with zeros and label them as anomalies.

Network Architecture and Implementation. Figure 2 shows the overall neural network architecture of the proposed model. As shown in the figure, multivariate time series data come from smart meters of industries. Given multivariate time series X, we first use a sliding window with length W to segment the time series into subsequences e.g. (\(x_{t-W+1}, \ldots , x_{t}\)). The subsequences are then used as the input of the proposed model which uses a variational auto-encoder architecture with LSTM to learn normal patterns from training data. The right side of Fig. 2 shows the detailed network structure with attention mechanism. The network structure is a variational recurrent auto-encoder which is composed of an encoder and a decoder.

In the VRAE, the encoder compresses the input time series into the fixed-length latent representation z based on the variational distribution \(q_{\phi }(z \mid x)\) and outputs the hidden states \(\mathbf {h}_{t}\) as the summary of the past information until the time at t. The latent variables z are drawn from a distribution with a given prior \(p_\theta (z)\), which is usually a multivariate unit with Gaussian distribution \(\mathcal {N}(0, \mathrm {I})\). Here, we assume the prior distribution of the latent variables z as a multivariate normal distribution, \(p_\theta (z) \sim \mathcal {N}(0, \mathrm {I})\). The outputs of the encoder are the parameters (\(\boldsymbol{\mu }_{z}\) and \(\boldsymbol{\sigma }_{z}\)) for the posterior \(q_{\phi }(z \mid x)\). The approximate posterior \(q_{\phi }(z \mid x)\) of z is diagonal Gaussian \(q_{\phi }(z \mid x)\sim \mathcal {N}\left( \boldsymbol{\mu _{\mathrm {z}}}, \boldsymbol{\sigma _{\mathrm {z}}}^{2} \mathrm {I}\right) \), where the mean \(\boldsymbol{\mu _{\mathrm {z}}}\) and the co-variance \(\varSigma _{\mathrm {z}}=\boldsymbol{\sigma _{\mathrm {z}}}^{2} \mathrm {I}\) are derived from the two fully connected layers (\(\boldsymbol{\mu _{\mathrm {z}}}\) and \(\boldsymbol{\sigma _{\mathrm {z}}}\) layers in Fig. 2) with Linear and SoftPlus activations, respectively. The latent variable \(\mathbf {z}\) (chosen to be K dimensions) are then sampled from the approximate distribution with reparameterization trick, \( \mathbf {z}= \boldsymbol{\mu _{\mathbf {z}}} +\boldsymbol{\sigma _{\mathrm {z}}} \cdot \boldsymbol{\epsilon }\), where \(\boldsymbol{\epsilon } \sim {\text {Normal}}(\mathbf {0}, \mathbf {I})\) is an independent random variable used for feasible stochastic gradient descent. The decoder also uses a LSTM network to reconstruct the data from latent variable \(\mathbf {z}\) through the generation distribution \(p_{\theta }(x \mid z)\), and outputs the parameters (\(\boldsymbol{\mu _{\mathbf {x}}}\) and \(\boldsymbol{\sigma _{\mathrm {x}}}\)) of \(p_{\theta }(x \mid z)\).

The objective of a VAE is to maximise the evidence lower bound (ELBO), \(\mathcal {L}\left( \theta , \phi ; \mathbf {x}\right) \), which can be written as follows:

$$\begin{aligned} \begin{aligned} \log p_{\theta }(\mathbf {x})&\ge \mathcal {L}\left( \theta , \phi ; \mathbf {x}\right) \\&=E_{q_{\phi }\left( \mathbf {z} \mid \mathbf {x}\right) }\left[ \log p_{\theta }(\mathbf {x} \mid \mathbf {z})\right] -\mathcal {D}_{\mathrm {KL}}\left( q_{\phi }(\mathbf {z} \mid \mathbf {x}) \Vert p_{\theta }(\mathbf {z})\right) \end{aligned} \end{aligned}$$
(1)

where the \(\phi \) and \(\theta \) are the parameters of the encoder and decoder, respectively. The first item of the right-hand side of the equation is the reconstruction loss, which can be approximated by Monte Carlo integration [1]. The second item \(D_{K L}\) is the Kullback-Leibler (KL) divergence between the approximate posterior and the prior distribution of the latent variable z.

To tackle the posterior collapse in the variational inference and the weakness in a long sequence, we additionally apply self-attention mechanism that promotes interaction between the inference model and the generative model. The attention model extracts a context vector based on all hidden states encoded from the input time series. The LSTM encoder computes all hidden states \(\left\{ \mathbf {s}_{i}\right\} _{i=1}^{T_{x}}\) from the input time series, while the LSTM decoder estimates the hidden state \(\mathbf {h_t}\) at each time t by a recurrent function using the previous hidden state \(\mathbf {h_{t-1}}\) and the context vector, denoted by:

$$\begin{aligned} \mathbf {h}_{t}=f\left( \mathbf {h}_{t-1}, \mathbf {c}_{t}\right) \quad \text{ where } \quad \mathbf {c}_{t}=\sum _{i=1}^{T_{x}} \alpha _{t i} \mathbf {s}_{i}\end{aligned}$$
(2)

where \(\mathbf {c}_{t}\) is the context vector containing the weighted sum of all source hidden states \(\mathbf {s}_{i}\) encoded from the input time series. The attention weights, \(\boldsymbol{\alpha }_{t}=\left\{ \alpha _{t i}\right\} _{i=1}^{T_{x}}\) , are computed by the score function of measuring the similarity between the hidden states \(\mathbf {s}_{t}\) at time t in the encoder and all hidden states \(\left\{ \mathbf {s}_{i}\right\} _{i=1}^{T_{x}}\) of the last recurrent layer in the encoder. The self-attention models the relevance of each pair of the hidden states of different time instances in the encoder. Here, we use the scaled dot-product similarity [37] as the score function because of its high learning efficiency for a large input.

Due to the bypass phenomenon [4], the variational latent space may not learn much due to the powerful attention mechanism. We therefore use the variational attention mechanism to model context vectors as probability distributions. We choose the prior distribution of the context vectors \( \mathbf {c}_{t}\) as the Gaussian standard distribution, i.e., \(\mathbf {c}_{t} \sim {\text {Normal}}(\mathbf {0}, \mathbf {I})\). We do the same for the latent variables. The encoder first computes the deterministic context vector \(\mathbf {c}_{t}=\sum _{i=1}^{T_{x}} \alpha _{t i} \mathbf {s}_{i}\), then passes it to the linear layers to compute the parameters of the approximate posterior \(q_{\boldsymbol{\phi }}^{(a)}\left( \mathbf {c}_{t} \mid \mathbf {x}\right) \sim {\text {Normal}}(\boldsymbol{\mu }_{\mathbf {c}_{t}}, \mathbf {\Sigma }_{\mathbf {c}_{t}}), \boldsymbol{\mu }_{\mathbf {c}_{t}}\) and \(\mathbf {\Sigma }_{\mathbf {c}_{t}}\). The decoder takes the concatenation of the sampled \(\mathbf {z}\) and the sampled \(\mathbf {c}_{t}\) from their approximated posteriors as the input, and generates the parameters of \(p_{\theta }(x \mid z)\) as the output.

3.3 Loss Function – ELBO+

With the variational attention mechanism, the variational lower bound \(\mathcal {L}\left( \theta , \phi ; \mathbf {x}\right) \) in Eq. 1 becomes:

$$\begin{aligned} \begin{aligned} \mathcal {L}(\theta , \phi , \boldsymbol{x})= E_{\boldsymbol{z}, \boldsymbol{c} \sim q_{\phi }\left( \boldsymbol{z}, \boldsymbol{c} \mid \boldsymbol{x}\right) }\left[ \log p_{\theta }\left( \boldsymbol{x} \mid \boldsymbol{z}, \boldsymbol{c}\right) \right] \\ -\mathcal {D}_{\mathrm {KL}}\left( q_{\phi }\left( \boldsymbol{z}, \boldsymbol{c} \mid \boldsymbol{x}\right) \Vert p(\boldsymbol{z}, \boldsymbol{c})\right) \end{aligned} \end{aligned}$$
(3)

To minimise the effects of learning from abnormal data, we mitigate the contribution of global anomalies (pre-detected) and missing points by introducing a weighted vector, \(\boldsymbol{\beta } = \left\{ \beta _{i} \right\} _{i=1}^{T_{x}}\), to \(\log p_{\theta }\left( \boldsymbol{x} \mid \boldsymbol{z}, \boldsymbol{c}\right) \) shown in Eq. 4. If \(x_i\) is an anomaly, then \(\beta _i=0\), otherwise \(\beta _i=1\). We name this improved ELBO as ELBO+, where \(\lambda _{kl}\) weights the reconstruction loss and the KL loss and \(\eta _{a}\) weights the latent KL loss and the attention KL loss. The training objective is to maximise the ELBO in Eq. 4, which is the negative of the loss function for VAE. Theoretically, the anomalies present can also influence the KL losses, but the hyperparameters \(\lambda _{kl}\) and \(\eta _{a}\) can reduce the ratio of KL losses. We therefore do not reduce their contribution to the KL loss.

$$\begin{aligned} \begin{aligned} \mathcal {L}(\theta , \phi , \boldsymbol{x})^+= E_{\boldsymbol{z} \sim q_{\phi }^{(z)}\left( \boldsymbol{z} \mid \boldsymbol{x}\right) , \boldsymbol{c_t} \sim q_{\phi }^{(a)}\left( \boldsymbol{c_t} \mid \boldsymbol{x}\right) }\left[ \boldsymbol{\beta } \log p_{\theta }\left( \boldsymbol{x} \mid \boldsymbol{z}, \boldsymbol{c}\right) \right] \\ -\lambda _{kl}\left[ { \mathcal {D}_{\mathrm {KL}}\left( q_{\phi }^{(z)}\left( \boldsymbol{z} \mid \boldsymbol{x}\right) \Vert p(\boldsymbol{z})\right) } \right. \\ \left. {-\eta _{a} \sum _{t=1}^{T} \mathcal {D}_{\mathrm {KL}}\left( q_{\phi }^{(a)}\left( \boldsymbol{c_t} \mid \boldsymbol{x}\right) \Vert p(\boldsymbol{c_t})\right) }\right] \\ \end{aligned} \end{aligned}$$
(4)

3.4 Anomaly Detection

Since the generative model reconstructs the input time series based on the probability distribution, it can derive different outputs according to the probability distribution. Normally, rare events (anomalies) have lower probabilities. The rarity of events can be measured by the reconstruction probability, \( \log p_{\theta }\left( \boldsymbol{x} \mid \boldsymbol{z}\right) \), which can be calculated through the Monte Carlo method.

The encoder first generates the parameters of the approximate posterior distribution \(\log p_{\phi }\left( \boldsymbol{z} \mid \boldsymbol{x}\right) \) using the test data. Then, sampled latent variables (L samples) are derived from the approximate posterior distribution. The sampling strategy for latent variables takes into account the variability of the latent space in order to increase the robustness of anomaly detection. For each sample, the decoder outputs the parameters of the approximate posterior distribution \(\log p_{\theta }\left( \boldsymbol{x} \mid \boldsymbol{z}\right) \). In the end, the average reconstruction probability of each sample is calculated from the output parameters, i.e.,

$$\begin{aligned} \begin{aligned} E_{\boldsymbol{z} \sim q_{\phi }\left( \boldsymbol{z} \mid \boldsymbol{x}\right) )} \left[ \log p_{\theta }\left( \boldsymbol{x} \mid \boldsymbol{z}\right) \right] \approx \frac{1}{L} \sum _{l=1}^{L} \log p_{\theta }\left( \boldsymbol{x} \mid \boldsymbol{\mu }_l, \boldsymbol{\sigma }_l \right) \end{aligned} \end{aligned}$$
(5)

where \( \boldsymbol{\mu }_l\) and \(\boldsymbol{\sigma }_l\) are the parameters as the output from the decoder for the approximate distribution \(\log p_{\theta }\left( \boldsymbol{x} \mid \boldsymbol{z}\right) \).

The reconstruction probabilities are then used as anomaly scores (between 0 and 1), which measure the strength of the anomaly of input values. We consider the observations whose anomaly score is greater than 0.5 as the contextual anomalies for the experiments in the next section. This value can be tuned as per the requirement of the problem.

4 Experiments

4.1 Data Sets and Experimental Setup

We use two data sets for the experiments, a synthetic data set generated by PyOD for the detection performance evaluation and a real smart meter data set about water supply temperature for district heating. In the real smart meter data set, we segment the time series into subsequences by a sliding window with a length of 168, and divide the subsequences into a training set, a validation set and a test set with a ratio of 75/15/10. We use PyTorch v1.6.0 to implement our algorithm and train the models via CPU i9-9900 and NVIDIA GTX 2080 Ti graphics cards with 16G RAM on Ubuntu 16.04. More details are included in appendices.

4.2 Evaluation by Comparison with Baselines

To evaluate our method, we compare it with 4 traditional methods and VAE-baseline using the synthetic data set, and use Precision (P), Recall (R) and F1 score (F1) as the metrics for the comparison. The comparing methods includes Cluster-based Local Outlier Factor (CBLOF) [13], K nearest Neighbors (KNN) [2], Principal component analysis (PCA) [34], One-class support vector machines (OCSVM) [28] and VAE-baseline [17]. The generated time series data by PyOD have 24,000 data points with 5 features, including 20% abnormal data points. The normal 20,000 points are used for training and 4,000 points for testing. Table 3 shows the hyperparameters used in our model.

From the results in Table 2, we can see that our method outperforms all others and are more effective for anomaly detection in terms of all three metrics. In general, the VAE-based networks have shown better performance in learning normal patterns from the train set. It also confirms that recurrent neural networks have a good capability to model long temporal dependency of time series (Table 1).

Table 1. Model hyperparameters
Table 2. Performance on the synthetic dataset

4.3 An Empirical Case Study

We next evaluate the proposed method by an empirical case study, which detects the “anomalies” in the water temperature time series from an industry district heating company. The data is recorded from 19/09/2019 11:05:00 to 11/08/2020 15:00:00 with 23 sensors in irregular minute-level, with a total of 220,097 observations for each time series. We align these fine-grained readings to hourly resolution by aggregation, and obtain 7,582 observations for each time series.

Fig. 3.
figure 3

Latent space visualisation of training set

Fig. 4.
figure 4

F1 score of different models

For visualisation purposes, we reduce the 3-dimensional latent variables of the training set to 2D and visualise them using Principal Component Analysis (PCA) and t-distributed Stochastic Neighbour Embedding (t-SNE) [25]. Figure 3 shows the projected points of subsequences in the latent space by the dimensional reduction methods, t-SNE and PCA, respectively. The more similar subsequence are, the closer these points are placed. The color legend represents the time from the beginning (bottom) to the end (top) sequences.

Effects of ELBO+ and Pre-detected Global Anomalies. To our knowledge, minimising the impact of anomalies during training can assist the learning processing of networks. To exam the effectiveness of the pre-detected global anomalies and our modified ELBO loss function, we calculate F1 score in the test set of the real smart meter data set to compare the performance under different conditions. The F1 score is a measure of test accuracy and calculated from the precision and recall of the test. The four models are (1) VAE baseline, (2) VAE with global anomaly detection, (3) VAE with elbo+, and (4) VAE with both global anomaly detection and elbo+. Figure 4 shows that predetected global anomalies and elbo+ have a positive effect on anomaly detection and our model outperforms the VAE baseline.

From Fig. 3, the latent space has a clear tendency to group which implies there are distinct features between subsequences. Here, we gives three examples (Fig. 5) of subsequences in a time series with distinctive features. From 19/09/2019 to 12/12/2019, the transmission water temperature for district heating continuously is 74.2 °C, which is a straight line (s1). s2 is about 68.4 °C and is stationary. By contrast, s3 has a lower temperature and is nonstationary.

Fig. 5.
figure 5

Distinctive features (or patterns) in hot water temperature time series.

Fig. 6.
figure 6

An example of a contextual anomaly with corresponding anomaly scores.

The model outputs reconstruction probabilities of time series as anomaly scores for each point. When the anomaly score is higher than 0.5, the corresponding point is classified as an anomaly. We give an example of detected contextual anomalies in hot water temperature dataset. Figure 6 shows a contextual anomaly example and the corresponding anomaly scores of points. The red dot indicates a contextual anomalies.

5 Conclusions and Future Work

In this paper, we proposed an unsupervised anomaly detection algorithm for smart meter data using VRAE with attention mechanism. Our method can detect different types of anomalies including global and contextual anomalies. The enhanced ELBO+ function can mitigate the contribution of global anomalies and missing points. We have evaluated our method comprehensively and the results have demonstrated the effectiveness and superiority of our method. For future work, we would further improve our case study by applying a real-time architecture for online anomaly detection. We would also address dealing with concept drifts during the anomaly detection process.