1 Introduction

Anomaly detection aims to identify rare observations that differ considerably from the majority of other ones [12, 20, 32]. In recent years, diverse research communities, e.g., cyber security [9], anomalous activity detection [24] and image processing [39], etc., have done tremendous work on anomaly detection. Especially, in real-world scenarios, a wide range of applications [2] w.r.t. anomaly detection are involved with time series data. For instance, detecting malicious or abnormal activities in sequential sensor readings is vital for the control system of a power grid [18]. Hence, we focus on detecting anomalies in time series data.

Early work on anomaly detection mostly focused on univariate time series, which consider only a single time-related variable or metric. The key to detecting anomalies with a single metric is to learn the temporal dependencies of the time series. Traditional methods [7, 37] often use statistics such as mean, ARIMA [23] and Hidden Markov Models [29] to determine the temporal trends of sequences and thus obtain the expected value of a specified point. Recently, numerous deep learning-based methods [1, 33] have been proposed to enhance the ability to represent time series data and better capture its temporal dependencies, e.g., CNN [25], RNN [4] and LSTM [22], etc. However, in many circumstances, multivariate time series data are involved, e.g., detecting anomalies in server machines [17] based on multiple metrics such as CPU usage, bandwidth and network throughput. Considering the characteristics of MTS data, e.g., high dimensionality, complex interactions and temporal dependency among variables, it thus remains a challenging problem for anomaly detection.

Conventionally, MTS data consist of multiple univariate time series, and thus MTS anomaly detection can be divided into several univariate time series anomaly detection problems [14]. However, this intuitive approach of applying univariate-based methods completely ignores the relationships among different variables in MTS data. To address this issue, a few methods [11, 31] employ dimensionality reduction techniques for high-dimensional time series data, and then they apply univariate-based methods. To capture more complex relationships among variables, various methods [15, 21, 35] based on deep learning techniques have been proposed, such as Gated Recurrent Unit(GRU) [6] with AutoEncoder [27]. Nonetheless, most of these methods only model the multivariate relationships implicitly, which still have limits [8]. To overcome this problem, several deep learning methods [8, 36] have been proposed recently to explicitly construct the relationships among different variables using graph neural networks [19].

Fig. 1
figure 1

Illustration of dynamic relationships among variables

However, existing methods [8, 17, 36], whether modeling relationships among different variables implicitly or explicitly, tend to neglect that variable relationships can be different under different context sequences. In other words, previous methods model static correlations, while in reality, these correlations are indeed dynamic or evolving over time. As shown in Figure 1, a smart grid is equipped with three sensors, i.e., voltage sensor, temperature sensor and current sensor, to monitor health status. It is normal for the three sensors to follow the same trend, where temperature will always rise with increasing voltage or current, as shown in sequence 1. However, in sequence 2, temperature violates this trend. However, we can not treat it as an anomaly since the temperature in the plant is always brought down manually when the outdoor air temperature is high, which is obviously a common problem in a real-life world. Hence, the questions are how to capture temporal dependency under different contexts and integrate them with relationships among different variables.

To address aforementioned problems, we propose a novel Hierarchical Attention Network for Context Anomaly Detection (HAN-CAD) model to fully exploit the relationships among different variables and their temporal characteristics with regard to various context sequences. We propose using GRUs to obtain the initial feature representation of variables and sequences. Then, we construct a similarity graph for the variables and apply graph attention to capture variable-level correlations based on the similarity graph. Furthermore, another attention layer is proposed to learn the sequence-level temporal relationships between variables and sequences. By hierarchically integrating temporal relationships, we propose using the reconstruction model, i.e., AutoEncoder, to detect anomalies without requiring any grountruth information. Specifically, our contributions can be summarized as follows:

  • To the best of our knowledge, we are the first to use graph attention mechanisms to capture dynamics of variable relationships and sequences for MTS context anomaly detection.

  • Based on the hierarchical attention structure, we can obtain the temporal-aware and context-aware representation so as to better detect anomalies.

  • We perform comprehensive experiments on three real-world datasets. Experimental results show that our proposed method is effective and outperforms the state-of-the-art methods.

The rest of this paper is organized as follows. In Section 2, we overview the related work. Then, our proposed method is described in Section 3, including the problem statement and details of the proposed model. Experiments and empirical evaluations are reported in Section 4. Finally, Section 5 concludes the paper.

2 Related work

Anomaly detection in MTS is an important and challenging task in many real-world applications [2]. Extensive studies have been carried out by academic researchers and industry practitioners. In this section, we briefly review the related deep learning work for MTS anomaly detection since our proposed method is based on deep learning models.

Recent work on deep learning-based MTS anomaly detection can be categorized into three groups: prediction-based models, reconstruction-based models and hybrid models. All of these models [10, 13, 21, 28, 35, 40] follow a similar procedure, which involves feature extraction for MTS using deep learning techniques and construction of different task models. The major difference lies in how the anomaly score is determined, i.e., by the prediction error for prediction-based models, the reconstruction error for reconstruction-based models, and both errors for hybrid models. The core idea of prediction-based models is to predict the observation at time step \(\tilde{x}_t\) based on previous observations. Then, the observation at time step t can be determined as an anomaly if the prediction error between the true observation \(x_t\) and \(\tilde{x}_t\) is larger than the defined threshold. For instance, Bontemps et al. [3] proposed the first LSTM network for collective anomaly detection with several measures of predicted errors. Hundman et al. [13] proposed a dynamic thresholding method based on LSTM to predict future observations for spacecraft. Furthermore, Siami-Namini et al. [26] compared the performance for different variants in time series data and concluded that BiLSTM [34] is more suitable for time series prediction. Reconstruction-based models are widely used for anomaly detection, which try to obtain the representation of the whole sequence and compute the reconstructed error of the observation at each time step. Most of the reconstruction-based models are based on two deep generative models, namely AutoEncoders (AEs) and Generative Adversarial Networks (GANs) [30]. For instance, Malhotra et al. [21] proposed the use of LSTM Encoder-Decode network, in which sequences are represented by LSTM and the reconstruction process is based on AE. To address the overfitting problem of AE, Zhou et al. [38] proposed a robust anomaly detection approach based on GAN by augmenting the data using the time warping technique. For more related work about deep learning methods, readers can refer to [2, 5].

It is worth noting that our work aims to capture variable relationships and its dynamic based on the graph neural network and attention mechanism. Among existing related work, both Zhao et al. [36] and Deng et al. [8] propose using graph attention technique to model inter-variable correlations. Zhao’s model is based on a fully connected network, while in Deng’s work, they propose using top K directed graph to learn the relationships between variables, which is more flexible. Li et al. [17] also propose the similar approach to catpure inter-variable correlations based on hierarchical Variational AutoEncoder. In comparison, all previous work only considers the static correlations between different variables. Our proposed method employs a hierarchical attention mechanism to characterize the dynamic correlations for MTS anomaly detection based on graph neural network.

3 Methodology

In this section, we present the details and implementation of the proposed method.

3.1 Problem statement

Let \(\mathcal {X}_N = \{ {\textbf {x}}_t \}_{t=1}^{N} \in R^{d \times N}\) denote a set of multivariate time-series data of length N, where \({\textbf {x}}_t \in R^{d}\) indicates the observation with d variables or features at time step t and N is the maximum length of timestamps. In this paper, we aim to detect whether the sequence \(\mathcal {X}_{i:j} = \{ {\textbf {x}}_i,{\textbf {x}}_{i+1},\cdots ,{\textbf {x}}_j\}\) contains abnormal activities, i.e., anomalies, without using any groundtruth information.

As shown in Figure 1, inconsistent trends among different variables can indicate anomalies and are changing as time goes by. Therefore, to effectively detect anomalies in multivariate time-series, it is essential to capture relationships among multiple variables and learn their dynamic characteristics as anomalies evolve. We address these challenges by proposing a hierarchical attention network that focuses on two key issues: 1) capturing inter-variable correlations using a graph attention network from the variable-level perspective, and 2) characterizing the dynamic relationships between sequences and variables using GRUs and attention mechanism from the sequence-level perspective. Finally, anomalies can be detected based on an AutoEconder network.

3.2 Overview of proposed model

The overall framework of the proposed method is shown in Figure 2, which involves four main modules:

  • Feature Learning Module: obtains the time-related features for variables and sequences using GRUs.

  • Variable-level Learning Module: learns the interactions between different variables using a graph attention network.

  • Sequence-level Learning Module: learns the evolving relationships between sequences and variables using GRUs and attention mechanism.

  • Reconstruction-based Detection Module: detects the sequence anomalies using an AutoEcoder nework.

Fig. 2
figure 2

Overview of our proposed framework

3.3 Feature representation

The Feature Representation Module takes a sequence of length L, i.e., \(\mathcal {X}_{L} \in R^{d \times L}\), as input and outputs the dense vectors as features for variables and sequences, respectively. In this work, we denote \({\textbf {v}}_i\) and \({\textbf {s}}\) as the dense vector for variable i and sequences, respectively. More specifically, variables in sequences are much related with different time steps. For instance, temperature sensors in a smart grid system exhibit varying statuses at different times of the day. High temperatures during midnight hours may indicate potential device malfunctions or abnormal operating conditions. These nuanced relationships can ultimately aid in identifying anomalous patterns.

To capture temporal dependencies and acquire better representations for variables and sequence, we use a Bidirectional Gated Recurrent Unit (Bi-GRU) network which can leverage information from both previous time steps (forward direction) and later time steps (backward direction). Specifically, Let \({\textbf {v}}_i = \{ x_1^{i}, x_2^{i},, \cdots , x_L^{i}\}\) be the initial representation containing consecutive L observations for variable i. Then, the updated representation goes through following non-linear transformations sequentially:

$$\begin{aligned} z_t^{i} = \sigma (W_z^{i} [h_{t-1}^{i},x_t^{i}]) \end{aligned}$$
(1)
$$\begin{aligned} r_t^{i}= \sigma (W_r^{i} [h_{t-1}^{i},x_t^{i}]) \end{aligned}$$
(2)
$$\begin{aligned} g_t^{i} = tanh(W^{i} [r_t^{i} h_{t-1}^{i},x_t^{i}]) \end{aligned}$$
(3)

and

$$\begin{aligned} \overrightarrow{h_t^{i}} = (1-z_t^{i})h_{t-1}^{i}+z_t^{i}g_t^{i} \end{aligned}$$
(4)

where z, r and g are the update gate, reset gate and candidate hidden state by integrating the reset gate, respectively. \(W_z^{i}\), \(W_r^{i}\) and \(W^{i}\) are all trainable weights. \(h_{t-1}^{i}\) is the output at time step \(t-1\) for variable i. \(\overrightarrow{h_t^{i}}\) is the output for the forward directional GRU and meanwhile we get the backward directional GRU output, i.e., \(\overleftarrow{h_t^{i}}\). Thus, the final representation for variable i can be formulated as follows:

$$\begin{aligned} {\textbf {v}}_i = h_L^{i} = \overrightarrow{h_L^{i}} + \overleftarrow{h_L^{i}} \end{aligned}$$
(5)

Following the same procedure, we can also get the representation for the whole sequence \({\textbf {s}}\).

3.4 Variable-level learning

In multivariate time-series data, learning variable feature independently cannot fully capture the characteristics of anomalies. Moreover, relationships among variables indeed reveal the distinctive time-related patterns, which are also favorable for detecting anomalies. Hence, rather than learning the variables independently, we investigate to leverage their mutual impacts and aggregate features of these variables at variable level. To address these issues, we propose to use graph attention network to model relationships and get updated features for variables.

Firstly, we construct a similarity graph among different variables, i.e., variable-level graph, in which nodes and edges represent variables and relationships, respectively. The variable-level graph \(G = \{V,E\}\) contains a node set \(V=\{v_1, v_2, \cdots , v_d\}\) with features extracted by Bi-GRU, i.e., \(\{{\textbf {v}}_1, {\textbf {v}}_2, \cdots , {\textbf {v}}_d\}\). The similarities between variables are computed using Eq.6 and then sorted in descending order. Then, we take top K most similar pairs as edges. Thereafter, Graph attention mechanism is introduced to model the interactions among variables and conduct variable-level feature learning. Furthermore, we propose using multi-head attention to extract robust features for variables. Finally, the feature representation for each variable is formed by a weighted sum of all connected node features, which can be formulated as follows:

$$\begin{aligned} s_{ij} = \frac{{\textbf {v}}_i^{\textsf{T}}{} {\textbf {v}}_j}{\Vert {\textbf {v}}_i \Vert \cdot \Vert {\textbf {v}}_j \Vert } \end{aligned}$$
(6)
$$\begin{aligned} {\textbf {v}}_i = \sigma (\frac{1}{H}\sum _{h=1}^{H}\sum _{j \in \mathcal {N}_i} \alpha _{ij}^{h} {\textbf {v}}_j) \end{aligned}$$
(7)

where H is the number of heads, \(\mathcal {N}_i\) is the set of neighbors of variable i and \(\alpha _{ij}\) is the attention weight which indicates the relevance of variable j to variable i which can be computed by:

$$\begin{aligned} r_{ij} = \textrm{LeakyReLU}(W^{r}({\textbf {v}}_i \oplus {\textbf {v}}_j)) \end{aligned}$$
(8)
$$\begin{aligned} \alpha _{ij} = \frac{exp(r_{ij})}{\sum _{j=1}^{|\mathcal {N}_i |}exp(r_{ij})} \end{aligned}$$
(9)

where \(\oplus\) denotes concatenation and \(W^{r}\) is the trainable weight.

3.5 Sequence-level learning

In addition, relationships between variables are not stable and always evolve over time. Especially for variables with strong correlations, their anomaly patterns might vary dramatically in different sequences. Previous works treat variables and sequences equally and assigned the same weights to them, which cannot reveal the characteristics of sequence impacts on the variables. In order to capture sequence-level dependency, we propose using another attention mechanism to learn the interaction of variables and sequences. Specifically, the attention weights \(\beta _{j}\) can be computed as follows:

$$\begin{aligned} m_{j} = \textrm{LeakyReLU}(W^{m}({\textbf {v}}_j)) \end{aligned}$$
(10)
$$\begin{aligned} \beta _{j} = \frac{exp({\textbf {s}}^{\textsf{T}}m_{j})}{\sum _{j=1}^{d}exp({\textbf {s}}^{\textsf{T}}m_{j})} \end{aligned}$$
(11)

where \({\textbf {s}}\) and \(W^{m}\) are the feature vector for sequence and the trainable parameters, respectively.

After obtaining the attention weights for sequence, the updated sequence feature representation can be computed as follows:

$$\begin{aligned} {\textbf {s}}^{\prime } = \beta _0{\textbf {s}} \oplus \sum _{j=1}^{d}\beta _j {\textbf {v}}_j \end{aligned}$$
(12)

Specifically, the final feature representation is aggregated by concatenating the original representation of sequence and representation of variables with evolving relationships among variables.

3.6 Reconstruction-based detection

Following the above hierarchical attention process, we can obtain the final feature representation \({\textbf {s}}\) for a sequence \(\mathcal {X}=\{x_1, x_2, ..., x_L\}\). Then, we employ AutoEncoder network to reconstruct the sequence. Let \(f_e\left( \cdot \right)\) denote the Encoder and \(f_d\left( \cdot \right)\) denote the Decoder. Given the feature vectors \({\textbf {s}}\) for the sequence \(\mathcal {X}\), the encoder maps the \({\textbf {s}}\) into the latent representation \({\textbf {z}}\) and decoder reversely maps the \({\textbf {z}}\) into the reconstructed \(\hat{\mathcal {X}}\) as follows:

$$\begin{aligned} {\textbf {z}} = f_e\left( {\textbf {s}}^{\prime },W^{e}\right) \end{aligned}$$
(13)
$$\begin{aligned} \hat{\mathcal {X} }= f_d\left( {\textbf {z}},W^{d}\right) \end{aligned}$$
(14)

where both of \(W^{e}\) and \(W^{d}\) are trainable parameters. Finally, the reconstruction loss can be defined as follows.

$$\begin{aligned} Loss = \frac{1}{L} \sum _{i=1}^{L} \Vert x_i - \hat{x}_i \Vert _2 \end{aligned}$$
(15)

where \(\Vert \cdot \Vert _2\) denotes \(\ell _2\) normal. The sequence can be identified as an anomaly if the reconstruction error is larger than a threshold. In this paper, we adjust the threshold to maximize the F1 score.

The whole learning process of hierarchical attention method is presented in Algorithm 1.

figure a

4 Experimental results

In this section, we conduct experiments on three real-world datasets and evaluate the effectiveness of the proposed method compared with four state-of-the-art methods.

4.1 Datasets and metrics

Experiments are conducted on three publicly available datasets that have ground truth information, which are described as follows:

  • ASD (Application Server Dataset) [17]: This dataset is a collection of 45-day-long status data from 12 servers in a large internet company. The status of servers is monitored based on 19 metrics (\(d = 19\)), e.g., CPU-related metrics, memory-related metrics, network metrics and etc.. In our experiments, we only used data from one serve to speed up training. Additionally, we used 66.7% of data for training and the rest for testing.

  • SMD (Server Machine Dataset) [27]: This is another dataset for servers, which collected 5-week-long MTS data. SMD contains 12 servers with 38 metrics, including CPU load, network usage, memory usage etc.. In our experiments, we split the dataset from one server into two parts: 50% of data was used for training, and the remaining 50% was used for testing.

  • WADI (Water Distribution) [16]: This dataset contains 16-day-long data collected in a water distribution system. Several cyber-attacks were executed, which caused various anomalies in the system. In our experiments, we choose five days of normal data for training and the remaining days containing anomalies for testing.

The detail statistics of three real-world datasets are shown in Table 1

Table 1 Dataset Statistics

In order to compare with other baselines, we evaluated the performance of the proposed method using three commonly used metrics for detection tasks, i.e., precision, recall and F1-score. Furthermore, it is worth mentioning that any sequence containing at least one anomaly is considered as being correctly detected.

4.2 Baselines

We extensively compared the performance of the proposed method with five state-of-the-art MTS anomaly detection methods as follows:

  • LSTM-AE [21] is a classic reconstruction-based anomaly detection method, which exploits temporal dependencies using LSTM and detects anomalies using AutoEncoder.

  • MAD-GAN [16]: is another reconstruction-based anomaly detection method based on GAN.

  • MTAD-GAT [36]: is a state-of-the-art method that can efficiently model the relationships between variables using a graph attention network. Anomaly score can be inferred from the reconstruction error and prediction probability.

  • GDN [8]: models the structure among different variables using a graph neural network and provides interpretability for anomalies based on the attention weights.

  • InterFusion [17]: captures the relationships among metrics as well as temporal dependency based on hierarchical variational AutoEncoder. InterFusion also provides interpretations based on MCMC methods.

4.3 Experimental setup

In our experiments, the length of sliding window is set as 100, 100 and 30 for ASD, SMD and WADI, respectively. The models are trained using the Adam optimizer with a learning rate \(5e\text {-}4\). The sizes of representation for variables and sequences are both 64. We also use dropout to reduce overfitting and the dropout probability is 0.2. The number of header in multi-headed attention is 2. The state-of-the-art methods and the proposed method are trained on a Windows server with 3.60 GHz Intel I9-9900k CPU and 11 GB Nvidia GeForce RTX 2080 Ti GPU.

4.4 Comparison of performance

Firstly, we compare our proposed method with five baselines on three real-world datasets. In particular, experiments are repeated 5 times, and average performance and standard deviation are reported. Table 2 presents the results for all methods using precision, recall and F1 scores, in which the best results are bold-faced. In general, HAN-CAD shows promising results in most cases on Precision and Recall and outperforms all baselines on F1, which demonstrates the effectiveness of our method. Especially, we have the following two observations:

\(\bullet\) From the comparison results on the three datasets, we can observe the evident order of six methods from high to low in terms of the three metrics: “HAN-CAD \(\rightarrow\) InterFusion \(\rightarrow\) GDN \(\rightarrow\) MTAD-GAT \(\rightarrow\) MAD-GAN \(\rightarrow\) LSTM-AE”. Further, it is worth noting that all methods capturing correlations between variables, i.e., InterFusion, MTAD-GAT, GDN and our method, perform better than traditional reconstruction-based methods, which reveals the importance of inter-variables relationships for MTS anomaly detection.

\(\bullet\) All the baselines obtain lower measure scores on the WADI than on other datasets, which implies anomalies in WADI are more difficult to detect. This is probably because WADI is consisted of 112 variables and thus has more complex relationships among variables. However, our proposed method, HAN-CAD, is able to effectively capture these complex relationships through the use of dynamic context-based modeling. Therefore, HAN-CAD significantly outperforms the baseline methods even on the challenging WADI dataset and achieves high performance measures.

Table 2 Performance(\(\%\)) comparison of different methods on three real-world datasets
Fig. 3
figure 3

F1 scores with different sliding window lengths

In our experiments, we used the sliding window technique to obtain context sequences. To validate the effects of context sequences for different methods, we compared HAN-CAD with MTAD-GAT, GDN and InterFusion with different lengths of sliding windows in terms of F1 score. As shown in Figure 3, our method consistently showed promising results with different lengths on all three datasets. Moreover, our method presented a stable trend, in which HAN-CAN achieved the best F1 when length is 100. Whereas, two graph neural network-based (GNN-based) methods show more fluctuations, which indicates that integrating the inter-variable relationships and context sequence would make GNN-based MTS anomaly detection more robust.

Fig. 4
figure 4

F1 scores with different number of edges

Furthermore, we also investigate how the graph structure impacts the effectiveness of MTS anomaly detection based on GNN. Figure 4 shows the results with different ratios of edges in terms of F1 score for MTAD-GAT, GDN and HAN-CAN on the three datasets. The findings show that our method outperforms the other two GNN-based methods in most settings. In addition, it is observed that all GNN-based methods perform worse on sparse graphs, which may be due to the difficulty in extracting non-linear structural features for relationships in such graphs.

Table 3 Ablation study
Table 4 Comparison of training times in seconds

4.5 Ablation study

Finally, we investigate the impacts of the three components in our method on three datasets. In particular , the first model is trained without using the component of Bi-GRUs, i.e. w/o feature learning. The second model is trained without using the component of graph attention mechanism,i.e. w/o variable learning. The third model is trained without attention mechanism among variables,i.e. w/o sequence learning. Table 3 summarizes the results for the ablation study. We can see that all three components are important, as removing any one of them results in decreased performance in all three measures. Additionally, the comparison results indicate that the graph attention mechanism is the most critical component among the three, suggesting that capturing the relationships among variables is crucial for effectively detecting anomalies in MTS. Moreover, we report the training time for the baselines and the proposed method in Table 4. It is evident that as the amount of data increases, the training time also increases. Among the GNN-based methods, our approach performs the most efficiently, possibly due to the stable training achieved by integrating variable-level learning and sequence-level learning. Furthermore, we can observe that the training time decreases the most when variable learning is not employed in our method.

5 Conclusion

In this paper, we focus on detecting anomalies in multivariate time series. We argue that relationships among different variables are dynamic with regard to context sequences, and capturing the dynamic relationships can improve the accuracy of anomaly detection. Hence, we propose a novel Hierarchical Attention Network for Context Anomaly Detection in Multivariate Time Series. Two attention layers are hierarchically equipped into our model, in which one graph attention is introduced to obtain inter-variable relationships and the other attention is used to capture dynamic relationships. The effectiveness of our method is validated on three real-world datasets. And extensive comparison experiments demonstrate the superiority of our method.