Keywords

1 Introduction

Anomaly detection for multivariate time series has emerged as a prominent research topic in recent times. In the areas of production and IT systems, time series data can directly reflect the working status and operating conditions of the system, which is an important basis for anomaly detection. In the past, domain experts usually utilize their expertise to establish thresholds for each indicator based on empirical observations. However, with the unprecedented explosion in data complexity and scale due to rapid technological advancements, traditional techniques have become insufficient to effectively address the challenges posed by anomaly detection. To tackle this problem, a lot of unsupervised methods based on classical machine learning have been developed over the previous years, including density estimation-based methods [6] and distance-based methods [3, 14]. Nevertheless, these approaches fail to capture the intricate and high-dimensional relationships that exist among time series.

Recently, Methods based on deep learning have contributed to the enhancement of anomaly detection for multivariate time series. For example, AutoEncoders (AE) [5], VAE [12], GAN [17], and Transformers [23] are recent popular anomaly detection methods that employ sequence reconstruction to encode time series data. In addition, Long Short-Term Memory (LSTM) networks [9] and Recurrent Neural Networks (RNN) [22] have also displayed promising results for detecting anomalies in multivariate time series. However, most of these methods fail to consider the association between various time series, moreover, they do not offer a clear explanation of which time series are correlated with each other, thus impeding the interpretation of detected anomalies. A complex set of multivariate time series are often intrinsically linked to each other.

Fig. 1.
figure 1

Multivariate time series segments from the SWaT dataset, with anomalies shaded in red. (Color figure online)

In Fig. 1, the time series are obtained from five sensors of the same process at the SWaT water treatment testbed [16]. The red shaded region corresponds to an anomaly, indicating that the LIT101 value has exceeded the threshold. In addition, the readings of FIT101, MV101, and P101 all changed during this period, P102 changed after the anomaly has ended. Based on the fault log, we know that LIT101 serves as the level transmitter responsible for measuring the water level of the tank, and the anomalous segment corresponds to the overflow of the tank. The fundamental reason of this anomaly is because of the premature opening of MV101 (inlet valve) in the same process. Given that the state of MV101 is limited to only two possibilities (open and closed) and irregular, identifying abnormal for it through the temporal features is a challenging task. Consequently, integrating spatial features become crucial to detect and explain the anomaly. Several methods have employed graph neural networks for anomaly detection because of its remarkable capability to leverage spatial structural information, such as MTAD-GAT [26] and GDN [8]. However, MTAD-GAT assumes a complete graph structure for the spatial characteristics of multivariate time series, which may not accurately reflect their asymmetric correlations in real-world scenarios. GDN [8] is limited to a single time point and fails to catch the detailed associations between a time point and a whole sequence. GTA [7] combines graph structures for spatial feature learning and Transformers for temporal modeling. However, it utilizes Gumbel-Softmax, which is insufficient in accurately representing the spatial relationships among multivariate time series.

This paper presents a method for anomaly detection by leveraging the spatio-temporal relationships among multivariate time series. The proposed approach leverages the joint optimization of Graph Attention Networks (GAT) and Transformers for unsupervised anomaly detection. In order to explore complex temporal and spatial dependencies among diverse time series, a novel graph structure learning strategy is proposed, which considers multivariate time series as separate nodes and learns attention weights of each node to obtain a bidirectional graph structure. The proposed method employs GAT and graph structure to integrate information of nodes with their neighbors, while the temporal features of time series are modeled utilizing Transformers. The utilization of Transformers in the proposed approach is motivated by their capability to capture long-term dependencies, compute global dependencies, and enable efficient parallel computation. To further enhance the detection performance, an anomaly amplification strategy based on local and global differences is also introduced. In summary, this paper makes the following major contributions:

  • We propose a new method for learning the graph structure in multivariate time series.

  • We propose an novel method for multivariate time series anomaly detection, which efficiently captures spatio-temporal information using GAT and Transformers.

  • Extensive experiments are conducted on four popular datasets to demonstrate the effectiveness of our proposed method. And, ablation studies are conducted to understand the impact of each component in our architecture.

2 Related Work

2.1 Traditional Anomaly Detection for Multivariate Time Series

Traditional methods for time series anomaly detection typically contain distance-based methods and clustering-based methods. LOF (Local Outlier Factor) [6] is a density-based method, which determines the degree of anomaly by comparing the local density between each data point and its surrounding neighboring data points. KNN [3] is a distance-based outlier detection method, which detects anomalies by calculating the distances between each data point and its K nearest neighbors. IsolationForest [14] uses a tree structure to decompose data and quantifies the distances between nodes to identify outliers. Traditional unsupervised methods for anomaly detection are limited in their ability to identify anomalies, as they do not take into account the spatio-temporal relationships inherent in the data.

2.2 Deep Learning Anomaly Detection for Multivariate Time Series

Prediction-Based Models: LSTMNDT [10] leverages LSTM [9] network to predict time series collected from spacecraft, but it ignores the spatial correlations. MTAD-GAT [26] employs two GAT layers to model the spatio-temporal relationships simultaneously, but MTAD-GAT assumes that the spatial characteristics of multivariate time series are a complete graph, in most cases, time series are typically associated in an asymmetric manner. GDN [8] uses node embedding for graph structure learning, encodes spatial information using GAT. However, GDN is limited to a single time point and cannot catch the detailed associations between a time point and a whole sequence.

Reconstruction-Based Models: LSTM-VAE [19] utilizes a LSTM network and a variational autoencoder (VAE) [12] for the reconstruction of time series. DAGMM [27] combines a deep autoencoder with Gaussian Mixture Model. But the Gaussian Mixture Model is not suitable for complex distributed datasets. OmniAnomaly [22] employs a new stochastic RNN based on the LSTM-VAE model for anomaly detection. GANS [13, 17, 20] uses generators for reconstruction. Anomaly transformer [25] leverages a prior-association and series-association and compares them to better identify anomalies. USAD [4] uses a deep autoencoder trained with adversarial training to learn and detect anomalies in new data. However, all the methods mentioned above only take into account either temporal or spatial associations, without learning both associations, and they may lack sufficient ability to accurately localize anomalies. GTA [7] combines graph structures for spatial feature learning and Transformers for temporal modeling. However, it utilizes Gumbel-Softmax, which is insufficient in accurately representing the spatial relationships between multivariate time series.

3 Method

In this section, we give the details of the proposed spatio-temporal relationship anomaly detection (STAD) for multivariate time series. At first, we present the problem statement and the overall architecture of STAD. Next, we will elaborate the particulars of the graph structure learning, the GAT-based spatial model, Transformers and anomaly amplification modules.

3.1 Problem Statement

In our study, time series is represented by \(\left\{ \textbf{X}_{1},\textbf{X}_{2},\cdots ,\textbf{X}_{d} \right\} \). For the time series i, \(\textbf{X}_{i}=\left[ x_{1i}, x_{2i}, \cdots , x_{Ni} \right] \), where \(\textbf{X}_{i}\in \mathbb {R}^N\) denotes the observed value of time series i. N is the length of \(\textbf{X}\) and d is the number of multivariate time series. Our goal is to model multivariate time series data in order to identify any anomalous behaviors.

3.2 Overview

The overall architecture of the model in this paper is shown in Fig. 2. It consists of three main components:

  1. (1)

    Graph Structure Learning: Learn a graph structure that represents spatial relationships between multivariate time series.

  2. (2)

    GAT-based spatial model: Fusing time series with spatial features using GAT and graph structure.

  3. (3)

    Transformers based on anomaly amplification strategy: Transformers are used to reconstruct the spatio-temporal relationships of each time series. Anomaly amplification strategy is used to amplify anomalies.

Fig. 2.
figure 2

An overview of the proposed STAD method.

3.3 Graph Structure Learning

For our model, the primary task is to reconstruct spatial and temporal relationships for multivariate time series. For spatial modeling, we utilize a learnable graph to represent the relationships between multivariate time series. We consider each time series as a node, and the relationships between the time series are represented as edges in the graph. An adjacency matrix \(A\in \mathbb {R}^{d\times d}\) is used to express this graph, where \(A_{ij}\) denotes that there are edges between node i to node j. Our proposed framework has flexibility and can automatically learn the relationships of the graph without prior knowledge about the graph structure. In order to obtain the hidden dependencies between nodes, we designed a framework that, unlike the GDN [8], does not use the node embedding learning graph structure. We learn a weight matrix that assigns a weight score to each node based on its own features and similarity to other nodes, and then use top k to filter the most relevant sets for the graph structure:

$$\begin{aligned} e_{ij}=\textrm{LeakyReLU}\left( \textbf{w}^\textrm{T}\cdot \left( \textbf{X}_{i}\oplus \textbf{X}_{j} \right) \right) \end{aligned}$$
(1)
$$\begin{aligned} \mathbf {\rho } _{ij}=\frac{\textrm{exp}\left( e_{ij} \right) }{\sum _{k=1}^{d}\textrm{exp}\left( e_{ik} \right) } \end{aligned}$$
(2)
$$\begin{aligned} A_{ij}=1\Big \{ j\in \mathrm TopK\left( \big \{ \rho _{ik}:k\in C_{i} \big \} \right) \Big \} \end{aligned}$$
(3)

where \(\oplus \) stands for stitching two nodes together. \(\textbf{X}_{i}\in \mathbb {R}^{N}\) is the feature vector of node i, \(\textbf{w}\in \mathbb {R}^{2N}\) is a learnable parameter vector, LeakyReLU is a nonlinear activation function, \(\mathbf {\rho }_{ij}\in \mathbb {R}^{d\times d}\) is the weight score between source node i and target node j. Next, we define a GAT-based spatial model that utilizes the learned adjacency matrix A to model the spatial features of multivariate time series.

3.4 GAT-Based Spatial Model

We use GAT and graph structure learning to fuse the information of the nodes with their neighbors. For the input multivariate time series \(\textbf{X}\in \mathbb {R}^{N\times d}\), we compute the aggregated representation \(\mu _i\) of node i as follows:

$$\begin{aligned} \mu _{i}=\textrm{ReLU}\left( \alpha _{i,i}\textbf{WX}_{i}+\sum _{j\in N\left( i \right) }^{}\alpha _{i,j}\textbf{WX}_{j} \right) \end{aligned}$$
(4)

where, \(\textbf{X}_{i}\in \mathbb {R}^{N}\) is the input feature of node i, \(N\left( i \right) =\left\{ j|A_{ij}> 0 \right\} \) represents the neighborhood set of node i and its values are obtained from matrix A, \(\textbf{W}\in \mathbb {R}^{N\times N}\) is the trainable weight matrix with a linear transformation for each node. Unlike GDN [8], we connect the node features to the weight score \(\rho \) so that not only the local spatial dependencies but also the global spatial dependencies in the graph can be captured. The attention coefficient \(\alpha _{i,j}\) is computed using the following calculation method:

$$\begin{aligned} Concat_i=\mathbf {\rho } _{i}\oplus \textbf{WX}_{i} \end{aligned}$$
(5)
$$\begin{aligned} \pi _{i,j}=\textrm{LeakyReLU}\Big (\textbf{a}^\textrm{T} \big ( Concat_{i}\oplus Concat_{j} \big ) \Big ) \end{aligned}$$
(6)
$$\begin{aligned} \alpha _{i,j}=\frac{\textrm{exp}\left( \pi _{i,j} \right) }{\sum _{k\in N\left( i \right) \cup \left\{ i \right\} }\textrm{exp}\left( \pi _{i,k} \right) } \end{aligned}$$
(7)

where \(\oplus \) denotes concatenation, \( Concat_{i}\) concatenates the weight scores \(\mathbf {\rho } _{i}\) and \(\textbf{WX}_i\), the vector \(\textbf{a}\) represents the learnable coefficients of the attention mechanism. LeakyReLU is used to calculate the attention coefficients and we employ the Softmax function to normalize the computed coefficients. Next, we use Transformers to model temporal features.

3.5 Transformers Based on Amplifying Anomalies Strategy

We supply \(\mathbf {\mu } \in \mathbb {R}^{N\times d}\) to the Transformers for reconstruction by alternately stacking Multi-Mix Attention and feedforward layers. This structure better captures the details and patterns present in time series data. Among them, the overall equation of layer l is as follows:

$$ \begin{aligned} \textbf{Z}^{^{l}}=\mathrm{Add \& Norm} \Big ( \mathrm{Multi\text{- }Mix \ Attention}\big ( \mathbf {\mu } ^{l-1} \big )+\mathbf {\mu } ^{l-1} \Big ) \end{aligned}$$
(8)
$$ \begin{aligned} \mathbf {\mu } ^{l}=\mathrm{Add \& Norm} \Big ( \mathrm{Feed\text{- }Forward} \big ( \textbf{Z}^{l} \big )+\textbf{Z}^{l} \Big ) \end{aligned}$$
(9)

where \(\mathbf {\mu } ^{l}\in \mathbb {R}^{N\times d_{model}}\), \(l\in \left\{ 1,2,\cdots L \right\} \) represents the output of layer l, featuring \(d_{model}\) channels. Initial input \(\mathbf {\mu } ^{0}=\textrm{Embeding}(\mathbf {\mu })\). \(\textbf{Z}^{l}\in \mathbb {R}^{N\times d_{model}} \) is the hidden representation of layer l.

Multi-mix Attention: Inspired by Anomaly Transformer [25], we propose the Multi-Mix Attention with local associations and global associations to amplify anomalies. Local associations are derived from a learnable Gauss function. The Gauss function can focus on adjacent layers and amplify local associations. To prevent the weights from decaying too rapidly or overfitting, we design the scale parameter \(\sigma \) as a learnable parameter, which allows the function to better adapt to different patterns of time series. In addition, we use Transformers’ self-attentive scores as the global associations, which can adaptively find the most effective global distributions. The Multi-Mix Attention of layer l is as follows:

$$\begin{aligned} \textbf{Q},\textbf{K},\textbf{V},\mathbf {\sigma } =\mathbf {\mu } ^{l-1}\textbf{M}_{\textbf{Q}}^{l},\mathbf {\mu } ^{l-1}\textbf{M}_{\textbf{K}}^{l},\mathbf {\mu } ^{l-1}\textbf{M}_{\textbf{V}}^{l},\mathbf {\mu } ^{l-1}\textbf{M}_{\mathbf {\sigma } }^{l} \end{aligned}$$
(10)
$$\begin{aligned} \mathrm{Local\text{- }Association:}\textbf{G}^{^{l}}=\textrm{Rescale}\Big ( \Big [ \frac{1}{\sqrt{2\pi }\sigma _{i}} \textrm{exp}\big ( -\frac{\left| j-i \right| ^2}{2\sigma _{i}^{2}} \big )\Big ]_{i,j\in \left\{ 1,\cdots ,N \right\} } \Big ) \end{aligned}$$
(11)
$$\begin{aligned} \mathrm{Global\text{- }Association:}\textbf{S}^{l}=\textrm{Softmax}\left( \frac{\textbf{QK}^{T} }{\sqrt{d_{model}}}\right) \end{aligned}$$
(12)
$$\begin{aligned} \mathrm{Reconstruction:}\widehat{\textbf{Z}}^{l}=\textbf{S}^{l}\textbf{V} \end{aligned}$$
(13)

where \(\textbf{Q},\textbf{K},\textbf{V}\in \mathbb {R}^{N\times d_{model}} \), \(\mathbf {\sigma } \in \mathbb {R}^{N\times 1}\) denote query, key, self-attentive value and learning scale respectively. \(\textbf{M}_{\textbf{Q}}^{l},\textbf{M}_{\textbf{K}}^{l},\textbf{M}_{\textbf{V}}^{l} \in \mathbb {R}^{d_{model}\times d_{model}}\), \(\textbf{M}_{\mathbf {\sigma }}^{l} \in \mathbb {R}^{d_{model}\times 1}\) denote the parameter matrices of the \(\textbf{Q}\), \(\textbf{K}\), \(\textbf{V}\) and \(\sigma \) in the l-th layer respectively. We use Gaussian kernels to calculate the association weights between each two points, and then convert these weights into a discrete distribution through row-wise normalization with Rescale to obtain \(\textbf{G}^{l} \in \mathbb {R}^{N\times N}\). \(\textbf{S}^{l} \in \mathbb {R}^{N\times N}\) is the attention map of Transformers. We found that it contains abundant information and can be utilized as a global learning association. \(\widehat{\textbf{Z}}^{l} \in \mathbb {R}^{N\times d_{model}}\) is the hidden representation after the Multi-Mix Attention in the l-th layer.

We use KL divergence to represent the difference between local and global associations [18]. By averaging multiple layers of association differences, more information can be fused, and the combined association differences is:

$$\begin{aligned} \textrm{Dis}\left( \textbf{G},\textbf{S} \right) =\Big [\frac{1}{L}\sum _{l=1}^{L}\Big (\textrm{KL}(\textbf{G}_{i,:}^{l}\left| \right| \textbf{S}_{i,:}^{l})+\textrm{KL}(\textbf{S}_{i,:}^{l}\left| \right| \textbf{G}_{i,:}^{l})\Big )\Big ]_{i=1,\cdots ,N} \end{aligned}$$
(14)

where, KL(\(\cdot \parallel \cdot \)) corresponds to the Kullback-Leibler divergence between the associations of \(\textbf{G}^l\) and \(\textbf{S}^l\) for each row. \(\textrm{Dis}\left( \textbf{G},\textbf{S} \right) \in \mathbb {R}^{N \times 1} \) is the degree of deviation of input time series with local-association \(\textbf{G}\) and global-association \(\textbf{S}\). Since the Gaussian function has local single-peakedness, so that the Gaussian distribution will show fluctuations on both normal and anomalous data, while normal data tends to exhibit smoother performance with the global association, which indicates that the Dis value of the abnormal points will be smaller than the Dis value of the normal points, so Dis has good anomaly differentiation.

3.6 Joint Optimization

Finally, we optimize the spatio-temporal model. We employ additional losses to amplify the Dis, which can further amplify the difference. The loss functions are:

$$\begin{aligned} \textbf{L}_{1}=\left\| \mu -\textbf{X} \right\| _{F}^{2} \end{aligned}$$
(15)
$$\begin{aligned} \textbf{L}_{2}=\left\| \widehat{\textbf{X}} -\mu \right\| _{F}^{2} \end{aligned}$$
(16)
$$\begin{aligned} \textbf{L}_{total}=\beta \times \textbf{L}_{1}+\left( 1-\beta \right) \times \textbf{L}_{2}-\lambda \times \left\| \textrm{Dis}\left( \textbf{G},\textbf{S} \right) \right\| _{1} \end{aligned}$$
(17)

where \(\widehat{\textbf{X}}\) represents the reconstruction of \(\mathbf {\mu }\) through the use of Transforms. \(\left\| \cdot \right\| _{F}\), \(\left\| \cdot \right\| _{K}\) represents the Frobenius and k-norms, \(\beta \) denotes a balance parameter that lies within the interval [0, 1], \(\lambda \) represents the weighting of the loss terms. When \(\lambda >0\), the optimization is to amplify Dis.

Note that excessively amplifying differences can compromise the accuracy of Gaussian kernel [18], rendering the local-association devoid of meaningful interpretation. To avoid this, Anomaly Transforms proposes a minimax strategy [25]. In the minimization phase, the local association \(\textbf{G}\) is optimized to approximate the sequence association \(\textbf{S}\) learned from the original sequence. For the maximization stage, we optimize the global association to increase the difference. The loss functions of the two stages are as follows:

$$\begin{aligned} \mathrm{Minimize Phase:}\textbf{L}_{total}=\beta \times \textbf{L}_{1}+\left( 1-\beta \right) \times \textbf{L}_{2}+\lambda \times \left\| \textrm{Dis}\left( \textbf{G},\textbf{S}_{detach} \right) \right\| _{1} \end{aligned}$$
(18)
$$\begin{aligned} \mathrm{Maximize Phase:}\textbf{L}_{total}=\beta \times \textbf{L}_{1}+\left( 1-\beta \right) \times \textbf{L}_{2}-\lambda \times \left\| \textrm{Dis}\left( \textbf{G}_{detach},\textbf{S} \right) \right\| _{1} \end{aligned}$$
(19)

where detach refers to the discontinuation of backpropagating the gradient and \(\lambda >0 \). During the minimization phase, the backpropagation of the gradient of \(\textbf{S}\) is halted, enabling \(\textbf{G}\) to approximate \(\textbf{S}\). Conversely, during the maximization phase, the gradient backpropagation of \(\textbf{G}\) is stopped while \(\textbf{S}\) is optimized to amplify anomalies.

Anomaly Score: By combining association differences with joint optimization, we obtain the anomaly score:

$$\begin{aligned} Score=\textrm{Softmax}\Big (-\textrm{Dis}\left( \textbf{G},\textbf{S} \right) \Big )\odot \Big (\beta \times \textbf{L}_{1}+\left( 1-\beta \right) \times \textbf{L}_{2}\Big ) \end{aligned}$$
(20)

where \(\odot \) is the element multiplication method. This design allows the reconstruction error and anomaly amplification strategies to synergistically improve the detection performance.

4 Experiments

4.1 Datasets

To evaluate our method, we carry out detailed experiments on four datasets. The characteristics of these datasets are summarized in Table 1.

  • Secure Water Treatment Testbed (SWaT): The SWaT dataset is derived from genuine industrial control system data obtained from a water treatment plant [16]. It contains 51 sensors.

  • Water Distribution Testbed (WADI): This is an extension of the SWaT system, but has a larger and more complex data scale compared to the SWaT dataset [2].

  • Server Machine Dataset (SMD) [22]: SMD consisting of 38-dimensional data collected over a 5-week period from a major Internet corporation. Only a subset of the dataset is used for evaluation due to Service Changes, which affected some machines in the dataset. The subset consists of 7 entities (machines) that did not undergo any service changes.

  • Pooled Server Metrics (PSM) [1]: The PSM dataset is provided by eBay, reflects the status of servers, 25 dimensions in total.

Table 1. Details of the datasets.

4.2 Baseline and Evaluation Metrics

We compared our STAD with several baseline approaches, including traditional methods: Isolation Forest [14], and deep-learning-based models: USAD [4], GDN [8], OmniAnomaly [22], LSTM-VAE [19], DAGMM [27], and Anomaly transformer [25]. We use Precision, Recall, and F1 scores to evaluate the performance of our method, which are widely used in anomaly detection.

4.3 Implementation Details

Adhering to the established protocol in Anomaly Transformer [25], we use a non-overlapping sliding window approach to obtain subsets. The fixed size of the sliding window is uniformly set to 100. We utilized grid search to obtain the anomaly threshold and hyperparameters that result in the highest F1 score. The top-K values for SWaT, PSM, WADI, and SMD are 10, 5, 30, and 15 respectively. The Transformer model consists of 3 layers, we set the number of heads to 8 and the \(d_{model}\) dimension to 512. The value of \(\lambda \) is set to 3, \(\beta \) to 0.5 and we employ the Adam optimizer [11] with the learning rate of \(10^{-4}\). Training process employs an early stopping strategy and batch size is set to 32. All experiments were conducted using a single NVIDIA Titan RTX 12GB GPU in PyTorch. To ensure that any timestamps during an anomaly event can be detected, we utilized a widely adopted point adjustment strategy [21, 22, 24]. In order to maintain fairness, the same point adjustment strategy was implemented across all baseline experiments.

4.4 Result Analysis

In many real-world anomaly detections, failure to detect anomalies can result in severe consequences. Therefore, detecting all genuine attacks or anomalies is more crucial than achieving high accuracy. As shown in Table 2, our proposed STAD outperforms other methods in terms of F1 performance. It is noteworthy that while most methods perform well on datasets such as SWaT, PSM, and SMD, as their anomalies are more easily detectable, our model still outperforms them in F1 score. When dealing with more complex MTS datasets like WADI, most existing methods yield poor results, while our model shows a significant improvement compared to others. We also observe that: (1) Compared to traditional unsupervised methods, deep learning-based techniques generally demonstrate superior detection performance; (2) Compared to models that solely learn a single relationship, the concurrent acquisition of temporal and spatial relationships significantly amplifies the anomaly detection efficacy.

Table 2. Experimental results on four public datasets.(%)

4.5 Ablation Experiments

To investigate the efficacy of each constituent of our methodology, we conducted ablation experiments to observe how the model performance varies on the four datasets. Firstly, we investigated the significance of using GAT to model spatial dependency relationships. We directly applied the raw data as input to the Transformers. Secondly, we used a static graph to replace the learned graph to prove the effectiveness of our proposed graph structure learning. Finally, to validate the necessity of Multi-Mix Attention, we removed it and only use spatial relations to reconstruct.

Table 3. Experimental results of STAD and its variants.(%)

The summarized results are presented in the Table 3. Furthermore, the following observations are provided based on the results: (1) The difference between the models that do not learn graph structure and our proposed model highlights the significance of spatial features in addressing anomaly detection for multivariate time series data. (2) Our structure learning is more effective than using a static graph as the graph structure. (3) The transformer architecture with the Multi-Mix Attention performs remarkable performance in handling time series data. Overall, It is evident that each component of our model is effective and indispensable, thereby endowing the framework with powerful capabilities for detecting anomalies in multivariate time series.

4.6 Interpretability

We visualize the anomaly amplification strategy section, as seen in Fig. 3, for real-world datasets, our model can correctly detect anomalies. For the SWaT dataset, our approach has shown the ability to detect anomalies at an early stage, indicating its potential for practical applications such as providing early warning for faults.

Fig. 3.
figure 3

Visualization of model learning in a real-world dataset. Anomalies are marked by red shading. (Color figure online)

Fig. 4.
figure 4

Visualization of graph structure learning.

Additionally, the visualization of the learned graph structure further demonstrates the effectiveness of our proposed model. Figure 4(a) is the process diagram of the secure water treatment testbed [16]. It can be observed that the SWaT system is mainly divided into 6 processes, and sensors in the same process stage are more likely to be interdependent. Figure 4(b) displays the t-SNE [15] plot of the sensor embeddings learned by our model on the SWaT dataset, where most nodes belonging to the same process cluster together. This demonstrates the effectiveness of our graph structure learning.

4.7 Case Analysis

We use the example in Fig. 1 to illustrate why our model helps with anomaly interpretation. From the previous anomaly analysis, we know that the anomaly is manifested as water tank overflow, but the root anomaly is caused by the early opening of MV101. It is hard to find anomalies of MV101 with its irregular switch status. However, through Fig. 5(a), we can see that our model successfully detected the anomaly in MV101.

Fig. 5.
figure 5

Case study showing the attack in SWaT.

In addition, other sensors are expected to be correlated with MV101 when the system is functioning normally. Figure 5 presents the weight scores between the other sensors of the same process and MV101. As depicted in Fig. 5(b), our model effectively learns the features associated with MV101 under normal conditions. When anomalies occur (corresponding to the red section in Fig. 1), the sensors weight scores are visualized in Fig. 5(c). It is evident that the sensor under attack (MV101) is more closely associated (darker in color) with other sensors in the same subprocess. This is reasonable, as when an anomaly occurs, the sensors associated with the anomaly are more strongly affected.

5 Conclusion

This paper proposes a novel approach for multivariate time series anomaly detection by leveraging spatio-temporal relationships. The proposed approach utilizes a graph attention network (GAT) and a graph structure learning strategy to capture spatial associations among multivariate time series. Additionally, Transformers are used to model temporal relationships within the time series. An anomaly amplification strategy is also employed to enhance the anomaly scores. Experimental results demonstrate that the proposed method outperforms existing approaches in identifying anomalies and is effective in explaining anomalies. Future work may involve incorporating online training techniques to better handle complex real-world scenarios.