1 Introduction

Modern industrial systems are becoming increasingly complex due to technological advancements, changing customer demands, globalization, and regulatory requirements [1]. In such systems, the operation and maintenance of interconnected sensors generate massive multivariate time series data with high dimensionality and spatio-temporal complexity [2, 3]. Efficient and precise anomaly detection techniques for multivariate time series enable companies to continuously monitor their key indicators and timely alerts for potential events [4, 5]. By employing these techniques, operators can perform scheduled maintenance based on detected anomalies, minimizing downtime and emergency repairs. In terms of security, anomaly detection helps monitor data traffic and system behavior, identifying cyber security threats or unauthorized access attempts and triggering necessary security protocols. Recently, rapid progress in neural network-based methods has led to significant performance improvements in anomaly detection models.

Some typical methods [6,7,8,9] adopt models such as autoencoder (AE) and recurrent neural network (RNN) as their core modules to obtain representation and construct networks. However, such models completely focus on the extraction of temporal information, ignoring explicitly the correlations in multivariate time series. This correlation manifests that sensors in actual industrial systems are complex and nonlinearly interconnected. For instance, the flow change will affect the temperature and pressure [10]. Determining whether the entire system is running normally based on a single sensor may be challenging. Consequently, such correlations should be considered for anomaly detection in multivariate time series.

Note that graph neural networks (GNNs) [11, 12] can effectively extract and discover features and patterns from graph structured data. Several works have employed GNNs to model topological structure relationships among multiple sensors [10, 13,14,15,16]. As depicted in the top part of Fig. 1, such methods employ embedding vectors to capture each sensor’s distinctive characteristics. Then, at iteration i, the graph structure relationships \(A_{i}\) are only learned from the sensor embedding. However, in these approaches, the learning of graph structure relationships between sensors is completely determined by the sensor embedding, i.e., the update of graph structure relationships \(A_{i}\) is not subject to \(A_{i-1}\). Each iteration learns independently, neglecting the valuable information from previous ones. This learning scheme of graph structure between sensors lacks continuity modeling relationships, leading to inadequate modeling of sensor relationships. As a result, the motivation of this paper is to introduce a sustainable graph structure learning process. This sustainable updating ensures that the knowledge gained in each iteration is not discarded, leading to more accurate and robust anomaly detection.

Fig. 1
figure 1

Comparison of graph structure learning between existing method and our proposed method

In this paper, we present a Time-Series Graph Attention network (TS-GAT) approach to improve anomaly detection performance. As shown in the bottom part of Fig. 1, we propose a sustainable updating graph structure learning method that can continually learn relationships between sensors from the sensor embedding. Specifically, the graph structure relationships \(A_{i}\) are not only learned from the sensor embedding but also subject to \(A_{i-1}\). Our proposed learning scheme can completely exploit the learned graph structure relationships \(A_{i-1}\) to implement sustainable updating, which improves the efficiency of modeling relationships between sensors. Besides, we construct a time series encoder for generating temporal views to help with multivariate time series. Our contributions are summarized as follows:

  • We present a novel graph attention network-based model that concurrently learns from temporal and spatial perspectives, thereby providing more knowledge of the relationships between multivariable sensors.

  • We present a sustainable updating method for graph structure learning combining a similarity-constrained loss and a threshold selection strategy. It facilitates more efficient learning of topological sensor dependencies.

  • Experimental results on three publicly available real-world datasets demonstrate that our model surpasses the state-of-the-art approaches.

The remaining sections are organized below. Section 2 expounds relevant literature in deep learning anomaly detection models. Section 3 describes the methodology, including the model framework, training loss function, and anomaly score calculation. Section 4 explains the experimental results, which are displayed visually through tables and graphs. Section 5 briefly summarizes the proposed model.

2 Related work

Currently, numerous deep approaches are flourishingly implemented in anomaly detection. This section will thoroughly sort out the mainstream models.

AE has a strong nonlinear representation ability and employs the reconstruction residuals as a criterion for anomaly discrimination. Zong et al. [6] proposed an end-to-end hybrid model that combined AE with the Gaussian method. Another study [7] adopted the architecture of AE, consisting of one encoder and two decoders, and leveraged an adversarial learning technique to train the model. Naito et al. [17] adopted a two-stage AE model to enhance the interpretability and accuracy of anomaly detection. Moreover, RNNs capture the temporal characteristics for dealing with time series tasks. Park et al. [9] proposed a framework combining variational autoencoder (VAE) and long short-term memory network (LSTM) to tackle the issue of anomaly detection on multimodal data. Fährmann et al. [18] also was a hybrid model of VAE and LSTM, but it focused on lightweight. To characterize the temporal correlation of time series distributions, Li et al. [8] built a generative adversarial network-like architecture using LSTM and RNN as basic modules and performed anomaly detection synthetically through reconstruction and discrimination errors. In addition, the prevalent transformer [17] also appears in some studies. Zeng et al. [19] constructed a deep transformer-based model, which was trained in an adversarial training manner and used anomaly scores combining reconstruction residuals and probability during detection. However, the models discussed above are essentially incapable of constructing unequivocally the relationships among multivariate sensors.

To deal with the above issue, some research has shifted to sensor relationship modeling. By virtue of excellent graph relationship modeling capabilities [20,21,22,23], GNNs have been extensively applied in anomaly detection. Deng and Hooi [14] adopted a graph attention network (GAT) to learn sensor relationships, which first employed the linear layer to embed sensors, then used the top-k technique to determine the graph structure. Tang et al. [16] was highly similar to [14], except that [16] utilized gate recurrent unit to learn temporal features. Zhan et al. [15] established a reconstruction GAT model focused on multi-scale feature learning. Regarding model design, Zhao et al. [13] and [10] were comparable. Both employed two GATs to simultaneously learn spatio-temporal linkages, optimizing the reconstruction and prediction networks. However, these models have a main drawback: the graph structure learning lacks continuity in modeling sensor relationships.

To compare the above models more clearly, we have compiled a summary Table 1. From Table 1, we can draw two conclusions. First of all, the non-graph models are hybrids of existing deep models, emphasizing the extraction of temporal features. However, these models ignore the modeling of sensor relationships. Secondly, existing graph models use GAT to solve the relationship modeling, but it has the defect of independent updating during training. In contrast to the above methods, our proposed model has the graph structure learning ability of sustainable updating to model sensor relationships. Next, the proposed model in this paper will be explained in detail.

Table 1 Comparison of the pros and cons of existing work

3 Method overview

3.1 Problem statement

The objective of anomaly detection in multivariate time series is to find anomalous data during testing, which is usually implemented using the paradigm of unsupervised learning. Under this paradigm, the training dataset consists solely of normal data, whereas the test dataset contains both normal and abnormal data. The two data types are distinguished based on their patterns and behaviors. As depicted in Fig. 2, the highlighted area indicates an anomaly segment, with apparent fluctuations in the data, while the normal area is relatively stable.

Fig. 2
figure 2

Example of normal and anomaly data

Generally, the multivariate time series data are symbolized as \(X=\left\{ x_1, x_2, \ldots , x_n\right\} \in R^{n^* m}\), where n represents the length of data and m denotes the number of features observed from sensors. For any time step t, \(x_t \in R^m\) is a m-dimensional vector. For input data, we perform data normalization and sliding window ( window w and step c, as plotted in Fig. 3). The final output is a vector with the value \(Y=\left\{ y_1, y_2, \ldots , y_n\right\} \), where \(y_t \in \{0,1\}\) and \(y_t=1\) declares that the current sample is an anomaly.

Fig. 3
figure 3

Example of sliding window

3.2 Model architecture

We comprehensively elucidate the proposed model TS-GAT in this section, as depicted in Fig. 4. Our model mainly comprises three core components: time series encoder, sustainable updating graph structure learning, and forecasting-based decoder. Each is then thoroughly explained.

Fig. 4
figure 4

A high-level framework of TS-GAT

3.2.1 Time series encoder

In multivariate time series data, distinct sensors often have unique properties, which may be related to each other in complex ways. For example, consider a smart home system with sensors for measuring temperature, humidity, and light. It is reasonable to assume that temperature and humidity sensors in different rooms will behave similarly. However, within the same room, there is often a close correlation between these sensors. Humidity decreases may accompany temperature increases. Therefore, a flexible and diverse way to describe the behavior of each sensor is required. Sensor embedding [14] can convert sensor data and features into embedding representations in high-dimensional vector space. Therefore, we propose a time series encoder to obtain embedding vectors from multivariate time series flexibly.

As described in Fig. 4, the time series encoder is formed with an LSTM layer, a timestamp masking layer, and a fully connected layer. Among them, LSTM can better capture long-term dependencies by introducing special memory units and gating mechanisms, as shown in Fig. 5.

Fig. 5
figure 5

A separate flow diagram of LSTM model. The LSTM consists of a forget gate \(f_t\), an input gate \(i_t\), an output gate \(o_t\), and a memory unit

For input data x, the LSTM layer handles long-term temporal dependencies in a recurrent and memorized manner, which can effectively gain temporal features. The timestamp masking layer occludes the latent features at stochastically selected timestamps to obtain more robust views. The fully connected layer maps the masked features to produce the representative sensor embedding z, as shown in Eq. 1.

$$\begin{aligned} z=F(x) \end{aligned}$$
(1)

where F is the time series encoder. \(z \in R^{d}\), d denotes the embedding dimension.

Through the above sensor encoding, the model can map each unique sensor behavior into a semantically rich representation vector. This representation ability helps to understand the patterns and laws behind sensor behavior deeply and provides more information for subsequent task processing.

3.2.2 Sustainable-updating graph structure learning

Based on the sensor embeddings obtained from the above time series encoder, this section will introduce graph structure learning in detail. When modeling sensor relationships, the entire multivariate sensor is treated as a graph structure. Typically, since the relationships of the graph structure do not require symmetry, a directed graph is defined to represent the graph structure. The nodes represent sensors, and the edges represent relationships between sensors.

The proposed sustainable updating method integrates a similarity-constrained loss and a threshold selection strategy. First, we define a global and learnable similarity matrix H, created through random initialization, as shown in Eq. 2. Then, we leverage a threshold select strategy to determine the adjacency matrix A instead of the irrational top-k form, as indicated in Eq. 3. Concretely, assume that the nodes are related if the similarity in H is higher than or equal to threshold \(\delta \). The threshold selection strategy avoids redundant graph fully connected mode. Finally, we calculate another similarity S using the node i and all its potential neighbors j in the following Eq. 4. Through the similarity constraint loss of H and S, the learning of H ensures the sustainable updating of A from the data. The graph structure, i.e., the adjacency matrix A, can be continuously updated based on the previous learning, guaranteeing continuity.

$$\begin{aligned}{} & {} H=Rand() \end{aligned}$$
(2)
$$\begin{aligned}{} & {} A_{ji}=1 \left\{ (i, j) \in Index(H_{ij}\ge \delta )\right\} \end{aligned}$$
(3)
$$\begin{aligned}{} & {} S_{ij}=\frac{z_{i}^{\top } \cdot z_{j}}{\Vert z_{i}\Vert \cdot \Vert z_{j}\Vert } \end{aligned}$$
(4)

where \(H \in R^{m^* m}\) denotes a learnable matrix derived from Rand stochastic function. Index stands for the index pair operation of a matrix. z is the embedding vector obtained from the time series encoder. S denotes the similarity matrix. Both i and j belong to [1, m].

Following the adjacency matrix A, we utilize GAT to capture the sensor dependencies, employing a flexible graph structure to represent the associations between individual sensors. As depicted in Fig. 6, the graph attention mechanism of GAT enables the network to learn the strength of relationships, i.e., attention coefficients, between each node and its neighboring nodes. This coefficient allocation allows the network to focus on neighboring nodes highly relevant to the current node, enhancing the model’s expressive ability. Its calculation process is shown in Eqs. 5 and 6.

Fig. 6
figure 6

A simple representation of graph attention mechanism

$$\begin{aligned}{} & {} e_{i, j}={\text {Leaky}}ReLU\left( a^T \cdot \left( W z_i \oplus W z_j\right) \right) \end{aligned}$$
(5)
$$\begin{aligned}{} & {} \alpha _{i, j}=\frac{\exp \left( e_{i, j}\right) }{\sum _{k \in \psi (i) \cup \{i\}} \exp \left( e_{i, k}\right) } \end{aligned}$$
(6)

where a denotes the learnable weights. W is a learnable weight matrix. z represents the embedding vector obtained from the time series encoder. \(\oplus \) means the operation of concatenation. \(\psi (i)=\left\{ j \mid A_{j i}>0\right\} \) represents the set of neighbors of node i. \(\alpha _{i, j}\) denotes the attention coefficients. By the \(\alpha \), the aggregate representation \(v_i\) of node i is defined as Eq. 7.

$$\begin{aligned} v_i=ReLU\left( \alpha _{i, i} W z_i+\sum _{j \in \psi (i)} \alpha _{i, j} W z_j\right) \end{aligned}$$
(7)

3.2.3 Forecasting-based decoder

Through the representations v obtained above, we can gain the predicted output. The prediction paradigm can portray the future behavior of sensors by modeling historical observations against normal data. Therefore, we construct a forecasting-based decoder g, formed via a causal convolution [24] and a fully connection layer. The fully connection layer performs the feature mapping. The causal convolution is a strict time-constrained model that obeys the fundamental contextual dependencies of data modeling on temporal order. As shown in Fig. 7, causal convolution only uses historical data when calculating, making the model more suitable for capturing patterns of time series and helping to improve forecasting performance.

Fig. 7
figure 7

Visualization of causal convolution

To better guarantee the model output \({\hat{x}}\), we integrate the information from the embedding vectors z and graph attention-based representations v and feed it to the decoder G, as shown in Eq. 8.

$$\begin{aligned} {\hat{x}}=G\left( z \circ v\right) \end{aligned}$$
(8)

where \(\circ \) is element-wise operation.

3.3 Model training and anomaly detection

Model training. To reduce the discrepancy between input data x and the forecasting output \({\hat{x}}\), we adopt the mean squared error (MSE) as a prediction loss in Eq. 9. Furthermore, we compute a similarity loss using two similarity matrices S and H, as shown in Eq. 10. Imposing the similarity loss to constrain the H can further ensure the sustainable updating of the adjacency matrix A. In sum, when training the proposed model, we minimize the loss function according to Eq. 11.

$$\begin{aligned}{} & {} L_{pred} =\frac{1}{n} \sum _{i=1}^n \Vert x_{i}-{\hat{x}}_{i} \Vert ^2_{2} \end{aligned}$$
(9)
$$\begin{aligned}{} & {} L_{sim} = \Vert S-H \Vert ^2_{2} \end{aligned}$$
(10)
$$\begin{aligned}{} & {} L =L_{pred} +\lambda \cdot L_{sim} \end{aligned}$$
(11)

where \(\lambda \in (0,1]\).

Based on the above loss functions, we provide Algorithm 1 to clearly show the model’s training process.

Algorithm 1
figure a

The TS-GAT model training algorithm

Anomaly detection. Following [14], we leverage graph deviation scoring (GDC) to gain the anomaly scores. GDC computes the anomaly statistics less sensitive to severe biases from specific sensor behavior. Firstly, we obtain the prediction error specified in Eq 12. To prevent the impact caused by different dimensions of sensors, it is more robust to normalize by the median and interquartile range (IQR) rather than its mean and standard deviation, as demonstrated in Eq 13. Then, the max function aggregates multiple sensors to achieve the final scores, as shown in Eq 14. When performing anomaly detection, the threshold is selected through the validation dataset. If the anomaly score exceeds the threshold, we label a timestamp in the test dataset as an anomaly.

$$\begin{aligned}{} & {} err_{i}(t)=|x_{i}(t)-{\hat{x}}_{i}(t) | \end{aligned}$$
(12)
$$\begin{aligned}{} & {} p_i(t)=\frac{err_i(t)-\tilde{\mu }_i}{\tilde{\sigma }_i} \end{aligned}$$
(13)
$$\begin{aligned}{} & {} P(t)=\max \left( p_i(t)\right) \end{aligned}$$
(14)

where \(err_{i}(t)\) indicate the error of sensor i at time t. \(\tilde{\mu }_i\) and \(\tilde{\sigma }_i\) denotes the median and IQR.

4 Experiment

4.1 Datasets

We leverage three publicly available datasets throughout the experiments: Secure Water Treatment (SWaT) [13], Water Distribution (WADI) [10], and Hardware-in-the-loop-based Augmented Industrial control systems security (HAI) [25]. SWaT and WADI are operational water treatment testbeds primarily utilized for cyberattack and anomaly detection research. HAI, a cyber-physical system, can simulate various complicated processes to generate sophisticated attacks. Table 2 lists each dataset in full. We visualize some multivariate time series observations to observe the data clearly, as illustrated in Fig. 8.

Table 2 Dataset statistics
Fig. 8
figure 8

Visualization of multivariate time series data. The pink highlights represent anomalous segments

4.2 Evaluation metric

As performance metrics for the proposed model, we employ precision (P), recall (R), and F1-score (F1). P denotes the proportion of true anomalous samples that the model correctly detects. R indicates the percentage of predicted anomalous points relative to all anomalies. The F1 comprehensively considers P and R. The higher the above F1, the better the accuracy of anomaly detection, as defined in Eqs. 1516, 17.

$$\begin{aligned}{} & {} P=\frac{TP}{TP+FP} \end{aligned}$$
(15)
$$\begin{aligned}{} & {} R=\frac{TP}{TP+FN} \end{aligned}$$
(16)
$$\begin{aligned}{} & {} F 1=2 \times \frac{P \times R}{P+R} \end{aligned}$$
(17)

where TP, FP, and FN denote true positive, false positive and false negative, respectively.

Anomalies observations frequently take place continuously for a while generating abnormal segments. The previous work [26] provides a point-adjust technique that, if any anomaly observation within it is correctly recognized, deems the whole abnormal segment to be accurate. Audibert et al. [7], Zhao et al. [13], Su et al. [27] adopted such strategy in evaluation. Additionally, another work [28] focuses on the optimal threshold to evaluate the performance using the best F1 score (short F1 score hereafter). In this paper, we adopt the above F1 and the adjustment strategy to evaluate anomaly detection performance.

4.3 Implementation details

Our experiment uses PyTorch on a machine with NVIDIA RTX 3090 GPU. Following empirical values in the existing literature, the learning rate is 0.001, and the batch size is 128. We adopt Adam [29] as the network optimizer with \(\beta 1=0.9\) and \(\beta 2=0.999\). We set the training epoch as 100 and leverage the early stopping strategy with patience=8 to prevent overfitting. For the following parameters, we select the optimal value based on the experimental parameter adjustment results. The sliding window w and step c are 15 and 3. The embedding dimension d of the time series encoder is set to 128. The kernel size of casual convolution is 5. The \(\delta \) in Eq. 3 equals 0.5. The \(\lambda \) in Eq. 11 is 0.8.

4.4 Comparison with state-of-the-art methods

We quantitatively compare the proposed TS-GAT model with existing approaches, including LSTM-VAE [9], DAGMM [6], MAD-GAN [8], USAD [7], MTAD-GAT [13], GDN [14], STGAT-MAD [15], GTA [30], HAD-MDGAT [10], GRN [16]. Table 3 presents the comparison findings of the models for P, R, and F1. The most outstanding performance is highlighted in bold, while the second is underlined.

From Table 3, several significant conclusions can be drawn. Firstly, our model consistently achieves the best F1 scores across all available datasets. It also acquires the highest recall in all circumstances other than HAI, where it performs worse than (81.84\(\%\) vs. 86.58\(\%\)). However, the precision of our model (93.14\(\%\)) is dramatically higher than that of GTA (83.22\(\%\)) on the same dataset. Secondly, models such as LSTM-VAE, DAGMM, and MAD-GAN, which do not account for sensor relationships, exhibit noticeably lower F1 scores than models that incorporate this relationship. This underscores the criticality of graph relationships in the context of anomaly detection. Thirdly, incorporating sensor relationships enhances the performance of graph models, such as MTAD-GAT, STGAT-MAD, etc. However, the inability to update the continuous graph structure hampers further performance enhancements. In contrast to these approaches, our model addresses this limitation, achieving superior performance. Finally, as obtained by the above models, the slight compromise between high recall and precision has critical implications for real-world applications. In most cases, FPs and FNs demand the execution of an inevitable trade-off. However, minimizing alarms triggered by FPs is prominent for applications’ efficiency. But in the long run, detecting as many potential anomalies as possible can enhance the system stability since even rare abnormalities may lead to malfunction of the whole system. The maintenance technical staff with specialized knowledge will incline toward a high-level sensitivity instead of specificity to avoid missing informative and pivotal events [31].

Table 3 Comparison results of TS-GAT and various models

4.5 Anomaly detection visualization

To intuitively show the performance of the TS-GAT model, we visualize the test values, predicted values, and anomaly scores on the HAI dataset in Fig. 9. For sensors P1_FCV02Z and P4_ST_PT01, values vary quite steadily, and the model detected all the abnormal segments accurately. The sensor P4_ST_PO is characterized by dense fluctuations and sharply distinguishes the middle two anomalies. We can see from Fig. 9 that the prediction and the test values essentially follow the same trends. It indicates that the forecasting of the model is precise enough. Under the prediction-based paradigm, the proposed model possesses the ability to detect all anomalies.

Fig. 9
figure 9

Display of anomaly detection performance. Each sensor consists of two subgraphs and the highlighted area denotes the abnormal segment. The first represents test and predicted values, while the second denotes the anomaly scores

4.6 Ablation study

To verify the efficacy and necessity of each component of TS-GAT, the ablation experiments with simplified counterparts of the model are carried out in Table 4. Concretely, we first evaluate the significance of the time series encoder by replacing it with linear embedding. Then, to discuss the effect of threshold selection strategy on graph structure learning, we adopt an entire static graph in which every node is correlated with all other nodes. Lastly, we discard the similarity loss to evaluate the model’s performance.

Obviously, these variants with the corresponding component removed clearly underperform the TS-GAT model. It reveals that each component is indispensable to the model and contributes to performance. In addition, the model’s performance significantly deteriorates in the absence of similarity loss, thus confirming the efficacy of the sustainable updating graph structure learning method proposed in this paper.

Table 4 Comparison of TS-GAT ablation experiments on F1

4.7 Interpretability of model

Interpretability via time series embedding. We further investigate the interpretability of the time series encoder through the visualization of t-distributed Stochastic Neighbor Embedding (t-SNE) [32] in Fig. 10. We are concerned about whether the features mapped by the encoder can be reflected in the t-SNE space. For instance, the similarity of representation features may reflect similar sensor behaviors. According to Fig. 10, we confirm that some sensors essentially form local clusters. It proves that features obtained from the proposed model may accurately capture the behavioral similarities among local sensors.

Fig. 10
figure 10

t-SNE visualization of latent features mapped from time series embedding component on HAI dataset. Different colors indicate distinct processes. The nodes in the red dashed circle basically form a local cluster (color figure online)

Interpretability via learned relationships. According to the learned model, we discover the connections in the graph structure, which can further provide interpretability to know which sensors are relevant. As plotted in Fig. 11a, the sensors P1_LCV01D and P1_FCV03Z are linked, and their responses to the two abnormal segments are nearly identical. In Fig. 11b, sensor P1_FT02Z connects with P1_FT03, and both exhibit a strong reaction to the last abnormal segment. This is most likely caused by the connections between these sensors. Therefore, we declare that connected sensors typically behave similarly and are beneficial for anomaly detection.

Fig. 11
figure 11

Left side displays the connection relationships, while the right contains four subplots, where the first column represents test and predicted values, and the second denotes anomaly scores. The highlights indicate the anomaly segment

5 Conclusion

In this work, we present a graph attention network-based anomaly detection approach for multivariate time series. The proposed model is integrated with a time series encoder, a sustainable updating graph structure learning module, and a forecasting-based decoder. The encoder has an excellent embedding ability for effectively generating temporal features. The graph learning module can improve the efficiency of modeling sensor relationships. The decoder incorporates causality based on time series, which gives excellent predictive capability. Experimental results on three real-world available datasets indicate the superior performance of the proposed model over state-of-the-art approaches. Remarkably, it also provides good interpretability.

This research monolithically focuses on the relational modeling of multivariate sequences from both temporal and spatial views to better handle anomaly detection. In the future, we will concentrate on detecting tiny abnormalities and explore more possibilities of graph neural networks for anomaly detection.