An effective parallel convolutional anomaly multi-classification model for fault diagnosis in microservice system

Li, Xi; Wen, Peian; Chen, Peng; Chen, Juan; Wen, Xuming; Xia, Yunni

doi:10.1007/s11219-024-09672-6

An effective parallel convolutional anomaly multi-classification model for fault diagnosis in microservice system

Research
Published: 21 May 2024

Volume 32, pages 921–938, (2024)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Software Quality Journal Aims and scope Submit manuscript

An effective parallel convolutional anomaly multi-classification model for fault diagnosis in microservice system

Download PDF

Xi Li¹^na1,
Peian Wen¹^na1,
Peng Chen¹,
Juan Chen¹,
Xuming Wen¹ &
…
Yunni Xia²

119 Accesses
Explore all metrics

Abstract

Microservice architecture is a new technology for deploying large-scale applications and services in the cloud. But multivariate time series data with anomalies are increasingly generated in the cloud. Effectively diagnosing the runtime system anomalies is necessary to ensure the quality of service of microservice systems. Typical anomaly detection methods are effective in data quality and computing reliability of cloud computing. However, they all focus on one-class anomaly detection, which may not perform on practical microservice frameworks with diverse types of anomalies. Furthermore, locating the root cause of anomalies to eliminate after detection is essential. To address these issues, we propose an effective parallel convolutional anomaly multi-classification model (PCAC) based on an attention mechanism for fault diagnosis in microservice system. We first construct a parallel convolutional structure that allows subnetworks to extract features independently. Then, channel and spatial attention mechanisms are applied in the parallel convolutional layers to mitigate the loss of feature representation. Finally, causal inference based on the anomalous graph is used to locate the fault in the microservice system. The experimental results clearly show that the proposed model achieves the highest F1 scores on six public microservice datasets, improved by 37.9% in average macro-F1 and 4.4% in average micro-F1 scores respectively, outperforming eight state-of-the-art methods.

MTG_CD: Multi-scale learnable transformation graph for fault classification and diagnosis in microservices

Article Open access 15 May 2024

Enhancing fault localization in microservices systems through span-level using graph convolutional networks

Article 05 June 2024

BSDG: Anomaly Detection of Microservice Trace Based on Dual Graph Convolutional Neural Network

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

With the rapid development of cloud computing technology, a substantial amount of multivariate time series data is generated in microservices stored system (Di Francesco et al., 2017). The microservice structure involves creating multiple applications that can work interdependently. It creates a streamlined delivery pipeline that decomposes the application into multiple small services, speeding up development and maintenance and providing greater flexibility. The microservice’s distributed nature allows for its high scalability properties.

In real-world applications, a critical task is to detect anomalies in multivariate time series data in microservice systems. Diverse anomalies would be produced, such as memory leaks, network delays, and high CPU usage between the cooperations of microservice components. Currently, most anomaly detection methods aim at one-class anomalies (Wen et al., 2022; Chen et al., 2022; Song et al., 2023). Compared with traditional anomaly detection methods, multi-classification based on diverse anomalies in microservices architecture is becoming more complex. Convolutional neural networks (CNN) are used in general time series classification methods. But CNNs may not perform well since they cannot capture spatial information completely or lack attention to the correlations between convolutional channels in feature extraction (Fauvel et al., 2021). The deep learning network GDN (Deng & Hooi, 2021) uses graphs to model multivariate time series spatial features but does not consider temporal features. Multivariable time series data has become a typical data type, but for multivariable streams, both time dependence and correlation between observations should be considered. Thus, how to extract both spatial and temporal features from monitoring data is the challenge of multivariate time series anomaly detection.

The attention mechanism plays an important role in deep neural networks. Because attention gives the model the ability to discriminate, the machine will pay attention to the information that is more critical to the current task in a lot of information (Woo et al., 2018). The attention mechanism increases the performance of extracting diverse features and makes the neural network models more flexible.

An anomaly will propagate with the connection among microservices and eventually affect the whole system performance if the faults cannot be located in time. Log-based methods (Yang et al., 2021) detect and locate bugs based on log parsing. Even though discovering more informational causes, they are hard to work in real time and require abnormal information in log files. Thus, efficiently diagnosing the runtime system fault and being able to identify and locate it is a great challenge after anomaly detection. The graph structure provides the idea for fault diagnosis. We use a graph model and localize root causes with an algorithm similar to the random walk.

The main motivation is to accurately detect anomalies in the multivariate time series data from the monitoring data in microservices scenarios. And our main research questions are as follows: (1) How to improve the accuracy of anomalous multi-classification in microservice system? Specifically, microservice data collected from monitors are generally stored as multivariate time series; (2) when anomalies occur in the monitoring data, how to effectively identify the dimensions that have the greatest impact on generating the anomaly, to better locate the root cause? We approach a method to classify the various anomaly events in monitoring microservice data in the cloud and identify the abnormal time series that are most likely to be the causes of each anomaly in system. The proposed PCAC model includes two parts: anomaly detection and fault localization, as shown in Fig. 1. In the part of anomaly detection, there are two modules: feature capturing and anomaly multi-classify. First, we construct a convolutional structure, which contains two branches in parallel. To capture the association features in microservice data, one branch extracts the channel features with attention, and another extracts the spatial feature with attention. Then, based on both features, anomaly multi-classification is achieved. Another part of fault localization includes anomalous graph and causal inference modules. In this part, we use causal inference methods to learn the fault propagation paths generated by graph methods.

The main contributions of this work are summarized as follows:

To address the difficulty of extracting spatial and temporal features from multivariate time series, we designed a parallel convolution architecture, which can better capture the spatio-temporal dependencies of multivariate time series simultaneously to achieve better anomaly detection for microservice system.
To solve the problem of incomplete feature extraction in ordinary CNNs, we propose a method with channel and spatial attention mechanisms to extract features in subnetworks independently and reduce the loss of feature representations.
To effectively determine the fault cause after anomaly detection, we analyze and compare several causal inference-based cause localization methods to identify the specific fault service.
We conduct experiments on eight state-of-the-art baseline methods on six public microservice datasets, with a 37.9% improvement in average macro-F1 and a 4.4% improvement in average micro-F1 scores, respectively.

Section 2 reviews different anomaly detection methods for microservice systems. Section 3 introduces the proposed model in detail. Section 4 evaluates the effectiveness of the model through comparative experiments and ablation experiments, analyzes the abnormal cause, and diagnoses the fault service. Section 5 summarizes the work and presents potential future research.

2 Related work

The study of anomaly detection has been carried out for several decades and is an active research area gaining increasing attention in deep learning. At the same time, many anomaly multi-classification methods for microservice system are proposed. We mainly review the related work on statistical model-based, machine learning-based, deep learning-based, and root cause localization methods.

2.1 Statics-based method

Generalized autoregressive conditional heteroskedasticity (GARCH) (Engle, 1982) is a method for modeling the volatility caused by conditional mean and conditional heteroscedasticity of monitoring microservice metrics. It calculates each point’s anomaly score, clusters the points, and then detects multiple categories of anomalies in the microservice system. Principal component analysis (PCA) (Shyu et al., 2003) extracts data features by dimensionality reduction and achieves anomaly classification on the low-dimensional data.

2.2 Machine learning-based methods

Support vector machine (SVM) (Kriegel et al., 2011) is a binary classification model. Multiple binary classifiers will be constructed in the microservice system to obtain the predicted probabilities by comparing each classifier in the testing set. The goal is to find a hyperplane with a maximum margin that can separate points of different classes. K-nearest neighbor (KNN) algorithm (Kiss et al., 2014) predicts the label of data by the labels of its K-nearest pre-marked neighbors. A decision tree (Lewis, 2000) is a classification model that formulates the features and anomaly labels that indicate the relations among the data points.

2.3 Deep learning-based method

Autoencoder (AE) (Fan et al., 2018; Xin et al., 2023) is a common neural network model that consists of an encoder and a decoder. The encoder extracts features by defining the neural architecture, and the decoder is similar to the encoder, to convert the encoding back to the original data. After training AE, the encoded features are used to train a classifier for anomaly classification. CNN (convolutional neural network) (LeCun et al., 1998) extracts microservice system monitoring metrics features by setting convolution kernels based on sliding windows and performs classification using the cross-entropy function. FCN (fully convolutional network) (Long et al., 2017) replaces fully connected layers with convolutional layers based on the convolutional neural network. LSTM (long short-term memory) (Graves & Graves, 2012) is a variation of RNN (recurrent neural network), which mitigates the problem of exploding gradients to some extent. The output of LSTM will go through a fully connected layer and softmax function to convert it into the probability distribution of each category. TapNet (Zhang et al., 2020; Xu et al., 2022) uses LSTM and CNN stacking to model microservice system monitoring metrics and classify them via softmax. MTEXCNN (Assaf et al., 2019) employs three cascaded 2D convolutions to extract spatial information, followed by a 1D convolution to extract temporal information as well as classify microservice system monitoring metrics. TranAD (Tuli et al., 2022) utilizes transformer-based adversarial training to detect anomalies, while GDN (Deng & Hooi, 2021) employs graph structure learning to capture the relationships between different sensors. Both TranAD and GDN are relatively novel methods for anomaly detection, demonstrating excellent performance. We apply the same modifications as those used for LSTM mentioned above to adapt these two models for multi-class anomaly classification, enabling a comparison with our proposed model.

2.4 Root cause localization method based on causal inference

The dependencies between services in a microservice application may cause the propagation of faults. Root cause localization helps our anomaly multi-classification models diagnose the source for anomalies and find the most fundamental reason for their occurrence. Based on the fault propagation paths, graphs methods to locate the root cause of fault are developed. For example, AutoMap (Deng & Hooi, 2020) treats the different components in the system as individual nodes, and their interdependencies form a graph, and then finds the root cause based on the PC (Spirtes et al., 2000) algorithm and PageRank (Page et al., 1999) algorithm. Causeinfer (Chen et al., 2016) uses the PC algorithm to build a causal graph and then uses Breadth First Search (BFS) to infer the root cause of the causal graph. MicroDiag (Wu et al., 2021) uses linear non-Gaussian acyclic model (LiNGAM) (Hyvärinen et al., 2010) to learn the fault propagation relationship between microservices, build a fault propagation graph, and use PageRank to perform root cause localization on the propagation graph. The above methods ignore the capture of fault patterns of entity measurement data. However, some faults in the measurement data related to entities during a system fault may affect the final root cause localization results (Dongjie et al., 2023). Thus, capturing the fault patterns of measurement data in root cause localization and improving localization accuracy become challenges.

3 Method

3.1 Overall architecture

The architecture of the proposed PCAC is shown in Fig. 2. PCAC is composed of four modules: feature capturing (1), anomaly multi-classification (2), anomalous graph (3), and inferred causal (4). The input T of the model is system metrics of multivariate time series from microservice data monitoring. Firstly, it enters the feature capturing module (1), which consists of the channel attention branch and spatial attention branch. Two branches in parallel process data and a weighted feature map are generated. Module (1) solves the problem of incomplete feature extraction and gains spatio-temporal dependencies. Then, anomaly multi-classification is achieved through the operations of flattening and softmax on the attention map in module (2). Module (2) uses the cross-entropy to update the parameters to reduce the loss of features. Based on the multi-classification results, the root cause analysis is carried out by an anomalous graph generation in module (3). Finally, the probabilities of fault services are output in module (4) for cause location, effectively avoiding spreading among microservices.

3.2 Feature capturing: parallel convolution with attention

Data is input into the upper and lower branches simultaneously. In the upper branch, the data is passed through two one-dimensional convolution operations and corresponding activation functions before entering the channel attention function, which adjusts the importance of each channel in the model by learning attention weights and strengthens the ability to capture the correlation between multiple channels. In the lower branch, spatial attention branch, the data undergoes the same process as in the upper branch and then enters the spatial attention function, which adds weights to different positions to enable the model to focus better on critical local features in the system metrics. Finally, the feature maps output by the two branches are concatenated to obtain the final feature map.

Conv1D represents the one-dimensional convolution, and ReLU represents the activation function ReLU. Feature map represents the weighted feature map obtained by concatenating the features from the channel attention and spatial attention branches based on attention mechanism. For the given input feature tensor F, we compute the channel attention map $M_c(F)$ and the spatial attention map $M_s(F)$ at two separate branches, then compute the attention map M(F) as follows:

$$\begin{aligned} M(F)=M_c(F)+M_s(F) \end{aligned}$$

(1)

Channel attention

The process of channel attention based on attention mechanism (Fauvel et al., 2021) is shown in detail in Fig. 3. To aggregate the feature map in each channel, we take two global pools on the feature F and produce the channel attention feature $M_c(F)$. As shown in Fig. 3, the channel attention mainly includes a shared multi-layer perceptron (MLP) network, a maximum pooling (MaxPool), and an average pooling (AvgPool).

Firstly, use MaxPool and AvgPool to extract feature information and input them to the shared MLP network to obtain the corresponding MaxPool_OUT and AvgPool_OUT. After that, the MaxPool_OUT and AvgPool_OUT are spliced and activated by the sigmoid function to obtain the attention score matrix. Finally, multiply the attention score matrix with the original input feature tensor F to get the channel attention map $M_c(F)$. The calculation formula is as follows:

$$\begin{aligned} M_c(F)=F\times \sigma (MLP(AvgPool(F))+MLP(MaxPool(F))) \end{aligned}$$

(2)

where $\sigma$ is the sigmoid function, and $\times$ is the matrix multiplication. Each channel has an attention score because channel attention adds weight to feature information. Using the sigmoid function ensures that the scores of different channels are independent. The attention score matrix multiplies the original input, and then different weights are given according to the degree of importance. That means the original data is filtered and selected.

Spatial attention

The proposed model introduces a spatial attention mechanism (Fauvel et al., 2021) to enhance the ability to capture features in different spatial locations. As shown in Fig. 4, the spatial attention map $M_s(F)$ is calculated as follows:

$$\begin{aligned} M_s(F)=F\times \sigma (Conv1D(ReLU(Conv1D(F)))) \end{aligned}$$

(3)

where $\sigma$ is the softmax function. The spatial attention mechanism weights different features at different positions, considering the relationship between the score at each position and other positions. Using the softmax function ensures that the sum of the scores is 1, thereby ensuring global consistency.

3.3 Anomaly multi-classification

The final attention map from parallel convention with attention module inputs the flatten and softmax function, respectively. Data is converted into one-dimension vectors through the flatten function and then output into probabilities p for multi-classification through the softmax function. That is, date is labeled as normal and abnormal. Furthermore, the abnormal data is predictably labeled as different types of anomalies based on probabilities, such as memory leak, network delay, or high CPU hog, which usually occurs during service invocation in the microservice system.

In the training phase, the loss is calculated using a cross-entropy loss function defined as Eq. 4. We use the calculated loss to perform parameter updating. Using the cross-entropy loss function as the optimization objective during training allows the model to continuously adjust its parameters during training to minimize the difference between the predicted probability and the actual label.

$$\begin{aligned} loss=-\frac{1}{n}\sum \limits _{i=0}^{n-1}\sum \limits _{j=0}^{m-1}y_{ij}log(p_{ij}) \end{aligned}$$

(4)

where n is the number of training samples, m is the number of classes, y represents the actual label, and p represents the probability of the label predicted by the model.

In the testing phase, the test data is processed the same way as in the training phase, but the model is not updated. Instead, the macro-F1 and micro-F1 scores of the model are calculated.

3.4 Fault localization

Once anomalies are detected, the fault location engine in the microservice system starts to trace the execution paths and then locate faulty services. The engine is composed of two main procedures: anomalous graph construction and causal inference. Fault localization procedure is as follows:

Step 1: Select a causal inference algorithm and construct a directed acyclic graph (DAG) with minimum loss information as anomalous graph G based on data after anomaly multi-classification.
Step 2: Use PageRank algorithm on G to compute the score of each anomalous node.
Output: Anomalous graph G and probability of each anomalous node.

We choose four common causal inference algorithms to construct causal graphs in order to find the best one to root causes analysis, including Peter-Clark (PC) (Spirtes et al., 2000) algorithm, Greedy Equivalence Search (GES) (Chickering & Boutilier, 2003) algorithm, and linear non-Gaussian acyclic model (LINGAM) (Hyvärinen et al., 2010) algorithm, in which ICA-LINGAM (Shimizu et al., 2006) and Direct-LINGAM (Shimizu et al., 2011) are included. In order to locate the faulty services, a graph centrality algorithm named PageRank (Page et al., 1999) is used on the anomalous graph and outputs the probability of each anomalous node. In the root causal inference phase, these probabilities serve as the basis to diagnose which microservice is most likely to cause faults.

4 Experiments

4.1 Datasets and experimental setup

Datasets

Sock Shop^{Footnote 1} is a widely used microservice benchmark designed to test and evaluate microservices technology. It consists of 13 microservices, in which we mainly choose the front-end, catalogue, users, orders, payment, and shipping. The microservice architecture of Sock Shop is shown in Fig. 5. The complex connections between them make the multi-classification task of our model more challenging for those multivariate time series data in the microservice system. We deploy the Sock Shop using Kubernetes on multiple virtual machines(VMs) in the cloud. The Kubernetes cluster includes one master node and three worker nodes. We deploy open-source monitoring and visualization tools Prometheus^{Footnote 2} and Grafana^{Footnote 3} on the master node to monitor the application and collect data. Furthermore, we use the load generation tool Locust^{Footnote 4} on the master node to simulate workloads for the microservice application. All services of Sock Shop are deployed on the nodes allocated to different VMs automatically.

To simulate realistic scenarios, we inject three types of anomalies into our experiment: CPU hog, memory leak, and network latency (Mariani et al., 2018; Chen et al., 2015). The Pumba^{Footnote 5} tool was utilized to simulate network failures, and Docker container resources were subjected to stress tests to induce anomalies. The duration of each anomaly ranged from 1 to 5 min, while the application ran normally for 10 to 30 min, after which the process was repeated for each anomaly at least five times. Data is collected in real time every 5 s, based on Prometheus configuration, including service-level and resource-level data. At the service level, the latency of each service is recorded. At the resource level, metrics related to container resources are collected, including CPU hog, memory leak, and network transmit bytes.

Table 1 shows the details of six microservice datasets, including the size of the training set and test set and the number of feature dimensions. The ratio of training and test size is seven to three. In addition, three types of anomaly proportions are represented.

Table 1 The details of datasets used in experiments

Full size table

Metrics

We use marco-F1 and micro-F1 scores as evaluation indicators to verify the performance of the model and compare it to other baseline anomaly detection methods. Both marco-F1 and micro-F1 are commonly used to evaluate models in multi-classification scenarios. Macro-F1 is calculated by average precision and recall score regardless of the importance of the different classes. Micro-F1 is suitable for a dataset with an unbalanced multi-classification distribution.

Baseline methods

We compared different types of anomaly multi-classification models to validate the effectiveness of our model. These include (i) classical machine learning models GaussianNB, KNN, SVM, and SGD and (ii) the deep learning models CNN, DNN, LSTM, OmniAnomaly, transformer-based TranAD, and graph structure-based GDN. TranAD and GDN are relatively new anomaly detection models.

Experimental settings

All experiments are implemented in Python 3.7.11 and PyTorch 1.6.0 using a single NVIDIA GeForce 940MX (12 G) GPU, Intel (R) Core (TM) i7-7500U CPU @ 2.70GHz, and 12 G RAM. The convolutional kernel size in Conv1D is 3, 5, the number of epochs is 80, the batch size is 128, and the neural networks are optimized by the Adam optimizer, with the initial learning rate set to $10^4$.

4.2 Main results

We compare PCAC with eight baseline methods on six microservice datasets in Table 2 and Fig. 6 in terms of macro-F1 and micro-F1. The best performance is bolded.

Table 2 Performance of baseline models and ours

Full size table

Table 2 shows that PCAC achieves the highest macro-F1 and micro-F1 scores overall in eight baseline methods on the six datasets. Furthermore, we provide the ranking of PCAC and all baseline methods on macro-F1. The micro-F1 specific ranking score differs slightly from the macro-F1, but the model ranking is the same. Our model exceeds the other methods from the ranking results, proving that our method is effective in multi-classify.

The details of anomaly detection results of our method are shown in Fig. 7. It can be seen that on catalogue, front-end, orders, payment, shipping, and users datasets from Fig. 7a–f, the detection error rate is only 1.95%, 3.18%, 2.78%, 1.59%, 2.86%, and 4.12%, respectively. Users dataset requires more accurate detection of CPU hog and memory leak. In summary, our method has demonstrated an average false alarm rate of only 2.75% on the six datasets, indicating its effectiveness for detecting the three types of anomalies and cascading them to achieve better performance in fault diagnosis.

4.3 Ablation study

To investigate the impact of each component branch on the PCAC performance, we repeat the experiments without channel attention or spatial attention successively on the six datasets.