Keywords

1 Introduction

Trace-based anomaly detection is a key step in the troubleshooting of microservice systems because the structure of trace can help operators to understand the anomaly propagation chain and then locate the root cause.

Existing approaches can be divided into two categories: trace-level approaches and invocation-level approaches. For those trace-level approaches such as [3, 5], they need further analysis to locate the abnormal microservice. For those invocation-level approaches such as [2, 4], they can directly detect the invocation anomalies but highly depend on the accuracy.

We propose a supervised trace anomaly detection method called TICAD (Trace Invocation Callee Anomaly Detection), which can effectively learn the sequential patterns in the invocations and then infer the anomalies in the traces. Firstly, TICAD reorganizes the invocations of traces. Then, each invocation’s callee state will be represented as a vector using multiple metrics. TICAD will mine the inherent relationship between the previous invocations and the current one through a neural network based on LSTM and self-attention. After the invocation anomalies are detected, whether a trace is abnormal can be inferred from them.

Our main contributions are listed below: We propose the TICAD, which detects invocation anomalies and subsequently infers the anomalies of the traces. We further propose a neural network based on LSTM and self-attention to detect anomalies in the invocations, which can learn the contextual dependencies and patterns between the invocation vectors. We conduct extensive experiments on TICAD to verify its effectiveness on the public dataset.

2 Related Works

Supervised Machine Learning Approaches: MEPFL [8] is proposed to predict multiple tasks such as latent error detection in the trace log, which is collected from both normal and faulty versions of the application. And Seer [1] is presented to detect Qos violations in the massive trace data. A deep learning model, which contains CNN and LSTM layers, is trained in Seer to predict the abnormal microservices.

Unsupervised Machine Learning Approaches: Among all the unsupervised approaches, most of them are based on the normal assumption. AVEB [4] trains a variational autoencoder to learn the response time feature of normal cases for each microservice. Then the target data with significant reconstruct errors will be determined as anomalous. TraceAnomaly [3] also trains a variational autoencoder with posterior flow to model the normal pattern of trace. In [5], a multimodal LSTM model is proposed to learn the sequential pattern of the invocation type and response time.

Fig. 1.
figure 1

The framework of TICAD

3 TICAD Design

As shown in the Fig. 1, we first reorganize all the traces. After that, each invocation will be transformed into a vector according to the metrics. For each invocation, its vector will be fed into a neural network based on LSTM and self-attention along with those of the previous invocations. Finally, it will automatically learn the potential features associated with anomalies.

3.1 Trace Pre-processing

In this section, we process the original trace data and fine tune the data structure to better detect anomalies. We group all the invocations with the same microservice pair and reorder them by their timestamp. After that, the original dataset is divided into \(n_c\) datasets where \(n_c\) is the number of unique microservice pairs. In the following steps, invocations of different groups can be processed and learned in parallel without affecting each other. Next, we vectorize the invocations from different perspectives. More precisely, we vectorize the callee state of the invocation, which means the label represents whether the callee is normal when the invocation occurs. To avoid the problem caused by using the latency alone, we use additional resource utilization metrics of the callee to enhance the representation of the invocation. We directly concatenate the latency and resource utilization metrics to form the vector of the invocation and standardize the values of each dimension in the vector.

3.2 Anomaly Detection

After the vectorization of the invocations, we will detect whether each invocation is abnormal. For the target invocation, in addition to its own feature vector information, we also use the extra vectors of the previous invocations to enrich the current information. Instead of relying only on the vector of the target invocation, this kind of learning method can help to decrease the false positives caused by the noise data. In practice, a reasonable window will be selected to slice the invocations. For each invocation waiting to be detected, the input is a matrix X consists of \(w + 1\) vectors:

$$\begin{aligned} X_i = [v_{i-w}, v_{i-w+1}, \dots , v_{i}]^\top \end{aligned}$$
(1)

where \(v_i\) is the vector of current invocation and w is the window length.

Now that we have the input matrix X, TICAD demands a neural network to automatically learn the relation between the input and the fact whether the current invocation is anomalous. Therefore, we propose a neural network based on both LSTM and self-attention [6]. Briefly, The same input matrix will be learned with LSTM and self-attention neural network separately, and the output of target invocation will be concatenated to detect anomalies.

For the self-attention part, a neural network will take the input and use the multi-head scaled dot-product attention mechanism to aggregate the information. Instead of directly using the whole encoder of Transformer, we only utilize a few parts which are easy but effective. We first scale the input and add positional information to the original vector:

$$\begin{aligned} X_h = dropout(\sqrt{h}(X_iW^h) + W^{pos}) \end{aligned}$$
(2)

where h is a scalar representing the hidden size. \(W^h \in R^{n_f \times h}\) is the weight of the linear transformation and \(n_f\) is the number of features, i.e., the size of original invocation vector. \( W^{pos} \in R^{(w + 1) \times h}\) represents the learned positional embedding.

Then the \(X_h\) will be fed into the multi-head scaled dot-product attention layer, which can aggregate the information according to the attention scores. In practise, multiple heads can be calculated in parallel. The \(Q_i\), \(K_i\) and \(V_i\) of each head, which are the indispensable elements of attention mechanism, will be transformed from the same input \(X_h\):

$$\begin{aligned} Q_i = XW^Q_i, K_i = XW^K_i, V_i = XW^V_i \end{aligned}$$
(3)

where \(W^Q, W^K, W^V \in R^{h \times d_{head}}, d_{head} = h/n_{heads}\), \(n_{heads}\) is the number of heads.

For each head, scaled dot-product attention mechanism will calculate the attention scores and then get the weighted sum of values, which is shown in the following equation:

$$\begin{aligned} head_i = softmax(\frac{Q_iK_i\top }{\sqrt{d_{head}}})V_i\end{aligned}$$
(4)

All the results of the heads will be concatenated and transformed to \(X_h'\) which is shown below:

$$\begin{aligned} X_{h'} = (head_1 \oplus head_2 \oplus \dots \oplus head_i)W^{h'} \end{aligned}$$
(5)

where \(W^{h'} \in R^{h \times h}\).

The final part of self-attention consists of layer normalization and residual dropout and the aggregated vector of target invocation is represented as \(v_s\), which is shown in the following equation:

$$\begin{aligned} X_{f} = Layer Normalization(X_h + dropout(X_{h'})\end{aligned}$$
(6)
$$\begin{aligned} v_s =X_{f}[w]\end{aligned}$$
(7)

For the LSTM part, we adopt a variant of LSTM called Bi-LSTM (Bidirectional Long Short Term Memory), whose detailed structure is shown in the Fig. 2. Each row of \(X_h\) will be input into the Bi-LSTM model at each time step. As shown in the figure, \(hf_w \in R^{h/2} \) and \(hb_0 \in R^{h/2}\) are the hidden state vectors at the last time step, which will be concatenated to represent the result of Bi-LSTM:

$$\begin{aligned} v_{l} = h^l_w \oplus h^r_0 \end{aligned}$$
(8)

Finally, \(v_s\) and \(v_{l}\) will be concatenated to calculate the anomaly probability:

$$\begin{aligned} Anomaly\_Probability = \sigma ((v_s \oplus v_l)^{\top }W^{a} + b^{a}) \end{aligned}$$
(9)

where \(\sigma \) represents the sigmoid function and \(W^{a} \in R^{2*h \times 1}\).

If a trace has at least one abnormal invocation, the trace will be judged as abnormal.

Fig. 2.
figure 2

The structure of LSTM based neural network

4 Evaluation

4.1 Datasets and Criteria

Datasets. To make the experiments more convincing, we use the public dataset which is proposed in TraceRCA [2] to evaluate the effectiveness of TICAD. This dataset collected traces from the Train Ticket [7] system, which is one of the largest open source microservice systems.

Baselines. To demonstrate the effectiveness of TICAD, We compare it with TraceAnomaly [3] and MEPFL-RF [8]. MEPFL-RF refers to the Random Forest version of MEPFL. Parameters of them are set best for accuracy.

Evaluation Metrics. As with previous researches, we use three evaluation metrics: precision, recall and F1 score, which are calculated as follows: Precision = TP/(TP + FP), Recall = TP/(TP + FN), F1 score = (2*Precision*Recall)/(Precision + Recall).

4.2 Preparation Experiments

For TICAD, if we directly divides all the invocations, it’s likely that there will be no complete trace in the test set. This will result in the inability to compare the effectiveness of methods because TICAD can’t infer trace-level results from incomplete trace invocations. Therefore, we randomly select 5% normal trace ID and 5% abnormal trace ID as the reserved ID, which means all the invocations of these traces will be reserved for test set. For supervised methods such as TICAD and MEPFL, we directly copy the abnormal traces or invocations to solve the lack of positive samples.

4.3 Experiments on Trace-Level Anomaly Detection

In this section, we use different methods to perform trace-level anomaly detection, which can show the effectiveness of different methods on whole traces. The results are shown in the Table 1. It can be seen that TICAD proposed in this paper achieved the highest F1 score(0.974) and the highest recall(0.986). Although the precision is not the highest, it doesn’t lag far behind other methods. In general, TICAD shows its availability and effectiveness in anomaly detection tasks at the level of trace.

Table 1. Trace-level Anomaly Detection Results

5 Conclusion

In this paper, we propose an end-to-end trace anomaly detection method called TICAD. It has the ability to effectively learn the sequential patterns in the invocations and then infer the anomalies in the traces. TICAD will mine the inherent relationship between the current invocation and the previous ones through a neural network based on LSTM and self-attention.