Keywords

1 Introduction

Microservice architectures is increasingly used to develop various applications due to its advantages such as efficient development, quick deployment and flexible scaling. In recent years, software applications based on a microservice architecture have been widely deployed in cloud computing data centers, and their infrastructures (e.g. Kubernetes, Mesos) have also developed rapidly to support and manage large-scale microservices. As various microservice applications have different resource requirements and behavior characteristics, the operators of a data center pay much more attention to the management strategies of microservices. A microservice architecture includes heterogeneous software, e.g., open-source software, third-party services and application-specific software. Analyzing and understanding microservice-based applications is the key to ensure applications’ reliability. Existing methods usually study the attributes of a single service or application, e.g., executable file name, port number, file metadata. However, these attributes are not the inherent attributes of microservices, which can be dynamically adjusted or hidden with operation and maintenance. Thus, the application-specific analysis cannot dynamically adapt to changing cloud computing. Some tools inspect packets, but analyzing the runtime behavior of deployed microservices has much overhead. Furthermore, these tools cannot accurately analyze the characteristics of microservices in the code level. Moreover, the strict privacy policy of cloud computing prohibits the intrusive analysis of microservices, which increases the difficulty of profiling microservices.

To address the above challenges, this paper proposes a sequential trace-based fault diagnosis framework for microservices called as Midiag. We collect sequential system calls to trace the runtime behaviors of various microservices with a unified, efficient and non-invasive way, when microservices interact with the host operating system, e.g., accessing file systems, synchronizing threads. Then, we employ k-means to cluster the collected sequential system calls as sequence patterns with Longest Common Subsequence (LCS). Finally, we employ the GRU-based neural network to model a sequence pattern, predict the next system call, and then diagnose faults by comparing the expected system call and the actual one in the specific pattern.

2 Related Work

Monitoring technologies are the basis of fault diagnosis by identifying the deviations from normal system behaviors. Existing works have proposed many models for fault diagnosis, such as subsequence analysis [1], behavioral Markov model [2], finite state automata [3], dynamic Bayesian network [4], and deep neural network [5]. Ref. [6] proposes a mandatory security policy generated by normal application behaviors in the system call level, which can realize simple admission controls regardless of the dependency of cross-system call sequences. The above method is limited to network-based applications and their communication protocols [7], while Midiag is generally applicable for various applications. Ref. [8] and [9] propose outlier detection methods based on semi-supervised learning (e.g., clustering) with labeled and unlabeled samples. The above methods are suitable for small-scale systems, but they are difficult to deal with the actual deployment scenarios of applications with various microservices and complex dependencies [10]. Midiag diagnoses faults by automatically analyzing system calls without applications’ domain knowledge.

3 Midiag Design

Figure 1 shows the system architecture of Midiag, which includes trace collector, trace pattern miner, microservice modeler and fault diagnostor.

Fig. 1.
figure 1

Midiag system architecture

3.1 Trace Collector

We deploy a trace collector in every host to collect the traces of Docker containers deployed in the host. The trace collector employs a kernel virtualization tool that is IO Visor (https://github.com/iovisor) to collect the kernel events of interest without customizing the kernel by dynamically injecting user-defined bytecodes into kernel hook functions. IO Visor combines open source components to build networking, security and tracing in datacenters. We adopt bcc that is a component of IO Visor supporting immediate compilation to allow IO Visor programs running at the host speed in kernel. When a microservice is loaded, the Docker container notifies the daemon of the user space with PID, and the trace collector starts to collect system calls. The collector monitors the system calls of every microservices deployed on an operating system; the trace collector registers the PIDs of monitored Docker containers in the PID table; the trace collector sends the system call sequences of microservices registered in the PID table to the trace pattern miner for further mining trace patterns.

3.2 Trace Pattern Miner

Microservices carry out a series of activities by invoking system calls. Since different microservices have various system call patterns, we classifying microservices with similar system call sequences for improving the accuracy of fault diagnosis. The system calls collected from the trace collector are stored in the database for persistent storage, and the trace pattern miner employs k-means to cluster system call sequences with historical traces. In the training stage, the trace pattern miner clusters the system call sequences collected from the trace collectors as k microservice types with k-means. In the testing stage, the trace pattern miner takes a test system call sequence collected from a trace collector as an input, and then selects the cluster with the highest similarity as its microservice type. Trace pattern miner measures the similarities between the system call sequence to be detected and the central points of k clusters, and then the most similar microservice type is regarded as its microservice type.

First, we calculate the similarity between system call sequences. The longest common subsequence (LCS) is the longest subsequences between two sequences. We suppose that sequence Z = (z1, z2,…,zk) is the LCS of sequence X = (x1, x2,…,xm) and sequence Y = (y1, y2,…,yn), and then we conclude that:

  • If Xm = Yn, then Zk = Xm = Yn, and Zk−1 is the LCS of Xm−1 and Yn−1;

  • If Xm ≠ Yn, Zk is the LCS of Xm and Yn−1 or the LCS of Xm−1 and Yn.

We calculate c [i, j] to record the length of the LCS of Xi and Yj as:

$$ c\left[ {i,j} \right] = \left\{ {\begin{array}{*{20}c} {0, i = 0 \,or\, j = 0} \\ {c\left[ {i - 1,j - 1} \right] + 1, i,j > 0\, and\, x_{i} = y_{i} } \\ {max\left( {c\left[ {i,j - 1} \right],c\left[ {i - 1,j} \right]} \right), i,j > 0 \,and \,x_{i} \ne y_{i} } \\ \end{array} } \right.. $$

The LCS of X and Y can be recursively performed in the following way:

  • When xm = yn, we calculate the LCS of Xm-1 and Yn-1, and then add xm or yn to the tail to obtain the LCS of X and Y.

  • When xm ≠ yn, we calculate the LCS of Xm-1 and Y and the LCS of X and Yn-1.

After obtaining the system call sequence of each microservice, we calculate the distance between two system call sequences, and then use the distance to measure the similarity between system call sequences as:

$$ D\left( {X,Y} \right) = 1 - \frac{{\left| {lcs\left( {X,Y} \right)} \right|}}{{\left| X \right| + \left| Y \right| - \left| {lcs\left( {X,Y} \right)} \right|}} $$

where |X| and |Y| are the lengths of system call sequence X and that of the system call sequence Y, and lcs(X,Y) is the LCS between X and Y. If X and Y are exactly the same, then d(x, y) = 0; if X and Y have no common subsequence, then d(x, y) = 1.

With the distance between the system call sequences and patterns, we cluster microservices’ system call sequences, so that the microservices’ system call patterns can be categorized to improve the accuracy of fault diagnosis. The k-means method firstly randomly finds a representative system call sequence for each cluster, and respectively assigns other objects to respective clusters according to the distances between them and the representative of clusters. If replacing a cluster representative with a new object can improve the quality of the obtained cluster, the representative of the cluster can be replaced with a new one. System call sequences can be classified into k different categories after iterations.

3.3 Microservice Modeler

We take system call sequences as the input of GRU neural networks with the attention mechanism to train GRU neural networks, and then obtain the trained GRU neural network, where each neural network pattern corresponds to a type of system call sequences. k GRU neural network patterns are respectively established for k types of microservices’ system call sequences. We construct a GRU-based neural network model for each system call pattern as follows.

The first (n − 1) system calls of the corresponding system call sequences are encoded as the hidden variables of a neural network’s input layer. The hidden variables present context variables containing data flow information of the whole system call sequence. The attention mechanism is employed to allocate weight coefficients to the hidden variables. The more layers a network has, the stronger ability to learn and predict system call sequences it has. However, when the number of layers is too high, the training of the pattern is difficult to converge, so we employ a 3-layer GRU network. We add a full connection layer at the end to reduce the output’s dimension. The Softmax function is used as the output layer of the neural network, and the corresponding tag is the category of system call sequence. The neural network pattern is trained with the gradient descent and the back-propagation loss, the parameters of the pattern are continuously adjusted, and then the trained GRU-based neural network pattern is obtained.

3.4 Fault Diagnostor

The similarities between the system call sequence to be detected with the representative sequences of k clusters are measured. We first classify a system call sequence as the cluster with the greatest similarity. For each cluster, we train a GRU-based model with the dataset of system call sequences. The original input sequence is reconstructed into a variable vector, and the fault diagnosis is carried out on the newly added sequence. The input sequence is converted into an encoding vector before being input into the GRU-based model. The output of the GRU layer is repeated S times to construct an intermediate sequence, where S is the length of the input sequence. The intermediate sequence passes through a time distribution dense layer with a Softmax activation function, and then the sequence is decoded as an original input sequence by another GRU layer.

The system call sequence to be detected is the input of the GRU neural network built in the specific cluster. The difference between the predicted system call and the actual one is measured as the abnormality degree. After the last system call is removed from the system call sequence, the system call sequence is used as the input of the GRU neural network trained in the corresponding cluster. The pattern encodes this system call sequence as a hidden variable, and generates the hidden variable into a context variable containing data flow information with the attention mechanism. GRU predicts the category of the next system call in the system call sequence, and outputs the normalized discrete probability distribution through the Softmax function. The Manhattan distance is used to calculate the difference between the probability distribution vector of the next system call predicted by the GRU neural network and the vector of the next actual system call. The distance is taken as an anomaly degree, where the larger the distance is, the larger the anomaly degree of the system call is.

4 Evaluation

The experimental environment includes eight virtual machines (VMs) running on Ubuntu 18; each VM has a 2.40 GHz virtual CPU core and 32 GB memory; each VM employs bcc and JIT to collect system calls.

This section evaluates Midiag with precision that is the ratio of correctly detected faults and injected faults in ref. [11]. We choose sixteen microservices categorized as SQL database (i.e., PostgreSQL, Ingres r3, MaxDB, InterBase), NoSQL database (i.e., MongoDB, Cassandra, HBASE, Memcache), Web server (i.e., Apache, Nginx, Lighttpd, Appweb), FTP client (i.e., File Zilla, Fire FTP, gFTP, LFTP). To train the GRU-based model, we collect one thousand system call sequences for each microservice. A single GRU-based model is trained with the dataset of system call sequences generated by all microservices, and then a sample is detected with the unified model. Furthermore, Midiag trains multiple GRU-based models with multiple datasets of system call sequences generated by microservices in different clusters, respectively. If the loss returned by the GRU-based model is higher than the threshold, the detected sample is detected as a fault. Each experiment is repeated 100 times.

Firstly, we evaluate the length of system call sequence on precision. Figure 2 shows that the longer the length is, the more accurate the precision is, before the threshold of the sequence length reaches 900, and Midiag can achieve the best precision 0.91. However, the precision decreases after that, because the longer sequence causes the overfit of the trained model. Secondly, we compare Midiag with the traditional single GRU-based model. Figure 3 shows the effect of loss threshold on the accuracy of fault diagnosis. If the threshold is too high, more faults of system call sequence will be classified as normal (i.e., false negative). If the threshold is too low, the normal sequence will be incorrectly classified as abnormal (i.e., false positive). The achievable accuracy of a single GRU-based model is less than 0.80, while Midiag can achieve precision 0.91 by categorizing sixteen microservices as five clusters and diagnosing faults in each cluster according to the cluster of system call sequences.

Fig. 2.
figure 2

System call sequence length on precision

Fig. 3.
figure 3

Fault diagnosis threshold on precision

5 Conclusion

The microservice architecture raises great challenges to the operation and maintenance of applications in cloud computing. Existing operation technologies usually employ a unified model to analyze applications’ status. However, the behaviors of various microservices vary greatly, and describing them with a single model is difficult. To address the above issue, this paper proposes a microservice fault diagnosis framework Midiag based on mining system call patterns. After collecting system calls with a non-invasive lightweight tool, we employ k-means to cluster system call sequences as sequence patterns with LCS. The GRU-based neural network is adopted to model a sequence pattern to predict the future system call, and thus we can detect faults by comparing the predicted system call and the actual one in a specific pattern. Experimental results show that Midiag can effectively distinguishes system call sequences and achieve much higher precision in detecting faults.