Keywords

1 Introduction

With the continuous development of software systems, the scale of systems becomes larger and larger, hence it is almost impossible to detect system anomalies manually. During the past decade, we witness the introduction of many automated log-based approaches [1, 2]. These methods often apply useful information from logs to detect system anomalies. We observe that some methods adopt data mining and machine learning techniques to analyze log data and detect the occurrence of system anomalies. For example, Xu, et al. [2] treated the log-based anomaly detection task as an unsupervised learning problem and utilized Principal Component Analysis (PCA) to detect anomalies.

However, most of these log-based anomaly detection approaches are not sufficiently robust in the real-world implementation. Therefore, in this paper, we propose a method named Sprelog - a novel log-based anomaly detection approach, which can achieve accurate and robust anomaly detection on real-world, ever-changing, and noisy log data. More importantly, we evaluate the proposed approach using the public log data collected from Hadoop. Specially, we reorganize the injection ratios of the Hadoop log data to evaluate the effectiveness of the proposed approach. Our experimental results demonstrate that when we increase the injection rate from 5% to 20%, the F1-score merely decreases from 0.97 to 0.94. Hence, the experiment not only shows that our approach can effectively detect anomalies of the online service system with the ever-changing and noisy log data, but also, more importatnly, very robust.

We summarize the main contributions of this paper as follows:

  1. (1)

    We aim to solve the task of unstable log anomaly detection. Specifically, we adopt a self-matching network that captures “inconsistent” information during the evolution of log messages by exploring word-word interaction features. This self-matching network assists our method to manage both the instability during the evolution of log messages and the noise in the log data.

  2. (2)

    We adopt two semantic representations, the local static-based word embedding (word2vec [3] or glove [4]) and the global dynamic word embedding (ReBERTa [5]). Also, we apply the low-rank bi-linear pooling approach to integrate these two semantic representations effectively.

  3. (3)

    We have evaluated Sprelog using two public datasets. The results confirmed the effectiveness of our approach.

2 The Proposed Model

2.1 Task Description

In this research, the log-based anomaly detection task can be described as a tuple of three elements (SIy), where \(S=[s^1,s^2,\dots ,s^g ]\) represents the log event whose length is g. I denotes the current log message task ID and \(y\in Y\) conveys the anomaly detection status of the log. More detailed, Y={Yes, No} which Yes represents that log is normal, and No means that the log is abnormal. Generally, the function of our model Sprelog is to assign a label to each task ID based on the conditional probability Pr(y|SI) according to the given set of \(\{S,I\}\) to solve the log-based anomaly detection task.

2.2 Overview of the Proposed Model

In this section, we describe our proposed model in detail. Model architecture is depicted in Fig. 1. First, we process the unstructured original log data into the structured log events by log parsing, and then convert each word of log events into a vector by word embedding. Then, to transform log events into fixed-dimensional log sequence semantic vectors, we combine word-to-word interactions vectorization with Reberta-based semantic feature vectorization. Finally, to further synthesize the vectorized log event sequences, we use Low-rank Bi-linear Pooling to integrate the log sequence information for the log-based anomaly detection task.

Fig. 1.
figure 1

Overview of our proposed Sprelog model.

2.3 The Representation of the Local Semantic Log Sequence

In this section, we acquire the representation of the local semantic information. Firstly, we construct word-to-word interactions features from parsed logs. Secondly, we combine the word information with the log sequence local semantic information via our self-matching networks [6]. After the above-mentioned steps, we fix the number of the word-to-word interactions vector of the whole log sequence.

It is worth noting that the self-matching network can generate a attended feature vector for the input log sequence: \(f_a=S\cdot a\), where \( a\in R^n\) is the self-matched attention vector. Literature [7] provides study to demonstrate the effectiveness of semantic incongruity as a predictor for log-based anomaly detection. Hence, attention vector a can be designed to capture log sequence incongruity.

In this paper inspired came from “co-attention” network proposed by Lu et al. to address the Visual Question Answering (VQA) task [8]. They introduce an affinity matrix C to attend input picture feature map V and text question representation Q. C is calculated by:

$$\begin{aligned} \mathrm {C}=\tanh \left( \mathrm {Q} \cdot \mathrm {W}_{a} \cdot \mathrm {V}\right) \end{aligned}$$
(1)

where \(W_a\) contains attention weights.

A joint activation approach (e.g. maximize by rows and columns) is adopted to adjust attention weights for V and Q simultaneously. We modify this approach by introducing a weight matrix between word-to-word pair to improve the ability of capturing joint information of words.

Given a word pair \((e_i,e_j)\), the the joint feature vector \(w_{i,j}\) is computed:

$$\begin{aligned} w_{i, j}=\tanh \left( \mathbf {e}_{i} \cdot \mathbf {M}_{i, j} \cdot \mathbf {e}_{j}^{\mathrm {T}}\right) \end{aligned}$$
(2)

where \(e_i\) and \(e_j\) are word embeddings for i and j [9], \(w_{i,j}\in R\) measuring the joint information between word i and word j, and \(M_{i,j}\in R^{k\times k}\) is a parameter matrix.

The self-matching information matrix W based on all joint information \(w_{i,j} \), \(i.j\in (1,2,\ldots ,n)\) is computed:

$$\begin{aligned} \mathbf {W}=\left( \begin{array}{cccc} w_{1,1} &{} w_{1,2} &{} \ldots &{} w_{1, n} \\ w_{2,1} &{} w_{2,2} &{} \ldots &{} w_{2, n} \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ w_{n, 1} &{} w_{2,2} &{} \ldots &{} w_{n, n} \end{array}\right) \end{aligned}$$
(3)

A maximization activation approach is applied to calculate the self-matched attention vector a. We first calculate an intermediate vector \(m\in R^n\), by maximizing elements in W by rows.

$$\begin{aligned} m_{i}=\max \left( w_{i, 1}, w_{i, 2}, \ldots , w_{i, n}\right) , \forall i \in (1,2, \cdots , n) \end{aligned}$$
(4)

Then, we input m into a standard softmax function to calculate a: \(a=softmax(m)\). Softmax function is adopted for the purpose of normalization.

2.4 The Representation of the Global Semantic Log Sequence

In this section, we use Reberta to obtain global semantic features in log sequence. Specifically, the Reberta applied contains token embeddings, segment embeddings, position embeddings, and Transformer with multi-head attention [5] as the encoder layer to obtain H-dimensional encoded log sequence containing contextual information. To achieve these representations, we first use \(S=[s^1,s^2, \dots ,S^g]\) to represent the log event whose length is g. In the pre-trained method Reberta, we obtain global contextual log sequence representation \(f_l\) by applying the fixed-dimensional output vector as the semantic representation of the entire log event. The output vector is symbolized as the [cls]. The detailed equation is demonstrated as follows:

$$\begin{aligned} f_l= Reberta_{cls} (S) \end{aligned}$$
(5)

2.5 Low-Rank Bilinear Pooling

After the previous steps, two log sequences feature vectors are acquired: \(f_{a}\) generated by the self-matching network and \(f_l\) generated by the Reberta encoder. Here, we concatenate these two feature vectors for the final prediction. We employ a Low-rank Bilinear Pooling (LRBP) method based on Hadamard product to reduce the dimension of the final input vector to control the potential information redundancy without reducing feature vector’s discriminative power [10].

In this work, we follow the concept of LRBP to pool information from two input feature vectors: \(f_a\in R^k\) and \(f_l\in R^d\). The final projection feature vector for the input log sequence is calculated:

$$\begin{aligned} f=U^T\cdot f_a\circ V^T\cdot f_l+g \end{aligned}$$
(6)

where \(\circ \) represents the Hadamard product, \(f\in R^c\), \(U\in R^{k\times c}\), and \(V\in R^{d\times c}\) are parameters that need to be learned. \(g\in R^c\) is bias, c, k and d are hyperparameters.

f is the final feature vector for the inputted log sequence. We input f into a standard softmax classification layer to make the log-based anomaly detection prediction:

$$\begin{aligned} p_i=softmax(W_f\cdot f+b) \end{aligned}$$
(7)

where \(p_i\in R^2\) represents whether the input log event sequence is normal or not, and \(W_f\in R^{2\times c}\), \(b\in R^2\) are parameters to be learned.

2.6 Training Objective

The lost function of this Log-Based Anomaly Detection classification task is a standard cross-entropy:

$$\begin{aligned} J(\theta )=-\sum _{i=1}^{N}\left[ y_{i} \cdot \log p_{i}+\left( 1-y_{i}\right) \cdot \log \left( 1-p_{i}\right) \right] +\lambda \cdot R \end{aligned}$$
(8)

where N is the size of training dataset, \(y_i\) is the true label for log sequence i. \(\theta =\{M_{i,j},U,V,g,W_f,b\}\) are model parameters. \(R=\Vert \theta \Vert _{L 2}\) regularization term, \(\lambda \) is a hyperparameter measuring the weight of regularization term.

3 Experimental Setup

3.1 Dataset and Hyper Parameters

We evaluate our proposed Sprelog on two datasets, including the original HDFS datasets [11] and the synthetic unstable HDFS datasets. To prepare for the synthetic datasets, we randomly collect 51,000 log sequences from the original HDFS datasets consisting of 50,000 normal and 1,000 anomaly sequences. We inject the unstable log data into it and create two testing sets: NewTesting 1 and 2, which contain injected unstable log events and unstable log sequences, respectively. The details of the two datasets are show Table 1:

Table 1. The synthetic HDFS datasets

We fix all the hyper-parameters applied to our model. Specifically, We train our networks by stochastic gradient descent with the learning rate of 0.1, the momentum of 0.9, the weight decay of 0.0005, the dropout ratio of 0.5, and the gradient clipping of 0.1. The training batch size for all datasets is tuned amongst 64, 128, 256. The L2 regularization is set to \({10^{-5}}\) for the original HDFS datasets, and \({10^{-3}}\) for synthetic unstable HDFS datasets.

Table 2. Experiment results on synthetic HDFS dataset of untable log sequences (the NewTesting1 set)

3.2 Results and Analysis

To analyze the effectiveness of our model, we take some current competitive methods as baselines on the above two datasets to compare the performance of Sprelog with other models. The results are demonstrated as follows.

(1) Experiments on the Synthetic HDFS Dataset

Our work focuses on the anomaly detection problem in unstable logs. To further prove that the model proposed in this paper can effectively solve the anomaly detection issue in unstable logs, we conduct two groups of experiments. First, for the unstable log events, our model is trained on the original HDFS log datasets and tested on the synthetic unstable log event datasets (NewTesting1). The comparison results on the NewTesting1 set are shown in Table 2. We can note that as the proportion of unstable log injection increases, the performance of the five baselines continues to decline. The f1 value of our model Sprelog is around 0.94 based on different injection rates, which strongly proves that our method has high robustness and can effectively solve the anomaly detection issue in unstable logs. The main reason is that, during the log presentation process, our model applies the pre-trained model and semantic vectors, projecting the logs into higher dimensions, thus higher-level semantic information can be obtained.

Next, we conduct another group of experiments on the unstable log data. Specifically, our model is trained on the original HDFS log datasets and tested on the synthetic unstable log sequence datasets (NewTesting2). The experimental results are shown in Table 3.

Table 3. Experiment results on synthetic HDFS dataset of untable log sequences (the NewTesting2 set)

We can observe that LogAnomaly [14], LogRobust [7], and our proposed Sprelog are semantic-based models. When the injection ratio of log sequences increases, these models performance reduces slower compared to other methods. In particular, our proposed method Sprelog still remain significant performance, when the injection ratio of log sequences increases, log sequences suffer from missed, duplicated, or shuffled problems. The f1 value of the model Sprelog is basically around 0.96. The reason is that our model uses a self-matching network in the log sequence representation. The self-matching network explores the contextual information embedded in the log sequence and learns the different importance of log events through the attention mechanism. Therefore, it makes our model robust to small changes in the sequence.

4 Conclusion

Engineers can use logs (for example, system log messages) to investigate the anomalies. However, due to the continuous evolution of log statements and the emergence of processing log noise, the current log anomaly analysis model are not robust enough. To overcome this issue, we propose a new log-based anomaly detection method - Sprelog. Which can capture the inconsistency in the evolution of log sequences and higher-level semantic component information. Experiment results on publicly available datasets, our proposed Sprelog model achieves state-out-of-art performance, outperforming the most advanced log-based anomaly detection models that exist. In the future, our research group would like to improve the computing speed of Sprelog to further level up the performance of our solution.