Keywords

1 Introduction

A data center has to gather information about computer systems states. It is a common practice to have programs logging events that provide insights on the service activity and report their internal state, allowing to detect anomalies. Logs are semi-structured texts, usually appended to a file, with the ‘.log’ extension, that grows in size and becomes very large, therefore the fact that system administrators analyze systems’ health according to log files does not scale. This leads to develop solutions that automate the processing of logs, reducing human intervention. Software components and applications produce heterogeneous log files. There are services that use extremely flexible logging methods in their syntax, like syslog [22]. Log files usually do not contain the same type of information: ‘syslog’ event logs a system activity, while ‘crond’ event logs the CRON entries that show up in ‘syslog’. Each file tends to describe a partial view of the whole data center. The stored information can contain the time and date of specific event to log exactly what happened and does use a simple formatting. Furthermore, the services can use different key words to express normal or erroneous behaviour. Logging analysis involves processing large amount of data that are not easy to read and understand manually. This can make difficult to perform log summarizing. Non-automatic processes for trouble shooting [40] are discouraged in large computing system. Several studies have proposed approaches to scale up log analysis [20] , moving the error problems from manual operation to automated operation. However, they still may require log processing and conversion of log files into a format that can be understood by analysis tools. Some methods are used to compact the information into more readable formats, such as ‘.csv’ and ‘.json’, and to also automatically extract any suspicious information, like the cause of an ‘error’ message. Quite often it is not feasible just selecting the file and retyping the file extension as ‘.csv’ or ‘.json’, because the transformed file could be wrongly reformatted. Furthermore, the file usually contains daily service information, so the number of data can be over \(60-120~K\) rows, penalizing the usage of some analysis tools, because they will likely become unresponsive when they perform, e.g., selecting and adding new variables in the data sets. Before starting any analysis it is essential to decide which variables have to be included in the resultant data sets. Spreading the data across multiple columns is another aspect to consider to organise your data set into a manageable format. The resultant files can be used to identify trends and unusual activity that is beneficial for both short- and long-term data center management. Machine Learning techniques are a solution to identify anomaly detection patterns automatically, predicting the failure of the machine.

In this paper, we focus on determining problems patterns by combining the results obtained by different machine learning (ML) techniques. Log files contain a considerable amount of texts, therefore we have used modern Natural Language Processing (NLP) methods [1, 33] to map the log files’ words to a high dimensional metric space in order to define site administrators’ actions. The aim is to cluster and- if possible- classify the various system events. Furthermore, having to work with unsupervised data, we have used a machine (deep) learning technique called autoencoder [56]: the decision taken by and information available to site experts are not stored in any local framework. We have also used an invariant mining technique [37] that supports, for example, detection of failures and anomalies. During the log analysis, we have adopted the invariant mining model, because it is a general approach that does not rely on the nature of the data and it does not require any meaningful knowledge of the domain or constant rules. Furthermore, it does not require training labels and messages are not grouped with respect to distance only. The proposed approach is repeatable in other contexts and domains. With minimum setup effort and the usage of machine learning tools, it is possible to automatically extract relevant information about system state. Through experiments, we illustrate the potential benefits of our approach by answering our research questions.

The reminder of this paper is as follows. Section 2 introduces the background of log analysis and provides related works. Section 3 introduces the rational of our approach. Section 4 provides some promising experimental results. Section 5 concludes the paper with future work.

2 Log Mining and Related Works

Existing commercial tools, such as Elk [18], Splunk [45], Loggly [36] or Over Ops [41], are mainly used for Big Data visualization, but are not appropriate for anomaly detection. Moreover, utilities such as Salsa [30] or Log Enhancer [55] for diagnosability are not good on large volume data, because they are heavy. Log mining is an approach that uses statistical and ML algorithms in structured event list to extract knowledge from logs, discovering a model inference and detecting outliers in a system.

Log Compression. Preliminary phases to log mining are log compression, which uses, for example, dictionary-based and statistic-based techniques or Jaccard similarity [28], and log parsing, which collects clustering, heuristics, frequent pattern mining and evolutionary algorithms. Frequent pattern mining identifies frequencies, correlations and causalities between sets of items in transactional databases: among the most used algorithms, Apriori [46] and FP-growth [23] can be mentioned. The Apriori algorithm is based on a level structure and it is useful to search for the most frequent schemes [5] from a large dataset, to quickly exclude non-recurring patterns. FP-growth bases its operation on the construction of a tree structure that models the considered dataset. In particular, the frequent item sets are extracted by going through the tree, without therefore having to read the database several times.

Log Parsing. In the preprocessing techniques category, also called log abstraction, there is the log parsing activity used to abstract and simplify the initial data. Among the clustering methods for log parsing, we can mention Simple Logfile Clustering Tool (SLCT) [49], Hierarchical Event Log Organized (HELO) [21] and Iterative Partitioning Log Mining (IPLoM) [38]. In addition, POP [25], a parallel log analyzer, is suitable for large-scale data processing. During the log parsing process, metrics are sometimes used, for example LevenShtein distance [34] or Longest Common Subsequence [48] in Spell [14].

Data Transformation. In the preprocessing phase, there is also data transformation. Through this process the log set is transformed into an appropriate form, such as an histogram matrix of events or a binary transformation, which becomes the input in the next data mining step. El-Masri et al. [17] compare some aspects, such as scalability and efficiency, of the main log abstraction techniques.

Log Analysis. Logs can be analyzed in a static way (off-line method) or dynamically (on-line method), but the latter method is time consuming. The former is less precise but also faster. Hybrid analysis is a mixture of the two. The most popular on-line log parsing approaches are SHISO [39], LenMa (which are based on clustering), Drain [26] (based on heuristics) and Logram [9] (based on frequent pattern mining). A more detailed description of clustering and log parsing methods can be found in the survey [27]. Log analysis techniques can be divided into the following branches: failure detection techniques that include, for instance, decision tree and associative rule learning; anomaly detection techniques that comprises agglomerative or divisive hierarchical clustering, nearest neighbor approach, chi-squared test and Naive Bayes (NB) classifier [13], and failure diagnosis that includes the SherLog [54] tool and Decision Tree (DT) [42] approaches.

Machine Learning Techniques. Some traditional ML algorithms are applied in literature, including Principal Component Analysis (PCA) [53], LogCluster [50], Support Vector Machine (SVM) [51] and mixtures of Hidden Markov models. Some Deep Learning approaches for this scope are DeepLog (LSTM-RNN) [15], LogGAN (CNN) [52], Aarohi (a failure prediction method) [10], AirAlert (a framework based on Bayesian network) [7] and TCFG (Time Weighted Control Flow Graphs) [29]. Machine Learning techniques are divided into classification techniques (such as SVM and Bayesian networks), clustering techniques (including K-means [35], Self-Organizing Feature Map, SOFM [32] and DBSCAN [44]) and statistical techniques (for example PCA). A clustering algorithm divides data into groups, according to the criterion that the elements belonging to the same cluster are very similar. Outliers, which are those that do not belong to any cluster, often contain useful information on the anomalous characteristics of the system as they are generally different from other data. To estimate how much the results of the algorithms implemented in the log mining process correspond to reality, one can proceed in several ways. A first strategy is to measure true and false positives, true and false negatives and plotting the curve receiver operating characteristic (ROC). Other procedures, used to evaluate clustering methods, include internal evaluation (given by the Davies-Bouldin index [11] or Dunn index [16]), and external evaluation (with Rand measure [43], F-measure or Jaccard index).

Related Works. Breier and Branisova [3], employing Apache Hadoop framework, process log files in parallel. Their method, based on the generation on dynamic rules, spots anomalies through MapReduce. The parallel implementation, compared with A-priori and FP-Growth algorithms, has accelerated the process of detecting security breaches in log records.

Borghesi et al. [2], through a deep learning technique, perform anomaly detection in high performance computing systems. The Examon infrastructure (the code is available on Github [19]) dataset is explored through a neural network called autoencoder. It is trained with the Adam optimizer [31], by using an off-line approach. Finally, the metric used to evaluate the accuracy of the obtained results is F-measure. Layer et al. [33] use NLP to predict the operator’s action for the CMS experiment [8] workflow handling. They employ the information of the computing operators for taking the decision in case of failures, stored in the experiment framework, during the ML training phase. Bertero et al. [1] also use NLP to identity anomaly detection. They emulate different types of errors that refer to CPU consumption, misuse of memory, abnormal number of disk accesses, network packet loss and network latency.

The results presented in previous papers show that the optimal procedure for analyzing logs is the combination of, first, heuristics or filtering and, next, machine learning steps. In this article, we have chosen to use invariant mining and autoencoder techniques, because we are interested in the relationships between different messages and source hostnames. We have also applied NLP to identify anomaly categories that can be used to label the log entries and contribute to defining administrators’ operations. Tools, like SLCT, have the disadvantage of being based on searching only for the most frequent types of messages in the log file, neglecting the less frequent ones. For anomaly detection, however, it may be necessary to find rare types of messages, so this option has been omitted. Furthermore, the application of clustering methods in the log core analysis phase would lead to the risk of losing significant results on the anomalies to be considered individually, if they have been aggregated with others.

3 Approach

The approach proposed as the contribution of this paper is presented in Fig. 1. Once data has been collected and the log files characteristics have been identified, we have dedicated to the following phases: data preprocessing and transformation that have allowed us to build a set of datasets to train and build our anomaly detection models.

Fig. 1.
figure 1

General approach overview.

Source Data. The log files examined in this study are related to a set of services running at INFN Tier-1 data center [12], used by the large hadron collider experiments [4]. Low level services are shared among the highlighted groups, such as crond and sudo. The log files mainly belong to Linux system services, such as the software utility crond, the free and open-source main transfer agent postfix and the standard for message logging syslog. Table 1 summarizes the first 30 filenames according to their frequency, that is the number of times a value of an unique filename occurs. The suffix-type frequencies of the various available files are summarized as follows: .gz 10869, .log 3562759, .manifest 5, .meta 6, .pending 1, .txt 2.

Table 1. The top 30 log files per frequency.
Fig. 2.
figure 2

A log entry sample from the puppet-agent service.

Log files. Each of these log files contains a different amount of lines. They contain numerical and textual data that describe system states and run time information. Each log entry includes a message that contains a natural-language text (i.e. a list of words) describing some events. Logs are generally generated by logging statements inserted, either by software developers in source code or by system administrators in configuration files, to record particular events and software behaviour. Each log entry in the log file represents a specific event. Figures 2 and 3 show two different log entry samples composed of a log header and a log message: the former is generally composed of a timestamp, a custom-configuration information (such as the hostname in Fig. 2 where the service runs and a log level verbosity in Fig. 3) and the name of the service the message is associated to; the latter is just the message that contains information of the logged event.

Fig. 3.
figure 3

A log entry sample from the puppet-agent service.

Figures 4 and 5 show two examples of the log message of a log entry, characterized by a natural-language text which interpretation is difficult because there is not an official standard defining the message format. The text is usually composed of different fields called dynamic and static: a dynamic field (DF) is a string or a set of strings that are assigned at run time; a static field (SF) does not change during events. Such fields can be delimited by different separators, such as a comma, a white space, a parenthesis. Logging practice is scarcely well documented. This activity mainly depends on human expertise [24]. They often have to analyze a large volume of information that may be unrelated to the problematic scenarios and lead to overwhelming messages [6].

Fig. 4.
figure 4

A log message fields with just one dynamic field and static field.

Data Preprocessing. During this phase the log files change format and turn into ‘.csv’ files. In addition the following variables are added to each file: date, time, timestamp, hostname, internet protocol (ip) address, service name, process identifier (id), component name, message (msg). The hostname and ip couple are not always both available, especially when the service runs on a virtual machine. Each file is related to a particular service that runs on a well-known machine in the data center. Its location is got by a local database and included into the resulting file. During this phase we have tackled some site administration peculiarities: the same service called in lower or capital letters; the process identifier included in the service logging file; the service name included (or not) in the log message; the process identifier included (or not) in the log message; the logging filename included typo error. Before performing any cleaning operations, we have excluded meaningless services’ log files, especially those with a small number of events. In the remaining logs, the following changes have been applied in the message field: the removal of unwanted texts, such as punctuation, non-alphabets, and any other kind of characters that are not part of the language by involving regular expressions; the exclusion of non-English characters; the cancellation of stopwords, that is frequent general words, like ‘of, are, the, it, is’, with a low meaning. According to the amount and types of services, in this phase we have started to trace the types of log events, such as assert, fail, error and debug, and to identify anomaly key terms that can be used to classify the reason of the problems in the service. This part of the study is still ongoing requiring experts’ check. However, Table 2 summarizes few examples of message lines that describe a wrong service behaviour.

Fig. 5.
figure 5

A log message fields with a sequence of dynamic and static fields.

Table 2. Examples of message lines for the crond log file.

Features Transformation. Data transformation includes the creation of a new dataset that includes a matrix, whose dimension and values change according to the machine learning technique used to build the anomaly detection model. The resulting dataset can contain either binary data or numerical data. In case of binary data there is a numerical value of 1 for features that are present in the log event and a numerical value of 0 otherwise. In case of NLP usage, the tokenization phase (see Fig. 3) determines the element of the matrix: the message string is split up in single words. Once transformed the message events into a matrix of features, in case of problem with large dimension, it is possible to apply rules to reduce data, e.g., according to the uniqueness of the messages. For NLP, low frequency words or words that are not important for the meaning of the anomalies are filtered out. For a matter of time computation, the event count matrix has been created for each month, in which its elements indicate the occurrences of all messages given in input and relating to the hostnames from which they come in that time window.

Table 3. Examples of message strings split up in single words for the crond log file.

Machine Learning. Starting from the event count matrix, NLP, autoencoder and invariant mining techniques have been considered. NLP includes the word2vec [47] unsupervised algorithm, a popular embedding approach by Google to process natural language. word2vec takes a text corpus as input and produces a high-dimensional euclidean space with each unique word in the corpus. Each word of an event is mapped to nearby points in the vector space. This technique is able to produce meaningful word embeddings: similar words end up close, words that are not related to each other end up far away in the embedding space. Following the same approach to all log files, it is possible to form clusters of similar anomaly events and use traditional supervised classifiers to e.g. determine anomalous and normal behaviour of the services.

Autoencoder is a type of neural network that can be used to detect anomalies at the service level. With respect to Borghesi et al. [2], in this study we have created a set of separate autoencoder models, one for each service in the system. Each model is trained by using a loss function to ensure that the output is close to the input.

Invariant mining model is based on the fact that a number of initial operations on the files corresponds to the same number of final operations on the same or a similar type, for example the number of jobs arrived with the number of jobs started, or the number of jobs arrived and number of jobs completed. Linear program invariant is then a predicate that always contains the same values in different normal operations, such as opening and closing files. Invariants reflect the underlying correlation between variables and the properties of the execution path. Whenever an invariant is broken during a system operation, the anomaly in the log corresponding to these variables is detected because it is considered a sign of a probable malfunction. The final result is a vector of length equal to the number of unique messages given in input, and composed of values “0” and “1”. Each element corresponds to a log message, in which “1” indicates that it is labeled as anomalous, “0” instead as non-anomalous.

Some messages, erroneously labeled as anomaly, could be log patterns that at the beginning of a month correctly close the operations of the end of the previous month, but are not linked to the messages that precede them if we have separated the dataset by month. These boundary cases can be assessed individually, considering a limited time window e.g. 24 h before and after, or taking into account the average closing time of execution flows. Once the logs with anomalies have been identified automatically, through this strategy we can trace the causes and files that potentially triggered the errors. This could be useful for us to know in advance what the alarm messages are in real time and to act in time to prevent problems from occurring, that is, to do predictive maintenance.

Performance Metrics. To access the performance metrics of the ML techniques we have considered the following metrics. Precision is the measure of the model’s performance with respect to false positives (FP), which are the messages identified as anomalies, when no anomalies occurred. False negatives (FN), on the other hand, are messages mislabeled with non-anomaly, which should instead be indicated as anomalous. From the expression: \(precision=\frac{TP}{TP + FP}\), where true positives (TP) are messages that correctly report an occurred anomaly, it is evident that high precision values indicate a low rate of FP. Recall measures model’s performance with respect to false negatives, according to this formula: \(recall=\frac{TP}{TP + FN}\). f-measure, expressed by: \(f1=2\times \frac{precision \times recall}{precision + recall}\), is the harmonic mean of precision and recall. It is less affected by extreme values; the closer is its value to 1, the better the model performs.

4 Results

For this study, we have created a set of GitLab projects to store code we have implemented so far to perform the various phases of the presented approach. The collected and preprocessed data contains reserved information, therefore at the moment of the conference we have preferred to keep private our GitLab projects. We have developed our analysis on jupyter notebooks for python3 programming language by leveraging data analysis python libraries, such as pandas, nltk, and scikit-learn.

In the following we are going to show the effectiveness of the presented approach by showing the results obtained for a subset of services (see Table 1). The logs analyzed, after their preparation with the preprocessing phase, relate to the period June–October 2020. The causes of the anomalies detected are various: we can mention, for example, bugs related to the memory or silent corruptions of the data, jobs aborted due to field validation file, software downs or errors due to some users who use the data center services. We have grouped the extracted anomalies into specific message templates, so that they can be used as warnings in future system states without waiting further data acquisition time. Our approach is off-line: however, an on-line method is simulated by comparing the anomalous results of one month logging files with both the previous and the consecutive periods, and specific ML evaluation metrics are used. Next, the success is assessed through a manual check of the most frequent messages, in order to assign labels to certain messages and compare the experimental results with the real and expected ones. The precision, recall and F-measure performance metrics of ML techniques have been calculated through the data of two consecutive months. Table 4 shows the mean values of the ML performance metrics. We can assert that these models are effectively able to describe the anomalies.

Table 4. Mean values of the ML performance metrics.

All the methods highlight the same period of time for the anomalies due to the presence of problem key terms in the files. Messages, quite explanatory, labeled as anomalous by the methods are for example: “failing disk”, “left power supply failure”, recovered: “medium error disk”, “internal queue full”, “Total Queue full”, “process abort” and “Disk Failed: Abort”.

In the following we present details of the different methods.

Invariant Mining Model. The invariant mining technique is able to detect the intrinsic linear characteristics of workflows linked to a machine, in particular used to automatically collect the frequent patterns of error messages that generate a problem in the Tier-1 data center. The message-hostname pair has been chosen, because it is the most significant in the search for anomalies, and because other variables such as the ip address could be obtained from the hostname source. The percentage of anomalous messages, calculated between the number of unique suspicious messages of each month and the total number of monthly log messages, varies from a minimum of \(0.06\%\) to a maximum of \(10.3\%\).

Figure 6 shows some examples of anomaly messages detected from logs of Tier-1, where the histograms report the number of occurrences per day. Their frequency is high at particular dates, and therefore can be associated with some problems. These patterns come from two specific hostnames and ips, which we indicate for privacy “host_a, ip_XXX.XXX.XXX.XY” and “host_b, ip_XXX.XXX.XXX.XZ”, and therefore, this information is useful for monitoring the future operations of the specific associated machines.

Natural Language Processing Method. To establish the word2vec training set, we use the concatenation of 300 log files that constitute the basis of our model training. Therefore, the message corpus is relatively small. We have use the NLP method avoiding any optimization of the computation in order to get the maximum number of information. The output of the word2vec is a file containing the coordinates of distinct words of our training corpus.

Fig. 6.
figure 6

Some anomaly messages detected with invariant mining.

Autoencoder Method. An autoencoder is a feed-forward multi-layer neural network with the same number of input and output neurons. We have used it to learn a more efficient representation of data while minimizing the corresponding error.

5 Conclusions

The anomalies in logs can be related to permanent or transient errors. Performing manual diagnostics in a data center is not feasible, given the large-scale increase in unstructured information produced by machines in operation. For a human operator, even by dividing the data analysis for small time intervals, it would be very time-consuming. Most of the traditional tools on error analysis are based on verifying the standard behavior of the system by knowing a priori the behavior of specific software, but it is therefore linked to the specificity of the system and to the knowledge of the operators. If new software is used, manual analysis cannot be accurate after a short period of use, because in that case it is not possible to outline all the different normal system behaviors and tag the other unknowns as errors. If it were done for all the prototypes of unknown messages this could lead to a categorization of false positives.

In this work, the log mining is performed on monthly time slots and on the frequencies of each anomalous message, which are automatically extracted together with the hostnames registered to them. The results are promising. We have obtained an average of F-measure metric over \(86\%\). A subset of services has been used during the study, therefore we aim at improving our study by considering all the available datasets.