1 Introduction

As cloud computing applications scale to high loads and complex distributed environments, they become vulnerable to software bugs and failures. Usually, these large-scale systems are designed for robustness and stability. However, when a service fault or outage occurs, the mean time to recovery is considerable due to the inherent complexity of various components. Recently, the Google Cloud was affected by an incorrect configuration change issue for nearly two hours [1], while AWS faced a major outage that caused traffic delays and latency for about 7 h [2]. Thus, the need for diagnosing system anomaly plays a vital role in building trustworthy and reliable software. A rapid and precise failure detection helps troubleshoot issues faster, also motivates the plausibility of fully self-healing workflows that can perform auto-actions.

Logs are the preferred data source for root cause diagnosis of anomaly, because they record important events and holistic system status information, and are available on most platforms [3]. However, using logs comes with certain challenges. Firstly, the structural format and semantics of logs can vary across systems. Often, linking log messages generated by different system modules can be non-compatible given their diverse semantics [4]. Further, concurrent/parallel execution of tasks can produce random events at different times which does not share proper ordering or a deterministic ordering of messages [5]. Also, unexpected events that occur in real-time need to be evaluated to see if they are anomalous or not, which can be demanding even for domain experts [6]. Though heuristics such as regular expressions or keyword searching help automate to a certain extent, their scope is highly limited. They generally report false positives, do not scale and need to be constantly updated for new failures and code churn.

Such practical difficulties call for efficient analytical models that can leverage massively large logs to automatically mine valuable information for anomaly capture. Consequently, many works explore Artificial Intelligence (AI) for processing logs. The state of the art can be roughly classified into methods that use template-based features [4, 6, 10, 18,19,20], and ones that encode logs into vectors [7, 9, 11, 12, 15]. In the first category, a log entry is parsed into its invariant part (template) by discarding numeric information as parameter values and timestamps. The templates are assigned unique event indexes and anomaly detection is performed over the event index sequences using LSTM. Some methods create vectors for time windows by counting unique events, or TF-IDF aggregation. Then, the anomaly is learnt through unsupervised techniques like Principal Component Analysis (PCA) or Invariant Mining (IM). However, when only using these event indexes as features, valuable information can be lost because the templates can still contain semantic relationships. Considering these drawbacks, some methods directly use the extracted log templates as textual features, but they cannot handle those templates not seen before. However, in both cases, the quality of features degrades when the parser is inaccurate. Commonly used FT-Tree or Longest Common Subsequence (LCS) based parsing requires ideal settings and extensive tuning to obtain the best results. In contrast, log vectorization has gained traction in more recent works because of its capacity to encode word context. The sequential ordering of words can also be retained by embedding logs using transformers or recurrent neural networks. However, word2vec or skipgram create huge vocabulary spaces that can dilute learning over large-sized datasets. words not observed in training make it less resilient to new events that emanate in real-time.

There are a few limitations in the current arts for learning robust log embeddings. Representability of out-of-vocabulary (OOV) tokens is a key research gap. Although the Log2vec [7] approach is able to produce semantic/lexically charged word vectors, it relies on an offline compositional MIMICK technique to handle OOV words. Tokenizing logs into subwords was tried by a few works, but it substantially increases the sequence length. There is a need to develop a robust word2vec strategy for logs that inherently eliminates OOV without computation overheads. It should process from subword granularity but render embeddings that reflect the semantic/lexical properties at the word level. Second, rarely few works explore feature selection in the context of logs. Since an anomaly is typically below 2% of the entire log volume, there is a need to enhance its visibility by eliminating noisy and less useful logs that do not affect anomaly detection.

In this paper, we propose a novel log encoder and feature selection mechanism to process log texts. This encoder operates at subword precision but ensures that the aggregated word embeddings reflect the semantic/lexical relations. Additionally, the most desired set of events to learn about the presence of anomaly are auto-selected for anomaly classification. This is accomplished by modelling the probability distribution of event occurrences. For example, heartbeat signals, and daemon start/stops correlate to anomaly and are prioritized over standard info messages.

The main contributions of this work are as follows.

  • We propose a novel Subword Encoder Neural network (SEN) as an enhancement over the current Log2vec algorithm. This novelty boosts the representability of program/system-level entities such as machine IDs, IP/MAC addresses, error tracebacks, etc.

  • The SEN implicitly takes care of handling out-of-vocabulary words. It is robust to new kinds of events, as semantic/lexical dependencies are encoded in the distributed vectors through subword-level self-attention.

  • A novel Naive Bayes-based algorithm is proposed for feature selection. By evaluating the occurrence probabilities of different log events under failures, this model precisely filters the key log messages aiding failure classification.

  • The proposed selection mechanism improves the efficacy of existing Machine Learning (ML) detectors. It ensures better separability of features for outlier tagging.

The remainder of the paper is as follows. Section 2 reviews the latest trends in literature. Section 3 describes the proposed work in detail, followed by a comprehensive results analysis in Sect. 4. The conclusion section summarizes the key findings of this work.

2 Related works

This section reviews the recent AI developments in anomaly detection from logs. In addition, popular Natural Language Processing (NLP) strategies for feature selection are discussed.

2.1 Anomaly detection

Several works have explored Long Short Term Memory Networks (LSTM) to learn log dependencies and predict the likely next sequence in normal logs. An anomaly manifests in real-time when LSTM finds a lesser chance for its occurrence. This approach is unsupervised in nature, as the anomalous patterns are unknown to the model until inference time. For example, Du et al. proposed the DeepLog model that relied on the sequential occurrence of log events to judge anomalous next logs [8]. An LSTM was fit over the event identifiers and parameters to detect execution path and performance anomalies. Zhou et al. proposed a log pattern-driven anomaly detection model that uses statistical features (frequency, surge) to find transient anomalies [4]. The approach adopts LSTM to correlate long-range temporal patterns which are collectively supplied into a Back Propagation (BP) neural network to obtain anomaly decisions. Yin et al. developed a dual LSTM model that combinedly analyses both the sequence of log templates and the components that emitted them [5]. The fused outputs are iterated over time steps to predict the next logs. Along similar lines, Chen et al. exploited pre-order and post-order relationships across log event sequences to train a dual LSTM [10].

Meng et al. introduced the concept of lexical contrast embeddings to enhance the template representations over DeepLog [9]. The template vectors were modelled using LSTM to identify sequential/quantitative anomalies. In an earlier work Log2vec, the authors employed similar semantic dependency parsing and synonym/antonym concepts to learn word vectors in log texts [7]. Lv et al. filtered invalid log words based on parts of speech and learnt word vectors [11]. The derived log-level embedding was passed through LSTM to determine anomaly. Yang et al. added a self-attention layer over the LSTM to enhance the embedding representations [12]. Li et al. proposed time-semantic sentence embeddings that reflect changes in sequence order, log time interval, and events. This dual nature of logs is encoded into two vector sequences that are transformed using bidirectional LSTM [13]. In an updated version of this framework, the authors introduced identifier-based relationship graphs to group interleaved messages from different processes [14]. It also included an enhanced data-driven log parser for producing effective semantic-temporal embeddings.

Another set of studies involved Bidirectional Encoder Representations from Transformers (BERT) for efficiently encoding the log information. Lee et al. trained BERT with only normal system logs [15]. The predictive probability for masked tokens out to be low when abnormal logs were passed to this trained model. Wang et al. applied BERT and variational autoencoder to extract statistical and semantic features through contrastive adversarial training [16]. LogBERT method proposed by Guo et al. modelled the log contexts through masked log key prediction and minimizing closeness between normal logs [17].

Some works explore Convolutional Neural networks (CNN) for spatio-temporal processing of log sequence data. For instance, to process streaming logs, Lu et al. and Wang et al. devised a lightweight temporal (CNN) [6, 33]. The logs are parsed into key sequences which are then embedded via convolutions and determined to be anomalous or not. Similarly, Hashemi et al. designed a character-based hierarchical CNN that performs sequence classification [18]. It aggregates features level-wise from character to line to a sliding window sequence.

To effectively utilize raw logs without any prior anomaly information, unsupervised techniques are vastly explored. For instance, Niwa et al. created a relationship graph based on the interconnections of system components and their running state metrics [19]. A centroid-based clustering was applied to detect outliers in an unsupervised fashion. Similarly, Farzad et al. learnt an autoencoder to vectorize logs and spotted anomalies using isolation forest [34]. Zeufack et al. also exploited log keys to create event counting features useful for density-based clustering [20]. In this approach, an anomaly is evaluated considering its core and reachability distances from the found clusters.

2.2 Feature selection

Feature selection on textual datasets like logs is a strategy to distil significant keywords/variables that can form a basis for classification [21]. They effectively process high-dimensional feature spaces, while preserving information gain and reducing runtime complexity. Bommert et al. experimented with 22 filter methods to correlate and rank the best ones for different kinds of data [22]. It is seen that while the essential features contribute to the objective, their combination is more significant. In another work, Iqbal et al. discuss the taxonomy of feature selection and its applicability for text categorization, remote sensing, and image retrieval [23].

Typically, there are two approaches for picking the ideal features, the wrapper and filter models. The wrapper models generate different feature sets and apply classifiers to evaluate and identify the best combinations [24]. On the other hand, the filter techniques use statistical measures such as correlations between predictor and target variables to decide feature weight scoring [25]. These metrics are commonly drawn from information theory. For example, Prasetiyowati et al. used entropy criteria as the basis to weigh important features [26]. Wang et al. investigated trends in the number of features and their relationship to the classification performance to reach the optimal selections [27].

The mechanism proposed in our work defines feature affinity based on their Naive Bayes occurrence probabilities. It is a derivative of the Naive Bayes algorithm widely used in text classification tasks, like sentiment analysis, topic labelling, etc. [28]. Other feature selection techniques include Chi-Square, Information Gain (IG), and Recursive Feature Elimination (RFE) [29]. Ismail et al. proposed a system that can accurately distinguish between real-human and bot-generated texts based on the measurements gathered in a study using the Naive Bayes and entropy classifiers [30]. Similarly, Bird et al. applied relative entropy to select highly polar sentiments from a dataset of word stem attributes [31]. This enabled precise sentiment recognition. In summary, both the percentage of information gain and the dependencies between predictor/target variables are major aspects in the design of accurate feature selectors.

3 Proposed work

The goal of the proposed deep learning model is to learn robust distributed representations for the individual words comprising the logs. These word embeddings are formed such that they accurately capture the semantic sense of system/program entities (numbers, IPs, emails) typically found in logs, also generalize well to out-of-vocabulary words in unseen logs.

From the architecture diagram presented in Fig. 1, the logs are considered in blocks of a fixed number of lines, N. In the first step, the log messages are converted to a vector sequence via a novel semantic encoder. The subsequent step selects the most useful log message vectors for anomaly detection. This is to ensure that noisy/irrelevant messages are discarded while the remaining are retained in time order. The embeddings are merged to produce a log-level representation which can be passed into an ML classifier. In summary, the model ascertains whether the log block carries an anomaly or not in a supervised fashion.

Fig. 1
figure 1

Schematic diagram of the proposed log anomaly detection framework

3.1 Subword encoder neural network (SEN)

Log parsing identifies the underlying format used to generate the logging statements. Conventional template extraction methods such as FT-Tree (Frequent Template Tree) or Longest Common Subsequence (LCS) produce distinct log event categories. However, these event categories are not entirely unique but can still be semantically similar/related to each other. As illustrated in Fig. 2, FT-Tree assigned different template IDs to the first two log events, although they convey the same information. Hence there is a need to cluster the log messages based on their underlying meaning that strengthens the semantic relationships for determining the degree of similarity.

Fig. 2
figure 2

Examples of logs ideal for lexical constraints and semantic similarity

In this log parsing layer, we present a new mechanism to learn rich contextual log embeddings. The steps are illustrated in Fig. 3. A log message is split into words on the space character. The aim here is to generate a unique word-level embedding for any log word in the dataset. A log message can then be expressed as a sum of its word vectors.

Fig. 3
figure 3

Generating log event vectors

However, two key challenges are faced in creating robust word vectors. Firstly, the vocabulary of words in a log base constantly changes, resulting in new words/events being observed during real-time detection. Secondly, the parameter terms inherent to logs such as machine IDs, memory addresses, variables, etc. need to be precisely expressed so that any unseen IDs or entities can still have similar embedding. To overcome these issues, we introduce an encoder neural network that operates at the subword level. Specifically, the Byte-Pair Encoding (BPE) algorithm is run over all tokens in the corpus to form an all-inclusive vocabulary set. From 256 bytes base tokens, 50,257 vocabulary size is reached by performing 50,000 merges. This ensures that even complex entities can be efficiently decomposed to the most representative granular sub-entities.

This proposed log word encoder is presented in Fig. 4. Here, every word \(w\) is treated as a sequence of \(m\) subwords. They are converted to \(D\)-dimensional features, \(F\in {\mathbb{R}}^{m\times D}\) through an embedding lookup. To integrate deep contextual information across subword features in the word, self-attention learning is performed over \(F\).

Fig. 4
figure 4

Architecture of the subword encoder neural network

The self-attention layer works by first deriving three sets of features from \(F\) – queries \(Q\), keys \(K\), and values \(V\). These entities are formed as linear projections on \(F\) using the weight matrices \({W}_{Q}\in {\mathbb{R}}^{D\times D}\), \({W}_{K}\in {\mathbb{R}}^{D\times D}\), and \({W}_{V}\in {\mathbb{R}}^{D\times D}\), as given by Eq. (24).

$$Q=F{W}_{Q}$$
(1)
$$K=F{W}_{K}$$
(2)
$$V=F{W}_{V}$$
(3)

The matrix dot product \(Q\) and \(K\) yields the attention coefficient map. Its values are row-wise softmax normalised to ensure appropriate weighing of the features. Each cell in this attention map carries the degree of correlation between the subwords referred to by that row and column. Finally, the transformed feature-set, \(A\in {\mathbb{R}}^{m\times D}\) is calculated as a weighted combination of values \(V\) over the attention parameters. Equation (4) summarizes the computations involved.

$$A = softmax\left(\frac{Q{K}^{T}}{\sqrt{D}}\right)V$$
(4)

By attending to every part of the word, each subword encoding selectively infuses semantic attributes and linkage information depending on the words it is present in. The attention mechanism also retains salient features suppressing any irrelevant details, thereby enhancing the intermediate subword representations for downstream processing.

In the subsequent step, \(A\) is additively pooled along the rows to generate the word embedding \(E\). Given such deep representations from the encoder, the goal is to train context-based word vectors to predict the target word. Let \(W\in {\mathbb{R}}^{|W|\times D}\) be defined as the collection of SEN encodings for all word tokens in the dataset \(V\). Then this skip-gram objective can be expressed as per Eq. (5).

$${L}_{skipgram} =-\sum\nolimits_{-c\le j\le c}log\left(p\left({w}_{i+j}|{w}_{i}\right)\right)$$
(5)

where \(c\) is the length of the context window. \({w}_{i}\) is the input central word and \({w}_{i+j}\) denotes its neighbouring words. The conditional probability function is given by Eq. (6).

$$p\left({w}_{x}|{w}_{y}\right)= \frac{exp\left({W}_{x} . {W}_{y}\right)}{\sum_{k=1}^{V}exp\left({W}_{k} . {W}_{y}\right)}$$
(6)

Here, \({W}_{x}\) and \({W}_{y}\) refer to rows in the SEN matrix. In addition to contextual meaning, it is useful to encode lexical similarities and contrast among the log words. As demonstrated in Fig. 2, certain synonymous words convey the same semantic sense. Naturally, their word vectors should be closer compared to their antonyms. We adopt the LSWE (Lexical-contrast Semantic Word embeddings) to constrain such semantic relations on the learnt embeddings [7], as in Eq. (7).

$${L}_{LSWE} =-\sum_{u\in {SYN}_{{w}_{i}}}log\left(p\left({w}_{i}|u\right)\right)+\sum_{u\in {ANT}_{{w}_{i}}}log\left(p\left({w}_{i}|u\right)\right)$$
(7)

where, \({SYN}_{{w}_{i}}\) and \({ANT}_{{w}_{i}}\) denote the synonym and antonym sets of \({w}_{i}\).

We further augment the semantic embeddings by associating word dependencies, as described in the Log2vec technique by Meng et al. [7]. Dependency parsing examines the grammatical relationships that exist between phrases in the log message. From examples in Fig. 2, a dependency links a head/root node to a child node via an association. Therefore, connected pairs \((x, y)\) need to be closer than other words \(z\) in the message. Consequently, a set of relation triples \((x, y, z)\) can be formed to represent this inequality, where the \(sim(x, y) > sim(y, z) or sim(x, z)\). This constraint can be imposed on the objective function as a triplet loss [32].

$${L}_{DP} =\sum_{u\in {REL}_{{w}_{i}}}max\left({\Vert {w}_{i}-u\Vert }^{2}-{\Vert {w}_{i}-v\Vert }^{2}+\alpha , 0\right)$$
(8)

Thus, to train the SEN model, the final objective to minimize is the addition of these individual losses as presented in Eq. (9).

$$L={L}_{skipgram}+{{L}_{LSWE}+L}_{DP}$$
(9)

Post optimization, any log message can be obtained as a vector sum of their component word embeddings.

3.2 Naive bayes feature selection (NBFS)

In practical scenarios, the occurrence of certain log events can lead to an anomalous pattern. Instead of supplying all log events in a time window for ML analysis, one level of feature engineering can be performed to determine the most helpful log messages that aid in anomaly detection. The advantages of template selection are three-fold: 1) eliminates unnecessary noise, 2) compresses input without information loss, and 3) enables faster convergence and precise model fitting.

To achieve this goal, we propose a novel probability-based technique to determine suitable templates. Leveraging the SEN model, all log lines in the dataset can be mapped to their corresponding log embeddings. An Agglomerative Nesting (AGNES) clustering algorithm is run to group similar messages into unique event categories. These log clusters are indicative of a specific kind of event, similar in terms to a template extracted by FT-Tree or LCS techniques. Let \(C\) be the set of distinct clusters obtained.

Since the ML classifier is trained on a fixed-length log block (N events at a time), let the dataset be prepared as sliding windows, \(X\). Let \(y\in \{\mathrm{0,1}\}\) be the target for anomaly classification, i.e., whether \(X\) points to a failure or not. From Bayes theorem, the posterior probability of determining \(y\) from \(X\) is given by Eq. (10).

$$P\left(y\right|X)=\frac{P\left(X|y\right) \times P\left(y\right)}{P\left(X\right)}$$
(10)

Here, the likelihood \(P\left(x|y\right)\) can further be expressed as Markov chain function as per Eq. (11).

$$P\left(X\right|y)=P\left({x}_{1}|y\right)\times P\left({x}_{2}|y\right)\times ... \, P\left({x}_{n}|y\right)$$
(11)

where \({x}_{i}\) denotes a log event type from \(C\) that is found in the block \(X\). Therefore, the probability of an event \({x}_{i}\) to occur in a specific class of logs can be calculated as per Eq. (12).

$$P\left(x_i\vert y\right)=\frac{\#occurrences\;of\;x_i\;iny}{\#templates\;in\;y}$$
(12)

Using Eq. (12), \(P\left({x}_{i}|y=0\right)\) gives the plausibility for event \({x}_{i}\) to manifest in a normal log. Similarly, \(P\left({x}_{i}|y=1\right)\) is the possibility of seeing \({x}_{i}\) in an anomalous log.

The proposed feature selection technique computes the \(P\left({x}_{i}|y=0\right)\) and \(P\left({x}_{i}|y=1\right)\) statistics for all events \({x}_{i}\in C\) by traversing the entire dataset. In case the difference between these values for an event falls below a set threshold \(T\), that event is discarded. It means that the distribution of occurrence of that event did not significantly vary/distinguish between normal and abnormal logs. Evidently, it is a frequent event that bears no significance for the task of anomaly classification. The value of \(T\) is determined empirically. In summary, only messages belonging to log clusters that satisfy this property are subjected to ML.

3.3 Anomaly detection

In a log stream, a specific time frame is sampled. The log messages are encoded via SEN and filtered on the NBFS event selector layers. Assuming the feature stack to have \(k\) log row embeddings in D-dimensions, they are reshaped into a 1-D vector. The concatenated feature-set is padded to a uniform length and classified on supervised ML models, as illustrated in Fig. 1.

4 Results and Discussions

This section discusses the datasets, experiments and findings based on various aspects of the model. In the final subsection, a performance comparison is drawn with the state-of-the-art methods for anomaly detection.

4.1 Data collection

The proposed log analytics framework was evaluated on three standard open-source datasets: 1) BGL, 2) HDFS, and 3) OpenStack. Their details are listed in Table 1. These datasets are a collection of real-time system logs that were labelled and freely released by authors of previous works. In the existing literature, they are commonly used as a benchmark for assessing the effectiveness of anomaly detectors.

Table 1 Datasets obtained from three sources. Provided are the counts of normal and erroneous log events present in the files

The BGL logs are from the BlueGene/L supercomputer at the Lawrence Livermore National Labs (LLNL) in Livermore, California [35]. The log messages are marked with alert category tags to indicate anomaly. This label information was utilized for supervised learning.

OpenStack is a set of software components to manage cloud services and infrastructure [36]. The dataset was generated on the CloudLab platform. It provides VM instances that have injected anomalies and the abnormal log sections pertaining to them.

The Hadoop Distributed File System (HDFS) log events describe the insertion, deletion, and updation of blocks in the Hadoop ecosystem [8]. An error/failure here can be traced to the associated block IDs.

4.2 Experimental setup

All experiments are run on a Ubuntu 18.04 server with 32 GB NVIDIA Tesla V100 GPUs. The memory and processors are 128 GB RAM, 96CPUs. The embedding layers are learnt as PyTorch modules through Adam optimization, while the log classification is performed using Sklearn libraries. The data processing scripts involve NLTK, regex and Spacy functions. For the log vector neural network, the initial learning rate is set to 5.0. It is controlled by a multiplicative learning rate decay scheduler, that drops by a factor of 0.1 when no improvement is observed over 10 epochs.

4.3 Model training and validation

To perform training the logs are traversed as sliding windows. Each window constitutes a chunk of log events that occurred in that fixed-length time frame. The window size is a hyper parameter that takes optimal values based on the dataset. The log blocks are assigned to be normal or anomalous based on the labelling information provided (refer Table 1).

Such data instances are partitioned into train, validation, and test splits in the ratio of 80:10:10 respectively. A shuffled and stratified data sampling is applied to preserve the proportions of normal and erroneous blocks in these split sets. The evaluation criteria are Precision, Recall and F1-score. These metrics are widely used for benchmarking log failure prediction models.

The entire log corpus is first subjected to SEN training as per Eq. (9) in order to learn appropriate weights for embedding layers. Table 2 captures the decay in this semantic word2vec loss function. The subword encoder is able to converge well on all three datasets and generate effective representations of their textual contents. This is evident as the loss on the unseen validation set dropped closer to the training loss towards convergence. It took 3–4 epochs to fully absorb the semanticity constraints. These log features are then classified as normal or anomalous using supervised ML. In the experiments, models tried include logistic regression, Support Vector Machine (SVM), random forests, and extreme gradient boosting.

Table 2 Trends observed during training on the datasets. The embedding loss applies to the word2vec learning at the SEN phase. Cross-entropy loss is used to classify log blocks at the final detection stage. F1 scores emitted at this ML classifier are tracked through epochs

In Table 2, learning curves are plotted for the ideal classifier and the best set of hyperparameters that gave the highest results on each dataset. The cross-entropy loss decays steadily through epochs. The number of epochs differs based on dataset complexity and type of classifier. Since HDFS blocks require processing a larger contextual window, it takes a longer training time. However, accuracy on unseen data matches that of the training set over time. Similarly, the F1 scores on BGL stagger initially but stabilize after 30 epochs. OpenStack logs are the most amenable to anomaly detection out of the three, hence their learning is smooth across iterations.

4.4 Ablation studies

This section aims to evaluate the efficacy of individual components in the proposed framework. The effectiveness of the two key building blocks: 1) SEN and 2) NBFS is studied via ablation experiments.

Firstly, a vanilla baseline is established so that performance gains from the proposed modules can be measured incrementally. This base algorithm is a combination of parsed log template features and LSTM for self-supervised next log prediction. Such an approach is popular in current arts. It utilizes the error-free log sequences prepared in the previous step to model the distribution of logs, thereby identify outliers on the unseen set. Table 3 captures a holistic view of the ablation studies. It is evident this base method registers reasonable accuracy close to 90%, leaving much scope for enhancements. Especially, it needs to handle new events not observed during training. Also, the template extraction demands rigorous fine-tuning to predict the event types precisely.

Table 3 Experimental results to validate efficacy of the proposed enhancements

To augment this baseline, the SEN is introduced at the embedding learning stage. It solves both the challenges by learning robust representations for out-of-vocabulary words, also eliminates need for pattern-based log parsing. The SEN technique improves recall by 9.19% on BGL compared to the conventional method. A major contributing factor is self-attention learning which only focuses on significant details, unlike the LSTM that processes the entire context. On HDFS and OpenStack, the encoder enhances precision by 5.74% and 10.84% reducing any false positives.

By adding the NBFS layer, only the differential log events whose occurrence pattern greatly differs in the anomaly scenarios are supplied to ML. Inducing this event awareness into the context increases precision close to 1.00 on all the datasets. Evidently, the model exploits useful attributes to enhance boundary separation and distinguish the outliers, resulting in a higher recall by 4%.

4.5 Effects of embedding dimensionality on anomaly detection

Embedding size is a determinant factor of model performance. Choosing an appropriate length can improve the semantic effectiveness of the produced embedding features. Consequently, it aids in the discriminability of keywords representing anomalous patterns. In this experiment, the dimensionality is varied in steps of 64 256, and 1024 and trained for the best classification model on each dataset (refer Fig. 5).

Fig. 5
figure 5

Selecting optimal embedding size on different datasets

The BGL logs contained a greater number of distinct tokens and synonym-antonym sets. Therefore, it requires bigger dimensions of 1024 elements to encode this knowledge. Comparatively the OpenStack logs have lesser semantic dependencies, but more machine entities and identifiers. So, a small size such as 64 dimensions does not capture categorical similarity in these variable words. A length of 256 fits accurately. On the other hand, HDFS comprises fewer unique events and variations in the structure of messages. It responds well at 64 and 256 dimensions but degrades after that due to overfitting.

4.6 Effects of window size on the accuracy of anomaly detection

The size of the contextual window is a critical hyperparameter in the design of a log anomaly detector. A good enough time range will allow for sufficient correlations across the events to determine the presence of abnormality. Too small a size can deprive the ML of essential details that are helpful. On the other hand, very large size will invite noise and complex boundary fitting. It also incurs more memory and processing.

Specifically, the ideal window size is a characteristic of the dataset that depends on its nature and complexity. To obtain the optimal value, the search space for this parameter is varied through 1, 5, 10, 20, 50, 100, and 200. The F1 scores registered for the three datasets are visualized in Fig. 6.

Fig. 6
figure 6

Choosing the best window size to maximize accuracy. Scores are computed on the test set

It is observed for BGL, that a single alert/non-alert message is known to indicate whether an aberration occurred or not, hence one-sized window gives the highest F1 score. For OpenStack, a window size of 5 achieves the best performance. Since anomalies are present as transactions revolving around VM instances, it is expected to span a few sets of lines. In contrast, models tried on HDFS converge only for a window size of 100 and slightly dropped beyond that. As an HDFS anomaly manifests at a file-block level, a bigger window can trap most of that block-related messages amongst other log events. Therefore, it demands a larger size to observe event activity.

4.7 Improving efficiency of base classifiers using naive bayes feature selection

To demonstrate the impact of the proposed NBFS logic in pruning unnecessary logs, the behaviour of four classifiers in the presence and absence of this layer are investigated. Figure 7 plots the trends observed in F1 score before and after applying NBFS.

Fig. 7
figure 7

Trends in ML model performance before and after applying NBFS. (a) BGL. (b) HDFS. (c) OpenStack

It is seen that regardless of the dataset or type of classification model, NBFS improves the distinguishability of anomaly. For SVM, it leads to a 25% and 7% increase in the correctness of predictions on OpenStack and HDFS respectively. With NBFS, the non-linear kernel feature space is well-formed to enable large-margin separation of outliers. A similar trend saw Random Forest increase by 3% uniformly across datasets. The decision trees had a lesser overfitting effect as the ideal depth decreased post-NBFS. In contrast, Logistic regression and Extreme Gradient Boosting models show lesser response to NBFS, as the base classifiers already reached maximal results.

4.8 Generalizability to unseen logs

Resiliency to new log events is a key strength of the proposed model. Even new messages only encountered in real-time can still be assessed as normal or erroneous. Figure 8 provides a plot of total log events encountered at different percentages of the log corpus.

Fig. 8
figure 8

Number of unique events discovered in the logs by parsing as proportions in the timestamp order

While the HDFS and OpenStack logs contain a lesser number of distinct events that remain constant over time, the BGL comparatively shows more variations. The BGL dataset produced new log message clusters frequently as the trendline expands steeply. This property makes it an ideal candidate to test the robustness of the subword encoder to handle words unseen during training. The results of such an experiment for all three datasets are summarized in Table 4.

Table 4 Measuring accuracy when only a proportion of logs is used for training and rest taken into testing

Especially for BGL, even when only the first 10% logs were subjected to learning, the model still gave a reasonable 0.93 F1-score on the remainder 90% data. It improves with more data, as it reached 0.96 at halfway mark. This trend confirms efficacy of the approach to function well even under new logs. In slight contrast, on HDFS and OpenStack the model already achieved maximal accuracy at 30% and 60% dataset respectively. These datasets had a redundant pattern of messages that did not greatly impact the unknown subword representations.

4.9 Performance comparison

This section presents a comparison of the proposed work with several state-of-the-art methods for log anomaly detection. To ensure fairness in comparison, only works that have experimented with the same datasets as the current work have been considered. Tables 5, 6 and 7 show the analysis of existing techniques on BGL, HDFS, and OpenStack datasets respectively. It is seen that the proposed SEN encoder alongside NBFS-augmented classification reached the best Precision, Recall and F1 score compared to most of the other works. The biggest advantage of our method lies in low memory requirement, fast compute times and simplified workflow integration for streaming logs.

Table 5 Performance analysis of existing methods on the BGL logs dataset
Table 6 Results comparison of the proposed model with similar research works on the HDFS dataset
Table 7 Evaluation of AI methods on the OpenStack dataset

From Table 5, Chen et al. achieved a below-par score on BGL, because the conventional log parsing to extract template/parameters does not cover all possible events (keys) in the training set for the next log key prediction [10]. This method gave a 0.97 F1-score on HDFS logs that contained fixed event types, whereas it could not adapt to the characteristics of the BGL dataset that produces more irregular events. Similar approaches such as Du et al., Meng et al., and Yang et al. also face the same drawback of not being generalizable for new log keys [7, 8, 12]. Another popular approach involves modelling BERT to predict the likelihood of masked tokens. Lee et al. and Guo et al. obtain adequate F1 scores of 0.96 and 0.90 using BERT [15, 17]. However, due to diverse variability in logs, BERT predictive probabilities are unlikely to cover all possibilities in each context, resulting in false positives. Instead, considering these deep BERT features as input for a supervised classifier layer prevents precision errors.

Employing word vectors in place of log templates improves resilience in the embeddings for new log formats. These methods display better results. For instance, Z Wang et al. designed a semantic vector space model that cleanly highlights anomalous logs on a temporal CNN [33]. Li et al. use time and semantic embeddings to detect sequential anomalies [13]. These approaches acquire efficient representations but draw excessive contextual details and invariably noise too. The NBFS module proposed in our work eliminates such less relevant factors from impacting decision making. In place of text parsing, an end-to-end character-level neural network was presented by Hashemi et al. that achieved a 0.99 F1 score on BGL and HDFS [18]. This technique fails on OpenStack logs where the anomalies are too finely spread over certain VM instance messages to be solely distinguished at the character level.

Amongst LSTM methods, analysing multiple relationships such as pre/post order of events and component-aware templates improves the efficacy of LSTM. Yin et al. and Chen et al. demonstrate Dual LSTM that can inspect such patterns [5, 10]. These models capture long-term dependencies yet are not explicitly trained for semantic/contextual similarities between log words. On the other hand, Lv et al. that utilized word vectors converged with better precision [11]. In our proposed architecture, the SEN module ensures appropriate semantic-aware features for anomaly detection. Handcrafted features such as frequency, surge and variables have also been effective inputs for LSTM, as shown by Zhou et al. [4]. Nevertheless, they are derivative statistics and do not directly express inherent log contents, which in turn enables the LSTM to form better correlations.

Besides these approaches, autoencoders are shown to generate useful low-dimensional features for differentiating anomalies [16, 34]. Autoencoder presents a risk of lossy transformation. Training an autoencoder requires a lot of data, processing time and hyper-parameter tuning, whereas the proposed SEN is tuned directly for the word2vec objective and converges faster. Overall, in terms of lesser complexity and highest accuracy on multiple datasets, our method comes on par with the state of the art.

5 Conclusion

This article proposes a novel approach to learn log word embeddings that takes advantage of semantic/lexical relationships across words. It processes from a subword byte-pair vocabulary but ensures that contextuality is retained in the word-level embeddings. Learning such compositional word vectors inherently solves the representability of out-of-vocabulary tokens which is a key research challenge in this area. The ability of this module to operate under irregular events was confirmed through experiments. By only observing the first 10% logs, it gave a 93% F1 score on the BGL dataset, which proves its resiliency to new messages. Additionally, this paper introduces a probabilistic mechanism for selecting the most significant logs that can aid anomaly detection. It learns a Naive Bayes probability distribution for the occurrence pattern of events. Then, it identifies the salient ones that can reflect the difference between regular logs and abnormal logs. To our best knowledge, this is the first attempt to develop such a feature selector for logs. Empirically it was observed that this module improves performance of the base classifiers, to the extent of 25% for Support Vector Machine on OpenStack dataset.

The proposed framework was demonstrated on three benchmarked datasets. The learning curves imply that the models converged optimally. It reached mean 0.99 F1 scores on all three datasets, which exceed the current arts. As future work, the model can be expanded to more kinds of logs. The explainability of target predictions can be back-traced to features on the logfile, thereby opening pathways to self-healing workflows.