Keywords

1 Introduction

Today, devices are getting more and more ubiquitous. It is difficult to troubleshoot these devices when they are affected by an internal error. One simple solution is to use event logs to record each and every event that occur in these devices. These event logs can be of various formats. Some can be in non-structured format like text files. And some can be in semi-structured formats like CDF files, XML files, or even files with log extension (i.e., ‘.log’ formats). These files can be used to analyze the device functionality. For example, any windows system provides event logging. To view these event logs, open event viewer under System and security settings and admin istrative tools inside control panel.

Analysis of log events can be done in different ways. Traditionally, this is used to be a manual process. The analysis team used to get the data from the log files in the format that they require understands the device behavior and generates report on their findings. Automated techniques include tool-based analysis like SWATCH [1], LogSurfer [2], and SEC [3]. Some other automated process involves clustering techniques [4, 5], which includes calculating distance between log files and classifying them. Some other techniques involve mining internal patterns inside a single log file and classify these patterns. Few such approaches use state machine approach to analyze log files [6]. Generally, variants of apriori [7] algorithms are used in log analysis. We made a study on comparison of apriori over ECLAT [8] and decided to use ECLAT for this technique.

In this paper, we use a sequential mining algorithm named éclat and discover interesting rules over a pattern of events. The major objective of the work was to detect anomalies over event logs. However, scope of this paper is restricted to mining interesting patterns. We have considered an example to illustrate this process of generation of all possible subsets of event pattern over event logs.

2 ECLAT Algorithm Over Event Logs

ECLAT algorithm is used to perform itemset mining [9]. ECLAT stands for Equivalence Class clustering And bottom up Lattice Traversal. The algorithm uses tidsets (generally called as transaction ID sets) to avoid generation of unwanted subsets that does not occur in the current context. ECLAT algorithm is applicable for both sequential as well as normal patterns. For the proposed technique, we apply ECLAT algorithm over sequential patterns which are separated by bounded events. Let us illustrate how ECLAT can be modified over event logs. Let E indicate the set of all events of an event log if E 1, E 2, E 3, E 4E (Table 1).

Table 1 Initial input to ECLAT

When this input is given, the algorithm has to transform these transactions into the following table format format. The right-hand side of Table 2 shows the TID sets for a specific event E i E.

Table 2 Generating TID lists

In the next step, the algorithm provides the support for every event and is shown in Table 3.

Table 3 Events with support

Here if we consider the minimum support as 2 E3 will be eliminated and other Events will be considered as Frequent item set. This is illustrated in Table 4.

Table 4 ECLAT output

The above process shown is for single events over a set of sequences. The next step is to include item sets with more than one event. Table 5 illustrates this scenario.

Table 5 Generating TID lists for two itemsets

By observing Table 6, it is not possible to perform this process for three item sets. Final output of the algorithm is {E 1}, {E 2}, {E 4}, and {E 2, E 4}. In the next section of this paper, we have discussed about the implementation process using a sample event log.

Table 6 ECLAT output of two item set

3 Implementation of ECLAT Algorithm

In this paper, we are considering a sample device log which is shown in Fig. 1. Our initial step is to generate pattern boundaries, to separate out different sequential patterns. This can be done in different ways. One such technique is to use temporal relationship between patterns. Another approach is to manually identify boundary patterns and generate pattern sequences. In our case, we are using the second approach to generate pattern sequences. Figure 2 shows set of bounded patterns indicating start and end of events.

Fig. 1
figure 1

Sample event log

Fig. 2
figure 2

Boundary events

Implementation of the proposed technique requires three steps mentioned as follows:

  1. 1.

    Creating the event transaction matrix;

  2. 2.

    Deciding minimum support;

  3. 3.

    Applying ECLAT;

  1. A.

    Creating the event transaction matrix

From the above data, every event is identified by an identifier under column V8. These identifiers categorize events into a number of categories. For example, two events indicating user log in and log out can have the same identifier. These two events can be considered as events under user session category. Similarly, if user enters a set of commands to input variables into memory, increment a memory pointer, etc., such events can be categorized as memory events and can have a same identifier. Using the help of these identifiers, we can generate boundary events as shown in Fig. 2.

Now by considering this bounded events, sequence of events can be generated from the above data and can be included inside a single table as shown below. Each sequence is assigned with a sequence ID or pattern ID; in the order, they are generated indicated by column V1. Figure 3 illustrates the generation of these sequences. Each sequence has a different length. Hence, they are called variant patterns.

Fig. 3
figure 3

Sequence of events

The above patterns are generated by considering the identifiers present in the original data, which can indicate the event flow. After obtaining these event sequences, they should be converted into transaction matrix to avoid “NA” characters in the item sets. Figure 4 shows the transaction matrix that can be obtained from the above log sequence.

Fig. 4
figure 4

Transactions sets

  1. B.

    Deciding support

Before applying the ECLAT over transaction matrix, we need to decide the support value. The support value can be decided by an item frequency plot. If the plot indicates a specific support value to be used by the algorithm. This plot is shown in Fig. 5 under result section. Here, all the events appear in about 50% of the sequences. Hence, support can be kept nearer to 0.5. Hence in our paper, we have selected a support equal to 0.4.

Fig. 5
figure 5

Item frequency plot

  1. C.

    Applying ECLAT

ECLAT is applied over the transaction matrix with a minimum support from the previous step, and the process follows the steps mentioned in previous section. The result of the algorithm also is shown in Fig. 5 in the result section.

4 Results

In the previous section, we discussed about transaction matrix and an item frequency plot was generated from the transactions to decide the minimum support. This plot is shown in Fig. 5.

From Fig. 5, it can be noticed that most of the events appear in 30–50% of the mined patterns. Hence, minimum support can be selected anywhere between 0.3 and 0.5. In our case, we have selected 0.4.

After selecting the support, the transaction matrix and minimum support can be fed to the algorithm and the output result would be itemsets of various lengths. The result of our implementation is shown in Fig. 6.

Fig. 6
figure 6

Frequent itemsets with minimum support

5 Conclusion and Future Work

In this paper, we present how to use ECLAT algorithm over event logs. The result in Fig. 6 shows that the subsets of patterns that appear around 40–45% of the total events in the log file. However, the result shown is the head part of the result class. Many other subsets have around 50% and above support. Thus, our approach to select minimum support is valid. ECLAT algorithm provides appearance of every possible subset in terms of support. Thus, the output of the ECLAT algorithm suggests strong patterns that appear frequently in the event logs with respect to the minimum support selected.

After obtaining frequent patterns from the event logs, our further job would be to classify the patterns as anomalies or normal events. One way this can be done is by using Naïve Bayes filter and classify the events into anomalous or normal events which is a supervised learning technique, wherein we feed the system with known error patterns and normal patterns and classify the results. Another way is to consider time series analysis for anomaly detection. In our approach by observing the data anomaly detection over time series is difficulty as the difference between time is very less. Thus, we are using Naïve Bayes approach.