Keywords

1 Introduction

Information systems record abundance of data to support automation of business processes in different domains, e.g., supply management, banking, software development, maintenance, and healthcare systems. A process is a collection of tasks or activities having one or more links between the activities. Actors perform the tasks to achieve the business goals. For example, actors in a banking system include bank manager, cashier, other employees, and the customers. The activities are performed by a subset of these actors to complete any process. Opening an account in any bank is a process where a customer performs the following activities—selecting the bank account type, choosing the right bank for creating account, gathering the required documents, completing the application process, collecting the account number, and depositing the funds into the account. Each organisation has detailed description of their processes. For example, in a banking system the documents required for opening an account, transaction limits are clearly described. In any process flow there may be areas of improvement that allow reduction in the overall working time/cost of the system, improve productivity and quality, balance resource utilization [18]. Thus arises the need of a technique which could understand the flow and deviations in the processes. Process mining techniques facilitate discovery of process models from the data stored in the form of an event log and use these for further analysis. In the real world, process mining has found applications in the domains of healthcare, information and communication technology, manufacturing, education, finance, and logistics [18].

In this paper, we present a genetic mining approach for discovering process models. The experimentation was done on 21 unbalanced event logs and the proposed algorithm was compared with state-of-the-art algorithms, namely, Genetic process mining, Heuristic miner, \(\alpha \) \(^{++}\) and Inductive logic programming. The results show that the proposed algorithm generates competitive process models as compared to the compared algorithms and it produces “good” quality process models for all the datasets.

The rest of the paper is organized as follows: Sect. 2 gives a brief summary of process mining concepts and the related work. Section 3 describes the proposed approach. Section 4 outlines the experimentation and explains the results. Section 5 concludes the paper.

2 Process Mining and the Related Work

Process mining algorithms abstract the event log data into a well-structured form known as a process model. An event log is a collection of cases that represents a process. A case, also known as a trace, is a set of events [16]. Each row of the event log represents an occurrence that pertains to a single case. Events are ordered within a case and have attributes such as case ID, activity/task name, and timestamp, among other things. Table 1 depicts an example event log with three tasks and three scenarios. Each distinct task is assigned an integer symbol—‘Search product’ (1), ‘Check availability’ (2), and ‘Order product’ (3). The timestamp indicates the occurrence time of the corresponding event. A trace is formed by picking all the events with the same Case ID and arranging them in ascending order of their timestamp. Each trace (which may include numerous tasks) can be any length and is expressed as a series of integer symbols. In the example log, the traces corresponding to the Case ID 7011 (events 1, 2, 5), 7012 (events 3, 7), and 7013 (events 4, 6) are 123, 12, and 13 respectively. Process mining approaches can be divided into three categories:

  • Process discovery: the process of creating a process model from an event log without any prior knowledge [2, 17].

  • Conformance checking: determining whether the model provided by an algorithm matches the given event log [2, 17].

  • Enhancement: using information from the original process captured in an event log, improve or extend an existing process model [2, 17].

In this paper, we focus on process discovery algorithms. Different state-of-the-art process discovery techniques include \(\alpha \) [1], \(\alpha \) \(^{+}\) [15], Multi-phase miner [10, 19], Heuristics miner [22], Genetic process mining (GPM) [16], \(\alpha \) \(^{++}\) [23], \(\alpha \) \(^{\#}\) [24], \(\alpha \) \(^{*}\) [14], Fuzzy miner [3], Inductive logic programming (ILP) [12] etc. A few of these algorithms have explored the use of evolutionary methodology for discovering process models [11, 16, 21]. The recent algorithms in the domain of process discovery include Evolutionary tree miner (ETM) [6], Inductive miner [13], Multi-paradigm miner [8], a hybrid process mining approach that integrates the GPM, particle swarm optimization (PSO), and discrete differential evolution (DE) techniques to extract process models from event logs [7], Fodina algorithm which is an extension of Heuristic miner [4], Binary differential evolution [9]. Most of these algorithms either propose a hybrid technique using multiple evolutionary approach or optimize a weighted function of popular quality metrices of the process model.

Table 1. An example event log

3 Proposed Genetic Algorithm for Process Discovery (GA-ProM)

The process mining data is encapsulated in the form of an event log. Event log is the starting point for process discovery algorithms. An individual in the initial population is generated from the event log in the form of a causal relation matrix. Each bit of an individual is either 0 or 1. The steps for the proposed genetic mining algorithm for process discovery are outlined in the Algorithm 2. These steps are explained as follows:

The first task is to compute the dependency links between the tasks/activities in an event log V with n tasks/activities. The dependency relations that represent domain knowledge are stored in the form of the dependency measure matrix D [16]. Dependency measure matrix D which provides information about tasks in self loops, short loops (length-one and length-two loops) and tasks in parallel, is then used to generate causality relation matric M (Algorithm 1) [16]. Each individual in the initial population is represented by a causality relation matrix, which is computed as:

$$\begin{aligned} m_{t1,t2} = {\left\{ \begin{array}{ll} 1~\text {if r}~<~\text {D(t1, t2, V)}^P\\ { 0, \text {otherwise}}\\ \end{array}\right. } \end{aligned}$$
(1)

where \(t1,\ t2\ \in [1, n]\), \(r \in [0,1)\) is an n \(\times \) n matrix of random numbers, generated for each individual in the population of N individuals. P regulates the influence of the dependency measure while creating the causality relation. [16] suggested the value of P = 1. A process mining algorithm should, in an ideal situation, generate a model that is simple (simplicity), precise (precision), general (generalization) and fits the available logs (completeness). All the quality dimensions are measured on the scale [0, 1]. Completeness [16] is a measure of the ability of the mined process model to reproduce the traces in the event log. Preciseness [20, 21] is a metric for determining how much extra behaviour a model generates that isn’t visible in the event log. Simplicity indicates the number of nodes in the mined model [20, 21]. Generalization [11] measures the degree of which the mined model can reproduce future unforeseen behaviour.

figure a
figure b

In tournament selection, given a population of N individuals, a binary tournament amongst its members is used. Two individuals are randomly selected from the population and every individual must play two times. Out of those 2N randomly selected individuals (N pairs), the best individual is selected on the basis of quality dimensions under considerations. The winner individual goes into selection pool in the next iteration. Then, on the basis of a random crossover point, the causal relation of an individual is replaced with the casual relation of another randomly chosen individual at that crossover point. Input and output tasks are upgraded in causality matrix. The individuals are then mutated by adding a causality relation, deleting a causality relation or redistributing the relations in input or output sets of casual matrices. The mutations, that are inconsistent with the event log, are rejected. The proposed algorithm stops when there is no improvement in solutions for ten generations in a row, or it executes for the maximum number of specified generations.

4 Experimentation and Results

The population size is set to 100. The proposed algorithm (GA-ProM) is run for maximum of 100 iterations. The total number of runs is fixed at 30 and the results are based on the average performance of these runs. We have picked 21 complex case scenarios without noise [21]. The logs used in this experimentation are imbalanced, i.e., they contain traces with very different frequencies. The event logs were present in the .xes, .xml, and .hn formats. Using Prom 6 we first converted the .xes file into .csv format then traces are formed by picking all the events with the same Case ID and arranging them in ascending order of their timestamp to give the desired input file for experimentation. Summary of the experimental datasets is given in the Table 2.

Table 2. Details of the datasets characteristics

The proposed algorithm (GA-ProM) is run on 21 unbalanced logs (Table 2), namely, ETM, g2–g10, g12–g15, g19–g25 [21]. These logs contain traces of varying frequencies. As a result, these logs are better for evaluating an algorithm’s ability to overfit or underfit the data. The proposed algorithm is compared with \(\alpha \) \(^{++}\) [23], Heuristic miner [22], Genetic miner [16], and ILP [12] algorithms. The values for completeness, preciseness, and simplicity for ETM, g2–g10, g12–g15, g19–g25 datasets are taken as reported by [21]. However, [21] do not report the value of generalization for these datasets. The value of generalization for ETM, g2–g10, g12–g15, g19–g25 datasets, for the models generated by \(\alpha \) \(^{++}\), Heuristic miner, Genetic miner, and ILP algorithms, is computed using the Cobefra tool [5, 21]. The parameter settings for each of these algorithms is provided in Table 3.

Table 3. Parameter settings for the algorithms

4.1 Results and Analysis

In this paper, we compare the proposed algorithm (GA-ProM) with \(\alpha \) \(^{++}\), Heuristic miner, Genetic process mining (GPM), and ILP algorithms. \(\alpha \) \(^{++}\) is an abstraction-based algorithm that deals with non-free choice constructs. It is an extension of one of the earliest process discovery algorithm, known as \(\alpha \)-algorithm [23]. Heuristic Miner is heuristic-based algorithm and also an extension of \(\alpha \)-algorithm. The frequency of activity in the traces is taken into account by heuristic miner to deal with noise and imperfection in the logs [22]. GPM, a main contender for the proposed algorithm, is a pure genetic approach proposed in process mining domain. GPM optimizes a function of completeness and preciseness. ILP was created to add artificially generated negative events, such as the traces describing a path that is not permitted in a process [12]. A process model with “good” completeness value indicates that the process model is able to reproduce the behavior expressed in the event log [6]. Completeness may be considered an important metric to measure the quality of process models. It’s important was highlighted by the authors of [6] when they assigned a 10 times higher weight to completeness than the weight assigned to other quality dimensions.

The proposed algorithm optimizes completeness to discover the process models. In order to assess the quality of the discovered process models, we also compute the remaining three quality dimensions—namely, preciseness, simplicity, and generalization. We have also computed a weighted average of the quality dimensions, so as to rank the algorithms based on the weighted average and to compare the quality of the model generated by different algorithms [6].

Table 4 shows the values of the quality dimensions for the process models discovered by different algorithms. The proposed algorithm is able to achieve completeness values for the discovered process models that are better or at least as good as those achieved by the other algorithms. The results show that the discovered process models exhibit “good” values for the other quality dimensions also. Table 5 and Fig. 1 show that in terms of the weighted average, the process models discovered by the proposed algorithm are better in 14 datasets out of 21 in contrast to the other algorithms.

Table 4. Quality dimensions for the process models obtained using \(\alpha \) \(^{++}\), HM, GPM, ILP, and the proposed algorithm (GA-ProM) for synthetic datasets (C: Completeness, P: Preciseness, S: Simplicity, G: Generalization)
Table 5. Weighted average of the quality dimensions for synthetic datasets
Fig. 1.
figure 1

Plots of weighted average of quality dimensions for the process models generated from Heuristic Miner, \(\alpha \) \(^{++}\), ILP, Genetic Process Miner, and the proposed algorithm (GA-ProM). In the figures GA-ProM has been abbreviated as GA.

5 Conclusion

In the past decade, the domain of process mining has developed into an active research area. Process mining techniques empower organizations to gain valuable insights into their business process from the digital data recorded in the information systems. The discovery of process models from the data stored in the form of an event log is usually the first step towards process improvements. The quality of the discovered process models is measured using four metrices, namely, simplicity, preciseness, generalization, and completeness. Completeness is considered more important than the others as a “good” completeness value indicates that the process model is able to reproduce the behavior expressed in the event log.

In this paper, we present a genetic mining algorithm (GA-ProM) for process model discovery from the given event data. We have experimented with completeness as the fitness function. The algorithm was tested on 21 unbalanced logs. We have also compared the proposed algorithm (GA-ProM) with the state-of-the-art algorithms, namely, Heuristic miner, \(\alpha \) \(^{++}\), Genetic process Mining (GPM), and Inductive logic programming (ILP). The results show the effectiveness of the proposed approach as it produces “good” quality process models for all the datasets. The proposed approach compares well with the other algorithms in terms of completeness value of the discovered process models and also in terms of the weighted average of the quality dimensions for the process models.