1 Introduction

Process mining (Van der Aalst 2011) is a new and emergent area, where the main focus is on extracting information about the run-time behavior of business processes from event logs recorded by the information systems available in an organization. Process mining can be used for three different purposes (Van der Aalst 2012):

  • Discovery, which consists in deriving a process model from the behavior recorded in an event log, without using any previous knowledge about the process which generated such event log. There are many techniques for this purpose, for example the α-algorithm (Van der Aalst et al. 2004), the heuristics miner (Weijters et al. 2006), or the genetic miner (Alves et al. 2007).

  • Conformance, which consists in comparing the behavior recorded in an event log with an existing process model, in order to determine to what extent the process is being performed as described in the existing model. There are special metrics for conformance checking (Rozinat and Van der Aalst 2008), and there are also ways to replay the event log in the process model in order to check conformance (Van der Aalst et al. 2012).

  • Enhancement, which consists in improving or extending an existing process model based on the run-time behavior recorded in an event log. For example, if a process has a decision between two branches at some point, it will be possible to determine the exact percentage of cases that follow each branch, or even to discover other branches which were not in the original model. There has been comparatively less work on enhancement than on discovery or conformance.

This work focuses on model enhancement, and particularly on the problem of how to enhance a given process model when there is a mismatch between the high-level activities in the model and the low-level events that are recorded in the event log.

1.1 Problem description

To illustrate the problem, consider the simple example in Fig. 1. Here the process model is expressed in terms of the high-level activities A, B and C. However, when these activities are performed, they generate low-level events, namely U, V, W, X, Y and Z. In particular, activity A generates the sequence UV; B generates WXX...; and C generates YZYZ... A sample trace from such process could be: UVWXXYZYZ.

Fig. 1
figure 1

An simple example of how high-level activities may generate low-level events

In this example, we have purposefully ensured that each kind of event (such as U, V, W, etc.) can belong to one and only one activity. The more general case of allowing an event to be produced by multiple activities (such as U being produced both by A and B) has been addressed in previous work, and we will discuss that shortly. Here, the fact that each event belongs to a single activity will allow us to establish the concept of mapping in this work.

To illustrate the kind of model enhancement we seek here, suppose that, initially, the process is thought to allow the sequence ABC only. However, suppose that in the event log we find the sequences UVWXXYZYZ and UVYZYZ. If the following mapping is true {U ↦ A, V ↦ A, W ↦ B, X ↦ B, Y ↦ C, Z ↦ C} then one must conclude that the sequence of events UVYZYZ must have been produced by the sequence of activities AC. Since AC is not in the original model, we can add it by allowing B to be skipped, as in the model of Fig. 1. The model has been enhanced with a new possible route. Furthermore, if we observe an equal number of UVWXXYZYZ and UVYZYZ traces in the event log, we can conclude that 50 % of cases follow the route through B, and 50 % of cases follow the route around B. Such information is not usually available in the process model, so we consider it to be another kind of enhancement.

The problem with this approach is that, to begin with, the mapping of low-level events to high-level activities is unknown. This mapping must be mined from the event log and from the given process model. Also, the enhancements that are going to be suggested for the process model depend on the particular mapping being used. Therefore, in case there are multiple possible mappings (as is usually the case), one must do a careful choice of mapping.

This work provides a solution to this problem, i.e. it presents an approach that is able to mine possible mappings, select a candidate mapping, and provide suggestions as to how the process model can be enhanced in order to capture the behavior observed in the event log.

1.2 Previous work

In previous work we have introduced a hierarchical Markov model (Ferreira et al. 2013a, b) to capture the relationship between high-level activities and low-level events, and in Szimanski et al. (2013) we introduced an approach for improving process models which has the hierarchical Markov model at its core. Through a combination of process mining and agent-based simulation, we described how a process model can be mined and improved iteratively, where at each step a new event log is generated by agent-based simulation.

However, the application of such approach in real-world experiments involving noisy event logs, such as in the Volvo IT case study that we present here, produced models that were difficult to interpret by the business analyst, because each high-level activity ended up containing a complex aggregate of low-level behavior, when the analyst was expecting, in some cases, a one-to-one mapping between a low-level event and a high-level activity.

In other words, the approach was capturing patterns of behavior which were too complex to interpret because each event could be produced by multiple activities and each activity contained a different pattern involving many kinds of low-level events. It became clear to us that, in some situations, the business analyst tends to see certain low-level events as being associated with a single, particular activity. When attempting to embed this restriction in our previous approach, we found that not only it changed the nature of the underlying model, but it also opened the way for using an entirely different approach to mine that model.

Therefore, while in Szimanski et al. (2013) we used a hierarchical Markov model, here we replaced that model with the new concept of mapping, which serves a similar purpose, but has some distinct advantages (one of which is that there is no need for preprocessing of the event log, since the approach can handle any amount of noise). Also, while in Szimanski et al. (2013) we used an iterative approach with agent-based simulation, here we opted for a greedy optimization approach which offers more guarantees of finding meaningful improvements to the process model. As a by-product of this new approach, we are able to automatically generate suggestions for the user regarding possible improvements to the process model.

1.3 Structure of presentation

The structure of this paper is as follows. Section 2 provides a formal definition of what a mapping is, and describes how to find all mappings from a given set of traces and from a set of possible paths through the process model. Then Section 3 explains how to select the best mapping based on concepts such as range and coverage; this task is complicated by the fact that it may be possible to combine existing mappings in order to create an even better mapping. Section 4 presents a case study where we apply the approach to a real-world event log, showing how the simplest possible model can be improved in a stepwise manner until it covers all the behavior in the event log (including all noise). Finally, Section 5 discusses several branches of related work, and Section 6 summarizes and concludes the paper.

2 The concept of mapping

In general, business processes are defined at a high level of abstraction, using modeling languages such as BPMN (BPMN 2011), EPCs (Scheer 2000), and Petri nets (Van der Aalst 1998). However, when looking at an event log, the recorded behavior often refers to low-level events which have no obvious relationship to the high-level activities defined in a process model.

A mapping is a way to establish a relationship between each kind of event observed in the event log and a specific high-level activity defined in the business process. In particular, an event can be mapped to one and only one activity. Even though this assumption may seem limiting at first sight, in practice there is more than one way to make the assumption hold, as the following example illustrates.

Example 1

Suppose that one observes the trace UVWXUVYZ in the event log. This trace does not seem to fit the sequence ABC if the mapping is {U ↦ A, V ↦ A, W ↦ B, X ↦ B, Y ↦ C, Z ↦ C}. If one is to believe this mapping, then it suggests that UVWXUVYZ was generated by ABAC, and therefore the process model should include this possibility. Alternatively, if one knows that the sequence ABAC cannot take place, then it is possible to come up with a different mapping to explain how the trace UVWXUVYZ was generated from the current model. Such a mapping is {U ↦ A, V ↦ A, W ↦ A, X ↦ A, Y ↦ C, Z ↦ C}, implying that UVWXUVYZ was generated by the sequence AC.

As illustrated in the example above, the problem of mapping an event to a single activity can be turned into a problem of whether the current model or the current mapping is actually a good description for the process. Since a process model can be freely designed or redesigned to accommodate the behavior observed in an event log (if it cannot be freely redesigned, them probably it cannot be enhanced), we will keep the assumption that an event can be mapped to a single activity, leaving it up to the process analyst to come up with a model that is coherent with this assumption.

The following definitions explain in more formal terms what a mapping actually is:

Definition 1 (Process model)

A process model is any kind of state machine defined in a high-level modeling language. Each state in the process model is referred to as an activity.

Definition 2 (Activity)

An activity represents an abstract unit of work, such as a human task in the process. Even though an activity may be regarded as a single step in the process, its execution may generate multiple events.

Definition 3 (Event)

An event is defined as a tuple e = (u, t) where u is a symbol denoting the type of event, and t is the timestamp of the event. Let π 1(⋅) be a function (i.e. projection) that returns the symbol of the event, and let π 2(⋅) be a function that returns the timestamp of event. For e = (u, t) we have π 1(e) = u and π 2(e) = t.

Definition 4 (Trace)

A trace is a sequence of events τ = 〈e 1, e 2, …, e n 〉 such that ∀ i < j π 2(e i ) ≤ π 2(e j ), i.e. the events are ordered by timestamp.

Definition 5 (Micro-sequence)

A micro-sequence is a sequence of symbols which is obtained by projection of a trace: μ = 〈π 1(e 1), π 1(e 2),…, π 1(e n )〉 such that ∀ i < j π 2(e i ) ≤ π 2(e j ). In other words, a micro-sequence corresponds to the sequence of symbols in a trace.Footnote 1

Definition 6 (Macro-sequence)

A process model may allow one or more paths of execution. Each path corresponds to a sequence of activities σ = 〈a 1, a 2, … , a m 〉. As these activities are performed, multiple events are generated. The sequence of events that is generated by executing a sequence of activities σ is a trace τ (see Def. 4 above).

Definition 7 (Mapping)

Let μ = 〈u 1, u 2, …, u n 〉 be a micro-sequence, which is the projection of some trace τ. Let σ = 〈a 1, a 2, …, a m 〉 be the macro-sequence which generated the trace τ. Now, let U be the set of symbols in the micro-sequence μ, i.e. U = {uuμ}, and let A be the set of activities in the macro-sequence σ, i.e. A = {aaσ}.

Then a mapping is defined as a function f:UA which maps each symbol uU to an activity aA.

A mapping f can also be represented as a set in the form: λ = {uf(u) ∣ uUf(u) ∈ A}.

For a given micro-sequence and macro-sequence, there can be multiple possible mappings of events (in the micro-sequence) to activities (in the macro-sequence). However, the set of possible mappings is definitely finite and restricted by the given macro-sequence. The following examples illustrate this point.

Example 2

Given the micro-sequence UVWXXYZYZ and the macro-sequence ABC, the possible mappings are:

λ 1 = {U ↦ A, V ↦ A, W ↦ A, X ↦ A, Y ↦ A, Z ↦ A}

λ 2 = {U ↦ A, V ↦ A, W ↦ A , X ↦ A, Y ↦ B, Z ↦ B}

λ 3 = {U ↦ A, V ↦ A, W ↦ A, X ↦ B, Y ↦ B, Z ↦ B}

λ 4 = {U ↦ A, V ↦ A, W ↦ A, X ↦ B, Y ↦ C, Z ↦ C}

λ 5 = {U ↦ A, V ↦ A, W ↦ B, X ↦ B, Y ↦ B, Z ↦ B}

λ 6 = {U ↦ A, V ↦ A, W ↦ B, X ↦ B, Y ↦ C, Z ↦ C}

λ 7 = {U ↦ A, V ↦ A, W ↦ B, X ↦ C, Y ↦ C, Z ↦ C}

λ 8 = {U ↦ A, V ↦ B, W ↦ B, X ↦ B, Y ↦ B, Z ↦ B}

λ 9 = {U ↦ A, V ↦ B, W ↦ B, X ↦ B, Y ↦ C, Z ↦ C}

λ 10 = {U ↦ A, V ↦ B, W ↦ B, X ↦ C, Y ↦ C, Z ↦ C}

λ 11 = {U ↦ A, V ↦ B, W ↦ C, X ↦ C, Y ↦ C, Z ↦ C}

Example 3

For the micro-sequence UVWXXYZYZ and the macro-sequence ABC, here are some examples of impossible mappings:

λ 12 = {U ↦ A, V ↦ A, W ↦ A, X ↦ A, Y ↦ A, Z ↦ B}

(impossible because the end of the micro-sequence YZYZ would be mapped to ABAB, which is not allowed by the macro-sequence, since A cannot occur after B)

λ 13 = {U ↦ A, V ↦ A, W ↦ A, X ↦ B, Y ↦ B, Z ↦ C}

(impossible because the end of the micro-sequence YZYZ would be mapped to BCBC, which is not allowed by the macro-sequence, since B cannot occur after C)

λ 14 = {U ↦ B, V ↦ B, W ↦ B, X ↦ C, Y ↦ C, Z ↦ C}

(impossible because U at the beginning of the micro-sequence would be mapped to B, which is not allowed since the macro-sequence cannot begin with B)

λ 15 = {U ↦ A, V ↦ A, W ↦ B, X ↦ B, Y ↦ A, Z ↦ A}

(impossible because XY in the micro-sequence would be mapped to BA , which is not allowed by the macro-sequence, since A cannot occur after B)

Not all mappings in Example 2 are interesting. Mapping λ 1, while certainly possible, is not interesting because every event has been mapped to activity A. Similarly, mappings λ 2, λ 3, λ 5, and λ 8 are not interesting because not every activity in the macro-sequence appears in the mapping. In this work we are interested in complete mappings, meaning that the mapping contains every activity in the macro-sequence.

Definition 8 (Complete mapping)

Recall that U is the set of symbols in the micro-sequence and A is the set of activities in the macro-sequence. A mapping is said to be complete if and only if the function f:UA is surjective, i.e. for every activity aA there exists an event uU such that f(u) = a.

The following example illustrates that not all of the mappings in Example 2 are admissible according to Definition 8.

Example 4

Given the micro-sequence UVWXXYZYZ and the macro-sequence ABC, the complete mappings are:

λ 4 = {U ↦ A, V ↦ A, W ↦ A, X ↦ B, Y ↦ C, Z ↦ C}

λ 6 = {U ↦ A, V ↦ A, W ↦ B, X ↦ B, Y ↦ C, Z ↦ C}

λ 7 = {U ↦ A, V ↦ A, W ↦ B, X ↦ C, Y ↦ C, Z ↦ C}

λ 9 = {U ↦ A, V ↦ B, W ↦ B, X ↦ B, Y ↦ C, Z ↦ C}

λ 10 = {U ↦ A, V ↦ B, W ↦ B, X ↦ C, Y ↦ C, Z ↦ C}

λ 11 = {U ↦ A, V ↦ B, W ↦ C, X ↦ C, Y ↦ C, Z ↦ C}

2.1 Finding all complete mappings

The problem of finding all complete mappings for a given micro-sequence and macro-sequence is relatively easy to solve. Basically, one starts by assigning the first event in the micro-sequence to the first activity in the macro-sequence. For each subsequent event, if it has not been mapped already, then it can be mapped either to the current activity in the macro-sequence or to the next activity in the macro-sequence (by advancing the current position in the macro-sequence). By the time the end of the micro-sequence is reached, the end of the macro-sequence should have been reached as well. If this is true, then the mapping is complete.

Algorithm 1 specifies this procedure. Besides the micro-sequence μ and the macro-sequence σ, the procedure has the following parameters: Λ is the set of complete mappings, which is initially empty (Λ = {}); i is the current position in the micro-sequence, which begins at the first position (i = 1); j is the current position in the macro-sequence, which begins at the first position (j = 1); and λ is the current mapping, which is initially empty (λ = {}) and is built during recursion. At the end of the micro-sequence, if λ is a complete mapping, then it will be added to Λ (line 18).

figure a

The algorithm begins by checking if i is at the beginning of the micro-sequence (line 2). In this case, the first event is mapped to the first activity (line 3), and the same procedure is called for the next position in the micro-sequence (line 4).

For i > 1 (line 5), it may be the case that the event has been mapped already or not:

  • If it has already been mapped, then the only viable options are that it has been mapped to the current activity (line 6) or to the next activity (line 8).

  • If the event has not been mapped already, then the same possibilities should be considered, i.e. either mapping it to the current activity (line 11), or mapping it to the next activity (line 14).

In any of these cases, the procedure advances to the next position (i + 1) in the micro-sequence (lines 7, 9, 12, 15).

If i gets past the end of the micro-sequence (line 16), then all events have been mapped, and the last thing to check is if j reached the end of the macro-sequence (line 17). This ensures that the mapping is complete.

Taking into account that each symbol (u i ) in the micro-sequence can be mapped either to the current activity (a j ) or to the next (a j+1), at first sight it would seem that the complexity of Algorithm 1 would be of the order of O(2n), where n is the length of the micro-sequence. However, once a symbol has been mapped to an activity, every occurrence of that symbol in the micro-sequence has already been mapped and there is no other choice for that symbol, so the complexity is actually O(2|U|) where |U| (i.e. the size of U) is the number of distinct symbols in the micro-sequence. Additionally, the first symbol in the micro-sequence is mapped to the first activity in the macro-sequence, so there is only one choice for that symbol. Therefore, the complexity of Algorithm 1 is O(2|U|−1).

Still, many branches in the recursion get pruned because, in general, if is often the case that a symbol u i can be mapped to only one of the two possibilities a j and a j+1, or even to none of them. As an example, for the micro-sequence UVWXXYZYZ and the macro-sequence ABC, there are only 11 mappings remaining at the end of the micro-sequence (see Example 2); this is less than what the estimate 2|U|−1 = 25 = 32 would suggest.

2.2 Working with multiple macro- and micro-sequences

In practice, an event log may contain several different micro-sequences, and each of these micro-sequences may occur more than once. We define the support of a micro-sequence as the number of times that the micro-sequence appears in an event log.

Definition 9 (Event log)

An event log L is defined as a bag of micro-sequences, e.g. L = [μ 1, μ 2, μ 1, μ 3, μ 2, …], where these micro-sequences are not necessarily different (i.e. there can be duplicates among them).

Definition 10 (Support)

The support of a micro-sequence is the number of times that μ appears in an event log L, and it is denoted by μ @ L. Let q i = μ i @ L be the support of μ i in L. Then an alternative representation for L is \(L=\left [\mu _{1}^{q_{1}},\mu _{2}^{q_{2}},\mu _{3}^{q_{3}},\ldots \right ]\) where all micro-sequences are now different.

Also, as explained in Definition 6, a process model may have several possible paths of execution and, typically, each of these paths generates a different macro-sequence. We define the macro-model as the set of possible macro-sequences that can be generated by a process model.

Definition 11 (Macro-model)

A macro-model is the set of all macro-sequences M = {σ 1, σ 2, …} that can be generated by a given process model.

In general, the macro-model can be obtained from any kind of process model (BPMN, EPC, Petri net, etc.) by following all possible paths of execution. Even if the process model contains loops, which could originate an infinite set of macro-sequences, it is usually possible to establish a maximum number of possible iterations for any given loop. In fact, it would be unrealistic to think that a loop in a business process can run an infinite number of times.

In Definitions 9 and 10, we have seen that an event log may contain several different micro-sequences. Now in Definition 11 we see that a process model may originate several different macro-sequences. This means that, if we are given an event log and a process model, we can obtain the bag of micro-sequences L and the set of macro-sequences M. However, in general we do not know which macro-sequence in M produced each micro-sequence in L.

Let \(L=[\mu _{1}^{q_{1}},\mu _{2}^{q_{2}},\mu _{3}^{q_{3}},\ldots ]\) and M = {σ 1, σ 2, …}. If we are interested in finding all possible mappings, then we need to consider the possibility that each micro-sequence μ i L might have been generated by any of the macro-sequences σ j M. Therefore, we need to extend Algorithm 1 in order to consider all of these possibilities.

Algorithm 2 provides a function that returns all complete mappings as a table T of records in the form (μ i , σ j , λ k ) where μ i is a micro-sequence, σ j is a macro-sequence, and λ k is a mapping between μ i and σ j . There may be several possible mappings between μ i and σ j , so table T may contain several records for the same μ i and σ j , as illustrated in Example 5.

figure b

Example 5

Let \(L=[\mu _{1}^{q_{1}},\mu _{2}^{q_{2}}]\) be an event log with the micro-sequences μ 1 = UVWXXYZYZ and μ 2 = UVYZYZ, and some arbitrary q 1 and q 2. Also, let M = {σ 1, σ 2} be a macro-model with the macro-sequences σ 1 = ABC and σ 2 = AC.

Then Algorithm 2 produces the following table of mappings between each micro-sequence μ i and each macro-sequence σ j :

figure c

2.3 Subsumption and compatibility

In table T it is possible that the same mapping λ k applies to different pairs of micro-sequences and macro-sequences. For example, suppose that T contains the tuples (μ 1, σ 1, λ 3) and (μ 2, σ 2, λ 3); then λ 3 covers both μ 1 and μ 2. In fact, this is not only possible but it is also desirable to happen since, ideally, we would like to be able to cover all micro-sequences with the same mapping.

Now, suppose that T contains the tuples (μ 1, σ 1, λ 3) and (μ 2, σ 2, λ 4) and that λ 4 is “contained” in λ 3 in the sense that every pair ua that appears in λ 4 also appears in λ 3 (but λ 3 may contain more). Then we say that λ 3 subsumes λ 4. It is clear that if λ 3 subsumes λ 4 then λ 3 covers all the micro-sequences which are covered by λ 4. The following example illustrates this point.

Example 6

Let UVWXXYZYZ be a micro-sequence which can be mapped to the macro-sequence ABC through the following mapping:

$${\lambda}_{1}=\{\textsf{U}\mapsto\textsf{A}, \textsf{V}\mapsto\textsf{A}, \textsf{W}\mapsto\textsf{B}, \textsf{X}\mapsto\textsf{B}, \textsf{Y}\mapsto\textsf{C}, \textsf{Z}\mapsto\textsf{C}\} $$

Let UVYZYZ be a micro-sequence which can be mapped to the macro-sequence AC through the following mapping:

$${\lambda}_{2}=\{\textsf{U}\mapsto\textsf{A}, \textsf{V}\mapsto\textsf{A},\textsf{Y}\mapsto\textsf{C}, \textsf{Z}\mapsto\textsf{C}\} $$

Since λ 1 subsumes λ 2, λ 1 covers both micro-sequences.

Definition 12 (Subsumption)

Let λ 1 and λ 2 be two given mappings. We say that λ 1 subsumes λ 2, denoted as λ 1λ 2, if and only if for every (ua) ∈ λ 2 we have (ua) ∈ λ 1.

In addition to the concept of subsumption, there is the related concept of compatibility. This concept is illustrated by means of the following example.

Example 7

Let UVWXXYZYZ be a micro-sequence which can be mapped to the macro-sequence ABC through the following mapping:

$${\lambda}_{1}=\{\textsf{U}\mapsto\textsf{A}, \textsf{V}\mapsto\textsf{A}, \textsf{W}\mapsto\textsf{B}, \textsf{X}\mapsto\textsf{B}, \textsf{Y}\mapsto\textsf{C}, \textsf{Z}\mapsto\textsf{C}\} $$

Let UVST be a micro-sequence which can be mapped to the macro-sequence AD through the following mapping:

$${\lambda}_{2}=\{\textsf{U}\mapsto\textsf{A}, \textsf{V}\mapsto\textsf{A},\textsf{S}\mapsto\textsf{D}, \textsf{T}\mapsto\textsf{D}\} $$

Then λ 1 does not subsume λ 2, but λ 1 is compatible with λ 2 in the sense that it is possible to create a third mapping λ 3 = {U ↦ A, V ↦ A, W ↦ B, X ↦ B, Y ↦ C, Z ↦ C, S ↦ D, T ↦ D} which combines λ 1 and λ 2.

Definition 13 (Compatibility)

Let λ 1 and λ 2 be two given mappings, and let U 1 and U 2 be their domains, respectively. We say that λ 1 is compatible with λ 2 if and only if every event u ∈ (U 1U 2) is mapped to the same activity a in both λ 1 and λ 2, i.e. (ua) ∈ λ 1 and (ua) ∈ λ 2. In particular, if \(U_{1}\cap U_{2}=\varnothing \) then the two mappings are compatible.

If two mappings are compatible, then they can be merged into a third mapping, as λ 3 in Example 7.

Definition 14 (Merge)

Let λ 1 and λ 2 be two compatible mappings. Then λ 1 and λ 2 can be merged into a third mapping λ 3 = {ua ∣ (ua) ∈ λ 1 ∨ (ua) ∈ λ 2}. This merge operation will be denoted as λ 3MERGE (λ 1, λ 2).

2.4 Coverage of a mapping

With the table of mappings provided by Algorithm 2, it possible to find which micro-sequences are covered by a given mapping λ. These are the micro-sequences that can be mapped to some macro-sequence σ j (it does not matter which one in particular) through mapping λ. In essence, this is just a matter of finding the tuples (μ i , σ j , λ k ) from T where λ k is subsumed by λ, i.e. \({\lambda }_{k}\sqsubseteq {\lambda }\).

Algorithm 3 describes how the coverage of a given mapping λ is computed. The event log \(L=[\mu _{1}^{q_{1}},\mu _{2}^{q_{2}},\mu _{3}^{q_{3}},\ldots ]\) and the macro-model M = {σ 1, σ 2, …} are assumed to be global variables, and table T is the result of Algorithm 2, as indicated in line 1. In line 3, the function initializes S c o v , which will contain the set of micro-sequences covered by λ. Similarly, in line 4 it initializes S a l l , which will contain all micro-sequences (both covered and non-covered). The sets S c o v and S a l l are built through lines 5–8.

figure d

In line 9, s c o v is a count of how many micro-sequences from the event log L are actually covered. Recall from Definition 10 that μ i @ L is the support of μ i in L, i.e. it is the number of times that μ i appears in L. Line 10 does a similar count, but for all the micro-sequences in T. The coverage of λ is then the ratio of s c o v to s a l l . This is usually expressed as a percentage value, as in line 11.

2.5 Coverage vs. range of a mapping

Ideally, it would be interesting to find the mapping λ with the highest possible coverage. This could be one of the mappings λ k in T, or it could be a merge of several such mappings (like λ 3 in Example 7). If the coverage of λ would be 100%, then this would mean that the mapping λ is able to cover all the micro-sequences in the event log.

However, it turns out that coverage is not the only factor that should be considered. In fact, it is possible to get a mapping with 100 % coverage, but this mapping may not be interesting at all, as the following example illustrates.

Example 8

For the micro-sequence UVWXXYZYZ and the macro-sequence ABC, one possible mapping is:

$${\lambda}_{1}=\{\textsf{U}\mapsto\textsf{A}, \textsf{V}\mapsto\textsf{A}, \textsf{W}\mapsto\textsf{B}, \textsf{X}\mapsto\textsf{B}, \textsf{Y}\mapsto\textsf{C}, \textsf{Z}\mapsto\textsf{C}\}$$

Now suppose that we observe the micro-sequence UV in the event log. There is no complete mapping between the micro-sequence UV and the macro-sequence ABC. However, λ 1 suggests that UV might have been produced by a macro-sequence with a single activity A. If we add this macro-sequence to the macro-model, then Algorithm 2 comes up with the following mapping:

$${\lambda}_{2}=\{\textsf{U}\mapsto\textsf{A}, \textsf{V}\mapsto\textsf{A}, \textsf{W}\mapsto\textsf{A}, \textsf{X}\mapsto\textsf{A}, \textsf{Y}\mapsto\textsf{A}, \textsf{Z}\mapsto\textsf{A}\}$$

which covers both UVWXXYZYZ and UV, and therefore has a coverage of 100% (assuming that there are no other micro-sequences in the event log).

The problem is that both UVWXXYZYZ and UV have been mapped to the macro-sequence A, when UVWXXYZYZ should have been mapped to ABC and UV should have been mapped to A. Fortunately, by trying all pairings of μ i and σ j , Algorithm 2 also comes up with λ 1 again:

$${\lambda}_{1}=\{\textsf{U}\mapsto\textsf{A}, \textsf{V}\mapsto\textsf{A}, \textsf{W}\mapsto\textsf{B}, \textsf{X}\mapsto\textsf{B}, \textsf{Y}\mapsto\textsf{C}, \textsf{Z}\mapsto\textsf{C}\}$$

which (now with macro-sequences ABC and A) covers both micro-sequences as well.

Clearly, λ 1 is preferable to λ 2. The reason for this is that the range of λ 1 includes activities A, B and C, whereas the range of λ 2 includes only activity A.

Definition 15 (Range)

The range of a mapping λ is the set of activities that appear in λ. It is defined as

$$R({\lambda}) = \{a\mid(u\mapsto a)\in {\lambda}\}$$

To compare two given mappings λ′ and λ″ it is often useful to use the size of their ranges, i.e. |R(λ′)| and |R(λ″)|. Therefore, we define the function RANGE (λ) as RANGE (λ) ← |R(λ)|.

As in Example 8, we will be looking for mappings with the largest possible range (ideally, this range should contain all the activities in the macro-model M). Then, from all those mappings with the largest possible range, we pick the mapping (or mappings) with highest coverage. The next section explain the mining algorithm in more detail.

3 Mining approach

In general terms, the problem to be addressed here consists in finding a mapping between the low-level events in an event log \(L=[\mu _{1}^{q_{1}},\mu _{2}^{q_{2}},\mu _{3}^{q_{3}},\ldots ]\) and the high-level activities in a macro-model M = {σ 1, σ 2, …}. Such mapping should have the following desired characteristics:

  1. 1

    Range – the range of the mapping should include, to the furthest extent possible, all the activities that appear in the macro-model M.

  2. 2

    Coverage – the mapping should cover, to the furthest extent possible, all the micro-sequences that appear in the event log L.

The first approach that comes to mind would be to generate all complete mappings through Algorithm 2 and then pick the mapping (or mappings) that best fit the characteristics above. However, there are additional possibilities to be considered. In Example 2, we have seen that it is possible to generate new mappings by merging existing ones, and those new mappings should be considered as well.

Suppose that for a given micro-sequence μ i and a given macro-sequence σ j there are k complete mappings. Then if there are n micro-sequences and m macro-sequences, the total number of complete mappings is of the order of n × m × k. In addition, we have to consider merges of two mappings (for those pairs of mappings which are compatible), merges of three mappings (for those triples of mappings which are compatible), and so on.

For practical purposes, the set of mappings to be considered is simply too large to be searched exhaustively. Therefore, we use a greedy approach to search for a mapping \(\hat {\lambda }\) that maximizes range and coverage. Such greedy approach is described in Algorithm 4. To make things clearer (even if at the expense of some rigor), this algorithm is described in plain words whenever possible.

figure e

In essence, the idea of Algorithm 4 is to keep merging compatible mappings into \(\hat {\lambda }\) (which is initially empty). At each iteration, a new mapping \({\lambda }_{k}^{\prime }\) is merged into \(\hat {\lambda }\) (step 6), so that all micro-sequences that are covered by \({\lambda }_{k}^{\prime }\) are now covered by \(\hat {\lambda }\) as well. This mapping \({\lambda }_{k}^{\prime }\) is chosen according to three selection criteria:

  • it must be compatible with \(\hat {\lambda }\) (step 3);

  • it must be the mapping (or one of the mappings) which, when merged with \(\hat {\lambda }\), produces a new mapping with the largest possible range (step 4);

  • it must be the mapping (or one of the mappings) with with highest coverage (step 5).

After running steps 2–8 iteratively, Algorithm 4 ends in one of three possible ways:

  • either there are no mappings for the remaining (i.e. non-covered) micro-sequences in L (T is empty in step 2),

  • or there are mappings, but none of them are compatible with \(\hat {\lambda }\) (T′ is empty in step 3),

  • or there are no micro-sequences left to be covered in L (L is empty in step 8).

In the first two cases, there will be some micro-sequences left in L that cannot be covered by the mapping \(\hat {\lambda }\) and therefore the coverage of \(\hat {\lambda }\) will be lower than 100 %. In the last case, \(\hat {\lambda }\) has 100% coverage, and we can say that the macro-model M seems to be a perfect description for the behavior observed in the event log.

Example 9

Let \(L=[\mu _{1}^{q_{1}},\mu _{2}^{q_{2}}]\) be an event log with the following micro-sequences:

$$\begin{array}{*{20}l} &\mu_{1}=\textsf{UVWXXYZYZ} & q_{1}=5\\ &\mu_{2}=\textsf{UVYZYZ} & q_{2}=3 \end{array} $$

Let M = {σ 1, σ 2} be a macro-model with the macro-sequences σ 1 = ABC and σ 2 = AC.

Given L and M, Algorithm 4 runs as follows:

  1. 1.

    \(\hat {\lambda }\) is initialized as an empty mapping.

  2. 2.

    Algorithm 2 produces the same table as in Example 5 on page 9.

  3. 3.

    Since \(\hat {\lambda }\) is empty, every mapping in T is compatible with \(\hat {\lambda }\), so T′ = T.

  4. 4.

    The mappings which yield the largest range are λ 1 through λ 6, and also λ 11.

  5. 5.

    The mappings with highest coverage are λ 1 through λ 6, because they cover μ 1 (with q 1 = 5), whereas λ 11 covers μ 2 (with q 2 = 3). We can pick one of the mappings λ 1 through λ 6 arbitrarily, so let us pick, for example: λ 2 = {U ↦ A, V ↦ A, W ↦ B, X ↦ B, Y ↦ C, Z ↦ C}.

  6. 6.

    Since \(\hat {\lambda }\) is empty, merging λ 2 into \(\hat {\lambda }\) yields \(\hat {\lambda }={\lambda }_{2}\).

  7. 7.

    Since μ 1 has been covered, it is removed from L, which becomes \(L=[\mu _{2}^{q_{2}}]\).

  8. 8.

    Since L is not empty, go back to step 2.

  9. 2.

    Table T is now the following:

    figure f
  10. 3.

    Only λ 12 is compatible with \(\hat {\lambda }\).

  11. 4.

    There is only λ 12 to choose from.

  12. 5.

    There is only λ 12 to choose from.

  13. 6.

    Merging λ 12 into \(\hat {\lambda }\) yields the same mapping \(\hat {\lambda }\) as before, because \(\hat {\lambda }\) subsumes λ 12.

  14. 7.

    Since μ 2 has been covered, it is removed from L, which becomes empty.

  15. 8.

    Since L is empty, \(\hat {\lambda }=\{\textsf {U}\mapsto \textsf {A}, \textsf {V}\mapsto \textsf {A}, \textsf {W}\mapsto \textsf {B}, \textsf {X}\mapsto \textsf {B}, \textsf {Y}\mapsto \textsf {C}, \textsf {Z}\mapsto \textsf {C}\}\) is returned.

In Algorithm 4 and in the example above there is an arbitrary choice of mapping \({\lambda }_{k}^{\prime }\) in step 5. In this particular example, any of the available choices would lead to a mapping \(\hat {\lambda }\) with the same coverage, but in a general case this choice could have an impact in the final coverage of \(\hat {\lambda }\). This is one of the problems of using a greedy approach: at each iteration we are choosing a mapping that appears to maximize coverage immediately, but it could happen that another mapping with equal coverage (or even lower coverage) would yield a higher coverage in the long run (i.e. in subsequent iterations).

On the other hand, it is possible that a mapping \({\lambda }_{k}^{\prime }\) which has not been chosen in the current iteration will end up being chosen in a subsequent iteration. (The only reason why this may not occur is if an incompatible mapping has been merged into \(\hat {\lambda }\) in the meantime.) In any case, regardless of whether \({\lambda }_{k}^{\prime }\) or another mapping is chosen, the coverage of \(\hat {\lambda }\) will increase, and it will keep increasing for as long as it is possible to find mappings that are compatible with (and therefore that can be merged into) \(\hat {\lambda }\).

For this reason, Algorithm 4 is usually capable of finding a mapping with good coverage (i.e. higher than 50 %). This is also helped by the fact that the distribution of sequences in an event log is typically unbalanced (with a few, very frequent sequences having a lot of occurrences and many sequences having just a few occurrences). By focusing on the mappings with highest coverage, Algorithm 4can usually find a mapping that covers most of the event log. For those micro-sequences that are left to be covered, it is usually possible to come up with a suggestion, as described below.

3.1 Generating suggestions for improvement

As stated above, when Algorithm 4 gets to a point where it cannot cover the micro-sequences that are left in L, it stops and returns the current mapping \(\hat {\lambda }\). However, in practice it is often the case that the mapping cannot cover the remaining micro-sequences not because the mapping itself is wrong, but because the macro-model does not include the behavior that could generate such micro-sequences. The following example illustrates this point.

Example 10

Let \(L=[\mu _{1}^{q_{1}},\mu _{2}^{q_{2}},\mu _{3}^{q_{3}}]\) be an event log with the following micro-sequences:

$$\begin{array}{*{20}l} &\mu_{1}=\textsf{UVWXXYZYZ} & q_{1}=5\\ &\mu_{2}=\textsf{UVWXUVYZ} & q_{2}=2 \\ &\mu_{3}=\textsf{UV} & q_{3}=1 \end{array} $$

Let M = {σ 1} be a macro-model with a single macro-sequence σ 1 = ABC.

Given L and M, Algorithm 4 finds the mapping:

$$\hat{\lambda}=\{\textsf{U}\mapsto\textsf{A}, \textsf{V}\mapsto\textsf{A}, \textsf{W}\mapsto\textsf{B}, \textsf{X}\mapsto\textsf{B}, \textsf{Y}\mapsto\textsf{C}, \textsf{Z}\mapsto\textsf{C}\}$$

which covers μ 1. Therefore, \(\hat {\lambda }\) covers 5/8=62.5 % of the micro-sequences in the event log.

The most frequent micro-sequence which is not covered is μ 2 = UVWXUVYZ. However, if we believe \(\hat {\lambda }\) to be correct, then \(\hat {\lambda }\) suggests that the micro-sequence μ 2 = UVWXUVYZ was generated by a macro-sequence σ 2 = ABAC. If we add this macro-sequence to the macro-model, then \(\hat {\lambda }\) increases its coverage to 7/8=87.5 %.

Now there is only one micro-sequence left to be covered: μ 3 = UV. If we keep to mapping \(\hat {\lambda }\), then \(\hat {\lambda }\) suggests that μ 3 is generated by the macro-sequence σ 3 = A. By adding σ 3 to M, we achieve 100 % coverage for \(\hat {\lambda }\).

In this example, the macro-sequences σ 2 = ABAC and σ 3 = A can be regarded as suggestions for improving the macro-model in order to increase the coverage of mapping \(\hat {\lambda }\). Of course, this mapping could be wrong from the perspective of a business analyst, and in this case these suggestions should not be followed. In any case, they are just suggestions; if the analyst believes that the mapping \(\hat {\lambda }\) makes sense, then the additional macro-sequences can be incorporated in the macro-model, resulting in a new, improved version.

A suggestion can be generated by applying the mapping \(\hat {\lambda }\) to a given non-covered micro-sequence μ i . In the example above, \(\hat {\lambda }\) has been applied to μ 2 and to μ 3. When \(\hat {\lambda }\) is applied to μ 2 = UVWXUVYZ the result is \(\sigma _{2}^{\star }=\textsf {AABBAACC}\). By eliminating all consecutive repetitions of activities, we get σ 2 = ABAC. Similarly, when \(\hat {\lambda }\) is applied to μ 3 = UV the result is \(\sigma _{3}^{\star }=\textsf {AA}\) and σ 3 = A. The following definitions establish the notation for these operations.

Definition Apply

A mapping λ = {u i f(u i )} can be applied to a micro-sequence μ = 〈u 1, u 2, …, u n 〉, resulting in a new sequence σ = 〈f(u 1), f(u 2), … , f(u n )〉. This operation is denoted as σ APPLY(λ, μ).

Definition Collapse

A sequence of symbols:

$$\sigma^{\star}=\langle \underbrace{a_{1},a_{1},...,a_{1}}_{a_{1}}, \underbrace{a_{2},a_{2},...,a_{2}}_{a_{2}},\;\;\ldots\;\;,\underbrace{a_{n},a_{n},...,a_{n}}_{a_{n}}\rangle$$

can be collapsed into:

$$\sigma=\langle a_{1}, a_{2}, \ldots, a_{n} \rangle$$

by eliminating all consecutive repetitions of the same symbol in σ . This operation is denoted as σCOLLAPSE(σ ).

To generate a suggested macro-sequence σ j for a non-covered micro-sequence μ i , first we apply the mapping \(\hat {\lambda }\) to the micro-sequence μ i in order to generate the sequence \(\sigma _{j}^{\star }\), i.e. \(\sigma _{j}^{\star }\leftarrow \textsc{Apply}({\hat {\lambda },\mu _{i}})\). Then we eliminate all consecutive repetitions of the same symbol in \(\sigma _{j}^{\star }\) in order to obtain the suggested macro-sequence σ j , i.e. \(\sigma _{j}\leftarrow \textsc {Collapse} ({\sigma _{j}^{\star }})\).

Example 11

Given the micro-sequence μ 4 = UVYZYZ and the mapping \(\hat {\lambda }=\{\textsf {U}\mapsto \textsf {A}, \textsf {V}\mapsto \textsf {A}, \textsf {W}\mapsto \textsf {B}, \textsf {X}\mapsto \textsf {B}, \textsf {Y}\mapsto \textsf {C}, \textsf {Z}\mapsto \textsf {C}\}\) from Example 10, it is possible to generate a suggested macro-sequence in the form \(\sigma _{4}^{\star }=\textsf {AACCCC}\) and σ 4 = AC.

3.2 Visualizing the (new) process model

In Example 10 we started with a single macro-sequence σ 1 = ABC and then we added σ 2 = ABAC and σ 3 = A to the macro-model. How has the process model changed when these new macro-sequences were added? Clearly, the point of adding σ 2 and σ 3 was to be able to map μ 1 to σ 1 = ABC, μ 2 to σ 2 = ABAC, and μ 3 to σ 3 = A. However, once we added σ 2 and σ 3 to the macro-model, the process itself is no longer a simple sequence of activities ABC.

In Example 10, there are 5 instances of μ 1, 2 instances of μ 2, and 1 instance of μ 3. In total, there are 8 instances of this process. Since μ 1, μ 2 and μ 3 are being mapped to σ 1, σ 2 and σ 3 respectively, it is as if the process has been executed 8 times in the following way: 5 instances in the form σ 1 = ABC; 2 instances in the form σ 2 = ABAC; and 1 instance in the form σ 3 = A. With this information, we can calculate the transition probabilities between the activities in the process in order to visualize the process model as a Markov chain.

Of course, another kind of model could be used to visualize the process, such as a dependency graph (Weijters et al. 2006) or a Petri net (Van der Aalst 1998). Our goal here is just to illustrate the process (and its changes from one version to another) in an intuitive way, and for that purpose we use the simplest possible framework. In particular, the type of Markov chain that we use here, which is extended with special “open” and “closed” states, is somewhat similar to the concept of WF-net, as defined in Van der Aalst (1998), which has special “input” and “output” places.

Example 12

Let ∘ (“open”) and ∙ (“closed”) be two special symbols to denote the beginning and the end of the process, respectively. With σ 1 = ∘ABC∙, σ 2 = ∘ABAC∙, and σ 3 = ∘A∙, we can say that the process always begins with activity A. We write P(∘,A) = 1.0 to say that the transition probability between ∘ and A is 1.0.

With regard to the transition probabilities from A to other activities, first we note that there is one A in σ 1, two A’s in σ 2, and one A in σ 3. Since there are 5 instances of σ 1, 2 instances of σ 2, and 1 instance of σ 3, we get a total number of 5 × 1 + 2 × 2 + 1 × 1 = 10 occurrences of A. The transition probabilities are as follows:

  • in 5 × 1 + 2 × 1 = 7 of those occurrences, the process goes from A to B, therefore P(A,B) = 7/10;

  • in 2 × 1 = 2 of those occurrences, the process goes from A to C, therefore P(A,C) = 2/10;

  • in 1 × 1 = 1 of those occurrences, the process ends with A, therefore P(A,∙) = 1/10.

With regard to the transition probabilities from B to other activities, we note that there is one B in σ 1, one B in σ 2, and none in σ 3. Therefore, there is a total number of 5 × 1 + 2 × 1 + 1 × 0 = 7 occurrences of B. The transition probabilities are:

  • in 5 × 1 = 5 of those occurrences, the process goes from B to C, therefore P(B,C) = 5/7;

  • in 2 × 1 = 2 of those occurrences, the process goes from B to A, therefore P(B,A) = 2/7.

With regard to the transition probabilities from C to other activities, we note that there is one C in σ 1 and one C in σ 2. Therefore, there is a total number of 5 × 1+2 × 1=7 occurrences of C. In all of those occurrences, the process ends with C. Therefore, P(C,∙) = 1.0.

All other transitions probabilities are zero. Therefore, the transition matrix for this process is:

figure g

This transition matrix can be depicted as a graphical model, as shown in Fig. 2.

Fig. 2
figure 2

A graphical depiction of the transition matrix for a macro-model with three macro-sequences

As this example illustrates, calculating the transition matrix for the process is a matter of counting the transitions of one activity to another, and dividing by the total number of occurrences of the first activity. The procedure is relatively simple, as is the conversion of the transition matrix to the type of graphical model shown in Fig. 2. Therefore, we do not provide a formal description of the procedure that has just been illustrated in Example 12.

However, it is important to mention how the required data for computing the transition matrix can be obtained. For each micro-sequence \(\hat {\mu }\) that is covered by \(\hat {\lambda }\), we can lookup the corresponding macro-sequence \(\hat {\sigma }\) in table T by selecting the tuple (μ i , σ j , λ k ) where \(\mu _{i}\,=\,\hat {\mu }\) and \({\lambda }_{k}\!\sqsubseteq \!\hat {\lambda }\). As for the number of instances that should be assigned to σ j , this is equal to the sum of the support (μ i @ L) for every micro-sequence μ i that ends up being mapped to σ j .

4 Case study: Volvo IT Belgium

This case study is based on a real-world event log collected from an incident management system at Volvo IT Belgium. Originally, this event log was made publicly available for the BPI Challenge 2013.Footnote 2 In that challenge, the goal was to analyze the event log in order to answer a number of specific questions posed by the process owner. Here we focus on arriving at a high-level process model that captures the low-level behavior recorded in the event log.

For the purpose of BPI Challenge 2013, there were actually three event logs available for two different processes: incident management, and problem management. Here, we illustrate the application of our mining approach to the incident management process, which is the one that is described in more detail in existing documentation.

4.1 The event log

In the event logFootnote 3, we find the events that correspond to all changes in the status of each incident (an incident is also referred to as a service request). Each event has:

  • an SR number, which is a unique identifier for the service request;

  • a Status, which can be Accepted, Queued or Completed;

  • a Sub-status, which is one of In Progress, Awaiting Assignment, Assigned, Resolved, Closed, etc.;

  • the Involved ST, which is the support team working on the incident and which changed its status or sub-status;

  • a timestamp for the event, in the form of a date and time;

  • other fields, such as the impact of the incident (high, medium, low), the product that the incident refers to, and the country that the support team belongs to.

An excerpt of the event log with the fields SR number, Status, Sub-status, Involved ST and timestamp is shown in Table 1. Here it is possible to see that the first event in the log is from March 2010 and the last event is from May 2012. There are 7554 cases (i.e. incidents) recorded in the event log, and a total of 65533 events. From these 7554 cases, there are 2278 different traces (i.e. micro-sequences).

Table 1 An excerpt of the event log from Volvo IT Belgium

The Involved ST field contains the identifier of the support team, sometimes with a suffix “2nd” or “3rd”. This means that the support team belongs to the second line and/or third line, rather than to the first line. The first line is the service helpdesk that serves as a frontend to the user or customer who submitted the service request. The second and third lines refer to support teams that take care of those incidents which cannot be solved by the first line. In particular, the third line is comprised of experts in product development which should only be involved if the incident cannot be solved within the first or second line.

The act of passing work from the first line to the second or even to the third line is known as escalation in the popular ITIL framework (Bon and Pieper 2005). One of the issues to be analyzed in the BPI Challenge 2013 was whether escalation of service requests takes place too often or too soon. Here we will be focusing on the Status and Sub-status columns, in order to map these statuses to a set of high-level activities.

4.2 The process model

An interesting feature of BPI Challenge 2013 is that the event log was made available together with some documentation about the process of handling service requests. A BPMN model for this process is shown in Figure 3.

Fig. 3
figure 3

A BPMN model of the high-level process associated with handling service requests

The model shows how an incident that is being “investigated” within the first line can escalate to the second line, and possibly to the third line. This happens in a similar way across lines: if no solution can be found within the current line, the incident escalates to the next line, where it will be matched against a database of known issues; if the incident is recognized as a known issue, then an existing solution is applied to resolve it; otherwise, the support team will try to come up with a solution to the problem. Eventually, if the problem is traced back to a certain component, the support team may get in contact with the respective supplier in order to fix the problem.

Our goal is to map the low-level events in Table 1 to the high-level activities in Fig. 3. For this purpose, we define a low-level event as being the concatenation of the Status and Sub-status fields in Table 1. For example, if the status is Accepted and the sub-status is In Progress, we concatenate these fields into the string AcceptedInProgress. Then the mapping we are looking for will contain pairs of event and activity, such as:

AcceptedInProgress ↦ Investigate

In fact, in the event log we find that about 84.4 % of cases begin with AcceptedInProgress, 15.3 % begin with QueuedAwaitingAssignment, and the remaining 0.3 % begin with other events. The fact that such a large percentage of cases begin with AcceptedInProgress appears to suggest that this event should be mapped to the first activity in the process, i.e. Register. However, Table 1 shows that the event AcceptedInProgress can appear multiple times in each trace, so it cannot belong to the activity Register.

See for example the escalation from first line (V30) directly(!) to third line (V5 3rd) that takes place from the second to the third event in Table 1. According to Figure 3, there is no Register activity in the third line, so AcceptedInProgress cannot be mapped to Register. Therefore, we conclude that the observable behavior recorded in the event log begins with the activity Investigate.

Before proceeding with the analysis, we should mention that no form of filtering or preprocessing was applied to the event log; we simply take the micro-sequences as they are, with duplicate events included. This is important to mention because, in practice, it is quite difficult to analyze real-world event logs without some sort of preprocessing (Mans et al. 2009). However, with the present approach we are able to analyze the event log in its entirety, without worrying about noise (Van der Aalst and Weijters 2004).

4.3 Analysis

We start with the event log and a macro-model with a single macro-sequence:

$$\sigma_{1}=\langle\textsf{Investigate},\textsf{Resolve},\textsf{Close}\rangle$$

On a Linux laptop with an Intel Core i5-4200U processor, an implementation of Algorithm 4 in Python 2.7 takes about 1.26 seconds to run, and produces the following mapping with 68.7 % coverage:

$$\begin{array}{*{20}l} \textsf{AcceptedAssigned} & \mapsto \textsf{Investigate} \\ \textsf{AcceptedInProgress} & \mapsto \textsf{Investigate} \\ \textsf{AcceptedWait} & \mapsto \textsf{Investigate} \\ \textsf{AcceptedWaitCustomer} & \mapsto \textsf{Investigate} \\ \textsf{AcceptedWaitImplementation} & \mapsto \textsf{Investigate} \\ \textsf{AcceptedWaitUser} & \mapsto \textsf{Investigate} \\ \textsf{AcceptedWaitVendor} & \mapsto \textsf{Investigate} \\ \textsf{CompletedClosed} & \mapsto \textsf{Close} \\ \textsf{CompletedInCall} & \mapsto \textsf{Investigate} \\ \textsf{CompletedResolved} & \mapsto \textsf{Resolve} \\ \textsf{QueuedAwaitingAssignment} & \mapsto \textsf{Investigate} \\ \end{array} $$

Since at this stage there is only one macro-sequence, all covered micro-sequences are mapped to σ 1. The current process model is shown in Fig. 4.

Fig. 4
figure 4

Process model, version 1

The most frequent micro-sequence that cannot be covered is: μ 1 = 〈AcceptedInProgress, AcceptedInProgress, CompletedInCall 〉. This corresponds to those incidents which end up being resolved during a call with the customer. The mapping above suggests that this micro-sequence μ 1 might have been generated by a macro-sequence in the form:

$$\sigma_{2}=\langle\textsf{Investigate}\rangle$$

Therefore, the suggestion is to add this macro-sequence to the macro-model. If we accept this suggestion and run Algorithm 4 again, it takes 2.59 seconds to generate a mapping with 93.7 % coverage. The new mapping is equal to the previous one, with the addition of the following pair:

$$\textsf{CompletedCancelled} \mapsto \textsf{Investigate}$$

This means that, provided with σ 1 and σ 2, Algorithm 4 was able to cover not only the micro-sequence μ 1 but also other traces containing the status CompletedCancelled.

Based on the macro-sequences which the covered micro-sequences have been mapped to, a new version of the process model is shown in Fig. 5.

Fig. 5
figure 5

Process model, version 2

The most frequent micro-sequence that cannot be covered is now: μ 2 = 〈AcceptedInProgress, AcceptedInProgress, CompletedResolved, AcceptedInProgress, CompletedResolved, CompletedClosed 〉. This corresponds to those incidents which were thought to have been resolved on a first attempt, but were resolved only after a second attempt. The current mapping suggests that this micro-sequence μ 2 might have been generated by a macro-sequence in the form:

$$\sigma_{3}=\langle\textsf{Investigate},\textsf{Resolve},\textsf{Investigate},\textsf{Resolve},\textsf{Close}\rangle$$

After inserting this suggestion in the macro-model and running Algorithm 4 again, it takes 9.66 seconds to generate a mapping with 96.9 % coverage. The new mapping adds the following pair:

$$\textsf{UnmatchedUnmatched} \mapsto \textsf{Resolve}$$

This means that, provided with σ 1, σ 2 and σ 3, Algorithm 4 was able to cover also those micro-sequences with the (undocumented) status UnmatchedUnmatched.

Based on the macro-sequences which the covered micro-sequences have been mapped to, a third version of the process model is shown in Fig. 6.

Fig. 6
figure 6

Process model, version 3

The most frequent micro-sequence that is not covered is now: μ 3 = 〈AcceptedInProgress, AcceptedInProgress, AcceptedWaitUser, CompletedResolved 〉. The current mapping suggests that this micro-sequence μ 3 might have been generated by a macro-sequence in the form:

$$\sigma_{4}=\langle\textsf{Investigate},\textsf{Resolve}\rangle$$

We add this micro-sequence to the macro-model and run Algorithm 4 again, which takes 18.05 seconds to find out that the current mapping, without any further change, is now able to cover 98.0% of the traces in the event log. However, because we added σ 4, there has been a slight change in the process model, as shown in Fig. 7.

Fig. 7
figure 7

Process model, version 4

By carrying out the same procedure repeatedly – i.e. by adding the suggested macro-sequences and running Algorithm 4 again – the mapping does not change anymore, but the new macro-sequences keep refining the process model, until eventually all the micro-sequences in the event log are covered. Figures 819 show the successive versions of the process model, as each new macro-sequence is added.

Fig. 8
figure 8

Process model version 5, after adding σ 5 = 〈Investigate, Resolve, Investigate, Resolve, Investigate, Resolve, Close 〉, coverage 98.29 %

Fig. 9
figure 9

Process model version 6, after adding σ 6 = 〈Investigate, Resolve, Close, Investigate, Resolve, Close 〉, coverage 99.59 %

Fig. 10
figure 10

Process model version 7, after adding σ 7 = 〈Investigate, Resolve, Investigate, Resolve 〉, coverage 99.63 %

Fig. 11
figure 11

Process model version 8, after adding σ 8 = 〈Investigate, Resolve, Close, Investigate, Resolve, Investigate, Resolve, Close 〉, coverage 99.67%

Fig. 12
figure 12

Process model version 9, after adding σ 9 = 〈Resolve, Investigate, Resolve 〉, coverage 99.69 %

Fig. 13
figure 13

Process model version 10, after adding σ 10 = 〈Resolve, Close, Resolve 〉, coverage 99.71 %

Fig. 14
figure 14

Process model version 11, after adding σ 11 = 〈Investigate, Resolve, Investigate, Resolve, Close, Investigate, Resolve, Close 〉, coverage 99.79 %

Fig. 15
figure 15

Process model version 12, after adding σ 12 = 〈Investigate, Resolve, Close, Investigate, Resolve, Close, Investigate, Resolve, Close 〉, coverage 99.91 %

Fig. 16
figure 16

Process model version 13, after adding σ 13 = 〈Investigate, Resolve, Investigate, Resolve, Investigate, Resolve, Close, Investigate, Resolve, Close 〉, coverage 99.92 %

Fig. 17
figure 17

Process model version 14, after adding σ 14 = 〈Investigate, Resolve, Close, Investigate, Resolve, Investigate, Resolve, Close, Investigate, Resolve, Close 〉, coverage 99.93 %

Fig. 18
figure 18

Process model version 15, after adding σ 15 = 〈Investigate, Resolve, Investigate, Resolve, Investigate, Resolve, Investigate, Resolve, Close 〉, coverage 99.99 %

Fig. 19
figure 19

Process model version 16, after adding σ 16 = 〈Investigate, Resolve, Investigate, Resolve, Close, Investigate, Resolve, Investigate, Resolve, Investigate, Resolve, Close, Investigate, Resolve, Close 〉, coverage 100.00 %

Starting with version 6 in Fig. 9 it can be seen that the coverage is already very close to 100%. From that point onwards, the new macro-sequences that are added to the macro-model are intended to cover some very particular cases which represent a very small fraction of the event log (of the order of 0.1 % or even less).

For example, from version 8 (Fig. 11) to version 9 (Fig. 12) the macro-sequence σ 9 = 〈Resolve, Investigate, Resolve 〉 was added. This is an interesting point in the analysis because σ 9 is the first macro-sequence to be added that begins with an activity other than Investigate. However, as can be seen in the model of Fig. 12, only 0.03 % of cases are actually covered by that macro-sequence.

Eventually, by adding the suggested macro-sequences, the mapping is able to cover all the cases in the event log. The resulting process model is shown in Fig. 19. Even if the analyst does not follow the suggestions and prefers to add different macro-sequences at each step, it will still be possible to achieve a macro-model with 100 % coverage, albeit a different one from that of Fig. 19.

4.4 Execution time and coverage

Figure 20 plots the execution time and coverage achieved by Algorithm 4 as each new macro-sequence is added to the macro-model. These plots illustrate two important facts:

  • Even though Algorithm 2 computes all possible mappings between each micro-sequence μ i and each macro-sequence σ j , the total execution time of Algorithm 4 does not appear to rise exponentially with the number of macro-sequences. This provides confidence that, in practice, it will be possible to add an arbitrary number of macro-sequences and still carry out the analysis within a reasonable amount of time.

  • Even though it is possible to achieve 100 % coverage by adding all of the suggested macro-sequences, for practical purposes it will be hardly necessary to do so since, at a certain point, the gain in coverage does not justify the extra execution time. By the time the mapping covers 99 % of the event log (or another percentage that the analyst finds appropriate), the remaining traces can be dismissed as noise, and the current process model can be considered to be a sufficiently accurate description for the behavior observed in the event log.

Fig. 20
figure 20

Time and coverage vs. number of macro-sequences in the macro-model

5 Related work

The gap between high-level activities and low-level events is a well-known problem in the field of process mining (Greco et al. 2005; Günther and Van der Aalst 2007; Günther et al. 2010; Jagadeesh Chandra Bose et al. 2012). Despite the development of a wide range of process mining techniques, most of these techniques are able to discover process models that are at the same level of abstraction as the events recorded in the event log (i.e. when there is a one-to-one mapping between events and activities, and this mapping is known a priori). However, end users need solutions to analyze event data and to visualize the results at a higher level of abstraction, preferably at the same level of abstraction of their process models.

The research community has been looking into this problem and, while it is still a topic of ongoing research, some approaches have already been proposed to be able to produce more abstract models from an event log. These approaches can be divided into two main groups:

  • First, there are techniques that work on the basis of models, by extracting a low-level model from the event log and then creating more abstract representations of that model. Examples are Greco et al. (2005) and Günther and Van der Aalst (2007). Basically, these techniques work by aggregating nodes in the model, in order to produce a simplified and more abstract picture of the process. In general, these approaches allow a stepwise simplification of the process until, in the limit, everything is aggregated into a single node. It is the end user who must know how far to carry the simplification in order to obtain meaningful results. A disadvantage of these approaches is that it is not possible to automatically identify aggregated nodes as meaningful business activities (they are simply labeled as “Cluster A”, “Cluster B”, etc.), so it may be difficult for the end user to understand and analyze the results.

  • Second, there are techniques that work on the basis of the events, by translating the event log into a more abstract sequence of events and then producing a model from that translated event log. Examples are Günther et al. (2010) and Jagadeesh Chandra Bose et al. (2012). Basically, these techniques work by identifying frequent patterns of events in the event log, and then substituting each of these patterns by a single, higher-level event. After all substitutions have been made, the event log becomes a sequence of more abstract events. As a final step, it is possible to extract a model by usual techniques, such as the α-algorithm (Van der Aalst et al. 2004), the heuristics miner (Weijters et al. 2006), or the genetic miner (De Medeiros and Weijters 2005). As with other approaches, it is possible to perform this abstraction in multiple stages, but there is no guarantee that the patterns of events that are identified in the event log correspond to meaningful business activities, so it is up to the end user to determine whether such correspondence actually exists.

The problem with these approaches is that, although they are able to create new layers of abstraction over an event log, in general they do not take into account that there may already exist an abstract notion of the business process in the form of a process model, or in another form (e.g. a flowchart, or a procedure manual) that can be translated into a process model. This information can become a valuable input for the analysis, since it identifies the high-level activities that one should use when building abstractions over an event log. Our approach differs by taking this information into account.

On the other hand, there has been a surging interest in techniques that are based on aligning the events in an event log with the activities in a process model (Van der Aalst et al. 2012; De Leoni et al. 2012b; De Leoni et al. 2012a; De Leoni and Van der Aalst 2013b, a). The concept of mapping that we use here can be somewhat related to that kind of alignment, in the sense that an alignment also establishes a correspondence between events and activities. Also, there is a search space for possible alignments that is somewhat analogous to the search space of possible mappings. However, in those alignments the process model is at the same level of abstraction of the event log, whereas in the present work we use the concept of mapping to bridge the gap between those two different levels. A further difference is that we focus on model improvement, whereas the works cited above focus either on discovery (Greco et al. 2005; Günther and Van der Aalst 2007; Günther et al. 2010; Jagadeesh Chandra Bose et al. 2012; De Leoni and Van der Aalst 2013b) or conformance (Van der Aalst et al. 2012; De Leoni et al. 2012b; De Leoni et al. 2012a; De Leoni and Van der Aalst 2013a).

Also, another branch of work that can be related to the present approach is that of generating recommendations in the context of process modeling (Hornung et al. 2008; Schonenberg et al. 2008; Vanderfeesten et al. 2008; Koschmider et al. 2009; Koschmider et al. June 2011). Of special interest is (Schonenberg et al. 2008), where the recommendations are based on the past history of the process, which is recorded in an event log. However, once again there is a one-to-one mapping between the events in the log and the activities in the process. In contrast, our suggestions for new macro-sequences are based on a mapping which must be mined from the event log and from a higher-level model of the process. This model may be inaccurate at first, but through the addition of the suggested macro-sequences, or other macro-sequences that the analyst finds appropriate, the model is able to cover more and more of the event log.

6 Conclusion

In this work we have shown that it is possible to enhance a given process model with information about the run-time behavior of the process, as recorded in an event log. More importantly, we have shown that it is possible to do this even when the relationship between the activities in the process model and the events in the log is unknown.

Assuming that each event can be mapped to a single activity, we introduced the concept of mapping and we showed how to mine the set of possible mappings from a given micro-sequence and macro-sequence (Algorithm 1). Then we generalized this procedure in order to be able to find the set of possible mappings between all the micro-sequences in the event log and all the macro-sequences in the process model (Algorithm 2).

In general, the set of possible mappings can be large, and one should focus on the mapping which has the highest coverage, meaning that the chosen mapping should be applicable to the largest possible number of traces in the event log (Algorithm 3). However, coverage is not the only factor to take into account, since a mapping where all events are mapped to the same activity will have 100 % coverage but will be of no practical interest.

The range of a mapping, i.e. the set of activities that appear in the mapping, is an even more important factor. Therefore, first we select the mappings with largest range, and then we select the mappings with highest coverage from those with largest range. If possible, we combine (i.e. merge) mappings in order to create a new mappings with even larger range and higher coverage. There may be many such possible combinations, so Algorithm 4 adopts a greedy approach, where it tries to merge those mappings which, by themselves, already appear to have the largest range and highest coverage among their peers.

In practice, such greedy approach is enough to find a mapping that covers a large fraction of the event log. For those traces which are not covered by the mapping, it is possible to generate a suggestion in the form of a macro-sequence which, if added to the process model, will allow the mapping to increase its coverage. These suggestions, or whatever macro-sequences the analyst decides to add, are the basis for enhancing the process model.

In a case study involving a real-world event log from the BPI Challenge 2013, we have shown how the successive inclusion of the suggested macro-sequences has contributed to reach a model that captures the run-time behavior of the process in terms of the high-level activities used to document that process. We have also highlighted the fact that the approach scales reasonably well with the increase in the number of macro-sequences, and that it is able to deal with noise, by eventually covering all traces in the event log.