Keywords

1 Introduction

More and more processes executed in companies are supported by information systems which store each event executed in the context of a so-called event log. For each event, such an event log typically describes a name identifying the executed activity, the respective case specifying the execution instance of the process, the time when the event was observed, and often other data related to the activity and/or process instance. An example event log is shown in Fig. 1.

In the context of process mining, many algorithms and software tools have been developed to utilize the data contained in event logs: in conformance checking, the goal is to determine whether the behaviors given by a process model and event log comply. In process enhancement, existing process models are improved. Finally, in process discovery, a process model is constructed aiming to reflect the behavior defined by the given event log: the observed events are put into relation to each other, preconditions, choices, concurrency, etc. are discovered, and brought together in a model, e.g. a Petri net.

Fig. 1.
figure 1

Excerpt of an example event log. The two visible cases correspond to the activity sequences and .

Process discovery is non-trivial for a variety of reasons. The behavior recorded in an event log cannot be assumed to be complete, since behavior allowed by the process specification might simply not have happened yet. Additionally, real-life event logs often contain rare patterns, either due to infrequent behavior or due to logging errors. Especially the latter type should not be taken into account when discovering the process model, but finding a balance between filtering out noise and at the same time keeping all desired information is often a non-trivial task. Ideally, a discovered model should be able to produce the behavior contained within the event log, not allow for behavior that was not observed, represent all dependencies between the events and at the same time be uncomplicated enough to be understood by a human interpreter. It is rarely possible to fulfill all these requirements simultaneously. Based on the capabilities and focus of the used algorithm, the discovered models can vary greatly. Often, there is no one and only true model, but instead, a trade-off between the aspects noted above has to be found. In Sect. 2 we give an overview of work related to our paper. For a detailed introduction to process mining we refer the interested reader to [1].

In this paper we suggest an algorithm inspired by language-based regions, that guarantees to find a model defining the minimal language containing the input language [2]. Due to our assumptions, the usually infinite set of all possible places is finite. In contrast to prominent discovery approaches based on language-based regions (see Sect. 2), we do not use integer linear programming to find the subset of fitting places. Instead, we replay the event log on candidate places to evaluate whether they are fitting. We achieve that by playing the token game for each trace in the log, and then utilizing the results to skip uninteresting sections of the search space as suggested in [3]. In contrast to the brute-force approach evaluating every single candidate place, our technique drastically increases the efficiency of candidate evaluation by combining this skipping of candidates with a smart candidate traversal strategy, while still providing the same guarantees. Additionally, our algorithm lends itself to apply efficient noise-filtering, as well as other user-definable constraints on the set of fitting places. As a final step, we suggest to post-process the discovered set of fitting places, thereby reducing the complexity of the resulting model and making it interpretable by humans. Altogether, our approach has the potential to combine the capability of discovering complex model structures, typical for region-based approaches, with the ability to handle noise and simplify the model according to user-definable constraints. We illustrate the capabilities of our algorithm by providing results of measuring its decrease in computation time compared to the brute-force approach, testing its noise-handling abilities, and illustrating the rediscovery of complex models.

An overview of related work is given in the next section. In the remainder of this paper we provide a detailed description, formalization and discussion of our discovery approach. In Sect. 3 basic notations and definitions are given. We present a detailed motivation and overview of our approach in Sect. 4. Section 5 provides an extensive explanation and formalization. In Sect. 6, we briefly discuss our implementation, including some tweaks and optimizations that can be used to further improve our approach. A comparison to existing discovery algorithms is given in Sect. 7 together with results and evaluation of testing. Finally, we conclude the paper with a summary and suggestion of future work in Sect. 8.

2 Related Work

Process discovery algorithms make use of a variety of formal and informal representations to model the behavior they extract from a given event log. However, the basic idea is similar: based on the event data given by an event log, the different event types are coordinated and ordered using some kind of connection between them. In this paper, we focus on the formal representation by Petri nets, where the event types correspond to transitions and the coordinating connections correspond to places. However, our ideas can be adapted to other representations as well. Discovery algorithms that produce a formal process model can provide desirable formal guarantees for their discovered models, for example, the ability to replay each sequence of events contained in the log, or the ability to re-discover a model based on a sufficient description of its behavior.

As noted above in process discovery there are several, often conflicting quality criteria for a discovered model. To decrease computation time and the complexity of the found Petri net, many existing discovery algorithms further abstract from a given log to another representation, containing only a fraction of the original information, based on which a formal model is created. These algorithms can rediscover only subclasses of Petri nets, and often the resulting model does not allow for the log to be fully replayed, or allows for much more than the log suggests. Examples are the Alpha Miner variants [4] and the Inductive Mining family [5]. Other miners based on heuristics, like genetic algorithms or Heuristic Miner [6] cannot provide guarantees at all.

Due to omitting part of the information contained in the log, the miners described above are not able to discover complex model structures, most prominently non-free choice constructs. The task of creating a Petri net that precisely corresponds to a given description of its behavior is known as the synthesis problem and closely related to region theory [7]. Traditionally, this description was given in form of a transition system, which was then transformed into a Petri net using state-based region theory [8]. The approach has been adapted for process discovery by developing algorithms, e.g. FSM Miner, that generate a transition system based on the log, which is then transformed into a Petri net [9, 10]. Other approaches use language-based region theory, where the given description is a language, rather than a transition system [11,12,13]. An event log can be directly interpreted as a language. Therefore, language-based regions can be applied directly to synthesize a Petri net from a given event log [2]. Here the basic strategy is to start with a Petri net that has one transition for each activity type contained in the log. Then all places that allow replaying the log are added. The result is a Petri net, that defines a language which is a minimal superset of the input language. Currently such algorithms, most prominently ILP Miner, are based on integer linear programming [14,15,16]. However, available implementations make use of an abstraction of the log to increase performance, thus losing their ability to find all possible places.

Fig. 2.
figure 2

There exists a variety of discovery algorithms that are able to mine a process model based on the log from Fig. 1. As illustrated, the resulting Petri nets differ significantly between the algorithms. In particular, the places connecting A to D and B to E, which ensure that the first choice implies the second choice, are rarely discovered by existing algorithms.

In contrast to most other discovery algorithms, region-based approaches guarantee that the model can replay the complete event log, and at the same time does not allow for much different behavior. In particular, complex model structures like non-free choice constructs are reliably discovered. On the downside, region-based discovery algorithms often lead to complex process models that are impossible to understand for the typically human interpreter. They are also known to expose severe issues with respect to low-frequent behavior often contained in real-life event logs. Finally, finding all the fitting places out of all possible places tends to be extremely time-consuming.

To illustrate the differences between existing discovery algorithms, in Fig. 2 we show the results of selected discovery algorithms applied to the log shown in Fig. 1. The original model, that produced the log, can be rediscovered by the approach suggested in this paper. Alpha Miner, Inductive Miner and ILP Miner cannot discover this model because they are restricted to mine only for certain structures and/or a subset of possible places. In particular, the implication of the second choice (C or D) by the first choice (A or B) is not discovered by any of these algorithms.

3 Basic Notation, Event Logs and Process Models

Throughout our paper we will use the following notations and definitions: A set, e.g. \(\{ {a, b, c} \}\), does not contain any element more than once, while a multiset, e.g. \([ {a, a, b, a} ] = [ {a^3, b} ] \), may contain multiples of the same element. By \(\mathbb {P}({X})\) we refer to the power set of the set X, and \(\mathbb {M}({X})\) is the set of all multisets over this set. In contrast to sets and multisets, where the order of elements is irrelevant, in sequences the elements are given in a certain order, e.g. \(\langle {a, b, a, b}\rangle \ne \langle {a, a, b, b}\rangle \). We refer to the i’th element of a sequence \(\sigma \) by \(\sigma (i)\). The size of a set, multiset or sequence X, that is \({|{X}|}\), is defined to be the number of elements in X.

We define activities, traces, and logs as usual, except that we require each trace to begin with a designated start activity and end with a designated end activity. Since process models, in general, have a start and end, this is a reasonable assumption. It implies, that in any discovered model all places are intermediate places that are not part of an initial or final marking. Thus, every candidate place we consider during execution of our algorithm, does not contain any initial tokens. This greatly simplifies the presentation of our work. Note, that any log can easily be transformed accordingly.

Definition 1

(Activity, Trace, Log). Let \(\mathcal {A}\) be the universe of all possible activities (actions, events, operations, ...), let be a designated start activity and let be a designated end activity. A trace is a sequence containing \(\blacktriangleright \) as the first element, as the last element and in-between elements of . Let \(\mathcal {T}\) be the set of all such traces. A log \(L_{} \subseteq \mathbb {M}({\mathcal {T}}) \) is a multiset of traces.

In the following definition of Petri nets, note that we require the set of transitions to correspond to a subset of the universe of activities. Therefore our Petri nets are free of silent or duplicate transitions. In combination with not having to deal with markings, this results in a finite set of candidate places. Also note, that we do not use weighted arcs, and can therefore assume the arcs to be implicitly defined by the sets of ingoing and outgoing transitions of the places. This definition of a subset of Petri nets, natural with respect to our discovery algorithm, removes a lot of notational overhead (Fig. 3).

Definition 2

(Petri nets). A Petri net is a pair \(N = ( {A_{}, \mathcal {P}_{}}) \), where \(A_{} \subseteq \mathcal {A} \) is the set of transitions, and \(\mathcal {P}_{} \subseteq \{ {({I} | {O}) \; | \; I \subseteq A_{} \wedge I \ne \emptyset \wedge O \subseteq A_{} \wedge O \ne \emptyset } \} \) is the set of places. We call I the set of ingoing activities of a place and O the set of outgoing activities.

Places, that are not able to perfectly replay the given log, can be unfitting in two ways: If at some point during replay there is no token at the place, but the log requires the consumption of a token anyway, we call the place underfed. If at the end of a replayed trace there is at least one token left at the place, we call the place overfed. This categorization has been extensively discussed in [3] and is the key to our efficient candidate traversal: as detailed in Sect. 4, by evaluating one place to be underfed (overfed) we can determine a whole set of places to be underfed (overfed), without even looking at them.

Definition 3

(Overfed/Underfed/Fitting Places, see [3]). Let \(N=( {A, \mathcal {P}_{}}) \) be a Petri net, let \(p_{} = ({I} | {O}) \in \mathcal {P}_{} \) be a place, and let \(\sigma _{} \) be a trace. With respect to the given trace \(\sigma _{}\), \(p_{}\) is called

  • underfed, denoted by \(\bigtriangledown ^{}_{\sigma _{}}{(p_{})}\), if and only if \(\exists k \in \{ {1, 2, ..., {|{\sigma _{}}|}} \} \) such that \({|{\{ {i \; | \; i \in \{ {1, 2, ...k-1} \} \wedge \sigma _{} (i) \in I} \}}|} < {|{\{ {i \; | \; i \in \{ {1, 2, ...k} \} \wedge \sigma _{} (i) \in O} \}}|} \),

  • overfed, denoted by \(\bigtriangleup ^{}_{\sigma _{}}{(p_{})}\), if and only if \({|{\{ {i \; | \; i \in \{ {1, 2, ...{|{\sigma _{}}|}} \} \wedge \sigma _{} (i) \in I} \}}|} > {|{\{ {i \; | \; i \in \{ {1, 2, ...{|{\sigma _{}}|}} \} \wedge \sigma _{} (i) \in O} \}}|} \),

  • fitting, denoted by \(\Box ^{}_{\sigma _{}}{(p_{})}\), if and only if not \(\bigtriangledown ^{}_{\sigma _{}}{(p_{})}\) and not \(\bigtriangleup ^{}_{\sigma _{}}{(p_{})}\).

Note, that a place can be underfed and overfed at the same time.

Definition 4

(Behavior of a Petri net). We define the behavior of the Petri net \(( {A, \mathcal {P}_{}})\) to be the set of all fitting traces, that is \(\{ {\sigma _{} \in \mathcal {T} \; | \; \forall p_{} \in \mathcal {P}_{} :\Box ^{}_{\sigma _{}}{(p_{})}} \}\).

Fig. 3.
figure 3

Example of a Petri net \(N=( {A_{}, \mathcal {P}_{}}) \) with transitions and places . The behavior of N is the set of fitting traces . A possible place \(({\{ {b} \}} | {\{ {a} \}})\) is underfed with respect to the trace .

4 Algorithmic Framework

As input our algorithm takes a log \(L_{}\), that can be directly interpreted as a language, and a parameter \(\tau \in [0,1]\). This parameter \(\tau \) in principle determines the fraction of the log that needs to be replayable by a fitting place, and is essential for our noise handling strategy. We provide details on this threshold at the end of this section.

Inspired by language-based regions, the basic strategy of our approach is to begin with a Petri net, whose transitions correspond to the activity types used in the given log. From the finite set of unmarked, intermediate places \(\mathcal {P}_{\text {all}}\) we want to select a subset \(\mathcal {P}_{\text {final}}\), such that the language defined by the resulting net defines the minimal language containing the input language, while, for human readability, using only a minimal number of places to do so. Note, that by filtering the log for noise the definition of the desired language becomes less rigorous, since the allegedly noisy parts of the input language are ignored, and thus the aforementioned property can no longer be guaranteed.

We achieve this aim by applying several steps detailed in Sect. 5. First of all, we observe that the set \(\mathcal {P}_{\text {all}}\) contains several places that can never be part of a solution, independently of the given log. By ignoring these places we reduce our set of candidates to \(\mathcal {P}_{\text {cand}} \subseteq \mathcal {P}_{\text {all}} \). Next, we apply the main step of our approach: while utilizing the token-game to skip large parts of the candidate space, we actually evaluate only a subset of candidates, \(\mathcal {P}_{\text {visited}} \subseteq \mathcal {P}_{\text {cand}} \). We can guarantee that the set of fitting places is a subset of these evaluated places, that is \(\mathcal {P}_{\text {fit}} \subseteq \mathcal {P}_{\text {visited}} \). Finally, aiming to achieve a model that is interpretable by human beings, we reduce this set of fitting places to a set of final places \(\mathcal {P}_{\text {final}} \subseteq \mathcal {P}_{\text {fit}} \) by removing superfluous places.

The main challenge of our approach lies in the size of the candidate space: there are \({|{\mathcal {P}_{\text {all}}}|} = {|{\mathbb {P}({A_{}}) \backslash {\emptyset } \times \mathbb {P}({A_{}}) \backslash {\emptyset }}|} \approx ~(2^{{|{A_{}}|}})^2\) possible places to be considered. Keeping all of them in memory and even more replaying the log for this exponentially large number of candidates will quickly become infeasible, even for comparably small numbers of activities. Reducing the set \(\mathcal {P}_{\text {all}}\) to \(\mathcal {P}_{\text {cand}}\) is by far not a sufficient improvement.

Towards a solution to this performance issue, we propose an idea allowing us to drastically reduce the amount of traversed candidate places, while still providing the complete set \(\mathcal {P}_{\text {fit}}\) as outcome. The monotonicity results on Petri net places introduced in [3] form the basis of our approach. Intuitively, if a candidate place \(p_{1} = ({I_1} | {O_1}) \) is underfed with respect to some trace \(\sigma _{}\), then at some point during the replay of \(\sigma _{}\) there are not enough tokens in \(p_{1}\). By adding another outgoing arc to \(p_{}\) connecting it to some transition \(a \notin O_1\) we certainly do not increase the number of tokens in the place and therefore the resulting place \(p_{2} = ({I_1} | {O_1 \cup \{ {a} \}}) \) must be underfed as well. Thus, by evaluating \(p_{1}\) to be underfed we can infer that all candidates \(({I_1} | {O_2})\) with \(O_1 \subseteq O_2\) are underfed as well, without having to evaluate them. A similar reasoning can be applied to overfed places. This is formalized in Lemma 1. For more details, we refer the reader to the original paper [3].

Lemma 1

(Monotonicity Results (see [3])). Let \(p_{1}\) =\(({I_1} | {O_1})\) be a place and let \(\sigma _{}\) be a trace. If \(\bigtriangledown ^{}_{\sigma _{}}{(p_{1})}\), then \(\bigtriangledown ^{}_{\sigma _{}}{(p_{2})}\) for all \(p_{2} =({I_2} | {O_2}) \) with \(I_1 \supseteq I_2\) and \(O_1 \subseteq O_2\). If \(\bigtriangleup ^{}_{\sigma _{}}{(p_{1})}\), then \(\bigtriangleup ^{}_{\sigma _{}}{(p_{2})}\) for all \(p_{2} =({I_2} | {O_2}) \) with \(I_1 \subseteq I_2\) and \(O_1 \supseteq O_2\).

As detailed in [3], these monotonicity results allow us to determine a whole set of places to be unfitting by evaluating a single unfitting candidate. Combining this idea with the candidate traversal strategy presented in Sect. 5 allows us to skip most unfitting places when traversing the candidate space, without missing any other, possibly interesting place. It is important to note that we do not simply skip the replay of the log, but actually do not traverse these places at all. This greatly increases the performance of our algorithm.

Setting our algorithm apart from other region-based approaches is its ability to directly integrate noise filtering. When evaluating a visited place, we refer to a user-definable parameter \(\tau \) as detailed in the following definition:

Definition 5

(Fitness with Respect to a Threshold). With respect to a given log \(L_{}\) and threshold \(\tau \), we consider a place \(p_{}\) =\(({I} | {O})\) to be

  • fitting, that is \(\Box ^{\tau }_{L_{}}{(p)}\), if and only if \( \frac{ {|{\{ { \sigma _{} \in L_{} \; | \; \Box ^{}_{\sigma _{}}{(p)} \wedge \sigma \cap (I \cup O) \ne \emptyset } \} }|}}{{|{ \{ {\sigma \in L_{} \; | \; \exists a \in \sigma _{} :a \in (I \cup O)} \} }|} } \ge \tau \),

  • underfed, that is \(\bigtriangledown ^{\tau }_{L_{}}{(p)}\), if and only if \( \frac{ {|{\{ { \sigma _{} \in L_{} \; | \; \bigtriangledown ^{}_{\sigma _{}}{(p)} } \} }|}}{{|{ \{ {\sigma \in L_{} \; | \; \exists a \in \sigma _{} :a \in (I \cup O)} \} }|} } >(1-\tau )\),

  • overfed, that is \(\bigtriangleup ^{\tau }_{L_{}}{(p)}\), if and only if \( \frac{ {|{\{ { \sigma _{} \in L_{} \; | \; \bigtriangleup ^{}_{\sigma _{}}{(p)} } \}}|} }{{|{ \{ {\sigma \in L_{} \; | \; \exists a \in \sigma _{} :a \in (I \cup O)} \} }|} } >(1-\tau )\).

Intuitively, a place \(p_{}\) is fitting/underfed/overfed/ with respect to \(L_{}\) and \(\tau \), if it is underfed/overfed/fitting with respect to a certain fraction of traces in \(L_{}\) that involve the activities of \(p_{}\). This fraction is determined by the threshold \(\tau \). By defining the value of \(\tau \), the user of our algorithm can choose to ignore a fraction of traces when evaluating the places, making the result much more robust with respect to infrequent behavior. If \(\tau = 1\), then all places are required to be perfectly fitting. In [3] it is shown that Lemma 1 can be extended to the use of such a threshold. Despite our slightly modified definition, their proof remains valid. The impact of different values of \(\tau \) will be investigated in Sect. 7.

5 Computing a Desirable Subset of Places

As input, our algorithm expects a log \(L_{}\), and a user-definable parameter \(\tau \in [0,1]\). The activities contained in \(L_{}\), \(A_{} \subseteq \mathcal {A} \), define the set of transitions of the Petri net we want to discover. These define a finite set of unmarked, intermediate places \(\mathcal {P}_{\text {all}}\), that we could insert, as the starting point of our algorithm.

5.1 Pre-pruning of Useless Places

Within the set \(\mathcal {P}_{\text {all}}\), there are many candidates that are guaranteed to be unfitting, independently of the given log. These are all places \(({I} | {O})\), with \({\blacktriangleright } \in O\) or with for designated start and end activities \(\blacktriangleright \), . By completely excluding such places from our search space, we remain with a much smaller set of candidates \(\mathcal {P}_{\text {cand}} \subseteq \mathcal {P}_{\text {all}} \): For a set of activities \(A_{}\) the number of candidates is bounded by \(2^{|{A_{}}|} \times 2^{|{A_{}}|} \). By removing all places with \({\blacktriangleright } \in O\) or , we effectively decrease the size of the activity set \(A_{}\) by one for each, incoming and outgoing activities. The new bound on the number of candidates is \(2^{|{A_{}}|} \times 2^{|{A_{}}|} \), thus reducing its size by \(25\%\).

5.2 Developing an Efficient Candidate Traversal Strategy

We illustrate our idea with the help of a running example, where is the given set of activities. The corresponding pre-pruned candidate space is visualized in Fig. 4. The organization and coloring are chosen to clarify the relations between the candidates, which we are going to utilize to further prune the candidate space.

Fig. 4.
figure 4

Example of the candidate space based on four activities \({\blacktriangleright }, a, b\) and . The horizontal levels and colored edges indicate the relations between the activity sets: each candidate place \(({I_1} | {O_1})\) is connected to a place \(({I_2} | {O_2})\) from the level above by a blue line if \(I_1=I_2\) and \(O_1 \supseteq O_2\), and by a red line if \(I_1 \supseteq I_2\) and \(O_1=O_2\).

Our strategy for traversing the set of candidates is the key to the effective utilization of the described monotonicity results. Since we cannot efficiently store and retrieve an exponential amount of candidate places, we need a scheme to deterministically compute the next place we want to evaluate based on a limited set of information we keep in storage. This scheme should at the same time guarantee that all fitting places are visited, each place is visited only a limited number of times (preferably at most once), and we are able to prune the search space by employing the results obtained by previous place evaluations. In the following, we are going to develop such a candidate traversal scheme.

5.3 Organization of the Candidate Space as Trees

We organize the candidate space in a set of trees, as described in Definition 6. An example is shown in Fig. 5. Note, that there are many ways to organize the candidates in such a structure and this example shows merely one of these possibilities.

Let \(A_{} \subseteq \mathcal {A} \) be the given set of activities and let \(>_i\) and \(>_o\) be two total orderings on \(A_{}\). In the remainder, we assume all sets of ingoing activities of a place to be ordered lexicographically according to \(>_i\), and sets of outgoing activities according to \(>_o\). Possible strategies of computing such orderings are noted in Sect. 6.

Definition 6

(Complete Candidate Tree). A complete candidate tree is a pair \(CT=( {N, F}) \) with . We have that \(F= F_\text {red} \cup F_\text {blue}\), with

figure a

If \(( {({I_1} | {O_1}), ({I_2} | {O_2})}) \in F\), we call the candidate \(({I_1} | {O_1})\) the child of its parent \(({I_2} | {O_2})\). A candidate \(({I} | {O})\) with \({|{I}|} ={|{O}|} =1\) is called a base root.

Note, that there is a certain asymmetry in the definition of \(F_\text {red}\) and \(F_\text {blue}\): red edges connect only places which have exactly one outgoing transition, while for blue edges the number of outgoing transitions is unlimited. This is necessary to obtain the collection of trees we are aiming for, and which is further investigated below: if we did not restrict one of the edge types in this way, the resulting structure would contain cycles. If we restricted both types of edges, many candidates would not be connected to a base root at all. However, the choice of restricted edge type, red or blue, is arbitrary.

In the following we show that each candidate is organized into exactly one tree, that can be identified by its unique base root. Therefore, the number of connected trees contained in one structure CT as described in Definition 6 is exactly the number of base roots. In our running example (Fig. 5) there are 9 such trees.

Lemma 2

The structure CT described by Definition 6 organizes the candidate space into a set of trees rooted in the base roots, where every candidate is connected to exactly one base root.

Proof

(Lemma 2). Let \(CT=(N, F)\) be the structure defined in Definition 6. We show that every candidate \(({I} | {O}) \in N\) has a unique parent, and, by induction on the number of activities of a candidate, that each candidate is the descendant of exactly one of the base roots. This implies that there are no cycles and the structure is indeed a set of connected trees rooted in the base roots.

If \({|{I \cup O}|} = 2\) then \(p_{}\) is a base root that cannot have any parents and the claim holds. Now assume that the claim holds for all candidates with \({|{I \cup O}|} =n\).

Consider a candidate \(p_{1} =({I_1} | {O_1}) \) with \({|{I_1 \cup O_1}|} =n+1\). Let \(p_{2} =({I_2} | {O_2}) \) be any potential parent of of \(p_{1}\). We distinguish two cases:

Case \({|{O_1}|} =1\): This implies \(I_1=\{ {a_1, a_2, ..., a_{n-1}, a_n} \} \). We have that \(( {p_{1}, p_{2}}) \notin F_\text {blue}\), because, otherwise, we would have \(O_2=\emptyset \). If \(( {p_{1}, p_{2}}) \in F_\text {red}\), then by definition \(O_1=O_2\) and \(I_1=I_1\backslash {\{ {a_n} \}} \).

Case \({|{O_1}|} \ge 2\): This implies \(O_1=\{ {a_1, a_2, ..., a_{k-1}, a_k} \} \), for some \(k \in [2,n-1]\). We have that \(( {p_{1}, p_{2}}) \notin F_\text {red}\), because red edges require \({|{O_1}|} =1\).If \(( {p_{1}, p_{2}}) \in F_\text {blue}\), then by definition we have that \(I_1=I_2\) and \(O_2=O_1\backslash {\{ {a_k} \}} \).

In both cases, \(p_{2} \) is fully defined based on \(p_{1}\) and therefore the unique parent. Since \({|{I_2 \cup O_2}|} =n\), the claim holds for \(p_{2} \) and thus for \(p_{1} \) as well. By induction, the claim holds for all candidates in N.    \(\square \)

Fig. 5.
figure 5

Example of a complete candidate tree based on . In reference to Lemma 3, an example subtree of red edges is marked by a red background, while all subtrees of blue edges attached to it are marked by a blue background. Together they form the whole tree rooted in the base root \(({\{ {{\blacktriangleright }} \}} | {\{ {a} \}})\).

Fig. 6.
figure 6

We illustrate our tree cutting strategy using the tree rooted in the base root \(({\{ {{\blacktriangleright }} \}} | {\{ {a} \}})\) from our running example (Fig. 5). For the example place \(({\{ {{\blacktriangleright }, a} \}} | {\{ {a} \}})\) we highlight the blue-edged subtree (\(T_2\)) and red-edged subtree (\(T_1\)) referred to in Lemma 3. In reference to Lemma 4 we indicate the subtrees attached by blue edges, which coincides with the blue-edged subtree (\(T_2\)), and the subtrees attached by red edges (\(T_3\)). We emphasize that \(T_1 \ne T_3\).

5.4 Tree Traversal and Pruning of Candidate Space

Definition 7

(Tree Traversal Strategy). Let \(L_{}\) be a log, \(\tau \) threshold, and A the set of activities contained in \(L_{}\). Let \(CT=( {N, F}) \) be the complete candidate tree based on A, with \(N=\mathcal {P}_{\text {cand}} \). We traverse one connected tree after the other, by using a depth-first strategy starting on each base root of CT.

Let \(p_{}\) =\(({I} | {O})\) be a visited place. If \(p_{}\) is fitting, i.e. \(\Box ^{\tau }_{L_{}}{(p_{})}\), we add it to \(\mathcal {P}_{\text {fit}}\). If \(p_{}\) is underfed (\(\bigtriangledown ^{\tau }_{L_{}}{(p_{})}\)), we skip all subtrees attached to \(p_{}\) by a blue edge. If \(p_{}\) is overfed (\(\bigtriangleup ^{\tau }_{L_{}}{(p_{})}\)) and \(O=\{ {a} \} \) with \(\not \exists a' \in A_{} :a' >_o a\), we skip all subtrees attached to \(p_{}\) by a red edge. After skipping a subtree this way, we proceed with our traversal as if we had already visited those places.

Formalized in Theorem 1, we show that this strategy allows us to skip sets of unfitting places and at the same time guarantee, that no place in \(\mathcal {P}_{\text {fit}}\) is missed.

Lemma 3

Let \(L_{}, \tau \) be a given log and threshold, \(CT=( {N, F}) \) a complete candidate tree and \(p_{1} \in N\) any candidate. If \(\bigtriangledown ^{\tau }_{L_{}}{(p_{1})}\) (\(\bigtriangleup ^{\tau }_{L_{}}{(p_{1})}\)), then for all candidates \(p_{2} \in N\) within the blue-edged (red-edged) subtree rooted in \(p_{1}\), we have that \(\bigtriangledown ^{\tau }_{L_{}}{(p_{2})}\) (\(\bigtriangleup ^{\tau }_{L_{}}{(p_{2})}\)).

Proof

(Lemma 3). Let \(p_{1} = ({I_1} | {O_1}) \in N\) be a candidate with \(\bigtriangledown ^{\tau }_{L_{}}{(p_{1})}\) (\(\bigtriangleup ^{\tau }_{L_{}}{(p_{1})}\)). Consider any descendant \(p_{2} = ({I_2} | {O_2}) \) attached to \(p_{1}\) via a path of blue (red) edges. From the definition of blue (red) edges (see Definition 6) we infer that \(I_1=I_2\) and \(O_1 \subseteq O_2\) (\(O_1=O_2\) and \(I_1 \subseteq I_2\)). Then from the monotonicity results presented Lemma 1 it follows that \(\bigtriangledown ^{\tau }_{L_{}}{(p_{2})}\) (\(\bigtriangleup ^{\tau }_{L_{}}{(p_{2})}\)).    \(\square \)

According to Lemma 3, if a place \(p_{}\) is underfed, then blue-edged subtree rooted in \(p_{}\) contains only underfed places, and if \(p_{}\) is overfed, then the red-edged subtree rooted in \(p_{}\) contains only overfed places. In the following, we try to extend these results to subtrees rooted in \(p_{}\) in general. We will show that for subtrees attached to \(p_{}\) by blue edges this is possible in a straight-forward way. Unfortunately, due to the asymmetry of the definition of the edges, the same argument does not always apply to subtrees attached by red edges: here it is entirely possible that there are blue edges within the corresponding subtree, and “overfedness” of \(p_{}\) cannot be necessarily extended to the contained places. This is illustrated in Fig. 6.

Lemma 4

Let \(L_{}, \tau \) be a given log and threshold, \(CT=( {N, F}) \) a complete candidate tree and \(p_{1} = ({I_1} | {O_1}) \in N\) any candidate. If \(\bigtriangledown ^{\tau }_{L_{}}{(p_{1})}\), then for all candidates \(p_{2}\) within the subtrees attached to \(p_{1}\) with a blue edge we have \(\bigtriangledown ^{\tau }_{L_{}}{(p_{2})}\). If \(\bigtriangleup ^{\tau }_{L_{}}{(p_{1})}\) and \(O_1=\{ {a} \} \) with \(\not \exists a' \in A_{} :a' >_o a\), then for all candidates \(p_{2} \) within the subtrees attached to \(p_{1}\) we have \(\bigtriangleup ^{\tau }_{L_{}}{(p_{2})}\).

Proof

(Lemma 4). Assume \(\bigtriangledown ^{\tau }_{L_{}}{(p_{1})}\). Due to the definition of blue edges, it holds for each child \(p_{} '=({I'} | {O'}) \) in a subtree attached to \(p_{1}\) by such an edge, that \({|{O'}|} > {|{O_1}|} \), and thus in particular \({|{O'}|} \ge 2\). By definition, all descendants of \(p_{} '\) have at least two outgoing activities. This implies that there is no red edge in the whole subtree, since they require by definition at most one outgoing activity. Thus, by Lemma 3, for every place \(p_{2} \) in a subtree rooted in such a \(p_{} '\) we have \(\bigtriangledown ^{\tau }_{L_{}}{(p_{2})}\), and the first claim holds.

Now assume \(\bigtriangleup ^{\tau }_{L_{}}{(p_{1})}\) and \(O_1=\{ {a} \} \) with \(\not \exists a' \in A_{} :a' >_o a\). Due to the definition of red edges, it holds for \(p_{1}\) and each descendant \(p_{} '=({I'} | {O'}) \) of \(p_{1}\) reachable by red edges, that \(O'= \{ {a} \} \). This implies that there is no blue edge in the whole subtree rooted in \(p_{1}\), since they require the existence of an activity \(a'\) with \(a <_o a'\). Thus, by Lemma 3, for every place \(p_{2} \) in the subtree rooted in \(p_{1}\), we have \(\bigtriangleup ^{\tau }_{L_{}}{(p_{2})}\).    \(\square \)

Theorem 1

Given a log \(L_{}\), a threshold \(\tau \) and the complete candidate tree \(CT=(N, F)\) with \(N=\mathcal {P}_{\text {cand}} \), let \(\mathcal {P}_{\text {fit}}\) be the set of fitting places with respect to \(L_{}\) and \(\tau \), that is \(\{p \in N \mid \Box ^{\tau }_{L_{}}{(p_{})} \}\). Let \(\mathcal {P}_{\text {visited}}\) be the set of places visited by our tree traversal strategy described in Definition 7. It holds that \(\mathcal {P}_{\text {fit}} \subseteq \mathcal {P}_{\text {visited}} \subseteq \mathcal {P}_{\text {cand}} \).

Proof

(Theorem 1). As proven in Lemma 2 every candidate is contained in exactly one tree rooted in a base root and is thus visited exactly once by standard depth-first search. Thus \(\mathcal {P}_{\text {visited}} \subseteq \mathcal {P}_{\text {cand}} \).

Using depth-first search for tree traversal guarantees that for any visited candidate \(p_{}\), we visit the complete subtree rooted in \(p_{}\), before proceeding to another subtree. Thus skipping a subtree does not influence the traversal of other subtrees. If \(p_{}\) is underfed/overfed, we can apply Lemma 4 to guarantee that all places contained in the skipped subtrees are underfed/overfed as well. Thus no fitting places are skipped, and we have that \(\mathcal {P}_{\text {fit}} \subseteq \mathcal {P}_{\text {visited}} \subseteq \mathcal {P}_{\text {cand}} \).    \(\square \)

As mentioned earlier, we cannot store the complete candidate tree due to its exponential size, and thus the challenge of the tree traversal lies in the deterministic computation of the next candidate based on limited information only. In our algorithm, this information is only the last candidate and its fitness status. We define a total ordering on the set of places easily computable based on the given orderings \(>_i, >_o\). We can use this ordering to deterministically select the next subtree to traverse based on the current candidate. The fitness status is used to decide whether to actually traverse the selected subtree, or skip it and select the next one.

While the theoretical worst-case scenario still requires traversing the full candidate space, we have achieved a drastic increase in performance in practical applications. This is presented in detail in Sect. 7.

5.5 Evaluation of Potentially Fitting Candidates

For each place visited during candidate traversal, we need to determine its fitness status. Fitting places are added to the set of fitting places \(\mathcal {P}_{\text {fit}} \), which will be the input for the post-processing step (see Sect. 5.6). Overfed and underfed places are not added, instead this fitness status can be used in the context of candidate traversal to skip sets of unfitting places.

To determine the fitness status of a place \(p_{}\) = \(({I} | {O})\), we use token-based replay. We replay every trace \(\sigma _{} \in L_{} \) on the place \(p_{}\): for each activity \(a \in \sigma _{} \), from first to last, if \(a \in O\) we decrement the number of tokens at the place by one. Then, if \(a \in I\) we increment the number of tokens by one. If the number of tokens becomes negative we consider the place to be underfed with respect to this trace, that is \(\bigtriangledown ^{}_{\sigma _{}}{(p_{})}\). Otherwise, if the trace has been fully replayed and the number of tokens is larger than zero, we consider the place to be overfed, that is \(\bigtriangleup ^{}_{\sigma _{}}{(p_{})}\). Note that the place can be underfed and overfed at the same time. Based on the replay results of the single traces and the user-definable threshold \(\tau \) (see Definition 5), we evaluate the fitness status of the place with respect to the whole log.

5.6 Post-processing

In the previous step, a subset of places \(\mathcal {P}_{\text {fit}} \) was computed. These are exactly the places considered to be fitting with respect to the given log \(L_{}\) and threshold \(\tau \). However, many of them can be removed without changing the behavior of the Petri net implicitly defined by \(\mathcal {P}_{\text {fit}}\). Since the process model we return is likely to be interpreted by a human being, such places are undesirable. These places are called implicit or sometimes redundant and have been studied extensively in the context of Petri nets [17,18,19,20,21]. In the post-processing step, we find and remove those undesirable places by solving an Integer Linear Programming Problem as suggested for example in [21]. The resulting set of places \(\mathcal {P}_{\text {final}} \subseteq \mathcal {P}_{\text {fit}} \) is finally inserted into the Petri net that forms the output of our algorithm.

6 Implementation

We implemented our algorithm as a plug-in for ProM [22] named eST-Miner using Java. There are many ways in which our idea of organizing the candidate space into a tree structure can be optimized for example with respect to certain types of models, logs or towards a certain goal. Other ideas on how to improve performance can be found in [3].

In our implementation, we investigated the impact of the orderings of the ingoing and outgoing activity sets (\(>_i, >_o\) in Sect. 5) on the fraction of cut off places. They determine the position of candidates within our tree structure. If these orderings are such that underfed/overfed places are positioned close to the root, this leads to large subtrees and thus many places being cut off due to monotonicity results. In Sect. 7, we present first results of testing different activity orderings and evaluate their impact.

7 Testing Results and Evaluation

In this section, we present the results of testing our algorithm on various data sets. We use a selection of artificial log-model pairs to demonstrate our ability to rediscover complex structures and deal with noise. The efficiency of our search space pruning technique and the resulting increase in speed are evaluated using artificial logs as well as real-life logs. An overview of these logs is given in Table 1.

Table 1. List of logs used for evaluation. The upper part lists real-life logs while the lower part shows artificial logs. Logs are referred to by their abbreviations. The Sepsis log has been reduced by removing all traces that occur at most twice. The log HP2018 has not yet been published. The much smaller 2017 version can be found at [23]. We use a reduced version with all single trace variants removed. The log Teleclaims* results from removing the natural start and end activity activities in the log Teleclaims and removing the 15% less common traces. The Artificial1 log contains a single trace \(\langle {a,b,c,d,e,f}\rangle \). The log Artificial2 is a log generated based on a random model containing a loop, XOR-Split and silent transition.

Rediscoverability of Models: In Fig. 7, a simple model is shown, which includes non-free choice constructs: the places \(({\{ {a} \}} | {\{ {e} \}})\) and \(({\{ {b} \}} | {\{ {f} \}})\) are not part of any directly follows relation, and are therefore not discovered by most existing algorithms that provide formal guarantees. Well-known discovery techniques like Alpha-Miner variants [4] or the Inductive Mining family [5] fail at such tasks. Attempts to extend the capabilities of Alpha Miner to discover such places [4] have been only partially successful. These approaches may result in process models that cannot replay the event log. In some cases, the model may have no fitting traces due to livelocks or deadlocks. More complex structures involving non-free choice constructs, like the model depicted in Fig. 8, are difficult to mine and not rediscovered by most algorithms [4].

In contrast to existing non-region-based algorithms, our approach guarantees that all fitting places are found, and thus we are able to rediscover every model that is free of duplicate and silent transitions, assuming that the log is complete (i.e., non-fitting places can be identified). In particular, we can rediscover both models shown in Figs. 7 and 8.

Fig. 7.
figure 7

The shown model can be rediscovered by our algorithm. Since \(( {a,e})\) and \(( {b,f})\) are not part of any directly follows relation, most other discovery algorithms fail to discover the corresponding places [4].

Fig. 8.
figure 8

The Petri net shown in this Figure is an example of a model rediscoverable by our algorithm, that existing non-region-based algorithms fail to find [4]. The only free choice is whether to do BC or D after doing A. The rest of the trace is implicitly determined by this choice.

Dealing with Noise: By setting the threshold \(\tau \) to 1 we require every fitting place, and thus the whole discovered Petri net, to be able to perfectly replay every trace of the log. However, event logs often contain some noise, be it due to exceptional behavior or due to logging errors. Often, we want to ignore such irregularities within the log when searching for the underlying model. Therefore, we suggest using the parameter \(\tau \) as a noise filtering technique utilizing the internal working of our algorithm. In contrast to approaches modifying the whole event log, this allows us to filter for each place individually: based on the portion of the log relevant for the current place, we can ignore a certain fraction of irregular behavior specified by \(\tau \) without losing other valuable information.

We test our implementation using different values for \(\tau \) on logs modified to contain several levels of noise. We run our algorithm with different values for \(\tau \) on logs with 1000 traces and added random noise of 2, 4, 10 and 20%. The resulting model is tested for precision with respect to the original log using the ETC Align Precision Metric implemented in ProM [22]. As shown in Fig. 9, a lower threshold \(\tau \), in general, leads to increasing precision of the discovered model. In fact, for adequate values of \(\tau \) the original models could be rediscovered. Thus, by choosing adequate values for the threshold \(\tau \), our algorithm is able to handle reasonable levels of noise within the given event log.

Fig. 9.
figure 9

For various levels of noise (0, 2, 4, 10, 20%) introduced randomly on logs with 1000 traces each, we present the precision results (ETC Align) resulting from applying our algorithm with different values for the threshold \(\tau \). Lower values of \(\tau \) have a positive impact on precision, and all tested models could be rediscovered for certain values of \(\tau \).

Measuring Performance: The main contribution of our approach lies in our strategy of computing the set of fitting places \(\mathcal {P}_{\text {fit}}\). The post-processing step, where implicit/redundant places are removed, has not been optimized and constitutes an application of existing results. Therefore our performance evaluation focuses on the first step. Based on several real-life logs as well as artificial logs we investigate two measures of performance: first, the absolute computation time needed to discover \(\mathcal {P}_{\text {fit}}\), compared to the time needed by a brute force approach traversing the whole set of candidates (\(\mathcal {P}_{\text {cand}}\)), and second, the fraction of places that were cut off, that is \(\mathcal {P}_{\text {skipped}}/\mathcal {P}_{\text {cand}} \). For each log, we performed 10 runs of our algorithm using two mutually independent random activity orderings for in- and outgoing activity sets to survey the influence on these performance measures.

Fraction of Skipped Places: The fraction of cut-off places can vary greatly between different event logs, but also within several runs of our algorithm on the same event log, depending on the chosen activity orderings for in- and outgoing activities. In Fig. 10, we present the results for several logs, based on 10 runs of our algorithm for each. Interestingly, the fraction of cut-off places is highest for the real-life event logs RTFM, HD2018, Sepsis and Teleclaims. For HD2018 it goes as high as 96% of the candidate places, that were never visited. The reason for this could be the more restrictive nature of these large and complex logs, resulting in a smaller set of fitting places \(\mathcal {P}_{\text {fit}}\), and thus more possibilities to cut off branches. The lowest results are obtained for the artificial Artificial1 log. Here we could confirm the expectation stated in Sect. 6, that an ordering of high average index first for ingoing activities and lo average index first for outgoing activities, leads to significantly more places being cut off than using the same ordering for both activity sets.

Fig. 10.
figure 10

Depending on the given log and activity orderings, the fraction of places skipped by our algorithm can vary greatly. In this figure, we present box plots showing the fraction of cut-off places \(\mathcal {P}_{\text {skipped}}\) for 10 sample runs of our algorithm on different logs, given as the fraction of the complete candidate space (\(\mathcal {P}_{\text {skipped}}/\mathcal {P}_{\text {cand}} \)). The threshold \(\tau \) has been set to 1 for all runs.

Comparison to the Brute Force Approach: We evaluate the increase in performance of computing \(\mathcal {P}_{\text {fit}}\) using our algorithm pruning the candidate space, in comparison to the brute force approach traversing the candidate space without any pruning. We choose three real-life event logs, RTFM, HD2018 and Sepsis and perform 10 runs of our algorithm on each. In Fig. 11 the results of these tests are presented: we compare the time needed by the brute force approach to the minimum, maximum and average time needed for each of the three logs. As is to be expected, the impact of applying our technique is most significant for large logs, where replaying the log on a place takes a long time, and thus cutting off places has a big effect. This is the case for the RTFM and HP2018 logs. The Sepsis log, on the other hand, has shorter and fewer traces, reflected in a smaller difference in the time needed by our algorithm compared to the brute force approach.

Fig. 11.
figure 11

Minimal, maximal and average time in milliseconds needed by 10 sample runs of our tree cutting algorithm on three real-life logs, with \(\tau =1\), compared to the time needed by the brute force approach traversing the complete candidate space.

8 Conclusion

We have introduced a process discovery algorithm inspired by language-based regions to find a model fitting with respect to a given event log and threshold \(\tau \). In contrast to non-region-based approaches our algorithm is able to discover complex structures, most prominently implicit dependencies expressed by non-free choice constructs. In particular, for \(\tau =1\), we can guarantee that the set of fitting places we discover defines a Petri net, such that the language of this net is the minimal language containing the input language (given by the event log). Our candidate traversal strategy allows us to skip large parts of the search space, giving our algorithm a huge performance boost over the brute force approach.

A well-known disadvantage of applying region theory is, that in the context of infrequent behavior within the log, the resulting models tend to be overfitting, not reflecting the core of the underlying process. We can avoid this issue, since our approach lends itself to using the threshold \(\tau \) as an intuitive noise control mechanics, utilizing the internal workings of our algorithm.

We can see several possibilities for future research based on the presented ideas. The most important contribution of our work is the reduction of the search space to merely a fraction of its original size: we organize all candidate places into trees and are able to cut off subtrees, that are known to contain only unfitting places. Our approach would strongly benefit from any strategy allowing for more subtrees to be cut off or pre-pruning of the candidate space. Note, that within our work we focus on formal strategies that provide the guarantee that all fitting places are discovered. For practical applications it is important to develop heuristic techniques to increase the number of skipped places. Compared to many other approaches this is relatively easy.

New insights can be gained from further testing and evaluating different activity orderings or tree traversal schemes. By developing heuristics on how to choose orderings, based on certain characteristics of the given log, one can optimize the number of skipped candidate places without losing formal guarantees. These guarantees are no longer given, when applying heuristic approaches that allow for fast identification and skipping of subtrees that are likely to be uninteresting. However, for practical applications, the increase in performance is likely to justify the loss of a few fitting places. Alternatively, by evaluating the more interesting subtrees first, the user can be shown a preliminary result, while the remaining candidates are evaluated in the background. User-definable or heuristically derived combinations of ingoing and outgoing activity sets can either be cut off during traversal, or excluded from the search space from the very beginning. Finally, note that our approach allows for each subtree to be evaluated independently, and thus lends itself to increase performance by implementing it in a parallelized manner.

A major issue of our algorithm is the inability to deal with silent and duplicate activities. There exist approaches to identify such activities, either as a general preprocessing step [27] or tailored towards a particular algorithm [28, 29]. The applicability of such strategies to our approach remains for future investigation.

We emphasize that our idea is applicable to all process mining related to Petri net definable models, and therefore we see potential not only in our discovery algorithm itself, but also in the combination with, and enhancement of, existing and future approaches.