Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Process mining [1] is a field of study involved with the discovery, conformance checking, and enhancement of processes, using event data recorded during process execution. In process discovery, we aim to discover process models based on traces of executed event data. In conformance checking, we assess to what degree a process model (potentially discovered) is in line with recorded event data. Finally, in process enhancement, we aim at improving or extending the process based on facts derived from event data.

Modern information systems allow us to track, often in great detail, the behaviour of the process it supports. Moreover, instrumentation and/or program tracing tools allow us to track the behavioural profile of the execution of enterprise-level software systems [2, 3]. Such behavioural data is often referred to as an event log, which can be seen as a multiset of log traces, i.e. sequences of observed events in the system. However, it is often the case, due to noise or under/over-specification, that the observed behaviour does not conform to a valid process instance, i.e., it deviates from its intended behaviour as specified by its reference model.

Conformance checking assesses to what degree the event log and model conform to each other. Early conformance checking techniques [4] are based on simple heuristics and therefore, may yield ambiguous/unpredictable results.

Alignments [5, 6] were introduced to overcome the limitations of early conformance checking techniques. Alignments map observed behaviour onto behaviour described by the process model. As such, we identify four types of relations between the model and event log in an alignment:

  1. 1.

    A log move, in which we are unable to map an observed event, recorded in the event log, onto the reference model.

  2. 2.

    A model move, in which an action is described by the reference model, yet this is not reflected in the event log.

  3. 3.

    A synchronous move, in which we are able to map an event, observed in the event log, to a corresponding action described by the reference model.

  4. 4.

    A silent move, in which the model performs a silent or invisible action (denoted with \(\tau \)).

Consider the example model of a simple file reading system given in Fig. 1 and the trace \(\sigma = \langle A,D,B,D \rangle \). An alignment for the model and \(\sigma \) is given by \(\gamma ^0\) (top right in Fig. 1). Here, the upper-part depicts the trace and the bottom-part depicts an execution path described by the model, starting at state \(\text {p}_\text {0}\) and ending at state \(\text {p}_\text {5}\). The first pair, \(|\frac{A}{A}|\), represents a synchronous move, in which both the log and the path in the model describe the execution of an A activity. The next pair, \(|\frac{D}{\gg }|\), is a log move where the log trace describes the execution of a D activity that is not mapped to a model move. The skip (\(\gg \)) symbol is used to represent such a mismatch. Observe that the model remains in the same state. This is continued by a model move in which the model executes a C activity, which is not recorded in the trace, i.e., \(|\frac{\gg }{C}|\). Finally, the alignment ends with two synchronous moves.

Fig. 1.
figure 1

Example process model (in Petri net formalism) for a simple file reading system and an alignment for the trace \(\sigma =\langle A,D,B,D \rangle \) (\(\gamma ^0\)). For the trace \(\sigma =\langle B \rangle \), two optimal alignments are given using the standard- (\(\gamma ^1\)) and variant (\(\gamma ^2\)) cost functions.

An optimal alignment is an alignment that minimizes a given cost function. Typically, each type of move gets a value assigned \(\mathbb {R}_{\ge 0}\). The cost of an alignment is simply the sum of the costs of its individual moves. The most common way to do this is to assign a cost of 1 to both model and log moves and 0 to synchronous and silent moves. In practice, the A* shortest path algorithm [7] is often used for computing optimal alignments.

We argue that the standard cost function is not always the best-suited function for optimal alignments. Consider the model from Fig. 1 again, with the trace \(\sigma ' = \langle B \rangle \). An optimal alignment using the standard cost function would result in \(\gamma ^1\). Considering that event B is observed behaviour, i.e., the system logged “parse file”, it seems illogical to map this behaviour with a path in the model indicating that the file was not found. In case we set up the cost function such that the number of synchronous moves are maximized, an optimal alignment would result in \(\gamma ^2\). Arguably, a more likely scenario is that not all parts of the program produced log output and \(\gamma ^2\) would be preferred.

Motivated by the example shown in Fig. 1, we consider the applicability of a cost function that maximizes the number of synchronous moves in a more general setting and study its effects. Our contributions are as follows.

  • We formalise the relation between the event log and the reference model to distinguish different cases of alignment problems. We show how the cost functions affect the resulting alignments for these cases. We further show that when the reference model is an abstraction of the event log, the alignment computation process can be significantly improved.

  • We study the differences in alignments and their computation times on over a thousand large instances that exhibit various characteristics. We also compare the results from the A* algorithm with a recent symbolic algorithm [8].

  • We present a new algorithm for computing alignments that exploits our new cost function in a preprocessing step. Using a set of industrial models, we show that it performs an order of magnitude faster than the A* algorithm.

The remainder of this paper is structured as follows. Section 2 introduces preliminaries. Then, in Sects. 3 and 4 we introduce the synchronous cost function and formalise the relation between the event log and the reference model. We discuss existing algorithms for computing alignments in Sect. 5. In Sect. 6 we present the new algorithm that preprocesses the model to improve the alignment computation process. Experiments are presented in Sect. 7. Section 8 discusses related work. Section 9 concludes the paper.

2 Preliminaries

We assume that the reader is familiar with the basics of automata theory and Petri nets. We denote a trace or sequence by \(\sigma =\langle \sigma _0, \sigma _1, \ldots , \sigma _{|\sigma |-1} \rangle \), two sequences are concatenated using the \({\cdot }\) operation. Given a sequence \(\sigma \) and a set of elements S, we refer to \(\sigma \setminus S\) as the sequence without any elements from S, e.g., \(\langle a,b,b,c,a,f \rangle \setminus \{b,f\} = \langle a,c,a \rangle \). For two sequences \(\sigma _1\) and \(\sigma _2\), we call \(\sigma _1\) a subsequence of \(\sigma _2\) (denoted with \(\sigma _1 \sqsubseteq \sigma _2\)) if \(\sigma _1\) is formed from \(\sigma _2\) by deleting elements from \(\sigma _2\) without changing its order, e.g., \(\langle c,a,t \rangle \sqsubseteq \langle a,c,r,a,t,e \rangle \). Similarly, \(\sigma _1 \sqsubset \sigma _2\) implies that \(\sigma _1\) is a strict subsequence of \(\sigma _2\), thus \(\sigma _1 \ne \sigma _2\).

Traces are sequences \(\sigma \in \varSigma ^*\), for which each element is called an event and is contained in the alphabet \(\varSigma \), also called the set of events. We globally define the alphabet \(\varSigma \), which does not contain the skip event (\(\gg \)) nor the invisible action or silent event (\(\tau \)). Given a set S, we denote the set of all possible multisets as \(\mathcal {B}(S)\), and its power-set by \(2^S\). An event log \(\text {E}\) is a multiset of traces, i.e., \(\text {E}\subseteq \mathcal {B}(\varSigma ^*)\).

2.1 Preliminaries on Petri Nets

Petri nets are a mathematical formalism that allow us to describe processes, typically containing parallel behaviour, in a compact manner. Consider Fig. 1 which is a simple example of a Petri net. The Petri net consists of places, visualized as circles, that allow us to express the state (or marking) of the Petri net. Furthermore, it consists of transitions, visualized as boxes, that allow us to manipulate the state of the Petri net. We are never able to connect a place with another place nor a transition with another transition. Thus, from a graph-theoretical perspective, a Petri net is a bipartite graph.

Definition 1

(Petri net, marking). A Petri net is defined as a tuple \(\text {N}= (\text {P}, \text {T}, \text {F}, \varSigma _\tau , \lambda , \text {m}_\mathrm {0}, \text {m}_\mathrm {F})\) such that:

  • \(\text {P}\) is a finite set of places,

  • \(\text {T}\) is a finite set of transitions such that \(\text {P}\cap \text {T}= \emptyset \),

  • \(\text {F}\subseteq (\text {P}\times \text {T}) \cup (\text {T}\times \text {P})\) is a set of directed arcs, called the flow relation,

  • \(\varSigma _\tau \) is a set of activity events, with \(\varSigma _\tau = \varSigma \cup \{\tau \}\),

  • \(\lambda : \text {T}\rightarrow \varSigma _\tau \) is a labelling function for each transition,

  • \(\text {m}_\mathrm {0}\in \mathcal {B}(\text {P})\) is the initial marking of the Petri net,

  • \(\text {m}_\mathrm {F}\in \mathcal {B}(\text {P})\) is the final marking of the Petri net.

A marking is defined as a multiset of places, denoting where tokens reside in the Petri net. A transition \(t\in T\) can be fired if, according to the flow relation, all places directing to t contain a token. After firing a transition, the tokens are removed from these places and all places having an incoming arc from t receive a token. It may be possible for a place to contain more than one token.

Definition 2

(Marking graph). For a Petri net \(\text {N}= (\text {P}, \text {T}, \text {F}, \varSigma _\tau , \lambda , \text {m}_\mathrm {0}, \text {m}_\mathrm {F})\), the corresponding marking graph or state-space \( MG = (\text {Q}, \varSigma _\tau , \delta , q_0, q_\mathrm {F})\) is a non-deterministic automaton such that:

  • \(\text {Q}\subseteq \mathcal {B}(P)\) is the (possibly infinite) set of vertices in \( MG \), which corresponds to the set of reachable markings from \(\text {m}_\mathrm {0}\) (obtained from firing transitions),

  • \(\delta \subseteq (\text {Q}\times \text {T}\times \text {Q})\) is the set of edges in \( MG \), i.e., \((\text {m}, t, \text {m}') \in \delta \) iff there is a \(t\in \text {T}\) such that \(\text {m}'\) is obtained from firing transition t from marking \(\text {m}\).

  • \(q_0= \text {m}_\mathrm {0}\) is the initial state of the graph,

  • \(q_\mathrm {F}= \text {m}_\mathrm {F}\) is the final state of the graph.

For an edge \(e=(m,t,m')\in \delta \), we write \(\lambda (e)\) to denote \(\lambda (t)\) and use the notation \(m\xrightarrow {a}m'\) to represent the edge e for which \(\lambda (e)=a\) (we assume that for two edges \((m,t_1,m')\in \delta \) and \((m,t_2,m')\in \delta \), if \(\lambda (t_1)=\lambda (t_2)\) then \(t_1=t_2\)). The source and target markings of edge e are respectively denoted by \(\texttt {src}(e)\) and \(\texttt {tar}(e)\).

Definition 3

(Path, language). Given a Petri net \(\text {N}\) and corresponding marking graph \( MG = (\text {Q}, \varSigma _\tau , \delta , q_0, q_\mathrm {F})\), a sequence of edges \(P=\langle P_0,P_1,\ldots ,P_n \rangle \in \delta ^*\) is called a path in \(\text {N}\) if it forms a path on the marking graph of \(\text {N}\): \(\texttt {src}(P_0)=\text {m}_\mathrm {0}\wedge \texttt {tar}(P_n)=\text {m}_\mathrm {F}\wedge \forall _{0\le i < n}: \texttt {tar}(P_i) = \texttt {src}(P_{i+1})\). The set of all paths in \(\text {N}\) is denoted by \( Paths (\text {N})\). With \(\lambda (P)\) we refer to the sequence of labels visited in P, i.e., \(\lambda (P) = \langle \lambda (P_0), \lambda (P_1), \ldots , \lambda (P_n) \rangle \) (there may be different paths P and \(P'\) such that \(\lambda (P)=\lambda (P')\)). We define the language \(\mathcal {L}\) of a Petri net \(\text {N}\) by \(\mathcal {L}(\text {N})=\{\lambda (P) \mid P\in Paths (\text {N})\}\).

Definition 4

(Trace to Petri net). Given a trace \(\sigma =\langle \sigma _1,\sigma _2,\ldots ,\sigma _n \rangle \in \varSigma ^*\), its corresponding Petri net is defined as \(\text {N}_\sigma = (\text {P}, \text {T}, \text {F}, \varSigma _\tau , \lambda , \text {m}_\mathrm {0}, \text {m}_\mathrm {F})\) with \(\text {P}=\{p_0,p_1,\ldots , p_n, p_{n+1}\}\), \(\text {T}=\{t_0,t_1, \ldots , t_{n}\}\), \(\text {F}=\{(p_0,t_0),(p_1,t_1),\ldots ,(p_n,t_n)\} \cup \{(t_0,p_1),(t_1,p_2),\ldots ,(t_n,p_{n+1})\}\), \(\varSigma _\tau = \bigcup _{0\le i<n} \{\sigma _i\}\), \(\forall _{0\le i<n}:\lambda (t_i)=\sigma _i\), \(\text {m}_\mathrm {0}=p_0\), and \(\text {m}_\mathrm {F}=p_{n+1}\).

2.2 Preliminaries on Alignments

Definition 5

(Alignment). Let \(\sigma \in \varSigma ^*\) be a log trace and let \(\text {N}\) be a Petri net model, for which we obtain the marking graph \( MG = (\text {Q}, \varSigma _\tau , \delta , q_0, q_\mathrm {F})\). We refer to \(\varSigma _{\gg }\) as the alphabet containing skips: \(\varSigma _{\gg }= \varSigma \cup \{\gg \}\) and \(\varSigma _{\tau \gg }\) as the alphabet that also contains the silent event: \(\varSigma _{\tau \gg }= \varSigma \cup \{\gg ,\tau \}\). Let \(\gamma \in (\varSigma _{\gg }\times \varSigma _{\tau \gg })^*\) be a sequence of log-model pairs (note that \(\tau \) steps are only possible in the model). For \(\gamma =\langle (\gamma _0^0,\gamma _0^1), (\gamma _1^0,\gamma _1^1), \ldots , (\gamma _{|\gamma |-1}^0,\gamma _{|\gamma |-1}^1) \rangle \), we define \(\gamma ^\ell \) as \(\gamma ^\ell =\langle \gamma _0^0, \gamma _1^0,\ldots , \gamma _{|\gamma |-1}^0 \rangle \setminus \{\gg \}\) and \(\gamma ^m\) by \(\gamma ^m = \langle \gamma _0^1, \gamma _1^1,\ldots , \gamma _{|\gamma |-1}^1 \rangle \setminus \{\gg \}\). We call \(\gamma \) an alignment if the following conditions hold:

  1. 1.

    \(\gamma ^\ell = \sigma \) (the activities of the log-part, equals to \(\sigma \)),

  2. 2.

    \(\gamma ^m\in \mathcal {L}(\text {N})\) (\(\gamma ^m\) forms a path in \(\text {N}\)),

  3. 3.

    \(\forall a,b\in \varSigma : a\ne b \Rightarrow (a,b) \notin \gamma \) (illegal moves),

  4. 4.

    \((\gg , \gg ) \notin \gamma \), (the ‘empty’ move may not exist in \(\gamma \)).

Definition 6

(Alignment cost). Let \(\gamma \in (\varSigma _{\gg }\times \varSigma _{\tau \gg })^*\) be an alignment for \(\sigma \in \varSigma ^*\) and the Petri net \(\text {N}\). The cost function c for pairs of \(\gamma \) is given as follows; \(c: (\varSigma _{\gg }\times \varSigma _{\tau \gg }) \rightarrow \mathbb {R}_{\ge 0}\), and we overload c for alignments; \(c: (\varSigma _{\gg }\times \varSigma _{\tau \gg })^* \rightarrow \mathbb {R}_{\ge 0}\), for which we have \(c(\gamma ) = \sum _{i=0}^{|\gamma |-1} c(\gamma _i)\).

We call an alignment \(\gamma \) under cost function c optimal iff \(\not \exists \gamma ' : c(\gamma ') < c(\gamma )\), i.e., there does not exist an alignment \(\gamma '\) with a smaller cost.

Definition 7

(Standard cost function). The standard cost function \(c_\mathrm {st}\) is defined for an alignment pair \((\ell ,m)\in (\varSigma _{\gg }\times \varSigma _{\tau \gg })\) as follows:

$$\begin{aligned} c_\mathrm {st}(\ell ,m) = {\left\{ \begin{array}{ll} 0 &{} \ell = {\gg } \text { and } m = \tau \text { (silent move, e.g., } (\gg ,\tau )) \\ 0 &{} \ell \in \varSigma \text { and } m\in \varSigma \text { and } \ell =m \text { (e.g., synchronous move } (a,a)) \\ 1 &{} \ell \in \varSigma \text { and } m={\gg }\, (\text {e.g., log move } (a,\gg )) \\ 1 &{} \ell ={\gg }\text { and } m\in \varSigma \, (\text {e.g., model move } (\gg ,a)) \\ \end{array}\right. } \end{aligned}$$

3 Maximizing Synchronous Moves

We gather that the standard cost function from Definition 7 is the most commonly used cost function in literature [1, 7, 9, 10], though note that any cost function could be used. The standard cost function may, however, lead to undesired results, as illustrated by the example from Fig. 1. We consider a new cost function that maximizes the number of synchronous moves, since it explains as many log moves as possible. We propose the alternative cost function as follows.

Definition 8

(max-sync cost function). We define the max-sync cost function \(c_\mathrm {sync}\) for an alignment pair as follows (for small \(\varepsilon >0\)):

$$\begin{aligned} c_\mathrm {sync}(\ell ,m) = {\left\{ \begin{array}{ll} 0 &{} \ell ={\gg } \text { and } m = {\tau }\, (\text {silent move, e.g., } (\gg ,\tau )) \\ 0 &{} \ell \in \varSigma \text { and } m\in \varSigma \text { and } {\ell =m}\, (\text {e.g., synchronous move } (a,a)) \\ 1 &{} \ell \in \varSigma \text { and } m={\gg }\, (\text {e.g., log move } (a,\gg ))\\ \varepsilon &{} \ell ={\gg }\text { and } m\in \varSigma \, (\text {e.g., model move } (\gg ,a)) \\ \end{array}\right. } \end{aligned}$$

This cost function only penalizes log moves, which as a consequence causes an optimal alignment to minimize the number of log moves and thus maximize the number of synchronous moves. The \(\varepsilon \) cost for model moves further filters optimal alignments to only include shortest paths through the model that maximize synchronous moves.

An advantage of the max-sync cost function over the standard one is that synchronized behaviour is not sacrificed for shorter paths through the model (as Fig. 1 illustrates). A disadvantage is that in order to maximize the number of synchronous moves, it may be possible that many model moves are required.

4 Relating the Model and Event Log

Given a Petri net model \(\text {N}\) and an event log \(\text {E}\subseteq \mathcal {B}(\varSigma ^*)\), we can distinguish four cases based on the languages that they describe. By distinguishing the relative granularities of \(\text {N}\) and \(\text {E}\) we define cases of alignment problems as follows.

C1: :

\(\forall \sigma _1\in E: (\exists \sigma _2\in \mathcal {L}(\text {N}) : \sigma _1 = \sigma _2)\); all log traces correspond to paths in the model. Then, every log trace can be mapped onto the model by only using synchronous and silent moves, which is optimal for \(c_\mathrm {st}\) and \(c_\mathrm {sync}\).

C2: :

\(\forall \sigma _1\in E: (\exists \sigma _2\in \mathcal {L}(\text {N}) : \sigma _1\sqsubseteq \sigma _2)\); all log traces correspond to subsequences of paths in the model. Then, every log trace can be mapped onto the model without using any log moves. The example from Fig. 1 for \(\sigma =\langle B \rangle \) is such an instance. We hypothesize that \(c_\mathrm {sync}\) provides better alignments in such instances as \(c_\mathrm {st}\) may avoid synchronization in favour of shorter paths through the model.

C3: :

\(\forall \sigma _1\in E: (\exists \sigma _2\in \mathcal {L}(\text {N}) : \sigma _2\sqsubseteq \sigma _1)\); for every log trace there is a path that forms a subsequence of the log trace. Then, every log trace can be mapped onto the model without using any model moves. Here, \(c_\mathrm {sync}\) and to some extent \(c_\mathrm {st}\) can arguably lead to bad results as model moves may be taken to synchronize with ‘undesired’ behaviour.

C4: :

None of the properties hold. All move types may be necessary for alignments. We regard this as a standard scenario. Depending on the use case, either \(c_\mathrm {st}\) or \(c_\mathrm {sync}\) could be preferred.

Aside from C4, we consider cases C2 and C3 as common instances in practice, as logging software often causes either too many or too little events to be logged or in case the model is over/underspecified. Discrepancies then show whether the model is of the right granularity. We note that it is also possible to hide certain activities in the model or log before alignment. This is however not trivial, especially if there are (slight) deviations in the log such that the alignment problem does not fit C2 or C3 exactly anymore.

When considering instances that exactly fit case C2 or C3, we can construct alignments by respectively removing all log or model moves from the product of the model and log. We define the cost functions \(c_\mathrm {add}\) and \(c_\mathrm {rem}\) to be variants of \(c_\mathrm {st}\) such that model and log moves respectively have a cost of \(\infty \). We argue that this results in a better ‘alignment quality’ and reduces the time for its construction.

5 Algorithms for Computing Alignments

We consider two algorithms for computing alignments, which we discuss as follows. Both algorithms take the product Petri net as input.

A*. The A* algorithm [7] computes the shortest path from the initial marking to the final marking on the marking graph for a given cost function. The heuristic function for A* exploits the Petri net marking equation, which can be achieved using Integer Linear Programming (ILP), to prune the search space.

Symbolic Algorithm. The symbolic algorithm [8] was recently developed as an improvement over A* for large state spaces. It exploits symbolic reachability to search for an alignment, i.e., considering sets of markings instead of single ones. By restricting the cost function to only allow 0 or 1-cost moves, optimal alignments can be computed by only taking a 1-cost move after exploring all markings reachable via 0-cost steps. We refer to this algorithm by Sym.

6 Preprocessing Reference Models for Large Event Logs

When constructing an alignment under the \(c_\mathrm {sync}\) cost function, we can disregard the cost for model moves to a certain extent. The goal is to find a path through the model that maximizes the number of synchronous moves. We can achieve this by searching for a subsequence in the log trace that is also included in the language of the reference model. By computing the transitive closure of the model’s marking graph, we find all paths and subsequences of paths through the model. For every log trace we can use dynamic programming to search for the maximum-length subsequence in the log trace that can be replayed in the transitive closure graph (TCG), from which we can construct a path through the marking graph and obtain an optimal alignment.

We construct a TCG as described in Definition 9. Here, \(\tau \)-edges are added to the marking graph such that every marking is reachable via \(\tau \)-steps. After determinization, for every path P in the original marking graph the TCG contains all paths \(P'\) such that \(\lambda (P')\sqsubseteq \lambda (P)\).

Fig. 2.
figure 2

Example Petri net model (left), its corresponding marking graph (middle) and transitive closure graph (right) with the sequence \(\langle D,E \rangle \) highlighted.

We can use this property to search for a subsequence of the log trace that can fully synchronize with the model. For instance in the example of Fig. 2, consider a log trace \(\sigma =\langle F,D,E,B \rangle \). The F event can be fired from \({\mathbb {Q}}_{0}\), after which the TCG is in state \({\mathbb {Q}}_{10}\). From this state, it is not possible to perform any other event from log trace. A better choice would be to skip the F event (which would then be a log move) and form the subsequence \(\langle D,E \rangle \), as highlightedFootnote 1. We call the maximum-length subsequence \(\hat{\sigma }\) from the log trace a maximum fitting subsequence if \(\hat{\sigma }\) also forms a path through the TCG, as defined in Definition 10.

Definition 9

(Transitive closure graph). Given a marking graph \( MG = (\text {Q}, \varSigma _\tau , \delta , q_0, q_\mathrm {F})\), we first construct an extended marking graph \( MG ' = (\text {Q}, \varSigma _\tau , \delta ', q_0, q_\mathrm {F})\) with \(\delta '=\delta \cup \{(\texttt {src}(e),\tau ,\texttt {tar}(e)) \mid e \in \delta \}\). A transitive closure graph (TCG), \( TCG = (\mathbb {Q}, \varSigma , \varDelta , \mathbb {Q}_0, \mathbb {Q}_\mathrm {F})\) is defined as the result of determinizing \( MG '\) (by using a standard determinization algorithm [11]) and by then removing all non-final states from the \( TCG \) such that \(\mathbb {Q}\subseteq 2^\text {Q}\), \(\varSigma = \varSigma _\tau \setminus \{\tau \}\), \(\varDelta \subseteq (\mathbb {Q}\times \varSigma \times \mathbb {Q})\), \(\mathbb {Q}_0= \text {Q}\), and \(\mathbb {Q}_\mathrm {F}= \mathbb {Q}\).

For an edge \(e\in \varDelta \) we also use the notation \(\texttt {src}(e)\) and \(\texttt {tar}(e)\) to respectively refer to the source and target marking sets in the TCG. Paths over the TCG are defined analogously to paths over marking graphs (Definition 3) and we use \( Paths ( TCG )\) and \(\mathcal {L}( TCG )\) to respectively denote the set of all paths in the \( TCG \) and the language of the \( TCG \).

Definition 10

(Maximum fitting subsequence). Given a sequence (log trace) \(\sigma \in \varSigma ^*\) and \( TCG = (\mathbb {Q}, \varSigma , \varDelta , \mathbb {Q}_0, \mathbb {Q}_\mathrm {F})\), then \(\hat{\sigma }\sqsubseteq \sigma \) is a maximum fitting subsequence if and only if \(\hat{\sigma }\in \mathcal {L}( TCG ) \wedge \forall \hat{\sigma }'\sqsubseteq \sigma : \hat{\sigma }'\in \mathcal {L}( TCG ) \Rightarrow |\hat{\sigma }| \ge |\hat{\sigma }'|\). We construct \(\hat{\sigma }\) by using dynamic programming to search for a subsequence of \(\sigma \) that is a maximum-length path in the \( TCG \).

figure a

Once we have found the maximum fitting subsequence \(\hat{\sigma }\) for a given model and log trace, we still have to determine which model moves should be applied to form a path through the original model. This can be achieved by using the TCG and traversing \(\hat{\sigma }\) in a backwards fashion as we show in Algorithm 1.

We first construct a path \(\texttt {MFP}\) from the subsequence \(\hat{\sigma }\) (line 3), in the example from Fig. 2 with \(\hat{\sigma }=\langle D,E \rangle \) (see also Fig. 3 for an illustration of the path construction process) this would be \(\texttt {MFP}=\langle ({\mathbb {Q}}_{0},D,{\mathbb {Q}}_{8}),({\mathbb {Q}}_{8},E,{\mathbb {Q}}_{9}) \rangle \). Then in line 4, a backward search procedure (\(\texttt {BWD}\)) is called to search for a path P in the marking graph from an E-edge to the final marking (\(\text {p}_\text {7}\)).

The \(\texttt {BWD}\) procedure takes a target marking m, label a and search space S as arguments. A sequence W is maintained to process unvisited markings from S and a mapping \(F:Q\rightarrow \delta \) is used for reconstructing the path. Starting from the target marking m (which is \(W_0\)), the procedure searches for edges e directing towards m in line 16–18 such that \(\texttt {src}(e)\) is in S and not already visited. For every such edge e, its source is appended to W (to be considered in a future iteration) and \(\texttt {src}(e)\) is mapped to e for later path reconstruction.

Following iterations of the for loop in line 11–18 consider a predecessor \(W_i\) of m and search for edges directing to \(W_i\). This way, the search space is traversed backwards in a breadth-first manner, resulting in shortest paths to m.

Fig. 3.
figure 3

Path construction using Algorithm 1 on the example from Fig. 2 for a maximum fitting subsequence \(\hat{\sigma }=\langle D,E \rangle \sqsubseteq \langle F,D,E,B \rangle \). Markings in the grey region are not part of the path. The resulting alignment \(\gamma \) is shown on the right.

In line 12–15 the \(\texttt {BWD}\) procedure checks whether there is an edge \(m'\xrightarrow {a}W_i\) for some \(m'\) (or an edge \(q_0\xrightarrow {a'}W_i\) for arbitrary \(a'\) in case \(a=\bot \)) and if so, constructs a path towards m in line 14 which is then returned. In the example, the path will be returned for the first \(\texttt {BWD}\) call.

After the first \(\texttt {BWD}\) call, the main function iterates backwards over all remaining edges from \(\texttt {MFP}\) (line 5–6) to create paths between \(\hat{\sigma }_i\) and \(\hat{\sigma }_{i+1}\), which are inserted in the path before P. Finally, in line 7 a path from the initial marking \(q_0\) towards the first label \(\hat{\sigma }_0\) is inserted before P to complete the path (here the label is set to \(\bot \) to search for \(q_0\) in the \(\texttt {BWD}\) procedure).

In the example we first compute the path in line 4, then after line 5–6 we insert the path , and in line 7 we insert the path from the initial state \(q_0=\text {p}_\text {0}\), to create the complete minimal-length path P in the marking graph such that \(\hat{\sigma }\sqsubseteq \lambda (P)\).

The alignment can be reconstructed by marking all events in the maximum fitting subsequence as synchronous moves, by marking the remaining labels in the log trace as log moves, and inserting the model and silent moves (as computed by Algorithm 1) at the appropriate places.

Note that the TCG algorithm does not exactly compute an alignment for the cost function \(c_\mathrm {sync}\). The backwards BFS does ensure a shortest path through the model from the initial to the final marking while synchronizing with the maximum fitting subsequence. However, there might exist a different maximum fitting subsequence that leads to a different path through the model with a lower total cost (fewer model moves). This can be repaired by computing the alignments for all maximum fitting subsequences. If the marking graph contains cycles, the corresponding markings get contracted to a single state in the TCG with a self-loop for each activity in the cycle. Also, the TCG may in theory contain exponentially more states than there are markings in the marking graph. However, in industrial models (Sect. 7.3), we found that in many cases the number of states in the TCG is at most two times more than the number of markings in the marking graph.

7 Experiments

For the experiments, we considered two types of alignment problems. On the one hand, a large reference model accompanied by an event log consisting of a single log trace, and on the other hand a smaller reference model accompanied by an event log of many traces. All experiments were performed on an Intel\(^{\tiny {{\textregistered }}}\) Core\(^\textsc {tm}\) i7-4710MQ processor with 2.50 GHz and 7.4 GiB memory. For all experiments, we have set a timeout of 60 s. When computing averages, a timeout also counts as 60 s.

We investigate differences between the alignments resulting from using the standard- and max-sync cost functions, and compare alignment computation times for A* (with ILP, using the implementation from RapidProM [12]) and the symbolic algorithm (implemented in the LTSmin model checker [13]). We further investigate specific alignment problems, cases C2 and C3 as discussed in Sect. 4. Finally, we also look at models accompanied by many log traces to compare the performance of the TCG algorithm (implemented in ProM [14]) with the other algorithms. For all large models with singleton log traces we used 8 threads for computing alignments, and for smaller models with many log traces we only used a single thread per alignment computationFootnote 2. All results are available online at https://github.com/utwente-fmt/MaxSync-BPM2018.

7.1 Experiments Using Large Models and Singleton Event Logs

Model Generation. Using the PTandLogGenerator [15] we generated Petri net models with process operators and additional features set to their defaults; where the respective probabilities for sequence, XOR, parallel, loop, OR are set to 45%, 20%, 20%, 10%, and 5%. The additional features for the occurrence of silent and duplicate activities, and long-term dependencies were all set to 20%.

To examine scalability we ranged the average number of activities from 25, 50, and 75, resulting in respectively 110, 271, and 370 transitions on average. For these settings, we generated 30 models (thus 90 in total) and generated a single log trace per model. For this log trace we added 10%, 30%, 50%, and 70% noise in three different ways (thus 12 noisy singleton logs are created); by (1) adding, removing and swapping events (resembling case C4), (2), by only adding events (resembling case C3), and (3) by only removing events (resembling case C2). In total there are 1,080 noisy singleton logs. We first consider noise of type 1.

Alignment Differences. In Table 1 we compare the resulting alignments, produced by Sym, for the different cost functions. When comparing the overall results of \(c_\mathrm {st}\) and \(c_\mathrm {sync}\) (rightmost column), we observe that \(c_\mathrm {sync}\) uses about 43% fewer log moves, which are added as synchronous moves. However in doing so, more than six times as many model moves are required.

When looking at an increase in the amount of noise, the relative difference between the number of log moves remains the same, while this difference in model moves slightly drops. When increasing the number of activities from 25 to 75, We observe an increase in the number of model moves for \(c_\mathrm {sync}\) from 3.2 times to 9.3 times as many compared to \(c_\mathrm {st}\). As a corresponding result from this effect, the difference between log moves from \(c_\mathrm {sync}\) and \(c_\mathrm {st}\) stays relatively the same for increasing activities.

We conclude that for \(c_\mathrm {sync}\) the relative reduction in log moves stays mostly the same, when fluctuating the amount of noise or size of the model. The size of the model seems to greatly affect the number of model moves for \(c_\mathrm {sync}\), making alignments from \(c_\mathrm {st}\) and \(c_\mathrm {sync}\) more diverse for larger models.

Table 1. Comparison between alignments generated using the \(c_\mathrm {st}\) and \(c_\mathrm {sync}\) cost functions. The numbers show averages, e.g., the value of 2.3 in the top-left corner denotes the average number of log moves for all computed alignments for which 10% noise is added, using the \(c_\mathrm {st}\) cost function.

Performance Results. We observed that while Sym is faster in computing alignments than the A* algorithm on \(c_\mathrm {st}\) (it takes on average 15.8 s for computing an alignment using A* and 10.5 s for Sym), for the \(c_\mathrm {sync}\) cost function A* is outperforming the symbolic algorithm (13.7 s for A* and 16.5 s for Sym). This has to do with the effect that the symbolic algorithm will explore the entire model before attempting a single log move whereas A* does not.

7.2 Alignment Problems that only Add or Remove Events

Alignment Differences. In Table 2 we compare the resulting alignments for adding or removing events. When inspecting the Add case, we find that the \(c_\mathrm {st}\) already avoids model moves for the most part as we would expect. Moreover, there are only small differences between alignments from \(c_\mathrm {st}\) and \(c_\mathrm {add}\). For \(c_\mathrm {sync}\), many model moves may be chosen to increase the number of synchronous moves. These additional synchronous moves are arguably not part of the ‘desired’ alignment since they require a large detour through the model.

When removing events from the log trace, the \(c_\mathrm {st}\) cost function is only partly able to describe the removal of events as it still chooses log moves. The \(c_\mathrm {sync}\) cost function does not take any log moves as this maximizes the number of synchronous moves, making it equal to \(c_\mathrm {rem}\). When comparing \(c_\mathrm {st}\) and \(c_\mathrm {sync}\), we could argue that for the Add case, the \(c_\mathrm {st}\) cost function better represents a ‘correct’ alignment and for the Rem case \(c_\mathrm {sync}\) is better suited.

Performance Results. We observed that for \(c_\mathrm {st}\), A* performs relatively bad for the Rem case (14.1 s on average), but significantly better for \(c_\mathrm {sync}\) (2.3 s on average). We argue that A* for \(c_\mathrm {st}\) tries to perform many log moves, that results in a lot of backtracking, while for \(c_\mathrm {sync}\) the algorithm avoids log moves entirely. The symbolic algorithm uses 6.6s and 7.3s on average for \(c_\mathrm {st}\) and \(c_\mathrm {sync}\) respectively. For the Rem case, we do not observe a significant difference in the performance times when considering \(c_\mathrm {rem}\), i.e., removing the log moves. This is because both algorithms already avoid log moves for the \(c_\mathrm {sync}\) cost function.

For the Add case, both A* and Sym require more time for computing alignments for \(c_\mathrm {sync}\) than for \(c_\mathrm {st}\). When removing model moves (\(c_\mathrm {add}\)), A* and Sym perform in respectively 36% and 77% of the time required for \(c_\mathrm {st}\) (thus 3.4 s and 9.3 s). By removing the model moves, both algorithms no longer have to explore a large part of the state-space and only have to decide on which log moves, synchronous and silent actions to chose, which is especially beneficial for A*.

Table 2. Comparison between alignments generated using the \(c_\mathrm {st}\) and \(c_\mathrm {sync}\) cost functions for alignment problems, where noise only consist of adding (Add) or removing (Rem) events. The cost functions \(c_\mathrm {add}\) and \(c_\mathrm {rem}\) are variations on \(c_\mathrm {st}\) such that model and log moves respectively have a cost of \(\infty \)

7.3 Experiments Using Event Logs with More Traces

We now consider smaller models that have to align many log traces. For our experiments, we selected 9 instances from the 735 industrial business process Petri net models from financial services, telecommunications and other domains, obtained from the data sets presented in Fahland et al. [16].

For our selection, we computed the transitive closure graph (TCG) and considered the instances for which we were able to compute TCG within 60 s. From this set, we selected the 9 most interesting cases, e.g., the models with the largest Petri net models, largest marking graphs, largest TCG graph, and largest TCG construction time. On average the marking graph contains 108 markings and the TCG 134 states. In the worst case, the number of states in the TCG was 200, which doubled the number of markings in the marking graph. We did not find a large difference between the performance results of the individual experiments.

For each model, we generated a set of 10, 100, 1,000, and 10,000 log traces for 10%, 30%, 50%, and 70% noise added by adding, removing, and swapping events. Thus in total, we have 16 event logs per model. We compared the performance of the TCG algorithm with that of A* using a single thread. We also experimented with the symbolic algorithm, but its setup time per alignment computation is too large to provide meaningful results. Note that in our experiments, we only consider the \(c_\mathrm {sync}\) cost function. The TCG algorithm is not applicable to the \(c_\mathrm {st}\) cost function.

Table 3. Alignment computation time (in milliseconds) for models with many log traces. TCG-comp, TCG-align, and TCG respectively denote the time for computing the TCG, the time for aligning all log traces, and the sum of the two.

Results. The results are summarized in Table 3. On average, the TCG algorithm used 270 ms for computing the transitive closure graph. When increasing the number of log traces (left table), we see that the preprocessing step of the TCG algorithm remains a significant part of its total time for up to 1,000 log traces. The A* algorithm has to create a synchronous product of the model and log trace for each instance, and expectedly takes more time in total. For 10,000 log traces, A* is 17 times slower than the TCG algorithm. But even for 10 log traces, the TCG algorithm outperforms A* by almost a factor of two.

When comparing the results for different amounts of noise (right table), we see practically no difference in the computation times for the TCG algorithm. The A* algorithm does require significantly more time for 30%, 50%, and 70% noise compared to the 10% case. We argue that from 30% noise onwards, A* has to visit most of the state-space to construct an optimal alignment. In the TCG algorithm, noise does not seem to affect its performance.

8 Related Work

One of the earliest works in conformance checking was from Cook and Wolf [17]. They compared log traces with paths generated from the model.

One technique to check for conformance is token-based replay [4]. The idea is to ‘replay’ the event logs by trying to fire the corresponding transitions, while keeping track of possible missing and remaining tokens in the model. However, this technique does not provide a path through the model. When traces in the event log deviate a lot, the Petri net may get flooded with tokens and the tokens do not provide good insights anymore.

Alignments were introduced [5, 7] to overcome the limitations of the token-based replay technique. Alignments formulate conformance checking as an optimization problem, i.e., minimizing the alignment cost-function. Since its introduction, alignments have quickly become the standard technique for conformance checking along with the A* algorithm for computing alignments [9]. In previous work [8] we presented the symbolic algorithm for alignments and we analysed how different model characteristics influence the computation times for \(c_\mathrm {st}\).

For larger models, techniques have been developed to decompose the Petri net in smaller subprocesses [18]. For instance, fragments that have a single-entry and single-exit node (SESE) represent an isolated part of the model. This way, localizing conformance problems becomes easier in large models. It would be interesting to combine the TCG algorithm with such decomposed models.

A sub-field of alignments is to compute a prefix-alignment for an incomplete log trace. This is useful for analysing processes in real-time instead of a-posteriori. Several techniques exist for computing prefix-alignments [7, 19]. The TCG approach that we introduced in this paper could also be suitable for computing prefix-alignments. Recently, Burattin and Carmona [20] introduced a technique similar to the TCG approach, in which the marking graph is extended with additional edges to allow for deviations. However, it cannot guarantee optimality as a single successor marking is chosen per event, while instead we consider all possible successors and can, therefore, better adapt for future events.

In a more general setting, conformance checking is related to finding a longest common subsequence, computing a diff, or computing minimal edit distances. Here, the problem is translated to searching for a string B from a regular language L such that the edit distance of B and an input word \(\alpha \) is minimal [21].

9 Conclusion

In this paper, we considered a max-sync cost function that instead of minimizing discrepancies between the log trace and the model, maximizes the number of synchronous moves. We empirically evaluated the differences with the standard cost function, compared the alignment computation times. The max-sync cost function also lead to a new algorithm for computing alignments.

We observed that in general, a considerable amount of model moves may be required to add a few additional synchronous moves, when comparing max-sync with the standard cost function. However, when alignment problems are structured such that log moves are on a lower granularity than the model, a max-sync cost function may be better suited. We also observed a significant performance improvement in alignment construction if alignments can be formed without taking any model moves or without any log moves.

On industrial models with many log traces, we showed that our new algorithm, which uses a preprocessing step on the model, is an order of magnitude faster in computing alignments on many log traces for the max-sync cost function.

We conclude that the max-sync cost function is complementary to the standard one as it provides an alternative view that may be preferable in some contexts, and it may also significantly reduce the alignment construction time.