Keywords

1 Introduction

Process discovery is one of the most crucial process mining tasks that entails the construction of process models from event logs of information systems [1]. The most arduous challenge for process discovery is tackling the problem that discovery algorithms are unable to generate accurate and comprehensible process models out of event logs stemming from highly flexible environments.

Trace Clustering is an efficient solution, which clusters the traces such that each of the resulting clusters corresponds to coherent sets of cases that can each be adequately represented by a process model [3]. Figure 1 shows the basic procedure for trace clustering.

Fig. 1.
figure 1

Illustration of the basic procedure of trace clustering in process mining.

Nevertheless, most currently available trace clustering techniques are not precise enough due to the indiscriminate treatment on the activities captured in traces. As a result, the impacts of some important activities are reduced and some typical process information may be distorted or even lost during comparison.

To address the drawback, this paper presents a novel similarity measurement based on constrained traces alignment. First, some typical causal sequences that reflect the “backbone” of process are identified. Then, these sequences are exploited as constraints to guarantee the priority of important activities in traces. Subsequently, we suggest two clustering strategies that agree with the process mining perspective. The agglomerative hierarchical clustering (AHC) was selected for its embedded flexibility on abstraction level to provide us an overall insight into the complex process. And the spectral clustering has a good recommendation about the number of clusters corresponding to the generic abstraction level.

In brief, this work contributes by proposing a novel constrained trace similarity measurement to guarantee the priority of important process episodes and subsequently adapting two appropriate clustering techniques into process mining perspective. In addition, experiments on real-life logs prove the improvements achieved by our method relative to six existing methods.

The rest of the paper is organized as follows: Sect. 2 provides a brief overview of related works. Next, Sect. 3 introduces our novel constrained trace similarity measurement and the process-adaptive clustering strategies we selected. And Sect. 4 discusses the experiment results. Finally, Sect. 5 draws conclusions and spells out directions for future work.

2 Related Work

The main distinction between trace clustering techniques is the clustering bias (distance/similarity measures) they proposed. Existing approaches in literature can be classified into two major categories.

2.1 Distance-Based Trace Clustering

2.1.1 Vector-Based Trace Clustering

Vector-based trace clustering approaches transform traces into a vector space. Then, clustering can be achieved combining different distance metrics in the vector space. Greco et al. [8] were pioneers in study of clustering log traces within the process mining domain. They make trace clusters through the vector space over the activities and their transitions to discover expressive process models at the first attempt. They also introduced a notion of disjunctive workflow schemas (DWS) for divisive trace clustering [5]. Song et al. [13] elaborated on constructing so-called profiles associated with multiple trace perspectives as the feature vector.

2.1.2 Context-Aware Trace Clustering

Context-aware trace clustering approaches regard the entire trace as a whole sequence which implies all the process context information. Then various string edit distance metrics can be applied on it in conjunction with standard clustering techniques. In [2], Bose and van der Aalst propose a generic edit distance which derives specific edit operation costs so as to take into account the behavior in traces. The context-aware method is further developed in [3], it leverages conserved patterns or subsequences as feature sets to describe the characteristic of a certain trace.

2.2 Model-Based Trace Clustering

2.2.1 Sequence Clustering

Sequence clustering algorithm creates first-order Markov chains for clusters cooperating with the expectation-maximization (EM) algorithm to determine the assignment of a certain sequence. It has been used to automatically group large protein datasets to search for homologous gene sequences in bioinformatics. This technique was migrated into trace clustering by [6].

2.2.2 Active Trace Clustering

Active trace clustering inherits the underlying idea of sequence clustering [15]. Therein a trace is added to the current cluster if the model discovered from the cluster including that trace satisfies the target threshold of fitness. An optimal distribution of execution traces over a given number of clusters is achieved whereby the combined accuracy of the associated process models is maximized. In this way, the quality of process model discovered is under control. More extension on it has been developed to support further objectives in [7].

3 Approach Design

Distance/similarity measurement and clustering strategy with its specific characteristics are both important cluster-theoretical aspects. Therefore, we introduce our approach in the two steps. The framework of the approach in this paper is depicted in Fig. 2.

Fig. 2.
figure 2

Framework of Constrained Trace Clustering.

3.1 Similarity Measurement Based on Constrained Traces Alignment

Just as noted by [10], “specifying an appropriate dissimilarity measure is far more important than choice of clustering algorithm. This aspect of the problem is emphasized less in the clustering literature since it depends on domain knowledge specifics.” Therefore, it is inevitable to refine the distance/similarity metrics in trace clustering for providing more appropriate information to the clustering algorithm.

We can perceive that identifying some significant behaviors in traces will assist in mining better sub-process models by clustering the traces based on those significant behaviors. However, due to the indiscriminate treatment on behaviors in traces and the lack of domain knowledge, capturing them directly from event logs seems to be a difficult task.

Fortunately, the association rules in data mining shed light on this tough task. Employing the association rules, we are able to reveal the “backbone” of process. Then, some typical causal sequences that reflect the process backbone are identified and exploited as constraints to guarantee the priority of significant behaviors during similarity comparison.

3.1.1 Dependency Measures to Reveal Process Backbone

Definition 1.

(Dependency Measures) Let \( L \) be an event log over \( A \). \( a \) and \( b \) are activities that occur in \( L \), i.e.; \( \left|\upsigma \right| \) denotes the length of the trace.

\( \left| {a\,\, >_{L} \,b} \right| \) is the number of times \( a \) is directly followed by \( b \) in, i.e.,

$$ \left| {a\,\, >_{L} \,b} \right| = \sum\limits_{{\upsigma \in L}} {L(\upsigma)} \times \left| {\left\{ {1 \le i \le \left|\upsigma \right|\;\;\left| {\upsigma(i)} \right. = a \wedge\upsigma(i + 1) = b} \right\}} \right| $$
(1)

\( \left| {a \Rightarrow_{L} \,b} \right| \) is the value of the dependency relation between \( a \) and \( b \):

$$ \left| {a \Rightarrow_{L} b} \right| = \left\{ {\begin{array}{*{20}l} {\frac{{\left| {a\,\, >_{L} \,b} \right| - \left| {b\,\, >_{L} \,a} \right|}}{{\left| {a\,\, >_{L} \,b} \right| + \left| {b\,\, >_{L} \,a} \right| + 1}}} \hfill & {if} \hfill & {a \ne b} \hfill \\ {\frac{{\left| {a\,\, >_{L} \,a} \right|}}{{\left| {a\,\, >_{L} \,a} \right| + 1}}} \hfill & {if} \hfill & {a = b} \hfill \\ \end{array} } \right. $$
(2)

\( \left| {a \Rightarrow_{L} b} \right| \) produces a value between –1 and 1. If \( \left| {a \Rightarrow_{L} b} \right| \) is close to 1, then there is a strong positive dependency between \( a \) and \( b \).

By setting \( \upmu \), the threshold of \( \left| {a\,\, >_{L} \,b} \right| \), we can filter out the infrequent items. And when \( \left| {a \Rightarrow_{L} b} \right| \) meets certain thresholds \( \upupsilon \), we specified that there is a connection between \( a \) and \( b \).

To illustrate the basic concepts, we use the following event log \( L \):

$$ \begin{aligned} & L = [ < a,d,e,f,g,i >^{16} , < a,b,e,f,i >^{1} , < a,c,e,f,h,i >^{1} , < a,c,b,e,f,h,i >^{12} , \\ & < a,e,f,g,i >^{5} , < d,d,a,d,e,f,g,i >^{1} , < a,b,c,e,f,h,i >^{13} , < a,d,d,d,e,f,g,i >^{1} ] \\ \end{aligned} $$

Figure 3(A) depicts the dependency graph corresponding to the threshold of \( \upmu \) = 2, \( \upupsilon \) = 0.7. And Fig. 3(B) is another dependency graph with the threshold of \( \upmu \) = 5, \( \upupsilon \) = 0.85. Obviously, the dependency graph does not show the routing logic but it reveals the “backbone” of the process model. The derived dependency graph (denote as G) is then used as a reference to reveal the general order of some typical activities.

Fig. 3.
figure 3

Dependency graphs according to the dependency measures.

3.1.2 Constrained Similarity Measurement

Let \( A_{i} \) represents the set of activities that involved in trace \( \upsigma_{i} \) (known as Alphabet) while \( B \) represents the set of activities that involved in the “process backbone”. \( A_{i} \cap A_{j} \cap B \) are the common activities between \( \upsigma_{i} \) and \( \upsigma_{j} \) with respect to the “process backbone”. Referring to the dependency graph mentioned above, we can get the causal sequences of them. For example, the involved common activities between traces \( < d,d,a,d,e,f,g,i > \), \( < a,d,d,d,e,f,g,i > \) in \( L \) and the dependency graph are \( < a,d,e,f,g,i > \), their constraints are shown as the black highlights in Fig. 4. We call these common activities that attached with causal sequences as typical behaviors.

Fig. 4.
figure 4

Illustration of constraint instance.

These behaviors provide an approximation of the essence of the both comparing traces in a global perspective. Next, we utilize them as constraint conditions to guarantee the priority of typical behaviors between traces. In the following, the typical behaviors between traces \( \upsigma_{i} \) and \( \upsigma_{j} \) concerning the process backbone are denoted as \( C_{i,j} \).

Definition 2.

(Constrained Similarity Measurement) \( \upsigma_{i} \) and \( \upsigma_{j} \) are two traces, the constraints of them are \( C_{i,j} \). Then, the similarity of \( \upsigma_{i} \) and \( \upsigma_{j} \), \( Sim(\upsigma_{i} ,\upsigma_{j} ) \) is defined as:

$$ Sim(\upsigma_{i} ,\upsigma_{j} ) = \frac{{length(CLCS(\upsigma_{i} ,\upsigma_{j} ,C_{i,j} ))}}{{\hbox{max} (length(\upsigma_{i} ),\,\,length\;(\upsigma_{j} ))}} $$
(3)

The constrained measurement is relied on the constrained longest common sequences (CLCS). The CLCS has already been applied in bioinformatics for the computation of the homology of two biological sequences. For more details about CLCS, please refer to [14].

3.2 Adapted Trace Clustering Strategies

Traditional trace clustering only adapts data-centric clustering algorithms. However, as described in [3], the most important evaluation dimension for trace clustering is from a process discovery perspective. This entails the compatibility with process features on the adopted clustering strategies. Here, we suggest two apposite clustering strategies for different applications.

3.2.1 Agglomerative Hierarchical Clustering

Thanks to the hierarchical characteristic of AHC algorithm, it pertinently agrees with the continuum ranging from unstructured processes to structured processes. The bottom of AHC means clusters corresponding to each trace whose processes mined are surely structured while the top represents the only cluster that contains everything whose process mined is usually unstructured when confronted with flexible environment. Thus, we are able to ascertain the applicable level as desired or traverse all the hierarchy straightforward to gain an overall insight into the complex process.

The method we adopt is proposed by [4] termed as GuideTreeMiner. The GuideTreeMiner uses AHC algorithm to build a guide tree (also known as dendrogram). Any horizontal line spanning over the dendrogram corresponds to a practical clustering at a specific abstraction level.

3.2.2 Spectral Clustering

Except for hierarchical clustering algorithms, most of the trace clustering require predefined parameters for clustering such as the amount of clusters, the maximum cluster size etc. The truth is the definition of these specialized parameters are far from easy for general users due to the lack of domain knowledge. Actually, even for experts, it’s also not a trivial work as well owing to the complexity and flexibility of real-life process. Against this background, the spectral clustering was selected as it provides a good recommendation about the number of clusters [11] which can guide us to a generic abstraction level.

It’s worth point out that the affinity matrix always calculated as Gaussian kernel, however, it doesn’t reflect the nature of processes. So, in this work, it is calculated as the constrained similarity described in the previous section. Likewise, the laplacian is often normalized, but duo to the robustness of constrained similarity measurement against infrequency and the pursuit of stable clustering indication, we select the non-normalized laplacian. As for the indication about the cluster number k, it can be recognized whereby the sudden drop in the eigenvalues. Actually, in many cases, the two solutions can be used in union.

4 Experiments

4.1 Experiment Configuration and Evaluation Criterion

We used the ProMFootnote 1 framework which has been developed to support process mining algorithms to perform the experiments. The data is from Dutch Financial InstituteFootnote 2. And we adopted the HeuristicsMiner to derive the process model as it has the best capability to deal with real-life event logs. The approaches to compare with are presented as follows: DWS Mining [5], Trace Clustering [13], GuideTree Miner [2, 3], Sequence Clustering [6] and ActiTraC [15].

We evaluate the results with respect to their model complexity, as they are measured by a comprehensive list of metrics reported in [12]:

  1. 1.

    \( \left| A \right| \) signifies the number of arcs in the process model.

  2. 2.

    \( \left| N \right| \) signifies the number of nodes in the process model.

  3. 3.

    \( \left| {CN} \right| = \left| A \right| - \left| N \right| + 1 \) signifies the cyclomatic number of the process model.

  4. 4.

    \( CNC = \tfrac{\left| A \right|}{\left| N \right|} \) signifies the coefficient of connectivity of the process model.

  5. 5.

    \( \Delta = \tfrac{\left| A \right|}{{\left| N \right| \cdot \left| {N - 1} \right|}} \) signifies the density of the process model.

4.2 Clustering Results

We made comparisons with different number of clusters for three different abstraction levels. Here, level 1 represents the original trace set, i.e. there is only one cluster. Level 2 stands for 2/3/4 clusters while level 3 contains 5/6/7 clusters. We calculated |A| by taking their corresponding nodes weighted average and the same as |N|.

The aggregated results are presented in Table 1. All the data has been depicted to the radarplots in Fig. 5. We can see that all cluster techniques lead to models with lower complexity than the original log file. However, the DWS, the ActiTraC and the Trace Clustering approaches lead to clusters whose models have higher density values than the unclustered one though they perform well in the other metrics. The smaller area and more balanced capabilities shown in the radarplots from two abstraction levels proved the effectiveness of our constraints.

Table 1. The aggregated clustering results
Fig. 5.
figure 5

Radarplots of different trace clustering techniques.

Moreover, Fig. 6 depicts Constrained Trace Clustering at different levels. With the increasing number of clusters, there is only a little improvement in all aspects. Considering the extra elaboration on more clusters, it is inefficient and meaningless to set the number of clusters to a higher value. This is consistent with the spectral clustering. Just as the eigenvalues scatterplot shown in Fig. 7, only the first three clusters are well separated. Therefore, the spectral clustering can guide us to correctly capture the right level of abstraction by providing a good recommendation about the number of clusters instead of the iterations on different hierarchies.

Fig. 6.
figure 6

Constrained Trace Clustering from different abstraction levels.

Fig. 7.
figure 7

Similarity matrix eigenvalues.

In a nutshell, all these experiments confirm the effectiveness and efficiency of Constrained Trace Clustering to deliver comprehensible process models from the flexible and complex logs.

5 Conclusions and Future Perspectives

In this paper, we contribute to trace clustering techniques by imposing constraints on trace similarity/distance measurements to guarantee the priority of important activities in traces. By this means, significant process information is preserved as much as possible such that more accurate trace similarity measurement will be obtained. Moreover, we integrate two clustering strategies that agree with the process mining perspective to cluster these traces.

There are still a number of challenging issues remain open for future research work. Firstly, more refined similarity/distance measurements that felicitously agree with the process domain knowledge are encouraged. On the issues of processing sequence data, we should learn from bioinformatics which has more mature applications on sequence data mining. Another direction of research might be that of integrating advanced clustering strategies, as currently available techniques only adapt traditional data clustering techniques which are data-centric instead of process-centric into process mining. More generally, designing ad hoc clustering strategies usually leads to more suitable clustering results. Finally, trace clustering techniques only take into account of sorting in the horizontal direction, it would be promising to combine with techniques that focus on abstraction in the vertical direction, such as FuzzyMiner [9] which enforces cartography to the process mining by means of clustering activities.