Keywords

1 Introduction

Currently, the use of Internet of Things (IoT) devices in organisations is becoming increasingly common, providing support to their business processes (BPs), known as IoT-enhanced BPs [16, 36]. The execution of BP activities is usually recorded in event logs, which can be analysed to gain insights into the BP and identify opportunities for improvement. When BPs are augmented with IoT devices, these devices can also provide critical contextual information. One of the main domains where IoT-enhanced BPs are found is smart manufacturing. In these BPs, sensors can track time series (TS) data on various process parameters, such as, for example, flow, temperature, and pressure, which can aid in predicting process outcomes and automating tasks. However, due to the unique characteristics of IoT data, such as granularity and storage independent of the process system [4], it is necessary to develop new PM techniques designed specifically for them. This emerging field of IoT-enhanced process mining (PM) is still in its early stages [4], with only limited research being already done, focusing primarily on decision mining using IoT data [2, 32].

In this paper, we propose TROPIC (TRace attributes, cOntrol-flow Plus Iot Clustering), a novel approach for multiperspective trace clustering that is capable of integrating the TS sensor data perspective, in addition to the control-flow and trace attribute data perspectives. By integrating these different perspectives, multi-perspective trace clustering can effectively identify process variants and anomalous process executions in smart manufacturing that may not be apparent from analysing the control-flow or another single perspective alone. Knowing these variants can, in turn, help organisations identify and propagate best practises to enhance process efficiency and increase the likelihood of positive process outcomes. To demonstrate the effectiveness of our approach, we apply it to a real-life manufacturing process and provide a detailed evaluation of the results. This case study highlights the potential of our approach to analyse and improve IoT-enhanced BPs.

The remainder of the paper is organised as follows. First, Sect. 2 provides an overview of previous research in multi-perspective PM, IoT-enhanced PM, and trace clustering. In Sect. 3, we present TROPIC, our two-level approach for multi-perspective trace clustering, and apply it to the manufacturing process in question in Sect. 4. The experimental results are discussed in Sect. 5, before concluding the paper in Sect. 6 with final remarks and suggestions for future work.

2 Background

2.1 Trace Clustering

Trace clustering is a technique used to group similar process instances, for instance, based on their shared sequential activity patterns. Traditionally, trace clustering has been used to improve process discovery by splitting the event log into sublogs consisting of instances that share comparable activity sequences, and mining a model of each sublog separately. This approach produces simpler and better fitting models that describe different process variants [5, 9, 13]. However, more recently, trace clustering has been applied to other goals, such as concept drift detection and process evolution analysis [19] and outlier detection [11]. Although improving process discovery results can typically rely only on the control-flow perspective, other objectives can greatly benefit from incorporating context information in clustering.

According to [8], three main categories of trace clustering approaches have been proposed: distance-based, feature-based, and model-based. Distance-based approaches directly cluster traces based on the distances between traces as sequences of activities, using distance metrics such as the Hamming distance, Levenshtein distance, Damerau-Levenshtein distance, and geodesic distance. Feature-based techniques, on the other hand, derive features from the traces, such as scalars, graphs and embeddings and cluster based on the feature values. Finally, model-based techniques aim to create clusters of traces that produce the best process models [9], optimising criteria such as model fitness. These three approaches have their advantages and disadvantages depending on the nature of the data and the intended application. Choosing the appropriate approach is critical to the effectiveness of the trace clustering process.

2.2 Multi-perspective Process Mining

Multi-perspective PM refers to process analysis techniques that take more than one process perspective into account, e.g., the control-flow and data attributes. The following perspectives are listed in [22] lists the following perspectives:

  • Control-flow perspective: Activity ordering in each process instance;

  • Resource perspective: Human and non-human resources executing tasks;

  • Data perspective: Trace and event attributes;

  • Time perspective: Activity duration, throughput time, business rules, etc.;

  • Function: Granularity of the activities of the process.

Multi-perspective techniques have been proposed for various types of PMs, such as multi-perspective process discovery [18, 24] and multi-perspective conformance checking [14, 23]. In trace clustering, a multi-perspective approach is proposed in [15], where a distance metric is presented to compare traces based on the control-flow perspective, the resource perspective, and the data perspective. The (possibly weighted) average of these metrics is computed and used as a pairwise multi-perspective distance measure to perform hierarchical clustering.

However, extending such a technique to TS data can be challenging, as TS typically need to be characterised by many features. For example, [12] reviewed the proposed TS characteristics in the literature and identified a list of approximately 7,700 characteristics to fully represent the TS data. Therefore, proceeding in one step, inputting TS features in a feature vector or including them in an average as in [15], would likely result in either TS features dominating over other perspectives or require very carefully selecting TS features beforehand. This problem grows dramatically when considering multivariate TS, which are very common in manufacturing. To address this issue, we propose a two-step approach that is more versatile than the simple average of distances computed over multiple perspectives.

2.3 IoT-Enhanced PM

Event Log Derivation. The existing literature on IoT-enhanced PM has primarily focused on deriving high-level events of the process from low-level IoT data to create event logs. Subsequently, traditional PM techniques have been employed to analyse these event logs and discover control-flow models of the processes. Several techniques have been proposed specifically for manufacturing processes. In [35], a four-step framework is presented to generate event logs from industrial IoT data, including data preprocessing, clustering of low-level data, classification to derive events from clusters, and creation of the final event log. Also, focussing on industrial applications, [34] propose to transform raw IoT data into an XES event log using complex event processing and event detection and refinement techniques. The authors present another approach in [33] to detect activities interactively from sensor data based on visualisation and exploratory analysis. In [37], a domain-specific language is developed to extract event logs from IoT data by specifying the case and activity identifiers.

Process Contextualisation. Next to event log derivation, some context-aware techniques have also been investigated, e.g., IoT data-aware process discovery [2, 20], sensor TS-aware decision mining [32], and IoT-aware conformance checking [28]. In a manufacturing context, [32] outlines an approach to derive decision rule patterns from TS sensor data by automatically featurising the sensor data and training a decision tree to learn the rules. A different problem is addressed by [28], who present an approach for IoT-enhanced deviation detection. In their paper, they argue that traditional conformance checking cannot take into account data that changes over time independently of the events of the process (i.e., TS data). They subsequently proposed a method to detect patterns in the TS data directly.

3 Methodology

TROPIC involves a two-step clustering process (see Fig. 1) currently tailored to the setting of smart manufacturing, typically characterised by highly structured processes around which sensor data are collected in the form of TS. Indeed, in such manufacturing BPs, sensor data and process activities are usually correlated, with process activities leaving recognisable patterns in the sensor data and certain sensor data values triggering the execution of certain process activities. In the clustering process of TROPIC, process instances are first clustered separately according to three perspectives: the control-flow, trace attribute data and TS sensor data perspectives. In this step, each perspective is considered independently, providing a detailed view of each aspect of the process. Then, the distances to each centroid in each clustering are computed and used as features for a second clustering step, which takes into account all three perspectives together. This results in a multi-perspective clustering that groups instances based on their unique combinations of control-flow, trace attributes and TS sensor data, providing a comprehensive view on the process.

Next, we explain the approach applied for each single-perspective clustering, followed by the multi-perspective clustering.

Fig. 1.
figure 1

Overview of TROPIC, our proposed approach.

3.1 Control-Flow Perspective

As mentioned in Sect. 2, three main categories of trace clustering have been proposed: distance-based, feature-based, and model-based approaches. Our approach follows the former by using the Damerau-Levenshtein (DL) distance. The DL distance is a string metric used to compute the edit distance between two strings, which is the minimum number of single-character edits (i.e., insertions, deletions, substitutions, and transpositions) required to transform one string into the other. It extends the Levenshtein distance by also including transpositions of characters. The DL distance between strings A and B, denoted DL(A,B), is computed as follows:

$$\begin{aligned} DL(A,B) = \left\{ \begin{array}{ll} \max (|A|,|B|) &{} \text{ if } \min (|A|,|B|) = 0 \\ \min {\left\{ \begin{array}{ll} DL(A_{1..i-1},B) + 1 \\ DL(A,B_{1..j-1}) + 1 \\ DL(A_{1..i-1},B_{1..j-1}) + \delta _{a_i,b_j} \\ DL(A_{1..i-2}b_i,A_{1..j-2}a_j) + 1 \end{array}\right. }&\text{ otherwise } \end{array} \right. \end{aligned}$$
(1)

where |A| denotes the length of string A, \(a_i\) denotes the i-th character of string A, and \(\delta _{a_i,b_j}\) is the Kronecker delta function, which is equal to 1 if \(a_i=b_j\), and 0 otherwise. The last term in the minimum function corresponds to transposition, and is only included if \(i,j>1\) and \(a_{i-1}=b_j\) and \(b_{j-1}=a_i\).

Due to the strictly ordered nature of control-flow data in many manufacturing processes, other trace clustering paradigms are usually less suitable. Additionally, activities are often logged at a fairly low level of granularity, making model-based techniques less appropriate. It is worth noting that manufacturing processes tend to be more structured in nature, and thus may not require more complex trace clustering techniques designed for less structured processes.

3.2 Trace Attribute Data Perspective

Trace attributes are usually numerical, categorical, or ordinal features that can be clustered using traditional clustering techniques. Common clustering techniques include hierarchical techniques [38], distance-based techniques, such as K-means [21] or K-medoids [27], model-based techniques, such as self-organising maps [17]; and density-based techniques such as DBSCAN [10]. TROPIC uses K-means, as a generic technique for mixed-type input features, which is most often the case in smart manufacturing. Moreover, its simplicity makes it easily understandable for non-experts. However, depending on the specific process, other techniques could be applied as well; for a general discussion of clustering techniques, see [31].

3.3 Time Series Sensor Data Perspective

In TS analysis, [1] distinguishes three categories of techniques to cluster whole TS: distance-based features, using measures such as Euclidean or dynamic time warping (DTW) distance [30]; structure-based features, which characterise the whole TS; and shape-based features, created by searching for common motifs.

We use DTW distance, which allows a direct comparison of whole TS and is suitable for TS that are expected to share a common general structure as is the case in most manufacturing processes but can differ in length and speed (i.e. certain subsequences can last longer in one TS than in the other). Intuitively, it corresponds to the distance remaining between two series after eliminating timing differences, i.e., correcting for varying activity duration. It relies on the computation of a warping function mapping time points from two series together to minimise the distance between the two series. More specifically, given two series \(A=a_1 , a_2 , ... , a_i , ... , a_n\) and \(B=b_1, b_2, ..., b_j, ..., b_m\), with distance \(d_{i,j}=||a_i - b_j||\), DTW aims at finding an optimal mapping function \(F=c_1, c_2, ..., c_k, ..., c_l\) such that the total distance \(E(F)= \sum _{k=1}^{l} d(c(k)) \cdot w(k)\) is minimised:

$$\begin{aligned} DTW(A,B) = \min _{F} \left[ \frac{ \sum _{k=1}^{l} d(c(k)) \cdot w(k) }{\sum _{k=1}^{l} w(k)} \right] \end{aligned}$$
(2)

where w(k) is a weight coefficient for the elements of the mapping function.

Applying this for each pair of batches yields a distance matrix which can be used as input for clustering techniques like K-medoids or hierarchical clustering.

3.4 Multi-perspective Clustering

Once process instances are clustered separately in each perspective, the results are combined by clustering them together. Single-perspective clusters can be represented in different ways, such as using their labels as categorical features or computing distances to the centroids. We follow the latter approach, which retains more information for multi-perspective clustering.

Moreover, perspectives can be weighted to adjust their contribution to the multi-perspective clustering. For example, control-flow can be given more weight to ensure it has sufficient influence on the final clustering. Weights can also be used to account for differences in the number of clusters generated by each perspective, where more clusters may result in more features and a greater impact on the final clustering.

4 A Case Study in Smart Manufacturing

4.1 Use Case

Process Description. We applied TROPIC to a real use case at a partner company active in the production of chemical products. Their production process can be summarised in four main steps:

  1. 1.

    Preparing raw material and loading the tank;

  2. 2.

    Mixing the raw material in the tank;

  3. 3.

    Circulating the product through filters to remove impurities;

  4. 4.

    Bottling and packing the finished product.

Sometimes, the quality of the product is not high enough after filtering, i.e., some characteristics of the product do not meet the specifications. In this case, an adjustment is applied by loading additional raw materials into the tank and repeating steps two and three, resulting in the high-level production process depicted in Fig. 2.

Fig. 2.
figure 2

High-level model of the process analysed in the experiment.

This seemingly simple process has to be executed with extreme precision and care as the slightest presence of impurities in the finished product greatly diminishes its quality. This is why the company is interested in analysing production logs and TS sensor data together to find out variation in process execution.

Data. Two main data sources are used in this use case: 1) logs from the production system, which contain the sequences of activities executed for each process instance and trace attributes and 2) TS data from sensors tracking the flow of the product in the four tanks and in the pipes leading through the filters every second. The data span a period from October 2020 to April 2022, representing 161 complete process instances and 199.4 million rows of sensor data.

Data Preprocessing. First, relevant TS pump circulation flow data was extracted for each batch. The data were resampled to one measurement per minute for smoothing and to reduce their length (the raw TS for the longest batch counted more than one million measurements before resampling), and some missing values due to the storage format were imputed. Finally, all data were normalised.

4.2 Clustering and Evaluation Approach

Multi-perspective Trace Clustering. We applied our two-step multi-perspective trace clustering approach to the obtained data. For the control-flow perspective, we followed a distance-based approach by computing the DL distance between the event sequences for each pair of batches and using the resulting distance matrix as input for the K-medoids. The number of clusters was set to five by plotting inertia and following the elbow method. The clusters contained 28, 23, 52, 22 and 36 instances, respectively. Secondly, regarding the trace attributes perspective, we applied the K-means algorithm with K = 5 (based on the elbow method). This yielded clusters of 23, 48, 41, 25 and 24 instances. Note that the attributes “tank open time” and “time in tank” are considered trace attributes as they measure batch quality and not timeliness. Third, we applied a distance-based TS clustering approach for the TS sensor data perspective, computing the DTW distance between the TS of each pair of batches to obtain a TS distance matrix used as input for K-medoids, with K = 6 (based on the elbow method), which formed clusters of sizes 9, 44, 59, 20, 21 and 8. Finally, to perform multi-perspective clustering, we computed the distances to centroids for each single-perspective clustering. Then we weighed the clusterings to take into account the different values of K in each clustering and applied K-means to all distances to centroids together, with K = 4 based on the elbow method. When K-means were applied, centroids initialisation was optimised to speed up convergence of the clustering by sampling centroids based on marginal inertia decrease, while when K-medoids were applied, medoids were randomly initialised.

Clustering Evaluation. The evaluation of clustering results is a challenging task that often depends on the specific domain and task at hand. A range of metrics are available to score clusterings based on intrinsic properties, such as the Davies-Bouldin (DB) score [6], which measures the similarity of clusters to their respective most similar cluster (lower value is better), or the Silhouette index [29], which compares the similarity between an instance and instances in its cluster with the similarity between this instance and instances in other clusters (higher value is better). Other metrics compare clusterings with known classes in the data or other clusterings, such as the Rand index [26], entropy, or purity. However, it is worth noting that better-formed clusters may not necessarily be more useful in practise, hence obtaining external validation from experts is critical to evaluate clustering results.

Fig. 3.
figure 3

Visualisation of multi-perspective clustering with t-SNE (cluster 1 = purple, cluster 2 = blue, cluster 3 = green, cluster 4 = yellow). (Color figure online)

In our case study, we compared the clusters obtained from the multi-perspective approach with those derived from single-perspective clustering, using both metrics and expert feedback. We computed silhouette indexes and DB scores for each clustering to assess the quality of the clusters in each approach. We also computed adjusted Rand indexes (ARI; where a higher value indicates higher similarity) and entropy scores (where a lower value indicates higher similarity) to determine the degree of similarity between the clusterings and to identify which perspective has the most influence on multi-perspective clustering. To validate our clustering results, we presented them to a senior process engineer at our partner company. Specifically, we showed the engineer the centroids of each multi-perspective cluster, as well as an overview of each cluster, including a directly-follows graph (DFG) for the control-flow, the mean or mode of trace attributes, and the DTW barrycenter average (DBA) [25] for the TS perspective, which is a method to compute the average of several TS taking into account potential time shifts.

4.3 Results

Multi-perspective clustering with K = 4 resulted in clusters of sizes 18, 53, 69, and 21 (see Fig. 3). In the remainder of this section, we provide visualisations of the clusters and report the values of the metrics and the interpretation and evaluation of the clusters by the process expert for each perspective.

Clustering Quality Assessment and Visualisation. The Silhouette score and the DB index are reported in Table 1. As can be seen, multi-perspective clustering has better scores than other clusterings for both metrics. Trace attributes clustering has the worst scores, while control-flow and TS clusterings have similar values.

Table 1. Internal validation metrics for each clustering.

Table 2 reports the cluster similarity metrics. Both entropy and ARI show that multi-perspective and control-flow clusterings have the highest similarity, i.e., they most often group the same instances together. On the other hand, trace attribute data clustering has high entropy and low ARI for all other clusterings, indicating that it forms very different clusters than the other perspectives.

Table 2. Pairwise similarity metrics values.

We visualised the multi-perspective clusters by modelling the DFGs of their control-flows (see Figs. 45, where high-level steps from Fig. 2 are highlighted), computing the mean and the mode of their attributes (see Table 3) and plotting the DBAs of their TS data (see Figs. 67). DFGs and DBAs were used and are put forward in this paper as they can provide intuitive visualisations of the control-flow and the TS data of many instances of a process at once, enabling business experts to quickly understand and analyse whole clusters. Note that while all the results of the multi-perspective clustering are shown, only particularly interesting results are displayed for the other clusterings, and that activity labels as well as some trace attribute values were anonymised on request of the company.

Fig. 4.
figure 4

DFGs for each cluster of the multi-perspective clustering.

Fig. 5.
figure 5

DFGs for clusters 1 and 5 of the control-flow clustering.

Table 3. Mean or mode of the trace attributes for each cluster of all clusterings (standard deviations between brackets).
Fig. 6.
figure 6

DBAs for each cluster of the multi-perspective clustering.

Fig. 7.
figure 7

DBAs for clusters 1 and 5 of the TS clustering.

Expert-Based Validation. When showing the multi-perspective clusters, the process expert categorised them as follows. Cluster 3, the largest cluster and the ones with the fewest distinctive characteristics, was identified as representing the standard execution of the process. Cluster 2 typically included traces with fewer adjustment activities and a lower material adjustments attribute than those in the other clusters, as shown in Fig. 4b and Table 3. In contrast, cluster 1 represented batches that required more adjustment activities and have a higher value for the material adjustments attribute (see Fig. 4a and Table 3) than batches in the other clusters. Having more adjustments also caused the filtering step to last longer, which can also be seen in the TS data by comparing Figs. 6a and 6b (filtering being characterised by long periods with a stable flow). Finally, cluster 4 included traces with missing activities that were necessary for proper process execution. These instances were identified as anomalies caused by improper logging of these activities.

4.4 Comparison of the Clusterings

In general, single-perspective clusters are more difficult to interpret than multi-perspective clusters. While control-flow clustering also groups together batches that required more adjustments, no cluster groups instances with fewer adjustments as neatly as multi-perspective cluster 2 (see Figs. 5a–5b). It is particularly difficult to recognise consistent patterns across perspectives in data clusters, while TS clusters succeed to some extent in grouping together instances with similar control-flows. Next to this, the most difficult perspective to interpret in all clusterings seems to be the TS perspective, where DBAs have difficulty capturing typical TS shapes, partly due to the presence of batches with missing data. This being said, DBAs based on TS clustering (see Fig. 7) seem more distinct and more easily interpretable.

5 Discussion

TROPIC successfully integrates TS sensor data in multi-perspective trace clustering, resulting in clusters that consider different process perspectives. The two-step structure makes it easy to disentangle the different perspectives, adjust their importance, and compare them. In our manufacturing use case, comparing multi-perspective trace clustering with single-perspective clustering showed that by leveraging underlying relationships between different perspectives, multi-perspective trace clustering could outperform single-perspective clustering even in their own perspective. For instance, multi-perspective trace clustering grouped instances with few adjustments better than control-flow clustering, as other perspectives helped recognise these instances.

In addition, the process expert found multi-perspective clusters more meaningful from a business point of view, as they identified variants and anomalies. This insight could help the company investigate the differences between clusters 1 and 2 to reduce the number of necessary adjustments in the future.

Furthermore, some anomalous process instances were detected in the use case, although we did not apply any anomaly detection technique. This observation highlights the potential of multi-perspective anomaly detection using TROPIC by applying outlier detection to the distances to centroids.

In addition, the choice of K for K-means and K-medoids clustering could have a great impact on the results of clustering at both stages. In this paper, the popular elbow method was used and yielded good results, as the clusters formed were insightful from a business perspective. Future work could investigate more complex methods to determine the value of K, e.g., based on stability or separation as in [7].

However, ARI and entropy indicated that the control-flow perspective produced a clustering more similar to the other perspectives. This result suggests that the control-flow perspective might be more important than other perspectives in the multi-perspective trace clustering. Weighting the perspectives could rebalance their contributions, but as all perspectives are correlated, weighting may not fundamentally change the clustering in the use case.

Finally, although we focused on three specific perspectives in this paper, we believe our approach could be extended to consider other perspectives. For example, a similar approach to that applied to the TS data obtained from IoT sensors could be applied to other processes that evolve over time, such as process performance. Such a different perspective could serve as a substitute for one of the current three dimensions, or the approach could easily be adapted to a higher dimensionality, allowing for several other perspectives to be included, such as the resource perspective.

6 Conclusion

In this paper, we presented a novel approach for multi-perspective trace clustering of manufacturing processes that considers three perspectives: control-flow, trace attributes, and TS sensor data. This approach can reveal process variants that are homogeneous across all three perspectives simultaneously. We evaluated the approach in a real-life use case of a smart manufacturing process, where it revealed meaningful clusters and anomalous instances for a specific IoT-enhanced BP, both actionable insights to improve process design and execution.

In future work, we plan to extend this approach in various ways. One possibility is to propose a generalisation to n arbitrary perspectives. We could also consider including event attributes and incorporating TS data at the event level. Furthermore, we could explore other clustering techniques for the multi-perspective clustering if our approach were to be used for more flexible types of processes, such as ensemble clustering methods or soft clustering techniques. Finally, we find that integrating contextual information in the log in the form of events, as suggested in [3], is an interesting alternative approach.