Keywords

1 Introduction

Business process analysis is important in modern organizations because it enables comprehension, optimization, and enhancement of operational processes based on recorded data [38]. These processes are complex due to the complex nature of the business domain. To address this complexity, log file formats and standards have emerged as conceptual models that capture the essential information required to support the analysis [2, 14, 21, 33]. These conceptual models drive the development of algorithms and facilitate the processing and analysis of recorded data.

Process mining is a research area that facilitates data-driven business process analysis based on recorded event logs [38]. Log files are crucial for analyzing business processes. Thus, extensive efforts have been made to define conceptual models, in the forms of log file formats and standards, that enable the analysis of recorded data using different software systems [21, 23, 34]. These formats and standards ensure compatibility and interoperability across various systems while providing a consistent and structured format for recording process-related information.

Traditionally, event log formats assume a single case notion as an obligatory element based on which the rest of the information could be correlated. For example, a purchase order event log could be extracted using either the order or the item notions, while the process contains both objects as potential cases. In reality, business processes deal with different perspectives, which may require several case notions. Hence, restricting log formats to a single case notion limits the applicability of process mining in practice.

To circumvent this limitation, researchers and practitioners flattened the recorded event log to perform process analysis, which introduces its limitations, including false behavior and false analysis results [16] (which result from so-called divergence and convergence problems [39]). For example, one order may contain many items. In the log extraction, if we consider the “item” as the case notion, events like “create order” must be repeated for each item. The mapping of events based on one case notion, like this example, is called flattening. One consequence is that we will get false statistics when retrieving the number of orders which are created. If the log is flattened around the “order” case notion, the relation between the “select item” and “approve item” in the process can be lost because all items can be stored around one order resulting in losing information about relations between items. The lack of these relations could introduce loops between the activities of these two events in discovering process models, which is considered false behavior. These issues compromise the accuracy of the analysis [39].

The Object-Centric Event Log (OCEL) [21] has been proposed to address the limitation of having only one case notion when extracting log files, and it is part of a new and emerging paradigm in process mining called Object-Centric Process Mining (OCPM) [39]. This paradigm aims to support analyzing business processes considering multiple case notions that require developing algorithms, techniques, and methods to support multi-dimensional process analysis. Although OCPM has started recently, due to the highly relevant problem that it targets, several algorithms, tools, and libraries have been developed to support such analysis, e.g., [3, 4, 6, 11, 26, 34, 35, 39, 40]. This development can also be observed in commercial tools like CelonisFootnote 1, showing the relevancy of the problem in practice.

Another recent alternative to recording event logs is knowledge graphs, which unleash their power within information systems, showcasing their ability to support various data sources, scalability, semantic reasoning, and adaptable schema evolution [24]. Thus, it is recently used to record and process event logs with multiple case notions, called multi-dimensional Event Knowledge Graph (EKG) [14]. However, the lack of process mining tools for analyzing EKG limits the practical application of this approach. Additionally, there is a lack of comparative analysis in terms of performance, strengths, weaknesses, limitations, and differences between the processing of data represented using these two approaches (EKG and OCEL). Therefore, this paper aims to address the following research questions:

RQ1):

How can an event knowledge graph be transformed into an object-centric event log?

RQ2):

How does the performance of processing event knowledge graph compare to processing object-centric event log in process mining?

RQ3):

What are the differences and similarities in applying process mining on an event knowledge graph compared to an object-centric event log?

To answer the first research question, we define an algorithm that transforms it into a set of OCELs. We implemented an algorithm as a part of a Python library, called neo4pm, that can be used to perform the transformation. In this paper, we use this implementation to transform five real EKGs into OCEL files, which are available publicly [28,29,30,31,32]. In addition, we compare similarities and differences in analyzing processes based on event logs represented in EKG and transformed OCELs, which helped us answer the third research question.

The structure of the paper is as follows. Section 2 gives an overview of related work. Section 3 provides preliminaries which are used in Sect. 4, where we define the algorithm formally. Section 5 reports the results and discusses the findings. Finally, Sect. 6 concludes the paper by giving future direction.

2 Related Work

In this section, we provide an overview of the research that offers tool support for processing event logs represented in multi-dimensional Event Knowledge Graph (EKG) and Object-centric Event Log (OCEL). Table 1 summarizes the process mining tools developed for OCEL and EKG. The table categorizes the level of support into eight use cases: transformation, exploration, monitoring, performance analysis, discovery, conformance checking, enhancement, and predictive process monitoring.

The tool support for EKG focuses on transforming traditional log files into the EKG data model [14]. A recent study has proposed a method for transforming OCEL to EKG [16]; however, the existing implementation does not yet support the transformation from the serialized standard OCEL files. Furthermore, there is a lack of support for EKG in other use cases. In contrast to EKG, the existing contributions to OCEL varies in different use cases. These categories are represented as columns in Table 1.

In the transformation use case, we have identified three sub-categories of transformation. Firstly, there are approaches focused on transforming traditional log to OCEL [37]. Secondly, there are methods for transforming data recorded in databases or Enterprise Resource Planning (ERP) systems to OCEL [10, 42]. Lastly, there are techniques available for flattening OCEL to traditional log [11, 21]. In the exploration use case, we have identified four sub-categories of exploration. This includes support for filtering events based on certain criteria [11], identifying concept drift in event data [8], supporting variant analysis on event logs [5, 7], and splitting the log into several clusters based on similarity in underlying behaviour [26].

In the monitoring use case, Park and van der Aalst present a tool for monitoring object-centric constraints [36]. In the performance analysis use case, a tool called OC-PM is available for calculating the duration time of objects [11]. Additionally, performance metrics computation is supported by [36] and [35]. In the discovery use case, the discovery of object-centric Petri nets is supported by [40] and [4]. In addition, the discovery of Markov Directly-Follow Multigraphs is supported by [26] by extending the discovery of Markov Directly-Follow Graphs [27]. In the conformance checking use case, Berti and van der Aalst provide a tool for conformance checking [11]. Also, tool support is provided for calculating precision and fitness [4, 6]. In the enhancement use case, tool support is provided for enhancing process models through feature extraction [3, 4]. In the predictive process monitoring use case, Adams et al. [4] provide a tool for predictive monitoring, and Gherissi et al. [22] offer a tool for predicting the next event time, activity, and remaining sequence time.

Table 1. Summary of studies providing tool support for OCEL or EKG

3 Preliminaries

This section introduces the notions of the Event Knowledge Graph (EKG) and the Object-Centric Event Log (OCEL), which serve as the foundation for defining the transformation algorithm in Sect. 4. We explain the EKG definition using a running example, which will also be utilized to demonstrate the approach and algorithm in the subsequent sections of this paper.

Figure 1 illustrates a running example that is used to explain the components of EKG. The figure represents recorded information in an EKG for a fictitious business process involving a customer order (\(\textsf{o1}\)) with two items (\(\textsf{i1}\) and \(\textsf{i2}\)). Orders and items are depicted as ovals (annotated with \(\mathsf {:Entity}\)), while events are represented as rectangles (annotated with \(\mathsf {:Event}\)). Each event has an activity name and a timestamp (e.g., \(\mathsf {Submit\ Order}\) and \(\mathsf {15:00}\) for \(\textsf{e1}\), respectively). Some events have the performing resource (e.g., \(\textsf{Elin}\) for \(\textsf{e3}\)). The figure illustrates the chronological sequence of events: \(\mathsf {Submit\ Order}\), two instances of \(\mathsf {Check\ Availability}\) (one for each item), and \(\textsf{Pick}\) \(\textsf{Items}\). The following definitions will define the elements within this graph based on which we can define the transformation algorithm.

Fig. 1.
figure 1

Running example showing event log represented in an EKG

Definition 1 (Universes)

We define the following universes to be used throughout the paper, some of which are adopted from [39]:

  • \(\mathbb {U}_{ lbl }\) is an infinite set of strings representing labels,

  • \(\mathbb {U}_{ att }\) is an infinite set of strings representing attribute names,

  • \(\mathbb {U}_{ val }\) is an infinite set of strings representing attribute values containing the following disjoint subsets:

    • \(\mathbb {U}_{ eid }\subset \mathbb {U}_{ val }\) represents the universe of event identifiers,

    • \(\mathbb {U}_{ time }\subset \mathbb {U}_{ val }\) represents the universe of timestamps,

    • \(\mathbb {U}_{ act }\subset \mathbb {U}_{ val }\) represents the universe of activity names,

    • \(\mathbb {U}_{ ot }\subset \mathbb {U}_{ val }\) represents the universe of object types,

    • \(\mathbb {U}_{ oid }\subset \mathbb {U}_{ val }\) represents the universe of object identifiers,

  • \( type : \mathbb {U}_{ oid } \rightarrow \mathbb {U}_{ ot }\) is a function that assigns exactly one object type to each object identifier,

  • \(\mathbb {U}_{ omap } = \{ omap : \mathbb {U}_{ ot } \rightarrow \mathcal{P}(\mathbb {U}_{ oid }) \mid \forall _{ ot \in dom ( omap )}\ \forall _{ oid \in omap ( ot )}\ type ( oid ) = ot \}\) is the universe of all object mappings indicating which object identifiers are included per object typeFootnote 2,

  • \(\mathbb {U}_{ vmap } = \{ vmap :\mathbb {U}_{ att } \not \rightarrow \mathbb {U}_{ val }\}\) is the universe of value assignments,Footnote 3 and

  • \(\mathbb {U}_{ event } = \mathbb {U}_{ eid } \times \mathbb {U}_{ act } \times \mathbb {U}_{ time } \times \mathbb {U}_{ omap } \times \mathbb {U}_{ vmap }\) is the universe of events.

Definition 2 (Labeled Property Graph (LPG))

An LPG (adopted from  [9, 16]) is a tuple \(G=(N, R, \gamma , \lambda , \rho ) \), where:

  • N and R are finite sets of nodes and relations, respectively,

  • \(\gamma : R \rightarrow N \times N\) is a total function assigning a pair of nodes (representing the source and target, respectively) to a relation,

  • \(\lambda : (R \cup N)\rightarrow \mathbb {U}_{ lbl }\) is a total function assigning a label to a node or a relation,

  • \(\rho : (N\cup R) \times \mathbb {U}_{ att } \nrightarrow \mathbb {U}_{ val }\) is a partial function assigning a value to an attribute of a node or a relation.

Given an LPG \(G=(N, R, \gamma , \lambda , \rho ) \), we call \(E=N\cup R\) the set of elements in the graph containing both nodes and relations. Considering a Label \(l \in \mathbb {U}_{ lbl }\), we write \(E^{l}\) to denote the subset of E consisting of all the elements with Label l. Formally, we show this as \(E^{l}\)=\(\{e \in E \mid \lambda (e)=l\}\). We use the same notation for the subsets N and R of E (e.g., \(N^{l}\)). Moreover, for every element \(e \in E\) and every attribute name \(a \in \mathbb {U}_{ att }\), if \((e,a) \in dom (\rho )\), we write e.a to refer to the value \(v \in \mathbb {U}_{ val }\) for which it holds that \(\rho (e,a)=v\); if \((e,a) \not \in dom (\rho )\), then e.a denotes a special value \(\perp \) that is not in \(\mathbb {U}_{ val }\).

Example 1

In Fig. 1, we can see ten nodes. One is annotated with \(\textsf{e1}\), where we refer to it by n and its activity name by act in this example. Thus, we can say \(\rho (n,act)=\textsf{Submit Order}\) representing that the activity name of this event is \(\textsf{Submit Order}\). We can also write \(n.act=\textsf{Submit Order}\). As this node is labeled with \(\textsf{Event}\), we can say \(\lambda (n)=\textsf{Event}\) or \(n\in N^\textsf{Event}\). This node has a relation to another event annotated with \(\textsf{e2}\). We refer to this event by \(n^\prime \) and to its relation to \(n^\prime \) by r. We can say \(\gamma (r)=(n,n^\prime )\). This relation is labeled with \(\textsf{df}\), so \(\lambda (r)=\textsf{df}\) or \(r\in R^\textsf{df}\).

After defining LPG, we now introduce a special kind of LPG that uses a specific schema named Event Knowledge Graph. We define the schema as \(\mathcal {S}=\) \(\big \{ \big (\textsf{has}, (\textsf{Log}, \textsf{Event})\big ),\) \( \big (\textsf{observed}, (\textsf{Event}, \textsf{Class})\big ), \big (\textsf{rel}, (\textsf{Entity}, \textsf{Entity})\big ),\) \( \big (\textsf{df},\) \((\textsf{Event},\) \(\textsf{Event})\big ),\) \( \big (\textsf{dfc}, (\textsf{Class}, \textsf{Class})\big ) \big \}\). This schema specifies the possible label of the source and the target node in each relation based on the relation’s label. Each member of the set is a tuple, where the first element indicates a possible relation’s label, and the second element indicates the label of source and target nodes, respectively. In the Event Knowledge Graph definition, we restrict the universe of labels as \(\mathbb {U}_{ lbl }=\bigcup _{(l,(s,t))\in S}\ \{l\}\cup \{s\}\cup \{t\}\), meaning that \(\mathbb {U}_{ lbl }\)={\(\textsf{Event}\), \(\textsf{Entity}\), \(\textsf{Class}\), \(\textsf{Log}\), \(\textsf{observed}\), \(\textsf{has}\), \(\textsf{rel}\), \(\textsf{df}\), \(\textsf{dfc}\), \(\textsf{corr}\)}. Note that an EKG can have multiple nodes labeled as \(\textsf{Log}\), meaning that it can record events related to multiple logs in one graph.

Definition 3 (Event Knowledge Graph (EKG))

An EKG is an LPG \(G=(N, R, \gamma , \lambda , \rho ) \), that has the following propertiesFootnote 4.

  1. a)

    \(\forall _{e \in N^{\textsf{Event}}} \left( e.id \in \mathbb {U}_{ eid }\wedge e.act \in \mathbb {U}_{ act } \wedge e.time \in \mathbb {U}_{ time } \right) \) indicating that each node with the label \(\textsf{Event}\) has attributes called id, act, and time with the value of an event identifier, an activity name, and a timestamp, respectively,

  2. b)

    \(\forall _{e \in N^{\textsf{Entity}}} (e.id \in \mathbb {U}_{ oid } \wedge e.type \in \mathbb {U}_{ ot })\) indicating that each node with the label \(\textsf{Entity}\) has an attribute called id and type with the value of an object identifier and object type, respectively,

  3. c)

    The relations between nodes can be specified as \(\forall _{(l,(s,t))\in S,\ r\in R \text { with } \gamma (r)=(e,e^\prime )}\ (e\in N^s\ \wedge \ e^\prime \in N^t)\Leftrightarrow r\in R^l\) indicating that a relation can be labeled as specified in schema if and only if the source and target nodes are labeled accordingly,

  4. d)

    \(\forall _{r \in R^{\textsf{rel}}}\ r.type \in \mathbb {U}_{ ot }\cup \{\textsf{Reified}\}\) indicating that each relation with the label \(\textsf{rel}\) has attributes called type with the value of an object type or a special value called \(\textsf{Reified}\). The \(\textsf{Reified}\) type is used to model the relation between derived entities to other entities.

We keep the definition of EKG to a minimum in this paper without elaborating on detailed properties that are not needed for the transformation algorithms. For example, we omit details on properties that should be held by \(\textsf{df}\) and \(\textsf{dfc}\) relations. More details can be found in  [14, 16].

Example 2

Our running example graph fulfills the properties stated in Definition 3 (a-b). As required by Definition 3 (a), each event in our graph has an event identifier (e.g., \(\textsf{e1}\)), an activity name (e.g., \(\textsf{Submit Order}\) for \(\textsf{e1}\)), a timestamp (e.g., \(\mathsf {15:00}\) for \(\textsf{e1}\)). Also, all entities have an identifier as well as a type as required by Definition 3 (b), e.g., the mustard-colored oval has an identifier with the value of \(\textsf{o1}\) and type of \(\textsf{Order}\).

Our running example graph fulfills the properties stated in Definition 3 (c-d). As required by Definition 3 (c), every relation that its source and target are labeled with \(\textsf{Log}\) and \(\textsf{Event}\) respectively are labeled with \(\textsf{has}\), e.g., the relation between \(\textsf{l1}\) and \(\textsf{e1}\). The same applies to other relations such as \(\textsf{observed}\), \(\textsf{rel}\), \(\textsf{df}\), and \(\textsf{dfc}\), where their source and target nodes are labeled as indicated in the defined set. As required by Definition 3 (d), every relation which is labeled by \(\textsf{rel}\) has an attribute named \(\textsf{type}\), e.g., the relation between \(\textsf{i1}\) and \(\textsf{o1}\) which has a \(\textsf{type}\) with the value of \(\textsf{oit}\).

The following two definitions are adopted from [39] describing an OCEL, the target format to which we will transform the described EKG.

Definition 4

(Event Projection (adopted from [39])). An event e is a tuple \(( eid , act ,\) \( time , omap , vmap )\) where \( eid \in \mathbb {U}_{ eid }\), \( act \in \mathbb {U}_{ act }\), \( time \in \mathbb {U}_{ time }\), \( omap \) is an object mapping, and \( vmap \) is a value assignment. For each such event \(e=( eid , act , time ,\) \( omap , vmap )\), we write \(\pi _{ eid }(e)\) to denote \( eid \), \(\pi _{ act }(e)\) denotes \( act \), \(\pi _{ time }(e)\) to denote \( time \), \(\pi _{ omap }(e)\) to denote \( omap \), and \(\pi _{ vmap }(e)\) denotes \( vmap \).

Definition 5

(Object-Centric Event Log (OCEL) [39]). An event log L is a pair \((E,\preceq _E)\) with \(E \subseteq \mathbb {U}_{ event }\) and \(\preceq _E\ \subseteq E \times E\) such that:

  • \(\preceq _E\) defines a partial order (reflexive, antisymmetric, and transitive),

  • \(\forall _{e_1,e_2 \in E} \ \pi _{ eid }(e_1)=\pi _{ eid }(e_2) \ \Rightarrow \ e_1 = e_2\), and

  • \(\forall _{e_1,e_2 \in E} \ e_1 \preceq _E e_2 \ \Rightarrow \ \pi _{ time }(e_1) \le \pi _{ time }(e_2)\).

4 Approach

This section introduces a transformation algorithm that enables transforming an Event Knowledge Graph (EKG) into a set of Object Centric Event Logs (OCELs), addressing RQ1. In this algorithm’s definition, the following Design Choices (DC) have been made:

DC1. EKG with Multiple Logs: The algorithm converts an EKG with multiple logs (i.e., an EKG with multiple nodes with the label \(\textsf{Log}\)) into a set of OCEL files. This choice aligns with the OCEL standard, allowing one global log element per file [21]. An alternative option would be to include all of events in one log file and mark events related to a log file using a vmap. However, this alternative deviates from the standard, as the vmap value does not represent logs according to the standard. Our approach can easily support the second design choice by merging the generated OCELs into one with a new vmap indicating the log file.

DC2. Event Lifecycles: Unlike XES, OCEL does not explicitly define event lifecycles which specifies events representing different states of an operational task in a business process. As a result, we chose to omit to transform event classes (representing lifecycles in EKG) to OCEL. Event classes in EKG can be related to multiple lifecycle states, and the explicit definition of the event lifecycle in a log file can enable the development of lifecycle-aware algorithms, similar to algorithms developed for XES. If OCEL is extended to support lifecycles in the future, our transformation algorithm can easily include the transformation logic. As an alternative design choice, it is possible to transform the lifecycle as event attributes or related objects, yet this still will not help in the definition of lifecycle-aware algorithms as this information needs to be explicitly supported by standards so that algorithms can take them into account.

figure a

DC3. Relations Between Entities:  The algorithm also omits to transform EKGs’ reified entities. OCEL does not support these relations, leaving them out of the transformation process.

By making these design choices, the algorithm ensures compliance with the current version of the OCEL standard while accommodating potential future extensions for lifecycle support and other entity transformations. Algorithm 1 describe the transformation logic, where the input is an EKG, and the output is a set of OCELs.

Here, we elaborate on this algorithm. Line 5 assigns the set of non-reified entities to . In our running example, . We exclude reified entities in EKG as OCEL does not capture relations among entities. Thus, we only need the set of non-reified entities. Then, the algorithm starts iterating around each log node. It defines a set for capturing all events of the log, i.e., E (line 7). Then, for each event, it defines two empty functions (lines 9 and 10) that will be configured accordingly: if the log has a \(\textsf{has}\) relation to the event, the algorithm i) retrieves all non-reified entities to which the event has a corr relation and assigns them to  (line 12), and ii) retrieves the type of all retrieved entities and assigns them to \(\mathcal{O}\mathcal{T}\) (line 13). If we look at our running example, this algorithm sets the mentioned variables for \(\textsf{e1}\) accordingly: , \(\mathcal{O}\mathcal{T}=\{\textsf{Order}, \textsf{Item}\}\).

Then, the algorithm sets omap and vmap through two loops. The first loop configures the omap function by relating each retrieved object type to a set of related object identifiers (line 15). This means that, \(omap(\textsf{Order})=\{o1\}\) and \(omap(\textsf{Item})=\{i1,i2\}\) for \(\textsf{e1}\). The second loop configures the vmap function by assigning all event’s attributes (except for id, act, and time) to vmap (line 19). For event \(\textsf{e1}\), vmap will be empty as the event has no other attributes. However, if we consider \(\textsf{e3}\), \(vmap(\textsf{Resource})\) \(=\textsf{Elin}\).

Finally, the algorithm updates the variable capturing all events within the processing log, i.e., E (line 22). For our example when processing \(\textsf{e1}\), \(E=\) \((\textsf{e1},\textsf{SubmitOrder},\) \(\mathsf {15\!:\!00},\) \(\{omap(\textsf{Order})\!=\!\{o1\}, omap(\textsf{Item})\!=\!\{i1, i2\}\},\{\})\). Iterating all these steps will produce an OCEL, and line 26 retrieves the set of OCELs transformed from the EKG.

5 Evaluation

This section presents the evaluation results of comparing transformed OCEL with EKG. Through this evaluation, we analyze the differences and similarities between these two approaches. A comparative performance analysis is also conducted between EKG and OCEL, further investigating the disparities and similarities between these approaches.

5.1 Data Processing

The transformation algorithm was implemented as part of an open-source Python libraryFootnote 5, called neo4pmFootnote 6. For evaluation, EKG was transformed to OCEL using our implemented algorithm, and the transformed logs are available publicly at [28,29,30,31,32]. Due to the large size of the log files, the transformation was performed on a server. Subsequently, EKG and OCEL were evaluated and compared on a laptop, replicating the environment typically used by analysts.

Data Transformation: To evaluate our approach, we transformed five open-access real-world EKG: BPIC14 [17], BPIC15 [18], BPIC16 [19], BPIC17 [20], and BPIC19 [15]. As a result, we obtained nine OCELs (one OCEL file for each EKG, except for BPIC15, which produced five OCEL files).

Evaluation Setup: For the evaluation setup, we used a laptop with the following specifications: two 6-core Intel Core i9 CPUs running at 2.90 GHz, 32 GB of RAM, a 1 TB HDD, and a 64-bit Windows 11 Enterprise operating system. Docker (v.4.17.1) was installed on the laptop to host the running evaluations. Neo4j (community edition 3.5) and PM4Py (v.2.7.3) were utilized for the evaluations [12, 13].

5.2 Information Preserving Evaluation

Table 2 illustrates the information-preserving evaluation results, comparing the number of different elements in the EKG and the transformed OCEL. This table captures the count of Logs, Events, non-reified Entities (objects in OCEL), Classes (activity names in OCEL), Observed relations (showing the activity lifecycles), corr relations, and direct-follow relations (df), shown as columns in the table. The rows represent the evaluation result for different BPICs. BPIC15 consists of multiple logs, so the numbers are given in detail for each log for OCEL, and they are aggregated to be compared with EKG. In the subsequent discussion, we will explore the differences observed in these elements.

Table 2. Information preserving evaluation result

#: Number of, \(*\): Non-Reified, OCEL\(_n\): n\(^{th}\) sublog

As can be seen in the table, information preservation is evident for all BPICs except BPIC15 and BPIC17, which exhibit some differences compared to the others. BPIC15 involves process data associated with multiple log files, leading to the transformation of EKG into multiple OCEL log files (as followed based on DC1.). EKG for BPIC17, on the other hand, captures information regarding the lifecycle of each event. Further elaboration on these differences will be provided below.

Differences in BPIC 15: Three differences can be observed when comparing the EKG with the generated OCELs, i.e., the difference in the total number of Entities (referred to as Objects in OCEL), Classes, and directly-follows relations.

The difference in the total number of Entities and Classes is the result of splitting the data to multiple OCELs for BPIC15, as shown in the Table 2, which is due to the limitation of OCEL to capture multiple logs. Consequently, some entities are repeated across different log files, leading to double counting when aggregating the numbers. The same applies to the count of classes. However, these differences do not affect the analysis, as each OCEL represents a subset of the log.

An additional disparity lies in the number of directly-follows relations. These relations significantly impact process discovery and conformance-checking algorithms, warranting a detailed analysis to ascertain the reasons behind the difference. We identified 860 missing directly-follows relations after transforming the EKG BPIC15 into OCEL. Notably, this number does not align with the difference reported in the table. The reason is that directly-follows relations need to be calculated at runtime for a given OCEL. This is different from EKG which materializes these relations. Hence, additional directly-follows relations may be inferred in OCEL that were not present in the source EKG. To illustrate this case, Fig. 2 presents a sub-graph extracted from the EKG for BPIC15, which allows us to delve deeper into the aforementioned issue.

In Fig. 2, we can observe two types of directly-follows (DF) relations: intra-log and inter-log directly-follows relations. The two red DF flows represent intra-log relations, indicating that these relations exist among events within a single log, i.e., events related to BPIC15_1. Additionally, there is one intra-log directly-follows relation involving events related to BPIC15_3, denoted by a thin mustard-colored (DF) relation. The figure’s two thick mustard-colored DF relations represent inter-log directly-follows relations. These relations occur when the source and target events are associated with different logs in the graph.

Figure 3 showcases the directly-follows relations discovered using PM4Py python library [12] with the transformed OCEL specifically for BPIC15_1. Several similarities and differences can be observed in comparison to Fig. 2. i) The two intra-log directly-follows relations for BPIC15_1 are preserved in the transformed OCEL. ii) However, the two inter-log directly-follows relations are lost, indicating that they are not captured in the transformed OCEL. iii) An additional intra-log directly-follows relation is introduced between the register submission date request and enter senddate acknowledgement events for the Case_R object type. Please note that we omit to discuss the intra-log directly-follows relation for BPIC15_3 in this context, as it is present in the other log file.

The absence of the two inter-log relations in the transformed OCEL is indeed expected, as OCEL does not support multi-log event storage. Based on this observation, we can conclude that:

  • Finding 1. Analyzing a process using multiple OCEL logs (as followed based on DC1.) can result in missing the inter-log relations. On the one hand, an Event Knowledge Graph (EKG) supports analyzing multiple logs simultaneously, meaning it will not miss these relations; on the other hand, merging multiple logs into one OCEL and keeping the log information as event attributes can be considered as a technique to handle this shortcoming.

Fig. 2.
figure 2

Intra- and inter- log directly-follows relations (shown by thin and thick flows respectively) for a part of BPIC15_1 & for BPIC15_3 in the Event Knowledge Graph

Fig. 3.
figure 3

Inter-log directly-follows relations for a part of BPIC15_1 equivalent to Fig. 2

As previously mentioned, some directly-follows relations in the transformed log were not present in the original EKG. For instance, the relation between register submission date request and enter senddate acknowledgement for the Case_R object type was not captured in the EKG. The reason behind this discrepancy lies in the runtime computation of directly-follows relations in Object-Centric Process Mining. In the EKG, two other events were occurring between these two events, resulting in the absence of a direct relation. However, when we project events related to a specific event log, events from other logs are removed, leading to different computations of directly-follows relations among events.

The addition of directly-follows relations can also be observed when filtering event logs based on certain event attributes. An important difference arises when filtering out specific events, such as the enter senddate procedure confirmation event in the EKG (as depicted in Fig. 2). In this case, there would be no directly-follows (DF) relation between the register submission date request and enter senddate acknowledgement events for the Application entity. However, applying the same filter in OCEL would result in a new directly-follows (DF) relation between these two events. This difference arises because directly-follows relations in OCEL are calculated at runtime based on existing timestamps.

It is important to note that we do not conclude which approach is correct or incorrect. However, this discrepancy is a significant difference that analysts should be aware of to avoid drawing incorrect conclusions.

  • Finding 2. Inter-log directly-follows relations are not preserved when transforming an EKG to multiple OCELs (as followed based on DC1.). If those relations matter in the analysis, an analyst may follow the alternative design choice stated in DC1.

  • Finding 3. Analyzing processes with multiple logs using OCEL can include additional directly-follows relations due to the absence of inter-log directly-follows relations. The alternative design choice can be followed to overcome this challenge as stated in DC1.

  • Finding 4. Filtering OCEL event logs based on specific events can introduce extra directly-follows relations due to the lack of filtered events, similar to the case of filtering traditional logs. These relations are not added when analyzing event knowledge graphs, as all directly-follows relations are pre-calculated.

Differences in BPIC 17: In the EKG, each event is associated with two classes. For instance, event 9 with the activity name O_Created is linked to two classes in the EKG, both of which have the same name as the activity. One class has the type Activity with the same name, while the other class has the type Activity+Lifecycle with the lifecycle value of COMPLETE. However, when transforming to OCEL, the information regarding the lifecycle is not taken into transformation since the OCEL standard does not include lifecycle specifications.

  • Finding 5. The OCEL standard does not include support for the event lifecycle, but it is supported in EKG. One option to overcome this limitation is to map this information as event’s values or related objects as explained in alternative choice for DC2.

5.3 Performance Evaluation

Table 3. Performance comparison (in seconds)

\(*\): Non-Reified

Table 3 shows the performance comparison result of processing event data in EKG and OCEL. The column labeled “Loading Time” in the table represents the time required to prepare the log file for analysis. For OCEL, it indicates the time taken to load the log file into memory. For the EKG, it refers to the time required to load the dump file into Neo4j.

  • Finding 6. Analyzing OCEL using PM4Py requires the log file to fit within the computer’s memory. In contrast, EKG (stored in Neo4j) can handle large data sizes without such memory limitations because a part of graph content is loaded into memory as needed and processed on demand  [1], as also demonstrated in [25]. This distinction is crucial when dealing with big data in process analysis as it can enable scaling process mining in practice.

  • Finding 7. Loading logs into EKG is a one-time process, similar to loading data into databases. Once the data is loaded, multiple analyses can be performed without reloading the data. However, with OCEL and PM4Py, the analyst needs to consider the loading time for every new analysis. Keeping large datasets in memory for extended periods may not be efficient, requiring careful consideration for each analysis conducted with OCEL and PM4Py when dealing with big data.

The columns labeled #Log, #Event, and #Entity\(^*\) represent the query execution times for retrieving the number of logs, events, and non-derived entities in OCEL and EKG, respectively. The queries on OCEL are extremely fast, with execution times rounded to zero. On the other hand, the query execution time for EKG is also reasonable. In the worst case, it takes approximately 7 seconds for BPIC16, which is a substantial EKG. Similar observations can be made for #Class, and #observed. However, there is one exception for BPIC16 in the case of #corr. Retrieving the number of #corr elements takes around one minute due to the size of the EKG, and the additional filtering of #corr relations for non-reified entities significantly increases the query execution time.

Considering the execution query times, a significant difference is observed in calculating the number of directly-follows relations in the log file. These relations play a crucial role as fundamental information for many process mining algorithms. EKG outperforms OCEL in this aspect. This is because all directly-follows relations are materialized in EKG, whereas in OCEL, these relations are computed at runtime during processing. Based on this observation, we can conclude that:

  • Finding 8. Discovering directly-follows relations on the entire log file is more efficient (performance-wise) in the EKG than OCEL. This is because the relations are materialized in EKG, whereas in OCEL, they are computed at runtime. The pre-calculation of directly-follows relations in the EKG enhances the efficiency and performance of process mining analyses.

Applying process mining without appropriate filters can lead to unhelpful and complex process models, often called “spaghetti” models, which is considered a fundamental weakness in most early process mining algorithms [41]. Hence, filtering event logs and focusing on a subset of directly-follows relations is common practice. In our paper, we compare the performance of retrieving different subsets of directly-follows relations from EKG and OCEL on all listed BPICs. We employ common filtering operations such as i) dicing the log based on a timestamp, ii) slicing the log based on an entity type, iii) slicing and dicing the log based on a timestamp and an entity type, iv) slicing the log based on an entity, and v) slicing and dicing the log based on a timestamp and an entity. The performance of slicing and dicing based on timestamp can be improved in neo4j if an index is defined for the timestamp. However, this solution may not be applicable for all attribute types, e.g., if we slice or dice based on the similarity of a textual attribute. Thus, we will test both approaches here. For the timestamp, we follow a pessimistic approach by selecting a timestamp and an entity type that does not exist in the data, which mandates traversing the whole graph when it has no index. Table 4 shows the performance comparison result of retrieving directly-follows relations by applying the above filtering. The numbers in the parenthesis represent the total query time execution after creating an index on the timestamp.

Table 4. Execution time by filtering (in seconds)

The numbers in parentheses are execution time after creating an index on the timestamp.

From the third column, it is evident that the performance of retrieving directly-follows relations using PM4Py is significantly better compared to EKG when applying a filter solely based on the event’s timestamp without the index. If the index can be defined, EKG has better performance. The main reason behind this difference is that applying such a filter in the EKG without the index necessitates traversing all nodes in the graph, resulting in a time-consuming operation. If the index can be used, EKG will not need to traverse the whole graph. On the other hand, PM4Py executes this operation by processing data in memory.

As observed from the remaining columns, the disparity mentioned above becomes less significant when filtering the log based on other log elements, such as entity type (referred to as object type in OCEL) and entities (referred to as objects). In summary, we can conclude with the following findings:

  • Finding 9. Analyzing a process using an OCEL log is much more efficient than an EKG without a relevant index when filtering only by dicing the data. In case that index can be defined, EKG has better performance.

  • Finding 10. There is no significant performance difference when analyzing a process using sliced data for an OCEL or EKG.

There are some limitations and threats to validity that shall be discussed as well. We shall emphasize that some findings can get affected by following alternative design choices as discussed in this section. Currently, we limit the comparison to taken design choices, but we will extend the comparison by considering alternative choices in the future. Also, we shall emphasize that our analysis is based on the current version of the OCEL standard. Our findings and other investigation can influence the extension of this standard in the future, which can relax or change some of the identified findings.

6 Concluding Remarks

This study conducted a comparative analysis of multi-dimensional process analysis using two contemporary conceptual models, namely Object-Centric Event Log (OCEL) and Event Knowledge Graph (EKG). A novel algorithm was introduced to transform EKG into the set of OCEL, which was implemented in Python as part of an open-source library. Five real log files represented in EKG were transformed into OCEL using this algorithm, and the resulting log files were utilized for the comparative analysis.

A total of ten findings emerged from this study, with several noteworthy ones highlighted here. The research shows that transforming EKG containing multiple log files into separate OCELs can cause a loss of inter-log relations between events. Moreover, the study demonstrated differences in analyzing directly-follows relations, attributing them to the materialization of these relations in the EKG while requiring runtime calculations for OCEL. Additionally, it was found that analyzing a process using an OCEL log exhibited higher efficiency compared to an EKG without any index when only dicing the data. Also, it shows how the possibility of applying an index can shift the advantage toward EKG.

As a future direction, it will be interesting to investigate how the OCEL standard can be extended to address some of the reported limitations. It is also interesting to evaluate the difference between these two approaches in calculating directly-follows relations in real use cases where we can have access to stakeholders to evaluate those relations with the help of process experts.