Keywords

1 Introduction

Health processes are complex and involve steps performed by people from various disciplines and sub-areas. This complexity makes this area interesting, but difficult to analyze and understand. These processes make use of information systems that record large volumes of data, but which are difficult to exploit.

Process mining intends to gain knowledge about a particular running process and allows to have an accurate model of its behavior. The purpose is to improve the implementation and evaluation of health care processes. Moreover, the model can help configuring any additional requirements not included in the system [1].

To evaluate the feasibility of some process mining algorithms, using health care processes, each one will be tested to understand its limitations and advantages. For this purpose, the Process Mining for Python framework (PM4Py) [2] will be used, as it allows an algorithmic customization that other tools do not allow. Furthermore, it has a good variety of other features of interest.

Usually, the event logs used for analysis provide timestamps for the steps/activities that compose the process, as well as their description and other information. The dataset used was extracted from the MIMIC-III database [3].

Subsequently, specific scenarios with certain characteristics are created in order to test different situations that may expose limitations of the algorithms. In addition, the variants are analyzed [4]: select the variants with more occurrences and exclude specific cases that could generate noise in the process analysis. It is also interesting to filter the dataset, taking into account features that make sense in the set of logs. Finally, we intend to test techniques and tools to verify the conformance of the generated model.

It is also possible to calculate different statistics on the event logs of the dataset, as well as to create graphs that allow to understand various aspects of the dataset.

This document is organized into 6 sections. In Sect. 2, the main process discovery algorithms are presented. Section 3 analyzes PM4Py, justifying the choice of this tool. Section 4 explains the experimental scenarios. Section 5 shows the results. Finally, Sect. 6 presents conclusions about this work and future work.

2 Process Mining Algorithms

In the last decade, process mining has emerged as a new field of research that focuses on process analysis, using event data. Classic data mining techniques do not focus on business process models [5]. Thus, process mining focuses on processes, step by step, because the availability of event data and new techniques are increasing, allowing the discovery and the conformance verification of the processes [6].

Process models are used to analyze process execution through Business Process Management (BPM) systems. These process management tools are widely used to support operational process administration. However, they do not use event data [7].

The activities performed by people, machines, and software leave traces in the so called event logs [5]. Process mining techniques use these logs to discover, analyze and improve business processes [8].

Process mining is used to find patterns and understand the causes of certain process behaviors. On the other hand, process mining helps to understand how processes are being performed. For this purpose, specialized mining algorithms are applied to identify patterns from event data recorded in the information management systems [9].

There are several algorithms for process mining. The internal local relations between the activity data are modeled by the Heuristic Miner algorithm. This is the most widely used algorithm, mainly due to its ability to deal with noisyFootnote 1 and incomplete data, common in the health area. Global or external relationships between activities are modeled by Genetic Miner algorithms [10] and Fuzzy Miner [11].

Alfa Miner algorithm examines the event log for specific patterns. This algorithm works, simultaneously, a set of sequences of events, following a certain activities order in the event log, and shows the result in a Petri NetFootnote 2 project diagram. For example, if activity X is followed by Y, but Y is never followed by X, then it is assumed that there is a causal dependency between X and Y. However, Alpha Miner is unable to highlight the bottlenecks of the process [12].

Most business process mining tools use Directly-Follows Graph (DFGs) as a first approach of exploring event data. To deal with complexity, DFGs are simplified by removing nodes and edges based on frequency restrictions. This simplicity can make these DFGs misleading, as they can be misinterpreted, leading to different conclusions. In addition, bottleneck information can be misleading, especially after simplifying the model. This can lead to all kinds of interpretation problems, due to “invisible gaps” in the model [15].

Heuristic algorithms use the order or sequence of activities and the events frequency. They find the frequent and infrequent paths in the process. In this sense, they are more robust relatively to the process frequencies [16]. Heuristic Miner is very similar to Alpha algorithm, because it deals with similar problems. Furthermore, it catches more real problems. The Heuristic Miner uses logical XOR and AND connectors of dependency relationships. The result of this miner is a heuristic network that helps to visualize the process and predict the flow [12].

Inductive Miner is used widely in different areas, with very promising results. This algorithm has an improvement over the Alpha and Heuristics Miners, as it explore easily an event log. It ensures solidity, as it is able to deal with infrequent behaviors and large event logs. The basic concept of Inductive Miner is to detect a pattern in the logs and then search for that pattern until a base case is found [17].

Table 1 shows the algorithms comparison, according to their characteristics and limitations [18].

Table 1. Comparison of Process Mining algorithms.

3 Process Mining for Python (PM4Py)

Process Mining for Python framework (PM4Py) [2] is a process mining software, easily extensible. It allows conducting large scale experiments easily, and also algorithmic customization. In addition, it is possible to integrate large-scale applications, through a new process mining library. Other libraries can be integrated, such as pandas, numpy, scipy and scikit-learn [20].

The main advantages of the PM4Py library are:

  • Allows algorithmic development and customization more easily, when compared to existing tools like ProM [21], RapidProM [21], Disco [22] or Celonis [23];

  • Enables easy integration of process mining algorithms with algorithms from other areas of data science, implemented in several state-of-the-art Python packages.

  • PM4Py provides support for different types of event data structures, namely event logs, where each line is a list of events. Events are structured as key-value maps;

  • Provides conversion features to transform event data objects from one format to another. Also, PM4Py supports the use of Pandas data frames, which are efficient in case of using larger event data. Other objects currently supported by PM4Py include heuristic networks, Petri networks, process treesFootnote 3 and transition systemsFootnote 4.

PM4Py provides several main process mining techniques, including:

  • Process discovery, based on Alpha Miner algorithms [24], Directly-Follows Graph [15], Heuristic Miner [16] and Inductive Miner [17];

  • Conformance verification, through token-based alignment and reproduction [25];

  • Measurement of suitability, precision, generalization, and simplicity of process models [26];

  • Filtering based on time interval, case performance, input and output events, variants, attributes, and paths;

  • Case management: statistics on variants and cases;

  • Graphs: duration of the case, events by time, distribution of numeric attribute values;

  • Social Network Analysis [27]: work handover, joint work, subcontracting and networks of related activities.

PM4Py also provides Python visualization libraries, such as:

  • GraphViz: representation of direct sequence graphs, Petri Nets, transition systems, process trees;

  • NetworkX: static representation of social networks;

  • Pyvis: dynamic web-based social network representation.

4 Data and Experimental Scenario

For the experimental scenarios, data was selected and processed, from the MIMIC-III database. Then, data was converted into the necessary format for the process discovery algorithms of PM4Py application. Finally, for a better analysis of the algorithms, certain test scenarios were defined in order to expose them to different challenges.

4.1 Data Processing

The table schema of the MIMIC-III database (demo version) was analyzed to find the desired information for the test dataset. A subset of tables was selected satisfying the proposed requirements, Fig. 1.

Fig. 1.
figure 1

Scheme of test data.

Analyzing the scheme, the main table is TRANSFERS. It contains the physical locations of patients during hospitalization. The main attributes of this table are: the care unit (CURR_CAREUNIT), if it is a specialty; the entry date and time (INTIME) and the exit date and time (OUTTIME); the type of event (EVENTTYPE) which can be one of three: admit, for procedures performed in the patient’s admission/evaluation phase; transef for the patient’s transfer/stay phases and discharge for the patient’s discharge phases.

Note that, when there is no specialty, the acronym GCU was inserted, translated into General Care Unit. The description of the remaining acronyms of the specialized care units are presented in the Table 2.

Table 2. Description of care units (Adapted from https://mimic.physionet.org/mimictables/transfers/).

The SUBJECT_ID attribute connects to the PATIENTS table, which has information about the patients, namely the attributes gender (GENDER) and date of birth (DOB).

Subsequently, HADM_ID attribute is used to access the ADMISSIONS table that contains information about the patient’s admission. Using this table, it is possible to collect information about the type/place of admission (ADMISSION_LOCATION) and date of discharge (DISCHTIME) or death (DEATHTIME). It also allows accessing to the PROCEDUREEVENTS_MV table to obtain data related to the events performed in each admission.

The PROCEDUREEVENTS_MV table includes the name of the process (ORDERCATEGORYNAME) and the date and time of start (STARTTIME) and end (ENDTIME) of the process. Notice that, at this stage, the processes are synchronized with the physical locations, by the respective start/entry and end/exit dates.

4.2 Preparation of the Dataset

From the excel data importation, the respective treatment was made to obtain the dataset format required for the application of the PM4Py process discovery algorithms.

The required format consists in 3 types of information:

  • Case ID - a unique identifier for each process;

  • Event - a step in the process, any activity that is part of the process;

  • Timestamp - date and time for a given event.

The HADM_ID was used as a case identifier, because it is unique. For each stage of the process, the type of event (admission, transfer, or discharge), the care unit (an identifying acronym) and the name of the process performed were added. For the timestamp of the stage, the start date of the process was used or, in cases where a process was not identified, the date of entry into the care unit.

Moreover, other information was used, such as the type of admission, the type of exit (death or discharge), the patient’s date of birth, the gender, and the day of the week on which the event has occurred. In the end, possible duplications were removed from the synchronization of processes with physical locations. Notice that, to use the algorithms in this dataset, the dataset was converted into log format, ordered by timestamp, getting a total of 1163 logs.

4.3 Test Scenarios for the Algorithms

Through the algorithm’s analysis and comparison, it was verified that loops between steps and duplications may arise. Thus, admissions were selected to allow testing all these scenarios in isolation.

In an initial scenario, simple admissions were chosen, where none of the cases described above were verified. This scenario, Fig. 2, has, as main objective, a first interaction to test algorithms and their models.

Fig. 2.
figure 2

Simple scenario.

Next, a scenario with duplicate steps, Fig. 3, was selected: there is a step that occurs repeatedly.

Fig. 3.
figure 3

Scenario with duplicate steps.

In the last scenario, the algorithms were exposed to loops between steps. Figure 4 shows a loop occurrence between 2 steps.

Fig. 4.
figure 4

Scenario with loops between stages.

5 Results

In this section, the results of the tests are presented. The models generated for each tested scenario, execution times, analysis of the log set variants, log set, log set statistics and conformance verification are presented and analyzed.

5.1 Models Analysis for Each Test Scenario

Table 3 presents the models results from the test scenarios. Alpha Miner is unable to create a valid Petri Net model, because it isolates duplicate steps. This result was predictable because this algorithm, admittedly, does not support duplicate steps, neither loops of length one or two [18]. For loops between 2 steps, it generates an invalid Petri Net model, isolating one of the loop steps.

For all logs, Directly-Follows Graph generated a log too large, a spaghetti model. Despite generating this DFG model, it is not a valid one, as the admissions are broken. This result can be justified due the fact that the DFGs are simplified, removing nodes and connections based on frequency limits [15]. However, for other scenarios, the DFG model performance can be considered good.

Heuristic Miner allows the presentation of the frequency of the stages and connections, but it does not mark the most frequent stages and connections [16], which is a disadvantage relatively to Petri Net. This algorithm is compatible with duplicate steps and loop challenges. When the algorithm was converted to Petri Net, the model showed hidden transactions. For a larger number of logs the resulting model is difficult to analyze, as it has created spaghetti models.

Considering all logs, Inductive Miner seems to generate a smaller model, with fewer steps and connections. An explanation of this result may be the improvement that this algorithm has in the search for splits/patterns in the logs. Moreover, it uses many hidden transactions to overcome loops in parts of the model [26].

Table 3. Models for each test scenario.

5.2 Execution Times

Table 4 shows the average of the execution times, in seconds, of the Heuristic Miner and Inductive Miner algorithms for the entire set of logs. These algorithms were able to present a valid model. Notice that each algorithm and model were tested 5 times, under the same conditions, and the average time was calculated after removing the maximum and minimum times. The execution times correspond to the execution of the algorithms, because all logs were already loaded into memory.

Table 4. Execution times for the entire set of logs.

5.3 Variant Analysis

The analysis of variants is extremely important, as it considers the number of occurrences of the variants. This analysis allows to remove the least relevant variants. A variant is a set of cases that share the same perspective of control flow, therefore, a set of cases that share the same events/activities, in the same order [4].

Inductive Miner algorithm was used in this analysis. The results are in Table 5 and contain the description of the variant, the number of occurrences and the respective percentage. Table 6 presents the models generated for different frequencies of variants. The remaining variants have a lower number of occurrences and were discharged. If they were considered, all logs were included, which could turn the analysis impossible and inefficient.

Table 5. Variants.
Table 6. Models of different frequencies of variants.

5.4 Filtering Event Data

In this section, filtrations were tested, Table 7, in the most frequent variants, in order to analyze the process in a different detail. Inductive Miner algorithm was used with the result in a Petri Net model.

Table 7. Filtering models.

5.5 Log Set Statistics

In PM4Py, it is possible to calculate different statistics on the event logs. At Table 8 two statistics can be analyzed using the dataset: average case duration and case dispersion ratio. This last is the average distance between the completion of two consecutive cases in the log.

Table 8. Statistics results to dataset.

It is also possible to create graphics, Table 9, to understand various aspects of the dataset used in the model, such as, for example, the distribution of a numeric attribute, the distribution of the case duration, or the distribution of events over time.

Table 9. Event distribution graphs.

5.6 Conformance Verification for Test Logs

Conformance verification is a technique to compare the predicted/expected model of the process with the real model of the process, that is, the set of real event logs for that process. The objective is to verify if the logs are in accordance with the model and vice versa [30]. In PM4Py, two fundamental techniques can be implemented: token-based reproduction and alignments [26].

For this analysis, Inductive Miner algorithm was used with the result in a Petri Net model. Variants with one occurrence were removed. Furthermore, when necessary, a set of two logs was used, where one of them belongs to the predictive model and the other does not.

Token-Based Repetition

The token-based replay corresponds to a Petri Net tracking model, starting from the initial location, to find out which transitions are performed and in which locations there are remaining or missing tokens for the tested log instance. A log conforms to the model if, during its execution, transitions can be triggered without the need to insert any missing tokens [31].

For the model and the set of logs tested, the result is represented in Fig. 5. In the first log, it is clear that the model was unable to satisfy it. The attribute trace_is_fit is False, because, in the attribute transitions_with_problems, there was a transition in which the path was unable to follow. Hence, 9 produced tokens were consumed and 1 token was missing. Since it managed to satisfy a large part of the route, trace_fitness ends up being close to 1, being approximately 0.889.

For the second log, trace_is_fit is True. Thus, the model satisfied all the transitions of the log, having consumed all tokens produced, 10, and with no remaining or missing tokens.

Fig. 5.
figure 5

Conformance results using token-based repetition.

Alignments

Alignment-based reproduction aims to find one of the best alignments between the log and the model. For each log, the output of an alignment is a list of pairs where the first element is a log event and the second element is a model transition. For each pair, the following classification can be provided [32]:

  • Synchronization movement: the classification of the event corresponds to the name of the transition; in this case, the log and the model advance in the same way during the replay;

  • Move in the record: pairs where the second element is». This symbol in the second element corresponds to a repetition movement in the log that is not similar in the model. This type of movement is inappropriate and there is a deviation between the log and the model;

  • Move in the model: pairs where the first element is». This situation corresponds to a repetition movement in the model that is not similar in the log. For movements in the model, we can make the following distinction:

    • Movements in the model involving hidden transitions: in this case, even if it is not a synchronized movement, the movement is adequate;

    • Movements in the model that do not involve hidden transitions: in this case, the movement is inappropriate and means a deviation between the log and the model.

Each log conformance check is associated with a dictionary, containing, among others, the following information:

  • Alignment: contains the alignment (synchronization movements, movements in the register, movements in the model);

  • Cost: contains the cost of the alignment according to the cost function provided, which can be customized;

  • Fitness: is equal to 1 if the log is perfectly adequate.

For the model and the set of logs tested, the first log had a fitness close to 1, that is, there is an adaptation to the model close to 1 (but lower than 1), indicating that it was not able to complete the entire process path from the model. On the other hand, in the second log, the process was able to finalize the log path, Fig. 6.

Fig. 6.
figure 6

Results of alignments.

Overall Assessment of the Model by the Set of Test Logs

In PM4Py, it is possible to obtain different information on the comparison between the behavior contained in the test logs and the behavior contained in the model, to verify if and how they correspond. There are four different dimensions of conformance in Process Mining: the measurement of the adequacy of the replay, the measurement of precision, the measurement of generalization and the measurement of simplicity.

The calculation of the adequacy of the replay aims to calculate how much of the behavior in the log is admitted by the process model. Two methods are proposed to calculate the adequacy of replay: replay and alignments, both based on token, previously used for isolated logs.

For precision or accuracy, the set of transitions in the process model is compared with the set of activities logs that follow the model [26]. For that, unvisited branches are counted. Unvisited branches are decisions that are possible in the model and not in the event log. If not, the accuracy is perfect. This analysis can also be obtained from the two methods mentioned in the previous subsection, where token-based reproduction is faster, but based on heuristics. Therefore, the result may not be accurate [31]. Alignments are accurate, work on any type of network, but can be slow [32].

Generalization is the third dimension to analyze how the log and the process model coincide. Basically, a model is general if the elements of the model are visited often enough during a reproduction operation.

Finally, simplicity is the fourth dimension for analyzing a process model. In this case, simplicity is defined considering only the Petri Net model. This metric considers the number of incoming and outgoing connections that each transition has [33]. For all these metrics, the resulting value varies between 0 and 1.

Figure 7 describes those evaluations, showing that, according to the calculation of the adequacy of the replay, the adaptation of the set of logs tested to the model was high, but not complete. Precision, on the other hand, proved to be quite low, for both approaches. The using of hidden transactions and the fact of one of the logs had not completed the path can explain this result. The set of logs tested has many repeated steps, leading to a low generalization. Besides, it was considered a few steps in the model. As the model has hidden transactions and loops, simplicity is low, since there are situations of join or split in steps.

Fig. 7.
figure 7

Results of the evaluation of the Log-Model.

6 Conclusion

From the results with the experience scenarios, Alpha Miner was not able to deal with duplicated steps and loops between two steps. Directly-Follows Graph achieved that, but in turn, for a larger set of logs, the generated model was invalid, not being able to represent cases with more than 5 steps.

For the other algorithms, they were really able to deal with challenges and larger volumes of logs. Inductive Miner was the algorithm that better handled with duplicated steps and loops between 2 steps. It uses hidden steps more recurrently, mainly in loop parts.

Considering the models tested, the Process Trees are the most difficult to analyze due to their syntax. The Petri Net models proved to be more efficient and structured. Based on the execution times, Petri Net is the type of model that takes longer to run for a larger volume of logs but allows a better analysis.

For large amounts of data, the Petri Net model of Inductive Miner was the one that had the longest execution time, but it was also the one that had the best result. Due to the improvement that this algorithm has, the model, in general, is more organized and easier to analyze [26].

Table 10 summarizes the results achieved where the comparison parameters are presented in order of priority. If an algorithm has limitations to challenges, it is no longer analyzed in the next parameters. Thus, the most suitable algorithm is the Inductive Miner.

Table 10. Summary of the conclusions.

For future work it is intended to expand these experiences to different areas and types of dataset. Another important aspect would be the execution of these same tests in other existing tools, as they may have different implementations of algorithms and functionalities.