Keywords

1 Introduction

Process Mining is a specialised form of data-driven process analytics where data about process executions, collated from the different IT systems typically available in organisations, is analysed to uncover the real behaviour and performance of business operations [2]. Without question, the extent to which the outcomes from process mining analyses can be relied upon for insights is directly related to the quality of the input data. The onus is usually on a process analyst to identify, assess and appropriately remedy data quality issues so as to avoid inadvertently introducing errors into the data while minimising information loss. It is widely acknowledged that eighty percent of the work of data scientists is taken up by data preparation and handling data quality issuesFootnote 1. The case of the process analyst is no different [17].

There has been an increased interest in research investigating the issues of responsible data science [3, 15]. Key dimensions in the notion of responsible data science (such as fairness, accuracy, transparency, and confidentiality [3]) are being explored and also for different domains (e.g., healthcare). In order to take steps towards responsible process mining, there is the dual need to increase the importance of data quality awareness and mitigate the opportunity to make erroneous conclusions, while helping process analysts overcome the burden of managing data quality.

In this tutorial paper, we focus on event logs as the primary form of input into process mining. Accordingly, we first present a brief summary of existing work on understanding data quality requirements for event logs. In the words of Edward Demming, the father of quality management, you can’t manage what you can’t measure. Hence, our next section outlines key techniques for measuring data quality in event logs. Finally, we provide a synopsis of current contributions and future needs of data quality awareness in process mining.

2 Understanding Data Quality Requirements for Event Logs

An event log used for process mining contains a collection of cases whereby each case can be seen as a sequence of events [2]. Each event refers to a case, an activity being undertaken, a point in time and a transaction type. An event may also refer to a resource or an organisational role and other data attributes (e.g., customer details and case outcomes).

The process mining manifesto [1] highlighted the need for high-quality event logs for process mining. The manifesto describes five levels of maturity ranging from one star to five stars. At the lowest level of maturity (*) where events are recorded manually, one may find that events that are incorrectly entered (e.g., incorrect timestamps or activity labels) or events may be missing. At the highest level of maturity (*****), event logs are considered to be complete and accurate as events are recorded automatically by a system (e.g., a process-aware information system). Most real-life event logs are found to be in-between these two extremes of the scale with many quality issues [6, 17].

As most process mining techniques make use of key event data, namely, case identifiers, activity labels, and timestamps, missing, inaccurate or erroneous values (e.g., only a date is recorded but no time, incorrect spellings or variations in how activities are labeled) for any of this data may mean that a case or an event has to be filtered out or an erroneous value may need to be replaced, or a missing value may need to be inferred.

Given the diversity of data quality problems, it is important to understand the key requirements. While Juran and Godfrey [10] provide the fundamental “fitness for use” principle, decades of data quality research has proliferated various understandings of data quality requirements through its underlying dimensions [8, 14, 16, 20]. Over the course of time, many of the definitions for different data quality dimensions have overlapped, and the same definitions for the same dimensions have developed conflicting interpretations, resulting in a level of disparity that does not support a shared understanding. Recent work offers an empirically validated consolidation of these dimensions covering both academic and practitioner perspectives [9], and provides 33 dimensions clustered into eight categories, namely Completeness, Accuracy, Validity, Consistency, Currency, Availability and Accessibility, Reliability and Credibility, and Usability and Interpretability. These studies indicate that data quality requirements cover both objective (e.g. uniqueness and format consistency) as well as subjective (e.g. relevance and freshness) dimensions.

There have been efforts by process mining researchers to classify data quality issues typically found in event logs [6, 12, 17, 18] with a view to take steps towards addressing these issues and thus to increase the reliability of analysis results.

Bose et al. [6] identify four broad categories of issues affecting event log quality: missing data (where data items are not recorded in an event log), incorrect data (where data items are incorrectly recorded in an event log), imprecise data (where recorded values are considered too coarse to be useful) and irrelevant data (where data items contains irrelevant information). The authors also identify 27 classes of event log quality issues (e.g., problems related to timestamps in event logs, imprecise activity names, and missing events) depending on where they occur such as cases, events, activity labels, timestamps, resources. Their intention is to “encourage systematic logging approaches (to prevent event log issues), repair techniques (to alleviate event log issues) and analysis techniques (to deal with the manifestation of process characteristics in event logs)” [6]. These issues were illustrated from the analysis of five real-life event logs from different application domains.

Suriadi et al. [17] identify eleven event log imperfection patterns based on their experience with over 20 Australian industry data sets which confirm the severity of data quality issues in process data and their potential impact on process mining analyses. The eleven patterns include form-based event capture, inadvertent time travel, unanchored event, scattered event, elusive case, scattered case, collateral event, polluted label, distorted label, synonymous labels and homonymous label. Each pattern is described using the following components: description of the pattern, real-life example of the pattern, affect which captures the consequence of the occurrence of the pattern on process mining outcomes, the type of data event and event log entities affected by the pattern, strategy to detect the presence of a pattern, potential remedies and side-effects of these remedies, and indicative rules for detection.

Lu and Fahland [12] propose a conceptual framework to better understand event data quality for process mining analysis. The framework categorises event data into three entities: quality of events, quality of ordering of events and quality of labels of event. These three entities are then evaluated based on two dimensions: individual trustworthiness and global conclusiveness whereby individual trustworthiness focuses on the intrinsic qualities of event data (e.g., accuracy or correctness dimensions) while the global conclusiveness indicates if a significant pattern is being observed.

3 Measuring Data Quality of Event Logs

Data quality requirements continue to be dictated by the fitness for use principle [10], thus making them highly dependent on the use context. Further a plethora of diversified requirements (i.e. dimensions) exist, which are in turn deeply bound to use context making them complicated to model, analyse, and re-use, resulting in a prohibitive capacity to have a common set of measures for detecting and quantifying data quality.

Batini et al. [4] provide a comprehensive analysis of existing approaches for data quality assessment. We note that most, if not all, of these approaches follow a user centric approach where requirements are solicited from users before the data is explored (see e.g. [5, 11, 19]).

However, in the process mining context, access to the creators of data that constitutes event-logs cannot be relied on. This is mostly the case for publicly available event logs. Furthermore a process analyst cannot typically influence data capture practices and hence expectation of cleaning of the source data may be misplaced. Thus it is imperative to measure the quality of an event log respective to the particular type of analysis intended such as process discovery, performance analysis or conformance checking. For instance, the missing values metric assesses the fraction of the log for which a particular log attribute is populated which contributes to quantifying the Completeness dimension. In a log where the majority of events only have “complete” (rather than “start” and “complete”) timestamps, i.e. have a high degree of missing values, the suitability of that log for performance analysis is low while the suitability for process discovery may not be negatively affected. On the other hand, if recorded timestamps do not accurately reflect when an activity occurred, process discovery will be compromised.

In [18], the authors propose an extensible framework to measure event data quality based on twelve dimensions collated from prior literature and to quantify the prevalence of data quality issues in event data. They include completeness, uniqueness, timeliness, validity, accuracy/correctness, consistency, believability, credibility, relevancy, security/confidentiality, complexity, coherence, representation/format.

Another early advocate of detecting data quality issues in event logs is Anna Rozinat, the co-founder of Disco Process Mining Tool. Through a number of blog posts which have now been collated into a book on process mining in practiceFootnote 2, various data quality issues in event logs and ways to detect and (potentially) repair them were discussed. The quality issues mentioned in the book include formatting errors, missing data (event, attribute values, case IDs, activities, timestamps, attribute history, timestamps for activity repetition) as well as zero timestamps, wrong timestamps, same timestamps for multiple activities and different timestamp granularity.

4 Data Quality Awareness in Process Mining

Keeping a detailed record of the origins of data and how data is transformed along the way will increase its traceability and trustworthiness. Where such information is unavailable, the extent and effect of changes on the data will be opaque to the analyst who, may view the data as ‘ground truth’, i.e. direct observations as opposed to already modified data. Such a view can result in inaccurate or misleading analysis results or inappropriate further transformations. For instance, where the analyst is unaware that a data set extracted from a hospital’s emergency department has been modified through time-shifting in order to de-identify patients (as in the case of MIMIIC critical care data setFootnote 3), using this data for performance analysis will lead to incorrect results.

There has been some work to detect and repair quality issues associated with event logs. For example, Dixit et al. [7] presents a user-guided technique to detect event ordering imperfection patterns in a log associated with timestamps and then repairing identified issues using user input. The timestamp related quality issues such as different granularities, order anomaly and statistical anomaly are detected and repaired. Similarly, Lu et al. [13] presents an interactive way to assist users explore data quality patterns of interest using the context information contained in an event log. Five measures to quantify the pervasiveness of a pattern in an event log are also proposed. They include the pattern support, pattern confidence, case support, case confidence and case coverage.

To date there has been little research aimed at developing a comprehensive framework to address the issue of incorrect analysis results from inadequate data quality of event logs. Lessons from prior work in quality awareness for database (e.g., [21]) indicate that there are at least three essential components of such frameworks, each of which presents a number of research challenges, namely (1) data quality profiling that builds on shared understanding of data quality dimensions and associated metrics, (2) user preference modelling that allows users analytic needs to be captured, and (3) visibility of quality profiles together with analysis (process mining) results to improve understanding of the impact of inadequate data quality. We invite process mining researchers to tackle these challenges to move towards responsible process mining with the aim to improve the credibility and trust of stakeholders in process mining results.