Keywords

1 Introduction

Recently, the necessity of managing and analyzing a large number of processes together with their growing complexity has brought an increasing interest towards methods and technologies to support the representation and comparison of process models. The comparison activity might need to focus, for example, on the discrepancies between the real behavior as captured by event logs and the reference process model, the analysis of process variants to understand the differences, and even cross-organizational process comparisons to describe the peculiar characteristics of each system and to identify the best practices for process improvement. Process mining [2] a research domain formed by combining data mining and process analysis techniques, has developed techniques to analyze processes by making use of event logs. Nevertheless, within the reference literature related to process mining and business process management, the visualization dimension of comparison is still in an exploratory stage and there is a demand to elaborate effective solutions to facilitate this activity for both process analysts and stakeholders.

At the same time, we note the availability of a broad and deep corpus of research in the information visualization field, containing techniques generally not applied to business process data, resulting in a lack of specific contributions exploring the aspect of visualization for process comparison [4, 24]. Research has shown the superior utility of visual representations as compared to table data [24] and we argue that the intended audience for this research, business analysts, cognate about business systems from a control flow perspective, so processes should be represented as a graph of temporally ordered activities shown to match their internal model of the business [28]. In addition, there are a number of perspectives to these processes; control, resource and data [3, 12], that need to be understood clearly by the analyst to improve the process aligned with the model.

One of the ways to better understand how to improve business processes of an organization is to compare the behavior and performance of processes within the organization against others who are carrying out similar kinds of operations. Process variants represent alternative ways of performing business activities to accomplish a goal. It is important to understand the reasons for these variations as well as the effects of such variations on process performance in order to make process improvement recommendations. Regrettably, this potential is not fully realized yet, as the majority of existing process mining techniques analyze a single log at a time and this step then needs to be repeated for all the process variants of interest [10, 14, 16]. As a result, the comparison between the behavioural and performance aspects of different process variants is carried out by manually (and potentially subjectively) interpreting the results.

This concept of process benchmarking or learning from the results of other similar processes in businesses is a well-accepted notion in business process management [7, 20], which will be applied in this paper. We motivate our multi-perspective approach in this paper by noting that since particular analysis tasks are aligned with these perspectives [1, 12], any visualization approach, sensitive to these requirements [27], should be able to visualize all these perspectives effectively. Thus, this paper proposes new comparative process visualization techniques which combine approaches from process management and information visualization fields to communicate the similarities and differences between the behavior and performance of business processes.

In Sect. 2, an analysis of related work is presented. In Sect. 3 we present a series of techniques designed for comparative process visualization to assess performance and behavior differences among various cohorts. The new techniques engage the comparative perspective through three different views: general model, superimposed model and side-by-side comparison with the ultimate goal of extracting indications for process improvement. Section 5 continues with a description of the preliminary evaluation we have performed, including example visualizations created using real hospital process data and the feedback from presentation to hospital management stakeholders. Section 6 concludes the paper.

2 Related Work

Our exploration stems from two streams of BPM research, process visualization and process comparison. At the same time our research refers to more general concepts belonging mainly to information visualization, in order to find possible intersections and useful techniques applicable to multi-perspective process comparison.

As more and more organizations rely heavily on IT systems to support their business operations, a vast amount of detailed records of business operations (i.e., which activities are carried out by whom at what time for which customer and at which cost) becomes available for analysis. Sophisticated process mining techniques can be applied to this data in order to reveal the real behaviour and performance of these operations [2]. While visualization techniques have been widely recognized as crucial for supporting decision making and analysis tasks as well as the emergence of behavioral patterns [13, 24, 29, 31], within BPM we register just a few relevant contributions with an interest in visualization aspects, especially regarding personalized views [26], process change [15, 32] and dynamic visualization [19].

A number of papers recently explored aspects related to process comparison in different ways. Kleiner [14] analyzed the technique of delta analysis for comparing the actual process represented by a process model with some reference process considered as a prescriptive process model. Delta analysis provides a basis for process comparison by generating a similarity measure between the reference and the discovered process models by using an estimate of the equivalence of event logs. The analysis though is performed only from the data perspective and does not focus on the implementation of a graphical model to show the control flow perspective. The time dimension also emerges from process visualization literature as particularly significant for process data [4, 19]. Although processes are intrinsically characterized by the time dimension, process modeling has rarely visualized it. Currently, the only time structure that is represented in process graphs is the ordering of activities as a workflow sequence, without any indication of duration of activities or waiting time between them.

A number of contributions concerning the relationship between several process variants with a reference or general model have emerged in the literature. Küster [17] focuses on the consolidation of process models though the automatic detection and resolution of differences between process versions. Li, Reichert and Wombacher [20] concentrate their analysis on the minimization of the derived reference model from a set of process variants. Process similarity has also been studied by Dijkman [10] mostly in terms of metrics and search algorithms for business process model repositories, focusing on model structure and behavior similarity metrics. The technique can be usefully applied to the computation of a difference map, which together with a side-by-side arrangement, represents the main approach to process comparison. While the second one mainly relies on the user’s visual memory to operate the comparison by pulling the models alongside, the difference map consists in computing a merged model summarizing the differences and similarities of the compared processes. A few contributions consider though the two different approaches together [5, 15].

A number of papers also tackle the aspect of comparison of process variants with graphical representations, using mostly the color variable to represent the differences across both activities and links [9, 15, 16, 21]. Most of the visualization approaches perform the comparison only on two process models, using color-coding to present the difference analysis as a comparison to a reference model, referring to differences as “deletion”,“addition” and “changes”. As a consequence they always use one of the two processes as a reference to operate the comparison. Focusing on instance traffic, Kriglstein et al. [16] explore a number of visualization techniques to compare process models. A difference analysis is performed between two process models and the visualization specifies the discrepancies on activities and edges through a color-coding approach. A more appropriate approach [15] has been adopted by the same authors for the visualization of changes in business processes to highlight the intermediate steps that lead to an updated process. Andrews [5] instead presents a semantic graph visualizer to calculate and visualize the similarity of graph components. The approach applies a difference map visualization by associating a color to each graph, merging the two hues in a gradient for common nodes in a difference map. A different color-coding has been applied by Buijs [7] for a dual comparison visualization of process models and their executable logs. The alignment matrix visualization proved to be too complex and difficult for participants to interpret.

We also explored contributions outside BPM and information systems disciplines, such as graph drawing [6] and information visualization [13, 29]. The field of uncertainty visualization has also investigated the representation of similarity measures [8, 22]. The literature review has highlighted the lack of significant disciplinary connections between the fields of BPM and process mining and the information visualization disciplines, suggesting a need for a design approach to guide the development of novel visualization techniques to support process comparison activities.

3 Comparative Visualization Design Approach

Process mining is a well-established research discipline that exploits event data using a combination of process analysis and data mining techniques [2]. Using process mining techniques, one can automatically discover a process model (and related resource usage and performance metrics) from an event log [2]. However, in order to carry out a comparative analysis of processes, process mining techniques are first applied to a single log (optionally with a single process model) and this step is then repeated for all processes of interest. As a result, the comparison between behavioural and performance aspects of different processes is then carried out by manually interpreting the results. As most existing process mining techniques do not cater for comparisons in an automated and straight-forward manner, there are also challenges in making use of existing visualizations from process mining frameworks and tools such as ProMFootnote 1 or DiscoFootnote 2. Hence, there is a real need for novel comparative visualizations that can highlight key differences in terms of process behaviour and performance.

4 Data Requirements

In this paper, we address the key requirements in process analysis to be able to visualize the differences in terms of process behaviour and performance of two or more processes, while making use of different process-related information including process models and historical records of process executions. We identify the three main inputs to the visualizations: logs, process models and visualization configurations.

The main input for the proposed visualization solutions is one or more event logs. The event log(s) are used to extract data regarding process behaviour and performance. In particular, the information regarding the set of completed activities, the frequency of those activity executions and the min/max/avg duration of those activities will be used as objective measures for the visualizations. An event log could be as minimal as having only one transition type (i.e., “complete” events). With richer logs such as those with start and complete timestamps, additional customer information or employee data, it is possible to have a more accurate picture regarding wait times, bottlenecks and resource utilizations.

Furthermore, our proposed visualization solutions heavily rely on the existence of one or more process models to map performance differences upon or to compare and contrast different ways of executing processes. The process models are used to visualize the order in which activities are being carried out. It is, of course, possible to use the input event log(s) to discover these process models using existing process discovery techniques [2]. In theory, any process modelling language (e.g., BPMN, Petri Nets, EPC or Fuzzy model) can be supported.

The final input is the desired visualization configurations which enable the selection of data streams (logs) and related process models and mapping of data to generate relevant visualizations. Thus, this overview of techniques should be seen within the context of a complete interactive system for manipulation of process mining data for comparison purposes, providing the ability to obtain an overview, drill down and compare models as required [29]. In Sect. 5 we show an example where a visualization is configured and viewed for hospital data case studies. We now proceed to describe in detail the design of these visualizations from basic principles.

4.1 Visualization Techniques

The requirements analyzed in the previous section have motivated some design examples to tackle representational issues in process comparison. Design solutions were developed for cohort comparison in general, in one single organization or across multiple organizations. The comparison has been tackled from different perspectives in order to capture the different aspects of variability in the processes. In order to bridge some of the gaps identified in the literature we directed design efforts to the different perspectives of process mining, in particular the time, performance and resource perspectives. Concerning the comparative perspective we consider the possibility to comparing more than two models. Although the comparison of multiple models has already been explored in [7, 25, 30], none of the analyzed contributions examines the design of an actual difference model that considers the characteristics of all compared models.

The proposed visualization techniques have been conceived to allow the exploration both globally as an overview and individually on the single processes, supporting the user moving across different abstraction levels [24]. All three views have been designed aiming at comparing processes both at the model level and event log data, in order to include information regarding the performance, time and resource perspectives. Each view is complementary to the others, focusing on different process mining perspectives and users’ points of view, in order to highlight varied aspects of comparison. The proposed examples visualize the comparison across the process models for three cohorts, which are identified by three different color hues (red, blue and green).

General Model. The aim of the general model view is to observe the differences between cohorts with a focus on the differences in the performance and resource perspectives (Fig. 1). The starting point of the visualization is one process model which represents the general process model for the different cohorts.

In order to illustrate our approach, we consider the three main attributes that represent the basic components associated to activity execution, that is activity name, median duration and frequency, in addition to the number of resources (see Table 1). The values for median duration and frequency for each cohort are normalized on each activity proportionally for each cohort, to obtain performance related data. Next, for each activity, resources are aggregated per organizational level across the different cohorts, in order to display the ratio between performance (frequency/duration) and number and type of resources involved in each activity (see activity A in Fig. 1). Resources have been classified into three organizational levels for explanatory reasons, following a typical hierarchy of managerial, professional and technical staff. The examples thus indicate the different resources levels performing the particular task, shown by circles with differing color fills (refer to the left part of Fig. 2).

The example visualization in Fig. 1 applies the stacked bar pattern (described below) for highlighting the differences in performance of the cohorts in each activity. For each activity a stacked colored bar is partitioned according to the different execution time of each cohort. Color transparency is used to map activity frequency, assigning a higher alpha value to a lower frequency.

Table 1. Sample data attributes used in the visualizations
Fig. 1.
figure 1

General model example, with merged model and log data annotations.

Different visual patterns, by way of glyphs (see Fig. 2), have been explored for the representation of performance variations across different cohorts at the activity level. In each case, the different blocks of color represents a different cohort performing the activities. The stacked bar (Fig. 2a), applied also in the example in Fig. 1, constitutes an immediate way to map the differences across activities directly to the model, obtaining both an analytic and global view. By implementing multiple color dimensions, other information such as the absolute frequency of each activity can be mapped within the stacked bar, allowing for comparison across other processes. In order to maintain readability, color transparency has been rendered through a range of four different non-continuous levels.

A similar alternative for the representation of this data type is a space filling visualization of hierarchies, such as a treemap representation (Fig. 2b). Keeping the hue variable associated to cohort categorical values and transparency to map frequency data, the performance/temporal value is represented on the space (area), providing more uniformity in case of a high variability in values. A different solution applies overlapping circle sections for each cohort (Fig. 2c), by mapping the frequency to the radius and the median duration on the arc section subtended angle. This solution have been designed to stress the difference between cohorts and to represent the time dimension as a percentage of the maximum completion time. An overlapping principle has also been explored through triangle shapes (Fig. 2d) associated to each cohort. This allows a mapping of the performance values, i.e. frequency and median duration, to height and base width respectively. This type of pattern might be more appropriate for models that are not particularly complex, when the design goal is to perform a comparison at the activity level than the control-flow one. At the same time it might reveal some issues in readability in case performance data is too similar, causing the superimposed triangles to overlap. For particularly complex models, a more suitable solution is to concentrate on the control-flow perspective and eliminate all possible sources of visual occlusion, thus delegating the comparative perspective to single activities with interaction elements that can be activated and deactivated whenever necessary, as displayed in Fig. 1. We are currently working on an online survey with the goal of deeply assessing the strengths and weaknesses both of the three views and of the different visual patterns.

Fig. 2.
figure 2

Visualization of the resource perspective and associated glyphs for the general model shown in Fig. 1

Superimposed Model. The superimposed model view is devoted to the comparison of different cohorts following the perspective of one process model, that we identify here with the first cohort (C1). The main aspect for consideration is the correspondence of activities in the model, visualized through the alignment and superposition of an activity element as in [11].

Fig. 3.
figure 3

Visualization of superimposed cohorts: C2 and C3 - over C1

The main aspects considered in the different cohorts are the process flow (i.e. activity ordering) and the similarity of activities. The similarity level of activities can be based on different values depending on the aspect to be observed, varying from unidimensional factors such as execution time and frequency to metrics modelling the general performance. The example presented in Table 1 and Fig. 3 considers similarity in terms of the ratio of cohorts performance values, between the frequency and the median duration of each activity in C1 compared to the average value of the same ratio for C2 and C3. The resulting values are grouped by level of similarity in three partitions: high, medium and low. The similarity scores of activities in C1, with respect to C2 and C3, is mapped by applying three levels of blur as in [8], according to the partitions, where the highest level of blur corresponds to the lowest similarity level for the activity across the cohorts. The superimposition of the models is based on the match of activity position within the process flow across the different cohorts. The presence of each activity is checked in the three models, as well as its direct predecessors and successors, to verify if the same activity is executed in different parts of the process, thus establishing the presence of a shift in the ordering of activity execution, forward or backwards. The matching activities are mapped as a stacked rectangles on the top of the reference process model (C1). The rectangle is then slightly shifted towards the left when the same activity is founded in the model but in a different position, earlier in the flow, and towards the right when the same activity occurs at some point later in the flow.

Side-by-side Model. This type of comparison technique aims at exploring, more deeply, the time perspective of the processes at a broader level, by integrating the information on the waiting time between an activity and its successor: a very common event that causes the delay of completion times for the whole process.

Fig. 4.
figure 4

Side by side model comparison

The three models are analyzed separately, focusing specifically on the ordering of activities. The proposed diagram (Fig. 4) exploits the process model logical flow to describe temporal dependencies between activities through predecessor and successor nodes of a directed graph [23]. In order to capture the variability across the models we applied a visualization approach that highlights just the matching flows that correspond to the comparison scenario, leaving the irrelevant branches in the background [8]. This approach requires a further analysis of log data. Besides the main properties used for the general and superimposed model, the information related to the waiting time is extracted and stored in a separate source/target table, identifying the couples of consecutive activities. The waiting time between each couple of activities is represented by the length of the arcs, while activity duration is displayed by extending the activity box with a grey texture. This visualization method is also consistent with a configurable process model approach [18]. This type of comparison might present some issues in case of particularly complex models. Especially if the models present a large variability in the waiting time between activities, further calculations are required in terms of data normalization, in order to maintain the readability of the diagram.

5 Evaluation

The evaluation approach adopted for the proposed visualization framework is three-fold. Firstly, we made use of event logs and discovered process models from two hospitals (H1 and H2)Footnote 3 and developed a set of visualizations by hand. This serves as a preliminary evaluation and feasibility analysis of the proposed design principles. Secondly, we showed the resulting visualizations to the stakeholders in order to (1) gauge the understandability and usefulness of proposed visualizations and (2) to solicit further user requirements.

Finally, we are in the process of developing a set of software plug-ins for the process mining framework, ProM, based on their input and are also preparing an anonymous online survey in order to obtain the opinions of BPM practitioners and academics from around the world. In this paper, we present the evaluation outcomes from the first two steps: visualizations created using real datasets and stakeholder feedback about the visualizations. Please note that due to the lack of resource information in the datasets, the visualizations do not include the resource perspective.

Hospital One. One of the comparative analysis questions from stakeholders at Hospital One (H1) is “Are there any differences in terms of process behaviour and durations for patients who present at ED at different times of the day?” In order to answer this question, patients are put into four cohorts depending on their arrival times at ED (i.e., midnight - 6am, 6am - 12noon, 12noon - 6pm and 6pm - midnight). A process model, together with dominant paths, is discovered from the event log containing data for all four cohorts. The names of the activities, their frequencies and median execution times of activities are calculated for each cohort.

Figure 5 depicts the resulting visualization. From this figure, it is easy to see the performance comparison across two dimensions (frequency and duration) for four different cohorts. As the number of cases for each cohort varies across the different time periods (i.e. 147, 244, 320, 173 min), the relative frequencies are used in the visualization. The visualization made use of a number of metric classes: the absolute frequency for activities (the height of the triangles), the absolute frequency for paths (the strength of the edges and activity darkness), and the median duration from one activity to another (the width of the triangles).

Fig. 5.
figure 5

Visualizing the behaviour and performance differences between four patient cohorts in H1. The ED Admission activity is blown up on the bottom right.

One example of a pattern being easily seen is the difference in the ED Admission activity for the 6am to Noon cohort, compared to the others, shown by the wide triangle indicating a large difference in duration compared to other cohorts (see highlighted box bottom right in Fig. 5). As this was the first visualization created with the real data sets, further refinements to the original design were necessary. For instance, we needed to adjust the dimensions of the visualization elements in order to accommodate very high/low frequencies. We also realized that it might be necessary to set the maximum limit with respect to the number of cohorts being compared. This visualization was presented to stakeholders (including doctors from the emergency department at H1, as well as healthcare researchers from different QLD Hospitals) as a part of three presentations to demonstrate preliminary results from the process mining analysis being conducted at H1. These stakeholders found the visualization to be intuitive and they were very receptive to being presented with visual comparisons of the four cohorts across the two performance dimensions.

Hospital Two. Another comparative analysis question from stakeholders at Hospital Two (H2) is “What are the differences in terms of process behaviour and durations for patients who are discharged from ED within four hours of arrival and those who stayed longer than four hours?” In order to answer this question, the dataset is split into two cohorts, those who stayed in ED for less than or equal to four hours and those who stayed for more than four hours. All three types of visualizations were created using the data from H2. Process models were created for both cohorts as well. For this evaluation we concentrate on the superimposition and side by side visuals as the performance general model visualization is similar to the example for H1.

Fig. 6.
figure 6

Side-by-side comparison of two patient cohorts in H2, with a blown up selected example at the bottom.

Figure 6 depicts the visualization that reflects the side-by-side comparison of patients in the two cohorts. Here, the emphasis is on the time perspective whereby cases in Cohort One (C1) have throughput time of up to 4 hours and cases in Cohort Two (C2) has throughput time of over 4 hours. The design also allows the comparison of dissimilar models by the selection of two similar segments of the H1 and H2 models for comparison. As seen in the example the portion of process between Medical Note_final and Discharge Letter is significantly longer in C1, due to the waiting time as well as the median duration of both activities involved.

Fig. 7.
figure 7

Superimposed model of two patient cohorts in H2, with a blown up selected example at the bottom.

Figure 7 depicts the superimposition of the process model for C1 onto the model for C2 with emphasis on whether the activities are being shifted forward or backward in relation to a model. The example shows that the activity related to ECG (ordered) is executed later in the model in C2 with respect to C1, while Medical Note final has the same position in both cohorts but with a lover level of similarity, as displayed by the blur.

These two visualizations were shown to the head of the emergency department from H2. This doctor found all three visualizations to be useful for different purposes. He noted that the performance models (e.g., Fig. 5) provide salient patterns that pop-out easily. Figure 6, showing time-based visuals using alignment analysis, was seen as useful as it highlighted the differences in time easily, seeing activities related to particular antecedents. Figure 7, which highlights the differences between the process behaviour of the two cohorts was found to be sub-optimal for this dataset due to a high degree of similarities found across the two cohorts; thus minimal blurring. However, he recognized the potential use of this type of visualization in comparing different departments or different hospitals with a high level of variation in process behaviour.

Findings from these preliminary evaluations also highlight the need for an integrated system starting at a high level, filtering and drilling down to activity comparisons, with interactions assisting with insight in real time. We are currently working on a software plug-in to support these visualizations with interaction and filtering capabilities.

6 Conclusion

In this paper we have presented research on a collection of multi-perspective visualization techniques for process comparisons. These designs emerge from a need to better communicate process comparisons within the process mining domain. Our research has highlighted the lack of design approaches for comparative process visualizations, and the scarcity of efforts in visual patterns innovation for the representation of processes. In particular, we developed a design approach to tackle representational issues within process comparison activities and a series of display techniques for comparing multiple cohorts across four perspectives, namely: control-flow, time, performance and resources. The evaluation phase drew attention towards the positive response of stakeholders in respect to experimentation on visual patterns in process representation, as well as in the availability of different views to address different process perspectives. As a general objective, we intend to continue to broaden the research in process visualization and search for improvements in the visual patterns and interaction modes for process mining analysis activities. For future work, we plan to work on the implementation of the proposed visual solution within a dynamic environment, such as ProM. We also aim to expand the evaluation of the visualizations with a systematic survey to assess the effectiveness of the different representations.