Keywords

1 Introduction

Moving from predicting the outcome of a running process to optimizing it with respect to a goal implies making decisions about actions that will change its course. Prescriptive Process Monitoring (PresPM) [3, 9, 14, 19] is a young subfield of Process Mining (PM) studying such business process optimization methods. Optimization in the PresPM context concerns decisions an agent has to take to optimize the outcome of a running case given certain goals (metrics). It does not concern enhancing the underlying process itself, as practiced in PM. Our PresPM (see below) literature review reveals that two methods, reinforcement learning (RL) [16] and causal inference (CI) [8, 12], emerge as pathways. However, a quantitative comparison is currently missing. Most of the research on PresPM works with offline historical data creating two limitations: Online RL is not possible and experimental results prove difficult to quantify accurately for lack of counterfactuals.

This research gap defines our contribution. In our experiments, we introduce online RL to business processes and benchmark it against CI. Our use of synthetic data, rather than historical event logs as in earlier PresPM research, is not only instrumental in permitting both online RL and CI, but also enables deeper insights, a correct evaluation of the experiments’ results, and the calculation of perfect policies as an absolute benchmark.

Solutions to timed process interventions, such as the escalation of a customer complaint process to higher management echelons, a customer call to speed up an administrative process or to maximize turnover, an additional test to conduct to reduce a patient’s length of stay at hospitals, etc., can be seen as a gateway to solving the more generic problem of recommending the next best activities in a process. In timed interventions, the agent has one chance to make an intervention sometime during the process, whereas, in next best activity problems, the agent has to choose between all possible activities at every step in the process. The relative simplicity of timed process interventions in terms of combinatorial possibilities (state space) results in fewer data and computational requirements. It will permit easier insights into the characteristics of the used models and find faster real-world adoption. Furthermore, a vast number of relevant applications for timed process interventions exist. For these reasons, our experiments focus on timed process interventions, rather than next best activity recommendations.

This paper is structured as follows. Section 2 introduces concepts pertaining to PresPM and refers to related work around them. We then move on to the experimental Sect. 3 comparing RL to CI using two synthetic datasets. The insights gained from the literature review and experiments lead to a deeper discussion of the two PresPM methods in Sect. 4. We conclude this paper and suggest avenues for future work in Sect. 5.

2 Background and Related Work

In line with the very large majority of PresPM research, we limit our focus to the optimization of a single process in isolation (assuming process independence), even though in practice, many processes affect and even interact with each other.

2.1 Causal Inference

By default, CI [8, 12] works with offline, logged data. The field can be subdivided into two components. The first concerns the detection of causal relationships: “Which treatment(s) have an effect on the process’ outcome?”. A treatment is a multi-class action. In the CI literature, treatments are often, but not always, binary (two classes: take or don’t take the action). The second CI component involves estimating the effect of treatments. We concentrate on the individual treatment effect (ITE) [13], which is the difference between predicted outcomes of (possible) treatment(s) and non-treatment for a given sample. For example, when our model predicts that calling customer x (treatment) will increase revenue by 200€, while x is expected to reduce sales by 100€ if not called (non-treatment), then the ITE\(_x\) is 300€. Note that the ITE is an expectation, not a hard-coded causality. Usually, a threshold (e.g., 50€ in our example) is determined to arrive at a policy for selecting (non-) treatments. The main challenge is the absence of counterfactuals in the dataset. A counterfactual is the unobserved outcome of a case assuming another treatment than the one factually applied. In the absence of randomized controlled trials (RCTs), realized by a policy of random interventions, selection bias will occur as the data-gathering policy leads to different distributions of treatment and non-treatment samples in the datasets. Combating selection bias is an important aspect of CI (e.g., [13]). Real-world CI applications include marketing (e.g., churn reduction, discounting, ...), education, recommender systems, etc. Most of these applications, however, are cross-sectional rather than longitudinal: There are no timing issues, let alone sequential treatments as seen in processes.

In line with the overview provided by [9], we found no papers in the sparse PresPM literature published before 2020 claiming to use CI for process outcome optimization. [6] and [19] did apply a form of indirect CI with that aim, albeit without carrying the CI label. The indirect approach consists of first predicting the most likely (or distribution of) process continuation(s) (suffix) for every possible treatment given a certain ongoing case (prefix). In the second step, another model predicts outcomes for all these suffixes, which will then be used to choose a treatment. In the direct CI approach, the process outcomes for all possible treatments for a given prefix are directly predicted. Direct CI implementations can be found in [2] (without timing considerations) and [3] and [14] (including timing). With the exception of [2], none of the PresPM papers addresses selection bias. [1], in contrast, use a sequence-to-sequence recurrent neural network that automatically builds a treatment-invariant representation of the prefixes to combat the selection bias in a medical treatment problem.

The lack of counterfactuals in the test set hinders the accurate evaluation of CI methods’ results: For a given prefix, the action recommended by the CI model may be absent from the cases in the test set. Researchers cope with this problem by relying on a predictive model to estimate outcomes, a distance-minimizing algorithm to find the nearest case in the training or dataset, or a generative model that produces augmented data [10].

2.2 Reinforcement Learning

RL [16] is an important class of ML algorithms learning policies that guide an agent’s behavior or sequence of actions in an environment in order to maximize an expected cumulative reward. Early successes in computer games drew much attention to RL, which has since then expanded into industrial processes, robotics, autonomous driving, healthcare, engineering, finance, etc. RL comes in many flavors. We will discuss and use the widespread Q-learning variant. In processes, the most important reward is often the process outcome that becomes known at the conclusion (last event) of the case. Regardless, intermediate rewards could be easily included in RL should they occur. The cost of actions can be viewed as a negative reward. At its core, RL assumes an online environment that the agent can interact with. RL does not need an environment (\(\rightarrow \) process) model. Instead, real episodes (\(\rightarrow \) cases) are executed and their rewards (\(\rightarrow \) outcomes) are observed. For every encountered state (\(\rightarrow \) prefix), a state-action value (Q) is learned for every possible action. Q represents the state-action value for the next state (\(\rightarrow \) prefix) plus the reward minus the cost of that action to get to that next state (a transition). For any given prefix, the state-action values can be interpreted similarly as the effects of the possible treatments learned by CI. The difference between the state-action value for a treatment and the one for the non-treatment corresponds to the ITE at that state (\(\rightarrow \) prefix). The state-action value of the last prefix of a (completed) process is its final outcome. Given the size of the state space (\(\rightarrow \) number of possible prefixes) in most processes, these state-action values cannot be stored in tabular form (Q-table). Instead, they are approximated by an artificial neural network (NN). This is called deep reinforcement learning. At every state (\(\rightarrow \) prefix), the policy will be to choose the action with the highest relative state-action value. Learning is achieved by playing out many processes and iteratively updating the Q-table NN after each (batch of) observed rewards (\(\rightarrow \) outcomes). In order to explore all areas of the state space and to prevent prematurely settling into a sub-optimal policy, a certain degree of exploration is introduced: The agent will sometimes overrule the policy and choose another action, especially at the beginning of the learning process. RL has found many applications in process outcome optimization, but few researchers [4, 5] apply RL to PresPM process optimization.

In practice, the online, real-life form of data gathering is often too slow and too expensive. It can even be dangerous at the early stages of learning when the NN is insufficiently trained and significant exploration happens. An entire spectrum of alternative data-gathering methods at different proximities to reality exists. For instance, [11] work with synthetic data for industrial process control. [17] use simulation to train robots and investigate pathways to close the reality gap, the mismatch between the reality and the simulation. Offline RL, finally, allows working with existing datasets (supervised data). PM discovery techniques, for example, yield grid graphs of business processes as representations of the RL agent’s environment in [4] and [5].

2.3 Research Gap

There exists no quantitative comparative analysis of CI and RL process outcome optimization problems. This is the main research gap we address in this paper. As explained in Subsect. 2.1, the use of historical data for the test sets hampers the evaluation of CI methods for lack of counterfactuals. A similar issue appears in the RL literature that exhibits a prevalence of non-real-life work. Here, simulations or models based on reality are used to train and test online models without considering the performance on the original problems, thus ignoring the reality gap (exception: [15]). We also address this issue by making use of entirely artificial synthetic data in our experiments. This form of data allows us to accurately evaluate CI, to test online RL and eliminate the reality gap, and to share the same test set between both methods. Additionally, none of the aforementioned papers compared their results to perfect policy results needed to gain an intuition for the absolute performance of their methods. A perfect policy leads to better expected process outcomes (as defined by the chosen metric) than any other policy. The majority of the discussed papers treat rather complex problems for which computing such a perfect policy is intractable, hence the need for techniques such as CI and RL. We opt for relatively simple timed interventions, so that we can easily compute results for a perfect policy. The next Sect. 3 describes our contribution: making an accurately evaluated CI-RL-perfect-solution comparison based on synthetic datasets.

Fig. 1.
figure 1

Synthetic generative processes: Petri net visualization

3 Experimental Comparison of CI and RL

In the following three subsections, we describe our data generation, experimental setup, and results.

3.1 Data Generation

We work with two synthetic processes generating both the offline dataset for CI and the online environment for online RL. We first describe these two processes and then motivate our choice.

Two Synthetic Processes. The process models as Petri nets and key features of our two synthetic processes are shown in Fig. 1 and Table 1 respectively. Process_1 is a sequence of three activities, either “A” or “B” with an according integer attribute, representing an arbitrary event attribute (e.g., age, amount, ...). At one of the three events, a (free) intervention can be made. The outcome of the process is the sum of the attributes, where the attribute of the event where the intervention took place is multiplied by 2 if activity “A” occurred at least once in the process, otherwise by \(-2\). Process_2 consists of five events and includes both an AND and an XOR construct. Every Process_2 case carries an integer case attribute known from the start. Event attributes are integers as well, and an intervention can be made once in a process at any event. When an intervention is made (at a cost of 5), the attribute corresponding to the first of “D1” or “D2” to occur thereafter (if any) will be multiplied by 2 in case the process passes through its “B” branch, by \(-4\) otherwise. The final outcome is the sum of the attributes times the case attribute.

Table 1. Synthetic generative processes: key features

Motivation. These processes and interventions were designed to be simple for clear insights, yet representative of real-world processes by incorporating their main challenges. Since PresPM concerns actionable decisions, we can reduce sub-processes that do not contain any decision points (and are not a branch of a parallel structure with another branch containing such a decision point) into one event, thus significantly shrinking the process model. In our experiments, the interventions only change event attributes but in reality, they may alter the control-flow as well. That would not change the CI and RL algorithms. Moreover, when all control-flow variations starting from a given decision point merge together in one location/activity later in the process model without containing any further intermediate decision points, then they can be reduced to one event as well. The value of this event’s attribute will vary according to which decision was made and which control-flow variant was followed earlier.

Both processes have a strong stochastic component to reflect the uncertainty accompanying real-life processes. The values of the three activities and attributes in Process_1 are sampled from probability distributions, whereas activities in Process_2 are governed by the given structure, with the attributes and case variables sampled from probability distributions as well. A real-life decision-maker is not only confronted by stochasticity, but the information available to make decisions may also differ between cases. Our synthetic processes also incorporate this aspect: as long as no “A” appears in Process_1 or Process_2 hasn’t passed through its “B” or “C” branch, it cannot be known for sure whether intervening will be beneficial or detrimental. As in many real-world processes, the outcomes of both processes will only be known at their conclusion. Including intermediary rewards or penalties, however, would not significantly alter the CI or RL algorithms.

Our experiments investigate binary actions (interventions). This simplification allows for clearer insights without loss of generalization. As direct CI is generally not suited for sequences of actions, we further simplified by opting for one-off actions (timed interventions) to permit a CI-RL comparison; the RL method, however, can be extended to sequential or continuous actions without modification. In combination with the use of synthetic data, the resulting small state space also renders calculating perfect policies practical.

3.2 Experimental Setup

Since the indirect CI approach inevitably compounds the errors of two successive prediction models, we opted for the simplicity of working with one model in the direct approach. We use one NN to predict process outcomes for CI; the intervention (Boolean) is part of its inputs, and the batch size is 1,024. RL is achieved with a standard Q-learning architecture with a first-in-first-out 1,024-transition samples memory for stabilization. A penalty of 100 is applied for intervening more than once. An NN predicts Q for both possible actions (“intervention” and “non-intervention”) at every encountered state (prefix). For a balanced comparison, the same NN architecture is used for both CI and RL. Our NNs have two long short-term memory (LSTM) and two dense layers as displayed in Fig. 2.

Fig. 2.
figure 2

The NN architectures (regression model) are almost identical for CI and RL. The first layer includes an additional “intervention” feature for CI. The last layer outputs a scalar (outcome) for CI and a 2-dimensional vector (Q for “intervention” and “non-intervention”) for RL.

Table 2. Experimental settings.

For the CI learning phase, an RCT dataset of 10,000 samples is generated. This largely exceeds both processes’ state space size and should, therefore, offset CI’s offline handicap. For RL, data are generated on the fly. The test set consists of 1,000 samples for which all counterfactuals are computed. The data generated by the synthetic processes are preprocessed as follows: The activity levels are one-hot encoded. The outcomes, attributes, and case variables (Process_2) are standardized. For CI, the intervention decision (1 or 0) is concatenated to the other event features. For every case sample, we build a sequence (sequence length = total process length) for every prefix, using padding to complete the sequence for ongoing process instances. We thus arrive at a two-dimensional data structure that is fed into the models’ input layer. For Process_2, the case variable enters the models separately after the LSTM layers.

Every experiment is carried out five times and learning stopped using an early-stopping algorithm for both methods. A policy based on CI requires identifying the threshold, which we identify as the value that maximizes the ITE score on a 20% validation set. Uplift [7] is the metric to evaluate the results. It is the difference between the process outcomes of implementing the policy and not intervening at all, cumulated over the complete test set. The experimental settings are summarized in Table 2.

3.3 Results

We summarized our experimental results in Table 3. RL clearly outperforms CI for both processes: The mean scores are significantly higher. The standard deviations of the RL scores are much lower, making RL by far the more robust method. Most or all of this outperformance can be attributed to RL’s innate superior ability to find the optimal policy (see Sect. 4). The fact that online RL permits exploring all parts of the state space plays virtually no role here, as the CI training sets in our experiments contain the complete state space as well. Were this not the case, the observed CI-RL divergence would certainly widen.

Table 3. Experimental results comparing CI, online RL, a perfect and random policy for both processes. Online RL reaches the highest uplift but requires much more computational effort than CI.

The precise knowledge of the (stochastic) synthetic generative processes enables computing perfect policies. Table 3 shows that RL comes to within 3% of the perfect policy results for both processes (some stochasticity is normal). The CI policy constitutes a substantial improvement over the RCT data-gathering policy that originally created the dataset as well, albeit to a lesser extent than RL. Having set both RL’s memory and CI’s batch size to 1,024, one optimization step of the NN involves the same number of samples for both methods. Since every RL transition (except for the first 1,023 ones) was followed by an NN optimization step, we can directly compare the number of RL transitions to the number of CI epochs. Table 3 shows that RL’s computational requirements are one to two orders of magnitude higher than those for CI.

4 Discussion

In this section, we discuss the suitability of CI and RL for PresPM and show why RL outperformed CI in our experiments. We also address the issues of RL’s online requirement, reward specification, and inefficiency.

Causal Inference. Learning counterfactuals and treatment effects is at the core of CI. The sequential aspect of processes, however, poses a problem: The decision to not treat at a certain time in a running process does not preclude treatments later on in the process. For any given prefix in our experiments, direct CI relied on a predictive model to estimate the process outcomes for both intervention and non-intervention. This is problematic in the latter case: The predictive model cannot discern the optimal path from that prefix, and will instead consider the outcomes for all encountered treatments under the data-gathering policy that produced the relevant samples in the training set, as illustrated in the simplified example in Fig. 3a.

Direct CI, therefore, only operates safely on problems with one pre-determined decision point and will become increasingly suboptimal when moving to processes with one flexible or with several decision points. Thresholds are sub-optimal compromises and products of optimization algorithms themselves. Dependencies between processes, e.g., when resources (space, manpower) are limited or processes interact with each other, cannot be incorporated in the CI framework. Because of these deficits, optimal policies are theoretically out of CI’s reach, as confirmed in our experiments. Nevertheless, CI policy results are still better than those that the data-gathering policy yields.

Similar to all other predictive models used for prescriptive or decision-making purposes, feedback loops risk deteriorating results: Implementing the CI policy will progressively shift the real-life data distribution away from the original training data, decaying the models’ predictive accuracy. Frequent updates of the CI models would help but at the same time introduce new bias in the data (new data-gathering policy). However, with a sufficient degree of randomness in the decisions taken (as in RCTs and similar to exploration in RL), this iterative, in the limit online CI, approach would neutralize the feedback loops.

Fig. 3.
figure 3

Simple process to compare CI to RL. Both agree on the policy at the second event. At the first event, CI correctly estimates the outcome for intervention (\(Y_{I}\)), whereas the prediction model for the outcome for non-intervention (\(Y_{NI}\)), will observe two different outcomes (5 and 0) and summarize (here: average) those to 2.5. This value depends on the loss function and the samples’ distribution (percentages in the graph) in the training set, which itself depends on the data-gathering policy. In contrast, RL selects the maximum of the two Q values in the second event.

Reinforcement Learning. RL has many theoretical advantages over CI. It does not require a process outcome prediction model and can rely on observed outcomes. RL is entirely generic: Theoretically, the same RL algorithm can deal with anything from timed intervention to next best activity prediction, which represents the ultimate complexity. RL models are very flexible: Constraints, rewards, and penalties can be added at liberty to avoid detrimental or unacceptable actions, pursue secondary goals, etc. With online RL, agents can freely interact with their environment, and dependencies between processes can be taken into account if the processes are treated concurrently by one model. Exploration in online RL theoretically visits the complete state-space (all possible prefixes). Given sufficient exploration, online RL policies will automatically adapt to a changing environment (concept drift). Proven theorems even show that online Q-learning algorithms converge given enough time. Both online and offline RL, however, are known to be inefficient, requiring many transitions to converge to the optimal policy, as demonstrated by our experiments.

The max operator over the Q-values (see Fig. 3b) explains RL outperformance versus direct CI with equal data access. For every prefix, the learned Q values represent the expected outcomes for intervention and non-intervention respectively, assuming a (calculated) perfect policy after that, whereas the ITEs in CI represent the difference between the expected outcomes, each of which depends on the sample distribution from the data-gathering policy and the loss function. Note, however, that with an online CI approach (with real-time updating after every finished process observed) and allowing exploration, this data-gathering policy would converge to the optimal policy as well, thus practically obliterating the differences between CI and RL.

Real-World Implementation. Despite its power and versatility, RL suffers from some important drawbacks. Yet, many of these are not entirely unique to RL but apply to CI and PresPM in general. The first such drawback is the risk of committing errors during real-time implementation. This implementation risk, however, can be reduced to that of the data-gathering policy (the de facto policy in place upon which the CI dataset is based) by inserting constraints into the RL algorithm that can easily deal with those. Rules mined earlier with a process discovery algorithm can frame the agent’s actions. Even human intuition can be inserted by allowing the human agent to overrule the RL algorithm’s proposed action. In other words, implementing RL should not be riskier either than the original, existing policy or than implementing CI. The latter two policies occasionally make or propose costly mistakes too. If necessary, a two-stage offline-online approach can further reduce the risk: Offline RL based on simulations or predictive models can serve as an initialization to an online RL that then continues to learn acting in the real world, thereby closing the reality gap.

A similar argument can be made for the related challenge of reward specification. The desired outcome for a process to be optimized will not always be one-dimensional: The primary goal may be to reduce throughput time, however, without compromising employees’ well-being and product quality. Moreover, such goals may shift over time or may need adjustment in the face of concept drift. Again, this challenge is not unique to RL, and exists regardless of the solution method, if any. When possible, these goals will be consolidated into one metric for use by both CI and RL. If not, RL can be extended to include constraints on undesired actions and/or rewards/penalties that promote secondary goals. As before, the human agent can also overrule the RL’s algorithms suggestions.

RL is inefficient: It is data-hungry and slow to converge. Our experiments were based on relatively short and simple processes. Longer and more complicated processes (great action width/depth) will have an exponentially larger state space, suggesting that RL will no longer be a viable option where CI could still be. Yet, in deep RL, the Q-table is replaced by an NN, which to some extent obsoletes the need to visit the complete state space as unseen state-action (prefix-action) pairs can be interpolated. Working examples of this are video games with very large, and autonomous driving with near-infinite state spaces. The more similar regions the state space contains, the better this will work. Additionally, limiting the number of actions to the most relevant ones with causal discovery techniques (first CI component in [12]) may be a worthwhile investment before starting with RL (and CI as well).

5 Conclusions and Future Work

We conducted experiments on timed process interventions with synthetic data that render genuine online RL and the comparison to CI possible and allow for an accurate evaluation of the results. We showed how the theoretical problems burdening CI can be overcome by online RL, contingent upon the strong assumption of real-time implementation of the learned policies in the real world. In our experiments, online RL produced better and more robust policies than CI. In fact, RL nearly reached the theoretically optimal solution, which can be inferred because of the use of synthetic data. The RL methods we used for timed interventions can also be applied without any modification to next best activity prediction in the limit, or problems of intermediate complexity. When computational effort and/or the real-time implementation requirement preclude online RL, CI may be a viable alternative in scenarios where the dataset covers a large and evenly distributed share of the state space and action depth is limited.

With this work, we contributed to the nascent field of PresPM. We chose a simplified setting to gain some important insights. Reaching PresPM maturity will depend on exploring other, perhaps more sophisticated approaches, in ever more realistic settings. Further extensions of this work are, therefore, plentiful. First, an initial investigation of the merits of loss attenuation [20], uncertainty [20], and future individual intervention effects [14] revealed promising insights but should be corroborated. Future work could also shed light on the conditions under which RL remains efficient enough on realistic problems with sequences of multiple possible actions (greater action width and depth). Further complications could include the introduction of outcome noise, uncertain inputs, and concept drift. Since the rewards of processes often only happen (or become known) at their conclusion, Monte-Carlo methods (as in [4]) could be a faster alternative to the classical Q-learning we used. RL does adapt to concept drift, but only very slowly. As a consequence, RL is not suited to deal with disruptions (e.g., caused by a pandemic). Digital twins for processes or organizations have been proposed as a solution [18] and are an avenue for future research. Instead of including the complete state space in the data for CI, as we did, it could be investigated to what extent CI would fall further behind online RL when the dataset only covers part of the state space (and contains selection bias caused by the data-gathering policy). For applications where online RL is not an option, more research on offline RL is recommended. Lifting the assumption of process independence would move the problem setting even closer to reality and would pose additional challenges: Process independence is a necessary assumption in CI [8]. The combinatorial explosion caused by interdependent processes is challenging for RL as well. In the domain of CI, adaptations to the standard algorithms could lead to more capabilities in terms of sequences of actions (possibly with a discounting mechanism as used in RL). Indirect CI’s theoretical ability to handle sequences of actions could be weighed against the accuracy loss due to the compounding of two predictive models. Combating selection bias in processes (as in [1] for an environment without exogenous actors) beckons more research as well. We did not elaborate on how the decision points and the set of possible actions available to the agents at those points are determined.