1 Introduction

An increasing number of companies are using Process Aware Information Systems (PAIS) to support their business. All these systems record execution traces, called event logs, with information regarding the executed activities. In typical scenarios, these traces do not only contain which activities have been performed, but also additional attributes. The extraction of useful information out of large amount of data is the main challenge of the data mining field. However, recently, a new topic branched off data mining: process mining [15, 31]. This latter case expects that the analyzed data are specifically referring to executions of business processes, and this assumption is made throughout the analysis phases. Frequently, under the umbrella of the term “process mining”, three main activities are distinguished: the discovery of models, such as control-flow, describing the actual process undergoing (namely, process discovery); the assessment of the conformance of a given event log with respect to a given process model (namely, conformance checking); and the extension of existing process models with additional information (namely, process enhancement). However, apart from these classical activities, it is also possible to exploit the event log in order to create predictive models, i.e., models that are useful to predict future characteristics of incomplete process instances. In general, it is useful to distinguish two types of process mining techniques, according to when the analysis takes place: (i) a-posteriori; and (ii) at runtime. A-posteriori, or “off-line”, techniques use a finite portion of historical data to extract knowledge out of it. Runtime, or “on-line”, approaches, on the other hand, give information as long as the business process is running. It is more difficult to define approaches belonging to the latter category: the process generating the data or the frequency of events emission, may be subject to modifications or drifts (which may also be seasonal). To tackle these on-line problems, it is therefore necessary to use tools able to adapt to new or unobserved scenarios. One of the most challenging task in process mining is prediction. Obviously, predictions are useful only if the object of such predictions have not been observed yet. For this reason, prediction per se is, inherently, a task performed on incomplete traces, at runtime. Moreover, the ability to predict the evolution of a running case is difficult also due to the peculiarities of each process instance and, because of the “human factor”, which might introduce strong flexibility.

In this work, we focus on predicting the remaining time of running cases. We decided to ground our approach not only on the control flow, but also on additional data that we could observe. Moreover, the system we present is also capable of dealing with unexpected scenarios and evolving conditions. With respect to our seminal work [25], here we propose a slightly modified version of that method and we also propose two novel approaches able to overcome its limitations. In particular, we are going to define two different scenarios: in the first one we assume the process has a well defined static workflow, the same assumption made in [25], while in the latter we remove such assumption and the process is considered dynamic, e.g., the process has seasonal drift. We will show that the proposed approaches improve state-of-the art performances in the first scenario and one of them is also able to deal with dynamic processes. We also leverage one of these models to predict the future sequence of activities of a running case. We assess the remaining time prediction accuracy of our methods against the state-of-the art ones using three real-life case studies concerning a process of an Italian software company, the management of road-traffic fines by a local police office of an Italian municipality and the BPI 12 challenge. The remainder of this paper is structured as follows: Sect. 2 reviews recent works concerning prediction tasks in the framework of process mining. Section 3 gives some essential background on process mining and machine learning concepts used throughout the paper, while Sect. 4 describes the prediction approaches. Section 5 discusses the implementation and the experimental results and finally Sect. 6 briefly summarizes the presented content and concludes the paper.

2 Related work

The first framework focused on the time perspective has been proposed by van der Aalst et al. [33]. They describe a prediction approach which extracts a transition system from the event log and decorates it with time information extracted from historical cases. The transition system consists in a finite state machine built with respect to a given abstraction of the events inside the event log. The time prediction is made using the time information (e.g., mean duration of specific class of cases) inside the state of the transition system corresponding to the current running instance. A closely related approach is presented in [4], where the same idea of a decorated (a.k.a. annotated) transition system is used. In this approach the annotations are decision trees used to predict the completion time and also the next future activity of the process instance. The work presented in [17] considers the data perspective in order to identify SLAs (Service Level Agreement) violations. In this work, authors try to estimate the amount of unknown data, in order to improve the quality of the final prediction. The actual forecast is built using a multilayer perceptron, trained with the backpropagation algorithm. Two works by Folino et al. [9, 10] report an extended version of the technique described in [33]. In particular, they cluster the log traces according to the corresponding “context features” and then, for each cluster, they create a predictive model using the same method as in [33]. The clustering phase is performed using predictive clustering trees. In order to propose a forecast, the approach clusters the new running instance and then uses the model belonging to the specific cluster. One of the weaknesses of these methods based on [33] is that they assume a static process, where the event log used for the training phase contains all the possible process behaviours. Unfortunately, in real life cases this assumption is usually not valid. Lakshmanan et al. [16] reports an approach which uses the Instance-specific Probabilistic Process Models (PPM) and is able to predict the likelihood of future activities. Even though the method does not provide a time prediction, it gives to the business managers useful information regarding the progress of the process. This work also shows that the PPM built is actually Markovian. A probabilistic method, based on Hidden Markov Model (HMM), is also proposed in [24]. Results show how the proposed method outperforms the state-of-the-art. However, the achieved results seem to be not much different from the ones achieved by [33]. Ghattas et al., in a recent work [12], exploit Generic Process Model and decision trees, based on the process context, to provide decision criteria defined according to the actual process goals. In [18], de Leoni et al. propose a general framework able to find correlation between business process characteristics. They manipulate the event log in order to enrich it with derived information, and then generate a decision tree in order to discover correlations. In particular, one of the correlation problem suggested here is the forecast of the remaining time of running cases. Being based on decision trees, numeric values need to be discretized and this lower the accuracy of the method. For this reason, they do not provide any prediction example. In [25], Polato et al. show an approach based on [33] in which the additional attributes of the events are taken into account in order to refine the prediction quality. This method exploits the idea of annotating a transition system adding machine learning models, such as Naïve Bayes and Support Vector Regressors. The experimental results show how the additional attributes can influence positively the prediction quality.

In this paper we propose two new approaches based on Support Vector Regression and we discuss their strengths and their weaknesses compared with the approaches presented in [23] and [33]. In particular we emphasize in which scenarios an approach is better than the others and why. A similar approach is presented in [4], in which authors create a process model similar to a transition system, in which non frequent behaviours of the process are discarded. Predictions are made on top of this model using decision trees. This method is very similar to the one presented in [25], the main difference is the process model which is a kind of reduced transition system. More recently, deep neural network based approaches have been proposed. In [8], authors present a recurrent neural network (RNN) for the prediction of the next event, however they do not predict the remaining time. Their network is composed by two hidden RNN layers using basic LSTM cells (Long-short term memory). To train the model, GPUs have been used because of the size of the deep network. Another recent approach based on LSTM is the one by Tax et al. [23]. In that work authors propose a LSTM neural network architecture for predicting the next activity and its execution time, from a partial trace. Such a network can be adopted for remaining time prediction by iteratively predicting the next activity and its duration, until a special end of case activity is predicted. Both the cited deep NN based methods do not take into account additional attributes. A known problem of the deep NN based approaches is their training time and their sensitivity to the hyperparameters choice [6].

Approaches coming from different areas can also be used to achieve similar results. For example, queue theory [2, 14] and queue mining can be seen as a very specific type of process mining, and recent works are starting to aim for similar prediction purposes [27]. In this particular case, authors decided to focus on the delay prediction problem (i.e., providing information to user waiting in lines, in order to improve customer satisfactions). The method presented here is based on the construction of an annotated transition system. Such model is then used to make delay predictions using simple averages or non-linear regression algorithms. They also show other methods based on queue mining techniques. Even though this approach concerns making predictions on the time perspective, the slant is totally different and the goal is more narrow with respect to the aim of the approach we propose in this work. Another example of prediction-focused paper has recently been published by Rogge-Solti and Weske [26]. In this case the focus is on the prediction of the remaining time, in order to avoid missing deadlines. However, only information about the workflow are used to build the model and to make prediction.

Remaining time prediction is a particular instantiation of the so-called predictive monitoring task, in which the goal is to provide early advice so that users can steer ongoing process executions towards the achievement of business constraints. Maggi et al. [20] proposed a decision tree based predictor model. The presented framework continuously estimate how likely is that a user defined constraint it will be fulfilled by the ongoing process instances. The same task is faced in [19]. Here authors present an alternative approach where traces are treated as complex symbolic sequences. They propose two possible encodings: (i) index-based and (ii) a combination of the first one and an HMM based one. From an Area Under the Roc Curve perspective the proposed encoding outperforms the simple baselines (i.e., boolean, frequency-based). Using the just mentioned encoding, Verenich et al. [35] proposes a two phases approach: in the first phase the dataset is divided into groups by means of unsupervised clustering; in the second phase a predictor is trained for each cluster. In particular they use random forests as the predictor algorithm. A very similar approach is presented in [11]. Finally, Teinemaa et al. [29] show how textual information, when sufficiently large, can be useful to improve accuracy and earliness. Even in this work the best performing predictors are random forests.

3 Background

In this section we define some useful process mining concepts which we will use throughout the paper. First of all we give the concept of sequence and then the basic definition of event, trace and event log.

Given a set A, a finite sequence over A of length n is a mapping \(s \in \mathbb {S} : ([1, n] \subset \mathbb {N}) \rightarrow A\), and it is represented by a string, i.e., \(s = \langle s_1, s_2, \dots , s_n\rangle \), \(|s|=n\). Over a sequence s we define the following functions:

  • selection operator \((\cdot )\): \(s(i) = s_i,\, \forall \, 1 \le i \le n\);

  • \( hd ^k(s) = \langle s_1, s_2, \dots , s_{\min (k, n)} \rangle \);

  • \( tl ^k(s) = \langle s_w, s_{w+1}, \dots , s_n \rangle \) where \(w = \max (n - k + 1, 1)\);

  • \(\textit{set}(s) = \{ s_i \mid s_i \in s \}\), e.g., \(\textit{set}(\langle a, b, b, c, d, d, d, e \rangle ) = \{a, b, c, d, e\}\).

Definition 1

(Event) An event is a tuple \(e = (a, c, t, d_1, \dots , d_m)\), where \(a \in \mathcal {A}\) is the process activity associated with the event, \(c \in \mathcal {C}\) is the case id, \(t \in \mathbb {N}\) is the event timestamp (seconds since 1/1/1970Footnote 1) and \(d_1, \dots , d_m\) is a list of additional attributes values, where \(\forall \, 1 \le i \le m, \, d_i \in \mathcal {D}_i\), with \(\mathcal {D}_i\) the domains of the additional attributes. We call \(\mathcal {E} = \mathcal {A} \times \mathcal {C} \times \mathcal {T} \times \mathcal {D}_1 \times \dots \times \mathcal {D}_m\) the event universe.

Over an event e we define the following projection functions: \(\pi _\mathcal {A}(e) = a\), \(\pi _\mathcal {C}(e) = c\), \(\pi _\mathcal {T}(e) = t\) and \(\pi _{\mathcal {D}_i}(e) = d_i, \forall \, 1 \le i \le m\). If e does not contain the attribute value \(d_i\) for some \(i \in [1, m] \subset \mathbb {N}\), \(\pi _{\mathcal {D}_i}(e) = \perp \). In the remainder of this paper we will call the number of additional attribute m as \(|\mathcal {D}|\).

Definition 2

(Trace, Partial Trace) A trace is a finite sequence of events \(\sigma _c = \langle e_1, e_2, \dots , e_{|\sigma _c|} \rangle \in \mathcal {E}^*\) such that \(\forall \, 1 \le i \le |\sigma _c|, \pi _\mathcal {C}(e_i) = c \wedge \forall \, 1 \le j < |\sigma _c|\), \(\pi _\mathcal {T}(\sigma _c(j)) \le \pi _\mathcal {T}(\sigma _c(j + 1))\). We define a partial trace of length k as \(\sigma ^k_c = hd^k(\sigma _c)\), for some \(k \in [1, |\sigma _c|] \subset \mathbb {N}\). We call \(\varSigma \) the set of all possible (partial) traces.

While a trace corresponds to a complete process instance, i.e., an instance which is both started and completed; a partial trace represents a process instance which is still in execution and hence it has not completed yet. Over a trace \(\sigma _c = \langle e_1, e_2, \dots , e_{|\sigma _c|} \rangle \) we define the following projection functions: \(\Pi _\mathcal {A}(\sigma _c) = \langle \pi _\mathcal {A}(e_1), \pi _\mathcal {A}(e_2), \dots , \pi _\mathcal {A}(e_{|\sigma _c|}) \rangle \), \(\Pi _\mathcal {T}(\sigma _c) = \langle \pi _\mathcal {T}(e_1), \pi _\mathcal {T}(e_2), \dots , \pi _\mathcal {T}(e_{|\sigma _c|}) \rangle \) and \(\Pi _{\mathcal {D}_i}(\sigma _c) = \langle \pi _{\mathcal {D}_i}(e_1), \pi _{\mathcal {D}_i}(e_2), \dots , \pi _{\mathcal {D}_i}(e_{|\sigma _c|}) \rangle \) for all \(1 \le i \le |\mathcal {D}|\).

Definition 3

(Event log [33]) An event log L is a set of traces, \(L = \{\sigma _c \mid c \in \mathcal {C}\}\) such that each event appears at most once in the entire log, i.e., \(\forall \sigma _1, \sigma _2 \in L, \sigma _1 \ne \sigma _2 : \textit{set}(\sigma _1) \cap \textit{set}(\sigma _2) = \emptyset \).

A transition system is one of the simplest process modelling notations, it consists of states and transitions where each transition connects two states (not necessarily different). Formally, it is defined as follows:

Definition 4

(Transition System (TS)) A transition system is a triplet \(\textit{TS} = (S, A, T)\), where S is the set of states, \(A \subseteq \mathcal {A}\) is the set of activities and \(T \subseteq S \times A \times S\) is the set of transitions. \(S^{\textit{start}} \subseteq S\) is the set of initial states, and \(S^{\textit{end}} \subseteq S\) is the set of final (accepting) states.

Given a state \(s \in S\), it is possible to define the set of reachable states from s as: \(s\bullet = \{s' \in S\, |\, \exists t \in T, \exists a \in A\, \text {s.t.}\, t = (s, e, s') \}\). According to [33], to construct a transition system which maps each partial trace in the log to a state, we need the so called event and state representation functions.

Definition 5

(Event representation function) Let \(\mathcal {R}_e\) be the set of possible event representations. An event representation function \(f^{\textit{event}} \in \mathcal {E} \rightarrow \mathcal {R}_e\) is a function that, given an event e produces some representation of it (e.g., projection functions over \(e \in \mathcal {E}\): \(\pi _\mathcal {A}(e), \pi _\mathcal {T}(e)\)).

Definition 6

(State representation function) Let \(\mathcal {R}_s\) be the set of possible state representations, a state representation function \(f^{\textit{state}} \in \varSigma \rightarrow \mathcal {R}_s\) is a function that, given a (partial) trace \(\sigma \) returns some representation of it (e.g., sequences, sets, multiset over some event properties).

Even though conceptually these representation functions are not directly correlated, in practice, \(f^\textit{state}\) needs an \(f^\textit{event}\) in order to create a meaningful state representation. Choosing the right functions \(f^{\textit{state}}\) and \(f^{\textit{event}}\), also referred to as abstractions, is not a trivial task. The main issue with the representation choice is the overfitting vs. underfitting problem as discussed in [32, 33] where authors also propose possible good choices for \(f^{\textit{state}}\) and \(f^{\textit{event}}\). A common event abstraction is \(f^{\textit{event}}(e) = \pi _\mathcal {A}(e)\), which maps an event onto the name of the activity, while commons state abstractions are: the set abstraction, i.e., \(f^{\textit{state}}(\sigma _c) = \{\pi _\mathcal {A}(e) \mid e \in \sigma _c\}\), the multiset abstraction, i.e., \(f^{\textit{state}}(\sigma _c) = \{(a, m) \mid a = \pi _\mathcal {A}(e) \wedge m = |\Pi _\mathcal {A}(\sigma _c) \uparrow \{a\}| \}\) and the list abstraction, i.e., \(f^{\textit{state}}(\sigma _c) = \langle \pi _\mathcal {A}(\sigma _c(1)), \dots , \pi _\mathcal {A}(\sigma _c(|\sigma _c|)) \rangle \).

Fig. 1
figure 1

Example of a transition system extracted from a log containing three trace types \(\langle A, B, C, F\rangle \), \(\langle A, B, D, F\rangle \) and \(\langle A, B, E, F\rangle \), with \(f^\textit{event}(e) = \pi _\mathcal {A}(e)\) and \(f^\textit{state}(\sigma ) = \{f^\textit{event}(\sigma (|\sigma |))\}\). The state \(s_0\) is the initial state, while \(s_6\) is the accepting (i.e., final) state. The notation \({s_1}_{\{A\}}\) means that the state \(s_1\) has a state representation equals to \(\{A\}\). Each transition is labeled by the corresponding event representation value

Using these two functions \(f^\textit{event}\) and \(f^\textit{state}\), it is possible to define a (labeled) transition system where states correspond to prefixes of traces in the log mapped to some representations using \(f^\textit{state}\), and transitions correspond to representation of events through \(f^\textit{event}\).

Definition 7

(Labeled Transition System (LTS)) Given a state representation function \(f^{\textit{state}}\), an event representation function \(f^{\textit{event}}\) and an event log L, we define a Transition System as \(\textit{LTS} = (S, E, T)\), where:

  • \(S = \{f^{\textit{state}}( hd ^k(\sigma )) \mid \sigma \in L \wedge 0 \le k \le |\sigma |\}\) is the state space;

  • \(E = \{f^{\textit{event}}(\sigma (k)) \mid \sigma \in L \wedge 1 \le k \le |\sigma |\}\) is the set of event labels;

  • \(T (\subseteq S \times E \times S) = \{(f^{\textit{state}}(hd^k(\sigma )), \, f^{\textit{event}}(\sigma (k+1)), \, f^{\textit{state}}(hd^{k+1}(\sigma ))) \, |\) \(\sigma \in L \wedge 0 \le k < |\sigma | \}\) is the transition relation.

\(S^{\textit{start}} = \{f^{\textit{state}}(\langle \rangle )\}\) is the set with the unique initial state, and \(S^{\textit{end}} = \{f^{\textit{state}}(\sigma ) \mid \sigma \in L\}\) is the set of final (accepting) states.

We say that a trace is compliant with the transition system if it corresponds to a walk from \(s_i \in S^{\textit{start}}\) to \(s_e \in S^{\textit{end}}\). We also call a trace \(\sigma \) non-fitting with respect to a transition system if \(f^\textit{state}(\sigma ) \notin S\).

Figure 1 depicts an example of a transition system extracted from the log fragment reported in Table 1.

4 Remaining time prediction

In this section we show a spectrum of different approaches able to predict the remaining time of running business process instances. In particular, we will emphasize the pros and cons of each approach and which are the situations in which an approach should be preferred among the others.

The problem of predicting the remaining time of running process instances can be summarized as follows: given an event log, containing historical traces about the execution of a business process, we want to predict, for an ongoing process instance, how much time remains until its completion. All the approaches described in this work are based on the idea of making predictions using a model constructed (i.e., learned) using the information collected so far (i.e., the event log). The obtained model takes the partial trace, which represents the running process instance, as input and returns the remaining time forecast. The remaining time of a case consists of the time that will be spent from now up to the end of the execution of the last activity of the process. This amount of time, given a complete trace \(\sigma _c\), is easily computable for each event \(e_i \in \sigma _c\). We define the function \(\textit{rem} : \varSigma \times \mathbb {N} \rightarrow \mathbb {N}\) as follows: \(\textit{rem}(\sigma _c, i) = \pi _\mathcal {T}(\sigma _c(|\sigma _c|)) - \pi _\mathcal {T}(\sigma _c(i))\), where i is an event index. If \(\sigma _c = \langle \rangle \) or \(i \notin \{1, 2, \dots , |\sigma _c|\}\) then \(\textit{rem}(\sigma _c, i) = 0\). This function calculates the time difference between the last event in the trace and the i-th event. In the remainder of this section we present our new time prediction approaches based on the application of machine learning models. Moreover, we discuss how to exploit one of the proposed model to predict also the future sequence of activities.

Table 1 Example of an event log fragment with events sorted using their Timestamp and grouped by Case Id

4.1 Approach 1: simple regression

This prediction task has all the characteristics to be faced as a regression problem. (Partial) Traces in the event log are the input vectors and the remaining time at each event is the target value. In this section we present a direct application of \(\epsilon \)-SVR algorithm. Despite the simplicity of this method, there are some aspects which need particular attention. First of all we describe how to switch from the event log to a representation suitable for the \(\epsilon \)-SVR algorithm.

In our setting, the input consists of traces and, in particular, the attributes of the corresponding events. Whereas SVR takes as input vectors \(\mathbf {x} \in \mathbb {R}^l\), for some \(l \in \mathbb {N}^+\), we have to convert sequences of events into some representation in \(\mathbb {R}^l\). Let us consider a (partial) trace \(\sigma _c = \langle e_1, e_2, \dots , e_n\rangle \) of length \(n \in \mathbb {N}^+\). For each event \(e_i = (a^i, c, t^i, d_1^i, \dots , d_m^i) \in \sigma _c^k\), the attributes \(d_1^i, \dots , d_m^i\) may have different values, because they can change as the process instance evolves. We consider as additional attributes values the last values chronologically observed. Formally, we define the function \(\textit{last} : \varSigma \times \mathbb {N} \rightarrow \mathcal {D} \cup \{\perp \}\) as:

\( \textit{last}(\sigma _c, i) = \left\{ \pi _{\mathcal {D}_i}(e_j) \mid j = {{\mathrm{arg\,max}}}_{1\le h \le |\sigma _c|} \,\pi _{\mathcal {D}_i}(e_h) \ne \perp \right\} , \) where i is the index of the attribute. If there is no index j such that \(\pi _{\mathcal {D}_i}(\sigma _c(j)) \ne \perp \) then \( last (\sigma _c, i) = \perp \). What we need to do next is to transform the domains \(\mathcal {D}_i\) into a numerical representation that can be given as input to the SVR. In order to do that, we use the one-hot encoding for all nominal attributes and for the activity. This encoding converts the nominal data value \(d_i\) into a binary vector \(v \in \{0, 1\}^{|\mathcal {D}_i|}\), with \(|\mathcal {D}_i|\) components and with all values set to 0 except for the component referring to \(d_i\), which is set to 1. For example, given \(\mathcal {D}_j = \{a, b, c, d\}\) and \(\pi _{\mathcal {D}_j}(e) = b\), the one-hot encoding would return the vector \(\mathbf {v} = [0, 1, 0, 0] \in \{0, 1\}^4\), where \(\mathbf {v}_1\) corresponds to the value a, \(\mathbf {v}_2\) to the value b, and so onFootnote 2. Instead, all the other attributes values \(d_j\) such that \(\mathcal {D}_j \subseteq \mathbb {R}\) are simply put in a single component vector, e.g., let \(\mathcal {D}_i \equiv \mathbb {N} \subset \mathbb {R}\) and \(\pi _\mathcal {D}(e) = 17\), the output vector is \(\mathbf {u} = [17] \in \mathbb {R}\). In the remainder of this paper we will call \(\mathbbm {1} : A \rightarrow \mathbb {R}^n\), for some \(n \in \mathbb {N}^+\), the function which maps an attribute value to its one-hot encoded vector.

After the just described transformation, all vectors are concatenated (keeping the same fixed order) together. We use the symbol \(\mathbf {v}||\mathbf {w}\) as the concatenation operator between the vectors \(\mathbf {v}\) and \(\mathbf {w}\). , e.g., recalling \(v \in \mathbb {R}^4\) and \(u \in \mathbb {R}\) from the previous examples, their concatenation is equal to \(\mathbf {z} = \mathbf {v} \, || \, \mathbf {u} = [0, 1, 0, 0]||[17] = [0, 1, 0, 0, 17] \in \mathbb {R}^5\). Note that if \(\pi _{\mathcal {D}_i}(e) = \perp \) and \(\mathcal {D}_i\) is nominal, then \(\perp \) is projected to a vector (\(\in \{0, 1\}^{|\mathcal {D}_i|}\)) with all components set to zero. Otherwise, if \(\mathcal {D}_i \subseteq \mathbb {R}\) the value \(\perp \) is simply interpreted as a zero. Eventually, the concatenation of all vectors constitutes an input vector \(\mathbf {x} \in \mathbb {R}^l\) for the \(\epsilon \)-SVR. We summarize all of these steps with the function \(\gamma ^* : \varSigma \rightarrow \mathbb {R}^l\), i.e., \(\forall i, \gamma ^*(\sigma ^i_c) = \mathbf {x}_i\), such that . Finally, the corresponding target value is calculated using the function \(\textit{rem}\), e.g., \(y = \textit{rem}(\sigma _c, i)\) for some \(1 \le i \le |\sigma _c|\). Hence, starting from a trace \(\sigma _c = \langle e_1, e_2, \dots , e_n\rangle \), the corresponding set of n training examples \((\mathbf {x}, y)\) will be \((\gamma ^*(\sigma ^i_c), \textit{rem}(\sigma _c, i)), \forall i \in \{1, \dots , n\}\).

Training A training set for a \(\epsilon \)-SVR algorithm is defined as \(\textit{Tr} = \{(\mathbf {x}_i, y_i)\}_{i=1}^l\) where \(\mathbf {x} \in \mathbb {R}^n\), for some \(n \in \mathbb {N}^+\), and \(y \in \mathbb {R}\). In order to map the data contained in an event log L, we exploit the transformations described in the previous section. In particular, the training set is created by the Algorithm 1.

figure a
figure b

The value returned by the function \(\textit{rem}\) depends on the time granularity (e.g., hours, minutes, seconds). It is important to keep the same granularity for all the instances. Once the training set \(\textit{Tr}\) is constructed, the training phase learns the parameters of the \(\epsilon \)-SVR.

Prediction After the training phase, the \(\epsilon \)-SVR model is created and it can be directly used to predict the remaining time of partial traces. First of all, the new trace is converted to a vector \(\mathbf {x}\) suitable for the \(\epsilon \)-SVR, applying the same procedure illustrated in Algorithm 1, in particular from line 4 to line 7. In line 4 the information about the activity is encoded in the vector \(\mathbf {x}\) and the for loop appends to it the encoding of the additional attributes. Then this vector \(\mathbf {x}\) is given as input to \(f^*\), i.e., the \(\epsilon \)-SVR model, which produces the time prediction. This prediction value has to be interpreted with the same granularity used in the creation of the training instances. Algorithm 2 shows the prediction algorithm.

4.2 Approach 2: regression with contextual information

This approach differs from the previous one since it makes use of control-flow information in order to add contextual knowledge. The basic idea consists of adding a limited set of features able to encapsulate the control-flow path followed by a partial trace. We chose as control-flow model a transition system because it generally represents a good trade-off between expressivity and compactness. Given a labelled transition system \(\textit{TS} = (S, E, T)\) starting from an event log L we have to transform it into a series of features and encode it into a proper form applicable to the \(\epsilon \)-SVR algorithm. As for the literal attributes, we use the one-hot encoding: the set \(S {\setminus } S^\textit{start}\) is treated as a literal domain, where the possible values are the states \(s \in S\) excluding the initial state because a non-empty trace always maps onto a state not included in \(S^\textit{start}\). So, we enumerate the states, \(s_1, s_2, \dots , s_n \in S {\setminus } S^\textit{start}\), and we map a state \(s_i\) into a vector \(\mathbf {v} \in \{0, 1\}^n\) by setting to 1 the i-th component in the n-dimensional vector, and to 0 all the others. For example, given the states set \(S {\setminus } S^\textit{start} = \{s_1, s_2, s_3, s_4\}\) we encode the state \(s_3\) into \(\mathbf {v} \in \{0, 1\}^4\) such that \(\mathbf {v} = [0, 0, 1, 0]\). In the following we will refer to this method as SVR+TS.

4.2.1 Non-fitting traces

As for the previous approach, even this one is able to handle non-fitting traces. However, without any adjustment, the prediction would be calculated overlooking the control-flow. Indeed, using the encoding described above, if \(f^\textit{state}(\sigma ) \notin S\) then the corresponding vector would be null (i.e., \(\mathbf {v} = [0, 0, \dots , 0]\)). We cope with this problem by mapping the non-compliant state \(s = f^\textit{state}(\sigma ) \notin S\), into a set of lawful states \(s_i \in S\). The idea is to associate the non-fitting trace with states that are, within some degree, similar. Then the vector \(\mathbf {v}\) will contain for each state the normalized similarity value. It is very important to define a thoughtful similarity function: we assume that two states are similar if their representations are similar. In particular, since we are focusing on control-flow, we use as event representation function something like \(f^\textit{event}(e) = \pi _\mathcal {A}(e)\). This implies that abstractions are aggregate representations of a set of activities. Let us define a similarity function for each abstraction considered in Sect. 3 (i.e., set, bag and sequence):

Definition 8

(Set similarity function) Given two sets \(x_1, x_2 \subseteq \mathcal {X}\), with \(\mathcal {X}\) the set of all possible values, we define the similarity function \(f^\textit{sim}_{\textit{set}} \in 2^\mathcal {X} \times 2^\mathcal {X} \rightarrow [0, 1]\) as the Jaccard similarity [21].

Definition 9

(Bag similarity function) Given two multi-sets over a root set \(\mathcal {X}\), \(x_1, x_2 \in \mathbb {B}(\mathcal {X})\), we define the similarity function \(f^\textit{sim}_{\textit{bag}} \in \mathbb {B}(\mathcal {X}) \times \mathbb {B}(\mathcal {X}) \rightarrow [0, 1]\) as the Jaccard similarity [21].

Definition 10

(List similarity function) Given two finite sequences over \(\mathcal {X}\), \(x_1, x_2 \in \mathbb {S}(\mathcal {X})\), we define the similarity function \(f^\textit{sim}_{\textit{list}} \in \mathbb {S}(\mathcal {X}) \times \mathbb {S}(\mathcal {X}) \rightarrow [0, 1] \subset \mathbb {R}\) as the Damerau-Levenhstein similarity [5].

The computational complexity of these similarities is quadratic in the size of the abstractions. On the basis of the abstraction used for constructing the transition system, the corresponding similarity function is chosen. Other similarity functions can be used, e.g. Levenshtein similarity. A reasonable choice should take into account how the state representation is defined. In our case, in particular with sequences, we consider the analogy of this abstraction with strings. Using more in depth knowledge about the process, other (custom) similarities could be used.

Every time a non-fitting trace comes into play, its representation is compared with all the state representations of the TS (excluding the initial state). So, given a transition system \(\textit{TS} = (S, E, T)\) (created using \(f^\textit{event}\) and \(f^\textit{state}\)), a similarity function \(f^\textit{sim}\) and a trace \(\sigma \) (such that \(s' = f^\textit{state}(\sigma ) \notin S\)) \(f^\textit{sim}(s, s')\) is computed for each state \(s \in S {\setminus } S^\textit{start}\). After that, each similarity value is normalized and finally put into the vector \(\mathbf {v}\). We will call this kind of TS similarity-based transition system.

Training The training phase of this method is almost the same of the preceding one. The main difference lays on the introduction of a new derived feature to the training set. This can be done by making some minor changes to the Algorithm 1. Since we assume the construction of \(\textit{TS}\), we need it as input along with the state representation function \(f^{\textit{state}}\) and the similarity function \(f^\textit{sim}\). We calculate the state associated with each partial trace and we encode it into a one-hot vector or into the normalized similarity vector if it is a non-fitting trace. Finally, we construct the rest of the training instances as in Algorithm 1.

Prediction In this phase, as for the Simple Regression approach, the \(\epsilon \)-SVR model created in the previous step is used to forecast the remaining time of running process instances. The novelty of the method just described consists of the resulting model, which is obtained from the training phase. The introduction of contextual information, generally, leads to a different optimization problem and consequently to a different final model. The only changes to make in Algorithm 2 are the addition of \(f^\textit{state}\) as input, and the substitution of the right side of line 1 with the one-hot (or the non-fitting) encoding of \(f^\textit{state}(\sigma _p)\).

4.3 Approach 3: Data-aware transition system (DATS)

The approach presented in this section is a refinement of [25] which exploits the same idea described in [33]. The main difference from [25] is the removal of the soujourn time expected value in the prediction because now that time is included in the SVR estimation which takes into account also the additional data. Let us recall the main characteristics of the method presented in [33]. In their work van der Aalst et al. introduced the concept of annotated transition system: each state of the transition system is “decorated” with predictive information, called measurements. Since we want to predict the remaining time, a possible set of measurements collected in a state might be the remaining time starting from the state itself. Formally, in [33] a measurement is defined as:

Definition 11

(Measurement) A measurement function \(f^\textit{measure}\) is a function that, given a trace \(\sigma \) and an event index \(i \in [1, 2, \dots , |\sigma |]\) produces some measurement. Formally, \(f^\textit{measure} \in \varSigma \times \mathbb {N} \rightarrow \mathcal {M}\), where \(\mathcal {M}\) is the set of possible measurement values (e.g., remaining time).

In [33] different kinds of measurements are proposed. Once the suitable measurement is chosen, an annotated transition system is created according to the event log:

Definition 12

(Annotated transition system) Let L be an event log and \(\textit{TS} = (S, E, T)\) a labeled transition system constructed over L using the representation functions \(f^\textit{event}\) and \(f^\textit{state}\). Given a particular measurement function \(f^\textit{measure} : \varSigma \times \mathbb {N} \rightarrow \mathcal {M}\), we define an annotation \(A \in S \rightarrow \mathbb {B}(\mathcal {M})\), such that \(\forall s \in S\):

$$\begin{aligned} A(s) = \biguplus _{\sigma \in L} \biguplus _{{\mathop {s = f^\textit{state}(\sigma ^k)}\limits ^{1\le k \le |\sigma |}}} f^\textit{measure}(\sigma , k), \end{aligned}$$

where \(\mathbb {B}(A) : A \rightarrow \mathbb {N}\) is the set of multiset over a finite set A and \(\uplus \) is the disjoint union defined as the multiset \(X \uplus X' = \{(A \, \cup \, B) \rightarrow \mathbb {N}^+ \mid \forall \, c \in A \, \cup \, B, \, M(c) = X(c) + X'(c)\}\). An annotated transition system is the tuple (SETA).

Since we are facing the remaining time prediction problem, our \(f^\textit{measure}\) function coincides with the function rem defined in Sect. 4. The last step consists of the definition of a prediction function \(\in \mathbb {B}(\mathcal {M}) \rightarrow \mathcal {M}\), such that, given a multiset of measurements, it produces some prediction, e.g., the average or the median value. So, in the operative setting we have an annotated transition system (SETA) constructed over L and a prediction function \(\mathcal {P} : \mathbb {B}(\mathcal {M}) \rightarrow \mathcal {M}\). Using these tools a prediction is made in a straightforward way: given a partial trace \(\sigma _p\) observed so far, such that \(f^\textit{state}(\sigma _p) = s\), the prediction value is \(\mathcal {P}(A(s))\). It is worth to notice that the prediction is calculated using merely control-flow (i.e., transition system) and temporal (i.e., remaining time) information.

The main difference introduced by our approach is the addition of classifiers and regressors, which take advantage of additional attributes, as annotations. Let us give a brief overview of this approach. As in [33], we start with the transition system construction and then we enrich each state with a Naïve Bayes classifier [22] and each transition with a Support Vector Regressor [1, 7, 28] (\(\epsilon \)-SVR) trained with historical data considering all attributes. The introduction of these two machine learning models is based on the intuition that in a state s Naïve Bayes estimates the probability of transition from s to \(s' \in s\bullet \), while \(\epsilon \)-SVR predicts the remaining time if the next state will be \(s'\).

Fig. 2
figure 2

A state annotated with a Naïve Bayes classifier. The probability of going from state \(s_2\) to any of its exiting states is reported next to the corresponding edge

Figure 2 proposes an example of state (\(s_2\)) annotated with a Naïve Bayes classifier. In this state, the Naïve Bayes classifier gets the probabilities to reach each exiting state: probabilities to reach states \(s_3, s_4\) and \(s_5\) are respectively 0.6, 0.1 and 0.3. Such values are used to weigh the remaining time values obtained from the support vector regressors. In Fig. 2 the state \(s_2\) corresponds to the current partial trace \(\sigma '\). From each outgoing state (i.e., \(s_3, s_4, s_5\)) the remaining time is estimated using the SVR associated with the incoming transition (i.e., \(s_2 \rightarrow s_3\), \(s_2 \rightarrow s_4\) and \(s_2 \rightarrow s_5\)). These estimations are multiplied by the probability values obtained from the NB classifiers, and finally summed together in order to compute a weighted average over all the possible continuations from s. Figure 3 presents an example of remaining time estimation for each outgoing state. Formally, let \(\hat{p}_{s'}\) be the Naïve Bayes estimated probability to reach state \(s' \in s\bullet \) from state s, and \(\hat{\tau }_{s \rightarrow s'}\) the estimated remaining time returned by the \(\epsilon \)-SVR associated with transition \(s \rightarrow s'\). Then, given the state s reached after observing a (partial) trace \(\sigma \), the prediction returned by the annotated transition system is \(\sum _{s' \in s\bullet } (\hat{p}_{s'} \cdot \hat{\tau }_{s \rightarrow s'})\).

Fig. 3
figure 3

Example of a Support vector regressor application. The estimated remaining times are suggested by the labels Rem

Let us now define the annotations used to decorate the transition system and then the predictor transition system.

Definition 13

(Naïve Bayes Annotation) Let \(\textit{TS}\) be a labeled transition system, obtained from an event log L, based on an event representation function \(f^{\textit{event}}\) and a state representation function \(f^{\textit{state}}\). Let’s call k the size of the \(\gamma ^*(\sigma )\) vector calculated for traces \(\sigma \in L\). A Naïve Bayes Annotation is a function \( NB : S \times \mathbb {R}^k \times S \rightarrow [0, 1] \subset \mathbb {R}\), which, given two states \(s_i, s_j \in S\) and a data attribute vector \(\mathbf {x} \in \mathbb {R}^k\), returns the probability to reach the state \(s_j\) starting from \(s_i\) through a single transition.

Definition 14

(SVR Annotation) Let \(\textit{TS}\) be a labeled transition system, obtained from an event log L, based on an event representation function \(f^{\textit{event}}\) and a state representation function \(f^{\textit{state}}\). An SVR Annotation is a function \(R : T \times \mathbb {R}^k \rightarrow \mathbb {R}\), such that, given a transition \(t \in T\) and a data attribute vector \(\mathbf {x} \in \mathbb {R}^k\), it applies Support Vector Regression to produce an estimation of the remaining time.

Definition 15

(Predictor TS) Let \(\textit{TS} = (S, E, T)\) be a labeled transition system, obtained from an event log L, based on an event representation function \(f^{\textit{event}}\) and a state representation function \(f^{\textit{state}}\). A predictor transition system is a tuple \(\textit{PTS} = (S, E, T, \textit{NB}, R)\) where \(\textit{NB}\), R are respectively a Naïve Bayes and a SVR annotation, based on the event log L and the transition system \(\textit{TS}\).

Training In this section, we describe how to construct a predictor transition system. Algorithm 3 shows the construction procedure.

figure c

The first loop (line 1) initializes all the training sets for the \(\epsilon \)-SVR model to the empty set, while the second loop (lines 4–16) creates the training sets and updates the NB classifiers. In particular, line 6 constructs the training instances by extracting the additional data from the partial traces and calculating the remaining time. Being possible to build a NB model incrementally, this is done in line 13. Last loop (lines 17–19) trains the \(\epsilon \)-SVR models using the training sets built previously. The training time of this model can be time consuming because of the fact that, especially with complex processes, many SVR and NB should be trained in addition to the construction of the TS. However, with not very complex processes the training time is reasonable.

figure d

Prediction In this section, we describe how to predict the remaining time for a running case using a predictor transition system constructed with Alg. 3. Algorithm 4 shows the prediction procedure. The algorithm simply applies the formula, seen at the beginning of this section, \(\sum _{s' \in s\bullet } (\hat{p}_{s'} \cdot \hat{\tau }_{s \rightarrow s'})\). Each \(\hat{p}_{s'}\) is the value produced by the application of the NB classifiers and \(\hat{\tau }_{s \rightarrow s'}\) by the \(\epsilon \)-SVR. Specifically, let \(s = f^\textit{state}(\sigma _p)\) be the current state, then \(\hat{p}_{s'} = \textit{NB}(s, \gamma ^*(\sigma ), s')\) and \(\hat{\tau }_{s \rightarrow s'} = R(s \rightarrow s', \gamma ^*(\sigma ))\), for all \(s' \in s\bullet \). A core difference w.r.t. [25] is the absence of the expected sojourn time (on the current state): in this revised version this information is implicitly embedded inside the \(\epsilon \)-SVR and hence can be removed from the formulation. The computational complexity of the algorithm is linear in the average number of outgoing transitions of the states of the transition system. Figure 4 puts together the two representations depicted in Sect. 4.3: it shows an example of NB and SVR application for a partial trace \(\sigma = \langle A, B\rangle \) (see event log in Table 1). It is noteworthy that given a predictor transition system (result of the learning procedure), the computation of the prediction requires constant number of operations which, in the worst case, corresponds to the size of the largest set \(s\bullet \) of the transition system. This property allows the application of the approach in on-line settings [3], where each event is allowed to trigger only a constant number of operations.

Fig. 4
figure 4

Example of a prediction calculated by a predictor transition system: \(P(s_2) = 0.6 \cdot 2 + 0.1 \cdot 3 + 0.3 \cdot 1 = 1.8\)

4.3.1 Future path prediction

The model we used in this last approach can also be exploited in order to predict, for a running case, which is the most likely sequence of activities until the end of the case. Let us take, for example, the situation depicted in Fig. 2 which is a fragment of the TS in Fig. 1: the most likely sequence of states starting from \(s_{2\{B\}}\) is \(s_{2\{B\}} \rightarrow s_{3\{C\}} \rightarrow s_{6\{F\}}\), \(s_{6\{F\}}\) is an accepting node and so the process instance is complete, with a probability equals to \(0.6 \times 1 = 0.6\). Note that the transition between \(s_{3\{C\}}\) and \(s_{6\{F\}}\) has a probability equals to 1 because is the only way the process can proceed. In general, the sequence can go through many split states in which the transition probabilities are \(< 1\) and finding the full sequence with the highest probability is not a trivial task. We face this problem as a shortest path problem in which the goal is to find a path between two vertices of a graph, such that the sum of the weights of its constituent edges is minimized. Specifically, let us consider the transition system as a directed graph: we would like to find the shortest path between the current node and an accepting node. In order to leverage this idea, we need to define a suitable distance measure (i.e., cost) between nodes.

Given a possible sequence of activities (in a Predictor TS) between the state \(s_1\) to \(s_n\), i.e., \(s_1, s_2, \dots , s_n\), with the corresponding transition probabilities (obtained using the NB annotations), i.e., \(p_i \forall s_i \rightarrow s_{i+1}, 1 \le i < n\), then the likelihood of such sequence is defined by \(\prod _{1\le i < n} p_i\). Since probabilities have to be multiplied to get the likelihood of a sequence we cannot use it directly as edge cost because we need to follow the definition of the shortest path problem (i.e., edge cost have to be summed). However, we can exploit properties of the logarithm function to transform probabilities into distance like values, in particular: \(\log (pq) = \log (p) + \log (q),\) and since p is a probability: \(\forall \; 0 \le p \le 1 \in \mathbb {R} , \log (p) \le 0,\) where low values of log(p) mean that the transition probability is low, or, from a graph view point, the distance between the nodes is high. Using this idea, we can use as edge cost the opposite of the logarithm of the transition probability. Note that this logarithm transformation works as we desire: probabilities close to 0 (i.e., rare occurrences) correspond to high costs, while probabilities close to 1 (i.e., very likely) correspond to almost negligible costs. Using the just mentioned transformation we can construct a graph corresponding to the TS where the shortest path problem can be solved applying a best first search. So, from a computational point of view the prediction of the activity sequence has a cost which is linear in the number of nodes (i.e., states of the TS).

4.4 Application scenarios

The proposed methods are designed to be used in different operative scenarios. Specifically, we can classify a business process on the basis of its complexity and its dynamicity over time. Anytime the process is stable and it is also not very complex (i.e., structured and with a reasonable number of activities) DATS should be preferred over the other methods. Thanks to the TS and the information on the states/transitions, it is able to take advantage of the peculiarities of each single activity by keeping the learning time (and space) acceptable. In all other cases, i.e., complex and/or very dynamic processes, SVR based approaches should be considered. In particular, SVR+TS is a better choice in cases of dynamic but simple processes because it uses more information and, in general, should be more accurate. In the case of dynamic processes, SVR or SVR+TS can be used depending on how complex the process is. It is a matter of trade off between (expected) accuracy and training time. With particularly complex process SVR is of course the most reasonable choice. Table 2 summarizes the possible scenarios and which method should be preferred. From a computational point of view, all the models once trained can be useful in an almost real time setting because prediction can be calculated in the order of milliseconds. However, the training time could take several hours in particular if the auto-tuning of the parameters is required.

Table 2 Possible application scenarios

From a scalability point of view, the use of an explicit process model, such as a TS, can be an issue. In particular, by using a TS, the number of states can easily grow and this can cause the following problems: (i) for DATS, the number of prediction models that have to be trained could be too high and the training time could be not reasonable; (ii) for SVR+TS, the number of features are way greater than the number of training examples which can cause a drop in the performance.

It is always possible to limit the number of states of a TS by using a horizon in the abstraction, but we have to be careful in order to not underfit the real process model. However, we think that in most of the real world business processes it is possible to have a reasonable TS without overgeneralizing the actual model.

5 Experiments

These techniques have been implemented for the ProM frameworkFootnote 3 [34]. To mine the transition systems, we rely on the miner’s implementation available inside the framework. Naïve Bayes Classification and the Support Vector Regression are performed using the implementation in the Weka framework [13]. In particular, for SVR we used the SMO (Sequential Minimal Optimization) implementation provided by the framework.

The experiments reported in this section aim to assess how the techniques leveraging on the similarity-based transition system give more accurate predictions for variants of process executions that have not been observed in the training set. Moreover, in the static scenario we want to assess the prediction quality against state-of-the-art methods. The comparison is made with respect to the technique reported in van der Aalst et al. [33] and the LSTM based approach presented in [23] that have been discussed in Sect. 2. Work [25] shows that no much difference can be appreciated when polynomial or RBF is used as kernel type. Therefore, the experiments only make use of the latter. To measure and compare the accuracy, we used three indicators: the Mean Absolute Percentage Error (MAPE), the Root Mean Square Percentage Error (RMSPE) and, only in the comparison with [23], the Mean Absolute Error (MAE). The first two metrics are the percentage form of the two standard measures, Mean Absolute Error and Root Mean Square Error. We use their percentage form in order to have values which do not depend on the duration of the process because they are scaled. Let n be the number of samples and let \(A_i\) and \(F_i\) be respectively the actual value and the predicted value for the i-th sample. MAPE is defined by \(100\%\frac{1}{n}\sum _{i=1}^n \left| \frac{A_i-F_i}{A_i}\right| \), while RMSEP by \(\quad 100\% \sqrt{\frac{1}{n}\sum _{i=1}^n \left( \frac{A_i-F_i}{A_i} \right) ^2}.\) Note that we discard the last event of each trace in order to avoid that \(A_i=0\), which could cause a division by zero. Even though MAPE and RMSPE are similar, the latter metric tends to punish larger error more than the former. So, with respect to how important is to avoid large errors in the single prediction, one metric can be preferred to the other one.

We performed our experiments on three real case studies. The first case study logFootnote 4,Footnote 5 concerns the execution of process instances in an information system for the management of road-traffic fines by a local police office of an Italian municipality.

The second case study logFootnote 6 concerns the ticketing management process of the help desk of an Italian software company. In particular, this process consists of 14 activities: it starts with the insertion of a new ticket where a seriousness level is applied. Then the ticket is managed by a resource and it is processed. When the problem is resolved it is closed and the process instance ends. In this case the process is not linear as the previous one, but it is well structured. The third log is a reduced version of the dataset from the BPI 2012 challenge (the same dataset as in [23]). It is a real-life log of a Dutch Financial Institute. The event log describes an application process for a personal loan or overdraft within a global financing organization. Table 3 summarizes the logs’ details.

Table 3 Logs’ information: total number of events, total number of traces, number of activities (\(|\mathcal {A}|\)), average case duration (\(\mu _{\textit{D}}\)), average number of event per case (\(\mu _{\textit{E}}\)) and finally the size of the TS (\(|\mathcal {S}|\)) with the set, bag and sequence abstraction

5.1 Comparison on static processes

5.1.1 Road fines log

We extracted from the information systems an event log that refers to executions that end with sending for credit collection. With this set of experiments we want to show that all the methods presented here perform well under the assumption that the training contains all the possible process’ behaviours (the same assumption is done by van der Aalst et al. [33]). The experiments were performed using fivefold cross validation. The test has been performed over all possible prefix sub-traces of the test log. The SVR hyper-parameters have been tuned automatically using a grid search strategy. In particular we sought for the best combination of C, as penalty in the optimization problem, and \(\gamma \) for the RBF kernel. The transition system has been mined using different abstractions, namely set, multi-set and sequence with no limit. Since, in the event log, for 99% of the traces every activity was performed at most once, and the process is almost linear, the multi-set and the sequence abstraction was not considered to perform experiments.

Table 4 reports the results of the experiments. The acronyms used in the tables mean: VDA is the approach presented in [33], DATS is the data-aware transition system, SVR is the approach which uses a simple support vector regression machine, and SVR+TS is the method which incorporates contextual information (via TS) in the training instances.

Table 4 Road fines log experiment results using fivefold cross validation: (*) means that the approach is the baseline

Results show that every proposed approach outperforms the baseline with similar results. However, it is worth to notice that the approaches based on the transition system, i.e., DATS and SVR+TS, achieve best performances on RMSPE and MAPE respectively.

Fig. 5
figure 5

Error (MAPE on the left, RMSPE on the right) with respect to the number of events to the completion for the Road fines log

From these results, we can argue that the improvements with respect to the baseline are due to the introduction of the additional data in our models. In this log, we can also notice that the effect of considering the workflow (i.e., the transition system) is cramped in comparison to the data perspective. Since this log has only two variants, the importance of the process model is limited, however, SVR+TS achieves better results than the simple SVR which does not use any information about the process model. Figure 5 shows how the error is distributed with respect to the number of events that the trace will take to finish. As expected the error increases with the increasing of the number of future events. However, it is worth to notice that the error for the proposed methods with only one event to the completion is higher than with two. This can be due to the fact that in the TS the last non final state is the only one with two possible continuations. Moreover, since SVR has the same behaviour, but does not depend on the TS, we believe that data could be misleading to understand which last activity will be taken and this causes an increase of the error.

5.1.2 Help desk log

As in the previous case study, the experiments were performed using fivefold cross validation and the SVR’s hyper-parameters have been tuned automatically using a grid search strategy. Since the process is more complex, we performed two kinds of experiments: (i) we compared the performance of our approaches against the baseline assuming that in the training phase all the possible behaviours of the process are present at least once; (ii) we removed some variants from the training set in order to have completely new activity sequences in the test set, so the assumption is not valid anymore (see Sect. 5.2). The second type of experiments aims to show that the approaches based on single SVR performed well even with noisy instances or with new events. Table 5 shows the results. The test has been performed over all possible prefix sub-traces of the test log. We tested all the three trace abstractions, i.e., set, multi-set and sequence. As for the previous case study, the best performing methods are DATS and SVR+TS, however the lowest MAPE and RMSPE are achieved by DATS using the set abstraction. Since the SVR approach did not achieved good results (it is in line with VDA), the transition system information used by SVR+TS played a decisive role. In fact, even though VDA is the worst performing method, it is relatively closer than the previous case study. The improvements with respect to the baseline are on average \(14\%\) for the MAPE and \(4\%\) for RMSPE.

Table 5 Help desk log experiment MAPE and RMSPE results using fivefold cross validation: (*) means that the approach is the baseline
Fig. 6
figure 6

Error (MAPE on the left, RMSPE on the right) with respect to the number of events to the completion for the Help Desk log using the set abstraction

Figure 6 shows the error of the methods with respect to the number of events the instance will take to terminate. In this case the error’s behaviour is as we expected. We can also notice that SVR has a very high error with the number of future events greater than 3. This means that the TS information are helpful, in fact VDA has better performance even without using any additional data. From a workflow point of view, this case study has a more complex structure and we can see how the workflow information, via the TS, had different impact on the results. In this experiment the simple SVR failed in comparison with SVR+TS and the DATS methods, emphasizing the importance of the information brought by the TS.

5.1.3 Comparison with LSTM based method

In this section we show the comparison between our proposals and [33] against the LSTM based approach presented in [23] in the static process scenario (see Sect. 4.4). For [23] we used the implementation provided by the authors.

In these experiments the setting is different from the previous ones, in particular we perform a single fold, because the deep network would have required too much time for being trained and tested. Since processes are considered stable, we did not include the simple SVR since SVR+TS in this scenario can be considered as a kind of generalization of SVR. Differently from [23], we consider all prefix lengths (as done in the other experiments), including partial traces with a single event. We temporally divided the log as follows: the first 2 / 3 of the traces as training (and validation) data and the remaining 1 / 3 as test set. We tested different network configurations, validating the number of LSTM neurons in each layer in the set \(\{100,150,200\}\) and also the number of layers in the range [1,6] (shared layers are not included). We fixed a single shared layer as suggested in [23]. The validation set is used for selecting the best hyper-parameters and network architecture. The evaluation is done in terms of Mean Absolute Error (MAE), that is an error measure among continuous variables. This is the same metric which is optimized by the LSTM network over single events. Let \(A_i\) the desired output, and \(F_i\) be the output predicted, for some data samples of dimension n, than we can define the MAE as: \(MAE = \sum _{i=1}^n |A_i - F_i| / n.\) The MAE has an intuitive interpretation as the average absolute difference between the prediction and the expected output. The obtained results are reported in Table 6.

Table 6 MAE performances: errors are reported in days

From the table we can notice that the LSTM based approach achieves good performance in the Help Desk log while in the other two datasets our proposals have better performances. In particular, we can notice that on the more complex log (it has many repeated activities), i.e., BPI 12, the best performing method is SVR+TS as we expected (see Sect. 4.4). However, also DATS performs quite well with the BPI 12 log. As underlined previously, our approach can suffer from processes with many activities, in fact with BPI 12 we had to limit the abstractions’ horizon in order to be able to compute the prediction. Another observation is that the approach in [23] has a significant drawback regarding the parameter selection. A reasonable way to select the parameters of the network is to look at the error on the validation set. However, since the function that the method optimizes is not the overall MAE (indeed, the network predicts the time until the next event instead of the whole remaining time) we empirically found that the configuration with lower validation loss is generally not the best performing one on the test set. The results reported in Table 6 are based on the best performing parameters on the validation test. For some datasets this may not be a problem. For example, in the Help Desk dataset the difference between the selected configuration and the best performing one on the test set is small, namely \(< 0.1\) days. However, in other cases this difference may be huge. This is the case for the Road Fines dataset, where the selected configuration has an error in MAE that is almost the double w.r.t. the best one. We would like to underline the fact that our results are consistent with the ones obtained (on the same logs) by Tax et al. [23].

Table 7 Help Desk log experiment (without some process variants) results using fivefold cross validation: (*) means that the approach is the baseline

5.2 Comparison on dynamic processes

In order to assess our approaches on dynamic processes we used a modified version of the Help Desk log, in which we removed, from the training set, some process’ variants or a subset of the activities. Table 7 shows the results with fivefold cross validation on the log without some variants using the set abstraction. We use this abstraction because it is the most general one and in this way there are more chances that states with high similarity to the target one exists. In particular, column “Variants” shows results using training instances without half of the variants present in the starting event log. Column “Activity” instead, shows results removing from the training set all the traces with a specific activity, i.e., “Waiting”, which were present in almost 25% of the instances. Then the test phase uses the remaining part of the log, so inside the test there are completely new process behaviours which cause problems to VDA and DATS approaches. Results show that in both cases, SVR+TS outperforms all the other approaches with a MAPE around 34% and a RMSPE of 55%. The introduction of the similarity mechanism in SVR+TS makes this approach less sensible to the noise or change in the workflow, because it is able to mitigate the lack of the correct state using information from correlated ones. It is worth highlighting, that both VDA and DATS would not return any prediction in case of unseen activity’ sequences. However, in order to be able to compare the approaches, we implemented a safety mechanism: if a trace does not map into a valid state of the transition system, the last event is removed and the mapping is redone. This process is repeated until obtaining a prefix that can be mapped or, viceversa, is empty (and is discarded), and in this case the average remaining time of the entire training is returned. This explains why SVR has not better performance than VDA and DATS. These experiments show the effectiveness of the SVR+TS approach and the importance of the similarity-based transition system.

5.3 Next activities prediction

Using the Help Desk event log we also tested the future path prediction (FPP) method described in Sect. 4.3.1. We evaluated our method against a random predictor which chooses randomly the next activity according to the possible continuation seen in the event log. If an activity is also a possible termination, the method randomly decides whether to stop or not. In these experiments we want to show that our model can be used also to predict future activities, however, since DATS was not developed for this precise task (but for predicting the remaining time), there could be existing methods, e.g., [4, 8, 30], which perform better than ours. As for the previous experiments, we used fivefold cross validation. To evaluate the methods, which means assessing how much the predicted path respects the actual one, we used two metrics: the Damerau-Levenshtein similarity (DAM) [5] and the common prefix (PRE). PRE simply counts the length of the common prefix between the prediction and the actual trace. The DAM metric gives an overall similarity between the predicted sequence and the actual one, while the PRE metric focuses only on the very next activities. Table 8 shows the results.

Table 8 Activity sequence prediction results obtained with a transition system created using different types of abstractions (abst.)

Each row in the tables represents the similarity considering sequences of exactly n activities where n is indicated in the \(\#\) column. Rows with the \(\mathbb {E}_\#\) symbol are the average similarities considering every sequence length. Readers can notice that with every abstraction the proposed method outperforms the random one. In particular, FPP is able to identify the next activity almost 94% of the times and the similarity of the next two is almost 0.86 on average with an hit rate of 71% on average. We achieve good results, with respect to Damerau-Levenshtein similarity, even with 3, 4 and 5 activity sequences, and this could mean that the algorithm finds many right future activities but in the wrong order. We can also notice how the differences in the prediction accuracy are evident with the increasing of the number of future activities. This fact is due to the stockpile of the uncertainty of each step in the future, which makes the prediction of the FPP closer to the random one. However, on average, our approach obtained for DAM and PRE 0.81 and 0.63 respectively, which are far better than the 0.36 and 0.09 obtained by the random method.

6 Conclusions

In this work we presented new methods that can be employed to tackle the problem of predicting the time-to-completion of running business process instances. The contributions we presented in this work can be summarized as the following: (i) we proposed three new prediction methods, which take advantage of the additional data present in the event log; (ii) we leveraged on solid and well-studied machine learning techniques in order to build models able to manage the additional information; (iii) we constructed our approach in order to deal with unexpected behaviours or noisy data, by looking at the closeness between the new trace and the most similar process flows already observed.

The proposed set of methods aims to face different scenarios and to overcome the limitations of the state-of-the-art approaches. Moreover, we distinguished the application of the prediction problem into two main scenarios: (1) the process is stable and consequently the event log used to train the model contains all the possible process behaviours; (2) the process is dynamic (i.e., might contain drifts) and, consequently, the event log used for training does not contain all the possible process behaviours, e.g., seasonability of the process. Experiments in [25] and those reported in Sect. 5 have shown how the DATS approach has state-of-the-art performance in the first scenario. Assuming that the training event log contains all the possible process’ behaviours, it is possible to take advantage of the static nature of the process and rely on the transition system structure. A central role in this method is played by the Naïve Bayes classifiers, which are also involved in the prediction of future sequence of activities in Sect. 4.3.1. These two tools together can be very useful for business managers because, in addition to the remaining time estimation, they have also some hints about the sequence of activities that the instance is going to take. Using these information business managers can act preventively and they can try to avoid uncomfortable situations. However, we also obtained good results with the single SVR-based methods which are well suited for the second scenario in which not all the process’ behaviours are present in the training phase. Here, the SVR+TS method is not affected by the lack of information in the training set because is able to generalize the workflow thanks to the similarity-based transition system and the nature of the single-SVR itself. Experimental results point out that methods which are strongly dependent on the TS structure have problems with new process’ variants, while SVR method with the similarity-based TS, outperforms all the other approaches. As future work, we plan to improve the parameters calibration of our approaches, in order to improve the overall results. We would like to investigate whether taking into account only work-hours in the prediction is valuable or not. Moreover, we would like to deploy our approaches on real scenarios, in order to stress the whole approach under production-level constraints.