1 Introduction

Over the last years, we have assisted to the rising of a novel research field called process mining [45], dealing with the analysis of business processes starting from a log. The increasing interest in process mining drops into the wide-ranging framework of Business Process Management (BPM), a discipline dealing with the study and control of process execution. In particular, the tasks of learning a model from a log (commonly addressed as process discovery) and verifying the conformance of a log w.r.t. a model (i.e. compliance monitoring or conformance checking) have brought significant advantages to business processes, thus gaining increasing attention in industrial and academic research.

For both process discovery and compliance monitoring, the availability of the input log is a fundamental requirement. Unfortunately, in a number of real contexts such log might not be present; for example, a newly designed process might have a model, but it might lack the execution logs (due to its novelty). Even when the log is available, its quality might be a major concern, as incompleteness or noise can heavily affect both mining and conformance checking results. Furthermore, in order to accurately compare novel process discovery and conformance checking approaches with existing techniques, it is extremely important to relay on a solid benchmark suite composed of logs with specific, known characteristics. For these reasons, a common approach to evaluate process mining techniques is based on the use of synthetic logs created via process simulation. Log generators are software tools that take as input a model and a set of desired features and emit artificial logs. The generators differ in the supported model language and in the log features that can be requested. A small number of these software [25, 51, 52] can also produce negative examples, i.e. sequences of activities representing process instances that diverge from the expected behaviour, for example, executions that are not compliant with specific time or resource constraints or lack some events in a sequence. Indeed, the availability of negative examples in the log is very important when evaluating the performance of different process mining techniques and their robustness to noise. Real-life logs might contain also negative examples, but the information about which among all the instances are positive/negative is often unavailable. Obviously, manually annotating logs is viable only for small data sets.

The cited approaches [25, 51, 52] to log generation with the ability to produce negative examples are intrinsically procedural and therefore not suitable for the evaluation of process discovery techniques based on declarative process models. As pointed out in [37, 49], differently from procedural models (which work in a closed world assumption where all the allowed behaviours need to be explicitly specified), declarative models are open, thus enjoying a flexibility that makes them more suitable to describe highly variable behaviours in a compact way.

In previous works [14, 16], we adopted an artificial intelligence technique, namely abduction, to complete partial logs, and introduced a (novel) notion of weak compliance of incomplete process instances w.r.t. a procedural model. The focus was on determining whether incomplete logs (i.e. reporting process instances with missing or partially specified events) might be considered compliant with a model, and under which conditions/hypotheses. To that end, we exploited Social Constrained IFF (SCIFF) [6], an abductive logic programming (ALP) framework, and in particular the hypotheses-making reasoning capabilities typical of abduction.

In this work, we investigate the link between abduction and process mining by exploiting the SCIFF framework for a different purpose: the generation of synthetic logs according to predefined custom-selectable properties. Thanks to its highly expressive notation, SCIFF is able to operate with both procedural and declarative formalisms to express the behavioural model. The resulting tool provides logs with several interesting features. In particular, synthetic negative examples can be generated, which are not compliant with the original input model according to user-defined properties, such as the absence of some events, the presence of events in non-compliant sequences, with wrong timing, duration or resource values, etc. While the previous paper [13] contains some preliminary ideas about the design of a synthetic log generator for positive examples only, the contributions of this work are manifold:

  • a study of the theoretical aspects of positive and negative log generation in both open declarative and closed procedural environments;

  • a detailed description of our log generation methodology as well as underlying foundations of the proposed technique;

  • a standalone log generation prototype available for download together with an evaluation of its performance on an average hardware architecture;

  • a study of the existing approaches to log generation and a discussion on the advantages and shortcoming of each one when compared with the proposed methodology.

2 Preliminaries

A key concept in the field of BPM is that of event log: a collection of observed (i.e. logged) executions of a business process, in terms of all the occurred events. Each event in the log refers to an activity, i.e. a well-defined step in the business process, and is related to a specific process instance. The logged description of a process instance in terms of all its constituent activities is addressed as trace or case. In this regard, a storing standard has also been developed to clearly define a common way to exchange and analyse event logs: eXtensible Event Stream (XES) [55]. In order to reason upon the business execution, a model of the process is often employed. It is intended as the set of relevant activities that may occur in the observed environment and the constraints on them. This model must be specified through some shared language in order to be understood by different experts from the business domain. The languages to express the business model are often classified according to their procedural or declarative nature. Petri nets [54], for instance, are a widely used example of a procedural modelling language. As such, they tend to represent the model as a flow of activities that can be executed sequentially, alternatively or in parallel from a beginning to an end. Declarative languages, on the other hand, specify the constraints that should be satisfied by all the traces, without focusing on the exact paths to be followed. One of the most famous examples of declarative languages is Declare [40, 48]. In the following, we use the terms positive trace to address a logged process instance that fulfil the requirements of the business model, whereas negative traces diverge from the expected behaviour, thus being non-compliant w.r.t. the model.

Our approach to log generation involves the concept of abduction [30]: a general, logic-based technique for making hypotheses, originally thought as a mechanism for providing explanations given some observations and some rules/constraints (which capture the domain knowledge). The set of hypotheses that explain the observations is usually called abductive explanation. This form of reasoning is the basis of the SCIFF framework [6] employed here. SCIFF provides a language based on ALP for expressing the domain knowledge, together with a proof procedure supporting the abductive reasoning process. It was initially developed with the goal of run-time checking the compliance of the observed system behaviour w.r.t. a given model. To this end, SCIFF extends the abductive reasoning with a few fundamental compliance-related concepts, such as observed events, expected behaviour (in terms of expected events) and prohibitions, and the formal notions of compliance and violation of traces against expectations.

A relevant point for this work is the SCIFF’s ability to deal with both ground traces and templates (or non-ground traces). A trace is indeed a container for various information. For example, consider an activity addressing the measurement of the temperature D in a room. The trace containing such event is likely to report the observed value for D and the timestamp T of the measurement. These data can be seen as variables assuming different values. A trace is said to be ground if all the data in it are bounded to a value. A template is instead a trace containing some variables, e.g. generic indications of D and T instead of their effective values. Templates can possibly contain constraints restricting the variable domains. As such, they are a powerful abstraction to represent sets of traces in a compact way.

3 Motivations for positive and negative log generation

The availability of an event log and the presence of positive and negative examples in it is very important for the validation of conformance checking and process discovery techniques. Indeed, as conformance checking algorithms aim to identify the traces that are not compliant with a predefined model, the availability of a log containing both positive and negative examples allows the developer to evaluate the performance of a certain technique, for example, in terms of the number of false positive and false negative results obtained or required time for computation. Since in real-life logs the information about which traces diverge from the expected behaviour is usually unavailable, the main process discovery techniques are limited to the harder setting of unsupervised learning with positive examples only. The generation of artificial logs with both compliant and non-compliant traces could foster the adoption of more effective supervised techniques. For example, some machine learning schemes based on inductive logic programming (ILP) [39] relay on a set of negative as well as positive examples in order to extract a formal description of the business process model from an event log.

Furthermore, in the field of process discovery, the presence of negative events allows the developer to verify the robustness of its approach. For example, suppose to have a log generator able to take a process model as input and produce a log with a mixed composition, reporting both positive and negative traces in predefined percentages, according to configurable parameters. The emitted event log can be used as input to the process discovery algorithm to test its performance by comparing the extracted process model with the original one. Varying the percentage of negative examples in the log can give an evaluation of the robustness of the process discovery algorithm to noise and incompleteness.

In the field of business process, a strong concern is related to the accuracy of the model, i.e. its ability to capture allowed behaviours and to highlight unwanted actions. For this purpose, a domain expert is often involved but, when the model is composed of several actions, constraints, and possibly concurrent paths, the analysis of accuracy is far from straightforward. For this reason, when the process model is given, the generation of synthetic traces can be useful to provide domain experts with a collection of examples to clearly explain which process instances are allowed or not allowed by the input model. Thus, to obtain a quicker feedback about accuracy.

Synthetic logs can be employed also for the conversion of models from a declarative to a procedural formalization (and vice versa). For example, if the business model of a process is provided by means of declarative constraints—e.g. through a Declare formalization—the synthetic traces produced through log generation can be used as input to a process miner, for example, the \(\alpha \)-algorithm [50], to extract a procedural version of the original model (in the mentioned miner, a Petri net model).

Another interesting application is related to the planning problem. Indeed, when the model is understood, the generated positive traces are examples of process executions compliant with a certain set of constraints and can therefore be used for production purposes. For example, consider a line of production that must complete its job in 10 minutes. If the execution model is known, positive trace generation can suggest which are the execution path of the production line that can fulfil the requirement on the total production time. Furthermore, being able to emit both ground and non-ground traces, abduction is a perfect candidate to suggest which are the sets of cases that fulfil the constraints, thus to tackle the planning problem. It is interesting to note that abduction has been investigated in the past for supporting the planning task [42] and that in turn planning itself has been recently proposed for generation of trace templates and instances [36].

Finally, when the business model is complex, involving several different actions, and including constraints on time and resources, even the evaluation of model consistency (i.e. determine whether at least one trace can satisfy all the constraints), which may appear a simple operation in general, can be very difficult for a human being. Log generation allows to suddenly identify situations in which no trace can be compliant with the model. In that cases indeed, as the intersection of the sets of traces satisfying each constraint is empty, the generation would not produce any output.

4 On different modelling approaches and how they affect log generation

Within the BPM research field, a number of different formalisms and languages have been proposed for the definition of business processes. Different aspects of the business process itself have been researched, thus leading to the emergence of several proposals, each one with merits and limits. Providing a complete and detailed view of all these approaches is out of the scope of this work. However, in the following we introduce the few aspects that we deem fundamental to comprehend the issues related to positive and negative trace generation.

4.1 Procedural versus declarative and open versus closed modelling approaches

A first important aspect of log generation is how to define the flow-related aspects of a business process. On the one end of the spectrum, there are the procedural approaches, where the process flow is defined as a number of steps from a beginning to an end. Usually, two artificial activities, the start and the stop, are employed. They are not meant to address real events, but just to support the concepts of beginning and conclusion of a process instance. Typical constructs of procedural languages provide notions such as sequence of activities, or–split/join (alternative, non-exclusive choices of different flow paths), xor–split/join, and and–split/join (parallel execution of different flow paths). Figure 1 shows a simple example of a procedural workflow defined using the YAWL graphical language [47], where after the start, activity a should be executed, then b or {c, d} should be executed (exclusive choice between the two paths), and finally activity e should be executed.

Fig. 1
figure 1

A simple business process defined through the procedural language YAWL

On the other end of the spectrum, there are declarative approaches, which focus more on the constraints that should be satisfied by all the process executions, rather than fixing exact flows/paths. In Fig. 2a, a Declare [40] constraint is placed over the two activities of having a coffee and paying a coffee: the intended meaning is that it does not matter the order (notice the absence of arrows) of execution, but if you have a coffee then you must pay for it, and vice versa, if you pay for a coffee you must have it. In Fig. 2b, another constraint is shown: after the check-out of a shopping cart, the user is not allowed to add more items. The two small vertical lines in the connection between the two activities stand for a prohibition, while the presence of an arrow sets also a temporal ordering.

Fig. 2
figure 2

Two simple examples of Declare constraints

Another important aspect regards the degree of openness allowed by the modelling approach. Closed models are characterized by the fact that only the specified activities, at the specified time instant can be executed. For example, let us consider again the model depicted in Fig. 1: a closed modelling approach would imply that only the envisaged activities (namely, {a,b,c,d,e}) are allowed to be executed, and any other activity is prohibited. Such approaches are typical of a number of applicative domains, for example, security communication protocols or bank protocols for financial transactions, where even the smallest bit of information that is not envisaged should trigger alarms and invalidate the interaction.

At the opposite side, open approaches are characterized by the fact that they specify both activities that should be executed and activities that should not be executed (i.e. prohibitions). When nothing is specified about an activity, the usual intended meaning is that the appearance of such activity in a case does not influence the trace compliance (or non-compliance) w.r.t. the model. As an example, let us consider the process depicted in Fig. 2a: within having a coffee and paying a coffee, any other activity such as chat with the barman is allowed as well. Theoretically, open approaches allow for the execution of any activity for which no prohibition has been expressed. However, for practical reasons, many approaches allow for the execution of any (non-forbidden) activity within a specified set: in other words, the set of allowed activities is defined a priori and finite. This paper adopts the same view: we will generate only traces whose activities belong to a user-defined finite set \(\mathbb {A}\).

4.1.1 Relation between two orthogonal dimensions

A noteworthy consideration regards the relation between these orthogonal dimensions: procedural vs. declarative models and open vs. closed approaches. A number of works have identified procedural models as closed and declarative models as open. This association is by no means mandatory. Indeed, a declarative approach can be used to model a closed process if we provide a set of additional constraints to ensure that only the specified activities are allowed, and any other is prohibited. As well as a procedural formalism can be used to express an open model by relaxing some constraints.

Although our approach remains general and can be applied to all the cases, for convenience’s sake, in this paper we focus on the most popular combination of the orthogonal dimensions, i.e. procedural closed models and declarative open ones. The interested reader can refer to [37] for an in-depth discussion on closed vs. open models.

4.2 Positive and negative traces w.r.t. declarative open process models

The open/closed nature of the model affects the reasoning on positive and negative traces. In an open model, if nothing is said about a certain activity X, this means, for example, that any positive trace, containing or not containing X is compliant anyway with the process model. This observation leads to two further considerations:

  1. 1.

    one might be tempted to ignore traces containing activities like X. However, if X is not in the model (i.e. it does not appear in any constraint), but a process designer addresses it as an activity that could happen, he might be interested in observing both traces containing and not containing X;

  2. 2.

    the number of traces that are compliant with an open declarative process model is potentially infinite w.r.t. activities like X.

To address the issues above, we restrict the generation to traces containing only events belonging to a specified finite set \(\mathbb {A}\) of activities. For example, let us consider the process model shown in Fig. 2a. For the sake of the example, we could say:

$$\begin{aligned} \mathbb {A}=\{{\textsf {have a coffee}}, \textsf {pay a coffee}, {\textsf {chat with the barman}}\}, \end{aligned}$$

where the activity chat with the barman is not subject to any constraint. In this case, a subset of all the positive traces would be the following oneFootnote 1:

$$\begin{aligned} \tau _1&= \emptyset \\ \tau _2&= [{\textsf {have a coffee}}, {\textsf {pay a coffee}}]\\ \tau _3&= [{\textsf {chat with the barman}}, {\textsf {have a coffee}}, {\textsf {pay a coffee}}]\\ \tau _4&= [{\textsf {have a coffee}}, {\textsf {chat with the barman}}, {\textsf {pay a coffee}}]\\ \tau _5&= [{\textsf {have a coffee}}, {\textsf {pay a coffee}}, {\textsf {chat with the barman}}]\\ \tau _6&= [{\textsf {pay a coffee}}, {\textsf {have a coffee}}]\\ \ldots&\end{aligned}$$

Negative traces are those ones that violate one or more constraints of the process model. If we consider again the example shown in Fig. 2a, negative traces are those that violate the only constraint present: they should contain pay a coffee without have a coffee or should contain have a coffee without pay a coffee. A subset of all the negative traces would be the following one:

$$\begin{aligned} \overline{\tau _7}&= [{\textsf {have a coffee}}]\\ \overline{\tau _8}&= [{\textsf {pay a coffee}}]\\ \overline{\tau _9}&= [{\textsf {chat with the barman}}, \, {\textsf {have a coffee}}]\\ \overline{\tau _{10}}&= [{\textsf {have a coffee}}, \, {\textsf {chat with the barman}}]\\ \overline{\tau _{11}}&= [{\textsf {chat with the barman}}, {\textsf {pay a coffee}}]\\ \overline{\tau _{12}}&= [{\textsf {pay a coffee}}, \, {\textsf {chat with the barman}}]\\ \dots&\end{aligned}$$

where the overline on the trace symbol \(\tau _i\) indicates that the trace violates the process model.

4.3 Positive and negative traces w.r.t. procedural closed process models

Positive traces w.r.t. procedural models are only those that are allowed and explicitly considered by the model. Intuitively, it suffices to walk the structure (the graph) of the process model, to get instances of positive traces. For example, if we consider the process model shown in Fig. 1, only two positive traces are allowed:

$$\begin{aligned} \tau _{13}&= [{\textsf {a}}, {\textsf {b}}, {\textsf {e}}]\\ \tau _{14}&= [{\textsf {a}}, {\textsf {c}}, {\textsf {d}}, {\textsf {e}}]. \end{aligned}$$

Similarly to the case of open declarative models, negative traces w.r.t. closed procedural models are those traces that violates one or more constraints. More in detail, procedural constraints are the flow constructs explicitly defined in the process model, plus the constraint (implicit in the closeness flavour) that anything not considered by the model is strictly prohibited. Thus, negative traces can be grouped into two sets: (i) traces that contain only (possibly, a subset of) events explicitly mentioned in the model and that violate one or more flow structures; and (ii) traces that contain unknown events. Referring to Fig. 1, two examples of negative traces that belongs to group (i) would be:

$$\begin{aligned} \overline{\tau _{15}}&= [{\textsf {a}},\, {\textsf {e}}] \\ \overline{\tau _{16}}&= [{\textsf {a}},\, {\textsf {d}}, \, {\textsf {c}},\, {\textsf {e}}] \\ \ldots&\end{aligned}$$

In \(\overline{\tau _{15}}\), the activity b is missing; in \(\overline{\tau _{16}}\), instead activities c and d are executed in the wrong order. The set of negative traces belonging to group (ii) instead is possibly infinite, as possibly infinite is the set of activities not explicitly envisaged by the procedural model. To cope with this aspect, we restrict our approach by considering again a finite set \(\mathbb {A}\) of activities that will be used for generating the traces. Referring again to Fig. 1, such a set could be:

$$\begin{aligned} \mathbb {A}=\{{\textsf {a}},\, {\textsf {b}},\, {\textsf {c}}, \, {\textsf {d}},\, {\textsf {e}},\, {\textsf {f}},\, {\textsf {g}}\}. \end{aligned}$$

The process model does not envisages activities \(\{{\textsf {f}}, {\textsf {g}}\}\): any trace containing one or more of these activities is a negative trace. For example,

$$\begin{aligned} \overline{\tau _{17}}&= [{\textsf {a}},\, {\textsf {f}}, \, {\textsf {b}},\, {\textsf {e}}] \\ \overline{\tau _{18}}&= [{\textsf {a}},\, {\textsf {c}}, \, {\textsf {d}},\, {\textsf {e}},\, {\textsf {g}}] \\ \ldots&\end{aligned}$$

5 The SCIFF abductive capabilities

In order to clearly explain our generation approach, we first briefly recall the main concepts behind the SCIFF abductive capability. The interested reader can refer to the work [6] for a complete dissertation on this topic. Formally, a SCIFF specification is a triple \(\langle \mathcal {KB},\, \mathcal {A},\, \mathcal {IC}\rangle \), where:

  • \(\mathcal {KB}\) is a knowledge base (i.e. a Logic Program as for [34]);

  • \(\mathcal {A}\) is a set of abducible predicates with functor \(\mathbf{ABD}\), \(\mathbf{E}\), or \(\mathbf{EN}\);

  • \(\mathcal {IC}\) is a set of Integrity Constraints (ICs).

Among the elements of \(\mathcal {A}\), predicates with functor \(\mathbf{ABD}\) correspond to usual abducibles as in ALP [30], i.e. predicates that can be hypothesized. \(\mathbf{E}\) and \(\mathbf{EN}\) are particular abducibles respectively used for modelling positive expectations—i.e. expectations about the happening of certain events—and negative expectations—i.e. prohibitions about the happening of certain events.

Happened events are represented through predicates with functor \(\mathbf H \). Hence, in SCIFF a trace is a set of predicates \(\mathbf H (EvDesc, T)\), where each predicate stands for the observation of the happening of an event described by EvDesc at timestamp T. For example, trace \(\tau _2\) introduced in Sect. 4.2 would be represented in SCIFF as:

$$\begin{aligned} \tau _2 = [\mathbf{H}({\textsf {have\_a\_coffee}}, 5), \mathbf{H}({\textsf {pay\_a\_coffee}}, 8)], \end{aligned}$$

meaning that at time instant 5 the event have_a_coffee has been observed, and then, at time instant 8, the event pay_a_coffee has been observed too.

\(\mathcal {IC}\) is a set of forward rules of the form \(body\rightarrow head\). They are used to dynamically link the observation of the happening of events with positive and negative expectations. Roughly speaking, when body becomes true, also head must be true. The body contains conjunctions of special terms with functors H, E/EN or ABD, while the head is made of disjunctions of conjunctions of terms with functors E or ABD. For example, the following IC

$$\begin{aligned} \mathbf H ({\textsf {have\_a\_coffee}},T_a) \rightarrow \mathbf E ({\textsf {pay\_a\_coffee}},T_b) \end{aligned}$$
(1)

states that every time the event have_a_coffee is observed at a timestamp \(T_a\) and then an event pay_a_coffee is expected to happen (to be observed) at a timestamp \(T_b\).

The IC shown in (5) already provides a powerful hint of how the SCIFF proof procedure determines if a trace is compliant w.r.t. to a model. Let us suppose a process model is described in terms of (5). Also, let us suppose to observe the event \(\mathbf{H}({\textsf {have\_a\_coffee}}, 12)\). The event triggers IC (5): as a consequence, the positive expectation \(\mathbf E ({\textsf {pay\_a\_coffee}},T_b)\) is abduced. The SCIFF Proof Procedure then waits for further events. If an event happens, such that it matches with the positive expectation, we say that such expectation is fulfilled. If no event matching the expectation happens, the expectation is violated. We do not report here the definitions of compliance and violation of a trace w.r.t. a model: the interested reader can refer to [6]. Rather, we highlight the following: given a process model described with a SCIFF specification, a trace is compliant with (or violates) that specification, if it is compliant with (violates) the positive and negative expectations generated by the ICs.

In this paper, the goal is to generate traces that are compliant to (or violate) a given model. To this end, we recur to a technique similar to that of previous works: we start from a process model represented in terms of ICs. However, instead of using positive and negative expectations, Integrity Constraints (ICs) are defined in terms of abducibles only, each abducible representing the hypothesis that an event happens. For example, let us model the procedural process shown in Fig. 1. First of all, the sequence between the start and the execution of activity a would be modelled as:

$$\begin{aligned} ic_1:\quad \mathbf{ABD}(\textsf {start}, T_s) \rightarrow \mathbf{ABD}({\textsf {a}}, T_a) \wedge T_a > T_s. \end{aligned}$$

Then, the xor disjunction between b and c would beFootnote 2:

$$\begin{aligned} ic_2:\mathbf{ABD}({\textsf {a}}, T_a) \rightarrow \mathbf{ABD}({\textsf {b}}, T_b) \wedge T_b> T_a \vee \mathbf{ABD}({\textsf {c}}, T_c) \wedge T_c > T_a. \end{aligned}$$

Finally, further sequence constraints are needed:

$$\begin{aligned} ic_3:\quad \mathbf{ABD}({\textsf {b}}, T_b) \rightarrow&\mathbf{ABD}({\textsf {e}}, T_e) \wedge T_e> T_b.\\ ic_4:\quad \mathbf{ABD}({\textsf {c}}, T_c) \rightarrow&\mathbf{ABD}({\textsf {d}}, T_d) \wedge T_d> T_c.\\ ic_5:\quad \mathbf{ABD}({\textsf {d}}, T_d) \rightarrow&\mathbf{ABD}({\textsf {e}}, T_e) \wedge T_e> T_d.\\ ic_6:\quad \mathbf{ABD}({\textsf {e}}, T_e) \rightarrow&\mathbf{ABD}({\textsf {stop}}, T_s) \wedge T_s > T_e. \end{aligned}$$

Summing up, the process model in Fig. 1 would be represented by the set \(\{ic_1,ic_2,ic_3,ic_4,ic_5,ic_6\}\). Given the start symbol as input, the SCIFF Proof Procedure would produce the following two trace templates:

$$\begin{aligned} \tau _{\alpha }&= [\mathbf{ABD}({\textsf {a}},T_a), \mathbf{ABD}({\textsf {b}},T_b), \mathbf{ABD}({\textsf {e}},T_e), T_a<T_b<T_e]\\ \tau _{\beta }&= [\mathbf{ABD}({\textsf {a}},T_a), \mathbf{ABD}({\textsf {c}},T_c), \mathbf{ABD}({\textsf {d}},T_d), \mathbf{ABD}({\textsf {e}},T_e),T_a<T_c<T_d<T_e], \end{aligned}$$

where \(\tau _{\alpha }\) matches \(\tau _{13}\) and \(\tau _{\beta }\) matches \(\tau _{14}\) both introduced in Sect. 4.3.

Finally, notice that \(\tau _{\alpha }\) and \(\tau _{\beta }\) are what we call trace templates, since the timestamps are not grounded to specific values, but are rather variables that can be assigned to any set of values respecting the inequalities.

6 Generation of a synthetic log through SCIFF

In order to simplify the comprehension of our approach, we first introduce the reader to the generation of positive traces from a high-level perspective. The process consists of tree steps:

  1. 1.

    the process model is properly translated into a SCIFF specification;

  2. 2.

    the SCIFF proof procedure takes as input the process model expressed as a SCIFF specification and generates the trace templates;

  3. 3.

    trace templates are grounded with values respecting the constraints imposed by the model.

Translation into SCIFF The translation of the process model into a SCIFF specification depends on the language adopted for the model definition. In previous works, we investigated how closed procedural approaches can be easily translated into SCIFF [14, 15], even with the support to activity data and temporal constraints [16, 18]. In these previous works, the focus was on establishing if a partial trace could be possibly considered as compliant. If not compliant because of missing events, we showed how hypothetical reasoning and abduction could be exploited to determine a possible set of further activities that, once merged with the initial trace, would make it compliant w.r.t. the model. However, what if the partial trace is empty? The approach presented here is an extension of the previous ones, where the starting point is an empty trace, and the SCIFF is queried about the existence of a trace. Differently from [14, 15], here there is no need to verify the compliance, since the initial trace is empty.

Generation of trace templates Starting from a SCIFF model specification, the proof procedure is queried about the existence of a compliant trace. As also highlighted in [18], thanks to its abductive nature, the same generation process can also be carried out starting from partially specified traces. The SCIFF generates each trace through hypothetical reasoning, i.e. by abducing the events as of the rules given in the specification. At this stage, timestamps and activity data (if present) are indicated as variables thus realizing trace templates. Once a solution (i.e. a first template) is found, the SCIFF is asked to look for another one: iteratively, all the templates are emitted. To ensure the termination of this iterative process, we ask the user to provide in input a meta-information about the process: we require that a maximum number of events for each trace is specified. This allows to avoid problems due, for example, to the presence of (indeterministic) loops in the process model.

Grounding the trace templates Finally, SCIFF substitutes the variables in the templates with all their possible values. Of course, it can happen that a variable have an infinite set of possible values. For example, consider a timestamp of an event whose only constraint is to be greater than zero: there are infinite values that can satisfy such constraint. For this reason, we ask the user to provide in input a meta-information about the process: the user must specify the maximum allowed timestamp (the minimum is automatically assumed to be zero). Together with the assumption that timestamps are represented with integer numbers, the maximum timestamp limit ensures the termination of the grounding process.

As regards the generation of negative examples, since in our approach models are represented by means of SCIFF’s Integrity Constraints (ICs), a negative trace can be seen as a trace that violates one or more Integrity Constraints (ICs). Hence, to generate a negative trace, it suffices to take the initial SCIFF specification, obtain a new specification by negating one or more Integrity Constraints (ICs), and use it to generate traces. The output will be compliant with the model containing negated Integrity Constraints (ICs); thus, it will violate the initial model.

First of all, it is important to understand what it means to negate an IC. In SCIFF, Integrity Constraints (ICs) are (forward) implications: they are violated when the premises are true, and the consequences are false. In SCIFF, the consequences (named head) are disjunction of conjunction of literals and abducibles. Let us focus to a simple example of one IC made of two different conjuncts:

$$\begin{aligned} \mathbf{ABD }(a,T_a) \rightarrow \mathbf{ABD }(b,T_b) \wedge T_b > T_a. \end{aligned}$$
(2)

The intended meaning is that the execution of activity a should be followed by the execution of activity b. In other words, every time we observe a, we should observe also bafter. The head of the Integrity Constraints (ICs) contains two conjuncts: the first states that whenever a happens, also b should happen; and the second requires that b should happen after. So, (2) can be violated in two ways:

  1. (i)

    by having a trace containing activity a, and not containing activity b;

  2. (ii)

    by having a trace containing a, and containing b happened beforea;Footnote 3

The two cases lead to two different trace templates, each one violating the original IC (2) in its own way.

Given this explanation of what IC negation implies, the generation process takes place as follows. The approach starts from a SCIFF specification \(\mathcal {S}=\langle \mathcal {KB}, \mathcal {A}, \mathcal {IC} \rangle \) of a process model, where \(\mathcal {IC}\) is the set of one or more Integrity Constraints (ICs) \(ic_i\):

$$\begin{aligned} \mathcal {IC}\equiv \{ ic_1, ic_2, \ldots , ic_n \}. \end{aligned}$$
(3)

New sets \(\overline{\mathcal {IC}_j}\) are obtained by negating one or more \(ic_i, 1\le i \le n\). Notice however that—as explained for (2)—negating a single \(ic_i\) can lead to different Integrity Constraints (ICs). Hence for each \(ic_i\) we could obtain \(\overline{ic_{i,1}}, \ldots , \overline{ic_{i,m_i}}\). For example, let us suppose to negate \(ic_1\), and that this leads to three different Integrity Constraints (ICs). We would get:

$$\begin{aligned} \overline{\mathcal {IC}_1}&\equiv \{ \overline{ic_{1,1}}, ic_2, \ldots , ic_n \}\\ \overline{\mathcal {IC}_2}&\equiv \{ \overline{ic_{1,2}}, ic_2, \ldots , ic_n \}\\ \overline{\mathcal {IC}_3}&\equiv \{ \overline{ic_{1,3}}, ic_2, \ldots , ic_n \}. \end{aligned}$$

Then, we could negate \(ic_2\) (suppose this is possible in two different ways):

$$\begin{aligned} \overline{\mathcal {IC}_4}&\equiv \{ ic_1, \overline{ic_{2,1}}, \ldots , ic_n \}\\ \overline{\mathcal {IC}_5}&\equiv \{ ic_1, \overline{ic_{2,2}}, \ldots , ic_n \}\\ \ldots&\end{aligned}$$

The number of models \(\overline{\mathcal {IC}_j}\) that can be obtained is given by the cardinality of the power set of \(\mathcal {IC}\), multiplied for the cardinality of the power set of all the possible ways of negating one or more Integrity Constraints (ICs). SCIFF specifications \(\overline{\mathcal {S}_{j}} =\langle \mathcal {KB}, \overline{\mathcal {IC}_{j}}, \mathcal {A} \rangle \) are obtained consequently, and each one is used to generate traces. The number of \(\overline{\mathcal {S}_{j}}\) provided in this way is definitely huge. However, not all those specifications will allow for the existence of traces: indeed, it might happen that by negating two different Integrity Constraints (ICs), an inconsistent model is obtained, thus not allowing the existence of any trace compliant with that model. It might also happen that a trace generated from a \(\overline{\mathcal {IC}_{j}}\) is again compliant with the initial \(\mathcal {IC}\). For example, if the initial model admits two alternative paths (i.e. \(\mathcal {IC}\) includes a xor constraint) and the specification derived from a \(\overline{\mathcal {IC}_{j}}\) negates the constraints of only one path, the traces generated following the other, unmodified path are again compliant with the initial model. To remove these actually positive traces, after the generation, each trace is again checked for its non-compliance with the original specification \(\mathcal {S}\). Algorithm 1 summarizes the steps of negative trace generation.

figure e

6.1 Positive and negative trace generation for open declarative processes

Among the several languages proposed for modelling open declarative approaches, we focus on Declare [40]. In previous works [37, 38], we showed how Declare constructs can be mapped in SCIFF while preserving the notion of compliance. Here, the shift of perspective towards log generation motivates some changes in the way we represent the Declare constructs. Moreover, a further issue is given by the fact that, as explained in Sect. 4.1, in open declarative models there can be activities that are part of the process but are not part of any constraint. When generating traces, also these activities are interesting and should appear in the log.

6.1.1 Generating positive traces

As regards the first step of log generation, i.e. the translation of the Declare model into a SCIFF specification, it stems from the one proposed in [38], which was oriented to compliance monitoring. Differently from [38], here the focus is restricted to the problem of generating traces; thus, two variations are present: (i) there is no more need for happened events and positive/negative expectations, but every SCIFF IC is expressed in terms of abducibles only; (ii) prohibitions can be expressed in terms of a particular type of IC, named denials, like for instance in the neg_response(A,B) constraint, where the fail constant is used with the special purpose of addressing an inconsistency or a failure. Table 1 reports a few examples of how the Declare constructs can be easily mapped.

Table 1 Mapping of some relevant Declare formulas onto SCIFF

The translation of the Declare constraints alone is not enough: activities that are not subjected to any existence constraint, or activities that are not mentioned at all by any constraint (but are allowed due to the openness of the approach) will not appear in any trace. To address this issue, in the SCIFF specification of a Declare model we add a further set of Integrity Constraints (ICs). In particular, we assume a finite set \(\mathbb {A}\) of all activities that have to be considered. For each activity \(\textsf {x}\in \mathbb {A}\), we add the following:

$$\begin{aligned}&true \rightarrow \ true \vee \mathbf{ABD}(\textsf {x},T_x), \end{aligned}$$
(4)
$$\begin{aligned}&\mathbf{ABD}(\textsf {x},T_1) \rightarrow \ {true} \vee \mathbf{ABD}(\textsf {x},T_2) \wedge T_2>T_1. \end{aligned}$$
(5)

IC (4) ensures that either an activity \(\textsf {x}\) is not in a trace, or it is included in at least a trace. IC (5) instead ensures that, once an activity \(\textsf {x}\) has been included in a trace, there is also the possibility of generating traces with two, three, \(\ldots \), many instances of \(\textsf {x}\). Notice that to avoid termination issues, we exploit an implementation feature of the SCIFF proof procedure: it explores the disjoints following the syntactical order in which they have been defined. By placing as a first disjoint always the true option in both IC (4) and (5), we ensure that the procedure terminates, and only if requested for further solutions it explores the other disjoints.

6.1.2 Generating negative traces

The generation of negative traces follows exactly the schema presented in Algorithm 1. For each Declare constraint, one or more Integrity Constraints (ICs) are generated, corresponding to the possible ways of negating the constraint. For example, let us consider again IC (2), that corresponds indeed to a response(A,B) Declare constraint (i.e. the execution of an activity \(\textsf {A}\) must be followed by the execution of an activity \(\textsf {B}\)). We already discussed that there are two significant ways (out of three) for negating the IC. Formally, they are:

$$\begin{aligned} {true}&\rightarrow \mathbf{ABD}(a,T_a) \nonumber \\ \mathbf{ABD}(a,T_a) \wedge \mathbf{ABD}(b,T_b)&\rightarrow {\textit{fail}} \end{aligned}$$
(6)

and

$$\begin{aligned} {true}&\rightarrow \mathbf{ABD}(a,T_a) \nonumber \\ \mathbf{ABD}(a,T_a) \wedge \mathbf{ABD}(b,T_b) \wedge T_b>T_a&\rightarrow {\textit{fail}}.\\ \mathbf{ABD}(a,T_a)&\rightarrow \mathbf{ABD}(b,T_b) \wedge T_b<T_a. \nonumber \end{aligned}$$
(7)

IC (6) states the prohibition of the happening of \(\textsf {b}\), if \(\textsf {a}\) has happened. The reader might wonder why the denial is needed: if we simply resorted to not generate any activity \(\textsf {b}\), we would have already achieved the goal of a trace with \(\textsf {a}\) and not \(\textsf {b}\). However, other Declare constraints might call for the happening of \(\textsf {b}\). The denial guarantees that the traces will not contain \(\textsf {b}\), thus ensuring the violation of IC (2).

A final consideration regards inconsistent models. When negating one or more Integrity Constraints (ICs), inconsistent models will be generated: for example, a model with an IC asking for the presence of activity \(\textsf {a}\) and also an IC prohibiting \(\textsf {a}\). From the perspective of the SCIFF proof procedure, inconsistent models are not an issue: simply, they will not produce any trace.

6.2 Positive and negative trace generation for closed procedural processes

A huge number of specification languages for closed procedural processes have been proposed. The number of approaches is so vast that their listing would be out of the scope of this work. We restrict ourselves to a well-known class of processes, often referred as structured process models [31] with unique tasks [50]. Although our approach could in theory deal with unstructured models, we prefer to adopt such restriction since structured models have a clearer semantics, in particular when dealing with loop patterns and termination conditions. Such a choice comes at the price that not all real processes can be easily represented, and more complex models would be needed to capture them. The choice of unique tasks instead is due to a simpler management of the process model translation into our framework: indeed, any process model with repeated activities can be always translated into another with unique tasks by applying a simple renaming of the repeated activities.

6.2.1 Generating positive traces

The generation of positive traces from procedural models follows the general three-step procedure. As regards the translation of structured process models into SCIFF, we refer to the previous works [13, 15, 18]. As for template generation, it is worth noting that, differently from open approaches, in procedural closed models only those activities that are explicitly mentioned by the model are allowed in the trace. Hence, the specification of the procedural model in terms if SCIFF’s constraints is sufficient to ensure the generation of the traces. There is no need for additional constraints such as IC (4) and (5).

6.2.2 Generating negative traces

When negating flow constructs of a closed procedural model, it is important to consider that the correct chaining of events might be broken, and certain negative traces might be not generated. Let us explain the issue through the workflow example shown in Fig. 3, where a very simple sequence of activities \(\textsf {a}\), \(\textsf {b}\), \(\textsf {c}\) is proposed.

Fig. 3
figure 3

A closed, procedural process defined in YAWL

Let us focus, for the sake of comprehension, on generating those traces that violate the sequence constraint between activities \(\textsf {a}\) and \(\textsf {b}\) because they do not contain \(\textsf {b}\). These traces are generated through IC (6). Intuitively, they would be:

$$\begin{aligned} \overline{\tau _{19}}&= [ {\textsf {a}} ]\\ \overline{\tau _{20}}&= [ {\textsf {a}}, {\textsf {c}} ]. \end{aligned}$$

Since the sequence construct between \(\textsf {b}\) and \(\textsf {c}\) is represented as:

$$\begin{aligned} \mathbf{ABD}(b,T_b) \rightarrow \mathbf{ABD}(c,T_c) \wedge T_c>T_b, \end{aligned}$$

in our formalism the generation of activity \(\textsf {c}\) is triggered by the presence of \(\textsf {b}\). Its absence would clearly lead to not generating \(\textsf {c}\). Practically, only trace \(\overline{\tau _{19}}\) would be generated, while we would miss trace \(\overline{\tau _{20}}\).

We overcome the issue through the same approach discussed in Sect. 6.1.1: for each activity \(\textsf {x} \in \mathbb {A}\) (thus also for \(\textsf {c}\)), we add the two Integrity Constraints (ICs) (4) and (5). Such a choice would lead to generate a number of non-compliant traces due to the presence of events at the wrong position in the trace and, eventually in the example, to the generation of \(\overline{\tau _{20}}\).

It is worth to underline that this is not the only way in which we could have dealt with the issue. For example, a simple alternative could have been to generate positive traces and then randomly insert further interesting activities from \(\mathbb {A}\). Nonetheless, we believe that our approach comes with the advantage of using the same logic framework for dealing with both negative and positives traces. Indeed, constraints (4) and (5) are used: (i) in case of open models—to insert additional, allowed activities not mentioned by any constraints in the trace, thus to generate other positive traces; (ii) in case of closed models—to generate negative traces by inserting additional activities not envisaged by the model (or activities that do belong to the model but result placed in the wrong position inside the trace). In this sense, our approach uniformly deals with the generation of both negative and positives traces. Furthermore, the random insertion of activities in the trace once it is generated could leave some cases uncovered.

Another point is related to the strategy of identifying all possible ways to negate each constraint. It is indeed crucial when we deal with alternative flows in the model because it allows the generation of negative traces mixing activities from both paths. When a xor gate is in the model, as, for example, in Fig. 1 (see IC \({ic}_2\) of Sect. 5 for its SCIFF specification), there are several ways to negate such constraint. As previously discussed for IC (2), there are two significant ways to negate each of the two alternative sequences \([\textsf {a, b}]\) and \([\textsf {a, c}]\) entailed by the xor constraints. Furthermore, there are two ways to negate the disjunction: (i) after a, neither b, nor c are executed; (ii) after a, both b and c are executed. The latter negated form of the xor constraint generates negative traces containing both b and c, which originally belonged to alternative paths.

7 Evaluation of the prototype

Aiming to empirically verify our approach, we developed a first software prototype which employs the SCIFF abduction capabilities to generate positive and negative traces starting from a given business model.

The software, which is publicly available for download [41], allows to specify the input model in terms of the Integrity Constraints (ICs) that must be fulfilled and a set \(\mathbb {A}\) of possible activities that can occur in the traces. As discussed in the previous sections, \(\mathbb {A}\) can include further activities that do not appear in any constraint. The user can also specify various options, useful to control the generation process and the characteristics of the emitted traces. It is indeed possible to specify:

  • if the model is declarative open or procedural closed;

  • if positives, negatives or both types of traces must be generated;

  • the time limit for each generated trace, i.e. its maximum time length in seconds;

  • the maximum length of the generated traces intended as the maximum number of events in each trace—particularly relevant to guarantee the termination of the generation procedure when loops are in the model;

  • the number of instances for each path, i.e. in case the model includes some alternative choices of different flow paths (as shown in Fig. 1), the maximum number of traces that must be generated for each alternative path.

The emitted traces are provided in XES format [55], each one reporting a trace identifier and the list of the events with the corresponding timestamp. The download package includes a codification of the two models in Figs. 1 and 2a—as examples of procedural and declarative processes, respectively—that can be used to verify the generation abilities of the prototype as regards both positive and negative traces. The prototype originates 10 positive and 20 negative traces from the declarative model in Fig. 2a when the options of 5-second time limit, five activities as maximum length and one instance for each path are provided. With the same options, the procedural model in Fig. 1 originates four positive and 4450 negative traces (as a consequence of its closeness flavour).

As discussed in [22], the complexity of the main abductive decision problem (i.e. to determine whether an explanation exists) is located at the second level of the polynomial hierarchy (\(\Sigma ^{P}_2\)-complete). In order to empirically evaluate the generation performance of the proposed tool, we employ an arbitrarily big procedural model composed of a long sequence of activities. Thus, there is only one path allowed by the model from the start to the stop activity. We first estimate the time to generate an increasing number of only positive traces on an average hardware architecture (a quad-core Intel i7 2,9 GHz CPU and 16 GB RAM), when the SCIFF is given three different models composed of 250, 500 and 750 activities (see the graph in Fig. 4a). In order to force the generation of a certain number N of traces in each experiment, we set the time limit and maximum length to high values (so that the generation process is not influenced by them) and we ask to generate Ninstances for each path. Since there is only one possible path in the given model, we obtain exactly N traces in the output. As highlighted in Fig. 4a, the generation time increases linearly with the number of emitted traces.

Figure 4b reports the performance of the tool when we fix to 10, 20 and 30 the number of generated positive traces, but we progressively increase the number of activities in the model. In this case, as a bigger model corresponds to more Integrity Constraints (ICs) to be considered, the generation time shows a superlinear trend in the number of activities. This is indeed expected, since a bigger model entails more activities for each instance of the process execution and requires more time to generate the corresponding trace.

Fig. 4
figure 4

Evaluation of the times to generate positive traces for an arbitrarily long procedural model

Fig. 5
figure 5

Evaluation of the times to generate positive and negative traces for an arbitrarily long procedural model. The performance decreases rapidly (superlinearly) with the number of activities in the process model (Fig. 5a), but linearly with the number of generated traces (Fig. 5b). Figure 5b reports the data in logarithmic scale on both the axis to better appreciate the trend

The same test is repeated while asking the tool to generate both positive and negative traces. As the model is closed procedural, there are far more forbidden process instances than the allowed ones. Therefore, as shown in Fig. 5a, a slight increase in the number of model activities corresponds to a huge increase in the space of negative traces to be explored and, consequently, to a higher generation time. This is further confirmed by the graph in Fig. 5b, where we relate the number of traces generated in each experiment of Fig. 5a with the required time (with logarithmic scale on both axes to better appreciate the trend): the two dimensions are still linearly proportional.

8 Related work

Automated discovery of process models from event logs is one of the main research areas of process mining. As it focuses on extracting knowledge from business process logs, the evaluation of process discovery techniques inevitably requires the availability of event logs [20]. For this purpose, some real-life logs have been made publicly available [9] and are often used as benchmarks for testing process discovery tools. Given the real-life origin of these datasets, they may contain imperfections (i.e. non-compliant traces) or show incompleteness (i.e. miss some examples of traces that should be considered compliant), thus causing alterations in the discovered business model. Indeed, real-life logs usually reports all the observed traces without any indication of which cases are non-compliant or which execution paths are missing. For this reason, a widespread approach in this field is based on process simulation to artificially generate event logs with predefined characteristics. This allows the researchers to perform a finer tuning of the developed algorithms and better control the experimental evaluation. Some model simulators and log generators have been developed for this purpose [27, 28].

In this section we propose a classification of some relevant works on log generation according to the following main directives:

  • the ability to generate process execution examples with the employment of a procedural approach;

  • the ability to support declarative constraints;

  • the possibility to generate positive compliant traces;

  • the possibility to generate negative non-compliant traces;

  • the availability of generation mechanisms from partially specified traces;

  • the ability to deal with time constraints;

  • the ability to deal with data constraints;

  • the possibility to generate traces with a user-defined probability distribution of workflow execution paths and values;

  • the online availability of the proposed generation tool.Footnote 4

Table 2 Classification of log/model generators approaches according their most relevant features. Since providing a survey of all available works on log generation is beyond the scope of this paper, this list has not to be intend exhaustive

The approaches described in the following are classified according to these directives in Table 2.

The work [53] introduces a framework for the automated generation of Petri nets representing processes, according to user-defined rules. In particular, they suggest to gradually refine Workflow nets [46] in a top-down approach, in order to generate all possible process models belonging to the class of Jackson nets. A similar approach has been proposed in [8], where the authors describe a technique to generate Petri nets according to a different set of refinement rules. In both cases, the generated process models are intended to be used as benchmarks for process discovery algorithms, but the proposed approaches do not address the problem of generating traces from the developed Petri nets; hence, we exclude the works [53] and [8] from the classification in Table 2.

In the works [10, 11], the authors present a tool, the Processes Logs Generator (PLG), developed for the specific purpose of generating process discovery benchmarks. The software allows the user to randomly develop business models according to some predefined parameters and then “execute” the generated model while recording each activity in a log file. In [24], the authors present an approach based on CPN Tools [29] to simulate business process models. The key component of the simulator performs a template-oriented transformation from BPMN process models into CPNs. Another approach based on CPN Tools is described in [7], where the authors generate XML event logs by the simulation of a CPN. The work [57] uses simulation to evaluate the impact in terms of performance of re-engineering business processes. This approach creates simulation models from workflow processes and uses simulation results as a feedback to calibrate the model.

All the approaches described so far support only procedural business process models plus additional information like, e.g. mean execution time of activities, probability of choices, etc. Eventually, a number of approaches support also the simulation of limited resources, their allocation, queues, etc. Nevertheless, these procedural approaches show some limitations when the process is characterized by high variability and allows for many alternative execution paths. In these cases, declarative process models are proven to perform better. For this reason, CPN Tools has presented an extension [56] of the traditional procedural-based modeller to graphically add Declare constraints [48]. The simulation of such hybrid models can be carried out by the user or executed in a random way. The work [20] presents a synthetic log generator that allows the user to define the process model by listing a number of declarative constraints expressed in Declare. More recently, another log generator based on a declarative approach has been presented in [3,4,5]. Here, the authors employ Declarative Process Intermediate Language (DPIL), a declarative process modelling language, to create a simulation model, and then generate XES event logs composed of only positive traces through a second component called Multi-perspective Declarative Process Simulation (MuDePS). Each generated log describes an exhaustive, distinct set of traces of the desired length.

The SCIFF approach can inherently give support to the definition of both declarative as well as procedural models and naturally exploit abduction to enable the generation. Furthermore, differently from SCIFF, all these solutions to log generation ignore the necessity of negative examples in the benchmark event log to determine how robust to noise a certain process discovery technique is [21]. Since information about state transitions that were prevented from taking place is often unavailable in real-life logs, it cannot be exploited in order to guide the learning task. For this reason, as also underlined in [25], in most cases process discovery techniques are limited to the harder setting of unsupervised learning. Our notion of negative traces may recall the concept of syntactical noise introduced by Günther in [26]. However, this similarity is only limited to the appearance of the trace (i.e. showing missing, out of order or unexpected events), whereas the hypothesized use of these instances remains different: Günther’s noisy traces are intended as a consequence of distortions in transmitting or recording the log (and as such, should be identified and discarded during process discovery), whereas in this work, negative traces are intended as meaningful aspects of the process. Their non-compliance can be itself a source of knowledge, for example, to instruct an inductive mining tool.

In [23], the authors assume to have a set of negative events collected from domain experts who suggests whether a proposed execution plan is feasible or not. Given this knowledge, the authors combine ILP and partial-order planning techniques to process discovery. Similarly, other approaches [6, 17, 32, 33] envisage the presence of “non-compliant” traces as a fundamental requirement to enable process mining tasks through supervised learning techniques. In [2, 43, 44], the authors present a tool that takes a series of business process specifications as input and generates an event log. Based upon other user-defined security and compliance requirements, deviations from the defined control flow are generated. In particular, specific trace properties can be either enforced or violated on a random basis in the resulting event log. This approach shares with SCIFF the possibility to generate negative examples but is limited to support the definition of procedural business process models only. The work [25, 51, 52] proposes a process discovery learning technique, which starts by generating artificially induced negative events and later uses these examples to enforce the learning process. Differently from our approach, in all these works the negative traces are not generated by modifying the model’s constraints, but rather replacing the positive events of each trace and then checking if a state transition of interest (corresponding to a candidate negative event) could occur. Specifically, the authors assume that a generated transition can be considered as a negative example if it is not present in any trace with a similar history in the log. Although this assumption has the great advantage of allowing the user to deal with negative traces before having extracted the process model, it can lead to incorrect results. Indeed, the initial hypothesis of having all the positive examples in the log cannot be always guaranteed, causing the allowed state transitions that are not reported in the log to be classified as negative examples. On the contrary, if the log contains some not allowed trace, the generator will erroneously consider them as positive examples. Differently from these works, the SCIFF approach is able to ensure the validity of the generated positive and negative traces w.r.t. a model. This feature could be useful, for example, in conjunction with an inductive miner (employing both positive and negative instances) to iteratively refine a business process model through subsequent steps of generation and mining.

9 Conclusions and future works

In the recent years, as the interest in BPM and process mining techniques has increased, the need for event log benchmarks with predefined characteristics has become more and more popular.

This work presents a novel approach for generating synthetic logs starting from a declarative or procedural business process model. We analysed the theoretical aspects of positive and negative trace generation and evaluate the feasibility of our method by providing a standalone prototype able to generate trace templates as well as completely grounded traces. Also, custom constraints on data and time can be specified to meet the application-dependent requirements. The performance evaluation of the proposed tool is promising although a superlinear trend in the generation time for increasing model dimension is clearly highlighted. The proposed approach is then put through a deep qualitative comparison with the existing literature on business process simulation, revealing the numerous advantages of the employment of the SCIFF framework in this field. Namely, the possibility to deal with both open declarative and closed procedural model specifications while producing positive and negative traces, and the ability to deal with partially specified traces.

This analysis of the state of the art has also highlighted some shortcomings of our approach, which will be addressed in the future. In particular, the generation of traces and their grounding does not consider any probability information, while some paths in the process model might be more frequent than others, as well as some data values and activity durations might be more probable than others. Furthermore, as data constraints are currently supported by directly modifying the SCIFF constraints, for the future, we plan to introduce some higher-level mechanism aiming to meet the needs of a non-technical user.

Since positive/negative log generation explores all the possible allowed and non-allowed traces given a business model, the process can require a significant amount of time, depending on the length of the model, the number of possible paths and the quantity and quality of constraints to be checked. When the SCIFF framework is used for compliance monitoring purposes, previous works [12, 35] have proven the possibility to significantly speed up the checking process through the employment of programming models for distributed computation like MapReduce [19]. A similar approach could be useful to accelerate positive/negative log generation when a collection of computing nodes is available. For the future, we plan to explore this possibility in order to partition the generation process into smaller tasks that can be easily concurrently executed with the support of a large-scale data processing engine.