1 Introduction

Today, organizations need to make rapid changes to their process models to be able to survive in the competitive world of business. The implementation of these changes first calls for an accurate understanding of the current organizational processes. To attain this goal, organizations have resorted to process-aware information systems such as the workflow management systems (WFMSs), enterprise resource planning (ERP), and customer-relationship management (CRM) systems (Van der Aalst 2014). These systems record information on the current actual organizational processes, but do not present an accurate model of these processes. Organization managers generally picture presumptive models of organizational processes, but there are often differences between the actual process model and these presumptive models.

Consequently, there is a requirement for techniques which are able to automatically extract the actual process models from the data sets stored in the information systems. The process mining concept was framed to achieve this goal (Burattin 2015). One of the data sets stored in the information systems is event log. An event log contains the information on events unfolding during the organizational processes. Every event offers information on an activity (task), timestamp, case, and performer of the activity. Event logs have significant importance because the information contained in these logs is used in all process mining studies. Process mining techniques are generally classified into three categories (Van der Aalst 2014; Burattin 2015; Van Dongen and van der Aalst 2005):

  1. (a)

    Process discovery: In these techniques, the process model is extracted without any presumptive model using the data in the information systems.

  2. (b)

    Conformance checking: In these techniques, the actual process model obtained through process discovery methods is compared to a presumptive model.

  3. (c)

    Extension: In these techniques, the presumptive model serves as the input and it is improved based on the information extracted from the information systems.

This research revolves around the process discovery techniques. The previous researches on process discovery have been fraught with the following problems (Aleem et al. 2015):

  1. (a)

    The inability to identify some complicated structures such as the non-free choice structures, short-length loops, hidden tasks, and duplicate tasks.

  2. (b)

    The inability to resist noise.

  3. (c)

    The inability to tackle incompleteness.

  4. (d)

    The inability to simplify the process model and producing sophisticated models for complicated processes. (These models are called the spaghetti models.)

  5. (e)

    The high computational complexity.

All of these problems eventually result in the development of process models, which either lack adequate efficiency or are incomprehensible and unanalyzable. Due to these problems, the present research is conducted to propose a novel process discovery technique that solves some of the problems described above to the possible extent and discovers more accurate process models than the previous methods.

All of the existing process discovery methods use only event logs. However, other types of information such as information on data exchange among processes’ roles also exist in the information systems. Since organizational roles perform the process activities, the relations among these roles can disclose information on the process model. Therefore, in the approach proposed in this research, the event logs are used along with the information on data exchange among the roles, which is derived from the physical generalized flow diagrams (PGFDs).

In the proposed solution, network analysis techniques are used to discover the process model. It is also assumed that the information in the information systems is only about one process. To this end, event logs and the PGFD model are used to develop a bilayer network under a suitable scenario. This network represents the handover flow and data exchange flow in the process model. Afterward, by extracting the network motifs and subjecting them to structural analyses, rules are set to identify the logical relationships among the process activities. Finally, the current process model of the organization is discovered following the rules mentioned above.

This article is composed of seven sections: Section one presented an introduction to process mining and its necessities. The research objectives and methodology were also described in this section after explaining the relevant problems. Section two presents a review of the process discovery studies and compares their features in a table. Afterward, based on this table, the differences between the proposed solution and the previous methods are explained. In addition, the previous researches done in the area of process discovery which use social network analysis are introduced in this section as well. In section three, before the description of the proposed approach, the fundamentals concepts used in this approach are introduced. The research problem is mathematically modeled in section four, and section five offers a detailed description of each step of the proposed model. The evaluation method is also introduced in this section. The assessment results are presented and analyzed in section six, and finally, section seven provides the summary of the research steps and suggestions on the future research.

2 Related works

Process discovery methods are the pivots of this research. Hence, the previous researches on process discovery are reported in this section. In addition, since this research utilizes social network analysis techniques, the previous researches using these techniques in the area of process discovery are also introduced.

2.1 Process model discovery methods

The studies on process mining have mainly attempted to propose a process model discovery method. These studies are classified into five major categories by approach:

  1. (i)

    Deterministic mining approaches: The methods developed based on the alpha algorithm, such as the beta and alpha plus algorithms (Wen et al. 2006, 2009), belong to this group. The alpha algorithm introduced by Van der Aalst and Song (2004) was based on the running time of tasks in event logs; this algorithm defines a set of dependencies including causal, parallel, and choice relationships and maps each relation to a Petri net. The most important pitfall of these algorithms is that they overlook noise. Their advantage is that they can discover a workflow per process and display it as an SWF-net.Footnote 1 A SWF-net is a workflow for all of whose transitions it is always possible to reach to the end place and for any transition there is a firing sequence enabling it (Van der Aalst et al. 2004).

  2. (ii)

    Heuristic mining methods: These methods use dependencies similar to the deterministic approaches, but they consider the dependencies along with their frequencies. These algorithms are based on the fact that with an increase in the frequency of a dependency (vice versa), the odds of randomness of that dependency decrease. Heuristic algorithms consist of three steps: (1) creating a dependency/frequency table from the event log; (2) creating a dependency/frequency graph based on a set of heuristic rules; and (3) creating a Petri net using the information in the dependency/frequency graph and dependency/frequency table (Weijters and van der Aalst 2003). The primary advantage of these methods is their ability to resist noise and handle incompleteness.

  3. (iii)

    Inductive approaches: These algorithms function by the divide-and-conquer mechanism and involve two major steps. In step one, a stochastic activity graph (SAG) is created for the process instances. The SAG graph is a directed graph showing the direct dependencies among the activities. Afterward, the resulting SAG graph is transformed into a workflow model (Herbst and Karagiannis 2003). The mergeSeq, splitSeq, and splitPar algorithms proposed by Herbst (2000, 2002) and the algorithm introduced by Schimm (2003) come into this category. The important advantage of these algorithms is their ability to detect duplicate tasks.

  4. (iv)

    Evolutionary approaches: consisting of three main steps, these methods are based on the genetic algorithm (GA). First, a random initial population is created from the process model. Next, the fitness index is calculated for each process model forming the initial population. The fitness index determines the degree to which the model explains the behavior contained in the event log. After that, the initial population evolves by dint of the crossover and mutation genetic operators to create the next generation. Therefore, each generation evolves more than the previous generation step by step until a fit model with a large fitness index is found. Despite the ability of evolutionary approaches to detect most structures, they impose high computational complexity (Medeiros et al. 2006).

  5. (v)

    Clustering-based approaches: In these algorithms, first a process model is developed for each process instance. Afterward, the models are clustered and the juxtaposition of the clusters yields the main process model. In these approaches, process instances are clustered by the k-gram and bag of activities algorithms. To cluster the process instances using the algorithm introduced by Song et al. (2009), each process instance is shown as a vector, but there is no information on the contents of activities per process instance. In the edit distance approach introduced by Bose and van der Aalst (2009), this problem is solved by assuming two process instances as a string and both strings are compared. The difference between the two strings is expressed as a cost. In trace clustering introduced by Bose and van der Aalst (2010), the process instances are compared, and the similarities between the two process instances are identified based on a fixed set of features. Finally, the similar instances are put in one cluster.

Table 1 presents a comparison of the previous researches. It shows the features, advantages, and disadvantages of each method.

Table 1 Comparison of the previous process discovery methods

All of the previous process discovery methods only use the event log data as seen in Table 1. In this paper, a novel process model discovery approach is proposed which uses the event log data as well as the information on data exchange among the roles. Given that this approach uses the information on the data exchange among the roles along with the event log data, it is expected to generate a more accurate process model. In this approach, social network analysis techniques are employed to discover the process model.

2.2 Application of social network analysis in the area of process discovery

Process mining researches can be focused in many perspectives, the most important of which are (Van der Aalst et al. 2004):

  1. (a)

    Process perspective (‘How’): it focuses on the control flow which means the ordering of activities. The goal here is to find a good characterization of all the possible paths, expressed in terms of a Petri net.

  2. (b)

    Organizational perspective (‘Who’): it focuses on the resources that means which resources are involved and how are they related. The goal is to either structure the organization by classifying people in terms of roles and organizational units or to show relations among individual resources.

  3. (c)

    Case perspective (‘What’): it focuses on properties of cases. Cases can be characterized by their paths in the process or by the values of the corresponding data elements.

The focus of all previous studies used social networks in the area of process discovery is on organizational perspective. These researches extract a social network whose nodes are resources. In the social network, edges have been formed based on three scenarios (Song and van der Aalst 2008):

  1. (a)

    Handover scenario: In this scenario, an edge is formed between two nodes if, in a process trace, a role performs an activity on a case and then hands over to another role to perform a new activity.

  2. (b)

    Joint activity scenario: In this scenario, edges are formed among resources who perform same activities in different traces.

  3. (c)

    Joint cases scenario: In this scenario, edges are formed among resources who cooperate with each other on a single case. In other words, edges are formed among resources who perform activities in a single trace. The difference between this scenario and handover scenario is that, here, the order according to which activities are performed in a trace is not important.

In previous researches, based on the extracted social networks and using the social network analysis techniques and metrics such as betweenness, resources are classified in terms of organizational roles (Van der Aalst et al. 2007). In this research, the social network analysis techniques are utilized in the area of process perspective. In other words, they are exercised to discover the process control flow.

3 Fundamentals concepts

The approach proposed in this article is centered on a set of fundamentals, which are introduced in this section.

3.1 Social network analysis

A social network is a network of people or groups and the relationships among them. People and groups constitute the nodes, while the edges are formed of the social relationships among people such as friendship, kinship, business, and shared interests. Social network analysis is a subgroup of network analysis techniques and refers to the use of processes, methods, and instruments that lend a better insight into the relationships and structures and help extract significant information (Esmaeili et al. 2011; Dehghan et al. 2014).

Data mining and analysis methods are generally classified into two categories: (1) classic techniques such as clustering and classification and (2) network analysis techniques. In the classic methods, a large volume of data is generated. Besides, these methods are offline. However, network analysis methods convert a large amount of data into a network; thus, they offer higher processing speed. They are also online (Arif 2015). Hence, the network analysis techniques were adopted in this study. Given that the organizational roles constitute network nodes in this study, the network is considered as a social network.

3.1.1 Motifs

In network analysis, subgraphs with special features that are repeated in a network are called motifs. When a subgraph with a particular feature is repeated in a network, a specific behavior or structure is manifested in that network. Techniques known as motif discovery algorithms have been defined to discover the motifs. All motif discovery algorithms involve three steps: (1) searching and determining frequency of subgraphs of a certain size in a network; (2) identifying the isomorphic subgraphs and classifying them into different isomorphic classes; (3) identifying the statistical significance of the classes (by comparing the frequency of each class in the network to the frequency of that class in random networks). This statistical significance is measured by the z-score and p value criteria. A class with high z-core and low p value is statistically significant (Kavurucu 2015).

The motif discovery algorithms are classified by: (1) the method of subgraph enumeration and (2) the method of subgraph searching. As for subgraph enumeration, motif discovery algorithms are grouped into two classes (Kavurucu 2015; Wong and Baur 2010):

  1. (a)

    Tree-based algorithms: These algorithms enumerate all network subgraphs using a tree. The disadvantages of these algorithms are that they are time-consuming and costly because, with an increase in the network size and the subgraph size, the number of subgraphs grows exponentially.

  2. (b)

    Probabilistic algorithms: These algorithms sample the subgraphs instead of fully enumerating them and estimate their frequencies. The probabilistic algorithms have shorter run times than the full enumeration algorithms. Therefore, they are more suitable for discovering motifs in large networks and large-sized motifs. However, their sampling procedures are not entirely unbiased.

Motif discovery algorithms are also classified into two categories by subgraph searching mechanism (Kavurucu 2015; Wong and Baur 2010):

  1. (a)

    Network-based algorithms: These algorithms search all subgraphs of a given size in a network and estimate the total frequency of the subgraphs.

  2. (b)

    Motif-based algorithms: These algorithms only search a given subgraph in the network and calculate the frequency of only that subgraph in the entire main network.

Table 2 lists all motif discovery algorithms devised so far. This table categorizes the motif discovery algorithms.

Table 2 Classification of motif discovery algorithms

In this research, RAND-ESU algorithm introduced by Wernicke (2005) is used for motif discovery. This algorithm is selected because firstly it is a probabilistic algorithm and has low computational complexity. Secondly, it is a network-based algorithm, which searches all subgraphs of a certain size. It should be noted that the logical structures of the relationships among the activities are manifested as different subgraphs in the network. Hence, motif-based algorithms do not suit this research because these algorithms only search a certain subgraph. Thirdly, the sampling approach used in this algorithm is almost fair as compared to the other probabilistic algorithms (Wernicke 2005). This algorithm has been implemented by FANMOD,Footnote 2 which is also used in this research.

3.2 Physical generalized flow model

The physical generalized flow model is a diagram that identifies the organizational flows among the organizational entities (Whitten et al. 2007). Here, organizational entities represent the organizational roles. Moreover, organizational flows are grouped into three categories: (1) data flows; (2) material and service flows, and (3) financial flows (Whitten et al. 2007). These flows are identified by the physical generalized flow diagram (PGFD). The physical generalized flow diagram is defined on three levels: (1) context level, (2) system level, and (3) process level (Whitten et al. 2007). In this paper, the physical generalized flow diagram on the process level is used. Figure 1 depicts a physical generalized flow diagram (PGFD) on the process level. In this figure, the flows among organizational roles are identified.

Fig. 1
figure 1

Physical generalized flow diagram on the process level

In this research, only the information on the data flows is used, and the financial flows and commodity flows are neglected. The PGFD model information is recorded as a table in the information system. Table 3 shows a sample PGFD table in the information system. This table shows from which role what kind of flow is transferred to which role. It also identifies the flow content and the flow form.

Table 3 The PGFD table on the process level in an information system

4 Problem statement and modeling

The present research approach is based on a series of fundamental hypotheses about data sets. Table 4 presents a summary of these hypotheses.

Table 4 Data set-related hypotheses

In this section, the research problem is modeled based on the above hypotheses. In this research, the problem inputs include the event logs, the process-level PGFD model, and the role-activity matrix. The event log is defined as an ordered tuple of sets as follows:

$$ {\text{EL}} = \left( {{\text{CN}},{\text{St}},{\text{Ct}},A,{\text{RS}}} \right). $$
(1)

In Eq. (1), CN denotes the input cases set, St is the activity start time, Ct represents the activity completion time, A shows the process activities set, and RS is the process resources set. Each of the aforementioned sets is defined as follows:

$$ {\text{CN}} = \left\{ {c_{1} ,c_{2} , \ldots ,c_{n} } \right\}. $$
(2)
$$ A = \left\{ {a_{1} ,a_{2} , \ldots ,a_{n} } \right\} . $$
(3)
$$ {\text{RS}} = \left\{ {rs_{1} ,rs_{2} , \ldots ,rs_{n} } \right\} . $$
(4)

In Eq. (2) ci shows the input case number, in Eq. (3) ai is the performed activity, and in Eq. (4) rsi represents the performer of the activity. The other input is the process-level PGFD model; this input is also defined as an ordered tuple of sets as follows:

$$ {\text{PGFD}} = \left( {{\text{FR}}_{i} ,{\text{FA}}_{i} ,{\text{TR}}_{i} ,{\text{TA}}_{i} ,T} \right). $$
(5)

In Eq. (5), FRi is the set of roles sending the flows, FAi is the set of activities performed by the flow-sending roles, TRi shows the set of roles receiving the flows, TAi denotes the set of activities performed by the flow-receiving roles, and finally, T denotes the flows type. Each of the above sets and the roles set are also defined as follows:

$$ R = \left\{ {r_{1} ,r_{2} , \cdots ,r_{n} } \right\}. $$
(6)
$$ {\text{FA}}_{i} \subseteq A. $$
(7)
$$ {\text{TA}}_{i} \subseteq A. $$
(8)
$$ {\text{TR}}_{i} = \left\{ {{\text{TR}}_{i} \subseteq R,\;a_{i} \in {\text{TA}}_{i} } \right\} . $$
(9)
$$ {\text{FR}}_{i} = \left\{ {{\text{FR}}_{i} \subseteq R,\;a_{i} \in {\text{FA}}_{i} } \right\}. $$
(10)
$$ T = \left\{ {{\text{data}},{\text{financial}},{\text{item}}} \right\}. $$
(11)

In Eq. (6), ri shows an organization role, and in Eq. (11) ‘data’ refers to the data flow type, ‘financial’ represents the financial flow type, and ‘item’ shows the material and service flow type. Finally, the third input is the role-activity matrix stored in the information system, which is defined as follows:

$$ {\text{RA}} = \left[ {x_{{a_{i} r_{j} }} } \right],\quad x_{{a_{i} r_{j} }} \in \left\{ {0,1} \right\},\quad a_{i} \in A,\quad r_{j} \in R. $$
(12)

In Eq. (12), RA is the role-activity matrix, wherein the columns show the roles and the rows show the activities. The entries in this matrix are either 0 or 1. If entry \( x_{{a_{i} r_{j} }} \) equals 1, activity ai is performed by role rj. As mentioned, in the proposed approach the input information is transformed into a bilayer network under a suitable scenario to represent the workflow and data exchange. This network is defined as the following ternary:

$$ N = \left( {V_{i,j} , E,{\text{FT}}} \right). $$
(13)
$$ V_{i,j} = \left\{ {v_{i,j} |i \in R, j \in A} \right\}. $$
(14)
$$ E \subseteq V_{i.j} \times V_{i.j} . $$
(15)
$$ {\text{FT}} = \left\{ {{\text{data}},{\text{handover}}} \right\}. $$
(16)

In Eq. (14) Vi, j is the network nodes set, in Eq. (15) E is the set of edges, and in Eq. (16) FT reflects the edges type. The edge type may be a dataflow which is denoted by ‘data’ or a workflow which is represented by ‘handover.’ After creating the bilayer network of the workflow and data exchange, the network motifs are extracted and motifs with certain features are mapped to a logical structure of the activity relationships in accordance with certain rules. Finally, the discovered logical relationships are integrated, and the resulting model is mapped to a Petri network. Each step of the solution is described in detail hereunder.

5 Research methodology

In this section, first the proposed method of this research will be described, and then, the assessment strategy will be introduced.

5.1 Proposed method

The methodology of this research consisted of five steps. These five steps are illustrated in Fig. 2.

Fig. 2
figure 2

Five steps of proposed solution to the process discovery problem

5.1.1 Identifying the roles performing activities in the event log

In this research, a standard format is considered for event logs. In this format, the event log initially contains no information on the organizational roles performing the activities. Therefore, this information needs to be added to the event log to determine which role performs each activity. In order to add this information, the role-activity matrix stored in the information system is used. An example of this matrix is presented in Table 5. In this table, the rows show the activities and the columns represent the roles. As seen in Table 5, a role can be responsible for several activities. For instance, the ‘customer relationship’ and ‘order checking’ activities are both performed by ‘sale manager.’ However, the opposite is not true. In other words, each activity is assigned to only one role. The role information can be added to the event log based on these rules and the role-activity matrix.

Table 5 Role-activity matrix

When roles information is added to the event log, the new event log is defined as an ordered tuple of sets as follows:

$$ {\text{EL}}^{{\prime }} = \left( {{\text{CN}},{\text{St}},{\text{Ct}},A,{\text{RS}},R} \right). $$
(17)

In Eq. (17) like Eq. (1), CN denotes the input cases set, St is the activity start time, Ct represents the activity completion time, A shows the process activities set, RS is the process resources set, and R represents the role set which is defined in Eq. (6).

5.1.2 Creating a multilayer network based on inter-role relationships, from event logs and PGFD

In this step, a bilayer social network is developed to represent the workflow and data exchange flow in the process. In this network, the information in the new event log, which includes information on roles now, and the PGFD-driven information on the exchange of data among the roles, is used. The network modeling scenario is also as follows:

  • Nodes These are the roles responsible for performing different activities in the network. It is worth mentioning that some roles can handle multiple activities in the process. Hence, at the time of defining the nodes, the activity titles are used along with the role titles to be able to determine the activity performed by a given role in the next steps. According to Eq. (14), vi, j is a network node that i refers to a role and j refers to the activity that is done by role i.

  • First-layer edges In the first layer, the edges among nodes are formed based on ‘handover.’ In the handover scenario, if two following activities exist with a causal relationship for a case and the first activity is performed by role i and the second activity is performed by role j, a handover from i to j takes place. The formation of the network under the handover scenario is completely described by Van der Aalst and Song (2004). Based on Eqs. (13) and (16), the amount of FT for the first layer of edges is ‘handover.’ According to Eq. (17), if eiELʹ is an event, two operations are defined on ELʹ:

    $$ \pi_{r} \left( {e_{i} } \right) = r_{i} \in R. $$
    (18)
    $$ \pi_{a} \left( {e_{i} } \right) = a_{i} \in A. $$
    (19)

    If tc is a process trace belonging to case cCN and→ denotes a causal relationship:

    $$ \begin{aligned}& v_{a_{i} r_{i}} \trianglerighteq v_{a_{j} r_{j}} \exists 0 \le n < \left| {t_{c}} \right| \pi_{r} \left({e_{i}}\right) = r_{1} \wedge \pi_{a} \left({e_{i}}\right) \\&= a_{i} \wedge \pi_{r} \left({e_{i + n}}\right) = r_{j} \wedge \pi_{a} \left({e_{i + n}}\right) = a_{j} \wedge e_{i} e_{i + n} \end{aligned} $$
    (20)

    In Eq. (20), \( v_{{a_{i} r_{i } }} { \trianglerighteq }v_{{a_{j} r_{j} }} \) denotes a handover flow, which means in process trace tc belongs to case c, in event ei roles ri executes activity ai on case c, and then in event ei+n role rj executes activity aj on case c.

  • Second-layer edges An edge is formed between two roles responsible for different activities, if, in the PGFD model, a data flow is exchanged between those two roles. Based on Eqs. (13) and (16), FT for the second layer of edges is ‘data.’ Based on Eq. (5), if eiPGFD, five operations are defined on PGFD:

    $$ \pi_{fa} \left( {e_{i} } \right) = a \in {\text{FA}}_{i} . $$
    (21)
    $$ \pi_{fr} \left( {e_{i} } \right) = a \in {\text{FR}}_{i} . $$
    (22)
    $$ \pi_{ta} \left( {e_{i} } \right) = a^{{\prime }} \in TA_{i} . $$
    (23)
    $$ \pi_{tr} \left( {e_{i} } \right) = r^{{\prime }} \in TR_{i} . $$
    (24)
    $$ \pi_{t} \left( {e_{i} } \right) = t \in T. $$
    (25)

    Then, a data flow between nodes \( v_{{a_{i} r_{i} }} \) and \( v_{{a_{i}^{{\prime }} r_{i}^{{\prime }} }} \) is defined as:

    $$ \begin{aligned} &v_{{a_{i} r_{i } }} \triangleright v_{{a_{i}^{{\prime }} ri_{{}}^{{\prime }} }} \exists 0 \le i \le \left| {\text{PGFD}} \right| \exists e_{i} \in {\text{PGFD}} \pi_{fa} \left( {e_{i} } \right) \\&= a_{i} \wedge \pi_{fr} \left( {e_{i} } \right) = r_{i} \wedge \pi_{ta} \left( {e_{i} } \right)\\& = a_{i}^{{\prime }} \wedge \pi_{tr} \left( {e_{i} } \right) = r_{i}^{{\prime }} \wedge \pi_{T} \left( {e_{i} } \right) = \left\{ {\text{data}} \right\}. \end{aligned} $$
    (26)

    In Eq. (26), \( v_{{a_{i} r_{i } }} \triangleright v_{{a_{i}^{{\prime }} r_{i}^{{\prime }} }} \) denotes a data flow which means role ri who executes activity ai sends some data to role rʹi that is going to execute activity aʹj.

The network described above is a directed and weighted network.

5.1.3 Discovering motifs in the multilayer network

In this phase, ternary motifs in the bilayer workflow and data exchange flow network are extracted. The reason for selecting ternary motifs is that with a decrease in motifs size, their detection is accelerated and size three motifs are the smallest network motifs that can reveal logical structures in process models. These motifs provide adequate information for the identification of logical structures through activity relationships without a need for examining larger motifs. To extract the motifs, the Rand-ESU algorithm implemented in FANMOD is used in this research. The computational complexity of Rand-ESU is O(nK) that n is the number of nodes, and k is the size of subgraphs (Wernicke 2005). Thus, for ternary motifs, its computational complexity is equal to O(n3).

One of the advantages of FANMOD is the possibility of discerning edges and nodes according to their type. According to Eqs. (16) and (13), FT set shows that there are two types of flow in the bilayer network of N: ‘data’ and ‘handover.’ Therefore, these flows divide the network edges into three categories: (1) edges only containing the data exchange flow; (2) edges only containing the handover flow, and (3) edges containing both the data exchange flow and the handover flow. These three types are discerned in FANMOD by assigning specified codes to edges. Figure 3 presents some samples of size three subgraphs resulting from FANMOD. In this figure, three groups of edges are observable.

Fig. 3
figure 3

Classification of size three subgraphs existing in the bilayer network

After extracting the subgraphs, the identified subgraphs are filtered by the z-score and p value values to identify the significant motifs as follows:

$$ \left| {z{\text{ - score}}} \right| > 1.65. $$
(27)
$$ \left| {p{\text{ - value}}} \right| \le 0.05. $$
(28)

The reason for selecting these threshold limit values (TLVs) in Eqs. (27) and (28) is that subgraphs with z-score values higher than 1.65 and p values lower than 0.05 are most probably significant motifs and the likelihood of randomness of these motifs is below 5% (Bjorn and Falk 2008).

5.1.4 Classification of motifs for identifying logical structures in the process

In this step, considering the resulting structure of the ternary motifs, rules are set to map motifs with specific structural features to legal logical structures in process models. These rules are as follows:

  1. (i)

    Rule no. 1 If the nodes in the ternary motif are connected consecutively by two edges containing handover flows so that the input and output degrees of none of the nodes are higher than one, the activities corresponding to these nodes are also consecutive in the process logical structure. Figure 4 depicts the motif structure that applies to rule no. 1. In this figure, the data exchange flow is neglected, and the edges only contain the handover flow.

    Fig. 4
    figure 4

    The motif representing a sequence among the activities

  2. (ii)

    Rule no. 2 In a size three motif, if two edges containing handover flows that enter a node, the activity corresponding to that node in the process Petri net. is a joint node. The structure of the motif, to which rule no. 2 applies, is illustrated in Fig. 5. In this figure, the data exchange flow is overlooked, and the edges only contain the handover flow.

    Fig. 5
    figure 5

    The motif representing a joint node

  3. (iii)

    Rule no. 3 In a size three motif, if two edges containing handover flows that leave a node, the activity corresponding to that node in the process Petri net. is a split node. The structure of the motif, explained by rule no. 3, is depicted in Fig. 6. In this figure, the data exchange flow is overlooked, and the edges only contain the handover flow.

    Fig. 6
    figure 6

    The motif representing a split node

The logical structures of the process model can be mostly figured out based on the above three rules. The question is that since all of the aforesaid rules are based on the handover flows, what information can be derived from the data exchange flows? In response to this question, it should be said that although the second and third rules identify the joint and split structures, these rules do not determine the types of these structures. Here, the data exchange flows are used to determine whether a joint/split structure is an ‘AND’ or ‘XOR’ structure. Hence, the two other rules are set as follows.

  1. (iv)

    Rule no. 4 If data exchange takes place among the activities in a joint/split structure, those activities are parallel and the joint/split nodes are ‘AND’ nodes. The philosophy behind this rule is that activities that are parallel are somehow connected. For instance, all belong to the same organizational unit, and thus, there are most probably data exchange flows among the roles performing these activities.

  2. (v)

    Rule no. 5 If there is no data exchange among the activities in a joint/split structure, those activities can be in the choice state with one another and the joint/split nodes are also ‘XOR’ nodes. The philosophy behind this rule is that the choice structures divide the process model into several branches. As a result, the activities that are in the choice state, in relation to each other, are independent of one another. Hence, there is no relationship among their roles including the data exchange flow.

5.1.5 Discovering the process control flow model by integrating the resulting motifs

In this step, which is the last step of the proposed approach, the process control flow structure is obtained based on the rules defined in the previous step. In this step, first the nodes hosting each motif should be identified. In this research, one of the features of FANMOD is used for this purpose. In FANDOM, nodes are discernable as well as edges. This feature allows for the identification of the nodes related to a given motif. Through assignment of a code to each node, FANMOD differentiates among the isomorphic motifs on different nodes; hence, it will be possible to identify the activities and roles related to each node. Generally speaking, the motif discovery algorithms function based on the following three strategies to calculate the importance of the subgraphs (Wong and Baur 2010):

  1. (i)

    F1: This strategy allows for the overlap of nodes and edges.

  2. (ii)

    F2: In this strategy, only nodes are allowed to overlap and edges must be entirely separated.

  3. (iii)

    F3: In this strategy, neither nodes nor edges are allowed to overlap.

FANMOD uses the F1 strategy. Hence, motifs may share nodes. Next, the separated motifs should be joined according to their shared nodes. Afterward, the logical relationships among the activities corresponding to each node are identified based on rules no. 1 to 3. Finally, the type of joint and split nodes are determined by rule nos. 4 and 5, resulting in the process control flow structure.

5.2 Evaluation strategy

To evaluate the proposed approach, first, the effect of factors changing the process behavior on the proposed method will be studied. Next, the efficiency of this method will be compared to the previous methods. To conduct this comparison, the alpha, alpha plus, and heuristic mining algorithms are selected from the group of prior methods. The mechanism of each algorithm is described in the following.

  • Alpha algorithm This algorithm, which is explained by Van der Aalst and Song (2004), is a deterministic algorithm. In sum, three kinds of dependency relationships, viz. succession, parallel, and choice, are assumed depending on the activity performing time in event logs. Afterward, each relation is mapped to a Petri net.

  • Alpha plus algorithm This algorithm is the extended version of the alpha algorithm, except that it not only takes the explicit dependencies into account, but also defines a set of implicit relationships among the activities based on several incidence theories (Wen et al. 2009).

  • Heuristic mining algorithm This algorithm functions similar to the alpha algorithm except that it not only takes the dependencies into account but also values the frequency of relationships to detect noise (Weijters and van der Aalst 2003).

In this research, since the primary process models are simulated and available, the efficiency of the proposed method is measured using the F-measure, which is the mean weight of Precision and Recall and is obtained via Eq. (29). Precision and Recall are calculated by Eqs. (30) and (31), respectively. Here, Precision refers to the ratio of the correctly detected relations to all relations found by the solution. On the other hand, Recall refers to the ratio of relationships identified successfully to total actual relationships in the process model. Therefore, the proposed method is evaluated from the viewpoint of business process management. In other words, conformance checking between the discovered process model and the initial process model is the subject.

$$ f = \frac{2}{{\begin{array}{*{20}c} {\frac{1}{\text{precision}}} & + & {\frac{1}{\text{recall}}} \\ \end{array} }} . $$
(29)
$$ {\text{Precision}} = \frac{\text{tp}}{{{\text{tp}} + {\text{fp}}}} . $$
(30)
$$ {\text{Recall}} = \frac{\text{tp }}{{{\text{tp}} + {\text{fn }}}}. $$
(31)

In Eqs. (30) and (31), TP is the true-positive rate and shows the number of relationships among the process activities detected correctly by the solution. FP is the false-positive rate and indicates the relationships that do not actually exist among the process activities but are wrongly identified by the solution. FN is also the false-negative rate which refers to the number of relationships that exist in the primary process model but are not detected by the solution.

6 Results and discussion

To assess the proposed method, a set of experiments are designed. In this respect, first the workflow matrices are created for three different processes. A workflow matrix identifies what activities are performed by what roles or resources and it also identifies the order in which activities are performed. Figure 7 depicts part of a workflow matrix belonging to the first process. This process belongs to a company that provides different internet services such as the ADSL service.

Fig. 7
figure 7

Part of workflow matrix for the first process

In Fig. 7, the rows are activities and the columns are organizational roles. In this matrix, the order of performing activities is also identified. According to a workflow matrix, a process model can be generated. The process models are simulated in WoPeD,Footnote 3 which is an open-source software developed to provide modeling, simulating, and analyzing processes described by workflow networks. Event logs can be generated by simulating processes in WoPeD. Figure 8 illustrates the first process model generated based on the workflow matrix depicted in Fig. 7.

Fig. 8
figure 8

First process which is modeled and simulated in WoPeD

In the next step, the PGFD models of all three processes are made concerning the related workflow matrices. Table 6 illustrates the first process’s PGFD model that is created based on the workflow matrix shown in Fig. 7.

Table 6 PGDF table of the first process

Table 6 depicts different flows exchanging among organizational roles who are responsible for certain activities. In this research, only data flows are used. After that, eight event logs are generated per process by changing the parameters determining the process behavior and simulating the processes in WoPeD. Hence, a total of 24 event logs (8 × 3) are created. According to previous researches, the behavior of a process is affected by three factors (Weijters and van der Aalst 2003):

  • The number of input cases within a given period of time: In this research, the number of input cases varies from 100 cases to 400 cases within a 30-day period.

  • Noise level: There are five ways of producing noise in an event log: (1) Missing head, (2) Missing body, (3) Missing tail, (4) Missing event, and (5) Exchange event. In this study, noise was produced using all of the mentioned methods in the range between 0 and 40%.

  • The probability of activities being fired in the process: For example, if activities a and b are in a choice state with each other, probabilities p and 1 − p lead to the selection of activities a and b in each iteration, respectively.

Finally, the proposed method is applied to each experiment set which is the combination of an event log and the PGF model of each process. Figure 9 shows the integrated motifs for one of the experiment sets of the first process. These motifs are extracted from the bilayer network of data flow and handover flow and integrated based on rule nos. 1, 2 and 3 that determine sequential, split, and joint patterns among activities.

Fig. 9
figure 9

Integrated motifs for the first process resulted from the proposed method for the

In Fig. 9, based on rule no. 2 and rule no. 3 there are split structures among activities {‘Register order’, ’Contact customers’, ’Check orders’} and {‘send install address’, ‘Quick installation’, ‘Normal installation’}, and there are joint structures among activities {‘Register order’, ’Contact customers’, ‘Set install time’} and {‘Quick installation’, ‘Normal installation’, ‘Get confirm’}. According to rule no. 4 and rule no. 5, the type of split/joint structures among {‘Register order’, ’Contact customers’, ’Check orders’} and {‘Register order’, ’Contact customers’, ‘Set install time’} are ‘AND,’ since there is a data flow between ‘Contact customers’ and ‘Check orders’, and the type of split/joint structures among {‘send install address’, ‘Quick installation’, ‘Normal installation’} and {‘Quick installation’, ‘Normal installation’, ‘Get confirm and Register’} are ‘XOR,’ because there is no data flow between ‘Quick installation’ and ‘Normal installation.’ After exerting the mentioned rules and mapping the motifs to Petri net patterns, the control flow of the process is derived. Figure 10 depicts the control flow of the first process created by the proposed method.

Fig. 10
figure 10

Control flow of the first process resulted from the proposed method

In the following, the characteristics of experiments and the results of implementing the proposed method on the experiment sets are explained.

6.1 Results

In Tables 7, 8, and 9, the specifications of the eight experiments carried out on the first, second, and third processes are listed, respectively. These tables show the number of input cases fed into the process in each experiment along with the likelihood of activities being fired and the noise percentage variations. For the probability of activities being fired, each row belongs to a choice structure. For instance, in Table 9, which belongs the third process, three rows correspond to the probability of activities being fired. These three rows show the existence of three choice structures in the third process model, and each row shows the probability of the activities being fired in a choice structure.

Table 7 The specifications of the experiments carried out in the first process
Table 8 The specifications of the experiments carried out in the second process
Table 9 The specifications of the experiments carried out in the third process

Figures 11, 12, and 13 show the Precision, Recall, and F-measure values for each of the eight experiments on the first, second, and third processes, respectively. In these diagrams, the horizontal axis indicates the experiment number and the vertical axis shows the Precision, Recall, and F-measure values.

Fig. 11
figure 11

Results from the experiments in the first process

Fig. 12
figure 12

Results from the experiments in the second process

Fig. 13
figure 13

Results from the experiments in the third process

In the first four experiments, the noise for each process is set to zero (Tables 7, 8, and 9). However, the rate of cases fed into the process and probability of activities being fired in these four experiments are variant. In the second four experiments, a noise percentage is applied to the event logs. As seen in Figs. 11, 12, and 13, the results from the first four experiments on each process are the same. However, in the second four experiments, the efficiency of results partly declines with an increase in noise. Therefore, the number of cases fed into the process and the probability of activities being fired change the trend in the event log but do not affect the results of the proposed method. The only factor influencing the performance of this method is noise.

After studying the determining factors, the number of cases fed into the process and the probability of activities being fired are ruled out in this step and only noise percentage is taken into account. Next, the performance of the proposed approach is compared to the performances of the alpha, heuristic mining, and alpha plus algorithms. These three algorithms are all implemented as plug-ins in ProM. Therefore, all event logs containing the noise generated per process are entered onto ProM software. Afterward, each of the three algorithms is applied to the inputs to measure the Precision, Recall, and F-measure of the results. In Figs. 14, 15, and 16, the ratio of the F-measure value to the noise percentage is presented per selected algorithm along with the results from the application of the proposed method to all three processes. In these diagrams, the horizontal axis shows the noise percentage and the vertical axis indicates the F-measure value.

Fig. 14
figure 14

F-measure-to-noise (%) ratio in the first process

Fig. 15
figure 15

F-measure-to-noise (%) ratio in the second process

Fig. 16
figure 16

F-measure-to-noise (%) ratio in the third process

6.2 Discussion

Figures 11, 12, and 13 suggest that in the first four experiments on each process, the results remain unchanged. In these experiments, noise is zero and only the input cases and the probability of activities being fired, change. In the next four experiments, noise is added to the event log, and the increase in noise undermines the results efficiency. Therefore, as seen in Figs. 11, 12, and 13, the number of input cases and the probability of activities being fired do not affect the results from the proposed method, and noise is the only determinant.

Moreover, Figs. 11, 12, and 13 indicate that although Recall is high in all eight experiments, it is not equal to 1. In other words, the proposed method is not capable of identifying some structures. These structures include the hidden and duplicate tasks. The hidden tasks are those tasks that are only in charge of routing in the process model. No information on these tasks is stored in the information system. Therefore, the proposed method is not capable of identifying these tasks. However, it is possible to enable the proposed approach to determine the hidden tasks in the future using a set of incidence theories.

On the other hand, duplicate tasks are activities that are manifested more than once in the process model, but the proposed method is currently unable to differentiate among them. This pitfall can also be avoided in the future by initially relabeling the activities. To some up, as seen in Figs. 11, 12, and 13, the results of the proposed method are influenced by the logical structures in the primary process model and the noise recorded in the event log.

In Figs. 14, 15, and 16, the number of input cases and the probability of activities being fired are ruled out to only consider the effect of noise percentage. These figures also present the comparison of the results of the proposed method to the results of the alpha, alpha plus, and heuristic mining algorithms. As seen in Figs. 14, 15, and 16, despite the effect of noise recorded in the event log on the proposed approach, this method is more noise resistant than the previous methods. Besides, with an increase in noise percentage, the F-measure value remains high and does not decrease significantly. As reflected by Figs. 14, 15, and 16, among the three selected algorithms, only the alpha plus algorithm is initially highly accurate. However, with an increase in noise the efficiency of this method is severely decreased.

7 Conclusions and future work

In this research, a new approach for process model discovery is introduced. This new approach identifies the process model based on the information in the event log and the information on the exchange of data among organization roles (which is derived from the PGFD model). In this approach, first a bilayer social network of the handover flow and data exchange flow among the roles in charge of different activities in the process is created using the event log information and PGFD model. Next, the process model consisting of the logical relationships among the process activities and their orders is obtained by discovering the network motifs. To assess the proposed approach, three different processes are designed and simulated and a set of experiments is carried out on the data sets related to each process. Finally, the efficiency of the resulting models is measured using the F-measure. The assessment results reflected the higher effectiveness of the proposed method as compared to the previous methods. Some of the advantages of the proposed method are its high resistance to noise and its ability to handle incompleteness. Moreover, since in addition to the information in the event log, the information on data exchange is also used to identify the logical relationships among activities, this approach is capable of detecting many complicated structures (such as the non-free choice structures); hence, it is more precise than the previous methods.

As indicated by the assessment results, the proposed approach is highly precise and highly resistant to noise. Therefore, the implementation of this approach within a standard framework (e.g., as a plug-in in ProM) can help managers to discover accurate organizational process models. This tool enables organizational managers to identify the actual current process models; thus, they will be able to make rapid changes.

Despite the high efficiency and the resistance of the proposed method to noise, the following requirements must be addressed in the future researches:

  • Adding simplifying mechanism to the proposed approach for preventing the development of the spaghetti models. It is possible through the community detection methods.

  • Improving the recommended approach for enabling it to identify all complicated structures including duplicate and hidden tasks through defining incident theories and relabeling activities.

  • Applying the proposed method to actual data sets.

  • Evaluating the computational complexity of the proposed method against the other methods.