Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The usage of reference models offers many advantages for the development of individual enterprise models in practice as well as in science (Fettke & Loos, 2004; Frank, 2008, p. 42). However, it is undisputed that the realisation of these advantages requires the availability of reference models. Thus, methods for a systematic development of high potential reference models are highly relevant.

Based on the established distinction between rationalism and empiricism as two basic paths to knowledge, the distinction within reference modelling differentiates between a deductive and an inductive strategy for developing a reference model (Becker & Schütte, 1997, pp. 428–430; Thomas, 2006, p. 102f):

  • Deductive strategy: Common principles and theories are the basics for the development of a reference model. The reference model will be refined and concretised during the development phase.

  • Inductive strategy: On the basis of individual enterprise models, a reference model is developed through the identification of commonalities between the individual models and through the abstraction of particularities. An increasing abstraction from specificities of individual enterprise models is one characteristic of this development process.

Even though both strategies are known in the field, a deeper analysis of the current state-of-the-art reveals a significant gap. Most methods follow the deductive strategy, while the inductive strategy is supported only by a few. However, the inductive strategy also has much potential for reference modelling:

  1. 1.

    Numerous reference models have been constructed inductively, particularly in practice [cf. attribute “construction method” in the reference model catalogue at http://rmk.iwi.uni-sb.de; analogously in (Thomas 2006, p. 103)]. Note that it cannot be concluded from this finding that inductive methods for the reference model development are well known. Rather the opposite seems to be true as, in these works, the exact development steps are not very detailed or not explicitly described. Thus, in particular, the important question of a possible generalization of the actually selected development steps remains unclear.

  2. 2.

    Both development strategies can be combined without problems. Thus, it is possible to use a deductively developed reference model together with individual reference models as a basis for a further inductive development of reference models.

  3. 3.

    Enterprise modelling has gained more importance in organizational practice. Thus, more individual enterprise models, target models and reference models which can be used for inductive reference modelling are available. Some of the models are available as so called “open models” (Koch, Strecker, & Frank, 2006).

To summarise, although there is a considerable lack of methodological knowledge about the inductive development of reference models, the potential of inductive methods is extremely attractive. Especially if one distinguishes between the standard case of reference modelling and non-standard cases, the inductive strategy seems highly beneficial. The standard case delivers a reference model which serves as a basis for creating individual models, e.g. in terms of adapting and enhancing with respect to an individual use case. In contrast to this, the non-standard cases cover different variations: (1) variation of the modelling demand, e.g. best practice, common practice or model reusability; (2) variation of the object, e.g. companies with several locations/offices, parent/subsidiary companies of a horizontally organized enterprise, organization units with comparable function in different sectors; (3) variation of the modelling level, e.g. software reference model or (4) variation of the modelling purpose, e.g. model merging, developing multi-perspective reference models, analysis of big model collections. Against this background, the present work aims at a contribution to closing the identified gap in research. Thus, our research objective is to develop a method for inductive reference modelling.

The research approach of this work stands in the tradition of German design science oriented research in the modelling of enterprise information systems (Frank, 2006): On the basis of theoretically as well as practically relevant problems in the (inductive) development of reference models, where no satisfying solutions exist, the authors study and present particular techniques supporting an inductive model development. Following an inductive strategy implies the need for methods and techniques, e.g. identifying correspondences or structural analogies between different process models. Different approaches for merging, abstracting and aggregating particular process models are necessary as well. These streams of research, the development of corresponding methods and the implementation of particular techniques, are therefore highly important in the context of our applied research approach and in the paper at hand. The applicability and usefulness of these methods and techniques are shown by means of an application scenario: Developing reference models for some Dutch government processes based on existing models from 10 municipalities in an inductive manner.

After this introduction, the next section gives an overview of related work on inductive reference modelling. Thereafter, a specific seven phases method for inductive reference modelling is presented. Section 3 describes some central subject areas with corresponding particular techniques supporting this inductive approach, while Sect. 4 introduces a software tool—the RefMod-Miner—realizing these and more techniques. In Sect. 5, the mentioned application scenario is presented. Finally, Sect. 6 closes the article with a conclusion and an outlook on future work. This work is an extended version of Fettke (2014), in particular, different techniques, software tools and application scenarios are described in greater detail in Sects. 35.

2 Related Work

Several authors describe a procedure model for reference modelling development (Ahlemann & Gastl, 2007; Becker, Delfmann, Knackstedt, & Kuropka, 2002; Delfmann, 2006; Fettke & Loos, 2004; Schlagheck, 2000; Schütte, 1998; Schwegmann, 1999; Thomas, 2006; vom Brocke, 2003). A first analysis of these methods shows that the inductive strategy does not play a prominent role with regard to most methods. Typically, starting from a general definition of the problem, a reference model is derived by a stepwise refinement and concretisation. In contrast, activities such as the creation of individual enterprise models or the abstraction of enterprise-specific features that would be expected for the inductive strategy are not listed at the top level of the life cycle models.

The analysis of these methods shows that the inductive strategy of reference modelling plays no prominent role. Indeed, none of the outlined methods mentioned before explicitly argues against the inductive strategy. On the contrary, some even noted that existing individual enterprise models and other knowledge sources should be identified and taken into account as part of the reference model development (cf. Becker et al., 2002, p. 49; Schwegmann, 1999, p. 167; Thomas, 2006, pp. 278–280). Nevertheless, besides the programmatic call to consider existing individual enterprise models, only few actual suggestions exist for the systematic derivation of reference models from these models.

Also, the question remains open as to what can be done if appropriate individual enterprise models are neither available nor identifiable prior to the reference model development. Must the development of individual enterprise models for reference modelling be waived in this case? Or is it possible that reference model development benefits from the developments of individual enterprise models while, in a second step, a reference model is derived in an inductive manner? Besides the mentioned methods, various authors (Gottschalk, van der Aalst, & Jansen-Vullers, 2008) and (Li, Reichert, & Wombacher, 2010) present first ideas for an inductive strategy. However, these works do not provide general inductive methods for the development of reference models. Instead, reference modelling is mainly seen as an algorithmic problem. More or less, it is assumed that a reference model can be derived from a set of given process models. Questions, for example with regard to the collection of individual models or the terminological harmonization of labels of the process model, remain largely unaddressed. In addition, these works focus mainly on the process control view and do not consider the modelling of business information systems in general.

Furthermore, some approaches utilize an inductive strategy (Aier, Fichter, & Fischer, 2011; Daun & Matheis, 2005; Karow, Pfeiffer, & Räckers, 2008). However, these approaches focus on the development of a particular reference model. The authors do not claim to present a general method for the inductive development of reference models.

In addition to the works specific to the development of reference models, various approaches are known that have a certain similarity to the inductive development of reference models, e.g. approaches for model comparison (Dijkman, Dumas, van Dongen, Käärik, & Mendling, 2011) or for the integration of enterprise models (Rahm & Bernstein, 2001a, 2001b). These approaches provide very interesting concepts for the analysis of enterprise models but they have not been applied in reference modelling so far.

In conclusion, it can be stated that the deductive strategy significantly dominates the previous methods for reference model development. The inductive strategy and its fundamental ideas are basically known. Nevertheless, there is a lack of general methods for the inductive construction of reference models.

3 Towards a Seven Phase Method for Inductive Reference Modelling

For the inductive development of reference models no concrete requirements are known. Instead, the different requirements for such a method are justified by arguments:

  • Inductive development: The method is intended to support a modeller so that a reference model can be derived systematically from individual enterprise models. One cannot speak of an inductive development in a meaningful way if this requirement is not met.

  • Identification of commonalities: If the individual enterprise models contain similarities, these have to be represented in the reference model. In this way, the reference model represents the typical structures of an application domain.

  • Abstraction: Reference models do not claim to represent all company-specific features. Therefore, the derived reference model should be more abstract than the individual enterprise models.

  • Generativity: In contrast to the first requirement, it should be possible to derive the individual enterprise models from the inductively generated reference model. This ensures that the reference model is not too far away from the individual enterprise models that it represents.

  • Properties of natural languages: A common part of enterprise models are natural languages, in which known phenomena such as homonymy, synonymy and linguistic fuzziness are typical. A method must take these aspects into account.

In the following, the seven phases of the proposed method for the inductive development of reference models (Walter, Fettke, & Loos, 2012a) will be presented in greater detail.

Phase 1: Initiation of Reference Model Development

The goal of the first step is to identify the requirements that a derived reference model should fulfil. To determine the requirements, the following alternatives are available:

  • Interviews: Interviews with domain experts or potential model users can give guidance concerning the requirements that the reference model should fulfil.

  • Literature review: A literature review of relevant literature provides an insight into the aspects to be taken into account by a derived reference model.

  • Analysis of existing reference models: An analysis of existing reference models provides an overview of the requirements that are already fulfilled by other reference models. It is useful to consider the models of other domains besides directly similar models.

The derived requirements have to be prioritized in order to evaluate the relevancies of the different requirements.

The result is a prioritized list of requirements for the reference model.

Phase 2: Acquisition of Individual Process Models

The goal of this step is to collect individual enterprise models that are used for the inductive development of reference models. This should be done in four sub-steps:

  • Class definition: The class of enterprises for which the reference model should be developed has to be determined. For example, a class can be created by an explicit list of companies or by a specification of characteristic features that a business must meet in this industry branch or domain.

  • Enterprise selection: In general, individual enterprise models are collected not for all, but only for selected companies of the previously defined class. The selection of suitable companies should take into account at least three aspects: (a) representativeness of the selected companies (b) accessibility to a company or individual enterprise models (c) effort to collect individual enterprise models. In a concrete decision, conflicts between these aspects will occur. For example, the costs will rise if additional enterprise models have to be collected. But this can be essential for reasons of representativeness.

  • Unified modelling conventions: Modelling conventions concern: (a) the chosen modelling language, e.g. event-driven process chains (EPC) or Business Process Modelling Notation (BPMN); (b) layout conventions, e.g. sequential processes have to be aligned top to bottom; (c) naming conventions, e.g. a single process step has to be described by “subject + predicate”; (d) terminological conventions, e.g. “A customer is a business partner buying goods regularly”. The definition of unified modelling conventions noticeably reduces the effort of later analysis. However, it is rather unlikely that such conventions can be enforced, especially in inter-company contexts. Thus, step 3 contains further measures.

  • Collecting individual enterprise models: Enterprise models of the selected enterprises have to be ascertained. The known methods for enterprise modelling can be used. The inductive development of the reference model can be carried out at a lower cost, especially when individual enterprise models have already been created in the past and can be reused. It is important to document the source (“provenience”) of the collected enterprise models because important conclusions can often be drawn from this information (e.g., What was the purpose of the original model? Which changes took place? Are there some legal restrictions which have to be obeyed?).

The result is a definition of classes of enterprises as well as individual enterprise models.

Phase 3: Pre-processing of Individual Process Models

The goals of the third step are an adjustment and a harmonisation as well as a pre-processing of the individual enterprise models in order to derive an initial reference model. For this purpose, several sub-steps are required:

  • Checking the unified modelling conventions: If the modelling conventions could be enforced in the collection of individual enterprise models in the second step, it is necessary to check the extent to which they have already been applied. [Appropriate techniques are given in Delfmann (2010)]. Otherwise, the individual enterprise models have to be transformed in this step according to the unified modelling conventions.

  • Generating modelsynsets: As a next step, modelsynsets have to be built in order to prepare an appropriate grouping of the models in phase 4. The definition of modelsynsets is based on the concept of a linguistic synset, which designates a set of interchangeable words in certain contexts (Miller, 1998, p. 23): A modelsynset is a set consisting of a single word or a group of words that can be interchanged in an enterprise model without changing the intended purpose of the model. An example of a modelsynset is “creditor, supplier (a business partner who has obligations for goods and services)”. A synset and a modelsynset are conceptually similar, but they do not have to be the same: General dictionaries for the English colloquial language, such as WordNet, are usually not appropriate because individual business terms are often not available at the necessary level of detail. But such terms are important within individual enterprise models. In addition, individual enterprise models often contain business-specific characteristics which are not covered by general dictionaries. Nonetheless, digital dictionaries can be used as a first step for an automatic generation of modelsynsets, which must be checked afterwards.

The results are homogeneous individual enterprise models and modelsynsets.

Phase 4: Exploitation of the Reference Model

The goal of this step is the generation of a reference model out of homogeneous individual enterprise models. The following sub-steps have to be processed:

  • Clustering: In a clustering step the different individual models are grouped in a way such that models within one group are similar and models belonging to different groups are different. Here, typical techniques of cluster analysis or multivariate statistics can be used. The modelsynset created in phase 3 can support the grouping. Known similarity measures for enterprise models can also be applied (Dijkman et al., 2011). However, it has to be mentioned that known similarity measures are focussing on the similarity of enterprise models as a whole and do not take into account the similarity of single model fragments. The identification of similarities between individual sub-models provides great potential for the derivation of reference models. Individual enterprise models as a whole exhibit significant differences, although some parts are very similar and, thus, could be summarized in a reference model.

  • Deriving a reference model: For each cluster, a reference model has to be derived. The main idea is based on identifying similar model fragments within a cluster, which are then transformed into a reference model. In this step, individual enterprise models are interpreted as graphs. Within the various graphs, isomorphic sub-graphs have to be identified. These sub-graphs should be as large as possible. The relative frequencies of a sub-graph can be used in order to check which fragments can be used as a reference model. An abstraction parameter α and a configuration parameter β are introduced to describe the extent to which characteristics of individual enterprise models are reflected by the reference model. If α is equal to 0 %, all sub-graphs are used and if α is equal to 100 %, only sub-graphs occurring in all individual enterprise models become part of the reference model. The configuration parameter β determines the value at which a sub-graph becomes a mandatory part of the reference model.

The result of this step is a raw reference model.

Phase 5: Post-processing of the Reference Model

The goal of the fifth step is the post-processing of the previously derived raw reference model. Here, three different approaches are possible:

  • Concatenation of model fragments: Interesting relationships can occur between parts of the raw reference model, which should be reflected in the final reference model. For example, some sequences can occur in several different individual enterprise models, so these dependencies should be included in the reference model.

  • Integration of deductively developed reference model fragments: If fragments of a reference model cannot be derived with the inductive strategy, these fragments can be derived deductively and integrated into the final reference model.

  • Manual extensions: As a last option, manual extensions can be made in order to correct the reference model, because it is obvious that not all steps can be completely automated.

The Result is the reference model.

Phase 6: Evaluation of the Reference Model

The goal of this step is to evaluate the developed reference model. In principle, the evaluation can be made from different perspectives where the scope and the content of the perspectives can hardly be defined a priori. Instead, these have to be negotiated in a discourse between the model developers, the users and the evaluators. Within such a discourse, it should be checked to what extent the criteria are justified, how they are weighted and to what extent they are fulfilled. Typical perspectives are:

  • Evaluation with respect to requirements: It is necessary to check in how far the reference model fulfils the requirements defined in the first step.

  • Evaluation with respect to individual enterprise models: It is necessary to examine how individual enterprise models can be derived from the reference model. As a benchmark, the initial individual reference models or other models can be used.

  • Evaluation based on an existing framework: Literature provides several criteria for the assessment of reference models, e.g. the framework by Frank (Frank, 2007), the guidelines for enterprise modelling (Becker, Rosemann, & Schütte, 1995) or ontological quality criteria (Fettke, 2006).

The result is an evaluated reference model.

Phase 7: Maintenance and Enhancement

The goal of the seventh step is to maintain and improve the reference model after the initial construction. This includes corrections of the reference model as well as necessary additions. It is possible that further individual enterprise models are developed and should be integrated into the reference model during enhancement. It is worth considering whether the previously created reference model should be developed from scratch (redrafted) or whether a check is sufficient and how far aspects of the new individual enterprise models are covered by the reference model, so that only slight changes have to be made (modification draft). Important considerations here are stability of the reference model, the planned development costs and the complexity of necessary changes.

The result is an enhanced reference model.

4 Particular Techniques for Inductive Reference Modelling

4.1 Process Matching

Matching describes the process that takes two schemata as input, referred to as the source and the target, and produces a number of matches between the elements of these two schemata based on a particular correspondence (Rahm & Bernstein, 2001a, 2001b). Thereby, the term schema has a broad interpretation and can comprise database schemata (e.g. Evermann, 2009) as well as arbitrary other model schemata.

Process matching can be divided into two different fields—matching process models (1) and matching nodes of process models (2) (Thaler, Hake, Fettke, & Loos, 2014). Matching process models describe the mapping of process models on other models based on criteria like similarity, equality or analogy. A prominent application scenario is the handling of company mergers, where it is necessary to synchronize different processes, e.g. in the context of administration.

In contrast to this, matching nodes of process models, which is mostly associated with the term of process matching, describes the mapping of single nodes, a set of nodes or node blocks of one model to the corresponding elements of another model. Important application scenarios are the harmonization of business process models and the inductive derivation of reference models from different individual models. In order to determine the matches between process models (1), node matching techniques as described in Becker and Laue (2012), Weidlich, Dijkman, & Mendling (2010) are used in most cases. While Becker and Laue (2012) present 19 different similarity measures for business process models with their underlying—mostly 1:1—node matching techniques (Weidlich et al., 2010) develops a similarity measure for process models based on M:N node matches. The cardinality describes the cardinal number of node sets which are being matched to each other. A sample of a node matching with both 1:1 and M:N matches is visualized in Fig. 1.

Fig. 1
figure 1

Node matching example (Thaler et al., 2014)

Generally, it is not only possible to match nodes (activities, events and connectors in terms of EPCs), but also edges. However, most of the existing techniques and algorithms only take activities into account. There are several different approaches for the automatic detection of correspondences. A common technique is the consideration of (normalized) edit distances like the Levenshtein distance (Dijkman et al., 2011). Many approaches (Cayoglu et al., 2013) also use wordnets with tools like WordNet or GermaNet to take semantic information concerning synonyms, homonyms or antonyms into consideration. Thereby, node labels are split into single terms or n-grams, stop words like “is”, “are”, “at” etc. are removed and the remaining terms or n-grams are matched to the terms or n-grams of other labels.

As mentioned above, there are several possibilities identifying correspondences between nodes, thus, the particular technique RefMod-Mine/NSCM will be introduced to give an example. First of all, the technique uses a semantic error detection to validate the correctness of node types. The form and the order of nouns and verbs of a label are analyzed, so that the algorithm is able to determine whether a node should be an activity or an event (in case of EPCs).

Generally, one can also distinguish between considering exactly two models, which are matched to each other (binary matching) and a set of models (n-ary matching). The n-ary matching realizes a transitive matching over multiple models, which is generally not the case in the context of binary matches. The RefMod-Mine/NSCM algorithm conducts an n-ary cluster matching, thus, the nodes of all models which should be matched are being compared pairwise, using a semantic similarity measure. The agglomerative (Jain, Murty, & Flynn, 1999) cluster algorithms start with clusters of size 1 (activities) and consolidates two activities to a cluster if their similarity value exceeds a specific threshold.

The used similarity measure consists of three phases: (1) splitting node labels L into single words \( {w}_{i_L} \), so that \( split(L)=\left\{{w}_{1_L},\dots, {w}_{n_L}\right\} \), whereby stop words and waste characters like additional spaces are removed and (2) computing the Porter Stem \( stem\left({w}_{i_L}\right) \) (Porter, 1997) and comparing the stem sets of the labels. The similarity is defined as the division of the number of matching stems sets by the sum of all words (cf. Eq. 1).

Equation 1
figure a

RefMod-Mine/NSCM node similarity measure

If the similarity value exceeds a specific threshold, the labels are checked for antonyms using a lexical database (3), which decides on the similarity being 0 or sim(L 1, L 2).

In the end, the RefMod-Miner/NSCM technique extracts binary matchings from the calculated node clusters. For each model pair, all clusters are analyzed for the occurrence of nodes in both models. The containing node set of the first model is then matched to the node set of the second model. Finally, the algorithm returns binary simple or complex matches for the nodes of each model pair.

4.2 Structural Analogies

One of the main problems in reference modelling is the identification of correspondences (cf. 4.1). But if there is neither a suitable definition of correspondences between elements nor the means to identify correspondences between elements in another way, it is almost impossible to calculate a useful matching. This is especially the case if the considered schemata belong to different domains utilising completely different vocabulary.

One way to overcome such vocabulary problems is to focus on structural aspects only. Typically, the induced underlying graph structure of most modelling languages is used for the identification of schema matches. One of the most common approaches is the calculation of graph edit distances (GED) (Dijkman et al., 2011; Li, Reichert, & Wombacher, 2008). The derived measure relies on the number of change operations (insertion, modification or deletion of nodes) that are needed to transform one schema into a second one. Commonly, the lower the number of change operations, the greater the similarity.

Another approach in the context of process matching is the refined process structure tree (RPST) (Vanhatalo, Völzer, & Koehler, 2009), in the course of which the underlying graph of a process is decomposed into a hierarchy of fragments. Each fragment is a small subgraph which has exactly one single entry node and exactly one single exit node (SESE). Multiple entry and exit nodes can be handled by adding single dummy starting or end nodes. The fragments of an RPST can be separated into several fragment types, as for example trivial, bond or rigid fragments. Based on this kind of graph decomposition, the analogy is determined through the comparison of the resulting RPSTs.

In contrast to the RPST, the approach in Walter, Fettke, & Loos (2012b) utilises all subgraphs of the underlying graph to determine the degree of structural analogy between two process models, especially EPCs. The main advantage is that this approach is not restricted to SESE fragments. Moreover, this technique is also independent from any previous knowledge about correspondences of elements. For example, in Fig. 2 two EPCs are presented that are structurally analogous although they describe different processes. Obviously, only three elements have equal labels (“start”, “finish order”, “order finished”). In order to match (cf. 4.1) further elements, it is necessary to use advanced mapping algorithms that are able to identify antonyms like “invoice settled” and “payment received”. Otherwise, such elements cannot be mapped.

Fig. 2
figure 2

Structural analogue process chains

The degree of structural analogy d s of two given EPCs A and B is calculated as followed Walter et al. (2012a, 2012b):

Equation 2
figure b

Degree of structural analogy

In a survey, the method was applied to the Y-CIM reference models (Scheer, 1998) and the Retail-H reference model (Becker & Schütte, 2004). The results show that the reference models contain structurally analogous parts: about 75 % of the structures consisting of 4 nodes are structurally analogous, 54 % with 5 nodes, 36 % with 6 nodes, 23 % with 7 nodes and 14 % with 8 nodes.

In comparison to the linear time computation of an RPST (Vanhatalo et al., 2009), the calculation of subgraph isomorphism is said to be NP complete (Garey & Johnson, 1979). Nonetheless, due to the nature of EPCs, several structural characteristics, e.g. different node types, can be used to speed up the calculation of subgraph isomorphism. Thus, this approach can be used to calculate further process matches which are then utilised for the inductive development of a reference model (Rehse, Fettke, & Loos, 2013).

4.3 Reference Model Development

The terms reference modelling and reference model have not been consistently defined in literature and a lively discussion about this topic is still underway. In general, business process reference models can be understood as business process models which ought to fulfil certain criteria and offer certain features. However, these criteria are still under discussion. Referring to Fettke and Loos (2007), the following features are considered important:

  • Reusability: Business process reference models represent blueprints for the development of process-oriented IS which can be reused in different IS development projects.

  • Exemplary practices: Business process reference models can provide common, good or even best practices, describing how business processes are actually designed in practice or how they could or should be designed and executed in order to reach certain goals. In this context, a descriptive as well as a prescriptive or even normative connotation of business process reference models becomes apparent, depending on their interpretation.

  • Universal applicability: Business process reference models not only represent business processes of one particular organization, but also aim at providing universally applicable business process representations which are valuable for different organizations in a certain domain.

Reference models can provide benefits for both theory and practice. Besides the provision of general descriptions of enterprises, which is especially interesting from a theoretical point of view, practice profits, e.g. from reductions in modelling costs, modelling time and modelling risk, as reference models can represent proven solutions (Becker & Meise, 2011). Furthermore, increases in model quality based on the reuse and adaptation of already validated process models can be expected.

In the following section, the minimal cost of change approach (MCC) (Ardalani, Houy, Fettke, & Loos, 2013) as a solution for inductive reference model development is presented in greater detail. This approach supports the development of reference models with the minimal cost of change in the sense of a minimized graph edit distance to match a set of given underlying process models.

The MCC algorithm comprises three main steps: In the first step, a set of candidateRelations is calculated out of the existing nodes and edges in given process models. In the second step, this set is filtered through a threshold. In step three the reference model is generated based on the filtered set of step 2.

According to the first step, all existing relations (edges) in the given process models will be extracted into the set of candidateRelations. For each relation a savedValue (\( Nodes\to \left[- cost(del), cost(ins)\right] \)) is calculated to prioritize the relations for the later reference model. This value is based on a cost function (\( cost:O\times Nodes\to \mathrm{\mathbb{N}} \)), which indicates the costs for change operations (\( O=\left\{ins,\ mov,\ del\right\} \)) needed to transform one model into another model. Obviously, relations with greater savedValues have higher priority to appear in the final reference model. Then, in the filtering step, a threshold (\( t\in \left[- const\;(del),\; const\;(ins)\right] \)) is used to filter the candidateRelations. Relations that have a savedValue greater than the threshold will be added to the reference model. By setting the threshold to higher values, only relations with higher savedValue are inserted into the final model. Consequently, this results in smaller reference models. And in the final step, the reference model is created from the filtered set considering several refinement rules.

By changing the parameters of the MCC approach, such as the cost function or the threshold, different reference models can be created. To assess the created reference models, a further approach is necessary. In this contribution, a totalSavedValue is defined as the sum of savedValues of the existing relations in a created reference model. Then, the reference model with the highest totalSavedValue within the set of reference models is retrieved as the final reference model. Obviously, in order to achieve higher totalSavedValues, the relations with a positive savedValue should be inserted into the reference model and therefore the threshold should be equal to zero. In analogy to the parameter α defined in phase 3 of the method, the threshold can be mapped onto a normalised range between 0 and 100 %. As it has been mentioned, if α is equal to 0 %, all elements will be inserted and if α is equal to 100 %, only the relations occurring in all individual models become part of the reference model.

Although the MCC approach especially focuses on providing an abstracted reference model, which contains the most relevant relations of the underlying process models, the algorithm is also able to present a completely integrated model containing all nodes of the underlying process models if a low threshold is defined. To shed more light on the input and output of this approach, an example is shown in Fig. 3. Three sample EPCs in a model variant collection represent the input data. The approach—with different thresholds set—generates common practice reference models as shown below.

Fig. 3
figure 3

Given process models and generated reference models with different thresholds

It should be emphasized that the generated models using this approach are not always the favourite reference models. But, with adjustments of the parameters, a reference model meeting the expectations can be created. Therefore, each generated model can be considered as a reference model for certain purposes, while others may not meet the requirements for a reference model.

5 RefMod-Miner

In order to support the inductive reference modelling approach, a corresponding software tool was developed. The goal of the tool development was not to support a fully automated development of a reference model. Rather, the tool supports a developer in creating a reference model in an inductive manner.

In order to achieve platform independence, JAVA was used as the programming language. The architecture of the tool consists of three layers that are shown in Fig. 4. At the lowest layer, functionalities for loading, storing, conversion and transformation as well as versioning of model data are available. Generally, two file formats are supported: the ARIS Markup Language (AML) and EPC Markup Language (EPML). The second layer contains concepts and algorithms which support the analysis of individual enterprise models and the derivation of a reference model. The top layer contains several functions for model visualization and editing as well as the possibility of browsing repositories and functions to explain the derivation process.

Fig. 4
figure 4

Architecture of the reference model miner

6 Application Scenario

In order to demonstrate some particular techniques, a holistic application scenario is conducted. Within this scenario, 80 process models from the Dutch government (Vogelaar, Verbeek, Luka, & van der Aalst, 2012), which cover 8 different processes with 10 variants each, are used. The objective is to derive a reference model for these 8 processes based on the variants from different municipalities.

In a first step, clustering techniques are used to identify and reconstruct the given model groups. Since the model repository consists of 80 single models with 8 different processes and 10 variants each, it is necessary to identify the relevant models for generating the reference models. Therefore, a process model similarity measure is used, which quantifies the similarity between two process models based on the percentage of common nodes and edges (Minor, Tartakovski, & Bergmann, 2007) on a scale between 0 and 1.

The results show that it is possible to automatically derive a reference model from a given set of enterprise models. Furthermore, typical similarities and differences of the enterprise models are explicated. Hence, the application scenario gives substantive support for the inductive development of reference modelling to become much more efficient and effective.

7 Conclusion and Future Work

Reference modelling offers several advantages for the practice of enterprise modelling (see also chapters by Becker (2015) and Malinova and Mendling (2015)). These benefits, however, can only be derived if high performing reference models are available. As predominant methods are almost exclusively using deductive approaches, this work presents the possibilities and challenges of an inductive approach, contributing to a more innovative approach to process management (see also chapter by Schmiedel and vom Brocke (2015)). Although the presented method does not allow a purely algorithmic approach, it is still able to significantly support the modeller in the reference model development, e.g. in terms of process standardization or developing common or best practices. This potential support is particularly attractive since neither a deductive nor an inductive strategy has preeminent advantages. Consequently, in the practice of reference modelling, it is suitable to link both strategies.

As deductive methods for reference model development are already well known, the paper at hand focuses on the inductive approach and presents a specific seven phases method for the inductive development of reference models. The authors introduced several important subject areas and described particular techniques which are relevant in the given context. These techniques are also applied to a concrete application scenario in order to give an impression of what is already possible today. However, there is a need for intensive further research since the strengths of the techniques in many fields is far away from being adequate. In fact, the presented RefMod-Mine/NSCM approach for process matching won the Process Model Matching Contest 2013 (Cayoglu et al., 2013), although the evaluation in terms of precision, recall and f-measure was only of moderate value. Thus, a further development of corresponding techniques is still necessary and of high importance.

It is the vision of the authors to develop a comprehensive model corpus containing models in a standardized, digital and processable format. Thus, the following long-term research objectives: (1) Creating a consistent understanding of business application systems in different domains, (2) reusing the contained models in other contexts, (3) creating a homogeneous data basis for different application and analysis scenarios. Moreover, the authors aim at publishing the model corpus in terms of open models, much as in the open source idea in the context of software development. However, this very much depends on the license holder of the model corpus’ content.

The initial starting point for this ongoing work is the reference model catalogue provided by Fettke & Loos (2002) (rmk.iwi.uni-sb.de/). It contains 98 reference model entries with lexical data and meta-data, such as the number of contained single models. However, this catalogue contains neither digitally processable models (in terms of the used modelling language and a consistent exchange format) nor entries of individual models from different domains. Each model collection or each model within the developed model corpus could be assigned to exactly one of the following three categories based on their origin or type: (1) reference model, (2) individual model, and (3) model from controlled modelling scenarios.

In order to support this further research, the authors developed a process model corpus which could serve as a standardized basis for the evaluation of methods and techniques. Indeed, the corpus currently covers 2,290 single models. Nevertheless, it is limited in terms of scope and diversity. Against this background the corpus should be enhanced by additional models and collections, and provided to other scientists.

Further needs for future work are mentioned in the following:

  • Development of further high performing concepts for inductive reference modelling,

  • a wide application of new methods to gain more experience in terms of performance,

  • an application of the inductive method to develop new reference models and

  • the development and application of techniques and algorithms for the corpus development.