Keywords

1 Introduction

AI-powered chatbots “have a considerable impact in many domains directly related to the design, operation, and application of information systems” and at the same time need to be handled with care [70], as models provide you with information without considering their own technology’s limitations. Business process management as an information systems discipline seems a viable candidate to benefit from chatbots and hence from the recent advances in large language models, in particular, when supporting users in creating and improving process-related content, most prominently process models and process descriptions. Process models enable participants to understand the processes in which they are involved [17] and to improve business performance [6]. However, errors in the process models may have adverse business consequences [24], and may lead to problems during process execution and quality issues [15].

Currently the creation of process models is often based on the interaction between domain experts having the knowledge of the process and process modellers/analysts capable of process modelling and analysis techniques. Hence, the acquisition of as-is models can consume up to 60% of the time spent on process management projects [29]. The overarching question of this work is thus how and to which degree chatbots can replace the process modeller/analyst when creating process models through conversational modelling (CM) with the domain expert.

CM means conversation flow modelling where the chatbot can receive and interpret inputs from the user (i.e., follow-up questions, unexpected inputs, or changes of topic) and provide appropriate responses that keep the conversation coherent [49].

This question can be broken down into the following research questions:

RQ1 How can CM methods/tools be employed for process modelling?

RQ2 Which CM methods/tools exist for process modelling?

RQ3 How can we evaluate CM methods/tools with respect to process modelling?

RQ4 Which implications do Chatbots have for BPM modelling practice/research?

RQ1 – RQ4 are tackled as follows: Based on the concept of conversational process modelling, initial application scenarios are posed based on the process life cycle (cf. Sect. 2). These initial application scenarios provide the keywords for the subsequent literature review (cf. Sect. 3) which aims at refining the scenarios along a taxonomy of existing approaches. For evaluating existing chatbots, a test set of process descriptions, process models, and quality assessment is collected and prepared (cf. Sect. 4.1). The systematic analysis of the chatbots (cf. Sect. 4.2) along with the refined application scenarios are conducted based on key performance indicators and provide the basis for deriving practical implications and research directions in conversational process modelling (cf. Sect. 5).

2 Conversational Process Modelling

Only few papers address conversational modelling, mostly by focusing on the design of virtual human agents (aka chatbots), e.g., [49, 61]. However, there is no common understanding of conversational process modelling yet and we hence provide informal Concept 1 which takes up characteristics of conversational modelling regarding the participants in the conversation, i.e., the domain expert and the chatbot, and the iterative nature of the conversation.

Concept 1

(Conversational process modelling) describes the process of creating and improving process models and process descriptions based on the iterative exchange of questions/answers between domain experts and chatbots.

Concept 1 reflects the overarching goal of conversational process modelling, i.e., to enable process modelling and improvement based on interaction between the domain expert and the chatbot, instead of interaction between the domain expert and the process analyst/modeller. This goal constitutes the first pillar to analyze the BPM life cycle w.r.t the process modelling scenarios where conversational process modelling can be applied. The second pillar reflects the assumption that conversational process modelling is exclusively based on domain expert/chatbot interaction and does not employ any other tool. In the conclusion, we will sketch how conversational process modelling can be extended if the chatbot usage is augmented by other tools such as process simulation tools.

In the following, Concept 1 is fleshed out for application scenarios along the BPM life cycle as provided in [27]. The BPM life cycle is chosen as it provides a systematic structuring of the different process-oriented tasks and capabilities towards creating business value.

Process discovery subsumes a range of methods to create process models (and is not to be confused with process discovery as the process mining task is necessarily based on event logs). The typical input in a process discovery project consists of textual process descriptions gathered based on interviews or workshops. Based on the process descriptions, process models are created by process modellers/analysts. We identified the following steps as suitable for being supported by chatbots: (1) gathering the process descriptions for creating the process model. This also includes the preparation of the process descriptions, i.e., to increase the quality of the process description in terms of, for example, being precise, e.g. through automatic paraphrasing. (2) taking a process description as an input and producing a process model (accompanied by the process description). Here, the chatbot can be employed for analyzing the text and extracting process model relevant information such as activities and their relations as well as actors [12]. Finally (3) assessing a process model (with the accompanying process description), regarding model quality based on quality metrics such as cohesion [72] and guidelines such as number of elements or label style [8].

The process analysis phase builds the bridge between the as-is process model created in the process discovery phase and the to-be model created in the process redesign phase. It is concerned with the qualitative and quantitative assessment of a process models. A qualitative analysis comprises, for example, an assessment whether or not certain activities can be automated; this can then be analogously reflected by an action recommendation, e.g., if the automation potential is not fully exploited, yet. The chatbot can support this assessment based on the extracted activities in the process discovery phase. The results of the qualitative assessment can then be used in the process redesign phase for corresponding redesign actions. Quantitative process analysis comprises, for example, detecting bottlenecks based on process simulations. As mentioned before, for this work, we assume that the chatbot is used without invoking further tools and systems such as a process simulator. Hence, quantitative process analysis does not include tasks for conversational process modelling at this stage, but for future work as discussed in Sect. 4.3.

Process redesign comprises the definition of the redesign goal which again is considered a managerial task. The chatbot can support the domain expert by proposing existing redesign methods such as Lean Six Sigma, as well as in querying models (cf. [56]) or applying the redesign instructions. Especially important is refactoring of process descriptions, based on existing guidelines on process model refactoring or catalogues of process smells such as [73].

The phases of process implementation and process monitoring are considered as a part of future work of conversational process modelling as they will require the invocation of additional tools and systems such as a process engine or process-aware information system.

Table 1 summarizes the initial application scenarios for conversational process modelling along the process life cycle phases and steps which constitute the input for the subsequent literature and test set based analyses.

Table 1. Application Scenarios and Chatbot Tasks along Process Life Cycle

The BPMN model depicted in Fig. 1 assembles and refines the application scenarios, together with their input, outputs, and related chatbot tasks as summarized in Table 1 into a generic process model for conversational process modelling, reflecting its interactive and iterative characteristics: at first, the domain expert provides a process description which is refined (\(\rightarrow \) paraphrase) and the results are displayed (\(\rightarrow \) extract). Then an assessment of the result quality is conducted (\(\rightarrow \) compare and assess). If the quality is insufficient, the process models/descriptions are refined (\(\rightarrow \) query, refactor), possibly based on a specific method (\(\rightarrow \) select method), until the quality reaches a sufficient level.

Fig. 1.
figure 1

The Process of Conversational Process Modeling (modeled in BPMN using SAP Signavio)

3 State of the Art

The literature analysis consists of two steps, i.e., i) a pre-review based on the initial application scenarios and life cycle phases summarized in Table 1 and based on the outcome of the pre-review, ii) a more generalized review including, for example, NLP-based methods for the extraction of model information from process descriptions. i) and ii) follow the guiding principles of [37].

i) Pre-review: The pre-review is conducted based on the keywords resulting from building the cross product of the application scenarios and keyword “chatbot” summarized in Table 1, e.g., “process modelling” chatbot. These keywords are then used in the title search (allintitle) on google.scholar.comFootnote 1. Next, we use the keywords resulting from the cross product of application scenario and chatbot task, e.g., “process modelling” paraphrase and the keywords resulting from the cross product of keyword “conversational” and the application scenarios (allintitle), e.g., conversational “process modelling”. In order to broaden the pre-review, we repeated the search for application and chatbot, but without keyword “process”. Most of these searches result in 0 or a couple of hits, which were rejected due to quality issues or domain irrelevance.

The pre-review did not yield deeper insights into techniques, opportunities, and limitations of conversational process modelling. The results rather point towards generalizing the keywords used for the search, particularly covering NLP-based methods. Hence, for the ii) second search, we used https://scholar.google.com to produce Table 2. It shows the list of 52 papers relevant for a wide variety of relevant topics. Selection of the papers for the list was done based on the existence of the enumerated keywords (Selection Criteria) in the abstract or the title (for the first 20 hits).

In the following, we will discuss the literature collected in Table 2 regarding five fundamental questions that partly correspond to the research questions and partly to the pointers derived from the pre-review.

How do chatbots work, and what are important areas of application? A chatbot is a type of human-computer interaction, used to simulate conversations to solve particular user problems [3]. Chatbots work by processing language input from humans (furthermore referred to as natural language processing (NLP) [21, 50]), and reacting to it. The interpretation of human input is achieved through a set of rules [20, 26, 40], or by utilizing large language models (LLMs) [42], which are trained to understand the meaning/intent/context [18, 44] and generate new content based on different statistical and probabilistic techniques. According to [51] the main areas of chatbot application are human resources, e-commerce, learning management systems, customer service, and sales.

Table 2. Literature Queries, Hits, and Selections

How are responses generated? After receiving user input, the chatbot processes it into a machine-readable form and based on that input generates a natural language output utilizing different types of response generation methods [77]. Chatbot systems can be divided into six categories, based on the type of response generator [44]. (1) template-based: response is selected from the list of predefined pairs of query patterns; (2) corpus-based: converts user query to a structured query language (SQL) query and passes it to utilized techniques of professional knowledge management (i.e., database, ontology); (3) intent-based: task-oriented system, which based on user query tries to recognise user intent with the help of advanced NLU techniques; (4) RNN-based: RNN-based (Recurrent Neural Network) chatbot generates response query directly from the user query with the help of the model, trained on dialogue data set; (5) RL-based: RL-based (Reinforcement Learning) chatbots use rewarding and punishing functions to achieve the desired behaviour; (6) hybrid-based: a combination of approaches listed above to achieve better performance or to overcome limitations, faced by using one approach only.

How can response generation be implemented? All of the above types utilize some type of knowledge graph to formalize the configuration [7, 76] and the intended output format of the conversation [4, 55]. The knowledge graph is either accessed by simple querying languages such as AIML or SPARQL, or it is encoded as part of a neural network through training. So responses are either queried explicitly or generated implicitly as part of a neural network. Both approaches have different strengths and weaknesses. For conversation-related applications such as entertainment, neural networks work well, but for other applications with special output, other approaches are still valid solutions. Low-code solutions to control explicit responses [25] as well as BPMN-based solutions to encode potential progressions of a conversation [60] have been proposed. One example of such a system is PACA [41]. Automatically learning from user interactions can be achieved not only for neural networks (e.g., reinforcement learning) but also by encoding interactions automatically into rules, such as in [5, 36].

Can chatbots deal with business processes? According to the survey of chatbot integration [9], 2 out of 347 chatbot systems support the business process interface pattern, i.e., [34, 43] that convert BPMN process models into dialog models/chatbots. Currently, there are no chatbots that are able to generate BPMN models themselves. However, interest in the generation of models from various types of document sources has recently increased [29, 31, 64]. Referring to [32] as an input for business process model generation use case diagrams, business rules, standard operating procedures, and plain unstructured text are considered. Based on the approaches mentioned above, the following 3 steps for creating BPMN can be summarized [12, 66]: (1) Sentence Level Analysis: extraction of basic BPMN artefacts such as tasks, events, and actors; (2) Text Level Analysis: exploration of relationships between basic items, e.g., gateways. (3) Process Model Generation: create a syntactically correct model, that captures the semantics of the input. [67] proposes a machine-readable intermediate format generated out of natural language (either through automatic or manual annotation). The result is then easy to interpret by computers.

How can we evaluate chatbots with respect to BPM modelling? Currently there are no gold standard data sets that can be used to evaluate and compare the efficiency of process extraction from unstructured text [10]. In [29] a set of 47 text-model pairs from industry and textbooks are introduced, which could be converted with an accuracy of 77% (up to 96% of similarity for some cases) from text to model. In [39], 53 model-text pairs were used to evaluate performance of a novel model-to-text transformation method. To avoid the necessity of constant creation of new datasets by hand, data augmentation techniques (increase of the training set size with the help of the modified copies of already existing data set items) can be used [2, 79]. Another important tool is paraphrasing [35], which is about generating similar texts from a source. Such texts are generally recognized as lexically and syntactically different while remaining semantically equal.

4 Performance of Current Generation LLMs for Conversational Process Modelling

In order to assess the performance of conversational process modelling tools and answer RQ3, it is necessary to come up with a data set, an evaluation method, and a set of KPIs. Extending the three steps, which are required to create a BPMN model (see Sect. 3), a fully integrated conversational process modelling toolchain would contain: (a) extraction of tasks from textual descriptions, (b) extraction of logic such as decisions or parallel branchings from textual descriptions, (c) creation and the layout of a BPMN model, and (d) the application of modifications for refinement of BPMN models. As a fully integrated conversational process modelling tool does not exist yet, in this paper we concentrate on how well current LLMs, namely GPT models text-davinci-001 (GPT1), text-davinci-002 (GPT2), text-davinci-003 (GPT3) from openai.org playgroundFootnote 2, as well gpt 3.5 turbo (GPT3.5) from writesonic.comFootnote 3, perform for extracting tasks for textual description (see (a) above). Task extraction is a starting point of the conversational process modelling toolchain, as the task is an atomic element of the process flow, which represents a unit of work that should be performed [65].

4.1 Test Set Generation

The test set [46] utilized in this paper contains 21 textual process descriptions from 6 topics or domains. For each process description between 8 and 11 BPMN process models have been created by modelling novices. These models represent different possible ways of interpreting the textual process description. Each model has at least one start and end event, 3 exclusive gateways, one parallel gateway, and an average of 14 tasks. Some models also contain sub-processes, pools, and lanes. Each model was evaluated by a modelling expert using a quality value from 0 to 5, to reflect, on how well the textual description has been transformed into a BPMN model, i.e., all tasks and decisions from the textual description are in the BPMN, tasks which can run in parallel have been correctly identified, and the BPMN model is well-formed.

An example of a textual description and an associated interpretation, i.e., the BPMN model, can be seen in Fig. 2.

Fig. 2.
figure 2

Textual Description And BPMN Model From the Evaluation Dataset

4.2 Evaluation

In this section, we will use the following KPIs and discuss their impact on conversational process modelling approaches: KPI1 - Text Similarity; KPI2 - Set Similarity; KPI3 - Set Overlap; KPI4 - Restricted Text Similarity; KPI5 - Restricted Set Similarity; KPI6 - Restricted Set Overlap; KPI7 - Average Augmented Task Extraction Prevalence and Similarity (GPT3 only). All results, including non-averaged data, are also available in [47].

Prompt engineering and KPIs: KPIs 1–3 are used to assess task extraction from original process descriptions. This is realized by passing the following prompt “Considering following \(<process\_description>\) return the list of main tasks in it” to the LLMs. For assessment based on KPIs 4–6, the original prompt is changed to the “Considering following \(<process\_description>\) return the list of main tasks (each 3–5 words) in it” to improve the granularity of extracted tasks and to refine the quality of obtained tasks’ labeling. KPI7 is used to evaluate how stable task extraction is, performed by LLMs by extending the set of original process descriptions by utilizing different paraphrasing algorithms.

Task extraction from associated models is realised by parsing .XML documents and extracting relevant BPMN activities, keeping their sequence in the process flow.

As the basis for each similarity measurement we utilize contextual (BERT) and non-contextual (TD-IDF) vectorisers with a cosine similarity metric [19]. The contextual and non-contextual approaches will be denoted as C and NC.

For KPI1, each LLM (GPT1, GPT2, GPT3, GPT3.5) is instructed to extract the tasks from the original process descriptions. The answer is then compared to the original text to assess the completeness of the extraction. The results are depicted in Table 3. For this KPI, GPT3.5 is the most successful LLM.

Table 3. Text Similarity (KPI1): Comparison of tasks extracted by LLM and original text using contextual (BERT) and non-contextual (TD-IDF) vectorisers

Table 4 shows the results for KPI2. The four LLMs are instructed to extract tasks from each textual description. This set of tasks is then compared to the set of tasks, extracted from each BPMN model mentioned above (see Sect. 4.1). As for every textual description multiple BPMN models exist, the results are averaged per textual description. The averages are then again averaged for all textual descriptions. GPT3 is successful for this KPI with 74% extraction rate.

Table 4. Set Similarity (KPI2): Comparison of tasks extracted by LLM with tasks extracted from BPMN Models. For each text a set of n tasks is extracted. Each text has 8–11 associated models from which again a set m of tasks can be extracted. Each set n is compared with all sets m, yielding a set of similarities which is averaged for similarity methods contextual (C) and non-contextual (NC)

For KPI3, the goal is to quantify the overlap between extracted tasks from the original text and associated to its models: (1) how similar are individual tasks, and (2) how many tasks exist only in one of the two extractions. The results are shown in Table 5 and show that between 6 and 7 tasks extracted from the models are also found in the text, while about 6 tasks could not be found in the extracted text. When looking at it from the point of view of the tasks extracted from the text, the ratio becomes 4:3. So almost 50% of the tasks are not similar between the model and text (see discussion for details).

Table 5. Set Overlap (KPI3): Each task extracted from the text is compared (for each associated model) with task extracted from the model. If the similarity is bigger than a threshold, a task is deemed common, else it is deemed to only occur in either the model or the text.

KPI4 focuses on restricting the number of words per extracted task, to coax the bot into extracting more tasks, as generally, the number of extracted tasks from the text is lower than the number of tasks contained in the models (see discussion for more details). Table 6 shows that this decreases the similarity when comparing text (due to stronger paraphrasing), but KPI5 (cf. Table 7) and KPI6 (cf. Table 8) show an increase in the number of tasks by one while not decreasing similarity when compared to the tasks from the model.

Table 6. Restricted Text Similarity (KPI4): Task names are allowed to only have 3–5 words, cmp. Table 3.
Table 7. Restricted Set Similarity (KPI5): Task names are allowed to only have 3–5 words, cmp. Table 4.
Table 8. Restricted Set Overlap (KPI6): Task names are allowed to only have 3-5 words, cmp. Table 5.

Finally, for KPI7, we assessed the effects of paraphrasing on prevalence and similarity, i.e., how stable LLMs are for task extraction with similar input. We use nine different algorithms for paraphrasing text [2] (rewriting sentences using synonyms), which is, for example, useful to clean up textual descriptions from humans. The results are displayed in Table 9, and show that especially the contextual similarity does not decrease significantly, while the number of extracted tasks even improves in comparison to the original text.

Table 9. Average Augmented Task Extraction Prevalence and Similarity GPT3 (KPI7): for nine different paraphrasing methods, the average number of tasks, and similarity measures are calculated. The second row holds the value of the original text from Table 6

4.3 Discussion

Tables 3, 4, 5, 6, 7, 8 and 9 clearly show that GTP3 currently supports task extraction the best, beating GPT3.5. The potential reason for GPT3 success could be the size of the model (175 billion parameters over 1,3 billion for GPT3.5). GPT3.5 model is optimized for a chat and may not be as effective for more complex language tasks [1].

Another important insight is that manually designed and refined models contain additional tasks that cannot be directly extracted from the original text but exist due to a humans ability to “read between the lines” or reason about task granularity. GPT extracts tasks exactly as written in text but does not have the capability to reason when it makes more sense to have multiple small tasks instead of a big one. We tried to coax GPT3 into extracting more tasks by restricting the number of words describing a task (i.e., its label), which increased the average number of extracted tasks slightly by 1, as can be seen in Table 7.

On average, GPT extracted a third less tasks than existed in the model. When strictly looking at the capability of extracting tasks from the original text, GPT3, on average, achieves a text similarity of 80%. The interpretation of this value is difficult. It could mean that the LLM missed about 20% of the tasks or, alternatively, that 20% of the text are just the filler words that have been ignored by the LLM. Together with the observation that the LLM does not like to split up tasks, the 30% less tasks extracted from the text in comparison to the model, hint at a possible explanation.

5 Conclusion: Practical Implications and Research Directions

From the state-of-the-art discussion in Sect. 3 and the results of the evaluation presented in Sect. 4.3, the following two main managerial implications can be derived:

  1. 1.

    For the chatbot application scenarios “gather information” and “process modelling” (cf. Table 1), chatbots are in principle ready to be applied in practice as-is, yet the results have to be taken with a grain of salt, i.e., the domain expert should always check the results. However, the lack of an appropriate, human-readable output format, e.g., a BPMN process model, limits the space of early adopters in a company significantly to experts at the intersection of their domain and computer science. This limitation is particularly unfortunate, as it counteracts the goal of conversational process modelling to minimize the necessary technical skills of the domain expert.

  2. 2.

    For the chatbot application scenarios “compare and assess”, “select method, query models”, and “query and refactor models”, off-the-shelf chatbots are not yet ready to be applied due to their inability to output process models and to understand process model semantics.

As business process modelling has become an important tool for managing organizational change and for capturing requirements of software, the first managerial implication is that conversational process modelling can already have a significant business impact. Considering that the central problem in this area – the acquisition of as-is models – consumes up to 60% of the time spent on process management projects [29], chatbot-based partial automation can be sufficiently impactful, even if substantial human refinement is required.

The second managerial implication is that future research should focus on integrating the strong language capabilities of chatbots with the specialized capabilities of existing knowledge-based tools. The integrative research direction is more promising than chatbot training with specialized process modeling training sets featuring native process models, e.g., process models in BPMN format and a number of semantic targets, such as information on the existence of deadlocks in a process model. First, training of the chatbot with respect to business process models ignores the vast existing modeling knowledge encoded into existing tools. Second, semantics are clearly defined and encoded in existing tools such that training chatbots with the aim of understanding formal semantics is futile unless it serves as an intermediate step that unlocks further value.

To conclude, while advanced tasks such as model querying, refinement, and analysis presumably require domain-specific solutions, the interplay of traditional, knowledge based approaches to business process modeling can relatively straight-forwardly be augmented by machine learning-based chatbots to facilitate tedious tasks such as information gathering and basic model creation.