1 Introduction

Workflows are popular means to specify and implement processes in business and science. For example, they are used in modern science to specify and implement scientific experiments, allowing scientists to better understand the phenomenon or hypothesis they are studying. Designing scientific workflows can be a difficult task, however, as it requires a thorough knowledge of the field as well as knowledge of the services available to implement the steps in the workflow. To overcome this obstacle and facilitate workflow design, many workflow repositories have emerged, such as myExperiment [28], Crowdlabs [21] and Galaxy [14] to share, publish and enable the reuse of workflows.

However, sharing and publishing workflows is not enough to allow their reuse. In recent years, a significant number of workflows have been shared by scientists in several areas on the myExperiment workflow repository. However, their users are experiencing difficulties when exploring and querying the workflows. Indeed, users have yet to go through the published workflows to identify those that are relevant to their needs. The situation is exacerbated by the fact that a significant number of workflows published in these repositories cannot be executed due to the volatility of the services that implement the steps of a workflow.

Over the past few years, we have developed solutions to improve workflow reuse with the above concerns in mind [8, 9, 16, 16, 18]. The solution we have developed can be organized within a framework that contains five inter-connected steps as shown in Fig. 1.

Fig. 1.
figure 1

Five steps for promoting workflow reuse.

The first component is used to extract frequent workflow fragments from existing workflow repositories. The reason for this is that frequent workflow fragments are likely to be reused in the construction of new workflows. The second component is used to search for workflows or fragments of them in order to design a new workflow. The third component is used to repair workflow fragments that can no longer be executed due to the volatility of their underlying services. The fourth component is used to compose a new workflow taking into account existing workflows that were previously retrieved by the user. The last step is then used to enrich the repository with workflow fragments with the newly specified workflow. The remainder of this paper focuses and presents each of the first steps of the framework before closing the paper with a discussion highlighting the outstanding issues.

2 Mining Frequent Workflow Fragments

The literature on business and scientific workflows is rich in proposals that seek to exploit existing workflows. Existing work on the exploitation of workflows has mainly focused on deriving a workflow specification (usually in the form of a Petri net) from workflow execution logs. Very few proposals have focused on mining knowledge from workflow specifications. [29] extracts action patterns from repositories, that capture chunks of actions on business objects often appearing together in business processes. Our work is related to Diamantini’s proposal [12] who applied the method of hierarchical clustering of graphs in order to extract common workflow fragments. We focus on fragments rather than entire workflows, as there are more (realistic) possibilities for reuse at the fragment level than at the level of the entire workflow. In other words, the chances that the user will find a workflow that suits his needs are slim. On the other hand, the chances that it will find workflows containing one or more fragments that can contribute to its workflow are greater.

Given a workflow repository, we would like to exploit frequent workflow fragments, i.e. fragments that are used in several workflows. These fragments are likely to implement tasks that can be reused in newly specified workflows. Workflow extraction raises the following questions.

  • How to manage the heterogeneity of labels used by different users to model their workflow activities within the repository? Different designers use different labels to name the activities that make up their workflow. We need a way to homogenize the labels before extracting the workflows so that we don’t miss relevant (false negative) fragments.

  • Which graphical representation is best suited for formatting workflows to extract frequent fragments? We argue that the effectiveness and efficiency of the extraction algorithm used to retrieve frequent fragments depends on the representation used to encode the workflow specifications.

Homogenizing Activity Labels. To be able to extract frequent workflow fragments, we first need to identify common activities. This assumes that activities implementing the same functionality must have the same names. Some workflow modeling tools (see for example SignavioFootnote 1) manage a dictionary and allow reusing dictionary entries via a search or via the auto-completion function (when the user starts typing an activity label, the system suggests similar activity labels from the dictionary). If the models in the repository come from different tools and use different naming conventions, a pre-processing step can be applied to homogenize the activity labels using a [27] dictionary. To facilitate this step, we rely on existing techniques such as [20]. These techniques are capable of recognizing label styles and automatically refactoring labels with quality problems. These labels are transformed into a word-object label by deriving actions and business objects from activity labels. Verb-object style labels contain an action that is followed by a business object, such as Create Invoice and Validate Order. The advantages of this style of labeling have been highlighted by several practical studies and modeling guidelines. Synonyms are also discussed in this step. They are recognized using the WordNet lexical database and are replaced by a common term.

Workflow Representation. To extract frequent patterns, we make use of an existing graph extraction algorithm, SUBDUE [19]. SUBDUE is a heuristic based algorithm that uses measure-theoretic information to find important sub-graphs. To use SUBDUE, however, we first need to transform the workflow specification into a graph format. To this end, we have studied a number of possible representations, including those proposed in the state of the art, see [12], and proposed one that improves the execution time, the memory space required, and also the meaning of the extracted models. In fact, we have showed (see [9]) that the representation model is of major importance.

The new representation model for workflows that we suggested is compact, in the sense that it contains a small number of nodes, yet complete since it can be used to uncover the original BPMN specification of the original workflow. Using such a model, the nodes represent activities. Control operators in the workflow are not encoded using nodes. Instead, the edge connecting two activity nodes are labeled to indicate the kind of the control operator(s) connecting them. Compared with existing representation models, the results of an experimental evaluation that we conducted showed that this representation model is the most effective and efficient when it comes to mining frequent workflow graphs [9].

3 Searching Workflow Fragments

To facilitate the reuse of workflow fragments, two scenarios for their retrieval can be identified. In the first scenario, user describes the functionality of the fragment that could complete his specification using keywords (or, alternatively, using a dedicated visual query language as in [2]). In the second scenario, the specification of a workflow fragment is already available and user would like to find similar ones in the repository. In the following we present solutions for these two cases.

3.1 Keywords-Based Search of Workflow Fragments

To obtain a workflow fragment that can be used to complete an initial workflow, user issues a query that characterizes the desired workflow fragment. The query consists of a set of keywords \(\{kw_1,\dots ,kw_n\}\) characterizing the functionality that the fragment must implement. The user selects the terms of his own choice. In other words, we do not impose any vocabulary for the specification of keywords.

To identify the fragments in the repository that are relevant to the specified keywords, we adopt the widely used technique of TF/IDF (frequency term/inverted document) measurement. It should be mentioned here that the workflow fragments in the repository are indexed by the activity labels obtained after the workflow homogenization (see the previous section). We refer to such an index as IDX. Direct application of TF/IDF based on the set of keywords provided by the user is likely to miss some fragments. Indeed, the designer provides keywords of his choice; a given keyword may not be present in the repository, whereas one of its synonyms could be present. To solve this problem, we adopt the following method which complements the traditional TF/IDF with a semantic enrichment phase. More precisely, from a set of keywords provided by the user \(\{kw_1,\dots ,kw_n\}\), for each keyword \(kw_i\) we retrieve its synonyms, which we refer to as \(syn(kw_i)\) from an existing thesaurus, for example, Wordnet [30] or a domain-specific specialized thesaurus if we are dealing with workflows coming from a specific domain. The IDX index is searched to identify if there is a term in \(syn(kw_i) \cup \{kw_i\}\) that appears. If there is, then the term in the index is used to represent \(kw_i\) in the vector that will be used to compare the query with the vectors representing the workflow fragments in the repository.

Once the vector representing the user query is constructed and the TF/IDFs of its associated terms are computed, it is compared to the vectors representing the workflow fragments in the repository using the cosine similarity [3]. The set of fragments retrieved in the previous step, which we call candidate fragments, is then examined by the user, who will be able to choose the fragment that meets best his/her needs.

3.2 Similarity-Based Search of Workflow Fragments

A second scenario for reusing fragments is the situation when the user has already a workflow fragment specification and he would like to find in the repository all similar specifications [16]. He may be looking for all the variants of his model or for a correct version when his fragment contains an error [15].

Using graphs as a representation formalism for both user model and workflow models, the above searching problem turns into a graph searching and matching problem. We want to compare the process graph representing user requirements with the target graphs in the repository. The matching process can be formulated as a search for graph or subgraph isomorphism. However, it is possible that there does not exist a process model such that an exact graph or subgraph isomorphism can be defined. Thus, we are interested in finding workflow models that have similar structures if models that have an identical structure do not exist. The error-correcting graph matching integrates the concept of error correction (or inexact matching) into the matching process. Similar to string edit distance, graph edit operations are defined and the cost of applying these operations on a first graph to render it identical to the second one are evaluated. The graph edit distance is defined as the minimum cost over the possible sequence of such operations. For a process graph-based formalism, specific cost functions have to be defined for the edit operations: cost of modifying activity parameters or changing control flow structures. We have also defined an edit operation that merges two activities and its opposite, decomposing an activity into two activities to deal with different granularity levels of activity modelling. Complex mappings (one-to-many) between activities may appear due to multiple ways of implementing an interaction pattern (for instance BPEL synchronous or asynchronous interaction [16]) or different granularity in implementing a functionality as a single or multiple tasks [13].

In the framework presented in [17] we made use of semantic annotations of process models to propose a fast similarity search technique that returns a ranked list of similar models in the repository. In a second step, a more fine-grained matching method can be applied between the query and one or several matching candidates.

For the first step, that can be considered as a filtering step, we use an abstraction function that represents a process as a finite set of flow dependencies between its activities. Thus, the similarity of two processes is defined based on the similarity of their flow dependencies. To speed up the comparison of the flow dependencies of the query and those of the repository processes, we define indexes built on the flow dependencies and the annotations of input/output parameters of activities of the processes.

For applications requiring a more-precise matching and for which the query response time is not constrained, the user can choose to calculate a more fine-grained similarity measure, based on the graph edit distance. A survey of similarity measures can be found in [4] and an overview of process matching techniques is presented in the corresponding chapter in [5].

Besides control-flow specification, quality constraints and preferences can be specified by the user and taken into account into the searching and matching process (see for example [1]).

4 Repairing Workflow

A problem that frequently hinders the reuse of workflows (and thus their fragments) is the volatility of the services that implements the steps of the workflows. This is not surprising; there is usually no agreement between the service providers who implement the workflow steps and the users who require the providers to supply their services on a continuous basis.

We have shown in a previous study [6] that semantic annotations of web services can be used to identify appropriate substitutes for unavailable service operations, thus allowing the preservation of scientific workflows. More precisely, we have developed an algorithm which, by inputting annotations that semantically describe missing service operations using the concepts of domain ontologies [22, 31], identifies available service operations that can play the same role as unavailable ones.

Although the algorithm we have developed is sound, its practical applicability is hampered by the following facts. First, semantic annotations of web services are rare: the number of web services that are annotated is much lower than the growing number of available web services. Consequently, the probability of locating substitution operations using the algorithm described in [6], is low. Second, our experience suggests that a large proportion of existing semantic annotations suffer from inaccuracies: annotators tend to use concepts that are sub-concepts or super-concepts of the concepts that should be used for annotating web service parameters. As a result, a substitute discovered to replace an unavailable operation using such annotations may be unsuitable and, conversely, a suitable substitute may be discarded. Finally, scientific workflows may contain operations that are implemented using mechanisms other than web services, such as local programs or scripts whose annotations are not available.

To address the above issues, we proposed a heuristic to locate substitutes in the absence of semantic annotations of web services. The proposed method uses data links linking the inputs and outputs of service operations in existing workflow specifications to locate operations with parameters that are compatible with the missing operations. In addition, it exploits source data [25] collected from previous workflow runs to ensure that candidate substitutes perform tasks similar to the missing operations.

To illustrate this idea, we will use an example of a real workflow in silico that is used to perform a value-added protein identification, in which the results of the protein identification are complemented by additional information on proteins homologous to the identified protein [7]. Figure 2 illustrates the workflow that implements this experiment.

Fig. 2.
figure 2

Value-added protein identification workflow (identified in the text by \( proteinIdentificationWf_1 \))

The workflow consists of three operations. The operation IdentifyProtein takes as input the peptide masses obtained during the digestion of a protein and an identification error and provides the Uniprot access number of the best match. If a protein is found, the operation GetHomologous performs a homology search and returns the list of similar proteins. The homologous protein accessions are then used to feed the execution of the operation GetGOTerm to obtain the corresponding term from the gene ontologyFootnote 2. Few months after the developement of this workflow, it could no longer be ran. The reason was that GetHomologous operation used in the workflow to perform the protein homology search no longer existed, and as such the user was unable to run the workflow. So we had to search for an available web service that performs homology searches and use it instead. This proved to be a lengthy process.

To identify suitable substitute service operations, we defined the notions of parameter compatibility and task compatibility, which we present in the remainder of this section.

4.1 Parameter Compatibility

Consider the existence of the workflow illustrated in Fig. 3. The operation GetSimilarProteins of this workflow is linked to both IdentifyProtein and GetGOTerm which are also linked to the unavailable operation GetHomologous. If the data links in this workflow are free of discrepancies, then the input and output of the GetSimilarProtein are compatible with the output of the IdentifyProtein and the input of the GetGOTerm, respectively, to the extent that they can be connected within a workflow.

Fig. 3.
figure 3

Value-added pValue-added protein identification workflow (identified in the text by \( proteinIdentificationWf_2 \))

4.2 Task Compatibility

In addition to compatibility in terms of entries and exits, it must be ensured that the replacement applicant performs a task that is compatible with that of the unavailable operation. This raises the question of how we can verify that a candidate substitute performs a task that is compatible with that of the unavailable operation in the absence of semantic web service annotations.

To perform this test, we exploit the following observation. If GetSimilarProteins performs a task that is compatible with that of GetHomologous, then for the same input, the two operations should in principle provide the same output. An operation is able to replace the operation in terms of task, if for each possible input instance that the operation is able to consume, it provides the same output as the one obtained by invoking the operation. To perform this test, however, we need to call the missing operation! To overcome the above problem, we adopted a solution that uses the logs from past executions of the workflow. These are logs that contain information about past workflow executions. More importantly, they contain intermediate data that was used as input and output by the operations making up a workflow when it was set up. Examples of workflow source logs are Janus [24] and PASOA [23]. Using such provenance information, we are able to identify if the newly identified candidate service operation delivers the same results as the unavailable one, by utilizing the execution of the latter that are stored within provenance logs.

The effectiveness of the above solution has been tested against a small number of real-world workflows (see [10] for more details).

The presented solution verifies parameter and task compatibility of two services. In case when mismatches exist at the level of operations signature (different number or types of parameters), adapters can be semi-automatically generated to ensure service replaceability [11]. The approach is based on a catalogue of mismatch patterns, that contain both formal and informal descriptions of the type of adapter (called adapter template) used to resolve that type of mismatch.

5 Concluding Remarks

In this paper, we have presented a framework of solutions that can be used to promote service-based workflows, or more specifically fragments therof. Although we have been able to demonstrate that the above solutions can be useful, reuse is a difficult problem and the new service-based environments brings new challenges that need to be taken into account to promote reuse.

Indeed, while the reuse of software artifacts is generally a long-standing issue, the open science initiative emphasizes the need to improve the reusability of data-driven workflows. Semantic description is an essential ingredient for finding appropriate artifacts and reusing them. Semantic descriptions of services and workflows, although widely studied, still pose problems that hinder their adoption and use. In particular, support for the definition of domain-specific lightweight descriptions for users unfamiliar with logic, enrichment of descriptions of web services in general and data services in particular (e.g. relationship between input and output parameters (see [26]), tasks/capabilities, prerequisites and domain-specific effects).

Data services as well as entire workflows are increasingly hosted in cloud environments. This raises questions that need to be addressed, such as how to compose a workflow using fragments hosted in different and therefore heterogeneous cloud environments. Services hosted in the cloud typically require accreditation and/or are not free of charge. This raises an optimization issue that has to do with reducing the cost of a new workflow composed from fragments of existing workflows. Another line of research worth studying is to revisit traditional service-based workflows by considering micro-services as a solution that improves their reliability and reusability.