Keywords

1 Introduction

Different researchers interpreted definitions for provenance from different perspective. Wherein, Davidson et al. defined provenance as the data’s documentation history, which includes each of conversion process’s steps for the data source [1]. Herschel et al. considered provenance as “information describing the production process of some end product” [2]. Similarly, Freire et al. quoted the provenance concepts of the Oxford English Dictionary as “its history and pedigree; the source or origin of an object; a record of the ultimate derivation and passage of an item through its various owners” [3]. Ragan et al. indicated that provenance has been be used to depict the histories and origins of various types in different ways [4]. Moreover, Uri et al. depicted that provenance is a causality graph with certain nodes and edges, elucidating the process by which an object became its current state [5]. Almeida et al. mentioned that provenance, sometimes based on scientific workflows, could be utilized to preserve particular data’s execution log history as a traceable resource [6]. Allen et al. referred provenance as the record of creation, update and activities that influence a piece of data, which aids to facilitate trust in cross-organizational collaboration [7]. In this paper, a relatively common definition of provenance is proposed, which refers to record the lifecycle of a piece of data or thing that accounts for its generation, transformation, manipulation, and consumption, together with an explanation of how and why it got to the present place.

According to diverse application scenarios, provenance is mainly categorized into four categories, containing data provenance, workflow provenance, information systems provenance, and provenance meta-data [2], with a hierarchy from most general to specific ones. In the context of workflow domains, provenance possesses three types diversely: retrospective, prospective, and evolution [8,9,10,11].

In scientific researches, provenance could be employed for several purposes. For instance, scientists and engineers track provenance information to identify its contributors, occurred time, and execution process, etc., for certain data product [12]; provenance assists us to assess, maintain and improve the quality of products [13]; provenance can be used to enhance the transparency, authenticity, and integrity of a piece of data [6, 14]; In particular, scientists expend substantial effort tracking provenance data so as to ensure the repeatability and reproducibility of production process in scientific experiments [15]; It is perhaps more significant that scientists could gain insights into the chain of reasoning facilitated to discover, analyze, and explain unexpected results [16]. In a nutshell, diverse purposes provide provenance with multiple applications.

Many scholars have tackled issues with provenance across numerous domains, such as, Medical Sciences [6], Biology [17], Biomedicine [18], Genomics [19], Geography [20], and Geoinformatics [21], which were exploited in scientific workflow [22], medical records [6], financial reports [4], supply chains [23], data exploration [1], and network diagnosis [24], etc.

As illustrated in various literature, the problem of systematically modeling [25], capturing [26], storing [27], and querying [28] provenance have attracted extensive attention of scientific researchers in a wide broad of applications. In this article, we emphatically concern provenance-modeling issues in multidisciplinary collaboration applications. The aim of this article is to provide users with potential principles and sound tradeoffs while designing or choosing their peculiar provenance model. The contributions of this work are threefold. One is that we identify critical components of the provenance model and compare diverse methodologies used in them. Secondly, we conceive a collaborative model for provenance practice in multidisciplinary collaboration. Finally, we conclude certain problems existed in current model-centric provenance researches.

We organize the rest of this paper as follows. An essential outline on comparison among existing provenance-inspired models is elucidated in Sect. 2. Section 3 designs a provenance model for multidisciplinary collaboration comprehensively. Several open-ended issues on provenance models, systems, and practice are illuminated in Sect. 4. Finally, we conclude this paper with a brief conclusion of main contributions and further work.

2 Core Components of Provenance Model: An Overview

2.1 Two Classical Model Specifications

In current literature, various researchers have proposed different provenance models and relevant solutions in their respective fields. However, differences between those models make it arduous to understand the expressiveness of provenance representations, access and utilize provenance unimpededly, especially exchange information between provenance-enabled systems. Against this background, the scientific community began to emerge a consensus on provenance standardization in 2007, thus releasing and revising the open provenance model (OPM) [29] to resolve provenance-related challenges and issues. Subsequently, furtherly inspired by OPM, another conceptual model named PROV-DM [30] was endorsed by the World Wide Web consortium (W3C) in 2013, which provided well-established concepts and definitions to achieve information’s interchangeable interoperability in heterogeneous contexts.

In OPM, three types and their dependencies are constituted, as shown in Fig. 1(a). Wherein, Artifact represents an immutable object during process execution, which can be expressed in physical carrier (such as device), or digital representation (such as data). Process can be considered as a range of actions to act on artifacts, and thus new artifacts may be entailed. As a contextual entity, Agent could enable, facilitate, control, and influence the execution of processes. In terms of causal relationships, one artifact, being triggered by the other, can be used or generated by a process, which may be triggered by another process, under the control of one or more agents. Similarly, PROV-DM contains core types and their relationships, forming the essence of provenance information. As depicted in Fig. 1(b), there are three element types and seven relationships. Hereinto, we consider an Entity, either real or imaginary, as something with certain fixed aspects that can be physical, digital, or conceptual. Activity performs upon or with entities during a period, and it may include generating, transforming, modifying, processing, and consuming entities. Agent is responsible for an activity’s happening, the existence of an entity, or the activities of other agents.

Fig. 1.
figure 1

(a) The OPM core composition [29], (b) The PROV-DM core composition [30]

2.2 Characteristic Comparison Among Existing Models

The OPM and PROV-DM have been currently regarded as fundamental model specifications. Despite all this, the provenance models have the variation tendency with applications and user requirements in practical usage. Instead of recreating the wheel, numerous researchers have exploited and even extended either OPM or PROV-DM to build their unified provenance model. In this section, we identify relevant studies on existing provenance models, intended to illuminate potential principle of provenance-oriented models for users, so that they could obtain insight into making informed decisions while designing or selecting a provenance model.

Review Method.

In our study, an explicit strategy for literature search and selection was adopted to explore existing research works. Next, we would elaborate it gradually.

Search Strategy.

First, we framed the research question (RQ) to explore focused aspects of existing provenance models, which aims at facilitating users to gain comprehensive perspectives about the provenance-based model’s principles. Further, we identified relevant studies (RS). Wherein, six databases were searched altogether in the search scope (SS): (1) IEEExplore; (2) ACM Digital Library; (3) Scopus; (4) Springer Database; (5) ScienceDirect; (6) CNKI. Based on Title, Abstract, or Keywords match, the search string (SS) was (“provenance model” OR “lineage model” OR “derivation model” OR “pedigree model”) in English. Likewise, the Chinese search string was (“起源模型” OR “溯源模型” OR “世系模型”). In the filtering criteria (FC), articles pertaining to provenance model were included, and articles in the form of abstracts, summary of workshops, or systematic reviews only were excluded. We mainly surveyed outcomes of literatures between January 2014 and January 2018.

Study Selection.

Initially, we obtained 608 articles from six databases via matching search strings in titles, abstracts, or keywords. Second, duplicated works (93) were excluded, with 515 articles remained. Third, we performed screenings of abstract relevance to remove 456 articles. At this point, 59 articles remained. Fourth, we reviewed the remaining articles, focusing especially on excluding those that were not related to provenance model. That is, articles (38) with no evidence of implementation, such clear statement, enforcement method, and model analysis, were removed. At this step, 21 articles remained. Finally, we used an inductive codification methodology to further analyze all full-text articles, excluding articles that were not suitable for classification. Each article used the predefined categories, including specification, type, domain, purpose, etc. As a result, 20 articles were selected totally for subsequent analyses.

Comparative Results of the Search.

As depicted in Table 1, twenty provenance models are mainly surveyed. We can see from nine properties that most models [11, 31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47] utilize or extend PROV (57%) or OPM (33%), which consider their respective field features, with only rare percentages (nearly 10%) of these are built proprietarily [48] or based on other standards, such as RWS (Read-Write-Reset) [49]. In terms of element types, these are all the models pertaining to workflow (25%) or data provenance (75%). That is to say, those models usually own specific modeling structure with certain domain, specialize on particular type of process, and apply high-level instrumentations such as structured query languages, such as SQL [50], Cypher [51], SPARQL [52], ProQL [53], etc. Particularly, in workflow-based models, all of them support some forms of retrospective provenance, whilst few of those provide related means to collect prospective (40%) and evolution (20%) provenance. However, recent literature has revealed that researchers turn towards extensions to models for capturing prospective provenance [36, 50,51,52], and few have proposed extensions to integrate retrospective, and evolution provenance both in design-time and run-time [11, 54]. Whereas, it is pointed that the transition is not sharp but gradient from one type to another [9].

Table 1. Characteristic dimensions of existing provenance models (the check “√” denotes a clear statement, whilst the star “*” indicates no explicit expression in related full-text article)

Additionally, provenance-inspired models have been in applied various applications, under the usage of diversified purposes. Wherein, Fig. 2 indicates a wide range of domains, of which the most frequent are Biomedicine or Healthcare (20%) [32, 34, 36, 46], Security-related (15%) [33, 41, 45], Data Analysis (15%) [39, 42, 43], and Web-based (15%) [40, 47, 48], together with Domain-Agnostic (10%) and Collaboration (10%) areas [31, 36]. As shown in Fig. 3, the purpose of existing models includes Data Quality (nearly 21%), Replication (nearly 21%), Recall (nearly 18%), Data Security (nearly 15%), and Presentation (nearly 15%) dominantly.

Fig. 2.
figure 2

Domains of existing provenance models

Fig. 3.
figure 3

Purposes of existing provenance models

Moreover, almost all models support compatibility (75%) and interoperability (90%), in which scalability (20%) are rarely carried. Amongst those traits, compatibility is evaluated based on its coincidence degree with these standards. For a model-enabled system with good scalability, it has the capacity to manipulate large amounts of provenance at back-end [55] and front-end [56], respectively. Interoperability is the ability to exchange provenance information within multiple systems [15, 29, 30, 54], and it is measured basically whether it is in accordance with these standards or not in this paper.

Finally, the result shows that there are three types of publications, in which large percent (75%) of them were published as conferences, 20% as journals, and one Ph.D. thesis (5%) was also covered. There is evidence that the numbers of publications remained relatively increasing (267%) from 2014 to 2017.

Be noted that, those selected models are not intended to be complete but to induce certain insight in model properties and construction for reference only, which may not cover or represent all-encompassing situations in current literature.

Discussion and Analysis.

Due to space constraints, other provenance-oriented models are not enumerated in this article. However, we can probably draw three conclusions from existing literature that: (1) the majority models are core and extension of OPM or PROV-DM standards, which have been applied in a broad spectrum of domains such as biomedicine or healthcare, data analyses, web-based, security-related areas, etc., with diverse purposes of data quality, replication, recall, data security, presentation, etc. (2) almost all of available models emphasize on their compatibility, scalability, and interoperability, in which information exchange amongst systems receive attention in emerging researches. (3) existing models are mostly focused on specific discipline, whose ingredients are related to structured provenance information, with a lack of researches on unstructured collaboration especially interdisciplinary collaboration.

3 A Collaborative Provenance Model for Multidisciplinary Applications

In the field of multidisciplinary collaboration, it is requisite that multiple researchers from different disciplines, such as physics, chemistry, computer, medicine, etc., complete creative and intellectual labor together by means of exchanging and sharing various resources. In this section, we summarize specific characteristics of multidisciplinary collaboration, and design a provenance model to record collaborative process and its associated data evolution for research collaboration across disciplines. On this basis, we also concisely evaluate the proposed model’s effectiveness.

3.1 Multidisciplinary Collaboration Characteristics

Here, four features are identified by categorizing varying collaboration patterns, complex team composition, different communication schema, and dynamic collaborative process, and each of them is ever-changing in collaboration process. All those characteristics enable it challenging to design one provenance-base model that could depict the process of human interaction and data evolution wholly.

3.2 Typical Scenario

For ease of exposition in a review paper such as this, we simplified the actual scenario of multidisciplinary collaboration. As illustrated in Fig. 4, the multidisciplinary collaboration is a process of problem-solving to work together towards a common goal, via exchanging and sharing diverse resources such as hardware devices, system software, and information technologies among cross-disciplinary researchers, during which relevant scientific data would be generated, transformed, modified and consumed continually by those collaborators. Overall, this kind of collaboration consists two sub-processes, i.e., human interaction and data evolution, whose influence acts upon each other.

Fig. 4.
figure 4

The multidisciplinary collaboration scenario

3.3 A Collaborative Provenance Model: CollabPG

Based on PROV-DM [30], we further extend our collaborative provenance model, which constitutes two kinds of information: components and dependencies. In this method, we utilize a directed acyclic graph CollabPG(R, A, RE, RU, E) to collect associated provenance information, in which R, A, RE, RU are vertex sets in triple expression and E are edge sets. As shown in Fig. 5, we give an example of this model. Here, Resources (R) is expressed in yellow ovals, Activities (A) in blue rectangles, Researchers (RE) in orange pentagons, and Rules (RU) in green circles. The attributes of each element are shown in gray. In two blue cloud-patterned scopes, we can observe scientific collaboration among multidisciplinary researchers under various rules. The Black scope reveals data evolution process, including its generation, transformation, and modification.

Fig. 5.
figure 5

An example of collaborative provenance model

Components.

There are four element types, including:

Resource(rid, attributes, state): denotes multiple resources that can be any physical, digital, or conceptual artifacts with certain utility values, where rid is a unique identifier. Attributes are sets of attribute-pairs representing fixed aspects of this resource, such as the type attribute, which contains information resource (hardware devices, information technologies, system software, etc.) and scientific data (referenced resource, intermediate data, executed results, etc.) that may be collected in electronic documents. State is the resource’s lifecycle phase, including generation, transformation, modification, and invalidation. The generated resource begins to be utilized, and is no longer available for use after invalidation.

Activity(aid, attributes, timeRange): refers to the collaborative activity, acting upon or with resources, which happens during a period of time. Wherein, aid is the unique identifier, and attributes are sets of attribute-pairs, such as the type attribute. Besides, the timeRange, written by [startTime, endTime], denotes the time interval (includes the beginning and end time) that an activity occurs.

Researcher (reid, attributes, subject): denotes scientific researchers responsible for the occurrence of an activity, or certain resource’s existence. Wherein, reid is the unique identifier, attributes are sets of attribute-pairs such as the type attribute, which contains role, level, etc., and subject is the discipline that one researcher belongs to, such as physics, chemistry, biology, mathematics, mechanics, etc.

Rule (ruid, attributes, category): denotes sets of restriction rules that various resources, activities, or researchers have to obey. Amongst of all items, ruid is the unique identifier, and attributes are sets of attribute-pairs. The category contains structured and unstructured rules, in which the former represent process logics such as causality and concurrency. The latter are disciplinary paradigm, confidential protocol, and privacy mechanism.

Dependencies.

In this model, the edge E expresses dependency relationships between above vertices. Here, sixteen dependencies are included primarily:

  • wasDerivedFrom(r2, r1) ∈ R2xR1: Transforming one resource into another, i.e., R2 is transformed from R1, together with changes of certain attributes.

  • wasRevisionOf(r2, r1) ∈ R2xR1: Modifying from the resource to a newest one, i.e., R2 is the revised version of R1, only minor values being updated at the same attributes.

  • wasGeneratedBy(r, a) ∈ RxA: Producing one resource by an activity, i.e., Activity A generates the resource R.

  • Used(r, a) ∈ RxA: Utilizing one resource by an activity, i.e., Activity A uses an existing resource R.

  • wasInvalidatedBy(r, a) ∈ RxA: Invalidating the resource by an activity, i.e., Activity A invalids an existing R, due to its destruction, cessation, or expiry.

  • wasAttributedTo(r, re) ∈ RxRE: Ascribing one resource with the researcher, i.e., Researcher RE is responsible for the existence of Resource R.

  • wasExchangedBy(a2, a1) ∈ A2xA1: Exchanging specific resources by two activities, i.e., Activity A2 uses some resources generated by Activity A1.

  • wasExecutedBy(a, re) ∈ AxRE: Executing an activity by the researcher, i.e., Researcher RE plays a role in Activity A.

  • dependOn(re2, re1) ∈ RE2xRE1: Researcher RE2’s outcome depends on importing contributions of RE1.

  • consultWith(re2, re1) ∈ RE2xRE1: Researcher RE2 carries out activities together with RE1 via joint supervision, consultation, and decision-making.

  • wasGuidedBy(re2, re1) ∈ RE2xRE1: Researcher RE2 directly acts on activities under the guidance of RE1.

  • wasConformedTo(r, ru) ∈ RxRU: Conforming one resource to certain rule, i.e., Resource R conforms to the Rule RU, such as disciplinary paradigm.

  • wasConstrainedBy(a, ru) ∈ AxRU: Restricting an activity to certain rule, i.e., Activity A was constrained by Rule RU, such process logic.

  • complyWith(re, ru) ∈ RExRU: Complying researcher’s behavior with certain rule, i.e., Researcher RE complies with Rule RU, such as confidential protocol.

  • exclusiveWith(ru2, ru1) ∈ RU2xRU1: Rule RU2 and RU1 is mutually exclusive.

  • Precede(ru2, ru1) ∈ RU2xRU1: Rule RU2 have priority over RU1, whatever they are exclusive or not.

3.4 Evaluation of the CollabPG Model

Here, we mainly focus three evaluation criteria on our model, which contains compatibility, scalability, and interoperability.

The PROV-DM [30] defines some conceptual standards, such as information collection, storage methods, and query technologies, aiming to achieve the goal of exchanging information between heterogeneous systems. The proposed CollabPG model has the compatibility with it. Wherein, the resource, activity, and researcher have similar functionality to entity, activity, and agents in PROV-DM. Considering such factors as privacy, sensitivity, and control-flow of provenance information, we add the element of rule and related dependencies in our model. The relationships, such as wasExchangedBy and wasExecutedBy, correspond to wasInformedBy and wasAssociatedWith, while dependOn, consultWith, and wasGuidedBy can be viewed as extends of actedOnBehalfOf. Through the analysis above, we can build a mapping from the collaborative provenance model to the PROV-DM, so that our model ensures its compatibility, which supports exchanging information with other provenance-enabled models. Specially, collaboration characteristics have been reflected explicitly in our model, whose comparison with PROV-DM is shown concretely in Table 2.

Table 2. Comparison of collaborative provenance model with PROV-DM

Besides, it can be observed that our model has the interoperability to support exchange information amongst multiple systems, due to its accordance with the PROV-DM standard. Moreover, we would pursue good scalability in subsequent model-based system design as well.

4 Challenges and Opportunities on Provenance Model in Multidisciplinary Collaboration

After surveying the state of the art, this section would concern specific research issues on the balance between models, systems, and practice in provenance exploration of multidisciplinary collaboration. We would introduce each of them separately.

Trade-Off Between Core Principles and Extension Requirements for Provenance-Bound Models.

Concerning the core criteria of a provenance model to be quality-guaranteed, several researchers have summarized related criteria for evaluating the quality of models. Examples include Completeness, Correctness, Clarity, Consistency, Simplicity and Comprehensibility. That is, the model should contain all ingredients of the domain that are relevant, conformed to the syntax of modeling language together with authentic and correct information. Moreover, the statements in the model are not uncontested, contradictory, and redundant. Lastly, the model should be effortless to be understood by its users and developers. Meanwhile, it may be inevitable to adjust provenance via extending original components of models to satisfy specific needs in practical applications. Under this circumstance, qualified requirements such as compatibility, scalability, and interoperability, could enhance the capacities of models. Therefore, it is anticipant that core and extension in provenance models are considered comprehensively in the future approach.

Trade-Off Between General Models and Specific Systems.

On one hand, a desired model is versatilely used, domain-agnostic, loosely-coupled with systems, and supports interoperability and interchange among systems as well. On the other hand, no any model is likely to be self-contained and represent all provenance-inspired systems. Sometimes, the representativeness of one model is more imperative than its completeness. In practical exploration, the model should be integrated with specific application system. For instance, there is a correlation between provenance models and capture systems, in which the information granularity of models mobilizes diverse grained-level systems to be adopted. However, we have to take integration efforts, provenance granularity, false positives, and analysis scope into count in terms of choosing appropriate capture methods and systems [26]. Specifically, the granularity of capture encounters provenance costs, i.e., fine-grained capturing approaches aggravate the issues of information overload, performance influence, and memory workload. Therefore, we should pertinently select coarse-grained, fine-grained, or hybrid-enabled systems based on actual models and scenarios.

Trade-Off Between Privacy and Utility of Information in Provenance Model.

Several studies have indicated that attentions with provenance-centric disclosure are linked to issues of security and privacy concerns [16, 22], due to part of provenance’s sensitivity and confidentiality, particularly for individual interaction from different disciplines in collaborative environments. When it comes to security, researchers exploit diverse access control strategies, such as authentication, authorization, and sandboxing, aiming to pinpoint which view of provenance that particular users can access. As elucidated in existing literature, customized techniques, including sanitization, abstraction, obscuring, and redaction, are employed to render an abstracted overview of provenance by omitting sensitive pieces of its information. However, those pruning methods inevitably yields varying degree of utility loss for provenance usage, which may pose possibly undesirable side-effects while exploring provenance details. As a consequence, it remains to be explored to consider double-side factors about balancing an appropriate threshold of confidentiality protection and utility preservation in provenance information. More specifically, we could reveal partial provenance to targeted users varying with their ownership roles, trust levels, and access privileges. At the same time, fractional information could be concealed according to its sensitive attributes, privacy requirements, and application propensity.

In the domain of multidisciplinary collaboration, challenges mentioned above involve only some issues of models, whose proposals may be applicable to arbitrary applications as well. In provenance practice, one important point is that choice of what model-based solution is most appropriate depends on different needs.

5 Conclusion and Future Work

In this paper, we revealed underlying overview that constitute core components of provenance model, such as, model specification, characteristic comparison, and model analysis. We conceived a collaborative model with multi-faceted factors in multidisciplinary applications, designed to depict cross-disciplinary scientists’ collaboration process through exchanging and sharing diverse resources, together with its associated provenance data evolution. We summarized fundamental issues in existing provenance models to facilitate the understanding of model dimensions and construction. A recapitulative research in this article was designed to facilitate to make reasonable decisions about which model-based provenance solution to choose for both domain experts and common users in interdisciplinary applications.

In the future research, we intend to further explore dependency path calculations, tracking mechanisms, storage methods, query technologies, and access visualization of provenance applied in multidisciplinary applications, combined with their collaborative characteristics and attributes.