Abstract
In many domains of science, engineering, and commerce, data analysis systems are employed to derive new data (and ultimately, one hopes, knowledge) from datasets describing experimental results or simulated phenomena. To support such analyses, we have developed a “virtual data system” that allows users first to define, then to invoke, and finally explore the provenance of procedures (and workflows comprising multiple procedure calls) that perform such data derivations. The underlying execution model is “functional” in the sense that procedures read (but do not modify) their input and produce output via deterministic computations. This property makes it straightforward for the virtual data system to record not only the recipe for producing any given data object but also sufficient information about the environment in which the recipe has been executed, all with sufficient fidelity that the steps used to create a data object can be re-executed to reproduce the data object at a later time or a different location. The virtual data system maintains this information in an integrated schema alongside semantic annotations, and thus enables a powerful query capability in which the rich semantic information implied by knowledge of the structure of data derivation procedures can be exploited to provide an information environment that fuses recipe, history, and application-specific semantics. We provide here an overview of this integration, the queries and transformations that it enables, and examples of how these capabilities can serve scientific processes.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Alvarez, S., Vazquez-Salceda, J., Kifor, T., Varga, L.Z., Willmott, S.: Applying Provenance in Distributed Organ Transplant Management. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 28–36. Springer, Heidelberg (2006)
Bardeen, M., Gilbert, E., Jordan, T., Nepywoda, P., Quigg, E., Wilde, M., Zhao, Y.: The QuarkNet/grid collaborative learning e-Lab. In: EEE International Symposium on Cluster Computing and the Grid, 2005. CCGrid 2005, 9 May 2005, vol. 1, pp. 27–34 (2005) DOI: 10.1109/CCGRID.2005.1558530
Barga, R.S., Digiampietri, L.A.: Automatic Generation of Workflow Provenance. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 1–9. Springer, Heidelberg (2006)
Braun, U., Garfinkel, S., Holland, D., Muniswamy-Reddy, K., Seltzer, M.: Issues in Automatic Provenance Collection. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 171–183. Springer, Heidelberg (2006)
Bourilkov, D., Khandelwal, V., Kulkarni, A., Totala, S.: Virtual Logbooks and Collaboration in Science and Software Development. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 19–27. Springer, Heidelberg (2006)
Buneman, P., Khanna, S., Tan, W.-C.: Why and Where: A Characterization of Data Provenance. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, p. 316. Springer, Heidelberg (2000)
Branco, M., Moreau, L.: Enabling provenance on large scale e-Science applications. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 55–63. Springer, Heidelberg (2006)
Bose, R., Mann, R.G., Prina-Ricotti, D.: AstroDAS: Sharing Assertions across Astronomy Catalogues through Distributed Annotation. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 193–202. Springer, Heidelberg (2006)
Callahan, S.P., Freire, J., Santos, E., Scheidegger, C.E., Silva, C.T., Vo, H.T.: Managing the Evolution of Dataflows with VisTrails. In: IEEE Workshop on Workflow and Data Flow for Scientific Applications (SciFlow) (2006)
Cui, Y., Widom, J.: Practical Lineage Tracing in Data Warehouses. In: 16th International Conference on Data Engineering, pp. 367–378 (2000)
Cui, Y., Widom, J., Wiener, J.L.: Tracing the Lineage of View Data in a Warehousing Environment. ACM Transactions on Database Systems 25(2), 179–227 (2000)
Deelman, E., Singh, G., Su, M.-H., Blythe, J., Gil, Y., Kesselman, C., Mehta, G., Vahi, K., Berriman, G.B., Good, J., Laity, A., Jacob, J.C., Katz, D.S.: Pegasus: A Framework for Mapping Complex Scientific Workflows onto Distributed Systems. Scientific Programming 13(3), 219–237 (2005)
Foster, I.: Globus Toolkit Version 4: Software for Service-Oriented Systems. In: Jin, H., Reed, D., Jiang, W. (eds.) NPC 2005. LNCS, vol. 3779, pp. 2–13. Springer, Heidelberg (2005)
Freire, J., Silva, C.T., Callahan, S.P., Santos, E., Scheidegger, C.E., Vo, H.T.: Managing Rapidly-Evolving Scientific Workflows. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 10–18. Springer, Heidelberg (2006)
Foster, I., Voeckler, J., Wilde, M., Zhao, Y.: Chimera: A Virtual Data System for Representing, Querying, and Automating Data Derivation. In: 14th Conference on Scientific and Statistical Database Management (2002)
Frey, J., Tannenbaum, T., Foster, I., Livny, M., Tuecke, S.: Condor-G: A Computation Management Agent for Multi-Institutional Grids. Cluster Computing 5(3), 237–246 (2002)
Groth, P., Miles, S., Tan, V., Moreau, L.: Architecture for Provenance Systems. Technical report, University of Southampton (October 2005)
Giugno, R., Shasha, D.: Graphgrep: A fast and universal method for querying graphs. In: Proceeding of the IEEE International Conference in Pattern recognition (ICPR), Quebec, Canada (August 2002)
Kloss, G.K., Schreiber, A.: Provenance Implementation in a Scientific Simulation Environment. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 37–45. Springer, Heidelberg (2006)
Huettel, S., Song, A., McCarthy, G.: Functional Magnetic Resonance Imaging. Sinauer Associates (2004)
Myers, J.D., Chappell, A.R., Elder, M., Geist, A., Schwidder, J.: Re-integrating the research record. IEEE Computing in Science & Engineering, 44–50 (2003)
Muniswamy-Reddy, K., Holland, D., Braun, U., Seltzer, M.: Provenance-Aware Storage Systems. In: 2006 USENIX Annual Technical Conference, Boston, MA (June 2006)
Singh, G., Kesselman, C., Deelman, E.: Optimizing Grid-Based Workflow Execution. Journal of Grid Computing 3(3-4), 201–219 (2006)
Szomszor, M., Moreau, L.: Recording and reasoning over data provenance in web and grid services. In: Meersman, R., Tari, Z., Schmidt, D.C. (eds.) CoopIS 2003, DOA 2003, and ODBASE 2003. LNCS, vol. 2888, pp. 603–620. Springer, Heidelberg (2003)
Woods, R.P., Grafton, S.T., Holmes, C.J., Cherry, S.R., Mazziotta, J.C.: Automated image registration: I. General methods and intrasubject, intramodality validation. Journal of Computer Assisted Tomography 22, 139–152 (1998)
Woods, R.P., Grafton, S.T., Watson, J.D.G., Sicotte, N.L., Mazziotta, J.C.: Automated image registration: II. Intersubject validation of linear and nonlinear models. Journal of Computer Assisted Tomography 22, 153–165 (1998)
Woodruff, A., Stonebraker, M.: Supporting Fine-Grained Data Lineage in a Database Visualization Environment. In: 13th International Conference on Data Engineering, pp. 91–102 (1997)
Zhao, J., Goble, C., Greenwood, M., Wroe, C., Stevens, R.: Annotating, linking and browsing provenance logs for e-science. In: Workshop on Semantic Web Technologies for Searching and Retrieving Scientific Data (October 2003)
Zhao, J., Goble, C., Stevens, R.: An Identity Crisis in the Life Sciences. In: Moreau, L., Foster, I. (eds.) IPAW 2006. LNCS, vol. 4145, pp. 254–269. Springer, Heidelberg (2006)
Zhao, Y., Dobson, J., Foster, I., Moreau, L., Wilde, M.: A Notation and System for Expressing and Executing Cleanly Typed Workflows on Messy Scientific Data. SIGMOD Record 34(3), 37–43 (2005)
Zhao, Y., Wilde, M., Foster, I., Voeckler, J., Dobson, J., Gilbert, E., Jordan, T., Quigg, E.: Virtual Data Grid Middleware Services for Data-Intensive Science. Concurrency and Computation: Practice and Experience (2005) doi:10.1002/cpe.968
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zhao, Y., Wilde, M., Foster, I. (2006). Applying the Virtual Data Provenance Model. In: Moreau, L., Foster, I. (eds) Provenance and Annotation of Data. IPAW 2006. Lecture Notes in Computer Science, vol 4145. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11890850_16
Download citation
DOI: https://doi.org/10.1007/11890850_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-46302-3
Online ISBN: 978-3-540-46303-0
eBook Packages: Computer ScienceComputer Science (R0)