Abstract
Users can determine the precise origins of their data by collecting detailed provenance records. However, auditing at a finer grain produces large amounts of metadata. To efficiently manage the collected provenance, several provenance management systems, including SPADE, record provenance on the hosts where it is generated. Distributed provenance raises the issue of efficient reconstruction during the query phase. Recursively querying provenance metadata or computing its transitive closure is known to have limited scalability and cannot be used for large provenance graphs. We present matrix filters, which are novel data structures for representing graph information, and demonstrate their utility for improving query efficiency with experiments on provenance metadata gathered while executing distributed workflow applications.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Abraham, J., Brazier, P., Chebotko, A., Navarro, J., Piazza, A.: Distributed storage and querying techniques for a semantic web of scientific workflow provenance. In: IEEE International Conference on Services Computing (2010)
Bloom, B.: Space/time tradeoffs in hash coding with allowable errors. Communications of the ACM 13(7) (1970)
Broder, A., Mitzenmacher, M.: Network Applications of Bloom Filters: A Survey (2002)
Callahan, S., Freire, J., Santos, E., Scheidegger, C., Silva, C., Vo, H.: VisTrails: Visualization meets data management. In: ACM SIGMOD International Conference on Management of Data (2006)
Chang, F., Dean, J., Ghemawat, S., Hsieh, W., Wallach, D., Burrows, M., Chandra, T., Fikes, A., Gruber, R.: BigTable: A distributed storage system for structured data. In: 7th USENIX Symposium on Operating Systems Design and Implementation (2006)
Deelman, E., Blythe, J., Gil, Y., Kesselman, C., Mehta, G., Patil, S., Su, M., Vahi, K., Livny, M.: Pegasus: Mapping scientific workflows onto the Grid. Grid Computing (2004)
Dong, G., Libkin, L., Su, J., Wong, L.: Maintaining transitive closure of graphs in SQL. International Journal of Information Technology 5 (1999)
Foster, I., Vockler, J., Wilde, M., Zhao, Y.: Chimera: A virtual data system for representing, querying, and automating data derivation. In: 14th International Conference on Scientific and Statistical Database Management (2002)
Frew, J., Metzger, D., Slaughter, P.: Automatic capture and reconstruction of computational provenance. Concurrency and Computation 20(5) (2008)
Gadelha Jr., L., Clifford, B., Mattoso, M., Wilde, M., Foster, I.: Provenance management in Swift. Future Generation of Computer Systems 27(6) (2011)
Gehani, A., Lindqvist, U.: Bonsai: Balanced lineage authentication. In: 23rd Annual Computer Security Applications Conference. IEEE Computer Society (2007)
Gehani, A., Kim, M., Zhang, J.: Steps toward managing lineage metadata in Grid clusters. In: 1st USENIX Workshop on the Theory and Practice of Provenance (2009)
Gehani, A., Malik, T.: Efficient Querying of Distributed Provenance Stores. In: 8th Workshop on the Challenges of Large Applications in Distributed Environments (2010)
Groth, P.: Recording Provenance in Service-Oriented Architectures, Report, University of Southampton (2004)
Groth, P., Luck, M., Moreau, L.: A protocol for recording provenance in service-oriented grids. In: International Conference on Principles of Distributed Systems (2004)
Groth, P.: On the Record: Provenance in Large Scale, Open Distributed Systems. Thesis, University of Southampton (2005)
Groth, P.: A Distributed Algorithm for Determining the Provenance of Data, e-Science (2008)
Groth, P., Moreau, L.: Representing distributed systems using the Open Provenance Model. Future Generation Computer Systems 27(6) (2011)
Heinis, T., Alonso, G.: Efficient lineage tracking for scientific workflows. In: ACM SIGMOD International Conference on Management of Data (2008)
Holland, D., Braun, U., Maclean, D., Muniswamy-Reddy, K., Seltzer, M.: Choosing a data model and query language for provenance. In: 2nd International Provenance and Annotation Workshop (2008)
Karvounarakis, G., Ives, Z., Tannen, V.: Querying data provenance. In: ACM SIGMOD International Conference on Management of Data (2010)
Karypis, G., Aggarwal, R., Kumar, V., Shekhar, S.: Multilevel hypergraph partitioning: Applications in VLSI domain. In: 34th Design and Automation Conference (1997)
Malik, T., Nistor, L., Gehani, A.: Tracking and sketching distributed data provenance. In: 6th IEEE International Conference on e-Science (2010)
Miles, S., Deelman, E., Groth, P., Vahi, K., Mehta, G., Moreau, L.: Connecting scientific data to scientific experiments with provenance. In: 3rd IEEE International Conference on e-Science and Grid Computing (2007)
Moreau, L., Ludaescher, B., Altintas, I., Barga, R., Bowers, S., Callahan, S., Chin Jr., G., Clifford, B., Cohen, S., Cohen-Boulakia, S., Davidson, S., Deelman, E., Digiampietri, L., Foster, I., Freire, J., Frew, J., Futrelle, J., Gibson, T., Gil, Y., Goble, C., Golbeck, J., Groth, P., Holland, D., Jiang, S., Kim, J., Koop, D., Krenek, A., McPhillips, T., Mehta, G., Miles, S., Metzger, D., Munroe, S., Myers, J., Plale, B., Podhorszki, N., Ratnakar, V., Santos, E., Scheidegger, C., Schuchardt, K., Seltzer, M., Simmhan, Y.: The First Provenance Challenge. Concurrency and Computation: Practice and Experience 20(5) (2007)
Moreau, L., Clifford, B., Freire, J., Futrelle, J., Gil, Y., Groth, P., Kwasnikowska, N., Miles, S., Missier, P., Myers, J., Plale, B., Simmhan, Y., Stephan, E., van den Bussche, J.: The Open Provenance Model core specification (v1.1). Future Generation Computer Systems (2010)
MySQL, http://www.mysql.com
Neo4j, http://neo4j.org
Novel Information Gathering and Harvesting Techniques for Intelligence in Global Autonomous Language Exploitation, http://www.speech.sri.com/projects/GALE/
PlanetLab, http://www.planet-lab.org
Muniswamy-Reddy, K., Holland, D., Braun, U., Seltzer, M.: Provenance-aware storage systems. In: USENIX Annual Technical Conference (2006)
Muniswamy-Reddy, K., Braun, U., Holland, D., Macko, P., Maclean, D., Margo, D., Seltzer, M., Smogor, R.: Layering in provenance systems. In: USENIX Annual Technical Conference (2009)
Muniswamy-Reddy, K., Macko, P., Seltzer, M.: Making a Cloud provenance-aware. In: 1st USENIX Workshop on the Theory and Practice of Provenance (2009)
Muniswamy-Reddy, K., Macko, P., Seltzer, M.: Provenance for the Cloud. In: 8th USENIX Conference on File and Storage Technologies (2010)
Simmhan, Y.L., Plale, B., Gannon, D., Marru, S.: Performance evaluation of the Karma provenance framework for scientific workflows. In: 1st International Provenance and Annotation Workshop (2006)
Support for Provenance Auditing in Distributed Environments, http://spade.csl.sri.com
Speech Technology and Research, SRI International, http://www.speech.sri.com
Thain, D., Tannenbaum, T., Livny, M.: Condor and the Grid, Grid computing: Making the global infrastructure a reality. John Wiley (2003)
Tupelo project, NCSA, http://tupeloproject.ncsa.uiuc.edu/node/2
Zhou, W., Sherr, M., Tao, T., Li, X., Loo, B., Mao, Y.: Efficient querying and maintenance of network provenance at Internet-scale. In: ACM SIGMOD International Conference on Management of Data (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Malik, T., Gehani, A., Tariq, D., Zaffar, F. (2013). Sketching Distributed Data Provenance. In: Liu, Q., Bai, Q., Giugni, S., Williamson, D., Taylor, J. (eds) Data Provenance and Data Management in eScience. Studies in Computational Intelligence, vol 426. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29931-5_4
Download citation
DOI: https://doi.org/10.1007/978-3-642-29931-5_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-29930-8
Online ISBN: 978-3-642-29931-5
eBook Packages: EngineeringEngineering (R0)