Abstract
We introduce StarFlow, a script-centric environment for data analysis. StarFlow has four main features: (1) extraction of control and data-flow dependencies through a novel combination of static analysis, dynamic runtime analysis, and user annotations, (2) command-line tools for exploring and propagating changes through the resulting dependency network, (3) support for workflow abstractions enabling robust parallel executions of complex analysis pipelines, and (4) a seamless interface with the Python scripting language. We describe real applications of StarFlow, including automatic parallelization of complex workflows in the cloud.
Chapter PDF
Similar content being viewed by others
Keywords
References
Proceedings of the 2010 USENIX Workshop on the Theory and Practice of Provenance, San Jose, CA, USA. USENIX (February 22, 2010)
United States Environmental Protection Agency. Epa frs facilities state combined csv files download, http://epa.gov/enviro/html/frs_demo/geospatial_data/geo_data_state_combined.html
Berthold, M.R., Cebron, N., Dill, F., Gabriel, T.R., Kotter, T., Meinl, T., Ohl, P., Thiel, K., Wiswedel, B.: Knime - the konstanz information miner: version 2.0 and beyond. SIGKDD Explor. Newsl. 11(1), 26–31 (2009)
Callahan, S.P., Freire, J., Santos, E., Scheidegger, C.E., Silva, C.T., Vo, H.T.: Vistrails: visualization meets data management. In: SIGMOD 2006 Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, pp. 745–747. ACM, New York (2006), General Chair-Yu, Clement and General Chair-Scheuermann, Peter and Program Chair-Chaudhuri, Surajit
clario Analytics. clario, http://clarioanalytics.com
Clifford, B., Freire, J., Gil, Y., Groth, P., Futrelle, J., Kwasnikowska, N., Miles, S., Missier, P., Myers, J., Simmhan, Y., Stephan, E., den Bussche, J.V.: The open provenance model core specification, v1.1 (2009), http://eprints.ecs.soton.ac.uk/18332/1/opm_OnlinePDF.pdf
LinkedIn Corporation Azkaban, http://sna-projects.com/azkaban/
Pentaho Corporation, Kettle: Pentaho data integration, http://kettle.pentaho.org
Deelman, E., Blythe, J., Gil, A., Kesselman, C., Mehta, G., Patil, S., Su, M.-h., Vahi, K., Livny, M.: Pegasus: Mapping scientific workflows onto the grid, pp. 11–20 (2004)
Elkabany, K., Staley, A., Park, K.: Picloud - cloud computing for science. simplified. In: SciPy 2010 Python for Scientific Computing Conference, Austin, TX (July 2010)
Foster, I., Vckler, J., Wilde, M., Zhao, Y.: Chimera: A virtual data system for representing, querying, and automating data derivation. In: Proceedings of the 14th Conference on Scientific and Statistical Database Management, pp. 37–46 (2002)
The Eclipse Foundation. Eclipse c/c++ development tooling project, http://www.eclipse.org/cdt
Guo, P.J., Engler, D.: Towards practical incremental recomputation for scientists: An implementation for the python language. In: TaPP 2010 [1] (2010)
Ikeda, R., Widom, J.: Panda: A system for provenance and data. In: TaPP 2010 [1] (2010)
Yahoo! Inc., Oozie, http://yahoo.github.com/oozie/
Kuehn, H., Liberzon, A., Reich, M., Mesirov, J.P.: Using genepattern for gene expression analysis. Curr. Prot. in Bioinformatics, 7.12.1–7.12.39 (2008)
Amazon Web Services LLC. Amazon elastic compute cloud (ec2), http://aws.amazon.com/ec2
McPhillips, T., Bowers, S., Zinn, D., Ludaschera, B.: Scientific workflow design for mere mortals. Future Generation Computer Systems 25(5), 541–551 (2009)
Mercurial. Mercurial, http://mercurial.selenic.com
Missier, P., Belhajjame, K., Zhao, J., Roos, M., Goble, C.: Data lineage model for taverna workflows with lightweight annotation requirements. In: Freire, J., Koop, D., Moreau, L. (eds.) IPAW 2008. LNCS, vol. 5272, pp. 17–30. Springer, Heidelberg (2008)
Muniswamy-Reddy, K.-K., Holland, D.A., Braun, U., Seltzer, M.I.: Provenance-aware storage systems. In: USENIX Annual Technical Conference, General Track, pp. 43–56. USENIX (2006)
Oinn, T.M., Addis, M., Ferris, J., Marvin, D., Senger, M., Greenwood, R.M., Carver, T., Glover, K., Pocock, M.R., Wipat, A., Li, P.: Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics 20(17), 3045–3054 (2004)
Pan, M.J.: pomsets: workflow management for your cloud. In: SciPy 2010 Python for Scientific Computing Conference, , Austin, TX (July 2010)
The GNU Project, Gnu automake, http://www.gnu.org/software/automake
Riley, J.: Starcluster - numpy/scipy computing in the cloud. In: SciPy 2010: Python for Scientific Computing Conference, Austin, TX (July 2010)
Taylor, J., Schenck, I., Blankenberg, D., Nekrutenko, A.: Using galaxy to perform large-scale interactive data analyses. Curr. Prot. in Bioinformatics, 10.5.1–10.5.25 (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Angelino, E., Yamins, D., Seltzer, M. (2010). StarFlow: A Script-Centric Data Analysis Environment. In: McGuinness, D.L., Michaelis, J.R., Moreau, L. (eds) Provenance and Annotation of Data and Processes. IPAW 2010. Lecture Notes in Computer Science, vol 6378. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-17819-1_27
Download citation
DOI: https://doi.org/10.1007/978-3-642-17819-1_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-17818-4
Online ISBN: 978-3-642-17819-1
eBook Packages: Computer ScienceComputer Science (R0)