Abstract
Big Data is considered proprietary asset of companies, organizations, and even nations. Turning big data into real treasure requires the support of big data systems. A variety of commercial and open source products have been unleashed for big data storage and processing. While big data users are facing the choice of which system best suits their needs, big data system developers are facing the question of how to evaluate their systems with regard to general big data processing needs. System benchmarking is the classic way of meeting the above demands. However, existent big data benchmarks either fail to represent the variety of big data processing requirements, or target only one specific platform, e.g. Hadoop.
In this paper, with our industrial partners, we present BigOP, an end-to-end system benchmarking framework, featuring the abstraction of representative Operation sets, workload Patterns, and prescribed tests. BigOP is part of an open-source big data benchmarking project, BigDataBench. BigOP’s abstraction model not only guides the development of BigDataBench, but also enables automatic generation of tests with comprehensive workloads.
We illustrate the feasibility of BigOP by implementing an automatic test generation tool and benchmarking against three widely used big data processing systems, i.e. Hadoop, Spark and MySQL Cluster. Three tests targeting three different application scenarios are prescribed. The tests involve relational data, text data and graph data, as well as all operations and workload patterns. We report results following test specifications.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
References
Amazon redshift service, http://aws.amazon.com/cn/redshift/
Amplab big data benchmark, https://amplab.cs.berkeley.edu/benchmark/
Cloudera impala, http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html
Gridmix, http://hadoop.apache.org/mapreduce/docs/current/gridmix.html
Hadoop project, http://hadoop.apache.org/
Hbase project, http://hbase.apache.org/
Hive project, http://hive.apache.org/
Hpl benchmark home page, http://www.netlib.org/benchmark/hpl/
Nosql databases, http://nosql-database.org/
Pigmix, https://cwiki.apache.org/confluence/display/PIG/PigMix
Sort benchmark home page, http://sortbenchmark.org/
Standard performance evaluation corporation (spec), http://www.spec.org/
Transaction processing performance council (tpc), http://www.tpc.org/
Asanovic, K., Bodik, R., Catanzaro, B.C., Gebis, J.J., Husbands, P., Keutzer, K., Patterson, D.A., Plishker, W.L., Shalf, J., Williams, S.W., et al.: The landscape of parallel computing research: A view from berkeley. Technical report, UCB/EECS-2006-183, EECS Department, University of California, Berkeley (2006)
Bienia, C., Kumar, S., Singh, J., Li, K.: The parsec benchmark suite: Characterization and architectural implications. In: Proc. of PACT 2008, pp. 72–81. ACM (2008)
Chen, Y., Raab, F., Katz, R.H.: From tpc-c to big data benchmarks: A functional workload model. Technical Report UCB/EECS-2012-174, EECS Department, University of California, Berkeley (July 2012)
Codd, E.F.: A relational model of data for large shared data banks. In: Pioneers and Their Contributions to Software Engineering, pp. 61–98. Springer (2001)
Cooper, B.F., Silberstein, A., Tam, E., Ramakrishnan, R., Sears, R.: Benchmarking cloud serving systems with ycsb. In: Proc. of SoCC (2010)
Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. Communications of the ACM 51(1), 107–113 (2008)
Engle, C., Lupher, A., Xin, R., Zaharia, M., Franklin, M.J., Shenker, S., Stoica, I.: Shark: Fast data analysis using coarse-grained distributed memory. In: Proc. of SIGMOD 2012, pp. 689–692. ACM (2012)
Ferdman, M., Adileh, A., Kocberber, O., Volos, S., Alisafaee, M., Jevdjic, D., Kaynak, C., Popescu, A., Ailamaki, A., Falsafi, B.: Clearing the clouds: A study of emerging workloads on modern hardware. Architectural Support for Programming Languages and Operating Systems (2012)
Ghazal, A., Hu, M., Rabl, T., Raab, F., Poess, M., Crolotte, A., Jacobsen, H.-A.: Bigbench: Towards an industry standard benchmark for big data analytics. In: Proc. of SIGMOD 2013. ACM (2013)
Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The hibench benchmark suite: Characterization of the mapreduce-based data analysis. In: Proc. of ICDEW 2010, pp. 41–51. IEEE (2010)
Ming, Z., Luo, C., Gao, W., Han, R., Yang, Q., Wang, L., Zhan, J.: Bdgs: A scalable big data generator suite in big data benchmarking. arXiv:1401.5465 (2014)
Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., De Witt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: Proc. of SIGMOD 2009, pp. 165–178. ACM, New York (2009)
Poess, M., Nambiar, R.O., Walrath, D.: Why you should run tpc-ds: A workload analysis. In: Proc. of VLDB 2007, pp. 1138–1149. VLDB Endowment (2007)
Wang, L., Zhan, J., Luo, C., Zhu, Y., Yang, Q., He, Y., Gao, W., Jia, Z., Shi, Y., Zhang, S., Zhen, C., Lu, G., Zhan, K., Qiu, B.: Bigdatabench: A big data benchmark suite from internet services. Accepted by HPCA 2014 (2014)
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M., Shenker, S., Stoica, I.: Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: Proc. of NSDI 2012, p. 2. USENIX Association (2012)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Zhu, Y. et al. (2014). BigOP: Generating Comprehensive Big Data Workloads as a Benchmarking Framework. In: Bhowmick, S.S., Dyreson, C.E., Jensen, C.S., Lee, M.L., Muliantara, A., Thalheim, B. (eds) Database Systems for Advanced Applications. DASFAA 2014. Lecture Notes in Computer Science, vol 8422. Springer, Cham. https://doi.org/10.1007/978-3-319-05813-9_32
Download citation
DOI: https://doi.org/10.1007/978-3-319-05813-9_32
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-05812-2
Online ISBN: 978-3-319-05813-9
eBook Packages: Computer ScienceComputer Science (R0)