BigOP: Generating Comprehensive Big Data Workloads as a Benchmarking Framework

Zhu, Yuqing; Zhan, Jianfeng; Weng, Chuliang; Nambiar, Raghunath; Zhang, Jinchao; Chen, Xingzhen; Wang, Lei

doi:10.1007/978-3-319-05813-9_32

Yuqing Zhu²²,
Jianfeng Zhan²²,
Chuliang Weng²²,
Raghunath Nambiar²²,
Jinchao Zhang²²,
Xingzhen Chen²² &
…
Lei Wang²²

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8422))

Included in the following conference series:

International Conference on Database Systems for Advanced Applications

1912 Accesses
9 Citations

Abstract

Big Data is considered proprietary asset of companies, organizations, and even nations. Turning big data into real treasure requires the support of big data systems. A variety of commercial and open source products have been unleashed for big data storage and processing. While big data users are facing the choice of which system best suits their needs, big data system developers are facing the question of how to evaluate their systems with regard to general big data processing needs. System benchmarking is the classic way of meeting the above demands. However, existent big data benchmarks either fail to represent the variety of big data processing requirements, or target only one specific platform, e.g. Hadoop.

In this paper, with our industrial partners, we present BigOP, an end-to-end system benchmarking framework, featuring the abstraction of representative Operation sets, workload Patterns, and prescribed tests. BigOP is part of an open-source big data benchmarking project, BigDataBench. BigOP’s abstraction model not only guides the development of BigDataBench, but also enables automatic generation of tests with comprehensive workloads.

We illustrate the feasibility of BigOP by implementing an automatic test generation tool and benchmarking against three widely used big data processing systems, i.e. Hadoop, Spark and MySQL Cluster. Three tests targeting three different application scenarios are prescribed. The tests involve relational data, text data and graph data, as well as all operations and workload patterns. We report results following test specifications.

Access provided by Autonomous University of Puebla. Download to read the full chapter text

Chapter PDF

On Big Data Benchmarking

Revisiting Benchmarking Principles and Methodologies for Big Data Benchmarking

Discussion of BigBench: A Proposed Industry Standard Performance Benchmark for Big Data

References

Amazon redshift service, http://aws.amazon.com/cn/redshift/
Amplab big data benchmark, https://amplab.cs.berkeley.edu/benchmark/
Cloudera impala, http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html
Gridmix, http://hadoop.apache.org/mapreduce/docs/current/gridmix.html
Hadoop project, http://hadoop.apache.org/
Hbase project, http://hbase.apache.org/
Hive project, http://hive.apache.org/
Hpl benchmark home page, http://www.netlib.org/benchmark/hpl/
Nosql databases, http://nosql-database.org/
Pigmix, https://cwiki.apache.org/confluence/display/PIG/PigMix
Sort benchmark home page, http://sortbenchmark.org/
Standard performance evaluation corporation (spec), http://www.spec.org/
Transaction processing performance council (tpc), http://www.tpc.org/
Asanovic, K., Bodik, R., Catanzaro, B.C., Gebis, J.J., Husbands, P., Keutzer, K., Patterson, D.A., Plishker, W.L., Shalf, J., Williams, S.W., et al.: The landscape of parallel computing research: A view from berkeley. Technical report, UCB/EECS-2006-183, EECS Department, University of California, Berkeley (2006)
Google Scholar
Bienia, C., Kumar, S., Singh, J., Li, K.: The parsec benchmark suite: Characterization and architectural implications. In: Proc. of PACT 2008, pp. 72–81. ACM (2008)
Google Scholar
Chen, Y., Raab, F., Katz, R.H.: From tpc-c to big data benchmarks: A functional workload model. Technical Report UCB/EECS-2012-174, EECS Department, University of California, Berkeley (July 2012)
Google Scholar
Codd, E.F.: A relational model of data for large shared data banks. In: Pioneers and Their Contributions to Software Engineering, pp. 61–98. Springer (2001)
Google Scholar
Cooper, B.F., Silberstein, A., Tam, E., Ramakrishnan, R., Sears, R.: Benchmarking cloud serving systems with ycsb. In: Proc. of SoCC (2010)
Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. Communications of the ACM 51(1), 107–113 (2008)
Article Google Scholar
Engle, C., Lupher, A., Xin, R., Zaharia, M., Franklin, M.J., Shenker, S., Stoica, I.: Shark: Fast data analysis using coarse-grained distributed memory. In: Proc. of SIGMOD 2012, pp. 689–692. ACM (2012)
Google Scholar
Ferdman, M., Adileh, A., Kocberber, O., Volos, S., Alisafaee, M., Jevdjic, D., Kaynak, C., Popescu, A., Ailamaki, A., Falsafi, B.: Clearing the clouds: A study of emerging workloads on modern hardware. Architectural Support for Programming Languages and Operating Systems (2012)
Google Scholar
Ghazal, A., Hu, M., Rabl, T., Raab, F., Poess, M., Crolotte, A., Jacobsen, H.-A.: Bigbench: Towards an industry standard benchmark for big data analytics. In: Proc. of SIGMOD 2013. ACM (2013)
Google Scholar
Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The hibench benchmark suite: Characterization of the mapreduce-based data analysis. In: Proc. of ICDEW 2010, pp. 41–51. IEEE (2010)
Google Scholar
Ming, Z., Luo, C., Gao, W., Han, R., Yang, Q., Wang, L., Zhan, J.: Bdgs: A scalable big data generator suite in big data benchmarking. arXiv:1401.5465 (2014)
Google Scholar
Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., De Witt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: Proc. of SIGMOD 2009, pp. 165–178. ACM, New York (2009)
Google Scholar
Poess, M., Nambiar, R.O., Walrath, D.: Why you should run tpc-ds: A workload analysis. In: Proc. of VLDB 2007, pp. 1138–1149. VLDB Endowment (2007)
Google Scholar
Wang, L., Zhan, J., Luo, C., Zhu, Y., Yang, Q., He, Y., Gao, W., Jia, Z., Shi, Y., Zhang, S., Zhen, C., Lu, G., Zhan, K., Qiu, B.: Bigdatabench: A big data benchmark suite from internet services. Accepted by HPCA 2014 (2014)
Google Scholar
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M., Shenker, S., Stoica, I.: Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: Proc. of NSDI 2012, p. 2. USENIX Association (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

State Key Laboratory of Computer Architecture, (Institute of Computing Technology, Chinese Academy of Sciences), Huawei, Cisco, China
Yuqing Zhu, Jianfeng Zhan, Chuliang Weng, Raghunath Nambiar, Jinchao Zhang, Xingzhen Chen & Lei Wang

Authors

Yuqing Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Jianfeng Zhan
View author publications
You can also search for this author in PubMed Google Scholar
Chuliang Weng
View author publications
You can also search for this author in PubMed Google Scholar
Raghunath Nambiar
View author publications
You can also search for this author in PubMed Google Scholar
Jinchao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xingzhen Chen
View author publications
You can also search for this author in PubMed Google Scholar
Lei Wang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computer Engineering, Nanyang Technological University, 50 Nanyang Avenue, 639798, Singapore, Singapore
Sourav S. Bhowmick
Department of Computer Science, Utah State University, Old Main Hill, 4205, 84322-4205, Logan, UT, USA
Curtis E. Dyreson
Department of Computer Science, Aalborg University, Selma Lagerløfs Vej, 300, 9220, Aalborg Øst, Denmark
Christian S. Jensen
Department of Computer Science, National University of Singapore, 13 Computing Drive, 117417, Singapore, Singapore
Mong Li Lee
Department of Computer Science, Udayana University, Jl. Kampus Unud Jimbaran Bali, 80364, Badung, Bali, Indonesia
Agus Muliantara
Information Systems Engineering, Christian-Albrechts-Universität zu Kiel, Olshausenstrasse 40, 24098, Kiel, Germany
Bernhard Thalheim

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhu, Y. et al. (2014). BigOP: Generating Comprehensive Big Data Workloads as a Benchmarking Framework. In: Bhowmick, S.S., Dyreson, C.E., Jensen, C.S., Lee, M.L., Muliantara, A., Thalheim, B. (eds) Database Systems for Advanced Applications. DASFAA 2014. Lecture Notes in Computer Science, vol 8422. Springer, Cham. https://doi.org/10.1007/978-3-319-05813-9_32

Download citation

DOI: https://doi.org/10.1007/978-3-319-05813-9_32
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-05812-2
Online ISBN: 978-3-319-05813-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

BigOP: Generating Comprehensive Big Data Workloads as a Benchmarking Framework

Abstract

Chapter PDF

Similar content being viewed by others

On Big Data Benchmarking

Revisiting Benchmarking Principles and Methodologies for Big Data Benchmarking

Discussion of BigBench: A Proposed Industry Standard Performance Benchmark for Big Data

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

BigOP: Generating Comprehensive Big Data Workloads as a Benchmarking Framework

Abstract

Chapter PDF

Similar content being viewed by others

On Big Data Benchmarking

Revisiting Benchmarking Principles and Methodologies for Big Data Benchmarking

Discussion of BigBench: A Proposed Industry Standard Performance Benchmark for Big Data

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation