Benchmarking Performance for Migrating a Relational Application to a Parallel Implementation

Gadiraju, Krishna Karthik; Davis, Karen C.; Talaga, Paul G.

doi:10.1007/978-3-319-12256-4_6

Krishna Karthik Gadiraju¹⁷,
Karen C. Davis¹⁷ &
Paul G. Talaga¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8823))

Included in the following conference series:

International Conference on Conceptual Modeling

1570 Accesses
1 Citations

Abstract

Many organizations rely on relational database platforms for OLAP-style querying (aggregation and filtering) for small to medium size applications. We investigate the impact of scaling up the data sizes for such queries. We intend to illustrate what kind of performance results an organization could expect should they migrate current applications to big data environments. This paper benchmarks the performance of Hive [20], a parallel data warehouse platform that is a part of the Hadoop software stack. We set up a 4-node Hadoop cluster using Hortonworks HDP 1.3.2 [10]. We use the data generator provided by the TPC-DS benchmark [3] to generate data of different scales. We use a representative query provided in the TPC-DS query set and run the SQL and Hive Query Language (HiveQL) versions of the same query on a relational database installation (MySQL) and on the Hive cluster. We measure the speedup for query execution for all dataset sizes resulting from the scale up. Hive loads the large datasets faster than MySQL, while it is marginally slower than MySQL when loading the smaller datasets.

Access provided by Autonomous University of Puebla. Download to read the full chapter text

Chapter PDF

TPCx-BB (Big Bench) in a Single-Node Environment

Big Data Query Engines

A Survey on Parallel Database Systems from a Storage Perspective: Rows Versus Columns

Keywords

References

Baru, C., Bhandarkar, M., Nambiar, R., Poess, M., Rabl, T.: Setting the direction for big data benchmark standards. In: Nambiar, R., Poess, M. (eds.) TPCTC 2012. LNCS, vol. 7755, pp. 197–208. Springer, Heidelberg (2013)
Chapter Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Communications of the ACM 51(1), 107–113 (2008)
Article Google Scholar
DSGen v1.1.0, data generation tool for TPC-DS, http://www.tpc.org/tpcds/
Ghazal, A., Rabl, T., Hu, M., Raab, F., Poess, M., Crolotte, A., Jacobsen, H.-A.: BigBench: Towards an Industry Standard Benchmark for Big Data Analytics (2013)
Google Scholar
GridMix program. Available in Hadoop source distribution: src/benchmarks/gridmix
Google Scholar
Gruenheid, A., Omiecinski, E., Mark, L.: Query optimization using column statistics in hive. In: Proceedings of the 15th Symposium on International Database Engineering & Applications, pp. 97–105. ACM (2011)
Google Scholar
HadoopTeraSort program. Available in Hadoop source distribution since 0.19 version: src/examples/org/apache/hadoop/examples/terasort
Google Scholar
Herodotou, H., Lim, H., Luo, G., Borisov, N., Dong, L., Cetin, F.B., Babu, S.: Starfish: A Self-tuning System for Big Data Analytics. In: CIDR, vol. 11, pp. 261–272 (2011)
Google Scholar
Hive Performance Benchmark, https://issues.apache.org/jira/browse/hive-396
Hortonworks HDP 1.3.2, http://hortonworks.com/products/hdp/hdp-1-3/#overview
Hortonworks Stinger Initiative, http://hortonworks.com/labs/stinger/
Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The HiBench benchmark suite: Characterization of the MapReduce-based data analysis. In: 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW), pp. 41–51. IEEE (2010)
Google Scholar
Nambiar, R.O., Poess, M.: The making of TPC-DS. In: Proceedings of the 32nd International Conference on Very Large Data Bases, pp. 1049–1058. VLDB Endowment (2006)
Google Scholar
Pansare, N., Cai, Z.: Using Hive to perform medium-scale data analysis (2010)
Google Scholar
Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, pp. 165–178. ACM (2009)
Google Scholar
Running the TPC-H benchmark on Hive, https://issues.apache.org/jira/secure/attachment/12416257/TPC-H_on_Hive_2009-08-11.pdf
Sort program. Available in Hadoop source distribution: src/examples/org/apache/hadoop/examples/sort
Google Scholar
Shi, Y., Meng, X., Zhao, J., Hu, X., Liu, B., Wang, H.: Benchmarking cloud-based data management systems. In: Proceedings of the Second International Workshop on Cloud Data Management, pp. 47–54. ACM (2010)
Google Scholar
TPC-DS benchmarking standard, http://www.tpc.org/tpcds/spec/tpcds_1.1.0.pdf
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: A warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment 2(2), 1626–1629 (2009)
Article Google Scholar
White, T.: Hadoop: The definitive guide. O’Reilly (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

Electrical Engineering and Computing Systems, University of Cincinnati, Cincinnati, OH, 45221-0030, USA
Krishna Karthik Gadiraju, Karen C. Davis & Paul G. Talaga

Authors

Krishna Karthik Gadiraju
View author publications
You can also search for this author in PubMed Google Scholar
Karen C. Davis
View author publications
You can also search for this author in PubMed Google Scholar
Paul G. Talaga
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

UQ Business School, The University of Queensland, 4072, St Lucia, QLD, Australia
Marta Indulska
316B Information Sciences and Technology Building, Penn State University, 16802, University Park, PA, USA
Sandeep Purao

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gadiraju, K.K., Davis, K.C., Talaga, P.G. (2014). Benchmarking Performance for Migrating a Relational Application to a Parallel Implementation. In: Indulska, M., Purao, S. (eds) Advances in Conceptual Modeling. ER 2014. Lecture Notes in Computer Science, vol 8823. Springer, Cham. https://doi.org/10.1007/978-3-319-12256-4_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-12256-4_6
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-12255-7
Online ISBN: 978-3-319-12256-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Benchmarking Performance for Migrating a Relational Application to a Parallel Implementation

Abstract

Chapter PDF

Similar content being viewed by others

TPCx-BB (Big Bench) in a Single-Node Environment

Big Data Query Engines

A Survey on Parallel Database Systems from a Storage Perspective: Rows Versus Columns

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Benchmarking Performance for Migrating a Relational Application to a Parallel Implementation

Abstract

Chapter PDF

Similar content being viewed by others

TPCx-BB (Big Bench) in a Single-Node Environment

Big Data Query Engines

A Survey on Parallel Database Systems from a Storage Perspective: Rows Versus Columns

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation