Understanding Vertical Scalability of I/O Virtualization for MapReduce Workloads: Challenges and Opportunities

Nicolae, Bogdan

doi:10.1007/978-3-642-54420-0_1

Bogdan Nicolae²⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8374))

Included in the following conference series:

European Conference on Parallel Processing

1857 Accesses
2 Citations

Abstract

As the explosion of data sizes continues to push the limits of our abilities to efficiently store and process big data, next generation big data systems face multiple challenges. One such important challenge relates to the limited scalability of I/O, a determining factor in the overall performance of big data applications. Although paradigms like MapReduce have long been used to take advantage of local disks and avoid data movements over the network as much as possible, with increasing core count per node, local storage comes under increasing I/O pressure itself and prompts the need to equip nodes with multiple disks. However, given the rising need to virtualize large datacenters in order to provide a more flexible allocation and consolidation of physical resources (transforming them into public or private/hybrid clouds), the following questions arise: is it possible to take advantage of multiple local disks at virtual machine (VM) level in order to speed up big data analytics? If so, what are the best practices to achieve a high virtualized aggregated I/O throughput? This paper aims to answer these questions in the context of I/O intensive MapReduce workloads: it analyzes and characterizes their behavior under different virtualization scenarios in order to propose best practices for current approaches and speculate on future areas of improvement.

Download to read the full chapter text

Chapter PDF

Fine-grained data-locality aware MapReduce job scheduler in a virtualized environment

Article 22 January 2020

Improved Resource Exploitation by Combining Hadoop Map Reduce Framework with VirtualBox

Experiences of Converging Big Data Analytics Frameworks with High Performance Computing Systems

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Communications of the ACM 51(1), 107–113 (2008)
Article Google Scholar
White, T.: Hadoop: The Definitive Guide. O’Reilly Media, Inc. (2009)
Google Scholar
Gagné, M.: Cooking with Linux—still searching for the ultimate Linux distro? Linux J. 2007(161), 9 (2007)
Google Scholar
Shvachko, K., Huang, H., Radia, S., Chansler, R.: The Hadoop distributed file system. In: MSST 2010: The 26th Symposium on Massive Storage Systems and Technologies (2010)
Google Scholar
Zhang, Z., Wu, C., Cheung, D.W.: A survey on cloud interoperability: taxonomies, standards, and practice. SIGMETRICS Perform. Eval. Rev. 40(4), 13–22 (2013)
Article Google Scholar
Baset, S.A.: Open source cloud technologies. In: SoCC 2012: Proceedings of the 3rd ACM Symposium on Cloud Computing, pp. 28:1–28:2. ACM, New York (2012)
Google Scholar
Kavulya, S., Tan, J., Gandhi, R., Narasimhan, P.: An analysis of traces from a production mapreduce cluster. In: CCGRID 2010: Proceedings of the 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, pp. 94–103. IEEE Computer Society (2010)
Google Scholar
Ren, Z., Xu, X., Wan, J., Shi, W., Zhou, M.: Workload characterization on a production hadoop cluster: A case study on taobao. In: IISWC 2012: Proceedings of the 2012 IEEE International Symposium on Workload Characterization, San Diego, USA, pp. 3–13. IEEE Computer Society (2012)
Google Scholar
Abad, C.L., Roberts, N., Lu, Y., Campbell, R.H.: A storage-centric analysis of mapreduce workloads: File popularity, temporal locality and arrival patterns. In: IISWC 2012 Proceedings of the 2012 IEEE International Symposium on Workload Characterization, San Diego, USA, pp. 100–109 (2012)
Google Scholar
Abad, C.L., Luu, H., Roberts, N., Lee, K., Lu, Y., Campbell, R.H.: Metadata traces and workload models for evaluating big storage systems. In: UCC 2012: Proceedings of the 5th International Conference on Utility and Cloud Computing, Chicago, USA, pp. 125–132. IEEE Computer Society (2012)
Google Scholar
Nicolae, B., Moise, D., Antoniu, G., Bougé, L., Dorier, M.: Blobseer: Bringing high throughput under heavy concurrency to hadoop map/reduce applications. In: IPDPS 2010: Proc. 24th International Parallel and Distributed Processing Symposium, Atlanta, USA, pp. 1–12 (2010)
Google Scholar
Nicolae, B., Antoniu, G., Bougé, L., Moise, D., Carpen-Amarie, A.: Blobseer: Next-generation data management for large scale infrastructures. J. Parallel Distrib. Comput. 71, 169–184 (2011)
Article Google Scholar
Rasmussen, A., Lam, V.T., Conley, M., Porter, G., Kapoor, R., Vahdat, A.: Themis: an i/o-efficient mapreduce. In: SoCC 2012: Proceedings of the Third ACM Symposium on Cloud Computing, San Jose, USA, pp. 13:1–13:14. ACM (2012)
Google Scholar
Ibrahim, S., Jin, H., Lu, L., He, B., Wu, S.: Adaptive disk i/o scheduling for mapreduce in virtualized environment. In: ICPP 2011: The 2011 International Conference on Parallel Processing, Taipei, Taiwan, pp. 335–344 (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

IBM Research, Dublin, Ireland
Bogdan Nicolae

Authors

Bogdan Nicolae
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Rechen- und Kommunikationszentrum, RWTH Aachen, Seffenter Weg 23, 52074, Aachen, Germany
Dieter an Mey
TU Vienna, 1040, Vienna, Austria
Michael Alexander
RWTH Aachen University, Seffenter Weg 23, 52074, Aachen, Germany
Paolo Bientinesi & Carsten Clauss &
University Magna Graecia of Catanzaro, 88100, Catanzaro, Italy
Mario Cannataro
Inria Rennes - Bretagne Atlantique, 35042, Rennes, France
Alexandru Costan & Christine Morin &
University of Innsbruck, 6020, Innsbruck, Austria
Gabor Kecskemeti
Department of Computer Science, University of Pisa, 56126, Pisa, Italy
Laura Ricci
Universitat Politècnica de València, 46022, València, Spain
Julio Sahuquillo
LLNL, USA
Martin Schulz
Dipartimento di Informatica, Università di Salerno, 84084, Salerno, Italy
Vittorio Scarano
Tennessee Tech University and Oak Ridge National Laboratory, 38505, Cookeville, TN, USA
Stephen L. Scott
Technische Universität München, 80333, Munich, Germany
Josef Weidendorfer

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nicolae, B. (2014). Understanding Vertical Scalability of I/O Virtualization for MapReduce Workloads: Challenges and Opportunities. In: an Mey, D., et al. Euro-Par 2013: Parallel Processing Workshops. Euro-Par 2013. Lecture Notes in Computer Science, vol 8374. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54420-0_1

Download citation

DOI: https://doi.org/10.1007/978-3-642-54420-0_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-54419-4
Online ISBN: 978-3-642-54420-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Understanding Vertical Scalability of I/O Virtualization for MapReduce Workloads: Challenges and Opportunities

Abstract

Chapter PDF

Similar content being viewed by others

Fine-grained data-locality aware MapReduce job scheduler in a virtualized environment

Improved Resource Exploitation by Combining Hadoop Map Reduce Framework with VirtualBox

Experiences of Converging Big Data Analytics Frameworks with High Performance Computing Systems

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Understanding Vertical Scalability of I/O Virtualization for MapReduce Workloads: Challenges and Opportunities

Abstract

Chapter PDF

Similar content being viewed by others

Fine-grained data-locality aware MapReduce job scheduler in a virtualized environment

Improved Resource Exploitation by Combining Hadoop Map Reduce Framework with VirtualBox

Experiences of Converging Big Data Analytics Frameworks with High Performance Computing Systems

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation