Abstract
The MapReduce model uses a barrier between the Map and Reduce stages. This provides simplicity in both programming and implementation. However, in many situations, this barrier hurts performance because it is overly restrictive. Hence, we develop a method to break the barrier in MapReduce in a way that improves efficiency. Careful design of our barrier-less MapReduce framework results in equivalent generality and retains ease of programming. We motivate our case with, and experimentally study our barrier-less techniques in, a wide variety of MapReduce applications divided into seven classes. Our experiments show that our approach can achieve better job completion times than a traditional MapReduce framework. This is due primarily to the interleaving of I/O and computation, and forgoing disk-intensive work. We achieve a reduction in job completion times that is 25% on average and 87% in the best case.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Apache hadoop. http://hadoop.apache.org (2011)
Ananthanarayanan, G., Kandula, S., Greenberg, A., Stoica, I., Lu, Y., Saha, B., Harris, E.: Reining in the outliers in map-reduce clusters using mantri. In: Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, OSDI’10, pp. 1–16. USENIX Association, Berkeley (2010)
Black, F., Scholes, M.S.: The pricing of options and corporate liabilities. J. Polit. Econ. 81(3), 637–654 (1973)
Brants, T., Popat, A.C., Xu, P., Och, F.J., Dean, J.: Large language models in machine translation. In: Proc. of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 858–867 (2007)
Chu, C., Kim, S., Lin, Y., Yu, Y., Bradski, G., Ng, A., Olukotun, K.: Map-reduce for machine learning on multicore. In: Advances in Neural Information Processing Systems (NIPS), pp. 281–288 (2007)
Condie, T., Conway, N., Alvaro, P., Hellerstein, Elmeleegy, K., Sears, R.: Online aggregation and continuous query support in mapreduce. In: NSDI’10: Seventh USENIX Symposium on Networked Systems Design and Implementation. USENIX Association, Berkeley (2010)
Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: Proc. of the USENIX Symposium on Operating Systems Design and Implementation (OSDI), pp. 137–149 (2004)
DeWitt, D., Gerber, R.: Multiprocessor hash-based join algorithms. In: Proc. of VLDB Conference. Citeseer, Princeton (1985)
Dyer, C., Cordova, A., Mont, A., Lin, J.: Fast, easy, and cheap: construction of statistical machine translation models with mapreduce. In: Proc. of the Third Workshop on Statistical Machine Translation (StatMT), pp. 199–207. Association for Computational Linguistics, Stroudsburg (2008)
Elsayed, T., Lin, J., Oard, D.W.: Pairwise document similarity in large collections with mapreduce. In: Proc. of the Annual Meeting of the Association for Computational Linguistics on Human Language Technologies (HLT), pp. 265–268. Association for Computational Linguistics, Stroudsburg (2008)
Farivar, R., Verma, A., Chan, E., Campbell, R.: Mithra: multiple data independent tasks on a heterogeneous resource architecture. In: Proc. of IEEE International Conference on Cluster Computing and Workshops (CLUSTER), pp. 1–10 (2009). doi:10.1109/CLUSTR.2009.5289201
Fix, E., Hodges, J.: Discriminatory analysis, nonparametric discrimination: Consistency properties. Tech. rep., USAF School of Aviation Medicine, Randolph Field, Texas (1951)
Guibas, L.J., Sedgewick, R.: A dichromatic framework for balanced trees. In: Proc. of Symposium on Foundations of Computer Science, pp. 8–21 (1978). doi:10.1109/SFCS.1978.3
Gupta, A., Mumick, I.: Maintenance of materialized views: problems, techniques, and applications. Bull. Tech. Commun. 51, 3 (2010)
Illinois Cloud Computing Testbed. http://cloud.cs.illinois.edu/ (2011)
Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: Proc. of European Conference on Computer Systems (EuroSys), pp. 59–72 (2007)
Mongodb. http://www.mongodb.org (2011)
Olson, M.A., Bostic, K., Seltzer, M.: Berkeley db. In: Proc. of the USENIX Annual Technical Conference (ATEC), p. 43. USENIX Association, Berkeley (1999)
O’Malley, O., Murthy, A.C.: Winning a 60 second dash with a yellow elephant. http://sortbenchmark.org/Yahoo2009.pdf (2009)
Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., Dewitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: Proc. of the ACM SIGMOD International Conference. ACM (2009). URL http://database.cs.brown.edu/sigmod09/benchmarks-sigmod09.pdf
Popa, L., Budiu, M., Yu, Y., Isard, M.: DryadInc: Reusing work in large-scale computations. In: Proc. of Workshop on Hot Topics in Cloud Computing (HotCloud) (2009)
Stonebraker, M., Abadi, D., DeWitt, D., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: MapReduce and parallel DBMSs: friends or foes? Commun. ACM 53(1), 64–71 (2010)
Tokyo cabinet. http://1978th.net/tokyocabinet/ (2011)
Verma, A., Llora, X., Goldberg, D.E., Campbell, R.H.: Scaling genetic algorithms using mapreduce. In: Proc. of International Conference on Intelligent Systems Design and Applications (2009)
Verma, A., Zea, N., Cho, B., Gupta, I., Campbell, R.: Breaking the MapReduce stage barrier. In: Proc. of IEEE International Conference on Cluster Computing (CLUSTER), pp. 235–244 (2010)
White, T.: Hadoop: The Definitive Guide. O’Reilly Media, Sebastopol (2009)
Yang, H., Dasdan, A., Hsiao, R., Parker, D.: Map-reduce-merge: simplified relational data processing on large clusters. In: Proc. of the 2007 ACM SIGMOD International Conf. on Management of data (2007)
Zaharia, M., Konwinski, A., Joseph, A.D., Katz, R., Stoica, I.: Improving mapreduce performance in heterogeneous environments. In: Proc. of the USENIX Symposium on Operating Systems Design and Implementation (OSDI) (2008)
Author information
Authors and Affiliations
Corresponding author
Additional information
This is an extended version of our IEEE Cluster 2010 paper [25].
Rights and permissions
About this article
Cite this article
Verma, A., Cho, B., Zea, N. et al. Breaking the MapReduce stage barrier. Cluster Comput 16, 191–206 (2013). https://doi.org/10.1007/s10586-011-0182-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-011-0182-7