Advances in MapReduce Big Data Processing: Platform, Tools, and Algorithms

Abualigah, Laith; Masri, Bahaa Al

doi:10.1007/978-981-33-6400-4_6

Laith Abualigah⁵ &
Bahaa Al Masri⁵

Part of the book series: Studies in Big Data ((SBD,volume 85))

767 Accesses
9 Citations

Abstract

This paper reviews the research progress of big data processing platforms and algorithms based on MapReduce programming model in recent years. Firstly, 12 typical ones are introduced. MapReduce-based big data processing platform analyzes and compares their implementation principles and applicable scenarios, abstracts their commonalities, then introduces based on MapReduce big data analysis algorithms, including search algorithms, data cleaning/transformation algorithms, aggregation algorithms, join algorithms, sorting algorithms, preference queries, optimization calculations method, graph algorithm, data mining algorithm, classifies these algorithms according to MapReduce implementation, analyzes the factors affecting the performance of the algorithm; finally, big data. The processing algorithm is abstracted as an external memory algorithm, and the characteristics of the external storage algorithm are sorted out. The research ideas and problems of the performance optimization method of the universal external memory algorithm are proposed. For the researcher’s reference. Specifically, it includes the disk I/O of optimizing the external memory algorithm, optimizing the locality of the external memory algorithm, and designing the incremental iterative algorithm. The existing large data processing platform and algorithm research mostly focus on platform dynamic performance optimization based on resource allocation and task scheduling, specific algorithm parallelization, specific algorithmic. This chapter provides researchers with a broad research study in MapReduce Big Data Processing: Platform, Tools, and Algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

MapReduce Parallel Programming Model: A State-of-the-Art Survey

Article 29 October 2015

The Family of Map-Reduce

MapReduce: an infrastructure review and research insights

Article 08 June 2019

References

Wu L, Yuan L, You J (2015) Survey of large-scale data management systems for big data applications. J Comput Sci Technol 30(1):163–183. https://doi.org/10.1007/s11390-015-1511-8
Article Google Scholar
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113. https://doi.org/10.1145/1327452.1327492
Article Google Scholar
Wolf J, Balmin A, Rajan D, Hildrum K, Khandekar R, Parekh S, Wu KL, Vernica R (2012) On the optimization of schedules for MapReduce workloads in the presence of shared scans. The VLDB J—The Int J Very Large Data Bases, 21(5):589–609. https://doi.org/10.1007/s00778-012-0279-5
Computing Platform 2016
Google Scholar
Yang H, Luan Z, Li W, Qian D (2012) MapReduce workload modeling with statistical approach. J Grid Comput 10(2):279–310. https://doi.org/10.1007/s10723-011-9201-4
Article Google Scholar
Kimura K, Nomura Y, Tanaka Y, Kurihara H, Yamamoto R (2015) Runtime composition for extensible big data processing platforms. In: Proceedings of the 2015 IEEE 8th international conference on cloud computing 1053–1057. https://doi.org/10.1109/cloud.2015.151
Out-of-Core Algorithm (2016) https://en.wikipedia.org/wiki/Out-of-core_algorithm
Low Y, Gonzalez J, Kyrola A, Bickson D, Bickson D, Guestrin C, Hellerstein JM (2012) Distributed graphLab: a framework for machine learning and data mining in the cloud. Proc VLDB Endowment 5(8):716–727. https://doi.org/10.14778/2212351.2212354
Article Google Scholar
Zhang J, Xiang D, Li T, Pan Y (2013) M2M: a simple Matlab-to-MapReduce translator for cloud computing. Tsinghua Sci Technol 18(1):1–9
Article Google Scholar
Liu Y, Li M, Alham NK, Hammoud S (2013) HSim: a MapReduce simulator in enabling cloud computing. Future Gener Comput Syst 29(1):300–308. https://doi.org/10.1016/j.future.2011.05.007
Article Google Scholar
GridGain in-memory data fabric. http://go.gridgain.com/rs/491-TWR-806/images/GridGain_Product_Datasheet_070416.pdf
Fang W, He B, Luo Q, Govindaraju NK (2011) Mars: accelerating MapReduce with graphics processors. IEEE Trans Parallel Distrib Syst 22(4):608–620. https://doi.org/10.1109/tpds.2010.158
Yoo RM, Romano A, Kozyrakis C (2009) Phoenix rebirth: scalable MapReduce on a large-scale shared-memory system. In: Proceedings of the IEEE international symposium on workload characterization (IISWC 2009). IEEE, pp 198–207. https://doi.org/10.1109/iiswc.2009.5306783
Mundkur P, Tuulos V, Flatow J (2011) Disco: a computing platform for large-scale data analytics. In: Proceedings of the 10th ACM SIGPLAN workshop on Erlang, pp 84–89. https://doi.org/10.1145/2034654.2034670
Ekanayake J, Li H, Zhang B, Gunarathne T, Bae S, Qiu J, Fox G (2010) Twister: a runtime for iterative MapReduce. In: Proceedings of 19th ACM international symposium on high performance distributed computing. ACM Press, pp 810–818. https://doi.org/10.1145/1851476.1851593
Bu Y, Howe B, Balazinska M, Ernst MD (2010) HaLoop: efficient iterative data processing on large clusters. Proc VLDB Endowment 3(1–2):285–296. https://doi.org/10.14778/1920841.1920881
Article Google Scholar
Zhang Y, Gao Q, Gao L, Wang C (2012) iMapReduce: a distributed computing framework for iterative computation. J Grid Comput 10(1):47–68. https://doi.org/10.1007/s10723-012-9204-9
Article Google Scholar
Elnikety E, Elsayed T, Ramadan HE (2011) iHadoop: asynchronous iterations for MapReduce. In: Proceedings of the 3rd IEEE international conference on cloud computing technology and science (CloudCom). IEEE, pp 81–90. https://doi.org/10.1109/cloudcom.2011.21
Zhang Y, Gao Q, Gao L, Wang C (2011) PrIter: a distributed framework for prioritized iterative computations. In: Proceedings of the 2nd ACM symposium on cloud computing. ACM Press 13. https://doi.org/10.1145/2038916.2038929
Isard M, Budiu M, Yu Y, Birrell A, Fetterly D (2007) Dryad: distributed data-parallel programs from sequential building blocks. Proc ACM SIGOPS Operating Syst Rev 41(3):59–72. https://doi.org/10.1145/1272998.1273005
Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. HotCloud
Google Scholar
Rasooli A, Down DG (2014) Guidelines for selecting Hadoop schedulers based on system heterogeneity. J Grid Comput 12(3):499–519. https://doi.org/10.1007/s10723-014-9299-2
Article Google Scholar
Karun AK, Chitharanjan K (2013) A review on hadoop—HDFS infrastructure extensions. In: Proceedings of the 2013 IEEE conference on information and communication technologies (ICT). IEEE, pp 132–137. https://doi.org/10.1109/cict.2013.6558077
Vavilapalli VK, Murthy AC, Douglas C, Agarwal S, Konar M, Evans R, Graves T, Lowe J, Shah H, Seth S, Saha B, Curino C, O’Malley O, Radia S, Reed B, Baldeschwieler E (2013) Apache Hadoop YARN: yet another resource negotiator. In: Proceedings of the 4th annual symposium on cloud computing 16. https://doi.org/10.1145/2523616.2523633
Ranger C, Raghuraman R, Penmetsa A, Bradski G, Kozyrakis C (2007) Evaluating MapReduce for multi-core and multiprocessor systems. In: Proceedings of the 2007 IEEE 13th international symposium on high performance computer architecture. IEEE, pp 13–24. https://doi.org/10.1109/hpca.2007.346181
Pietzuch PR, Bacon J (2003) Peer-to-Peer overlay broker networks in an event-based middleware. In: Proceedings of the 2nd international workshop on distributed event-based systems. ACM Press, pp 1–8. https://doi.org/10.1145/966618.966628
Qiu J, Wu Q, Ding G, Xu Y, Feng S (2016) A survey of machine learning for big data processing. EURASIP J Adv Signal Process 2016(1):1–16. https://doi.org/10.1186/s13634-015-0293-z
Article Google Scholar
Martins R, Manquinho V, Lynce I (2015) Improving linear search algorithms with model-based approaches for MaxSAT solving. J Exp Theor Artif Intell 27(5):673–701. https://doi.org/10.1080/0952813X.2014.993508
Article Google Scholar
Wang HZ (2015) Big data algorithms. China Machine Press, Beijing (in Chinese)
Google Scholar
Ding XO, Wang HZ, Zhang XY, Gao H (2016) Association relationships study of multi-dimensional data quality. Ruan Jian Xue Bao/J Softw 27(7):1626–1644 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5040.htm. https://doi.org/10.13328/j.cnki.jos.005040
Yang DH, Li NN, Wang HZ, Li JZ, Gao H (2016) The optimization of the big data cleaning based on task merging. Computers 39(1):97–108 (in Chinese with English abstract)
MathSciNet Google Scholar
Han JW, Kamber M, Pei J (2011) Data mining: concepts and techniques, 3rd ed. Morgan Kaufmann Publishers
Google Scholar
Wang Y, Su Y, Agrawal G (2015) A novel approach for approximate aggregations over arrays. In: Proceedings of the 27th international conference on scientific and statistical database management. ACM Press. https://doi.org/10.1145/2791347.2791349
Issa JA (2015) Performance evaluation and estimation model using regression method for hadoop WordCount. IEEE Acc 3:2784–2793. https://doi.org/10.1109/ACCESS.2015.2509598
Article Google Scholar
Han XX, Yang DH, Li JZ (2010) Approximate join aggregate on massive data. Chin J Comput 10:1919–1933 (in Chinese with English abstract)
Article Google Scholar
Song J, Li TT, Zhu ZL, Bao YB, Song Jie et al Research progress of MapReduce big data processing platform and algorithm 541
Google Scholar
Asiri N, Alsulim R (2015) Non-recursive approach for sort-merge join operation. In: Proceedings of international the conference on beyond databases, architectures and structures. Springer International Publishing, pp 216–224. https://doi.org/10.1007/978-3-319-34099-9_16
Chen M, Zhong Z (2014) Block nested join and sort merge join algorithms: an empirical evaluation. In: Proceedings of the international conference on advanced data mining and applications. Springer International Publishing, pp 705–715
Google Scholar
Tong Y, Liu ZJ, Liu H (2016) Optimizing hash join with MapReduce on multi-core CPUs. IEICE Trans Inf Syst 99(5):1316–1325. https://doi.org/10.1587/transinf.2015EDP7306
Article MathSciNet Google Scholar
Song J, Xu S, Zhang L, Pahl C, Yu G (2015) Performance and energy optimization on TeraSort algorithm by task self-resizing. Inf Technol Control 44(1):30–40
Google Scholar
Ci X, Ma YZ, Meng XF (2014) Method for top-K query on big data in cloud. Ruan Jian Xue Bao/J Softw 25(4):813–825 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/4564.htm, https://doi.org/10.13328/j.cnki.jos.004564
Li WF, Peng ZY, Li DY (2012) Top-K query processing techniques on uncertain data. Ruan Jian Xue Bao/J Softw 23(6):1542–1560 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/4200.htm, https://doi.org/10.3724/sp.j.1001.2012.04200
MacLean B, Tomazela DM, Shulman N, Chambers M, Finney GL, Frewen B, Kern R, Tabb DL, Liebler DC, MacCoss MJ (2010) Skyline: an open source document editor for creating and examining targeted proteomics experiments. Bioinformatics 26(7):966–968. https://doi.org/10.1093/bioinformatics/btq054
Article Google Scholar
Zhang B, Zhou S, Guan J (2011) Adapting skyline computation to the MapReduce framework: algorithms and experiments. In: Proceedings of the international conference on database systems for advanced applications. Springer-Verlag, Berlin, Heidelberg, pp 403–414. https://doi.org/10.1007/978-3-642-20244-5_39
Ding LL, Xin JC, Wang GR, Huang S (2011) Efficient skyline query processing of massive data based on MapReduce. Chin J Comput 34(10):1785–1796 (in Chinese with English abstract). https://doi.org/10.3724/sp.j.1016.2011.01785
Jin C, Vecchiola C, Buyya R (2008) MRPGA: an extension of MapReduce for parallelizing genetic algorithms. In: Proceedings of the 4th IEEE international conference on eScience (eScience 2008). IEEE, pp 214–221. https://doi.org/10.1109/escience.2008.78
McNabb AW, Monson CK, Seppi KD (2007) Parallel pso using MapReduce. In: Proceedings of the 2007 IEEE congress on evolutionary computation. IEEE, pp 7–14. https://doi.org/10.1109/cec.2007.4424448
Li H, Wei X, Fu Q, Luo Y (2014) MapReduce delay scheduling with deadline constraint. Concurrency and Comput Prac Exp 26(3):766–778. https://doi.org/10.1002/cpe.3050
Article Google Scholar
Xu X, Ji Z, Yuan F, Liu X (2014) A novel parallel approach of cuckoo search using MapReduce. In: Proceedings of the 2014 international conference on computer, communications and information technology (CCIT 2014). Atlantis Press. https://doi.org/10.2991/ccit-14.2014.31
Whang JJ, Lenharth A, Dhillon IS, Pingali K (2015) Scalable data-driven pagerank: algorithms, system issues, and lessons learned. In: Proceedings of the european conference on parallel processing. Springer-Verlag, Berlin, Heidelberg, pp 438–450. https://doi.org/10.1007/978-3-662-48096-0_34
Song J, Guo CP, Zhang YC, Zhang YF, Yu G (2016) Research and implemental incremental iterative model. Chin J Comput 39(1):109–125 (in Chinese with English abstract)
Google Scholar
Bu Y, Howe B, Balazinska M, Ernst MD (2012) The HaLoop approach to large-scale iterative data analysis. The VLDB J—The Int J Very Large Data Bases 21(2):169–190. https://doi.org/10.1007/s00778-012-0269-7
Valiant LG (1990) A bridging model for parallel computation. Commun ACM 33(3):103–111. https://doi.org/10.1145/79173.79181
Article Google Scholar
Yu G, Gu Y, Bao YB, Wang ZG (2011) Large scale graph data processing on cloud computing environments: challenges and progress. Chin J Comput 34(10):1753–1767 (in Chinese with English abstract). https://doi.org/10.3724/sp.j.1016.2011.01753
Mohanavalli S, Jaisakthi SM, Aravindan C (2011) Strategies for parallelizing k-means data clustering algorithm. In: Proceedings of the information technology and mobile communication. Springer-Verlag, Berlin, Heidelberg, pp 427–430. https://doi.org/10.1007/978-3-642-20573-6_76
Abualigah L, Diabat A, Mirjalili S, Abd Elaziz M, Gandomi AH (2021) The arithmetic optimization algorithm. Comput Methods Appl Mech Eng
Google Scholar
Meng X, Bradley J, Yuvaz B, Sparks E, Venkataraman S, Liu D, Freeman J, Tsai DB, Made M, Owen S, Xin D, Xin R, Franklin MJ, Zadeh R, Zaharia M, Talwalkar A (2016) Mllib: machine learning in apache spark. J Mach Learn Res 17(34):1–7
MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Computer Sciences and Informatics, Amman Arab University, Amman, Jordan
Laith Abualigah & Bahaa Al Masri

Authors

Laith Abualigah
View author publications
You can also search for this author in PubMed Google Scholar
Bahaa Al Masri
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Laith Abualigah .

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, Annamalai University, Chennai, Tamil Nadu, India
Kalaiselvi Geetha Manoharan
Department of Computer Science and Engineering, SRM Institute of Science and Technology, Chennai, Tamil Nadu, India
Jawaharlal Arun Nehru
Department of Mechanical Engineering, Annamalai University, Chennai, Tamil Nadu, India
Sivaraman Balasubramanian

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Abualigah, L., Masri, B.A. (2021). Advances in MapReduce Big Data Processing: Platform, Tools, and Algorithms. In: Manoharan, K.G., Nehru, J.A., Balasubramanian, S. (eds) Artificial Intelligence and IoT. Studies in Big Data, vol 85. Springer, Singapore. https://doi.org/10.1007/978-981-33-6400-4_6

Download citation

DOI: https://doi.org/10.1007/978-981-33-6400-4_6
Published: 13 February 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-33-6399-1
Online ISBN: 978-981-33-6400-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Advances in MapReduce Big Data Processing: Platform, Tools, and Algorithms

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

MapReduce Parallel Programming Model: A State-of-the-Art Survey

The Family of Map-Reduce

MapReduce: an infrastructure review and research insights

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Advances in MapReduce Big Data Processing: Platform, Tools, and Algorithms

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

MapReduce Parallel Programming Model: A State-of-the-Art Survey

The Family of Map-Reduce

MapReduce: an infrastructure review and research insights

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation