Abstract
This paper reviews the research progress of big data processing platforms and algorithms based on MapReduce programming model in recent years. Firstly, 12 typical ones are introduced. MapReduce-based big data processing platform analyzes and compares their implementation principles and applicable scenarios, abstracts their commonalities, then introduces based on MapReduce big data analysis algorithms, including search algorithms, data cleaning/transformation algorithms, aggregation algorithms, join algorithms, sorting algorithms, preference queries, optimization calculations method, graph algorithm, data mining algorithm, classifies these algorithms according to MapReduce implementation, analyzes the factors affecting the performance of the algorithm; finally, big data. The processing algorithm is abstracted as an external memory algorithm, and the characteristics of the external storage algorithm are sorted out. The research ideas and problems of the performance optimization method of the universal external memory algorithm are proposed. For the researcher’s reference. Specifically, it includes the disk I/O of optimizing the external memory algorithm, optimizing the locality of the external memory algorithm, and designing the incremental iterative algorithm. The existing large data processing platform and algorithm research mostly focus on platform dynamic performance optimization based on resource allocation and task scheduling, specific algorithm parallelization, specific algorithmic. This chapter provides researchers with a broad research study in MapReduce Big Data Processing: Platform, Tools, and Algorithms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Wu L, Yuan L, You J (2015) Survey of large-scale data management systems for big data applications. J Comput Sci Technol 30(1):163–183. https://doi.org/10.1007/s11390-015-1511-8
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113. https://doi.org/10.1145/1327452.1327492
Wolf J, Balmin A, Rajan D, Hildrum K, Khandekar R, Parekh S, Wu KL, Vernica R (2012) On the optimization of schedules for MapReduce workloads in the presence of shared scans. The VLDB J—The Int J Very Large Data Bases, 21(5):589–609. https://doi.org/10.1007/s00778-012-0279-5
Computing Platform 2016
Yang H, Luan Z, Li W, Qian D (2012) MapReduce workload modeling with statistical approach. J Grid Comput 10(2):279–310. https://doi.org/10.1007/s10723-011-9201-4
Kimura K, Nomura Y, Tanaka Y, Kurihara H, Yamamoto R (2015) Runtime composition for extensible big data processing platforms. In: Proceedings of the 2015 IEEE 8th international conference on cloud computing 1053–1057. https://doi.org/10.1109/cloud.2015.151
Out-of-Core Algorithm (2016) https://en.wikipedia.org/wiki/Out-of-core_algorithm
Low Y, Gonzalez J, Kyrola A, Bickson D, Bickson D, Guestrin C, Hellerstein JM (2012) Distributed graphLab: a framework for machine learning and data mining in the cloud. Proc VLDB Endowment 5(8):716–727. https://doi.org/10.14778/2212351.2212354
Zhang J, Xiang D, Li T, Pan Y (2013) M2M: a simple Matlab-to-MapReduce translator for cloud computing. Tsinghua Sci Technol 18(1):1–9
Liu Y, Li M, Alham NK, Hammoud S (2013) HSim: a MapReduce simulator in enabling cloud computing. Future Gener Comput Syst 29(1):300–308. https://doi.org/10.1016/j.future.2011.05.007
GridGain in-memory data fabric. http://go.gridgain.com/rs/491-TWR-806/images/GridGain_Product_Datasheet_070416.pdf
Fang W, He B, Luo Q, Govindaraju NK (2011) Mars: accelerating MapReduce with graphics processors. IEEE Trans Parallel Distrib Syst 22(4):608–620. https://doi.org/10.1109/tpds.2010.158
Yoo RM, Romano A, Kozyrakis C (2009) Phoenix rebirth: scalable MapReduce on a large-scale shared-memory system. In: Proceedings of the IEEE international symposium on workload characterization (IISWC 2009). IEEE, pp 198–207. https://doi.org/10.1109/iiswc.2009.5306783
Mundkur P, Tuulos V, Flatow J (2011) Disco: a computing platform for large-scale data analytics. In: Proceedings of the 10th ACM SIGPLAN workshop on Erlang, pp 84–89. https://doi.org/10.1145/2034654.2034670
Ekanayake J, Li H, Zhang B, Gunarathne T, Bae S, Qiu J, Fox G (2010) Twister: a runtime for iterative MapReduce. In: Proceedings of 19th ACM international symposium on high performance distributed computing. ACM Press, pp 810–818. https://doi.org/10.1145/1851476.1851593
Bu Y, Howe B, Balazinska M, Ernst MD (2010) HaLoop: efficient iterative data processing on large clusters. Proc VLDB Endowment 3(1–2):285–296. https://doi.org/10.14778/1920841.1920881
Zhang Y, Gao Q, Gao L, Wang C (2012) iMapReduce: a distributed computing framework for iterative computation. J Grid Comput 10(1):47–68. https://doi.org/10.1007/s10723-012-9204-9
Elnikety E, Elsayed T, Ramadan HE (2011) iHadoop: asynchronous iterations for MapReduce. In: Proceedings of the 3rd IEEE international conference on cloud computing technology and science (CloudCom). IEEE, pp 81–90. https://doi.org/10.1109/cloudcom.2011.21
Zhang Y, Gao Q, Gao L, Wang C (2011) PrIter: a distributed framework for prioritized iterative computations. In: Proceedings of the 2nd ACM symposium on cloud computing. ACM Press 13. https://doi.org/10.1145/2038916.2038929
Isard M, Budiu M, Yu Y, Birrell A, Fetterly D (2007) Dryad: distributed data-parallel programs from sequential building blocks. Proc ACM SIGOPS Operating Syst Rev 41(3):59–72. https://doi.org/10.1145/1272998.1273005
Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. HotCloud
Rasooli A, Down DG (2014) Guidelines for selecting Hadoop schedulers based on system heterogeneity. J Grid Comput 12(3):499–519. https://doi.org/10.1007/s10723-014-9299-2
Karun AK, Chitharanjan K (2013) A review on hadoop—HDFS infrastructure extensions. In: Proceedings of the 2013 IEEE conference on information and communication technologies (ICT). IEEE, pp 132–137. https://doi.org/10.1109/cict.2013.6558077
Vavilapalli VK, Murthy AC, Douglas C, Agarwal S, Konar M, Evans R, Graves T, Lowe J, Shah H, Seth S, Saha B, Curino C, O’Malley O, Radia S, Reed B, Baldeschwieler E (2013) Apache Hadoop YARN: yet another resource negotiator. In: Proceedings of the 4th annual symposium on cloud computing 16. https://doi.org/10.1145/2523616.2523633
Ranger C, Raghuraman R, Penmetsa A, Bradski G, Kozyrakis C (2007) Evaluating MapReduce for multi-core and multiprocessor systems. In: Proceedings of the 2007 IEEE 13th international symposium on high performance computer architecture. IEEE, pp 13–24. https://doi.org/10.1109/hpca.2007.346181
Pietzuch PR, Bacon J (2003) Peer-to-Peer overlay broker networks in an event-based middleware. In: Proceedings of the 2nd international workshop on distributed event-based systems. ACM Press, pp 1–8. https://doi.org/10.1145/966618.966628
Qiu J, Wu Q, Ding G, Xu Y, Feng S (2016) A survey of machine learning for big data processing. EURASIP J Adv Signal Process 2016(1):1–16. https://doi.org/10.1186/s13634-015-0293-z
Martins R, Manquinho V, Lynce I (2015) Improving linear search algorithms with model-based approaches for MaxSAT solving. J Exp Theor Artif Intell 27(5):673–701. https://doi.org/10.1080/0952813X.2014.993508
Wang HZ (2015) Big data algorithms. China Machine Press, Beijing (in Chinese)
Ding XO, Wang HZ, Zhang XY, Gao H (2016) Association relationships study of multi-dimensional data quality. Ruan Jian Xue Bao/J Softw 27(7):1626–1644 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5040.htm. https://doi.org/10.13328/j.cnki.jos.005040
Yang DH, Li NN, Wang HZ, Li JZ, Gao H (2016) The optimization of the big data cleaning based on task merging. Computers 39(1):97–108 (in Chinese with English abstract)
Han JW, Kamber M, Pei J (2011) Data mining: concepts and techniques, 3rd ed. Morgan Kaufmann Publishers
Wang Y, Su Y, Agrawal G (2015) A novel approach for approximate aggregations over arrays. In: Proceedings of the 27th international conference on scientific and statistical database management. ACM Press. https://doi.org/10.1145/2791347.2791349
Issa JA (2015) Performance evaluation and estimation model using regression method for hadoop WordCount. IEEE Acc 3:2784–2793. https://doi.org/10.1109/ACCESS.2015.2509598
Han XX, Yang DH, Li JZ (2010) Approximate join aggregate on massive data. Chin J Comput 10:1919–1933 (in Chinese with English abstract)
Song J, Li TT, Zhu ZL, Bao YB, Song Jie et al Research progress of MapReduce big data processing platform and algorithm 541
Asiri N, Alsulim R (2015) Non-recursive approach for sort-merge join operation. In: Proceedings of international the conference on beyond databases, architectures and structures. Springer International Publishing, pp 216–224. https://doi.org/10.1007/978-3-319-34099-9_16
Chen M, Zhong Z (2014) Block nested join and sort merge join algorithms: an empirical evaluation. In: Proceedings of the international conference on advanced data mining and applications. Springer International Publishing, pp 705–715
Tong Y, Liu ZJ, Liu H (2016) Optimizing hash join with MapReduce on multi-core CPUs. IEICE Trans Inf Syst 99(5):1316–1325. https://doi.org/10.1587/transinf.2015EDP7306
Song J, Xu S, Zhang L, Pahl C, Yu G (2015) Performance and energy optimization on TeraSort algorithm by task self-resizing. Inf Technol Control 44(1):30–40
Ci X, Ma YZ, Meng XF (2014) Method for top-K query on big data in cloud. Ruan Jian Xue Bao/J Softw 25(4):813–825 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/4564.htm, https://doi.org/10.13328/j.cnki.jos.004564
Li WF, Peng ZY, Li DY (2012) Top-K query processing techniques on uncertain data. Ruan Jian Xue Bao/J Softw 23(6):1542–1560 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/4200.htm, https://doi.org/10.3724/sp.j.1001.2012.04200
MacLean B, Tomazela DM, Shulman N, Chambers M, Finney GL, Frewen B, Kern R, Tabb DL, Liebler DC, MacCoss MJ (2010) Skyline: an open source document editor for creating and examining targeted proteomics experiments. Bioinformatics 26(7):966–968. https://doi.org/10.1093/bioinformatics/btq054
Zhang B, Zhou S, Guan J (2011) Adapting skyline computation to the MapReduce framework: algorithms and experiments. In: Proceedings of the international conference on database systems for advanced applications. Springer-Verlag, Berlin, Heidelberg, pp 403–414. https://doi.org/10.1007/978-3-642-20244-5_39
Ding LL, Xin JC, Wang GR, Huang S (2011) Efficient skyline query processing of massive data based on MapReduce. Chin J Comput 34(10):1785–1796 (in Chinese with English abstract). https://doi.org/10.3724/sp.j.1016.2011.01785
Jin C, Vecchiola C, Buyya R (2008) MRPGA: an extension of MapReduce for parallelizing genetic algorithms. In: Proceedings of the 4th IEEE international conference on eScience (eScience 2008). IEEE, pp 214–221. https://doi.org/10.1109/escience.2008.78
McNabb AW, Monson CK, Seppi KD (2007) Parallel pso using MapReduce. In: Proceedings of the 2007 IEEE congress on evolutionary computation. IEEE, pp 7–14. https://doi.org/10.1109/cec.2007.4424448
Li H, Wei X, Fu Q, Luo Y (2014) MapReduce delay scheduling with deadline constraint. Concurrency and Comput Prac Exp 26(3):766–778. https://doi.org/10.1002/cpe.3050
Xu X, Ji Z, Yuan F, Liu X (2014) A novel parallel approach of cuckoo search using MapReduce. In: Proceedings of the 2014 international conference on computer, communications and information technology (CCIT 2014). Atlantis Press. https://doi.org/10.2991/ccit-14.2014.31
Whang JJ, Lenharth A, Dhillon IS, Pingali K (2015) Scalable data-driven pagerank: algorithms, system issues, and lessons learned. In: Proceedings of the european conference on parallel processing. Springer-Verlag, Berlin, Heidelberg, pp 438–450. https://doi.org/10.1007/978-3-662-48096-0_34
Song J, Guo CP, Zhang YC, Zhang YF, Yu G (2016) Research and implemental incremental iterative model. Chin J Comput 39(1):109–125 (in Chinese with English abstract)
Bu Y, Howe B, Balazinska M, Ernst MD (2012) The HaLoop approach to large-scale iterative data analysis. The VLDB J—The Int J Very Large Data Bases 21(2):169–190. https://doi.org/10.1007/s00778-012-0269-7
Valiant LG (1990) A bridging model for parallel computation. Commun ACM 33(3):103–111. https://doi.org/10.1145/79173.79181
Yu G, Gu Y, Bao YB, Wang ZG (2011) Large scale graph data processing on cloud computing environments: challenges and progress. Chin J Comput 34(10):1753–1767 (in Chinese with English abstract). https://doi.org/10.3724/sp.j.1016.2011.01753
Mohanavalli S, Jaisakthi SM, Aravindan C (2011) Strategies for parallelizing k-means data clustering algorithm. In: Proceedings of the information technology and mobile communication. Springer-Verlag, Berlin, Heidelberg, pp 427–430. https://doi.org/10.1007/978-3-642-20573-6_76
Abualigah L, Diabat A, Mirjalili S, Abd Elaziz M, Gandomi AH (2021) The arithmetic optimization algorithm. Comput Methods Appl Mech Eng
Meng X, Bradley J, Yuvaz B, Sparks E, Venkataraman S, Liu D, Freeman J, Tsai DB, Made M, Owen S, Xin D, Xin R, Franklin MJ, Zadeh R, Zaharia M, Talwalkar A (2016) Mllib: machine learning in apache spark. J Mach Learn Res 17(34):1–7
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Abualigah, L., Masri, B.A. (2021). Advances in MapReduce Big Data Processing: Platform, Tools, and Algorithms. In: Manoharan, K.G., Nehru, J.A., Balasubramanian, S. (eds) Artificial Intelligence and IoT. Studies in Big Data, vol 85. Springer, Singapore. https://doi.org/10.1007/978-981-33-6400-4_6
Download citation
DOI: https://doi.org/10.1007/978-981-33-6400-4_6
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-33-6399-1
Online ISBN: 978-981-33-6400-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)