Skip to main content

Advances in MapReduce Big Data Processing: Platform, Tools, and Algorithms

  • Chapter
  • First Online:
Artificial Intelligence and IoT

Part of the book series: Studies in Big Data ((SBD,volume 85))

Abstract

This paper reviews the research progress of big data processing platforms and algorithms based on MapReduce programming model in recent years. Firstly, 12 typical ones are introduced. MapReduce-based big data processing platform analyzes and compares their implementation principles and applicable scenarios, abstracts their commonalities, then introduces based on MapReduce big data analysis algorithms, including search algorithms, data cleaning/transformation algorithms, aggregation algorithms, join algorithms, sorting algorithms, preference queries, optimization calculations method, graph algorithm, data mining algorithm, classifies these algorithms according to MapReduce implementation, analyzes the factors affecting the performance of the algorithm; finally, big data. The processing algorithm is abstracted as an external memory algorithm, and the characteristics of the external storage algorithm are sorted out. The research ideas and problems of the performance optimization method of the universal external memory algorithm are proposed. For the researcher’s reference. Specifically, it includes the disk I/O of optimizing the external memory algorithm, optimizing the locality of the external memory algorithm, and designing the incremental iterative algorithm. The existing large data processing platform and algorithm research mostly focus on platform dynamic performance optimization based on resource allocation and task scheduling, specific algorithm parallelization, specific algorithmic. This chapter provides researchers with a broad research study in MapReduce Big Data Processing: Platform, Tools, and Algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Wu L, Yuan L, You J (2015) Survey of large-scale data management systems for big data applications. J Comput Sci Technol 30(1):163–183. https://doi.org/10.1007/s11390-015-1511-8

    Article  Google Scholar 

  2. Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113. https://doi.org/10.1145/1327452.1327492

    Article  Google Scholar 

  3. Wolf J, Balmin A, Rajan D, Hildrum K, Khandekar R, Parekh S, Wu KL, Vernica R (2012) On the optimization of schedules for MapReduce workloads in the presence of shared scans. The VLDB J—The Int J Very Large Data Bases, 21(5):589–609. https://doi.org/10.1007/s00778-012-0279-5

  4. Computing Platform 2016

    Google Scholar 

  5. Yang H, Luan Z, Li W, Qian D (2012) MapReduce workload modeling with statistical approach. J Grid Comput 10(2):279–310. https://doi.org/10.1007/s10723-011-9201-4

    Article  Google Scholar 

  6. Kimura K, Nomura Y, Tanaka Y, Kurihara H, Yamamoto R (2015) Runtime composition for extensible big data processing platforms. In: Proceedings of the 2015 IEEE 8th international conference on cloud computing 1053–1057. https://doi.org/10.1109/cloud.2015.151

  7. Out-of-Core Algorithm (2016) https://en.wikipedia.org/wiki/Out-of-core_algorithm

  8. Low Y, Gonzalez J, Kyrola A, Bickson D, Bickson D, Guestrin C, Hellerstein JM (2012) Distributed graphLab: a framework for machine learning and data mining in the cloud. Proc VLDB Endowment 5(8):716–727. https://doi.org/10.14778/2212351.2212354

    Article  Google Scholar 

  9. Zhang J, Xiang D, Li T, Pan Y (2013) M2M: a simple Matlab-to-MapReduce translator for cloud computing. Tsinghua Sci Technol 18(1):1–9

    Article  Google Scholar 

  10. Liu Y, Li M, Alham NK, Hammoud S (2013) HSim: a MapReduce simulator in enabling cloud computing. Future Gener Comput Syst 29(1):300–308. https://doi.org/10.1016/j.future.2011.05.007

    Article  Google Scholar 

  11. GridGain in-memory data fabric. http://go.gridgain.com/rs/491-TWR-806/images/GridGain_Product_Datasheet_070416.pdf

  12. Fang W, He B, Luo Q, Govindaraju NK (2011) Mars: accelerating MapReduce with graphics processors. IEEE Trans Parallel Distrib Syst 22(4):608–620. https://doi.org/10.1109/tpds.2010.158

  13. Yoo RM, Romano A, Kozyrakis C (2009) Phoenix rebirth: scalable MapReduce on a large-scale shared-memory system. In: Proceedings of the IEEE international symposium on workload characterization (IISWC 2009). IEEE, pp 198–207. https://doi.org/10.1109/iiswc.2009.5306783

  14. Mundkur P, Tuulos V, Flatow J (2011) Disco: a computing platform for large-scale data analytics. In: Proceedings of the 10th ACM SIGPLAN workshop on Erlang, pp 84–89. https://doi.org/10.1145/2034654.2034670

  15. Ekanayake J, Li H, Zhang B, Gunarathne T, Bae S, Qiu J, Fox G (2010) Twister: a runtime for iterative MapReduce. In: Proceedings of 19th ACM international symposium on high performance distributed computing. ACM Press, pp 810–818. https://doi.org/10.1145/1851476.1851593

  16. Bu Y, Howe B, Balazinska M, Ernst MD (2010) HaLoop: efficient iterative data processing on large clusters. Proc VLDB Endowment 3(1–2):285–296. https://doi.org/10.14778/1920841.1920881

    Article  Google Scholar 

  17. Zhang Y, Gao Q, Gao L, Wang C (2012) iMapReduce: a distributed computing framework for iterative computation. J Grid Comput 10(1):47–68. https://doi.org/10.1007/s10723-012-9204-9

    Article  Google Scholar 

  18. Elnikety E, Elsayed T, Ramadan HE (2011) iHadoop: asynchronous iterations for MapReduce. In: Proceedings of the 3rd IEEE international conference on cloud computing technology and science (CloudCom). IEEE, pp 81–90. https://doi.org/10.1109/cloudcom.2011.21

  19. Zhang Y, Gao Q, Gao L, Wang C (2011) PrIter: a distributed framework for prioritized iterative computations. In: Proceedings of the 2nd ACM symposium on cloud computing. ACM Press 13. https://doi.org/10.1145/2038916.2038929

  20. Isard M, Budiu M, Yu Y, Birrell A, Fetterly D (2007) Dryad: distributed data-parallel programs from sequential building blocks. Proc ACM SIGOPS Operating Syst Rev 41(3):59–72. https://doi.org/10.1145/1272998.1273005

  21. Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. HotCloud

    Google Scholar 

  22. Rasooli A, Down DG (2014) Guidelines for selecting Hadoop schedulers based on system heterogeneity. J Grid Comput 12(3):499–519. https://doi.org/10.1007/s10723-014-9299-2

    Article  Google Scholar 

  23. Karun AK, Chitharanjan K (2013) A review on hadoop—HDFS infrastructure extensions. In: Proceedings of the 2013 IEEE conference on information and communication technologies (ICT). IEEE, pp 132–137. https://doi.org/10.1109/cict.2013.6558077

  24. Vavilapalli VK, Murthy AC, Douglas C, Agarwal S, Konar M, Evans R, Graves T, Lowe J, Shah H, Seth S, Saha B, Curino C, O’Malley O, Radia S, Reed B, Baldeschwieler E (2013) Apache Hadoop YARN: yet another resource negotiator. In: Proceedings of the 4th annual symposium on cloud computing 16. https://doi.org/10.1145/2523616.2523633

  25. Ranger C, Raghuraman R, Penmetsa A, Bradski G, Kozyrakis C (2007) Evaluating MapReduce for multi-core and multiprocessor systems. In: Proceedings of the 2007 IEEE 13th international symposium on high performance computer architecture. IEEE, pp 13–24. https://doi.org/10.1109/hpca.2007.346181

  26. Pietzuch PR, Bacon J (2003) Peer-to-Peer overlay broker networks in an event-based middleware. In: Proceedings of the 2nd international workshop on distributed event-based systems. ACM Press, pp 1–8. https://doi.org/10.1145/966618.966628

  27. Qiu J, Wu Q, Ding G, Xu Y, Feng S (2016) A survey of machine learning for big data processing. EURASIP J Adv Signal Process 2016(1):1–16. https://doi.org/10.1186/s13634-015-0293-z

    Article  Google Scholar 

  28. Martins R, Manquinho V, Lynce I (2015) Improving linear search algorithms with model-based approaches for MaxSAT solving. J Exp Theor Artif Intell 27(5):673–701. https://doi.org/10.1080/0952813X.2014.993508

    Article  Google Scholar 

  29. Wang HZ (2015) Big data algorithms. China Machine Press, Beijing (in Chinese)

    Google Scholar 

  30. Ding XO, Wang HZ, Zhang XY, Gao H (2016) Association relationships study of multi-dimensional data quality. Ruan Jian Xue Bao/J Softw 27(7):1626–1644 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/5040.htm. https://doi.org/10.13328/j.cnki.jos.005040

  31. Yang DH, Li NN, Wang HZ, Li JZ, Gao H (2016) The optimization of the big data cleaning based on task merging. Computers 39(1):97–108 (in Chinese with English abstract)

    MathSciNet  Google Scholar 

  32. Han JW, Kamber M, Pei J (2011) Data mining: concepts and techniques, 3rd ed. Morgan Kaufmann Publishers

    Google Scholar 

  33. Wang Y, Su Y, Agrawal G (2015) A novel approach for approximate aggregations over arrays. In: Proceedings of the 27th international conference on scientific and statistical database management. ACM Press. https://doi.org/10.1145/2791347.2791349

  34. Issa JA (2015) Performance evaluation and estimation model using regression method for hadoop WordCount. IEEE Acc 3:2784–2793. https://doi.org/10.1109/ACCESS.2015.2509598

    Article  Google Scholar 

  35. Han XX, Yang DH, Li JZ (2010) Approximate join aggregate on massive data. Chin J Comput 10:1919–1933 (in Chinese with English abstract)

    Article  Google Scholar 

  36. Song J, Li TT, Zhu ZL, Bao YB, Song Jie et al Research progress of MapReduce big data processing platform and algorithm 541

    Google Scholar 

  37. Asiri N, Alsulim R (2015) Non-recursive approach for sort-merge join operation. In: Proceedings of international the conference on beyond databases, architectures and structures. Springer International Publishing, pp 216–224. https://doi.org/10.1007/978-3-319-34099-9_16

  38. Chen M, Zhong Z (2014) Block nested join and sort merge join algorithms: an empirical evaluation. In: Proceedings of the international conference on advanced data mining and applications. Springer International Publishing, pp 705–715

    Google Scholar 

  39. Tong Y, Liu ZJ, Liu H (2016) Optimizing hash join with MapReduce on multi-core CPUs. IEICE Trans Inf Syst 99(5):1316–1325. https://doi.org/10.1587/transinf.2015EDP7306

    Article  MathSciNet  Google Scholar 

  40. Song J, Xu S, Zhang L, Pahl C, Yu G (2015) Performance and energy optimization on TeraSort algorithm by task self-resizing. Inf Technol Control 44(1):30–40

    Google Scholar 

  41. Ci X, Ma YZ, Meng XF (2014) Method for top-K query on big data in cloud. Ruan Jian Xue Bao/J Softw 25(4):813–825 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/4564.htm, https://doi.org/10.13328/j.cnki.jos.004564

  42. Li WF, Peng ZY, Li DY (2012) Top-K query processing techniques on uncertain data. Ruan Jian Xue Bao/J Softw 23(6):1542–1560 (in Chinese with English abstract). http://www.jos.org.cn/1000-9825/4200.htm, https://doi.org/10.3724/sp.j.1001.2012.04200

  43. MacLean B, Tomazela DM, Shulman N, Chambers M, Finney GL, Frewen B, Kern R, Tabb DL, Liebler DC, MacCoss MJ (2010) Skyline: an open source document editor for creating and examining targeted proteomics experiments. Bioinformatics 26(7):966–968. https://doi.org/10.1093/bioinformatics/btq054

    Article  Google Scholar 

  44. Zhang B, Zhou S, Guan J (2011) Adapting skyline computation to the MapReduce framework: algorithms and experiments. In: Proceedings of the international conference on database systems for advanced applications. Springer-Verlag, Berlin, Heidelberg, pp 403–414. https://doi.org/10.1007/978-3-642-20244-5_39

  45. Ding LL, Xin JC, Wang GR, Huang S (2011) Efficient skyline query processing of massive data based on MapReduce. Chin J Comput 34(10):1785–1796 (in Chinese with English abstract). https://doi.org/10.3724/sp.j.1016.2011.01785

  46. Jin C, Vecchiola C, Buyya R (2008) MRPGA: an extension of MapReduce for parallelizing genetic algorithms. In: Proceedings of the 4th IEEE international conference on eScience (eScience 2008). IEEE, pp 214–221. https://doi.org/10.1109/escience.2008.78

  47. McNabb AW, Monson CK, Seppi KD (2007) Parallel pso using MapReduce. In: Proceedings of the 2007 IEEE congress on evolutionary computation. IEEE, pp 7–14. https://doi.org/10.1109/cec.2007.4424448

  48. Li H, Wei X, Fu Q, Luo Y (2014) MapReduce delay scheduling with deadline constraint. Concurrency and Comput Prac Exp 26(3):766–778. https://doi.org/10.1002/cpe.3050

    Article  Google Scholar 

  49. Xu X, Ji Z, Yuan F, Liu X (2014) A novel parallel approach of cuckoo search using MapReduce. In: Proceedings of the 2014 international conference on computer, communications and information technology (CCIT 2014). Atlantis Press. https://doi.org/10.2991/ccit-14.2014.31

  50. Whang JJ, Lenharth A, Dhillon IS, Pingali K (2015) Scalable data-driven pagerank: algorithms, system issues, and lessons learned. In: Proceedings of the european conference on parallel processing. Springer-Verlag, Berlin, Heidelberg, pp 438–450. https://doi.org/10.1007/978-3-662-48096-0_34

  51. Song J, Guo CP, Zhang YC, Zhang YF, Yu G (2016) Research and implemental incremental iterative model. Chin J Comput 39(1):109–125 (in Chinese with English abstract)

    Google Scholar 

  52. Bu Y, Howe B, Balazinska M, Ernst MD (2012) The HaLoop approach to large-scale iterative data analysis. The VLDB J—The Int J Very Large Data Bases 21(2):169–190. https://doi.org/10.1007/s00778-012-0269-7

  53. Valiant LG (1990) A bridging model for parallel computation. Commun ACM 33(3):103–111. https://doi.org/10.1145/79173.79181

    Article  Google Scholar 

  54. Yu G, Gu Y, Bao YB, Wang ZG (2011) Large scale graph data processing on cloud computing environments: challenges and progress. Chin J Comput 34(10):1753–1767 (in Chinese with English abstract). https://doi.org/10.3724/sp.j.1016.2011.01753

  55. Mohanavalli S, Jaisakthi SM, Aravindan C (2011) Strategies for parallelizing k-means data clustering algorithm. In: Proceedings of the information technology and mobile communication. Springer-Verlag, Berlin, Heidelberg, pp 427–430. https://doi.org/10.1007/978-3-642-20573-6_76

  56. Abualigah L, Diabat A, Mirjalili S, Abd Elaziz M, Gandomi AH (2021) The arithmetic optimization algorithm. Comput Methods Appl Mech Eng

    Google Scholar 

  57. Meng X, Bradley J, Yuvaz B, Sparks E, Venkataraman S, Liu D, Freeman J, Tsai DB, Made M, Owen S, Xin D, Xin R, Franklin MJ, Zadeh R, Zaharia M, Talwalkar A (2016) Mllib: machine learning in apache spark. J Mach Learn Res 17(34):1–7

    MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Laith Abualigah .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Abualigah, L., Masri, B.A. (2021). Advances in MapReduce Big Data Processing: Platform, Tools, and Algorithms. In: Manoharan, K.G., Nehru, J.A., Balasubramanian, S. (eds) Artificial Intelligence and IoT. Studies in Big Data, vol 85. Springer, Singapore. https://doi.org/10.1007/978-981-33-6400-4_6

Download citation

Publish with us

Policies and ethics