Analyzing Performance of Apache Spark MLlib with Multinode Clusters on Azure HDInsight: Spark-Perf Case Study

Minukhin, Sergii; Brynza, Natalia; Sitnikov, Dmytro

doi:10.1007/978-3-030-54215-3_8

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1246))

Included in the following conference series:

International Scientific Conference “Intellectual Systems of Decision Making and Problem of Computational Intelligence”

578 Accesses

Abstract

In this paper, we present an analysis and results of experimental research into determining the performance of solving machine learning problems via the library Apache Spark MLlib for the ecosystem Microsoft Azure HDInsight with the help of the test dataset Spark-Pref. In order to solve the defined problems, software and information support methodology have been developed based on the monitoring system SparkMeasure and Ambari. Metrics have been suggested for analyzing the performance of Apache Spark computations. These metrics use statistical characteristics of learning and testing processes when benchmark Spark-perf tests are carried out. There have been suggested formulas for determining settings for Apache Spark parameters. These formulas provide a time minimization as compared to the standard values of Spark parameter settings for executing sets of machine learning test tasks for heterogeneous and homogeneous cluster configurations of Apache Spark Azure HDInsight. In order to assess computing performance for machine learning methods in Spark-Pref a metric has been proposed, which is calculated as the ratio of the average testing time and the average training time. The results of the computational experiments have been demonstrated. They confirm the effectiveness of the proposed algorithms for Apache Spark settings relative to the standard values for heterogeneous and homogeneous clusters deployed on the platform Apache Spark Azure HDInsight, machine learning methods for a Spark-Pref test set being implemented.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Leveraging resource management for efficient performance of Apache Spark

Article Open access 23 August 2019

Characterizing BigBench Queries, Hive, and Spark in Multi-cloud Environments

On Scalability of Distributed Machine Learning with Big Data on Apache Spark

References

Amazon AWS: Complete business guide to the world’s largest provider of cloud services. https://www.zdnet.com/article/amazon-aws-everything-you-should-know-about-the-largest-cloud. Accessed 5 Mar 2020
Amazon Machine Learning. https://docs.aws.amazon.com/machine-learning/latest/dg/what-is-amazon-machine-learning.html. Accessed 5 Mar 2020
Apache Mahout For Creating Scalable Performant Machine Learning Applications. https://mahout.apache.org/. Accessed 5 Mar 2020
Azure HDInsight Azure HDInsight documentation. https://docs.microsoft.com/en-us/azure/hdinsight/. Accessed 5 Mar 2020
Cloud Serving Benchmark. https://research.yahoo.com/news/yahoo-cloud-serving-benchmark. Accessed 5 Mar 2020
Dv2 and DSv2-series. https://docs.microsoft.com/en-us/azure/virtual-machines/dv2-dsv2-series. Accessed 5 Mar 2020
Ev3 and Esv3-series. https://docs.microsoft.com/en-us/azure/virtual-machines/ev3-esv3-series. Accessed 5 Mar 2020
Get started with Google Cloud. https://cloud.google.com/docs. Accessed 5 Mar 2020
Microsoft®Azure Official Site | Create Your Free Account Today. https://azure.microsoft.com/en-us/free/search/. Accessed 5 Mar 2020
MLlib is Apache Spark’s scalable machine learning library. https://spark.apache.org/mllib/. Accessed 5 Mar 2020
Spark-perf (homepage) Performance tests for Spark. https://spark-packages.org/package/databricks/spark-perf. Accessed 5 Mar 2020
TensorFlow on Spark. TensorFlow. https://docs.microsoft.com/en-us/azure/databricks/applications/deep-learning/single-node-training/tensorflow. Accessed 5 Mar 2020
Aziz, K., Zaidouni, D., Bellafkih, M.: Big data processing using machine learning algorithms: Mllib and mahout use case, pp. 1–6 (2018). https://doi.org/10.1145/3289402.3289525
Gao, W., Zhan, J., Wang, L., Luo, C., Zheng, D., Wen, X., Ren, R., Zheng, C., He, X., Ye, H., Tang, H., Cao, Z., Zhang, S., Daig, J.: Bigdatabench: A scalable and unified big data and ai benchmark suite (2018). https://arxiv.org/abs/1802.08254
Ghazal, A., Ivanov, T., Kostamaa, P., Crolotte, A., Voong, R., Al-Kateb, M., Ghazal, W., Zicari, R.: Bigbench v2: the new and improved bigbench, pp. 1225–1236 (2017). https://doi.org/10.1109/ICDE.2017.167
Gounaris, A., Torres, J.: A methodology for spark parameter tuning. Big Data Res. 11 (2017). https://doi.org/10.1016/j.bdr.2017.05.001
Han, R., John, L., Zhan, J.: Benchmarking big data systems: a review. IEEE Trans. Serv. Comput. PP, 1 (2017). https://doi.org/10.1109/TSC.2017.2730882
Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The hibench benchmark suite: characterization of the mapreduce-based data analysis, pp. 41–51 (2010). https://doi.org/10.1109/ICDEW.2010.5452747
Ivanov, T., Rabl, T., Poess, M., Queralt, A., Poelman, J., Poggi, N., Buell, J.: Big data benchmark compendium. In: Nambiar, R., Poess, M. (eds.) Performance Evaluation and Benchmarking: Traditional to Big Data to Internet of Things - 7th TPC Technology Conference, TPCTC 2015, Kohala Coast, HI, USA, August 31 - September 4, 2015. Revised Selected Papers. Lecture Notes in Computer Science, vol. 9508, pp. 135–155. Springer, Heidelberg (2015). https://doi.org/10.1007/978-3-319-31409-9_9
Kraska, T., Talwalkar, A., Duchi, J., Griffith, R., Jordan, M., Franklin, M.: A distributed machine-learning system (2013)
Google Scholar
Palit, T., Shen, Y., Ferdman, M.: Demystifying cloud benchmarking, pp. 122–132 (2016). https://doi.org/10.1109/ISPASS.2016.7482080
Petridis, P., Gounaris, A., Torres, J.: Spark parameter tuning via trial-and-error. In: Angelov, P., Manolopoulos, Y., Iliadis, L.S., Roy, A., Vellasco, M.M.B.R. (eds.) INNS Conference on Big Data. Advances in Intelligent Systems and Computing, vol. 529, pp. 226–237 (2016). http://dblp.uni-trier.de/db/conf/inns/inns2016.html#PetridisGT16
Wang, K., Maifi Hasan Khan, M., Nguyen, N., Gokhale, S.: A model driven approach towards improving the performance of apache spark applications. In: 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 233–242 (2019)
Google Scholar
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: NSDI, pp. 15–28 (2012)
Google Scholar
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud 2010, p. 10. USENIX Association, USA (2010)
Google Scholar
Zaharia, M., Xin, R.S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X.,Rosen, J., Venkataraman, S., Franklin, M.J., Ghodsi, A., Gonzalez, J.,Shenker, S., Stoica, I.: Apache spark: a unified engine for big dataprocessing. Commun. ACM 59(11), 56–65 (2016). https://doi.org/10.1145/2934664

Download references

Author information

Authors and Affiliations

Simon Kuznets Kharkiv National University of Economics, Kharkiv, Ukraine
Sergii Minukhin & Natalia Brynza
Kharkiv National University of Radio Electronics, Kharkiv, Ukraine
Dmytro Sitnikov

Authors

Sergii Minukhin
View author publications
You can also search for this author in PubMed Google Scholar
Natalia Brynza
View author publications
You can also search for this author in PubMed Google Scholar
Dmytro Sitnikov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sergii Minukhin .

Editor information

Editors and Affiliations

Department of Informatics, Jan Evangelista Purkyně University in Ústí nad Labem, Ústí nad Labem, Czech Republic
Sergii Babichev
Department of Informatics and Computer Science, Kherson National Technical University, Kherson, Ukraine
Volodymyr Lytvynenko
Institute of Electronics and Information, Lublin University of Technology, Lublin, Poland
Waldemar Wójcik
Department of Informatics and Computer Science, Kherson National Technical University, Kherson, Ukraine
Svetlana Vyshemyrskaya

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Minukhin, S., Brynza, N., Sitnikov, D. (2021). Analyzing Performance of Apache Spark MLlib with Multinode Clusters on Azure HDInsight: Spark-Perf Case Study. In: Babichev, S., Lytvynenko, V., Wójcik, W., Vyshemyrskaya, S. (eds) Lecture Notes in Computational Intelligence and Decision Making. ISDMCI 2020. Advances in Intelligent Systems and Computing, vol 1246. Springer, Cham. https://doi.org/10.1007/978-3-030-54215-3_8

Download citation

DOI: https://doi.org/10.1007/978-3-030-54215-3_8
Published: 26 July 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-54214-6
Online ISBN: 978-3-030-54215-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Analyzing Performance of Apache Spark MLlib with Multinode Clusters on Azure HDInsight: Spark-Perf Case Study

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Leveraging resource management for efficient performance of Apache Spark

Characterizing BigBench Queries, Hive, and Spark in Multi-cloud Environments

On Scalability of Distributed Machine Learning with Big Data on Apache Spark

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Analyzing Performance of Apache Spark MLlib with Multinode Clusters on Azure HDInsight: Spark-Perf Case Study

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Leveraging resource management for efficient performance of Apache Spark

Characterizing BigBench Queries, Hive, and Spark in Multi-cloud Environments

On Scalability of Distributed Machine Learning with Big Data on Apache Spark

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation