Abstract
In this paper, we present an analysis and results of experimental research into determining the performance of solving machine learning problems via the library Apache Spark MLlib for the ecosystem Microsoft Azure HDInsight with the help of the test dataset Spark-Pref. In order to solve the defined problems, software and information support methodology have been developed based on the monitoring system SparkMeasure and Ambari. Metrics have been suggested for analyzing the performance of Apache Spark computations. These metrics use statistical characteristics of learning and testing processes when benchmark Spark-perf tests are carried out. There have been suggested formulas for determining settings for Apache Spark parameters. These formulas provide a time minimization as compared to the standard values of Spark parameter settings for executing sets of machine learning test tasks for heterogeneous and homogeneous cluster configurations of Apache Spark Azure HDInsight. In order to assess computing performance for machine learning methods in Spark-Pref a metric has been proposed, which is calculated as the ratio of the average testing time and the average training time. The results of the computational experiments have been demonstrated. They confirm the effectiveness of the proposed algorithms for Apache Spark settings relative to the standard values for heterogeneous and homogeneous clusters deployed on the platform Apache Spark Azure HDInsight, machine learning methods for a Spark-Pref test set being implemented.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Amazon AWS: Complete business guide to the world’s largest provider of cloud services. https://www.zdnet.com/article/amazon-aws-everything-you-should-know-about-the-largest-cloud. Accessed 5 Mar 2020
Amazon Machine Learning. https://docs.aws.amazon.com/machine-learning/latest/dg/what-is-amazon-machine-learning.html. Accessed 5 Mar 2020
Apache Mahout For Creating Scalable Performant Machine Learning Applications. https://mahout.apache.org/. Accessed 5 Mar 2020
Azure HDInsight Azure HDInsight documentation. https://docs.microsoft.com/en-us/azure/hdinsight/. Accessed 5 Mar 2020
Cloud Serving Benchmark. https://research.yahoo.com/news/yahoo-cloud-serving-benchmark. Accessed 5 Mar 2020
Dv2 and DSv2-series. https://docs.microsoft.com/en-us/azure/virtual-machines/dv2-dsv2-series. Accessed 5 Mar 2020
Ev3 and Esv3-series. https://docs.microsoft.com/en-us/azure/virtual-machines/ev3-esv3-series. Accessed 5 Mar 2020
Get started with Google Cloud. https://cloud.google.com/docs. Accessed 5 Mar 2020
Microsoft®Azure Official Site | Create Your Free Account Today. https://azure.microsoft.com/en-us/free/search/. Accessed 5 Mar 2020
MLlib is Apache Spark’s scalable machine learning library. https://spark.apache.org/mllib/. Accessed 5 Mar 2020
Spark-perf (homepage) Performance tests for Spark. https://spark-packages.org/package/databricks/spark-perf. Accessed 5 Mar 2020
TensorFlow on Spark. TensorFlow. https://docs.microsoft.com/en-us/azure/databricks/applications/deep-learning/single-node-training/tensorflow. Accessed 5 Mar 2020
Aziz, K., Zaidouni, D., Bellafkih, M.: Big data processing using machine learning algorithms: Mllib and mahout use case, pp. 1–6 (2018). https://doi.org/10.1145/3289402.3289525
Gao, W., Zhan, J., Wang, L., Luo, C., Zheng, D., Wen, X., Ren, R., Zheng, C., He, X., Ye, H., Tang, H., Cao, Z., Zhang, S., Daig, J.: Bigdatabench: A scalable and unified big data and ai benchmark suite (2018). https://arxiv.org/abs/1802.08254
Ghazal, A., Ivanov, T., Kostamaa, P., Crolotte, A., Voong, R., Al-Kateb, M., Ghazal, W., Zicari, R.: Bigbench v2: the new and improved bigbench, pp. 1225–1236 (2017). https://doi.org/10.1109/ICDE.2017.167
Gounaris, A., Torres, J.: A methodology for spark parameter tuning. Big Data Res. 11 (2017). https://doi.org/10.1016/j.bdr.2017.05.001
Han, R., John, L., Zhan, J.: Benchmarking big data systems: a review. IEEE Trans. Serv. Comput. PP, 1 (2017). https://doi.org/10.1109/TSC.2017.2730882
Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The hibench benchmark suite: characterization of the mapreduce-based data analysis, pp. 41–51 (2010). https://doi.org/10.1109/ICDEW.2010.5452747
Ivanov, T., Rabl, T., Poess, M., Queralt, A., Poelman, J., Poggi, N., Buell, J.: Big data benchmark compendium. In: Nambiar, R., Poess, M. (eds.) Performance Evaluation and Benchmarking: Traditional to Big Data to Internet of Things - 7th TPC Technology Conference, TPCTC 2015, Kohala Coast, HI, USA, August 31 - September 4, 2015. Revised Selected Papers. Lecture Notes in Computer Science, vol. 9508, pp. 135–155. Springer, Heidelberg (2015). https://doi.org/10.1007/978-3-319-31409-9_9
Kraska, T., Talwalkar, A., Duchi, J., Griffith, R., Jordan, M., Franklin, M.: A distributed machine-learning system (2013)
Palit, T., Shen, Y., Ferdman, M.: Demystifying cloud benchmarking, pp. 122–132 (2016). https://doi.org/10.1109/ISPASS.2016.7482080
Petridis, P., Gounaris, A., Torres, J.: Spark parameter tuning via trial-and-error. In: Angelov, P., Manolopoulos, Y., Iliadis, L.S., Roy, A., Vellasco, M.M.B.R. (eds.) INNS Conference on Big Data. Advances in Intelligent Systems and Computing, vol. 529, pp. 226–237 (2016). http://dblp.uni-trier.de/db/conf/inns/inns2016.html#PetridisGT16
Wang, K., Maifi Hasan Khan, M., Nguyen, N., Gokhale, S.: A model driven approach towards improving the performance of apache spark applications. In: 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 233–242 (2019)
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: NSDI, pp. 15–28 (2012)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud 2010, p. 10. USENIX Association, USA (2010)
Zaharia, M., Xin, R.S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X.,Rosen, J., Venkataraman, S., Franklin, M.J., Ghodsi, A., Gonzalez, J.,Shenker, S., Stoica, I.: Apache spark: a unified engine for big dataprocessing. Commun. ACM 59(11), 56–65 (2016). https://doi.org/10.1145/2934664
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Minukhin, S., Brynza, N., Sitnikov, D. (2021). Analyzing Performance of Apache Spark MLlib with Multinode Clusters on Azure HDInsight: Spark-Perf Case Study. In: Babichev, S., Lytvynenko, V., Wójcik, W., Vyshemyrskaya, S. (eds) Lecture Notes in Computational Intelligence and Decision Making. ISDMCI 2020. Advances in Intelligent Systems and Computing, vol 1246. Springer, Cham. https://doi.org/10.1007/978-3-030-54215-3_8
Download citation
DOI: https://doi.org/10.1007/978-3-030-54215-3_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-54214-6
Online ISBN: 978-3-030-54215-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)