Abstract
Clustering analysis is one of the most commonly used techniques for uncovering patterns in data mining. Most clustering methods require establishing the number of clusters beforehand. However, due to the size of the data currently used, predicting that value is at a high computational cost task in most cases. In this article, we present a clustering technique that avoids this requirement, using hierarchical clustering. There are many examples of this procedure in the literature, most of them focusing on the dissociative or descending subtype, while in this article we cover the agglomerative or ascending subtype. Being more expensive in computational and temporal cost, it nevertheless allows us to obtain very valuable information, regarding elements membership to clusters and their groupings, that is to say, their dendrogram. Finally, several sets of data have been used, varying their dimensionality. For each of them, we provide the calculations of internal validation indexes to test the algorithm developed, studying which of them provides better results to obtain the best possible clustering.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Krumholz, H.M.: Big data and new knowledge in medicine: the thinking, training, and tools needed for a learning health system. Health Aff. 33(7), 1163–1170 (2014)
Su, Z., Xu, Q., et al.: Big data in mobile social networks: a QoE-oriented framework. IEEE Netw. 30(1), 52–57 (2016)
Pérez-Chacón, R., Luna-Romera, J.M., et al.: Big data analytics for discovering electricity consumption patterns in smart cities. Energies 11(3), 683 (2018)
Guo, H., Liu, Z., et al.: Big earth data: a new challenge and opportunity for digital earth’s development. Int. J. Digit. Earth 10(1), 1–12 (2017)
Dean, J., Ghemawat, S.: MapReduce. Commun. ACM 51(1), 107 (2008)
Ghemawat, S., Gobioff, H., et al.: The Google file system. In: Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles - SOSP 2003, vol. 37, no. 5, p. 29. ACM Press, New York (2003)
Apache Spark: Lightning-Fast C. C. https://spark.apache.org/
Apache Spark: Clustering Documentation (2019). https://spark.apache.org/docs/2.2.0/ml-clustering.html
Luna-Romera, J.M., García-Gutiérrez, J., et al.: An approach to validity indices for clustering techniques in Big Data. Progr. Artif. Intell. 7, 1–14 (2017)
Apache Spark: Clustering - Bisecting k-means. https://spark.apache.org/docs/2.2.0/mllib-clustering.html#bisecting-k-means
Fahad, A., Alshatri, N., et al.: A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans. Emerg. Top. Comput. 2(3), 267–279 (2014)
Sharma, A., López, Y., et al.: Divisive hierarchical maximum likelihood clustering. BMC Bioinform. 18(S16), 546 (2017)
Kim, E., Oh, W., et al.: Divisive hierarchical clustering towards identifying clinically significant pre-diabetes subpopulations. In: AMIA – Annual Symposium Proceedings. AMIA Symposium, vol. 2014, pp. 1815–1824 (2014)
Patnaik, A.K., Bhuyan, P.K., et al.: Divisive Analysis (DIANA) of hierarchical clustering and GPS data for level of service criteria of urban streets. Alexandria Eng. J. 55(1), 407–418 (2016)
Loewenstein, Y., Portugaly, E.: Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space. Bioinformatics 24(13), i41–i49 (2008)
Uchiyama, I.: Hierarchical clustering algorithm for comprehensive orthologous-domain classification in multiple genomes. Nucleic Acids Res. 34(2), 647–658 (2006)
Dunn, J.C.: Well-separated clusters and optimal fuzzy partitions. J. Cybern. 4(1), 95–104 (1974)
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. PAMI-1(2), 224–227 (1979)
Martín-Fernández, J.D., Luna-Romera, J.M.: Distance class (2018). https://github.com/Joseda13/linkage/blob/master/src/main/scala/es/us/linkage/Distance.scala
Hastie, T., Tibshirani, R., et al.: The Elements of Statistical Learning. Springer Series in Statistics. Springer, New York (2009)
Spark, A.: ScalaDoc - RDD. https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Martín-Fernández, J.D., Luna-Romera, J.M., Pontes, B., Riquelme-Santos, J.C. (2020). Indexes to Find the Optimal Number of Clusters in a Hierarchical Clustering. In: Martínez Álvarez, F., Troncoso Lora, A., Sáez Muñoz, J., Quintián, H., Corchado, E. (eds) 14th International Conference on Soft Computing Models in Industrial and Environmental Applications (SOCO 2019). SOCO 2019. Advances in Intelligent Systems and Computing, vol 950. Springer, Cham. https://doi.org/10.1007/978-3-030-20055-8_1
Download citation
DOI: https://doi.org/10.1007/978-3-030-20055-8_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-20054-1
Online ISBN: 978-3-030-20055-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)