Indexes to Find the Optimal Number of Clusters in a Hierarchical Clustering

Martín-Fernández, José David; Luna-Romera, José María; Pontes, Beatriz; Riquelme-Santos, José C.

doi:10.1007/978-3-030-20055-8_1

José David Martín-Fernández¹⁹,
José María Luna-Romera¹⁹,
Beatriz Pontes¹⁹ &
…
José C. Riquelme-Santos¹⁹

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 950))

Included in the following conference series:

International Workshop on Soft Computing Models in Industrial and Environmental Applications

1552 Accesses
3 Citations

Abstract

Clustering analysis is one of the most commonly used techniques for uncovering patterns in data mining. Most clustering methods require establishing the number of clusters beforehand. However, due to the size of the data currently used, predicting that value is at a high computational cost task in most cases. In this article, we present a clustering technique that avoids this requirement, using hierarchical clustering. There are many examples of this procedure in the literature, most of them focusing on the dissociative or descending subtype, while in this article we cover the agglomerative or ascending subtype. Being more expensive in computational and temporal cost, it nevertheless allows us to obtain very valuable information, regarding elements membership to clusters and their groupings, that is to say, their dendrogram. Finally, several sets of data have been used, varying their dimensionality. For each of them, we provide the calculations of internal validation indexes to test the algorithm developed, studying which of them provides better results to obtain the best possible clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A Heuristic Automatic Clustering Method Based on Hierarchical Clustering

Effective Data Clustering Algorithms

Holistic Assessment of Structure Discovery Capabilities of Clustering Algorithms

References

Krumholz, H.M.: Big data and new knowledge in medicine: the thinking, training, and tools needed for a learning health system. Health Aff. 33(7), 1163–1170 (2014)
Article Google Scholar
Su, Z., Xu, Q., et al.: Big data in mobile social networks: a QoE-oriented framework. IEEE Netw. 30(1), 52–57 (2016)
Article MathSciNet Google Scholar
Pérez-Chacón, R., Luna-Romera, J.M., et al.: Big data analytics for discovering electricity consumption patterns in smart cities. Energies 11(3), 683 (2018)
Article Google Scholar
Guo, H., Liu, Z., et al.: Big earth data: a new challenge and opportunity for digital earth’s development. Int. J. Digit. Earth 10(1), 1–12 (2017)
Article Google Scholar
Dean, J., Ghemawat, S.: MapReduce. Commun. ACM 51(1), 107 (2008)
Article Google Scholar
Ghemawat, S., Gobioff, H., et al.: The Google file system. In: Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles - SOSP 2003, vol. 37, no. 5, p. 29. ACM Press, New York (2003)
Google Scholar
Apache Spark: Lightning-Fast C. C. https://spark.apache.org/
Apache Spark: Clustering Documentation (2019). https://spark.apache.org/docs/2.2.0/ml-clustering.html
Luna-Romera, J.M., García-Gutiérrez, J., et al.: An approach to validity indices for clustering techniques in Big Data. Progr. Artif. Intell. 7, 1–14 (2017)
Google Scholar
Apache Spark: Clustering - Bisecting k-means. https://spark.apache.org/docs/2.2.0/mllib-clustering.html#bisecting-k-means
Fahad, A., Alshatri, N., et al.: A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans. Emerg. Top. Comput. 2(3), 267–279 (2014)
Article Google Scholar
Sharma, A., López, Y., et al.: Divisive hierarchical maximum likelihood clustering. BMC Bioinform. 18(S16), 546 (2017)
Article Google Scholar
Kim, E., Oh, W., et al.: Divisive hierarchical clustering towards identifying clinically significant pre-diabetes subpopulations. In: AMIA – Annual Symposium Proceedings. AMIA Symposium, vol. 2014, pp. 1815–1824 (2014)
Google Scholar
Patnaik, A.K., Bhuyan, P.K., et al.: Divisive Analysis (DIANA) of hierarchical clustering and GPS data for level of service criteria of urban streets. Alexandria Eng. J. 55(1), 407–418 (2016)
Article Google Scholar
Loewenstein, Y., Portugaly, E.: Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space. Bioinformatics 24(13), i41–i49 (2008)
Article Google Scholar
Uchiyama, I.: Hierarchical clustering algorithm for comprehensive orthologous-domain classification in multiple genomes. Nucleic Acids Res. 34(2), 647–658 (2006)
Article Google Scholar
Dunn, J.C.: Well-separated clusters and optimal fuzzy partitions. J. Cybern. 4(1), 95–104 (1974)
Article MathSciNet Google Scholar
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
Article Google Scholar
Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. PAMI-1(2), 224–227 (1979)
Article Google Scholar
Martín-Fernández, J.D., Luna-Romera, J.M.: Distance class (2018). https://github.com/Joseda13/linkage/blob/master/src/main/scala/es/us/linkage/Distance.scala
Hastie, T., Tibshirani, R., et al.: The Elements of Statistical Learning. Springer Series in Statistics. Springer, New York (2009)
Book Google Scholar
Spark, A.: ScalaDoc - RDD. https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD

Download references

Author information

Authors and Affiliations

University of Seville, 41012, Seville, Spain
José David Martín-Fernández, José María Luna-Romera, Beatriz Pontes & José C. Riquelme-Santos

Authors

José David Martín-Fernández
View author publications
You can also search for this author in PubMed Google Scholar
José María Luna-Romera
View author publications
You can also search for this author in PubMed Google Scholar
Beatriz Pontes
View author publications
You can also search for this author in PubMed Google Scholar
José C. Riquelme-Santos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to José David Martín-Fernández .

Editor information

Editors and Affiliations

Data Science and Big Data Lab, Pablo de Olavide University, Seville, Spain
Francisco Martínez Álvarez
Data Science and Big Data Lab, Pablo de Olavide University, Seville, Spain
Alicia Troncoso Lora
University of Salamanca, Salamanca, Spain
José António Sáez Muñoz
Department of Industrial Engineering, University of A Coruña, A Coruña, Spain
Héctor Quintián
University of Salamanca, Salamanca, Spain
Emilio Corchado

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Martín-Fernández, J.D., Luna-Romera, J.M., Pontes, B., Riquelme-Santos, J.C. (2020). Indexes to Find the Optimal Number of Clusters in a Hierarchical Clustering. In: Martínez Álvarez, F., Troncoso Lora, A., Sáez Muñoz, J., Quintián, H., Corchado, E. (eds) 14th International Conference on Soft Computing Models in Industrial and Environmental Applications (SOCO 2019). SOCO 2019. Advances in Intelligent Systems and Computing, vol 950. Springer, Cham. https://doi.org/10.1007/978-3-030-20055-8_1

Download citation

DOI: https://doi.org/10.1007/978-3-030-20055-8_1
Published: 01 May 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-20054-1
Online ISBN: 978-3-030-20055-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Indexes to Find the Optimal Number of Clusters in a Hierarchical Clustering

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A Heuristic Automatic Clustering Method Based on Hierarchical Clustering

Effective Data Clustering Algorithms

Holistic Assessment of Structure Discovery Capabilities of Clustering Algorithms

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Indexes to Find the Optimal Number of Clusters in a Hierarchical Clustering

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

A Heuristic Automatic Clustering Method Based on Hierarchical Clustering

Effective Data Clustering Algorithms

Holistic Assessment of Structure Discovery Capabilities of Clustering Algorithms

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation