Abstract
The clustering of documents based on their similarity is prolific for an application that wants to extract similar documents and disparate non-similar documents to reduce ambiguity in finding relevant documents. Therefore, we require a robust clustering algorithm that can cluster document efficiently and effectively. In this paper, the performance of two clustering algorithms K-Means and DBSCAN with optimal parameters are compared on various textual datasets with distance measures—cosine and hybrid similarity. The challenge of finding the optimal value of epsilon in DBSCAN and value of K in K-Means is fulfilled by the DMDBSCAN and within-cluster sum of square algorithms, respectively. Hybrid similarity has an impact of single words and phrases, so the shared phrases across the corpus are drawn and the phrase similarity is computed. Then, hybrid similarity is formed using cosine and Phrase. To catch the shared phrases, the document representation model—Document Index Graph—is implemented in the Neo4j graph database. We utilized silhouette score to evaluate the performance of clustering algorithms. Experimental results reflect that DBSCAN performs better than K-Means on both the similarity measures.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Nagpal A, Jatain A, Gaur D (2013) Review based on data clustering algorithms. In: 2013 IEEE conference on information and communication technologies. IEEE
Yu W, Qiang G, Xiao-Li L (2006) A kernel aggregate clustering approach for mixed data set and its application in customer segmentation. In: 2006 international conference on management science and engineering. IEEE
Bakr AM, Yousri NA, Ismail MA (2013) Efficient incremental phrase-based document clustering. In: Proceedings of the 21st international conference on pattern recognition (ICPR2012). IEEE
Böhm C et al (2009) CoCo: coding cost for parameter-free outlier detection. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining
Cha S-H (2007) Comprehensive survey on distance/similarity measures between probability density functions. City 1(2):1
Hammouda KM, Kamel MS (2004) Efficient phrase-based document indexing for web document clustering. IEEE Trans Knowl Data Eng 16(10):1279–1296
Preeti K, Harshal A (2020) Document clustering based on phrase and single term similarity using Neo4j. Int J Innov Technol Explor Eng (IJITEE) ISSN 9.3 3188-3192
Momin BF, Kulkarni PJ, Chaudhari A (2006) Web document clustering using document index graph. In: 2006 international conference on advanced computing and communications. IEEE
Jin C-X, Bai Q-C (2016) Text clustering algorithm based on the graph structures of semantic word co-occurrence. In: 2016 international conference on information system and artificial intelligence (ISAI). IEEE
Beil F, Ester M, Xu X (2002) Frequent term-based text clustering. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining
Huan Z, Pengzhou Z, Zeyang G (2018) K-means text dynamic clustering algorithm based on KL divergence. In: 2018 IEEE/ACIS 17th international conference on computer and information science (ICIS). IEEE
Narayana GS, Vasumathi D (2016) Clustering for high dimensional categorical data based on text similarity. In: Proceedings of the 2nd international conference on communication and information processing
Kathiria P, Ahluwalia S (2016) A Naive method for ontology construction. Int J Soft Comput Artif Intell Appl (IJSCAI) 5(1):53–62
Zamir O et al (1997) Fast and intuitive clustering of Web documents. KDD 97
Zamir O, Etzioni O (1999) Grouper: a dynamic clustering interface to Web search results. Comput Netw 31(11–16):1361–1374
Zamir O, Etzioni O (1998) Web document clustering: a feasibility demonstration. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval
Kang S-S (2003) Keyword-based document clustering. In: Proceedings of the sixth international workshop on information retrieval with Asian languages
Kathiria P, Arolkar H (2019) Study of different document representation models for finding phrase-based similarity. In: Information and communication technology for intelligent systems. Springer, Singapore, pp 455–464
Mishra RK, Saini K, Bagri S (2015) Text document clustering on the basis of inter passage approach by using K-means. In: International conference on computing, communication and automation. IEEE
Patil C, Baidari I (2019) Estimating the optimal number of clusters k in a dataset using data depth. Data Sci Eng 4(2):132–140
Maghawry A, Omar YMK, Badr A (2020) Self-organizing map vs initial centroid selection optimization to enhance k-means with genetic algorithm to cluster transcribed broadcast news documents. Int Arab J Inf Technol 17(3):316–324
Salih NM, Jacksi K (2020) Semantic document clustering using k-means algorithm and ward's method. In: 2020 international conference on advanced science and engineering (ICOASE). IEEE
Užupytė R, Babarskis T, Krilavičius T (2018) The generation of electricity load profiles using k–means clustering algorithm. J Univer Comput Sci. Graz: Graz Univer Technol 24(9)
Baser P, Saini JR (2013) A comparative analysis of various clustering techniques used for very large datasets. Int J Comput Sci Commun Netw 3(5):271
Gupta MK, Chandra P (2019) A comparative study of clustering algorithms. In: 2019 6th international conference on computing for sustainable global development (INDIACom). IEEE
Popat SK, Emmanuel M (2014) Review and comparative study of clustering techniques. Int J Comput Sci Inf Technol 5(1):805–812
Rama B, Jayashree P, Jiwani S (2010) A survey on clustering current status and challenging issues. Int J Comput Sci Eng 2(9):2976–2980
Xu D, Tian Y (2015) A comprehensive survey of clustering algorithms. Annals Data Sci 2(2):165–193
Ogbuabor G, Ugwoke FN (2018) Clustering algorithm for a healthcare dataset using silhouette score value. AIRCC’s Int J Comput Sci Inf Technol 10(2):27–37
Elbatta MNT (2012) An improvement for DBSCAN algorithm for best results in varied densities
Kodinariya TM, Makwana PR (2013) Review on determining number of cluster in K-means clustering. Int J 1(6):90–95
George S, Sudheep Elayidom M, Santhanakrishnan T (2017) A novel sequence graph representation for searching and retrieving sequences of long text in the domain of information retrieval. Int J Sci Res Comput Sci, Eng Inf Technol 2(5)
Chandrababu S, Bastola DR (2018) Comparative analysis of graph and relational databases using herbmicrobeDB. In: 2018 IEEE international conference on healthcare informatics workshop (ICHI-W). IEEE
Hoang N, Anh T, Hoang K (2009) Efficient approach for incremental Vietnamese document clustering. In: Proceedings of the eleventh international workshop on Web information and data management
Rahmah N, Sitanggang IS (2016) Determination of optimal epsilon (eps) value on dbscan algorithm to clustering data on peatland hotspots in sumatra. In: IOP conference series: earth and environmental science, vol 31, no 1. IOP Publishing
Gaonkar MN, Sawant K (2013) AutoEpsDBSCAN: DBSCAN with Eps automatic for large dataset. Int J Adv Comput Theory Eng 2(2):11–16
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Kathiria, P., Pandya, V., Arolkar, H., Patel, U. (2023). Performance Analysis of Document Similarity-Based DBSCAN and K-Means Clustering on Text Datasets. In: Singh, Y., Singh, P.K., Kolekar, M.H., Kar, A.K., Gonçalves, P.J.S. (eds) Proceedings of International Conference on Recent Innovations in Computing. Lecture Notes in Electrical Engineering, vol 1001. Springer, Singapore. https://doi.org/10.1007/978-981-19-9876-8_5
Download citation
DOI: https://doi.org/10.1007/978-981-19-9876-8_5
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-9875-1
Online ISBN: 978-981-19-9876-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)