Performance Analysis of Document Similarity-Based DBSCAN and K-Means Clustering on Text Datasets

Kathiria, Preeti; Pandya, Vandan; Arolkar, Harshal; Patel, Usha

doi:10.1007/978-981-19-9876-8_5

Preeti Kathiria⁴²,
Vandan Pandya⁴²,
Harshal Arolkar⁴³ &
…
Usha Patel⁴²

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 1001))

369 Accesses
1 Citations

Abstract

The clustering of documents based on their similarity is prolific for an application that wants to extract similar documents and disparate non-similar documents to reduce ambiguity in finding relevant documents. Therefore, we require a robust clustering algorithm that can cluster document efficiently and effectively. In this paper, the performance of two clustering algorithms K-Means and DBSCAN with optimal parameters are compared on various textual datasets with distance measures—cosine and hybrid similarity. The challenge of finding the optimal value of epsilon in DBSCAN and value of K in K-Means is fulfilled by the DMDBSCAN and within-cluster sum of square algorithms, respectively. Hybrid similarity has an impact of single words and phrases, so the shared phrases across the corpus are drawn and the phrase similarity is computed. Then, hybrid similarity is formed using cosine and Phrase. To catch the shared phrases, the document representation model—Document Index Graph—is implemented in the Neo4j graph database. We utilized silhouette score to evaluate the performance of clustering algorithms. Experimental results reflect that DBSCAN performs better than K-Means on both the similarity measures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Hardcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Combining semantic and term frequency similarities for text clustering

Article 02 January 2019

A Document Clustering Approach Using Shared Nearest Neighbour Affinity, TF-IDF and Angular Similarity

A Combined K-Mean Semantic Approach for the Implicit Document Clustering

References

Nagpal A, Jatain A, Gaur D (2013) Review based on data clustering algorithms. In: 2013 IEEE conference on information and communication technologies. IEEE
Google Scholar
Yu W, Qiang G, Xiao-Li L (2006) A kernel aggregate clustering approach for mixed data set and its application in customer segmentation. In: 2006 international conference on management science and engineering. IEEE
Google Scholar
Bakr AM, Yousri NA, Ismail MA (2013) Efficient incremental phrase-based document clustering. In: Proceedings of the 21st international conference on pattern recognition (ICPR2012). IEEE
Google Scholar
Böhm C et al (2009) CoCo: coding cost for parameter-free outlier detection. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining
Google Scholar
Cha S-H (2007) Comprehensive survey on distance/similarity measures between probability density functions. City 1(2):1
Google Scholar
Hammouda KM, Kamel MS (2004) Efficient phrase-based document indexing for web document clustering. IEEE Trans Knowl Data Eng 16(10):1279–1296
Article Google Scholar
Preeti K, Harshal A (2020) Document clustering based on phrase and single term similarity using Neo4j. Int J Innov Technol Explor Eng (IJITEE) ISSN 9.3 3188-3192
Google Scholar
Momin BF, Kulkarni PJ, Chaudhari A (2006) Web document clustering using document index graph. In: 2006 international conference on advanced computing and communications. IEEE
Google Scholar
Jin C-X, Bai Q-C (2016) Text clustering algorithm based on the graph structures of semantic word co-occurrence. In: 2016 international conference on information system and artificial intelligence (ISAI). IEEE
Google Scholar
Beil F, Ester M, Xu X (2002) Frequent term-based text clustering. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining
Google Scholar
Huan Z, Pengzhou Z, Zeyang G (2018) K-means text dynamic clustering algorithm based on KL divergence. In: 2018 IEEE/ACIS 17th international conference on computer and information science (ICIS). IEEE
Google Scholar
Narayana GS, Vasumathi D (2016) Clustering for high dimensional categorical data based on text similarity. In: Proceedings of the 2nd international conference on communication and information processing
Google Scholar
Kathiria P, Ahluwalia S (2016) A Naive method for ontology construction. Int J Soft Comput Artif Intell Appl (IJSCAI) 5(1):53–62
Google Scholar
Zamir O et al (1997) Fast and intuitive clustering of Web documents. KDD 97
Google Scholar
Zamir O, Etzioni O (1999) Grouper: a dynamic clustering interface to Web search results. Comput Netw 31(11–16):1361–1374
Article Google Scholar
Zamir O, Etzioni O (1998) Web document clustering: a feasibility demonstration. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval
Google Scholar
Kang S-S (2003) Keyword-based document clustering. In: Proceedings of the sixth international workshop on information retrieval with Asian languages
Google Scholar
Kathiria P, Arolkar H (2019) Study of different document representation models for finding phrase-based similarity. In: Information and communication technology for intelligent systems. Springer, Singapore, pp 455–464
Google Scholar
Mishra RK, Saini K, Bagri S (2015) Text document clustering on the basis of inter passage approach by using K-means. In: International conference on computing, communication and automation. IEEE
Google Scholar
Patil C, Baidari I (2019) Estimating the optimal number of clusters k in a dataset using data depth. Data Sci Eng 4(2):132–140
Article Google Scholar
Maghawry A, Omar YMK, Badr A (2020) Self-organizing map vs initial centroid selection optimization to enhance k-means with genetic algorithm to cluster transcribed broadcast news documents. Int Arab J Inf Technol 17(3):316–324
Google Scholar
Salih NM, Jacksi K (2020) Semantic document clustering using k-means algorithm and ward's method. In: 2020 international conference on advanced science and engineering (ICOASE). IEEE
Google Scholar
Užupytė R, Babarskis T, Krilavičius T (2018) The generation of electricity load profiles using k–means clustering algorithm. J Univer Comput Sci. Graz: Graz Univer Technol 24(9)
Google Scholar
Baser P, Saini JR (2013) A comparative analysis of various clustering techniques used for very large datasets. Int J Comput Sci Commun Netw 3(5):271
Google Scholar
Gupta MK, Chandra P (2019) A comparative study of clustering algorithms. In: 2019 6th international conference on computing for sustainable global development (INDIACom). IEEE
Google Scholar
Popat SK, Emmanuel M (2014) Review and comparative study of clustering techniques. Int J Comput Sci Inf Technol 5(1):805–812
Google Scholar
Rama B, Jayashree P, Jiwani S (2010) A survey on clustering current status and challenging issues. Int J Comput Sci Eng 2(9):2976–2980
Google Scholar
Xu D, Tian Y (2015) A comprehensive survey of clustering algorithms. Annals Data Sci 2(2):165–193
Article MathSciNet Google Scholar
Ogbuabor G, Ugwoke FN (2018) Clustering algorithm for a healthcare dataset using silhouette score value. AIRCC’s Int J Comput Sci Inf Technol 10(2):27–37
Google Scholar
Elbatta MNT (2012) An improvement for DBSCAN algorithm for best results in varied densities
Google Scholar
Kodinariya TM, Makwana PR (2013) Review on determining number of cluster in K-means clustering. Int J 1(6):90–95
Google Scholar
George S, Sudheep Elayidom M, Santhanakrishnan T (2017) A novel sequence graph representation for searching and retrieving sequences of long text in the domain of information retrieval. Int J Sci Res Comput Sci, Eng Inf Technol 2(5)
Google Scholar
Chandrababu S, Bastola DR (2018) Comparative analysis of graph and relational databases using herbmicrobeDB. In: 2018 IEEE international conference on healthcare informatics workshop (ICHI-W). IEEE
Google Scholar
Hoang N, Anh T, Hoang K (2009) Efficient approach for incremental Vietnamese document clustering. In: Proceedings of the eleventh international workshop on Web information and data management
Google Scholar
Rahmah N, Sitanggang IS (2016) Determination of optimal epsilon (eps) value on dbscan algorithm to clustering data on peatland hotspots in sumatra. In: IOP conference series: earth and environmental science, vol 31, no 1. IOP Publishing
Google Scholar
Gaonkar MN, Sawant K (2013) AutoEpsDBSCAN: DBSCAN with Eps automatic for large dataset. Int J Adv Comput Theory Eng 2(2):11–16
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Technology, Nirma University, Sarkhej—Gandhinagar Hwy, Gota, Ahmedabad, India
Preeti Kathiria, Vandan Pandya & Usha Patel
Faculty of Computer Technology, GLS University, Ahmedabad, India
Harshal Arolkar

Authors

Preeti Kathiria
View author publications
You can also search for this author in PubMed Google Scholar
Vandan Pandya
View author publications
You can also search for this author in PubMed Google Scholar
Harshal Arolkar
View author publications
You can also search for this author in PubMed Google Scholar
Usha Patel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Usha Patel .

Editor information

Editors and Affiliations

Computer Science and IT Department, Central University of Jammu, Jammu, Jammu and Kashmir, India
Yashwant Singh
Department of Computer Science, KIET Group of Institutions, Ghaziabad, Uttar Pradesh, India
Pradeep Kumar Singh
Department of Electrical Engineering, Indian Institute of Technology Patna, Patna, Bihar, India
Maheshkumar H. Kolekar
School of Artificial Intelligence, Indian Institute of Technology Delhi, New Delhi, Delhi, India
Arpan Kumar Kar
IDMEC, Polytechnic Institute of Castelo Branco, Castelo Branco, Portugal
Paulo J. Sequeira Gonçalves

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kathiria, P., Pandya, V., Arolkar, H., Patel, U. (2023). Performance Analysis of Document Similarity-Based DBSCAN and K-Means Clustering on Text Datasets. In: Singh, Y., Singh, P.K., Kolekar, M.H., Kar, A.K., Gonçalves, P.J.S. (eds) Proceedings of International Conference on Recent Innovations in Computing. Lecture Notes in Electrical Engineering, vol 1001. Springer, Singapore. https://doi.org/10.1007/978-981-19-9876-8_5

Download citation

DOI: https://doi.org/10.1007/978-981-19-9876-8_5
Published: 03 May 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-9875-1
Online ISBN: 978-981-19-9876-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Performance Analysis of Document Similarity-Based DBSCAN and K-Means Clustering on Text Datasets

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Combining semantic and term frequency similarities for text clustering

A Document Clustering Approach Using Shared Nearest Neighbour Affinity, TF-IDF and Angular Similarity

A Combined K-Mean Semantic Approach for the Implicit Document Clustering

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Performance Analysis of Document Similarity-Based DBSCAN and K-Means Clustering on Text Datasets

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Combining semantic and term frequency similarities for text clustering

A Document Clustering Approach Using Shared Nearest Neighbour Affinity, TF-IDF and Angular Similarity

A Combined K-Mean Semantic Approach for the Implicit Document Clustering

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation