Skip to main content

Performance Analysis of Document Similarity-Based DBSCAN and K-Means Clustering on Text Datasets

  • Conference paper
  • First Online:
Proceedings of International Conference on Recent Innovations in Computing

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 1001))

Abstract

The clustering of documents based on their similarity is prolific for an application that wants to extract similar documents and disparate non-similar documents to reduce ambiguity in finding relevant documents. Therefore, we require a robust clustering algorithm that can cluster document efficiently and effectively. In this paper, the performance of two clustering algorithms K-Means and DBSCAN with optimal parameters are compared on various textual datasets with distance measures—cosine and hybrid similarity. The challenge of finding the optimal value of epsilon in DBSCAN and value of K in K-Means is fulfilled by the DMDBSCAN and within-cluster sum of square algorithms, respectively. Hybrid similarity has an impact of single words and phrases, so the shared phrases across the corpus are drawn and the phrase similarity is computed. Then, hybrid similarity is formed using cosine and Phrase. To catch the shared phrases, the document representation model—Document Index Graph—is implemented in the Neo4j graph database. We utilized silhouette score to evaluate the performance of clustering algorithms. Experimental results reflect that DBSCAN performs better than K-Means on both the similarity measures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Free shipping worldwide - see info
Hardcover Book
USD 249.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Nagpal A, Jatain A, Gaur D (2013) Review based on data clustering algorithms. In: 2013 IEEE conference on information and communication technologies. IEEE

    Google Scholar 

  2. Yu W, Qiang G, Xiao-Li L (2006) A kernel aggregate clustering approach for mixed data set and its application in customer segmentation. In: 2006 international conference on management science and engineering. IEEE

    Google Scholar 

  3. Bakr AM, Yousri NA, Ismail MA (2013) Efficient incremental phrase-based document clustering. In: Proceedings of the 21st international conference on pattern recognition (ICPR2012). IEEE

    Google Scholar 

  4. Böhm C et al (2009) CoCo: coding cost for parameter-free outlier detection. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining

    Google Scholar 

  5. Cha S-H (2007) Comprehensive survey on distance/similarity measures between probability density functions. City 1(2):1

    Google Scholar 

  6. Hammouda KM, Kamel MS (2004) Efficient phrase-based document indexing for web document clustering. IEEE Trans Knowl Data Eng 16(10):1279–1296

    Article  Google Scholar 

  7. Preeti K, Harshal A (2020) Document clustering based on phrase and single term similarity using Neo4j. Int J Innov Technol Explor Eng (IJITEE) ISSN 9.3 3188-3192

    Google Scholar 

  8. Momin BF, Kulkarni PJ, Chaudhari A (2006) Web document clustering using document index graph. In: 2006 international conference on advanced computing and communications. IEEE

    Google Scholar 

  9. Jin C-X, Bai Q-C (2016) Text clustering algorithm based on the graph structures of semantic word co-occurrence. In: 2016 international conference on information system and artificial intelligence (ISAI). IEEE

    Google Scholar 

  10. Beil F, Ester M, Xu X (2002) Frequent term-based text clustering. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining

    Google Scholar 

  11. Huan Z, Pengzhou Z, Zeyang G (2018) K-means text dynamic clustering algorithm based on KL divergence. In: 2018 IEEE/ACIS 17th international conference on computer and information science (ICIS). IEEE

    Google Scholar 

  12. Narayana GS, Vasumathi D (2016) Clustering for high dimensional categorical data based on text similarity. In: Proceedings of the 2nd international conference on communication and information processing

    Google Scholar 

  13. Kathiria P, Ahluwalia S (2016) A Naive method for ontology construction. Int J Soft Comput Artif Intell Appl (IJSCAI) 5(1):53–62

    Google Scholar 

  14. Zamir O et al (1997) Fast and intuitive clustering of Web documents. KDD 97

    Google Scholar 

  15. Zamir O, Etzioni O (1999) Grouper: a dynamic clustering interface to Web search results. Comput Netw 31(11–16):1361–1374

    Article  Google Scholar 

  16. Zamir O, Etzioni O (1998) Web document clustering: a feasibility demonstration. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval

    Google Scholar 

  17. Kang S-S (2003) Keyword-based document clustering. In: Proceedings of the sixth international workshop on information retrieval with Asian languages

    Google Scholar 

  18. Kathiria P, Arolkar H (2019) Study of different document representation models for finding phrase-based similarity. In: Information and communication technology for intelligent systems. Springer, Singapore, pp 455–464

    Google Scholar 

  19. Mishra RK, Saini K, Bagri S (2015) Text document clustering on the basis of inter passage approach by using K-means. In: International conference on computing, communication and automation. IEEE

    Google Scholar 

  20. Patil C, Baidari I (2019) Estimating the optimal number of clusters k in a dataset using data depth. Data Sci Eng 4(2):132–140

    Article  Google Scholar 

  21. Maghawry A, Omar YMK, Badr A (2020) Self-organizing map vs initial centroid selection optimization to enhance k-means with genetic algorithm to cluster transcribed broadcast news documents. Int Arab J Inf Technol 17(3):316–324

    Google Scholar 

  22. Salih NM, Jacksi K (2020) Semantic document clustering using k-means algorithm and ward's method. In: 2020 international conference on advanced science and engineering (ICOASE). IEEE

    Google Scholar 

  23. Užupytė R, Babarskis T, Krilavičius T (2018) The generation of electricity load profiles using k–means clustering algorithm. J Univer Comput Sci. Graz: Graz Univer Technol 24(9)

    Google Scholar 

  24. Baser P, Saini JR (2013) A comparative analysis of various clustering techniques used for very large datasets. Int J Comput Sci Commun Netw 3(5):271

    Google Scholar 

  25. Gupta MK, Chandra P (2019) A comparative study of clustering algorithms. In: 2019 6th international conference on computing for sustainable global development (INDIACom). IEEE

    Google Scholar 

  26. Popat SK, Emmanuel M (2014) Review and comparative study of clustering techniques. Int J Comput Sci Inf Technol 5(1):805–812

    Google Scholar 

  27. Rama B, Jayashree P, Jiwani S (2010) A survey on clustering current status and challenging issues. Int J Comput Sci Eng 2(9):2976–2980

    Google Scholar 

  28. Xu D, Tian Y (2015) A comprehensive survey of clustering algorithms. Annals Data Sci 2(2):165–193

    Article  MathSciNet  Google Scholar 

  29. Ogbuabor G, Ugwoke FN (2018) Clustering algorithm for a healthcare dataset using silhouette score value. AIRCC’s Int J Comput Sci Inf Technol 10(2):27–37

    Google Scholar 

  30. Elbatta MNT (2012) An improvement for DBSCAN algorithm for best results in varied densities

    Google Scholar 

  31. Kodinariya TM, Makwana PR (2013) Review on determining number of cluster in K-means clustering. Int J 1(6):90–95

    Google Scholar 

  32. George S, Sudheep Elayidom M, Santhanakrishnan T (2017) A novel sequence graph representation for searching and retrieving sequences of long text in the domain of information retrieval. Int J Sci Res Comput Sci, Eng Inf Technol 2(5)

    Google Scholar 

  33. Chandrababu S, Bastola DR (2018) Comparative analysis of graph and relational databases using herbmicrobeDB. In: 2018 IEEE international conference on healthcare informatics workshop (ICHI-W). IEEE

    Google Scholar 

  34. Hoang N, Anh T, Hoang K (2009) Efficient approach for incremental Vietnamese document clustering. In: Proceedings of the eleventh international workshop on Web information and data management

    Google Scholar 

  35. Rahmah N, Sitanggang IS (2016) Determination of optimal epsilon (eps) value on dbscan algorithm to clustering data on peatland hotspots in sumatra. In: IOP conference series: earth and environmental science, vol 31, no 1. IOP Publishing

    Google Scholar 

  36. Gaonkar MN, Sawant K (2013) AutoEpsDBSCAN: DBSCAN with Eps automatic for large dataset. Int J Adv Comput Theory Eng 2(2):11–16

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Usha Patel .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kathiria, P., Pandya, V., Arolkar, H., Patel, U. (2023). Performance Analysis of Document Similarity-Based DBSCAN and K-Means Clustering on Text Datasets. In: Singh, Y., Singh, P.K., Kolekar, M.H., Kar, A.K., Gonçalves, P.J.S. (eds) Proceedings of International Conference on Recent Innovations in Computing. Lecture Notes in Electrical Engineering, vol 1001. Springer, Singapore. https://doi.org/10.1007/978-981-19-9876-8_5

Download citation

Publish with us

Policies and ethics