Abstract
Fast and high quality document clustering is a crucial task in organizing information, search engine results, enhancing web crawling, and information retrieval or filtering. Recent studies have shown that the most commonly used partition-based clustering algorithm, the K-means algorithm, is more suitable for large datasets. However, the K-means algorithm can generate a local optimal solution. In this paper we propose a novel Harmony K-means Algorithm (HKA) that deals with document clustering based on Harmony Search (HS) optimization method. It is proved by means of finite Markov chain theory that the HKA converges to the global optimum. To demonstrate the effectiveness and speed of HKA, we have applied HKA algorithms on some standard datasets. We also compare the HKA with other meta-heuristic and model-based document clustering approaches. Experimental results reveal that the HKA algorithm converges to the best known optimum faster than other methods and the quality of clusters are comparable.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Aggarwal CC, Gates SC, Yu PS (1999) On the merits of building categorization systems by supervised clustering. In: Proceedings of the fifth ACM SIGKDD Int’l conference on knowledge discovery and data mining, pp 352–356
Anderberg MR (1973) Cluster analysis for applications. Academic Press Inc, New York, NY
Boley D, Gini M, Gross R, Han EH, Hastings K, Karypis G et al (1999) Partitioning-based clustering for web document categorization. Decis Support Syst 27(3): 329–341. doi:10.1016/S0167-9236(99)00055-X
Cios K, Pedrycs W, Swiniarski R (1998) Data mining methods for knowledge discovery. Kluwer Academic Publishers
Coello CAC (2000) Constraint-handling using an evolutionary multi objective optimization technique. Civ Eng Environ Syst 17: 319–346. doi:10.1080/02630250008970288
Cui X, Potok TE, Palathingal P (2005) Document clustering using particle swarm optimization. In: IEEE swarm intelligence symposium, pp 185–191
Cutting DR, Pedersen JO, Karger DR, Tukey JW (1992) Scatter/gather: a cluster-based approach to browsing large document collections. In: Proceedings of the ACM SIGIR. Copenhagen, pp 318–329
Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. In: Knowledge discovery and data mining, pp 269–274
Everitt B (1980) Cluster analysis, 2nd edn. Halsted Press, New York
Geem ZW, Kim JH, Loganathan GV (2002) Harmony search optimization: application to pipe network design. Int J Model Simul 22: 125–133
Geem ZW, Tseng C, Park Y (2005) Harmony search for generalized orienteering problem: best touring in China, Springer. Lect Notes Comput Sci 3412: 741–750
Grira N, Crucianu M, Boujemaa N (2005) Unsupervised and semi-supervised clustering: a brief survey. In: 7th ACM SIGMM international workshop on multimedia information retrieval, pp 9–16
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 264–323. CSUR. doi:10.1145/331499.331504
Jones G, Robertson AM, Santimetvirul C, Willett P (1995) Non-hierarchic document clustering using a genetic algorithm. Inform Res 1(1)
Kennedy J, Eberhart RC, Shi Y (2001) Swarm intelligence. Morgan Kaufmann, New York
Klein RW, Dubes RC (1989) Experiments in projection and clustering by simulated annealing. Pattern Recognit 22: 213–220. doi:10.1016/0031-3203(89)90067-8
Labroche N, Monmarche’ N, Venturini G (2003) AntClust: ant clustering and web usage mining. In: Genetic and evolutionary computation conference, pp 25–36
Larsen B, Aone C (1999) Fast and effective text mining using linear-time document clustering. In: Proceedings of the fifth ACM SIGKDD Int’l conference on knowledge discovery and data mining, pp 16–22
Lee KS, Geem ZW (2004) A new meta-heuristic algorithm for continues engineering optimization: harmony search theory and practice. Comput Method Appl Mech Eng 194: 3902–3933. doi:10.1016/j.cma.2004.09.007
Mahdavi M, Fesanghary M, Damangir E (2007) An improved harmony search algorithm for solving optimization problems. Appl Math Comput 188: 1567–1579
McQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth berkeley symposium on mathematical statistics and probability, pp 281–297
Merwe VD, Engelbrecht AP (2003) Data clustering using particle swarm optimization. In: Proceedings of IEEE congress on evolutionary computation (CEC 2003), Australia, pp 215–220
Omran M, Salman A, Engelbrecht AP (2002) Image classification using particle swarm optimization. In: Proceedings of the 4th Asia-Pacific conference on simulated evolution and learning (SEAL 2002), Singapore, pp 370–374
Quinlan RJ (1993) C4.5: programs for machine learning. Morgan Kaufmann
Raghavan VV, Birchand K (1979) A clustering strategy based on a formalism of the reproductive process in a natural system. In: Proceedings of the second international conference on information storage and retrieval, pp 10–22
Rudolph G (1994) Convergence analysis of canonical genetic algorithms. IEEE Trans Neural Netw 5(1): 96–101. doi:10.1109/72.265964
Salton G (1989) Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley
Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manage 24: 513–523. doi:10.1016/0306-4573(88)90021-0
Selim SZ, Ismail MA (1984) K-means type algorithms: a generalized convergence theorem and characterization of local optimality. IEEE Trans Pattern Anal Mach Intell 6: 81–87
Steinbach M, Karypis G, Kumar V (2000a) A comparison of document clustering techniques. KDD’2000. Technical report of University of Minnesota
Steinbach M, Karypis G, Kumar V (2000b) A comparison of document clustering techniques. In: KDD workshop on text mining
Stumme G, Hotho A, Berendt B (2001) Semantic Web Mining. Freiburg, September 3rd, 12th Europ. Conf. on Machine Learning (ECML’01)/5th Europ. Conf. on Principles and Practice of Knowledge Discovery in Databases (PKDD’01)
Stumme G, Hotho A, Berendt B (2006) Semantic Web Mining State of the art and future directions. J Web Semantics: Sci Serv Agents World Wide Web 124–143
TREC (1999) Text REtrieval conference. http://trec.nist.gov
TRECQ (1999) Text REtrieval conference relevance judgments. http://trec.nist.gov/data/qrels_eng/index.html
Zhao Y, Karypis G (2004) Empirical and theoretical comparisons of selected riterion functions for document clustering. Mach Learn 55(3): 311–331. doi:10.1023/B:MACH.0000027785.44527.d6
Zhao Y, Karypis G (2005) Hierarchical clustering algorithms for document datasets. Data Min Knowl Discov 10: 141–168. doi:10.1007/s10618-005-0361-3
Zhong S (2006) Semi-supervised model-based document clustering: a comparative study. Mach Learn 65(1). doi:10.1007/s10994-006-6540-7
Zhong S, Ghosh J (2003) A unified framework for model-based lustering. J Mach Learn Res 4: 1001–1037. JMLR doi:10.1162/jmlr.2003.4.6.1001
Zhong S, Ghosh J (2005) Generative model-based clustering of documents: a comparative study. Knowl Inf Syst 8: 374–384. KAIS doi:10.1007/s10115-004-0194-1
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Charu Aggarwal.
Rights and permissions
About this article
Cite this article
Mahdavi, M., Abolhassani, H. Harmony K-means algorithm for document clustering. Data Min Knowl Disc 18, 370–391 (2009). https://doi.org/10.1007/s10618-008-0123-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-008-0123-0