Abstract
Rapid technological advances imply that the amount of data stored in databases is rising very fast. However, data mining can discover helpful implicit information in large databases. How to detect the implicit and useful information with lower time cost, high correctness, high noise filtering rate and fit for large databases is of priority concern in data mining, specifying why considerable clustering schemes have been proposed in recent decades. This investigation presents a new data clustering approach called PHD, which is an enhanced version of KIDBSCAN. PHD is a hybrid density-based algorithm, which partitions the data set by K-means, and then clusters the resulting partitions with IDBSCAN. Finally, the closest pairs of clusters are merged until the natural number of clusters of data set is reached. Experimental results reveal that the proposed algorithm can perform the entire clustering, and efficiently reduce the run-time cost. They also indicate that the proposed new clustering algorithm conducts better than several existing well-known schemes such as the K-means, DBSCAN, IDBSCAN and KIDBSCAN algorithms. Consequently, the proposed PHD algorithm is efficient and effective for data clustering in large databases.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the ACM SIGMOD international conference on management of data, pp 94–105
Borah B, Bhattacharyya DK (2004) An improved sampling-based DBSCAN for large spatial databases. In: Proceedings of international conference on intelligent sensing and information processing, pp 92–96
Breitenbach M, Grudic GZ (2005) Clustering through ranking on manifolds. In: Proceedings of the 22nd international conference on machine learning, pp 73–80
Chen Y, Rege M, Dong M, Hua J (2008) Non-negative matrix factorization for semi-supervised data clustering. Knowl Inf Syst 17(3):355–379
Cheng H, Hua KA, Vu K (2008) Constrained locally weighted clustering. In: Proceedings of the VLDB endowment, pp 90–101
Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd international conference on knowledge discovery and data mining, pp 226–231
Filippone M, Camastra F, Masulli F, Rovetta S (2008) A survey of kernel and spectral methods for clustering. Pattern Recogn 41:176–190
Fisher R (1936) The use of multiple measurements in taxonomic problems. Ann Eugen 7:179–188
Guha S, Rastogi R, Shim K (1998) CURE: An efficient clustering algorithm for large databases. In: Proceedings of the ACM SIGMOD international conference on management of data, pp 73–84
Guha S, Rastogi R, Shim K (1999) ROCK: A robust clustering algorithm for categorical attributes. In: Proceedings of the 15th international conference on data engineering, pp 512–521
Karypis G, Han EH, Kumar V (1999) CHAMELEON: A hierarchical clustering using dynamic modeling. IEEE Comput 32(8):68–75
MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley symposium on mathematical statistics and probability, vol 1, pp 281–297
Tsai C-F, Liu C-W (2006) KIDBSCAN: A new efficient data clustering algorithm for data mining in large databases. Lect Notes Comput Sci (LNCS) 4029:702–711
Tsai C-F, Yen C-C (2007) ANGEL: A new effective and efficient hybrid clustering technique for large databases. Lect Notes Comput Sci (LNCS) 4426:817–824
equation:UCI Repository. http://www.sgi.com/tech/mlc/db/
Wang T-P, Tsai C-F (2006) GDH: An effective and efficient approach to detect arbitrary patterns in clusters with noises in very large databases. Master thesis, National Pingtung University of Science and Technology, Taiwan
Ward JH Jr (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58(301):236–244
Zhang T, Ramakrishnan R (1996) BIRCH: An efficient data clustering method for very large databases. In: Proceedings of the ACM SIGMOD international conference on management of data, pp 103–114
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Tsai, CF., Yeh, HF., Chang, JF. et al. PHD: an efficient data clustering scheme using partition space technique for knowledge discovery in large databases. Appl Intell 33, 39–53 (2010). https://doi.org/10.1007/s10489-010-0239-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-010-0239-y