A Unified Metric for Categorical and Numerical Attributes in Data Clustering

Cheung, Yiu-ming; Jia, Hong

doi:10.1007/978-3-642-37456-2_12

Yiu-ming Cheung^23,24 &
Hong Jia²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7819))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

10k Accesses
4 Citations

Abstract

Most of the existing clustering approaches are applicable to purely numerical or categorical data only, but not both. In general, it is a nontrivial task to perform clustering on mixed data composed of numerical and categorical attributes because there exists an awkward gap between the similarity metrics for categorical and numerical data. This paper therefore presents a general clustering framework based on the concept of object-cluster similarity and gives a unified similarity metric which can be simply applied to the data with categorical, numerical, or mixed attributes. Accordingly, an iterative clustering algorithm is developed, whose efficacy is experimentally demonstrated on different benchmark data sets.

Access provided by Autonomous University of Puebla. Download to read the full chapter text

Chapter PDF

A method for k-means-like clustering of categorical data

Article 06 September 2019

A k-Means-Like Algorithm for Clustering Categorical Data Using an Information Theoretic-Based Dissimilarity Measure

Methods for Clustering Categorical and Mixed Data: An Overview and New Algorithms

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Michalski, R.S., Bratko, I., Kubat, M.: Machine learning and data mining: methods and applications. Wiley, New York (1998)
Google Scholar
MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)
Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, Series B (Methodological) 39(1), 1–38 (1977)
MathSciNet MATH Google Scholar
Hsu, C.C.: Generalizing self-organizing map for categorical data. IEEE Transactions on Neural Networks 17(2), 294–304 (2006)
Article Google Scholar
Cesario, E., Manco, G., Ortale, R.: Top-down parameter-free clustering of high-dimensional categorical data. IEEE Transactions on Knowledge and Data Engineering 19(12), 1607–1624 (2007)
Article Google Scholar
Goodall, D.W.: A new similarity index based on probability. Biometric 22(4), 882–907 (1966)
Article Google Scholar
Li, C., Biswas, G.: Unsupervised learning with mixed numeric and nominal data. IEEE Transactions on Knowledge and Data Engineering 14(4), 673–690 (2002)
Article Google Scholar
Zaki, M.J., Peters, M.: Click: Mining subspace clusters in categorical data via k-partite maximal cliques. In: Proceedings of the 21st International Conference on Data Engineering, pp. 355–356 (2005)
Google Scholar
Guha, S., Rastogi, R., Shim, K.: Rock: A robust clustering algorithm for categorical attributes. Information Systems 25(5), 345–366 (2001)
Article Google Scholar
Barbara, D., Couto, J., Li, Y.: Coolcat: An entropy-based algorithm for categorical clustering. In: Proceedings of the 11th ACM Conference on Information and Knowledge Management, pp. 582–589 (2002)
Google Scholar
Huang, Z.: Clustering large data sets with mixed numeric and categorical values. In: Proceedings of the First Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 21–24 (1997)
Google Scholar
Huang, Z.: A fast clustering algorithm to cluster very large categorical data sets in data mining. In: Proceedings of the SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, pp. 1–8 (1997)
Google Scholar
Huang, Z., Ng, M.K.: A note on k-modes clustering. Journal of Classification 20(2), 257–261 (2003)
Article MathSciNet MATH Google Scholar
Khan, S.S., Kant, S.: Computation of initial modes for k-modes clustering algorithm using evidence accumulation. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI 2007), pp. 2784–2789 (2007)
Google Scholar
He, Z., Deng, S., Xu, X.: Improving k-modes algorithm considering frequencies of attribute values in mode. In: Hao, Y., Liu, J., Wang, Y.-P., Cheung, Y.-m., Yin, H., Jiao, L., Ma, J., Jiao, Y.-C. (eds.) CIS 2005. LNCS (LNAI), vol. 3801, pp. 157–162. Springer, Heidelberg (2005)
Chapter Google Scholar
Ng, M.K., Li, M.J., Huang, J.Z., He, Z.: On the impact of dissimilarity measure in k-modes clustering algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence 29(3), 503–507 (2007)
Article Google Scholar
Jain, A.K.: Data clustering: 50 years beyound k-means. Pattern Recognition Letters 31(8), 651–666 (2010)
Article Google Scholar
Basak, J., Krishnapuram, R.: Interpretable hierarchical clustering by constructing an unsupervised decision tree. IEEE Transactions on Knowledge and Data Engineering 17(1), 121–132 (2005)
Article Google Scholar
Shepard, R.N.: Toward a universal law of generalization for physical science. Science 237, 1317–1323 (1987)
Article MathSciNet MATH Google Scholar
Santini, S., Jain, R.: Similarity measures. IEEE Transactions on Pattern Analysis and Machine Intelligence 21(9), 871–883 (1999)
Article Google Scholar
He, X., Cai, D., Niyogi, P.: Laplacian score for feature selection. In: Advances in Neural Information Processing Systems (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Institute of Computational and Theoretical Studies, Hong Kong Baptist University, Hong Kong SAR, China
Yiu-ming Cheung & Hong Jia
United International College, Beijing Normal University - Hong Kong Baptist University, Zhuhai, China
Yiu-ming Cheung

Authors

Yiu-ming Cheung
View author publications
You can also search for this author in PubMed Google Scholar
Hong Jia
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computing Science, Simon Fraser University, 8888 University Drive, V5A 1S6, Burnaby, BC, Canada
Jian Pei
Dept. of Computer Science and Information Engineering, Institute of Medical Informatics, National Cheng Kung University, Tainan, Taiwan
Vincent S. Tseng
Faculty of Engineering and Information Technology, University of Technology Sydney, Broadway, P.O. Box 123, 2007, Sydney, NSW, Australia
Longbing Cao & Guandong Xu &
Asian Office of Aerospace Research and Development (AOARD), Air Force Office of Scientific Research (AFOSR), Air Force Research Laboratory USA, Osaka University, 7-23-17 Roppongi, 106-0032, Minato-ku, Tokyo, Japan
Hiroshi Motoda

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cheung, Ym., Jia, H. (2013). A Unified Metric for Categorical and Numerical Attributes in Data Clustering. In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2013. Lecture Notes in Computer Science(), vol 7819. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37456-2_12

Download citation

DOI: https://doi.org/10.1007/978-3-642-37456-2_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-37455-5
Online ISBN: 978-3-642-37456-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Unified Metric for Categorical and Numerical Attributes in Data Clustering

Abstract

Chapter PDF

Similar content being viewed by others

A method for k-means-like clustering of categorical data

A k-Means-Like Algorithm for Clustering Categorical Data Using an Information Theoretic-Based Dissimilarity Measure

Methods for Clustering Categorical and Mixed Data: An Overview and New Algorithms

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

A Unified Metric for Categorical and Numerical Attributes in Data Clustering

Abstract

Chapter PDF

Similar content being viewed by others

A method for k-means-like clustering of categorical data

A k-Means-Like Algorithm for Clustering Categorical Data Using an Information Theoretic-Based Dissimilarity Measure

Methods for Clustering Categorical and Mixed Data: An Overview and New Algorithms

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation