A Comparison of Categorical Attribute Data Clustering Methods

Hautamäki, Ville; Pöllänen, Antti; Kinnunen, Tomi; Lee, Kong Aik; Li, Haizhou; Fränti, Pasi

doi:10.1007/978-3-662-44415-3_6

Ville Hautamäki²⁰,
Antti Pöllänen²⁰,
Tomi Kinnunen²⁰,
Kong Aik Lee²¹,
Haizhou Li²¹ &
…
Pasi Fränti²⁰

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 8621))

Included in the following conference series:

Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR)

2541 Accesses
1 Citations

Abstract

Clustering data in Euclidean space has a long tradition and there has been considerable attention on analyzing several different cost functions. Unfortunately these result rarely generalize to clustering of categorical attribute data. Instead, a simple heuristic k-modes is the most commonly used method despite its modest performance. In this study, we model clusters by their empirical distributions and use expected entropy as the objective function. A novel clustering algorithm is designed based on local search for this objective function and compared against six existing algorithms on well known data sets. The proposed method provides better clustering quality than the other iterative methods at the cost of higher time complexity.

Download to read the full chapter text

Chapter PDF

Clustering Performance Analysis

The Performance of Objective Functions for Clustering Categorical Data

A Unified Metric for Categorical and Numerical Attributes in Data Clustering

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Xu, R., Wunsch, D.: Survey of clustering algorithms. IEEE Transactions on Neural Networks 16(3), 645–678 (2005)
Article Google Scholar
MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, vol. I, pp. 281–297. University of California (1967)
Google Scholar
Kaufman, L., Rousseeuw, P.J.: Finding groups in data: an introduction to cluster analysis. John Wiley Sons, New York (1990)
Book Google Scholar
Huang, Z.: Extensions to k-means algorithm for clustering large data sets with categorical values. Data Mining Knowledge Discovery 2(3), 283–304 (1998)
Article Google Scholar
Gersho, A., Gray, R.M.: Vector Quantization and Signal Compression. Kluwer Academic Publisher, Boston (1992)
Book MATH Google Scholar
He, Z., Xu, X., Deng, S., Dong, B.: K-histograms: An efficient clustering algorithm for categorical dataset. CoRR abs/cs/0509033 (2005)
Google Scholar
San, O.M., Huynh, V.N., Nakamori, Y.: An alternative extension of the k-means algorithm for clustering categorical data. International Journal of Applied Mathematics and Computer Science 14(2), 241–247 (2004)
MATH MathSciNet Google Scholar
Cai, Z., Wang, D., Jiang, L.: K-distributions: A new algorithm for clustering categorical data. In: Huang, D.-S., Heutte, L., Loog, M. (eds.) ICIC 2007. LNCS (LNAI), vol. 4682, pp. 436–443. Springer, Heidelberg (2007)
Chapter Google Scholar
Chakrabarti, D., Papadimitrou, S., Modha, D.S., Faloutsos, C.: Fully automatic cross-associations. In: Proceedings of the ACM SIGKDD Conference (2004)
Google Scholar
Andritsos, P., Tsaparas, P., Miller, R.J., Sevcik, K.C.: LIMBO: Scalable clustering of categorical data. In: Bertino, E., Christodoulakis, S., Plexousakis, D., Christophides, V., Koubarakis, M., Böhm, K. (eds.) EDBT 2004. LNCS, vol. 2992, pp. 123–146. Springer, Heidelberg (2004)
Chapter Google Scholar
Barbará, D., Li, Y., Couto, J.: Coolcat: an entropy-based algorithm for categorical clustering. In: CIKM 2002: Proceedings of the Eleventh International Conference on Information and Knowledge Management, pp. 582–589. ACM, New York (2002)
Google Scholar
Li, T., Ma, S., Ogihara, M.: Entropy-based criterion in categorical clustering. In: ICML 2004: Proceedings of the Twenty-First International Conference on Machine Learning, p. 68. ACM, New York (2004)
Google Scholar
Li, T.: A unified view on clustering binary data. Machine Learning 62(3), 199–215 (2006)
Article Google Scholar
Chen, K., Liu, L.: The “best k” for entropy-based categorical data clustering. In: Proceedings of the 17th International Conference on Scientific and Statistical Database Management (SSDBM 2005), Berkeley, USA, pp. 253–262 (2005)
Google Scholar
Cover, T., Thomas, J.: Elements of Information Theory. Wiley-Interscience (1991)
Google Scholar
Guha, S., Rastogi, R., Shim, K.: Rock: A robust clustering algorithm for categorical attributes. Information Systems 25(5), 345–366 (2000)
Article Google Scholar
Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York (2006)
MATH Google Scholar
Asuncion, A., Newman, D.: UCI machine learning repository (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computing, University of Eastern Finland, Finland
Ville Hautamäki, Antti Pöllänen, Tomi Kinnunen & Pasi Fränti
Institute for Infocomm Research, A*STAR, Singapore
Kong Aik Lee & Haizhou Li

Authors

Ville Hautamäki
View author publications
You can also search for this author in PubMed Google Scholar
Antti Pöllänen
View author publications
You can also search for this author in PubMed Google Scholar
Tomi Kinnunen
View author publications
You can also search for this author in PubMed Google Scholar
Kong Aik Lee
View author publications
You can also search for this author in PubMed Google Scholar
Haizhou Li
View author publications
You can also search for this author in PubMed Google Scholar
Pasi Fränti
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computing, University of Eastern Finland, 80101, Joensuu, Finland
Pasi Fränti
School of Computer Science, The University of Manchester, Manchester, UK
Gavin Brown
Delft University of Technology, Delft, The Netherlands
Marco Loog
Universidad de Alicante, Spain
Francisco Escolano
Università Ca’ Foscari Venezia, Venezia Mestre, Italy
Marcello Pelillo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hautamäki, V., Pöllänen, A., Kinnunen, T., Lee, K.A., Li, H., Fränti, P. (2014). A Comparison of Categorical Attribute Data Clustering Methods. In: Fränti, P., Brown, G., Loog, M., Escolano, F., Pelillo, M. (eds) Structural, Syntactic, and Statistical Pattern Recognition. S+SSPR 2014. Lecture Notes in Computer Science, vol 8621. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-44415-3_6

Download citation

DOI: https://doi.org/10.1007/978-3-662-44415-3_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-44414-6
Online ISBN: 978-3-662-44415-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

A Comparison of Categorical Attribute Data Clustering Methods

Abstract

Chapter PDF