Abstract
Clustering data in Euclidean space has a long tradition and there has been considerable attention on analyzing several different cost functions. Unfortunately these result rarely generalize to clustering of categorical attribute data. Instead, a simple heuristic k-modes is the most commonly used method despite its modest performance. In this study, we model clusters by their empirical distributions and use expected entropy as the objective function. A novel clustering algorithm is designed based on local search for this objective function and compared against six existing algorithms on well known data sets. The proposed method provides better clustering quality than the other iterative methods at the cost of higher time complexity.
Chapter PDF
Similar content being viewed by others
Keywords
- Categorical Cluster
- Cluster Representative
- Single Proton Emission Compute Tomography
- Good Cluster Quality
- Small Model Size
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Xu, R., Wunsch, D.: Survey of clustering algorithms. IEEE Transactions on Neural Networks 16(3), 645–678 (2005)
MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, vol. I, pp. 281–297. University of California (1967)
Kaufman, L., Rousseeuw, P.J.: Finding groups in data: an introduction to cluster analysis. John Wiley Sons, New York (1990)
Huang, Z.: Extensions to k-means algorithm for clustering large data sets with categorical values. Data Mining Knowledge Discovery 2(3), 283–304 (1998)
Gersho, A., Gray, R.M.: Vector Quantization and Signal Compression. Kluwer Academic Publisher, Boston (1992)
He, Z., Xu, X., Deng, S., Dong, B.: K-histograms: An efficient clustering algorithm for categorical dataset. CoRR abs/cs/0509033 (2005)
San, O.M., Huynh, V.N., Nakamori, Y.: An alternative extension of the k-means algorithm for clustering categorical data. International Journal of Applied Mathematics and Computer Science 14(2), 241–247 (2004)
Cai, Z., Wang, D., Jiang, L.: K-distributions: A new algorithm for clustering categorical data. In: Huang, D.-S., Heutte, L., Loog, M. (eds.) ICIC 2007. LNCS (LNAI), vol. 4682, pp. 436–443. Springer, Heidelberg (2007)
Chakrabarti, D., Papadimitrou, S., Modha, D.S., Faloutsos, C.: Fully automatic cross-associations. In: Proceedings of the ACM SIGKDD Conference (2004)
Andritsos, P., Tsaparas, P., Miller, R.J., Sevcik, K.C.: LIMBO: Scalable clustering of categorical data. In: Bertino, E., Christodoulakis, S., Plexousakis, D., Christophides, V., Koubarakis, M., Böhm, K. (eds.) EDBT 2004. LNCS, vol. 2992, pp. 123–146. Springer, Heidelberg (2004)
Barbará, D., Li, Y., Couto, J.: Coolcat: an entropy-based algorithm for categorical clustering. In: CIKM 2002: Proceedings of the Eleventh International Conference on Information and Knowledge Management, pp. 582–589. ACM, New York (2002)
Li, T., Ma, S., Ogihara, M.: Entropy-based criterion in categorical clustering. In: ICML 2004: Proceedings of the Twenty-First International Conference on Machine Learning, p. 68. ACM, New York (2004)
Li, T.: A unified view on clustering binary data. Machine Learning 62(3), 199–215 (2006)
Chen, K., Liu, L.: The “best k” for entropy-based categorical data clustering. In: Proceedings of the 17th International Conference on Scientific and Statistical Database Management (SSDBM 2005), Berkeley, USA, pp. 253–262 (2005)
Cover, T., Thomas, J.: Elements of Information Theory. Wiley-Interscience (1991)
Guha, S., Rastogi, R., Shim, K.: Rock: A robust clustering algorithm for categorical attributes. Information Systems 25(5), 345–366 (2000)
Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York (2006)
Asuncion, A., Newman, D.: UCI machine learning repository (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hautamäki, V., Pöllänen, A., Kinnunen, T., Lee, K.A., Li, H., Fränti, P. (2014). A Comparison of Categorical Attribute Data Clustering Methods. In: Fränti, P., Brown, G., Loog, M., Escolano, F., Pelillo, M. (eds) Structural, Syntactic, and Statistical Pattern Recognition. S+SSPR 2014. Lecture Notes in Computer Science, vol 8621. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-44415-3_6
Download citation
DOI: https://doi.org/10.1007/978-3-662-44415-3_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-44414-6
Online ISBN: 978-3-662-44415-3
eBook Packages: Computer ScienceComputer Science (R0)