Abstract
We present an optimization approach that generates k-means like clustering algorithms. The batch k-means and the incremental k-means are two well known versions of the classical k-means clustering algorithm (Duda et al. 2000). To benefit from the speed of the batch version and the accuracy of the incremental version we combine the two in a “ping–pong” fashion. We use a distance-like function that combines the squared Euclidean distance with relative entropy. In the extreme cases our algorithm recovers the classical k-means clustering algorithm and generalizes the Divisive Information Theoretic clustering algorithm recently reported independently by Berkhin and Becher (2002) and Dhillon1 et al. (2002). Results of numerical experiments that demonstrate the viability of our approach are reported.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Auslender A, Teboulle M and Ben–Tiba S (1999) Interior proximal amd multiplier methods based on second order homogeneous kernels. Mathematics of Operations Research, 24:645–668.
Berkhin P and Becher JD (2002) Learning simple relations: Theory and applications. In: Proceedings of the Second SIAM International Conference on Data Mining.
Berry M and Browne M (1999) Understanding Search Engines. SIAM.
Bertsekas DP and Tsitsiklis JN (1989) Parallel and Distributed Computation: Numerical Methods. Prentice-Hall, New Jersey.
Dempster A, Laird N and Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39.
Dhillon IS, Guan Y and Kogan J (2002) Refining clusters in high-dimensional text data. In: Proceedings of the Workshop on Clustering High Dimensional Data and its Applications (held in conjunction with the Second SIAM International Conference on Data Mining).
Dhillon, IS, Kogan J and Nicholas C (2003) Feature selection and document clustering, In Berry MW Ed. A Comprehensive Survey of Text Mining, pp. 73–100.
Dhillon IS and Modha DS (2001) Concept decompositions for large sparse text data using clustering. Machine Learning, 42(1):143–175.
Duda RO, Hart PE and Stork DG (2000) Pattern Classification. John Wiley & Sons.
Forgy E (1965) Cluster analysis of multivariate Data: Efficiency vs. interpretability of classifications. Biometrics, 21(3):768.
Inderjit S. Dhillon Subramanyam Mallela and Rahul Kumar (2002) Enhanced word clustering for hierarchical text classification. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD-2002), pp. 191–200.
Kogan J (2001) Clustering large unstructured document sets. In: Berry MW Ed. Computational Information Retrieval, pp. 107–117.
Kogan J (2001) Means clustering for text data. In: Proceedings of the Workshop on Text Mining at the First SIAM International Conference on Data Mining, pp. 47–54.
Kogan J, Teboulle M and Nicholas C (2003) Optimization approach to generating families of k-means like algorithms. In: Proceedings of the Workshop on Clustering High Dimensional Data and its Applications (held in conjunction with the Third SIAM International Conference on Data Mining).
Kogan J, Teboulle M and Nicholas C (2003) The entropic geometric means algorithm: An approach for building small clusters for large text datasets. In: Proceedings of the Workshop on Clustering Large Data Sets (held in conjunction with the Third IEEE International Conference on Data Mining), pp. 63–71.
Lei Xu and Michael I. Jordan (1995) On convergence properties of the EM Algorithm for Gaussian Mixtures. MIT A.I. Memo No. 1520, C.B.C.L. paper No. 111.
Rockafellar RT (1970) Convex Analysis Princeton University Press, Princeton, NJ.
Teboulle M (1992) On ϕ-divergence and its applications. In: Phillips FY and Rousseau J Eds. Systems and Management Science by Extremal Methods–Research Honoring Abraham Charnes at Age 70, Kluwer Academic Publishers, pp. 255–273.
Teboulle M (1997) Convergence of proximal-like algorithms. SIAM J. of Optimization, 7:1069-1083.
Author information
Authors and Affiliations
Corresponding author
Additional information
This research was supported in part by the US Department of Defense, the United States–Israel Binational Science Foundation (BSF), and Northrop Grumman Mission Systems (NG/MS).
Rights and permissions
About this article
Cite this article
Kogan, J., Teboulle, M. & Nicholas, C. Data Driven Similarity Measures for k-Means Like Clustering Algorithms. Inf Retrieval 8, 331–349 (2005). https://doi.org/10.1007/s10791-005-5666-8
Issue Date:
DOI: https://doi.org/10.1007/s10791-005-5666-8