Optimal Time Bounds for Approximate Clustering

Mettu, Ramgopal R.; Plaxton, C. Greg

doi:10.1023/B:MACH.0000033114.18632.e0

Optimal Time Bounds for Approximate Clustering

Published: July 2004

Volume 56, pages 35–60, (2004)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Optimal Time Bounds for Approximate Clustering

Download PDF

Ramgopal R. Mettu &
C. Greg Plaxton

697 Accesses
48 Citations
Explore all metrics

Abstract

Clustering is a fundamental problem in unsupervised learning, and has been studied widely both as a problem of learning mixture models and as an optimization problem. In this paper, we study clustering with respect to the k-median objective function, a natural formulation of clustering in which we attempt to minimize the average distance to cluster centers. One of the main contributions of this paper is a simple but powerful sampling technique that we call successive sampling that could be of independent interest. We show that our sampling procedure can rapidly identify a small set of points (of size just O(\(k \log \frac{n}{k}\))) that summarize the input points for the purpose of clustering. Using successive sampling, we develop an algorithm for the k-median problem that runs in O(nk) time for a wide range of values of k and is guaranteed, with high probability, to return a solution with cost at most a constant factor times optimal. We also establish a lower bound of Ω(nk) on any randomized constant-factor approximation algorithm for the k-median problem that succeeds with even a negligible (say \(\frac{1}{{100}}\)) probability. The best previous upper bound for the problem was Õ(nk), where the Õ-notation hides polylogarithmic factors in n and k. The best previous lower bound of Ω(nk) applied only to deterministic k-median algorithms. While we focus our presentation on the k-median objective, all our upper bounds are valid for the k-means objective as well. In this context our algorithm compares favorably to the widely used k-means heuristic, which requires O(nk) time for just one iteration and provides no useful approximation guarantees.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Arora, S., & Kannan, R. (2001). Learning mixtures of arbitrary Gaussians. In Proceedings of the 33rd Annual ACM Symposium on Theory of Computing (pp. 247–257).
Alon, N., & Spencer, J. H. (1991). The probabilistic method. New York, NY: Wiley.
Google Scholar
Charikar, M., & Guha, S. (1999). Improved combinatorial algorithms for facility location and k-median problems. In Proceedings of the 40th Annual IEEE Symposium on Foundations of Computer Science (pp. 378–388).
Charikar, M., Guha, S., Tardos, É., & Shmoys, D. B. (1999). A constant-factor approximation algorithm for the k-median problem. In Proceedings of the 31st Annual ACM Symposium on Theory of Computing (pp. 1–10).
Dasgupta, S. (1999). Learning mixtures of Gaussians. In Proceedings of the 40th Annual IEEE Symposium on the Theory of Computation (pp. 634–644).
Duda, R. O., & Hart, P. E. (1973). Pattern classification and scene analysis. New York: John Wiley and Sons.
Google Scholar
Guha, S., Mishra, N., Motwani, R., & O'Callaghan, L. (2000). Clustering data streams. In Proceedings of the 41st Annual IEEE Symposium on Foundations of Computer Science (pp. 359–366).
Indyk, P. (1999). Sublinear time algorithms for metric space problems. In Proceedings of the 31st Annual ACM Symposium on Theory of Computing (pp. 428–434).
Jain, K., Mahdian, M., & Saberi, A. (2002). A new greedy approach for facility location problems. In Proceedings of the 34th ACM Symposium on Theory of Computation (pp. 731–740).
Lindsay, B. (1995). Mixture models: Theory, geometry, and applications. Hayward, California: Institute for Mathematical Statistics.
Google Scholar
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1 (pp. 281–297).
Google Scholar
MacKenzie, P. D. (1997). Lower bounds for randomized exclusive write PRAMs. Theory of Computing Systems, 30, 599–626.
Google Scholar
Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge: MIT Press.
Google Scholar
Mettu, R. R. (2002). Approximation Algorithms for NP-Hard Clustering Problems. PhD thesis, Department of Computer Science, University of Texas at Austin.
Mettu, R. R., & Plaxton, C. G. (2003). The online median problem. SIAM Journal on Computing, 32, 816–832.
Google Scholar
Mishra, N., Oblinger, D., & Pitt, L. (2001). Sublinear time approximate clustering. In Proceedings of the 12th Annual ACM-SIAM Symposium on Discrete Algorithms (pp. 439–447).
Motwani, R., & Raghavan, P. (1995). Randomized Algorithms. Cambridge University Press, Cambridge, UK.
Google Scholar
Thorup, M. (2001). Quick k-median, k-center, and facility location for sparse graphs. In Proceedings of the 28th International Colloquium on Automata, Languages, and Programming (pp. 249–260).
Yao, A. (1977). Probabilistic computations: Toward a unified measure of complexity. In Proceedings of the 18th IEEE Symposium on Foundations of Computer Science (pp. 222–227).

Download references

Authors

Ramgopal R. Mettu
View author publications
You can also search for this author in PubMed Google Scholar
C. Greg Plaxton
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mettu, R.R., Plaxton, C.G. Optimal Time Bounds for Approximate Clustering. Machine Learning 56, 35–60 (2004). https://doi.org/10.1023/B:MACH.0000033114.18632.e0

Download citation

Issue Date: July 2004
DOI: https://doi.org/10.1023/B:MACH.0000033114.18632.e0

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Optimal Time Bounds for Approximate Clustering

Abstract

Article PDF

Similar content being viewed by others

Faster Algorithms for the Constrained k-means Problem

k-means++ under Approximation Stability

Clustering with Lower-Bounded Sizes

References

Rights and permissions

About this article

Cite this article

Navigation

Optimal Time Bounds for Approximate Clustering

Abstract

Article PDF

Similar content being viewed by others

Faster Algorithms for the Constrained k-means Problem

k-means++ under Approximation Stability

Clustering with Lower-Bounded Sizes

Explore related subjects

References

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation