Abstract
Clustering is a fundamental problem in unsupervised learning, and has been studied widely both as a problem of learning mixture models and as an optimization problem. In this paper, we study clustering with respect to the k-median objective function, a natural formulation of clustering in which we attempt to minimize the average distance to cluster centers. One of the main contributions of this paper is a simple but powerful sampling technique that we call successive sampling that could be of independent interest. We show that our sampling procedure can rapidly identify a small set of points (of size just O(\(k \log \frac{n}{k}\))) that summarize the input points for the purpose of clustering. Using successive sampling, we develop an algorithm for the k-median problem that runs in O(nk) time for a wide range of values of k and is guaranteed, with high probability, to return a solution with cost at most a constant factor times optimal. We also establish a lower bound of Ω(nk) on any randomized constant-factor approximation algorithm for the k-median problem that succeeds with even a negligible (say \(\frac{1}{{100}}\)) probability. The best previous upper bound for the problem was Õ(nk), where the Õ-notation hides polylogarithmic factors in n and k. The best previous lower bound of Ω(nk) applied only to deterministic k-median algorithms. While we focus our presentation on the k-median objective, all our upper bounds are valid for the k-means objective as well. In this context our algorithm compares favorably to the widely used k-means heuristic, which requires O(nk) time for just one iteration and provides no useful approximation guarantees.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Arora, S., & Kannan, R. (2001). Learning mixtures of arbitrary Gaussians. In Proceedings of the 33rd Annual ACM Symposium on Theory of Computing (pp. 247–257).
Alon, N., & Spencer, J. H. (1991). The probabilistic method. New York, NY: Wiley.
Charikar, M., & Guha, S. (1999). Improved combinatorial algorithms for facility location and k-median problems. In Proceedings of the 40th Annual IEEE Symposium on Foundations of Computer Science (pp. 378–388).
Charikar, M., Guha, S., Tardos, É., & Shmoys, D. B. (1999). A constant-factor approximation algorithm for the k-median problem. In Proceedings of the 31st Annual ACM Symposium on Theory of Computing (pp. 1–10).
Dasgupta, S. (1999). Learning mixtures of Gaussians. In Proceedings of the 40th Annual IEEE Symposium on the Theory of Computation (pp. 634–644).
Duda, R. O., & Hart, P. E. (1973). Pattern classification and scene analysis. New York: John Wiley and Sons.
Guha, S., Mishra, N., Motwani, R., & O'Callaghan, L. (2000). Clustering data streams. In Proceedings of the 41st Annual IEEE Symposium on Foundations of Computer Science (pp. 359–366).
Indyk, P. (1999). Sublinear time algorithms for metric space problems. In Proceedings of the 31st Annual ACM Symposium on Theory of Computing (pp. 428–434).
Jain, K., Mahdian, M., & Saberi, A. (2002). A new greedy approach for facility location problems. In Proceedings of the 34th ACM Symposium on Theory of Computation (pp. 731–740).
Lindsay, B. (1995). Mixture models: Theory, geometry, and applications. Hayward, California: Institute for Mathematical Statistics.
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1 (pp. 281–297).
MacKenzie, P. D. (1997). Lower bounds for randomized exclusive write PRAMs. Theory of Computing Systems, 30, 599–626.
Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge: MIT Press.
Mettu, R. R. (2002). Approximation Algorithms for NP-Hard Clustering Problems. PhD thesis, Department of Computer Science, University of Texas at Austin.
Mettu, R. R., & Plaxton, C. G. (2003). The online median problem. SIAM Journal on Computing, 32, 816–832.
Mishra, N., Oblinger, D., & Pitt, L. (2001). Sublinear time approximate clustering. In Proceedings of the 12th Annual ACM-SIAM Symposium on Discrete Algorithms (pp. 439–447).
Motwani, R., & Raghavan, P. (1995). Randomized Algorithms. Cambridge University Press, Cambridge, UK.
Thorup, M. (2001). Quick k-median, k-center, and facility location for sparse graphs. In Proceedings of the 28th International Colloquium on Automata, Languages, and Programming (pp. 249–260).
Yao, A. (1977). Probabilistic computations: Toward a unified measure of complexity. In Proceedings of the 18th IEEE Symposium on Foundations of Computer Science (pp. 222–227).
Rights and permissions
About this article
Cite this article
Mettu, R.R., Plaxton, C.G. Optimal Time Bounds for Approximate Clustering. Machine Learning 56, 35–60 (2004). https://doi.org/10.1023/B:MACH.0000033114.18632.e0
Issue Date:
DOI: https://doi.org/10.1023/B:MACH.0000033114.18632.e0