Abstract
We give a sampling-based algorithm for the k-Median problem, with running time O(k \((\frac{{k^2 }}{ \in } \log k)^2 \) log\((\frac{k}{ \in } \log k)\)), where k is the desired number of clusters and ∈ is a confidence parameter. This is the first k-Median algorithm with fully polynomial running time that is independent of n, the size of the data set. It gives a solution that is, with high probability, an O(1)-approximation, if each cluster in some optimal solution has Ω\((\frac{{n \in }}{k})\) points. We also give weakly-polynomial-time algorithms for this problem and a relaxed version of k-Median in which a small fraction of outliers can be excluded. We give near-matching lower bounds showing that this assumption about cluster size is necessary. We also present a related algorithm for finding a clustering that excludes a small number of outliers.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Alon, N., Dar, S., Parnas, M., & Ron, D. (2000). Testing of clustering. In Proc. FOCS.
Arora, S., Raghavan, P., & Rao, S. (1998). Approximation schemes for euclidean k-medians and related problems. In Proc. STOC.
Arya, V., Garg, N., Khandekar, R., Meyerson, A., Munagala, K., & Pandit, V. (2001). Local search heuristics for k-median and facility location problems. In Proc. STOC.
Bansal, N., Blum, A., & Chawla, S. (2002). Correlation clustering. In Proc. FOCS.
Bradley, P. S., Fayyad, U. M., & Reina, C. (1998). Scaling clustering algorithms to large databases. In Proc. KDD.
Charikar, M., Chekuri, C., Feder, T., & Motwani, R. (1997). Incremental clustering and dynamic information retrieval. In Proc. STOC.
Charikar, M., & Guha, S. (1999). Improved combinatorial algorithms for the facility location and k-median problems. In Proc. FOCS.
Charikar, M., Guha, S., Tardos, É., & Shmoys, D. (2002). A constant-factor approximation algorithm for the k-median problem. Journal of Computer and System Sciences, 65:1, 129–149.
Charikar, M., Khuller, S., Mount, D., & Narasimhan, G. (2001). Algorithms for facility location problems with outliers. In Proc. SODA.
Charikar, M., & Panigrahy, R. (2001). Clustering to minimize the sum of cluster diameters. In Proc. 33rd STOC.
Dasgupta, S. (1999). Learning mixtures of gaussians. In Proc. FOCS.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1984). Maximum-likelihood from incomplete data via the EM algorithm. J. Royal Stat. Soc. Ser. B, 39.
Duda, R. O., & Hart, P. E. (1972). Pattern classification and scene analysis. Wiley.
Feder, T., & Greene, D. (1988). Optimal algorithms for approximate clustering. In Proc. STOC.
Gonzalez, T. F. (1985). Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38, 2–3.
Guha, S. (2000). Approximation algorithms for facility location problems. Ph.D. Thesis, Stanford University.
Guha, S., Mishra, N., Motwani, R., & O'Callaghan, L. (2000). Clustering data streams. In Proc. FOCS.
Guha, S., Rastogi, R., & Shim, K. (1998). CURE: An efficient clustering algorithm for large databases. In Proc. SIGMOD.
Hartigan, J. A. (1975). Clustering algorithms. John Wiley and Sons.
Hochbaum, D. S., & Shmoys, D. B. (1985). A best possible approximation algorithm for the k-center problem. Mathematics of Operations Research, 10, 180–184.
Indyk, P. (1999). Sublinear time algorithms for metric space problems. In Proc. STOC.
Jain, K., Mahdian, M., Markakis, E., Saberi, A., & Vazirani, V. (to appear). Greedy facility location algorithms analyzed using dual-fitting with factor-revealing LP. Journal of the ACM.
Jain, K., & Vazirani, V. (1999). Primal-dual approximation algorithms for metric facility location and k-median problems. In Proc. FOCS.
Joze-Hkajavi, N., & Salem, K. (1998). Two-phase clustering of large datasets. Tech. rpt. CS-98-27, Univ. of Waterloo.
Lin, J.-H., & Vitter, J. S. (1992). ε-Approximations with minimum packing constraint violation. In Proc. STOC.
Marroquin, J., & Girosi, F. (1993). Some extensions of the k-means algorithm for image segmentation and pattern recognition. AI Memo 1390, MIT AI Lab..
Mettu, R., & Plaxton, C. G. (2002). Optimal time bounds for approximate clustering. In Proc. 18th UAI.
Meyerson, A. (2001). Online facility location. In Proc. FOCS.
Mishra, N., Oblinger, D., & Pitt, L. (2001). Sublinear time approximate clustering. In Proc. SODA.
Palmer, C. R., & Faloutsos, C. (2000). Density biased sampling: An improved method for data mining and clustering. In Proc. SIGMOD.
Pelleg, D., & Moore, A. W. (1999). Accelerating exact k-means algorithms with geometric reasoning. In Proc. KDD.
Pitt, L., & Reinke, R. E. (1988). Criteria for polynomial-time (conceptual) clustering. Machine Learning, 2, 4.
Redner, R. A., & Walker, H. F. (1984). Mixture densities, maximum likelihood and the EM algorithm. SIAM Review, 26, 2.
Steinbach, M., Karypis, G., & Kumar, V. (2000). A comparison of document clustering techniques. Tech rept. No. 00-034, Univ. of Minn.
Thorup, M. (2001). Quick k-median, k-center, and facility location for sparse graphs. In Proc. ICALP.
Young, N. E. (2000). k-medians, facility location, and the Chernoff-Wald bound. In Proc. SODA (pp. 86–95).
Zhang, T., Ramakrishnan, R., & Livny, M. (1996). BIRCH: An efficient data clustering method for very large databases. In Proc. SIGMOD (pp. 103–114).
Rights and permissions
About this article
Cite this article
Meyerson, A., O'Callaghan, L. & Plotkin, S. A k-Median Algorithm with Running Time Independent of Data Size. Machine Learning 56, 61–87 (2004). https://doi.org/10.1023/B:MACH.0000033115.78247.f0
Issue Date:
DOI: https://doi.org/10.1023/B:MACH.0000033115.78247.f0