Abstract
We consider two related discrete optimization problems of searching for a subset in a finite set of points in Euclidean space. Both problems are induced by versions of a fundamental problem in data analysis, namely, that of selecting a subset of similar elements in a set of objects. In each problem, given an input set and a positive real number, it is required to find a cluster (i.e., a subset) of the largest size under constraints on a quadratic clusterization function. The points in the input set, which are outside the sought-for subset, are treated as a second (complementary) cluster. In the first problem, the function under the constraint is the sum over both clusters of the intracluster sums of the squared distances between the elements of the clusters and their centers. The center of the first (i.e., the sought-for) cluster is unknown and determined as a centroid, while the center of the second one is fixed at a given point in Euclidean space (without loss of generality, at the origin of coordinates). In the second problem, the function under the constraint is the sum over both clusters of the weighted intracluster sums of the squared distances between the elements of the clusters and their centers. As in the first problem, the center of the first cluster is unknown and determined as a centroid, while the center of the second one is fixed at the origin of coordinates. In this paper, we show that both problems are strongly NP-hard. Also, we present exact algorithms for the problems in which the input points have integer components. If the space dimension is bounded by some constant, the algorithms are pseudopolynomial.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
MacQueen, J.B., Some Methods for Classification and Analysis of Multivariate Observations, Proc. of the 5th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley: Univ. of California Press, 1967, vol. 1, pp. 281–297.
Rao, M., atCluster Analysis and Mathematical Programming, J.Am. Stat. Assoc, 1971, vol. 66, pp. 622–626.
Hansen, P., Jaumard, B., and Mladenovich, N., Minimum Sum of Squares Clustering in a Low Dimensional Space, J. Classific, 1998, vol. 15, pp. 37–55.
Hansen, P. and Jaumard, B., Cluster Analysis and Mathematical Programming, Math. Programming, 1997, vol. 79, pp. 191–215.
Fisher, R.A., Statistical Methods and Scientific Inference, New York: Hafner, 1956.
Jain, A.K., Data Clustering: 50 Years beyond A;-Means, Patt. Recog. Lett., 2010, vol. 31, iss. 8, pp. 651–666.
Aloise, D., Deshpande, A., Hansen, P., and Popat, P., NP-Hardness of Euclidean Sum-of-Squares Clustering, Mach. Learning, 2009, vol. 75, iss. 2, pp. 245–248.
Drineas, P., Frieze, A., Kannan, R., Vempala, S., and Vinay, V. Clustering Large Graphs via the Singular Value Decomposition, Mach. Learning, 2004, vol. 56, pp. 9–33.
Dolgushev, A.V. and Kel'manov, A.V., On the Algorithmic Complexity of a Problem in Cluster Analysis, J. Appl. Ind. Math., 2011, vol. 2011, no. 5, pp. 2–191.
Mahajan, M., Nimbhorkar, P., and Varadarajan, K., The Planar k;-Means Problem Is NP-Hard, Theor. Comp. Sci., 2012, vol. 442, pp. 3–21.
Brucker, P., On the Complexity of Clustering Problems, Led. Not. Econ. Math. Systems, 1978, vol. 157, pp. 45–54.
Bern, M. and Eppstein, D., Approximation Algorithms for Geometric Problems, in Approximation Algorithms for NP-Hard Problems, Boston: PWS, 1997, pp. 296–345.
Indyk, P., A Sublinear Time Approximation Scheme for Clustering in Metric Space, Proc. of the 40th Ann. IEEE Symp. on Foundations of Computer Science (FOCS), 1999, pp. 154–159.
De la Vega, F. and Kenyon, C., A Randomized Approximation Scheme for Metric Max-Cut, I. Comput. Syst. Set., 2001, vol. 63, pp. 531–541.
De la Vega, E. Karpinski, M., Kenyon, C., and Rabani, Y., Polynomial Time Approximation Schemes for Metric Min-Sum Clustering, Electronic Colloquium on Computational Complexity (ECCC), 2002, rep. no. 25.
Hasegawa, S., Imai, E., Inaba, M., Katoh, N., and Nakano, J., Efficient Algorithms for Variance-Based fc-Clustering, Proc. of the 1st Pacific Conference on Computer Graphics and Applications (Pacific Graphics'93, Seoul, K.rea), River Edge, N.: World Scientific, 1993, vol. 1, pp. 75–89.
Inaba, M., Katoh, N., and Imai, E., Applications of Weighted Voronoi Diagrams and Randomization to Variance-Based k;-Clustering, SCG'94 Proc. of the Tenth Annual Symposium on Computational Geometry, Stony Brook, N., USA, 1994, pp. 332–339.
Sahni, S. and Gonzalez, T. P-Complete Approximation Problems, I. ACM, 1976, vol. 23, pp. 555–566.
Ageev, A.A., Kel'manov, A.V., and Pyatkin, A.V., NP-Hardness of the Euclidean Max-Cut Problem, Dokl. Math., 2014, vol. 2014, no. 89, pp. 3–343.
Ageev, A.A., Kel'manov, A.V., and Pyatkin, A.V., Complexity of the Weighted Max-Cut in Euclidean Space, 7. Appl. Ind. Math., 2014, vol. 2014, no. 8, pp. 4–453.
Kel'manov, A.V. and Pyatkin, A.V., On the Complexity of a Search for a Subset of “Similar” Vectors, Dokl. Math., 2008, vol. 78, no. 1, pp. 574/575.
Kel'manov, A.V. and Pyatkin, A.V, On a Version of the Problem of Choosing a Vector Subset, I. Appl. Ind. Math., 2009, vol. 2009, no. 3, pp. 4–447.
Kel'manov, A.V. and Pyatkin, A.V., NP-Hardness of Some Quadratic Euclidean 2-Clustering Problems, Dokl. Math., 2015, vol. 2015, no. 92, pp. 2–634.
Kel'manov, A.V. and Pyatkin, A.V, On the Complexity of Some Quadratic Euclidean 2-Clustering Problems, Comput. Math. Math. Phys., 2016, vol. 2016, no. 56, pp. 3–491.
Bishop, C.M., Pattern Recognition and Machine Learning, New York: Springer, 2006.
James, G., Witten, D., Hastie, T., and Tibshirani, R., An Introduction to Statistical Learning, New York: Springer, 2013.
Hastie, T. Tibshirani, R., and Friedman, J., The Elements of Statistical Learning, 2nd ed., Springer, 2009.
Aggarwal, C.C., Data Mining: The Textbook, Springer, 2015.
Goodfellow, I., Bengio, Y. and Courville, A., Deep Learning (Adaptive Computation and Machine Learning Series), MIT Press, 2017.
Shirkhorshidi, A.S., Aghabozorgi, S., Wah, T.Y., and Herawan, T. Big Data Clustering: A Review, LNCS, 2014, vol. 8583, pp. 707–720.
Pach, J. and Agarwal, R.K., Combinatorial Geometry, New York: Wiley, 1995.
Kel'manov, A.V. and Khandeev, V.., A 2-Approximation Polynomial Algorithm for a Clustering Problem, 7. Appl. Ind. Math., 2013, vol. 2013, no. 7, pp. 4–515.
Gimadi, E.Kh., Kel'manov, A.V., Kel'manova, M.A., and Khamidullin, S.A., A Posteriori Detection of a Quasiperiodic Fragment in a Numerical Sequence with a Given Number of Recurrences, Sib. Zh. Ind. Mat., 2006, vol. 2006, no. 9, pp. 1–55.
Gimadi, E.Kh., Kel'manov, A.V., Kel'manova, M.A., and Khamidullin, S.A., A Posteriori Detecting a Quasiperiodic Fragment in a Numerical Sequence, Pattern Recogn. Image An., 2008, vol. 2008, no. 18, pp. 1–30.
Baburin, A.E., Gimadi, E.Kh., Glebov, N.I., and Pyatkin, A.V., The Problem of Finding a Subset of Vectors with the Maximum Total Weight, 7. Appl. Ind. Math., 2008, vol. 2008, no. 2, pp. 1–32.
Gimadi, E.Kh., Pyatkin, A.V, and Rykov, L.., On Polynomial Solvability of Some Problems of a Vector Subset Choice in a Euclidean Space of Fixed Dimension, 7. Appl. Ind. Math., 2010, vol. 2010, no. 4, pp. 1–48.
Shenmaier, V.V., Solving Some Vector Subset Problems by Voronoi Diagrams, 7. Appl. Ind. Math., 2016, vol. 2016, no. 10, pp. 4–560.
Kel'manov, A.V. and Khandeev, V.., An Exact Pseudopolynomial Algorithm for a Problem of the Two-Cluster Partitioning of a Set of Vectors, 7. Appl. Ind. Math., 2015, vol. 2015, no. 9, pp. 4–497.
Dolgushev, A.V. and Kel'manov, A.V., An Approximation Algorithm for Solving a Problem of Cluster Analysis, 7. Appl. Ind. Math., 2011, vol. 2011, no. 5, pp. 4–551.
Dolgushev, A.V., Kel'manov, A.V., and Shenmaier, V.V., Polynomial-Time Approximation Scheme for a Problem of Partitioning a Finite Set into Two Clusters, Proc. Steklov Inst. Math., 2016, vol. 295, suppl. 1, pp. 47–56.
Kel'manov, A.V. and Khandeev, V.I., Fully Polynomial-Time Approximation Scheme for a Special Case of a Quadratic Euclidean 2-Clustering Problem, J. Appl. Ind. Math., 2016, vol. 2016, no. 56, pp. 2–334.
Kel'manov, A.V., Motkova, A.V., and Shenmaier, V.V., An Approximation Scheme for a Weighted Two-Cluster Partition Problem, LNCS, 2018, vol. 10716, pp. 323–333.
Kel'manov, A.V. and Khandeev, V.I., A Randomized Algorithm for Two-Cluster Partition of a Set of Vectors, Comput. Math. Math. Phys., 2015, vol. 2015, no. 55, pp. 2–330.
Kel'manov, A.V. and Motkova, A.V, Polynomial-Time Approximation Algorithm for the Problem of Cardinality-Weighted Variance-Based 2-Clustering with a Given Center, Comput. Math. Math. Phys., 2018, vol. 2018, no. 58, pp. 1–130.
Kel'manov, A.V. and Motkova, A.V, Exact Pseudopolynomial Algorithms for a Balanced 2-Clustering Problem, I. Appl. Ind. Math., 2016, vol. 2016, no. 10, pp. 3–349.
Kel'manov, A.V. and Motkova, A.V, A Fully Polynomial-Time Approximation Scheme for a Special Case of a Balanced 2-Clustering Problem, LNCS, 2016, vol. 9869, pp. 182–192.
Kel'manov, A.V, Khandeev, V.., and Panasenko, A.V, Randomized Algorithms for Some Clustering Problems, Comm. Comput. Inform. Sci., 2018, vol. CCIS-871, pp. 109–119.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Russian Text © The Author(s), 2019, published in Sibirskii Zhurnal Vychislitel’noi Matematiki, 2019, Vol. 22, No. 2, pp. 121–136.
Rights and permissions
About this article
Cite this article
Kel′manov, A.V., Panasenko, A.V. & Khandeev, V.I. Exact Algorithms of Search for a Cluster of the Largest Size in Two Integer 2-Clustering Problems. Numer. Analys. Appl. 12, 105–115 (2019). https://doi.org/10.1134/S1995423919020010
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1134/S1995423919020010