Keywords

1 Introduction

The subject of this study is a strongly NP-hard quadratic Euclidean problem of partitioning a finite set of points into two clusters. We will show a fully polynomial-time approximation scheme (FPTAS) for a special case of the problem.

Our research is motivated by insufficient study of the problem from an algorithmic direction and its importance in some applications including geometry, cluster analysis, statistical problems of joint evaluation and hypotheses testing with heterogeneous samples, data interpretation problem, etc.

The paper has the following structure. Section 2 contains the problem formulation, some applications, and some closely related problems. Additionally, known and our new results are discussed. In Sect. 3 we formulate and prove some basic properties exploited by our algorithm. In Sect. 4, an approximation algorithm is presented. Finally, also in Sect. 4 we show that our algorithm is a fully polynomial-time approximation scheme when the space dimension is fixed.

2 Problem Formulation, Its Origin, Related Problems, known and New Results

Everywhere below we use the standard notations, namely: \(\mathbb {R}\) is the set of the real numbers, \(\mathbb {R}_{+}\) is the set of positive real numbers, \(\mathbb {Z}\) is the set of integers, \(\Vert \cdot \Vert \) is the Euclidean norm, and \(\langle \cdot ,\cdot \rangle \) is the scalar product.

The problem under consideration is formulated as follows (see also [1, 2]).

Problem 1

(Balanced Variance-based 2-Clustering with given center). Given a set \(\mathcal {Y} =\{y_{1},\ldots ,y_{N}\}\) of points from \(\mathbb {R}^{q}\) and a positive integer M. Find a partition of \(\mathcal {Y}\) into two non-empty clusters \(\mathcal {C}\) and \(\mathcal {Y}\setminus \mathcal {C}\) such that

$$\begin{aligned} F(\mathcal {C})= |\mathcal {C}|\sum _{y\in \mathcal {C}}\Vert y-\overline{y}(\mathcal {C})\Vert ^2 +|\mathcal {\mathcal {Y}\setminus \mathcal {C}}|\sum _{y\in \mathcal {Y}\setminus \mathcal {C}}\Vert y\Vert ^2 \longrightarrow \min \,, \end{aligned}$$
(1)

where \(\overline{y}(\mathcal {C}) = \frac{1}{|\mathcal {C}|} \sum \limits _{y \in \mathcal {C}}y \) is the geometric center (centroid) of \(\mathcal {C}\) and such that \(|\mathcal {C}| = M\).

The problem has an obvious geometrical interpretation. It is a partition of a finite set of points in Euclidean space into two geometrical structures minimizing (1). In formula (1) the weights of the sums are the cardinalities of the desired clusters. So, Problem 1 can be interpreted as the problem of optimal weighted (by the cardinalities of the clusters) summing and also as a problem of balanced partitioning (or clustering).

In addition, the problem has applications in Data mining problem (see, for example, [35]). The essence of this multifaceted problem is the approximation of data by some mathematical model that allows to plausibly explain the origin of the data in terms of the model. In particular, the next statistical hypothesis can be used as such mathematical model: it is true that the input data \(\mathcal {Y}\) is the inhomogeneous sample from two distributions, and that one of these distributions has zero mean while another mean is unknown and non-equal to zero. To test this hypothesis, first we need to find an optimal solution to Problem 1, and only then we will be able to use the classical results in the field of statistical hypothesis testing.

It is widely known that applied researchers, who study and analyze data, use algorithms as the basic mathematical tools for solving a variety of clustering problems in which clusters consist of similar or related by certain criteria objects. Creating such mathematical tools for solving data mining problems causes the development of new algorithms with guaranteed performance estimates of accuracy and time complexity.

The strong NP-hardness of Problem 1 was proved in [1, 2]. This fact implies that, unless P\(=\)NP, there are neither exact polynomial-time nor exact pseudopolynomial-time algorithms for it [6]. In addition, in [1, 2], the nonexistence of an FPTAS was shown (unless P\(=\)NP) for Problem 1. So, finding subclasses of this problem for which there exists an FPTAS is a question of topical interest.

Note that there is only one algorithmic result for Problem 1, i.e. an exact algorithm [7] for the case of integer components of the input points. The time complexity of this algorithm is \(\mathcal {O}(qN(2MB+1)^q)\), where B is the maximum absolute value of the components of the input points. If the dimension q of the space is bounded by a constant, then the time complexity of the algorithm is \(\mathcal {O}(N(MB)^q)\). So, in this case the algorithm is pseudopolynomial.

At the same time, there are a lot of results for problems closely related to Problem 1. Properties of algorithms for these problems can be found in the papers cited below.

The NP-hard Balanced variance-based 2-clustering problem is one of the most closely related to Problem 1. The objective function in this problem is different from (1) in that the center of cluster \(\mathcal {Y}\setminus \mathcal {C}\) is not fixed:

$$\begin{aligned} |\mathcal {C}|\sum _{y\in \mathcal {C}}\Vert y-\overline{y}(\mathcal {C})\Vert ^2 +|\mathcal {\mathcal {Y}\setminus \mathcal {C}}|\sum _{y\in \mathcal {Y}\setminus \mathcal {C}}\Vert y-\overline{y}(\mathcal {Y}\setminus \mathcal {C})\Vert ^2 \longrightarrow \min . \end{aligned}$$
(2)

In problem (2) the centroids \(\overline{y}(\mathcal {C})\) and \(\overline{y}(\mathcal {Y}\setminus \mathcal {C})\) of both clusters \(\mathcal {C}\) and \(\mathcal {Y}\setminus \mathcal {C}\) are the functions of \(\mathcal {C}\). It is well-known that this problem is equivalent to Min-sum all-pairs 2-clustering problem in which it is required to find a partition such that

$$\begin{aligned} \sum _{x\in \mathcal {C}} \sum _{z\in \mathcal {C}}\Vert x-z\Vert ^2 +\sum _{x\in \mathcal {Y}\setminus \mathcal {C}} \sum _{z\in \mathcal {Y}\setminus \mathcal {C}}\Vert x-z\Vert ^2 \longrightarrow \min \,. \end{aligned}$$
(3)

Algorithmic questions for problems (2) and (3) were studied, for example, in [1, 2, 813].

The well-known NP-hard [14] Minimum sum-of-squares 2-clustering problem is close to Problem 1. In this problem (related to classical work by Fisher [15] and also called 2-Means [16]), we need to find two clusters \(\mathcal {C}\) and \(\mathcal {Y}\setminus \mathcal {C}\) such that

$$\begin{aligned} \sum _{y\in \mathcal {C}}\Vert y-\overline{y}(\mathcal {C})\Vert ^2 +\sum _{y\in \mathcal {Y}\setminus \mathcal {C}}\Vert y-\overline{y}(\mathcal {Y}\setminus \mathcal {C})\Vert ^2 \longrightarrow \min \,. \end{aligned}$$
(4)

In problem (4) as well as in problem (2) the centroids of both clusters are the functions of \(\mathcal {C}\), but in problem (4) the sums are not weighted by the cluster cardinalities. Thousands of publications are dedicated to problem (4) and its applications.

The strongly NP-hard problem Minimum sum-of-squares 2-clustering with given center has been actively studied in the last decade. In this problem we need to find a 2-partition such that

$$\begin{aligned} \sum _{y\in \mathcal {C}}\Vert y-\overline{y}(\mathcal {C})\Vert ^2 +\sum _{y\in \mathcal {Y}\setminus \mathcal {C}}\Vert y\Vert ^2 \longrightarrow \min \,. \end{aligned}$$
(5)

Problem (5) differs from Problem 1 in that the sums are not weighted by the cardinalities of the desired clusters. The algorithmic results for this problem can be found in [1724].

In the considered Problem 1, the centroid \(\overline{y}(\mathcal {C})\) of the cluster \(\mathcal {C}\) is unknown and the center of the cluster \(\mathcal {Y}\setminus \mathcal {C}\) is given at the origin as in the problem (5). Since Problem 1 is neither equivalent nor a special case of the problems (2)–(5), the previous algorithmic results for these closely related problems do not apply to Problem 1. We need new explorations for this problem.

In this work we present an approximation algorithm for Problem 1. Given a relative error \(\varepsilon \), the algorithm finds a \((1 + \varepsilon )\)-approximate solution in \(\mathcal {O}\left( qN^{2}(\sqrt{\frac{2q}{\varepsilon }}+1)^{q}\right) \) time. In the case of a fixed space dimension q the running time of the algorithm is equal to \(\mathcal {O}\left( N^{2}\left( \frac{1}{\varepsilon }\right) ^{q/2}\right) \) and so, it implements a fully polynomial-time approximation scheme.

3 Foundations of the Algorithm

In this section, we provide some basic statements exploited by our algorithm.

The following two lemmas are well known. Their proofs are presented in many publications (see, for example, [25, 26]).

Lemma 1

For an arbitrary point \(x \in \mathbb {R}^q\) and a finite set \(\mathcal {Z} \subset \mathbb {R}^q\), it is true that

$$\begin{aligned} \sum \limits _{z \in \mathcal {Z}} \Vert z - x \Vert ^2 = \sum \limits _{z \in \mathcal {Z}} \Vert z - \overline{z} \Vert ^2 + |\mathcal {Z}| \cdot \Vert x - \overline{z} \Vert ^2 \,, \end{aligned}$$

where \(\overline{z}\) is the centroid of \(\mathcal {Z}\).

Lemma 2

Let the conditions of Lemma 1 hold. If a point \(u\in \mathbb {R}^{q}\) is closer (in terms of distance) to the centroid \(\overline{z}\) of \(\mathcal {Z}\) than any point in \(\mathcal {Z}\), then

$$\begin{aligned} \sum _{z \in \mathcal {Z}} \Vert z - u \Vert ^2 \le 2 \sum _{z \in \mathcal {Z}} \Vert z - \overline{z} \Vert ^2\;. \end{aligned}$$

Lemma 3

Let

$$\begin{aligned} S(\mathcal {C},x) = |\mathcal {C}|\sum _{y\in \mathcal {C}}\Vert y-x\Vert ^2 +|\mathcal {\mathcal {Y}\setminus \mathcal {C}}|\sum _{y\in \mathcal {Y}\setminus \mathcal {C}}\Vert y\Vert ^2, \,\, \mathcal {C}\subseteq \mathcal {Y}, \,\,x\in \mathbb {R}^{q}\,, \end{aligned}$$
(6)

where \(\mathcal {Y}\) is the input set of Problem 1. Then it is true that

$$\begin{aligned} S(\mathcal {C}, x) =F(\mathcal {C})+|\mathcal {C}|^{2}\Vert x-\overline{y}(\mathcal {C})\Vert ^{2}\,. \end{aligned}$$

Proof

Applying Lemma 1 to the set \(\mathcal {C}\) and its centroid, we have

$$\begin{aligned} \sum \limits _{y\in \mathcal {C}}\Vert y-x\Vert ^{2} =\sum \limits _{y\in \mathcal {C}}\Vert y-\overline{y}(\mathcal {C})\Vert ^{2} +|\mathcal {C}|\cdot \Vert x-\overline{y}(\mathcal {C})\Vert ^{2}\,. \end{aligned}$$
(7)

After the substitution of (7) in the definition (6), we obtain

$$\begin{aligned} S(\mathcal {C}, x)&= |\mathcal {C}|\sum _{y\in \mathcal {C}}\Vert y-x\Vert ^2 +|\mathcal {\mathcal {Y}\setminus \mathcal {C}}|\sum _{y\in \mathcal {Y}\setminus \mathcal {C}}\Vert y\Vert ^2 \\&= |\mathcal {C}|\sum \limits _{y\in \mathcal {C}}\Vert y-\overline{y}(\mathcal {C})\Vert ^{2} +|\mathcal {C}|^{2}\Vert x-\overline{y}(\mathcal {C})\Vert ^{2}+|\mathcal {\mathcal {Y}\setminus \mathcal {C}}|\sum \limits _{y\in \mathcal {Y}\setminus \mathcal {C}}\Vert y\Vert ^{2}\\&=F(\mathcal {C})+|\mathcal {C}|^{2}\Vert x-\overline{y}(\mathcal {C})\Vert ^{2}\,. \end{aligned}$$

   \(\square \)

For any function f(xy), we denote by \(f^{x}(y)\) the function when the argument x is fixed and by \(f^{y}(x)\) the function when the argument y is fixed.

Lemma 4

For the conditional minimums of the function (6) the next statements are true:

(1) for any nonempty fixed set \(\mathcal {C}\subseteq \mathcal {Y}\) the minimum of the function \(S^{\mathcal {C}}(x)\) over \(x\in \mathbb {R}^{q}\) is reached at the point \(x=\overline{y}(\mathcal {C}) = \frac{1}{|\mathcal {C}|} \sum \limits _{y \in \mathcal {C}} y\) and is equal to \(F(\mathcal {C})\);

(2) if \(|\mathcal {C}|=M=const\), then, for any fixed point \(x\in \mathbb {R}^{q}\), the minimum of function \(S^{x}(\mathcal {C})\) over \(\mathcal {C}\subseteq \mathcal {Y}\) satisfies

$$\begin{aligned} \arg \min \limits _{\mathcal {C}\subseteq \mathcal {Y}}S^{x}(\mathcal {C}) =\arg \min \limits _{\mathcal {C}\subseteq \mathcal {Y}}G^{x}(\mathcal {C})\,, \end{aligned}$$

where

$$\begin{aligned} G^{x}(\mathcal {C}) = \sum \limits _{y \in \mathcal {C}} g^{x}(y)\,, \end{aligned}$$
(8)
$$\begin{aligned} g^{x}(y) = (2M-N)\Vert y\Vert ^{2} - 2M\left\langle y, x\right\rangle , \,\,\,\, y \in \mathcal {Y}\,, \end{aligned}$$
(9)

and

$$\begin{aligned} \min \limits _{\mathcal {C}\subseteq \mathcal {Y}}G^{x}(\mathcal {C}) = \sum \limits _{y \in \mathcal {B}^{x}} g^{x}(y)\,, \end{aligned}$$
(10)

where the set \(\mathcal {B}^{x}\) consists of M points of the set \(\mathcal {Y}\), at which the function \(g^{x}(y)\) has the smallest values.

Proof

The first statement follows from Lemma 3.

Since \(|\mathcal {Y}| = N\) and \(|\mathcal {C}| = M\), the second statement follows from the next chain of equalities:

$$\begin{aligned}&S^{x}(\mathcal {C}) = M\sum \limits _{y\in \mathcal {C}}\Vert y-x\Vert ^{2} + (N-M)\sum \limits _{y\in \mathcal {Y}\setminus \mathcal {C}}\Vert y\Vert ^{2}\\&\qquad = M\sum \limits _{y\in \mathcal {C}}\Vert y\Vert ^{2} + M^{2}\Vert x\Vert ^{2} - 2M\sum \limits _{y\in \mathcal {C}}\left\langle y, x\right\rangle + (N-M)\sum \limits _{y\in \mathcal {Y}\setminus \mathcal {C}}\Vert y\Vert ^{2}\\&\quad = (N-M)\sum \limits _{y\in \mathcal {Y}}\Vert y\Vert ^{2} + M^{2}\Vert x\Vert ^{2} + (2M-N)\sum \limits _{y\in \mathcal {C}}\Vert y\Vert ^{2} - 2M\sum \limits _{y\in \mathcal {C}}\left\langle y, x\right\rangle \\&= (N-M)\sum \limits _{y\in \mathcal {Y}}\Vert y\Vert ^{2} + M^{2}\Vert x\Vert ^{2}+\sum \limits _{y \in \mathcal {C}} g^{x}(y) = (N-M)\sum \limits _{y\in \mathcal {Y}}\Vert y\Vert ^{2} + M^{2}\Vert x\Vert ^{2}+G^{x}(\mathcal {C})\,. \end{aligned}$$

It remains to note that in the last two equalities the first two addends do not depend on \(\mathcal {C}\). The formula (10) is obvious.    \(\square \)

Lemma 5

Let the conditions of Lemma 4 hold and \(\mathcal {C}^{*}\) be the optimal solution of Problem 1. Then, for a fixed point \(x\in \mathbb {R}^{q}\), the following inequality is true

$$\begin{aligned} F(\mathcal {B}^{x}) \le F(\mathcal {C}^{*})+M^{2}\Vert x-\overline{y}(\mathcal {C}^{*})\Vert ^{2}\,. \end{aligned}$$

Proof

The definitions (1) and (6), and Lemma 4 imply

$$\begin{aligned} F(\mathcal {B}^{x})=S^{\overline{y}(\mathcal {B}^{x})}(\mathcal {B}^{x}) \le S^{x}(\mathcal {B}^{x}) \le S^{x}(\mathcal {C}^{*})\,. \end{aligned}$$
(11)

Applying Lemma 3 to the right-hand side of (11), we obtain

$$\begin{aligned} S^{x}(\mathcal {C}^{*}) =F(\mathcal {C}^{*})+M^{2}||x-\overline{y}(\mathcal {C}^{*})||^{2}\,. \end{aligned}$$
(12)

Combining (11) and (12) yields the statement of the lemma.    \(\square \)

Lemma 6

Let the conditions of Lemma 5 hold and \(t=\arg \min \limits _{y\in \mathcal {C}^{*}}\Vert y-\overline{y}(\mathcal {C}^{*})\Vert ^{2}\) be the point from the subset \(\mathcal {C}^{*}\) closest to its centroid. Then the following inequality is true

$$\begin{aligned} \Vert t-\overline{y}(\mathcal {C}^{*})\Vert ^{2}\le \frac{1}{M^{2}}F(\mathcal {B}^{t})\,, \end{aligned}$$
(13)

where \(\mathcal {B}^{t}\) is the set defined in Lemma 4 (for \(x = t\)).

Proof

By the definition of point t we have

$$\begin{aligned} \Vert t-\overline{y}(\mathcal {C}^{*})\Vert ^{2}\le \Vert y-\overline{y}(\mathcal {C}^{*})\Vert ^{2} \end{aligned}$$

for each \(y\in \mathcal {C}^{*}\). Summing up both sides of this inequality over all \(y\in \mathcal {C}^{*}\), we obtain

$$\begin{aligned} M\Vert t-\overline{y}(\mathcal {C}^{*})\Vert ^{2} \le \sum \limits _{y\in \mathcal {C}^{*}}\Vert y-\overline{y}(\mathcal {C}^{*})\Vert ^{2}\,. \end{aligned}$$
(14)

Since \(\mathcal {C}^{*}\) is the optimal solution,

$$\begin{aligned} F(\mathcal {C}^{*})\le F(\mathcal {B}^{t})\,. \end{aligned}$$
(15)

Then (14), (1) and (15) imply

$$\begin{aligned} M\Vert t-\overline{y}(\mathcal {C}^{*})\Vert ^{2} \le \sum \limits _{y\in \mathcal {C}^{*}}\Vert y-\overline{y}(\mathcal {C}^{*})\Vert ^{2} \le \frac{1}{M}F(\mathcal {C}^{*})\le \frac{1}{M} F(\mathcal {B}^{t})\,. \end{aligned}$$

   \(\square \)

Lemma 7

Let the conditions of Lemma 6 hold. Let

$$\begin{aligned} \Vert x-\overline{y}(\mathcal {C}^{*})\Vert ^{2}\le \frac{\varepsilon }{2M^{2}}F(\mathcal {B}^{t}) \end{aligned}$$
(16)

for some \(\varepsilon > 0\) and \(x\in \mathbb {R}^{q}\). Then the subset \(\mathcal {B}^{x}\) (defined in Lemma 4) is a \((1+\varepsilon )\)-approximate solution of Problem 1.

Proof

From (1), Lemma 4 and the definition of the point t we have

$$\begin{aligned} F(\mathcal {B}^{t})=S^{\overline{y}(\mathcal {B}^{t})}(\mathcal {B}^{t}) \le S^{t}(\mathcal {B}^{t}) \le S^{t}(\mathcal {C}^{*})\,. \end{aligned}$$
(17)

Applying Lemma 2 to the set \(\mathcal {C}^{*}\) and the point t, we have

$$\begin{aligned} \sum \limits _{y\in \mathcal {C}^{*}}\Vert y-t\Vert ^{2}\le 2\sum \limits _{y\in \mathcal {C}^{*}}\Vert y-\overline{y}(\mathcal {C}^{*})\Vert ^{2}\,. \end{aligned}$$

Therefore, definition (6) yields

$$\begin{aligned} S^{t}(\mathcal {C}^{*})&= M\sum \limits _{y\in \mathcal {C}^{*}}\Vert y-t\Vert ^{2} +(N-M)\sum \limits _{y\in \mathcal {Y}\setminus \mathcal {C}^{*}}\Vert y\Vert ^{2} \nonumber \\&\qquad \le 2M\sum \limits _{y\in \mathcal {C}^{*}}\Vert y-\overline{y}(\mathcal {C}^{*})\Vert ^{2} +(N-M)\sum \limits _{y\in \mathcal {Y}\setminus \mathcal {C}^{*}}\Vert y\Vert ^{2}\le 2F(\mathcal {C}^{*})\,. \end{aligned}$$
(18)

Combining (16), (17), and (18) we obtain

$$\begin{aligned} \Vert x-\overline{y}(\mathcal {C}^{*})\Vert ^{2} \le \frac{\varepsilon }{2M^{2}}F(\mathcal {B}^{t}) \le \frac{\varepsilon }{2M^{2}}S^{t}(\mathcal {C}^{*}) \le \frac{\varepsilon }{M^{2}}F(\mathcal {C}^{*})\,. \end{aligned}$$
(19)

Finally, from Lemma 5 and (19) for the subset \(\mathcal {B}^{x}\) we obtain the following estimate of the value of the objective function

$$\begin{aligned} F(\mathcal {B}^{x})\le F(\mathcal {C}^{*})+M^{2}\Vert x-\overline{y}(\mathcal {C}^{*})\Vert ^{2}\le (1+\varepsilon )F(\mathcal {C}^{*})\,. \end{aligned}$$

This estimate means that the subset \(\mathcal {B}^{x}\) is a \((1+\varepsilon )\)-approximate solution for Problem 1.    \(\square \)

4 Approximation Algorithm

In this section, we present our approximation algorithm for Problem 1. Its main idea is as follows. For each point of the input set a domain (cube) is constructed so that the center of the desired subset necessarily belongs to one of these domains. Given (as input) the prescribed relative error \(\varepsilon \) of the solution, a lattice (a grid) is generated that discretizes the cube with a uniform step in all coordinates. For each lattice node, a subset of M points from the input set that have the smallest values of the function (9) is formed (the minimum of (8) is reached at that subset). The resulting set is declared as a solution candidate. The candidate that minimizes the objective function is chosen to be the final solution.

For an arbitrary point \(x\in \mathbb {R}^{q}\) and positive numbers h and H, we define the set of points

$$\begin{aligned} \mathcal {D}(x,h,H)=\{d\in \mathbb {R}^{q}|\, d=x+h\cdot (i_{1},\dots ,i_{q}), i_{k}\in \mathbb {Z},|hi_{k}|\le H, k\in \{1,\dots ,q\}\} \end{aligned}$$
(20)

which is a cubic lattice of size 2H centered at the point x with node spacing h.

For any point \(x\in \mathbb {R}^{q}\) the number of nodes in this lattice is

$$\begin{aligned} |\mathcal {D}(x,h,H)|\le \left( 2\Big \lfloor {\frac{H}{h}}\Big \rfloor +1\right) ^{q}\le \left( 2\frac{H}{h}+1\right) ^{q}\,. \end{aligned}$$
(21)

Remark 1

If some point z from \(\mathbb {R}^{q}\) and some node x from the lattice \(\mathcal {D}(x,h,H)\) satisfy the inequality \(\Vert z-x\Vert \le H\) then the distance from z to the nearest node of the lattice obviously does not exceed \(\frac{h\sqrt{q}}{2}\).

For constructing an algotithmic solution we need to determine adaptively the size H of the lattice and its node spacing h for each point y of the input set \(\mathcal {Y}\) so that the domain of the lattice contains the centroid of the desired subset. The node spacing is defined by the relative error \(\varepsilon \). To this end we define the functions:

$$\begin{aligned} H(y)=\frac{1}{M}\sqrt{F(\mathcal {B}^{y})},\,\,\,y\in \mathcal {Y}\,, \end{aligned}$$
(22)
$$\begin{aligned} h(y,\varepsilon )=\frac{1}{M}\sqrt{\frac{2\varepsilon }{q}F(\mathcal {B}^{y})},\,\,\,y\in \mathcal {Y},\,\,\varepsilon \in \mathbb {R}_{+}\,, \end{aligned}$$
(23)

where \(\mathcal {B}^{y}\) is a set determined in Lemma 4, if \(x = y\).

Note that all calculations in the algorithm described below are based on constructing candidate (approximate) solutions of Problem 1 as a subset \(\mathcal {B}^{x}\) (defined in Lemma 4) for any point x from the support set of points. In this way we use two support sets. The first of them is the input set \(\mathcal {Y}\) and the second one is the set of nodes of the lattice \(\mathcal {D}(y,h,H)\) centered at y. The lattice is adaptively calculated by formulae (22) and (23) for each input point \(y\in \mathcal {Y}\). The approximation factor is finally bounded using the basic statements in Sect. 3.

Remark 2

For any point \(y \in \mathcal {Y}\) the cardinality \(|\mathcal {D}(y,h,H)|\) of the lattice does not exceed the value

$$\begin{aligned} L=\left( \sqrt{\frac{2q}{\varepsilon }} + 1\right) ^{q} \end{aligned}$$

due to (21), (22), and (23).

Below is the step-by-step description of the algorithm.

Algorithm \(\mathcal {A}\).

Input: a set \(\mathcal {Y}\) and numbers M and \(\varepsilon \).

For each point \(y\in \mathcal {Y}\) Steps 1–6 are executed.

Step 1. Compute the values \(g^{y}(z), \,\, z\in \mathcal {Y},\) using formula (9); find a subset \(\mathcal {B}^{y} \subseteq \mathcal {Y}\) with M smallest values \(g^{y}(z)\), compute \(F(\mathcal {B}^{y})\) using formula (1).

Step 2. If \(F(\mathcal {B}^{y}) = 0\), then put \(\mathcal {C}_{\mathcal {A}}=\mathcal {B}^{y}\); exit.

Step 3. Compute H and h using formulae (22) and (23).

Step 4. Construct the lattice \(\mathcal {D}(y,h,H)\) using formula (20).

Step 5. For each node x of the lattice \(\mathcal {D}(y,h,H)\) compute the values \(g^{x}(y)\), \(y\in \mathcal {Y}\), using formula (9) and find a subset \(\mathcal {B}^{x}\subseteq \mathcal {Y}\) with M smallest values \(g^{x}(y)\). Compute \(F(\mathcal {B}^{x})\) using formula (1), remember this value and the set \(\mathcal {B}^{x}\).

Step 6. If \(F(\mathcal {B}^{x}) = 0\), then put \(\mathcal {C}_{\mathcal {A}}=\mathcal {B}^{x}\); exit.

Step 7. In the family \(\{\mathcal {B}^{x}| \, x\in \mathcal {D}(y,h,H), y\in \mathcal {Y}\}\) of candidate sets that have been constructed in Steps 1–6, choose as a solution \(\mathcal {C}_{\mathcal {A}}\) the set \(\mathcal {B}^{x}\) for which \(F(\mathcal {B}^{x})\) is minimal.

Output: the set \(\mathcal {C}_{\mathcal {A}}\).

Theorem 1

For any fixed \(\varepsilon > 0\) Algorithm \(\mathcal {A}\) finds a \((1+\varepsilon )\)-approximate solution of Problem 1 in \(\mathcal {O}\left( qN^{2}\left( \sqrt{\frac{2q}{\varepsilon }}+1\right) ^{q}\right) \) time.

Proof

Let us bound the approximation factor of the algorithm. If the equality \(F(\mathcal {B}^{y}) = 0\) holds at Step 2 for some point \(y\in \mathcal {Y}\), then the subset \(\mathcal {B}^{y}\subseteq \mathcal {Y}\) is an optimal solution of Problem 1, since, for any set \(\mathcal {C} \subseteq \mathcal {Y}\), it is true that \(F(\mathcal {C}) \ge 0\). We get an optimal solution at Step 6 in the same way.

Consider the case when the condition \(F(\mathcal {B}^{y}) = 0\) at Step 2 does not hold. Obviously, there exists a point \(t\in \mathcal {Y}\) such that \(t=\arg \min \limits _{y\in C^{*}}||y-\overline{y}(C^{*})||\) and the algorithm meets it at least once in the set \(\mathcal {Y}\) while running. By Lemma 6, inequality (13) holds for this point. This inequality and (22) mean that \(\Vert t-\overline{y}(\mathcal {C}^{*})\Vert \le H(t)\), so the centroid of the optimal subset lies within the lattice D(thH) of the size \(H=H(t)\) and the node spacing \(h=h(t,\varepsilon )\).

Let \(x^{*}=\arg \min \limits _{x\in \mathcal {D}(t,h,H)}\Vert x-\overline{y}(\mathcal {C}^{*})\Vert \) be a node of the grid \(\mathcal {D}(t,h,H)\), the nearest to the centroid of the optimal subset. Since the squared distance from the optimal centroid \(\overline{y}(\mathcal {C}^{*})\) to the nearest node \(x^{*}\) of the lattice does not exceed \(\frac{h^{2}q}{4}\) (by remark 1), we have the estimate

$$\begin{aligned} \Vert x^{*}-\overline{y}(C^{*})\Vert ^{2}\le \frac{h^{2}q}{4}=\frac{\varepsilon }{2M^{2}}F(\mathcal {B}^{t})\,. \end{aligned}$$

Therefore, the point \(x^{*}\) satisfies the conditions of Lemma 7 and, hence, the set \(\mathcal {B}^{x^{*}}\) is a \((1+\varepsilon )\)-approximate solution of Problem 1.

It is clear, that any subset \(\mathcal {B}^{x}\) in the family of candidate solutions on Step 7 constructed for node x such that \(\Vert x-\overline{y}(C^{*})\Vert ^{2}\le \Vert x^{*}-\overline{y}(C^{*})\Vert ^{2}\) guarantees a \((1+\varepsilon )\)-approximation also.

Let us evaluate the time complexity of the algorithm.

At Step 1 calculation of \(g^{y}(z)\) requires at most \(\mathcal {O}(qN)\)-time. Finding the M smallest elements in the set of N elements is performed in \(\mathcal {O}(N)\) operations (for example, using the algorithm of finding the n-th smallest value in an unordered array [27]). Computation of the value \(F(\mathcal {B}^{y})\) takes \(\mathcal {O}(qN)\) time.

Steps 2, 3 and 6 are executed in \(\mathcal {O}(1)\) operations. It requires \(\mathcal {O}(qL)\) operations for generating the lattice at Step 4 (by remark 2).

At Step 5, computation of the elements of the set \(\mathcal {B}^{x}\) for each node of the grid requires \(\mathcal {O}(qN)\) time, and the same is true for the computation of \(F(\mathcal {B}^{x})\) (as computations at Step 1). Thus, at this step the computational time for all nodes of the grid is \(\mathcal {O}(qNL)\).

Since Steps 1–6 are performed N times, the time complexity of these steps is \(\mathcal {O}(qN^{2}L)\). The time complexity of Step 7 is bounded by \(\mathcal {O}(NL)\), and the total time complexity of all Steps is \(\mathcal {O}(qN^{2}L)\). Therefore, the time complexity of Algorithm \(\mathcal {A}\) is \(\mathcal {O}\left( qN^{2}\left( \sqrt{\frac{2q}{\varepsilon }} + 1\right) ^{q}\right) \).    \(\square \)

Remark 3

In the case when the dimension q of space is bounded by a constant value and \(\varepsilon < 2q\), we have

$$\begin{aligned} qN^{2}\left( 1+\sqrt{\frac{2q}{\varepsilon }}\right) ^{q}\le qN^{2}2^{q}\left( \frac{2q}{\varepsilon }\right) ^{q/2} =\mathcal {O}\left( N^{2}\left( \frac{1}{\varepsilon }\right) ^{q/2}\right) \,, \end{aligned}$$

and it means that Algorithm \(\mathcal {A}\) is an FPTAS.

Remark 4

It is clear that the constructed algorithm can be applied for solving a problem in which the cardinalities of the clusters are the optimized variables. For this purpose, it is sufficient to solve Problem 1 N times with the help of Algorithm \(\mathcal {A}\) for each \(M = 1, \dots ,N\), and then choose the best of these solutions in the sense of minimizing the objective function. The time complexity of this algorithm obviously equals \(\mathcal {O}\left( N^{3}\left( \frac{1}{\varepsilon }\right) ^{q/2}\right) \). But it is interesting to construct algorithms with less time complexity without searching for such candidate solutions.

5 Conclusion

In this paper we presented an approximation algorithm for one strongly NP-hard quadratic Euclidian problem of balanced partitioning a finite set of points into two clusters. It was proved that our algorithm is a fully polynomial-time approximation scheme if the space dimension is bounded by a constant.

In the algorithmical sense, the considered problem is poorly studied. Therefore, it seems important to continue studying the questions on algorithmical approximability of the problem.