Keywords

1 Introduction

The subject of this study is one strongly NP-hard problem of partitioning a finite set of points in Euclidean space into clusters. Our goal is to analyze the computational complexity of the problem in the one-dimensional case. The research is motivated by the openness of the specified mathematical question, as well as by the importance of the problem for some applications, in particular, for Data analysis, Data mining, Pattern recognition, and Data processing.

The paper has the following structure. In Sect. 2, the problem formulation is given. In the same section, a connection is established with a well-known problem that is the closest to we consider one. The next section presents auxiliary statements that reveal the structure of the optimal solution to the problem. These statements allow us to prove the main result. In Sect. 4, our main result of the polynomial solvability of the problem in the 1D case is presented.

2 Problem Formulation, Its Sources and Related Problems

In the well-known clustering K-Means problem, an N-element set \(\mathcal {Y}\) of points in d-dimension Euclidean space and a positive integer K are given. It is required to find a partition of the input set \(\mathcal {Y}\) into non-empty clusters \(\mathcal {C}_{1},\ldots , \mathcal {C}_{K}\) minimizing the sum

$$\begin{aligned} \sum _{k=1}^{K} \sum _{y \in \mathcal {C}_{k}} \Vert y - \overline{y}(\mathcal {C}_{k})\Vert ^2 , \end{aligned}$$

where \(\overline{y}(\mathcal {C}_{k}) = \frac{1}{|\mathcal {C}_{k}|}\sum _{y\in \mathcal {C}_{k}}y\) is the centroid of the k-th cluster.

Another common name of K-Means problem is MSSC (Minimum Sum-of-Squares Clustering). In statistics, this problem is known from the last century and is associated with Fisher (see, for example, [1, 2]). In practice (in a wide variety of applications), this problem arises when there is the following hypothesis on a structure of some given numerical data. Namely, one has assumption that the set \(\mathcal {Y}\) of sample (input) data contains K homogeneous clusters (subsets) \(\mathcal {C}_{1}, \ldots , \mathcal {C}_{K}\), and in all clusters, the points are scattered around the corresponding unknown mean values \(\overline{y}(\mathcal {C}_{1}), \ldots , \overline{y}(\mathcal {C}_{K})\). However, the correspondence between points and clusters is unknown. Obviously, in this situation, for the correct application of classical statistical methods (hypothesis testing or parameterestimating) to the processing of sample data, at first it is necessary to divide the data into homogeneous groups (clusters). This situation is typical, in particular, for the above-mentioned (see Sect. 1) applications.

The K-Means strong NP-hardness was proved relatively recently [3]. The polynomial solvability of this problem on a line was proved in [4] in the last century. The cited paper presents an algorithm with \(\mathcal {O}( KN^2)\) running time that implements a dynamic programming scheme. This well-known algorithm relies on an exact polynomial algorithm for solving the well-known Nearest neighbor search problem [5]. Note that the polynomial solvability in \(\mathcal {O}( KN\log N)\)-time of the 1D case of the K-Means problem follows directly from earlier (than [4]) results obtained in [6,7,8,9]. In the cited papers, the authors have proved the faster polynomial-time algorithms for some special cases of the Nearest neighbor search problem. Nevertheless, in recent years, for the one-dimensional case of the K-Means problem, some new exact algorithms with \(\mathcal {O}( KN\log N)\) running time have been constructed. An overview of these algorithms and their properties can be found in [10, 11].

The object of our research is the following problem that is close in its formulation to K-Means and is poorly studied.

Problem 1

(K-Means and Given J-Centers).Given an N-element set \(\mathcal {Y}\) of points in d-dimension Euclidean space, a positive integer K, and a tuple \(\{c_1,\ldots ,c_J\}\) of points. Find a partition of \(\mathcal {Y}\) into non-empty clusters \(\mathcal {C}_{1}, \ldots , \mathcal {C}_{K}\), \(\mathcal {D}_1, \ldots , \mathcal {D}_{J}\) such that

$$\begin{aligned} F = \sum _{k=1}^{K} \sum _{y \in \mathcal {C}_{k}} \Vert y - \overline{y}(\mathcal {C}_{k})\Vert ^2 + \sum _{j=1}^{J} \sum _{y \in \mathcal {D}_{j}} \Vert y - c_{j}\Vert ^2 \rightarrow \min , \end{aligned}$$

where \(\overline{y}(\mathcal {C}_{k})\) is the centroid of the k-st cluster.

On the one hand, Problem 1 may be considered as some modification of K-Means. On the other hand, the introduced notation allows us to call Problem 1 as K-Means and Given J-Centers.

Unlike K-Means, Problem 1 models an applied clustering problem in which for a part of clusters (i.e., for \(\mathcal {D}_1, \ldots , \mathcal {D}_{J}\)) the quadratic scatter data centers (i.e., \(c_1,\ldots ,c_J\)) are known in advance, i.e., they are given as input instance. This applied problem is also typical for Data analysis, Data mining, Pattern recognition, and Data processing. In particular, the two-cluster Problem 1, i.e., 1-Mean and Given 1-Center, is related to the solution of the applied signal processing problem. Namely, this two-clusters problem is related with the problem of joint detecting a quasi-periodically repeated pulse of unknown shape in a pulse train and evaluating this shape under Gaussian noise with given zero value (see [12,13,14]). In this two-cluster Problem 1, the zero mean corresponds to the cluster with the center specified at the origin. Apparently, the first mention has been made in [12] on this two-cluster Problem 1. It should be noted that simpler optimization problems induced by the applied problems of noise-proof detection and discrimination of impulses of specified shapes are typical, in particular, for radar, electronic reconnaissance, hydroacoustics, geophysics, technical and medical diagnostics, and space monitoring (see, for example, [15,16,17]).

Problem 1 strong NP-hardness was proved in [18,19,20]. Note that the K-Means problem is not equivalent to Problem 1 and is not a special case of it. Therefore, the solvability of Problem 1 in the 1D case requires independent study. This question until now remained open.

The main result of this paper is the proof of Problem 1 polynomial solvability in the one-dimensional case.

3 Some Auxiliary Statements: Properties of the Problem 1 Optimal Solution in the 1D Case

In what follows, we assume that \(d = 1\). Below we will call by Problem 1D the one-dimensional case of Problem 1.

Our proof is based on the few given below auxiliary statements, which reveal the structure of Problem 1D optimal solution. For briefness, we present these statements without proofs, limiting ourselves to the presentation of their ideas.

Denote by \(\mathcal {C}_{1}^*, \ldots , \mathcal {C}_{K}^*\), \(\mathcal {D}_1^*, \ldots , \mathcal {D}_{J}^*\) the optimal clusters in Problem 1D.

Lemma 1

If in Problem 1D \(c_{m} < c_{\ell }\), where \(1 \le m \le J\), \(1 \le \ell \le J\), then for each \(x \in \mathcal {D}_{m}^*\) and \(z \in \mathcal {D}_{\ell }^*\) the inequality \(x \le z\) holds.

Lemma 2

If in Problem 1D \(\overline{y}(\mathcal {C}_{m}^*) < \overline{y}(\mathcal {C}_{\ell }^*)\), where \(1 \le m \le K\), \(1 \le \ell \le K\), then for each \(x \in \mathcal {C}_{m}^*\) and \(z \in \mathcal {C}_{\ell }^*\) the inequality \(x \le z\) holds.

Lemma 3

For an optimal solution of Problem 1D, the following statements are true:

  1. (1)

    If \(\overline{y}(\mathcal {C}_{m}^*) < c_{\ell }\), where \(1 \le m \le K\), \(1 \le \ell \le J\), then for each \(x \in \mathcal {C}_{m}^*\) and \(z \in \mathcal {D}_{\ell }^*\) the inequality \(x \le z\) holds.

  2. (2)

    If \(\overline{y}(\mathcal {C}_{m}^*) > c_{\ell }\), where \(1 \le m \le K\), \(1 \le \ell \le J\), then for each \(x \in \mathcal {C}_{m}^*\) and \(z \in \mathcal {D}_{\ell }^*\) the inequality \(x \ge z\) holds.

The proof of Lemmas 13 is carried out by the contrary method using the following equality

$$\begin{aligned} ( x - c_{m} )^2 + ( z - c_{\ell } )^2 = 2 (x - z) (c_{\ell } - c_{m}) + ( z - c_{m} )^2 + ( x - c_{\ell } )^2 . \end{aligned}$$

The validity of this equality follows from the well-known formula for the sum of squares of the trapezoid diagonals.

Lemma 4

In Problem 1D, for each \(k \in \{1, \ldots , K\}\) and \(j \in \{1, \ldots , J\}\) it is true that \(\overline{y}(\mathcal {C}_{k}^*) \ne c_j\).

Lemma 5

In Problem 1D, for each \(k, j \in \{1, \ldots , K\}\), \(k \ne j\), it is true that \(\overline{y}(\mathcal {C}_{k}^*) \ne \overline{y}(\mathcal {C}_{j}^*)\).

The proof of Lemmas 4 and 5 is carried out by the contrary method.

Lemmas 15 establish the relative position of the optimal clusters \(\mathcal {D}_1^*, \ldots , \mathcal {D}_{J}^*\) and \(\mathcal {C}_{1}^*, \ldots , \mathcal {C}_{K}^*\) on a line. These lemmas are the base of the following statement.

Theorem 1

Let in Problem 1D points \(y_1, \ldots , y_N\) of \(\mathcal {Y}\), and points \(c_1, \ldots , c_J\) be ordered so that

$$\begin{aligned}&y_1< \ldots< y_N,\\&c_1< \ldots < c_J. \end{aligned}$$

Then optimal partition of \(\mathcal {Y}\) into clusters \(\mathcal {C}_1^*, \ldots , \mathcal {C}_K^*, \mathcal {D}_1^*, \ldots , \mathcal {D}_{J}^*\) corresponds to a partition of the positive integer sequence \(1,\ldots ,N\) into disjoint segments.

4 Polynomial Solvability of the Problem in the 1D Case

The following theorem is the main result of the paper.

Theorem 2

There exists an algorithm that finds an optimal solution of Problem 1D in polynomial time.

Our proof of Theorem 1 is constructive. Namely, we justify an algorithm that implements a dynamic programming scheme and allows one to find an exact solution of Problem 1D in \(\mathcal {O}(KJN ^ 2)\) time.

The idea of the proof is as follows. Without loss of generality, we assume that the points \(y_1, \ldots , y_N\) of \(\mathcal {Y}\), as well as the points \(c_1, \ldots , c_J\) are ordered as in Theorem 1.

Let \(\mathcal {Y}_{s,t} = \{ y_s, \ldots , y_t \}\), where \(1 \le s \le t \le N\), be a subset of \(t - s + 1\) points of \(\mathcal {Y}\) with numbers from s to t.

Let

$$\begin{aligned} f_{s, t}^{j} = \sum _{i = s}^t ( y_i - c_j )^2 , \,\,\,\,j=1,\ldots ,J, \end{aligned}$$
$$\begin{aligned} f_{s, t} = \sum _{i = s}^t ( y_i - \overline{y}(\mathcal {Y}_{s,t}) )^2 , \end{aligned}$$

where \(\overline{y}(\mathcal {Y}_{s,t})\) is the centroid of the subset \(\mathcal {Y}_{s,t}\).

We prove that the optimal value of the Problem 1 objective function is found by the following formula

$$\begin{aligned} F^* = F_{K,J}(N) , \end{aligned}$$

and the values

$$\begin{aligned} F_{k,j}(n), \,\,\,k = -1, 0, 1, \ldots , K; \,\,\, j = -1, 0, 1, \ldots , J; \,\,\,n = 0, \ldots , N, \end{aligned}$$

are calculated by the recurrent formulas. The formula

$$\begin{aligned} F_{k,j}(n) = \left\{ \begin{array}{l} 0, \;\;\;\;\;\;\, \text {if } n = k = j = 0; \\ +\infty , \;\; \text {if } n = 0; \; k = 0, \ldots , K; \; j = 0, \ldots , J; \; k + j \ne 0; \\ +\infty , \;\; \text {if } k = -1; \; j = -1, \ldots , J; \; n = 0, \ldots , N; \\ +\infty , \;\; \text {if } j = -1; \; k = -1, \ldots , K; \; n = 0, \ldots , N; \\ \end{array} \right. \end{aligned}$$
(1)

sets the initial and boundary conditions for subsequent calculations. Formula (1) follows from the properties of the optimal solution. The basic formula

$$\begin{aligned} F_{k,j}(n) = \min \Bigl \{ \min _{i = 1}^n \Bigl \{ F_{k-1, j}(i-1) + f_{i, n} \Bigr \} , \; \min _{i = 1}^n \Bigl \{ F_{k, j-1}(i-1) + f_{i, n}^j \Bigr \} \Bigr \}, \\ k = 0, \ldots , K; \; j = 0, \ldots , J; \; n = 1, \ldots , N, \end{aligned}$$
(2)

defines recursion. In general, the formulas (1), (2) implement the forward algorithm.

Further, we have proved that the optimal clusters \(\mathcal {C}_1^*, \ldots , \mathcal {C}_K^*, \mathcal {D}_1^*, \ldots , \mathcal {D}_{J}^*\) may be found using the following recurrent rule, that implements the backward algorithm.

The step-by-step rule looks as follows:

  • Step 0. \(k := K\), \(j := J\), \(n := N\).

  • Step 1. If

    $$\begin{aligned} \min _{i = 1}^n \Bigl ( F_{k-1, j}(i-1) + f_{i, n} \Bigr ) \le \min _{i = 1}^n \Bigl ( F_{k, j-1}(i-1) + f_{i, n}^j \Bigr ) , \end{aligned}$$

    then

    $$\begin{aligned} \mathcal {C}_k^* = \{ y_{i^*}, y_{i^*+1}, \ldots , y_{n} \}, \end{aligned}$$

    where

    $$\begin{aligned} i^* = \arg \min \limits _{i = 1}^n \Bigl ( F_{k-1, j}(i-1) + f_{i, n} \Bigr ) ; \end{aligned}$$

    \(k := k - 1\); \(n := i^* - 1\). If, however,

    $$\begin{aligned} \min _{i = 1}^n \Bigl ( F_{k-1, j}(i-1) + f_{i, n} \Bigr ) > \min _{i = 1}^n \Bigl ( F_{k, j-1}(i-1) + f_{i, n}^j \Bigr ) , \end{aligned}$$

    then

    $$\begin{aligned} \mathcal {D}_j^* = \{ y_{i^*}, y_{i^*+1}, \ldots , y_{n} \}, \end{aligned}$$

    where

    $$\begin{aligned} i^* = \arg \min \limits _{i = 1}^n \Bigl ( F_{k, j-1}(i-1) + f_{i, n}^j \Bigr ) ; \end{aligned}$$

    \(j := j - 1\); \(n := i^* - 1\).

  • Step 2. If \(k > 0\) or \(j > 0\), then go to Step 1; otherwise — the end of calculations.

The validity of this rule we have proved by induction.

Finally, we have proved that the running time of the algorithm is \(\mathcal {O}(KJN^2)\), that is, the algorithm is polynomial. The algorithms running time is defined by the complexity of implementation of formula (2). This formula is calculated \(\mathcal {O}(K J N)\) times and every calculation of \(F_{k,j}(n)\) requires \(\mathcal {O}(N)\) operations.

5 Conclusion

In the present paper, we have proved the polynomial solvability of the one-dimensional case of one strongly NP-hard problem of partitioning a finite set of points in Euclidean space. The construction of approximate efficient algorithms with guaranteed accuracy bounds for the general case of Problem 1 and faster polynomial-time exact algorithms for the 1D case of this problem seems to be the directions of future studies.