Keywords

1 Introduction

The subject of this research is some hard-to-solve discrete optimization problems that model some simplest applied problems of cluster analysis and data interpretation. Our goal is to analyze and systematize the issues of constructing efficient algorithms that ensure fast and exact problems-solving in the one-dimensional case.

It is known that in terms of applied problem-solving the computer processing of large-scaling data [1] the existing exact and approximate polynomial-time algorithms having theoretical accuracy guarantee but quadratic and higher running time are often unclaimed in applications [2, 3]. In other words, these strongly justified polynomial-time algorithms are not used or are rarely used in practice due to the “large” (quadratic and higher) running-time. On the other hand, many hard-to-solve computer geometric problems arising in the data analysis and interpretation [4, 5] are solvable in polynomial time when the space dimension is fixed. At the same time, fast polynomial algorithms for solving problems are efficient tools for finding out the structure (i.e. Data mining) [4] of large data by projecting it into spaces of lower dimension (for example, into three-dimensional space, or a plane, or a number line). These projective mathematical tools are popular among data analytics [6] since these tools allow one to interpret the data by the visual representation. In this connection, the construction of fast algorithms having almost linear or linear running-time to solve special cases (in which the space dimension is fixed) of the problems are important mathematical research directions. This paper belongs to these directions.

The paper has the following structure. Section 2 presents the mathematical formulations of the problems under consideration. Interpretations of problems are presented in the next section for demonstrating their origins and their connection with the problems of data analysis. Section 4 provides a brief overview of the results of the computational complexity of problems. In the next section, we present existing results on the problems polynomial solvability in the case of fixed space dimension. Finally, in Sect. 6, the existing and new algorithms are presented, which find the exact solution of the problems in linear or almost linear time in the one-dimensional case.

2 Problems Formulations

Everywhere below \(\mathbb {R}\) denotes the set of real numbers, \(\Vert \cdot \Vert \) denotes the Euclidean norm, and \(\langle \cdot , \cdot \rangle \) denotes the scalar product.

All the problems considered below are the problems of 2-partitioning of input points set. In the problems of searching for one subset, the second cluster is understood as the subset that complements this cluster to the input set. A point in the d-dimensional space is interpreted as the measuring result of a set of d characteristics (features) of an object or as the vector (force), i.e. as the segment directed from the origin to this point in the space.

The problems under consideration have the following formulations.

Problem 1

(Longest Normalized Vector Sum). Given: an N-element set \(\mathcal {Y}\) of points in d-dimensional Euclidean space. Find: a nonempty subset \(\mathcal {C} \subseteq \mathcal {Y}\) such that

$$\begin{aligned} F(\mathcal {C}) = \frac{1}{|\mathcal {C}|} \Bigl \Vert \sum \limits _{y \in \mathcal {C}} y \Bigr \Vert ^2 \rightarrow \max . \end{aligned}$$

Problem 2

(1-Mean and Given 1-Center Clustering). Given: an N-element set \(\mathcal {Y}\) of points in d-dimensional Euclidean space. Find: a 2-partition of \(\mathcal {Y}\) into clusters \(\mathcal {C}\) and \(\mathcal {Y} \setminus \mathcal {C}\) such that

$$\begin{aligned} S(\mathcal {C}) = \sum \limits _{y \in \mathcal {C}} \Vert y - \overline{y}(\mathcal {C})\Vert ^2 + \sum \limits _{y \in \mathcal {Y} \setminus \mathcal {C}} \Vert y\Vert ^2 \rightarrow \min , \end{aligned}$$
(1)

where \(\overline{y}(\mathcal {C}) = \frac{1}{|\mathcal {C}|} \sum _{y\in \mathcal {C}}y\) is the centroid of \(\mathcal {C}\).

Problem 3

(Longest M-Vector Sum). Given: an N-element set \(\mathcal {Y}\) of points in d-dimensional Euclidean space and some positive integer M. Find: a nonempty subset \(\mathcal {C} \subseteq \mathcal {Y}\) of size M such that

$$\begin{aligned} H(\mathcal {C}) = \Bigl \Vert \sum \limits _{y \in \mathcal {C}} y \Bigr \Vert \rightarrow \max . \end{aligned}$$
(2)

Problem 4

(Constrained 1-Mean and Given 1-Center Clustering). Given: an N-element set \(\mathcal {Y}\) of points in d-dimensional Euclidean space and some positive integer M. Find: a 2-partition of \(\mathcal {Y}\) into clusters \(\mathcal {C}\) and \(\mathcal {Y} \setminus \mathcal {C}\) minimizing the value of (1) under constraint \(|\mathcal {C}|=M\).

Problem 5

(Longest Vector Sum). Given: an N-element set \(\mathcal {Y}\) of points in d-dimensional Euclidean space. Find: a subset \(\mathcal {C} \subseteq \mathcal {Y}\) maximizing the value of (2).

Problem 6

(M-Variance). Given: an N-element set \(\mathcal {Y}\) of points in d-dimensional Euclidean space and some positive integer M. Find: a subset \(\mathcal {C} \subseteq \mathcal {Y}\) of size M such that

$$\begin{aligned} Q(\mathcal {C}) = \sum _{y \in \mathcal {C}} \Vert y - \overline{y}(\mathcal {C})\Vert ^2 \rightarrow \min . \end{aligned}$$

Problem 7

(Maximum Size Subset of Points with Constrained Variance). Given: an N-element set \(\mathcal {Y}\) of points in d-dimensional Euclidean space and some real number \(\alpha \in (0, 1)\). Find: a subset \(\mathcal {C} \subset \mathcal {Y}\) of the largest size such that

$$\begin{aligned} Q(\mathcal {C}) \le \alpha \sum _{y \in \mathcal {Y}} \Vert y - \overline{y}(\mathcal {Y})\Vert ^2, \end{aligned}$$

where \(\overline{y}(\mathcal {Y})= \frac{1}{|\mathcal {Y}|} \sum _{y \in \mathcal {Y}}\) is the centroid of \(\mathcal {Y}\).

Problem 8

(Smallest M-Enclosing Ball). Given: an N-element set \(\mathcal {Y}\) of points in d-dimensional Euclidean space and some positive integer number M. Find: a minimum radius ball covering M points.

3 Interpretations and Origins of the Problems

All of the formulated optimization problems have simple interpretations in the geometric, statistical, physical, biomedical, geophysical, industrial, economic, anti-terrorism and social terms. One can find some interpretations in the papers cited below. An interested reader can easily give his own interpretation. In this paper, we limit ourselves to a few simple interpretations.

Problems 14 arose in connection with the solution of an applied signal processing problem, namely, the problem of joint detecting a quasiperiodically repeating pulse of unknown shape and evaluating this shape under Gaussian noise with zero mean [7,8,9]. Apparently, the first note on these problems was made in [7]. In these problems, the cluster center specified at the origin corresponds to the mean equal to zero.

It should be noted that simpler optimization problems, which are induced by the applied problems of detecting and discriminating of pulses with given forms in the noise conditions, are characteristic, in particular, for radar, electronic reconnaissance, hydroacoustics, geophysics, technical and medical diagnostics, and Space monitoring (see, for example, [10,11,12]).

The Problem 1 objective function can be rewritten as

$$\begin{aligned} F(\mathcal {C}) = \frac{1}{|\mathcal {C}|} \sum \limits _{y \in \mathcal {C}}\sum \limits _{x \in \mathcal {C}} \langle y, x \rangle = \frac{1}{|\mathcal {C}|} \sum \limits _{y \in \mathcal {C}} \Big \langle y, \sum \limits _{x \in \mathcal {C}} x \Big \rangle = \sum \limits _{y \in \mathcal {C}} \langle y, \overline{y}(\mathcal {C}) \rangle . \end{aligned}$$

Therefore, maximization Problem 1 can be interpreted as a search for a subset \(\mathcal {C}\) of objects (forces) that are most similar to each other in the terms of an average value of the sum of all scalar products. Another interpretation is the search for a subset of forces that are most co-aimed with the vector from the origin to the point \(\overline{y}(\mathcal {C})\), i.e. from the origin to an unknown centroid. Maximization Problems 1 and 5 have similar interpretations.

Apparently, in [13], Problem 6 was first formulated. This problem models a simplest data analysis problem, namely finding a subset of M similar objects in the set of N objects. In this problem, the similarity of objects is interpreted as the minimum total quadratic scatter of points in a set with respect to some unknown “average” object (centroid), which may not belong to the input set. Equivalent treatment of similarity is minimum of the sum of squares of all possible pairwise distances between objects since for the objective function of the Problem 6, the following equality holds

$$\begin{aligned} Q(\mathcal {C}) = \frac{1}{2|\mathcal {C}|} \sum _{x\in \mathcal {C}} \sum _{y\in \mathcal {C}}\Vert x-y\Vert ^2. \end{aligned}$$

Problem 7 models [14] the search for the largest subset of similar objects under the upper bound (restriction) on the similarity criterion of Problem 6, i.e. on the total quadratic scatter of points in the desired cluster. In accordance with this restriction, Problem 7 can be interpreted as clearing data from so-called outliers that violate intracluster homogeneity (see, for example, [15,16,17]). As a result of solving the problem, all data that have significant quadratic scatter will belong to the complementary cluster. In this problem, the degree of the desired cluster homogeneity is governed by the given number \(\alpha \).

Finally, Problem 8, formulated in [18], as a generalization of the well-known problem of the Chebyshev center, has a simple geometric formulation that does not require any explanation. On the other hand, in applications, this problem arises whenever it is necessary to cover (for example, surround or locate in the territory) a given number of objects in the conditions of limited resources (for example, financial or energy).

4 The Computational Complexity of Problems: Existing Results

In [19] the authors proved the strong NP-hardness of Problem 1 by polynomial reducibility of the known NP-hard problem 3-Satisfiability [20] to Problem 1. This result implies the strong NP-hardness of Problem 3 since the objective functions of these problems are related by equality

$$\begin{aligned} F(\mathcal {C}) = \frac{1}{|\mathcal {C}|}H^{2}(\mathcal {C}). \end{aligned}$$

Indeed, it follows from this equality that polynomial solvability of Problem 3 would imply polynomial solvability of Problem 1 (it would be sufficient to iterate over the finite number of admissible M). Note that chronologically, NP-hardness of Problem 3 was first proved (however, NP-hardness of Problem 3 does not imply NP-hardness of Problem 1). Recall that for proving the intractability of Problem 3, the authors of [8, 9, 21] constructed polynomial-time reduction of the known NP-hard Clique problem [20] to Problem 3.

By virtue of equality

$$\begin{aligned} S(\mathcal {C}) = \sum _{y \in \mathcal {Y}} \Vert y \Vert ^2 - F(\mathcal {C}), \end{aligned}$$
(3)

the strong NP-hardness of Problem 2 follows from its polynomial equivalence to Problem 1. From the strong NP-hardness of Problem 2 follows the strong NP-hardness of Problem 4 for the same reason as the strong NP-hardness of Problem 3 follows from the strong NP-hardness of Problem 1. Indeed, polynomial solvability of Problem 4 would imply polynomial solvability of Problem 2.

In [22] the authors proved the strong NP-hardness of Problem 5 by polynomial reducibility of the known NP-hard problem 3-Satisfiability [20] to Problem 5.

Further, the paper [23] presents the proof of the strong NP-hardness of Problem 6. In this paper there is a simple proof of polynomial reducibility to Problem 6 of the well-known [24] NP-hard Clique problem on a homogeneous graph with non-fixed degree.

The authors of [14] proved the strong NP-hardness of Problem 7 by showing that decision forms of Problems 6 and 7 are equivalent.

Finally, the paper [18] presents the proof of the strong NP-hardness of Problem 8. To do this, the authors of the cited paper have shown polynomial reducibility to Problem 8 of Clique problem.

5 Exact Algorithms for the Problems in the Multidimensional Case: Existing Results

Exact algorithms of exponential time complexity were constructed for multidimensional case of Problems 16 in a number of papers. These algorithms are polynomial for the case of fixed space dimension (or for the case of space dimension bounded from above by some constant).

For Problem 1, an algorithm given in [25] finds the exact solution of the problem in \(\mathcal {O}(d^{2}N^{2d})\) time. The authors of [29] proposed an accelerated algorithm with \(\mathcal {O}(dN^{d+1})\) running time.

Since Problem 2 is polynomially equivalent to Problem 1, the exact solution of Problem 2 can be found in the same time as the exact solution of Problem 1.

Polynomial solvability of Problem 3 in the case of fixed space dimension follows from [28]. The authors of [25] and [29] presented exact algorithms for Problem 3 with running time \(\mathcal {O}(d^{2}N^{2d})\) and \(\mathcal {O}(dN^{d+1})\), respectively.

Problem 4 is polynomially equivalent to Problem 3. Therefore, the solution of Problem 4 can be found in the same time as the solution of Problem 3.

Further, polynomial solvability of Problem 5 in the case of fixed space dimension follows from [27]. The authors of [26] presented an algorithm with \(\mathcal {O}(d^{2}N^{d})\) running time. An improved algorithm with \(\mathcal {O}(dN^{d+1})\) running time is proposed in [29]. In addition, the author of [30] presented a faster algorithm with \(\mathcal {O}(N^{d-1}(d + \log N))\) running time for the case \(d\ge 2\).

Algorithms proposed in [13, 29] find exact solution of Problem 6 in \(\mathcal {O}(dN^{d+1})\) time. The feature of the algorithm proposed in [29] is that it allows one to find solutions for all admissible values of M at once.

An exact algorithm for Problem 7, obviously, can be obtained from the exact algorithm proposed in [29] for Problem 6. Running time of this algorithm is \(\mathcal {O}(dN^{d+1})\).

Finally, the issue of polynomial time solvability of Problem 8 for an arbitrary but fixed dimension d of space is open till now.

It follows from the above results that for \(d=1\) these algorithms find solutions in time which quadratically depends on the power N of the input set.

Below we present simple and fast exact algorithms that find solutions of one dimensional case of the problems in \(\mathcal {O}(N)\) (for Problems 3, 4, 5) or \(\mathcal {O}(N\log N)\) (for Problems 2, 7, 8) time. Here, for completeness, we present known algorithms for Problems 1 and 6, the time complexity of which is \(\mathcal {O}(N\log N)\).

6 Fast and Exact Algorithms for the Problems in the One-Dimensional Case

Hereafter one-dimensional (\(d=1\)) cases of Problems 18 we will denote as Problem \(X-1D\), where X is the number of the problem.

Let us formulate algorithms for solving the problems.

figure a

Proposition 1

Algorithm \(\mathcal {A}_1\) finds an optimal solution of Problem 1-1D in \(\mathcal {O}(N\log N)\) time.

This algorithm was proposed in [19] for construction of the approximation scheme which is polynomial in the case of fixed space dimension. The same paper established the accuracy and running time (determined by the sorting time) of the algorithm.

figure b

Proposition 2

Algorithm \(\mathcal {A}_2\) finds an optimal solution of Problem 2-1D in \(\mathcal {O}(N\log N)\) time.

The validity of the statement follows from the fact that, in accordance with (3), in the one-dimensional case the following holds

$$\begin{aligned} S(\mathcal {C}) = \sum _{y \in \mathcal {Y}} y^2 - F(\mathcal {C}) . \end{aligned}$$
figure c

Proposition 3

Algorithm \(\mathcal {A}_3\) finds an optimal solution of Problem 3-1D in \(\mathcal {O}(N)\) time.

The accuracy of the algorithm follows from the fact that in the one-dimensional case for the function (2) the following holds

$$\begin{aligned} H(\mathcal {C}) = \Bigl | \sum \limits _{y \in \mathcal {C}} y \Bigr | . \end{aligned}$$
(4)

The time complexity of selecting M largest (or smallest) elements determines the running time of the algorithm. This selecting can be made in \(\mathcal {O}(N)\) operations without sorting (see, for example, [31]).

figure d

Proposition 4

Algorithm \(\mathcal {A}_4\) finds an optimal solution of Problem 4-1D in \(\mathcal {O}(N)\) time.

The validity of the statement follows from the polynomial equivalence of Problems 3 and 4.

figure e

Proposition 5

Algorithm \(\mathcal {A}_5\) finds an optimal solution of Problem 5-1D in \(\mathcal {O}(N)\) time.

The accuracy of the algorithm follows from (4). The running time of the algorithm follows from the fact that constructing subsets \(\mathcal {C}_1\) and \(\mathcal {C}_2\) can be done in time \(\mathcal {O}(N)\).

figure f

Proposition 6

Algorithm \(\mathcal {A}_6\) finds an optimal solution of Problem 6-1D in \(\mathcal {O}(N\log N)\) time.

This statement is based on the fact that each value of \(\sum _{k = i}^j y_k\) and \(\sum _{k = i}^j y_k^2\) can be found in \(\mathcal {O}(1)\) time using sliding window sums. This algorithm was recently justified in [32].

figure g

Proposition 7

Algorithm \(\mathcal {A}_7\) finds an optimal solution of Problem 7-1D in \(\mathcal {O}(N\log N)\) time.

The algorithm accuracy follows from the monotonicity property [14] of the function \(Q(\mathcal {C})\). The sorting determines the algorithm running time since the calculations in Step 2 can be performed in \(\mathcal {O}(1)\) time using prefix summation and this step is performed no more than \(\mathcal {O}(N)\) times.

figure h

Proposition 8

Algorithm \(\mathcal {A}_8\) finds an optimal solution of Problem 8-1D in \(\mathcal {O}(N\log N)\) time.

The algorithm accuracy follows from the fact that in the one-dimensional case a minimum radius ball enclosing points \(y_{i}, y_{i+1}, \ldots , y_{j}\), where \(y_{i}< y_{i+1}< \ldots < y_{j}\), has a radius of \((y_{j} - y_{i})/2\). The algorithm running time is determined by sorting.

7 Conclusion

The paper provides a brief overview of the complexity of some recently identified optimization problems of 2-clustering a finite set of points in Euclidean space. We present fast and exact algorithms for the one-dimensional case of these problems. In our opinion, these algorithms will serve as a good tool for solving the problems of projective analysis and interpretation of big data.