Keywords

1 Introduction

The subject of this study are two strongly NP-hard problems of clustering a finite sequence of points in Euclidean space. Our goal is to construct a randomized algorithm for the problems. The research is motivated by the fact that the considered problems are related to mathematical time series analysis problems, approximation and discrete optimization problems, and also by their importance for applications such as signals analysis and recognition, remote object monitoring, etc. (see the next section and the papers therein).

The paper has the following structure. In Sect. 2, formulation of the problems is given. In the same Section, the known results are listed. The next Section contains the auxiliary problem and the algorithm for solving it, which are needed to construct our proposed algorithms. In Sect. 4, the randomized algorithms for the considered problems are presented.

2 Problems Formulation, Related Problems, and Known Results

We consider the following two problems.

Problem 1

Given a sequence \(\mathcal {Y} = (y_1, \ldots , y_N)\) of points in \(\mathbb {R}^d\) and positive integers \(T_{\min }\), \(T_{\max }\) and \(M > 1\). Find a subset \(\mathcal {M} = \{n_1, \ldots , n_M\} \subseteq \mathcal {N} = \{1, \ldots , N\}\) of the index set of \(\mathcal {Y}\) such that

$$\begin{aligned} F_{1}(\mathcal {M}) = \sum _{j \in \mathcal {M}} \Vert y_{j} - \overline{y}(\mathcal {M}) \Vert ^2\longrightarrow \min \;, \end{aligned}$$

where \(\overline{y} (\mathcal {M})=\frac{1}{|\mathcal {M}|}\sum _{i \in \mathcal {M}} y_{i}\) is the centroid of \(\{ y_{j} \,|\, j \in \mathcal {M} \}\), under the constraints

$$\begin{aligned} T_{\min } \le n_m - n_{m - 1} \le T_{\max } \le N , \,\,\,\, m = 2, \ldots , M \;, \end{aligned}$$
(1)

on the elements of the set \((n_1, \ldots , n_M)\).

Problem 2

Given a sequence \(\mathcal {Y} = (y_1, \ldots , y_N)\) of points in \(\mathbb {R}^d\) and positive integers \(T_{\min },\, T_{\max }\), and \(M > 1\). Find a subset \(\mathcal {M} = \{n_1, \ldots , n_M\} \subseteq \mathcal {N}=\{1,\dots ,N\}\) of the index set of \(\mathcal {Y}\) such that

$$\begin{aligned} F_{2}(\mathcal {M})=|\mathcal {M}| \sum \limits _{j\in \mathcal {M}}\Vert y_j-\overline{y}(\mathcal {M})\Vert ^2 + |\mathcal {N}\setminus \mathcal {M}|\sum \limits _{i\in \mathcal {N}\setminus \mathcal {M}}\Vert y_i\Vert ^2\longrightarrow \min \;, \end{aligned}$$

where \(\overline{y}(\mathcal {M}) = \frac{1}{|\mathcal {M}|} \sum _{i \in \mathcal {M}}y_i\) is the centroid of \(\{y_j\,|\, j\in \mathcal {M}\}\), under the constraints (1) on the elements of the set \((n_1, \ldots , n_M)\).

Problem 1 is induced by the following applied problem. Given a sequence \(\mathcal {Y}\) of N time-ordered measurements of d numerical characteristics of some object. M of these measurements correspond to a repeating (identical) state of the object. There is an error in each given measurement result. The correspondence of the measurement results to the states of the object is unknown. However, it is known that the time interval between two consecutive identical states is bound from above and below by the specified constants \(T_{\min }\) and \(T_{\max }\). It is required to find a subsequence of numbers corresponding to the measurements of the repeated state of the object.

In the special case when \(T_{\min } = 1\) and \(T_{\max } = N\), Problem 1 is equivalent to the well-known M-variance problem (see, e.g.,  [1]). A list of known results for M-variance problem can be found in  [2].

When \(T_{\min }\) and \(T_{\max }\) are parameters, Problem 1 is strongly NP-hard for any \(T_{\min } < T_{\max }\)  [3]. When \(T_{\min } = T_{\max }\), it is solvable in polynomial time.

In  [4], a 2-approximation algorithm with \(\mathcal {O}(N^2 (MN + d))\) running time is proposed.

An exact algorithm for the case of integer inputs was substantiated in  [5]. When the space dimension is fixed, the algorithm is pseudopolynomial and runs in \(\mathcal {O}(N^3 (MD)^d)\) time.

In  [6], an FPTAS was presented for the case of Problem 1 when the space dimension is fixed. Given relative error \(\varepsilon \), this algorithm finds a \((1+\varepsilon )\)-approximate solution to the problem in \(\mathcal {O}(M N^3 (1/\varepsilon )^{q/2})\) time.

Problem 2 simulates the following applied problem. As in Problem 1, we have a sequence \(\mathcal {Y}\) of N time-ordered measurement results for d characteristics of some object. This object can be in two different states (active and passive, for example). Each measurement has an error and the correspondence between the elements of the input sequence and the states is unknown. One know that the object was in the active state exactly M times (or the probability of the active state is \(\frac{M}{N}\)) and the time interval between every two consecutive active states is bounded from below and above by some constants \(T_{\min }\) and \(T_{\max }\). It is required to find 2-partition of the input sequence and evaluate the object characteristics.

If \(T_{\min } = 1\) and \(T_{\max } = N\), Problem 2 is equivalent to Cardinality-weighted variance-based 2-clustering with given center problem. One can easily find a list of known results for this special case in  [8].

Cardinality-weighted variance-based 2-clustering with given center problem is related but not equivalent to the well-known Min-sum all-pairs 2-clustering problem (see, e.g.,  [9, 10]). Many algorithmic results are known for this closely related problem, but they are not directly applicable to Cardinality-weighted variance-based 2-clustering with given center problem.

Problem 2 is strongly NP-hard [11]. Only two algorithmic results have been proposed for this problem until now.

An exact pseudopolynomial algorithm was proposed in [11] for the case of integer instances and the fixed space dimension d. The running time of this algorithm is \(\mathcal {O}(N(M(T_{\max }-T_{\min }+1)+d)(2MD+1)^d)\), where D is the maximum absolute value of coordinates of the input points.

In [12], a 2-approximation algorithm was presented. The running time of the algorithm is \(\mathcal {O}(N^2(M(T_{\max }-T_{\min }+1)+d))\).

The main results of this paper are randomized algorithms for Problems 1 and 2. These algorithms find \((1+\varepsilon )\)-approximate solution with probability not less than \(1-\gamma \) in \(\mathcal {O}(dMN^2)\) time, for the given \(\varepsilon > 0\), \(\gamma \in (0, 1)\) and under assumption \(M \ge \beta N\) for \(\beta \in (0, 1)\). The conditions are found under which these algorithms are asymptotically exact (i.e. the algorithms find a \((1 + \varepsilon _N)\)-approximate solutions with probability \(1 - \gamma _N\), where \(\varepsilon _N, \gamma _N \rightarrow 0\)) and find the solutions in \(\mathcal {O}(d M N^3)\) time.

3 Auxiliary Problem

To construct the algorithms for Problems 1 and 2, we need the following auxiliary problem.

Problem 3

Given a sequence g(n), \(n = 1, \ldots , N\), of real values, positive integers \(T_{\min }\), \(T_{\max }\) and \(M > 1\). Find a subset \(\mathcal {M} = \{ n_1, \ldots , n_M \} \subseteq \mathcal {N}\) of indices of sequence elements such that

$$\begin{aligned} G(\mathcal {M}) = \sum _{i \in \mathcal {M}} g(i) \rightarrow \min \;, \end{aligned}$$

under constraints (1) on the elements of the tuple \(( n_1, \ldots , n_M)\).

The following algorithm finds the solution of Problem 3.

figure a

Remark 1

It follows from  [4, 7] that Algorithm \(\mathcal {A}\) finds the optimal solution of Problem 3 in \(\mathcal {O}(N M (T_{\max } - T_{\min } + 1))\) time.

4 Randomized Algorithms

Below is a randomized algorithm for Problem 1.

figure b

The next randomized algorithm allows one to find approximate solution of Problem 2.

figure c

The following theorem describes the properties of algorithms \(\mathcal {A}_1\) and \(\mathcal {A}_2\).

Theorem 1

Assume that in Problems 1 and 2, \(M \ge \beta N\) for \(\beta \in (0, 1)\). Then, given \(\varepsilon > 0\) and \(\gamma \in (0, 1)\), for a fixed parameter

$$\begin{aligned} k = \max ( \lceil \frac{2}{\beta } \lceil \frac{2}{\gamma \varepsilon } \rceil \rceil , \lceil \frac{8}{\beta } \ln \frac{2}{\gamma } \rceil ) \end{aligned}$$

algorithms \(\mathcal {A}_1\) and \(\mathcal {A}_2\) find \((1+\varepsilon )\)-approximate solutions of Problem 1 and 2 with probability \(1 - \gamma \) in \(\mathcal {O}(d M N^2)\) time.

Finally, in the next theorem, conditions are established under which algorithms \(\mathcal {A}_1\) and \(\mathcal {A}_2\) are polynomial and asymptotically exact.

Theorem 2

Assume that in Problems 1 and 2, \(M \ge \beta N\) for \(\beta \in (0, 1)\). Then, for fixed \(k = \lceil \log _2 N \rceil \), algorithms \(\mathcal {A}_1\) and \(\mathcal {A}_2\) find \((1 + \varepsilon _N)\)-approximate solutions of Problem 1 and 2 with probability \(1 - \gamma _N\) in \(\mathcal {O}(d M N^3)\) time, where \(\varepsilon _N, \gamma _N \rightarrow 0\).

The idea of proving Theorems 1 and 2 is to estimate the probability of events \(F_i(\mathcal {M}_{\mathcal {A}_i}) \ge (1 + \frac{1}{\delta t}) F_i(\mathcal {M}_i^*)\) in the case when the multiset \(\mathcal {T}\) contains at least t elements of the optimal solution \(\mathcal {M}_i^*\), where \(\delta \in \mathbb {R}\), \(t \in \mathbb {N}\), \(i = 1, 2\). To do this, we use the Markov inequality. Then, using Chernov’s inequality, we show that it is sufficient to put \(\delta = \gamma / 2\), \(t = \lceil 2 / (\gamma \varepsilon ) \rceil \) in Theorem 1 and \(\delta = (\log _2 N)^{-1/2}\), \(t = \lceil k M / (2 N) \rceil \) in Theorem 2.

5 Conclusion

In the present paper, we have proposed randomized algorithms for two sequence clustering problems. The algorithms find \((1+\varepsilon )\)-approximate solutions with probability not less than \(1-\gamma \) in \(\mathcal {O}(dMN^2)\) time. Conditions are found under which the algorithms are polynomial and asymptotically exact.

In our opinion, the algorithms presented in this paper can be used to quickly obtain solutions to large-scale applied problems of signal analysis and recognition.