Keywords

1 Introduction

The subject of this study is one quadratic cardinality-weighted problem of partitioning a sequence of points in Euclidean space into two subsequences of the given sizes with some additional constraints. The goal of our study is to analyze the computational complexity of the problem and to substantiate a polynomial-time approximation algorithm for this problem. The motivation of our study is the relevance of the considered problem, for example, for data mining and data clustering, when the data having in the hands is a time series.

The paper is organized as follows. In Sect. 2 the problem formulation and its interpretation are presented. Also, we present one closely related problem and some known results for it there. In addition, in the same section, we analyze the problem complexity. In Sect. 3 we formulate an auxiliary problem and some statements which underlie quality estimates for the proposed algorithm. Section 4 contains the approximation algorithm for the solution to the problem considered. The analysis of the algorithm properties is also in this section.

2 Problem Formulation, Related Problem, and Complexity

Everywhere below \(\mathbb {R}\) denotes the set of real numbers, \(\Vert \cdot \Vert \) denotes the Euclidean norm, and \(\langle \cdot , \cdot \rangle \) denotes the scalar product.

We consider the following

Problem 1

Given a sequence \(\mathcal {Y} = (y_1, \ldots , y_N)\) of points in \(\mathbb {R}^d\) and some positive integers \(T_{\min },\, T_{\max }\), and \(M>1\). Find a subset \(\mathcal {M}=\{n_1, n_2, \dots \} \subset \mathcal {N}=\{1,\dots ,N\}\) of indices in \(\mathcal {Y}\) such that

(1)

where \(\overline{y}(\{y_n| n\in \mathcal {M}\}) = \frac{1}{|\mathcal {M}|} \sum _{i \in \mathcal {M}}y_i\) is the centroid of \(\{y_n| n\in \mathcal {M}\}\) with the following constraints

(2)

and \(|\mathcal {M}|=M\).

The following applied interpretation can be proposed for Problem 1. As the input, we have a sequence \(\mathcal {Y}\) of N measurement results. This sequence is the time-ordered one (i.e., the input is time series or discrete signal). The measurements are taken for d characteristics of some object. There are two different possible states for this object (active and passive, for example). We know that exactly M times the object was in the active state (or the probability of the active state is \(\frac{M}{N}\)). And we also know that there is an error for each measurement. The correspondence between the elements of the input sequence and the states is unknown. But the time interval between every two consecutive active states is bounded from below and above by some constants \(T_{\min }\) and \(T_{\max }\). It requires to find 2-partition of the input sequence and evaluate the object characteristics (i.e., \(\overline{y}(\{y_n| n\in \mathcal {M}\})\) in accordance with (1)).

This application problem is very typical for processing time-series or discrete signals, for example, in distant object monitoring and in geophysics, in technical and medical diagnostics, etc. (see, for example, [1,2,3,4,5]).

The considered problem is closely related to

Problem 2

(Cardinality-weighted variance-based 2-clustering with given center). Given an N-element set \(\mathcal {Y}\) of points in \(\mathbb {R}^d\), and positive integer number M. Find a partition of \(\mathcal {Y}\) into two non-empty clusters \(\mathcal {C}\) and \(\mathcal {Y} \setminus \mathcal {C}\) such that

where \(\overline{y}(\mathcal {C}) = \frac{1}{|\mathcal {C}|} \sum \limits _{y \in \mathcal {C}} y\) is the centroid of \(\mathcal {C}\), subject to constrain \(|\mathcal {C}|=M\).

The strong NP-hardness of Problem 2 was established in [6]. The strong NP-hardness of Problem 1 follows from this result, as Problem 2 is the special case of Problem 1 when \(T_{\min }=1\) and \(T_{\max }=N\).

Problem 2 has been studied in algorithmic direction in [7,8,9,10,11].

In [7], an exact pseudo-polynomial algorithm was constructed for the case of integer components of the input points and fixed dimension d of the space. The running time of this algorithm is \(\mathcal {O}(N (M D)^d)\), where D is the maximum absolute value of coordinates of the input points.

In [8], an approximation scheme that allows one to find \((1+\varepsilon )\)-approximate solution in \(\mathcal {O}\left( dN^2\big (\sqrt{\frac{2d}{\varepsilon }}+2\big )^d\right) \) time was proposed. It implements an FPTAS in the case of the fixed space dimension.

Moreover, in [9], the modification of this algorithm with improved time complexity: \(\mathcal {O}\Big (\sqrt{d}N^2\big (\frac{\pi e}{2}\big )^{d/2}\big (\sqrt{\frac{2}{\varepsilon }}+2\big )^d\Big ),\) was proposed. The algorithm implements an FPTAS in the case of fixed space dimension and remains polynomial for instances of dimension \(\mathcal {O}(\log n)\). In this case, it implements a PTAS with \(\mathcal {O}\left( N^{C\,(1.05+\log (2+\sqrt{\frac{2}{\varepsilon }}))}\right) \) time, where C is a positive constant.

In [10], an approximation algorithm that allows one to find a 2-approximate solution to the problem in \(\mathcal {O}\left( dN^2\right) \) time was constructed.

In [11], a randomized algorithm was constructed. It allows one to find \((1 + \varepsilon )\)-approximate solution with probability not less than \(1 - \gamma \) in \(\mathcal {O}(d N)\) time for an established parameter value, a given relative error \(\varepsilon \) and fixed \(\gamma \). The conditions are found under which the algorithm is asymptotically exact and runs in \(\mathcal {O}(d N^2)\) time.

In this paper, we present the first result for strongly NP-hard Problem 1. Namely, we present a 2-approximation algorithm. This algorithm is based on the approaches and results presented in [10, 12,13,14]. The running time of this algorithm is \(\mathcal {O}(N^2(M(T_{\max }-T_{\min }+1)+d))\).

3 Foundations of the Algorithm

In this section, we formulate some statements (e.g. about indices of \(\mathcal {M}\)) and formulate one more auxiliary problem which can be solved in polynomial time. All these statements and auxiliary problem are necessary for substantiation of our algorithms.

The following lemma is well known (see, for example, [15, 16]).

Lemma 1

For a finite set \(\mathcal {Z} \subset \mathbb {R}^d\), if a point \(t\in \mathbb {R}^{d}\) is closer (in terms of distance) to the centroid \(\overline{z}\) of \(\mathcal {Z}\) than any point in \(\mathcal {Z}\), then

$$\begin{aligned} \sum _{z \in \mathcal {Z}} \Vert z - t \Vert ^2 \le 2 \sum _{z \in \mathcal {Z}} \Vert z - \overline{z} \Vert ^2. \end{aligned}$$

Two following lemmas were proved in [12, 13].

Lemma 2

If the elements of \(\mathcal {M}=\{n_1,\dots ,n_M\}\) belong to \(\mathcal {N}=\{1,\ldots ,N\}\) and satisfy the system of constraints (2), then for every fixed \(M\in \{2,\dots ,N\}\) we have:

  1. (1)

    the parameters of this system are related by inequality

    $$\begin{aligned} (M-1)T_{\min }\le N-1, \end{aligned}$$
    (3)
  2. (2)

    the element \(n_m\) in \(\{n_1,\dots ,n_m,\dots ,n_M\}\) belongs to the set

    (4)
  3. (3)

    the feasibility domain of components \(n_{m-1}\) from this set under condition \(n_m=n\) is defined by formula

    $$\begin{aligned} \gamma _{m-1}^{-}(n)=\{j| \max \{1+(m-2)T_{\min },n-T_{\max }\}\le j \le n-T_{\min }\} , \end{aligned}$$
    (5)

    where \(n\in \omega _m, \, m=2,\dots ,M.\)

Lemma 3

For every \(M\in \{2,\dots ,N\}\) the system of constraints (2) is feasible if and only if inequality (3) holds.

Consider the following function:

$$\begin{aligned} S(\mathcal {M},b) =M\sum \limits _{n\in \mathcal {M}}\Vert y_n-b\Vert ^2 +(N-M)\sum \limits _{n\in \mathcal {N}\setminus \mathcal {M}}\Vert y_n\Vert ^2, \,\,b\in \mathbb {R}^d, \,\, \mathcal {M}\subset \mathcal {N}. \end{aligned}$$

It is similar to the objective function of Problem 1. The only difference is the point b instead of the centroid \(\overline{y}(\{y_n| n\in \mathcal {M}\})\). This function can be rewritten as follows:

$$\begin{aligned} S(\mathcal {M},b) =(N-M)\sum \limits _{n\in \mathcal {N}}\Vert y_n\Vert ^2-\sum \limits _{n\in \mathcal {M}}\Big (2M\langle y_n,b\rangle -(2M-N)\Vert y_n\Vert ^2-M\Vert b\Vert ^2\Big ). \end{aligned}$$
(6)

Note that the first summand is a constant if M and b are the fixed values. Hence the minimum of \(S(\mathcal {M},b)\) is reached on the subsequence that maximizes the second summand. This expression motivates us to formulate auxiliary:

Problem 3

Given a sequence \(\mathcal {Y} = (y_1, \ldots , y_N)\) of points in \(\mathbb {R}^d\), a point \(b\in \mathbb {R}^d\), and some positive integers \(T_{\min },\, T_{\max }\) and M. Find a subset \(\mathcal {M}=\{n_1,\dots ,n_M\} \subset \mathcal {N}=\{1,\dots ,N\}\) of indices in the sequence \(\mathcal {Y}\) such that

$$\begin{aligned} G^b(\mathcal {M})=\sum \limits _{n\in \mathcal {M}}g^b(n)\longrightarrow \max , \end{aligned}$$
(7)

where

(8)

with additional constraints (2) on the elements of \(\mathcal {M}\), if \(M\ne 1\).

Let us define the set \(\varPsi _{M}\) of subsets of admissible index tuples in the auxiliary problem:

(9)

For \(M=1\) the set \(\varPsi _M\) is not empty for any parameters \(T_{\min }\) and \(T_{\max }\) by definition (9). For other feasible values of M we have [12, 13]:

Lemma 4

If \(M\ge 2\), then the set \(\varPsi _M\) is not empty if and only if an inequality (3) holds.

Proofs of the following lemma and its corollary in [12, 13] do not use (8) and so they hold for our case too.

Lemma 5

Let \(\varPsi _M\ne \emptyset \) for some \(M\ge 1\). Then for this M, the optimal value \(G_{\max }^b=\max \limits _{\mathcal {M}}G^b(\mathcal {M})\) of objective function (7) can be found by formula

$$\begin{aligned} G_{\max }^b=\max \limits _{n\in \omega _M}G_M^b(n). \end{aligned}$$
(10)

The values \(G_M^b(n)\), \(n\in \omega _M\), can be calculated by the following recurrent formulae:

$$\begin{aligned} G_m^b(n)={\left\{ \begin{array}{ll} g^b(n), \text { if } n\in \omega _1, m=1,\\ g^b(n)+\max \limits _{j\in \gamma ^{-}_{m-1}(n)}G_{m-1}^b(j), \text { if } n\in \omega _m, m=2,\dots ,M, \end{array}\right. } \end{aligned}$$
(11)

where sets \(\omega _m\) and \(\gamma ^{-}_{m-1}(n)\) are defined by formulae (4) and (5).

Corollary 1

The elements \({n}^b_1,\dots ,{n}^b_M\) of the optimal set \({\mathcal {M}^{b}} =\arg \max \limits _{\mathcal {M}}G^b(\mathcal {M})\) can be found by the following recurrent formulae:

$$\begin{aligned} {n}^b_M=\arg \max \limits _{n\in \omega _M}G_M^b(n), \end{aligned}$$
(12)
(13)

The following algorithm finds an optimal solution for auxiliary Problem 3. The step-by-step description of the algorithm looks like as follows.

  • Algorithm \(\mathcal {A}_1\).

  • Input: a sequence \(\mathcal {Y}\), a point b, some positive integer \(T_{\min }\), \(T_{\max }\), M.

  • Step 1. Compute \(g^b(n)\), \(n\in \mathcal {N}\), using formula (8).

  • Step 2. Using recurrent formulae (11), compute \(G_m^b(n)\) for each \(n\in w_n\) and \(m=1,\dots ,M\).

  • Step 3. Find the maximal value \(G_{\max }^b\) of the objective function \(G^b\) using formula (10) and the optimal set \({\mathcal {M}^b}=\{{n}^b_1,\dots ,{n}^b_M\}\) by (12) and (13) from Corollary 1.

  • Output: the value of \(G_{\max }^b\), the set \({\mathcal {M}}^b\).

The following theorem has been established in [13].

Theorem 1

Algorithm \(\mathcal {A}_1\) finds an optimal solution of Problem 3 in \(\mathcal {O}(N(M(T_{\max }-T_{\min }+1)+d))\) time.

Remark 1

The value \((T_{\max }-T_{\min }+1)\) is bounded by N. So the algorithm has a complexity at most \(\mathcal {O}(N(MN+d))\).

Remark 2

If the values \(T_{\max }\) and \(T_{\min }\) are fixed then the algorithm has a complexity at most \(\mathcal {O}(N(M+d))\).

4 Approximation Algorithm

The idea of the proposed algorithm is that the solution of the original NP-hard Problem 1 is replaced by an efficient algorithmic exact solution of auxiliary Problem 3. The solutions are found by Algorithm \(\mathcal {A}_1\) presented in the previous section for each point \(b \in \mathcal {Y}\) and then the best one is to be chosen.

  • Algorithm \(\mathcal {A}\).

  • Input: a sequence \(\mathcal {Y}\), some positive integers \(T_{\min }\), \(T_{\max }\), M.

  • Step 1. \(i := 0\), \(\mathcal {M}_\mathcal {A} : = \emptyset \), \(N := -\infty \).

  • Step 2. \(i := i + 1\), \(b := y_i\).

  • Step 3. Find the optimal solution \(\mathcal {M}^b\) and the maximal value \(G_{\max }^b\) of objective function (7) using algorithm \(\mathcal {A}_1\).

  • Step 4. If \(H < G_{\max }^b\) then \(b_\mathcal {A} := b\), \(H := G_{\max }^b\), \(\mathcal {M}_\mathcal {A} = \mathcal {M}^b\).

  • Step 5. If \(i < N\) then go to Step 2, else go to Step 6.

  • Step 6. Find the centroid \(\overline{y}(\{y_n|n \in \mathcal {M_A}\})\) and the value of the objective function \(F(\mathcal {M_A})\) by (1).

  • Output: the set \(\mathcal {M}_\mathcal {A}\), the value \(F(\mathcal {M_A})\), the points \(\overline{y}(\{y_n|n \in \mathcal {M_A}\})\) and \(b_\mathcal {A}\).

Theorem 2

Algorithm \(\mathcal {A}\) finds a 2-approximate solution of Problem 1 in \(\mathcal {O}(N^2(M(T_{\max }-T_{\min }+1)+d))\) time.

Proof

Let \(\mathcal {M}^*\) be an optimal solution of Problem 1, \(y^*=\overline{y}(\{y_n|n \in \mathcal {M}^*\})\) be the centroid of this optimal solution, and \(\mathcal {M}_{\mathcal {A}}\), \(F(\mathcal {M_A})\), \(\overline{y}(\{y_n|n \in \mathcal {M_A}\})\), \(b_\mathcal {A}\) be the output of Algorithm \(\mathcal {A}\). Let \(t=\arg \min \limits _{i\in \mathcal {M}^{*}}||y_i-y^*||^{2}\) be a point from the optimal subsequence that is the closest to \(y^*\).

Let us substantiate the accuracy of the algorithm. The following chain of equalities and inequalities holds:

Let us explain the numbered signs (we have chosen those that are not obvious). One can check by the differentiation that the minimum of \(S(\mathcal {M}_\mathcal {A},\cdot )\) (for the fixed \(\mathcal {M}_\mathcal {A}\)) is attained at \(\overline{y}(\{y_n|n \in \mathcal {M_A}\})\), so inequality (1) holds. Equality (2) follows from Theorem 1 and the description of Algorithm \(\mathcal {A}_1\). Equality (3) is obvious by (6). Inequality (4) holds since \(\mathcal {M}^{*} \in \varPsi _{M}\) and \(t\in \mathcal {Y}\) and inequality (5) holds by Lemma 1.

Hence \(F(\mathcal {M}_{\mathcal {A}}) / F(\mathcal {M}^*)\le 2\) and Algorithm \(\mathcal {A}\) finds a 2-approximate solution of Problem 1.

Let us estimate the time complexity. At Step 1 and Step 2 we need \(\mathcal {O}(d)\) operations. At Step 3 we solve the auxiliary problem and so we need \(\mathcal {O}(N(M(T_{\max }-T_{\min }+1)+d))\) time. At Step 4 and Step 5 we need at most \(\mathcal {O}(dN)\) operations. Step 2 – Step 5 are executed for N times. Step 6 requires \(\mathcal {O}(dN)\) operations. Thus, the total time complexity of the algorithm is \(\mathcal {O}(N^2(M(T_{\max }-T_{\min }+1)+d))\).   \(\square \)

Remark 3

The value \((T_{\max }-T_{\min }+1)\) is bounded by N. So the algorithm has a complexity at most \(\mathcal {O}(N^2(MN+d))\).

Remark 4

If the values \(T_{\max }\) and \(T_{\min }\) are fixed then the algorithm has a complexity at most \(\mathcal {O}(N^2(M+d))\).

5 Conclusion

In this paper, we have shown the strong NP-hardness of one cardinality-weighted quadratic partitioning problem of a sequence of points in Euclidean space into two clusters when the center of one of the desired clusters is fixed and the center of the other one is unknown and is determined as a geometric center. We also presented the first algorithmic result for the considered problem. This result is the polynomial-time 2-approximation algorithm. It seems important to continue studying the questions on the algorithmic approximability of the problem since the considered problem is poorly studied in the algorithmic direction.