2-Approximation Polynomial-Time Algorithm for a Cardinality-Weighted 2-Partitioning Problem of a Sequence

Kel’manov, Alexander; Khamidullin, Sergey; Panasenko, Anna

doi:10.1007/978-3-030-40616-5_34

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11974))

Included in the following conference series:

International Conference on Numerical Computations: Theory and Algorithms

673 Accesses
1 Citations

Abstract

We consider a problem of 2-partitioning a finite sequence of points in Euclidean space into clusters of the given sizes with some constraints. The solution criterion is the minimum of the sum of weighted intracluster sums of squared distances between the elements of each cluster and its center. The weight of the intracluster sum is equal to the cluster size. The center of one cluster is given as input (is the origin without loss of generality), while the center of the other one is unknown and is determined as a geometric center. The following constraints hold: the difference between the indices of two subsequent points included in the first cluster is bounded from above and below by some given constants. In this paper, we have shown that the considered problem is the strongly NP-hard one and propose a polynomial-time 2-approximation algorithm for solving the problem.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Exact Algorithm for One Cardinality-Weighted 2-Partitioning Problem of a Sequence

A fully polynomial-time approximation scheme for a sequence 2-cluster partitioning problem

Article 01 April 2016

An Approximation Algorithm for a Problem of Partitioning a Sequence into Clusters with Restrictions on Their Cardinalities

Keywords

1 Introduction

The subject of this study is one quadratic cardinality-weighted problem of partitioning a sequence of points in Euclidean space into two subsequences of the given sizes with some additional constraints. The goal of our study is to analyze the computational complexity of the problem and to substantiate a polynomial-time approximation algorithm for this problem. The motivation of our study is the relevance of the considered problem, for example, for data mining and data clustering, when the data having in the hands is a time series.

The paper is organized as follows. In Sect. 2 the problem formulation and its interpretation are presented. Also, we present one closely related problem and some known results for it there. In addition, in the same section, we analyze the problem complexity. In Sect. 3 we formulate an auxiliary problem and some statements which underlie quality estimates for the proposed algorithm. Section 4 contains the approximation algorithm for the solution to the problem considered. The analysis of the algorithm properties is also in this section.

2 Problem Formulation, Related Problem, and Complexity

Everywhere below $\mathbb {R}$ denotes the set of real numbers, $\Vert \cdot \Vert $ denotes the Euclidean norm, and $\langle \cdot , \cdot \rangle $ denotes the scalar product.

We consider the following

Problem 1

Given a sequence $\mathcal {Y} = (y_1, \ldots , y_N)$ of points in $\mathbb {R}^d$ and some positive integers $T_{\min },\, T_{\max }$, and $M>1$. Find a subset $\mathcal {M}=\{n_1, n_2, \dots \} \subset \mathcal {N}=\{1,\dots ,N\}$ of indices in $\mathcal {Y}$ such that

(1)

where $\overline{y}(\{y_n| n\in \mathcal {M}\}) = \frac{1}{|\mathcal {M}|} \sum _{i \in \mathcal {M}}y_i$ is the centroid of $\{y_n| n\in \mathcal {M}\}$ with the following constraints

(2)

and $|\mathcal {M}|=M$.

The following applied interpretation can be proposed for Problem 1. As the input, we have a sequence $\mathcal {Y}$ of N measurement results. This sequence is the time-ordered one (i.e., the input is time series or discrete signal). The measurements are taken for d characteristics of some object. There are two different possible states for this object (active and passive, for example). We know that exactly M times the object was in the active state (or the probability of the active state is $\frac{M}{N}$). And we also know that there is an error for each measurement. The correspondence between the elements of the input sequence and the states is unknown. But the time interval between every two consecutive active states is bounded from below and above by some constants $T_{\min }$ and $T_{\max }$. It requires to find 2-partition of the input sequence and evaluate the object characteristics (i.e., $\overline{y}(\{y_n| n\in \mathcal {M}\})$ in accordance with (1)).

This application problem is very typical for processing time-series or discrete signals, for example, in distant object monitoring and in geophysics, in technical and medical diagnostics, etc. (see, for example, [1,2,3,4,5]).

The considered problem is closely related to

Problem 2

(Cardinality-weighted variance-based 2-clustering with given center). Given an N-element set $\mathcal {Y}$ of points in $\mathbb {R}^d$, and positive integer number M. Find a partition of $\mathcal {Y}$ into two non-empty clusters $\mathcal {C}$ and $\mathcal {Y} \setminus \mathcal {C}$ such that

where $\overline{y}(\mathcal {C}) = \frac{1}{|\mathcal {C}|} \sum \limits _{y \in \mathcal {C}} y$ is the centroid of $\mathcal {C}$, subject to constrain $|\mathcal {C}|=M$.

The strong NP-hardness of Problem 2 was established in [6]. The strong NP-hardness of Problem 1 follows from this result, as Problem 2 is the special case of Problem 1 when $T_{\min }=1$ and $T_{\max }=N$.

Problem 2 has been studied in algorithmic direction in [7,8,9,10,11].

In [7], an exact pseudo-polynomial algorithm was constructed for the case of integer components of the input points and fixed dimension d of the space. The running time of this algorithm is $\mathcal {O}(N (M D)^d)$, where D is the maximum absolute value of coordinates of the input points.

In [8], an approximation scheme that allows one to find $(1+\varepsilon )$-approximate solution in $\mathcal {O}\left( dN^2\big (\sqrt{\frac{2d}{\varepsilon }}+2\big )^d\right) $ time was proposed. It implements an FPTAS in the case of the fixed space dimension.

Moreover, in [9], the modification of this algorithm with improved time complexity: $\mathcal {O}\Big (\sqrt{d}N^2\big (\frac{\pi e}{2}\big )^{d/2}\big (\sqrt{\frac{2}{\varepsilon }}+2\big )^d\Big ),$ was proposed. The algorithm implements an FPTAS in the case of fixed space dimension and remains polynomial for instances of dimension $\mathcal {O}(\log n)$. In this case, it implements a PTAS with $\mathcal {O}\left( N^{C\,(1.05+\log (2+\sqrt{\frac{2}{\varepsilon }}))}\right) $ time, where C is a positive constant.

In [10], an approximation algorithm that allows one to find a 2-approximate solution to the problem in $\mathcal {O}\left( dN^2\right) $ time was constructed.

In [11], a randomized algorithm was constructed. It allows one to find $(1 + \varepsilon )$-approximate solution with probability not less than $1 - \gamma $ in $\mathcal {O}(d N)$ time for an established parameter value, a given relative error $\varepsilon $ and fixed $\gamma $. The conditions are found under which the algorithm is asymptotically exact and runs in $\mathcal {O}(d N^2)$ time.

In this paper, we present the first result for strongly NP-hard Problem 1. Namely, we present a 2-approximation algorithm. This algorithm is based on the approaches and results presented in [10, 12,13,14]. The running time of this algorithm is $\mathcal {O}(N^2(M(T_{\max }-T_{\min }+1)+d))$.

3 Foundations of the Algorithm

In this section, we formulate some statements (e.g. about indices of $\mathcal {M}$) and formulate one more auxiliary problem which can be solved in polynomial time. All these statements and auxiliary problem are necessary for substantiation of our algorithms.

The following lemma is well known (see, for example, [15, 16]).

Lemma 1

For a finite set $\mathcal {Z} \subset \mathbb {R}^d$, if a point $t\in \mathbb {R}^{d}$ is closer (in terms of distance) to the centroid $\overline{z}$ of $\mathcal {Z}$ than any point in $\mathcal {Z}$, then

$$\begin{aligned} \sum _{z \in \mathcal {Z}} \Vert z - t \Vert ^2 \le 2 \sum _{z \in \mathcal {Z}} \Vert z - \overline{z} \Vert ^2. \end{aligned}$$

Two following lemmas were proved in [12, 13].

Lemma 2

If the elements of $\mathcal {M}=\{n_1,\dots ,n_M\}$ belong to $\mathcal {N}=\{1,\ldots ,N\}$ and satisfy the system of constraints (2), then for every fixed $M\in \{2,\dots ,N\}$ we have:

(1)
the parameters of this system are related by inequality
$$\begin{aligned} (M-1)T_{\min }\le N-1, \end{aligned}$$
(3)
(2)
the element $n_m$ in $\{n_1,\dots ,n_m,\dots ,n_M\}$ belongs to the set
(4)
(3)
the feasibility domain of components $n_{m-1}$ from this set under condition $n_m=n$ is defined by formula
$$\begin{aligned} \gamma _{m-1}^{-}(n)=\{j| \max \{1+(m-2)T_{\min },n-T_{\max }\}\le j \le n-T_{\min }\} , \end{aligned}$$
(5)
where $n\in \omega _m, \, m=2,\dots ,M.$

Lemma 3

For every $M\in \{2,\dots ,N\}$ the system of constraints (2) is feasible if and only if inequality (3) holds.

Consider the following function:

$$\begin{aligned} S(\mathcal {M},b) =M\sum \limits _{n\in \mathcal {M}}\Vert y_n-b\Vert ^2 +(N-M)\sum \limits _{n\in \mathcal {N}\setminus \mathcal {M}}\Vert y_n\Vert ^2, \,\,b\in \mathbb {R}^d, \,\, \mathcal {M}\subset \mathcal {N}. \end{aligned}$$

It is similar to the objective function of Problem 1. The only difference is the point b instead of the centroid $\overline{y}(\{y_n| n\in \mathcal {M}\})$. This function can be rewritten as follows:

$$\begin{aligned} S(\mathcal {M},b) =(N-M)\sum \limits _{n\in \mathcal {N}}\Vert y_n\Vert ^2-\sum \limits _{n\in \mathcal {M}}\Big (2M\langle y_n,b\rangle -(2M-N)\Vert y_n\Vert ^2-M\Vert b\Vert ^2\Big ). \end{aligned}$$

(6)

Note that the first summand is a constant if M and b are the fixed values. Hence the minimum of $S(\mathcal {M},b)$ is reached on the subsequence that maximizes the second summand. This expression motivates us to formulate auxiliary:

Problem 3

Given a sequence $\mathcal {Y} = (y_1, \ldots , y_N)$ of points in $\mathbb {R}^d$, a point $b\in \mathbb {R}^d$, and some positive integers $T_{\min },\, T_{\max }$ and M. Find a subset $\mathcal {M}=\{n_1,\dots ,n_M\} \subset \mathcal {N}=\{1,\dots ,N\}$ of indices in the sequence $\mathcal {Y}$ such that

$$\begin{aligned} G^b(\mathcal {M})=\sum \limits _{n\in \mathcal {M}}g^b(n)\longrightarrow \max , \end{aligned}$$

(7)

where

(8)

with additional constraints (2) on the elements of $\mathcal {M}$, if $M\ne 1$.

Let us define the set $\varPsi _{M}$ of subsets of admissible index tuples in the auxiliary problem:

(9)

For $M=1$ the set $\varPsi _M$ is not empty for any parameters $T_{\min }$ and $T_{\max }$ by definition (9). For other feasible values of M we have [12, 13]:

Lemma 4

If $M\ge 2$, then the set $\varPsi _M$ is not empty if and only if an inequality (3) holds.

Proofs of the following lemma and its corollary in [12, 13] do not use (8) and so they hold for our case too.

Lemma 5

Let $\varPsi _M\ne \emptyset $ for some $M\ge 1$. Then for this M, the optimal value $G_{\max }^b=\max \limits _{\mathcal {M}}G^b(\mathcal {M})$ of objective function (7) can be found by formula

$$\begin{aligned} G_{\max }^b=\max \limits _{n\in \omega _M}G_M^b(n). \end{aligned}$$

(10)

The values $G_M^b(n)$, $n\in \omega _M$, can be calculated by the following recurrent formulae:

$$\begin{aligned} G_m^b(n)={\left\{ \begin{array}{ll} g^b(n), \text { if } n\in \omega _1, m=1,\\ g^b(n)+\max \limits _{j\in \gamma ^{-}_{m-1}(n)}G_{m-1}^b(j), \text { if } n\in \omega _m, m=2,\dots ,M, \end{array}\right. } \end{aligned}$$

(11)

where sets $\omega _m$ and $\gamma ^{-}_{m-1}(n)$ are defined by formulae (4) and (5).

Corollary 1

The elements ${n}^b_1,\dots ,{n}^b_M$ of the optimal set ${\mathcal {M}^{b}} =\arg \max \limits _{\mathcal {M}}G^b(\mathcal {M})$ can be found by the following recurrent formulae:

$$\begin{aligned} {n}^b_M=\arg \max \limits _{n\in \omega _M}G_M^b(n), \end{aligned}$$

(12)

(13)

The following algorithm finds an optimal solution for auxiliary Problem 3. The step-by-step description of the algorithm looks like as follows.

Algorithm $\mathcal {A}_1$.
Input: a sequence $\mathcal {Y}$, a point b, some positive integer $T_{\min }$, $T_{\max }$, M.
Step 1. Compute $g^b(n)$, $n\in \mathcal {N}$, using formula (8).
Step 2. Using recurrent formulae (11), compute $G_m^b(n)$ for each $n\in w_n$ and $m=1,\dots ,M$.
Step 3. Find the maximal value $G_{\max }^b$ of the objective function $G^b$ using formula (10) and the optimal set ${\mathcal {M}^b}=\{{n}^b_1,\dots ,{n}^b_M\}$ by (12) and (13) from Corollary 1.
Output: the value of $G_{\max }^b$, the set ${\mathcal {M}}^b$.

The following theorem has been established in [13].

Theorem 1

Algorithm $\mathcal {A}_1$ finds an optimal solution of Problem 3 in $\mathcal {O}(N(M(T_{\max }-T_{\min }+1)+d))$ time.

Remark 1

The value $(T_{\max }-T_{\min }+1)$ is bounded by N. So the algorithm has a complexity at most $\mathcal {O}(N(MN+d))$.

Remark 2

If the values $T_{\max }$ and $T_{\min }$ are fixed then the algorithm has a complexity at most $\mathcal {O}(N(M+d))$.

4 Approximation Algorithm

The idea of the proposed algorithm is that the solution of the original NP-hard Problem 1 is replaced by an efficient algorithmic exact solution of auxiliary Problem 3. The solutions are found by Algorithm $\mathcal {A}_1$ presented in the previous section for each point $b \in \mathcal {Y}$ and then the best one is to be chosen.

Algorithm $\mathcal {A}$.
Input: a sequence $\mathcal {Y}$, some positive integers $T_{\min }$, $T_{\max }$, M.
Step 1. $i := 0$, $\mathcal {M}_\mathcal {A} : = \emptyset $, $N := -\infty $.
Step 2. $i := i + 1$, $b := y_i$.
Step 3. Find the optimal solution $\mathcal {M}^b$ and the maximal value $G_{\max }^b$ of objective function (7) using algorithm $\mathcal {A}_1$.
Step 4. If $H < G_{\max }^b$ then $b_\mathcal {A} := b$, $H := G_{\max }^b$, $\mathcal {M}_\mathcal {A} = \mathcal {M}^b$.
Step 5. If $i < N$ then go to Step 2, else go to Step 6.
Step 6. Find the centroid $\overline{y}(\{y_n|n \in \mathcal {M_A}\})$ and the value of the objective function $F(\mathcal {M_A})$ by (1).
Output: the set $\mathcal {M}_\mathcal {A}$, the value $F(\mathcal {M_A})$, the points $\overline{y}(\{y_n|n \in \mathcal {M_A}\})$ and $b_\mathcal {A}$.

Theorem 2

Algorithm $\mathcal {A}$ finds a 2-approximate solution of Problem 1 in $\mathcal {O}(N^2(M(T_{\max }-T_{\min }+1)+d))$ time.

Proof

Let $\mathcal {M}^*$ be an optimal solution of Problem 1, $y^*=\overline{y}(\{y_n|n \in \mathcal {M}^*\})$ be the centroid of this optimal solution, and $\mathcal {M}_{\mathcal {A}}$, $F(\mathcal {M_A})$, $\overline{y}(\{y_n|n \in \mathcal {M_A}\})$, $b_\mathcal {A}$ be the output of Algorithm $\mathcal {A}$. Let $t=\arg \min \limits _{i\in \mathcal {M}^{*}}||y_i-y^*||^{2}$ be a point from the optimal subsequence that is the closest to $y^*$.

Let us substantiate the accuracy of the algorithm. The following chain of equalities and inequalities holds:

Let us explain the numbered signs (we have chosen those that are not obvious). One can check by the differentiation that the minimum of $S(\mathcal {M}_\mathcal {A},\cdot )$ (for the fixed $\mathcal {M}_\mathcal {A}$) is attained at $\overline{y}(\{y_n|n \in \mathcal {M_A}\})$, so inequality (1) holds. Equality (2) follows from Theorem 1 and the description of Algorithm $\mathcal {A}_1$. Equality (3) is obvious by (6). Inequality (4) holds since $\mathcal {M}^{*} \in \varPsi _{M}$ and $t\in \mathcal {Y}$ and inequality (5) holds by Lemma 1.

Hence $F(\mathcal {M}_{\mathcal {A}}) / F(\mathcal {M}^*)\le 2$ and Algorithm $\mathcal {A}$ finds a 2-approximate solution of Problem 1.

Let us estimate the time complexity. At Step 1 and Step 2 we need $\mathcal {O}(d)$ operations. At Step 3 we solve the auxiliary problem and so we need $\mathcal {O}(N(M(T_{\max }-T_{\min }+1)+d))$ time. At Step 4 and Step 5 we need at most $\mathcal {O}(dN)$ operations. Step 2 – Step 5 are executed for N times. Step 6 requires $\mathcal {O}(dN)$ operations. Thus, the total time complexity of the algorithm is $\mathcal {O}(N^2(M(T_{\max }-T_{\min }+1)+d))$. $\square $

Remark 3

The value $(T_{\max }-T_{\min }+1)$ is bounded by N. So the algorithm has a complexity at most $\mathcal {O}(N^2(MN+d))$.

Remark 4

If the values $T_{\max }$ and $T_{\min }$ are fixed then the algorithm has a complexity at most $\mathcal {O}(N^2(M+d))$.

5 Conclusion

In this paper, we have shown the strong NP-hardness of one cardinality-weighted quadratic partitioning problem of a sequence of points in Euclidean space into two clusters when the center of one of the desired clusters is fixed and the center of the other one is unknown and is determined as a geometric center. We also presented the first algorithmic result for the considered problem. This result is the polynomial-time 2-approximation algorithm. It seems important to continue studying the questions on the algorithmic approximability of the problem since the considered problem is poorly studied in the algorithmic direction.

References

Fu, T.: A review on time series data mining. Eng. Appl. Artif. Intell. 24(1), 164–181 (2011)
Article Google Scholar
Kuenzer, C., Dech, S., Wagner, W.: Remote Sensing Time Series. RDIP, vol. 22. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-15967-6
Book Google Scholar
Liao, T.W.: Clustering of time series data – a survey. Pattern Recogn. 38(11), 1857–1874 (2005)
Article Google Scholar
Kel’manov, A.V., Jeon, B.: A posteriori joint detection and discrimination of pulses in a quasiperiodic pulse train. IEEE Trans. Signal Process. 52(3), 645–656 (2004)
Article MathSciNet Google Scholar
Carter, J.A., Agol, E., et al.: Kepler-36: a pair of planets with neighboring orbits and dissimilar densities. Science 337(6094), 556–559 (2012)
Article Google Scholar
Kel’manov, A.V., Pyatkin, A.V.: NP-hardness of some quadratic Euclidean 2-clustering problems. Dokl. Math. 92(2), 634–637 (2015)
Article MathSciNet Google Scholar
Kel’manov, A.V., Motkova, A.V.: Exact Pseudopolynomial Algorithms for a Balanced 2-Clustering Problem. J. Appl. Ind. Math. 10(3), 349–355 (2016)
Article MathSciNet Google Scholar
Kel’manov, A., Motkova, A.: A fully polynomial-time approximation scheme for a special case of a balanced 2-clustering problem. In: Kochetov, Y., Khachay, M., Beresnev, V., Nurminski, E., Pardalos, P. (eds.) DOOR 2016. LNCS, vol. 9869, pp. 182–192. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-44914-2_15
Chapter Google Scholar
Kel’manov, A., Motkova, A., Shenmaier, V.: An approximation scheme for a weighted two-cluster partition problem. In: van der Aalst, W., et al. (eds.) AIST 2017. LNCS, vol. 10716, pp. 323–333. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-73013-4_30
Chapter Google Scholar
Kel’manov, A.V., Motkova, A.V.: Polynomial-time approximation algorithm for the problem of cardinality-weighted variance-based 2-clustering with a given center. Comput. Math. Math. Phys. 58(1), 130–136 (2018)
Article MathSciNet Google Scholar
Kel’manov, A., Khandeev, V., Panasenko, A.: Randomized algorithms for some clustering problems. In: Eremeev, A., Khachay, M., Kochetov, Y., Pardalos, P. (eds.) OPTA 2018. CCIS, vol. 871, pp. 109–119. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93800-4_9
Chapter Google Scholar
Kel’manov, A.V., Khamidullin, S.A.: Posterion detection of a given number of identical subsequences in a quasi-periodic sequence. Comp. Math. Math. Phys. 41(5), 762–774 (2001)
MATH Google Scholar
Kel’manov, A.V., Khamidullin, S.A.: An approximation polynomial algorithm for a sequence partitioning problem. J. Appl. Ind. Math. 8(2), 236–244 (2014)
Article MathSciNet Google Scholar
Kel’manov, A.V., Khamidullin, S.A., Khandeev, V.I.: Exact pseudopolynomial algorithm for one sequence partitioning problem. Autom. Remote Control 78(1), 67–74 (2017)
Article MathSciNet Google Scholar
Kel’manov, A.V., Romanchenko, S.M.: An FPTAS for a vector subset search problem. J. Appl. Ind. Math. 8(3), 329–336 (2014)
Article MathSciNet Google Scholar
Kel’manov, A.V., Romanchenko, S.M.: An approximation algorithm for solving a problem of search for a vector subset. J. Appl. Ind. Math. 6(1), 90–96 (2012)
Article MathSciNet Google Scholar

Download references

Acknowledgments

The study presented in Sects. 3 and 4 was supported by the Russian Foundation for Basic Research, projects 19-07-00397, 19-01-00308 and 18-31-00398. The study presented in the other sections was supported by the Russian Academy of Science (the Program of basic research), project 0314-2019-0015, and by the Russian Ministry of Science and Education under the 5-100 Excellence Programme.

Author information

Authors and Affiliations

Sobolev Institute of Mathematics, 4 Koptyug Avenue, 630090, Novosibirsk, Russia
Alexander Kel’manov, Sergey Khamidullin & Anna Panasenko
Novosibirsk State University, 2 Pirogova Street, 630090, Novosibirsk, Russia
Alexander Kel’manov & Anna Panasenko

Authors

Alexander Kel’manov
View author publications
You can also search for this author in PubMed Google Scholar
Sergey Khamidullin
View author publications
You can also search for this author in PubMed Google Scholar
Anna Panasenko
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anna Panasenko .

Editor information

Editors and Affiliations

University of Calabria, Rende, Italy
Yaroslav D. Sergeyev
University of Calabria, Rende, Italy
Dmitri E. Kvasov

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kel’manov, A., Khamidullin, S., Panasenko, A. (2020). 2-Approximation Polynomial-Time Algorithm for a Cardinality-Weighted 2-Partitioning Problem of a Sequence. In: Sergeyev, Y., Kvasov, D. (eds) Numerical Computations: Theory and Algorithms. NUMTA 2019. Lecture Notes in Computer Science(), vol 11974. Springer, Cham. https://doi.org/10.1007/978-3-030-40616-5_34

Download citation

DOI: https://doi.org/10.1007/978-3-030-40616-5_34
Published: 14 February 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-40615-8
Online ISBN: 978-3-030-40616-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics