Abstract
We consider a problem of 2-partitioning a finite sequence of points in Euclidean space into clusters of the given sizes with some constraints. The solution criterion is the minimum of the sum of weighted intracluster sums of squared distances between the elements of each cluster and its center. The weight of the intracluster sum is equal to the cluster size. The center of one cluster is given as input (is the origin without loss of generality), while the center of the other one is unknown and is determined as a geometric center. The following constraints hold: the difference between the indices of two subsequent points included in the first cluster is bounded from above and below by some given constants. In this paper, we have shown that the considered problem is the strongly NP-hard one and propose a polynomial-time 2-approximation algorithm for solving the problem.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
- Euclidean space
- Sequence of points
- Weighted 2-partition
- Quadratic variation
- NP-hard problem
- Approximation algorithm
- Polynomial time
1 Introduction
The subject of this study is one quadratic cardinality-weighted problem of partitioning a sequence of points in Euclidean space into two subsequences of the given sizes with some additional constraints. The goal of our study is to analyze the computational complexity of the problem and to substantiate a polynomial-time approximation algorithm for this problem. The motivation of our study is the relevance of the considered problem, for example, for data mining and data clustering, when the data having in the hands is a time series.
The paper is organized as follows. In Sect. 2 the problem formulation and its interpretation are presented. Also, we present one closely related problem and some known results for it there. In addition, in the same section, we analyze the problem complexity. In Sect. 3 we formulate an auxiliary problem and some statements which underlie quality estimates for the proposed algorithm. Section 4 contains the approximation algorithm for the solution to the problem considered. The analysis of the algorithm properties is also in this section.
2 Problem Formulation, Related Problem, and Complexity
Everywhere below \(\mathbb {R}\) denotes the set of real numbers, \(\Vert \cdot \Vert \) denotes the Euclidean norm, and \(\langle \cdot , \cdot \rangle \) denotes the scalar product.
We consider the following
Problem 1
Given a sequence \(\mathcal {Y} = (y_1, \ldots , y_N)\) of points in \(\mathbb {R}^d\) and some positive integers \(T_{\min },\, T_{\max }\), and \(M>1\). Find a subset \(\mathcal {M}=\{n_1, n_2, \dots \} \subset \mathcal {N}=\{1,\dots ,N\}\) of indices in \(\mathcal {Y}\) such that
where \(\overline{y}(\{y_n| n\in \mathcal {M}\}) = \frac{1}{|\mathcal {M}|} \sum _{i \in \mathcal {M}}y_i\) is the centroid of \(\{y_n| n\in \mathcal {M}\}\) with the following constraints
and \(|\mathcal {M}|=M\).
The following applied interpretation can be proposed for Problem 1. As the input, we have a sequence \(\mathcal {Y}\) of N measurement results. This sequence is the time-ordered one (i.e., the input is time series or discrete signal). The measurements are taken for d characteristics of some object. There are two different possible states for this object (active and passive, for example). We know that exactly M times the object was in the active state (or the probability of the active state is \(\frac{M}{N}\)). And we also know that there is an error for each measurement. The correspondence between the elements of the input sequence and the states is unknown. But the time interval between every two consecutive active states is bounded from below and above by some constants \(T_{\min }\) and \(T_{\max }\). It requires to find 2-partition of the input sequence and evaluate the object characteristics (i.e., \(\overline{y}(\{y_n| n\in \mathcal {M}\})\) in accordance with (1)).
This application problem is very typical for processing time-series or discrete signals, for example, in distant object monitoring and in geophysics, in technical and medical diagnostics, etc. (see, for example, [1,2,3,4,5]).
The considered problem is closely related to
Problem 2
(Cardinality-weighted variance-based 2-clustering with given center). Given an N-element set \(\mathcal {Y}\) of points in \(\mathbb {R}^d\), and positive integer number M. Find a partition of \(\mathcal {Y}\) into two non-empty clusters \(\mathcal {C}\) and \(\mathcal {Y} \setminus \mathcal {C}\) such that
where \(\overline{y}(\mathcal {C}) = \frac{1}{|\mathcal {C}|} \sum \limits _{y \in \mathcal {C}} y\) is the centroid of \(\mathcal {C}\), subject to constrain \(|\mathcal {C}|=M\).
The strong NP-hardness of Problem 2 was established in [6]. The strong NP-hardness of Problem 1 follows from this result, as Problem 2 is the special case of Problem 1 when \(T_{\min }=1\) and \(T_{\max }=N\).
Problem 2 has been studied in algorithmic direction in [7,8,9,10,11].
In [7], an exact pseudo-polynomial algorithm was constructed for the case of integer components of the input points and fixed dimension d of the space. The running time of this algorithm is \(\mathcal {O}(N (M D)^d)\), where D is the maximum absolute value of coordinates of the input points.
In [8], an approximation scheme that allows one to find \((1+\varepsilon )\)-approximate solution in \(\mathcal {O}\left( dN^2\big (\sqrt{\frac{2d}{\varepsilon }}+2\big )^d\right) \) time was proposed. It implements an FPTAS in the case of the fixed space dimension.
Moreover, in [9], the modification of this algorithm with improved time complexity: \(\mathcal {O}\Big (\sqrt{d}N^2\big (\frac{\pi e}{2}\big )^{d/2}\big (\sqrt{\frac{2}{\varepsilon }}+2\big )^d\Big ),\) was proposed. The algorithm implements an FPTAS in the case of fixed space dimension and remains polynomial for instances of dimension \(\mathcal {O}(\log n)\). In this case, it implements a PTAS with \(\mathcal {O}\left( N^{C\,(1.05+\log (2+\sqrt{\frac{2}{\varepsilon }}))}\right) \) time, where C is a positive constant.
In [10], an approximation algorithm that allows one to find a 2-approximate solution to the problem in \(\mathcal {O}\left( dN^2\right) \) time was constructed.
In [11], a randomized algorithm was constructed. It allows one to find \((1 + \varepsilon )\)-approximate solution with probability not less than \(1 - \gamma \) in \(\mathcal {O}(d N)\) time for an established parameter value, a given relative error \(\varepsilon \) and fixed \(\gamma \). The conditions are found under which the algorithm is asymptotically exact and runs in \(\mathcal {O}(d N^2)\) time.
In this paper, we present the first result for strongly NP-hard Problem 1. Namely, we present a 2-approximation algorithm. This algorithm is based on the approaches and results presented in [10, 12,13,14]. The running time of this algorithm is \(\mathcal {O}(N^2(M(T_{\max }-T_{\min }+1)+d))\).
3 Foundations of the Algorithm
In this section, we formulate some statements (e.g. about indices of \(\mathcal {M}\)) and formulate one more auxiliary problem which can be solved in polynomial time. All these statements and auxiliary problem are necessary for substantiation of our algorithms.
The following lemma is well known (see, for example, [15, 16]).
Lemma 1
For a finite set \(\mathcal {Z} \subset \mathbb {R}^d\), if a point \(t\in \mathbb {R}^{d}\) is closer (in terms of distance) to the centroid \(\overline{z}\) of \(\mathcal {Z}\) than any point in \(\mathcal {Z}\), then
Two following lemmas were proved in [12, 13].
Lemma 2
If the elements of \(\mathcal {M}=\{n_1,\dots ,n_M\}\) belong to \(\mathcal {N}=\{1,\ldots ,N\}\) and satisfy the system of constraints (2), then for every fixed \(M\in \{2,\dots ,N\}\) we have:
-
(1)
the parameters of this system are related by inequality
$$\begin{aligned} (M-1)T_{\min }\le N-1, \end{aligned}$$(3) -
(2)
the element \(n_m\) in \(\{n_1,\dots ,n_m,\dots ,n_M\}\) belongs to the set
(4) -
(3)
the feasibility domain of components \(n_{m-1}\) from this set under condition \(n_m=n\) is defined by formula
$$\begin{aligned} \gamma _{m-1}^{-}(n)=\{j| \max \{1+(m-2)T_{\min },n-T_{\max }\}\le j \le n-T_{\min }\} , \end{aligned}$$(5)where \(n\in \omega _m, \, m=2,\dots ,M.\)
Lemma 3
For every \(M\in \{2,\dots ,N\}\) the system of constraints (2) is feasible if and only if inequality (3) holds.
Consider the following function:
It is similar to the objective function of Problem 1. The only difference is the point b instead of the centroid \(\overline{y}(\{y_n| n\in \mathcal {M}\})\). This function can be rewritten as follows:
Note that the first summand is a constant if M and b are the fixed values. Hence the minimum of \(S(\mathcal {M},b)\) is reached on the subsequence that maximizes the second summand. This expression motivates us to formulate auxiliary:
Problem 3
Given a sequence \(\mathcal {Y} = (y_1, \ldots , y_N)\) of points in \(\mathbb {R}^d\), a point \(b\in \mathbb {R}^d\), and some positive integers \(T_{\min },\, T_{\max }\) and M. Find a subset \(\mathcal {M}=\{n_1,\dots ,n_M\} \subset \mathcal {N}=\{1,\dots ,N\}\) of indices in the sequence \(\mathcal {Y}\) such that
where
with additional constraints (2) on the elements of \(\mathcal {M}\), if \(M\ne 1\).
Let us define the set \(\varPsi _{M}\) of subsets of admissible index tuples in the auxiliary problem:
For \(M=1\) the set \(\varPsi _M\) is not empty for any parameters \(T_{\min }\) and \(T_{\max }\) by definition (9). For other feasible values of M we have [12, 13]:
Lemma 4
If \(M\ge 2\), then the set \(\varPsi _M\) is not empty if and only if an inequality (3) holds.
Proofs of the following lemma and its corollary in [12, 13] do not use (8) and so they hold for our case too.
Lemma 5
Let \(\varPsi _M\ne \emptyset \) for some \(M\ge 1\). Then for this M, the optimal value \(G_{\max }^b=\max \limits _{\mathcal {M}}G^b(\mathcal {M})\) of objective function (7) can be found by formula
The values \(G_M^b(n)\), \(n\in \omega _M\), can be calculated by the following recurrent formulae:
where sets \(\omega _m\) and \(\gamma ^{-}_{m-1}(n)\) are defined by formulae (4) and (5).
Corollary 1
The elements \({n}^b_1,\dots ,{n}^b_M\) of the optimal set \({\mathcal {M}^{b}} =\arg \max \limits _{\mathcal {M}}G^b(\mathcal {M})\) can be found by the following recurrent formulae:
The following algorithm finds an optimal solution for auxiliary Problem 3. The step-by-step description of the algorithm looks like as follows.
-
Algorithm \(\mathcal {A}_1\).
-
Input: a sequence \(\mathcal {Y}\), a point b, some positive integer \(T_{\min }\), \(T_{\max }\), M.
-
Step 1. Compute \(g^b(n)\), \(n\in \mathcal {N}\), using formula (8).
-
Step 2. Using recurrent formulae (11), compute \(G_m^b(n)\) for each \(n\in w_n\) and \(m=1,\dots ,M\).
-
Step 3. Find the maximal value \(G_{\max }^b\) of the objective function \(G^b\) using formula (10) and the optimal set \({\mathcal {M}^b}=\{{n}^b_1,\dots ,{n}^b_M\}\) by (12) and (13) from Corollary 1.
-
Output: the value of \(G_{\max }^b\), the set \({\mathcal {M}}^b\).
The following theorem has been established in [13].
Theorem 1
Algorithm \(\mathcal {A}_1\) finds an optimal solution of Problem 3 in \(\mathcal {O}(N(M(T_{\max }-T_{\min }+1)+d))\) time.
Remark 1
The value \((T_{\max }-T_{\min }+1)\) is bounded by N. So the algorithm has a complexity at most \(\mathcal {O}(N(MN+d))\).
Remark 2
If the values \(T_{\max }\) and \(T_{\min }\) are fixed then the algorithm has a complexity at most \(\mathcal {O}(N(M+d))\).
4 Approximation Algorithm
The idea of the proposed algorithm is that the solution of the original NP-hard Problem 1 is replaced by an efficient algorithmic exact solution of auxiliary Problem 3. The solutions are found by Algorithm \(\mathcal {A}_1\) presented in the previous section for each point \(b \in \mathcal {Y}\) and then the best one is to be chosen.
-
Algorithm \(\mathcal {A}\).
-
Input: a sequence \(\mathcal {Y}\), some positive integers \(T_{\min }\), \(T_{\max }\), M.
-
Step 1. \(i := 0\), \(\mathcal {M}_\mathcal {A} : = \emptyset \), \(N := -\infty \).
-
Step 2. \(i := i + 1\), \(b := y_i\).
-
Step 3. Find the optimal solution \(\mathcal {M}^b\) and the maximal value \(G_{\max }^b\) of objective function (7) using algorithm \(\mathcal {A}_1\).
-
Step 4. If \(H < G_{\max }^b\) then \(b_\mathcal {A} := b\), \(H := G_{\max }^b\), \(\mathcal {M}_\mathcal {A} = \mathcal {M}^b\).
-
Step 5. If \(i < N\) then go to Step 2, else go to Step 6.
-
Step 6. Find the centroid \(\overline{y}(\{y_n|n \in \mathcal {M_A}\})\) and the value of the objective function \(F(\mathcal {M_A})\) by (1).
-
Output: the set \(\mathcal {M}_\mathcal {A}\), the value \(F(\mathcal {M_A})\), the points \(\overline{y}(\{y_n|n \in \mathcal {M_A}\})\) and \(b_\mathcal {A}\).
Theorem 2
Algorithm \(\mathcal {A}\) finds a 2-approximate solution of Problem 1 in \(\mathcal {O}(N^2(M(T_{\max }-T_{\min }+1)+d))\) time.
Proof
Let \(\mathcal {M}^*\) be an optimal solution of Problem 1, \(y^*=\overline{y}(\{y_n|n \in \mathcal {M}^*\})\) be the centroid of this optimal solution, and \(\mathcal {M}_{\mathcal {A}}\), \(F(\mathcal {M_A})\), \(\overline{y}(\{y_n|n \in \mathcal {M_A}\})\), \(b_\mathcal {A}\) be the output of Algorithm \(\mathcal {A}\). Let \(t=\arg \min \limits _{i\in \mathcal {M}^{*}}||y_i-y^*||^{2}\) be a point from the optimal subsequence that is the closest to \(y^*\).
Let us substantiate the accuracy of the algorithm. The following chain of equalities and inequalities holds:
Let us explain the numbered signs (we have chosen those that are not obvious). One can check by the differentiation that the minimum of \(S(\mathcal {M}_\mathcal {A},\cdot )\) (for the fixed \(\mathcal {M}_\mathcal {A}\)) is attained at \(\overline{y}(\{y_n|n \in \mathcal {M_A}\})\), so inequality (1) holds. Equality (2) follows from Theorem 1 and the description of Algorithm \(\mathcal {A}_1\). Equality (3) is obvious by (6). Inequality (4) holds since \(\mathcal {M}^{*} \in \varPsi _{M}\) and \(t\in \mathcal {Y}\) and inequality (5) holds by Lemma 1.
Hence \(F(\mathcal {M}_{\mathcal {A}}) / F(\mathcal {M}^*)\le 2\) and Algorithm \(\mathcal {A}\) finds a 2-approximate solution of Problem 1.
Let us estimate the time complexity. At Step 1 and Step 2 we need \(\mathcal {O}(d)\) operations. At Step 3 we solve the auxiliary problem and so we need \(\mathcal {O}(N(M(T_{\max }-T_{\min }+1)+d))\) time. At Step 4 and Step 5 we need at most \(\mathcal {O}(dN)\) operations. Step 2 – Step 5 are executed for N times. Step 6 requires \(\mathcal {O}(dN)\) operations. Thus, the total time complexity of the algorithm is \(\mathcal {O}(N^2(M(T_{\max }-T_{\min }+1)+d))\). \(\square \)
Remark 3
The value \((T_{\max }-T_{\min }+1)\) is bounded by N. So the algorithm has a complexity at most \(\mathcal {O}(N^2(MN+d))\).
Remark 4
If the values \(T_{\max }\) and \(T_{\min }\) are fixed then the algorithm has a complexity at most \(\mathcal {O}(N^2(M+d))\).
5 Conclusion
In this paper, we have shown the strong NP-hardness of one cardinality-weighted quadratic partitioning problem of a sequence of points in Euclidean space into two clusters when the center of one of the desired clusters is fixed and the center of the other one is unknown and is determined as a geometric center. We also presented the first algorithmic result for the considered problem. This result is the polynomial-time 2-approximation algorithm. It seems important to continue studying the questions on the algorithmic approximability of the problem since the considered problem is poorly studied in the algorithmic direction.
References
Fu, T.: A review on time series data mining. Eng. Appl. Artif. Intell. 24(1), 164–181 (2011)
Kuenzer, C., Dech, S., Wagner, W.: Remote Sensing Time Series. RDIP, vol. 22. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-15967-6
Liao, T.W.: Clustering of time series data – a survey. Pattern Recogn. 38(11), 1857–1874 (2005)
Kel’manov, A.V., Jeon, B.: A posteriori joint detection and discrimination of pulses in a quasiperiodic pulse train. IEEE Trans. Signal Process. 52(3), 645–656 (2004)
Carter, J.A., Agol, E., et al.: Kepler-36: a pair of planets with neighboring orbits and dissimilar densities. Science 337(6094), 556–559 (2012)
Kel’manov, A.V., Pyatkin, A.V.: NP-hardness of some quadratic Euclidean 2-clustering problems. Dokl. Math. 92(2), 634–637 (2015)
Kel’manov, A.V., Motkova, A.V.: Exact Pseudopolynomial Algorithms for a Balanced 2-Clustering Problem. J. Appl. Ind. Math. 10(3), 349–355 (2016)
Kel’manov, A., Motkova, A.: A fully polynomial-time approximation scheme for a special case of a balanced 2-clustering problem. In: Kochetov, Y., Khachay, M., Beresnev, V., Nurminski, E., Pardalos, P. (eds.) DOOR 2016. LNCS, vol. 9869, pp. 182–192. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-44914-2_15
Kel’manov, A., Motkova, A., Shenmaier, V.: An approximation scheme for a weighted two-cluster partition problem. In: van der Aalst, W., et al. (eds.) AIST 2017. LNCS, vol. 10716, pp. 323–333. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-73013-4_30
Kel’manov, A.V., Motkova, A.V.: Polynomial-time approximation algorithm for the problem of cardinality-weighted variance-based 2-clustering with a given center. Comput. Math. Math. Phys. 58(1), 130–136 (2018)
Kel’manov, A., Khandeev, V., Panasenko, A.: Randomized algorithms for some clustering problems. In: Eremeev, A., Khachay, M., Kochetov, Y., Pardalos, P. (eds.) OPTA 2018. CCIS, vol. 871, pp. 109–119. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93800-4_9
Kel’manov, A.V., Khamidullin, S.A.: Posterion detection of a given number of identical subsequences in a quasi-periodic sequence. Comp. Math. Math. Phys. 41(5), 762–774 (2001)
Kel’manov, A.V., Khamidullin, S.A.: An approximation polynomial algorithm for a sequence partitioning problem. J. Appl. Ind. Math. 8(2), 236–244 (2014)
Kel’manov, A.V., Khamidullin, S.A., Khandeev, V.I.: Exact pseudopolynomial algorithm for one sequence partitioning problem. Autom. Remote Control 78(1), 67–74 (2017)
Kel’manov, A.V., Romanchenko, S.M.: An FPTAS for a vector subset search problem. J. Appl. Ind. Math. 8(3), 329–336 (2014)
Kel’manov, A.V., Romanchenko, S.M.: An approximation algorithm for solving a problem of search for a vector subset. J. Appl. Ind. Math. 6(1), 90–96 (2012)
Acknowledgments
The study presented in Sects. 3 and 4 was supported by the Russian Foundation for Basic Research, projects 19-07-00397, 19-01-00308 and 18-31-00398. The study presented in the other sections was supported by the Russian Academy of Science (the Program of basic research), project 0314-2019-0015, and by the Russian Ministry of Science and Education under the 5-100 Excellence Programme.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Kel’manov, A., Khamidullin, S., Panasenko, A. (2020). 2-Approximation Polynomial-Time Algorithm for a Cardinality-Weighted 2-Partitioning Problem of a Sequence. In: Sergeyev, Y., Kvasov, D. (eds) Numerical Computations: Theory and Algorithms. NUMTA 2019. Lecture Notes in Computer Science(), vol 11974. Springer, Cham. https://doi.org/10.1007/978-3-030-40616-5_34
Download citation
DOI: https://doi.org/10.1007/978-3-030-40616-5_34
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-40615-8
Online ISBN: 978-3-030-40616-5
eBook Packages: Computer ScienceComputer Science (R0)