1 Introduction

Clustering is one of the most fundamental problems in computer science, and finds numerous applications in many different areas, such as data management, machine learning, bioinformatics, networking, etc. [38]. The common goal of many clustering problems is to partition a set of given data items into a number of clusters so as to minimize the total cost measured by a certain objective function. For example, the popular k-means (or k-median) clustering seeks k mean (or median) points to induce a partition of the given data items so that the average squared distance (or the average distance) from each data item to its closest mean (or median) point is minimized. Most existing clustering techniques assume that the data items are independent from each other and therefore can “freely” determine their memberships in the resulting clusters (i.e., a data item does not need to pay attention to the clustering of others). However, in many real-world applications, data items are often constrained or correlated, which require a great deal of effort to handle such additional constraints. In recent years, considerable attention has been paid to various types of constrained clustering problems, such as l-diversity clustering [46], r-gather clustering [2, 27, 36], capacitated clustering [5, 20, 41], chromatic clustering [7, 23], and probabilistic clustering [19, 33, 45]. In this paper, we study a class of constrained clustering problems of points in Euclidean space.

Given a set of pointsPin\(\mathbb {R}^d\), a positive integerk, and a constraint\(\mathbb {C}\), the constrainedk-means (ork-median) clustering problem is to partitionPintokclusters so as to minimize the objective function of the ordinaryk-means (ork-median) clustering and satisfy the constraint\(\mathbb {C}\). In general, the problems are denoted byk-CMeans andk-CMedian, respectively.

The detailed definition for each individual problem is given in Sect. 4. Roughly speaking, data constraints can be imposed at either cluster or item level. Cluster level constraints are restrictions on the resulting clusters, such as the size of the clusters [2] or their mutual differences [59], while item level constraints are mainly on data items inside each cluster, such as the coloring constraint which prohibits items of the same color being clustered into one cluster [7, 23, 46].

Fig. 1
figure 1

a The Voronoi diagram induced by the mean points (i.e. the grey points) of k-means clustering for \(k=3\); b the Voronoi diagram induced by the mean points of chromatic k-means clustering, where the points sharing the same color should be in different clusters

The additional constraints could considerably change the nature of the clustering problems. For instance, one key property exhibited in many unconstrained clustering problems is the so called locality property, which indicates that each cluster is located entirely inside the Voronoi cell of its center (e.g., the mean, median, or center point) in the Voronoi diagram of all the centers [37] (see Fig. 1a). Existing algorithms for these clustering problems often rely on such a property [8, 10, 17, 31, 37, 43, 48, 50]. However, due to the additional constraints, the locality property may no longer exist (see Fig. 1b). Therefore, we need new techniques to overcome this challenge.

1.1 Our Main Results

In this paper we present a unified framework called Peeling-and-Enclosing (PnE), based on two standalone geometric techniques called Simplex Lemma and Weaker Simplex Lemma, to solve a class of constrained clustering problems without the locality property in Euclidean space, where the dimensionality of the space could be rather high and the number k of clusters is assumed to be some fixed number. Particularly, we investigate the constrained k-means (k-CMeans) and k-median (k-CMedian) versions of these problems. For the k-CMeans problem, our unified framework generates in \(O(n(\log n)^{k+1}d)\) time a set of k-tuple candidates of cardinality \(O((\log n)^{k})\) for the to-be-determined k mean points. We show that among the set of candidates, one of them induces a \((1+\epsilon )\)-approximation for k-CMeans. To find out the best k-tuple candidate, a problem-specific selection algorithm is needed for each individual constrained clustering problem (note that due to the additional constraints, the selection problems may not be trivial). Combining the unified framework with the selection algorithms, we obtain a \((1+\epsilon )\)-approximation for each constrained clustering problem in the considered class. Our results considerably improve (in various ways) the best known algorithms for all these problems (see the table in Sect. 1.2). Below is a list of the constrained clustering problems considered in this paper. Our techniques can also be extended to achieve \((1+\epsilon )\)-approximations for the k-CMedian version of these constrained clustering problems. We expect that our technique will be applicable to other variants of k-means and k-median clustering problems without locality property, as long as the corresponding selection problems can be solved.

  1. 1.

    l-Diversity Clustering In this problem, each input point is associated with a color, and each cluster has no more than a fraction \(\frac{1}{l}\) (for some constant \(l>1\)) of its points sharing the same color. The problem is motivated by a widely-used privacy preserving model called l-diversity [46, 47] in data management, which requires that each block contains no more than a fraction \(\frac{1}{l}\) of items sharing the same sensitive attribute.

  2. 2.

    Chromatic Clustering In [23], Ding and Xu introduced a new clustering problem called chromatic clustering, which requires that the points with the same color should be clustered in different clusters. It is motivated by a biological application for clustering chromosome homologs in a population of cells, where homologs from the same cell should be clustered into different clusters. Similar problem also appears in applications related to transportation system design [7].

  3. 3.

    Fault Tolerant Clustering The problem of fault tolerant clustering assigns each point p to its l nearest cluster centers for some \(l\ge 1\), and counts all the l distances as its cost. The problem has been extensively studied in various applications for achieving better fault tolerance [16, 35, 40, 44, 54].

  4. 4.

    r-Gather Clustering This clustering problem requires each of the clusters to contain at least r points for some \(r>1\). It is motivated by the k-anonymity model for privacy preserving [2, 55], where each block contains at least k items.Footnote 1

  5. 5.

    Capacitated Clustering This clustering problem has an upper bound on the size of each cluster, and finds various applications in data mining and resource assignment [20, 41].

  6. 6.

    Semi-Supervised Clustering Many existing clustering techniques, such as ensemble clustering [52, 53] and consensus clustering [3, 18], make use of a priori knowledge. Since such clusterings are not always based on the geometric cost (e.g., k-means cost) of the input, thus a more accurate way of clustering is to consider both the priori knowledge and the geometric cost. We consider the following semi-supervised clustering problem: given a set P of points and a clustering \(\overline{\mathcal {S}}\) of P (based on the priori knowledge), partition P into k clusters so as to minimize the sum (or some function) of the geometric cost and the difference with the given clustering \(\overline{\mathcal {S}}\). Another related problem is evolutionary clustering [15], where the clustering in each time point needs to minimize not only the geometric cost, but also the total shifting from the clustering in the previous time point (which can be viewed as \(\overline{\mathcal {S}}\)).

  7. 7.

    Uncertain Data Clustering Due to the unavoidable error, data for clustering are not always precise. This motivates us to consider the following probabilistic clustering problem [19, 33, 45] : given a set of “nodes” with each represented as a probabilistic distribution over a point set in \(\mathbb {R}^d\), group the nodes into k clusters so as to minimize the expected cost with respect to the probabilistic distributions.

Note Following our work published in [25], Bhattacharya et al. [14] improved the running time for finding the candidates of k-cluster centers from nearly linear to linear based on the elegant \(D^2\)-sampling. Their work also follows the framework of clustering constrained data, i.e., generating the candidates and selecting the best one by a problem-specific selection algorithm, presented in this paper. Our paper represents the first systematically theoretical study of the constrained clustering problems. Some of the underlying techniques, such as Simplex Lemma and Weaker Simplex Lemma, are interesting in their rights, which have already been used to solve other problems [26] (e.g., the popular “truth discovery” problem in data mining).

1.2 Related Works

Table 1 Existing and our new results for the class of constrained clustering problems

The above 7 constrained clustering problems have been extensively studied in the past and a number of theoretical results have been obtained (in addition to many heuristic/practical solutions). Table 1 lists the best known theoretical results for each of them. It is clear that most existing results are either constant approximations or only for some restricted versions (e.g., constant dimensional space, etc.), and therefore can be improved by our techniques.

For the related traditional Euclidean k-means and k-median clustering problems, extensive research has been done in the past. Inaba et al. [37] showed that an exact k-means clustering can be computed in \(O(n^{O(dk)})\) time for n points in \(\mathbb {R}^{d}\). Arthur and Vassilvitskii [8] presented the k-means++ algorithm that achieves the expected \(O(\log k)\) approximation ratio. Ostrovsky et al. [50] provided a \((1+\epsilon )\)-approximation for well-separated points. Based on the concept of stability, Awasthi et al. [9] presented the PTAS for the problems of k-means and k-median clustering. Matousek [48] obtained a nearly linear time \((1+\epsilon )\)-approximation for any fixed d and k. Similar result for k-median has also been achieved by Kolliopoulos and Rao [42]. Later, Fernandez de la Vega et al. [31] and Badŏiu et al. [10] achieved nearly linear time \((1+\epsilon )\)-approximations for high dimensional k-means and k-median clustering, respectively, for fixed k. Kumar et al. [43] showed that linear-time randomized \((1+\epsilon )\)-approximation algorithms can be obtained for several Euclidean clustering problems (such as k-means and k-median) in any dimensional space. Recently, this technique has been further extended to several clustering problems with non-metric distance functions [1]. Later, Jaiswal et al. [39] applied a non-uniform sampling technique, which is called \(D^2\)-sampling, to simplify and improve the result in [43]; their algorithm can also handle the non-metric distance clustering problems studied in [1]. Using the core-set technique, a series of improvements have been achieved for high dimensional clustering problems [30].

As for the hardness of the problem, Dasgupta [21] showed that it is NP-hard for k-means clustering in high dimensional space even if \(k=2\); Awasthi et al. [4] proved that there is no PTAS for k-means clustering if both d and k are large, unless \(P=NP\). Guruswami and Indyk [34] showed that it is NP-hard to obtain any PTAS for k-median clustering if k is not a constant and d is \(\varOmega (\log n)\).

Besides the traditional clustering models, Balcan et al. considered the problem of finding the clustering with small difference from the unknown ground truth [11, 12].

1.3 Our Main Ideas

Most existing k-means or k-median clustering algorithms in Euclidean space consist of two main steps: (1) identify the set of k mean or median points and (2) partition the input points into k clusters based on these mean or median points (we call this step Partition). Note that for unconstrained clustering problems, the Partition step can be done by using the Voronoi diagram of the obtained mean or median points; however, for some constrained clustering problems, the Partition step may not be trivial (see Fig. 1a, b). More formally, we have the following definition.

Definition 1

(Partition Step) Given an instance P of k-CMeans (or k-CMedian) and k cluster centers (i.e., the mean or median points), the Partition step is to form k clusters of P, where the clusters should satisfy the constraint and each cluster is assigned to an individual cluster center, such that the objective function of the ordinary k-means (or k-median) clustering is minimized.

To determine the candidate set of the k mean or median points in step (1), most existing algorithms (either explicitly or implicitly) rely on the locality property. To shed some light on this, consider a representative and elegant approach by Kumar et al. [43] for k-means clustering. Let \(\{Opt_1,\ldots ,Opt_k \}\) be the set of k unknown optimal clusters in non-increasing order of their sizes. Their approach uses random sampling and sphere peeling to iteratively find k mean points. At the jth iterative step, it draws \(j-1\) peeling spheres centered at the \(j-1\) already obtained mean points, and takes a random sample from the points outside the peeling spheres to find the jth mean point. Due to the locality property, the points belonging to the first \(j-1\) clusters lie inside their corresponding \(j-1\) Voronoi cells; that is, for each peeling sphere, most of the covered points belong to their corresponding cluster, and thus ensures the correctness of the peeling step.

However, when the additional constraint (such as coloring or size) is imposed on the points, the locality property may no longer exist (see Fig. 1b), and thus the correctness of the peeling step cannot always be guaranteed. In this scenario, the core-set technique [30] is also unlikely to be able to resolve the issue. The main reason is that although the core-set can greatly reduce the size of the input points, it is quite challenging to impose the constraint through the core-set.

To overcome this challenge, we present a unified framework, called Peeling-and-Enclosing (PnE), in this paper, based on a standalone new geometric technique called Simplex Lemma. The simplex lemma aims to address the major obstacle encountered by the peeling strategy in [43] for constrained clustering problems. More specifically, due to the loss of locality, at the jth peeling step, the points of the jth cluster \(Opt_{j}\) could be scattered over all the Voronoi cells of the first \(j-1\) mean points, and therefore their mean point can no longer be simply determined by the sample outside the \(j-1\) peeling spheres. To resolve this issue, our main idea is to view \(Opt_{j}\) as the union of j unknown subsets, \(Q_{1}, \ldots , Q_{j}\), with each \(Q_{l}\), \(1\le l \le j\)-1, being the set of points inside the Voronoi cell (or peeling sphere) of the obtained lth mean point and \(Q_{j}\) being the set of remaining points of \(Opt_{j}\). After approximating the mean point of each unknown subset by using random sampling, we build a simplex to enclose a region which contains the mean point of \(Opt_{j}\), and then search the simplex region for a good approximation of the jth mean point. To make this approach work, we need to overcome two difficulties: (a) how to generate the desired simplex to contain the jth mean point, and (b) how to efficiently search the (approximate) jth mean point inside the simplex.

For difficulty (a), our idea is to use the already determined \(j-1\) mean points (which can be shown that they are also the approximate mean points of \(Q_{1}, \ldots , Q_{j-1}\), respectively) and another point, which is the mean of those points in \(Opt_{j}\) outside the peeling spheres (or Voronoi cells) of the first \(j-1\) mean points (i.e., \(Q_{j}\)), to build a (\(j-1\))-dimensional simplex to contain the jth mean point. Since we do not know how \(Opt_{j}\) is partitioned (i.e., how \(Opt_{j}\) intersects the \(j-1\) peeling spheres), we vary the radii of the peeling spheres \(O(\log n)\) times to guess the partition and generate a set of simplexes, where the radius candidates are based on an upper bound of the optimal value determined by a novel estimation algorithm (in Sect. 3.4). We show that among the set of simplexes, one of them contains the jth (approximate) mean point.

For difficulty (b), our simplex lemma (in Sect. 2) shows that if each vertex \(v_{l}\) of the simplex \(\mathcal {V}\) is the (approximate) mean point of \(Q_{l}\), then we can find a good approximation of the mean point of \(Opt_j\) by searching a small-size grid inside \(\mathcal {V}\). A nice feature of the simplex lemma is that the grid size is independent of the dimensionality of the space and thus can be used to handle high dimensional data. In some sense, our simplex lemma can be viewed as a considerable generalization of the well-known sampling lemma (i.e., Lemma 4 in this paper) in [37], which has been widely used for estimating the mean of a point set through random sampling [29, 37, 43]. Different from Lemma 4, which requires a global view of the point set (meaning that the sample needs to be taken from the point set), our simplex lemma only requires some partial views (e.g., sample sets are taken from those unknown subsets whose size might be quite small). If \(Opt_j\) is the point set, our simplex lemma enables us to bound the error by the varianceFootnote 2 of \(Opt_j\) (i.e., a local measure) and the optimal value of the clustering problem on the whole instance P (i.e., a global measure), and thus helps us to ensure the quality of our solution.

For the k-CMedian problem, we show that although the simplex lemma no longer holds since the median point may lie outside the simplex, a weaker version (in Sect. 5.1) exists, which searches a surrounding region of the simplex. Thus our Peeling-and-Enclosing framework works for both k-CMeans and k-CMedian. It generates in total \(O((\log n)^{k})\)k-tuple candidates for the constrained k mean or median points. To determine the best k mean or median points, we need to use the property of each individual problem to design a selection algorithm. The selection algorithm takes each k-tuple candidate, computes a clustering (i.e., completing the Partition step) satisfying the additional constraint, and outputs the k-tuple with the minimum cost. We present a selection algorithm for each considered problem in Sects. 4 and 5.4.

2 Simplex Lemma

Fig. 2
figure 2

Examples for Lemmas 1 and 3 with \(j=4\) respectively

In this section, we present the Simplex Lemma for approximating the mean point of an unknown point set Q, where the only known information is a set of j points with each of them being an approximate mean point of an unknown subset of Q. In Sect. 5.1, we show how to extend the idea to approximate median point by the Weaker Simplex Lemma. These two lemmas are crucially used to solve the k-CMeans and k-CMedian problems.

Lemma 1

(Simplex Lemma I) Let Q be a set of points in \(\mathbb {R}^d\) with a partition of \(Q=\cup ^j_{l=1} Q_l\) and \(Q_{l_1}\cap Q_{l_2}=\emptyset \) for any \(l_1\ne l_2\). Let o be the mean point of Q, and \(o_l\) be the mean point of \(Q_l\) for \(1\le l\le j\). Let the variance of Q be \(\delta ^2=\frac{1}{|Q|}\sum _{q\in Q}||q-o||^2\), and \(\mathcal {V}\) be the simplex determined by \(\{o_1, \ldots , o_j\}\). Then for any \(0<\epsilon \le 1\), it is possible to construct a grid of size \(O((8j/\epsilon )^j)\) inside \(\mathcal {V}\) such that at least one grid point \(\tau \) satisfies the inequality \(||\tau -o||\le \sqrt{\epsilon }\delta \).

Figure 2a gives an example for Lemma 1. To prove Lemma 1, we first introduce the following lemma.

Lemma 2

Let Q be a set of points in \(\mathbb {R}^d\), and \(Q_{1}\) be a subset containing \(\alpha |Q|\) points for some \(0<\alpha \le 1\). Let o and \(o_1\) be the mean points of Q and \(Q_{1}\), respectively. Then \(||o_{1}-o||\le \sqrt{\frac{1-\alpha }{\alpha }}\delta \), where \(\delta ^2=\frac{1}{|Q|}\sum _{q\in Q}||q-o||^2\).

Proof of Lemma 2

The following claim has been proved in [43].

Claim 1

Let Q be a set of points in \(\mathbb {R}^d\) space, and o be the mean point of Q. For any point \(\tilde{o}\in \mathbb {R}^d\), \(\sum _{q\in Q}||q-\tilde{o}||^2=\sum _{q\in Q}||q-o||^2+|Q| \times ||o-\tilde{o}||^2\).

Let \(Q_2=Q{\setminus } Q_1\), and \(o_{2}\) be its mean point. By Claim 1, we have the following two equalities.

$$\begin{aligned} \sum _{q\in Q_{1}}||q-o||^2= & {} \sum _{q\in Q_{1}}||q-o_1||^2+|Q_1| \times ||o_1-o||^2, \end{aligned}$$
(1)
$$\begin{aligned} \sum _{q\in Q_{2}}||q-o||^2= & {} \sum _{q\in Q_{2}}||q-o_2||^2+|Q_2| \times ||o_2-o||^2. \end{aligned}$$
(2)

Let \(L=||o_{1}-o_{2}||\). By the definition of mean point, we have

$$\begin{aligned} o=\frac{1}{|Q|}\sum _{q\in Q} q=\frac{1}{|Q|}\left( \sum _{q\in Q_1} q+\sum _{q\in Q_2} q\right) =\frac{1}{|Q|}(|Q_1|o_1+|Q_2|o_2). \end{aligned}$$
(3)

Thus the three points \(\{o, o_1, o_2\}\) are collinear, while \(||o_{1}-o||=(1-\alpha ) L\) and \(||o_{2}-o||=\alpha L\). Meanwhile, by the definition of \(\delta \), we have

$$\begin{aligned} \delta ^2=\frac{1}{|Q|}\left( \sum _{q\in Q_{1}}||q-o||^2+\sum _{q\in Q_{2}}||q-o||^2\right) . \end{aligned}$$
(4)

Combining (1) and (2), we have

$$\begin{aligned} \delta ^2= & {} \frac{1}{|Q|}\left( \sum _{q\in Q_{1}}||q-o_1||^2+ |Q_1| \times ||o_1-o||^2 \nonumber \right. \\&\left. +\sum _{q\in Q_{2}}||q-o_2||^2+|Q_2| \times ||o_2-o||^2\right) \nonumber \\\ge & {} \frac{1}{|Q|}\left( |Q_1| \times ||o_1-o||^2+|Q_2| \times ||o_2-o||^2\right) \nonumber \\= & {} \alpha ((1-\alpha )L)^2+(1-\alpha )(\alpha L)^2\nonumber \\= & {} \alpha (1-\alpha )L^2. \end{aligned}$$
(5)

Thus, we have \(L\le \frac{\delta }{\sqrt{\alpha (1-\alpha )}}\), which means that \(||o_{1}-o||=(1-\alpha ) L\le \sqrt{\frac{1-\alpha }{\alpha }}\delta \).    \(\square \)

Proof of Lemma 1

We prove this lemma by induction on j.

Base case For \(j=1\), since \(Q_1=Q\), \(o_1=o\). Thus, the simplex \(\mathcal {V}\) and the grid are all simply the point \(o_1\). Clearly \(\tau =o_{1}\) satisfies the inequality.

Induction step Assume that the lemma holds for any \(j\le j_0\) for some \(j_{0} \ge 1\) (i.e., the induction hypothesis). Now we consider the case of \(j=j_0+1\). First, we assume that \(\frac{|Q_l |}{|Q|}\ge \frac{\epsilon }{4j}\) for each \(1\le l\le j\). Otherwise, we can reduce the problem to the case of a smaller j in the following way. Let \(I=\{l| 1\le l\le j, \frac{|Q_l |}{|Q|}< \frac{\epsilon }{4j}\}\) be the index set of small subsets. Then, \(\frac{\sum _{l\in I}|Q_l |}{|Q|}<\frac{\epsilon }{4}\), and \(\frac{\sum _{l\not \in I}|Q_l |}{|Q|}\ge 1-\frac{\epsilon }{4}\). By Lemma 2, we know that \(||o'-o||\le \sqrt{\frac{\epsilon /4}{1-\epsilon /4}}\delta \), where \(o'\) is the mean point of \(\cup _{l\not \in I}Q_l\). Let \((\delta ')^2\) be the variance of \(\cup _{l\not \in I}Q_l\). Then, we have \((\delta ')^2\le \frac{|Q|}{|\cup _{l\not \in I}Q_l|}\delta ^2\le \frac{1}{1-\epsilon /4}\delta ^2\). Thus, if we replace Q and \(\epsilon \) by \(\cup _{l\not \in I}Q_l\) and \(\frac{\epsilon }{16}\), respectively, and find a point \(\tau \) such that \(||\tau -o'||^2\le \frac{\epsilon }{16}(\delta ')^2\le \frac{\epsilon /16}{1-\epsilon /4}\delta ^2\), then we have

$$\begin{aligned} ||\tau -o||^2\le (||\tau -o'||+||o'-o||)^2\le \frac{\frac{9}{16}\epsilon }{1-\epsilon /4}\delta ^2\le \epsilon \delta ^2 , \end{aligned}$$
(6)

where the last inequality is due to the fact \(\epsilon <1\). This means that we can reduce the problem to a problem with the point set \(\cup _{l\not \in I}Q_l\) and a smaller j (i.e., \(j-|I|\)). By the induction hypothesis, we know that the reduced problem can be solved, where the new simplex would be a subset of \(\mathcal {V}\) determined by \(\{o_l\mid 1\le l\le j, l\not \in I\}\), and therefore the induction step holds for this case. Note that in general, we do not know I, but we can enumerate all the \(2^j\) possible combinations to guess I if j is a fixed number as is the case in the algorithm in Sect. 3.2. Thus, in the following discussion, we can assume that \(\frac{|Q_l |}{|Q|}\ge \frac{\epsilon }{4j}\) for each \(1\le l\le j\).

For each \(1\le l\le j\), since \(\frac{|Q_l |}{|Q|}\ge \frac{\epsilon }{4j}\), by Lemma 2, we know that \(||o_l-o||\le \sqrt{\frac{1- \frac{\epsilon }{4j}}{ \frac{\epsilon }{4j}}}\delta \le 2\sqrt{\frac{j}{\epsilon }}\delta \). This, together with triangle inequality, implies that for any \(1\le l, l'\le j\),

$$\begin{aligned} ||o_l-o_{l'}||\le ||o_l-o||+||o_{l'}-o||\le 4\sqrt{j/\epsilon }\delta . \end{aligned}$$
(7)

Thus, if we pick any index \(l_0\), and draw a ball \(\mathcal {B}\) centered at \(o_{l_0}\) and with radius \(r=\max _{1\le l\le j}\{||o_l-o_{l_0}||\}\le 4\sqrt{j/\epsilon }\delta \) [by (7)], the whole simplex \(\mathcal {V}\) will be inside \(\mathcal {B}\). Note that \(o=\sum ^j_{l=1}\frac{|Q_l|}{|Q|}o_l\), so o lies inside the simplex \(\mathcal {V}\). To guarantee that o is contained by the ball \(\mathcal {B}\), we can construct \(\mathcal {B}\) only in the (\(j-1\))-dimensional space spanned by \(\{o_1, \ldots , o_j\}\), rather than the whole \(\mathbb {R}^d\) space. Also, if we build a grid inside \(\mathcal {B}\) with grid length \(\frac{\epsilon r}{4j}\), i.e., generating a uniform mesh with each cell being a \((j-1)\)-dimensional hypercube of edge length \(\frac{\epsilon r}{4j}\), the total number of grid points is no more than \(O((\frac{8j}{\epsilon })^j)\). With this grid, we know that for any point p inside \(\mathcal {V}\), there exists a grid point g such that \(||g-p||\le \sqrt{j (\frac{\epsilon r}{4j})^2}=\frac{\epsilon }{4\sqrt{j}}r\le \sqrt{\epsilon }\delta \). This means that we can find a grid point \(\tau \) inside \(\mathcal {V}\), such that \(||\tau -o||\le \sqrt{\epsilon }\delta \). Thus, the induction step holds, and the lemma is true for any \(j\ge 1\). \(\square \)

In the above lemma, we assume that the exact positions of \(\{o_1, \ldots , o_j\}\) are known (see Fig. 2a). However, in some scenarios (e.g., in the Algorithm in Sect. 3.2), we only know an approximate position of each mean point \(o_{i}\) (see Fig. 2b). The following lemma shows that an approximate position of o can still be similarly determined (see Sect. 7.1 for the proof).

Lemma 3

(Simplex Lemma II) Let Q, o, \(Q_{l}, o_{l}, 1\le l \le j\), and \(\delta \) be defined as in Lemma 1. Let \(\{o'_1, \ldots , o'_j\}\) be j points in \(\mathbb {R}^{d}\) such that \(||o'_l-o_l ||\le L\) for \(1\le l\le j\) and \(L>0\), and \(\mathcal {V}'\) be the simplex determined by \(\{o'_1, \ldots , o'_j\}\). Then for any \(0<\epsilon \le 1\), it is possible to construct a grid of size \(O((8j/\epsilon )^j)\) inside \(\mathcal {V}'\) such that at least one grid point \(\tau \) satisfies the inequality \(||\tau -o||\le \sqrt{\epsilon }\delta +(1+\epsilon )L\).

3 Peeling-and-Enclosing Algorithm for k-CMeans

In this section, we present a new Peeling-and-Enclosing (PnE) algorithm for generating a set of candidates for the mean points of k-CMeans. Our algorithm uses peeling spheres and the simplex lemma to iteratively find a good candidate center for each unknown cluster. An overview of the algorithm is given in Sect. 3.1.

Some notations Let \(P=\{p_1, \ldots , p_n\}\) be the set of \(\mathbb {R}^{d}\) points in k-CMeans, and \(\mathcal {OPT}=\{Opt_1, \ldots ,\)\(Opt_k\}\) be the k unknown optimal constrained clusters with \(m_{j}\) being the mean point of \(Opt_j\) for \(1\le j\le k\). Without loss of generality, we assume that \(|Opt_1|\ge |Opt_2|\ge \cdots \ge |Opt_k |\). Denote by \(\delta ^2_{opt}\) the optimal objective value, i.e., \(\delta ^2_{opt}=\frac{1}{n}\sum ^k_{j=1}\sum _{p \in Opt_j}||p-m_j||^2\). We also set \(\epsilon >0\) as the parameter related to the quality of the approximate clustering result.

Fig. 3
figure 3

Illustration for one iteration of Peeling-and-Enclosing. a Beginning of iteration 4; b generate 3 spheres (in white) to peel the optimal cluster \(Opt_{4}\) (in green); c build a simplex (in red) to contain \(m_{4}\); d find an approximate mean point \(p_{v_{4}}\) for \(m_{4}\) (Color figure online)

3.1 Overview of the Peeling-and-Enclosing Algorithm

Our Peeling-and-Enclosing algorithm needs an upper bound \(\varDelta \) on the optimal value \(\delta ^2_{opt}\). Specifically, \(\delta ^2_{opt}\) satisfies the condition \(\varDelta /c\le \delta ^2_{opt}\le \varDelta \) for some constant \(c\ge 1\). In Sect. 3.4, we will present a novel algorithm to determine such an upper bound for general constrained k-means clustering problems. Then, it searches for a \((1+\epsilon )\)-approximation \(\delta ^{2}\) of \(\delta ^{2}_{opt}\) in the set

$$\begin{aligned} H=\{\varDelta /c, (1+\epsilon )\varDelta /c, (1+\epsilon )^2\varDelta /c, \ldots , (1+\epsilon )^{\lceil \log _{1+\epsilon }c\rceil }\varDelta /c\ge \varDelta \}. \end{aligned}$$
(8)

Obviously, there exists one element of H lying inside the interval \([\delta ^{2}_{opt}, (1+\epsilon )\delta ^{2}_{opt}]\), and the size of H is \(O(\frac{1}{\epsilon }\log c)\).

At each searching step, our algorithm performs a sphere-peeling and simplex-enclosing procedure to iteratively generate k approximate mean points for the constrained clusters. Initially, our algorithm uses Lemmas 4 and 5 to find an approximate mean point \(p_{v_1}\) for \(Opt_{1}\) (note that since \(Opt_1\) is the largest cluster, \(|Opt_1|/n \ge 1/k\) and the sampling lemma applies). At the \((j+1)\)th iteration, it already has the approximate mean points \(p_{v_1}, \ldots , p_{v_j}\) for \(Opt_1, \ldots , Opt_j\), respectively (see Fig. 3a). Due to the lack of locality, some points of \(Opt_{j+1}\) could be scattered over the regions (e.g., Voronoi cells or peeling spheres) of \(Opt_{1}, \ldots , Opt_{j}\) and are difficult to be distinguished from the points in these clusters. Since the number of such points could be small (comparing to that of the first j clusters), they need to be handled differently from the remaining points. Our idea is to separate them using j peeling spheres, \(B_{j+1,1}, \ldots , B_{j+1,j}\), centered at the j approximate mean points respectively and with some properly guessed radius (see Fig. 3b). Let \(\mathcal {A}\) be the set of unknown points in \(Opt_{j+1}{\setminus } (\cup ^j_{l=1}B_{j+1,l})\). Our algorithm considers two cases, (a) \(|\mathcal {A}|\) is large enough and (b) \(|\mathcal {A}|\) is small. For case (a), since \(|\mathcal {A}|\) is large enough, we can use Lemma 4 and Lemma 5 to find an approximate mean point \(\pi \) of \(\mathcal {A}\), and then construct a simplex determined by \(\pi \) and \(p_{v_1}, \ldots , p_{v_j}\) to contain the \(j+1\)th mean point (see Fig. 3c). Note that \(\mathcal {A}\) and \(Opt_{j+1}\cap B_{j+1,l}, 1\le l \le j,\) can be viewed as a partition of \(Opt_{j+1}\) where the points covered by multiple peeling spheres can be assigned to anyone of them, and \(p_{v_{l}}\) can be shown as an approximate mean point of \(Opt_{j+1}\cap B_{j+1,l}\); thus the simplex lemma applies. For case (b), it directly constructs a simplex determined just by \(p_{v_1}, \ldots , p_{v_j}\). For either case, our algorithm builds a grid inside the simplex and uses Lemma 3 to find an approximate mean point for \(Opt_{j+1}\) (i.e., \(p_{v_{j+1}}\), see Fig. 3d). The algorithm repeats the Peeling-and-Enclosing procedure k times to generate the k approximate mean points.

3.2 Peeling-and-Enclosing Algorithm

Before presenting our algorithm, we first introduce two basic lemmas from [24, 37] for random sampling. Let S be a set of n points in \(\mathbb {R}^d\) space, and T be a randomly selected subset of size t from S. Denote by m(S) and m(T) the mean points of S and T respectively.

Lemma 4

[37] With probability \(1-\eta \), \(||m(S)-m(T)||^2<\frac{1}{\eta t}\delta ^2\), where \(\delta ^2=\frac{1}{n}\sum _{s\in S}||s-m(S)||^2\) and \(0<\eta <1\).

Lemma 5

[24] Let \(\varOmega \) be a set of elements, and S be a subset of \(\varOmega \) with \(\frac{|S|}{|\varOmega |}=\alpha \) for some \(\alpha \in (0,1)\). If we randomly select \(\frac{t\ln \frac{t}{\eta }}{\ln (1+\alpha )}=O(\frac{t}{\alpha }\ln \frac{t}{\eta })\) elements from \(\varOmega \), then with probability at least \(1-\eta \), the sample contains at least t elements from S for \(0<\eta <1\) and \(t\in \mathbb {Z}^+\).

Our Peeling-and-Enclosing algorithm is shown in Algorithm 1.

figure a
figure b

Theorem 1

Let P be the set of n\(\mathbb {R}^{d}\) points and \(k\in \mathbb {Z}^+\) be a fixed constant. Given a constant \(\epsilon \in (0, \frac{1}{4k^2})\), in \(O(2^{poly(\frac{k}{\epsilon })}n(\log n)^{k+1} d )\) time, Algorithm 1 outputs \(O(2^{poly(\frac{k}{\epsilon })}(\log n)^{k})\)k-tuple candidate mean points. With constant probability, there exists one k-tuple candidate in the output which is able to induce a \(\big (1+O(\epsilon )\big )\)-approximation of k-CMeans.

Remark 1

To increase the success probability to be close to 1, e.g., \(1-\frac{1}{n}\), one only needs to repeatedly run the algorithm \(O(\log n)\) times; both the time complexity and the number of k-tuple candidates increase by a factor of \(O(\log n)\).

3.3 Proof of Theorem 1

Let \(\beta _j= |Opt_j|/n\), and \(\delta ^2_j=\frac{1}{|Opt_j |}\sum _{p\in Opt_j}||p-m_j||^2\), where \(m_j\) is the mean point of \(Opt_j\). By our assumption in the beginning of Sect. 3, we know that \(\beta _1\ge \cdots \ge \beta _k\). Clearly, \(\sum ^k_{j=1}\beta _j =1\) and the optimal objective value \(\delta ^2_{opt}=\sum ^k_{j=1}\beta _j\delta ^2_j\).

Proof Synopsis Instead of directly proving Theorem 1, we consider the following Lemma 6 and Lemma 7 which jointly ensure the correctness of Theorem 1. In Lemma 6, we show that there exists such a root-to-leaf path in one of the returned trees that its associated k points along the path, denoted by \(\{p_{v_1}, \ldots , p_{v_k}\}\), are close enough to the mean points \(m_{i}, \ldots , m_{k}\) of the k optimal clusters, respectively. The proof is based on mathematical induction; each step needs to build a simplex, and applies Simplex Lemma II to bound the error, i.e., \(||p_{v_j}-m_j ||\) in (9). The error is estimated by considering both the local (i.e., the variance of cluster \(Opt_j\)) and global (i.e., the optimal value \(\delta _{opt}\)) measurements. This is a more accurate estimation, comparing to the widely used Lemma 4 which considers only the local measurement. Such an improvement is due to the increased flexibility in the Simplex Lemma II, and is a key to our proof. In Lemma 7, we further show that the k points, \(\{p_{v_1}, \ldots , p_{v_k}\}\), in Lemma 6 induce a \((1+O(\epsilon ))\)-approximation of k-CMeans.

Lemma 6

Among all the trees generated by Algorithm 1, with constant probability, there exists at least one tree, \(\mathcal {T}_i\), which has a root-to-leaf path with each of its nodes \(v_{j}\) at level j (\(1\le j\le k\)) associating with a point \(p_{v_j}\) and satisfying the inequality

$$\begin{aligned} ||p_{v_j}-m_j ||\le \epsilon \delta _j+(1+\epsilon )j\sqrt{\frac{\epsilon }{\beta _j}}\delta _{opt} . \end{aligned}$$
(9)

Before proving this lemma, we first show its implication.

Lemma 7

If Lemma 6 is true, then \(\{p_{v_1}, \ldots , p_{v_k}\}\) is able to induce a \((1+O(\epsilon ))\)-approximation of k-CMeans.

Proof

We assume that Lemma 6 is true. Then for each \(1\le j\le k\), we have

$$\begin{aligned} \sum _{p\in Opt_j}||p-p_{v_j}||^2= & {} \sum _{p\in Opt_j}||p-m_j||^2+|Opt_j|\times ||m_j-p_{v_j}||^2 \nonumber \\\le & {} \sum _{p\in Opt_j}||p-m_j||^2+|Opt_j|\times 2\left( \epsilon ^2\delta ^2_j+(1+\epsilon )^2 j^2\frac{\epsilon }{\beta _j}\delta ^2_{opt}\right) \nonumber \\= & {} (1+2\epsilon ^2)|Opt_j |\delta ^2_j+2(1+\epsilon )^2 j^2\epsilon n\delta ^2_{opt}, \end{aligned}$$
(10)

where the first equation follows from Claim 1 in the proof of Lemma 2 (note that \(m_j\) is the mean point of \(Opt_j\)), the inequality follows from Lemma 6 and the fact that \((a+b)^2\le 2(a^2+b^2)\) for any two real numbers a and b, and the last equality follows from the fact that \(\frac{|Opt_j|}{\beta _j}=n\). Summing both sides of (10) over j, we have

$$\begin{aligned} \sum ^k_{j=1}\sum _{p\in Opt_j}||p-p_{v_j}||^2\le & {} \sum ^k_{j=1}((1+2\epsilon ^2)|Opt_j |\delta ^2_j+2(1+\epsilon )^2 j^2\epsilon n\delta ^2_{opt})\nonumber \\\le & {} (1+2\epsilon ^2)\sum ^k_{j=1}|Opt_j |\delta ^2_j+2(1+\epsilon )^2 k^3\epsilon n\delta ^2_{opt}\nonumber \\= & {} (1+O(k^3)\epsilon ) n\delta ^2_{opt}, \end{aligned}$$
(11)

where the last equation follows from the fact that \(\sum ^k_{j=1}|Opt_j |\delta ^2_j=n\delta ^2_{opt}\). By (11), we know that \(\{p_{v_1}, \ldots , p_{v_k}\}\) will induce a \((1+O(k^3)\epsilon )\)-approximation for k-CMeans. Note that k is assumed to be a fixed number. Thus the lemma is true. \(\square \)

Lemma 7 implies that Lemma 6 is indeed sufficient to ensure the correctness of Theorem 1 (except for the number of candidates and the time complexity). Now we prove Lemma 6.

Proof of Lemma 6

Let \(\mathcal {T}_i\) be the tree generated by Algorithm 2 when \(\delta \) falls in the interval of \([\delta _{opt}, (1+\epsilon )\delta _{opt}]\). We will focus our discussion on \(\mathcal {T}_{i}\), and prove the lemma by mathematical induction on j.

Base case For \(j=1\), since \(\beta _1=\max \{\beta _j |1\le j\le k\}\), we have \(\beta _1\ge \frac{1}{k}\). By Lemma 4 and Lemma 5, we can find the approximate mean point through random sampling. Let \(\varOmega \) and S (in Lemma 5) be P and \(Opt_1\), respectively. Also, \(p_{v_{1}}\) is the mean point of the random sample from P. Lemma 5 ensures that the sample contains enough number of points from \(Opt_1\), and Lemma 4 implies that \(||p_{v_1}-m_1||\le \epsilon \delta _1\le \epsilon \delta _1+(1+\epsilon )\sqrt{\frac{\epsilon }{\beta _1}}\delta _{opt}\).

Induction step Suppose \(j>1\). We assume that there is a path in \(\mathcal {T}_i\) from the root to the \((j-1)\)th level, such that for each \(1\le l\le j-1\), the level-l node \(v_{l}\) on the path is associated with a point \(p_{v_{l}}\) satisfying the inequality \(||p_{v_l}-m_l ||\le \epsilon \delta _l+(1+\epsilon )l\sqrt{\frac{\epsilon }{\beta _l}}\delta _{opt} \) (i.e., the induction hypothesis). Now we consider the case of j. Below we will show that there is one child of \(v_{j-1}\), i.e., \(v_{j}\), such that its associated point \(p_{v_{j}}\) satisfies the inequality \(||p_{v_j}-m_j ||\le \epsilon \delta _j+(1+\epsilon )j\sqrt{\frac{\epsilon }{\beta _j}}\delta _{opt} \). First, we have the following claim (see Sect. 7.2 for the proof).

Claim 2

In the set of radius candidates in Algorithm 2, there exists one value \(r_j\in \mathcal {R}\) such that

$$\begin{aligned} r_j\in \left[ j\sqrt{\epsilon /\beta _j}\delta _{opt}, \left( 1+\frac{\epsilon }{2}\right) j\sqrt{\epsilon /\beta _j}\delta _{opt}\right] . \end{aligned}$$
(12)

Now, we construct the \(j-1\) peeling spheres, \(\{B_{j,1}, \ldots , B_{j,j-1}\}\) as in Algorithm 2. For each \(1\le l\le j-1\), \(B_{j,l}\) is centered at \(p_{v_l}\) and with radius \(r_j\). By Markov’s inequality and the induction hypothesis, we have the following claim (see Sect. 7.3 for the proof).

Claim 3

For each \(1\le l\le j-1\), \(|Opt_l {\setminus } (\bigcup ^{j-1}_{w=1}B_{j,w})|\le \frac{4\beta _j n}{\epsilon }\).

Claim 3 shows that \(|Opt_l {\setminus } (\bigcup ^{j-1}_{w=1}B_{j,w})|\) is bounded for \(1\le l\le j-1\), which helps us to find the approximate mean point of \(Opt_j \). Induced by the \(j-1\) peeling spheres \(\{B_{j,1}, \ldots , B_{j,j-1}\}\), \(Opt_j\) is divided into j subsets, \(Opt_j\cap B_{j,1}\), \(\ldots \), \(Opt_j\cap B_{j,j-1}\) and \(Opt_j {\setminus }(\bigcup ^{j-1}_{w=1}B_{j,w})\). For ease of discussion, let \(P_l\) denote \(Opt_j\cap B_{j,l}\) for \(1\le l\le j-1\), \(P_j\) denote \(Opt_j {\setminus }(\bigcup ^{j-1}_{w=1}B_{j,w})\), and \(\tau _{l}\) denote the mean point of \(P_l\) for \(1\le l\le j\). Note that the peeling spheres may intersect with each other. For any two intersecting spheres \(B_{j,l_1}\) and \(B_{j,l_2}\), we arbitrarily assign the points in \(Opt_j\cap (B_{j,l_1}\cap B_{j,l_2})\) to either \(P_{l_1}\) or \(P_{l_2}\). Thus, we can assume that \(\{P_l\mid 1\le l\le j\}\) are pairwise disjoint.

Now consider the size of \(P_{j}\). We have the following two cases: (a) \(|P_j |\ge \epsilon ^3\frac{\beta _{j}}{j}n\) and (b) \(|P_j |<\epsilon ^3\frac{\beta _{j}}{j}n\). We show how, in each case, Algorithm 2 can obtain an approximate mean point for \(Opt_{j}\) by using the simplex lemma (i.e., Lemma 3).

For case (a), by Claim 3, together with the fact that \(\beta _l\le \beta _{j}\) for \(l>j\), we know that

$$\begin{aligned} \sum ^k_{l=1}|Opt_{l}{\setminus }\left( \bigcup ^{j-1}_{w=1}B_{j,w}\right) |\le & {} \sum ^{j-1}_{l=1}|Opt_{l}{\setminus }\left( \bigcup ^{j-1}_{w=1}B_{j,w}\right) |+|P_j|+\sum ^k_{l=j+1}|Opt_l|\nonumber \\\le & {} \frac{4(j-1)\beta _j}{\epsilon }n+|P_j|+(k-j)\beta _j n, \end{aligned}$$
(13)

where the second inequality follows from Claim 3. So we have

$$\begin{aligned} \frac{|P_j |}{\sum ^k_{l=1}|Opt_{l}{\setminus }\left( \bigcup ^{j-1}_{w=1}B_{j,w}\right) |}\ge & {} \frac{|P_j |}{\frac{4(j-1)\beta _j}{\epsilon }n+|P_j |+(k-j)\beta _j n}. \end{aligned}$$
(14)

We view the right-hand side as a function of \(|P_j|\). Given any \(h>0\), the function \(f(x)=\frac{x}{x+h}\) is an increasing function on the variable \(x\in [0, +\infty )\). Note that we assume \(|P_j |\ge \epsilon ^3\frac{\beta _{j}}{j}n\). Thus

$$\begin{aligned} \frac{|P_j |}{\sum ^k_{l=1}|Opt_{l}{\setminus }\left( \bigcup ^{j-1}_{w=1}B_{j,w}\right) |}\ge & {} \frac{\frac{\epsilon ^3}{j}\beta _j n}{\frac{4(j-1)\beta _j}{\epsilon }n+\frac{\epsilon ^3}{j}\beta _j n+(k-j)\beta _j n}\nonumber \\> & {} \frac{\epsilon ^4}{8kj}\ge \frac{\epsilon ^4}{8k^2}, \end{aligned}$$
(15)

(15) implies that \(P_{j}\) is large enough, comparing to the set of points outside the peeling spheres. Hence, we can obtain an approximate mean point \(\pi \) for \(P_j\) in the following way. First, we set \(t=\frac{k}{\epsilon ^5}\), \(\eta =\frac{\epsilon }{k}\), and take a sample of size \(\frac{t\ln (t/\eta )}{\epsilon ^4 /8k^2}=\frac{8k^3}{\epsilon ^9}\ln \frac{k^2}{\epsilon ^6}\). By Lemma 5, we know that with probability \(1-\frac{\epsilon }{k}\), the sample contains at least \(\frac{k}{\epsilon ^5}\) points from \(P_j\). Then we let \(\pi \) be the mean point of the \(\frac{k}{\epsilon ^5}\) points from \(P_j\), and \(a^{2}\) be the variance of \(P_j\). By Lemma 4, we know that with probability \(1-\frac{\epsilon }{k}\), \(||\pi -\tau _j||^2\le \epsilon ^4 a^2\). Also, since \(\frac{|P_j|}{|Opt_j|}=\frac{|P_j|}{\beta _j n}\ge \frac{\epsilon ^3}{j}\) (because \(|P_j |\ge \epsilon ^3\frac{\beta _{j}}{j}n\) for case (a)), we have \(a^2\le \frac{|Opt_j|}{|P_j|}\delta ^2_j\le \frac{j}{\epsilon ^3}\delta ^2_j\). Thus,

$$\begin{aligned} ||\pi -\tau _j||^2\le \epsilon j\delta ^2_j. \end{aligned}$$
(16)
Fig. 4
figure 4

a, b The Simplexes of case (a) and case (b) with \(j=4\) respectively

Once obtaining \(\pi \), we can apply Lemma 3 to find a point \(p_{v_{j}}\) satisfying the condition of \(||p_{v_{j}}-m_j||\le \epsilon \delta _j+(1+\epsilon )j\sqrt{\frac{\epsilon }{\beta _j}}\delta _{opt}\). We construct a simplex \(\mathcal {V}'_{(a)}\) with vertices \(\{p_{v_1}, \ldots , p_{v_{j-1}}\}\) and \(\pi \) (see Fig. 4b). Note that \(Opt_j\) is partitioned by the peeling spheres into j disjoint subsets, \(P_1, \ldots , P_j\). Each \(P_l\) (\(1\le l\le j-1\)) lies inside \(B_{j,l}\), which implies that \(\tau _l\), i.e., the mean point of \(P_l\), is also inside \(B_{j,l}\). Further, by Claim 2, for \(1\le l\le j-1\), we have

$$\begin{aligned} ||p_{v_l}-\tau _l||\le r_j\le \left( 1+\frac{\epsilon }{2}\right) j\sqrt{\epsilon /\beta _j}\delta _{opt}. \end{aligned}$$
(17)

Recall that \(\beta _j\delta ^2_j\le \delta ^2_{opt}\). Thus, together with (16), we have

$$\begin{aligned} ||\pi -\tau _j||\le \sqrt{\epsilon j}\delta _j\le \sqrt{\epsilon j/\beta _j}\delta _{opt}. \end{aligned}$$
(18)

By (17) and (18), if setting the value of L (in Lemma 3) to be

$$\begin{aligned} \max \{r_j,||\pi -\tau _j|| \}\le & {} \max \left\{ \left( 1+\frac{\epsilon }{2}\right) j\sqrt{\epsilon /\beta _j}\delta _{opt}, \sqrt{\epsilon j/\beta _j}\delta _{opt}\right\} \nonumber \\= & {} \left( 1+\frac{\epsilon }{2}\right) j\sqrt{\epsilon /\beta _j}\delta _{opt}, \end{aligned}$$
(19)

and the value of \(\epsilon \) (in Lemma 3) to be \(\epsilon _0=\epsilon ^2/4\), by Lemma 3 we can construct a grid inside the simplex \(\mathcal {V}'_{(a)}\) with size \(O((8j/\epsilon _0)^j)\) to ensure the existence of the grid point \(\tau \) satisfying the inequality of

$$\begin{aligned} ||\tau -m_j||\le \sqrt{\epsilon _0}\delta _j+(1+\epsilon _0)L\le \epsilon \delta _j+(1+\epsilon )j\sqrt{\frac{\epsilon }{\beta _j}}\delta _{opt}. \end{aligned}$$
(20)

Hence, let \(p_{v_j}\) be the grid point \(\tau \), and the induction step holds for this case.

For case (b), we can also apply Lemma 3 to find an approximate mean point in a way similar to case (a); the difference is that we construct a simplex \(\mathcal {V}'_{(b)}\) with vertices \(\{p_{v_1}, \ldots , p_{v_{j-1}}\}\) (see Fig. 4b). Roughly speaking, since \(|P_j|\) is small, the mean points of \(Opt_j{\setminus } P_j\) and \(Opt_j\) are very close to each other (by Lemma 2). Thus, we can ignore \(P_j\) and just consider \(Opt_j{\setminus } P_j\).

Let \(a^2\) and \(m'_j\) denote the variance and mean point of \(Opt_j{\setminus } P_j\) respectively. We know that \(\{P_1, P_2, \ldots , P_{j-1}\}\) is a partition on \(Opt_j{\setminus } P_j\). Thus, similar with case (a), we construct a simplex \(\mathcal {V}'_{(b)}\) determined by \(\{p_{v_1}, \ldots , p_{v_{j-1}}\}\) (see Fig. 4b), set the value of L to be \(r_j\le (1+\frac{\epsilon }{2})j\sqrt{\frac{\epsilon }{\beta _j}}\delta _{opt}\), and then build a grid inside \(\mathcal {V}'_{(b)}\) with size \(O((\frac{8j}{\epsilon _0})^j)\), where \(\epsilon _0=\epsilon ^2/4\). By Lemma 3, we know that there exists one grid point \(\tau \) satisfying the condition of

$$\begin{aligned} ||\tau -m'_j||\le \sqrt{\epsilon _0}a+(1+\epsilon _0)L\le \frac{\epsilon }{2}a+(1+\epsilon )j\sqrt{\frac{\epsilon }{\beta _j}}\delta _{opt}. \end{aligned}$$
(21)

Meanwhile, we know that \(|Opt_j{\setminus } P_j|\ge (1-\epsilon ^3/j)|Opt_j|\), since \(|P_j|\le \frac{\epsilon ^3}{j} |Opt_j|\). Thus, we have \(a^2\le \frac{|Opt_j|}{|Opt_j{\setminus } P_j|}\delta ^2_j \le \frac{1}{1-\epsilon ^3/j} \delta ^2_j\), and \(||m'_j-m_j||\le \sqrt{\frac{\epsilon ^3/j}{1-\epsilon ^3/j}}\delta _j\) (by Lemma 2). Together with (21), we have

$$\begin{aligned} ||\tau -m_j||\le & {} ||\tau -m'_j||+||m'_j-m_j||\nonumber \\\le & {} \frac{\epsilon }{2}a+(1+\epsilon )j\sqrt{\frac{\epsilon }{\beta _j}}\delta _{opt}+\sqrt{\frac{\epsilon ^3/j}{1-\epsilon ^3/j}}\delta _j\nonumber \\\le & {} \frac{\epsilon }{2} \sqrt{\frac{1}{1-\epsilon ^3/j}}\delta _j+(1+\epsilon )j\sqrt{\frac{\epsilon }{\beta _j}}\delta _{opt}+\sqrt{\frac{\epsilon ^3/j}{1-\epsilon ^3/j}}\delta _j\nonumber \\\le & {} \left( \frac{\epsilon }{2} \sqrt{\frac{1}{1-\epsilon ^3/j}}+\sqrt{\frac{\epsilon ^3/j}{1-\epsilon ^3/j}}\right) \delta _j+(1+\epsilon )j\sqrt{\frac{\epsilon }{\beta _j}}\delta _{opt}\nonumber \\\le & {} \epsilon \delta _j+(1+\epsilon )j\sqrt{\frac{\epsilon }{\beta _j}}\delta _{opt}. \end{aligned}$$
(22)

Hence, let \(p_{v_j}\) be the grid point \(\tau \), and the induction step holds for this case.

Since Algorithm 2 executes every step in our above discussion, the induction step, as well as the lemma, is true. \(\square \)

Success Probability From the above analysis, we know that in the jth iteration, only case (a) (i.e., \(|P_j |\ge \epsilon ^3\frac{\beta _{j}}{j} n\)) needs to consider the success probability of random sampling. Recall that in case (a), we take a sample of size \(\frac{8k^3}{\epsilon ^9}\ln \frac{k^2}{\epsilon ^6}\). Thus with probability \(1-\frac{\epsilon }{k}\), it contains at least \(\frac{k}{\epsilon ^5}\) points from \(P_j\). Meanwhile, with probability \(1-\frac{\epsilon }{k}\), \(||\pi -\tau _j||^2\le \epsilon ^4 a^2\). Hence, the success probability in the jth iteration is \((1-\frac{\epsilon }{k})^2\). By taking the union bound, the success probability in all k iterations is \((1-\frac{\epsilon }{k})^{2k}\ge 1-2\epsilon \).

Number of Candidates and Running Time Algorithm 1 calls Algorithm 2 \(O(\frac{1}{\epsilon }\log c)\) times (in Sect. 3.4, we will show that c can be a constant number). It is easy to see that each node in the returned tree has \(|\mathcal {R}|2^{s+j} (\frac{32j}{\epsilon ^2})^j\) children, where \(|\mathcal {R}|=O(\frac{\log n}{\epsilon })\), and \(s=\frac{8k^3}{\epsilon ^9}\ln \frac{k^2}{\epsilon ^6}\). Since the tree has the height of k, the complexity of the tree is \(O(2^{poly(\frac{k}{\epsilon })}(\log n)^k)\). Consequently, the number of candidates is \(O(2^{poly(\frac{k}{\epsilon })}(\log n)^k)\). Further, since each node takes \(O(|\mathcal {R}|2^{s+j} (\frac{32j}{\epsilon ^2})^j\)nd) time, the total time complexity of the algorithm is \(O(2^{poly(\frac{k}{\epsilon })}\)\(n(\log n)^{k+1} d)\).

3.4 Upper Bound Estimation

As mentioned in Sect. 3.1, our Peeling-and-Enclosing algorithm needs an upper bound \(\varDelta \) on the optimal value \(\delta ^2_{opt}\). To compute this, our main idea is to use some unconstrained k-means clustering algorithm \(\mathcal {A}_{*}\) (e.g., the linear time \((1+\epsilon )\)-approximation algorithm in [43]) on the input points P without considering the constraint, to obtain a \(\lambda \)-approximation to the k-means clustering for some constant \(\lambda >1\). Let \(\mathcal {C}=\{c_1, \ldots , c_k\}\) be the set of mean points returned by algorithm \(\mathcal {A}_{*}\).Footnote 3 Below, we show that the Cartesian product \([\mathcal {C}]^k=\underbrace{\mathcal {C}\times \cdots \times \mathcal {C}}_{k}\) contains one k-tuple, which is an \((18\lambda +16)\)-approximation of k-CMeans on the same input P. Clearly, to select the k-tuple from \([\mathcal {C}]^k\) with the smallest objective value, we still need to solve the Partition step on each k-tuple to form the desired clusters. Similar to Remark 1, we refer the reader to Sect. 4 for the selection algorithms for the considered constrained clustering problems.

Theorem 2

Let \(P=\{p_1, \ldots , p_n\}\) be the input points of k-CMeans, and \(\mathcal {C}=\{c_{1},\ldots ,c_{k}\}\) be the mean points of a \(\lambda \)-approximation of the k-means clustering on P (without considering the constraint) for some constant \(\lambda \ge 1\). Then \([\mathcal {C}]^k\) contains at least one k-tuple which is able to induce an \((18\lambda +16)\)-approximation of k-CMeans.

Fig. 5
figure 5

a\(p_i\) is moved to \(c_t\) and becomes \(\tilde{p}_i\); b\(||\tilde{m}_l-\tilde{c}_l||\le ||\tilde{m}_l-\tilde{p}_i||\)

Proof Synopsis Let \(\omega \) be the objective value of the k-means clustering on P corresponding to the k mean points in \(\mathcal {C}\). To prove Theorem 2, we create a new instance of k-CMeans: for each point \(p_i\in P\), move it to its nearest point, say \(c_{t}\), in \(\{c_1, \ldots , c_k\}\); let \(\tilde{p}_i\) denote the new \(p_i\) (note that \(c_t\) and \(\tilde{p}_i\) coincide with each other; see Fig. 5a). The set \(\tilde{P}=\{\tilde{p}_1, \ldots , \tilde{p}_n\}\) forms a new instance of k-CMeans. Let \(\tilde{\delta }^2_{opt}\) be the optimal value of k-CMeans on \(\tilde{P}\), and \(\delta ^2_{opt}([\mathcal {C}]^k)\) be the minimum cost of k-CMeans on P by restricting its mean points to be one k-tuple in \([\mathcal {C}]^k\). We show that \(\tilde{\delta }^2_{opt}\) is bounded by a combination of \(\omega \) and \(\delta ^2_{opt}\), and \(\delta ^2_{opt}([\mathcal {C}]^k)\) is bounded by a combination of \(\omega \) and \(\tilde{\delta }^2_{opt}\) (see Lemma 8). Together with the fact that \(\omega \) is no more than \(\lambda \delta ^2_{opt}\), we consequently obtain that \(\delta ^2_{opt}([\mathcal {C}]^k)\le (18\lambda +16)\delta ^2_{opt}\), which implies Theorem 2.

Lemma 8

\(\tilde{\delta }^2_{opt}\le 2\omega +2\delta ^2_{opt}\), and \(\delta ^2_{opt}([\mathcal {C}]^k)\le 2\omega +8\tilde{\delta }^2_{opt}\).

Proof

We first prove the inequality of \(\tilde{\delta }^2_{opt}\le 2\omega +2\delta ^2_{opt}\). Consider any point \(p_i \in P\). Let \(Opt_{l}\) be the optimal cluster containing \(p_i\). Then, we have

$$\begin{aligned} ||\tilde{p}_i-m_l||^2\le & {} (||\tilde{p}_i-p_i||+||p_i-m_l||)^2\nonumber \\\le & {} 2||\tilde{p}_i-p_i||^2+2||p_i-m_l||^2, \end{aligned}$$
(23)

where the first inequality follows from triangle inequality, and the second inequality follows from the fact that \((a+b)^2\le 2a^2+2b^2\) for any two real numbers a and b. For both sides of (23), we take the averages over all the points in P, and obtain

$$\begin{aligned} \frac{1}{n}\sum ^k_{l=1}\sum _{p_i\in Opt_l}||\tilde{p}_i-m_l||^2 \le \frac{2}{n}\sum ^n_{i=1}||\tilde{p}_i-p_i||^2+\frac{2}{n}\sum ^k_{l=1}\sum _{p_i\in Opt_l}||p_i-m_l||^2. \end{aligned}$$
(24)

Note that the left-hand side of (24) is not smaller than \(\tilde{\delta }^2_{opt}\), since \(\tilde{\delta }^2_{opt}\) is the optimal objective value of k-CMeans on \(\tilde{P}\). For the right-hand side of (24), the first term \(2\frac{1}{n}\sum ^n_{i=1}||\tilde{p}_i-p_i||^2=2\omega \) (by the construction of \(\tilde{P}\)), and the second term \(2\frac{1}{n}\sum ^k_{l=1}\sum _{p_i\in Opt_l}||p_i-m_l||^2=2\delta ^2_{opt}\). Thus, we have \(\tilde{\delta }^2_{opt}\le 2\omega +2\delta ^2_{opt}\).

Next, we show the inequality \(\delta ^2_{opt}([\mathcal {C}]^k)\le 2\omega +8\tilde{\delta }^2_{opt}\). Consider k-CMeans clustering on \(\tilde{P}\). Let \(\{\tilde{m}_1, \ldots , \tilde{m}_k\}\) be the optimal constrained mean points of \(\tilde{P}\), and \(\{\tilde{O}_1, \ldots , \tilde{O}_k\}\) be the corresponding optimal clusters. Let \(\{\tilde{c}_1, \ldots , \tilde{c}_k\}\) be the k-tuple in \([\mathcal {C}]^k\) with \(\tilde{c}_l\) being the nearest point in \(\mathcal {C}\) to \(\tilde{m}_l\). Thus, by an argument similar to the one for (23), we have

$$\begin{aligned} ||\tilde{p}_i-\tilde{c}_l||^2\le & {} 2||\tilde{p}_i-\tilde{m}_l||^2+2||\tilde{m}_l-\tilde{c}_l||^2\le 4||\tilde{p}_i-\tilde{m}_l||^2. \end{aligned}$$
(25)

for each \(\tilde{p}_i\in \tilde{O}_l\). In (25), the last one follows from the facts that \(\tilde{c}_l\) is the nearest point in \(\mathcal {C}\) to \(\tilde{m}_l\) and \(\tilde{p}_i \in \mathcal {C}\), which implies that \(||\tilde{m}_l-\tilde{c}_l||\le ||\tilde{m}_l-\tilde{p}_i||\) (see Fig. 5b). Summing both sides of (25) over all the points in \(\tilde{P}\), we have

$$\begin{aligned} \sum ^k_{l=1}\sum _{\tilde{p}_i\in \tilde{O}_l}||\tilde{p}_i-\tilde{c}_l||^2\le & {} 4\sum ^k_{l=1}\sum _{\tilde{p}_i\in \tilde{O}_l}||\tilde{p}_i-\tilde{m}_l||^2. \end{aligned}$$
(26)

Now, consider the following clustering on P. For each \(p_i\), if \( \tilde{p}_i\in \tilde{O}_l\), we cluster it

to the corresponding \(\tilde{c}_l\).

Then the objective value of the clustering is

$$\begin{aligned} \frac{1}{n}\sum ^k_{l=1}\sum _{\tilde{p}_i\in \tilde{O}_l}||p_i-\tilde{c}_l||^2\le & {} \frac{1}{n}\sum ^k_{l=1}\sum _{\tilde{p}_i\in \tilde{O}_l}(2||p_i-\tilde{p}_i||^2+2||\tilde{p}_i-\tilde{c}_l||^2)\nonumber \\\le & {} 2\frac{1}{n}\sum ^n_{i=1}||p_i-\tilde{p}_i||^2+8\frac{1}{n}\sum ^k_{l=1}\sum _{\tilde{p}_i\in \tilde{O}_l}||\tilde{p}_i-\tilde{m}_l||^2. \end{aligned}$$
(27)

The left-hand side of (27), \(\frac{1}{n}\sum ^k_{l=1}\sum _{\tilde{p}_i\in \tilde{O}_l}||p_i-\tilde{c}_l||^2\), is no smaller than \(\delta ^2_{opt}([\mathcal {C}]^k)\) (by the definition), and the right-hand side of (27) is equal to \(2\omega +8\tilde{\delta }^2_{opt}\). Thus, we have \(\delta ^2_{opt}([\mathcal {C}]^k)\le 2\omega +8\tilde{\delta }^2_{opt}\). \(\square \)

Proof of Theorem 2

By the two inequalities in Lemma 8, we know that \(\delta ^2_{opt}([\mathcal {C}]^k)\le 18\omega +16\delta ^2_{opt}\). It is obvious that the optimal objective value of the k-means clustering is no larger than that of k-CMeans on the same set of input points P. This implies that \(\omega \le \lambda \delta ^2_{opt}\). Thus, we have

$$\begin{aligned} \delta ^2_{opt}([\mathcal {C}]^k)\le (18\lambda +16)\delta ^2_{opt}. \end{aligned}$$
(28)

So there exists one k-tuple in \([\mathcal {C}]^k\), which is able to induce an \((18\lambda +16)\)-approximation.    \(\square \)

4 Selection Algorithms for k-CMeans

As shown in Sect. 3, a set of k-tuple candidates for the mean points of k-CMeans can be obtained by our Peeling-and-Enclosing algorithm. To determine the best candidate, we need a selection algorithm to compute the clustering for each k-tuple candidate, and select the one with the smallest objective value. Clearly, the key to designing a selection algorithm is to solve the Partition step (i.e., generating the clustering) for each k-tuple candidate. We need to design a problem-specific algorithm for the Partition step, to satisfy the constraint of each individual problem.

We consider all the constrained k-means clustering problems which are mentioned in Sect. 1.1, except for the uncertain data clustering, since Cormode and McGregor [19] have showed that it can be reduced to an ordinary k-means clustering problem. However, the k-median version of the uncertain data clustering does not have such a property. In Sect. 5.4, we will discuss how to obtain the \((1+\epsilon )\)-approximation by applying our Peeling-and-Enclosing framework.

Fig. 6
figure 6

Minimum cost circulations for r-gather clustering (a) and l-diversity clustering (b)

4.1 r-Gather k-Means Clustering

Let P be a set of n points in \(\mathbb {R}^d\). r-Gather k-means clustering (denoted by (rk)-GMeans) on P is the problem of clustering P into k clusters with size at least r, such that the average squared Euclidean distance from each point in P to the mean point of its cluster is minimized [2].

To solve the Partition problem of (rk)-GMeans, we adopt the following strategy. For each k-tuple candidate \(P_{v}=\{p_{v_1}, \ldots p_{v_k}\}\) returned by Algorithm 1, build a complete bipartite graph G (see Fig. 6a): each vertex in the left column corresponds to a point in P, and each vertex in the right column represents a candidate mean point in \(P_{v}\); for each pair of vertices in different partite sets, connect them by an edge with the weight equal to their squared Euclidean distance. We can solve the Partition problem by finding the minimum cost matching in G: each vertex in the left has the supply 1, and each vertex in the right has the demand r and capacity n. After adding a source node s connecting to all the verities in the left and a sink node t connecting to all the vertices in the right, we can reduce the Partition problem to a minimum cost circulation problem, and solve it by using the algorithm in [28]. Denote by V and E the sets of vertices and edges of G. The running time for solving the minimum cost circulation problem is \(O(|E|^2\log |V|+ |E|\cdot |V|\log ^2 |V|)\) [49]. In our case, \(|E|=O(n)\) and \(|V|=O(n)\) if k is a fixed constant. Also, the time complexity for building G is O(nd). Thus, the total time for solving the Partition problem is \(O\Big (n\big (n(\log n)^2+d\big )\Big )\).Footnote 4 Together with the time complexity in Theorem 1, we have the following theorem.

Theorem 3

There exists an algorithm yielding a \((1+\epsilon )\)-approximation for (rk)-GMeans with constant probability, in \(O\Big (2^{poly(\frac{k}{\epsilon })}(\log n)^{k+1}n\big (n\log n+d\big )\Big )\) time.

4.2 r-Capacity k-Means Clustering

r-Capacityk-means clustering (denoted by (rk)-CaMeans) [41] on a set P of n points in \(\mathbb {R}^d\) is the problem of clustering P into k clusters with size at most r, such that the average squared Euclidean distance from each point in P to the mean point of its cluster is minimized.

We can solve the Partition problem of (rk)-CaMeans in a way similar to that of (rk)-GMeans; the only difference is that the demand r is replaced by the capacity r.

Theorem 4

There exists an algorithm yielding a \((1+\epsilon )\)-approximation for (rk)-CaMeans with constant probability, in \(O\Big (2^{poly(\frac{k}{\epsilon })}(\log n)^{k+1}n\big (n\log n+d\big )\Big )\) time.

4.3 l-Diversity k-Means Clustering

Let \(P=\bigcup ^{\tilde{n}}_{i=1}P_i\) be a set of colored points in \(\mathbb {R}^d\) and \(\sum ^{\tilde{n}}_{i=1}|P_i|=n\), where the points in each \(P_i\) share the same color. l-Diversity k-means clustering (denoted by (lk)-DMeans) on P is the problem of clustering P into k clusters such that the points sharing the same color inside each cluster have a fraction no more than \(\frac{1}{l}\) for some \(l>1\), and the average squared Euclidean distance from each point in P to the mean point of its cluster is minimized.

Similar to (rk)-GMeans, we reduce the Partition problem of (lk)-DMeans to a minimum cost circulation problem for each k-tuple candidate \(P_{v}=\{p_{v_1}, \ldots p_{v_k}\}\). The challenge is that we do not know the size of each resulting cluster, and therefore it is difficult to control the flow on each edge if directly using the bipartite graph built in Fig. 6a. Instead, we add a set of “gates” between the input P and the k-tuple \(P_v\) (see Fig. 6b). First, following the definition of (lk)-DMeans, we partition the “vertices” P into \(\tilde{n}\) groups \(\{P_1, \ldots , P_{\tilde{n}}\}\). For each \(P_i\), we generate a new set of vertices (i.e., the gates) \(P'_i=\{c^i_1, \ldots , c^i_k\}\), and connect each pair of \(p\in P_i\) and \(c^i_j\in P'_i\) by an edge with weight \(||p-p_{v_j}||^2\). We also connect each pair of \(c^i_j\) and \(p_{v_j}\) by an edge with weight 0. In Fig. 6b, the size of vertices \(|V|=n+k\tilde{n}+k+2=O(kn)\), and the size of edges \(|E|=n+kn+k\tilde{n}+k=O(kn)\). Below we show that we can use \(c^i_j\) to control the flow from \(P_i\) to \(p_{v_j}\) by setting appropriate capacities and demands.

Let \(t=\max _{1\le i\le \tilde{n}}|P_{i}|\). We consider the value \(\lfloor |Opt_j|/l\rfloor \) that is the upper bound on the number of points with the same color in \(Opt_j\) (recall \(Opt_j\) is the jth optimal cluster defined in Sect. 3). The upper bound \(\lfloor |Opt_j|/l\rfloor \) can be either between 1 and t or larger than t. Clearly, if the upper bound is larger than t, there is no need to consider the upper bound anymore. Thus, we can enumerate all the \((t+1)^k\) cases to guess the upper bound \(\lfloor |Opt_j|/l\rfloor \) for \(1\le j\le k\). Let \(u_{j}\) be the guessed upper bound for \(Opt_j\). If \(u_{j}\) is no more than t, then each \(c^i_j\), \(1\le i\le \tilde{n}\), has the capacity \(u_j\), and \(p_{v_j}\) has the demand \(l\times u_j\) and capacity \(l\times (u_j+1)-1\). Otherwise (i.e., \(u_{j} >t\)), set the capacity of each \(c^i_j\), \(1\le i\le \tilde{n}\), to be n, and the demand and capacity of \(p_{v_j}\) to be \(l\times (t+1)\) and n, respectively. By using the algorithm in [49], we solve the minimum cost circulation problem for each of the \((t+1)^k\) guesses.

Theorem 5

For any colored point set \(P=\bigcup ^{\tilde{n}}_{i=1}P_i\) in \(\mathbb {R}^{d}\) with \(n=|P|\) and \(t=\max _{1\le i\le \tilde{n}}|P_i|\), there exists an algorithm yielding, in \(O\Big (2^{poly(\frac{k}{\epsilon })}(\log n)^{k+1}(t+1)^kn\big (n\log n+d\big )\Big )\) time, a \((1+\epsilon )\)-approximation for (lk)-DMeans with constant probability.

Note We can solve the problem in [46] by slightly changing the above Partition algorithm. In [46], it requires that the size of each cluster is at least l and the points inside each cluster have distinct colors, which means that the upper bound \(u_j\) is always equal to 1 for each \(1\le j\le k\). Thus, there is no need to guess the upper bounds in our Partition algorithm. We can simply set the capacity for each \(c^i_j\) to be 1, and the demand for each \(p_{v_j}\) to be l. With this change, our algorithm yields a \((1+\epsilon )\)-approximation with constant probability in \(O\Big (2^{poly(\frac{k}{\epsilon })}(\log n)^{k+1}n\big (n\log n+d\big )\Big )\) time.

4.4 Chromatic k-Means Clustering

Let \(P=\bigcup ^{\tilde{n}}_{i=1}P_i\) be a set of colored points in \(\mathbb {R}^d\) and \(\sum ^{\tilde{n}}_{i=1}|P_i|=n\), where the points in each \(P_i\) share the same color. Chromatic k-means clustering (denoted by k-ChMeans) [7, 23] on P is the problem of clustering P into k clusters such that no pair of points with the same color is clustered into the same cluster, and the average squared Euclidean distance from each point in P to the mean point of its cluster is minimized.

To satisfy the chromatic requirement, each \(P_i\) should have a size no more than k. Given a k-tuple candidate \(P_v=\{p_{v_1}, \ldots , p_{v_k}\}\), we can consider the partition problem for each \(P_i\) independently, since there is no mutual constraint among them. It is easy to see that finding a partition of \(P_i\) is equivalent to computing a minimum cost one-to-one matching between \(P_i\) and \(P_v\), where the cost of the matching between any \(p\in P_i\) and \(p_{v_j}\in P_v\) is their squared Euclidean distance. We can build this bipartite graph in \(O(k^2d)\) time, and solve this matching problem by using Hungarian algorithm in \(O(k^3)\) time. Thus, the running time of the Partition step for each \(P_v\) is \(O(k^2(k+d)n)\).

Theorem 6

There exists an algorithm yielding a \((1+\epsilon )\)-approximation for k-ChMeans with constant probability, in \(O\big (2^{poly(\frac{k}{\epsilon })}(\log n)^{k+1}nd\big )\) time.

4.5 Fault Tolerant k-Means Clustering

Fault Tolerantk-means clustering (denoted by (lk)-FMeans) [54] on a set P of n points in \(\mathbb {R}^d\) and a given integer \(1\le l\le k\) is the problem of finding k points \(\mathcal {C}=\{c_1, \ldots , c_k\}\subset \mathbb {R}^d\), such that the average of the total squared distances from each point in P to its l nearest points in \(\mathcal {C}\) is minimized.

To solve the Partition problem of (lk)-FMeans, our idea is to reduce (lk)-FMeans to k-ChMeans, and use the Partition algorithm for k-ChMeans to generate the desired clusters. The reduction simply makes l monochromatic copies \(\{p^1_i, \ldots , p^l_i\}\) for each \(p_i\in P\). The following lemma shows the relation of the two problems.

Lemma 9

For any constant \(\lambda \ge 1\), a \(\lambda \)-approximation of (lk)-FMeans on P is equivalent to a \(\lambda \)-approximation of k-ChMeans on \(\bigcup ^n_{i=1}\{p^1_i, \ldots , p^l_i\}\).

Proof

We build a bijection between the solutions of (lk)-FMeans and k-ChMeans. First, we consider the mapping from (lk)-FMeans to k-ChMeans. Let \(\mathcal {C}=\{c_1, \ldots , c_k\}\) be the k mean points of (lk)-FMeans, and \(\{c_{i(1)}, \ldots , c_{i(l)}\}\)\(\subset \mathcal {C}\) be the l nearest mean points to each \(p_i \in P\). If using \(\mathcal {C}\) as the k mean points of k-ChMeans on \(\bigcup ^n_{i=1}\{p^1_i, \ldots , p^l_i\}\), the l copies \(\{p^1_i, \ldots , p^l_i\}\) of \(p_{i}\) will be respectively clustered to the l clusters of \(\{c_{i(1)}, \ldots , c_{i(l)}\}\) to minimize the cost.

Now consider the mapping from k-ChMeans to (lk)-FMeans. Let \(\mathcal {C}'=\{c'_1, \ldots , c'_k\}\) be the k mean points of k-ChMeans. For each i, \(\{c'_{i(1)}, \ldots , c'_{i(l)}\}\) are the mean points of the l clusters that \(\{p^1_i, \ldots , p^l_i\}\) are clustered to. It is easy to see that the l nearest mean points of \(p_{i}\) are \(\{c'_{i(1)}, \ldots , c'_{i(l)}\}\) if we use \(\mathcal {C}'\) as the k mean points of (lk)-FMeans.

With this bijection, we can pair up the solutions to the two problems. Clearly, each pair of solutions to (lk)-FMeans and k-ChMeans formed by the bijection have the equal objective value. Further, their optimal objective values are equal to each other, and for any pair of solutions, their approximation ratios are the same. Thus, Lemma 9 is true. \(\square \)

With Lemma 9, we immediately have the following theorem.

Theorem 7

There exists an algorithm yielding a \((1+\epsilon )\)-approximation for (lk)-FMeans with constant probability, in \(O\big (2^{poly(\frac{k}{\epsilon })}(\log n)^{k+1}nd\big )\) time.

Note As mentioned in [35], a more general version of fault tolerant clustering problem is to allow each point \(p_i \in P\) to have an individual l-value \(l_{i}\). From the above discussion, it is easy to see that this general version can also be solved in the same way (i.e., through reduction to k-ChMeans) and achieve the same approximation result.

4.6 Semi-supervised k-Means Clustering

As shown in Sect. 1.1, semi-supervised clustering has various forms. In this paper, we consider the semi-supervised k-means clustering problem which takes into account the geometric cost and priori knowledge. Let P be a set of n points in \(\mathbb {R}^d\), and \(\overline{\mathcal {S}}=\{\overline{S}_1,\ldots , \overline{S}_k \}\) be a given clustering of P. Semi-supervised k-means clustering (denoted by k-SMeans) on P and \(\overline{\mathcal {S}}\) is the problem of finding a clustering \(\mathcal {S}=\{S_1,\ldots ,S_k \}\) of P such that the following objective function is minimized,

$$\begin{aligned} \alpha \frac{Cost(\mathcal {S})}{E_1}+(1-\alpha ) \frac{dist\{\mathcal {S}, \overline{\mathcal {S}}\}}{E_2}, \end{aligned}$$
(29)

where \(\alpha \in [0,1]\) is a given constant, \(E_1\) and \(E_2\) are two given scalars to normalize the two terms, \(Cost(\mathcal {S})\) is the k-means clustering cost of \(\mathcal {S}\), and \(dist\{\mathcal {S}, \overline{\mathcal {S}}\}\) is the distance between \(\mathcal {S}\) and \(\overline{\mathcal {S}}\) defined in the same way as in [11]. For any pair of \(S_j\) and \(\overline{S}_i\), \(1\le j, i\le k\), their difference is \(|S_{j} {\setminus } \overline{S}_i |\). Given a bipartite matching \(\sigma \) between \(\mathcal {S}\) and \(\overline{\mathcal {S}}\), \(dist\{\mathcal {S}, \overline{\mathcal {S}}\}\) is defined as \(\sum ^k_{j=1}|S_{j} {\setminus } \overline{S}_{\sigma (j)}|\).

The challenge is that the bipartite matching \(\sigma \) is unknown in advance. We fix the k-tuple candidate \(P_{v}=\{p_{v_1}, \ldots p_{v_k}\}\). To find the desired \(\sigma \) to minimize the objective function (29), we build a bipartite graph, where the left (resp., right) column contains k vertices corresponding to \(p_{v_1}, \ldots , p_{v_k}\) (resp., \(\overline{S}_1,\ldots , \overline{S}_k \)), respectively. For each pair \((p_{v_j}, \overline{S}_i)\), we connect them by an edge; we calculate the edge weight \(w_{(i,j)}\) in the following way. For each \(p\in \overline{S}_i\), it could be potentially assigned to any of the k clusters in \(\mathcal {S}\); if \(i=\sigma (j)\), the induced k costs of p will be \(\{c^1_p, c^2_p, \ldots , c^k_p\}\), where \(c^l_p=\alpha \frac{||p-p_{v_l}||^2}{E_1}\) if \(l=j\), or \(c^l_p=\alpha \frac{||p-p_{v_l}||^2}{E_1}+(1-\alpha )\frac{1}{E_2}\) otherwise. Thus, we set

$$\begin{aligned} w_{(i,j)}=\sum _{p\in \overline{S}_i}\min _{1\le l\le k}c^l_p. \end{aligned}$$
(30)

We solve the minimum cost bipartite matching problem to determine \(\sigma \). To build such a bipartite graph, we need to first compute all the kn distances from the points in P to the k-tuple \(P_{v}\); then, we calculate the \(k^2\) edge weights via (30). The bipartite graph can be built in a total of \(O(k nd+k^2n)\) time, and the optimal matching can be obtained via Hungarian algorithm in \(O(k^3)\) time.

Theorem 8

There exists an algorithm yielding a \((1+\epsilon )\)-approximation for k-SMeans with constant probability, in \(O\big (2^{poly(\frac{k}{\epsilon })}(\log n)^{k+1}nd\big )\) time.

5 Constrained k-Median Clustering (k-C Median)

In this section, we extend our approach for k-CMeans to the constrained k-median clustering problem (k-CMedian). Similar to k-CMeans, we show that the Peeling-and-Enclosing framework can be used to construct a set of candidates for the constrained median points. Combining this with the selection algorithms (with trivial modification) in Sect. 4, we achieve the \((1+\epsilon )\) approximations for a class of k-CMedian problems.

To solve k-C Median, a straightforward idea is to extend the simplex lemma to median points and combine it with the Peeling-and-Enclosing framework to achieve an approximate solution. However, due to the essential difference between mean and median points, such an extension for the simplex lemma is not always possible. The main reason is that the median point (i.e., Fermat point) does not necessarily lie inside the simplex, and thus there is no guarantee to find the median point by searching inside the simplex. Below is an example showing that the median point actually can lie outside the simplex.

Fig. 7
figure 7

An example showing non-existence of a simplex lemma for k-CMedian

Let \(P=\{p_1, p_2, \ldots , p_{9}\}\) be a set of points in \(\mathbb {R}^d\). We consider the following partition of P, \(P_1=\{p_i\mid 1\le i\le 5\}\) and \(P_2=\{p_i\mid 6\le i\le 9\}\). Assume that all the points of P locate at the three vertices of a triangle \(\varDelta abc\). Particularly, \(\{p_1, p_2, p_6\}\) coincide with vertex a, \(\{p_3,p_4,p_5\}\) with vertex b, and \(\{p_7,p_8,p_9\}\) with vertex c (see Fig. 7). It is easy to see that the median points of \(P_1\) and \(P_2\) are b and c, respectively. If the angle \(\angle bac\ge \frac{2\pi }{3}\), the median point of P is vertex a (note that the median point can be viewed as the Fermat point of \(\varDelta abc\) with each vertex associated with weight 3). This means that the median point of P is outside the simplex formed by the median points of \(P_1\) and \(P_2\) (i.e., segment \(\overline{bc}\)). Thus, a good approximation of the median point cannot be obtained by searching a grid inside \(\overline{bc}\).

To overcome this difficulty, we show that a weaker version of the simplex lemma exists for median, which enables us to achieve similar results for k-CMedian.

5.1 Weaker Simplex Lemma for Median Point

Comparing to the simplex lemma in Sect. 2, the following Lemma 10 has two differences. One is that the lemma requires a partial partition on a significantly large subset of P, rather than a complete partition on P. Secondly, the grid is built in the flat spanned by \(\{o_1, \ldots , o_j\}\), instead of the simplex. Later, we will show that the grid is actually built in a surrounding region of the simplex, and thus the lemma is called “weaker simplex lemma”.

Lemma 10

(Weaker Simplex Lemma) Let P be a set of n points in \(\mathbb {R}^d\), and \(\bigcup ^j_{l=1}P_l\subset P\) be a partial partition of P with \(P_{l_1}\cap P_{l_2}=\emptyset \) for any \(l_1\ne l_2\). Let \(o_{l}\) be the median point of \(P_{l}\) for \(1\le l \le j\), and \(\mathcal {F}\) be the flat spanned by \(\{o_1, \ldots , o_j\}\). If \(|P{\setminus }(\bigcup ^j_{l=1}P_l)|\le \epsilon |P|\) for some constant \(\epsilon \in (0,1/5)\) and each \(P_l\) is contained inside a ball \(\mathcal {B}(o_l, L)\) centered at \(o_{l}\) and with radius \(L\ge 0\), then it is possible to build a grid in \(\mathcal {F}\) with size \(O(j^2(\frac{j\sqrt{j}}{\epsilon })^j)\) such that at least one grid point \(\tau \) satisfies the following inequality, where o is the median point of P (see Fig. 8).

$$\begin{aligned} \frac{1}{|P|}\sum _{p\in P} ||\tau -p||\le \left( 1+\frac{9}{4}\epsilon \right) \frac{1}{|P|}\sum _{p\in P}||p-o||+(1+\epsilon )L. \end{aligned}$$
(31)
Fig. 8
figure 8

An illustration for Lemma 10

Proof Synopsis To prove Lemma 10, we let \(\tilde{o}\) be the orthogonal projection of o to \(\mathcal {F}\) (see Fig. 8). In Claim 4, we show that the distance between o and \(\tilde{o}\) is bounded, and consequently, the induced cost of \(\tilde{o}\), i.e., \(\frac{1}{|P|}\sum _{p\in P}||p-\tilde{o}||\), is also bounded according to Claim 5. Thus, \(\tilde{o}\) is a good approximation of o, and we can focus on building a grid inside \(\mathcal {F}\) to approximate \(\tilde{o}\). Since \(\mathcal {F}\) is unbounded, we need to determine a range for the grid. Claim 6 resolves the issue. It considers two cases. One is that there are at least two subsets in the partial partition, \(\{P_1, \ldots , P_j\}\), having large enough fractions of P; the other is that only one subset is large enough. For either case, Claim 6 shows that we can determine the range of the grid using the location information of \(\{o_1, \ldots , o_j\}\). Finally, we can obtain the desired grid point \(\tau \) in the following way: draw a set of balls centered at \(\{o_1, \ldots , o_j\}\) with proper radii; build the grids inside each of the balls, and find the desired grid point \(\tau \) in one of these balls. Note that since all the balls are inside \(\mathcal {F}\), the complexity of the union of the grids is independent of the dimensionality d.

Claim 4

$$\begin{aligned} ||o-\tilde{o}||\le L+\frac{1}{1-\epsilon } \frac{1}{|P|}\sum _{p\in P}||o-p||. \end{aligned}$$
(32)
Fig. 9
figure 9

An illustration for Claim 4

Proof

Lemma 10 assumes that \(\bigcup ^j_{l=1}P_l\ge (1-\epsilon )|P|\). By Markov’s inequality, we know that there exists one point \(q\in \bigcup ^j_{l=1}P_l\) such that

$$\begin{aligned} ||q-o||\le \frac{1}{1-\epsilon } \frac{1}{|P|}\sum _{p\in P}||o-p||. \end{aligned}$$
(33)

Let \(P_{l_{q}}\) be the subset containing q. Then from (33), we immediately have

$$\begin{aligned} ||o-\tilde{o}||\le & {} ||o_{l_q}-o|| \nonumber \\\le & {} ||o_{l_q}-q||+||q-o||\nonumber \\\le & {} L+\frac{1}{1-\epsilon } \frac{1}{|P|}\sum _{p\in P}||o-p||. \end{aligned}$$
(34)

This implies Claim 4 (see Fig. 9). \(\square \)

Claim 5

$$\begin{aligned} \frac{1}{|P|}\sum _{p\in P}||p-\tilde{o}|| \le \frac{1}{1-\epsilon }\frac{1}{|P|}\sum _{p\in P}||p-o||+L. \end{aligned}$$
(35)
Fig. 10
figure 10

An illustration for Claim 5

Proof

For any point \(p\in P_l\), let \(dist\{\overline{o\tilde{o}}, p\}\) (resp., \(dist\{\overline{\mathcal {F}}, p\}\)) denote its distance to the line \(\overline{o\tilde{o}}\) (resp., flat \(\mathcal {F}\)). See Fig. 10. Then we have

$$\begin{aligned} ||p-\tilde{o}||= & {} \sqrt{dist^2\{\overline{o\tilde{o}}, p\}+dist^2\{\mathcal {F}, p\}}, \end{aligned}$$
(36)
$$\begin{aligned} ||p-o||\ge & {} dist\{\overline{o\tilde{o}}, p\}. \end{aligned}$$
(37)

Combining (36) and (37), we have

$$\begin{aligned} ||p-\tilde{o}||-||p-o||\le & {} \sqrt{dist^2\{\overline{o\tilde{o}}, p\}+dist^2\{\mathcal {F}, p\}}- dist\{\overline{o\tilde{o}}, p\}\nonumber \\\le & {} dist\{\mathcal {F},p\}\nonumber \\\le & {} ||p-o_l||\le L. \end{aligned}$$
(38)

For any point \(p\in P{\setminus }(\bigcup ^j_{l=1}P_l)\), we have

$$\begin{aligned} ||p-\tilde{o}||\le & {} ||p-o||+||o-\tilde{o}||. \end{aligned}$$
(39)

Combining (38), (39), and (32), we have

$$\begin{aligned} \frac{1}{|P|}\sum _{p\in P}||p-\tilde{o}||= & {} \frac{1}{|P|}\left( \sum _{p\in \bigcup ^j_{l=1}P_l}||p-\tilde{o}||+\sum _{p\in P{\setminus }\left( \bigcup ^j_{l=1}P_l\right) }||p-\tilde{o}||\right) \nonumber \\\le & {} \frac{1}{|P|}\left( \sum _{p\in \bigcup ^j_{l=1}P_l}(L+||p-o||)+\sum _{p\in P{\setminus }\left( \bigcup ^j_{l=1}P_l\right) }(||p-o||+||o-\tilde{o}||)\right) \nonumber \\\le & {} (1-\epsilon )L+\frac{1}{|P|}\sum _{p\in P}||p-o||+\epsilon L+\frac{\epsilon }{1-\epsilon }\frac{1}{|P|}\sum _{p\in P}||p-o||\nonumber \\= & {} \frac{1}{1-\epsilon }\frac{1}{|P|}\sum _{p\in P}||p-o||+L. \end{aligned}$$
(40)

Thus the claim is true. \(\square \)

Claim 6

At least one of the following two statements is true.

  1. 1.

    There exist at least two points in \(\{o_1, \ldots , o_j\}\) whose distances to \(\tilde{o}\) are no more than \(L+\frac{3j}{1-\epsilon }\frac{1}{|P|}\sum _{p\in P}\)\(||p-o||\).

  2. 2.

    There exists one point in \(\{o_1, \ldots , o_j\}\), say \(o_{l_0}\), whose distance to \(\tilde{o}\) is no more than \((1+\frac{1+2\epsilon }{\sqrt{3-12\epsilon }})L\).Footnote 5

Proof

We consider two cases: (i) there are two subsets \(P_{l_1}\) and \(P_{l_2}\) of P with size at least \(\frac{1-\epsilon }{3j}|P|\), and (ii) no such pair of subsets exists.

For case (i), by Markov’s inequality, we know that there exist two points \(q\in P_{l_1}\) and \(q'\in P_{l_2}\) such that

$$\begin{aligned} ||q-o||\le \frac{3j}{1-\epsilon }\frac{1}{|P|}\sum _{p\in P}||p-o||; \end{aligned}$$
(41)
$$\begin{aligned} ||q'-o||\le \frac{3j}{1-\epsilon }\frac{1}{|P|}\sum _{p\in P}||p-o||. \end{aligned}$$
(42)

This, together with triangle inequality, indicates that both \(||o_{l_1}-o||\) and \(||o_{l_2}-o||\) are no more than \(L+\frac{3j}{1-\epsilon }\frac{1}{|P|}\sum _{p\in P}||p-o||\). Since \(\tilde{o}\) is the orthogonal projection of o to \(\mathcal {F}\), we have \(||o_{l_1}-\tilde{o}||\le ||o_{l_1}-o||\) and \(||o_{l_2}-\tilde{o}||\le ||o_{l_2}-o||\). Thus, the first statement is true in this case.

Fig. 11
figure 11

An illustration for Claim 6

For case (ii), i.e., no two subsets with size at least \(\frac{1-\epsilon }{3j}|P|\), since \(\sum ^j_{l=1}|P_l| \ge (1-\epsilon )|P|\), by pigeonhole principle we know that there must exist one \(P_{l_0}\), \(1\le l_0\le j\), with size

$$\begin{aligned} |P_{l_0}|\ge \left( 1-(j-1)\frac{1}{3j}\right) (1-\epsilon )|P|\ge \frac{2}{3}(1-\epsilon )|P|. \end{aligned}$$
(43)

Let \(x=||o-o_{l_0}||\). We assume that \(x>L\), since otherwise the second statement is automatically true.

Now imagine moving o slightly toward \(o_{l_0}\) by a small distance \(\delta \). See Fig. 11. For any point \(p\in P_{l_0}\), let \(\tilde{p}\) be its orthogonal projection to the line \(\overline{oo_{l_0}}\), and a and b be the distances \(||o-\tilde{p}||\) and \(||p-\tilde{p}||\), respectively. Then, the distance between p and o is decreased by \(\sqrt{a^2+b^2}-\sqrt{(a-\delta )^2+b^2}\). Also, we have

$$\begin{aligned} \lim _{\delta \rightarrow 0}\frac{\sqrt{a^2+b^2}-\sqrt{(a-\delta )^2+b^2}}{\delta }= & {} \lim _{\delta \rightarrow 0}\frac{2a-\delta }{\sqrt{a^2+b^2}+\sqrt{(a-\delta )^2+b^2}}\nonumber \\= & {} \frac{(a/b)}{\sqrt{(a/b)^2+1}}. \end{aligned}$$
(44)

Since p is inside ball \(\mathcal {B}(o_{l_0}, L)\), we have \(a/b\ge (x-L)/L\). For any point \(p\in P{\setminus } P_{l_{0}}\), the distance to o is non-increased or increased by at most \(\delta \). Thus, the average distance from the points in P to o is decreased by at least

$$\begin{aligned} \frac{2}{3}(1-\epsilon )\frac{((x-L)/L) \delta }{\sqrt{((x-L)/L)^2+1}}-\left( 1-\frac{2}{3}(1-\epsilon )\right) \delta . \end{aligned}$$
(45)

Since the original position of o is the median point of P, the value of (45) should be non-positive. With simple calculation, we have

$$\begin{aligned} (x-L)/L\le \frac{1+2\epsilon }{\sqrt{3-12\epsilon }}\Longrightarrow x\le \left( 1+\frac{1+2\epsilon }{\sqrt{3-12\epsilon }}\right) L. \end{aligned}$$
(46)

By the same argument in case (i), we know that \(||o_{l_0}-\tilde{o}||\le ||o_{l_0}-o||\). This, together with (46), implies that the second statement is true for case (ii). This completes the proof for this claim. \(\square \)

With the above claims, we now prove Lemma 10.

Proof of Lemma 10

We build a grid in \(\mathcal {F}\) as follows. First, draw a set of balls.

  • For each \(o_l\), \(1\le l\le j\), draw a ball (called type-1 ball) centered at \(o_l\) and with radius \((1+\frac{1+2\epsilon }{\sqrt{3-12\epsilon }})L\).

  • For each pair of \(o_{l}\) and \(o_{l'}\), \(1\le l, l'\le j\), draw a ball (called type-2 ball) centered at \(o_{l}\) and with radius \((1+\frac{1+2\epsilon }{\sqrt{3-12\epsilon }})(||o_{l}-o_{l'}||+L)\).

We claim that among the above balls, there must exist one ball that contains \(\tilde{o}\). If there is only one subset in \(\{P_1, \ldots , P_j\}\) with size no smaller than \(\frac{1-\epsilon }{3j}|P|\), it corresponds to the second case in Claim 6, and thus there exists a type-1 ball containing \(\tilde{o}\). Now consider the case that there are multiple subsets, say \(\{P_{l_1}, \ldots , P_{l_t}\}\) for some \(t\ge 2\), all with size no smaller than \(\frac{1-\epsilon }{3j}|P|\). Without loss of generality, assume that \(||o_{l_1}-o_{l_2}||=\max \{||o_{l_1}-o_{l_s}||\mid 1\le s\le t\}\). Then, we can view \(\bigcup ^t_{s=1}P_{l_s}\) as a big subset of P bounded by a ball centered at \(o_{l_1}\) and with radius \(||o_{l_1}-o_{l_2}||+L\). By the same argument given in the proof of Claim 6 for (43), we know that \(|\bigcup ^t_{s=1}P_{l_s}|\ge \frac{2}{3}(1-\epsilon )|P|\). This also means that we can reduce this case to the second case in Claim 6, i.e., replacing \(P_{l_0}\), \(o_{l_0}\) and L by \(|\bigcup ^t_{s=1}P_{l_s}|\), \(o_{l_1}\) and \(||o_{l_1}-o_{l_2}||+L\) respectively. Thus, there is a type-2 ball containing \(\tilde{o}\).

Next, we discuss how to build the grids inside these balls. For type-1 balls with radius \((1+\frac{1+2\epsilon }{\sqrt{3-12\epsilon }})L\), we build the grids inside them with grid length \(\frac{\epsilon }{\sqrt{j}}L\). For type-2 balls with radius \(r_{l, l'}=(1+\frac{1+2\epsilon }{\sqrt{3-12\epsilon }})(||o_{l}-o_{l'}||+L)\) for some l and \(l'\), we build the grids inside them with grid length

$$\begin{aligned} \frac{1}{1+\frac{1+2\epsilon }{\sqrt{3-12\epsilon }}}\frac{(1-\epsilon )\epsilon }{6j\sqrt{j}}r_{l,l'}. \end{aligned}$$
(47)

If \(\tilde{o}\) is contained in a type-1 ball, then there exists one grid point \(\tau \) whose distance to \(\tilde{o}\) is no more than \(\epsilon L\). If \(\tilde{o}\) is contained in a type-2 ball, such a distance is no more than

$$\begin{aligned} \frac{(1-\epsilon )\epsilon }{6j}(||o_{l}-o_{l'}||+L) \end{aligned}$$
(48)

by (47). By the first statement in Claim 6 and triangle inequality, we know that

$$\begin{aligned} ||o_{l}-o_{l'}||\le ||o_{l}-\tilde{o}||+||\tilde{o}-o_{l'}||\le 2\left( L+\frac{3j}{1-\epsilon }\frac{1}{|P|}\sum _{p\in P}||p-o||\right) \!. \end{aligned}$$
(49)

Equations (48) and (49) imply that there exists one grid point \(\tau \) whose distance to \(\tilde{o}\) is no more than

$$\begin{aligned} \epsilon \frac{1}{|P|}\sum _{p\in P}||p-o||+\frac{(1-\epsilon )\epsilon }{2j}L\le \epsilon \frac{1}{|P|}\sum _{p\in P}||p-o||+\epsilon L. \end{aligned}$$
(50)

Thus in both types of ball-containing, by triangle inequality and Claim 5, we have

$$\begin{aligned} \frac{1}{|P|}\sum _{p\in P}||p-\tau ||\le & {} \frac{1}{|P|}\sum _{p\in P}(||p-\tilde{o}||+||\tilde{o}-\tau ||)\nonumber \\\le & {} \left( \frac{1}{1-\epsilon }+\epsilon \right) \frac{1}{|P|}\sum _{p\in P}||p-o||+(1+\epsilon )L\nonumber \\\le & {} \left( 1+\frac{9}{4}\epsilon \right) \frac{1}{|P|}\sum _{p\in P}||p-o||+(1+\epsilon )L, \end{aligned}$$
(51)

where the second inequality follows from the assumption that \(\epsilon \le \frac{1}{5}\).

As for the grid size, since we build the grids inside the balls in the \((j-1)\)-dimensional flat \(\mathcal {F}\), through simple calculation, we know that the grid size is \(O(j^2(\frac{j\sqrt{j}}{\epsilon })^j)\). This completes the proof. \(\square \)

5.2 Peeling-and-Enclosing Algorithm for k-CMedian Using Weaker Simplex Lemma

In this section, we present a unified Peeling-and-Enclosing algorithm for generating a set of candidate median points for k-CMedian. Similar to the algorithm for k-CMeans, our algorithm iteratively determines the k median points. At each iteration, it uses a set of peeling spheres and a simplex to search for an approximate median point. Since the simplex lemma no longer holds for k-CMedian, we use the weaker simplex lemma as a replacement. Thus a number of changes are needed to accommodate the differences.

Before presenting our algorithm, we first introduce the following lemma proved by Badŏiu et al. in [10] for finding an approximate median point of a given point set.

Theorem 9

[10] Let P be a normalized set of n points in \(\mathbb {R}^d\) space, \(1>\epsilon >0\), and R be a random sample of \(O(1/\epsilon ^3\log 1/\epsilon )\) points from P. Then one can compute, in \(O(d2^{O(1/\epsilon ^4)}\log n)\) time, a point set S(PR) of cardinality \(O(2^{O(1/\epsilon ^4)}\log n)\) , such that with constant probability (over the choices of R), there is a point \(\tau \in S(P,R)\) such that \(\sum _{p\in P}||\tau -p||\le (1+\epsilon ) \sum _{p\in P}||o-p||\), where o is the optimal median point of P.

Fig. 12
figure 12

The gray area is U

Sketch of the proof of Theorem9. Since our algorithm uses some ideas in Theorem 9, we sketch its proof for completeness. First, by Markov’s inequality, we know that there exists one point, say \(s_{1}\), from R whose distance to o is no more than \(2 \frac{1}{|P|}\sum _{p\in P}||o-p||\) with certain probability. Then the sampling procedure can be viewed as an incremental process starting with \(s_1\); a flat \(\mathcal {F}\) spanned by all previously obtained sample points is maintained; at each time that a new sample point is added, \(\mathcal {F}\) is updated. Let \(\tilde{o}\) be the projection of o on \(\mathcal {F}\), and

$$\begin{aligned} U=\left\{ p\in \mathbb {R}^d \mid \frac{\pi }{2}-\frac{\epsilon }{16}\le \angle o\tilde{o}p\le \frac{\pi }{2}+\frac{\epsilon }{16}\right\} . \end{aligned}$$
(52)

See Fig. 12. It has been shown that this incremental sampling process stops before at most \(O(1/\epsilon ^3\)\(\log 1/\epsilon )\) points are taken, and one of the following two events happens with constant probability: (1) \(\mathcal {F}\) is close enough to o, and (2) \(|P{\setminus } U|\) is small enough. For either event, a grid can be built inside \(\mathcal {F}\), and one of the grid points \(\tau \) is the desired approximate median point.

Below we give an overview of our Peeling-and-Enclosing algorithm for k-CMedian. Let \(P=\{p_1, \ldots , p_n\}\) be the set of \(\mathbb {R}^{d}\) points in k-CMedian, and \(\mathcal {OPT}=\{Opt_1, \ldots ,\)\(Opt_k\}\) be the k (unknown) optimal clusters with \(m_{j}\) being the median point of cluster \(Opt_j\) for \(1\le j\le k\). Without loss of generality, we assume that \(|Opt_1|\ge |Opt_2|\ge \cdots \ge |Opt_k |\). Denote by \(\mu _{opt}\) the optimal objective value, i.e., \(\mu _{opt}=\frac{1}{n}\sum ^k_{j=1}\sum _{p\in Opt_j}||p-m_j||\).

Algorithm overview We mainly focus on the differences with the k-CMeans algorithm. First, our algorithm uses Theorem 9 (instead of Lemma 4) to find an approximation \(p_{v_{1}}\) for \(m_1\). Then, it iteratively finds the approximate median points for \(\{m_2, \ldots , m_k\}\) using the Peeling-and-Enclosing strategy. At the \((j+1)\)th iteration, it has already obtained the approximate median points \(p_{v_1}, \ldots , p_{v_j}\) for clusters \(Opt_{1}, \ldots , Opt_{j}\), respectively. To find the approximate median point \(p_{v_{j+1}}\) for \(Opt_{j+1}\), the algorithm draws j peeling spheres \(B_{j+1,1}, \ldots , B_{j+1,j}\) centered at \(\{p_{v_1}, \ldots , p_{v_j}\}\), respectively, and considers the size of \(\mathcal {A}=Opt_{j+1}{\setminus } (\bigcup ^j_{l=1}B_{j+1,l})\). If \(|\mathcal {A}|\) is small, it builds a flat (instead of a simplex) spanned by \(\{p_{v_1}, \ldots , p_{v_j}\}\), and finds \(p_{v_{j+1}}\) by using the weaker simplex lemma where the j peeling spheres can be viewed as a partial partition on \(Opt_{j+1}\). If \(|\mathcal {A}|\) is large, it adopts a strategy similar to the one in Theorem 9 to find \(p_{v_{j+1}}\): start with the flat \(\mathcal {F}\) spanned by \(\{p_{v_1}, \ldots , p_{v_j}\}\), and grow \(\mathcal {F}\) by repeatedly adding a sample point in \(\mathcal {A}\) to it. As it will be shown in Lemma 11, \(\mathcal {F}\) will become close enough to \(m_{j+1}\), and \(p_{v_{j+1}}\) can be obtained by searching a grid (built in a way similar to Lemma 10) in \(\mathcal {F}\). By choosing a proper value (i.e., \(O(\epsilon )\mu _{opt}\)) for L in Lemmas 10 and 11, we can achieve the desired \((1+\epsilon )\)-approximation.

As for the running time, although Theorem 9 introduces an extra factor of \(\log n\) for estimating the optimal cost of each \(Opt_{j+1}\), our algorithm actually does not need it as such estimations have already been obtained during the Peeling-and-Enclosing step (see Claim 2 in the proof of Lemma 6). Thus, the running time is still \(O(n(\log n)^{k+1}d)\), which is the same as k-CMeans.

The algorithm is shown in Algorithm 3. The following lemma is needed to ensure the correctness of our algorithm.

Lemma 11

Let \(\mathcal {F}\) be a flat in \(\mathbb {R}^d\) containing \(\{p_{v_1}, \ldots , p_{v_j}\}\) and having a distance to \(m_{j+1}\) no more than \(\frac{2}{|Opt_{j+1}|}\sum _{p\in Opt_{j+1}}||p-m_{j+1}||\). Assume that all the peeling spheres \(B_{j+1,1}, \ldots , B_{j+1,j}\) are centered at \(\{p_{v_1}, \ldots , p_{v_j}\}\), respectively, and have a radius \(L\ge 0\). Then if \(|Opt_{j+1}{\setminus } ((\bigcup ^j_{w=1}B_{j+1,w})\bigcup U)|\le \epsilon |Opt_{j+1}|\), we have

$$\begin{aligned}&\frac{1}{|Opt_{j+1}|}\sum _{p\in Opt_{j+1}}||p-\tilde{m}_{j+1}||\nonumber \\&\quad \le (1+2\epsilon )\frac{1}{|Opt_{j+1}|}\sum _{p\in Opt_{j+1}}||p-m_{j+1}||+L \end{aligned}$$
(53)

for any \(0\le \epsilon \le 1\), where \(\tilde{m}_{j+1}\) is the projection of \(m_{j+1}\) on \(\mathcal {F}\) and U is defined in (52) ( after replacing o and \(\tilde{o}\) by \(m_{j+1}\) and \(\tilde{m}_{j+1}\), respectively).

Proof

To prove this lemma, we first compare it with Lemma 10. The main difference is that there is an extra part \(U \cap Opt_{j+1}\) in \(Opt_{j+1}\), where \(Opt_{j+1}\) can be viewed as the point set P in Lemma 10. Thus, \(Opt_{j+1}\) can be viewed as having three subsets, \((\bigcup ^j_{w=1}B_{j+1,w}) \bigcap Opt_{j+1}\), \(U \bigcap Opt_{j+1}\) and \(Opt_{j+1}{\setminus } ((\bigcup ^j_{w=1}B_{j+1,w})\bigcup U)\).

For each point \(p\in (\bigcup ^j_{w=1}B_{j+1,w})\bigcap Opt_{j+1}\), similar to (38) in Claim 5, we know that the cost increases by at most L if the median point moves from \(m_{j+1}\) to \(\tilde{m}_{j+1}\). Thus we have

$$\begin{aligned}&\sum _{p\in Opt_{j+1}\bigcap \left( \bigcup ^j_{w=1}B_{j+1,w}\right) }||p-\tilde{m}_{j+1}|| \nonumber \\&\quad \le \sum _{p\in Opt_{j+1}\bigcap \left( \bigcup ^j_{w=1}B_{j+1,w}\right) }(||p-m_{j+1}||+L). \end{aligned}$$
(54)

For the part \(Opt_{j+1}{\setminus } ((\bigcup ^j_{w=1}B_{j+1,w})\bigcup U)\), by triangle inequality we have

$$\begin{aligned}&\sum _{p\in Opt_{j+1}{\setminus } \left( \left( \bigcup ^j_{w=1}B_{j+1,w}\right) \bigcup U\right) }||p-\tilde{m}_{j+1}||\nonumber \\&\quad \le \sum _{p\in Opt_{j+1}{\setminus } \left( \left( \bigcup ^j_{w=1}B_{j+1,w}\right) \bigcup U\right) }(||p-m_{j+1}||+||m_{j+1}-\tilde{m}_{j+1}||)\nonumber \\&\quad \le \sum _{p\in Opt_{j+1}{\setminus } \left( \left( \bigcup ^j_{w=1}B_{j+1,w}\right) \bigcup U\right) }||p-m_{j+1}||+2\epsilon \sum _{p\in Opt_{j+1}}||p-m_{j+1}||, \end{aligned}$$
(55)

where the second inequality follows from the assumption that \(\mathcal {F}\)’s distance to \(m_{j+1}\) is no more than \(\frac{2}{|Opt_{j+1}|}\sum _{p\in Opt_{j+1}}||p-m_{j+1}||\) and

$$\begin{aligned} |Opt_{j+1}{\setminus } \left( \left( \bigcup ^j_{w=1}B_{j+1,w}\right) \bigcup U\right) |\le \epsilon |Opt_{j+1}|. \end{aligned}$$

For each point \(p\in Opt_{j+1}\cap U\), recall that the angle \(\angle m_{j+1}\tilde{m}_{j+1}p\in [\frac{\pi }{2}-\frac{\epsilon }{16}, \frac{\pi }{2}+\frac{\epsilon }{16}]\) in (52). In Theorem 3.2 of [10], it showed that \(||p-\tilde{m}_{j+1}||\le (1+\epsilon )||p-m_{j+1}||\). Therefore,

$$\begin{aligned} \sum _{p\in Opt_{j+1}\cap U}||p-\tilde{m}_{j+1}||\le (1+\epsilon )\sum _{p\in Opt_{j+1}\cap U}||p-m_{j+1}||. \end{aligned}$$
(56)

Combining (54), (55) and (56), we obtain (53). \(\square \)

To complete the Peeling-and-Enclosing algorithm for k-CMedian, we also need an upper bound for the optimal objective value. In Sect. 5.3, we will show how to obtain such an estimation. For this moment, we assume that the upper bound is available.

figure c
figure d

Using the same idea for proving Theorem 1, we obtain the following theorem for k-CMedian.

Theorem 10

Let P be a set of n points in \(\mathbb {R}^{d}\) and \(k\in \mathbb {Z}^+\) be a fixed constant. Given a constant \(\epsilon \in (0, \frac{1}{4k^2})\), in \(O(2^{poly(\frac{k}{\epsilon })}n(\log n)^{k+1} d )\) time, Algorithm 3 outputs \(O(2^{poly(\frac{k}{\epsilon })}\)\((\log n)^{k})\)k-tuple candidate median points. With constant probability, there exists one k-tuple candidate in the output which is able to induce a \(\big (1+O(\epsilon )\big )\)-approximation of k-CMedian.

5.3 Upper Bound Estimation for k-CMedian

In this section, we show how to obtain an upper bound of the optimal objective value of k-CMedian.

Theorem 11

Let \(P=\{p_1, \ldots , p_n\}\) be the input points of k-CMedian, and \(\mathcal {C}\) be the set of k median points of a \(\lambda \)-approximation of k-median on P (without considering the constraint) for some constant \(\lambda \ge 1\). Then the Cartesian product \([\mathcal {C}]^k\) contains at least one k-tuple which is able to induce a \((3\lambda +2)\)-approximation of k-CMedian.

Let \(\{c_1, \ldots , c_k\}\) be the k median points in \(\mathcal {C}\), and \(\omega \) be the corresponding objective value of the k-median approximate solution on P. Recall that \(\{m_{1}, \ldots , m_{k}\}\) are the k unknown optimal constrained median points of P, and \(\mathcal {OPT}=\{Opt_{1}, \ldots ,\)\(Opt_{k}\}\) are the corresponding k optimal constrained clusters. To prove Theorem 11, we create a new instance of k-CMedian in the following way: for each point \(p_i\in P\), move it to its nearest point, say \(c_{t}\), in \(\{c_1, \ldots , c_k\}\); let \(\tilde{p}_i\) denote the new \(p_i\) (note that \(c_t\) and \(\tilde{p}_i\) overlap with each other). Then the set \(\tilde{P}=\{\tilde{p}_1, \ldots , \tilde{p}_n\}\) forms a new instance of k-CMedian. Let \(\mu _{opt}\) and \(\tilde{\mu }_{opt}\) be the optimal cost of P and \(\tilde{P}\) respectively, and \(\mu _{opt}([\mathcal {C}]^k)\) be the minimum cost of P by restricting its k constrained median points to being a k-tuple in \([\mathcal {C}]^k\). The following two lemmas are keys to proving Theorem 11.

Lemma 12

\(\tilde{\mu }_{opt} \le \omega + \mu _{opt}\).

Proof

For each \(p_i\in Opt_l\), by triangle inequality we have

$$\begin{aligned} ||\tilde{p}_i-m_l||\le ||\tilde{p}_i-p_i||+||p_i-m_l||. \end{aligned}$$
(57)

For both sides of (57), taking the averages over i and l, we get

$$\begin{aligned} \frac{1}{n}\sum ^k_{l=1}\sum _{p_i\in Opt_l}||\tilde{p}_i-m_l|| \le \frac{1}{n}\sum ^n_{i=1} ||\tilde{p}_i-p_i||+\frac{1}{n}\sum ^k_{l=1}\sum _{p_i\in Opt_l}||p_i-m_l||. \end{aligned}$$
(58)

Note that the left-hand side of (58) is not smaller than \(\tilde{\mu }_{opt}\), since \(\tilde{\mu }_{opt}\) is the optimal object value of k-CMedian on \(\tilde{P}\). For the right-hand side of (58), the first term \(\frac{1}{n} \sum ^n_{i=1} ||\tilde{p}_i-p_i||=\omega \) (by the construction of \(\tilde{P}\)), and the second term \(\frac{1}{n}\sum ^k_{l=1}\sum _{p_i\in Opt_l}||p_i-m_l||=\mu _{opt}\). Thus, we have \(\tilde{\mu }_{opt} \le \omega + \mu _{opt}\). \(\square \)

Lemma 13

\(\mu _{opt}([\mathcal {C}]^k)\le \omega +2\tilde{\mu }_{opt}\).

Proof

Consider k-CMedian on \(\tilde{P}\). Let \(\{\tilde{m}_1, \ldots , \tilde{m}_k\}\) be the optimal constraint median points, and \(\{\tilde{O}_1, \ldots , \tilde{O}_k\}\) be the corresponding optimal constraint clusters of \(\tilde{P}\). Let \(\{\tilde{c}_1, \ldots , \tilde{c}_k\}\) be the k-tuple in \([\mathcal {C}]^k\) with \(\tilde{c}_l\) being the nearest point in \(\mathcal {C}\) to \(\tilde{m}_l\). Thus, by an argument similar to the one for (57), we have the following inequality, where \(\tilde{p}_i\) is assumed to be clustered in \(\tilde{O}_l\).

$$\begin{aligned} ||\tilde{p}_i-\tilde{c}_l||\le & {} ||\tilde{p}_i-\tilde{m}_l||+||\tilde{m}_l-\tilde{c}_l||\le 2||\tilde{p}_i-\tilde{m}_l||. \end{aligned}$$
(59)

In (59), the last one follows from the facts that \(\tilde{c}_l\) is the nearest point in \(\mathcal {C}\) to \(\tilde{m}_l\) and \(\tilde{p}_i \in \mathcal {C}\), which implies \(||\tilde{m}_l-\tilde{c}_l||\le ||\tilde{m}_l-\tilde{p}_i||\). For both sides of (59), taking the averages over i and l, we have

$$\begin{aligned} \frac{1}{n}\sum ^k_{l=1}\sum _{\tilde{p}_i\in \tilde{O}_l}||\tilde{p}_i-\tilde{c}_l||\le & {} 2\frac{1}{n}\sum ^k_{l=1}\sum _{\tilde{p}_i\in \tilde{O}_l}||\tilde{p}_i-\tilde{m}_l||. \end{aligned}$$
(60)

Now, consider the following k-CMedian on P. For each \(p_i\), if \( \tilde{p}_i\in \tilde{O}_l\), we cluster it to the corresponding median point \(\tilde{c}_l\). Then the objective value of the clustering is

$$\begin{aligned} \frac{1}{n}\sum ^k_{l=1}\sum _{\tilde{p}_i\in \tilde{O}_l}||p_i-\tilde{c}_l||\le & {} \frac{1}{n} \sum ^k_{l=1}\sum _{\tilde{p}_i\in \tilde{O}_l}(||p_i-\tilde{p}_i||+||\tilde{p}_i-\tilde{c}_l||)\nonumber \\\le & {} \frac{1}{n} \sum ^k_{l=1}\sum _{\tilde{p}_i\in \tilde{O}_l}||p_i-\tilde{p}_i||+2\frac{1}{n}\sum ^k_{l=1}\sum _{\tilde{p}_i\in \tilde{O}_l}||\tilde{p}_i-\tilde{m}_l||.\nonumber \\ \end{aligned}$$
(61)

The left-hand side of (61), i.e., \(\frac{1}{n}\sum ^k_{l=1}\sum _{\tilde{p}_i\in \tilde{O}_l}||p_i-\tilde{c}_l||\), is no smaller than \(\mu _{opt}([\mathcal {C}]^k)\) (by the definition), and the right-hand side of (61) is equal to \(\omega +2\tilde{\mu }_{opt}\). Thus, we have \(\mu _{opt}([\mathcal {C}]^k)\le \omega +2\tilde{\mu }_{opt}\). \(\square \)

Proof of Theorem 11

By Lemmas 12 and 13, we know that \(\mu _{opt}([\mathcal {C}]^k)\le 3\omega +2\mu _{opt}\). It is obvious that the optimal objective value of the k-median clustering is no larger than that of k-CMedian on the same set of points in P. This implies that \(\omega \le \lambda \mu _{opt}\). Thus, we have

$$\begin{aligned} \mu _{opt}([\mathcal {C}]^k)\le (3\lambda +2)\mu _{opt}. \end{aligned}$$
(62)

The above inequality means that there exists one k-tuple in \([\mathcal {C}]^k\), which is able to induce a \((3\lambda +2)\)-approximation. \(\square \)

5.4 Selection Algorithms for k-CMedian

For each of the six constrained clustering problems studied in Sect. 4, the same results (including the approximation ratio and time complexity) can be extended to the corresponding constrained k-median version with slight modification (e.g., assigning the edge cost to be the Euclidean distance rather than squared Euclidean distance when computing the minimum cost circulation on the graph G). Thus, we only focus on the probabilistic clustering problem.

Probabilistick-Median Clustering (k-PMedian) [19]. Let \(V=\{v_1, \ldots ,\)\(v_n\}\) be a set of nodes; each node \(v_{i}\) is associated with a point set \(D_i=\{p^i_1, \ldots , p^i_h\}\subset \mathbb {R}^d\), where each \(p^i_l\) has a probability \(t^i_l\ge 0\) satisfying the condition \(\sum ^h_{l=1}t^i_l\le 1\). Let \(w_i=\sum ^h_{l=1}t^i_l\) for \(1\le i\le n\) and \(W=\sum ^n_{i=1}w_i\). k-PMedian is the problem of finding k points \(\{m_1, \ldots , m_k\}\) in \(\mathbb {R}^d\) such that \(\sum ^n_{i=1}\min _{1\le j\le k}dist\{v_i,m_j\}\) is minimized, where \(dist\{v_i,m_j\}=\)\(\sum ^h_{l=1}\)\(t^i_l ||p^i_l-m_j||\).

Note that for the k-means version of probabilistic clustering, Cormode and McGregor [19] have showed that it can be reduced to an ordinary k-means clustering problem after replacing each \(D_i\) by its weighted mean point. However, this strategy can only yield a \((3+\epsilon )\)-approximation for the k-median version [19, 58]. We briefly sketch our idea for solving k-PMedian below.

Actually, k-PMedian is equivalent to the k-median clustering on the weighted point set \(\bigcup ^n_{i=1}D_i\) with the constraint that for each \(1\le i\le n\), all the points in \(D_i\) should be clustered into the same cluster. Thus, we can use our Peeling-and-Enclosing algorithm for k-CMedian in Sect. 5.2 to generate a set of candidates for the constrained k median points; the difference is that the points have weights, and thus in each sampling step we sample points with probabilities proportional to their weights. To accommodate such a difference, several minor modifications need to be made to Lemma 10 and Lemma 11: all distances are changed to weighted distances, and the involved set sizes (such as |P|) are changed to nh.

As for the running time of the Peeling-and-Enclosing algorithm, it still builds the trees with heights equal to k. But the number of children for each node is different. Recall that in the proof of Claim 2, in order to obtain an estimation for \(\beta _j=\frac{|Opt_j|}{n}\), we need to try \(O(\log n)\) times since \(\frac{1}{n}\le \beta _j\le 1\); but for k-PMedian, the range of \(\beta _j\) becomes \([\frac{w_{min}}{W}, 1]\) where \(w_{min}=\min _{1\le i\le n}w_i\) (note that \(W=\sum ^n_{i=1}w_i\le n\)). Thus, the running time of Peeling-and-Enclosing algorithm becomes \(O(nh (\log \frac{n}{w_{min}})^{k+1} d)\). Furthermore, for each k-tuple candidate, we perform the Partition step through assigning each \(D_i\) to the \(m_j\) with the smallest \(dist\{v_i,m_j\}\). Obviously, the Partition step can be finished within linear time. Thus we have the following theorem.

Theorem 12

There exists an algorithm yielding a \((1+\epsilon )\)-approximation for k-PMedian with constant probability, in \(O(2^{poly(\frac{k}{\epsilon })}\)nh\( (\log \frac{n}{w_{min}})^{k+1}\)d) time, where \(w_{min}=\min _{1\le i\le n}w_i\).

6 Future Work

Following this work, some interesting problems deserve to be further studied in the future. For example, we reduce the partition step to the minimum cost circulation problem for several constrained clustering problems in Sect. 4; however, since the goal is to find an approximate solution, one may consider using the geometric information to solve the Partition step approximately. In Euclidean space, several techniques have been developed for solving approximate matching problems efficiently [6, 51]. But it is still not clear whether such techniques can be extended to solve the constrained matching problems (such as the r-gather or l-diversity) considered in this paper, especially in high dimensional space. We leave it as an open problem for future work.