1 Introduction

Our challenge is to support growth in the beneficial use of big data while ensuring that it does not create unintended discriminatory consequences.

(Executive Office of the President, 2016 [MSP16]).

As the use of machine learning methods becomes more and more common in many areas of daily life ranging from automatic display of advertisements on webpages to mortgage approvals, we are faced with the question whether the decisions made by these automatic systems are fair, i.e. free of biases by race, gender or other sensitive attributes. While at first glance it seems that replacing human decisions by algorithms will remove any kind of bias as algorithms will only decide based on the underlying data, the problem is that the training data may contain all sorts of biases. As a result, the outcome of an automated decision process may still contain these biases.

Recent findings in algorithmically generated results strengthen this concern. For example, it has been discovered that the COMPAS software that is used to predict the probability of recidivism is much more likely to assign an incorrect high risk score to a black defendant and low risk scores to a white defendant [ALMK16]. This raises the general question how we can guarantee fairness in algorithms.

This questions comes with several challenges. The first challenge is to formally define the concept of fairness. And indeed, it turns out that there are several ways to define fairness which result in different optimal solutions [CPF+17], and it has recently been shown that they cannot be achieved simultanuously unless the data has some very special (unlikely) structure [KMR17].

In this paper we build upon the recent work by Chiericetti et al. [CKLV17] and consider fairness of clustering algorithms using the concept of disparate impact, which is a notion of (un)fairness introduced to computer science by Feldman et al. [FFM+15]. Disparate impact essentially means that the result of a machine learning task does not correlate strongly with sensitive attributes such as gender, race sexual or religious orientation. More formally and illustrated on the case of a single binary sensitive attribute X and cluster variable C, a clustering algorithm does not show disparate impact if it satisfies the \(p\%\) rule (a typical value for p is 0.8) stating that \(\frac{\Pr \{C=i | X=0\}}{\Pr \{C=i | X=1\}} \le p.\) If we assume that both attribute values appear with the same frequency, then by Bayes Theorem the above translates to having at most \(p\%\) points with a specific attribute value in each cluster.

Chierichetti et al. model fairness based on the disparate impact model in the following way. They assume that every point has one of two colors (red or blue). If a set of points C has \(r_C\) red and \(b_C\) blue points, then they define its balance to be \(\min (\frac{r_C}{b_C},\frac{b_C}{r_C})\). The overall balance of a clustering is then defined as the minimum balance of any cluster in it. Clusterings are then considered fair if their overall balance is close to 1/2.

An algorithm ensuring fairness has to proceed with care; as mentioned before an algorithm that obliviously optimizes an objective function may retain biases inherent in the training set. Chierichetti et al. avoid this by identifying a set of fair micro-clusters via a suitably chosen perfect matching and running the subsequent optimization on the microclusters. This clever technique has the benefit of always computing a fair clustering, as the union of fair micro clusters is necessarily also fair. However, the min-cost perfect matching is computationally expensive, and it needs random access to the data, which may be undesirable. This raises the following question:

Question 1

Is is possible to perform a fair data analysis efficiently, even when the size of the data set renders random-access unfeasible?

Our Contribution. We address the issue of scaling algorithms by investigating coresets for fair clustering problems, specifically for k-means. Given an input set P in d dimensional Euclidean space, the problem consists of finding a set of k centers \(c_1,\dots ,c_k\) and a partition of P into k sets \(C_1,\dots ,C_k\) such that \(\sum _{i=1}^k \sum _{p \in C_i} \Vert p-c_i\Vert _2^2\) is minimized.

Roughly speaking, a coreset is a summary of a point set which has the property that it approximates the cost function well for any possible candidate solution. The notion was proposed by Har-Peled and Mazumdar [HM04] and has since received a wide range of attention. For clustering coresets, see for example [BFL16, FL11, FMS07, FGS+13, FS05, LS10]. Coresets for geometric clustering are usually composable, meaning that if \(S_1\) is a coreset for \(P_1\) and \(S_2\) is a coreset for \(P_2\), then \(S_1\cup S_2\) is a coreset for \(P_1\cup P_2\) [IMMM14]. Composability is arguably the main appeal of coresets; it enables an easy reduction from coreset constructions to streaming and distributed algorithms which scale to big data. Dealing with fair clustering, composability is not obvious. In this work, we initiate the study of fair coresets and their algorithmic implications:

  • The standard coreset definition does not satisfy composability for fair clustering problems. We propose an alternative definition tailored to fair clustering problems and show that this new definition satisfies composability. The definition straightforwardly generalizes to having \(\ell \) color classes and we show how a suitable coreset (of size \(O(\ell k \log n \epsilon ^{d-1})\) for constant d) may be computed efficiently. This implies a PTAS for fair k-means clustering if k is a constant (we show this for the case of two colors and exact fairness preservation, as will be defined later).

  • We extend the coreset construction to the streaming setting. The main challenge to overcome is to reduce the dimension of our fair coresets in a streaming fashion. Our key ingredient to do so is a novel application of the random projections proposed by Cohen et al. [CEM+15], which may be of independent interest.

  • Then we describe a constant factor approximation algorithm for fair k-means clustering with two colors (and exact fairness preservation) based on the approach of Chierichetti et al. [CKLV17]. The general technique to obtain a constant factor algorithm is not new, but we do some adaptations to apply it to the k-means case.

  • Finally, we extend the practical approximation algorithm k-means++ to the fair k-means setting and empirically evaluate the resulting algorithm and the coreset approach.

Additional Related Work. The research on fairness in machine learning follows two main directions. One is to find proper definitions of fairness. There are many different definitions available including statistical parity [TRT11], disparate impact [FFM+15], disparate mistreatment [ZVGRG17] and many others, e.g. [BHJ+17, HPS16]. For an overview see the recent survey [BHJ+17]. Furthermore, the effects of different definitions of fairness and their relations have been studied in [Cho16, CPF+17, KMR17]. A notion for individual fairness has been developed in [DHP+12]. The other direction is the development of algorithms for fair machine learning tasks. Here the goal is to develop new algorithms that solve learning tasks in such a way that the result satisfies a given fairness condition. Examples include [CKLV17, HPS16, ZVGRG17]. The closest result to our work is the above described paper by Chierichetti et al. [CKLV17].

Polynomial-time approximation schemes for k-means were e.g. developed in [BFL16, FL11, KSS10], assuming that k is a constant. If d is a constant, then [CKM16, FRS16] give a PTAS. If k and d are arbitrary, then the problem is APX-hard [ACKS15, LSW17]. Lloyd’s algorithm [Llo57] is an old but very popular local search algorithm for the k-means problem which can converge to arbitrarlity bad solutions. By using k-means++ seeding [AV07] as initialization, one can guarantee that the computed solution is a \(O(\log k)\)-approximation.

Chierichetti et al. [CKLV17] develop approximation algorithms for fair k-center and k-median with two colors. This approach was further improved by Backurs et al. [BIO+19], who proposed an algorithm to speed up parts of the computation. Rösner and Schmidt [RS18] extend their definition to multiple colors and develop an approximation algorithm for k-center. Bercea et al. [BGK+18] develop an even more generalized notion and provide bicriteria approximations for fair variants of k-center, k-median and also k-means. For k-center, they provide a true 6-approximation. Very recently, Kleindessner et al. [KAM19] proposed a linear-time 2-approximation for fair k-center. This algorithm is not in the streaming setting, but still faster then previously existing approaches for fair clustering.

The fair k-means problem can also be viewed as a k-means clustering problem with size constraints. Ding and Xu [DX15] showed how to compute an exponential sized list of candidate solutions for any of a large class of constrained clustering problems. Their result was improved by Bhattacharya et al. [BJK18].

In addition to the above cited coreset constructions, coresets for k-means in particular have also been studied empirically in various different works, for example [AMR+12, AJM09, FGS+13, FS08, KMN+02]. Dimensionality reductions for k-means are for example proposed in [BZD10, BZMD15, CEM+15, FSS13]. In particular [CEM+15, FSS13] show that any input to the k-means problem can be reduced to \(\lceil k/\epsilon \rceil \) dimensions by using singular value decomposition (SVD) while distorting the cost function by no more than an \(\epsilon \)-factor. Furthermore, [CEM+15] also study random projection based dimensionality reductions. While SVD based reductions result in a smaller size, random projections are more versatile.

1.1 Preliminaries

We use \(P\subseteq \mathbb R^d\) to denote a set of n points in the d-dimensional vector space \(\mathbb R^d\). The Euclidean distance between two points \(p,q \in \mathbb R^d\) is denoted as \(\Vert p-q\Vert \). The goal of clustering is to find a partition of an input point set P into subsets of ’similar’ points called clusters. In k-means clustering we formulate the problem as an optimization problem. The integer k denotes the number of clusters. Each cluster has a center and the cost of a cluster is the sum of squared Euclidean distances to this center. Thus, the problem can be described as finding a set \(C= \{c_1,\dots ,c_k\}\) and corresponding clusters \(C_1,\dots ,C_k\) such that \(\mathrm{cost}(P,C)=\sum _{i=1}^k \sum _{p\in C_i} \Vert p-c_i\Vert ^2\) is minimized.

It is easy to see that in an optimal (non-fair) clustering each point p is contained in the set \(C_i\) such that \(\Vert p-c_i\Vert ^2\) is minimized. The above definition can be easily extended to non-negatively and integer weighted point sets by treating the weight as a multiplicity of a point. We denote the k-means cost of a set S weighted with w and center set C and corresponding clusters \(S_1,\ldots ,S_k\) by \(\mathrm{cost}_w(S,C)=\sum _{i=1}^k \sum _{p\in S_i} w(p)\Vert p-c_i\Vert ^2\). Finally, we recall that the best center for a cluster \(C_i\) is its centroid \(\mu (C_i):=\frac{1}{|C_i|}\sum _{p\in C_i}p\), which follows from the following proposition.

Proposition 1

Given a point set \(P\subset \mathbb {R}^d\) and a point \(c\in \mathbb {R}^d\), the 1-means cost of clustering P with c can be decomposed into \(\sum _{p\in P} \Vert p-c\Vert ^2 = \sum _{p\in P} \Vert p-\mu (P)\Vert ^2 + |P|\cdot \Vert \mu (P)-c\Vert ^2\).

Next, we give the coreset definition for k-means as introduced by Har-Peled and Mazumdar.

Definition 1

(Coreset [HM04]). A set \(S \subseteq \mathbb R^d\) together with non-negative weights \(w:S \rightarrow \mathbb {N}\) is a \((k,\epsilon )\)-coreset for a point set \(P\subseteq \mathbb R^d\) with respect to the k-means clustering problem, if for every set \(C \subseteq \mathbb R^d\) of k centers we have \((1-\epsilon ) \cdot \mathrm{cost}(P,C) \le \mathrm{cost}(S,C) \le (1+\epsilon ) \cdot \mathrm{cost}(P,C)\).

Fair Clustering. We extend the definition of fairness from [CKLV17] to sensitive attributes with multiple possible values. As in [CKLV17], we model the sensitive attribute by a color. Notice that we can model multiple sensitive attributes by assigning a different color to any combination of possible values of the sensitive attributes. We further assume that the sensitive attributes are not among the point coordinates. Thus, our input set is a set \(P \subseteq \mathbb R^d\) together with a coloring \(c: P \rightarrow \{1,\dots ,\ell \}\).

We define \(\xi (j)=|\{p\in P : c(p) = j\}| / |P|\) as the fraction that color j has in the input point set. Then we call a clustering \(C_1,\dots ,C_k\) \((\alpha ,\beta )\)-fair, \(0 < \alpha \le 1 \le \beta \), if for every cluster \(C_i\) and every color class \(j \in \{1,\dots ,\ell \}\) we have

$$ \alpha \cdot \xi (j) \le \frac{|\{p\in C_i : c(p) = j\}|}{|\{p\in C_i\}|} \le \beta \cdot \xi (j) . $$

For any set \(C=\{c_1,\dots ,c_k\}\) of k centers we define \( \mathrm{faircost}(P,C) \) to be the minimum of \(\sum _{i=1}^k \sum _{p\in C_i} \Vert p-c_i\Vert ^2\) where the minimum is taken over all \((\alpha ,\beta )\)-fair clusterings of P into \(C_1,\dots ,C_k\). The optimal \((\alpha ,\beta )\)-fair clustering \(C'\) is the one with minimal \(\mathrm{faircost}(P,C')\). Alternatively to \(\xi (j)\), we could demand that the fraction of all colors is (roughly) \(1/\ell \). However, notice that the best achievable fraction is \(\xi (j)\). Thus \((\alpha ,\beta )\)-fairness is a strictly more general condition. It is also arguably more meaningful if the data set itself is heavily imbalanced. Consider an instance where the blue points outnumber the red points by a factor of 100. Then the disparity of impact is at least 0.01. A (1, 1)-fair clustering then is a clustering where all clusters achieve the best-possible ratio 0.01.

2 Fair Coresets and How to Get Them

First, notice that the definition of coresets as given in Definition 1 does not translate well to the realm of fair clustering. Assume we replace \(\mathrm{cost}\) by \(\mathrm{faircost}\) in Definition 1. Now consider Fig. 1. We see two point sets \(P_1\) and \(P_2\) with eight points each, which both have an optimum cost of \(\varOmega (\varDelta )\). Replacing the four left and the four right points by one point induces an error of \(\mathcal {O}(\epsilon \varDelta )\), which is an \(\mathcal {O}(\epsilon )\)-fraction of the total cost. Thus, the depicted sets \(S_1\) and \(S_2\) are coresets. However, when we combine \(P_1\) and \(P_2\), then the optimum changes. The cost decreases dramatically to \(\mathcal {O}(\epsilon )\). For the new optimal solution, \(S_1\cup S_2\) still costs \(\varOmega (\epsilon \varDelta )\), and the inequality in Definition 1 is no longer satisfied.

Fig. 1.
figure 1

A simple example of non-composable coresets for the case of (1, 1)-fairness.

We thus have to do a detour: We define a stronger, more complicated notion of coresets which regains the property of being composable. Then, we show that a special type of coreset constructions for k-means can be used to compute coresets that satisfy this stronger notion. It is an interesting open question to analyze whether it is possible to design sampling based coreset constructions that satisfy our notion of coresets for fair clustering.

For our detour, we need the following generalization of the standard k-means cost. A coloring constraint for a set of k cluster centers \(C=\{c_1,\dots ,c_k\}\) and a set of \(\ell \) colors \(\{1,\dots ,\ell \}\) is a \(k\times \ell \) matrix K. Given a point set P with a coloring \(c:P \rightarrow \{1,\dots ,\ell \}\) we say that a partition of P into sets \(C_1,\dots , C_k\) satisfies K if \(|\{p\in C_i : c(p) =j\}| = K_{ij}\). The cost of the corresponding clustering is \(\displaystyle \sum _{i=1}^k \sum _{p\in C_i} \Vert p-c_i\Vert ^2\) as before. Now we define the color-k-means cost \(\mathrm{colcost}(P,K,C)\) to be the minimal cost of any clustering satisfying K. If no clustering satisfies K, \(\mathrm{colcost}(P,K,C):=\infty \).

Notice that we can prevent the bad example in Fig. 1 by using the color-k-means cost: If \(\mathrm{colcost}(P,K,C)\) is approximately preserved for the color constraints modeling that each cluster is either completely blue or completely red, then \(S_1\) and \(S_2\) are forbidden as a coresets.

Definition 2

Let P be a point set with coloring \(c:P \rightarrow \{1,\dots ,\ell \}\). A non-negatively integer weighted set \(S \subseteq \mathbb R^d\) with a coloring \(c':S \rightarrow \{1,\dots ,\ell \}\) is a \((k,\epsilon )\)-coreset for P for the \((\alpha ,\beta )\)-fair k-means clustering problem, if for every set \(C \subseteq \mathbb R^d\) of k centers and every coloring constraint K we have

$$ \mathrm{colcost}_w(S,K,C) \in (1\pm \epsilon ) \cdot \mathrm{colcost}(P,K,C), $$

where in the computation of \(\mathrm{colcost}(S,K,C)\) we treat a point with weight w as w unweighted points and therefore a point can be partially assigned to more than one cluster.

Definition 2 demands that the cost is approximated for any possible color constraint. This implies that it is approximated for those constraints we are interested in. Indeed, the fairness constraint can be modeled as a collection of color constraints. As an example for this, assume we have two colors and k is also two; furthermore, assume that the input is perfectly balanced, i.e., the number of points of both colors is n/2, and that we want this to be true for both clusters as well. Say we have a center set \(C=\{c_1,c_2\}\) and define \(K^i\) by \(K_{11}^i=i,K_{12}^i=i,K_{21}^i = \frac{n}{2}-i, K_{22}^i = \frac{n}{2}-i\), i.e., \(K^i\) assigns i points of each color to \(c_1\) and the rest to \(c_2\). The feasible assignments for the fairness constraint are exactly those assignments that are legal for exactly one of the color constraints \(K^i\), \(i\in \{0,\ldots ,\frac{n}{2}\}\). So since a coreset according to Definition 2 approximates \(\mathrm{colcost}(P,C,K^i)\) for all i, it in particular approximately preserves the cost of any fair clustering. This also works in the general case: We can model the \((\alpha ,\beta )\)-fair constraint as a collection of color constraints (and basically any other fairness notion based on the fraction of the colors in the clusters as well).

Proposition 2

Given a center set C, \(|C|=k\), the assignment restriction to be \((\alpha ,\beta )\)-fair can be modeled as a collection of coloring constraints.

Proof

Recall \(\xi (j)=\frac{|\{p\in P~:~c(p)=j\}|}{|P|}\). Let \(C=\{C_1,\ldots ,C_k\}\) be a clustering and let K be the coloring constraint matrix induced by C. We observe that the ith row sums up to \(|C_i|\) and the jth column sums up to \(|\{p\in P~:~c(p)=j\}|\). Then C is \((\alpha ,\beta )\)-fair if and only if \(\alpha \cdot \xi (j)\le \frac{|\{p\in C_i~:~c(p)=j\}|}{|C_i|}=\frac{K_{i,j}}{\sum _{h=1}^k K_{i,h}} \le \beta \cdot \xi (j)\) for all \(i\in \{1,\ldots ,k\}\) and \(j\in \{1,\ldots ,\ell \}\).

The main advantage of Definition 2 is that it satisfies composability. The main idea is that for any coloring constraint K, any clustering satisfying K induces specific color constraints \(K_1\) and \(K_2\) for \(P_1\) and \(P_2\); and for these, the coresets \(S_1\) and \(S_2\) also have to satisfy the coreset property. We can thus prove the coreset property for S and K by composing the guarantees for \(S_1\) and \(S_2\) on \(K_1\) and \(K_2\).

Lemma 1 (Composability)

Let \(P_1, P_2 \subset \mathbb {R}^d\) be point sets. Let \(S_1\), \(w_1\), \(c_1\) be a \((k,\epsilon )\)-coreset for \(P_1\) and let \(S_2\), \(w_2\), \(c_2\) be a \((k,\epsilon )\)-coreset for \(P_2\) (both satisfying Definition 2). Let \(S=S_1\cup S_2\) and concatenate \(w_1, w_2\) and \(c_1, c_2\) accordingly to obtain weights w and colors c for S. Then S, w, c is a (\(k,\epsilon )\)-coreset for \(P=P_1 \cup P_2\) satisfying Definition 2.

Proof

Let \(C = \{c_1,\ldots ,c_k\} \subset \mathbb {R}^d\) be an arbitrary set of centers, and let \(K\in \mathbb {N}^{k \times \ell }\) be an arbitrary coloring constraint for C. We want to show that

$$ \mathrm{colcost}_w(S,K,C) \in (1\pm \epsilon )\mathrm{colcost}(P,K,C). $$

Let \(\gamma : P \rightarrow C\) be an assignment that minimizes the assignment cost among all assignments that satisfy K, implying that \(\mathrm{colcost}(P,K,C) = \sum _{p \in P} ||x-\gamma (x)||^2\). Since \(\gamma \) satisfies K, the number of points of color j assigned to each center \(c_i \in C\) is exactly \(K_{ij}\). We split K into two matrices \(K_1\) and \(K_2\) with \(K=K_1+K_2\) by counting the number of points of each color at each center which belong to \(P_1\) and \(P_2\), respectively. In the same fashion, we define two mappings \(\gamma _1 : P_1 \rightarrow C\) and \(\gamma _2: P_2 \rightarrow C\) with \(\gamma _1(p)=\gamma (p)\) for all \(p \in P_1\) and \(\gamma _2(p)=\gamma (p)\) for all \(p \in P_2\).

Now we argue that \(\mathrm{colcost}(P,C,K)=\mathrm{colcost}(P_1,C,K_1) + \mathrm{colcost}(P_2,C,K_2)\). Firstly, we observe that \(\mathrm{colcost}(P,C,K)\le \mathrm{colcost}(P_1,C,K_1) + \mathrm{colcost}(P_2,C,K_2)\) since \(\gamma _1\) and \(\gamma _2\) are legal assignments for the color constraint \(K_1\) and \(K_2\), respectively, and they induce exactly the same point-wise cost as \(\gamma \). Secondly, we argue that there cannot be cheaper assignments for \(K_1\) and \(K_2\). Assume there where an assignment \(\gamma _1'\) with \(\sum _{p \in P_1} ||x-\gamma _1'(x)||^2 < \mathrm{colcost}(P_1,C,K_1)\). Then we could immediately adjust \(\gamma \) to be identical to \(\gamma _1'\) on the points in \(P_1\) instead of \(\gamma _1\), and this would reduce the cost; a contradiction to the optimality of \(\gamma \). The same argument holds for \(\gamma _2\). Thus, \(\mathrm{colcost}(P,C,K)=\mathrm{colcost}(P_1,C,K_1) + \mathrm{colcost}(P_2,C,K_2)\) is indeed true.

Now since \(S_1\), \(w_1\), \(c_1\) is a coreset for \(P_1\) and \(S_2\), \(w_2\), \(c_2\) is a coreset for \(P_2\), they have to approximate \(\mathrm{colcost}(P_1,C,K_1)\) and \(\mathrm{colcost}(P_2,C,K_2)\) well. We get from this that

$$\begin{aligned}&\mathrm{colcost}_w(S_1,C,K_1) + \mathrm{colcost}_w(S_2,C,K_2) \\&\in (1\pm \epsilon )\cdot \mathrm{colcost}(P_1,C,K_1) + (1\pm \epsilon )\cdot \mathrm{colcost}(P_2,C,K_2) \\&\in (1\pm \epsilon ) \cdot \mathrm{colcost}(P,C,K). \end{aligned}$$

Observe that \(\mathrm{colcost}_w(S,C,K) \le \mathrm{colcost}_w(S_1,C,K_1) + \mathrm{colcost}_w(S_2,C,K_2)\) since we can concatenate the optimal assignments for \(S_1\) and \(S_2\) to get an assignment for S. Thus, \(\mathrm{colcost}_w(S,C,K) \le (1+\epsilon )\cdot \mathrm{colcost}(P,C,K)\). It remains to show that \(\mathrm{colcost}_w(S,C,K) \ge (1-\epsilon )\cdot \mathrm{colcost}(P,C,K)\).

Let \(\gamma ':S\rightarrow C\) be an assignment that satisfies K and has cost \(\mathrm{colcost}_w(S,C,K)\) (for simplicity, we treat S as if it were expanded by adding multiple copies of each weighted point; recall that we allow weights to be split up for S). Let \(\gamma _1':P_1 \rightarrow C\) and \(\gamma _2':P_2\rightarrow C\) be the result of translating \(\gamma '\) to \(P_1\) and \(P_2\), and split K into \(K_1'\) and \(K_2'\) according to \(\gamma '\) as we did above. Then \(\mathrm{colcost}_w(S,C,K)=\mathrm{colcost}_w(S_1,C,K_1')+\mathrm{colcost}_w(S_2,C,K_2')\) by the same argumentation as above. Furthermore,

where the first inequality holds by the coreset property and the second is true since we can also use \(\gamma '\) to cluster P, implying the inequality\( \mathrm{colcost}_w(P,C,K) \le \mathrm{colcost}_w(P_1,C,K_1')+\mathrm{colcost}_w(P_2,C,K_2')\). That completes the proof.

We have thus achieved our goal of finding a suitable definition of coresets for fair clustering. Now the question is whether we can actually compute sets which satisfy the rather strong Definition 2. Luckily, we can show that a special class of coreset constructions for k-means can be adjusted to work for our purpose. A coreset construction for k-means is an algorithm that takes a point set P as input and computes a summary S with integer weights that satisfies Definition 1.

We say that a coreset construction is movement-based if

  • all weights \(w(p), p \in S\) are integers

  • there exists a mapping \(\pi : P \rightarrow S\) with \(|\pi ^{-1}(p)| = w(p)\) for all \(p \in S\) which satisfies that

    $$ \sum \limits _{x \in P} || x - \pi (x) ||^2 \le \dfrac{\varepsilon ^2}{16} \cdot \text {OPT}_k,$$

    where \(\text {OPT}_k=\min \limits _{C\subset \mathbb {R}^d,|C|=k} \mathrm{cost}(P,C)\).

Movement-based coreset constructions compute a coreset by ‘moving’ points to common places at little cost, and then replacing heaps of points by weighted points. Examples for movement-based coreset constructions can be found in [FGS+13, FS05, HM04]. Now the crucial observation is that we can turn any movement-based coreset construction for k-means (say, a black-box algorithm ALG) into an algorithm that computes coresets for fair k-means satisfying Definition 2. The main idea is to run ALG to move all points in P to common locations, and then to replace all points of the same color at the same location by one coreset point. This may result in up to \(\ell \) points for every location, i.e., the final coreset result may be larger than its colorless counterpart by a factor of at most \(\ell \). The rest of the proof then shows that Definition 2 is indeed true, following the lines of movement-based coreset construction proofs.

Theorem 1

Let ALG be a movement-based coreset construction for the k-means problem. Assume that given the input \(P \subset \mathbb {R}^d\), \(k \in \mathbb {N}\) and \(\epsilon \in (0,1)\), the size of the coreset that ALG computes is bounded by \(f(|P|,d,k,\epsilon )\). Then we can construct an algorithm \(ALG'\) which constructs a set \(S'\) that satisfies Definition 2, and the size of this set is bounded by \(\ell \cdot f(|P|,d,k,\epsilon )\), where \(\ell \) is the number of colors.

Proof

For any P, ALG gives us a set S and a non-negative weight function w such that Definition 1 is true, i.e.,

$$\begin{aligned} (1-\epsilon )\mathrm{cost}(P,C) \le \mathrm{cost}_w(S,C) \le (1+\epsilon )\mathrm{cost}(P,C) \end{aligned}$$
(1)

holds for all set of centers C with \(|C|=k\). Since ALG is movement-based, the weights are integer; and there exists a mapping \(\pi : P \rightarrow S\), such that at most w(p) points from P are mapped to any point \(p \in S\), and such that

$$\begin{aligned} \sum \limits _{x \in P} || x - \pi (x) ||^2 \le \dfrac{\varepsilon ^2}{16} \cdot \text {OPT}_k \end{aligned}$$
(2)

is true. Statement (2) is stronger than (1), and we will only need (2) for our proof. We will, however, need the mapping \(\pi \) to construct \(ALG'\). Usually, the mapping will be at least implicitly computed by ALG. If not or if outputting this information from ALG is cumbersome, we do the following. We assign every point in P to its closest point in S. The resulting mapping has to satisfy 2, since the distance of any point to its closest point in S can only be smaller than in any given assignment. We may now assign more than w(p) points to S. We resolve this by simply changing the weights of the points in S to match our mapping. Since we now have S, w and \(\pi \) satisfying (2), we can proceed as if ALG had given a mapping to us.

Now we do what movement-based coreset constructions do internally as well: We consolidate all points that share the same location. However, since they may not all be of the same color, we possibly put multiple (at most \(\ell \)) copies of any point in S into our coreset \(S'\). More precisely, for every \(p \in S\), we count the number \(n_{p,i}\) of points of color i. If \(n_{p,i}\) is at least one, then we put p into \(S'\) with color i and weight \(n_{p,i}\). The size of \(S'\) is thus at most \(\ell \cdot f(|P|,d,k,\epsilon )\).

The proof that \(S'\) satisfies Definition 2 is now close to the proof that movement-based coreset constructions work. To execute it, we imagine \(S'\) in its expanded form (where every point p is replaced by \(n_{p,i}\) points of color i. We call this expanded version \(P'\). Notice that \(\mathrm{cost}(P',C)=\mathrm{cost}_w(S',C)\) for all \(C \subset \mathbb {R}^d\). We only need \(P'\) for the analysis. Notice that \(\pi \) can now be interpreted as a bijective mapping between P and \(P'\) and this is how we will use it.

Let C be an arbitrary center set with \(|C|=k\) and let K be an arbitrary coloring constraint. We want to show that \((1-\epsilon ) \cdot \mathrm{colcost}(P,K,C) \le \mathrm{colcost}(P',K,C) \le (1+\epsilon ) \cdot \mathrm{colcost}(P,K,C)\). If no assignment satisfies K, then \(\mathrm{colcost}(P,K,C)\) is infinity, and there is nothing to show. Otherwise, fix an arbitrary optimal assignment \(\gamma :P\rightarrow C\) of the points in P to C among all assignments that satisfy K. Notice that \(\gamma \) and \(\pi \) are different assignments with different purposes; \(\gamma \) assigning a point in P to its center, and \(\pi \) assigning each point in P to its moved version in \(P'\).

We let \(v_c(x) := ||x-\gamma (x)||\) be the distance between \(x\in P\) and the center its assigned to. Let \(v_c\) be the |P|-dimensional vector consisting of all \(v_c(x)\) (in arbitrary order). Then we have

$$\begin{aligned} \mathrm{colcost}(P,C,K) = \sum _{x \in P} ||x-\gamma (x)||^2 = \sum _{x \in P} v_c(x)^2 = ||v_c||^2. \end{aligned}$$

Furthermore, we set \(v_p(x) = ||\pi (x) - x||\) and let \(v_p\) be the |P|-dimensional vector of all \(v_p(x)\) (ordered in the same way as \(v_c\)). We have \(\sum _{x \in P} ||\pi (x)-x||^2 \le \dfrac{\varepsilon ^2}{16} \cdot OPT_k\) by our preconditions.

Now we want to find an upper bound on \(\mathrm{colcost}(P',C,K)\). Since we only need an upper bound, we can use \(\gamma \) for assigning the points in \(P'\) to C. We already know that \(\gamma \) satisfies K for the points in P; and the points in \(P'\) are only moved versions of the points in P. We use this and then apply the triangle inequality:

$$\begin{aligned} \mathrm{colcost}(P',C,K) \le&\sum _{y \in P'} ||y-\gamma (\pi ^{-1}(y))||^2= \sum _{x \in P} ||\gamma (x)-\pi (x)||^2\\ \le&\sum _{x \in P} (||\gamma (x)-x||+||x-\pi (x)||)^2\\ =&\sum _{x \in P} (v_c(x)+v_p(x))^2 = ||v_c+v_p||^2. \end{aligned}$$

Now we can apply the triangle inequality to the vector \(v_c+v_p\) to get \(||v_c+v_p|| \le ||v_c||+||v_p|| \le \sqrt{\mathrm{colcost}(P,C,K)} + \sqrt{\frac{\varepsilon ^2}{16} }\cdot OPT_k\). So we know that

$$\begin{aligned} \mathrm{colcost}(P',C,K)\le ||v_c+v_p||^2&\le \mathrm{colcost}(P,C,K) + \frac{\varepsilon ^2}{16} \cdot OPT_k \\&\quad + 2 \sqrt{\mathrm{colcost}(P,C,K)}\cdot \sqrt{\frac{\varepsilon ^2}{16}\cdot OPT_k}\\&\le \mathrm{colcost}(P,C,K) + \frac{\varepsilon ^2}{16} \cdot OPT_k \\&\quad + \frac{\varepsilon }{2} \cdot \mathrm{colcost}(P,C,K)\\&< (1+\epsilon ) \cdot \mathrm{colcost}(P,C,K). \end{aligned}$$

To obtain that also \(\mathrm{colcost}(P,C,K) \le (1+\epsilon )\cdot \mathrm{colcost}(P',C,K)\), we observe that the above argumentation is symmetric in P and \(P'\). No argument used that P is the original point set and \(P'\) is the moved version. So we exchange the roles of P and \(P'\) (we repeat the whole argumentation starting at the point where we fix the center set C, so for example, \(\gamma \) is now an optimal assignment of \(P'\) to C) to complete the proof.

We can now apply Theorem 1. Movement-based constructions include the original paper due to Har-Peled and Mazumdar [HM04] as well as the practically more efficient algorithm BICO [FGS+13]. For more information on the idea of movement-based coreset constructions, see Sect. 3.1 in the survey [MS18]. For BICO in particular, Lemma 5.4.3 in [Sch14] gives a proof that the construction is movement-based. Using Theorem 1 and Corollary 1 from [FGS+13], we then obtain the following.

Corollary 1

There is an algorithm in the insertion-only streaming model which computes a \((k,\epsilon )\)-coreset for the fair k-means problem according to Definition 2. The size of the coreset and the storage requirement of the algorithm is \(m\in O(\ell \cdot k \cdot \log n \cdot \epsilon ^{-d+2})\), where \(\ell \) is the number of colors in the input, and where d is assumed to be a constant.

The running time of the algorithm is \(O(N(m)(n+\log (n\varDelta )m))\), where \(\varDelta \) is the spread of the input points and N(m) is the time to compute an (approximate) nearest neighbor.

3 Streaming PTAS and Constant-Factor Approximations

In this section, we give a constant-factor approximation algorithm for fair k-means with two colors, assuming that exactly half of the input points are colored with each color, and demanding that this is true for all clusters in the clustering as well. We call this special case exactly balanced. We also show how to obtain a PTAS for the case that k is a constant.

We restrict to two colors since for multiple colors, no true approximation algorithms are known even for the related case of k-median, and there is indication that this problem might be very difficult (it is related to solving capacitated k-median/k-means, a notoriously difficult research problem). Notice that the coreset approach works for arbitrary \((\alpha ,\beta )\)-fair k-means.

Fairlet Approach. For two colors, Chierichetti et al. [CKLV17] outline how to transfer approximation algorithms for clustering to the setting of fair clustering, but derive the algorithms only for k-center and k-median. The idea is to first compute a coarse clustering where the microclusters are called fairlets, and then to cluster representatives of the fairlets to obtain the final clustering. The following algorithm extends their ideas to compute fairlets for k-means.

figure a

The idea of the algorithm is the following. In any optimal solution, the points can be paired into tuples of a blue and a red point which belong to the same optimal cluster. Clustering the \(n/2 \ge k\) tuples with n/2 centers cannot be more expensive than the cost of the actual optimal k-means solution. Thus, we would ideally like to know the tuples and partition them into clusters. Since we cannot know the tuples, we instead compute a minimum cost perfect matching between the red and blue points, where the weight of an edge is the 1-means cost of clustering the two points with an optimal center (this is always half their squared distance). The matching gives us tuples; these tuples are the fairlets. The centroid of each fairlet now serves as its representative. The following theorem shows that clustering the representatives yields a good solution for the original problem.

Theorem 2

There is an algorithm that achieves the following. For any \(P \subset \mathbb {R}^d\) which contains |P|/2 blue and |P|/2 red points, it computes a set of representatives \(F \subset P\) of size |P|/2, such that an \(\alpha \)-approximate solution for the (plain) k-means problem on F yields a \((5.5\alpha +1)\)-approximation for the fair k-means problem on P.

By combining Theorem 2 with a constant-factor approximation for k-means (the currently best being the one proposed by Ahmadian et al. [ANSW17]), we get the following corollary.

Corollary 2

There is a \(\mathcal {O}(1)\)-approximation for exactly balanced fair k-means with two colors.

Similarly to the fairlet computation, the problem of finding an fair assignment, i.e., an cost-wise optimal assignment of points to given centers which is fair, can be modeled as a matching problem. This algorithm, as well as algorithms for fairlet computation and fair assignment for weighted points are described in the full version of the paper [SSS18].

PTAS. We next give an algorithm to efficiently compute a \((1+\epsilon )\)-approximation if k is a constant and not part of the input by extending known ideas to the fair k-means++ case. The main additional step is to use an optimal fair assignment problem algorithm.

We remark that the running time of the below stated algorithm algorithm is worse than that of [BJK18, DX15]. However, it can be easily adapted to work with weighted inputs. While we believe that in principle adapting the algorithms in [BJK18, DX15] to the weighted case is possible, we preferred to stick to the simpler slightly worse result to keep the paper concise.

Theorem 3

Let \(P\subseteq \mathbb R^d\) be a weighted point set of n points such that half of the point weight is red and the other half is blue. Then we can compute a \((1+\epsilon )\)-approximations to the fair k-means problem in time \(n^{O(k/\epsilon )}\).

We use the well-known fact that every cluster has a subset of \(O(1/\epsilon )\) points such that their centroid is a \((1+\epsilon )\)-approximation to the centroid of the cluster. We use the following lemma by Inaba et al.

Lemma 2

[IKI94] Let \(P \subseteq \mathbb R^d\) be a set of points and let S be a subset of m points drawn independently and uniformly at random from P. Let \(c(P)= \frac{1}{|P|} \sum _{p\in P} p\) and \(c(S)= \frac{1}{|S|} \sum _{p\in S} p\) be the centroids of P and S. Then with probability at least \(1-\delta \) we have

$$ \Vert \sum _{p\in P} \Vert p-c(S)\Vert _2^2 \le (1+\frac{1}{\delta m}) \cdot \Vert \sum _{p\in P} \Vert p-c(P)\Vert _2^2 . $$

It immediately follows that for \(m=\lceil 2/\epsilon \rceil \) there exists a subset S of m points that satisfies the above inequality. The result can immediately be extended to the weighted case. This implies that Algorithm 2 gives a PTAS.

figure b

The running time of the algorithm is \(n^{O(k/\epsilon )}\) since line two can be implemented in \(k^{O(k/\epsilon )}\) time and the partition problem can be solved in \(n^{O(1)}\) time. This implies the theorem.

Streaming PTAS. We would like to extend the PTAS to the streaming setting, using our coreset. Applying Corollary 1 directly incurs an exponential dependency on the dimension d. The standard way to avoid this is to project the data onto the first \(k/\varepsilon \) principal components, see [CEM+15, FSS13], and then to use a technique called merge-and-reduce. Unfortunately, merge-and-reduce technique requires a rescaling of \(\varepsilon \) by a factor of \(\log n\). In other words, the resulting streaming coreset will have a size \(\exp (\left( \frac{\log n}{\varepsilon }\right) ,{k\cdot \frac{\log n}{\varepsilon }})\), which is even larger than the input size. To avoid this, we show how to make use of oblivious random projections to reduce the dependency of the dimension for movement-based coreset constructions, and also recover a \((1+\varepsilon )\) approximate solution. This is a novel combination of coreset techniques with a sketching technique due to Cohen et al. [CEM+15] which may be of independent interest.

We review some of the algebraic properties. Given a matrix \(A\in \mathbb {R}^{n\times d}\), we define the Frobenius norm as \(\Vert A\Vert _F = \sqrt{\sum _{i=1}^n \Vert A_{i*}\Vert ^2}\), where \(A_{i*}\) is the ith row of A. For k-means, we will consider the rows of A to be our input points. The spectral norm \(\Vert A\Vert _2\) is the largest singular value of A.

Let us now consider the n-vector \(x=\mathbf {1}\cdot \frac{1}{\sqrt{n}}\). x is a unit vector, i.e. \(\Vert x\Vert _2 = 1\), and moreover, due to Proposition 1, the rows of \(xx^TA\) are \(\mu (A)^T\). Hence \(\Vert A-xx^TA\Vert _F^2\) is the optimal 1-means cost of A. This may be extended to an arbitrary number of centers by considering the n by k clustering matrix X with

$$X_{i,j}={\left\{ \begin{array}{ll} \sqrt{1/|C_j|} &{} \text {if } A_{i*}\in \text { cluster }C_j\\ 0 &{}\text {otherwise}\end{array}\right. }.$$

\(XX^T\) is an orthogonal projection matrix and \(\Vert A-XX^TA\Vert _F^2\) corresponds to the k-means cost of the clusters \(C_1,\ldots ,C_k\). If we lift the clustering constraints on X and merely assume X to be orthogonal and rank k, \(\Vert A-XX^TA\Vert _F^2\) becomes the rank k approximation problem. The connection between rank k approximation and k-means is well established, see for example [BZMD15, DFK+04, FSS13]. Specifically, we aim for the following guarantee.

Definition 3

(Definition 1 of [CEM+15]).\(\tilde{A}\in \mathbb {R}^{n\times d'}\) is a rank k-projection-cost preserving sketch of \(A\in \mathbb {R}^{n\times d}\) with error \(0<\varepsilon <1\) if, for all rank k orthogonal projection matrices \(XX^T\in \mathbb {R}^{n\times n}\),

$$\begin{aligned} \Vert \tilde{A}-XX^T\tilde{A}\Vert _F^2 + c \in (1\pm \varepsilon )\cdot \Vert A-XX^TA\Vert _F^2, \end{aligned}$$

for some fixed non-negative constant c that may depend on A and \(\tilde{A}\), but is independent of \(XX^T\).

Our choice of \(\tilde{A}\) is AS, where S is a scaled Rademacher matrix of target dimension \(m\in O(k/\varepsilon ^2)\), see Theorem 12 of [CEM+15]. In this case \(c=0\).

figure c

We combine oblivious sketches with movement-based coreset constructions in Algorithm 3. The general idea is to run the coreset construction on the rows of AS (which are lower dimensional points). Since the dimensions of AS are n times \(k/\varepsilon ^2\), this has the effect that we can replace d in the coreset size by \(O(k/\varepsilon ^2)\). Furthermore, we show that by storing additional information we can still compute an approximate solution for A (the challenge is that although AS will approximately preserve clustering costs, the cluster centers that achieve this cost lie in a different space and cannot be used directly as a solution for A).

Theorem 4

Let \(0<\varepsilon < 1/2\). Assume there is streaming algorithm ALG that receives the rows of a matrix \(A\in \mathbb {R}^{n \times d}\) and maintains an \((k,\epsilon )\)-coreset T with the following property: We can replace weighted points in T by a corresponding number of copies to obtain a matrix \(A'\) such that \( \sum _{i=1}^n \Vert A_{i*}-A_{i'*}\Vert ^2 \le \frac{\varepsilon ^2}{16} \cdot \text {OPT}. \) Furthermore, assume that ALG uses \(f(k,\varepsilon ,d,\log n)\) space. If we use ALG in Step 2 of Algorithm 3, then Algorithm 3 will use \(f(k,\varepsilon /25,c' \cdot (k/\varepsilon )^2,\log n)\cdot d + O(kd/\epsilon ^2)\) space to compute a set of centers C with

$$ \mathrm{faircost}(P,C) \le \gamma (1+\varepsilon ) \cdot \text {OPT} $$

where \(\text {OPT}\) is the cost of an optimal solution for A and \(c'>0\) is a constant such that the guarantees of Theorem 12 from [CEM+15] are satisfied.

Proof

Let X be the optimal clustering matrix on input \(A'S\) and Y be the optimal clustering matrix for input A. Let Z be the clustering matrix returned by our \((\alpha ,\beta )\)-fair approximation algorithm on input \(A'S\) (or, equivalently, on input T). Let \(\varepsilon ' = \varepsilon /25\). Since we are using a \(\gamma \)-approximation algorithm, we know that \(\Vert ZZ^TA'S - A'S\Vert _F^2 \le \gamma \cdot \Vert XX^TA'S-A'S\Vert _F^2\). We also observe that \(\Vert ZZ^TA'S - ZZ^TAS\Vert _F \le \Vert ZZ^T\Vert _2 \Vert A'S-AS|_F = \Vert A'S -AS\Vert _F\) for an orthogonal projection matrix \(ZZ^T\). Furthermore, we will use that \(\Vert XX^TA'S - A'S\Vert _F \le \Vert XX^T A'S - XX^T AS\Vert _F + \Vert XX^TAS - AS\Vert _F + \Vert A'S-AS\Vert _F\) and the fact that the spectral norm and Frobenius norm are conforming, e.g., they satisfy \(\Vert AB\Vert _F \le \Vert A\Vert _2\Vert B\Vert _F\). We obtain

$$\begin{aligned}&(1-\varepsilon ')\cdot \Vert ZZ^TA-A\Vert _F^2 \le \Vert ZZ^TAS-AS\Vert _F^2 \\\le & {} \left( \Vert ZZ^TA'S-ZZ^TAS\Vert _F +\Vert ZZ^TA'S-AS\Vert _F \right) ^2 \\\le & {} (\Vert ZZ^TA'S - ZZ^TAS\Vert _F + \Vert ZZ^TA'S - A'S\Vert _F \\&+ \Vert A'S-AS\Vert _F)^2 \\\le & {} \left( 2\Vert A'S-AS\Vert _F + \sqrt{\gamma } \Vert XX^TA'S-A'S\Vert _F\right) ^2 \\\le & {} ( (2+\sqrt{\gamma }) \Vert A'S-AS\Vert _F \\&+ \sqrt{\gamma } ( \Vert XX^T A'S - XX^T AS\Vert _F + \\&\Vert XX^TAS - AS\Vert _F))^2 \\\le & {} ((2 + 2\sqrt{\gamma }) \Vert A'S-AS\Vert _F +\sqrt{\gamma }\Vert XX^TAS - AS\Vert _F)^2\\\le & {} ((\frac{\varepsilon '}{4} ( 2+2\sqrt{\gamma }) + \sqrt{\gamma }) \Vert XX^TAS - AS\Vert _F)^2 \\\le & {} ((1+\varepsilon ') \sqrt{\gamma })\Vert XX^TAS - AS\Vert _F)^2 \\\le & {} (1+\varepsilon ')^2 \gamma \Vert YY^TAS -AS\Vert _F^2\\\le & {} (1+\varepsilon ')^3 \gamma \Vert YY^T A-A\Vert _F^2 \end{aligned}$$

where the first and the last inequality follows from the guarantee of Definition 3 and Theorem 12 of [CEM+15]. To conclude the proof, observe that \((1+\varepsilon ')^3 /{1-\varepsilon } \le (1+25\varepsilon ') = (1+\varepsilon )\) for \(0<\varepsilon <\frac{1}{2}\).

4 Practical Approximation Algorithms

Finally, we give some thought on how to extend the famous k-means++ algorithm [AV07] to the fair k-means setting in a way that results in high quality solutions. The k-means++ algorithm is a combination of an efficient randomized sampling technique that produces a \(\mathcal {O}(\log k)\)-approximate solution to the k-means problem on the one hand, with a local search heuristic, Lloyd’s algorithm, which refines the solution to a local optimum, on the other hand.

The straightforward way to adapt k-means++ is to use Theorem 2 (the fairlet approach), run k-means++ on the fairlet representatives and use the resulting centers. There are two variants of this: One can either use the assignment that results directly from clustering the fairlets, or one can only use the computed centers and recompute the assignment by using Algorithm 4. We name the two variants CKLV-k -means++ and Reassigned-CKLV-k -means++.

figure d
figure e

Compared to k-means++, Reassigned-CKLV has the drawback that the computed solution is not really optimal with respect to the original problem: After the reassignment, it may be beneficial to change the centers again. To further refine the solution, we propose to adapt Lloyd’s algorithm directly. Lloyd’s algorithm is an iterative search heuristic which given initial centers, alternatingly assigns points to their closest center and computes the optimum center for each cluster (the centroid). The two steps are repeated until a stopping criterion is met, for example until the algorithm is converged or has reached a maximum number of iterations. For fair k-means++, we replace the assignment step by a fair assignment step and call the resulting algorithm fair k -means++.

figure f

4.1 Empirical Evaluation

In the full version [SSS18], we give a short empirical evaluation to demonstrate the practicability of the streaming \(\mathcal {O}(\log k)\)-approximations and the coreset approach. We compare CKLV-k -means++ (Algorithm 4), Reassigned-CKLV (Algorithm 5) and fair k -means++ (Algorithm 6). The experiments clearly show that none of the algorithms can scale to big data, with fair k -means++ being particularly slow due to the repeated fair reassignment. However, using the coreset allows all three algorithms to scale well.