5.1 Introduction

Clustering is a cornerstone of our learning process and, thus, of our understanding of the world. Indeed, we can all distinguish between a rose and a tulip precisely because we have learned what these flowers are. Plato would say that we learned the Idea—or Form [120]—of both the rose and the tulip, which then enables us to recognize all instances of such flowers. A machine learner would say that we learned two classes: their most discriminating features (shape, size, number of petals, smell, etc.) as well as their possible intra-class variability.

Mathematically speaking, the first step on the road to classifying objects (such as flowers) is to create an abstract representation of these objects: with each object i we associate a feature vector \({\mathbf {p}}_i\in \mathbb {R}^d\), where the dimension d of the vector corresponds to the number of features one chooses to select for the classification task. The space \(\mathbb {R}^d\) in this context is sometimes called the feature space. The choice of representation will obviously have a strong impact on the subsequent classification performance. Say that in the flower example we choose to represent each flower by only d = 3 features: the average color of each RGB channel (level of red, green, and blue) of its petals. This choice is not fail-proof: even though the archetype of the rose is red and the archetype of the tulip is yellow, we know that some varieties of both flowers can have very similar colors and thus a classification solely based on the color will necessarily lead to confusion. In fact, there are many different ways of choosing features: from features based on the expert knowledge of a botanist to features learned by a deep learning architecture from many instances of labeled images of roses and tulips, via features obtained by hybrid methods more-or-less based on human intelligence (such as the first few components of a principal component analysis of expert-based features).

The second step on the road to classifying n objects is to choose a machine learning algorithm that groups the set of n points P = (p 1, …, p n) in k classes (k may be known in advance or determined by the algorithm itself). Choosing an appropriate algorithm depends on the context:

  • Availability of pre-labeled data. Classifying the points P in k classes may be seen as assigning a label (such as “rose” or “tulip” in our k = 2 example) to each of the points. If one has access to some pre-labeled data, we are in the case of supervised learning: a more-or-less parametrized model is first learned from the pre-labeled data and then applied to the unlabeled points that need classification. If one does not have access to any pre-labeled data, we are in the case of unsupervised learning where classes are typically inferred only via geometrical consideration of the distribution of points in the feature space. If one has only access to a few labeled data, we are in the in-between case of semi-supervised learning where the known labels are typically propagated in one form or another in the feature space.

  • Inductive vs transductive learning. Another important characteristic of a classification algorithm is whether it can be used to classify only the set of points P at hand (transductive), or if it can also be directly used to classify any never-seen data point p n+1 (inductive).

This chapter focuses on the family of algorithms jointly referred to as spectral clustering. These algorithms are unsupervised and transductive: no label is known in advance and one may not naturallyFootnote 1 extend the results obtained on P to never-seen data points. Another particularity of spectral clustering algorithms is that the number of classes k is known in advance.

Spectral clustering algorithms have received a large attention in the last two decades due to their good performance on a wide range of different datasets, as well as their ease of implementation. In a nutshell, they combine three steps:

  1. 1.

    Graph construction. A sparse similarity graph is built between the n points.

  2. 2.

    Spectral embedding. The first k eigenvectors of a graph representative matrix (such as the Laplacian) are computed.

  3. 3.

    Clustering. k-means is performed on these spectral features, to obtain k clusters.

For background information about spectral clustering, such as several justifications of its performance, out-of-sample extensions, as well as comparisons with local methods, the interested reader is referred to the recent book chapter [144].

One of the drawbacks of spectral clustering is its computational cost as n, d, and/or k become large (see Sect. 5.2.3 for a discussion on the cost). Since the turn of the century, a large number of authors have striven to reduce the computational cost while keeping the high level of classification performance. The majority of such accelerating methods are based on sampling: they reduce the dimension of a sub-problem of spectral clustering, compute a low-cost solution in small dimension, and lift the result back to the original space.

The goal of this chapter is to review existing sampling methods for spectral clustering, focusing especially on their approximation guarantees. Some of the fundamental questions we are interested in are: where is the sampling performed and what is sampled precisely? how should the reduced approximate solutions be lifted back to the original space? what is the computational gain? what is the control on performances—if it exists? Given the breadth of the literature on the subject, we do not try to be exhaustive, but rather to illustrate the key ways that sampling can be used to provide acceleration, paying special attention on recent developments on the subject.

Chapter Organization

We begin by recalling in Sect. 5.2 the prototypical spectral clustering algorithm. We also provide some intuitive and formal justification of why it works. The next three sections classify the different methods of the literature depending on where the sampling is performed with respect to the three steps of spectral clustering:

  • Section 5.3 details methods that sample directly in the original feature space.

  • Section 5.4 assumes that the similarity graph is given and details methods that sample nodes and/or edges to approximate the spectral embedding.

  • Section 5.5 assumes that the spectral embedding is given and details methods to accelerate the k-means step.

Finally, Sect. 5.6 gives perspective on the limitations of existing works and discusses key open problems.

Notation

Scalars, such as λ or d, are written with lowercase letters. Vectors, such as u, z, or the all-one vector 1, are denoted by lowercase bold letters. Matrices, such as W or L, are denoted by bold capital letters. Ensembles are denoted by serif font capital letters, such as C or X. The “tilde” will denote approximations, such as in \(\tilde {\mathbf {z}}\) or \(\tilde {\mathbf {U}}_k\). We use so-called Matlab notations to slice matrices: given a set of indices S of size m and an n × n matrix W, \(\mathbf {W}(\mathsf {S},:)\in \mathbb {R}^{m\times n}\) is W reduced to the lines indexed by S, \(\mathbf {W}(:,\mathsf {S})\in \mathbb {R}^{n\times m}\) is W reduced to the columns indexed by S, and \(\mathbf {W}(\mathsf {S},\mathsf {S})\in \mathbb {R}^{m\times m}\) is W reduced to the lines and columns indexed by S. The equation defines U k as the reduction of U to its first k columns. Also, C is the transpose of matrix C and C + its Moore–Penrose pseudo-inverse. The operator X = diag(x) takes as an input a vector \(\mathbf {x} \in \mathbb {R}^n\) and returns an n × n diagonal matrix X featuring x in its main diagonal, i.e., X(i, j) = x(i) if i = j and X(i, j) = 0, otherwise. Finally, we will consider graphs in a large part of this paper. We will denote by G = (V, E, W) the undirected weighted graph of |V| = n nodes interconnected by |E| = e edges. e ij ∈E is the edge connecting nodes v i and v j, with weight W(i, j) ≥ 0. Matrix W is the adjacency matrix of G. As G is considered undirected, W is also symmetric. In general, W can be any symmetric matrix with positive entries, but we usually prefer to work with sparse graphs without self-loops, in which case the matrix is also sparse and has a zero diagonal.

5.2 Spectral Clustering

The input of spectral clustering algorithms consists of (1) a set of points P = (p 1, p 2, …, p n) (also called featured vectors) representing n objects in a feature space of dimension d, and (2) the number of classes k in which to classify these objects. The output is a partition of the n objects in k disjoint clusters. The prototypical spectral clustering algorithm [102, 121] dates back in fact to fundamental ideas by Fiedler [46] and entails the following steps:

Algorithm 1. The prototypical spectral clustering algorithm

Input. A set of n points P = (p 1, p 2, …, p n) in dimension d and a number of desired clusters k.

  1. 1.

    Graph construction (optional)

    1. (a)

      Compute the kernel matrix \(\mathbf {K}\in \mathbb {R}^{n\times n}\): ∀(i, j), K(i, j) = κ(∥p i −p j2).

    2. (b)

      Compute W = s(K), a sparsified version of K.

    3. (c)

      Interpret W as the adjacency matrix of a weighted undirected graph G.

  2. 2.

    Spectral embedding

    1. (a)

      Compute the eigenvectors u 1, u 2, ⋯ , u k associated with the k smallest eigenvalues of a graph representative matrix R (usually a Laplacian) computed from W.

    2. (b)

      Set \({\mathbf {U}}_k = [ \:{\mathbf {u}}_1|\:{\mathbf {u}}_2|\:\cdots |\:{\mathbf {u}}_{k}\: ]\in \mathbb {R}^{n\times k}\).

    3. (c)

      Embed the i-th node to \({\mathbf {x}}_i = \frac {{\mathbf {U}}_k(i,:)^\top }{q(\|{\mathbf {U}}_k(i,:)\|{ }_2)}\), with q(⋅) a normalizing function.

  3. 3.

    Clustering

    1. (a)

      Use k-means on x 1, …, x n in order to identify k centroids c 1, …, c k.

    2. (b)

      Voronoi tessellation: construct one cluster per centroid c and assign each object i to the cluster of the centroid closest to x i.

Output: A partition of the n points in k clusters.

A few comments are in order:

  • A common choice of kernel in step 1a is the radial basis function (RBF) kernel \(\kappa (\|{\mathbf {p}}_i - {\mathbf {p}}_j \|{ }_2) = \exp \left (-\|{\mathbf {p}}_i - {\mathbf {p}}_j \|{ }_2^2/\sigma ^2\right )\) for some user-defined σ. The sparsification s of K usually entails setting the diagonal to 0 and keeping only the k largest entries of each column (i.e., set all others to 0). The obtained matrix K sp is not symmetric in general and a final “symmetrization” step \(\mathbf {W}={\mathbf {K}}_{\text{sp}}+{\mathbf {K}}_{\text{sp}}^\top \) is necessary to obtain a matrix W interpretable as the adjacency matrix of a weighted undirected graphFootnote 2 G = (V, E, W). This graph is called the k nearest neighbor (k-NN) similarity graph (note that the k used in this paragraph has nothing to do with the number of clusters). Other kernel functions κ and sparsification methods are possible (see Section 2 of [138] for instance).

  • There are several possibilities for choosing the graph representative matrix R in step 2a. We consider three main choices [138]. Let us denote by D the diagonal degree matrix such that D(i, i) =∑j W(i, j) is the (weighted) degree of node v i. We define the combinatorial graph Laplacian matrix L = D −W, the normalized graph Laplacian matrix L n = I −D −1∕2 WD −1∕2, and the random walk Laplacian L rw = I −D −1 W. Other popular choices includeFootnote 3 the non-backtracking matrix [73], degree-corrected versions of the modularity matrix [2], the Bethe–Hessian matrix [114], or similar deformed Laplacians [34].

  • The normalizing function q(⋅) used in step 2c depends on which representative matrix is chosen. In the case of the Laplacians, experimental evidence as well as some theoretical arguments [138] support using a unit norm normalization for the eigenvectors of L n (i.e., q is the identity function), and no normalization for the eigenvectors of L and L rw (i.e., q(⋅) = 1).

  • Step 1 of the algorithm is “optional” in the sense that in some cases the input is not a set of points but directly a graph. For instance, it could be a graph representing a social network between n individuals, where each node is an individual and there is an edge between two nodes if they know each other. The weight on each edge can represent the strength of their relation (for instance, close to 0 if they barely know each other, and close to 1 if they are best friends). The goal is then to classify individuals based on the structure of these social connections and is usually referred to as community detection in this context [47]. Given the input graph, and the number k of communities to identify, one can run spectral algorithms starting directly at step 2. Readers only interested in such applications can skip Sect. 5.3, which is devoted to sampling techniques designed to accelerate step 1.

After the spectral embedding X = ( x 1, …, x n ) has been identified, spectral clustering uses k-means in order to find the set of k centroids C = ( c 1, …, c k ) that best represents the data. Formally, the k-means cost function to minimize reads:

$$\displaystyle \begin{aligned} f(\mathsf{C}; \mathsf{X}) = \sum_{\mathbf{x}\in\mathsf{X}}\:\: \min_{\mathbf{c}\in\mathsf{C}} \|\mathbf{x} - \mathbf{c}\|{}_2^2. \end{aligned} $$
(5.1)

We would ideally hope to identify the set of k centroids C minimizing f(C;X). Solving exactly this problem is NP-hard [42], so one commonly resorts to approximation and heuristic solutions (see, for instance, [128] for details on different such heuristics). The most famous is the so-called Lloyd-Max heuristic algorithm:

Algorithm 2. The Lloyd-Max algorithm [87]

Input. Set of n points X = (x 1, x 2, …, x n) and number of desired clusters k.

  1. 1.

    Start from an initial guess C ini of k centroids

  2. 2.

    Iterate until convergence:

    1. (a)

      Assign each point x i to its closest centroid to obtain a partition of X in k clusters.

    2. (b)

      Move each centroid c to the average position of all points in cluster .

Output: A set of k centroids C = (c 1, …, c k).

When the clusters are sufficiently separated and C ini is not too far from the optimal centroids, then the Lloyd-Max algorithm converges to the correct solution [75]. Otherwise, it typically ends up in a local minimum.

A Remark on Notation

Two quantities of fundamental importance in spectral clustering are the eigenvalues λ i and especially the eigenvectors u i of the graph Laplacian matrix. We adopt the graph theoretic convention of sorting eigenvalues in non-decreasing order: 0 = λ 1 ≤ λ 2 ≤… ≤ λ n. Also, for reasons of brevity, we overload notation and use the same symbol for the spectrum of the three Laplacians L, L n, and L rw. Thus, we advise the reader to rely on the context in order to discern which Laplacian gives rise to the eigenvalues and eigenvectors. Finally, the reader should keep in mind that the largest eigenvalue is always bounded by 2 for L n and L rw.

5.2.1 An Illustration of Spectral Clustering

The first two steps of the algorithm can be understood as a non-linear transformation from the initial feature space to another feature space (that we call spectral feature space or spectral embedding): a transformation of features p i in \(\mathbb {R}^d\) to spectral features x i in \(\mathbb {R}^k\). The first natural question that arises is why do we run k-means on the spectral features X = (x 1, …, x n) that are subject to parameter tuning and costly to compute, rather than directly run k-means on the original P? Figures 5.1 and 5.2 illustrate the answer.

Fig. 5.1
figure 1

Left: the two half-moons synthetic dataset (n = 500, d = 2, k = 2). Right: k-means with k = 2 directly on P is unsuccessful to separate the two half-moons

In Fig. 5.1, we show the result of k-means directly on a set of artificial features P known as the two half-moons dataset. In this example, the intuitive ground truth is that each half-moon corresponds to a class that we want to recover. Running k-means directly in this 2D feature space will necessarily output a linear separation between the two obtained Voronoi cells and will thus necessarily fail, as no straight line can separate the two half-moons.

Spectral clustering, via the computation of the spectral features of a similarity graph, transforms these original features P in spectral features X that are typically linearly separable by k-means: the two half-moons are successfully recovered! We illustrate this in Fig. 5.2. In the next section, we will examine a theoretical argument aiming to justify this phenomenon.

Fig. 5.2
figure 2

Illustration of the spectral clustering algorithm on the two half-moons dataset (n = 500, d = 2, k = 2). The graph is created with a RBF kernel and via a sparsification done with k-nearest neighbors (with k = 5). The spectral embedding is done with the two eigenvectors associated with the two smallest eigenvalues of the combinatorial Laplacian matrix L. The embedding X is here in practice in 1D as the first eigenvector of L is always constant and thus not discriminative (to confirm this, first show that L is a PSD matrix and then prove that Lc = 0 for any constant vector c). Observe how the two clusters are now linearly separable in the spectral feature space. k-means on these features successfully recovers the two half-moons

5.2.2 Justification of Spectral Clustering

A popular approach—and by no means the only one, see Sect. 5.2.2.3—to justify spectral clustering algorithms stems from its connection to graph partitioning. Suppose that the similarity graph G = (V, E, W) has been obtained and we want to compute a partitionFootnote 4 \(\mathcal {P}=\{\mathsf {V}_1, \mathsf {V}_2, \ldots , \mathsf {V}_k\}\) of the nodes V in k groups. Intuitively, a good clustering objective function should favor strongly connected nodes to end up in the same subset, and nodes that are far apart in the graph to end up in different subsets. This intuition can be formalized with graph cuts.

Considering two groups V 1 and V 2, define \(w(\mathsf {V}_1,\mathsf {V}_2)=\sum _{i\in \mathsf {V}_1}\sum _{j\in \mathsf {V}_2} \mathbf {W}(i,j)\) to be the total weight of all links connecting V 1 to V 2. Also, denote by \(\bar {\mathsf {V}}_\ell \) the complement of V in V, such that \(\frac {1}{2}w(\mathsf {V}_\ell , \bar {\mathsf {V}}_\ell )\) is the total weight one needs to cut in order to disconnect V from the rest of the graph. Given these definitions, the simplest graph cut objective function, denoted by cut, is:

$$\displaystyle \begin{aligned} \mathtt{cut}\left(\mathcal{P}=\{\mathsf{V}_1, \ldots, \mathsf{V}_k\}\right) = \frac{1}{2}\sum_{\ell=1}^k w\left(\mathsf{V}_\ell, \bar{\mathsf{V}}_\ell\right). \end{aligned} $$
(5.2)

The best partition according to the cut criterion is \(\mathcal {P}^* = \text{argmin}_{\mathcal {P}} ~~\mathtt {cut}(\mathcal {P})\). For k = 2, solving this problem can be done exactly in \(O(n e + n^2 \log (n))\) amortized time using the Stoer–Wagner algorithm [126] and approximated in nearly linear time [68]. Nevertheless, this criterion is not satisfactory as it often separates an individual node from the rest of the graph, with no attention to the balance of the sizes or volumes of the groups. In clustering, one usually wants to partition into groups that are “large enough.” There are two famous ways to balance the previous cost in the machine learning literatureFootnote 5: the ratio cut [143] and normalized cut [121] cost functions, respectively defined as:

$$\displaystyle \begin{aligned} \mathtt{rcut}(\mathcal{P}) = \frac{1}{2}\sum_{\ell=1}^k \frac{w(\mathsf{V}_\ell, \bar{\mathsf{V}}_\ell)}{|\mathsf{V}_\ell|} ~~~~\text{and}~~~~\mathtt{ncut}(\mathcal{P}) = \frac{1}{2}\sum_{\ell=1}^k \frac{w(\mathsf{V}_\ell, \bar{\mathsf{V}}_\ell)}{\text{vol}(\mathsf{V}_\ell)}, \end{aligned} $$
(5.3)

where |V | is the number of nodes in V and \(\text{vol}(\mathsf {V}_\ell )=\sum _{i\in \mathsf {V}_\ell } \sum _{j\in \mathsf {V}} \mathbf {W}(i,j)\) is the so-called volume of V . The difference between them is that ncut favors clusters of large volume, whereas rcut only considers cluster size—though for a d-regular graph with unit weights the two measures match (up to multiplication by 1∕d). Unfortunately, it is hard to minimize these cost functions directly: minimizing these two balanced costs is NP-hard [121, 139] and one needs to search over the space of all possible partitions which is of exponential size.

A Continuous Relaxation

Spectral clustering may be interpreted as a continuous relaxation of the above minimization problems. Without loss of generality, in the following we concentrate on relaxing the rcut minimization problem (ncut is relaxed almost identically). Given a partition \(\mathcal {P}=(\mathsf {V}_1, \ldots , \mathsf {V}_k)\), let us define

$$\displaystyle \begin{aligned} \mathbf{{C}} = \left(\frac{{\mathbf{z}}_1}{\sqrt{|\mathsf{V}_1|}}|\ldots|\frac{{\mathbf{z}}_k}{\sqrt{|\mathsf{V}_k|}}\right) \in\mathbb{R}^{n\times k}, \end{aligned} $$
(5.4)

where \({\mathbf {z}}_\ell \in \mathbb {R}^n\) is the indicator vector of V :

$$\displaystyle \begin{aligned} {\mathbf{z}}_\ell(i) &= \begin{cases} 1 & \text{if node } i\in\mathsf{V}_\ell, \\ 0 & \text{otherwise.} \end{cases} \end{aligned} $$
(5.5)

It will prove useful in the following to remark that, independently of how the partitions are chosen, we always have that C C = I, the identity matrix in dimension k. With this in place, the problem of minimizing rcut can be rewritten as (see discussion in [138]):

(5.6)

To understand why this equivalence holds, one should simply note that

Solving (5.6) is obviously still NP-hard as the only thing we have achieved is to rewrite the rcut minimization problem in matrix form. Yet, in this form, it is easier to realize that one may find an approximate solution by relaxing the discreteness constraint “C as in (5.4).” In the absence of the hard-to-deal-with constraint, the relaxed problem is not only polynomially solvable but also possesses a closed-form solution! By the Courant–Fischer–Weyl (min-max) theorem, the solution is given by the first k eigenvectors U k = [u 1, u 2, …, u k] of L:

$$\displaystyle \begin{aligned} {\mathbf{U}}_k = \operatorname*{\mbox{arg min}}_{\mathbf{{C}} \in \mathbb{R}^{n\times k}} \;\text{tr}\hspace{-1px}\left(\mathbf{{C}}^\top \mathbf{L} \mathbf{{C}}\right) \quad \text{subject to} \quad \mathbf{{C}}^\top \mathbf{{C}} = \mathbf{I}. \notag \end{aligned} $$

This relaxation is not unique to the combinatorial Laplacian. In the same spirit, the minimum ncut optimization problem can be formulated in terms of the normalized Laplacian matrix L n, and the relaxed problem’s solution is given by the first k eigenvectors of L n.

A difficulty still lies before us: how do we go from a real-valued U k to a partition of the nodes? The two next subsections aim to motivate the use of k-means as a rounding heuristic. The exposition starts from the simple case when there are only two clusters (k = 2) before considering the general case (arbitrary k).

5.2.2.1 The Case of Two Clusters: Thresholding Suffices

For simplicity, we first consider the case of two clusters. If one constructs a partitioning \(\mathcal {P}_t\) with V 1 = {v i : u 2(i) > t} and V 2 = {v i : u 2(i) ≤ t} for every level set t ∈ (−1, 1), then it is a folklore result that

$$\displaystyle \begin{aligned} \mathtt{rcut}(\mathcal{P}^*) \leq \min_{t}\; \mathtt{rcut}(\mathcal{P}_t) \leq 2 \sqrt{ \mathtt{rcut}(\mathcal{P}^*) \, \left( d_{\text{max}} - \frac{\lambda_2}{2}\right)}, {} \end{aligned} $$
(5.7)

with \(\mathcal {P}^* = \operatorname *{\mbox{arg min}}_{\mathcal {P}} \ \mathtt {rcut}(\mathcal {P})\) being the optimal partitioning, d max is the maximum degree of any node in V, and λ 2 the second smallest eigenvalue of L. The upper bound is achieved by the tree-cross-path graph constructed by Guattery and Miller [57]. In an analogous manner, if \(\mathcal {P}^* = \operatorname *{\mbox{arg min}}_{\mathcal {P}} \ \mathtt {ncut}(\mathcal {P})\) is the optimal partitioning w.r.t. the ncut cost and every \(\mathcal {P}_t\) has been constructed by thresholding the second eigenvector of L n, then

$$\displaystyle \begin{aligned} \mathtt{ncut}(\mathcal{P}^*) \leq \min_{t} \;\mathtt{ncut}(\mathcal{P}_t) \leq 2 \sqrt{ \mathtt{ncut}(\mathcal{P}^*)}. {} \end{aligned} $$
(5.8)

Inequality (5.8) can be derived as a consequence of the Cheeger inequality, a key result of spectral graph theory [32], which for the normalized Laplacian reads:

$$\displaystyle \begin{aligned} \frac{\lambda_2}{2} \leq \mathtt{ncut}(\mathcal{P}^*) \leq \min_{\mathsf{V}} \frac{w(\mathsf{V}, \bar{\mathsf{V}})}{ \min\{ w(\mathsf{V}), w(\bar{\mathsf{V}})\} } \leq \min_{t} \;\mathtt{ncut}(\mathcal{P}_t) \leq \sqrt{ 2 \lambda_2 }. \notag \end{aligned} $$

As a consequence, we have

$$\displaystyle \begin{aligned} \mathtt{ncut}(\mathcal{P}^*) \leq \min_{t} \;\mathtt{ncut}(\mathcal{P}_t) \leq \sqrt{ 2 \lambda_2 } \leq \sqrt{ 4 \mathtt{ncut}(\mathcal{P}^*)} = 2 \sqrt{ \mathtt{ncut}(\mathcal{P}^*)},\end{aligned}$$

as desired. The derivation of the rcut bound given in (5.7) follows similarly.

5.2.2.2 More Than Two Clusters: Use k-Means

As the number of clusters k increases, the brute-force approach of testing every level set becomes quickly prohibitive. But why is k-means the right way to obtain the clusters in the spectral embedding? Though a plethora of experimental evidence advocate the use of k-means, a rigorous justification is still lacking. The interested reader may refer to [83] for an example of an analysis of spectral partitioning without k-means.

More recently, Peng et al. [107] came up with a mathematical argument showing that, if G is well clusterable and we use a k-means algorithm (e.g., [76]) which guarantees that the identified solution \(\tilde {\mathsf {C}}\) abides to

$$\displaystyle \begin{aligned} f(\tilde{\mathsf{C}}; \mathsf{X}) \leq (1 + \epsilon) f(\mathsf{C}^*; \mathsf{X}), \notag \end{aligned} $$

where C is the optimal solution of the k-means problem, then the partitioning \(\tilde {\mathcal {P}}\) produced by spectral clustering when using L n has ncut cost provably close to that of the optimal partitioning \(\mathcal {P}^*\). In particular, it was shown that, as long as \(\lambda _{k+1} \geq \text{c} k^2 \mathtt {ncut}(\mathcal {P}^*)\), then

$$\displaystyle \begin{aligned} \mathtt{ncut}(\mathcal{P}^*) \leq \mathtt{ncut}(\tilde{\mathcal{P}}) \leq \zeta \, \mathtt{ncut}(\mathcal{P}^*) \left(1 + \epsilon \, \frac{k^3}{\lambda_{k+1}}\right), \notag \end{aligned} $$

for some constants c, ζ > 0 that are independent of n and k (see also [71]). Note that, using the higher-order Cheeger inequality [83] \(\lambda _k/2 \leq \mathtt {ncut}(\mathcal {P}^*)\), the condition \(\lambda _{k+1} \geq \text{c} k^2 \mathtt {ncut}(\mathcal {P}^*)\) implies

$$\displaystyle \begin{aligned} \frac{\lambda_{k+1}}{\lambda_k} \geq \frac{\text{c} k^2}{2} = \varOmega(k^2).\end{aligned}$$

Though hopefully milder than this one,Footnote 6 such gap assumptions are very common in the analysis of spectral clustering. Simply put, the larger the gap λ k+1 − λ k is, the stronger the cluster structure and the easier it is to identify a good clustering. Besides quantifying the difficulty of the clustering problem, the gap also encodes the robustness of the spectral embedding to errors induced by approximation algorithms [36]. The eigenvectors of a perturbed Hermitian matrix exhibit an interesting property: instead of being arbitrary, induced changes are localized w.r.t. the eigenvalue axis, following an inverse square eigenvalue-distance law [89]. More precisely, if \(\tilde {\mathbf {u}}_i\) is the i-th eigenvector after perturbation, then the inner products \((\tilde {\mathbf {u}}_i^\top {\mathbf {u}}_j)^2\) decrease proportionally with |λ i − λ j|2. As such, demanding that λ k+1 − λ k is large is often helpful in the analysis of spectral clustering algorithms in order to ensure that the majority of useful information (contained within U k) is preserved (in \(\tilde {\mathbf {U}}_k\)) despite approximation errors.Footnote 7

5.2.2.3 Choice of Relaxation

The presented relaxation approach is not unique and other relaxations could be equally valid (see, for instance, [17, 24, 112]). This relaxation has nevertheless the double advantage of being theoretically simple and computationally easy to implement. Also, justification of spectral clustering algorithms does not only come from this graph cut perspective and in fact encompasses several approaches that we will not detail here: perturbation approaches or hitting time considerations [138], a polarization theorem [23], consistency derivations [84, 135], etc. Interestingly, recent studies (for instance, [18]) on the stochastic block models have shown that spectral clustering (on other matrices than the Laplacian, such as the non-backtracking matrix [73], or the Bethe–Hessian matrix [114] or other similar deformed Laplacians [34]) perform well up to the detectability threshold of the block structure.

5.2.3 Computational Complexity Considerations

What is the computational complexity of spectral clustering as a function of the number of points n, their dimension d, and the number of desired clusters k? Let us examine the three steps involved one by one. The first step entails the construction of a sparse similarity graph from the input points, which is dominated by the kernel computation and costs \(\mathcal {O}(dn^2)\). In the second step, given the graph G consisting of n nodes and e edges.Footnote 8 one needs to compute the spectral embedding (step 2 of Algorithm 1). Without exploiting the special structure of a graph Laplacian—other than its sparsity that is—there are two main options:

  • Using power iterations, one may identify sequentially each non-trivial eigenvector u in time \(\mathcal {O}( e / \delta _{\ell })\), where δ  = λ  − λ −1 is the -th eigenvalue gap and e is the number of edges of the graph [136]. Computing the spectral embedding therefore takes \(\mathcal {O}(k e / \delta ) \) with δ =min δ . Unfortunately, there exist graphsFootnote 9 such that \(\delta = \mathcal {O}(1/n)\), bringing the overall worst-case complexity to \(\mathcal {O}(k n e)\).

  • The Lanczos method can be used to approximate the first k eigenvectors in roughly \(\mathcal {O}(e k + nk^2)\) time. This procedure is often numerically unstable resulting to a loss of orthogonality in the computed Krylov subspace basis. The most common way to circumvent this problem is by implicit restart [26], whose computational complexity is not easily derived. The number of restarts, empirically, depends heavily on the eigenvalue distribution in the vicinity of λ k: if λ k is in an eigenvalue bulk, the algorithms takes longer than when λ k is isolated. We decide to write the complexity of restarted Arnoldi as \(\mathcal {O}(t(e k + nk^2))\) with t modeling the number of restarts. Note that throughout this paper, t will generically refer to a number of iterations in algorithm complexities. We refer the interested reader to [13] for an in-depth discussion of Lanczos methods.

The third step entails solving the k-means problem, typically by using the Lloyd-Max algorithm to converge to a local minimum of f(C;X). Since there is no guarantee that this procedure will find a good local minimum, it is usually rerun multiple times, starting in each case from randomly selected centroids C ini. The computational complexity of this third step is \(\mathcal {O}(t n k^2)\), where t is a bound on the number of iterations required until convergence multiplied by the number of retries (typically 10).

5.2.4 A Taxonomy of Sampling Methods for Spectral Clustering

For the remainder of the chapter, we propose to classify sampling methods aiming at accelerating one or more of these three steps according to when they sample. If they sample before step 1, they are detailed in Sect. 5.3. Methods that assume that the similarity graph is given or well-approximated and sample between steps 1 and 2 will be found in Sect. 5.4. Finally, methods that assume that the spectral embedding has been exactly computed or well-approximated and sample before the k-means step are explained in Sect. 5.5. This classification of methods, like all classification systems, bears a few flaws. For instance, Nyström methods can be applied to both the context of Sects. 5.3 and 5.4 and are thus mentioned in both. Also, we decided to include the pseudo-code of only a few chosen algorithms that we think are illustrative of the literature. This choice is of course subjective and debatable. Notwithstanding these flaws, we hope that this taxonomy clarifies the landscape of existing methods.

5.3 Sampling in the Original Feature Space

This section is devoted to methods that ambitiously aim to reduce the dimension of the spectral clustering problem even before the graph has been formed. Indeed, the naive way of building the similarity graph (step 1 of spectral clustering algorithms) costs \(\mathcal {O}(dn^2)\) and, as such, is one of the main computational bottlenecks of spectral clustering. It should be remarked that the present discussion fits into the wider realm of kernel approximation, a proper review of which cannot fit in this chapter: we will thus concentrate on methods that were in practice used for spectral clustering.

5.3.1 Nyström-Based Methods

The methods of this section aim to obtain an approximation \(\tilde {\mathbf {U}}_k\) of the exact spectral embedding U k via a sampling procedure in the original feature space.

The Nyström method

is a well-known algorithm for obtaining a rough low-rank approximation of a positive semi-definite (PSD) matrix A. Here is a high level description of the steps entailed:

Algorithm 3. Nyström’s method

Input. PSD matrix \(\mathbf {A} \in \mathbb {R}^{n \times n}\), number of samples m, desired rank k

  1. 1.

    Let S be m column indices chosen by some sampling procedure.

  2. 2.

    Denote by \(\mathbf {B} = \mathbf {A}(\mathsf {S},\mathsf {S}) \in \mathbb {R}^{m\times m}\) and \(\mathbf {C} = \mathbf {A}(:,\mathsf {S}) \in \mathbb {R}^{n\times m}\) the sub-matrices indexed by S.

  3. 3.

    Let B = Q ΣQ be the eigen-decomposition of B with the diagonal of Σ sorted in decreasing magnitude.

  4. 4.

    Compute the rank-k approximation of B as \({\mathbf {B}}_k = {\mathbf {Q}}_k \boldsymbol {\Sigma }_k {\mathbf {Q}}_k^\top \), where and .

Possible outputs:

  • A low-rank approximation \(\tilde {\mathbf {A}}=\mathbf {C} {\mathbf {B}}^+ {\mathbf {C}}^\top \in \mathbb {R}^{n \times n}\) of A

  • A rank-k approximation \(\tilde {\mathbf {A}}_k = \mathbf {C} {\mathbf {B}}^+_k {\mathbf {C}}^\top \in \mathbb {R}^{n \times n}\) of A

  • The top k eigenvectors of \(\tilde {\mathbf {A}}_k\), stacked as columns in matrix \(\tilde {\mathbf {{V}}}_k\in \mathbb {R}^{n\times k}\), obtained by orthonormalizing the columns of \(\tilde {\mathbf {Q}}_k=\mathbf {C} {\mathbf {Q}}_k \boldsymbol {\Sigma }_k^{-1}\in \mathbb {R}^{n\times k}\)

Various guarantees are known for the quality of \(\tilde {\mathbf {A}}\) depending on the type of sampling utilized (i.e., how the indices in S are selected in step 1) and the preferred notion of error (spectral ∥.∥2 vs Frobenius ∥.∥F vs trace ∥.∥ norm) [50, 54, 77, 148]. For instance:

Theorem 5.1 (Lemma 8 for q = 1 in [54])

Let 𝜖 ∈ (0, 1) and δ ∈ (0, 1) and suppose that S contains the indices of m columns drawn i.i.d. uniformly at random (with or without replacement). Then:

$$\displaystyle \begin{aligned} \| \mathbf{A} - \tilde{\mathbf{A}}\|{}_2 \leq \left( 1 + \frac{n}{(1 - \epsilon) m} \right) \| \mathbf{A} - {\mathbf{A}}_k\|{}_2 \end{aligned}$$

holds with probability at least 1 − 3δ, provided that \(m\geq 2\epsilon ^{-2}\mu k\log {(k/\delta )}\) , where

$$\displaystyle \begin{aligned}\mu = \frac{n}{k} \max_{i = 1, \ldots, n} \|{\mathbf{V}}_k(i,:)\|{}^2_2\end{aligned}$$

is the coherence associated with the first k eigenvectors V k of A , and A k is the best rank-k approximation of A.

Guarantees independent of the coherence can be obtained for more advanced sampling methods. Perhaps the most well-known method is that of leverage scores, where one draws m samples independently by selecting (with replacement) the i-th column with probability \({\mathbf {p}}_i = \|{\mathbf {V}}_k(i,:)\|{ }_2^2/k\).

Theorem 5.2 (Lemma 5 for q = 1 in [54])

Let 𝜖 ∈ (0, 1) and δ ∈ (0, 1) and suppose that S contains the indices of m columns drawn i.i.d. with replacement from such a probability distribution. Then:

$$\displaystyle \begin{aligned} \| \mathbf{A} - \tilde{\mathbf{A}}\|{}_2 \leq \| \mathbf{A} - {\mathbf{A}}_k\|{}_2 + \epsilon^2 \| \mathbf{A} - {\mathbf{A}}_k\|{}_* \end{aligned}$$

holds with probability at least 0.8 − 2δ provided that \(m \geq \mathcal {O}(\epsilon ^{-2}k \log (k/\delta ))\).

Computing leverage scores exactly is computationally prohibitive since it necessitates a partial SVD decomposition of A, which we are trying to avoid in the first place. Nevertheless, it is possible to approximate all leverage scores with a multiplicative error guarantee in time roughly \(\mathcal {O}(e k \log (e))\) if A has O(e) non-zero entries. (see Algorithms 1–3 in [54]). Many variants of the above exist [77, 78], but to the best of our knowledge, the fastest current Nyström algorithm utilizes ridge leverage scores with a complex recursive sampling scheme and runs in time nearly linear in n [100].

Nyström for Spectral Clustering

Though initially conceived for low-rank approximation, Nyström’s method can also be used to accelerate spectral clustering. The key observation is that U k, the tailing k eigenvectors of the graph representative matrix R, can be interpreted as the top k eigenvectors of the PSD matrix A = ∥R2 I −R. As such, the span of the k top eigenvectors of \(\tilde {\mathbf {A}}_k\) obtained by running Algorithm 3 on A is an approximation of the span of the exact spectral embedding. Different variants of this idea have been considered for the acceleration of spectral clustering [19, 48, 85, 86, 97, 141].

Following our taxonomy, we hereby focus on the case where we have at our disposal n points p i in dimension d, and the similarity graph has yet to be formed. The case where the graph is known is deferred to Sect. 5.4. In this case, we cannot run Algorithm 3 on A = ∥R2 I −R as the graph, and a fortiori its representative matrix R has not yet been formed. What we can have access to efficiently is B = s(K(S, S)) and C = s(K(:, S)), as these require only a partial computation of the kernel and cost only \(\mathcal {O}(dnm)\). Note that s is a sparsification function that is applied on a subset of the kernel matrix. The following pseudo-code exemplifies how Nyström-based techniques can be used to approximate the first k eigenvectors U k associated with the normalized Laplacian matrix (i.e., here R = L n):

Algorithm 3b. Nyström for spectral clustering [85]

Input. The set of points P, the number of desired clusters k, a sampling set S of size m ≥ k

  1. 1.

    Compute the sub-matrices \(\mathbf {B} {=} s(\mathbf {K}(\mathsf {S},\mathsf {S}))\in \mathbb {R}^{m\times m}\) and \(\in \mathbb {R}^{n\times m}\), where s is a sparsification function.

  2. 2.

    Let D r = diag(B1) be the m × m degree matrix.

  3. 3.

    Compute the top k eigenvalues Σ k and eigenvectors Q k of \({\mathbf {D}}_r^{-{1}/{2}} \mathbf {B} {\mathbf {D}}_r^{-{1}/{2}}\).

  4. 4.

    Set \( \tilde {\mathbf {Q}}_k = \mathbf {C} {\mathbf {D}}_r^{-{1}/{2}} {\mathbf {Q}}_k \boldsymbol {\Sigma }_k^{-1}\).

  5. 5.

    Let \({\mathbf {D}}_l = \text{diag}( \tilde {\mathbf {Q}}_k \boldsymbol {\Sigma }_k \tilde {\mathbf {Q}}_k^\top \mathbf {1})\) be the n × n degree matrix.

  6. 6.

    Compute \(\tilde {\mathbf {U}}_k\) obtained by orthogonalizing \({\mathbf {D}}_l^{-{1}/{2}} \tilde {\mathbf {Q}}_k\).

Output: \(\tilde {\mathbf {U}}_k\), an approximation of the spectral embedding U k.

This algorithm runs in \(\mathcal {O}(nm \max (d,k))\) time, which is small when m depends mildly on the other parameters of interest. Nevertheless, the algorithm (and others like it) suffers from several issues:

  • Algorithm 3b attempts to use Nyström’s method on \(\mathbf {A}=2\mathbf {I} - {\mathbf {L}}_n=\mathbf {I}+{\mathbf {D}}^{-\frac {1}{2}} s(\mathbf {K}) {\mathbf {D}}^{-\frac {1}{2}}\) via the exact computation of two sub-matrices of K. In doing so, it makes two strong (and uncontrolled) approximations. First of all, the sparsification step (step 1 in Algorithm 3b) is applied to the sub-matrices K(S, S) and K(:, S), deviating from the correct sparsification procedure that takes into account the entire kernel matrix K. Second, the degree matrix D is never exactly computed as knowing it exactly would entail computing exactly s(K), which is precisely what we are trying to avoid. Existing methods thus rely on heuristic approximations of the degree in order to bypass this difficulty (see steps 2 and 5 of Algorithm 3b).

  • Since we do not have direct access to the kernel matrix, we cannot utilize advanced sampling methods such as leverage scores to draw the sampling set S. This is particularly problematic if (due to sparsification) matrices B and C are sparse, as for sparse matrices uniform sampling is known to perform poorly [97]. Techniques that rely on distances between columns do not fair much better. Landmark-based approaches commonly perform better in simple problems but suffer when the clusters are non-convex [19]. We refer the reader to the work by Mohan et al. [97] for more information on landmark-based methods. The latter work also describes an involved sampling scheme that is aimed at general (i.e., non-convex) clusters.

For the reasons highlighted above, the low-rank approximation guarantees accompanying the classical Nyström method cannot be directly used here. A fortiori, it is an open question how much the quality of the spectral clustering solution is affected by using the centroids obtained by running k-means on \(\tilde {\mathbf {U}}_k\).

Column Sampling

Akin in spirit to Nyström methods, an alternative approach to accelerating spectral clustering was inspired by column sampling low-rank approximation techniques [37, 43].

An instance of such algorithms was put forth under the name of cSPEC (column sampling spectral clustering) by Wang et al. [141]. Let \(\mathbf {C} = {\mathbf {U}}_{C} \boldsymbol {\Sigma }_C {\mathbf {V}}_C^\top \) be the singular value decomposition of the n × m matrix C = s(K(:, S)). Then, matrices

$$\displaystyle \begin{aligned} \tilde{\boldsymbol{\Sigma}} = \sqrt{ \frac{n}{m}} \, \boldsymbol{\Sigma}_C \quad \text{and} \quad \tilde{\mathbf{U}} = \mathbf{C} {\mathbf{V}}_C \boldsymbol{\Sigma}_C^{+}\end{aligned}$$

are interpreted as an approximation of the actual eigenvalues and eigenvectors of K and thus U k can be substituted by the first k columns of \(\tilde {\mathbf {U}}\). This algorithm runs in \(\mathcal {O}(ndm+nm^2)\).

Authors in [29] propose a hybrid method, between column sampling and the representative-based methods discussed in Sect. 5.3.3, where they propose the following approximate factorization of the data matrix:

$$\displaystyle \begin{aligned} ({\mathbf{p}}_1|\ldots|{\mathbf{p}}_n) \simeq \mathbf{F} \mathbf{Z} \in\mathbb{R}^{d\times n}, \end{aligned} $$
(5.9)

where \(\mathbf {F}\in \mathbb {R}^{d\times m}\) concatenates the feature vectors of m sampled points and \(\mathbf {Z}\in \mathbb {R}^{m\times n}\) represents all unsampled points as approximate linear combinations of the representatives, computed via sparse coding techniques [82].Footnote 10 The SVD of \(\tilde {\mathbf {D}}^{-1/2}\mathbf {Z}\), with \(\tilde {\mathbf {D}}\) the row-sum of Z, is then computed to obtain an approximation \(\tilde {\mathbf {U}}_k\) of U k. The complexity of their algorithm is also \(\mathcal {O}(ndm+nm^2)\).

In these methods, the choice of the sample set S is, of course, central and has been much debated. Popular options are uniformly at random or via better-tailored probability distributions, via a first k-means (with k = m) pass on P, or via other selective sampling methods. Also, as with most extensions of Nyström’s method to spectral clustering, column sampling methods for spectral clustering do not come with end-to-end approximation guarantees on U k.

In the world of low-rank matrix approximation the situation is somewhat more advanced. Recent work in column sampling utilizes adaptive sampling with leverage scores in time \(\mathcal {O}(e + n \text{poly}(k))\), or uniformly i.i.d. after preconditioning by a fast randomized Hadamard transform [41, 145]. Others have also used a correlated version called volume sampling to obtain column indices [37]. Nevertheless, this literature extends beyond the scope of this chapter and thus we invite the interested reader to consider the aforementioned references for a more in-depth perspective.

5.3.2 Random Fourier Features

Out of several sketching techniques one could a priori use to accelerate spectral clustering, we focus on random Fourier features (RFF) [110]: a method that samples in the Fourier space associated with the original feature space. Even though RFFs have originally been developed to approximate a kernel matrix K in time linear in n instead of the quadratic time necessary for its exact computation, they can in fact be used to obtain an approximation \(\tilde {\mathbf {U}}_k\) of the exact spectral embedding U k.

Let us denote by κ the RBF kernel, i.e., \(\kappa (\mathbf {t})=\exp (-{\mathbf {t}}^2/\sigma ^2)\), whose Fourier transform is:

$$\displaystyle \begin{aligned} \hat{\kappa}(\boldsymbol{\omega})=\int_{\mathbb{R}^d}\kappa(\mathbf{t})\exp^{-i\boldsymbol{\omega}^\top \mathbf{t}}\mbox{d}\mathbf{t}. \end{aligned} $$
(5.10)

The above takes real values as κ is symmetric. One may write:

$$\displaystyle \begin{aligned} \kappa(\mathbf{p},\mathbf{q})=\kappa(\mathbf{p}-\mathbf{q})=\frac{1}{Z}\int_{\mathbb{R}^d}\hat{\kappa}(\boldsymbol{\omega})\exp^{i\boldsymbol{\omega}^\top(\mathbf{p}-\mathbf{q})}\mbox{d}\boldsymbol{\omega}, \end{aligned} $$
(5.11)

where, in order to ensure that κ(p, p) = 1, the normalization constant is set to \(Z=\int _{\mathbb {R}^d}\hat {\kappa }(\boldsymbol {\omega })\mbox{d}\boldsymbol {\omega }\). According to Bochner’s theorem, and due to the fact that κ is positive-definite, \(\hat {\kappa }/Z\) is a valid probability density function. κ(p, q) may thus be interpreted as the expected value of \(\exp ^{i\boldsymbol {\omega }^\top (\mathbf {p}-\mathbf {q})}\) provided that ω is drawn from \(\hat {\kappa }/Z\):

$$\displaystyle \begin{aligned} \kappa(\mathbf{p},\mathbf{q})=\mathbb{E}_{\boldsymbol{\omega}} \left(\exp^{i\boldsymbol{\omega}^\top(\mathbf{p}-\mathbf{q})}\right) \end{aligned} $$
(5.12)

Drawing ω from the distribution \(\hat {\kappa }/Z\) is equivalent to drawing independently each of its d entries according to the normal law of mean 0 and variance 2∕σ 2. Indeed: \(\hat {\kappa }(\boldsymbol {\omega })=\pi ^{d/2}\sigma ^d\exp (-\sigma ^2\boldsymbol {\omega }^2/4)\) and \(Z=\int _{\mathbb {R}^d}\hat {\kappa }(\boldsymbol {\omega })\mbox{d}\boldsymbol {\omega }=(2\pi )^d\), leading to

$$\displaystyle \begin{aligned}\frac{\hat{\kappa}(\boldsymbol{\omega})}{Z}=\left(\frac{\sigma}{2\sqrt{\pi}}\right)^d\exp^{-\sigma^2\boldsymbol{\omega}^2/4}.\end{aligned}$$

In practice, we draw independently m such vectors ω to obtain the set of sampled frequencies Ω = (ω 1, …, ω m). For each data point p i, and given this set of samples Ω, we define the associated random Fourier feature vector:

$$\displaystyle \begin{aligned} \boldsymbol{\psi}_i=\frac{1}{\sqrt{m}} [\cos{}(\boldsymbol{\omega}_1^\top {\mathbf{p}}_i)|\cdots|\cos{}(\boldsymbol{\omega}_m^\top {\mathbf{p}}_i)| \sin{}(\boldsymbol{\omega}_1^\top {\mathbf{p}}_i)|\cdots|\sin{}(\boldsymbol{\omega}_m^\top {\mathbf{p}}_i)]^\top\in\mathbb{R}^{2m}, \end{aligned} $$
(5.13)

and call \(\boldsymbol {\varPsi }=\left (\boldsymbol {\psi }_1|\cdots |\boldsymbol {\psi }_n\right )\in \mathbb {R}^{2m\times n}\) the RFF matrix. Other embeddings are possible in the RFF framework, but this one was shown to be the most appropriate to the Gaussian kernel [127]. As m increases, \(\boldsymbol {\psi }_i^\top \boldsymbol {\psi }_j\) concentrates around its expected value κ(p i, p j): \(\boldsymbol {\psi }_i^\top \boldsymbol {\psi }_j\simeq \kappa ({\mathbf {p}}_i,{\mathbf {p}}_j)\). Proposition 1 of [127] states the tightness of this concentration: it shows that the approximation starts to be valid with high probability for \(m\geq \mathcal {O}(d\log {d})\). The Gaussian kernel matrix is thus well approximated as K ≃Ψ Ψ. With such a low-rank approximation Ψ of K, one can estimate the degrees,Footnote 11 degree-normalize Ψ to obtain a low-rank approximation of the normalized Laplacian L n, and perform an SVD to directly obtain an approximation \(\tilde {\mathbf {U}}_k\) of the spectral embedding U k. The total cost to obtain this approximation is \(\mathcal {O}(ndm + nm^2)\). These ideas were developed in Refs. [31, 146], for instance.

As in Nyström methods however, the concentration guarantees of RFFs for K do not extend to the degree-normalized case; moreover, the sparsification step 1b of spectral clustering is ignored. Note that improving over RFFs in terms of efficiency and concentration properties is the subject of recent research (see, for instance, [81]).

5.3.3 The Paradigm of Representative Points

The methods detailed here sample in the original feature space and directly obtain a control on the misclustering rate due to the sampling process. They are based on the following framework:

  1. 1.

    Sample m so-called representatives.

  2. 2.

    Run spectral clustering on the representatives.

  3. 3.

    Lift the solution back to the entire dataset.

Let us illustrate this with the example of KASP:

Algorithm 4. KASP:k -means-based approximate spectral clustering[ 147 ]

Input. A set of n points P = (p 1, p 2, …, p n) in dimension d, a number of desired clusters k, and a number of representatives m.

  1. 1.

    Perform k-means with k = m on P and obtain:

    1. (a)

      the cluster centroids Y = (y 1, …, y m) as the m representative points.

    2. (b)

      a correspondence table to associate each p i to its nearest representative

  2. 2.

    Run spectral clustering on Y to get the cluster membership of each y i.

  3. 3.

    Lift the cluster membership to each p i by looking up the cluster membership of its representative in the correspondence table.

Output: k clusters

The complexity of KASP is bounded byFootnote 12 \(\mathcal {O}(mdnt+ m^3)\). For a summary of the analysis given in [147], let us consider the cluster memberships given by exact spectral clustering on P as well as the memberships given by exact spectral clustering on \(\tilde {\mathsf {P}}=({\mathbf {p}}_1+\epsilon _1, \ldots , {\mathbf {p}}_n+\epsilon _n)\), where the 𝜖 i are any small perturbations on the initial points. Let us denote by L n (resp. \(\tilde {\mathbf {L}}_n\)) the normalized Laplacian matrix of the similarity graph on P (resp. \(\tilde {\mathsf {P}}\)). The analysis concentrates on the study of the miss-clustering rate ρ:

$$\displaystyle \begin{aligned} \rho=\frac{\#\,\, \text{of points with different memberships}}{n}. \end{aligned} $$
(5.14)

The main result, building upon preliminary work in [63], stems from a perturbation approach and reads:

Theorem 5.3

Under the assumptions of Theorem 3 in [ 147 ]: , where g 0 is a value depending on the spectral gap. Also, under the assumptions of Theorem 6 in [ 147 ], one has, with high probability:

$$\displaystyle \begin{aligned} \|{\mathbf{L}}_n-\tilde{\mathbf{L}}_n\|{}_F\leq \mathcal{O}\left(\sigma_{\epsilon}^{(2)} + \sigma_{\epsilon}^{(4)}\right), \end{aligned} $$
(5.15)

with \(\sigma _{\epsilon }^{(2)}\) and \(\sigma _{\epsilon }^{(4)}\) the 2nd and 4th moments of the perturbation’s norms𝜖 i2.

Combining both bounds, one obtains an upper bound on the misclustering rate that depends on the second and fourth moments of the perturbation’s norms ∥𝜖 i2. The “collapse” of points onto the m representative points, interpreted as a perturbation on the original points, should thus tend to minimize these two moments, leading the authors to propose distortion-minimizing algorithms, such as KASP. A very similar algorithm, eSPEC, is described in [141].

5.3.4 Other Methods

5.3.4.1 Approximate Nearest Neighbor Search Algorithms

The objective here is to approximate the nearest neighbor graph efficiently. Even though these methods are not necessarily based on sampling, we include them in the discussion as they are frequently used in practice.

Given the feature vectors \({\mathbf {p}}_1, \ldots , {\mathbf {p}}_n\in \mathbb {R}^d\) and a query point \(\mathbf {q}\in \mathbb {R}^d\), the exact nearest neighbor search (exact NNS) associated with P and q is p  = argminpP dist(q, p), where dist stands for any distance. Different distances are possible depending on the choice of kernel κ. We will here consider the Euclidean norm as it enters the definition of the classical RBF kernel. Computing the exact NNS costs \(\mathcal {O}(dn)\). The goal of the approximate NNS field of research is to provide faster algorithms that have the following control on the error.

Definition 5.1

Point p is an 𝜖-approximate nearest neighbor of query \(\mathbf {q}\in \mathbb {R}^d\), if

$$\displaystyle \begin{aligned}\forall \mathbf{p}\in\mathsf{P}\qquad \text{dist}(\mathbf{q}, {\mathbf{p}}^{*})\leq(1 +\epsilon)\,\text{dist}(\mathbf{q}, \mathbf{p}).\end{aligned}$$

For 𝜖 = 0, this reduces to exact NNS.

Extensions of this objective to the k-nearest neighbor goal are considered in the NNS literature. A k-nearest neighbor graph can then be constructed simply by running an approximate k-NNS query for each object p i. Thus, approximate NSS algorithms are interesting candidates to approximate the adjacency matrix of the nearest-neighbor affinity graph, that we need in step 1 of spectral clustering. Many algorithms exist, their respective performances depending essentially on the dimension d of the feature vectors. According to [9], randomized k-d forests as implemented in the library FLANN [98] are considered state of the art for dimension of around 100, whereas methods based on balanced box decomposition (BBD) [4, 7] are known to perform well for d roughly smaller than 100. In high dimensions, to avoid the curse of dimensionality, successful approaches are, for instance, based on hashing methods (such as locality sensitive hashing (LSH) [5], product quantization (PQ) [66]), or k-d generalized random forests [9]. Finally, proximity graph methods that sequentially improve over a first coarse approximation of the k-NN graph (or other graph structures such as navigable graphs) have received a large attention recently and are becoming state of the art in regimes where quality of approximation primes (see, for instance, [8, 40, 51, 94]). Such tools come with various levels of guarantees and computation costs, the details of which are not in the scope of this chapter.

Experimentally, to obtain an approximate k-NN graph with a typical recall rateFootnote 13 of 0.9, these algorithms are observed to achieve a complexity of \(\mathcal {O}(dn^\alpha )\) with α close to 1 (α ≃ 1.1 in [40], for instance).

5.3.4.2 Feature Selection and Feature Projection

Some methods work on reducing the factor d of the complexity \(\mathcal {O}(dn^2)\) of the kernel computation via feature selection, i.e., the sampling of features deemed more useful for the underlying clustering task, or feature projection, i.e., the projection on usually random subspaces of dimension d′ < d. Feature selection methods are usually designed to improve the classification by removing features that are too noisy or useless for the classification. We thus do not detail further these methods as they are not approximation algorithms per se. The interested reader will find some entries in the literature via references [25, 35, 60, 149]. Projection methods use random projections of the original points P on spaces of dimension \(d'\sim \log {n}\) in order to take advantage of the Johnson–Lindenstrauss lemma of norm conservation: the kernel computed from the projected features in dimension d′ is thus an approximation of the true kernel with high probability. We refer to the works [64, 116] for more details.

5.4 Sampling Given the Similarity Graph

We now suppose that the similarity graph is either given (e.g., in cases where the original data is a graph) or has been well approximated (by approximate k-NN search, for instance) and concentrate on sampling-based methods that aim to reduce the cost of computing the first k eigenvectors of R.

These methods predominantly aim to approximate R by a smaller matrix \(\tilde {\mathbf {R}}\) of size m. The eigen-decomposition is done in \(\mathbb {R}^m\) which can be significantly cheaper when m ≪ n. In addition, each method comes with a fast way of lifting vectors from \(\mathbb {R}^m\) back to \(\mathbb {R}^n\) (this is usually a linear transformation). After lifting, the eigenvectors of \(\tilde {\mathbf {R}}\) are used as a proxy for those of R.

Unlike the previous section where a strong approximation guarantee of the exact embedding U k by an efficiently computed \(\tilde {\mathbf {U}}_k\) was a distant and difficult goal to achieve in itself, we will see in this section that the knowledge of the similarity graph not only enables to obtain such strong approximation guarantees, but also enables to control how the error on U k transfers as an error on the k-means cost.

To be more precise, recall (5.1) defining the k-means cost f(C;X) associated with the n points X = (x 1, …, x n) and a centroid set C. Now, suppose that we have identified a set of n points \(\tilde {\mathsf {X}}=(\tilde {\mathbf {x}}_1|\ldots |\tilde {\mathbf {x}}_n)\) that are meant to approximate the exact spectral embedding X. Moreover, let C (resp. \(\tilde {\mathsf {C}}^*\)) be the optimal set of k centroids minimizing the k-means cost on X (resp. \(\tilde {\mathsf {X}}\)). We will see that several (not all) approximation methods of this section achieve an end-to-end approximation guarantee of the form

$$\displaystyle \begin{aligned} \left| f(\mathsf{C^*}; \mathsf{X})^{\mathsf{{1}/{2}}} - f(\tilde{\mathsf{C}^*}; \mathsf{X})^{\mathsf{{1}/{2}}} \right| \leq \epsilon, \end{aligned}$$

for some small 𝜖 with—at least—constant probability. Such an end-to-end guarantee is indeed more desirable than a simple guarantee on the distance between U k and \(\tilde {\mathbf {U}}_k\): it informs us on the approximation quality of the attained clustering.

5.4.1 Nyström-Based Methods

The Nyström-based methods discussed in Sect. 5.3.1 are also applicable here. Let us concentrate on the choice R = L n to illustrate the main ideas. As explained in Sect. 5.3.1, the tailing k eigenvectors U k of L n can be interpreted as the top k eigenvectors of the PSD matrix A = 2I −L n. As such, the span of the top-k eigenvectors of \(\tilde {\mathbf {A}}_k\), \(\text{span}(\tilde {\mathbf {U}}_k)\), obtained by running Algorithm 3 on A should approximate the span of U k. Now, how does one go from Nyström theorems such as Theorem 5.2 to error bounds on the k-means cost function?The first step towards an end-to-end guarantee relies on the following result:

Lemma 1 (See the Proof of Theorem 6 in [21])

Denote by \(\tilde {\mathsf {C}}^*\) the optimal centroid set obtained by solving k-means on the rows of \(\tilde {\mathbf {U}}_k\) . It holds that

$$\displaystyle \begin{aligned} \left| f(\mathsf{C^*}; \mathsf{X})^{\mathsf{{1}/{2}}} - f(\tilde{\mathsf{C}}^*; \mathsf{X})^{\mathsf{{1}/{2}}} \right| \leq 2 \|\mathbf{{E}}\|{}_F, {} \end{aligned} $$
(5.16)

where \( \mathbf {{E}}={\mathbf {U}}_k {\mathbf {U}}_k^\top - \tilde {\mathbf {U}}_k\tilde {\mathbf {U}}_k^\top . \)

This means that the error made by considering the optimal k-means solution based on \(\tilde {\mathbf {U}}_k\) (instead of U k) is controlled by the Frobenius norm of the projector difference \(\mathbf {E}={\mathbf {U}}_k {\mathbf {U}}_k^\top - \tilde {\mathbf {U}}_k\tilde {\mathbf {U}}_k^\top \). Furthermore, sinceFootnote 14 \(\|\mathbf {{E}}\|{ }_F\leq \sqrt {2k}\|\mathbf {{E}}\|{ }_2\) and \(\|\mathbf {{E}}\|{ }_2=\|\sin {}(\varTheta ({\mathbf {U}}_k, \tilde {\mathbf {U}}_k))\|{ }_2\), we can apply the Davis–Kahan \(\sin \varTheta \) perturbation theorem (see, for instance, Section VII of [16]) and, provided that \(\sigma _k-\tilde {\sigma }_{k+1}>0\), obtain:

$$\displaystyle \begin{aligned}\|\mathbf{{E}}\|{}_F\leq\sqrt{2k}\|\mathbf{{E}}\|{}_2\leq \sqrt{2k} \, \frac{\|\mathbf{A}-\tilde{\mathbf{A}}\|{}_2}{\sigma_k-\tilde{\sigma}_{k+1}},\end{aligned}$$

where {σ i} (resp. \(\{\tilde {\sigma }_i\}\)) are the singular values of A (resp. \(\tilde {\mathbf {A}}\)) ordered decreasingly.Footnote 15 The final bound is obtained by combining the above with the leverage score sampling bound given by Theorem 5.2.

Theorem 5.4

Let \(\tilde {\mathbf {U}}_k\) be the eigenvectors obtained by running Algorithm 3 on A = 2I −L n (with the leverage score sampling scheme for the m samples S of step 1). Denote by \(\tilde {\mathsf {C}}^*\) the optimal centroid set obtained by solving k-means on the rows of \(\tilde {\mathbf {U}}_k\) . Then, for some constant C > 1, we have

$$\displaystyle \begin{aligned} \left| f(\mathsf{C^*}; \mathsf{X})^{\mathsf{{1}/{2}}} - f(\tilde{\mathsf{C}}^*; \mathsf{X})^{\mathsf{{1}/{2}}} \right| \leq 2 \frac{\sqrt{2k}}{\sigma_k-\tilde{\sigma}_{k+1}}\left(\sigma_{k+1}(\mathbf{A}) + \frac{Ck\log{(k/\delta)}}{m}\sum_{j=k+1}^n \sigma_j\right) \notag \end{aligned} $$

with probability at least 0.8 − 2δ.

Examining the above bound one notices that \(2 \sqrt {2k} \frac {\sigma _{k+1}(\mathbf {A})}{\sigma _k-\tilde {\sigma }_{k+1}}\) is independent of the number of samples. The incompressibility of this error term emanates from A being (in general) different from its best low-rank approximation. On the other hand, all remaining error terms can be made independent of k and n by setting

$$\displaystyle \begin{aligned}m=\mathcal{O}\left(k\sqrt{k}\log{k}\sum_{j=k+1}^n \frac{\sigma_j}{\sigma_k-\tilde{\sigma}_{k+1}}\right).\end{aligned}$$

This end-to-end guarantee is not satisfactory for several reasons. First of all, it relies on the assumption \(\sigma _k>\tilde {\sigma }_{k+1}\), which is not necessarily true. Moreover, the Davis–Kahan theorem could in theory guarantee \(\|\mathbf {E}\|{ }_2\leq {\|{\mathbf {A}}_k-\tilde {\mathbf {A}}_k\|{ }_2/\sigma _k}\) and \(\|\mathbf {E}\|{ }_2\leq {\|\mathbf {A}-\tilde {\mathbf {A}}_k\|{ }_2}/{\sigma _k}\), which are stronger than the bound depending on \(\|\mathbf {A}-\tilde {\mathbf {A}}\|{ }_2\) that we used. Unfortunately, Nyström approximation theorems do not give controls on \(\|{\mathbf {A}}_k-\tilde {\mathbf {A}}_k\|{ }_2\) nor on \(\|\mathbf {A}-\tilde {\mathbf {A}}_k\|{ }_2\), impeding tighter end-to-end bounds.

5.4.2 Graph Coarsening

Inspired by the algebraic multi-grid, researchers realized early on that a natural way to accelerate spectral clustering is by graph coarsening [38, 61, 69]. Here, instead of solving the clustering problem directly on G, one may first reduce it to a coarser graph G c consisting of m ≪ n nodes using a multi-level graph coarsening procedure. The expensive eigen-decomposition computation is done at a lower cost on the representative matrix of the small graph and the final spectral embedding is obtained by inexpensively lifting and refining the result.

In the notation of [91], coarsening involves a sequence of c + 1 graphs

$$\displaystyle \begin{aligned} G = G_0 = (V_0, E_0, {\mathbf{W}}_0) \quad G_1 = (V_1, E_1, {\mathbf{W}}_1) \quad \cdots \quad G_{\textit{c}} = (V_{\textit{c}}, E_{\textit{c}}, {\mathbf{W}}_{\textit{c}}) \end{aligned} $$
(5.17)

of decreasing size n = n 0 > n 1 > ⋯ > n c = m, where each vertex of G represents one of more vertices of G −1. To express coarsening in algebraic form, we suppose that L(G 0) = L is the combinatorial Laplacian associated with G. We then obtain L(G c) by applying the following repeatedly:

$$\displaystyle \begin{aligned} \mathbf{L}(G_{\ell}) = {\mathbf{P}}_\ell^{\mp} \mathbf{L}(G_{\ell-1}) {\mathbf{P}}_\ell^+, \end{aligned} $$
(5.18)

where \({\mathbf {P}}_\ell \in \mathbb {R}^{n_\ell \times n_{\ell -1}}\) is a matrix with more columns than rows,  = 1, 2, …, c is the level of the reduction, and symbol ∓ denotes the transposed pseudo-inverse. An eigenvector \(\tilde {\mathbf {u}} \in \mathbb {R}^{m}\) of L(G c) is lifted back to \(\mathbb {R}^n\) by backwards recursion

$$\displaystyle \begin{aligned}\tilde{\mathbf{u}}_{\ell-1} = {\mathbf{P}}_{\ell} \tilde{\mathbf{u}}_{\ell},\end{aligned}$$

where \(\tilde {\mathbf {u}}_{\textit {c}} = \tilde {\mathbf {u}}.\)

Matrices P 1, P 2, …, P c are determined by the transformation performed at each level. Specifically, one should define for each level a surjective map φ  : V −1 → V between the original vertex set V −1 and the smaller vertex set V . We refer to the set of vertices \(V_{\ell -1}^{(r)} \subseteq V_{\ell -1}\) mapped onto the same vertex \(v_r^{\prime }\) of V as a contraction set:

$$\displaystyle \begin{aligned}V_{\ell-1}^{(r)} = \{v \in V_{\ell-1} : \varphi_{\ell}(v) = v^{\prime}_r \}\end{aligned}$$

It is easy to deduce from the above that contraction sets induce a partitioning of V −1 into n subgraphs, each corresponding to a single vertex of V .

Then, for any \(v^{\prime }_r \in V_{\ell }\) and v i ∈ V −1, matrices \({\mathbf {P}}_{\ell } \in \mathbb {R}^{n_{\ell } \times n_{\ell -1}}\) and \({\mathbf {P}}^+_{\ell } \in \mathbb {R}^{n_{\ell -1} \times n_{\ell }}\) are given by:

The preceding construction is the only one that guarantees that every L(G ) will be the combinatorial Laplacian associated with G [90].

Note that from a computational perspective the reduction is very efficient and can be carried out in linear time: each coarsening level entails multiplication by a sparse matrix, meaning that \(\mathcal {O}(e)\) and \(\mathcal {O}(n)\) operations suffice, respectively, to coarsen L and lift any vector (such as the eigenvectors of L(G c)) from \(\mathbb {R}^{m}\) back to \(\mathbb {R}^n\).

5.4.2.1 Coarsening for Spectral Clustering

Using coarsening effectively boils down to determining for each how to partition G −1 into n contraction sets \(V_\ell ^{(1)}, \ldots , V_\ell ^{(n_{\ell })}\), such that, after lifting, the first k eigenvectors \(\tilde {\mathbf {U}}_k\) of L(G c) approximate the spectral embedding U k derived from L. Alternatively, one may also solve the k-means problem in the small dimension and only lift the resulting cluster assignments [38]. This scheme is computationally superior but we will not discuss it here as it does not come with any guarantees.

Perhaps the most simple (and common) method of forming contraction sets is by the heavy-edge matching heuristic—originally developed in the multi-grid literature and first considered for graph partitioning in [69]. This method is derived based on the intuition that the larger the weight of an edge, the less likely it will be that the vertices it connects will reside in different clusters. We should therefore aim to contract pairs of vertices connected by a heavy edge (i.e., of large weight) first. Let us consider this case further. By focusing on edges, we basically constrain ourselves by enforcing that every contraction set \(V_\ell ^{(r)}\) contains either two nodes connected by an edge or a single node, signifying that said node is chosen to remain as is in the coarser graph. As such, we can reformulate the problem of selecting contraction sets at each level as that of selecting the largest number of edges (to attain the largest reduction), while also striving to make the cumulative sum of selected edge weights as large as possible (giving preference to heavy edges). This is exactly the maximum weight matching problem, which can be approximated in linear time [44].

A plethora of numerical evidence motivates the use of matching-based coarsening methods, such as the heavy-edge heuristic, for accelerating spectral clustering [38, 69, 115]. From a theoretical perspective, the approximation quality of matching-based methods was characterized in [91]. Therein, the matching was constructed in the following randomized manner:

Algorithm 5. Randomized edge contraction (one level) [91]

Input. A graph G = (V, E)

  1. 1.

    Associate with each e ij ∈E a probability p ij > 0.

  2. 2.

    While |E| > 0:

    1. (a)

      Draw a sample e ij from E with probability ∝p ij.

    2. (b)

      Remove from E both e ij as well as all edges sharing a common endpoint with it.

    3. (c)

      Construct contraction set (v i, v j).

Output: Contraction sets

The following approximation result is known:

Theorem 5.5 (Corollary 5.1 in [91])

Consider a graph with bounded degrees d i ≪ n and \(\lambda _k \leq \min _{e_{ij} \in E} \left \lbrace \frac {d_i + d_j}{2}\right \rbrace \) . Suppose that the graph is coarsened by Algorithm 5, using a heavy-edge potential such that p ij ∝ w ij . For sufficiently large n, a single level, and δ > 0,

$$\displaystyle \begin{aligned} \left| f(\mathsf{C^*}; \mathsf{X})^{\mathsf{{1}/{2}}} - f(\tilde{\mathsf{C}}^*; \mathsf{X})^{\mathsf{{1}/{2}}}\right| &= \mathcal{O}\left( \sqrt{\frac{1 - \frac{m}{n}}{\delta} \frac{\sum_{\ell=2}^k \lambda_\ell }{\lambda_{k+1} - \lambda_{k}} } \right) \notag \end{aligned} $$

with probability at least 1 − δ. Above, \(\tilde {\mathsf {C}}^*\) is the optimal k-means solution when using the lifted eigenvectors of L c as a spectral embedding.

We deduce that coarsening works better when the spectral clustering problem is easy (as quantified by the weighted gap \( \sum _{\ell =2}^k \lambda _\ell / (\lambda _{k+1} - \lambda _{k}))\) and the achieved error is linear on the reduction ratio 1 − mn.

There also exist more advanced techniques for selecting contraction sets that come with stronger guarantees w.r.t. the attained reduction and quality of approximation, but feature running time that is not smaller than that of spectral clustering [90]. In particular, these work also with the normalized Laplacian and can be used to achieve multi-level reduction. Roughly, their strategy is to identify and contract sets S ⊂ V  for which x(i) ≈x(j) for all vectors x ∈U k and v i, v j ∈S. This strategy ensures that the best partitionings of G are preserved by coarsening. We will not expand on these methods here as they do not aim to improve the running time of spectral clustering.

5.4.3 Other Approaches

In the following, we present two additional approaches for approximately computing spectral embeddings. The former can be interpreted as a sampling-based method (but in a different manner than the techniques discussed so far), whereas the latter is only vaguely linked to sampling. Nevertheless, we find that both techniques are very interesting and merit a brief discussion.

5.4.3.1 Spectral Sparsification

This approach is best suited for cases when the input of spectral clustering is directly a graph.Footnote 16 Different from the methods discussed earlier, here the aim is to identify a Laplacian matrix \(\tilde {\mathbf {L}}\) of the same size as L but with fewer entries. Additionally, it should be ensured that

$$\displaystyle \begin{aligned} \frac{1}{1 + \epsilon} {\mathbf{x}}^\top \mathbf{L} \mathbf{x} \leq {\mathbf{x}}^\top \tilde{\mathbf{L}} \mathbf{x} \leq (1 + \epsilon) {\mathbf{x}}^\top \mathbf{L} \mathbf{x} \quad \text{for all} \quad \mathbf{x} \in \mathbb{R}^n \end{aligned} $$
(5.19)

for some small constant 𝜖 > 0 [125]. Most fast algorithms for spectral sparsification entail sampling \(\mathcal {O}(n \log {n})\) edges from the total edges present in the graph. Different sampling schemes are possible [72, 124], but the most popular ones entail sampling edges with replacement based on their effective resistance. It should be noted that though computing all effective resistances exactly can be computationally prohibitive, the effective resistance of edges can be approximated in nearly linear time on the number of edges based on a Johnson–Lindenstrauss argument [124].

There are different ways to use sparsification in order to accelerate spectral clustering. The most direct one is to exploit the fact that the eigenvalues \(\tilde {\lambda }_k\) and eigenvectors \(\tilde {\mathbf {U}}_k\) of \(\tilde {\mathbf {L}}\) approximate, respectively, the eigenvalues and eigenvectors of L up to multiplicative error. This yields the same flavor of guarantees as in graph coarsening and ensures that the computational complexity of the partial eigen-decomposition will decrease when \(e = \omega (n\log {n})\). A variation of this idea was considered in [140], though the latter did not provide a complete error and complexity analysis. Alternative approaches are also possible. We refer the interested reader to [136] for a rigorous argument that invokes a Laplacian solver.

Despite these exciting developments, we should mention that the overwhelming majority of graph sparsification algorithms remain in the realm of theory. That is, we are currently not aware of any practical and competitive implementation and thus retain a measure of skepticism with regard to their utility in the setting of spectral clustering.

5.4.3.2 Random Eigenspace Projection

There also exist approaches that do not explicitly rely on sampling. The key starting point here is that, with regard to spectral clustering, one does not need the eigenvectors exactly—any rotation of U k suffices (indeed, k-means is an algorithm based on distances and rotations conserve distances). Even more generally, consider \(\tilde {\mathbf {U}}_k \in \mathbb {R}^{n\times m}\) with m ≥ k and denote:

$$\displaystyle \begin{aligned} \epsilon = \min_{\mathbf{Q} \in \mathcal{Q}} \|{\mathbf{U}}_k {\mathbf{I}}_{k \times m} \mathbf{Q} - \tilde{\mathbf{U}}_k \|{}_F, \end{aligned}$$

where \(\mathcal {Q}\) is the space of m × m unitary matrices and I k×m consists of the first k rows of an m × m identity matrix.

The following lemma (which is a generalization of Lemma 1) shows how 𝜖 can be used to provide control on the k-means error:

Lemma 2 (Lemma 3.1 in [95])

Let \(\tilde {\mathsf {C}}^*\) be the optimal solution of the k-means problem on \(\tilde {\mathbf {U}}_k\) . It holds that Footnote 17

$$\displaystyle \begin{aligned} \left| f(\mathsf{C^*}; \mathsf{X})^{\mathsf{{1}/{2}}} - f(\tilde{\mathsf{C}}^*; \mathsf{X})^{\mathsf{{1}/{2}}} \right| \leq 2 \epsilon. {} \end{aligned} $$
(5.20)

There exists (at least) two approaches to efficiently compute \(\tilde {\mathbf {U}}_k\) while controlling 𝜖 [21, 130] (see also related work in [58]). We will consider here a simple variant of the one proposed in [130] and further analyzed in [95]. Let \(\mathbf {G} \in \mathbb {R}^{ n\times m}\) be a random Gaussian matrix with centered i.i.d. entries, each having variance \(\frac {1}{m}\). Furthermore, suppose that we project G onto span(U k) by multiplying each one of its columns by an ideal projector P k defined as

$$\displaystyle \begin{aligned} {\mathbf{P}}_k = \mathbf{U} \left( \begin{array}{cc} {\mathbf{I}}_k & 0\\ 0 & 0 \end{array} \right) {\mathbf{U}}^\top. \end{aligned} $$
(5.21)

Theorem 5.6 ([95, 130])

Let \(\tilde {\mathsf {C}}^*\) be the optimal solution of the k-means problem on the rows of \(\tilde {\mathbf {U}}_k = {\mathbf {P}}_k \mathbf {G}\) . For every δ ≥ 0, one has

$$\displaystyle \begin{aligned} \displaystyle \left| f(\mathsf{C^*}; \mathsf{X})^{\mathsf{{1}/{2}}} - f(\tilde{\mathsf{C}}^*; \mathsf{X})^{\mathsf{{1}/{2}}} \right| \leq 2 \sqrt{\frac{k}{m}} \, (\sqrt{k}+\delta), \end{aligned} $$
(5.22)

with probability at least \(1-\exp (-\delta ^2/2)\).

This result means that for an ideal projector P k, dimension \(m = \mathcal {O}(k^2)\) suffices to guarantee good approximation (since the error becomes independent of k and n)! A similar argument also holds when the entries of G, instead of being Gaussian, are selected i.i.d. from \(\{-\sqrt {3},0,+\sqrt {3}\}\) with probabilities {1∕6, 2∕3, 1∕6}, respectively [1]. This construction has the benefit of being sparser and, moreover, is reminiscent of sampling. It should be noted that in [130], \(m=\mathcal {O}(\log {n})\) was deemed enough because one only wanted that the distance between two rows of U k was approximated by the distance between the same two rows of \(\tilde {\mathbf {U}}_k\). There was in fact no end-to-end control on the k-means error.

The discussion so far assumed that P k is an ideal projector onto span(U k). However, in practice one does not have access to this projector as we are in fact in the process of computing U k. One may choose to approximate the action of P k by an application of a matrix function h on the representative matrix R [111, 129]. Assuming a point λ in the interval [λ k, λ k+1) is known, one may select a polynomial [122] or rational function [65, 92] that approximates the ideal low-pass response, i.e., h(λ) = 1 if λ ≤ λ and h(λ) = 0, otherwise. The approximated projector \(\tilde {\mathbf {P}}_k = h(\mathbf {R})\) can be designed to be very close to P k. For instance, in the case of Chebyshev polynomials of order c using the arguments of [80, Lemma 1] it is easy to prove that w.h.p. using h(R) instead of P k does not add more than \(\mathcal {O}(\textit {c}^{-\textit {c}}\sqrt {n})\) error in (5.20). Furthermore, the operation \(\tilde {\mathbf {P}}_k \mathbf {G}\) can conveniently be computed in \(\mathcal {O}(m\textit {c} e)\) time via this polynomial approximation.

The last ingredient needed for this approximation is λ , i.e., a point in the interval [λ k, λ k+1). Finding efficiently a valid λ is difficult. An option is to rely on eigencount techniques [39, 105, 109] to find one in Footnote 18 \(\mathcal {O}(\textit {c} k^2(\log {n})(e + n \log (\lambda _n/(\lambda _{k+1} - \lambda _k))))\) time, which features similar complexity as the Lanczos method (see discussion in Sect. 5.2.3). Another option is to content oneself with values of λ known only to be close to the interval [λ k, λ k+1), but thereby loosing the end-to-end guarantee [130].

5.5 Sampling in the Spectral Feature Space

Having computed (or approximated) the spectral embedding X = (x 1, x 2, …, x n), what remains is to solve the k-means problem on X, in order to obtain k centroids together with the associated k classes obtained after Voronoi tessellation. The usual heuristic used to solve the k-means problem, namely the Lloyd-Max algorithm, is already very efficient as it runs in \(\mathcal {O}(nk^2t)\) time as seen in Sect. 5.2.3. Nonetheless, this section considers ways to accelerate k-means even further. In the following, we classify the relevant literature in five categories and point towards representative references for each case. In our effort to provide depth (as well as breadth) of presentation, the rest of the section details only methods that belong to the first and last categories.

  • Exact acceleration of Lloyd-Max. There exists exact accelerated Lloyd-Max algorithms, some of them based on avoiding unnecessary distance calculations using the triangular inequality [59, 101], or on optimized data organization [67], and others concentrating on clever initializations [6, 104]. Unlike the former methods, the latter methods involve sampling and are discussed in Sect. 5.5.1.

  • Approximate acceleration of Lloyd-Max. Approximately accelerating the Lloyd-Max algorithm has also received attention, for instance, via approximate nearest neighbor methods [108], via cluster closure [142], or via applying Lloyd-Max hierarchically (in the large k context) [103]. An approach involving sampling is introduced in [119]: it is based on mini-batches sampled uniformly at random from X. We will not discuss further this method as it does not come with guarantees on the cost of the obtained solution.

  • Methods involving sampling in the Fourier domain. There are a few sampling-based heuristics to solve the k-means problem that are different from the Lloyd-Max algorithm. For instance, the work in [70] proposes to sample in the frequency domain to obtain a sketch from which one may recover the centroids with an orthogonal matching pursuit algorithm specifically tailored to this kind of compressive learning task [55]. These methods are reminiscent of the random Fourier features sketching approach introduced in Sect. 5.3.2. We will not discuss them further.

  • Methods involving sampling features. Similarly to ideas presented in Sect. 5.3.4.2 but here specific to the k-means setting, some works reduce the ambient dimension of the vectors, either by selecting a limited number of features [3, 20] or by embedding all points in a lower dimension using random projections [22, 33, 93]. The tightest results to day are a (1 + 𝜖) multiplicative error on the k-means cost f either by randomly selecting \(\mathcal {O}(\epsilon ^{-2}k\log {k})\) features or by projecting them on a random space of dimension \(\mathcal {O}(\epsilon ^{-2}\log {(k/\epsilon )})\) (sublinear in k!). The sampling result is useless in the spectral clustering setting as the ambient dimension of the spectral features is already k. The projection result could in principle be applied in our setting, to reduce the cost of the k-means step to \(\mathcal {O}(tnk\log {k})\). We will nevertheless not discuss it further in this chapter.

  • Methods involving sampling points. Finally, the last group of existing methods are the ones that solve k-means on a subset S of X, before lifting back the result on the whole dataset. We classify such methods in two categories. In Sect. 5.5.2, we detail methods that are graph-agnostic, meaning that they apply to any k-means problem; and in Sect. 5.5.3 we discuss methods that explicitly rely on the fact that the features x were in fact obtained from a known graph. We argue that the latter are better suited to the spectral clustering problem.

5.5.1 Clever Initialization of the Lloyd-Max Algorithm

Recall that the k-means objective on X is to find the k centroids C = (c 1, …, c k) that minimize the following cost function:

$$\displaystyle \begin{aligned} f(\mathsf{C}; \mathsf{X}) = \sum_{\mathbf{x}\in\mathsf{X}} ~\min_{\mathbf{c}\in\mathsf{C}}\|\mathbf{x}-\mathbf{c}\|{}_2^2. \end{aligned} $$
(5.23)

and that \(\mathsf {C}^*= \operatorname *{\mbox{arg min}}_{\mathsf {C}} ~f(\mathsf {C}; \mathsf {X})\) is the optimal solution attaining cost f  = f(C ;X). Recall also that the Lloyd-Max algorithm (see Algorithm 2) converges to a local minimum of f that we will denote by C lm, for which the cost function equals f lm = f(C lm;X). It is crucial to note that the initialization of centroids C ini in the first step of the Lloyd-Max algorithm, which usually is done by randomly selecting k points in X, is what determines the distance |f − f lm| to the optimal. As such, significant efforts have been devoted to smartly selecting C ini by various sampling schemes.

As usual, we also face here the usual trade-off between sampling effectively and efficiently. The fastest sampling method is of course uniformly at random, but it does not come with any guarantee on the quality of the local minimum C lm it leads to. An alternative sampling scheme, called k-means++ initialization, is based on the following more general D 2-sampling algorithm.

Algorithm 6:D 2 - sampling

Input. X, m the number of required samples

  1. 1.

    Initialize B with any x chosen uniformly at random from X.

  2. 2.

    Iterate the following steps until B contains m elements:

    1. (a)

      Compute \(d_i=\min _{b\in \mathsf {B}} \|{\mathbf {x}}_i-b\|{ }_2^2\).

    2. (b)

      Define the probability of sampling x i as d i∕∑i d i.

    3. (c)

      Sample x new from this probability distribution and add it to B.

Output: B a sample set of size m.

k-means++ initialization boils down to running Algorithm 6 with m = k to obtain a set of k initial centroids. Importantly, when the Lloyd-Max heuristic is run with this initialization, the following guarantee holds:

Theorem 5.7 ([6])

For any set of data points, the cost f lm obtained after Lloyd-Max initialized with k-means++ is controlled in expectation: \(\mathbb {E}(f_{\mathit{\text{lm}}}) \leq 8(\log k + 2)f^*\).

In terms of computation cost, D 2-sampling with m = k runs in \(\mathcal {O}(nkd)\), that is, \(\mathcal {O}(nk^2)\) in our setting of a spectral embedding X in dimension k. This work inspired other initialization techniques that come with similar guarantees and are in some cases faster [10, 12]. The interested reader is referred to the review [27] for further analyses on the initialization of k-means.

5.5.2 Graph Agnostic Sampling Methods: Coresets

The rest of Sect. 5.5 considers sampling methods that fall in the following framework: (i) sample a subset S of X, (ii) solve k-means on S, (iii) lift the result back on the whole dataset X. Section 5.5.2 focuses on coresets: general sampling methods designed for any arbitrary k-means problem, whereas in Sect. 5.5.3, we will take into account the specific nature of the spectral features encountered in spectral clustering algorithms.

5.5.2.1 Definition

Let S ⊂X be a subset of X of size m. To each element s ∈S associate a weight \(\omega (s)\in \mathbb {R}^+\). Define the estimated k-means cost associated with the weighted set S as:

$$\displaystyle \begin{aligned} \tilde{f}(\mathsf{C}; \mathsf{S}) = \sum_{s\in\mathsf{S}} \omega(s)\min_{\mathbf{c}\in\mathsf{C}}\|s-\mathbf{c}\|{}_2^2. \end{aligned} $$
(5.24)

Definition 5.2 (Coreset)

Let \(\epsilon \in (0,\frac {1}{2})\). The weighted subset S is a 𝜖-coreset for f on X if, for every set C, the estimated cost is equal to the exact cost up to a relative error:

$$\displaystyle \begin{aligned} \forall \mathsf{C}\qquad \left|\frac{\tilde{f}(\mathsf{C};\mathsf{S})}{f(\mathsf{C}; \mathsf{X})}-1\right|\leq\epsilon. \end{aligned} $$
(5.25)

This is the so-called strong coreset definition,Footnote 19 as the 𝜖-approximation is required for all C. The great interest of finding a coreset S comes from the following fact. Writing \(\tilde {\mathsf {C}}^*\) the set minimizing \(\tilde {f}\), the following inequalities hold:

$$\displaystyle \begin{aligned} (1-\epsilon) f(\mathsf{C}^*; \mathsf{X}) \leq (1-\epsilon) f(\tilde{\mathsf{C}}^*; \mathsf{X}) \leq \tilde{f}(\tilde{\mathsf{C}}^*; \mathsf{S})\leq \tilde{f}(\mathsf{C}^*; \mathsf{S})\leq (1+\epsilon) f(\mathsf{C}^*; \mathsf{X}). \end{aligned} $$

The first inequality comes from the fact that C is optimal for f, the second and last inequality are justified by the coreset property of S, and the third inequality comes from the optimality of \(\tilde {\mathsf {C}}^*\) for \(\tilde {f}\). This has two consequences:

  1. 1.

    First of all, since \(\epsilon <\frac {1}{2}\):

    $$\displaystyle \begin{aligned}f(\mathsf{C}^*; \mathsf{X})\leq f(\tilde{\mathsf{C}}^*; \mathsf{X})\leq (1+4\epsilon)f(\mathsf{C}^*; \mathsf{X}),\end{aligned}$$

    meaning that \(\tilde {\mathsf {C}}^*\) is a well-controlled approximation of C with a multiplicative error on the cost.

  2. 2.

    Estimating \(\tilde {\mathsf {C}}^*\) can be done using the Lloyd-Max algorithm on the weighted subsetFootnote 20 S, thus reducing the computation time from \(\mathcal {O}(nk^2)\) to \(\mathcal {O}(mk^2)\).

Coreset methods for k-means thus follow the general procedure:

Algorithm 7. Coresets to avoidk -means onX

Input. X, sampling set size m, and number of clusters k ≤ m.

  1. 1.

    Compute a weighted coreset S of size m using a coreset-sampling algorithm.

  2. 2.

    Run the Lloyd-Max algorithm on the weighted set S to obtain the set of k centroids \(\tilde {\mathsf {C}}\).

  3. 3.

    “Closest-centroid lifting”: classify the whole dataset X based on the Voronoi cells of \(\tilde {\mathsf {C}}\).

Output: A set of k centroids C = (c 1, …, c k).

Coreset methods compete with one another on essentially two levels: the coreset size m should be as small as possible in order to decrease the time of Lloyd-Max on S, and the coreset itself should be sampled efficiently (at least faster than running k-means on the whole dataset!), which turns out in fact to be a strong requirement. The reader interested in an overview of coreset construction techniques is referred to the recent review [99], as well as Chap. 2 of this book.

5.5.2.2 An Instance of Coreset-Sampling Algorithm

We focus on a particular coreset algorithm proposed in [11] that builds upon results developed in [45, 79]: it is not state of the art in terms of coreset size, but has the advantage of being easy to implement and fast enough to compute. It reads:

Algorithm 8: a coreset sampling algorithm [11]

Input. X, m the number of required samples, t an iteration number

  1. 1.

    Repeat t times: draw a set of size k using D 2-sampling. Out of the t sets obtained, keep the set B that minimizes f(B;X).

  2. 2.

    \(\alpha \leftarrow 16(\log {k}+2)\)

  3. 3.

    For each b  ∈B, define B the set of points in X in the Voronoi cell of b

  4. 4.

    Set \(\phi = \frac {1}{n} f(\mathsf {B};\mathsf {X})\).

  5. 5.

    For each b  ∈B and each x ∈B , define

    $$\displaystyle \begin{aligned}s(\mathbf{x}) = \frac{\alpha}{\phi} \|\mathbf{x}-b_\ell\|{}_2^2+\frac{2\alpha}{\phi|\mathsf{B}_\ell|}\sum_{\mathbf{x}'\in\mathsf{B}_\ell} \|\mathbf{x}'-b_\ell\|{}_2^2 + \frac{4n}{|\mathsf{B}_\ell|}\end{aligned}$$
  6. 6.

    Define the probability of sampling x i as p i = s(x i)∕∑x s(x)

  7. 7.

    S ← sample m nodes i.i.d. with replacement from p and associate to each sample s the weight \(\omega _s=\frac {1}{mp_s}\).

Output: A weighted set S of size m.

Theorem 2.5 of [11] states:

Theorem 5.8

Let 𝜖 ∈ (0, 1∕4) and δ ∈ (0, 1). Let S be the output of Algorithm 8 with \(t=\mathcal {O}(\log {1/\delta })\) . Then, with probability at least 1 − δ, S is a 𝜖-coreset provided that:

$$\displaystyle \begin{aligned} m=\varOmega\left(\frac{k^4\log{k}+k^2\log{1/\delta}}{\epsilon^2}\right). \end{aligned} $$
(5.26)

The computation cost of running this coreset-sampling algorithm, running Lloyd-Max on the weighted coreset, and lifting the result back to X is dominated, whenFootnote 21 n ≫ k, by step 1 of Algorithm 6 and thus sums up to \(\mathcal {O}(n k^2\log {1/\delta })\).

Remark 1

The coreset-sampling strategy underlying this algorithm relies on the concept of sensitivity [79]. Many other constructions of coresets for k-means are possible [99] with better theoretical bounds then (5.26). Nevertheless, as the coreset line of research has been essentially theoretical, practical implementations of coreset-sampling algorithms are scarce. A notable exception is, for instance, the work in [49] that proposes a scalable hybrid coreset-inspired algorithm for k-means. Other exceptions are the sampling algorithms based on the farthest-first procedure, a variant of D 2-sampling that chooses each new sample to be \( \operatorname *{\mbox{arg max}}_i d_i\) instead of drawing it according to a probability proportional to d i. Once S of size m is drawn, then ∀s ∈S, each weight ω s is set to be the cardinal of the Voronoi cell associated with s. Authors in [113] show that such weighted sets computed by different variants of the farthest-first algorithm are 𝜖-coresets, but for values of 𝜖 that can be very large. For a fixed 𝜖, the number of samples necessary to have a 𝜖-coreset with this type of algorithm is unknown (see also Chap. 3 of this book).

5.5.3 Graph-Based Sampling Methods

The methods discussed so far in this section are graph agnostic both for the sampling procedure and the lifting: they do not take into account that, in spectral clustering, X are in fact spectral features of a known graph.

A recent line of work [52, 53, 95, 130] based on graph signal processing (GSP) [118, 123] leverages this additional knowledge for accelerating both the sampling and the lifting steps. For the purpose of the following discussion, denote by \({\mathbf {z}}_\ell \in \mathbb {R}^n\) the ground truth indicator vector of cluster , i.e., z (i) = 1 if node v i is in cluster , and 0 otherwise. The goal of spectral clustering is, of course, to recover {z }=1,…,k.

Broadly, GSP-based methods can be summarized in the following general methodology [130]:

Algorithm 9. Graph-based sampling strategies to avoidk -means onX

Input. X, m the number of required samples, k the number of desired clusters

  1. 1.

    Choose the random sampling strategy. Either:

    1. (a)

      uniform (i.i.d.) Draw m i.i.d. samples uniformly.

    2. (b)

      leverage score (i.i.d.) Compute \(\forall {\mathbf {x}}_i, \textit {p}_i^*=\|{\mathbf {U}}_k^\top \delta _i\|{ }_2^2 / k\). Draw m i.i.d. samples from p . (optional:) set the weight of each sample s to \(1/\textit {p}_s^*\).

    3. (c)

      DPP Sample a few times independently from a DPP with kernel \({\mathbf {K}}_k={\mathbf {U}}_k {\mathbf {U}}_k^\top \). (optional:) set the weight of each sample s to 1∕π s.

  2. 2.

    Run the Lloyd-Max algorithm on the (possibly weighted) set S to obtain the k reduced cluster indicator vectors \({\mathbf {z}}^r_\ell \in \mathbb {R}^m\).

  3. 3.

    Lift each reduced indicator vector \(\{{\mathbf {z}}^r_\ell \}_{\ell =1,\ldots , k}\) to the full graph either with

    1. (a)

      Least-square Solve (5.33) with \(\mathbf {y}\leftarrow {\mathbf {z}}^r_\ell \).

    2. (b)

      Tikhonov Solve (5.34) with \(\mathbf {y}\leftarrow {\mathbf {z}}^r_\ell \).

    In both cases, P S should be set to \(\frac {1}{N}{\mathbf {I}}_m\) if uniform sampling was chosen, to \(\text{diag}(\textit {p}^*_{s_1}, \ldots \textit {p}^*_{s_m})\) if leverage score sampling was chosen, and to \(\text{diag}(\pi _{s_1}, \ldots \pi _{s_m})\) if DPP sampling was chosen.

  4. 4.

    Assign each node j to the cluster for which \(\hat {\mathbf {z}}_\ell (j)/\|\hat {\mathbf {z}}_\ell \|{ }_2\) is maximal.

Output: A partition of X in k clusters

To aid understanding, let us start by a high-level description of Algorithm 9. The indicator vectors z are interpreted as graph signals that are (approximately) bandlimited on the similarity graph G (see Sect. 5.5.3.1 for a precise definition). As such, there is no need to measure these indicator vectors everywhere: one can take advantage of generalized Shannon-type sampling theorems to select the set S of m nodes to measure (step 1). Then k-means is performed on S to obtain the indicator vectors \({\mathbf {z}}_\ell ^r\in \mathbb {R}^m\) on the sample set S (step 2). These reduced indicator vectors are interpreted as noisy measurements of the global cluster indicator vectors z on S. The solutions \({\mathbf {z}}_\ell ^r\) are lifted back to X as \(\hat {\mathbf {z}}_\ell \) via solving an inverse problem taking into account the bandlimitedness assumption or via label-propagation on the graph structure reminiscent of semi-supervised learning techniques (step 3). As the lifted solutions \(\hat {\mathbf {z}}_\ell \) do not have a binary structure as true indicator vectors should have, an additional assignment step is necessary: assign each node j to the class for which \(\frac {\hat {\mathbf {z}}_\ell (j)}{\|\hat {\mathbf {z}}_\ell \|{ }_2}\) is maximal (step 4).

The rest of this section is devoted to the discussion of the three sampling schemes as well as the two lifting procedures considered in this framework. To this end, we will first introduce a few graph signal processing (GSP) concepts in Sec. 5.5.3.1 before discussing in Sec. 5.5.3.2 several examples of graph sampling theorems appropriate to the spectral clustering context.

5.5.3.1 A Brief Introduction to Graph Signal Processing (GSP)

Denote by \(\mathbf {U}=({\mathbf {u}}_1|\ldots |{\mathbf {u}}_n)\in \mathbb {R}^{n\times n}\) the matrix of orthonormal eigenvectors of the Laplacian matrix L, with the columns ordered according to their associated sorted eigenvalues: 0 = λ 1 ≤ λ 2 ≤… ≤ λ n. In the GSP literature [118, 123], these eigenvectors are interpreted as graph Fourier modes for two main reasons:

  • By analogy to the ring graph, whose Laplacian matrix is exactly the (symmetric) double derivative discrete operator, and is thus diagonal in the basis formed by the classical 1D discrete Fourier modes.

  • A variational argument stemming from the Dirichlet form can be exploited to express eigenvectors u i of L as the basis of minimal variation \({\mathbf {x}}^\top \mathbf {L} \mathbf {x} = \frac {1}{2}\sum _{ij} {\mathbf {W}}_{ij}\left [\mathbf {x}(i)-\mathbf {x}(j)\right ]^2\) on G and eigenvalues λ i as a sum of local variations of u i, i.e., a generalized graph frequency.

A graph signal \(\mathbf {z}\in \mathbb {R}^n\) is a signal that is defined on the nodes of a graph: its i-th element is associated with node v i. Given the previous discussion, the graph Fourier transform of z, denoted by \(\tilde {\mathbf {z}}\), is its projection on the graph Fourier modes: \(\tilde {\mathbf {z}}={\mathbf {U}}^\top \mathbf {z}\in \mathbb {R}^n\). The notion of graph filtering naturally follows as a multiplication in the Fourier domain. More precisely, define a real-valued filter function h(λ) defined on [0, λ n]. The signal x filtered by h reads U h(Λ)U x, where we use the convention h(Λ) = diag(h(λ 1), h(λ 2), …, h(λ n)). In the following, we will use the following notation for graph filter operators:

$$\displaystyle \begin{aligned} h(\mathbf{L}) = Uh(\varLambda){\mathbf{U}}^\top. \end{aligned} $$
(5.27)

For more details on the graph Fourier transform and filtering, their various definitions and interpretations, we refer the reader to [133].

Of interest for the discussion in this chapter, one may define bandlimited graph signals as linear combinations of the first few low-frequency Fourier modes. Writing \({\mathbf {U}}_k=({\mathbf {u}}_1|\ldots |{\mathbf {u}}_k)\in \mathbb {R}^{n\times k}\), we have the formal definition:

Definition 5.3 (k-Bandlimited Graph Signal)

A graph signal \(\mathbf {z} \in \mathbb {R}^{n}\) is k-bandlimited if z ∈span(U k), i.e., \(\exists ~\boldsymbol {\alpha }\in \mathbb {R}^k\) such that z = U k α.

To grasp why the notion of k-bandlimitedness lends itself naturally to the approximation of spectral clustering, consider momentarily a graph with k disconnected components and \({\mathbf {z}}_\ell \in \mathbb {R}^n\) the indicator vector of cluster . It is a well-known property of the (combinatorial) Laplacian that {z }=1,…,k form a set of orthogonal eigenvectors of L associated with eigenvalue 0: that is, the set of indicator vectors {z }=1,…,k form a basis of span(U k). Understanding arbitrary graphs with block structure as a perturbation of the ideal disconnected component case, the indicator vectors {z }=1,…,k of the blocks should live close to span(U k) (in the sense that the difference between any z and its orthogonal projection onto span(U k) is small). This in turn implies that every z should be approximately k-bandlimited.

As we will see next, the bandlimitedness assumption is very useful because it enables us to make use of generalized versions of Nyquist–Shannon sampling theorems, taking into account the graph.

5.5.3.2 Graph Sampling Theorems

The periodic sampling paradigm of the Shannon theorem for classical bandlimited signals does not apply to graphs without specific regular structure. In fact, a number of sampling schemes have been recently developed with the purpose of generalizing sampling theorems to graph signals [30, 109, 117, 134] (see [88] for a review of existing schemes).

Let us introduce some notations. Sampling entails selecting a set S = (s 1, …, s m) of m nodes of the graph. To each possible sampling set, we associate a measurement matrix \(\mathbf {M}=(\boldsymbol {\delta }_{s_1}|\boldsymbol {\delta }_{s_2}|\ldots |\boldsymbol {\delta }_{s_m})^{\top } \in \mathbb {R}^{m\times n}\) where \(\boldsymbol {\delta }_{s_i}(j)=1\) if j = s i, and 0 otherwise. Now, consider a k-bandlimited signal z ∈span(U k). The measurement of z on S reads:

$$\displaystyle \begin{aligned} \mathbf{y} = \mathbf{M} \mathbf{z} + \mathbf{n} \in\mathbb{R}^m, \end{aligned} $$
(5.28)

where n models measurement noise. The sampling question boils down to: how should we sample S such that one can recover any bandlimited z given its measurement y? There are three important components to this question: (i) how many samples m do we allow ourselves (m = k being the strict theoretical minimum)? (ii) how much does it cost to sample? (iii) how do we in practice recover z from y and how much does that inversion cost?

There are a series of works that propose greedy algorithms to find the “best” set S of minimal size m = k that embed all k-bandlimited signals (see, for instance, [131] and references therein). These algorithms cost \(\mathcal {O}(nk^4)\) and are thus not competitive in our setting.Footnote 22 Moreover, in our case, we do not really need to be that strict on the number of samples and can allow more than k samples. A better choice is to use random graph sampling techniques. In the following we consider two types of independent sampling (uniform and leverage-score sampling) as well as a more involved method based on determinantal point processes.

Independent Sampling

In the i.i.d. setting, one defines a discrete probability distribution \(\mathbf {p}\in \mathbb {R}^n\) over the node set V. The sampling set S is then generated by drawing m nodes independently with replacement from p. At each draw, the probability to sample node v i is denoted by p i. We have ∑i p i = 1 and write P = diag(p). Under this sampling scheme, the following restricted isometry property holds for the associated measurement matrix M [109].

Theorem 5.9

For any δ, 𝜖 ∈ (0, 1), with probability at least 1 − δ:

$$\displaystyle \begin{aligned} (1-\epsilon)\|z\|{}^2_2 \leq \frac{1}{m}\|\mathbf{MP}^{-1/2}z\|{}^2_2\leq(1 +\epsilon)\|z\|{}^2_2 \end{aligned} $$
(5.29)

for all z ∈span(U k) provided that

$$\displaystyle \begin{aligned} m\geq \frac{3}{\epsilon^2}(\nu^k_{\mathbf{p}})^2\log{\frac{2k}{\delta}} \end{aligned} $$
(5.30)

where \(\nu ^k_{\mathbf {p}}\) is the so-called graph weighted coherence:

$$\displaystyle \begin{aligned} \nu^k_{\mathbf{p}} = \max_i \left\{\textit{p}_i^{-1/2}\|{\mathbf{U}}_k^\top\delta_i\|{}_2\right\}. \end{aligned} $$
(5.31)

This property is important as it says, in a nutshell, that any two different bandlimited signals will be identifiable post-sampling provided the number of samples is large enough. The concept of large enough depends on \((\nu ^k_{\mathbf {p}})^2\): a measure of the interplay between the probability distribution and the norms of the rows of U k. In the uniform i.i.d. case since p i = 1∕n, one has \((\nu ^k_{\mathbf {p}})^2=n\max _i \|{\mathbf {U}}_k^\top \delta _i\|{ }^2_2\), which stays under control only for very regular graphs, but can be close to n in irregular graphs such as the star graph. The good news is that there exists an optimal sampling distribution (in the sense that it minimizes the right-hand side of inequality (5.30)) that adapts to the graph at hand:

$$\displaystyle \begin{aligned} \textit{p}_i^* = \frac{\|{\mathbf{U}}_k^\top\delta_i\|{}_2^2}{k} \end{aligned} $$
(5.32)

In fact, in this case, \((\nu ^k_{{\mathbf {p}}^*})^2\) matches its lower bound k and the necessary number of samples m to embed all bandlimited signals drops to \(\mathcal {O}(k\log {k})\). The distribution p is also referred to by the name “leverage scores” in parts of the literature (see discussion in Sect. 5.3.1) [41]. As such, i.i.d. sampling under p will be referred to as leverage score sampling.

Now, for lifting, there are several options.

  • If one uses the unbiased decoder

    $$\displaystyle \begin{aligned} \hat{\mathbf{z}} &= \operatorname*{\mbox{arg min}}_{\mathbf{w}\in\text{span}({\mathbf{U}}_k)} \|{\mathbf{P}}^{-1/2 }_{\mathsf{S}}(\mathbf{M}\mathbf{w}-\mathbf{y})\|{}_2^2 \end{aligned} $$
    (5.33)

    where \({\mathbf {P}}^{-1/2 }_{\mathsf {S}}{=}\mathbf {MP}^{-1/2 }{\mathbf {M}}^\top \), then the following reconstruction result holds [109]:

    Theorem 5.10

    Let S be the i.i.d. nodes sampled with distribution p and M be the associated sampling matrix. Let 𝜖, δ ∈ (0, 1) and suppose that m satisfies (5.30). With probability at least 1 − δ, for all z ∈span(U k) and \(\mathbf {{n}}\in \mathbb {R}^m\) , the solution \(\hat {\mathbf {z}}\) of (5.33) verifies:

    $$\displaystyle \begin{aligned}\|\hat{\mathbf{z}}-\mathbf{z}\|{}_2\leq\frac{2}{\sqrt{m(1-\epsilon)}} \|{\mathbf{P}}^{-1/2}_{\mathsf{S}}\mathbf{{n}}\|{}_2.\end{aligned}$$

    This means that a noiseless measurement of a k-bandlimited signal yields a perfect reconstruction. Also, this quantifies how increasing m reduces the error of reconstruction due to a noisy measurement. Note that this error may be large if there is a significant measurement noise on a node that has a low probability of being sampled. However, by definition, this is not likely to happen.

  • One can also use a label-propagation decoder reminiscent to semi-supervised learning techniques [15, 28]:

    $$\displaystyle \begin{aligned} \hat{\mathbf{z}} &= \operatorname*{\mbox{arg min}}_{\mathbf{w}\in\mathbb{R}^n} \|{\mathbf{P}}^{-1/2 }_{\mathsf{S}}(\mathbf{M}\mathbf{w}-\mathbf{y})\|{}_2^2 + \gamma\; {\mathbf{w}}^\top g(\mathbf{L})\mathbf{w}, \end{aligned} $$
    (5.34)

    where γ is a regularization parameter, g(L) a graph filter operator as in (5.27) with g(λ) a non-decreasing function. As g is non-decreasing, the regularization term of (5.34) penalizes high frequency solutions, that is, solutions that are not smooth along paths of the graph. Theorems controlling the error of reconstruction are more involved and we refer the reader to Section 3.3 of [109] for details.

  • Other decoders [14, 106] are in principle possible, replacing, for instance, the 2 Laplacian-based regularization w g(L)w by 1-regularizers ∥∇w1, but they come with an increased computation cost, lesser guarantees, and have not been used for spectral clustering: we will thus not detail them further.

Let us discuss the computation costs of the previous sampling and lifting techniques. In terms of sampling time, uniform sampling is obviously the most efficient and runs in \(\mathcal {O}(k)\). Leverage score sampling is dominated by the computation of the optimal sampling distribution p of (5.32), which takes \(\mathcal {O}(nk)\) time.Footnote 23 In terms of lifting time, solving the decoder of (5.33) costs \(\mathcal {O}(nk + mk^2)\). Solving the decoder of (5.34) costs \(\mathcal {O}(et)\) via the conjugate gradient method, where t is the iteration number of the gradient solver (usually around 10 or 20 iterations suffice to obtain good accuracy when g(L) = L).

This discussion calls for a few remarks. First of all, these theorems are valid if we suppose that z is exactly k-bandlimited, which is in fact only an approximation if we consider z to be the ground truth indicator vectors of the k clusters to detect in the spectral clustering context. In this case, we can always decompose z as the sum of its orthogonal projection onto span(U k) and its complement β: \(\mathbf {z}={\mathbf {U}}_k {\mathbf {U}}_k^\top \mathbf {z} + \boldsymbol {\beta }\). (5.28) becomes \(\mathbf {y}=\mathbf {M} {\mathbf {U}}_k {\mathbf {U}}_k^\top \mathbf {z} + \mathbf {n}\) where n now represents the sum of a measurement noise and the distance-to-model term M β. The aforementioned theorems can then be applied to \({\mathbf {U}}_k {\mathbf {U}}_k^\top \mathbf {z}\). Moreover, note that the decoder of (5.34) is not only faster than the other ones in general, it also does not constrain the solution \(\hat {\mathbf {z}}\) to be exactly in span(U k), which is in fact desirable in the spectral clustering context: we thus advocate for the decoder of (5.34).

DPP Sampling

Determinantal point processes are a class of correlated random sampling strategies that strive to increase “diversity” in the samples, based on a kernel K expliciting the similarity between variables. DPP sampling has been used successfully in a number of applications in machine learning (see, for instance, [74]).

Denote by [n] the set of all subsets of {1, 2, …, n}. An element of [n] could be the empty set, all elements of {1, 2, …, n} or anything in between. DPPs are defined as follows:

Definition 5.4 (Determinantal Point Process [74])

Consider a point process, i.e., a process that randomly draws an element S ∈ [n]. It is determinantal if, ∀ A ⊆S,

$$\displaystyle \begin{aligned}\mathbb{P}(\mathsf{A}\subseteq\mathsf{S}) = \text{det}({\mathbf{K}}_{\mathsf{A}}),\end{aligned}$$

where \(\mathbf {K}\in \mathbb {R}^{n\times n}\), a semi-definite positive matrix 0 ≼K ≼ 1, is called the marginal kernel; and K A is the restriction of K to the rows and columns indexed by the elements of A.

The marginal probability π i of sampling an element i is thus K ii. Consider the following projective kernel:

$$\displaystyle \begin{aligned} {\mathbf{K}}_k = {\mathbf{U}}_k {\mathbf{U}}_k^\top. \end{aligned} $$
(5.35)

One can show that DPP samples from such projective kernels are necessarily of size k. After measuring the k-bandlimited signal z on a DPP sample S, one has the choice between the same decoders as before (see Eqs. (5.33) and (5.34)). For instance:

Theorem 5.11

For all z ∈span(U k), let \(\mathbf {y}=\mathbf {M} \mathbf {z} + \mathbf {n} \in \mathbb {R}^k\) be a noisy measurement of z on a DPP sample obtained from kernel K k . The decoder of (5.33) with P = diag(π 1, …, π n) necessarily enables perfect reconstruction up to the noise level. Indeed, one obtains:

$$||\hat{\textbf{z}}-\textbf{z}||_2 \; \leq\; \frac{1}{\sqrt{\lambda_{\mathrm {min}}(\mathbf{U}_{k}^{\top}\mathbf{M}^{\top}\mathbf{P}^{-1}_\textsf{S}\mathbf{MU}_{k})}}|| \mathbf{P}^{-1/2}_\textsf{S}\mathbf{n}||_{2}.$$
(5.36)

Proof

The proof is only partly in [131] and we complete it here. Let us write z = U k α. Solving (5.33) entails computing \(\hat {\alpha }\in \mathbb {R}^k\) s.t. \(\|{\mathbf {P}}^{-1/2 }_{\mathsf {S}}(\mathbf {MU}_k\hat {\alpha }-\mathbf {y})\|{ }_2^2\) is minimal. Setting the derivative w.r.t. \(\hat {\alpha }\) to 0, and replacing y by MU k α + n, yields:

$$\displaystyle \begin{aligned}{\mathbf{U}}_k^\top {\mathbf{M}}^\top {\mathbf{P}}^{-1}_{\mathsf{S}} \mathbf{M}{\mathbf{U}}_k\hat{\alpha} = {\mathbf{U}}_k^\top {\mathbf{M}}^\top {\mathbf{P}}^{-1}_{\mathsf{S}} \mathbf{MU}_k\alpha + {\mathbf{U}}_k^\top {\mathbf{M}}^\top {\mathbf{P}}^{-1}_{\mathsf{S}} \mathbf{n}.\end{aligned}$$

Recall that S is a sample from a DPP with kernel K k: \(\text{det}(\mathbf {MU}_k {\mathbf {U}}_k^\top {\mathbf {M}}^\top )\) is thus strictly superior to 0, which implies that MU k is invertible, which in turn implies that \(\hat {\alpha } = \alpha + (\mathbf {MU}_k)^{-1} \mathbf {n}\). One thus has \(\|\hat {\mathbf {z}} - \mathbf {z}\|{ }_2 = \|\hat {\alpha } - \alpha \|{ }_2=\left \|(\mathbf {MU}_k)^{-1} \mathbf {n}\right \|{ }_2=\left \|({\mathbf {P}}^{-1/2}_{\mathsf {S}}\mathbf {MU}_k)^{-1} {\mathbf {P}}^{-1/2}_{\mathsf {S}}\mathbf {n}\right \|{ }_2\). Using the matrix 2-norm to bound this error yields

$$\displaystyle \begin{aligned}\|\hat{\mathbf{z}} - \mathbf{z}\|{}_2\leq \sqrt{\lambda_{\text{max}}\left[\left({\mathbf{U}}_k^\top {\mathbf{M}}^\top {\mathbf{P}}^{-1}_{\mathsf{S}} \mathbf{MU}_k\right)^{-1}\right]} \|{\mathbf{P}}^{-1/2}_{\mathsf{S}} \mathbf{n}\|{}_2,\end{aligned}$$

as claimed.

Several comments are in order:

  • The particular choice of kernel \({\mathbf {K}}_k = {\mathbf {U}}_k {\mathbf {U}}_k^\top \) implies that the marginal probability of sampling node v i, \(\pi _i=\|{\mathbf {U}}_k^\top \delta _i\|{ }_2^2\), is proportional to the leverage scores \(\textit {p}_i^*\). The major difference between the i.i.d. leverage score approach and the DPP approach comes from the negative correlations induced by the DPP. In fact, the probability of jointly sampling nodes v i and v j in the DPP case is \(\pi _i \pi _j - {\mathbf {K}}_{ij}^2=\pi _i \pi _j - (\delta _i^\top {\mathbf {U}}_k {\mathbf {U}}_k^\top \delta _j)^2\). The interaction term \((\delta _i^\top {\mathbf {U}}_k {\mathbf {U}}_k^\top \delta _j)^2\) will be typically large if v i and v j are in the same cluster, and small if not. In other words, different from the i.i.d. leverage score case where each new sample is drawn regardless of the past, the DPP procedure avoids to sample nodes containing redundant information.

  • Whereas the leverage score approach only guarantees a RIP with high probability after \(\mathcal {O}(k\log {k})\) samples, the DPP approach has a stronger deterministic guarantee: it enables perfect invertibility (up to the noise level) after precisely m = k samples. The reconstruction guarantee of (5.36) is nevertheless not satisfactory: even corrected by the marginal probabilities P S, the matrix \({\mathbf {U}}_k^\top {\mathbf {M}}^\top {\mathbf {P}}^{-1}_{\mathsf {S}}\mathbf {MU}_k\) can still have a very small λ min, such that reconstruction may be quite sensitive to noise. Improving this control is still an open problem. In practice, sampling independently 2 or 3 times from a DPP with kernel K k creates a set S of size 2k or 3k that is naturally more robust to noise.

  • Whereas independent sampling is straightforward, sampling from a DPP with arbitrary kernel costs in general \(\mathcal {O}(n^3)\) (see Algorithm 1 of [74] due to [62]). Thankfully, in the case of a projective kernel such as K k, one can sample a set in \(\mathcal {O}(nk^2)\) based on Algorithm 3 of [132].

5.6 Perspectives

Almost two decades have passed since spectral clustering was first introduced. Since then, a large body of work has attempted to accelerate its computation. So, has the problem been satisfactorily addressed?—or, despite all these works, is there still room for improvement and further research?

To answer, we must first define what “satisfactorily addressed” would entail. As we have seen, the prototypical spectral clustering algorithm can be divided into three sub-problems: the similarity graph computation runs in \(\mathcal {O}(dn^2)\); the spectral embedding computation runs in \(\mathcal {O}(t(ek + nk^2))\) using an Arnoldi algorithm with t implicit restarts and assuming that e is the number of edges; and the k-means step runs in \(\mathcal {O}(tnk^2)\), with t now being a bound on the number of iterations of the Lloyd-Max algorithm. Our criteria for evaluating an approximation algorithm aiming to accelerate one (or more) of these sub-problems are two-fold:

  • We ask that the approximation algorithm’s computation cost is effectively lighter than the cost of the sub-problem(s) it is supposed to accelerate! The ultimate achievement is an order-of-magnitude improvement w.r.t. n (or e), d and/or k, especially when the complexity has no hidden constants (i.e., the algorithm is practically implementable). When such a gain is not possible, a gain on the constants of the theoretical cost is also considered worthwhile.

  • The algorithm should come with convincing guarantees in terms of the quality of the found solution. Heuristics or partially motivated methods do not cut it. We require that, under mild assumptions, the proposed solution is provably close to the exact solution. Let us clarify two aspects of this statement further:

    • It is difficult to concretely classify assumptions as mild, but a useful rule of thumb is checking whether the theoretical results are meaningful for the significant majority of cases where spectral clustering would be used.

    • The control of the approximation error comes in different flavors, that we detail here from the tightest to the loosest. The best possible error control in our context is a control over the clustering solution itself, via error measures such as the misclustering rate. This is unfortunately unrealistic in many cases. An excellent alternative is the multiplicative error—considered as the gold standard in approximation theory—over the k-means cost,Footnote 24 ensuring that the cost of the approximation is not larger than 1 + 𝜖 times the cost of the exact solution. Next comes the additive error over the cost: ensuring that the cost difference between approximated and exact solutions is not larger than 𝜖. All these error controls are referred to as end-to-end controls, and represent the limit of what we will consider a satisfactory error control.

Reviewing the literature, we were surprised to discover that there are rarely any algorithms meeting fully the proposed criteria: a faster algorithm with end-to-end control over the approximation error under mild assumptions. Let us revisit one by one the different approaches presented in Sects. 5.3, 5.4, and 5.5 examining them in light of our criteria for success. In each category of approximation algorithms, we order the methods according to the power of their error control.

Sampling Methods in the Original Feature Space [Sect. 5.3]

  • Representative points methods as described in [63, 147] allow for an end-to-end control on the miss-clustering rate ρ, which is unfortunately quite loose. The constants involved in Theorem 5.3 are in fact undefined—thus potentially large—which is problematic knowing that ρ is by definition between 0 and 1. Also, the theorem’s assumptions include independence of the 𝜖 i, which is hard to justify in practice. On the other hand, the computation gain of such methods is very appealing.

  • Feature projection methods, where the dimension d of the original feature space is reduced to a dimension d′≤ d based on Johnson–Lindenstrauss arguments, come with a multiplicative error control on the pairwise distances in the original feature space, thus providing a control on the obtained kernel matrix. The impact of this initial approximation on the final clustering result has not been studied.

  • Nyström-inspired methods [19, 48, 85, 97] can be very efficient in practice especially because they do not need to build the graph. However, precisely because they do not build the graph, these methods cannot exactly perform two key parts of the prototypical spectral clustering algorithm: the k-NN sparsification and the exact degree computation. The partial knowledge and sparsity of the kernel matrix also makes sampling difficult, as using leverage scores sampling is not possible anymore, whereas most other sampling schemes do not work very well with sparse matrices and come with weak guarantees. To the extent of our knowledge, there is also no convincing mathematical argument proving that using these methods will yield a clustering that is of similar quality to that produced by the exact spectral clustering algorithm.

  • Sketching methods such as the random Fourier features [110] is yet another way of obtaining a pointwise multiplicative (1 + 𝜖) error on the Gaussian kernel computation. RFF enable to compute a provably good low-rank approximation of the kernel. They nevertheless suffer from the same problems as Nyström-based techniques: without building the graph, sparsification and degree-normalization are uncontrolled. In addition, the guarantees on the low-rank approximation of the kernel do not transfer easily to guarantees of approximation of the spectral embedding U k.

  • Approximate nearest neighbors methods are numerous and varied, and come with different levels of guarantees. Practical implementations of algorithms, however, often set aside theoretical guarantees to gain on efficiency and performances; and comparisons are usually done on benchmarks rather than on theoretical performances. In the best of cases, there is a control on how close the obtained nearest neighbor similarity graph is to the exact one, but with no end-to-end control.

Spectral Embedding Approximation Methods [Sect. 5.4]

  • Random eigenspace projection is a very fast method and has been rigorously analyzed [95, 105, 111, 130]. It is true that a successful application depends on obtaining a good estimate of the k-th eigenvalue, which is very hard when the k-th eigenvalue gap is relatively small. Nevertheless, our current understanding of spectral clustering suggests that it only works well when the gap is (at least) moderately large. As such, though there are definitely situations in which random eigenspace projection will fail to provide an acceleration, these correspond to cases where one should not be using spectral clustering in the first place. The same argumentation can also be used in defense of all methods that come with mild gap assumptions (see coarsening and spectral sparsification).

  • Simple coarsening methods, such as the heavy-edge matching heuristic [69], have nearly linear complexity, seem to work well in practice, and are accompanied by end-to-end additive error control [91]. Nevertheless, the current analysis of these heuristics only accounts for very moderate reductions (m ≥ n∕2) and thus does not fully prove their success: in real implementations coarsening is used in a multi-level fashion resulting to a drastic decrease in the graph size (\(m = \mathcal {O}(n/2^{\textit {c}}) \) for c levels), whereas the end-to-end control only works for a single level.

  • Advanced coarsening methods, such as local variation methods [90], come with much stronger guarantees that allow for drastic size reduction and acceleration. Yet, thus far, all evidence suggests that finding a good enough coarsening is computationally as hard as solving the spectral clustering problem itself. As a consequence, it is at this point unclear whether these methods can be used to accelerate spectral clustering.

  • Spectral sparsification techniques come with excellent guarantees in theory: one may prove that a spectral sparsifier can be computed in nearly linear time and, moreover, the latter’s spectrum will be provably close to the original one. Yet, we have reasons to doubt their practicality. Indeed, current algorithms are very complex, feature impractically large constants, and are only relevant for dense graphs. In addition, spectral sparsifiers, by definition, approximate the entire spectrum of a graph Laplacian matrix. However, spectral clustering only needs an approximation of a tiny fraction of the spectrum. From that perspective, it is reasonable to conclude that without modification current approaches will not yield the best possible approximation.

  • Nyström-approximation applied directly to the Laplacian matrix is a good option, especially when combined with leverage score sampling. Nevertheless, an end-to-end error control has only been partially derived and is not yet satisfactory.

Sampling to Accelerate the k-Means Step [Sect. 5.5]

  • Exact methods to accelerate the Lloyd-Max algorithm, may they be via avoiding unnecessary distance calculations or via a careful initialization are always useful and should be taken into account.

  • Coresets come with the strongest guarantees: the minimum number of samples to guarantee a (1 + 𝜖) multiplicative error on the cost function has been well studied. Nevertheless, practical coreset-sampling methods are scarce; and in the best cases, the sampling cost is of the same order of the Lloyd-Max running cost itself.

  • Graph-based sampling comes with strong guarantees, but not over the k-means cost: on the reconstruction error based on a k-bandlimited model that is only an approximation in practice. Moreover, we interpret the reduced indicator vectors \({\mathbf {z}}_\ell ^r\) obtained by running Lloyd-Max on the sampled set S as (possibly noisy) measurements of z on S. This interpretation currently lacks solid theoretical ground and impedes an end-to-end control of this approximation method. Nevertheless, the leverage-score-based sampling allows for a reduction in order of magnitude of the Lloyd-Max running cost.

  • Other methods to accelerate k-means are not always appropriate to the spectral clustering context. Spectral feature dimension reduction is unnecessary in our context where d = k, sketching methods appropriate to distributed cases where n is very large are not appropriate neither as the spectral features need centralized data to be computed in any case.

In Practice

The attentive reader will have remarked that, unsurprisingly, the tighter the error control, the more expensive the computation, and vice versa. Also, although we have put here an emphasis on the approximation error controls, it should not undermine the fact that methods from the whole spectrum are in practice useful, depending on the situation at hand, and specifically on the range of values of n, d, and k. In very large d situations, a first step of random projection (or feature selection if some features are suspected to be too noisy) should be considered. Then, in situations where the exact computation of the proximity graph is too expensive, one may resort either to sketching methods or to Nyström-type methods to decrease the cost from quadratic to linear in n, and directly obtain an approximation of the spectral embedding without any explicit graph construction. These methods, however, do not take into account a sparsity constraint on the proximity graph and are usually rough on the degree correction they make.

The role of the sparsity constraint is not well understood theoretically, but seems to be important in some practical cases [138]. In such instances, a better option is to use approximate nearest neighbors methods to create a sparse similarity graph, and work from there. In extremely large data, say n ≥ 108, the only workable methods are the representative-based, with, if possible, a first k-means (or compressive k-means [70]) to reduce n to m, or, in last resort, a uniform random sampling strategy.

In situations where one has to deal with such a large similarity graph that Arnoldi iterations are too expensive to compute the spectral embedding (either a graph created via approximate nearest neighbors or if the original data is a graph), projection methods such as in [21, 130], coarsening methods such as in [90], or Nyström-based methods are different possibilities.

Sampling methods to accelerate the last k-means step may seem to be a theoretical endeavor given that the Lloyd-Max algorithm is already very efficient. Due to the quadratic term in k, it is nevertheless in practice useful when k grows large. In this situation, hierarchical k-means [103] is a nice option. Coresets, because they are so stringent on the error control, have a hard time actually accelerating k-means, unless hybrid coreset-inspired methods are envisioned [49]. Finally, graph-based methods, because they take into account that spectral features are in fact derived from the graph itself, enable significant acceleration and are well-suited to the spectral clustering context.

Future Research

Different directions of research could be envisioned to improve the state of the art:

  • For Nyström-inspired methods in the context of Sect. 5.3 (directly applied on the original data) as well as the other methods based on computing a low-rank approximation of the kernel matrix K, further work is needed to control both the sparsification and the degree correction, in order to bridge the gap between a provably good low-rank approximation of K to a provably good low-rank approximation of R.

  • For Nyström methods in the context of Sect. 5.4 (applied on a known or well-approximated similarity graph), it would be interesting to extend Theorem 5.2 (for instance) to a control over \(\|{\mathbf {A}}_k - \tilde {\mathbf {A}}_k\|\) instead of \(\|\mathbf {A} - \tilde {\mathbf {A}}\|\). This would enable a tighter use of Davis–Kahan’s perturbation theorem in the discussion of Sect. 5.4.1 and, in fine, a better end-to-end guarantee.

  • Projection-based methods of Sect. 5.4.3.2 currently necessitate to compute a value λ known to be in the interval [λ k, λ k+1). The algorithm used to do so is based on eigencount techniques that turn out to require as much computation time as the Lanczos iterations needed to compute U k exactly. One should relax this constraint to obtain end-to-end guarantees as a function of the distance between a coarsely estimated λ and the target interval.

  • The derivation and analysis of randomized multi-level coarsening schemes with end-to-end guarantees is very much an open problem. We suspect that by utilizing spectrum-dependent sampling-schemes akin to leverage-scores one should be able to achieve results superior to heavy-edge matching in nearly linear time.

  • There is an interesting similarity between coreset techniques and the graph-based sampling strategies discussed in Sect. 5.5.3 and it would be interesting to investigate this link theoretically, maybe paving the way to coresets for spectral clustering?

Finally, accelerating the prototypical spectral algorithm depicted in Algorithm 1 should not be the sole objective of researchers in this field. Indeed, taking the graph cut point-of-view of Sect. 5.2.2, Algorithm 1 makes three insufficiently motivated choices: (i) To begin with, the sparsification step in Algorithm 1 is not well understood. Apart from the fact that it is always computationally more convenient to work with a sparse similarity graph then a dense one, the precise effect of sparsification on the clustering performance has not been analyzed. (ii) As mentioned in Sect. 5.2.2.3, the relaxation employed by spectral relaxation is not unique. Why should we focus our attention on this one versus another? See, for instance, [24, 112] for recent alternative options. (iii) Finally, the use of k-means on the spectral features is not yet fully justified. Most of the end-to-end guarantees presented here compare the k-means cost of the exact solution to the k-means cost of the approximate solution. Given that the very use of k-means is not theoretically grounded, this choice of guarantee is debatable. Other options such as a control over the rcut or ncut objectives are possible (as in [96]) and should be further investigated.