Keywords

1 Introduction

Graph clustering, also known as community detection, is one of the most fundamental problems in algorithms with applications in distributed computing, machine learning, network analysis, and statistics. Over the past four decades, graph clustering algorithms have been extensively studied from both the theoretical and applied perspectives [10, 21]. On the theoretical side, the problem is known as graph partitioning and is one of the most fundamental \(\mathsf {NP}\)-hard problems. Among the many reasons, we mention its connection to several important topics in theoretical computer science including the Unique Games Conjecture and the Small Set Expansion Conjecture. Because of this, most graph clustering algorithms with better approximation guarantee are based on complicated spectral and convex optimisation techniques [17, 22], whose runtime is slow even in the centralised setting. From the practical point of view, graph clustering is a key component in unsupervised learning, and has been widely applied in data mining, analysis of social networks, and statistics. In particular, since many graphs occurring in practice (e.g. social networks) are stored in physically distributed servers (sites), designing simple and more practical distributed algorithms, with better performance, has received a lot attention in recent years [2,3,4, 8, 12, 23, 24].

We study graph clustering algorithms in the distributed setting. We assume that the input \(G=(V,E)\) is an unweighted distributed network over a set of \(|V| = n\) nodes and \(|E| = m\) edges. The set of nodes is always fixed and there are no node failures. Each node is a computational unit communicating only to its neighbours. We consider the synchronous timing model when, in each round, a node can either send the same message to all its neighbours or choose not to communicate. We assume that every node knows the rough size of |V|, which is not difficult to approximate, however the global structure of G is unknown to each node. Every node v has a uniqueFootnote 1 identifier \(\mathsf {ID}(v)\) of size \(O(\log (n))\). We will assume that any message sent by a node v will also contain \(\mathsf {ID}(v)\).

In this work we study the clustering problem when the input G consists of k well-defined clusters \(S_1,\cdots , S_k\) that form a partition of V, i.e., it holds that \(S_i\cap S_j=\emptyset \) for \(i\ne j\) and \(\bigcup _{1\le i\le k} S_i=V\). We allow the nodes in the network to exchange information with their neighbours over a number of rounds T. At the end of the T rounds, every node v determines a label indicating the cluster in which it belongs. Our objective is to design a distributed algorithm which guarantees that: (1) most nodes within the same cluster would receive the same label, and (2) every cluster would have its own unique label. The performance of our algorithm is measured by (1) the total number of proceeded rounds T, (2) the approximation guarantee, i.e., how many nodes in each cluster receive the correct label, and (3) the total message complexity, i.e., the total number of words exchanged among the nodes.

Structure of Clusters. The performance of a clustering algorithm always depends on the inherent cluster structure of the network: the more significant the cluster structure is, the easier the algorithm could approximate it. To quantify the significance of the cluster structure associated with the underlying graph, we follow the previous reference [18, 19] and introduce the gap assumption. For any set \(S\subset V\), let the conductance of S be

$$\begin{aligned} \phi _G(S) \triangleq \frac{|\partial (S)|}{\mathrm {vol}(S)}, \end{aligned}$$

where \(\partial (S) = E(S, V\setminus S)\) is the set of edges crossing S and \(V\setminus S\), and \(\mathrm {vol}(S) \) is the sum of degrees of nodes in S. We define the k-way expansion of G by

$$\begin{aligned} \rho (k) \triangleq \min _{\mathrm {partitions \,} S_1, \dots , S_k} \max _{1\le i \le k} \phi _G(S_i), \end{aligned}$$

and we call a partition \(\{S_i\}_{i=1}^{k}\) that achieves \(\rho (k)\) an optimal partitioning.

One of the basic facts in spectral graph theory is a tight connection between \(\rho (k)\) and the eigenvalues of the normalised Laplacian matrix of G. In particular, Lee et al. [14] proved the so-called higher-order Cheeger inequality:

$$\begin{aligned} \frac{\lambda _k}{2} \le \rho (k) \le O(k^3)\sqrt{\lambda _k}, \end{aligned}$$
(1)

where \(0 = \lambda _1 \le \dots \le \lambda _n \le 2\) are the eigenvalues of the normalised Laplacian of G. By definition, it is easy to see that a small value of \(\rho (k)\) ensures the existence of disjoint \(S_1, \dots , S_k\) of low conductance. On the other hand, by (1) we know that a large value of \(\lambda _{k+1}\) implies that no matter how we partition G into \(k+1\) subsets \(S_1, \dots , S_{k+1}\), there will be at least one subset \(S_i\) for which \(\phi _G(S_i) \ge \frac{\lambda _{k+1}}{2}\). To formalise this intuition, we follow the previous reference (e.g., [19, 23]) and define

$$ \varUpsilon _G(k) \triangleq \frac{\lambda _{k+1}}{\rho (k)}. $$

By definition, a large value of \(\varUpsilon _G(k)\) would ensure that G has k well-defined clusters.

Our Result. Our main result is an improved distributed graph clustering algorithm for inputs with k-well defined clusters. For the ease of presentation, we assume that G is d-regular, however our algorithm works and the analysis follows as long as the maximum degree \(d_{\max }\) of G and the minimum degree \(d_{\min }\) satisfy \(d_{\max }/d_{\min }=O(1)\). Our result is as follows:

Theorem 1

(Main Result). Let \(G = (V, E)\) be a d-regular network with \(|V| = n\) nodes, \(|E| = m\) edges and k optimal clusters \(S_1, \dots , S_k\). If \(\varUpsilon _G(k) = \omega \left( k^4 \log ^3(n)\right) \), there is a distributed algorithm that finishes in \(T = O \left( \frac{\log (n)}{\lambda _{k+1}}\right) \) rounds, such that the following three statements hold:

  1. 1.

    For any cluster \(S_j\) of size \(|S_j| \le \log (n)\), every node \(u \in S_j\) will determine the same label. Moreover, this label is \(\mathsf {ID}(v)\) for some \(v\in S_j\).

  2. 2.

    With probability at least 0.9, for any cluster \(S_j\) of size \(|S_j| > \log (n)\), all but \(o(|S_j|)\) nodes \(u \in S_j\) will determine the same label. Moreover, this label is \(\mathsf {ID}(v)\) for some \(v\in S_j\).

  3. 3.

    With probability at least 0.9, the total information exchanged among the n nodes, i.e. the message complexity is \(\widetilde{O}\left( \frac{n^2}{\lambda _{k+1}}\right) \), where \(\widetilde{O}(\cdot )\) hides \(\mathrm {poly}\log (n)\) factors.

Now we discuss the significance of our result. First of all, notice that \(\lambda _{k+1}=\varOmega (1)\) in many practical settings [16, 18], and in this case our algorithm finishes in \(T=O(\log n)\) rounds. Secondly, our result significantly improves the previous work with respect to the approximation ratio. As far as we know, the vast majority of the previous algorithms for distributed clustering are analysed with respect to the total volume (or number) of misclassified nodes over all clusters (e.g., [2,3,4, 23]). However, this form of approximation is unsatisfactory when the underlying graph contains clusters with very unbalanced sizes, since an upper bound on the total volume (or number) of misclassified nodes could still imply that nodes from a smaller cluster are completely misclassified. Our current work successfully overcomes this downside by analysing the approximation guarantee with respect to every approximated cluster and its optimal correspondent. To the best of our knowledge, such a strong approximation guarantee with respect to every optimal cluster is only known in the centralised setting [13, 19]. We show that such result can be obtained for distributed algorithms as well.

It is not difficult to image that obtaining this strong approximation guarantee would require a more refined analysis on the smaller clusters, since clusters with different size might have different orders of mixing time if random walk based processes are needed when performing the algorithm. Surprisingly, we are able to show that our algorithm is able to perfectly recover every small cluster. To the best of our knowledge, such a result of perfectly recovering small clusters is unknown even for centralised algorithms, and our developed subroutine of small cluster recovery might have other applications.

Finally, as a key component of our algorithm, we present a distributed subroutine which allows most nodes to estimate the size of the cluster they belong. This subroutine is based on the power method applied to a number of initial vectors. We show that the information retrieved by each node after this process is sufficient for most nodes to obtain good estimates for the size of their cluster. We believe that our present algorithm and the developed techniques would inspire further applications for many different problems concerning multiple and parallel random walks [11, 20] or testing clusters of communities in networks [9, 15].

Related Work. There is a large amount of work on graph clustering over the past decades, and here we discuss the ones closely related to ours. First of all, there have been several studies on graph clustering where the presence of the cluster structure is guaranteed by some spectral properties of the Laplacian matrix of an input graph. Von Luxburg [16] studies spectral clustering, and discusses the influence of the eigen-gap on the quality of spectral clustering algorithms. Peng et al. [19] analyse spectral clustering on well-clustered graphs and show that, when there is a gap between \(\lambda _{k+1}\) and \(\rho (k)\), the approximation guarantee of spectral clustering can be theoretically analysed. Gharan and Trevisan [18] designed an approximation algorithm that, under some condition on the relationship between \(\lambda _k\) and \(\lambda _{k+1}\), returns k clusters \(S_1,\ldots , S_k\) such that both the inner and outer conductance of each \(S_i\) can be theoretically analysed. Allen-Zhu et al. [1] present a local algorithm for finding a cluster with improved approximation guarantee under some gap assumption similar with ours.

For distributed graph clustering, the work most related to ours is the distributed algorithm developed by Sun and Zanetti [23]. In comparison to our algorithm, the algorithm in [23] only holds for graphs that consist of clusters of balanced sizes, and the approximation guarantee (i.e., the number of misclassified nodes) of their algorithm is with respect to the volume of the input graph, while the approximation guarantee of our algorithm is with respect to every individual cluster. Becchetti et al. [3] presented a distributed graph clustering algorithm for the case \(k=2\), based on an Averaging dynamics process. However, their analysis holds only for a restricted class of graphs exhibiting sparse cuts. Becchetti et al. [4] extended the results for a more general class of volume regular networks with k clusters. Nonetheless, their results apply to reasonably balanced networks, which is a setting more restricted than ours. Finally, we would like to mention a related sequence of work on decomposing graphs into expanders [6, 7]. However, we highlight that these algorithms cannot be applied in our setting for the following reasons: the number of partitioning sets could be much larger than the initial number of clusters k and the decomposition also allows some fraction of nodes not to be in any cluster [7].

Notation. We consider the input network \(G = (V,E)\) to be an unweighted d-regular network on \(|V| = n\) nodes and \(|E| = m\) edges. The \(n \,\times \,n\) adjacency matrix is denoted \(A_G\) and is defined as \(A_G(u,v) = 1\) if \(\{u, v\} \in E\) and \(A_G(u,v) = 0\) otherwise. The normalised Laplacian of G is the \(n \times n\) matrix defined as

$$ \mathcal {L}_G \triangleq I - \frac{1}{d} \cdot A_G. $$

We denote the eigenvalues of \(\mathcal {L}_G\) by \(0 = \lambda _1 \le \dots \le \lambda _n \le 2\). For any subset \(S \subseteq V\), we denote the characteristic vector \(\mathbf{1} _S \in \mathbb {R}^n\) by \(\mathbf{1} _S(v) = 1\) if \(v \in S\) and \(\mathbf{1} _S(v) = 0\) otherwise. For brevity, we will write \(\mathbf{1} _v\) whenever \(S = \{v\}\).

We will consider the setting when the input network G contains k disjoint clusters \(S_1, \dots , S_k\), that form a partition of V. For a given node \(v \in V\), we will denote by \(\mathcal {S}(v)\) the cluster that contains v. We will write \(\mathrm {Broadcast}_u(\mathsf {Message})\) whenever a node u sends a \(\mathsf {Message}\) to its neighbours and we will drop the subscript u whenever that is clear from the context. We will denote the label of a node v by L(v) and we will assume that initially \(L(v) = \perp \), for all nodes v. Throughout our algorithm, some nodes \(v\in V\) will become active. We will use the notation \(v^*\) whenever referring to an active node v.

2 Algorithm Description

Our algorithm consists in three major phases: Averaging, Small Detection and Large Detection, which we will describe individually.

Averaging Phase: The Averaging phase (Algorithm 1) consists in the execution of n different diffusion processes, one for every node. To each diffusion process, say corresponding to a node v, we associate a set of vectors \(\left\{ x_i^v\right\} _{i}\) such that, after every round i, every node u in the network will store the value \(x_i^v(u)\). The value \(x_i^v(u)\) is the mass value that node u received from node v after i rounds. Initially (round 0) the diffusion process starts from \(x_0^v = \frac{1}{\sqrt{d}} \mathbf{1} _v\), i.e. all mass value \(\frac{1}{\sqrt{d}}\) is concentrated at node v with 0 mass to the other nodes. For a general round i, the vector \(x_i^v\) is constructed iteratively, via the recursive formula

$$\begin{aligned} x_{i}^{v}(u) = \frac{1}{2}x_{i-1}^v(u) + \frac{1}{2d}\sum _{\{u, w\} \in E} x_{i-1}^{v}(w), \end{aligned}$$
(2)

for any u. We remark that, the only information a node u needs in round i are the values \(x_{i-1}^v(w)\) for all its neighbours w. We note that at any round i, node u does not need to update the values \(x_i^v\) for all nodes v in the network. Instead, u focuses on the diffusion processes started at the nodes which u has already seen throughout the process. To keep track of the already seen nodes, u maintains the set \(\mathsf {Seen}(u)\) (See Line 2 of Algorithm 1).

Now let us discuss the intuition behind this phase. The goal of the diffusion process started at node v is that, after \(T = \varTheta \left( \frac{\log (n)}{\lambda _{k+1}}\right) \) rounds, the entire mass \(\frac{1}{\sqrt{d}}\) is split roughly equally among all nodes inside the cluster \(\mathcal {S}(v)\), with very few of this mass exiting the cluster. A closer look at Eq. (2) tells us that the diffusion process started at v is nothing but a T-step, 1/2-lazy random walk process starting from the vector \(\frac{1}{\sqrt{d}}\mathbf{1} _v\). It is well known [5] that, assuming G is connected, the vectors \(x_i^{v}\) will converge to the uniform distribution as i goes to infinity. However, if the process runs for merely \(T = \varTheta \left( \frac{\log (n)}{\lambda _{k+1}}\right) \) iterations, one should expect the vector \(x_{T}^v\) to be close to the (normalised) indicator vector of the cluster \(\mathcal {S}(v)\). This is because \(\frac{\log (n)}{\lambda _{k+1}}\) corresponds to the local mixing time inside the cluster \(\mathcal {S}(v)\). Therefore, after T rounds, we expect for every \(u \in \mathcal {S}(v)\), the values \(x_{T}^v(u)\) to be similar and significantly greater than 0, while for nodes \(w \notin \mathcal {S}(v)\), we expect the values \(x_{T}^v(w)\) to be close to 0.

At the end of the Averaging phase, based on the values \(\left\{ x_{T}^v(u)\right\} _v\) each node u computes an estimate \(\ell _u\) for the size of its cluster. We define the estimates as

$$ \ell _u \triangleq \frac{3}{d \cdot \sum _{v} \left[ x_{T}^v(u)\right] ^2}, $$

and we will show that, for most nodes u, the estimates satisfy \(\ell _u \in \left[ |\mathcal {S}(u)|, 4|\mathcal {S}(u)|\right] \) (see Lemma 6).

figure a

Small Detection Phase: The purpose of this phase is for every node u in a cluster of small size \(|\mathcal {S}(u)| \le \log (n)\) to determine its label. Again, we focus on the intuition behind this process and we refer the reader to Algorithm 2 for a formal description. From the perspective of a node u, we would like to use the values \(\{x_{T}^v(u)\}_v\) to decide which nodes are in its own cluster. Informally, the values \(\{x_{T}^v(u)\}_{v \in \mathcal {S}(u)}\) should be similar and close to \(\frac{1}{\sqrt{d}|\mathcal {S}(u)|}\) since, for every diffusion process started at \(v \in \mathcal {S}(u)\) we expect the \(\frac{1}{\sqrt{d}}\) mass to be equally distributed among all nodes in the cluster. At the same time, we expect the values \(\{x_{T}^w(u)\}_{w \notin \mathcal {S}(u)}\) not to be very large, because they correspond to random walks started in a different clusters. Therefore, if \(|\mathcal {S}(u)|\) is not too big, we expect to see a clear separation between \(x_{T}^v(u)\) and \(x_{T}^{w}(u)\), for any \(v \in \mathcal {S}(u)\) and \(w \notin \mathcal {S}(u)\).

Since u knows that it is a member of its own cluster, it can use \(x_{T}^u(u)\) as a reference point. Namely u computes the pairwise differences

$$ y_v \triangleq |x_{T}^u(u) - x_{T}^v(u)|, $$

for all v and sorts them in increasing order. Let us call \(y_{v_1} \le \dots \le y_{v_n}\) to be those values. Based on the previous remarks, it should be expected that the first \(|\mathcal {S}(u)|\) values are small and correspond to nodes \(v \in \mathcal {S}(u)\). Then, u performs a binary search to find the exact size of its cluster \(\mathcal {S}(u)\) (lines 4-10 of Algorithm 2). Finally, u sets its label to be the minimumFootnote 2 \(\mathsf {ID}\) among nodes corresponding to the smallest \(|\mathcal {S}(u)|\) values \(\{ y_{v_i}\}\).

figure b

At the end of this phase we would like to stress two important facts. First of all, it is really crucial that the cluster \(\mathcal {S}(u)\) has size \(|\mathcal {S}(u)| \le \log (n)\). Otherwise, the values \(\{ x_{T}^v(u) \}_{v \in \mathcal {S}(u)}\) would be too small for u to distinguish them. Therefore, we cannot use this approach to determine the label of all nodes in the network. Secondly, we remark that for the algorithm to work, every node u should be in possession of the value \(x_{T}^u(u)\). This can be ensured only if all nodes start their own diffusion process.

Large Detection Phase: In this phase, the algorithm will detect the remaining clusters of large and medium size, that are clusters S with \(|S| > \log (n)\). The formal description of this phase can be found in Algorithm 3. Again, a node u in such a cluster would like to use the values \(\{x_{T}^v(u)\}_v\) to determine the composition of its cluster. Unfortunately, node u cannot trust all values \(x_{T}^v(u)\) because of the error in the diffusion processes caused by some mass exiting the cluster.

To overcome this difficulty, we use a different approach. We want each cluster \(S_j\) to select some representatives, which we will refer to as active nodes. The label of the cluster \(S_j\) will be the minimum ID among the active nodes in \(S_j\). The purpose of this selection is to avoid the (bad) nodes for which the diffusion process does not behave as expected. To that extent, we let each node u activate itself independently (Line 3 of Algorithm 3), with probability

$$ p(u) \triangleq \frac{5\cdot \log \left( 100k\right) }{\ell _u}. $$

Since for most nodes \(\ell _u \approx \left|\mathcal {S}(u) \right|\), the probability p(u) is large enough to ensure that: every cluster will have at least one active node and, in expectation, there will not be too many active nodes overall. If a node u becomes active, it will announce this in the network, along with its estimated cluster size \(\ell _u\). This happens throughout T rounds of communication (Lines 6–14 of Algorithm 3). In such a round j, every node v in the network keeps a list \(\mathsf {Act}(v)\) of the active nodes v has seen up to round j. Then, v checks which of those active nodes he has not yet communicated, broadcasts them in a \(\mathsf {Message}\) (Line 12 of Algorithm 3) and marks them as \(\mathsf {Sent}\). This process ensures that every node v announces each active node at most once, which significantly reduces the communication cost.

Coming back to node u, at the end of the T rounds u has seen all active nodes \(v^* \in \mathsf {Act}(u)\). Node u considers an active node \(v^*\) to be in \(\mathcal {S}(u)\), if two conditions are satisfied: 1) its estimated cluster size \(\ell _u\) and \(\ell _{v^*}\) are similar and 2) the value \(x_T^{v^*}\) is similar to what u expects to see. More precisely, u sets the threshold

$$ t_u \triangleq \frac{1}{2\sqrt{d}\ell _u}, $$

and considers its set of candidates

$$ \mathsf {Cand}(u) \triangleq \left\{ v^* \big | v^*\in \mathsf {Act}(u), \frac{\ell _u}{4} \le \ell _{v^*} \le 4\ell _u \, \text { and }\, x_T^{v^*}(u) \ge t_u\right\} . $$

The set of candidates \(\mathsf {Cand}(u)\) represents the set of active nodes that u believes are in its own cluster. If \(\mathsf {Cand}(u) \ne \emptyset \), then u sets its label to be the \(\min _{v^* \in \mathsf {Cand}(u)} \mathsf {ID}(v^*)\). Otherwise, if u is unlucky so that \(\mathsf {Cand}(u) = \emptyset \), then u randomly chooses an active node \(v^* \in \mathsf {Act}(u)\) and selects its label as \(L(u) = \mathsf {ID}(v^*)\) (Lines \(20{-}23\) of Algorithm 3).

figure c

The Main Algorithm: Now we bring together all three subroutines and present our main Algorithm 4. We note that once a node u has determined their label, that will not change in the future. This is because Algorithm 3 can only change the label if \(L(u) = \perp \) initially. Moreover, even if a node has determined their label in the Small Detection phase, they still participate in the Large Detection phase since they are active parts in the Propagation step (Lines 6–14) of Algorithm 3.

figure d

3 Analysis of the Algorithm

In this section we analyse our distributed algorithm and prove Theorem 1. Remember that we assume G is a d-regular network with optimal k clusters \(S_1, \dots , S_k\). Moreover, we work in the regime when G satisfies the assumption that

$$\begin{aligned} \varUpsilon _G(k) = \omega \left( k^4 \log ^3(n)\right) . \end{aligned}$$
(3)

For brevity, we will use \(\varUpsilon \) instead of \(\varUpsilon _G(k)\). We will structure the analysis of our algorithm in five subsections. In Subsect. 3.1 we recap some of the results in [23] and present the guarantees achieved after the Averaging phase of the algorithm. In Subsect. 3.2 we show that most nodes in the network can obtain a good estimate for the size of their cluster. In Subsect. 3.3 we deal with the analysis of the Small Detection phase of the algorithm. We will ultimately show that for all clusters \(S_j\) of size \(|S_j| \le \log (n)\), all nodes \(u\in S_j\) will determine the same label, unique for the cluster \(S_j\). In Subsect. 3.4 we analyse the Large Detection phase of the algorithm. In this phase, most nodes in clusters of size at least \(\log (n)\) will decide on a common label that is unique for the cluster. We also show that the number of misclassified nodes for each cluster is small. Finally, we conclude with the proof of Theorem 1 in Subsect. 3.5.

3.1 Analysis of the Averaging Phase

Recall that the Averaging phase consists in performing n diffusion processes for \(T = \varTheta \left( \frac{\log (n)}{\lambda _{k+1}}\right) \) rounds. For a node v, its own diffusion process can be viewed as a lazy random walk starting at \(x_0^v = \frac{1}{\sqrt{d}}\mathbf{1} _v\) and following the recursion

$$ x_{i+1}^v = P x_{i}^v, $$

where

$$ P \triangleq \frac{1}{2} \cdot I + \frac{1}{2d} \cdot A_G = I - \frac{1}{2} \mathcal {L}_G $$

is the transition matrix of the process. It is known that, assuming G is connected, the vectors \(\{x_i^v\}\) will converge to the stationary distribution as i goes to infinity. However, if the power method runs for \(T = \varTheta \left( \frac{\log (n)}{\lambda _{k+1}}\right) \) phases, we expect \(x_{T}^v \approx \frac{1}{\sqrt{d}\cdot |\mathcal {S}(v)|} \cdot \mathbf{1} _{\mathcal {S}(v)}\). Sun and Zanetti [23] formalise this intuition and give a concrete version of the above observation (See Lemma 2). The intuition behind their result lies in the fact that, for graphs G with k good clusters, there is a strong connection between the bottom k eigenvectors of \(\mathcal {L}_G\) and the normalised indicator vectors of the clusters [19].

We will now introduce the notation required to formalise the above discussion. Let \(f_1, \dots , f_k\) be the bottom k eigenvectors of \(\mathcal {L}_G\) and let \(\{ \chi _{S_1}, \dots , \chi _{S_k}\}\) be the normalised indicator vectors of the clusters \(S_i\), that is \(\chi _{S_i} \triangleq \frac{1}{\sqrt{|S_i|}} \mathbf{1} _{S_i}\), for all i. Let \(\widetilde{\chi }_i\) be the projection of \(f_i\) onto \(\mathrm {span}\{\chi _{S_1}, \dots \chi _{S_k}\}\) and let \(\{\widehat{\chi }_i\}\) be the vectors obtained from \(\{\widetilde{\chi }_i\}\) by applying the Gram-Schmidt orthonormalisationFootnote 3. For any node v, we define the discrepancy parameter

$$ \alpha _v \triangleq \sqrt{\frac{1}{d} \sum _{i=1}^k \left( f_i(v) - \widehat{\chi _i}(v)\right) ^2}. $$

We are now ready to state the result relating \(x_{T}^v\) to the indicator vector of the cluster \(\mathbf{1} _{\mathcal {S}(v)}\):

Lemma 2

(Adaptation of Lemma 4.4 in [23]). For any \(v \in V\), if we run the lazy random walk for \(T = \varTheta \left( \frac{\log {n}}{\lambda _{k+1}}\right) \) rounds, starting at \(x_0^v = \frac{1}{\sqrt{d}}\mathbf{1} _v\), we obtain a vector \(x_{T}^v\) such that

$$\begin{aligned} \left\Vert x_T^v - \frac{1}{\sqrt{d} \cdot |\mathcal {S}(v)|} \cdot \mathbf{1} _{\mathcal {S}(v)} \right\Vert ^2 = O \left( \frac{k^2}{d \cdot \varUpsilon \cdot |\mathcal {S}(v)|} + \alpha _v^2\right) . \end{aligned}$$
(4)

One can view the RHS of (4) as an upper bound for the total error of the diffusion process that started at v. To that extent, let us define the set of vectors

$$\begin{aligned} \boldsymbol{\varepsilon }_v \triangleq x_T^v - \frac{1}{\sqrt{d}\cdot |\mathcal {S}(v)|} \cdot \mathbf{1} _{\mathcal {S}(v)} \end{aligned}$$
(5)

and we will use for shortend \(\varepsilon _{(v, u)} = \boldsymbol{\varepsilon }_{v}(u)\). It is important to note that the order of the pair matters, since \(\varepsilon _{(v, u)}\) corresponds to a diffusion process started at v, while \(\varepsilon _{(u, v)}\) corresponds to a diffusion process started at u.

Under this notation, Eq. (4) becomes

$$\begin{aligned} \sum _{u \in V} \varepsilon _{(v, u)}^2 \le C_{\varepsilon } \left( \frac{k^2}{d \cdot \varUpsilon \cdot |\mathcal {S}(v)|} + \alpha _v^2\right) , \end{aligned}$$
(6)

for some absolute constant \(C_{\varepsilon }\). While one should expect each individual error \(\varepsilon _{(v, u)}\) to be relatively small, i.e. \(O\left( \frac{1}{|\mathcal {S}(v)|}\right) \), it is not immediately clear why this should be the case. Indeed, the presence of \(\alpha _v\) in Eq. (6) can cause significant perturbation. Given the relatively complicated definition of this parameter, the only upper bound we are aware of is the following:

Lemma 3

It holds that

$$\begin{aligned} \sum _{v\in V} \alpha _v^2 = O \left( \frac{k^2}{d \cdot \varUpsilon }\right) \le \frac{C_{\alpha }\cdot k^2}{d\cdot \varUpsilon }, \end{aligned}$$
(7)

for some absolute constant \(C_{\alpha }>0\).

3.2 Estimating the Cluster Size

In this section we will show that most nodes in every cluster are able to estimate approximately the size of their cluster. Recall that the estimate that each node u computes is

$$ \ell _u = \frac{3}{d \sum _{v}\left[ x_{T}^v(u)\right] ^2}, $$

and we want to show that for most nodes u we have that \(|\mathcal {S}(u)| \le \ell _u \le 4\cdot |\mathcal {S}(u)|\). To that extent, we split the set of (bad) nodes, that do not obey the above condition, into two categories:

$$ \mathcal {B}_{\mathrm {big}} \triangleq \left\{ \,u\Bigg |\ell _u > 4 \left|\mathcal {S}(u) \right|\,\right\} \qquad \text {and} \qquad \mathcal {B}_{\mathrm {small}} \triangleq \left\{ \,u\Bigg |\ell _u < \left|\mathcal {S}(u) \right|\,\right\} . $$

Moreover we let

$$ \mathcal {B}_{\ell }\triangleq \mathcal {B}_{\mathrm {big}}\cup \mathcal {B}_{\mathrm {small}}$$

and we will show that in each cluster, only a small fraction of nodes can be in \(\mathcal {B}_{\ell }\). We start with set \(\mathcal {B}_{\mathrm {big}}\) and here we show that only a small fraction of nodes in each cluster can be in \(\mathcal {B}_{\mathrm {big}}\).

Lemma 4

For every cluster \(S_j\) it holds that

$$ \left|\mathcal {B}_{\mathrm {big}}\cap S_j \right| \le \frac{\left|S_j \right|}{2 \cdot 500k \cdot \log (nk)}. $$

Now we focus on the set \(\mathcal {B}_{\mathrm {small}}\). In this case, we will prove something stronger, namely that in each cluster \(S_j\) the fraction of nodes estimating some value \(\ell \) is directly proportional to the value of \(\ell \). In other words, the smaller the value of \(\ell \), the fewer the number of nodes will estimate it. This result is crucial for the analysis of the Large Detection phase. To that extent, we define the level sets

$$ \mathcal {B}_{\mathrm {small}}^i \triangleq \left\{ \,u\Bigg |\ell _u < \frac{\left|\mathcal {S}(u) \right|}{2^{i-1}}\,\right\} , $$

for \(i = 1, \dots , \log (n)\). The following result formalises our discussion.

Lemma 5

For every cluster \(S_j\) and every \(i = 1, \dots , \log (n)\) it holds that

$$ \left|\mathcal {B}_{\mathrm {small}}^i \cap S_j \right| \le \frac{\left|S_j \right|}{2^i \cdot 500k \cdot \log (nk)}. $$

Now we are ready to state and prove the main result of this subsection.

Lemma 6

Almost all nodes \(u \in V\) have a good approximation \(\ell _u \approx \left|\mathcal {S}(u) \right|\). That is, for every cluster \(S_j\) the following conditions hold:

  1. 1.

    \(\frac{|\mathcal {B}_{\ell }\cap S_j|}{|S_j|} \le {\frac{1}{500k \cdot \log {nk}}}\);

  2. 2.

    \(\forall u \notin \mathcal {B}_{\ell }\), it holds that \(|\mathcal {S}(u)| \le \ell _u \le 4 |\mathcal {S}(u)|\).

Proof

Applying Lemmas 4 and 5 we have that

$$ \frac{\left|\mathcal {B}_{\ell }\cap S_j \right|}{\left|S_j \right|} \le \frac{\left|\mathcal {B}_{\mathrm {small}}\cap S_j \right|}{\left|S_j \right|} + \frac{\left|\mathcal {B}_{\mathrm {big}}\cap S_j \right|}{\left|S_j \right|} \le \frac{1}{500k \cdot \log (nk)}. $$

   \(\square \)

3.3 Analysis of the Small Detection Phase

This subsection is dedicated to the analysis of the Small Detection phase of our clustering algorithm and thus to the analysis of Algorithm 2. In this section we will show that our algorithm will perfectly recover all clusters of small size. We first introduce some notation. Up to a permutation of the indices, without loss of generality we consider the clusters \(S_1, \dots , S_p\) such that

$$ |S_i| \le \log (n), $$

for each \(1\le i \le p \le k\). Moreover, we will denote by

$$ \mathcal {A} = S_1 \cup \dots \cup S_p $$

to be the union of these clusters. The proof of our claim lies on the key observation that, for nodes \(u\in \mathcal {A}\), there is a large enough gap between the mass values of diffusion processes started in the same cluster and processes started indifferent clusters. This observation is formalised below.

Lemma 7

For any \(u \in \mathcal {A}\) and \(v\in V\) the following statements hold.

  1. 1.

    If \(v \in \mathcal {S}(u)\), then \(\left|x_T^u(u) - x_T^v(u) \right| \le \frac{1}{100\sqrt{d}\cdot \log (n)}\);

  2. 2.

    If \(v \notin \mathcal {S}(u)\), then \(\left|x_T^u(u) - x_T^v(u) \right|\ge \frac{9}{10\sqrt{d}\cdot |\mathcal {S}(u)| }\)

Now we state and prove the main result of this subsection.

Lemma 8

Let \(S_j\) be a cluster such that \(|{S_j}| \le \log (n)\). At the end of the Small Detection phase of the algorithm, all nodes \(u\in S_j\) will agree on a unique label. Moreover, this label is \(\mathsf {ID}(v)\), for some \(v \in S_j\).

Proof

Let \(S_j\) be some cluster of size \(|S_j| \le \log (n)\) and let \(u \in S_j\) be some node. Applying Lemma 6, we see that all nodes \(u \in S_j\) have an approximation \(|S_j| \le \ell _u \le 4 |S_j| \le 4 \log (n)\). Therefore every node \(u \in S_j\) will certainly perform the Small Detection phase (Line 6 of Algorithm 4).

Firstly, u sorts the values \(y_v = |x_T^u(u) - x_T^v(u)|\), for all \(v \in V\). Say the sorted values are \(y_{v_1} \le \dots \le y_{v_n}\). Notice that if \(w_1 \in S_j\) and \(w_2 \notin S_j\), by Lemma 7 it must be that

$$ |x_T^u(u) - x_T^{w_1}(u)| \le \frac{1}{100\sqrt{d}\cdot \log (n)} < \frac{9}{10 \sqrt{d} \cdot |S_j|} \le |x_T^u(u) - x_T^{w_2}(u)|. $$

Therefore, u knows that the first \(|S_j|\) values, namely \(y_{v_1}, \dots , y_{v_{|S_j|}}\) correspond to nodes in its own cluster and the other values correspond to nodes in different clusters. Thus, u needs to find a pair of consecutive values \(y_{v_i} \le y_{v_{i+1}}\) such that \(v_i \in S_j\) and \(v_{i+1} \notin S_j\).

To do this, u performs a binary search to find the size of its cluster. At any intermediate phase, say u considers the value \(y_{v_i}\), for some i, and compares this with \(\frac{9}{10\sqrt{d}\cdot i}\). If \(y_{v_i} < \frac{9}{10\sqrt{d}\cdot i}\) we claim that \(i \le |S_j|\). If not, then \(v_i \notin S_j\) and by Lemma 7 we have that

$$ y_{v_i} \ge \frac{9}{10\sqrt{d}\cdot |S_j|} \ge \frac{9}{10\sqrt{d}\cdot i}, $$

which gives the contradiction. Similarly, we can show that if \(y_{v_i} \ge \frac{9}{10\sqrt{d}\cdot i}\) then \(i > |S_j|\).

Once the node u finds the exact size of its cluster \(|\mathcal {S}(u)|\), he also knows which nodes are in the same cluster, i.e. \(v_1, \dots , v_{|\mathcal {S}(u)|}\). Thus u can set its label to be the smallest ID among nodes in its cluster. This holds for all nodes \(u \in S_j\) and all clusters \(S_j\).    \(\square \)

3.4 Analysis of the Large Detection Phase

At this point, we will assume all n random walks have been completed, i.e. the T rounds have been executed and each node u has a list of values \(\{x_{T}^v(u)\}_v\). From the perspective of u, we would like to use this information to decide which nodes are in the same cluster as u and which are not. Unfortunately, node u cannot trust all values \(x_{T}^v(u)\) because of the error term \(\varepsilon _{(v, u)}\). Going back to Eq. (6), we see that these errors are dependent on the parameters \(\alpha _v\). To overcome this issue, we define the notion of a \(\gamma \)-bad node, that is a node v for which the value of \(\alpha _v\) is large relative to its cluster size:

Definition 9

We say that a node u is \(\gamma \)-bad if

$$ \alpha _u \ge \frac{\gamma \cdot C_{\alpha } \cdot k^2}{d \cdot \varUpsilon \cdot \left|\mathcal {S}(u) \right|}. $$

The set of \(\gamma \)-bad nodes is denoted by

$$ \mathcal {B}_{\gamma } \triangleq \left\{ \,u\Bigg |\alpha _u \ge \frac{\gamma \cdot C_{\alpha } \cdot k^2}{d \cdot \varUpsilon \cdot \left|\mathcal {S}(u) \right|}\,\right\} . $$

One should think about the \(\gamma \)-bad nodes as nodes for which the diffusion process does not necessarily behave as expected. Hence we want to avoid activating them since they are not good representatives for their own clusters. To put it differently, combining Eq. (6) with the above definition, we have the following remark:

Remark 10

For every node \(v \notin \mathcal {B}_{\gamma }\) it holds that

$$ \left\Vert \boldsymbol{\varepsilon }_v \right\Vert ^2 \le C_{\varepsilon }\left( 1 + C_{\alpha } \cdot \gamma \right) \cdot \left( \frac{k^2}{d \cdot \varUpsilon \cdot |\mathcal {S}(v)|} \right) \le \frac{2 \cdot C_{\varepsilon } \cdot C_{\alpha } \cdot \gamma \cdot k^2}{d \cdot \varUpsilon \cdot \left|\mathcal {S}(v) \right| }. $$

For the rest of the analysis, we will consider the value

$$\begin{aligned} \gamma \triangleq 500k\cdot \log (100k). \end{aligned}$$
(8)

As for the question of how many \(\gamma \)-bad nodes are inside each cluster, the answer is not too many and is formalised in the Lemma bellow:

Lemma 11

Let \(S_j\) be some cluster. It holds that

$$ |\mathcal {B}_{\gamma } \cap S_j| \le \frac{|S_j|}{\gamma }. $$

The ultimate goal of the activation process is to select representatives for each cluster in such a way that the following conditions hold: (1) Every cluster has at least one active node, (2) The total number of active nodes is small and (3) No \(\gamma \)-bad node becomes active. Recall that each node activates independently with probability

$$ p(u) = \frac{5\log (100k)}{\ell _u}. $$

For most nodes, i.e. \(u \in V \setminus \mathcal {B}_{\ell }\), the probabilities are good enough to ensure the three conditions hold. The tricky part is to deal with nodes \(u \in \mathcal {B}_{\ell }\). More precisely, for nodes u such that \(\ell _u \ll |\mathcal {S}(u)|\) and \(u \in \mathcal {B}_{\gamma }\) the activation probability is simply too large to reason directly that no such node becomes active. We overcome this by first showing that, with high constant probability, no node \(u \in \mathcal {B}_{\ell }\) becomes active, and based on this no node in \(\mathcal {B}_{\gamma }\) becomes active as well. We formalise our discussion in Lemma 12, which is the main technical result of this subsection.

Lemma 12

With probability at least 0.9, the following statements hold:

  1. A1.

    No node from \(\mathcal {B}_{\ell } \cup \mathcal {B}_{\gamma }\) becomes active.

  2. A2.

    Every cluster \(S_j\) contains at least one active node \(v_j^* \in S_j \setminus \mathcal {B}_{\ell }\);

  3. A3.

    The total number of active nodes is \(n_a \le 500k\cdot \log (100k)\);

Now we are ready to state the main result of this subsection.

Lemma 13

At the end of the Large Detection phase, with probability at least 0.9, for any cluster \(S_j\) of size \(|S_j| > \log (n)\), all but \(o(\left|S_j \right|)\) nodes \(u\in S_j\) will determine the same label. Moreover, this label is \(\mathsf {ID}(v)\) for some \(v \in S_j\).

Proof

(Sketch). We assume the conclusions of Lemma 12 hold. Fix some cluster \(S_j\). We focus on the nodes \(u \in S_j' = S_j \setminus \mathcal {B}_{\ell }\) and assume the other nodes are misclassified. This is sufficient since, by Lemma 6, \(|S_j \cap \mathcal {B}_{\ell }| = o\left( |S_j|\right) \). Let \(s_j^*\) be the active node in \(S_j\) of smallest ID. By (A2) and (A1) we know \(s_j^*\) exists and \(s_j^* \notin \mathcal {B}_{\ell }\cup \mathcal {B}_{\gamma }\). Let \(u \in S_j'\) be a misclassified node. By the Algorithm’s description we know one of the two conditions must happen:

  1. 1.

    \(s_j^* \notin \mathsf {Cand}(u)\)

  2. 2.

    \(\exists v^* \notin S_j\), but \(v^* \in \mathsf {Cand}(u)\)

We look at each condition separately. For the first one, since \(s_j^*, u \in S_j'\), it must be that \(x_T^{s_j^*}(u) < t_u\). This means that the error \(\varepsilon _{(s_j^*, u)}\) is large in absolute value: \(\varepsilon _{(s_j^*, u)} < - \frac{1}{2\sqrt{d}\left|S_j \right|}\). But since \(s_j^* \notin \mathcal {B}_{\gamma }\), by Remark 10 the total error \(\left\Vert \boldsymbol{\varepsilon }_{s_j^*} \right\Vert \) cannot be too large. This means that the first condition can happen only for a small number \(o(|S_j|)\) of nodes. For the second condition the argument is similar. Let \(v^*\) be an active node such that \(v^* \notin S_j\), but \(v^* \in \mathsf {Cand}(u)\). Since \(v^* \in \mathsf {Cand}(u)\), we know that \(\ell _u \approx \ell _{v^*}\) and that the error \(\varepsilon _{(v^*, u)}\) is quite large: \(\varepsilon _{(v^*, u)} \ge t_u\). However, by (A1) \(v^* \notin \mathcal {B}_{\gamma }\), so \(\left\Vert \boldsymbol{\varepsilon }_{v^*} \right\Vert \) cannot be too large. Therefore \(v^*\) can be a candidate for a limited number of \(u \in S_j'\). Summing over all active nodes and using the upper bound (A3) is sufficient to show that condition 2 can happen only for a small number \(o(|S_j|)\) of nodes.    \(\square \)

3.5 Proof of the Main Result

In this section we bring everything together and prove Theorem 1:

Proof

(Proof of Theorem 1)

Number of Rounds. Firstly, let us look at the number of rounds of our Algorithm 4. We know that the Averaging phase of the algorithm takes \(T = \varTheta \left( \frac{\log (n)}{\lambda _{k+1}}\right) \) rounds. The Small Detection phase does not require any extra rounds of communication. For the Large Detection phase, we have again T rounds. This brings the total number of rounds to

$$ 2 \cdot T = \varTheta \left( \frac{\log (n)}{\lambda _{k+1}}\right) . $$

Clustering Guarantee. Secondly, we will look at the clustering guarantees of Algorithm 4. Let \(S_j\) be some cluster of G. If \(|S_j| \le \log (n)\), then by Lemma 8 we know that all nodes of \(S_j\) will choose the same label, that is the minimum \(\mathsf {ID}\) among nodes in \(S_j\). If \(\left|S_j \right| > \log (n)\), by Lemma 13 it follows that with probability at least 0.9 all but \(o\left( \left|S_j \right|\right) \) nodes will determine the same label that is the \(\mathsf {ID}(v)\) for some \(v \in S_j\).

Communication Cost. Let us first look at the cost for the Averaging phase. In any round \(i \le T\), every node u has to send to all its neighbours the values \(\{ x_{i-1}^v(u)\}\). This results in a total communication cost of \(\mathrm {cost}_{\mathrm {Avg}} = O\left( T \cdot n \cdot m \right) \). Again, for the Small Detection phase there is no cost attached. While for the Large Detection phase, by Lemma 13 we know that, with probability at least 0.9, the total number of active nodes is \(n_a = O(k \log (n)\cdot \log (k\log (n)))\). By the design of Algorithm 3, every node u in the network will broadcast each active node at most once. Therefore the total communication cost in the Large Detection phase is \(\mathrm {cost}_\mathrm {LD} = O(m \cdot n_a) = O(m k \cdot \log (k))\). This gives in total a communication cost of \(O(\mathrm {cost}_{\mathrm {Avg}} + \mathrm {cost}_\mathrm {LD}) = O(T \cdot n \cdot m)\). We remark that, while in general the number of edges could be \(m = \varTheta (n^2)\), we can first apply the sampling scheme in [23] to sparsify our network and then run our algorithm. The sparsification ensures that the structure of the clusters, the degree sequence and the parameter \(\varUpsilon _G(k)\) are preserved up to a small constant factor and the resulting number of edges becomes \(m = \widetilde{O}(n)\). Thus the final communication cost can be expressed as \(\widetilde{O}(T \cdot n^2)\).    \(\square \)