Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The explosive growth of online social networking and recent advances on modeling of massive and complex data has led to a skyrocketing interest in analysis of graph-structured data and, particularly, in discovering network communities. Indeed, many real-world networks—from brain connectivity to ecosystems to gang formation and money laundering—exhibit a phenomena where certain features tend to cluster into local cohesive groups. Community detection has been extensively studied in statistics, computer science, social sciences and domain knowledge disciplines and nowadays still remains one of the most hottest research areas in network analysis (for overview of algorithms, see, e.g., [5, 11, 13, 21, 22, 37, 42, 46, 50, 68], and the references therein).

The current paper is motivated by three overarching questions. First, there exists no unique and agreed upon definition of network community, typically a community is thought of a cohesive set of vertices that have stronger or better internal connections within the set than with external vertices [37, 44]. Second, community discovery is further aggravated in a presence of (usually multiple) outliers, and until recently the two tightly woven problems of outlier detection and network clustering have been studied as independent problems [5, 15, 48]. Third, vertices with a low degree (or the so-called parasitic outliers of the spectrum [35]) tend to produce multiple zero eigenvalues of the graph Laplacian, which results in a higher variability of spectral clustering and thus a reduced finite sample performance in community detection. Fourth, most of the currently available methods for community discovery within a spectral clustering framework are based on the Euclidean distance as a measure of “cohesion” or “closeness” among vertices, and thus do not explicably account for the underlying probabilistic geometry of the graph.

We propose to address the above challenges by introducing a concept of data depth into the network community detection that allows to integrate ideas on cohesion, centrality, outliers, and community discovery under a one systematized “roof.” Data depth is a nonparametric and inherently geometric tool to analyze, classify, and visualize multivariate data without making prior assumptions about underlying probability distributions. A new impetus has been recently given to data depths due to their broad utility in high dimensional and functional data analysis (for overview, see, e.g., [9, 26, 27, 34, 40, 47, 58, 73], and the references therein.). Given a notion of data depth, we can measure the “depth” (or “outlyingness”) of a given object or a set of objects with respect to an observed data cloud. A higher value of a data depth implies a deeper location or higher centrality in the data cloud. By plotting such a natural center-outward ordering of depth values that serves as a topological map of the data, the presence of clusters, outliers, and anomalies can be evaluated simultaneously in a quick and visual manner. A notion of data depth is novel to network studies. The only relevant paper on the topic is due to [14] who consider a random sample of graphs following the same probability model on the space of all graphs of a given size. This probabilistic framework, however, is not applicable to analysis of most real-world graph-structured data where the available data consists only of a single network. In this paper we primarily focus on utility of L 1-depth as the main tool for unsupervised community detection in a spectral setting. Although there exist numerous other depth alternatives, our choice of a depth function is motivated by simplicity and tractability of L 1-depth and the fact that it can be computed using a fast and monotonically converging algorithm [29, 30, 65]. This makes L 1-depth particularly attractive for community discovery in large complex networks.

The paper is organized as follows. Section 2 provides background on graphs, spectral clustering, and K-means algorithm. We introduce the new K-depths method based on the L 1-depth and discuss its properties in Sect. 3. simulation studies are presented in Sect. 4. Section 5 illustrates application of the K-depths method to tracking communities in online social media platform Flickr.

2 Preliminaries and Background

Graph Notations Consider an undirected and loopless graph \(\mathcal{G} = (\mathcal{V},\mathcal{E})\), with a vertex set \(\mathcal{V}\) of cardinality n and an edge set \(\mathcal{E}\). We assume that \(\mathcal{G}\) consists of K non-overlapping communities and K is given. Let A be an n × n-symmetric empirical adjacency matrix, i.e.,

$$\displaystyle{ A = \left \{\begin{array}{@{}l@{\quad }l@{}} 1,\quad &\mbox{ if}\;(i,j) \in \mathcal{E}\\ 0,\quad &\mbox{ otherwise}. \end{array} \right. }$$

The population counterpart of A is denoted by P. Let D be a diagonal matrix of degrees, i.e., \(D_{ii} =\sum \limits _{ j=1}^{n}A_{ij}\). Then, the graph Laplacian is defined as

$$\displaystyle\begin{array}{rcl} L = D^{-1/2}AD^{-1/2}.& &{}\end{array}$$
(1)

Spectral Clustering For smaller networks, communities can be identified via optimizing various goodness of partition measures, for instance, Ratio Cut [19], Normalized Cut [60], and Modularity [45], which involve a search for optimal split over all possible partitions of vertices. However, such discrete optimization problems are typically NP-hard and thus are not feasible for larger networks. The computational challenges can be circumvented using spectral clustering (SC) that yields a continuous approximation to discrete optimization [67]. Hence, SC is now one of the most widely popular procedures for tracking communities in large complex networks [66].

The key idea of SC is to embed a graph \(\mathcal{G}\) into a collection of multivariate sample points. Given K communities, we identify orthogonal eigenvectors v ⋅ j , j = 1, , K of the Laplacian L (or adjacency matrix A) that correspond to the largest K eigenvalues, and construct the n × K-matrix V = [v ⋅ 1, , v . K ]. Each row of V, v i  ≡ v i⋅ , provides a representation in \(\mathbb{R}^{K}\) of a vertex in \(\mathcal{V}\). Given this embedding, we can now employ any appropriate classifier to cluster this multivariate data set into K communities, and the most conventional choice is a method of K-means [1, 31, 36, 51].

Given a set of data points \(x_{i} \in \mathbb{R}^{d}\), for i = 1, , n, the method of K-means [41] aims to group observations into K sets C = { C 1, , C K } in such a way that the within-cluster sum of squares is minimized, that is, we minimize

$$\displaystyle{ \mathop{\mathrm{argmin}}\limits _{\mathbf{C}}\sum _{k=1}^{K}\sum _{ x\in C_{k}}\vert \vert x -\mu _{k}\vert \vert ^{2}, }$$
(2)

where μ k is the mean of points in C k , | | xμ k  | | 2 is the squared Euclidean distance between x and k-th group mean μ k . The optimization (2) is highly computationally intensive. As an alternative, we can employ the Lloyd’s algorithm (also known as Voronoi iteration or relaxation) for (2) that is based on iterative refinement and that allows to quickly identify an optimum (see the outline of the K-means 1). In this paper, the initial centers are chosen randomly from the data set.

Algorithm 1 The K-means algorithm

Regularization Low-degree vertices tend to produce multiple zero eigenvalues of a Laplacian L, which in turns increases clustering variability and adversely impacts a performance of the K-means algorithm. The problem is closely connected to the concentration of L, that is, the study on how close a sample Laplacian L to its expected value. Sparser networks tend to produce more low-degree vertices and do not concentrate. The idea of regularization in this context is to somehow diminish the impact of such vertices with low degrees, by viewing them as outliers and shrinking them toward the center of spectrum. As a result, regularization leads to a higher concentration. There are a number of regularization procedures ranging from brute-force trimming of outliers to sophisticated methods that are closely connected to regularization of covariance matrices (for more discussion and the most recent literature review see [35]). One of the most popular approaches, by analogy with a ridge regularization of covariance matrices, is to select some positive parameter τ and add τn to all entries of the adjacency matrix A [1], that is

$$\displaystyle{ A_{\tau } = A +\tau J, }$$

where J = 1∕n 1, 1 is n × n-matrix with all elements 1. The resulting regularized Laplacian then takes a form

$$\displaystyle{ L_{\tau } = D_{\tau }^{-1/2}A_{\tau }D_{\tau }^{-1/2}, }$$

where \(D_{ii,\tau } =\sum \limits _{ j=1}^{n}A_{ij}+\tau\). The optimal regularizer τ can then be selected by minimizing the Davis–Kahan bound, i.e., the bound on the distance between the sample and population Laplacians (for study on properties of regularized spectral clustering see, [31, 35], and the references therein). However, selecting an optimal regularizer τ is highly computationally expensive. In addition, the impact of small and weak communities on performance of regularized spectral clustering is not clear.

In this light, an interesting question arises on whether we can develop an alternative data-driven and computationally inexpensive method for taming “outliers” with low degrees and bypass the optimization stage of the Davis–Kahan bound? It seems natural to unitize here a statistical methodology that has been developed with a particular focus on analysis of outliers, that is, a notion of data depth.

3 Community Detection Using L 1 Data Depth

In this section, we propose a new unsupervised K-depths algorithm for network community detection based on iterative refinement with L 1 depth.

The L 1 Data Depth In this paper we consider an L 1-data depth of Vardi and Zhang [65]. Consider N distinct observations x 1, , x N in \(\mathbb{R}^{p}\) which we need to partition into K clusters, and let I(k) be a set of labels for observations in the k-th cluster. Let each observation x i be associated with a scalar η i , i = 1, , N, where η i are viewed as weights or as “multiplicities” of x i , and η i  = 1 if the data set has no ties. The multivariate L 1-median of a k-th cluster, y 0(k), is then defined as

$$\displaystyle{ y_{0}(k) =\mathop{ \mathrm{argmin}}\limits C(y\vert k), }$$
(3)

where C(y | k) is the weighted sum of distances between y and points x i in the k-th cluster

$$\displaystyle{ C(y\vert k) =\sum _{i\in I(k)}\eta _{i}\vert \vert x_{i} - y\vert \vert \qquad \forall k. }$$
(4)

Here | | uv | | , \(u,v \in \mathbb{R}^{p}\), is the Euclidean distance in \(\mathbb{R}^{p}\). If x 1, , x N are not multicollinear (which is the case of the considered spectral clustering framework), C(y) is positive and strictly convex in \(\mathbb{R}^{p}\). If the set x 1, , x N has ties, “multiplicities” η i can be chosen in such a way that it preserves convexity of C(y) (see [65], for further discussion).

The L 1 depth was proposed by Vardi and Zhang [65], based on the notion of a multivariate L 1-median (3), and the idea has been further extended to clustering and classification in multivariate and functional settings by López-Pintado and Jörnsten [39], Jörnsten [29]. Given a cluster assignment, the L 1 depth of point \(x,x \in \mathbb{R}^{K}\) with respect to a k-th cluster is defined as

$$\displaystyle{ LD(x\vert k) = 1 -\max [0,\vert \vert \bar{e}(x\vert k)\vert \vert - f(x\vert k)]. }$$
(5)

Here f(x | k) = η(x)∕ i ∈ I(k) η i with η(x) =  i = 1 N η i I(x = x i ) and \(\bar{e}(x\vert k)\) is the average of the unit vectors from a point x to all observations in the k-th cluster and is defined as

$$\displaystyle{ \bar{e}(x\vert k) =\sum _{i\in I(k),x_{i}\neq x}\eta _{i}e_{i}(x)/\sum _{j\in I(k)}\eta _{j}, }$$

where e i (x) = (x i x)∕ | | x i x | | .

The idea of 1 − LD(x | k) is to quantify a minimal additional weight required to assign x so that x becomes the multivariate L 1-median of the k-th cluster x ∪{ x i , i ∈ I(k)} [65]. Hence, L 1 depths as a robust representation of a topological structure of each cluster. Since L 1 is non-zero outside the convex hull of the data cloud, it is a feasible depth choice for comparing multiple clusters [29].

The K-Depths Method It is well known that K-means clustering algorithm is non-robust to outliers [16, 59, 69]. This partially is due to the fact that the K-means algorithm is based on a squared Euclidean norm as the measure of “distance” and only captures the information between a pair of points, i.e., a candidate center and another point (see Fig. 1a). Also to identify a cluster, the K-means algorithm uses a presumptive cluster center defined by a cluster mean, which makes it sensitive to anomalies and outliers. Although we update centers and clusters until the assignments no longer change, there is no guarantee that the global optimum for (2) can be found [49, 54].

Fig. 1
figure 1

Comparing K-means and K-depths algorithms. Circles denote cluster centers. Each cluster is identified by colors and border around points. (a ) K-means. (b ) K-depths. (c ) Generalized K-depths

Our idea is motivated by the two overarching questions. Is there an alternative “cohesion” measure to a squared Euclidean norm? Does such a measure allow to achieve a higher accuracy and stability by taking advantage of more information between clusters and points?

Indeed, such a “cohesion” measure exists, and it can be based on a data depth notion. As discussed earlier, a depth function evaluates how “deep” (or “central”) a point is with respect to a group of data (i.e., a cluster). Hence, depth functions allow for more informative and robust “cohesion” (or “distance”) measures than a squared Euclidean norm (Fig. 1b).

Our proposed approach is then to use a data depth (particularly, the L 1 depth) to find “nearest” clusters as a part of iterative refinement, and we call the new method “K-depths” clustering algorithm. That is, following the spectral clustering setting, we embed a graph into a collection of multivariate sample points. Then, given K communities, we identify orthogonal eigenvectors of the Laplacian L that correspond to the K largest eigenvalues of L, and construct an n × K-matrix V that is formed by eigenvectors of L. We view each row of V as a representation of a network vertex in \(\mathbb{R}^{K}\), and thus, we get n sample points in K-dimensional space. Clustering of these multivariate points using the K-depths yields a partition of networks into K communities. (The K-depths method is outlined in Algorithm 2. Note that we still use a squared Euclidean norm to initialize the K-depths iterative refinement.) Note that instead of Laplacian spectral embedding, we can also consider adjacency spectral embedding (see [36] and references therein).

Algorithm 2 Spectral clustering K-depths algorithm

The K-depths algorithm presented above is closely related to the modified Weiszfeld algorithm of [65]. The idea of the K-depths is to evaluate “centrality” of any given point in respect to all points within a cluster (see Fig.1b), that is, in respect to points located inside the cluster and points located close to a cluster borderline. However, points that fall in-between clusters may provide redundant or noisy information, which leads a higher variability of the clustering algorithm. We therefore introduce the generalized K-depths measure which only accounts for the inner parts of a cluster to calculate depth values (Fig. 1c). Given a contour plot of a cluster, we compute L 1 depth values using points which are within an arbitrary percentage contour p ∈ [0, 1]. For instance, Fig. 1b is a special case of Fig. 1c where the locality parameter p is 1, i.e., all 100 % of available data are used in the K-depths algorithm. We can also view the locality parameter p as a trade-off of bias (i.e., detection accuracy) and variance (i.e., detection variability).

Remark

The optimal choice of p, similarly to selection of optimal trimming, largely depends on a definition of outlier, types of anomalous behavior, proportion of contamination, and structure of the data. Conventionally, trimming and other robustifying parameters are chosen using various types of resampling, including V -fold crossvalidation, jackknife, and bootstrap (see [2, 24, 25]). Under the network setting, the problem is further aggravated by the lack of an agreed-upon definition of outliers and network anomalies and their dependence on the underlying network model structure (for overviews, see [35, 17, 18]). For instance, [5] discuss at least four kinds of outliers: mixed membership, hubs, small clusters, independent neutral nodes. Although selecting p using crossvalidation is likely to be affected by the presence of outliers in an observed network, we believe that one of the resampling ideas such as crossvalidation or bootstrap [12, 63] is still arguably the most feasible approach that allows to minimize parametric assumptions about the network model.

3.1 Properties of Spectral Clustering K-Depths Algorithm

Asymptotic properties of spectral clustering and, particularly, the K-means/ medians algorithms have been widely studied both in probability and statistics (for the most recent overviews, see, e.g., [28, 36, 52, 53], and the references therein). While most of the results focus on denser networks, most recently [36] derive an upper error bound for spectral clustering under moderately sparse stochastic block model with a maximum expected degree of order logn or higher.

The key result behind deriving all asymptotic properties of the K-means/ medians algorithms is to show that there exists a sequence ε n , ε n  ≥ 0 such that lim n →  ε n  = 0 and

$$\displaystyle{ z^{\,A_{k} } \leq (1 +\epsilon _{n})z^{{\ast}}(\mathcal{G}),\;n \in Z^{+} }$$
(6)

where \(z^{A_{k}}(\mathcal{G})\) is the approximate polynomial time solution from the K-means/medians algorithms and \(z^{{\ast}}(\mathcal{G})\) is the optimal solution. If such a sequence ε n exists, then [10] define asymptotic optimality of the K-medians algorithm.

Defining \(z(\mathcal{G})\) in (6) in terms of a Frobenius norm of a distance between the K largest eigenvectors U 1, , U K of a population adjacency matrix P and their respective counterparts \(\hat{U } _{1},\ldots, \hat{U } _{K}\) from an empirical adjacency matrix A, [33] show that there exists an approximate polynomial time solution to the K-means algorithm with an error bound

$$\displaystyle{ \vert \vert \hat{\varTheta }\hat{X}-\hat{U } \vert \vert _{F}^{2} \leq (1+\epsilon )\min _{ \begin{array}{c}\varTheta \in \mathbb{M}_{n,K} \\ X\in \mathbb{R}_{K\times K}\end{array}}\vert \vert U -\hat{U } \vert \vert _{F}^{2},\quad \hat{U },\;U \in \mathbb{R}^{n\times K}, }$$

where U = [U 1, , U K ] and \(\hat{U }= [\hat{U } _{1},\ldots, \hat{U } _{K}]\). Here Θ is a true membership matrix such that \(\varTheta _{ig_{i}}\) is 1 where g i  ∈ { 1, , K} is the community membership of vertex i, and \(\mathbb{M}_{n,K}\) is a collection of all n × K-matrices where each row has exactly one 1 and the remaining K − 1 entries are 0. For discussion on analogous results on existence of (1 +ε)-approximate solution for a k-medians algorithm in network applications see, for instance, [36].

Since statistical properties of median and L 1-depth are closed related (see [65]), we state the following conjecture about the error bound of the K-depths algorithm under the L 1 depth and adjacency spectral embedding.

Conjecture 1

There exists an Ω-approximate polynomial time solution to the K-depths method under adjacency spectral embedding which attains

$$\displaystyle{ \vert \vert \hat{\varTheta }\hat{X}-\hat{U } \vert \vert _{F}^{2} \leq \varOmega \min _{ \begin{array}{c}\varTheta \in \mathbb{M}_{n,K} \\ X\in \mathbb{R}_{K\times K}\end{array}}\vert \vert U -\hat{U } \vert \vert _{F}^{2}, }$$
(7)

where Ω is a positive constant and \((\hat{\varTheta }, \hat{X} ) \in \mathbb{M}_{n,K} \times \mathbb{R}_{K\times K}\) is the output of Ω-approximate K-depths algorithm.

Armed with (7), an upper bound on network community detection error of the K-depths algorithm 2 under adjacency spectral embedding can be derived for a stochastic block model (SBM), following derivations of [36, 52, 65]. This error bound for the K-depths increases with an increasing network sparsity and with the growing number of communities. In addition, assuming existence of a Σ-approximate solution to the K-depths algorithm, analogous error bounds can be derived under Laplacian spectral embedding [52, 53].

4 Simulations

In this section we evaluate a finite sample performance of the unsupervised K-depths classifier for detecting network communities and primarily focus on a case of two communities. To measure a goodness of clustering, we employ such standard criteria as misclassification rate and normalized mutual information (NMI). We define misclassification rate as the total percentage of mislabeled vertices, i.e.,

$$\displaystyle{ \gamma = \frac{1} {n}\sum _{i=1}^{K}\vert S_{ i}\vert, }$$

where | S i  | is the number of misclassified vertices in the i-th community.

Given the two sets of clusters with a total of n vertices: \(\mathbb{R} =\{ r_{1},\ldots,r_{K}\}\) and \(\mathbb{C} =\{ c_{1},\ldots,c_{J}\}\), the NMI is given by Manning et al. [43]:

$$\displaystyle{ \mathrm{NMI}(\mathbb{R}, \mathbb{C}) = \frac{I(\mathbb{R}; \mathbb{C})} {[H(\mathbb{R}) + H(\mathbb{C})]/2}. }$$

Here I is mutual information

$$\displaystyle\begin{array}{rcl} I(\mathbb{R}; \mathbb{C})& =& \sum _{k}\sum _{j}P(r_{k}\bigcap c_{j})\log \frac{P(r_{k}\bigcap c_{j})} {P(r_{k})P(c_{j}))} {}\\ & =& \sum _{k}\sum _{j}\frac{\vert r_{k}\bigcap c_{j}\vert } {n} \log \frac{n\vert r_{k}\bigcap c_{j}\vert } {\vert r_{k}\vert \vert c_{j}\vert } {}\\ \end{array}$$

where P(r k ), P(c j ), and \(P(r_{k}\bigcap c_{j})\) are the probabilities of a vertex being in cluster r k , c j and in the intersection of r k and c j , respectively, and H is entropy defined by

$$\displaystyle\begin{array}{rcl} H(\mathbb{R}) = -\sum _{k}P(r_{k})\log P(r_{k}) = -\sum \frac{\vert r_{k}\vert } {n} \log \frac{\vert r_{k}\vert } {n}.& & {}\\ \end{array}$$

NMI takes values between 0 and 1, and we prefer a clustering partition with a higher NMI.

4.1 Network Clustering with Two Groups

Here we use a benchmark simulation framework based on a 2-block stochastic block model (SBM)[61, 71]. SBM is a particular case of an inhomogeneous Erdös–Renyi model in which edges are formed independently and probability of an edge between two vertices is determined by group membership of vertices [23].

Following a simulation setting of Joseph and Yu [31], we generate 100 networks of order 3000 from an SBM with a block probability matrix

$$\displaystyle{ B = \left [\begin{array}{cc} 0.01 &0.0025\\ 0.0025 & 0.003 \end{array} \right ], }$$
(8)

and assume that the connections within the k-th community follow an independent Bernoulli distribution with probability B kk , k = 1, 2.

Table 1(a) summarizes clustering performance of the K-means and K-depths algorithms in terms of misclassification rate and NMI. We find that the K-depths method noticeably outperforms the K-means algorithm, delivering 36 % lower misclassification rate and more than four times higher NMI, although with a somewhat higher variability. Remarkably, the boxplot for misclassification rate and NMI (see the left panel of Fig. 2) indicates that despite a higher variability, the lower quartile of the misclassification rates delivered by the K-depths algorithm is smaller than the upper quartile of the misclassification rates yielded by the K-means algorithm. A similar dynamics is also observed for NMI (see the right panel of Fig. 2).

Table 1 Performance of the K-means and K-depths algorithms in respect to misclassification rate γ and NMI, with standard deviation in (), under (a)
Fig. 2
figure 2

Boxplots of clustering performance of the K-means and K-depths in terms of misclassification rate and NMI for the SBM (8)

We find that regularization of both K-means and K-depths where an optimal regularizer τ is selected using optimizing the Davis–Kahan bound as per [31] improves community discovery. That is, the regularized K-means outperforms the regularized K-depths in terms of misclassification rates, i.e., 0.16 vs. 0.22; and the regularized K-depths outperforms the regularized K-means in terms of NMI, i.e., 0.44 vs. 0.40. However, regularization turns out to be highly computationally expensive, that is, finding an optimal regularization for a single network of 3000 vertices under SBM (8) requires 1800 s (with 1 additional sec for the K-means algorithm itself). In contrast, the unregularized K-depths algorithm takes only 4 s. (The elapsed time is assessed in R on an OS X 64 bit laptop with 1.4 GHz Intel Core i5 processor and 4 GB 1600 MHz DDR3 memory.)

Thus, being intrinsically robust to low-degree vertices, the new K-depths method provides a simple and computationally efficient alternative to the currently adopted regularization procedures based on optimizing the Davis–Kahan bound.

Choice of a Locality Parameter Let us explore the impact of a locality parameter p, p ∈ [0, 1], on a clustering performance of the K-depths algorithm. Note that p controls how many points are selected to form the “deepest” sub-clusters which other points are compared with. Figure 3 visualizes sub-clusters and the respective contour plots based on the L 1-depth, corresponding to p = (0. 1,  0. 3,  0. 5,  0. 7). If p is 1, the whole data cloud is used, while lower values of p lead to a higher concentration of points around the cluster center and aim to minimize the impact of outlying points or noise. Hence, a locality parameter p can be viewed as a trade-off between bias and variance. Figure 4 shows the performance of the K-depths algorithm in respect to varying p and the SBM (8). We find that in general both mean and variance of misclassification rates and NMI are stable and comparable for p of less than 0.5. As expected, higher values of p lead to a better performance in terms of average misclassification rates and NMI but also result in a substantially higher variability. In general, an optimal p can be selected via crossvalidation, and choice of p is likely to be linked with a sparsity of an observed network. However, given the stability of the K-depths performance, as a rule of thumb we suggest to use a p of 0.5 or less.

Fig. 3
figure 3

Contour plots based on the L 1-data depth and varying data proportions p, i.e., p is 0.1, 0.3, 0.5, and 0.7

Fig. 4
figure 4

Boxplots of misclassification rates (a ) NMI (b ) with various choices of locality parameter p. The dashed line connects medians for resulting misclassification rates and NMI for various locality parameters p, in plots (a ) and (b ) respectively

4.2 Network Clustering with Outliers

Now we evaluate the performance of the K-depths algorithm in respect to a network with outliers. In particular, we consider the so-called Generalized Stochastic Block Model (GSBM) of Cai and Li [5] which is based on incorporating small and weak communities (outliers) into a conventional SBM structure. More specifically, consider an undirected and loopless graph \(\mathcal{G} = (\mathcal{V},\mathcal{E})\) with N = n + m vertices, where n is the number of “inliers” which follow the standard SBM framework and m is the number of “outliers” which connect with other vertices in random. Each inlier vertex is assigned to one of the two communities, while all outliers are placed into the 3rd community. An example of GSBM is shown in Fig. 5, two strong communities are colored by red and green within solid circles, the outliers (one weak and small community) are colored by blue within a dashed circle.

Fig. 5
figure 5

Network with outliers, or small and weak community, under GSBM

In this section we consider a GSBM of Cai and Li [5] by adding 30 outliers (i.e., one small and weak community) into a standard 2-block SBM (8).

In particular, we set a probability of an edge between outliers to be of 0.01. Connection between inliers and outliers is defined by an arbitrary (0, 1)-matrix Z, \(Z \in \mathbb{R}^{n\times m}\), such that \(\mathbb{E}Z =\boldsymbol{\beta } \mathbf{1}^{T} = [\boldsymbol{\beta },\ldots,\boldsymbol{\beta }]\) and the component of \(\boldsymbol{\beta }\) are 3000 i.i.d. copies of U 2, where U is a uniform random variable on [0, 0. 0025].

Following [5], we define a misclassification rate based only on inliers in the dominant 1st and 2nd communities, i.e.,

$$\displaystyle{ \gamma = \frac{1} {n}\sum _{k=1}^{2}\vert S_{ k}\vert, }$$

where | S k  | is a number of misclassified vertices in the k-th community and k = 1, 2. Similarly, NMI is defined calculated only on inliers and a number of clusters K are set to 3 for both K-means and K-depths algorithms.

Table 1(b) summarizes the results for misclassification rates and NMI delivered by the K-means and K-depths algorithms. In general, misclassification rates for both methods under the GSBM model are noticeably higher than the analogous rates under a standard SBM. However, the K-depths algorithm still outperforms the K-means method, yielding a 10 % lower misclassification rate. In turn, NMI delivered by the K-depths algorithm is almost twice higher than the corresponding NMI of the K-means method, i.e., 0.43 vs. 0.24, respectively. Remarkably, under the GSBM variability of both methods is very similar, while the upper quartile of NMI for the K-means algorithm is lower than almost all values of NMI delivered by the K-depths algorithm (see Fig. 6).

Fig. 6
figure 6

Boxplots of clustering performance of the K-means and K-depths in terms of misclassification rate and NMI under the GSBM

5 Application to Flickr Communities

In this section, we illustrate the K-depths algorithm to tracking communities in Flickr. Flickr is a popular website for users to share personal photographs and also an online platform. This data set contains the information of 80,513 Flickr bloggers, each blogger is viewed as a vertex, and the friendship between bloggers is represented by undirected edges. The data is available from [70]. Bloggers are divided into 195 groups depending on their interests. As discussed by Tang and Liu [62], the network is very sparse and scale-free (i.e., its degree distribution follows a power law).

In our study, we consider a subnetwork of Flickr by extracting vertices that belong to the second and third communities and edges within and in-between of these communities. Isolated vertices (vertices with no edges) are removed. The resulting data represents an undirected graph with 216 vertices and 996 edges; the second community contains 155 vertices and 753 edges, while the third community contains 61 vertices and 19 edges.

We now apply the K-means and K-depths algorithms to identify clusters in the Flickr subnetwork (see Table 2). We find that the K-depths algorithm delivers a misclassification rate of 0.35, which is more than 26 % lower than the misclassification rate of 0.47 yielded by the K-means algorithm. In turn, NMI yielded by the K-depths algorithm is comparable with NMI of the K-means algorithm.

Table 2 Misclassification rate (γ) and Normalized Mutual Information (NMI) criteria for the K-means and K-depths methods for the Flickr subnetwork

6 Conclusion and Future Work

In this paper, we introduce a new unsupervised approach to network community detection based on a nonparametric concept of data depth within a spectral clustering framework. In particular, we propose a data-driven K-depths algorithm based on iterative refinement of the L 1 depth. The new method is shown to substantially outperform the classical K-means and to deliver comparable results to the regularized K-means. The K-depths algorithm is simple and computationally efficient, requiring up to 400 times less CPU time than the currently adopted regularization procedures based on optimizing the Davis–Kahan bound. Moreover, the K-depths algorithm is intrinsically robust to low-degree vertices and accounts for the underlying geometrical structure of a graph, thus paving the way for using the L 1 depth and other depth functions as an alternative to computationally expensive selection of optimal regularizers.

In addition to asymptotic analysis of the K-depths clustering, in the future we plan to advance the K-depths approach to other types of depth functions, for example, the classical ones: half-space depth, Mahalanobis depth, random projection depth etc [38, 5557, 72], and to the most recent such as Monge-Kantorovich depth [6, 20] and to explore utility of the K-depths method as initialization algorithm (for discussion, see [64] and the references therein). Another interesting direction is to investigate the relationship between properties of the K-depths approach and the trimmed K-means algorithms [7, 8, 32], both in networks and general multivariate clustering contexts.