Keywords

1 Introduction

Clustering (also referred to as community detection) is a critical component of complex network analysis. In the traditional sense, a cluster (community) in a network comprises a group of vertices that are more connected to each other than to vertices outside the cluster [1]. We refer to such clusters as physical clusters. Several clustering algorithms (like Girvan–Newman algorithm [2], Louvain algorithm [3], and neighborhood-overlap-based greedy algorithm [4]) are available in the literature to determine physical clusters of vertices of larger modularity. A physical cluster is considered to be more modular [1] if it has high intra-cluster density and low inter-cluster density.

With the objective of maximizing edge-based intra-cluster density, it is difficult to expect a clustering algorithm to group vertices that are similar to each other. Similarity of vertices is typically assessed on the basis of node-level metrics that could be topology or domain-driven. Centrality metrics are classical examples for topology-based metrics that quantify the extent to which a node is important on the basis of its location in the network [1]. The centrality metrics are used as the node-level metrics for our analysis in this paper. We consider the following centrality metrics: neighborhood-driven degree (DEG) [1] and eigenvector (EVC) [5] centrality metrics and the shortest path-driven betweenness (BWC) [6, 7] and closeness (CLC) centrality metrics [8, 9]. For more details on the centrality metrics, the interested reader is referred to [10]. The analysis presented in this paper could be seamlessly extended to any combination of domain-driven or topology-driven metrics as well.

In Fig. 1, we show a motivating example graph wherein the tuple next to a vertex represents the degree (DEG) and eigenvector centralities (EVC) of the vertex. Any well-known clustering algorithm in the literature would determine the two physical clusters in sub Fig. 1a that have high modularity (larger intra-cluster density and lower inter-cluster density). However, a closer look at the tuples for the vertices within a physical cluster would indicate less similarity among the vertices on the basis of their DEG and EVC values. On the other hand, in sub Fig. 1b, we show two clusters each of which comprises vertices that are exactly similar on the basis of their DEG and EVC values. The two clusters in sub Fig. 1b are not very modular (the inter-cluster density is even larger than the intra-cluster density), but each of these clusters would be more cohesive (i.e., comprised of vertices that are similar) on the basis of their DEG and EVC values.

Fig. 1
figure 1

Example graph: physical modular clusters versus logical clusters of similar vertices. a Physical clusters (larger intra-cluster density, but lower vertex similarity). b Logical clusters (lower intra-cluster density, but larger vertex similarity)

We propose the following approach (more details are in Sect. 2) to determine such cohesive clusters of similar vertices in real-world network graphs. We distribute the vertices in a coordinate system whose coordinates are the normalized values of the node-level (centrality) metrics of the vertices. For such a logical topology, we iteratively attempt to connect the vertices together (an edge exists between two vertices if the Euclidean distance between their normalized coordinate values is within a threshold) in the form of a unit-disk graph. We use a binary search approach to identify the minimum value for the threshold distance that would connect together the vertices in the unit-disk graph. We run the Louvain clustering algorithm [3] on the connected unit-disk graph to determine one or more (logical) clusters whose member vertices are more similar compared to the vertices in the physical clusters obtained by running the Louvain algorithm on the corresponding real-world network graph. We use the Silhouette Index [11] measure to assess the similarity of the vertices in the logical clusters vis-a-vis the physical clusters with respect to the centrality metrics considered. In a recent work [12], we have successfully used the unit-disk graph and binary search-based approach to quantify the similarity between any two vertices in a network (in the form of a metric called the node similarity index). Our hypothesis in this research is that for a set of centrality metrics, the Silhouette Index of the logical clusters would be larger than the Silhouette Index of the physical clusters.

The rest of the paper is organized as follows: In Sect. 2, we explain the approach to determine logical clusters of similar vertices on the basis of node-level (centrality) metrics. In Sect. 3, we explain the formulation for the Silhouette Index measure. In Sect. 4, we test our hypothesis on a suite of 25 biological networks and 25 social networks on the basis of three sets of centrality metrics: (DEG, EVC); (BWC, CLC), and (DEG, EVC, BWC, CLC). We present and analyze the Silhouette Index results for the physical clusters and logical clusters obtained for the different combinations of centrality metrics. Section 5 presents our conclusions and outline plans for future work.

2 Logical Clusters of Similar Vertices

We describe the sequence of steps (see Fig. 2) involved in determining logical clusters of similar vertices (for k centrality metrics) in a real-world network graph.

Fig. 2
figure 2

Example to illustrate the computation of the logical clusters and their evaluation

  • Step (i): Construction of a Logical Topology: Let the number of centrality metrics considered for logical clustering of a real-world network graph be k. To get started, for each of the k metrics, we determine their raw centrality values (see Fig. 2a) and then independently normalize them (using the square root of the sum of the squares of the raw values). Following this, we build a logical topology (k-dimensional coordinate system) of the vertices wherein the coordinates of a vertex are its normalized centrality values (ranging from 0 to 1). See Fig. 2b.

  • Step (ii): Binary Search Algorithm to Deduce a Connected Unit-Disk Graph: We attempt to deduce (through a sequence of iterations) a unit-disk graph that would connect all the vertices in the logical topology at a minimum threshold distance for an edge to exist. For a set of k centrality metrics, the threshold could range from (0,…, \( \sqrt k \)); if the threshold distance is \( \sqrt k \), we will have a unit-disk graph that is sure to be connected (completely connected indeed!), but not connected at a threshold distance of 0 (unless all the vertices are co-located). We use this observation as the basis to run a binary search algorithm to determine the minimum possible value for the threshold distance to obtain a connected unit-disk graph [12]. We start the binary search algorithm with the left index set to 0 and the right index set to \( \sqrt k \) and go through a sequence of iterations (see Fig. 2c).

    Across all the iterations, the following invariant is maintained: If the threshold distance value corresponds to the left index, the unit-disk graph is not connected; if the threshold distance value corresponds to the right index, the unit-disk graph is connected. In each iteration, we first determine the middle index as the average of the left index and right index, and seek to construct a unit-disk graph with the threshold distance value corresponding to the middle index. If such a unit-disk graph is connected, we set the right index to the value of the middle index; otherwise, we set the left index to the value of the middle index. We continue the iterations until the right index and left index differ not more than a cutoff parameter (\( \in \)). We use \( \in \) = 0.001 for all the analysis conducted in this paper. We set the minimum threshold distance to correspond to the value of the right index in the last iteration of the algorithm.

  • Step (iii): Logical Clustering of the Connected Unit-Disk Graph: On the connected unit-disk graph obtained for a minimum threshold distance, we run the Louvain community detection algorithm [3] to determine (logical) clusters of vertices that have a larger intra-cluster density (and a lower inter-cluster density) in the unit-disk graph. The vertices within a logical cluster are expected to be more similar to each other. The Louvain algorithm (a hierarchical community detection algorithm) is designed to identify highly modular communities. To determine the logical clusters using the Louvain algorithm, the weight of an edge in the connected unit-disk graph is the Euclidean distance between their corresponding normalized coordinate values. Note that the edge weights for the real-world network graphs are “1” when we run the Louvain algorithm to determine the physical clusters.

3 Silhouette Index

We use the Silhouette Index [11] measure (ranges from −1 to 1) to evaluate the extent of similarity among the vertices of the physical clusters and logical clusters with respect to the normalized centrality values of the vertices. The larger the values for the Silhouette Index for a cluster, the more similar are the vertices within the cluster with respect to the centrality metrics in consideration. The Silhouette Index for a cluster is the average of the Silhouette Index values of its member vertices. The Silhouette Index for a network is the weighted average of the Silhouette Index values of its clusters. The Silhouette Index for a vertex i in a cluster Ck is calculated as per formulation (1). Here, \( \overline{{d_{i,\hbox{min} } }} \) is the minimum of the average of the Euclidean distances for vertex i to vertices in the other clusters; \( \overline{{d_{i,Ck} }} \) represents the average of the Euclidean distances for vertex i to vertices in its own cluster. A negative Silhouette Index for a vertex is an indication that the vertex is not in the appropriate cluster. A negative Silhouette Index for a cluster is an indication that its member vertices should have been in other cluster(s) for better cohesiveness.

$$ {\text{Silhouette}}\,{\text{Index}}\left( i \right) = \frac{{\overline{{d_{i,\hbox{min} } }} - \overline{{d_{i,Ck} }} }}{{\hbox{max} \left\{ {\overline{{d_{i,\hbox{min} } }} ,\overline{{d_{i,Ck} }} } \right\}}} $$
(1)

4 Evaluation on Real-World Networks

In this section, we test our hypothesis on a suite of 25 biological network graphs and 25 social network graphs of diverse degree distributions and present the results of the Silhouette Index measure evaluated for the physical clusters and logical clusters obtained by running the Louvain algorithm (per the approaches described in the earlier sections). The biological networks analyzed are either gene–gene interaction networks, protein–protein interaction networks, interactions between different animal species in a particular area, etc. The social networks analyzed comprise acquaintance networks that capture the association between two users in a social network over a certain time period and friendship networks for which no such time period is used to capture association between two users.

In Fig. 3, we visually present some of the fundamental statistical information for the biological networks and social networks (for more details, see [12]). The number of nodes in the biological networks ranges from 62 to 2,640 with a median of 813. The number of nodes in the social networks ranges from 22 to 1,882 with a median of 75. The spectral radius ratio for node degree (λsp ≥ 1) [13] for a network graph is the ratio of the principal eigenvalue [5] of the adjacency matrix for the graph and the average node degree; the larger the λsp value, the larger the variation in node degree. The edge density (0 ≤ ρedge ≤ 1) is calculated as the ratio of the actual number of edges in the network and the maximum possible number of edges in the network (which is N(N − 1)/2 for a network of N nodes). The biological networks are characteristic of having larger λsp values and lower ρedge values; on the other hand, social networks are characteristic of having smaller λsp values and larger ρedge values. For more details on the individual real-world networks analyzed in each of these domains, the interested reader is referred to [12].

Fig. 3
figure 3

Statistics for the biological and social networks

In Figs. 4 and 5, we present and visually compare the Silhouette Index values for the physical vs. logical clusters on the basis of the neighborhood-based (DEG, EVC); the shortest path-based (BWC, CLC), and all the four centrality metrics together (DEG, EVC, BWC, CLC). We observe all the data points to be above the dotted diagonal line, indicating that the Silhouette Index values for the logical clusters for all the real-world networks are appreciably larger than the Silhouette Index values for the physical clusters. For each coordinate system and for each network category, we measure (see Fig. 6a for a comparative bar chart) the average of the difference in the Silhouette Index values for the logical clusters versus physical clusters. We observe the biological networks to incur larger average difference in the Silhouette Index values for all the three coordinate systems. The (DEG, EVC) coordinate system incurs, respectively, the lowest (for social networks) and largest (for biological networks) values for the average difference in the Silhouette Indexes for the logical versus physical clusters. We also measure the median (see Fig. 6b for a comparative bar chart) of the Silhouette Index values for the physical clusters versus logical clusters. Through Figs. 4,5, and 6, we observe the Silhouette Index values for the physical (logical) clusters in the biological networks to be relatively lower (larger) than those in the social networks.

Fig. 4
figure 4

Biological networks: Silhouette Index values for the physical versus logical clusters

Fig. 5
figure 5

Social networks: Silhouette Index values for the physical versus logical clusters

Fig. 6
figure 6

Statistical comparison of the Silhouette Index values. a Avg. difference in the social networks biological networks Silhouette Index values. b Median of the Silhouette Index values

5 Conclusions

We show that logical clusters of vertices could be more cohesive with respect to node-level centrality metrics compared to physical clusters of vertices in complex real-world network graphs. In this pursuit, we adapt a recently proposed unit-disk graph approach (for node similarity assessment) [12] to determine such logical clusters of similar vertices. We applied the approach on a suite of 25 biological networks and 25 social networks and evaluated the extent of similarity of the vertices in the logical clusters versus physical clusters using the Silhouette Index measure with respect to three combinations of centrality metrics (DEG, EVC), (BWC, CLC), and (DEG, EVC, BWC, CLC). For each combination, we observe all the 50 real-world network graphs to incur significantly larger Silhouette Index values for the logical clusters compared to the physical clusters. We observe the biological networks to show a relatively larger difference in the Silhouette Index values between the logical clusters and physical clusters. Likewise, the (BWC, CLC) coordinate system has been observed to incur relatively larger Silhouette Index values for several real-world networks. In future, we plan to extend the unit-disk graph approach for outlier detection in complex networks and large datasets.