Keywords

1 Introduction

Complex network analysis is fundamental to the understanding and modeling of relationships in several domains such as social networks [1], communication networks [2], biological networks [3], citation networks [4], cyber-attack networks [5] etc. Centrality metrics are the most common node-level metrics and quantify the topological importance of a node (vertex) with respect to one or more aspects [6]. There are two major categories of centrality metrics: neighborhood-based and shortest path-based. While the degree (DEG) and eigenvector centrality (EVC) metrics are considered representative metrics for the neighborhood-based category, the betweenness (BWC) and closeness (CLC) centrality metrics are considered representative metrics for the shortest path-based category. The DEG of a vertex is a measure of the number of neighbors of the vertex. The EVC of a vertex [7] is a measure of the degree of the vertex as well as the degrees of its neighbors. The BWC of a vertex [8] is a measure of the fraction of the shortest paths between any two vertices that go through the vertex. The CLC of a vertex [9] is a measure of the sum of the shortest path distances from the vertex to every other vertex. For more information about these centrality metrics, the interested reader is referred to [6,7,8,9]. Throughout the paper, the terms ‘node’ and ‘vertex’, ‘link’ and ‘edge’, ‘network’ and ‘graph’, ‘cluster’ and ‘community’ are used interchangeably. They mean the same.

Community detection is a classical problem in complex network analysis wherein one or more clusters of vertices are identified based on the topological distribution of the vertices and the edges connecting the vertices [6]. The primary focus of the community detection algorithms (e.g., [10]) has been to determine highly modular communities such that the intra-cluster density is as high as possible and the inter-cluster density is as low as possible [11]. With the above approach for clustering in complex networks, two vertices are likely to belong to the same cluster only if they have an edge between them or have a multi-hop shortest path of fewer intermediate edges. Even within a cluster of high density, it is possible that vertices significantly differ with respect to the values for one or more node-level metrics (see Sect. 33.2 for a motivating example) if the extent of similarity of the vertices (with respect to one or more node-level metrics) is not the primary criteria for cluster formation.

Our research objective is to quantify the extent to which vertices in a complex network are logically clusterable on the basis of the similarity among the values incurred for one or more node-level metrics. The nodes in such a logical cluster need not be connected, but would have very similar values for the node-level metrics. Identification of such logical clusters would be very informative for several network domains. For example: in an organizational network, a logical cluster of employees with similar profiles (but in different departments) would help in forming an integration team that could coordinate across departments without any ego differences. In a health information network, a quantitative measure of clusterability of patients with similar health parameters is critical to decide whether treatment plans need to be individualized or can be generalized at the cluster-level. The Hopkins Statistic measure (ranges from 0 to 1) has been traditionally used to assess the clusterability of a dataset [12]. We use the Hopkins Statistic to assess the similarity-based clusterability of the vertices in a complex network.

The rest of the paper is organized as follows: Section 33.2 motivates the proposed research for similarity-based logical clustering with an illustrative example. Section 33.3 illustrates the computation of the Hopkins Statistic on a coordinate system of the normalized centrality values for the example graph of Sect. 33.2. Section 33.4 presents the Hopkins Statistic values computed for 47 real-world networks (of diverse degree distributions) and evaluates the logical clusterability of the vertices on the basis of the similarity with respect to the neighborhood-based centrality metrics vs. the shortest path-based centrality metrics. Section 33.5 reviews related work on similarity assessment and clustering in complex networks, and highlights our contribution. Section 33.6 concludes the paper and outlines plans for future work.

2 Proposed Approach and Example

Our proposed approach to quantitatively assess similarity-based clusterability is as follows: Let K be the number of centrality metrics considered for the analysis. We determine the centrality values of the vertices in the complex network with respect to each of these metrics. We then individually normalize the centrality values of the vertices for each metric. The normalization of the centrality values with respect to a metric is done using the square root of the sum of the squares of the centrality values. We distribute the vertices in a K-dimensional coordinate system (wherein the range for each dimension is from 0 to 1) such that the coordinate for a vertex is a K-tuple, with an entry for each centrality metric. We refer to the above distribution as the logical topology of the vertices. If two vertices have similar values for the centrality metrics, then they should be closer to each other in the logical topology. There could exist one or more logical clusters of such similar vertices so that the Euclidean distance between vertices within a cluster is much smaller than the Euclidean distance between vertices in two different clusters. If the centrality values of the vertices are not similar to each other, then the vertices could be further away from each other in the logical topology and such a topology would not be effectively clusterable. The Hopkins Statistic is an effective measure to quantify the extent to which vertices could be clustered based on the similarity of their centrality values. The larger the Hopkins Statistic, the more effective is the clusterability of the vertices on the basis of similarity in their centrality values.

Figure 33.1a presents the communities determined by the classical Girvan-Newman algorithm [10] on a toy example graph. We observe the nodes in the individual clusters of Fig. 33.1a to be more connected to each other than to the nodes in the other clusters (i.e., a larger intra-cluster density and a lower inter-cluster density). However, from a centrality point of view, we observe the nodes in the individual clusters to be much different from each other (see Fig. 33.1b and c). Figure 33.1b and c respectively present a logical clustering of the vertices based on the similarity of the normalized values of the DEG and EVC metrics. The vertices within such logical clusters need not be physically connected to each other (Fig. 33.1b) or even if connected, the logical clusters need not have a high modularity (Fig. 33.1c), but would have similar (either identical or very close) values for the centrality metric considered.

Fig. 33.1
figure 1

Example to illustrate the clustering of the vertices based on the traditional approach vs. the normalized centrality-based coordinate system. (a) Physical clusters identified by the traditional community detection approach. (b) Logical clustering of the vertices based on the similarity with respect to degree centrality. (c) Logical clustering of the vertices based on the similarity with respect to eigenvector centrality. (d) Distribution of vertices in the logical topology respect to degree and eigenvector centrality metrics

If two or more centrality metrics are to be considered together for similarity assessment, then we distribute the vertices in a coordinate system of the normalized centrality values (ranging from 0 to 1 for each dimension) and assess the clusterability of the vertices based on their proximity to each other. Figure 33.1d presents one such distribution of the vertices based on a coordinate system of the normalized DEG and EVC values. Though the exact clusters depend on the clustering algorithm used, we observe vertex 5 to be far away from the rest of the vertices in the normalized (DEG, EVC) co-ordinate system and such a conclusion cannot be arrived at in Fig. 33.1a, b or c. Likewise, vertex 1 gets grouped with vertices 0, 2 and 3 under the traditional approach (see Fig. 33.1a), but is reasonably far away from them in the (DEG, EVC) co-ordinate system.

3 Hopkins Statistic Measure

We use the Hopkins Statistic measure [12] as the basis to quantify the similarity-based clusterability of the vertices in a complex network. The Hopkins Statistic (ranges from 0 to 1) is traditionally used to assess the clustering tendency of any dataset. The idea behind the Hopkins Statistic is as follows: Given a dataset of size n, we pick m samples (constituting the set X such that m ≪ n) from this dataset as well as generate m uniformly random samples (constituting the set Y) exhibiting the same variation as that of the given dataset. We determine the sums of the distances of the samples in the sets X and Y to the nearest data points (i.e., one nearest data point for each sample) in the given dataset. The Hopkins Statistic is the ratio of the sum of the nearest neighbor distances of the samples in the set Y to the sum of the nearest neighbor distances of the samples in the sets X and Y. If the data points in the given dataset are randomly distributed, then the sums of the nearest neighbor distances for the samples in the sets X and Y would be approximately the same, and the Hopkins Statistic is expected to be 0.5. If the data points in the given dataset are highly clusterable, then the sum of the nearest neighbor distances for the samples in the set X would be very negligible compared to the sum of the nearest neighbor distances for the samples in the set Y; hence, the Hopkins Statistic is expected to be close to 1.0 for a highly clusterable dataset. If the data points in the original dataset are uniformly distributed, then the sum of the nearest neighbor distances for the samples in the set X would be significantly larger than the sum of the nearest neighbor distances for the samples in the set Y; the Hopkins Statistic in such a case would be closer to 0. A value of 0.75 or higher for the Hopkins Statistic indicates a clustering tendency at the 90% confidence level [5] [13,14,15].

We now illustrate the procedure to compute the Hopkins Statistic measure for the logical topology of the vertices illustrated in Fig. 33.1d. The distance between two vertices in the logical topology is computed as the Euclidean distance between their coordinates (represented by the normalized centrality values with respect to degree and eigenvector centrality). There are a total of eight vertices (n = 8) in the toy example graph of Fig. 33.1. Let the parameter ‘m’ be 5. That is, we randomly pick five of the eight vertices from the graph and constitute the set X. Let the five vertices picked be {0, 3, 4, 6, 7}, and the set X would comprise of their corresponding (DEG, EVC) values as coordinates: {(0.416, 0.489), (0.416, 0.467), (0.277, 0.264), (0.277, 0.155), (0.277, 0.155)}. From Fig. 33.1b and c, we could identify the range of the DEG and EVC values to be [0.277, …, 0.416] and [0.155, …, 0.489] respectively. While the normalized DEG values for the vertices in the original dataset are either 0.277 or 0.416, the normalized EVC values do not appear to follow any particular pattern. Hence, to generate the set Y with five data points, the DEG values need to be either 0.277 or 0.416 (we randomly choose one of these two values for each data point) and the EVC values need to be uniform randomly distributed in the range [0.155, …, 0.489]. Accordingly, the five samples constituting the set Y are generated to be: {(0.416, 0.168), (0.277, 0.328), (0.416, 0.479), (0.416, 0.183), (0.277, 0.217)}.

Figure 33.2 summarizes all the calculations. The sums of the nearest distances to a vertex for the samples in the sets X and Y are respectively 0.122 and 0.276; the Hopkins Statistic for the (DEG, EVC) coordinate system is 0.276/(0.276 + 0.122) = 0.69. Since it is relatively closer to 0.5, we could conclude that the vertices in the given complex network graph are not very much clusterable on the basis of the similarity with respect to the DEG and EVC values.

Fig. 33.2
figure 2

Calculations for the Hopkins Statistic for (DEG, EVC)-Similarity based Clusterability for the graph of Fig. 33.1

4 Similarity-Based Clusterability of Real-World Networks

We applied the proposed approach to a suite of 47 real-world networks of diverse degree distributions and computed the values for the Hopkins Statistic for the logical topologies of the vertices with respect to the neighborhood-based DEG and EVC metrics and the shortest path-based BWC and CLC metrics. The real-world networks (for more details, refer to [16]) considered fall under one of these domains (the number of networks under each category is indicated within the parentheses): I. Acquaintance network (12), II. Friendship network (9), III. Co-appearance network (6), Employment network (4), Citation network (3), Collaboration network (3), Literature network (3), Biological network (3), Political network (2), Game network (2), Transportation network (1), Geographical network (1) and Trade network (1). Table 33.1 presents the Hopkins Statistic values for each network with respect to the (DEG, EVC) and (BWC, CLC) measures along with the number of nodes and edges as well as the average degree (k avg) and spectral radius ratio for node degree [17] for each of these networks. The spectral radius ratio for node degree (λ sp) is a measure of the variation in node degree.

Table 33.1 Hopkins statistic values with respect to the neighborhood-based (DEG, EVC) metrics vs. shortest path-based (BWC, CLC) metrics

Figure 33.3 plots the Hopkins Statistic values for the two coordinate systems/logical topologies: if a data point is below the diagonal line, then the Hopkins Statistic value for the (DEG, EVC) coordinate system is relatively larger; if a data point is above the diagonal line, then the Hopkins Statistic value for the (BWC, CLC) coordinate system is relatively larger. Table 33.1 highlights the Hopkins Statistic values that are (overall) relatively larger for a real-world network with respect to the two coordinate systems. From both Fig. 33.3 and Table 33.1, we could come to the following conclusions: The median of the Hopkins Statistic values for the (DEG, EVC) and (BWC, CLC)-based logical topologies are 0.89 and 0.85 respectively. Hence, majority of the real-world networks are effectively clusterable with respect to both the neighborhood and shortest path-based centrality metrics. On a relative basis: out of the 47 real-world networks, for 29 networks (i.e., for about 60% of the real-world networks): the (DEG, EVC)-based Hopkins Statistic values are larger. Also, the Hopkins Statistic values for the (DEG, EVC)-based logical topologies are relatively closer to 1.0 compared to the Hopkins Statistic values for the (BWC, CLC)-based logical topologies. We could thus conclude that real-world networks are more effectively clusterable with respect to the neighborhood-based centrality metrics compared to shortest path-based metrics.

Fig. 33.3
figure 3

Comparison of the Hopkins Statistic values for the (DEG, EVC)-based logical topologies vs. (BWC, CLC)-based logical topologies of the real-world networks

Figure 33.4 plots the distribution of the Spectral radius ratio for node degree vs. the Hopkins Statistic measure values for the (DEG, EVC) and (BWC, CLC)-based logical topologies. We observe for both the coordinate systems, the logical topologies are more effectively clusterable with increase in the spectral radius ratio for node degree. This implies: scale-free networks (larger values for the spectral radius ratio for node degree) are more effectively clusterable with respect to both categories of centrality metrics. Whereas, the random networks (that have lower values for the spectral radius ratio for node degree) are (relatively) less effectively clusterable with respect to both categories of centrality metrics, especially the shortest path-based centrality metrics.

Fig. 33.4
figure 4

Spectral radius ratio for node degree vs. the Hopkins statistic measure values

5 Related Work and Our Contribution

Clustering in complex networks has been traditionally on the notion of distance between the nodes [13]. Nodes that are closer to each other are preferred for being grouped within a cluster. Various forms of distance measures (Euclidean, Manhattan, Minkowski, Chebyshev, Mahalanobis, etc) have been used in the literature [14]. Similarity-based clustering has been typically considered only for non-network data [15, 18], broadly as a problem in data mining. Given a distribution of data points, data points that are similar with respect to one or more parameters tend to be clustered together [19, 20]. The typical approach [19] is to associate a multi-dimensional vector to each data point, measure the pair-wise similarity (using functions such as cosine similarity [6], Jaccard Index [19], SimRank [21], PathSim [22]) or the distances between the data points (using some distance function), and then cluster the data points that are similar and closer to each other. Our research proposes a combination of the above two approaches. We first determine the values for the centrality metrics (node-level parameters) of the nodes in a complex network, then treat the nodes as data points with an associated multi-dimensional vector of the normalized values of the centrality metrics, and finally analyze the clusterability of these data points.

The proposed Hopkins Statistic-based clusterability assessment of node similarity could be the first step in pursuit of developing a network-level similarity index for complex networks. If the Hopkins Statistic measure is high, there could exist one or more clusters of similar nodes (with respect to the centrality metrics) in the network. Though there could exist two or more such clusters in the network (with the nodes in one cluster significantly different from the nodes in the other clusters with respect to the centrality metrics), a lower value for the Hopkins Statistic measure is definitely an indication that the nodes in the logical topology of centrality-based coordinates are more randomly distributed and not very similar to each other (i.e., not clusterable). In other words, the computation of the Hopkins Statistic measure could be used as a prerequisite step for any algorithm to quantify/assess the network-wide similarity of all the nodes in the network; there would not be any need for the algorithm to proceed further with assessing the network-wide similarity of the nodes if the value for the Hopkins Statistic measure is low.

6 Conclusions and Future Work

Our high-level contribution in this paper is a proposal to assess the logical clusterability of the nodes in a network on the basis of node-level metrics (such as centrality metrics). We propose to distribute the nodes in a K-dimensional coordinate system (referred to as the logical topology) of the normalized centrality values of the vertices (where ‘K’ is the number of centrality metrics considered and we assign a dimension for each centrality metric). Two vertices that need not be physically connected in the complex network could be very close to each other in the logical topology if they incur similar values for the centrality metrics. The Hopkins Statistic has been identified as the quantitative measure to assess the similarity-based clusterability of nodes in such a logical topology. A lower value for the Hopkins Statistic is definitely an indication that the nodes do not exhibit similar values for the centrality metrics considered. On the other hand, a higher value for the Hopkins Statistic is an indication that there could exist one or more clusters of similar vertices in the network with respect to the centrality metrics considered. We analyzed a suite of 47 real-world networks of diverse degree distributions and determined their Hopkins Statistic values with respect to the neighborhood-based degree and eigenvector centrality metrics and the shortest path-based betweenness and closeness centrality metrics. We observed about 60% of the real-world networks to exhibit relatively larger Hopkins Statistic values with respect to the neighborhood-based centrality metrics. Also, the Hopkins Statistic measure values for the neighborhood-based centrality metrics are relatively closer to 1 compared to the values for the shortest path-based centrality metrics. As part of future work, we plan to develop a network-wide node similarity index (NSI) measure for complex network analysis and test our hypothesis that a higher Hopkins Statistic is a prerequisite for a network to exhibit a higher value for the NSI.