Abstract
The modularization of large graphs or community detection in networks is usually approached as an optimization problem of a quality function or criterion, for instance, the modularity of Newman-Girvan. There exist other clustering criteria, with their own properties leading to different solutions. In this paper we present six linear modularization criteria in relational notation such as the Newman-Girvan modularity, Zahn-Condorcet, Owsiński-Zadrożny, the Deviation to Uniformity index, the Deviation to Indetermination index and the Balanced-Modularity. We use a generic version of Louvain algorithm to approach the optimal partition of the criteria with real networks of different sizes. We have found that those partitions present important differences concerning the number of clusters. The relational formalism allows us to justify these differences from a theoretical point of view. Moreover, this notation enables to easily identify the criteria having a resolution limit (a phenomenon which causes the criterion to fail to identify modules smaller than a given scale). This finding is confirmed in artificial benchmark LFR graphs.
Access provided by CONRICYT-eBooks. Download chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
1 Introduction
Networks are studied in numerous contexts such as biology, sociology, online social networks, marketing, etc. Graphs are mathematical representations of networks, where the entities are called nodes and the connections are called edges. Very large graphs are difficult to analyse and it is often profitable to divide them in smaller homogeneous components easier to handle. The process of decomposing a network has received different names: graph clustering (in data analysis), modularization, community structure identification. The clusters can be called communities or modules; in this paper we use those words as synonyms.
Assessing the quality of a graph partition requires a modularization criterion. This function will be optimized to find the best partition. Various modularization criteria were formulated in the past to address different practical applications. Those criteria differ in the definition given to the notion of community or cluster.
To understand the differences between the optimal partitions obtained by each criterion we show how to represent them using the same basic formalism. In this paper we use the Mathematical Relational Analysis (MRA) to express six linear modularization criteria. Linear criteria are easy to handle, for instance, the Louvain method can be adapted to linear quality functions (see Campigotto et al. (2014)). The six criteria studied are: the Newman-Girvan modularity, the Zahn-Condorcet criterion, the Owsiński-Zadrozny criterion, the Deviation to Uniformity, the Deviation to Indetermination index and the Balanced Modularity (details in Sect. 3). The relational representation makes clear the properties of those modularization criteria. It allows to easily identify the criteria suffering from a resolution limit, first discussed by Fortunato and Barthelemy (2006). We will complete this theoretical study by some experiments on real and synthetic networks, demonstrating the effectiveness of our classification.
In this paper, we deal only with linear criteria. Nevertheless, it is important to mention that thanks to the formalism of the MRA it is also possible to express non-linear criteria in relational notations. For instance, we can mention some very well-known criteria such as the Mancoridis-Gansner criterion (see Mancoridis et al. (1998)) in cluster-programming, the Ratio-Cuts by Wei and Cheng (1989), the Michalski criterion (see Michalski and Stepp (1983) and its relational notation given in Decaestecker (1992)), etc. The interested reader can see Conde-Céspedes and Marcotorchino (2012) and Conde-Céspedes (2013).
This paper is organized as follows: Sect. 2 presents the Mathematical Relational Analysis approach and introduces the property of balance for linear criteria and its relation to the property of resolution limit. In Sect. 3, six linear modularization criteria in the relational formalism are formulated. Next, Sect. 4 discusses some experiments on real and artificial graphs to confirm the theoretical properties found previously.
2 Relational Analysis Approach
There is a strong link between the Mathematical Relational AnalysisFootnote 1 and graph theory: a graph is a mathematical structure that represents binary relations between objects belonging to the same set. Therefore, a non-oriented and non-weighted graph \(G=(V,E)\), with \(N=|V|\) nodes and \(M=|E|\) edges, is a binary symmetric relation on its set of nodes V represented by its adjacency matrix \(\mathbf A \) as follows:
We denote the degree \(d_i\) of node i the number of edges incident to i. It can be calculated by summing up the terms of the row (or column) i of the adjacency matrix: \(d_i=\sum _{i'}a_{ii'}=\sum _{i'}a_{i'i}=a_{i.}=a_{.i}\). We denote \(\delta =\frac{2M}{N^2}\) the density of edges of the whole graph.
Partitioning a graph implies defining an equivalence relation on the set of nodes V, that means a symmetric, reflexive and transitive relation. Mathematically, an equivalence relation is represented by a square matrix \(\mathbf X \) of order \(N=|V|\), whose entries are defined as follows:
Modularizing a graph implies to find \(\mathbf X \) as close as possible to \(\mathbf A \). A modularization criterion F(X) is a function which measures either a similarity or a distance between \(\mathbf A \) and \(\mathbf X \). Therefore, the problem of modularization can be written as a function to optimize F(X) where the unknown X is subject to the constraints of an equivalence relation. In fact, the problem of modularization can be written in the general form:
subject to the constraints of an equivalence relation:
The exact solving of this \(0-1\) linear program due to the size of the constraints is impractical for big networks. So, heuristic approaches are the only reasonable way to proceed.
We define as well \(\bar{\mathbf{X }}\) and \(\bar{\mathbf{A }}\) as the inverse relation of \(\mathbf X \) and \(\mathbf A \) respectively. Their entries are defined as \(\bar{x}_{ii'}=1-x_{ii'}\) and \(\bar{a}_{ii'}=1-a_{ii'}\) respectively. In the following we denote \(\kappa \) the optimal number of clusters, that means the number of clusters of the partition \(\mathbf X \) which maximizes the criterion F(X).
2.1 Linear Balanced Criteria
Every linear criterion is an affine function of \(\mathbf X \), therefore in relational notation it can be written as:
where \(\varPhi (a_{ii'})\) denotes any function depending only on the original data (for instance the adjacency matrix) and K denotes any constant depending only on the original data. Therefore, K does not intervene in the optimization problem.
Definition 1
(Property of linear balance). A linear criterion is balanced if it can be written in the following general form:
where \(\phi (.)\) and \(\bar{\phi }(.)\) are non negative functions depending only on the original data and verifying \(\sum _{i=1}^{N}\sum _{i'=1}^{N}\phi _{ii'}>0\) and \(\sum _{i=1}^{N}\sum _{i'=1}^{N}\bar{\phi }_{ii'}>0\).
So, they can not be all null simultaneously.
By replacing \(\bar{x}\) by its definition \(1-x_ {ii'}\), Eq. (5) can be rewritten as follows:
2.1.1 Interpretation of Functions \(\phi (.)\) and \(\bar{\phi }(.)\)
At this point, we can give the intuition behind functions \(\phi (.)\) and \(\bar{\phi }(.)\). From expression (6) we deduce the importance of the property of balance for linear criteria. If the criterion is a function to maximize, the presence and/or absence of the terms \(\phi _{ii'}\) and \(\bar{\phi }_{ii'}\) has the following impact on the optimal solution:
-
If \(\bar{\phi }_{ii'}=0 \, \forall i,i'\) the solution that maximizes F(X) is the partition where all nodes are clustered together in a single cluster, so \(\kappa =1\) and \(x_{ii'}=1 \quad \forall (i,i')\) and \(F(X)=\sum _{i=1}^{N}\sum _{i'=1}^{N}\phi _{ii'}\).
-
If \(\phi _{ii'}=0 \, \forall i,i'\) then the optimal solution that maximizes F(X) is the partition where all nodes are separated, so \(\kappa = N\) and \(x_{ii'}=0\, \forall \, i\ne i'\) and \(x_{ii}=1 \, \forall i\) therefore \(F(X)=\sum _{i=1}^{N}\sum _{i'=1}^{N}\bar{\phi }_{ii}\).
In other words, the optimization of a linear criterion who does not verify the property of balance will either cluster all the nodes in a single cluster or isolate each node in its own cluster, therefore forcing the user to fix the number of clusters in advance.
We can deduce from the previous paragraphs that the values taken by the functions \(\phi \) and \(\bar{\phi }\) create a sort of balance between the fact of generating as many clusters as possible, \(\kappa = N\), and the fact generating only one cluster, \(\kappa = 1\).
In the following we will call the quantity \( \sum _{i=1}^{N}\sum _{i'=1}^{N}\phi (a_{ii'})x_{ii'}\) the term of positive agreements and the quantity \(\sum _{i=1}^{N}\sum _{i'=1}^{N} \bar{\phi }(a_ {ii'})\bar{x}_{ii'}\) the term of negative agreements.
2.2 Different Levels of Balance
We define two levels of balance for all linear balanced criterion:
Definition 2
(Property of local balance). A balanced linear criterion whose functions \(\displaystyle {\phi _{ii'}}\) and \(\displaystyle {\bar{\phi }_{ii'}}\) depend only upon the pair \((i,i')\) (therefore not depending on global properties of the graph) has the property of local balance.
Some remarks about Definition 2:
-
When we talk about global properties we refer to the total number of nodes, the total number of edges or other properties describing the global structure of the graph.
-
For the particular case of local balance where \(\displaystyle {\phi _{ii'}+\bar{\phi }_{ii'}=K}\) (that is \(\displaystyle {\phi _{ii'}}\) and \(\displaystyle {\bar{\phi }_{ii'}}\) sum up to a constant), we can conclude that whereas \(\displaystyle {\phi _{ii'}}\) increases \(\displaystyle {\bar{\phi }_{ii'}}\) decreases and vice versa.
Let us consider the special case where \(\phi (a_{ii'})=a_{ii'}\), the general term of the adjacency matrix. A null model is a graph with the same total number of edges and nodes and where the edges are randomly distributed. Let us denote the general term of the adjacency matrix of this random graph \(\bar{\phi }(a_{ii'})\). A criterion based on a null model considers that a random graph does not have community structure. The goal of such a criterion is to maximize the deviation between the real graph, represented by \(\phi (a_{ii'})\) and the null model version of this graph, represented by \(\bar{\phi }(a_{ii'})\) as shown in Eq. (6). Since the original graph and the null model have the same number of edges M, we have \(\sum _{i=1}^N\sum _{i'=1}^N\phi _{ii'}= \sum _{i=1}^N\sum _{i'=1}^N\bar{\phi }_{ii'}=2M\). If this constraint causes \(\bar{\phi }_{ii'}\) to depend upon the total number of edges M, then a criterion based on a null model does not verify the property of local balance. Consequently, it is not scale invariant because it depends on a global characteristic of the graph.
The definition of null model for linear criteria can be generalized as follows:
Definition 3
(Criterion based on a null model). A balanced linear criterion that seeks to maximize the deviation between the real graph and a null model is a criterion based on a null model. In its formulation, the real graph is represented by \(\phi (a_{ii'})\) whereas the null model is represented by \(\bar{\phi }(a_{ii'})\). The functions \(\displaystyle {\phi _{ii'}}\) and \(\displaystyle {\bar{\phi }_{ii'}}\) satisfy the following condition:
such that the functions \(\displaystyle {\phi _{ii'}}\) and \(\displaystyle {\bar{\phi }_{ii'}}\) depend on global properties of the graph.
The global properties of the graph can be, for example, the total number of edges or the total number of nodes.
We can deduce from Definitions 2 and 3 that a linear criterion cannot be locally balanced and based on a null model at the same time.
In the particular case where \(\bar{\phi }\) decreases with the size of the network, it becomes negligible for large graphs. As explained previously, if this term tends towards zero, the optimization of the criterion will tend to group the nodes more easily. For instance, a single edge between two sub-graphs would be interpreted by the criterion as a sign of a strong correlation between the two clusters, and optimizing the criterion would lead to the merge of the two clusters. Such a criterion is said to have a resolution limit.
The resolution limit was introduced by Fortunato and Barthelemy (2006), where the authors studied the resolution limit of the modularity of Newman-Girvan. They demonstrated that modularity optimization may fail to identify modules smaller than a given size which depends on global characteristics of the graph. Even weakly interconnected complete sub-graphs—the best identifiable communities—would be merged by this kind of optimization criteria if the network is sufficiently large. According to Kumpula et al. (2007) the resolution limit is present in any modularization criterion based on global optimization of intra-cluster edges and extra-community links and on a comparison to any null model.
In Sect. 4, we will show how criteria having a resolution limit fail to detect certain groups of densely connected nodes.
3 Modularization Criteria in Relational Notation
Graph clustering criteria depend strongly on the meaning given to the notion of community. In this section, we describe six linear modularization criteria and their relational coding in Table 1. We assume that the graphs we want to modularize are scale-free, that means that their degree distribution follows a power law.
-
1.
The Zahn-Condorcet criterion (1785, 1964): C.T. Zahn was the first who studied the problem of finding an equivalence relation \(\mathbf X \), which best approximates a given symmetric relation \(\mathbf A \) in the sense of minimizing the distance of the symmetric difference (Zahn 1964). The criterion defined by Zahn corresponds to the dual Condorcet’s criterion (Condorcet 1785) introduced in Relational Consensus whose relational coding was given by Marcotorchino and Michaud (1979). This criterion requires that every node in each cluster be connected with at least as half as the total nodes inside the cluster. Consequently, for each cluster the fraction of within cluster edges is at least 50 % (see Conde-Céspedes (2013)) and Appendix for proof).
-
2.
The Owsiński-Zadrożny criterion (1986) (see Owsiński and Zadrożny (1986)) it is a generalization of Condorcet’s function. It has a parameter \(\alpha \), which allows, according to the context, to define the minimal percentage of required within-cluster edges: \(\alpha \). For \(\alpha =0.5\) this criterion is equivalent to Condorcet’s criterion. The parameter \(\alpha \) defines the balance between the positive agreements term and the negative agreements term. For each cluster the density of edges is at least \(\alpha \,\%\) (see Conde-Céspedes(2013)).
-
3.
The Newman-Girvan criterion (2004) (see Newman and Girvan (2004)): It is the best known modularization criterion, called sometimes simply modularity. It relies upon a null model. Its definition involves a comparison of the number of within-cluster edges in the real network and the expected number of such edges in a random graph where edges are distributed following the independence structure (a network without regard to community structure). In fact, the modularity measures the deviation to independence.
As mention in the previous section, this criterion, based on a null model and it has a resolution limit (see Fortunato and Barthelemy (2006)). In fact, as the network becomes larger \(M\longrightarrow \infty \), the term \(\displaystyle \bar{\phi }_{ii'}=\frac{a_{i.}a_{.i'}}{2M}\) tends to zero since the degree distribution follows a power law.
-
4.
The Deviation to Uniformity (2013) This criterion maximizes the deviation to the uniformity structure, it was proposed in Conde-Céspedes (2013). It compares the number of within-cluster edges in the real graph and the expected number of such edges in a random graph (the null model) where edges are uniformly distributed, thus all the nodes have the same degree equal to the average degree of the graph. This criterion is based on a null model and it has a resolution limit. indeed \(\delta \longrightarrow 0\) as \(N \longrightarrow \infty \).
-
5.
The Deviation to Indetermination (2013) Analogously to Newman-Girvan function, this criterion compares the number of within-cluster edges in the real network and the expected number of such edges in a random graph where edges are distributed following the indetermination structure Footnote 2 (a graph without regard to community structure) (Marcotorchino and Conde-Céspedes 2013; Marcotorchino 2013). The Deviation to Indetermination is based on a null model, therefore it has a resolution limit.
-
6.
The Balanced modularity Footnote 3 (2013) This criterion, introduced in Conde-Céspedes and Marcotorchino (2013), was constructed by adding to the Newman-Girvan modularity a term taking into account the absence of edges \(\bar{\mathbf{A }}\). Whereas Newman-Girvan modularity compares the actual value of \(a_{ii'}\) to its equivalent in the case of a random graph \(\displaystyle \frac{a_{i.}a_{.i'}}{2M}\), the new term compares the value of \(\bar{a}_{ii'}\) to its version in case of a random graph \(\displaystyle \frac{(N-a_{i.})(N-a_{.i'})}{N^2-2M}\). It is based on a null model and it has a resolution limit.
The six linear criteria of Table 1 verify the property of balance, so it is not necessary to set in advance the number of clusters. Table 2 specifically focuses on the fonctions \(\phi _{ii'}\) and \(\bar{\phi }_{ii'}\) for each criterion.
From Tables 1 and 2 one can easily deduce that two criteria: Zahn-Condorcet and Owsiński-Zadrożny verify the property of local balance. Furthermore, Table 2 clearly shows that the functions \(\phi _{ii'}\) and \(\bar{\phi }_{ii'}\) add up to a constant \(K_{ii'}\) for these two criteria. The quantity \(\bar{\phi }_{ii'}\) decreases with the size of the graph for all criteria that have a resolution limit.
4 The Impact of Merging Two Clusters
We modularized five real networks of different sizes: Jazz (Gleiser and Danon 2003), Internet (Hoerdt and Magoni 2003), Web nd.edu (Albert et al. 1999), Amazon (Yang and Leskovec 2012)Footnote 4 and Youtube (Mislove et al. 2007). We ran a generic version of Louvain Algorithm (see Campigotto et al. (2014) and Blondel et al. (2008)) until achievement of a stable value of each criterion. The number of clusters obtained for each network is shown in Table 3.
Table 3 shows that the Zahn-Condorcet and Owsiński-Zadrożny criteria generate many more clusters than the other criteria having a resolution limit, for which the number of clusters is rather comparable. Moreover, this difference increases with the network size. Notice that the number of clusters for the Owsiński-Zadrożny criterion decreases with \(\alpha \), that is the minimal required fraction of within-cluster edges, so the criterion becomes more flexible.
In order to explain these differences we measure the impact of merging two clusters on the value of each criterion. Let us suppose we want to merge two clusters \(\mathscr {C}_1\) and \(\mathscr {C}_2\) in the network of sizes \(n_{1}\) and \(n_2\) respectively. Let us suppose as well they are connected by l edges as shown in Fig. 1.
Let us denote \(C_F\) the contribution of merging two clusters to the value of a criterion F. The contribution \(C_F\) can be easily calculated from (6) (for the proof see Conde-Céspedes (2013)):
-
If \(C>0\) the merger of the two clusters increases the value of the criterion.
-
If \(C<0\) the merger of the two clusters decreases the value of the criterion.
Equation (7) shows that the decision of merging or not the two clusters depends on a comparison between the quantity \(\displaystyle {\sum _{i\in \mathscr {C}_1}^{n_1} \sum _{i'\in \mathscr {C}_2}^{n_2}\phi _{ii'}}\) and the quantity \(\displaystyle {\sum _{i\in \mathscr {C}_1}^{n_1} \sum _{i'\in \mathscr {C}_2}^{n_2} \bar{\phi }_{ii'}}\). Giving the fact that both are positive, it is the one with the highest value that decides to merge or not to merge. Thus, whereas the first one is for fusion the second one is against the fusion.
Table 4 shows the explicit expression of the contribution for the linear criteria described below.Footnote 5
where \(\displaystyle {d_{av}=\frac{\sum _{i\in V}^{N}a_{i.}}{N}}\) is the average degree of the whole graph, \(\displaystyle {d_{av}^1=\frac{\sum _{i\in \mathscr {C}_1}^{n_1}a_{i.}}{n_1}}\) and \(\displaystyle {d_{av}^2=\frac{\sum _{i'\in \mathscr {C}_2}^{n_2}a_{.i'}}{n_2}}\) are the average degrees of \(\mathscr {C}_1\) and \(\mathscr {C}_2\) respectively.
We can remark from Table 4 that for the five criteria the contribution compares “the number of edges between \(\mathscr {C}_1\) and \(\mathscr {C}_2\): l” to the quantity in bold. We can see as well that the contribution for locally balanced criteria depends only upon local properties: \(l, \bar{l}, n_1, n_2\). In fact, locally balanced criteria are scale invariant. In contrast, for the other criteria having a resolution limit the contribution depends and is decreasing on the global size of the network. We remark as well that for three criteria: Newman-Girvan, Deviation to Indetermination and Balanced Modularity the contribution depends on the degree distribution of the two clusters. According to Barabasi and Albert (1999) many real networks fall into the class of scale-free networks, meaning that their degree distribution follows a power-law. In a scale-free network a few nodes called hubs have many connexions whereas most nodes have few connexions.
4.1 Impact on the Optimal Number of Clusters
From the previous results we can deduce the main characteristics of the optimal partition found by the optimization of each criterion (see Table 5). In addition, we remark the following facts:
-
The Zahn-Condorcet criterion: According to Table 4 for merging the two clusters \(\mathscr {C}_1\) and \(\mathscr {C}_2\), these ones must be connected by at least as many edges as the half of the maximum possible number of edges,Footnote 6 that is \(\displaystyle {l>\frac{n_1n_2}{2}}\).
-
The Owsiński-Zadrożny criterion: For merging the two clusters \(\mathscr {C}_1\) and \(\mathscr {C}_2\), these ones must be connected by at least as \(\alpha \,\%\) as the maximum possible number of edges.
-
The Deviation to Uniformity: According to Table 4 for the merge to take place the fraction of edges between \(\mathscr {C}_1\) and \(\mathscr {C}_2\) must be at least equal to the global density of the whole graph.
-
Newman-Girvan criterion: From Table 4 we can deduce that the optimal partition does not have clusters with a single node (this result was already demonstrated in Brandes et al. (2008)). In fact, if \(\mathscr {C}_1\) has only one node with only one connection to \(\mathscr {C}_2\), thus \(n_1=1\), \(d_{av}^1=1\), \(l=1\) and consequently the contribution is always positive: \(\displaystyle {C_{NG}=\left( 1-\frac{\sum _{i=1}^{n_2}a_{i.}}{2M}\right) >0}\).
-
Balanced Modularity: It is easy to understand the behavour of the contribution of Balanced Modularity when we compare it to those of Newman-Girvan and Deviation to Indetermination (see Conde-Céspedes (2013) for proof).Footnote 7 Indeed, we demostrated in Conde-Céspedes (2013) that:
$$\begin{aligned} C_{BM}=2C_{NG}+ n_1n_2\frac{(d_{av}^1-d_{av})(d_{av}^2-d_{av})}{2M(1-\delta )} \end{aligned}$$(8)and
$$\begin{aligned} C_{BM}=2C_{DI}+ n_1n_2\left( 2-\frac{1}{\delta }\right) \frac{(d_{av}^1-d_{av})(d_{av}^2-d_{av})}{N^2(1-\delta )}. \end{aligned}$$(9)Although the contribution for the Balanced Modularity is increasing in both the contribution of Newman Girvan \(C_{NG}\) and in the contribution of Deviation to Indetermination \(C_{DI}\), in both cases \(C_{BM}\) has an additional term that we can treat as regulator: \(\left( n_1n_2\frac{(d_{av}^1-d_{av})(d_{av}^2-d_{av})}{2M(1-\delta )}\right) \) and \(\left( n_1n_2\left( 2-\frac{1}{\delta }\right) \frac{(d_{av}^1-d_{av})(d_{av}^2-d_{av})}{N^2(1-\delta )}\right) \) respectively. These two regulators have opposite sign for real networks. In fact, the coefficient \(\left( 2-\frac{1}{\delta }\right) \) of the second regulator is almost surely negative for real graphs because the density \(\delta<< 0.5\) for scale-free networks. That is why the Balanced Modularity behaves as a regulator between both criteria: Newman-Girvan and Balanced Modularity. However, when the network size increases \(N \longrightarrow \infty \) and \(M \longrightarrow \infty \) the regulator terms tend to zero.
Only ground-truth overlapping communities are defined on real networks in Table 3. This fact makes difficult to judge the quality of the obtained partitions because we can not directly compare a partition to overlapping communities. That is why in the next section we will consider artificial networks with a predefined community structure.
5 Experiments with Artificial Networks
In order to judge the quality of the partitions obtained by each criterion we generated benchmark LFR graphsFootnote 8 (see Lancichinetti et al. (2008)) of different sizes 1000, 5000, 10000, 50000, 100000 and 500000. The input parameters are the same as those considered in Lancichinetti and Fortunato (2009). The average degree is 20, the maximum degree 50, the exponent of the degree distribution is –2 and that of the community size distribution is –1. In order to test the existence of resolution limit we chose small communities sizes, ranging from 10 to 50 nodes, and low values of mixing parameter, 0.10, 0.20 et 0.30. Figure 2 shows the average number of clusters for 100 runs of the generic Louvain algorithm.
In Fig. 2 it is hard to see the curve of the real number of clusters (in black) beacuse it is almost overlapped with those of OZ1 and OZ2.
Figure 2 shows clearly the difference between the behavior of those criteria having a resolution limit (NG, DU, DI and BM) and the behavior of criteria locally defined (ZC and OZ). As the size of the network increases the four criteria suffering from resolution-limit detect fewer clusters than those predefined. The number of clusters is rather comparable for these four functions, one reason can be the fact that the term of negative agreements tends to zero when the network gets bigger. Conversely, the number of clusters of criteria locally defined increases nearly at the same rate as the real number of clusters. Whereas OZ with high \(\alpha \) identifies more clusters than those predefined, the criterion which best approaches the real number of clusters is OZ with low values of \(\alpha =0.2\) and \(\alpha =0.1\).
Figure 3 shows the Normalized Mutual Information Footnote 9 (NMI) for the partitions in Fig. 2.
Figure 3 shows that the average NMI decreases with the network size for criteria having a resolution limit. Moreover, they almost overlap. Conversely, the NMI of the criteria locally defined seem to increase with the network size. The criterion with the highest NMI is OZ with low values of \(\alpha \), 0.1 and 0.2.
Figure 4 shows the average Normalized Mutual Information for the mixing parameter ranging from 0.1 to 0.8 for different network sizes.
Figure 4 shows that for all the criteria previously presented the NMI decreases as the mixing parameter increases. This figure demonstrates once more the differences between the behavior of criteria with resolution limit and that of the criteria locally defined. For the first ones the quality decreases abruptly beyond mixing parameter equal to 0.6. For the second ones, the quality seems to decrease at a lower rate. However, it is important to remark that the quality of criteria with a resolution limit decreases not only with the mixing parameter but also with the network size. Converserly, the behavior of the NMI of locally defined criteria seem to have the same behavour independtly of the size of the whole network.
Another point to remark is that even when the mixing parameter is high all the criteria find a community structure. In fact, the pre-defined communities in the LFR graphs are based on mixing parameter, whereas all the criteria analysed in this article have their own definition of graph with no community structure which is not based on the mixing parameter.
Table 5 presents a summary of the results found by the previous analysis.
6 Conclusions
We have presented six linear modularization criteria in relational notation, Zahn-Condorcet, Owsiński-Zadrożny, the Newman-Girvan modularity, the Deviation to Uniformity index, the Deviation to Indetermination index and the Balanced- Modularity. This notation allowed us to easily identify the criteria suffering from a resolution limit. We found that the first two criteria had a local definition, whereas the others, based on a null model, had a resolution limit. These findings were confirmed by modularizing real and artificial graphs using a generic version of the Louvain algorithm. We compared the number of clusters found by the six criteria and the Normalized Mutual information for artificial graphs. The results showed that criteria based on a local definition had a better performance than those based on a null model when the size of the graph increases, experimentally the crition having the best behavior was Owsiński-Zadrożny with low values of parameter \(\alpha \). However, it is important to remark that these results are based on a particular kind of graphs, more precisely, graphs with a low mixing parameter, small communities,Footnote 10 node degrees and community sizes distributed according to a power law.
Notes
- 1.
- 2.
- 3.
Although the name of this criterion contains the word balanced, its definition is not related to the property of balance given in Definition 1.
- 4.
The data was taken from http://snap.stanford.edu/data/com-Amazon.html.
- 5.
The contribution for the Balanced Modularity will be given later.
- 6.
This result is a consequence of the rule this criterion relies on: “The rule of absolute majority of Condorcet” in voting theory.
- 7.
These expressions are deduced from the two following expressions of Balanced Modularity in terms of Newman-Girvan and Deviation to Indetermination criteria:
$$F_{BM}=2F_{NG}+\sum _{i=1}^N \sum _{i'=1}^N \left( \frac{(a_{i.}-d_{av})(a_{.i'}-d_{av})}{2M(1-\delta )}\right) x_{ii'}$$and
$$F_{BM}=2F_{DI}+\left( 2-\frac{1}{\delta }\right) \sum _{i=1}^N \sum _{i'=1}^N \left( \frac{(a_{i.}-d_{av})(a_{.i'}-d_{av})}{N^2(1-\delta )} \right) x_{ii'}.$$ - 8.
LFR graphs are benchmark graphs introduced in Lancichinetti et al. (2008) that aim to reproduce as much as possible the structure that reflects the real properties of nodes and communities found in real networks. These artificial graphs have predefined community structure based on the mixing parameter of each node. As stated in Lancichinetti et al. (2008), for each node the mixing parameter is the fraction of its links it shares with the nodes of the network outside its community.
- 9.
The normalized mutual information (NMI) is a measure of similarity of two partitions. It was originated in information theory to measure the departure from independence between two random variables. Given a set of objects V and two partitions \(P_1\) and \(P_2\) defined on V, intuitively, the mutual information measures the information that \(P_1\) and \(P_2\) share. It is normalized between 0 and 1. It is worth 0 if the two partitions are independent and 1 if they are identical. Let p and q be the total number of clusters of partitions \(P_1\) and \(P_2\) respectively. The NMI is calculated as follows:
$$\begin{aligned} NMI(P_1,P_2)=\frac{2I(P_1,P_2)}{H(P_1)+H(P_2)} \end{aligned}$$where:
-
\(I(P_1,P_2)=\sum _{u=1}^p \sum _{v=1}^q p_{uv}\ln \left( \frac{p_{uv}}{p_{u.} p_{.v}} \right) \) is the mutual information of partitions \(P_1\) and \(P_2\). I tells how much we learn about \(P_1\) if we know \(P_2\) and vice versa. The quantity \(p_{uv}=\frac{n_{uv}}{N}\) is the fraction of objects who belong simultaneously to cluster u of partition \(P_1\) and to cluster v of partition \(P_2\). Analogously \(p_{uv}=\frac{n_{u.}}{N}\) is the fraction of objects who belong to cluster u of partition \(P_1\) and \(p_{uv}=\frac{n_{.v}}{N}\) is the fraction of objects who belong to cluster v of partition \(P_2\) and \(|V|=N\). In the case \(n_{uv}=0\) we assume \(\ln \left( \frac{p_{uv}}{p_{u.} p_{.v}} \right) =0\).
-
\(H(P_1)=-\sum _{u=1}^p p_{u.}\ln p_{u.}\) represents the Shanon entropy of \(P_1\) and \(H(P_2)=-\sum _{v=1}^q p_{.v}\ln p_{.v}\) represents the Shanon entropy of \(P_2\) (see Shannon (1948)).
-
- 10.
What we call small are communities ranging from 10 to 50 nodes, that is the same sizes considered by the authors of LFR graphs (see Lancichinetti and Fortunato (2009)).
References
Ah-Pine, J., & Marcotorchino, F. (2007). Statistical, geometrical and logical independences between categorical variables. In Proceedings of the ASMDA2007 Symposium, Chania, Greece.
Albert, R., Jeong, H., & Barabási, A. (1999). Internet: Diameter of the world-wide web. Nature, 401(6749), 130–131.
Barabasi, A. L., & Albert, R. (1999). Emergence of scaling in random networks. Science, 286, 509–512.
Blondel, V., Guillaume, J.-L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008(10), P10008.
Brandes, U., Delling, D., Gaertler, M., Grke, R., Hoefer, M., Nikoloski, Z., et al. (2008). On modularity clustering. IEEE Transactions on Knowledge and Data Engineering, 20(2), 172–188.
Campigotto, R., Conde-Céspedes, P., & Guillaume, J. (2014). A generalized and adaptive method for community detection. CoRR abs/1406.2518.
Conde-Céspedes, P. (2013). Modélisations et extensions du formalisme de l’Analyse Relationnelle Mathématique à la modularisation des grands graphes. Thèse de doctorat: Université Pierre et Marie Curie.
Conde-Céspedes, P., & Marcotorchino, J. (2012). Modularisation et recherche de communautés dans les réseaux complexes par unification relationnelle. In Revue des Nouvelles Technologies de l’Information, Apprentissage Artificiel et Fouille de Données, RNTI-A-6 (pp. 71–97).
Conde-Céspedes, P., & Marcotorchino, F. (2013). Comparison different modularization criteria using relational metric. In F. Nielsen & F. Barbaresco (Eds.), Proceedings First International Conference, Geometric Science of Information (Vol. 1, pp. 180–187). Paris: Springer.
Condorcet, C. A. M. (1785). Essai sur l’application de l’analyse à la probabilité des décisions rendues à la pluralité des voix. Journal of Mathematical Sociology, 1(1), 113–120.
Decaestecker, C. (1992). Apprentissage en classification conceptuelle incrémentale. Ph.D. thesis, Université Libre de Bruxelles (Faculté des Sciences).
Fortunato, S., & Barthelemy, M. (2006). Resolution limit in community detection. In Proceedings of the National Academy of Sciences of the United States of America.
Gleiser, P., & Danon, L. (2003). Community structure in jazz. Advances in Complex Systems (ACS), 06(04), 565–573.
Hoerdt, M., & Magoni, D. (2003). Proceedings of the 11th International Conference on Software, Telecommunications and Computer Networks (vol. 257).
Kumpula, J., Saramäki, J., Kaski, K., & Kertesz, J. (2007). Limited resolution in complex network community detection with potts model approach. The European Physical Journal B, 56(1), 41–45.
Lancichinetti, A., & Fortunato, S. (2009). Community detection algorithms: A comparative analysis. Physical Review E, 80, 056117.
Lancichinetti, A., Fortunato, S., & Radicchi, F. (2008). Benchmark graphs for testing community detection algorithms. Physical Review E, 78, 046110.
Mancoridis, S., Mitchell, B., Rorres, C., Chen, Y., & Gansner, E. (1998). Using automatic clustering to produce high-level system organizations of source code. In The IEEE Proceedings of the 1998 International Workshop on Program Understanding (IWPC 1998) (pp. 45–52). Ischia: IEEE Computer Society.
Marcotorchino, F. (1984). Utilisation des comparaisons par paires en statistique des contingences (partie i). Publication du Centre Scientifique IBM de Paris, F057, et Cahiers du Séminaire Analyse des Données et Processus Stochastiques Université Libre de Bruxelles (pp. 1–57).
Marcotorchino, F. (1985). Utilisation des comparaisons par paires en statistique des contingences (partie iii). Etude F-081 du Centre Scientifique IBM de Paris (pp. 1–39).
Marcotorchino, F. (2013). Optimal transport, spatial interaction models and related problems, impacts on relational metrics, adaptation to large graphs and networks modularity. Internal Publication of Thales.
Marcotorchino, F., & Conde-Céspedes, P. (2013). Optimal transport and minimal trade problem, impacts on relational metrics and applications to large graphs and networks modularity. In F. Nielsen & F. Barbaresco (Eds.), Proceedings of First International Conference, Geometric Science of Information (Vol. 8085, pp. 169–179). Heidelberg: Springer.
Marcotorchino, F., & Michaud, P. (1979). Optimisation en Analyse ordinale des données. Paris: Masson.
Michalski, R., & Stepp, R. (1983). Learning from observation: Conceptual clustering. In R. Michalski, J. Carbonell, T. Mitchell, & M. Kaufmann (Eds.), Machine learning: An artificial intelligence approach, Chap. 11 (Vol. 1, pp. 331–367). Heidelberg: Springer.
Mislove, A., Marcon, M., Gummadi, K., Druschel, P., & Bhattacharjee, B. (2007). Measurement and analysis of online social networks. In Proceedings of the 5th ACM/Usenix Internet Measurement Conference (IMC 2007), San Diego, CA.
Newman, M., & Girvan, M. (2004). Finding and evaluating community structure in networks. Physical Review E, 69, 026113.
Owsiński, J., & Zadrożny, S. (1986). Clustering for ordinal data: A linear programming formulation. Control and Cybernetics, 15(2), 183–193.
Shannon, C. (1948). A mathematical theory of communication. Bell System Technical Journal, 27(379–423), 623–656.
Wei, Y., & Cheng, C. (1989). Towards efficient hierarchical designs by ratio cut partitioning. In IEEE International Conference on Computer-Aided Design (pp. 298–301).
Yang, J., & Leskovec, J. (2012). Defining and evaluating network communities based on ground-truth. In International Conference on Data Mining (pp. 745–754). IEEE Computer Society. abs/1205.6233.
Zahn, C. (1964). Approximating symmetric relations by equivalence relations. SIAM Journal on Applied Mathematics, 12, 840–847.
Acknowledgments
This work is supported by REQUEST and Open Food System projects.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
Theorem 1
(The density of clusters obtained by maximization of Zahn-Condorcet criterion is least 50 %). Given a connected, non-oriented and unweighted graph \(G=(V,E)\), the optimal partition obtained by optimizing the Zahn-Condorcet criterion has the following property: the number of within-cluster edges of each cluster is at least as half as the possible maximum existing within-cluster edges, that is to say the number of existing edges in the case the cluster is a clique. Furthermore, every node in each cluster is connected with at least as half as the total nodes inside the cluster.
Proof
Considering the constraints of reflexivity and symmetry of the relational variable \(x_{ii'}\) (i.e. \(x_{ii}=1 \forall i\) and \(x_{ii'}=x_{i'i}\)), the expression of Zahn-Condorcet criterion in Table 2 can be written as follows:
\(F_{ZC}(X)=\sum _{i>i'}(a_{ii'}-\bar{a}_{ii'})x_{ii'}+N^2-2M-N\).
where:
-
\(\sum _{i>i'}a_{ii'}x_{ii'}\) is the number of within-cluster edges for all clusters.
-
\(\sum _{i>i'}\bar{a}_{ii'}x_{ii'}\) is the number of missing within-cluster edges for all clusters.
If we denote \(E_j\) the number of within edges of cluster j, the total number of missing edges for the cluster j will be \(\left( \frac{n_j(n_j-1)}{2}-E_j\right) \). So, the criterion Zahn-Condorcet will become:
\(F_{ZC}(\mathscr {C})=\sum _{j=1}^\kappa \left( E_j-\left( \frac{n_j(n_j-1)}{2}-Ej\right) \right) +N^2-2M-N \),
or
\(F_{ZC}(\mathscr {C})=\sum _{j=1}^\kappa (2E_j-\frac{n_j(n_j-1)}{2})+N^2-2M-N \).
the term \((2E_j-\frac{n_j(n_j-1)}{2})\) represents the contribution of cluster j to the value of the criterion. For each cluster of the optimal partition this term must be positive or null. Otherwise it would be possible to obtain a better partition by isolating each node in cluster j (the contribution to the value of the criterion by a cluster of an isolated node is null). This implies:
\((2E_j-\frac{n_j(n_j-1)}{2})\ge 0\), or \(E_j\ge \frac{n_j(n_j-1)}{4}\).
So, each cluster j has a density of at least 50 %.
This result can be extended to every node of each cluster of the optimal partition. In fact, let us suppose that there is a cluster j containing a node \(n_0\) which is connected with less than half of the total nodes in the cluster. Let us denote \(E_{j_0}\) the connexions of \(n_0\) to nodes in \(C_j\). So, \(E_{j_0}<=\frac{(n_j-1)}{2}\).
It is always possible to obtain a better partition by isolating \(n_0\). In fact, the contribution of the two resulting clusters after isolation of node \(n_0\) is:
\(2(E_j-E_{j_0})-\frac{(n_j-1)(n_j-2)}{2}\)
this last expression is greater than the contribution of cluster j, given by \((2E_j-\frac{n_j(n_j-1)}{2})\), if \(n_0\) is connected with less than half of nodes in \(C_j\).
This also proves why the partitions obtaining by optimizing Zahn-Condorcet criterion contain sometimes clusters of isolates nodes. \(\square \)
Rights and permissions
Copyright information
© 2017 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Conde-Céspedes, P., Marcotorchino, JF., Viennet, E. (2017). Comparison of Linear Modularization Criteria Using the Relational Formalism, an Approach to Easily Identify Resolution Limit. In: Guillet, F., Pinaud, B., Venturini, G. (eds) Advances in Knowledge Discovery and Management. Studies in Computational Intelligence, vol 665. Springer, Cham. https://doi.org/10.1007/978-3-319-45763-5_6
Download citation
DOI: https://doi.org/10.1007/978-3-319-45763-5_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-45762-8
Online ISBN: 978-3-319-45763-5
eBook Packages: EngineeringEngineering (R0)