Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Networks are studied in numerous contexts such as biology, sociology, online social networks, marketing, etc. Graphs are mathematical representations of networks, where the entities are called nodes and the connections are called edges. Very large graphs are difficult to analyse and it is often profitable to divide them in smaller homogeneous components easier to handle. The process of decomposing a network has received different names: graph clustering (in data analysis), modularization, community structure identification. The clusters can be called communities or modules; in this paper we use those words as synonyms.

Assessing the quality of a graph partition requires a modularization criterion. This function will be optimized to find the best partition. Various modularization criteria were formulated in the past to address different practical applications. Those criteria differ in the definition given to the notion of community or cluster.

To understand the differences between the optimal partitions obtained by each criterion we show how to represent them using the same basic formalism. In this paper we use the Mathematical Relational Analysis (MRA) to express six linear modularization criteria. Linear criteria are easy to handle, for instance, the Louvain method can be adapted to linear quality functions (see Campigotto et al. (2014)). The six criteria studied are: the Newman-Girvan modularity, the Zahn-Condorcet criterion, the Owsiński-Zadrozny criterion, the Deviation to Uniformity, the Deviation to Indetermination index and the Balanced Modularity (details in Sect. 3). The relational representation makes clear the properties of those modularization criteria. It allows to easily identify the criteria suffering from a resolution limit, first discussed by Fortunato and Barthelemy (2006). We will complete this theoretical study by some experiments on real and synthetic networks, demonstrating the effectiveness of our classification.

In this paper, we deal only with linear criteria. Nevertheless, it is important to mention that thanks to the formalism of the MRA it is also possible to express non-linear criteria in relational notations. For instance, we can mention some very well-known criteria such as the Mancoridis-Gansner criterion (see Mancoridis et al. (1998)) in cluster-programming, the Ratio-Cuts by Wei and Cheng (1989), the Michalski criterion (see Michalski and Stepp (1983) and its relational notation given in Decaestecker (1992)), etc. The interested reader can see Conde-Céspedes and Marcotorchino (2012) and Conde-Céspedes (2013).

This paper is organized as follows: Sect. 2 presents the Mathematical Relational Analysis approach and introduces the property of balance for linear criteria and its relation to the property of resolution limit. In Sect. 3, six linear modularization criteria in the relational formalism are formulated. Next, Sect. 4 discusses some experiments on real and artificial graphs to confirm the theoretical properties found previously.

2 Relational Analysis Approach

There is a strong link between the Mathematical Relational AnalysisFootnote 1 and graph theory: a graph is a mathematical structure that represents binary relations between objects belonging to the same set. Therefore, a non-oriented and non-weighted graph \(G=(V,E)\), with \(N=|V|\) nodes and \(M=|E|\) edges, is a binary symmetric relation on its set of nodes V represented by its adjacency matrix \(\mathbf A \) as follows:

$$\begin{aligned} a_{ii'}= {\left\{ \begin{array}{ll} 1 &{} \text {if there exists an edge between}\, i \, \text {and}\, i'\;\forall (i,i')\in V\times V\\ 0 &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(1)

We denote the degree \(d_i\) of node i the number of edges incident to i. It can be calculated by summing up the terms of the row (or column) i of the adjacency matrix: \(d_i=\sum _{i'}a_{ii'}=\sum _{i'}a_{i'i}=a_{i.}=a_{.i}\). We denote \(\delta =\frac{2M}{N^2}\) the density of edges of the whole graph.

Partitioning a graph implies defining an equivalence relation on the set of nodes V, that means a symmetric, reflexive and transitive relation. Mathematically, an equivalence relation is represented by a square matrix \(\mathbf X \) of order \(N=|V|\), whose entries are defined as follows:

$$\begin{aligned} x_{ii'}= {\left\{ \begin{array}{ll} 1 &{} \text {if}\, i \, \text {and}\, i' \,\text {are in the same cluster}\;\forall (i,i')\in V\times V \\ 0 &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(2)

Modularizing a graph implies to find \(\mathbf X \) as close as possible to \(\mathbf A \). A modularization criterion F(X) is a function which measures either a similarity or a distance between \(\mathbf A \) and \(\mathbf X \). Therefore, the problem of modularization can be written as a function to optimize F(X) where the unknown X is subject to the constraints of an equivalence relation. In fact, the problem of modularization can be written in the general form:

$$\begin{aligned} \underset{X}{Max}(F(X)) \end{aligned}$$
(3)

subject to the constraints of an equivalence relation:

$$\begin{aligned} \begin{array}{rcl} x_{ii'}\in \left\{ 0,1\right\} &{} &{} \text {Binary}\\ x_{ii}=1 &{}\forall i &{}\text {Reflexivity}\\ x_{ii'}-x_{i'i}=0 &{} \forall (i,i') &{}\text {Symmetry}\\ x_{ii'}+x_{i'i''}-x_{ii''}\le 1 &{} \forall (i,i',i'') &{}\text {Transitivity} \end{array} \end{aligned}$$

The exact solving of this \(0-1\) linear program due to the size of the constraints is impractical for big networks. So, heuristic approaches are the only reasonable way to proceed.

We define as well \(\bar{\mathbf{X }}\) and \(\bar{\mathbf{A }}\) as the inverse relation of \(\mathbf X \) and \(\mathbf A \) respectively. Their entries are defined as \(\bar{x}_{ii'}=1-x_{ii'}\) and \(\bar{a}_{ii'}=1-a_{ii'}\) respectively. In the following we denote \(\kappa \) the optimal number of clusters, that means the number of clusters of the partition \(\mathbf X \) which maximizes the criterion F(X).

2.1 Linear Balanced Criteria

Every linear criterion is an affine function of \(\mathbf X \), therefore in relational notation it can be written as:

$$\begin{aligned} F(X)=\sum _{i=1}^{N}\sum _{i'=1}^{N}\varPhi (a_{ii'})x_{ii'}+K, \end{aligned}$$
(4)

where \(\varPhi (a_{ii'})\) denotes any function depending only on the original data (for instance the adjacency matrix) and K denotes any constant depending only on the original data. Therefore, K does not intervene in the optimization problem.

Definition 1

(Property of linear balance). A linear criterion is balanced if it can be written in the following general form:

$$\begin{aligned} F(X)=\sum _{i=1}^{N}\sum _{i'=1}^{N}\phi (a_{ii'})x_{ii'}+ \sum _{i=1}^{N}\sum _{i'=1}^{N}\bar{\phi }(a_{ii'})\bar{x}_{ii'}+K. \end{aligned}$$
(5)

where \(\phi (.)\) and \(\bar{\phi }(.)\) are non negative functions depending only on the original data and verifying \(\sum _{i=1}^{N}\sum _{i'=1}^{N}\phi _{ii'}>0\) and \(\sum _{i=1}^{N}\sum _{i'=1}^{N}\bar{\phi }_{ii'}>0\).

So, they can not be all null simultaneously.

By replacing \(\bar{x}\) by its definition \(1-x_ {ii'}\), Eq. (5) can be rewritten as follows:

$$\begin{aligned} F(X)=\sum _{i=1}^{N}\sum _{i'=1}^{N}(\phi _{ii'}-\bar{ \phi }_{ii'}) x_{ii'} + K. \end{aligned}$$
(6)

2.1.1 Interpretation of Functions \(\phi (.)\) and \(\bar{\phi }(.)\)

At this point, we can give the intuition behind functions \(\phi (.)\) and \(\bar{\phi }(.)\). From expression (6) we deduce the importance of the property of balance for linear criteria. If the criterion is a function to maximize, the presence and/or absence of the terms \(\phi _{ii'}\) and \(\bar{\phi }_{ii'}\) has the following impact on the optimal solution:

  • If \(\bar{\phi }_{ii'}=0 \, \forall i,i'\) the solution that maximizes F(X) is the partition where all nodes are clustered together in a single cluster, so \(\kappa =1\) and \(x_{ii'}=1 \quad \forall (i,i')\) and \(F(X)=\sum _{i=1}^{N}\sum _{i'=1}^{N}\phi _{ii'}\).

  • If \(\phi _{ii'}=0 \, \forall i,i'\) then the optimal solution that maximizes F(X) is the partition where all nodes are separated, so \(\kappa = N\) and \(x_{ii'}=0\, \forall \, i\ne i'\) and \(x_{ii}=1 \, \forall i\) therefore \(F(X)=\sum _{i=1}^{N}\sum _{i'=1}^{N}\bar{\phi }_{ii}\).

In other words, the optimization of a linear criterion who does not verify the property of balance will either cluster all the nodes in a single cluster or isolate each node in its own cluster, therefore forcing the user to fix the number of clusters in advance.

We can deduce from the previous paragraphs that the values taken by the functions \(\phi \) and \(\bar{\phi }\) create a sort of balance between the fact of generating as many clusters as possible, \(\kappa = N\), and the fact generating only one cluster, \(\kappa = 1\).

In the following we will call the quantity \( \sum _{i=1}^{N}\sum _{i'=1}^{N}\phi (a_{ii'})x_{ii'}\) the term of positive agreements and the quantity \(\sum _{i=1}^{N}\sum _{i'=1}^{N} \bar{\phi }(a_ {ii'})\bar{x}_{ii'}\) the term of negative agreements.

2.2 Different Levels of Balance

We define two levels of balance for all linear balanced criterion:

Definition 2

(Property of local balance). A balanced linear criterion whose functions \(\displaystyle {\phi _{ii'}}\) and \(\displaystyle {\bar{\phi }_{ii'}}\) depend only upon the pair \((i,i')\) (therefore not depending on global properties of the graph) has the property of local balance.

Some remarks about Definition 2:

  • When we talk about global properties we refer to the total number of nodes, the total number of edges or other properties describing the global structure of the graph.

  • For the particular case of local balance where \(\displaystyle {\phi _{ii'}+\bar{\phi }_{ii'}=K}\) (that is \(\displaystyle {\phi _{ii'}}\) and \(\displaystyle {\bar{\phi }_{ii'}}\) sum up to a constant), we can conclude that whereas \(\displaystyle {\phi _{ii'}}\) increases \(\displaystyle {\bar{\phi }_{ii'}}\) decreases and vice versa.

Let us consider the special case where \(\phi (a_{ii'})=a_{ii'}\), the general term of the adjacency matrix. A null model is a graph with the same total number of edges and nodes and where the edges are randomly distributed. Let us denote the general term of the adjacency matrix of this random graph \(\bar{\phi }(a_{ii'})\). A criterion based on a null model considers that a random graph does not have community structure. The goal of such a criterion is to maximize the deviation between the real graph, represented by \(\phi (a_{ii'})\) and the null model version of this graph, represented by \(\bar{\phi }(a_{ii'})\) as shown in Eq. (6). Since the original graph and the null model have the same number of edges M, we have \(\sum _{i=1}^N\sum _{i'=1}^N\phi _{ii'}= \sum _{i=1}^N\sum _{i'=1}^N\bar{\phi }_{ii'}=2M\). If this constraint causes \(\bar{\phi }_{ii'}\) to depend upon the total number of edges M, then a criterion based on a null model does not verify the property of local balance. Consequently, it is not scale invariant because it depends on a global characteristic of the graph.

The definition of null model for linear criteria can be generalized as follows:

Definition 3

(Criterion based on a null model). A balanced linear criterion that seeks to maximize the deviation between the real graph and a null model is a criterion based on a null model. In its formulation, the real graph is represented by \(\phi (a_{ii'})\) whereas the null model is represented by \(\bar{\phi }(a_{ii'})\). The functions \(\displaystyle {\phi _{ii'}}\) and \(\displaystyle {\bar{\phi }_{ii'}}\) satisfy the following condition:

$$\sum _{i=1}^N\sum _{i'=1}^N \phi _{ii'}=\sum _{i=1}^N\sum _{i'=1}^N\bar{\phi }_{ii'}$$

such that the functions \(\displaystyle {\phi _{ii'}}\) and \(\displaystyle {\bar{\phi }_{ii'}}\) depend on global properties of the graph.

The global properties of the graph can be, for example, the total number of edges or the total number of nodes.

We can deduce from Definitions 2 and 3 that a linear criterion cannot be locally balanced and based on a null model at the same time.

In the particular case where \(\bar{\phi }\) decreases with the size of the network, it becomes negligible for large graphs. As explained previously, if this term tends towards zero, the optimization of the criterion will tend to group the nodes more easily. For instance, a single edge between two sub-graphs would be interpreted by the criterion as a sign of a strong correlation between the two clusters, and optimizing the criterion would lead to the merge of the two clusters. Such a criterion is said to have a resolution limit.

The resolution limit was introduced by Fortunato and Barthelemy (2006), where the authors studied the resolution limit of the modularity of Newman-Girvan. They demonstrated that modularity optimization may fail to identify modules smaller than a given size which depends on global characteristics of the graph. Even weakly interconnected complete sub-graphs—the best identifiable communities—would be merged by this kind of optimization criteria if the network is sufficiently large. According to Kumpula et al. (2007) the resolution limit is present in any modularization criterion based on global optimization of intra-cluster edges and extra-community links and on a comparison to any null model.

In Sect. 4, we will show how criteria having a resolution limit fail to detect certain groups of densely connected nodes.

3 Modularization Criteria in Relational Notation

Graph clustering criteria depend strongly on the meaning given to the notion of community. In this section, we describe six linear modularization criteria and their relational coding in Table 1. We assume that the graphs we want to modularize are scale-free, that means that their degree distribution follows a power law.

Table 1 Relational notation of linear modularity functions
  1. 1.

    The Zahn-Condorcet criterion (1785, 1964): C.T. Zahn was the first who studied the problem of finding an equivalence relation \(\mathbf X \), which best approximates a given symmetric relation \(\mathbf A \) in the sense of minimizing the distance of the symmetric difference (Zahn 1964). The criterion defined by Zahn corresponds to the dual Condorcet’s criterion (Condorcet 1785) introduced in Relational Consensus whose relational coding was given by Marcotorchino and Michaud (1979). This criterion requires that every node in each cluster be connected with at least as half as the total nodes inside the cluster. Consequently, for each cluster the fraction of within cluster edges is at least 50 % (see Conde-Céspedes (2013)) and Appendix for proof).

  2. 2.

    The Owsiński-Zadrożny criterion (1986) (see Owsiński and Zadrożny (1986)) it is a generalization of Condorcet’s function. It has a parameter \(\alpha \), which allows, according to the context, to define the minimal percentage of required within-cluster edges: \(\alpha \). For \(\alpha =0.5\) this criterion is equivalent to Condorcet’s criterion. The parameter \(\alpha \) defines the balance between the positive agreements term and the negative agreements term. For each cluster the density of edges is at least \(\alpha \,\%\) (see Conde-Céspedes(2013)).

  3. 3.

    The Newman-Girvan criterion (2004) (see Newman and Girvan (2004)): It is the best known modularization criterion, called sometimes simply modularity. It relies upon a null model. Its definition involves a comparison of the number of within-cluster edges in the real network and the expected number of such edges in a random graph where edges are distributed following the independence structure (a network without regard to community structure). In fact, the modularity measures the deviation to independence.

    As mention in the previous section, this criterion, based on a null model and it has a resolution limit (see Fortunato and Barthelemy (2006)). In fact, as the network becomes larger \(M\longrightarrow \infty \), the term \(\displaystyle \bar{\phi }_{ii'}=\frac{a_{i.}a_{.i'}}{2M}\) tends to zero since the degree distribution follows a power law.

  4. 4.

    The Deviation to Uniformity (2013) This criterion maximizes the deviation to the uniformity structure, it was proposed in Conde-Céspedes (2013). It compares the number of within-cluster edges in the real graph and the expected number of such edges in a random graph (the null model) where edges are uniformly distributed, thus all the nodes have the same degree equal to the average degree of the graph. This criterion is based on a null model and it has a resolution limit. indeed \(\delta \longrightarrow 0\) as \(N \longrightarrow \infty \).

  5. 5.

    The Deviation to Indetermination (2013) Analogously to Newman-Girvan function, this criterion compares the number of within-cluster edges in the real network and the expected number of such edges in a random graph where edges are distributed following the indetermination structure Footnote 2 (a graph without regard to community structure) (Marcotorchino and Conde-Céspedes 2013; Marcotorchino 2013). The Deviation to Indetermination is based on a null model, therefore it has a resolution limit.

  6. 6.

    The Balanced modularity Footnote 3 (2013) This criterion, introduced in Conde-Céspedes and Marcotorchino (2013), was constructed by adding to the Newman-Girvan modularity a term taking into account the absence of edges \(\bar{\mathbf{A }}\). Whereas Newman-Girvan modularity compares the actual value of \(a_{ii'}\) to its equivalent in the case of a random graph \(\displaystyle \frac{a_{i.}a_{.i'}}{2M}\), the new term compares the value of \(\bar{a}_{ii'}\) to its version in case of a random graph \(\displaystyle \frac{(N-a_{i.})(N-a_{.i'})}{N^2-2M}\). It is based on a null model and it has a resolution limit.

The six linear criteria of Table 1 verify the property of balance, so it is not necessary to set in advance the number of clusters. Table 2 specifically focuses on the fonctions \(\phi _{ii'}\) and \(\bar{\phi }_{ii'}\) for each criterion.

Table 2 Balance property for linear criteria

From Tables 1 and 2 one can easily deduce that two criteria: Zahn-Condorcet and Owsiński-Zadrożny verify the property of local balance. Furthermore, Table 2 clearly shows that the functions \(\phi _{ii'}\) and \(\bar{\phi }_{ii'}\) add up to a constant \(K_{ii'}\) for these two criteria. The quantity \(\bar{\phi }_{ii'}\) decreases with the size of the graph for all criteria that have a resolution limit.

4 The Impact of Merging Two Clusters

We modularized five real networks of different sizes: Jazz (Gleiser and Danon 2003), Internet (Hoerdt and Magoni 2003), Web nd.edu (Albert et al. 1999), Amazon (Yang and Leskovec 2012)Footnote 4 and Youtube (Mislove et al. 2007). We ran a generic version of Louvain Algorithm (see Campigotto et al. (2014) and Blondel et al. (2008)) until achievement of a stable value of each criterion. The number of clusters obtained for each network is shown in Table 3.

Table 3 Ref: Zahn-Condorcet (ZC), Owsiński-Zadrożny (OZ), Deviation to Uniformity (UNIF), Newman-Girvan (NG), Deviation to Indetermination (DI) and Balanced Modularity (BM)

Table 3 shows that the Zahn-Condorcet and Owsiński-Zadrożny criteria generate many more clusters than the other criteria having a resolution limit, for which the number of clusters is rather comparable. Moreover, this difference increases with the network size. Notice that the number of clusters for the Owsiński-Zadrożny criterion decreases with \(\alpha \), that is the minimal required fraction of within-cluster edges, so the criterion becomes more flexible.

In order to explain these differences we measure the impact of merging two clusters on the value of each criterion. Let us suppose we want to merge two clusters \(\mathscr {C}_1\) and \(\mathscr {C}_2\) in the network of sizes \(n_{1}\) and \(n_2\) respectively. Let us suppose as well they are connected by l edges as shown in Fig. 1.

Fig. 1
figure 1

Two sub graphs of the entire network we want to merge

Let us denote \(C_F\) the contribution of merging two clusters to the value of a criterion F. The contribution \(C_F\) can be easily calculated from (6) (for the proof see Conde-Céspedes (2013)):

$$\begin{aligned} C_F=\sum _{i\in \mathscr {C}_1}^{n_1} \sum _{i'\in \mathscr {C}_2}^{n_2} (\phi _{ii'}-\bar{\phi }_{ii'}) \end{aligned}$$
(7)
  • If \(C>0\) the merger of the two clusters increases the value of the criterion.

  • If \(C<0\) the merger of the two clusters decreases the value of the criterion.

Equation (7) shows that the decision of merging or not the two clusters depends on a comparison between the quantity \(\displaystyle {\sum _{i\in \mathscr {C}_1}^{n_1} \sum _{i'\in \mathscr {C}_2}^{n_2}\phi _{ii'}}\) and the quantity \(\displaystyle {\sum _{i\in \mathscr {C}_1}^{n_1} \sum _{i'\in \mathscr {C}_2}^{n_2} \bar{\phi }_{ii'}}\). Giving the fact that both are positive, it is the one with the highest value that decides to merge or not to merge. Thus, whereas the first one is for fusion the second one is against the fusion.

Table 4 shows the explicit expression of the contribution for the linear criteria described below.Footnote 5

Table 4 Contribution of merging two clusters for linear criteria

where \(\displaystyle {d_{av}=\frac{\sum _{i\in V}^{N}a_{i.}}{N}}\) is the average degree of the whole graph, \(\displaystyle {d_{av}^1=\frac{\sum _{i\in \mathscr {C}_1}^{n_1}a_{i.}}{n_1}}\) and \(\displaystyle {d_{av}^2=\frac{\sum _{i'\in \mathscr {C}_2}^{n_2}a_{.i'}}{n_2}}\) are the average degrees of \(\mathscr {C}_1\) and \(\mathscr {C}_2\) respectively.

We can remark from Table 4 that for the five criteria the contribution compares “the number of edges between \(\mathscr {C}_1\) and \(\mathscr {C}_2\): l” to the quantity in bold. We can see as well that the contribution for locally balanced criteria depends only upon local properties: \(l, \bar{l}, n_1, n_2\). In fact, locally balanced criteria are scale invariant. In contrast, for the other criteria having a resolution limit the contribution depends and is decreasing on the global size of the network. We remark as well that for three criteria: Newman-Girvan, Deviation to Indetermination and Balanced Modularity the contribution depends on the degree distribution of the two clusters. According to Barabasi and Albert (1999) many real networks fall into the class of scale-free networks, meaning that their degree distribution follows a power-law. In a scale-free network a few nodes called hubs have many connexions whereas most nodes have few connexions.

4.1 Impact on the Optimal Number of Clusters

From the previous results we can deduce the main characteristics of the optimal partition found by the optimization of each criterion (see Table 5). In addition, we remark the following facts:

Table 5 Summary by criterion
  • The Zahn-Condorcet criterion: According to Table 4 for merging the two clusters \(\mathscr {C}_1\) and \(\mathscr {C}_2\), these ones must be connected by at least as many edges as the half of the maximum possible number of edges,Footnote 6 that is \(\displaystyle {l>\frac{n_1n_2}{2}}\).

  • The Owsiński-Zadrożny criterion: For merging the two clusters \(\mathscr {C}_1\) and \(\mathscr {C}_2\), these ones must be connected by at least as \(\alpha \,\%\) as the maximum possible number of edges.

  • The Deviation to Uniformity: According to Table 4 for the merge to take place the fraction of edges between \(\mathscr {C}_1\) and \(\mathscr {C}_2\) must be at least equal to the global density of the whole graph.

  • Newman-Girvan criterion: From Table 4 we can deduce that the optimal partition does not have clusters with a single node (this result was already demonstrated in Brandes et al. (2008)). In fact, if \(\mathscr {C}_1\) has only one node with only one connection to \(\mathscr {C}_2\), thus \(n_1=1\), \(d_{av}^1=1\), \(l=1\) and consequently the contribution is always positive: \(\displaystyle {C_{NG}=\left( 1-\frac{\sum _{i=1}^{n_2}a_{i.}}{2M}\right) >0}\).

  • Balanced Modularity: It is easy to understand the behavour of the contribution of Balanced Modularity when we compare it to those of Newman-Girvan and Deviation to Indetermination (see Conde-Céspedes (2013) for proof).Footnote 7 Indeed, we demostrated in Conde-Céspedes (2013) that:

    $$\begin{aligned} C_{BM}=2C_{NG}+ n_1n_2\frac{(d_{av}^1-d_{av})(d_{av}^2-d_{av})}{2M(1-\delta )} \end{aligned}$$
    (8)

    and

    $$\begin{aligned} C_{BM}=2C_{DI}+ n_1n_2\left( 2-\frac{1}{\delta }\right) \frac{(d_{av}^1-d_{av})(d_{av}^2-d_{av})}{N^2(1-\delta )}. \end{aligned}$$
    (9)

    Although the contribution for the Balanced Modularity is increasing in both the contribution of Newman Girvan \(C_{NG}\) and in the contribution of Deviation to Indetermination \(C_{DI}\), in both cases \(C_{BM}\) has an additional term that we can treat as regulator: \(\left( n_1n_2\frac{(d_{av}^1-d_{av})(d_{av}^2-d_{av})}{2M(1-\delta )}\right) \) and \(\left( n_1n_2\left( 2-\frac{1}{\delta }\right) \frac{(d_{av}^1-d_{av})(d_{av}^2-d_{av})}{N^2(1-\delta )}\right) \) respectively. These two regulators have opposite sign for real networks. In fact, the coefficient \(\left( 2-\frac{1}{\delta }\right) \) of the second regulator is almost surely negative for real graphs because the density \(\delta<< 0.5\) for scale-free networks. That is why the Balanced Modularity behaves as a regulator between both criteria: Newman-Girvan and Balanced Modularity. However, when the network size increases \(N \longrightarrow \infty \) and \(M \longrightarrow \infty \) the regulator terms tend to zero.

Only ground-truth overlapping communities are defined on real networks in Table 3. This fact makes difficult to judge the quality of the obtained partitions because we can not directly compare a partition to overlapping communities. That is why in the next section we will consider artificial networks with a predefined community structure.

5 Experiments with Artificial Networks

In order to judge the quality of the partitions obtained by each criterion we generated benchmark LFR graphsFootnote 8 (see Lancichinetti et al. (2008)) of different sizes 1000, 5000, 10000, 50000, 100000 and 500000. The input parameters are the same as those considered in Lancichinetti and Fortunato (2009). The average degree is 20, the maximum degree 50, the exponent of the degree distribution is –2 and that of the community size distribution is –1. In order to test the existence of resolution limit we chose small communities sizes, ranging from 10 to 50 nodes, and low values of mixing parameter, 0.10, 0.20 et 0.30. Figure 2 shows the average number of clusters for 100 runs of the generic Louvain algorithm.

Fig. 2
figure 2

Average number of cluster for artificial LFR graphs (logarithmic scale). The curve of the real number of clusters (in black) it is almost overlapped with that of OZ1 and OZ2

In Fig. 2 it is hard to see the curve of the real number of clusters (in black) beacuse it is almost overlapped with those of OZ1 and OZ2.

Figure 2 shows clearly the difference between the behavior of those criteria having a resolution limit (NG, DU, DI and BM) and the behavior of criteria locally defined (ZC and OZ). As the size of the network increases the four criteria suffering from resolution-limit detect fewer clusters than those predefined. The number of clusters is rather comparable for these four functions, one reason can be the fact that the term of negative agreements tends to zero when the network gets bigger. Conversely, the number of clusters of criteria locally defined increases nearly at the same rate as the real number of clusters. Whereas OZ with high \(\alpha \) identifies more clusters than those predefined, the criterion which best approaches the real number of clusters is OZ with low values of \(\alpha =0.2\) and \(\alpha =0.1\).

Fig. 3
figure 3

The average Normalized Mutual Information (NMI) on the graphs in Fig. 2 (logarithmic scale)

Figure 3 shows the Normalized Mutual Information Footnote 9 (NMI) for the partitions in Fig. 2.

Fig. 4
figure 4

The average Normalized Mutual Information (NMI) according to mixing parameter for networks of 6 different sizes: 1000, 5000, 10000, 50000, 100000 and 500000

Figure 3 shows that the average NMI decreases with the network size for criteria having a resolution limit. Moreover, they almost overlap. Conversely, the NMI of the criteria locally defined seem to increase with the network size. The criterion with the highest NMI is OZ with low values of \(\alpha \), 0.1 and 0.2.

Figure 4 shows the average Normalized Mutual Information for the mixing parameter ranging from 0.1 to 0.8 for different network sizes.

Figure 4 shows that for all the criteria previously presented the NMI decreases as the mixing parameter increases. This figure demonstrates once more the differences between the behavior of criteria with resolution limit and that of the criteria locally defined. For the first ones the quality decreases abruptly beyond mixing parameter equal to 0.6. For the second ones, the quality seems to decrease at a lower rate. However, it is important to remark that the quality of criteria with a resolution limit decreases not only with the mixing parameter but also with the network size. Converserly, the behavior of the NMI of locally defined criteria seem to have the same behavour independtly of the size of the whole network.

Another point to remark is that even when the mixing parameter is high all the criteria find a community structure. In fact, the pre-defined communities in the LFR graphs are based on mixing parameter, whereas all the criteria analysed in this article have their own definition of graph with no community structure which is not based on the mixing parameter.

Table 5 presents a summary of the results found by the previous analysis.

6 Conclusions

We have presented six linear modularization criteria in relational notation, Zahn-Condorcet, Owsiński-Zadrożny, the Newman-Girvan modularity, the Deviation to Uniformity index, the Deviation to Indetermination index and the Balanced- Modularity. This notation allowed us to easily identify the criteria suffering from a resolution limit. We found that the first two criteria had a local definition, whereas the others, based on a null model, had a resolution limit. These findings were confirmed by modularizing real and artificial graphs using a generic version of the Louvain algorithm. We compared the number of clusters found by the six criteria and the Normalized Mutual information for artificial graphs. The results showed that criteria based on a local definition had a better performance than those based on a null model when the size of the graph increases, experimentally the crition having the best behavior was Owsiński-Zadrożny with low values of parameter \(\alpha \). However, it is important to remark that these results are based on a particular kind of graphs, more precisely, graphs with a low mixing parameter, small communities,Footnote 10 node degrees and community sizes distributed according to a power law.