1 Introduction

Most of the real-world systems can be modeled as graphs where different fields of study use extensively the node-link representation to represent information. Wide use of this representation has been witnessed in social networks (Wasserman and Faust 1994). A social network can be defined as a set of people, or groups of people interacting with each other (Scott 2000; Wasserman and Faust 1994). These interactions can be classified into several types like friendship or business relationship. Although many examples have been studied for social networks, two classic examples that have attracted extensive attention in the computer science community and the social network community at large, are the Actor collaboration network from Internet Movie database (IMDB) and the Science collaboration network.

Social network modeling and analysis allows us to understand the different types of relationships that can either facilitate or impede knowledge creation and transfer in a society on the whole, in an organization in particular, and in individuals, providing an insight into the underlying patterns and the social structures present in these networks (Scott 2011; Cross et al. 2000).

The study of networks in general, and of social networks in particular, was revived by the pioneering work of Watts and Strogatz (1998) on the properties of small-world networks. Equally important was the work from Barabasi and Albert on the growth of networks and the property of scale-free degree distribution (Barabási and Albert 1999). Although random graphs had been studied extensively in the past (Newman 2003), most of the real-world networks have the properties of small-world and scale-free networks.

A small-world network, as defined by Watts and Strogatz (1998), is a network when compared with a random graph of same node–edge density, has higher clustering coefficient and the typical distance between any two nodes scales as the logarithm of the number of nodes. The two structural properties used to define a small-world network are the average path length and the clustering coefficient. The most popular manifestation of the concept of low average path length is the ‘six degrees of separation’, uncovered by the social psychologist Stanley Milgram, who concluded that there was a path of acquaintances with a typical length of about 6 between most pairs of people in the USA (Milgram 1967). More precisely, the path length refers to the minimum number of edges traversed to go from node A to node B. The average path length is the average calculated for all pair of nodes in a network. Another important characteristic of these networks is the average clustering coefficient of nodes (Watts and Strogatz 1998), sometimes referred as Transitivity (Newman 2003) to avoid confusion from the concept of community structure (or clusters) (Scott 2000; Wasserman and Faust 1994). The concept is very well known in social networks and can be described as the friend of your friend is likely to be your friend. Mathematically, for a graph G with nodes V and edges E, the clustering coefficient (CC) for a node v is defined as:

$$ {\rm CC}(v) = \frac{r(N(v))}{|N(v)|(|N(v)|-1)/2} $$

where \(u \in V\) and \((u,v) \in E. \) The neighborhood of a node v is defined as the set of nodes in the neighborhood of v denoted by N(v). The number of elements in set N(v) is given by |N(v)|. And the notation r(N(v)) represents the number of edges (uw) such that \(u,w \in N(v)\) and u ≠ w. To calculate the clustering coefficient of the entire network, we take the average for all nodes in the network.

A scale-free network is a network in which a few nodes have a very high number of connections (degree) and lots of nodes are connected to a few nodes. Generally, it was believed that the degree distribution in most networks follows a poisson distribution but in reality, real-world networks have a highly skewed degree distribution. These networks have no characteristic scales for the degrees, hence they are called scale-free networks (Päivinen 2007). In other words, the degree distribution of scale-free networks follow a power-law distribution (Barabási and Albert 1999).

Apart from the small-world and scale-free properties, another important characteristic of real-world systems and specially social networks is the presence of community structures. A more generic formalism for the term community is clusters, where sociologists use the term community (Coleman 1964) as compared to the statistical and data mining domain where people use the term clusters (Tryon 1939) to refer to the same concept. Roughly speaking, we like to define a community as a decomposition of a set of entities into ‘Natural Groups’. There is no universally accepted definition of clustering (Everitt et al. 2009), most researchers describe a cluster by considering the internal homogeneity and the external separation as the fundamental criteria for defining a cluster (Gordon 1981; Almeida et al. 2012; Jain et al. 1999). A number of algorithms are present in the literature to study clusters and clustering problem in social networks (Newman 2004; Jia et al. 2011; Gilbert et al. 2011).

Different network generation models have been proposed to generate artificial networks having both the small-world and scale-free properties. These models do not generate graphs with community structures by construction as the probability of connection is based solely on the degree of a node and, in some cases, the immediate neighborhood of this node.

In this study, we present a new network generation model which incorporates the properties of small-world and scale-free networks with the additional advantage of having distinct community structures. We explicitly target social networks and argue that using three fundamental concepts from the social network study, we can generate artificial networks replicating real-world social networks. Clustering or community detection remains an important technique to organize and understand complex systems (Girvan and Newman 2002; Jain et al. 1999; Schaeffer 2007; Xu and Wunsch 2005) having a wide range of applications in various fields. For empirical evaluation of algorithms, metrics and analytical methods, it is important to be able to reproduce networks having community structures with small-world and scale-free properties that are close to real-world networks. The proposed model, by construction, incorporates the presence of community structures as we determine the connectivity of nodes based on a pre-generated clustering. We explain the details in the coming sections.

For the sake of discussion and explanatory purposes, throughout this paper, we are going to discuss four social networks. Two well-studied and well-structured social networks are the Actor collaboration network where nodes represent actors and two actors are connected to each other if they appear in a movie together. The other network is the Science collaboration network where nodes represent scientists and two scientists are connected to each other if they have written an artifact together. Apart from these two networks, we consider two hypothetical cases from everyday life. Consider a person joining a new organization as an employee and a person joining a sports club as a leisure activity. We will refer them as Actor, Author, Employee and Club networks, respectively, throughout this paper.

The rest of the paper is organized as follows: the next section contains a review of the existing network generation models for small-world and scale-free networks. In Sect. 3, we discuss the metric clustering coefficient and compare it with the presence of community structures as being two separate concepts. We then discuss assortativity, transitivity and preferential attachment in social networks in Sect. 4 and argue that with a little modification to these concepts, we can understand how networks having community structures evolve in the real world. We then present a network generation model in Sect. 5. We introduce three networks in Sect. 6 that are used for experimentation and comparative analysis with the artificial network generation models. In Sect. 7, we show that the existing network generation models not only produce graphs without community structures but also have other limitations. Finally, we conclude giving possible future directions of our research in Sect. 8.

2 Existing network generation models

In this section, we review a number of network generation models proposed in the literature having small-world and scale-free properties. A comparative summary of these models is presented in Table 1.

Table 1 Comparing and summarizing different artificial network generation models for small-world and scale-free networks existing in the literature

Holme and Kim (2002) modified the well-known Barabasi and Albert model (Barabási and Albert 1999) to obtain graphs that are small world as well as scale free. The idea is pretty simple and effective. A triad formation step is added after the preferential attachment step where every node, introduced in the network, connects not only to node w but also to a randomly chosen neighbor of w, thus resulting in a triad formation. The idea is similar to another model separately proposed by Dorogovtsev and Mendes (2002) in the same year where every new node added to the network is connected to both ends of a randomly chosen link where one of the nodes of this link is selected through preferential attachment. These models inspired Jian-Guo et al. to introduce another similar model (Liu et al. (2005). The network starts with a triangle and at each time step, a new node is added to the network with two edges. The first edge would choose a node to connect preferentially, and the second edge will choose a node connected to the first node, again based on preferential attachment. This is different from the model of Holme and Kim where the second node is randomly chosen. Wang et al. (2006) proposed a similar model to that of Dorogovtsev et al. where at each time step, a new node with two edges is added to the network and the two edges are connected to the two ends of a randomly chosen existing edge.

Fu and Liao (2006) proposed another extension to the Barabasi and Albert model which they called the Relatively Preferential Attachment method. At each time step, the newly introduced node in the network connects to a node w with preferential attachment, the nodes in the immediate neighborhood of w have higher probability of connecting to this new node as compared to other nodes. The only difference in this model with the already proposed models is that the new node can have m edges instead of two edges where the value of m is chosen as an initial parameter which remains constant throughout the execution of algorithm.

Klemm and Eguiluz (2002) also proposed a model, where each node of the network is assigned a state variable. A newly generated node is in the active state and keeps attaching links until eventually deactivated. At each time step, a new node is added to the network by attaching a link to each of the z active nodes. The new node is set as active. One of the existing nodes is deactivated where the probability of a node being deactivated is inversely proportional to its degree, i.e., lower the degree, higher the probability of deactivation. To reduce the average path length of the entire graph, at every step, for each link of the newly added node, it is decided randomly whether the link connects to the active node or it connects to a random node.

Catanzaro et al. (2004) present a model taking into consideration the assortativity of social networks. Assortativity is the tendency of nodes to preferentially connect to nodes that are similar to them. This similarity in general can consider any attribute, but in case of social networks, it is referred to as the connectivity or the node degree of the nodes. At every step, a new node is added to the network based on preferential attachment and a new edge is added between two existing nodes. These existing nodes are chosen on the basis of their degree, thus forcing links between similar degree nodes. The model is innovative as it allows addition of new links between old nodes.

Another interesting model was proposed by Guillaume and Latapy (2005). They identify bipartite graph structure as a fundamental model of complex networks by giving real world examples. The two disjoint sets of a bipartite graph are called bottom and top. At each step, a new top node is added and its degree d is sampled from a prescribed distribution. For each of the d edges of the new vertex, either a new bottom vertex is added or one is picked among the pre-existing ones using preferential attachment. A more generalized model based on similar principles was proposed by Bu et al. where instead of using the bipartite structure, a network can contain t disjoint sets (instead of just two sets, as is the case of the bipartite graph), where the example of sexual web (Lilijeros et al. 2001) was considered as a model. A sexual web is a network where nodes represent men and women having relationships to opposite sex, and similar nodes do not interact with each other. At each time step, a new node and m new edges are added to the network with the sum of the probabilities equal to 1. The preferential attachment rule is followed as the new node links with the existing nodes with a probability proportional to the degree of the nodes.

Wang and Rong (2008) proposed a slightly different model, which is still a modified form of the preferential attachment model. Instead of adding one node at a time, the model proposes to add n nodes at each time step which are connected in a ring formation. Any two nodes in the n new nodes are connected to the existing network where these connections are determined through preferential attachment.

Generation models for clustered graphs exist in the literature such as the work of Condon and Karp (1999) and Virtanen (2003), where the idea is to generate graphs that are already clustered as opposed to random graph models of Rapoport (1957) and Erdos and Renyi (1960). A recent work by Zaidi (2012) addresses the issue of generating clustered small-world networks which are not scale free and the clusters are randomly connected to each other. These generation models as such do not produce graphs with small-world and scale-free properties, which are fundamental to most real-world networks. Thus, the study and comparison of these other models remain out of the scope of the paper.

Comparing the different network generation models (See Table 1), first five models are quite similar to each other, as they try to force the triad formation step, one way or the other. Another common aspect in the first five models is that, in every step, only one node and two edges are added to the network. The only other taxonomical grouping possible is the last two models where the bipartite and n-partite structures are used as the fundamental property of real-world networks. The model of Wang and Rong is slightly different as it allows the addition of m new nodes at every time step. The ideas of Klemm and Eguiluz and Catanzaro et al. are quite original, and provide another way to look at the evolution and structure of complex networks.

3 Clustering coefficient and community structures

Largely due to the terminology used to define the metric clustering coefficient, often it is misunderstood that a network having high clustering coefficient suggests the presence of clusters or community structures in a network. Clustering coefficient, by definition, determines the local cohesiveness of a set of nodes, i.e., it focuses on the immediate neighborhood of nodes but fails to capture the presence of communities on the whole as argued by different researchers (Brandes and Erlebach 2005; Girvan and Newman 2002; Zaidi and Melanço 2010). High clustering coefficient only indicates the presence of a large number of triads, i.e., three nodes connected to each other through three edges. This property is often present in social networks where it refers to the phenomena that if you know two people, there is a high probability that the two people know each other as well. This metric also measures the presence of cliques as they are a composition of triads but it cannot be used to identify the presence of densely or sparsely connected nodes in a network as we explain below using examples.

Figure 1 is an example graph that depicts the differences between clustering coefficient and community structure. Figure 1a clearly has four communities with high connectivity between nodes of the same community, and Fig. 1b has several nodes sharing common neighbors, but visually no distinct groups. Both these graphs have the same number of nodes and edges where the clustering coefficient for graph a is 0.70 and b is 0.69. No information about the presence of four communities can be deduced from the clustering coefficient of graph a.

Fig. 1
figure 1

Two graphs with the same number of nodes and edges. a Four groups of nodes well connected within and sparsely connected with other groups. b Nodes sharing neighbors in the form of triads. Clustering coefficient for graph a is 0.70 and b is 0.69. High values for clustering coefficient do not necessarily imply the presence of community structures in a network as shown in graph (b)

Another interesting example is shown in Fig. 2 a and b as clusters obtained from some graph. Both contain the same number of nodes and edges. Thus, the density (ratio of number of edges and number of nodes) of the two clusters is exactly the same. Cluster (a) is constructed using Quads instead of triads, where a quad is a set of four nodes connected through four edges in a ring. We then compare this cluster with a cluster constructed with triads. Both these clusters are shown in in Fig. 2a and b. The clustering coefficient of cluster (a) is 0.0 representing the absence of triads and as compared to cluster (b) with a value of 0.69. This is no surprise as clustering coefficient, by definition, measures the quantity of triads in a graph. Another important metric used to classify a network as small world is the average path length. Calculating the average path lengths of the two clusters, (a) has a lower value with 2.3 as compared to (b) with a value of 2.6 showing that cluster (a) is more compact and on average, the nodes lie closer to each other as compared to cluster (b) and thus is a better cluster even though its clustering coefficient is 0.

Fig. 2
figure 2

Consider two clusters of some graph with same number of nodes and edges and thus having the same density in terms of number of nodes and number of edges. a Nodes well connected to each other forming quads, b nodes sharing neighbors to form triads. Clustering coefficient for cluster a is 0.0 and b is 0.69 representing the absence of triads in cluster (a). The average path length of a is 2.3 and of b is 2.6 showing that cluster (a) is more compact and on average, the nodes are much closer to each other than cluster (b). This example shows that a cluster can exist even with a low value of clustering coefficient as shown in a

From the above two examples, we can conclude that a graph having high clustering coefficient does not necessarily suggests the presence of distinct group of nodes tightly connected to each other and loosely connected within themselves. Moreover a cluster can be a good cluster even if its nodes have a low average clustering coefficient.

4 Assortativity and triads in social networks

In this section, we present three fundamental concepts associated with the theory of social networks. First, we briefly introduce these concepts and then argue that combining these concepts, we can produce a network generation model with small-world and scale-free properties having distinct clusters.

The theory of Assortativity or Assortative Mixing refers to the principle that in a network, similar nodes tend to attach to each nodes. This similarity can be based on one or more than one attributes. An important application of this theory is the assortative mixing in social networks where nodes attach to other nodes having similar degree. This differs from biological and technological networks that exhibit disassortative mixing (Newman 2002, 2003). Disassortative mixing refers to the phenomena where dissimilar nodes tend to connect to each other. A good example is the Sexual Web (Lilijeros et al. 2001) where nodes representing men or women connect to nodes with opposite sex.

We move on to another important concept in social networks, the formation of triads introduced by Simmel and Wolff (1950) as a fundamental structure for social networks. In fact, the smallest and most elementary social unit, a dyad is a social group composed of two members while a triad is a social group composed of three members. Groups of larger size are also possible but since a variety of relationships can form in them, they are less stable (Simmel and Wolff 1950) and often less studied in sociology.

Finally, the principle of Preferential Attachment introduced by Barabasi and Albert (1999) has an ingredient for growing network model with power-law degree distribution. Some times, referred to as the ‘Rich gets richer’, the idea is that in real-world networks, nodes having high degree have a high probability of attracting more connections as compared to nodes with low connectivity. In terms of social networks, this means that a famous person is likely to become more famous as compared to a person who is not well known in the social community. The idea is the direct implication of the human trait of extraversion–introversion (Jung 1921). Extroverts, who are open to meeting new people and developing new relationships are expected to have high degree of connectivity in a social network as compared to Introverts, who tend to be more reserved, less outgoing, and less sociable.

Extending the principle of assortativity, we argue that in theory, since nodes tend to connect to similar nodes, it is not always the case that the similarity is based on node degree. We consider examples from two of the most studied social networks in computer science, the actor network and the authorship network (Watts and Strogatz 1998; Barabási and Albert 1999; Freeman 2004; Wasserman and Faust 1994). If the degree was the only criteria for associating to other nodes, in case of an actor network, actors beginning their career will never get to play a role in a movie with well-known actors. This is contrary to the reality as often, new actors are given supporting roles along with well-known actors and thus the similarity is based on some other criteria. Similarly, for the authorship network, if degree was the only criteria, an experienced professor will never take a doctorate student who has only a few publications under his supervision which is generally not the case.

Our deduction from the above two examples is that the nodes tend to associate to other nodes having similarity based on the context and not the node degree. For example, an actor starting his career as a comedian has a higher probability to act in a comedy film and thus act with a well-known actor in the domain of comedy films. In this case, the similarity is based on the domain of the two actors. Similarly, for the case of authorship network, a student having done a masters in a particular domain such as computer networks, has a high probability of collaborating with a well-know researcher of the same domain, probably as a doctorate student. Again, the connectivity preference is due to the domain or subject of research. Generalizing from this concept, for other social networks in real world, consider the example of a person joining a new organization as employee. He has a high probability of interaction with his fellow employees, people working on the same project or sharing the same office. Another example is that of a person joining a sports club. He has a high probability of interaction with people sharing the same sports activity like Tennis.

Returning to the formation of triads in social networks, our perspective is that usually when a person enters a new social network, it is not just the triads that are formed but groups of larger size, or cliques are formed. From the actor example, it is quite clear that a new actor will probably act with a well-known actor, but the social interaction will take place within the entire cast of the movie. This interaction will be represented with a clique where all the nodes representing the actors will be connected to each other. The authorship network is no different as people co-authoring an artifact will form a clique. Similarly, in real world, usually groups of larger size are formed. Continuing with the two examples, a new employee will interact with not just only one or two more people, but with different colleagues in the same organization which work together on the same project or with whom he shares an office, and for a person joining a sports club, he will interact with people sharing similar activities instead of just one or two others.

Addressing the principle of Preferential Attachment, we argue that for every node in a group (or clique), few nodes have a higher number of connectivity with other nodes. For example, in a group representing the actors playing in the same movie, the famous actors will have many connections with others as they would have played a role in many movies. Similarly, in the authorship network, an experienced researcher would have published an artifact with many other researchers and thus would have a high number of connections.

Combining the above principles together, we claim that People in social networks are most likely to interact with similar people where similarity is based on the context and the domain. People form groups of larger size and not just triads, there are few people who have a very high degree of connectivity as compared to others. Based on these ideas, we present a new network generation model in the next section.

5 Proposed network generation model with communities

The basic idea of the proposed algorithm comprises three major steps. Instead of adding one node at a time, we add cliques of various sizes. This results in the network having high clustering coefficient. Next, we associate a possible connectivity attribute drawn from a degree distribution following power law. This insures that the degree distribution of the final network follows a scale-free property. Finally, to obtain community structures where some nodes are densely connected within and sparsely connected to other nodes, we generate a cluster tree which represents the possible communities for this network. Based on the connectivity attribute, and the distances in the cluster tree, nodes within the cliques are merged together creating highly dense groups of nodes well connected within and sparsely connected to nodes distant in the cluster tree.

The proposed algorithm comprises several steps where each of these steps is explained in detail below. The following mathematical notations are used throughout the explanation: G(VE) represents an undirected multigraph where V is a set of n nodes and E is a set of e edges. The graph G is initially empty and the nodes and edges are added as the algorithm progresses. \(\mathcal{C}\) represents a set of cliques such that \(\mathcal{C}=\{C_1,C_2,\ldots,C_k\}\) are different cliques each comprising one or several nodes. \(\mathcal{T}\) represents a tree where its leaves are equal to k (the number cliques in set \(\mathcal{C}\)).

5.1 Step 1: Clique generation

In contrast to existing network generation models, instead of adding one node or triads at a time, to generate the network, we start by adding cliques of variable sizes to G. The algorithm takes as parameter, the number of cliques to be generated (k), the minimum (minSize) and the maximum size (maxSize) of the cliques to be generated. A random number is generated between these two limits and for each random number, a clique C i is added to the graph G such that nodes and edges of the clique become member of V and E, respectively. As a result, G contains nodes that are well connected to each other within a clique, and nodes from different cliques are not connected to each other. G becomes a graph comprising \(\mathcal{C}=\{C_1,C_2,\ldots,C_k\}\) as shown in Fig. 3.

Fig. 3
figure 3

Step 1: A graph G containing only cliques of different sizes. Parameters used for this example are minSize = 1, maxSize = 5 and k = 11

5.2 Step 2: Scale-free degree distribution

To have the degree distribution of G follow a scale-free behavior, we generate a separate scale-free graph G′ with the same number of nodes as in G, using (Barabási and Albert 1999). Next, we assign the degree of a node in G′ as an attribute of a node in G chosen randomly and call this attribute sf Deg. Nodes once processed are not reconsidered for another assignment. The Pseudo Code for the process is given in Algorithm 1. Every time the procedure getNode is called, it picks a unique and randomly selected node from a given graph until all the nodes have been selected. The assignment of values from nodes of G′ to G is random but since there are few nodes with very high node degree, there is a high probability that they are divided among the cliques sparsely. Thus, we end up with one or two nodes in a clique with a high node degree as shown in Fig. 4. This step assures that the final graph G has a scale-free degree distribution.

Fig. 4
figure 4

Step 2: A scale-free degree distribution is imposed as attribute of nodes in graph G containing cliques

5.3 Step 3: Merger of nodes

The next step is the merger of nodes from different cliques to form a single connected network. The merger is a simple step where two nodes to be merged are replaced by a single node and all the edges connected to the two merged nodes are connected to this new node. Figure 5 shows how two nodes from two different cliques are merged forming a connected network of two cliques. The exact details of how to select two nodes and how many nodes are selected for merger are explained in the following sub-steps.

Fig. 5
figure 5

Merging two nodes from two different cliques. Two nodes from different cliques are removed and a new node is added which takes all the connections of the two removed nodes

5.3.1 Step 3.1: Calculation of number of mergers

As a first step, we need to calculate for each node, how many merges will it perform with other nodes. This calculation is based on the attribute sfDeg. The idea is pretty simple, the more a node is merged with others, the more higher its degree will be in the final network. This phenomenon is shown in Fig. 5 where the two merged nodes result in a single high degree node. The number of merges for each node \(n \in G\) is calculated using sf Deg as follows:

$$ Node\_Merges (n)= \left\lfloor \frac{sfDeg(n)}{Avg\_Node\_Degree(G)} \right\rfloor. $$
(1)

We use the following equation to calculate the total number of merges for a clique \(C_j \in \mathcal{C}{:}\)

$$Clique\_Merges(C_j)= \sum_{\forall n \in C_j } Node\_Merges(n).$$
(2)

5.3.2 Step 3.2: Generation of cluster tree

As the main objective is to have distinct clusters in the graph, we generate a random tree \(\mathcal{T}\) with the number of leaves exactly equal to the number of cliques generated in step 1. Each clique \(C_j \in \mathcal{C}\) is assigned to a leaf of the cluster tree \(\mathcal{T}\) as shown in Fig. 6. The tree can have varying depths to generate a hierarchical clustering where one such tree is shown in Fig. 6.

Fig. 6
figure 6

Step 3: A cluster tree \(\mathcal{T}\) is generated and cliques are assigned to its leaves to decide how the nodes (in the cliques) will be connected to each other forming clusters

5.3.3 Step 3.3: Merging nodes of cliques to form clusters

For every clique \(C_j \in \mathcal{C}\) in the cluster tree \(\mathcal{T}, \), we calculate a vector of probabilities \(P_{j_{i}}\), where j represents the clique for which this vector is being calculated and i represents the clique with which the probability of connecting j is calculated. This probability is inversely proportional to the distance between two cliques in \(\mathcal{T}\) and is spread equally over the branches of \(\mathcal{T}\) as shown in Fig. 7. The vector \(P_{j_{i}}\) thus obtained represents the probability of two cliques having their nodes merged.

Fig. 7
figure 7

Calculation of probabilities of merger of the left most leaf containing a clique (encircled) with other leaf nodes based on distances in the cluster tree \(\mathcal{T}\) where the probabilities represent the likeliness of a node in the encircled leaf (containing a clique) to be merged with other nodes in the cliques

For example, to calculate the probability of connection of the encircled node with other nodes in \(\mathcal{T}, \) the probability is uniformly divided among the three branches (1/3 in this example) for each branch leading outward from the encircled node. One of these branches leads to the root of the tree which is again uniformly divided among two of its children as shown in Fig. 7.

Using probability vector \(P_{j_{i}}\) and Clique_Merges(C j ) from Eq. 2, we calculate the exact number of pairwise merges using the equation below:

$$ Pairwise\_Merges(C_j,C_i)= \left\lfloor Clique\_Merges(C_j)* P_{j_i} \right\rfloor \quad \forall \; j,i \in {\mathcal{C}} $$
(3)

Pairwise_Merges(C j C i ) is a directed vector representing the exact number (as integer) of merges between each pair of clique (C j C i ). Based on these integer values, nodes from different cliques are merged to form connections between cliques which result in a fully connected network with the desired properties. This calculation is depicted in Fig. 8 where we show the probability vector for \(C_0, P_{0_{i}}\) and its corresponding Pairwise_Merges(C 0,C b ) \(\forall b \in \mathcal{C}. \)

Fig. 8
figure 8

Calculation of Pairwise_Merges(C 0,C b ) using distances in \(\mathcal{T}\) and probability vector of C 0 to determine the number of merges between C 0 and C b where \(C_b \in \mathcal{C}\) and C 0 ≠ C b

Figure 8 also shows the probability of C 0 divided uniformly among C 1, C 2 and the rest of the cliques in the cluster tree. The close neighbors of C 0 in the tree C 1, C 2 have a very high probability of 0.33 each of merging with C 0. The merger of nodes with close neighbors results in lots of connections being built between the cliques nearer to each other in the tree and thus represents clusters in the final graph. Algorithm 2 contains the pseudo code for the merger of two nodes.

Algorithm 2 uses a procedure merge where two functions are used to select nodes from a clique named Select_One_Node rand (C a ) and Select_One_Node prob (C b ). The implementation of these functions is very simple. The function Select_One_Node rand (C a ) chooses a node n randomly such that \(n \in C_a\) and \(C_a \in \mathcal{C}\) and Node_Merges(n) > 0. Note that the equality in Eq. (2) is always preserved during the execution of algorithm. The function Select_One_Node prob (C b ) uses the sfDeg(n) to calculate a probability which is proportional to the node degree of the node. Thus, nodes having high connectivity have a high probability of being selected as compared to nodes with low connectivity.

5.4 Further explanations and possible variants to the proposed model

In this section, we provide explanations of the different steps of the proposed model and relate these explanations to real-world social networks. This helps understand how characteristics of real-world networks are incorporated in the proposed model. We also discuss possible variations in the different steps that can change the behavior of the network generated. These variations demonstrate the robustness and flexibility of the proposed model as it can be used to generate networks with varying properties.

The first possible variation to the model is in the very first step explained in Sect. 5.1 where we add cliques of different sizes. The size of the cliques can be forced to be exactly 3, in which case we would have forced the presence of only triads just as the other network generation models presented in Sect. 2. Due to the presence of cliques (or triads), the average clustering coefficient of the entire graph increases as compared to a random graph which is a fundamental property to identify a small-world network.

The assignment of values in Sect. 5.2 is easy to comprehend once considered in the context of real world. This assignment represents that, in certain social groups, there are people who have relatively high connectivity with others. Continuing with our two example social networks, a famous actor who plays in many films will have a high number of connections with other actors and similarly, a senior professor will have a high number of connections with other researchers. This value is used in step 3 of the model to determine how different cliques of step 2 are merged together to form a single connected network.

A variation to this step can be the assignment of a normal degree distribution or a uniform degree distribution. The choice results in what the final degree distribution would be for the generated network. This flexibility is quite useful as the model can be used to generate networks with any kind of degree distribution.

In Sect. 5.3.2, we discuss how a cluster tree is used to generate a network with hierarchical clusters. A possible variation is the generation of a flat or partitional clustering. We can generate only a tree with depth 2 where we have all the cliques at the bottom level, merging with other cliques at level 1 to form clusters and the root represents the regrouping of all the clusters as shown in Fig. 9, which contains 4 clusters and 12 cliques.

Fig. 9
figure 9

A cluster tree \(\mathcal{T}\) to generate flat or partitional clustering with the leaf nodes containing the cliques at the bottom level, the clusters at the second level and the root regrouping the clusters at the top level

6 Real-world social networks

For the analytical study of the network generation models, we compare the networks generated by existing models with real-world networks using a number of metrics (see Sect. 7). We consider three social networks, two of which are author networks and the third one is an actor network.

The author network is a network where nodes represent scientists and an edge between them represents a collaboration in terms of co-authoring a scientific artifact like a book or an article. The two datasets are the Network Science dataset and the Geometry dataset. The Network Science data was compiled by Newman (2006) from the bibliographies of two review articles on networks, Newman (2003) and Boccaletti et al. (2006), with a few additional references added by hand. The network contains a single connected component with 379 nodes and 914 edges.

The other author network is the authors collaboration network in computational geometry. It was produced from the BibTeX bibliography obtained from the Computational Geometry Database geombib, Footnote 1 version February 2002. Problems with different names referring to the same person are manually fixed and the data base is made available by Vladimir Batagelj and Andrej Mrvar from the Pajek datasets website. Footnote 2 Only the biggest connected component was considered for experimentation where the reduced simple network contains 3,621 vertices and 9,461 edges.

The Actor network is a network where nodes represent actors and two actors are connected to each other if they have acted in a movie together. The dataset we use here is a subset taken from the IMDB Footnote 3 database of movies made until the year 1999 and used by other researchers such as Auber et al. [3] and Archambault et al. [2]. This network contains 7,640 nodes and 277,029 edges.

The choice of selecting these models is based on two criteria. First, we wanted to use graphs that are publicly available and have been studied by other researchers. Moreover, networks having varying density and size so as to see the behavior of the different models in terms of scalability and flexibility could be evaluated.

7 Results and discussion

We calculate a number of statistics using various network generation models and compare them with the real-world networks of equal sizes. The results are shown in Tables 2, 3 and 4. In some cases, the models are not parameterized and thus the node–edge density could not be controlled. We tried to generate models of similar size in terms of number of nodes, and where possible, similar number of edges. An important observation about these networks is that since all of them use the preferential attachment to produce the scale-free property, the degree distribution for all the models follows a power law. To the best of our knowledge, there is no metric which tries to identify the presence of communities in a network by analyzing the graph on the whole in a global perspective; thus, the presence of community structure in the proposed model is only justified by construction. Using the cluster tree, the way the nodes connect to each other can be controlled and thus any network that is produced has densely connected nodes which are sparsely connected to other nodes.

Table 2 Comparing different models with the collaboration network of scientists from the network science data
Table 3 Comparing different models with the collaboration network of scientists from the computational geometry data
Table 4 Comparing different models with the Actor network from the IMDB dataset

Lets have a look at some individual results for the various models in comparison to the real-world networks. For example, graphs generated using the model of Guillaume and Latapy, the node–edge density in every case is very high and could not be controlled. The model of Fu and Liao, in all the three examples, have a very low clustering coefficient as compared to the respective real-world network and thus could not really be classified as generating similar networks to the real-world networks used as examples in our study. Looking at the clustering coefficient of the model by Wang and Rong in Table 3, it is quite clear that the model fails to generate a high clustering coefficient for a similar size network. In the model of Holme and Kim (Table 4) where the node–edge density of the network is comparatively high to other two networks but the network has a large size, the clustering coefficient drops considerably. The model of Klemm and Eguiluz scales well in terms of clustering coefficient, but in case of low node–edge density (see Table 2), the average path length is considerably high to be a small-world network. Also, from Table 4, the average path length in case of a number of models is 1.99, which is a direct implication of a node having a very high degree. As a result, most of the nodes are connected to this high degree node and thus have almost a distance which reduces the average path length of the entire network.

From the above examples, one obvious problem that can be inferred is that these models have problems with scalability, as the node–edge density is varied for a network, the models are not able to reproduce comparative values with real-world networks for various statistics. On the other hand, the proposed model in this paper has the ability to control the size of cliques as the starting point, which helps us to gauge the density and at the same time, and generate small-world and scale-free networks. The values are quite close to the ones expected and thus the proposed model is quite flexible.

8 Conclusion and future research directions

In this paper, we have studied the concepts of assortativity, triads and preferential attachment as the building blocks for the structure of social networks. We use these concepts to present a model to generate artificial social networks. We evaluated a number of network generation models that successfully generated small-world and scale-free networks but fail to capture another important characteristic of real-world network, i.e., the presence of community structures. We compared the existing and the proposed network model with real-world social networks using a number of statistics. Results show that the proposed model indeed generates networks that have community structures and are topologically similar to real-world networks as compared to the other existing models that generate small-world and scale-free networks. Moreover, we identified another problem for the existing models, the scalability in terms of node–edge density, where it is difficult to maintain the high clustering coefficient and low average path length as networks of varying sizes are produced.

In this paper, we have focused on social networks and effectively presented a model to generate networks having small-world and scale-free behavior with communities. We intend to extend our study to other types of networks such as biological and technological networks to propose network generation models for these types of networks as well, incorporating several real-world networks.