1 Introduction

Networks have been studied in many fields such as biology, mathematics, quantitative geography, sociology, and information science (Fortunato 2010). A graph is made by a number of nodes (vertices) and some links (edges) that join them to each other (Schaeffer 2007). A community (cluster) is made of a series of vertices with common or similar properties based on various criteria (Fortunato 2010). Graph clustering or community detection refers to grouping nodes connected with edges, but not to outside the group (Fortunato 2010; Schaeffer 2007). For example, strategies related to target marketing can be designed well if community detection is possible in such networks. If users are regarded as vertices, and friendship relationships are regarded as edges, graph clustering can formulate the issue of detecting communities for these users for target marketing.

1.1 Reasearch challenge

A significant characteristic of real-world networks is community structure (Fu et al. 2013) in which people have social relations sharing similar personal or professional interests, records or real-life relationships. Different communities can share nodes in graphs; therefore, overlapping can be formed among communities (Zarandi and Razaee 2010), and detecting such communities in social networks, which exists in most real social networks, is of great importance. In recent research, some fuzzy methods such as fuzzy c-means (FCM) clustering (Zhang et al. 2007; Zarinbal et al. 2014) and the possibilistic c-means clustering model (PCM) (Krishnapuram and Keller 1993) have been put forward for discovering these overlapping communities.

The problem studied in this research is to detect communities on the basis of attribute and structural similarities. The goal is to partition the graph into c communities, each of which has cohesive structures and homogeneous attributes. This is somewhat challenging because these two similarities are independent or even conflicting goals. For example, authors who cooperate with each other may have different attributes, such as research topics, whereas those who search the same topics may come from different groups and never cooperate. It is unknown how to balance these two sources of data. Most researches design a distance function between two vertices combining the structural distance and attribute distance with two different weighting factors. Although this procedure is simple, it is hard to set the factors and interpret the function so that it is not clear whether the weight of coauthor relationship should be larger or smaller than that of research topic. Moreover, making quantitative decisions on the weights is even harder.

1.2 Main contribution

This study proposes a fuzzy model and algorithm for detecting communities that overlap on the basis of analyzing the semantic of social networks data. Nodes share common attributes in groups or communities and they have many connections among themselves. Therefore, there are two sources of data for performing the clustering task. The first is information about the nodes and their attributes such as known properties, users’ profiles in social networks or authors’ publications; and the second comes from the set of connections between nodes such as interactions and collaborations that form among users.

Fuzzy clustering is very useful for cluster analysis (Yang 1993; Valente de Oliveira and Pedrycz 2007). Considering fuzzy sets in detecting communities can make it possible to identify clusters due to their various impressions in link and attribute information. Nodes with fuzzy clustering are assigned to one or more clusters with different membership functions, making it possible to have overlapping and interesting clusters of various and flexible structures. Because of these advantages, a fuzzy clustering is proposed in order to identify clusters existing in complex networks using both link information and node attribute. Determining the membership functions for assigning each node to clusters based on node attribute and link information is challenging in fuzzy clustering. Considering this problem, a new model called PCMSA is suggested to identify overlapping clusters on the basis of attribute and structural similarities. The findings indicate that PCMSA is a considerable model for detecting communities in a complex network. Here is a summary of major contributions of this research:

  1. (1)

    Community detection in social networks based on both link information and node attribute due to the importance of these sources of data in some real graphs such as social networks

  2. (2)

    Fuzzy clustering that makes it possible for overlapping clusters in which nodes are assigned to one or more clusters that have various degrees of memberships

  3. (3)

    Strict structural and attribute similarity: in the last algorithms of graph clustering, most algorithms consider weighting factors to balance between attribute and structural similarities; however, the algorithm in this paper (PCMSA) strictly considers the two similarities.

The organization of this paper is as follows: The 2nd section presents the related works. The 3rd section introduces fuzzy clustering based on center, and the 4th section deals with the explanation of the proposed algorithm and importance of weak ties. Sections 5, 6 present the clustering validation index and experimental results, respectively, and conclusions and suggestions for further research work will be presented in the 7th section.

2 Related works

In recent years, fuzzy clustering has been created and widely used in general clustering, but little research has applied it in graph clustering (Schaeffer 2007) such that using fuzzy clustering in graph clustering has been observed less during the past decade (Wang et al. 2013). It is still possible to improve the performance of some methods that are meant for discovering fuzzy overlapping communities (Schaeffer 2007). FCM is among the most common fuzzy clustering models used along with other techniques to detect communities (Zhang et al. 2007). In these studies, the structure of the models is not adapted well enough for graph clustering. Golsefid et al. proposed a fuzzy duo-centric model for community detection in social networks for which the nodes’ properties are not considered in this paper (Golsefid et al. 2015).

Most graph clustering considers only one aspect of the graph and ignores the other (Andersen et al. 2006; Flake et al. 2000; Girvan and Newman 2002; Tian et al. 2008).Consequently, the clusters either have a random distribution of vertex attributes in them, or have a non-cohesive intra-cluster structure. A good graph clustering ought to balance similarities that are both structural and attribute in order to have an intra-cluster structure that is cohesive and has homogeneous vertex properties. However, considering node attributes and network topology together is also challenging so that one must combine two very different pieces of information.

Recently, some attempts take both sources of data into consideration. Such clustering algorithms are based either on distance or on model. Distance-based methods (Zhou et al. 2009, 2010; Ruan et al. 2013) initially form an augmented network by adding the virtual links in order to connect the attributes with nodes. In that case, the clusters can be identified with the similarity between two nodes that have standard clustering algorithms (Markov or K-Medoids clustering), calculated by the distance between nodes in the augmented network. There are two challenges with this algorithm: adding new nodes and new edges leads to a big graph that cannot be solved in some cases. Moreover, it is not clear how to cluster this heterogeneous graph with two types of nodes and edges (Zhou et al. 2010). Model-based approaches, both generative and discriminative models, have been developed to simulate the complex network generated by various Bayesian networks with topic modeling. In the literature, for the first one, there exist papers such as CART (Pathak et al. 2008), iTopic Model (Sun et al. 2009), and for the second one, there exist papers such as PCL-DC (Yang et al. 2009). Moreover, Cao et al. detect prosumer-community groups considering nodes’ attributes and network structure, but they do not consider the overlapping communities. Their algorithm also cannot detect communities in large-scales networks (Cao et al. 2019). Bu et al. propose GK-mean algorithm which is formulated as a multi-objective optimization problem (MOOP). Although Graph K-means (GK-mean) algorithm considers two topological structure and attribute information, but do not work well on large-scale networks (Bu, et al. 2019).

The above clustering algorithms have a drawback in clustering in large-scale networks. Some of them did not consider both sources of data or some have taken the assumptions to the problem more easily. Moreover, they did not consider weak ties in their model in addition to the strong ties (weak ties are discussed in Sect. 4).

This is while, the proposed clustering approach based on structural and attribute similarity (PCMSA), is designed based on the semantic of social networks data in addition to the fuzzy sets by considering the important theorems in overlapping social networks such as, weak ties and homophily. These two theorems are explained in Sect. 4. Moreover, the proposed model is evaluated by an extensive evaluation using different network sizes and even real large graphs. Results show clusters with high quality, homogeneous attributes and cohesive structures.

Until now, methods that consider both topology structure and node attribute have not considered the fuzzy nature in extent of membership in graph clustering. Although, Hu and Chan propose a fuzzy clustering based on two sources of data (Hu and Chan (2016)), but in their problem formulation, the structure of the model is not adapted well enough for analyzing the semantic of fuzzy social networks and overlapping complex networks. As discussed above, in their problem formulation, the structural distance and attribute distance are combined, while they are two seemingly independent, or even conflicting goals and it does not make sense semantically.

3 Background

Fuzzy Center-Based Clustering (FCM) and the possibilistic c-means clustering (PCM) are presented in this section.

3.1 Fuzzy clustering based on center

In fuzzy clustering, each node forms a part of a cluster that has a membership function between 0 (not belonging), and 1(belonging), and each node can form a part of several clusters that have different membership degrees which are crisp values over the interval [0,1] (Höppners 1999). The most famous fuzzy clustering suggested by Dunn (Dunn (1974)) and continued by Bezdek (Bezdek (1981)) is FCM clustering algorithm.

If \(X = \{ x_{1} ,x_{2} ,...,x_{n} \} \in R^{\alpha }\) is a series of feature vectors (\(\alpha\) and n are the dimension and the number of nodes, respectively), FCM assigns nodes to clusters by making the subsequent function minimum, and partitions nodes into c clusters.

$$J = \sum\limits_{i = 1}^{c} {\sum\limits_{k = 1}^{n} {u_{ik}^{m} \left\| {x_{k} - v_{i} } \right\|_{R}^{2} } }$$
(1)

where, \(u_{ik} \in [0,1]\) is the membership degree of each node \(k = 1,...,n\) in cluster \(i = 1,...,c\), and \(v = \{ v_{1} ,...,v_{c} \} \in R^{c\alpha }\) is indicative of a series of clusters’ centers. R shows the distance norm (Fortunato 2010) and \(1 \le m < \infty\) indicates the fuzzifier parameter. Clustering based on the objective function can be considered an optimization problem, which is solved by the gradient descent technique (Tan et al. 2007).

Krishnapuram and Keller proposed the PCM clustering algorithm in order to decrease the impact of outliers on FCM with relaxation of the condition of membership values to all clusters for each node, which equals 1, and replaced it with \(\mathop {\max }\limits_{i} (u_{ik} ) > 0, \, 1 \le k \le n\).The PCM objective function is defined as:

$$J = \sum\limits_{i = 1}^{c} {\sum\limits_{k = 1}^{n} {u_{ik}^{m} \left\| {x_{k} - v_{i} } \right\|_{R}^{2} + \sum\limits_{i = 1}^{c} {\beta_{i} \sum\limits_{k = 1}^{n} {(1 - u_{ik} )^{m} } } } }$$
(2)

where \(\beta_{i}\) is the average fuzzy intra-cluster distance of cluster i.

In a complex network, a number of nodes are joined to one another in a topological structure. By considering nodes as vertices, and links as edges, complex networks could be considered graphs. Many real-world problems such as social networks (Myspace, Facebook,…) have millions of users who are connected to one another as friends(Fortunato 2010). By considering users as vertices, and friendship relationships as edges, we can formulate the issue of assigning users to communities as a problem of community detection. An ideal cluster should possess an intra-cluster structure that is cohesive and has homogeneous vertex attributes. PCM considers only one aspect of the graph related to the nodes’ attributes, and ignores the other aspect related to the structure of the nodes. In this paper, both aspects are considered to detect communities considering attribute/structural similarities based on the PCM algorithm.

4 Proposed fuzzy clustering considering attribute and structural similarities

When a graph is given, different criteria can be defined to identify different graph clusters. By considering both structure and properties of the nodes, the proposed model PCMSA detects ideal communities with the following criteria:

  • (1) It identifies communities that have more densely connected nodes.

  • (2) It identifies communities that have nodes more strongly related to each other.

  • (3) The probability of adjacent nodes to belong to the same community is higher.

For PCMSA, the above criteria are considered by employing fuzzy clustering through formulation of the community detection problem as an optimization problem based on the PCM algorithm. Therefore, the main role of PCMSA is finding the best degree of membership to assign nodes to clusters so that clusters that are the most consistent with the discussed criteria can be achieved.

In order to detect communities that satisfy the above criteria, a minimum optimization problem based on PCM is formulated for the community detection problem of complex networks by considering both datasets related to node attributes and link structure. Then, the solution to this optimization problem is presented (Fig. 1).

Fig. 1
figure 1

Algorithm 1: Center-based fuzzy community detection algorithm based on link structure

4.1 Proposed model

This section deals with the proposed “fuzzy community detection model (PCMSA)” in an attempt to detect overlapping communities with regard to structural and attribute similarities in the complex networks. In the PCMSA, we want to cluster nodes on the basis of analyzing the semantic of social networks data considering the two important homophily and weak ties theorems (Kadushin 2004).

Assume that \(G(N,L)\) is a graph in which N indicates a number of nodes \((N = \{ 1,2,...,n\} )\) and L is indicative of some links \((L = \{ 1,2,...,l\} )\). The relevant terminologies are described as follows:

n

Number of nodes

l

Number of edges

e

Number of attributes

t

Number of repetitions (periods)

\(m \in [1,\infty )\)

Fuzzy parameter (weighting exponent called the fuzzifier)

\(x_{i}\)

ith object

c

Number of clusters

\(g_{i}\)

Set of nodes in cluster i

\(v_{i}\)

ith cluster centroid

\(u_{ik}\)

Membership degree of kth node to ith cluster

\(\Delta_{i}\)

Density of cluster i

\(v_{i}^{0}\)

The initial center of cluster i

\(\delta_{i}\)

Entropy of cluster i

\(\chi_{j}\)

The set of values of attribute \(\gamma_{j}\)

4.1.1 Structural similarity

The center-based community detection objective function considering link structure is formulated as:

$$J = \sum\limits_{k = 1}^{n} {\sum\limits_{i = 1}^{c} {(u_{ik} )^{m} } D_{ik} }$$
(3)

In this formulation, \(D_{ik}\) indicates the distance of node k from the center of cluster i (\(v_{i}\)) that can be calculated with the following formulation:

$$D_{ik} = \left\{ \begin{array}{*{20}l} 0 &\quad if{\text{ there is a link between }}nodes \, v_{i} \, and \, k \, \hfill \\ \sum\limits_{j = 1}^{n} {\left| {a_{{v_{i} j}} - a_{jk} } \right|}&\quad{\text{ otherwise}} \hfill \\ \end{array} \right.$$
(4)

The transitivity theorem (Kadushin 2004) (if node A is connected to node B and node B is also connected to node C, most probably node A will be connected to node C) of social networks is used to define the structural distance between nodes. Therefore, for two connected nodes (first case) the special case is considered and the second case is based on transitivity theorem. In this equation, each \(a_{ij} \, (i = 1,...,n \, , \, j = 1,...,n)\) is indicative of the entry in the ith row and jth column of the adjacency matrix denoted by A. The entries in the adjacency matrix \((a_{ij} )\) indicate adjacent nodes. In this matrix, if nodes \(i\) and \(j\) are adjacent, then \(a_{ij} = 1\), and if nodes \(i\) and \(j\) are not adjacent, \(a_{ij} = 0\)Wasserman and Faust 1994).

Our article focuses on undirected graphs, and the links are not signed or valued. Therefore, if \(a_{ij} = 1\) then \(a_{ji} = 1\), thus, the matrix is symmetric (Wasserman and Faust 1994).

In this article, first a community detection model based on PCM is proposed to cluster nodes considering their link information where nodes represent objects, and links indicate the relationship among objects. Therefore, each cluster is presented as a number of interconnected objects which are not connected to objects out of the group (Wasserman and Faust 1994). The objective function that has been proposed is formulated as:

$$\min J_{m} (u,v) = \sum\limits_{k = 1}^{n} {\sum\limits_{i = 1}^{c} {(u_{ik} )^{m} } D_{ik} + \sum\limits_{i = 1}^{c} {\Delta_{i} \sum\limits_{k = 1}^{n} {(1 - u_{ik} )^{m} } } }$$

The first part of this objective function lessens the distance from cluster centers to the extent that is possible considering the data link structure (\(D_{ik}\)). The second term causes \(u_{ik}\) to become as large as possible, and in this way avoid the trivial solution (Malek et al. 2015). \(\Delta_{i}\) equals the proportion of existing links in a cluster to all the links that can be presented in this cluster \((\left| {L_{i} } \right|)\), which has become maximized.

$$\mathop {\Delta_{i} }\limits_{i = 1,...,c} = \frac{{\left| {\left. {\{ (p,q)} \right|p,q \in g_{i} ,(p,q) \in L_{i} \} } \right|}}{{\left| {L_{i} } \right|}}$$

Now, the first proposed model identifies a new fuzzy clustering model that is center-based in order to identify communities that overlap in complex networks. This model is defined on the basis of the PCM clustering model and detects overlapping communities on the basis of the link structure. The defined model is formulated as:

$$\min J_{m} (u,v) = \sum\limits_{k = 1}^{n} {\sum\limits_{i = 1}^{c} {(u_{ik} )^{m} } D_{ik} + \sum\limits_{i = 1}^{c} {\Delta_{i} \sum\limits_{k = 1}^{n} {(1 - u_{ik} )^{m} } } }$$
(7)
$$\begin{gathered} subject \, to \hfill \\ \left[ \begin{gathered} \, u_{ik} \in [0,1] \, , \, 1 \le i \le c \, , \, 1 \le k \le n \hfill \\ \hfill \\ \mathop { \, \max }\limits_{i} (u_{ik} ) > 0 \, , \, 1 \le k \le n \hfill \\ \, 0 < \sum\limits_{k = 1}^{n} {u_{ik} } < n \, , \, 1 \le i \le c \hfill \\ \end{gathered} \right. \hfill \\ \end{gathered}$$
(8)

Theorem 1. Assume that \(G(N,L)\) is a graph in which N indicates the set of nodes \((N = \{ 1,2,...,n\} )\) and L is indicative of some links \((L = \{ 1,2,...,l\} )\). In our model, \(U_{k}\) indicates the kth column of \(U\); that is, \(U_{k} = \{ u_{1k} ,...,u_{ck} \} ,{ 1} \le {\text{k}} \le n\). Then, U could be a global minimum for \(J_{m} (U,V)\) only if the updating fuzzy membership value is:

$$u_{ik} = \left( {1 + \left( {\frac{{D_{ik} }}{{\Delta_{i} }}} \right)^{1/(m - 1)} } \right)^{ - 1}$$
(9)

and the center of cluster is as follows:

$$v_{i}^{ * } = \mathop {\arg \min }\limits_{{v_{i} \in [1,n]}} \left(\sum\limits_{k = 1}^{n} {\sum\limits_{j = 1}^{n} {u_{ik}^{m} } } D_{ik} \right)$$
(10)

Theorem 1 will be proved in “Appendix 1”.

Now, consider the algorithm of fuzzy community detection to identify overlapping communities based on link structure.

The \(\Delta_{i}\) moves to 0 if there are no links in a cluster, and it moves to 1 if all the links exist. The higher the value of \(\Delta_{i}\), the more connections exist between nodes, leading to a denser cluster.

The most important step in a clustering that is center-based is choosing the proper initial central node. An approach that is more common is random choice of initial centers, however, the outcomes are often weak (Malek et al. 2015). The Nodal degree, the number of lines incident with the node in the graph, can be a good criterion to choose the initial centers. This can be achieved by summing with regard to elements in the adjacency matrix as follows (Wasserman and Faust 1994):

$$v_{i}^{0} = \max_{{^{{i^{\prime} \in [1,n]}} }} \, \sum\limits_{j = 1}^{n} {a_{{i^{\prime}j}} }$$
(11)

In the proposed clustering, clusters have an intra-cluster structure that is cohesive. A favorable cluster should have an intra-cluster structure that is cohesive and has homogeneous vertex attributes. Therefore, in this paper, re-clustering is proposed to re-cluster communities considering a threshold on the basis of the homophily theorem in social networks in which, if two people have characteristics that match in a proportion greater than expected in the population from which they are drawn or the network of which they are apart, then they are more likely to be connected. The converse is also true: if two people are connected, then they are more likely to have common characteristics or attributes (Kadushin 2004).

Moreover, “The strength of weak ties”, which has attracted a lot of research attention, is an article that has been presented by Mark Granovetter (Granovetter 1977). Weak ties concentrate on holes in the network (Kadushin 2004). Our acquaintances (weak ties) may have less relationship with us than our close friends (strong ties). Thus, if we have a set of people with their acquaintances in whom many of the possible ties are absent, their network will constitute a low-density network (Kadushin 2004).

Weak ties cause the information to easily flow from remote parts of a network. Objects that have few weak ties are deprived of information from remote parts of a network and only get provincial news and information from their close friends. Compared to the strong ties, weak ties may serve as bridges between network segments. Thus, social systems that do not have weak ties are incoherent and will be fragmented as weak ties helping to integrate social systems. Without considering weak ties, new ideas will spread slowly, and scientific efforts will not achieve their success (Kadushin 2004). Due to the importance of weak ties they are considered in this paper with the proposed model that detect communities based on two sources of data, structural and attribute similarities.

Therefore, the re-clustering should be employed by the following measurement according to the above theorems.

This operation works by using a pair wise similarity measure to find groups of clusters that could benefit from re-clustering their component nodes and edges. In order to find groups of clusters needing to be re-clustered, the similarity of each pair of clusters based on their nodes’ attributes is found. The similarity measure for two especial communities (clusters) i and j is defined as:

$$p_{\begin{subarray}{l} i,j \\ i < j \end{subarray} } = 1 - \frac{{\sum\limits_{{q \in g_{i} }}^{{}} {\sum\limits_{{q^{\prime} \in g_{j} }}^{{}} {\left\| {x_{q} - x_{{q^{\prime}}} } \right\|} } }}{{\sum\limits_{w = 1}^{c - 1} {\sum\limits_{v = w + 1}^{c} {\sum\limits_{{h \in g_{w} }}^{{}} {\sum\limits_{{h^{\prime} \in g_{v} }}^{{}} {\left\| {x_{h} - x_{{h^{\prime}}} } \right\|} } } } }}$$

Equation (12) indicates the proposed similarity measure that calculates the percentage of similarity levels of cluster i and cluster j considering their nodes’ attributes. For especial case \(c = 2\), the re-clustering is done and it is then decided by the proposed validation index whether the re-clustering is good or not.

By using this similarity measure, groups of clusters that could benefit from re-clustering their component nodes and edges can be defined. If \(p_{i,j} > B\), (\(B\) identifies by try and error) then the re-clustering algorithm is run to re-cluster all the nodes and edges in cluster i and cluster j. This is similar to the “The rich get richer and the poor get poorer” phrase as the nodes in the cluster with the structural algorithm that possess similar attributes become denser with the re-clustering algorithm.

It is worth mentioning that B is not limited and depends on the considering graph. For each clustering, the value of B is defined by try and error and the scenario with the highest validity index. According to the experiments, the value of B is obtained smaller in graphs with more weak ties. As a result our method is flexible in which it is considered for each graph according to its similarities. If the attribute similarity is greater than B that is identified by the validity index, the re-clustering is applied. This is while some graphs don’t need to be re-clustered according to the weak ties and homophily theorems.

As it is mentioned before the value of B depends on the graph and its attribute similarities. Therefore the minimum and maximum values of \(p_{i,j \in [1,c]}\) are obtained for the graph. B is searched in this bound \(([\min ,\max ])\) and the value with the highest validity index in clustering is selected.

4.1.2 Attribute similarity (Re-clustering)

The proposed re-clustering objective function is defined as follows:

$$\min J_{m} (u,v) = \sum\limits_{k = 1}^{n} {\sum\limits_{i = 1}^{c} {(u_{ik} )^{m} } d_{ik} - \sum\limits_{i = 1}^{c} {\delta_{i} \sum\limits_{k = 1}^{n} {(1 - u_{ik} )^{m} } } }$$
(13)

In this equation, \(\delta_{i}\) is measured as:

$$\begin{gathered} \delta_{i \in [1,c]} = entropy(C_{i} ) = \sum\limits_{j = 1}^{e} {\frac{{\left| {g_{i} } \right|}}{n}} \, entropy(\gamma_{j} ,C_{i} ) \hfill \\ entropy(\gamma_{j} ,C_{i} ) = - \sum\limits_{{g \in \chi_{j} }}^{{}} {p_{ijg} *\ln p_{ijg} + (1 - } p_{ijg} )*\ln (1 - p_{ijg} ) \hfill \\ \end{gathered}$$
(14)

In this equation, \(p_{ijg}\) refers to the percentage of vertices existing in cluster i with value \(\gamma_{jg}\) on attribute \(\gamma_{j}\).\(\delta_{i}\) measures the weighted entropy from all attributes over c clusters. Moreover, for continues values of an attribute, the fuzzy membership function of that attribute is used and then \((\alpha - cut)\) in fuzzy sets (Mendel and Mendel 2017) is applied to create a finite set of values.

The parameterization of \(d_{ik}\) should be specified. Referring to the Gustafson and Kessel’s definition, \(d_{ik}\) can be obtained as follows (Gustafson and Kessel 1978):

$$d_{ik} (\Omega_{i} ) = (x_{k} - v_{i} )^{T} H_{i} (x_{k} - v_{i} ), \, 1 \le i \le c$$
(15)

This form of \(d_{ik}\) indicates the norm metric of an inner product with \(H_{i}\) symmetric and positive-definite matrix. Note that we take \(\Omega_{i} = \left\{ {v_{i} ,H_{i} } \right\}\) and J is linear in \(H_{i}\) inducing a singular problem (Gustafson and Kessel 1978). Gustafson and Kessel restricted the determinant \(\left| {H_{i} } \right|\) of matrix \(H_{i}\) in order not to allow the metric to grow without bound (Gustafson and Kessel 1978).

Now the proposed re-clustering model is as follows:

$$\min J_{m} (u,v) = \sum\limits_{k = 1}^{n} {\sum\limits_{i = 1}^{c} {(u_{ik} )^{m} } d_{ik} - \sum\limits_{i = 1}^{c} {\delta_{i} \sum\limits_{k = 1}^{n} {(1 - u_{ik} )^{m} } } }$$
(16)
$$\begin{gathered} subject \, to \hfill \\ \left[ \begin{gathered} \, \left| {H_{i} } \right| = \upsilon_{i} \, , \, \upsilon_{i} > 0 \, \hfill \\ \, u_{ik} \in [0,1] \, , \, 1 \le i \le c \, , \, 1 \le k \le n \hfill \\ \mathop { \, \max }\limits_{i} (u_{ik} ) > 0 \, , \, 1 \le k \le n \hfill \\ \, 0 < \sum\limits_{k = 1}^{n} {u_{ik} } < n \, , \, 1 \le i \le c \hfill \\ \end{gathered} \right. \hfill \\ \end{gathered}$$
(17)

Constraint \(\left| {H_{i} } \right| = \upsilon_{i}\) guarantees that \(H_{i}\) is positive-definite (Bezdek et al. 1999).

Now, the augmented objective function is defined as:

$$\begin{aligned} \min J_{m} (u,\Omega ,\lambda ) &= \sum\limits_{k = 1}^{n} {\sum\limits_{i = 1}^{c} {(u_{ik} )^{m} } d_{ik} (\Omega_{i} ) - \sum\limits_{i = 1}^{c} {\delta_{i} \sum\limits_{k = 1}^{n} {(1 - u_{ik} )^{m} } } } \hfill \\&\quad + \sum\limits_{i = 1}^{c} {\lambda_{i} } \left[ {\left| {H_{i} } \right| - \upsilon_{i} } \right] \hfill \\ \end{aligned}$$
(18)

where \(\left\{ {\lambda_{i} } \right\}\) is a set of Lagrange multipliers.

Theorem 2. U could be a global minimum for \(J_{m} (U,V)\) only if the updating fuzzy membership value is:

$$u_{ik}^{*} = \left( {1 + \left( {\frac{{d_{ik} (\Omega_{i} )}}{{ - \delta_{i} }}} \right)^{1/(m - 1)} } \right)^{ - 1}$$
(19)

and the cluster center is:

$$v_{i}^{*} = \frac{{\sum\limits_{k = 1}^{n} {u_{ik}^{m} } x_{k} }}{{\sum\limits_{k = 1}^{n} {u_{ik}^{m} } }}$$
(20)

and finally,

$$H_{i}^{{*^{ - 1} }} = \frac{1}{{\lambda_{i} \left| {H_{i}^{*} } \right|}}\sum\limits_{i = 1}^{c} {u_{ik}^{m} } (x_{k} - v_{i}^{*} )(x_{k} - v_{i}^{*} )^{T}$$
(21)

Theorem 2 will be proved in Appendix 1.

Now, \(FC_{i}\) is the fuzzy covariance matrix that can be defined as follows (Gustafson and Kessel 1978):

$$FC_{i} = \frac{{\sum\limits_{k = 1}^{n} {u_{ik}^{m} } (x_{k} - v_{i} )(x_{k} - v_{i} )^{T} }}{{\sum\limits_{k = 1}^{n} {u_{ik}^{m} } }} \, ; \, m > 1$$
(22)

Then, using (22) and \(\left| {H_{i} } \right| = \upsilon_{i}\) in (21), \(H_{i}^{{*^{ - 1} }}\) gives:

$$H_{i}^{{*^{ - 1} }} = \left[ {\frac{1}{{\upsilon_{i} \left| {FC_{i} } \right|}}} \right]^{1/\alpha } FC_{i}$$
(23)

In which \(\alpha\) is the feature space dimension.

The previous discussion and then re-clustering algorithm induce the following proposed algorithm for community detection considering both sources of data related to topological structure and vertex properties. Figure 2 illustrates the proposed algorithm.

Fig. 2
figure 2

PCMSA Algorithm: Center-based fuzzy community detection algorithm to identify overlapping communities based on link structure and nodes’ attributes

The essential condition for converging the algorithm suggested in Fig. 2 is met when:

$$\lim_{t \to \infty } \left\| {U^{(t)} - U^{(t - 1)} } \right\| = 0$$

The reason for this condition as well as the proposed algorithm convergence is to be offered in Appendix 2.

In the PCMSA algorithm, steps 1–4 detect communities on the basis of the structural similarities and steps 5–11 detect the last communities based on the attribute similarities.

For the proposed algorithm the complexity is \(O(c*n*t + c*(c - 1)*t*e*n_{g} )\) in which c, n, t, e, and \(n_{g}\) indicate the number of communities, the number of nodes, the number of iterations, the number of attributes, and the number of nodes in two re-clustering communities, respectively.

5 Clustering validation index

The appropriate criteria for evaluating the performance of clustering process can be the number of links in the community and those outside the community, which are the base of most community definitions (Zarandi et al. 2010). Suppose a sub-graph \(G_{i}\) of a graph G in which \(\left| {G_{i} } \right| = g_{i}\) and \(\left| G \right| = g\). The internal and external degree of sub-graph \(G_{i}\) can be defined as the number of links in the sub-graph connecting nodes to each other and that of links connecting nodes inside the sub-graph to the remainder of the graph, respectively. The quality of clusters with two measures of density, intra-cluster density and inter-cluster density is evaluated. The ratio between the number of internal links of cluster \(C_{i}\) and that of possible internal edges \((L_{i} )\) is called the intra-cluster density.

$$\mathop {\Delta_{i} }\limits_{i = 1,...,c} = \frac{{\left| {\left. {\{ (p,q)} \right|p,q \in C_{i} ,(p,q) \in L_{i} \} } \right|}}{{\left| {L_{i} } \right|}}$$
(25)

In addition, the inter-cluster density can be defined as the ratio between the number of edges from the nodes of \(C_{i}\) to the remainder of the graph and that of possible inter-cluster edges.

$$\mathop {\Delta_{i}^{ext} }\limits_{i = 1,...,c} = \frac{{\left| {\left. {\{ (p,q)} \right|p \in C_{i} ,q \notin C_{i} ,(p,q) \in L^{\prime}_{i} \} } \right|}}{{\left| {L^{\prime}_{i} } \right|}}$$
(26)

In this equation, \(L^{\prime}_{i}\) indicates edges between nodes inside cluster \(C_{i}\) to the remainder of the graph.

For \(C_{i}\) to be a community with homogeneous attributes, the homogeneity is expected to be appreciably maximum inside a community and minimum between the communities. This is defined as the separation measure. The separation measure is defined as follows:

$$\hom_{i}^{{}} = \frac{{\sum\limits_{j = 1}^{c} {\sum\limits_{{z \in g_{i} }}^{{}} {\sum\limits_{{z^{\prime} \in g_{j} }}^{{}} {\left\| {x_{z} - x_{{z^{\prime}}} } \right\|} } } }}{{\sum\limits_{w = 1}^{c - 1} {\sum\limits_{h = w + 1}^{c} {\sum\limits_{{p \in g_{w} }}^{{}} {\sum\limits_{{p^{\prime} \in g_{h} }}^{{}} {\left\| {x_{p} - x_{{p^{\prime}}} } \right\|} } } } }}$$

The suggested index considers two criteria: compactness and separation. The compactness measure is determined based on inter- and intra-density of communities \(\Delta dense_{i} = \Delta_{i} - \Delta_{i}^{ext}\).

As a result, a desired community is one with a maximum level of compactness and a larger level of separation. Therefore searching for the best trade-off between density and homogeneity is the goal of our algorithm, and one method to do that is maximization:

$$\Lambda_{i} = \Delta dense_{i} *\hom_{i}$$
(28)

It is expected that by maximizing \(\Lambda_{i}\), \(C_{i}\) can be a community. By considering this criterion for each community, the validity index \(\Lambda\) is determined by Eq. (29):

$$\Lambda = \Delta dense*\hom$$
(29)

In this equation \(\Delta dense\) is the average of \(\Delta dense_{i} (i = 1,...,c)\) and \(\hom\) is the average of \(\hom_{i} (i = 1,...,c)\). Equation (29) is considered the criterion to assess the performance of the proposed community detection and determines the most favorable number of clusters.

6 Experimental results

In this section, the performance of the proposed model is tested in several artificial and large scale real networks.

Example 1

In the first step, a simple dataset with 10 nodes is considered as shown in Fig. 3a, which indicates a co-worker graph where nodes show workers, and edges indicate relationships between them. Each number shows a worker ID. Moreover, two attributes describe features of a node. The first letter indicates gender (Male “M” or Female “F”) and the second letter indicates where they live (Montreal “M” or Toronto “T”). As shown in Fig. 3a, workers 4, 6 and 7 have the same properties, worker 3 is male and lives in Montreal and the others have the same properties. Suppose the cluster number is \(c = 2\). As it can be seen, depending on the clustering criteria, several clustering ways are obtained:

Fig. 3
figure 3

a co-worker Graph. b Structure-based cluster. c Attribute-based Cluster. s Proposed clustering (PCMSA)

Figure 3b indicates clusters based on structure similarity that only considers relationships between co-workers and ignores their attributes where each cluster has varieties of genders and places. In this clustering method, co-workers in clusters are closely connected and coherent.

Figure 3c indicates clusters considering attribute similarity. This means that clusters have co-workers with the same properties as much as possible; therefore, the resulted clusters are homogenous, however, the vertex connectivity may not be considered.

Figure 3d illustrates the result of the proposed clustering algorithm considering both sources of data related to topology structure and the node’s attribute. The workers in one cluster are closely connected and have the same property so that the coherent and homogenous clusters are resulted. In addition, as discussed earlier in Sect. 4, the proposed algorithm detects overlap communities where node 3 is assigned to both communities.

In this section, the function of the proposed model is tested in “m” (fuzzy parameter) and \(\upsilon\) (\(\left| H \right|\) in (17)). By considering the proposed validity index, to find the best value for fuzzy parameter (m), the values of \(1 - \Lambda\) for different values of m (fuzzy parameter) and c (number of communities) are obtained as follows. Thus by minimizing \(1 - \Lambda\), the best value for “m” can be 2.5.

The membership values generated by the proposed model with different values for \(\upsilon\) are shown in Table 1. In Fig. 3d the left cluster is cluster 1 and the other is cluster 2. By increasing the value of “\(\upsilon\)” only, the variations in the shape of clusters can become greater without any limitation that leads to the generation of clusters without homogenous attributes or coherent structures depending on the data. The historical membership functions for two critical points are shown in Table 1. Note that point 3 is strongly related to cluster 1 for \(\upsilon = 0.5\) and \(\upsilon = 1\), but it starts to form correctly from \(\upsilon = 1.5\). Moreover, node 4 starts to form incorrectly from \(\upsilon = 2\). Therefore, \(\upsilon\) can be set to 1.5, according to Table 1, causing reasonable and desired membership functions. Moreover, there is a weak tie between node 3 and node 8 in which they are not coworker. But they are assigned to the same cluster considering PCMSA algorithm based on their similar attributes. The same results can be seen for node 3 compared to nodes 9 and 10 (Fig. 4).

Table 1 Membership functions history for different values of \(\upsilon\)
Fig. 4
figure 4

Proposed validity index for different degrees of fuzziness (Example 1)

Note that in this paper,\(\upsilon_{i}\) is considered the same for all clusters.

Example 2

The second sample having 11 nodes is depicted in Fig. 5 (Zhou et al. 2009). There are two communities in this example. This graph contains vertices that indicate authors, and the edges between them indicate the co-author relationship between them. In addition to an ID number for every author, the associated topic related to an author is given by describing its attribute. Authors 1–7 work on XML, authors 9–11 work on Skyline and author 8 works on both. Obtained clusters with different clustering ways are shown in Fig. 5b–e. Figure 5d is a clustering algorithm, named SA-clustering, proposed by Zhou et al. (2009). Although it considers both relationships between nodes and their attributes, it cannot detect overlap communities as well as the PCMSA algorithm. Note that the proposed community detection PCMSA detects overlap communities in addition to considering both sources of data.

Fig. 5
figure 5

a Co-author graph. b Structure-based cluster. c Attribute-based Cluster. d SA-cluster Approach. e Proposed clustering (PCMSA)

Figure 6 indicates membership values generated by PCMSA for different values of m. Again By considering the proposed validity index, to find the best value for fuzzy parameter (m), the values of \(1 - \Lambda\) for different values of m (fuzzy parameter) and c (number of communities) are obtained as follows. Thus by minimizing \(1 - \Lambda\), the best value for “m” can be 2.

Fig. 6
figure 6

SC index for different degrees of fuzziness (Example 2)

Table 2 shows the membership values generated by the proposed model with different values for \(\upsilon\). Note that the membership function of nodes 3, 4 and 8 assign correctly until \(\upsilon = 1\). Therefore,\(\upsilon\) can be set to 1 according to Table 2. In Fig. 5e, the left cluster is named cluster 1 and the other is cluster 2.

Table 2 Membership functions history for different values of \(\upsilon\)

Moreover, there is a weak tie between node 8 and node 5 in which they are not coauthor. But they are assigned to the same cluster considering PCMSA algorithm based on their similar attributes. The same results can be seen for node 8 compared to nodes 6 and 7.

Example 3

The third example to evaluate the proposed model is a partial Facebook network. Facebook is an American online social medium and social networking service company. Sample three is part of this social network from Northwestern University (Traud et al. 2011) (The nodes are selected randomly from the original dataset). This network has 256 nodes with 401 links as shown in Fig. 7a. A total of six communities have been identified in this dataset (Traud et al. 2011). Moreover, the other sizes of the network have also evaluated in Fig. 9 that the largest of which has 3452788 nodes, and 76849635 edges.

Fig. 7
figure 7

Facebook network

In addition, each node has 7 attributes which indicate its properties. These attributes include student/faculty status flag, gender, major, second major/minor (if applicable), dorm/house, year, and high school.

As it is mentioned before, parameter B is defined by a procedure which is explained in Sect. 4. In this graph \(p_{i,j}^{\min }\) and \(p_{i,j}^{\max }\) are obtained 0.01 and 0.78, respectively. B is searched in this bound [0.01, 0.78] and B= 0.2 resulting in a best validity index.

The distinction between inter-cluster and intra-cluster density is explained in Table 3. The smaller the cluster number is, the lower the difference between intra-cluster and inter-cluster distance is. Additionally, the value of homogeneity is obtained according to (27). As the results show, the smaller the cluster number is, the lower the homogeneity is. The trade-off between \(\Delta dense\) and hom by using (29) indicates that the first local maximum occurs in \(c = 6\).

Table 3 Validity Index

In this part, two validity indices (NMI and Accuracy (Hu and Chan 2016)) in addition to the proposed validity index (\(\Lambda\)) are tested on the dataset in which the optimum number of communities c for each index is shown in Table 7. The reason why we choose these two metrics is that they are widely used to evaluate the community detection algorithms. NMI and Accuracy metrics (Hu and Chan 2016) are defined in Appendix 3.

The parameters of the PCMSA are set to \(\varepsilon = 0.001\) and \(m = 2.5\). As discussed in Appendix 3, NMI and Accuracy obtain optimal value with maximization. Table 4 shows the optimal values of the validity indices for \(c = 2,3,...,c_{\max } = \sqrt n = 16\). Among the indices, only Accuracy and \(\Lambda\) correctly result in the six clusters and NMI does not recognize the optimal c value and results in four clusters as the optimal number.

Table 4 Cluster validity values for \(c = 2,...,c_{\max } = \sqrt n = 16\)

The objective function of the PCMSA depends on the weighting exponent \(m \in [1,\infty )\). When a validity index is insensitive to change in m, it can be said that it is reliable. Therefore, here, all validity indices are considered for various values of both c and m. The results of reliability of each index by changing m are reported, and the results are listed in Tables 5, 6, 7.

Table 5 Cluster validity values for\(c = 2,...,8\):\(m = 1.5\)
Table 6 Cluster validity values for\(c = 2,...,8\):\(m = 2\)
Table 7 Values of c preferred by validity indices with\(c = 2,...,8\)and different values of m

Tables 5, 6 show the results of three validity indices for \(c = 2,...,8\) for weighting exponent \(m = 1.5\) and 2, respectively. In Table 5, for \(m = 1.5\), only \(\Lambda\) results in the optimal value for c \((c = 6)\). In Table 6, for \(m = 2\), \(\Lambda\) and NMI are the only indices that result in the optimal \(c = 6\).

Table 7 lists the optimal number of clusters for all validity indices in varieties values of m. As shown in Table 7, \(\Lambda\) demonstrates a better result, and is the least sensitive to change in m. The bold numbers in the Table 7 show the correct number of communities and the bold numbers in the Tables 4, 5, and 6 show the best validity values for each validity index and different values of m.

From the tests, the suggested validity index \(\Lambda\) is the only index which obtains the optimal c in all of experiments; moreover, when the different values of m are considered again, \(\Lambda\) has a better performance. Therefore, the proposed index \(\Lambda\) is used to evaluate PCMSA.

The result of community detection by algorithm 1 (structure-based community detection) is illustrated in Fig. 8a. It clusters nodes based on their relationship in the university and does not pay attention to their properties compared to Fig. 8c which clusters nodes based on the PCMSA algorithm considering both the nodes’ structure and properties. As shown in Fig. 8c, nodes in a community do not necessarily only have a relationship with each other, but they also have the same properties according to the dataset. Moreover, the result of community detection by SA-cluster algorithm (Zhou et al. 2009) is shown in Fig. 8b. In Fig. 8, the exclusive members of 6 clusters are shown using colors: yellow, dark blue, pink, green, light blue, and red. However, the black nodes are shared between each pair of clusters. The function of the suggested model (PCMSA) is compared with algorithm 1 and SA-cluster using the validation index which is equal to \(\Lambda_{a\lg orithm1} = 0.14\) and \(\Lambda_{SA - cluster} = 0.11\) for algorithm 1 and SA-cluster, respectively, using (29). However, the validity index for PCMSA is obtained \(\Lambda_{PCMSA} = 0.28\). Hence, the function of the suggested model, PCMSA, is superior to that of other models.

Fig. 8
figure 8

a Community detection based on node’s structure. b SA-clustering. c PCMSA Algorithm

In this experiment, the efficiency of different clustering algorithms is compared. Figure 9 indicates the clustering time on different number of nodes. The following observations on the runtime costs of different methods are made. First, PCMSA is usually 1.9–2.4 times slower than Algorithm 1, as it iteratively computes the re-clustering procedure. Although PCMSA is more expensive, but the iterative re-clustering improves the clustering quality a lot, as it is demonstrated in Fig. 8 and Table 15. According to our analysis CODICIL (Ruan et al. 2013), SA-cluster (Zhou et al. 2009), and FCAN (Hu and Chan 2016) algorithms are slower than the proposed algorithm (PCMSA). K-SNAP (Tian et al. 2008) is a little faster than PCMSA in three samples, but the quality of PCMSA is more than K-SNAP according to Table 15. Figure 9 suggests our method has promising scalability and computational complexity for analysis on even large social networks.

Fig. 9
figure 9

Clustering efficiency

To understand more about the unique capabilities of the PCMSA, we choose this network and looked in to its communities in details. To investigate how content information may affect cluster determination, we have also considered the attributes of nodes. Nodes 17 and 410 in Fig. 8a are assigned to different communities because there is not any edge between them, while with PCMSA algorithm (Fig. 8c) they are assigned to the same community according to their similar attributes(nodes 17 and 410 attributes: student, male, chemical engineering, -, dorm, 1980, chio). This confirms the existence of a weak tie between nodes 17 and 410 which they do not have any link (strong link) to each other but they are assigned to the same community considering their attribute similarities. This result is also clear for node 22, so that, it is shared between two communities with PCMSA according to its attribute similarities with two communities, while in Fig. 8a it is assigned to one community. Moreover, nodes 5 and 112 are assigned to different communities in Fig. 8a, but in Fig. 8c, node 112 is shared between two communities according to PCMSA algorithm (node 5 attributes: faculty, female, management, art, house, 1976, berh and node 112 attributes: faculty, male, management,-, house, 1975, berh). These results can also be seen for nodes 77 and 146.This, again, indicates the importance of considering content information in addition to the link structure when determining communities.

Example 4s

The fourth sample is a social network that has received particular attention in the context of community detection known as a Political Blogs Dataset, which could be downloaded from http://www-personal.umich.edu/mejn/netdata/ (Adamic and Glance 2004). 1490 weblogs politics are considered as vertices with 19025 hyperlinks between these weblogs as shown in Fig. 10a. These nodes are distributed in two clusters (Adamic and Glance 2004). Moreover, the other size of the network has also considered in Fig. 12 which has 6577 nodes, and 87345 edges. Political leaning as either liberal or conservative describes an attribute for each blog in the dataset. Several papers use this network for evaluating their cluster analysis. The average degree of this network is 12.768 and the average density is 0.009.

Fig.10
figure 10

Political blogs dataset

The intra and inter-cluster density are calculated by using (36) and (37). Then,\(\Delta dense\) is calculated and illustrated in Fig. 11. According to Fig. 11, with an increase in the number of clusters, \(\Delta dense\) has an increasing trend with a lower rate for \(c \ge 10\).

Fig. 11
figure 11

difference in density \((\Delta dense)\)

By considering the proposed validity index, to find the best value for fuzzy parameter (m), the values of \(1 - \Lambda\) for different values of m (fuzzy parameter) and c (number of communities) are obtained as follows in Fig. 12. Thus by minimizing \(1 - \Lambda\), the best value for “m” can be 2. Figure 12 indicates an average trending of \(\Lambda\). The first local maximum (minimum of \(1 - \Lambda\)) occurs in \(c = 2\). However, there are other favorable numbers of clusters like \(c = 14,28,48,...\).

Fig. 12
figure 12

Validity index for different degrees of fuzziness (Example 4)

In this part, two validity indices (NMI and Accuracy (Hu and Chan 2016)) in addition to the proposed validity index (\(\Lambda\)) are tested on the dataset in which the optimum number of communities c for each index is shown in Table 10.

The parameters of the PCMSA are set to \(\varepsilon = 0.001\) and \(m = 2\). Table 8 shows the optimal values of the validity indices. It is worth noticing that \(c_{\max } = \sqrt n = 38\), and numbers after \(c = 16\) are not presented in Table 8 due to the space consideration. Among the indices, only \(\Lambda\) correctly results in the two clusters and Accuracy and NMI do not recognize the optimal c value and results in 3 clusters as the optimal number.

Table 8 cluster validity values for\(c = 2,...,16\)
Table 9 Cluster validity values for\(c = 2,...,8\):\(m = 1.5\)

The objective function of the PCMSA depends on the weighting exponent \(m \in [1,\infty )\). When a validity index is insensitive to change in m, it can be said that it is reliable. Therefore, here, all validity indices are considered for various values of both c and m. The results of reliability of each index by changing m are reported, and the results are listed in Tables 9, 10, 11.

Table 10 Cluster validity values for\(C = 2,...,8\):\(m = 2.5\)
Table 11 Values of c preferred by validity indices with\(c = 2,...,8\)and different values of m

Tables 9, 10 show the results of three validity indices for \(c = 2,...,8\) for weighting exponent \(m = 1.5\) and 2.5, respectively. In Table 9, for \(m = 1.5\), only \(\Lambda\) and NMI result in the optimal value for c \((c = 2)\). In Table 10, for \(m = 2.5\), all indices result in the optimal \(c = 2\).

Table 11 lists the optimal number of clusters for all validity indices in varieties values of m. As shown in Table 11, \(\Lambda\) demonstrates a better result, and is the least sensitive to change in m. The bold numbers in the Table 11 show the correct number of communities and the bold numbers in the Tables 8, 9, and 10 show the best validity values for each validity index and different values of m.

From the tests, the suggested validity index \(\Lambda\) is the only index which obtains the optimal c in all of experiments; moreover, when the different values of m are considered again, \(\Lambda\) has a better performance. Therefore, the proposed index \(\Lambda\) is used to evaluate PCMSA.

The result of clustering by center-based clustering algorithms such as PCM is shown in Fig. 13a which reveals no difference in assigning nodes to clusters in comparison to previous algorithms. Moreover, Fig. 13b indicates the result of the SA-cluster algorithm proposed by Zhou et al. (2009). It is considered that the left cluster is cluster 1 (blue color) and the other is cluster 2 (green color). According to Fig. 13b node 1037 is assigned to cluster 2 instead of cluster 1. In addition, nodes 167, 170, 564 and 979 and some others are assigned to one cluster; however, they have similar attributes with the other cluster according to dataset. This means that the SA-cluster might not consider overlap clusters. Figure 13c indicates the results of the proposed community detection model (PCMSA). In this figure, shared nodes between two clusters are shown with red color. The function of the proposed model, PCM, and SA-cluster models are evaluated considering validation index using (29). The validation index for PCM and SA-cluster are \(\Lambda_{PCM} = 0.29\) and \(\Lambda_{SA - cluster} = 0.32\), respectively, however, it is \(\Lambda_{PCMSA} = 0.59\) for proposed model which indicates the better performance of PCMSA compared with PCM and SA-cluster models.

Fig. 13
figure 13

a PCM algorithm. b SA-cluster algorithm. c PCMSA algorithm

The efficiency of different clustering algorithms is compared in Fig. 14. This Figure indicates the clustering time on different number of nodes. The following observations on the runtime costs of different methods are made. First, PCMSA is usually 1.5–2.2 times slower than Algorithm 1, as it iteratively computes the re-clustering procedure. Although PCMSA is more expensive, but the iterative re-clustering improves the clustering quality a lot, as it is demonstrated in Fig. 13 and Table 15. The following observations show PCMSA is faster than the CODICIL (Ruan et al. 2013), SA-cluster (Zhou et al. 2009), K-SNAP (Tian et al. 2008), and FCAN (Hu and Chan 2016). K-SNAP is a little slower than PCMSA but the quality of PCMSA is much better than K-SNAP according to Table 15. Figure 14 suggests our method has promising scalability and computational complexity for analysis on even large social networks.

Fig. 14
figure 14

Clustering efficiency

Example 5

This example is used to identify communities from social networks extracted from Facebook. This dataset is available in Stanford Large Network Dataset Collection http://snap.stanforf.edu/data/index.html. This network consists of 4089 nodes and 170714 edges as it is shown in Fig. 15. A total of 193 communities have been distinguished and gender, job titles, institutions, etc., are considered as the attributes and they are 175 in total.

Fig.15
figure 15

Facebook dataset

As it is mentioned before, the objective function of the PCMSA depends on the weighting exponent \(m \in [1,\infty )\). When a validity index is insensitive to change in m, it can be said that it is reliable. Therefore, here, all validity indices are considered for various values of both c and m. After running PCMSA, the results of reliability of each index by changing m are reported in Table 12.

Table 12 Values of c preferred by validity indices with different values of c and m

As shown in Table 12, \(\Lambda\) demonstrates a better result, and is the least sensitive to change in m. Therefore, the proposed index \(\Lambda\) is used to evaluate PCMSA. The bold numbers in the Table 12 show the correct number of communities.

After running algorithms (FCAN, CODICIL, SA-cluster, K-SNAP, Algorithm 1, and PCMSA) with this dataset, the results are obtained as follows:\(\Lambda_{FCAN} = 0.45\), \(\Lambda_{CODICIL} = 0.33\),\(\Lambda_{SA - cluster} = 0.49\), \(\Lambda_{K - SNAP} = 0.29\), \(\Lambda_{A\lg orithm1} = 0.38\), \(\Lambda_{PCMSA} = 0.56\). PCMSA, again, performs better than the other algorithms and SA-cluster is ranked second. As it is mentioned before SA-cluster does not consider the overlapping communities and cannot handle noise in data. After SA-cluster, FCAN is the better one. FCAN cannot handle and balance attribute and structural similarities as well as PCMSA. In their problem formulation, the structural distance and attribute distance are combined, while they are two seemingly independent, or even conflicting goals and it does not make sense semantically.

Moreover, the efficiency of different clustering algorithms is compared in Fig. 16. According to our analysis PCMSA is faster than the all algorithms, except the Algorithm 1 as it iteratively computes the re-clustering procedure. Figure 16 suggests our method has promising scalability and computational complexity.

Fig. 16
figure 16

Clustering efficiency

Example 6:

The sixth sample is the twitter dataset that is available for download from Stanford Network Dataset Collection http://snap.stanford.edu/data/index.html. This dataset has 125,120 nodes and 2,248,406 edges and 3140 communities. This network is indicated in Fig. 17. The number of attributes for each node is 33569. The hashtags used by users in their tweets are considered as the attributes.

Fig. 17
figure 17

Twitter dataset

Like the previous examples, all validity indices are considered for various values of both c and m. After running PCMSA, the results of reliability of each index by changing m are reported in Table 13.

Table 13 Values of c preferred by validity indices with different values of c and m

As shown in Table 13, \(\Lambda\) demonstrates a better result, and is the least sensitive to change in m. Therefore, the proposed index \(\Lambda\) is used to evaluate PCMSA. The bold numbers in the Table 13 show the correct number of communities.

After running algorithms (FCAN, CODICIL, SA-cluster, K-SNAP, Algorithm 1, and PCMSA) with this dataset, the results are obtained as follows:\(\Lambda_{FCAN} = 0.34\), \(\Lambda_{CODICIL} = 0.26\),\(\Lambda_{SA - cluster} = 0.38\), \(\Lambda_{K - SNAP} = 0.31\), \(\Lambda_{A\lg orithm1} = 0.41\), \(\Lambda_{PCMSA} = 0.62\). PCMSA performs better than the other algorithms and Algorithm 1 is ranked second. After Algorithm 1, SA-cluster is the better one.

The efficiency of different clustering algorithms considering computational complexity is compared in Fig. 19.

Example 7

The final sample is the twitter dataset that is available for download from Stanford Network Dataset Collection http://snap.stanford.edu/data/index.html. This network consists of 5342870 nodes and 20037423 edges. A part of this network is shown in Fig. 18. A total of 5020 communities have been distinguished. The number of attributes for each node is 42342. The hashtags used by users in their tweets are considered as the attributes.

Fig. 18
figure 18

Twitter dataset

Here, all validity indices are considered for various values of both c and m. After running PCMSA, the results of reliability of each index by changing m are reported in Table 14.

Table 14 Values of c preferred by validity indices with different values of c and m

As shown in Table 14, \(\Lambda\) demonstrates a better result, and is the least sensitive to change in m. Therefore, the proposed index \(\Lambda\) is used to evaluate PCMSA. The bold numbers in the Table 14 show the correct number of communities.

After running algorithms (FCAN, CODICIL, SA-cluster, K-SNAP, Algorithm 1, and PCMSA) with this dataset, the results are obtained as follows:\(\Lambda_{FCAN} = 0.41\), \(\Lambda_{CODICIL} = 0.21\),\(\Lambda_{SA - cluster} = 0.50\), \(\Lambda_{K - SNAP} = 0.16\), \(\Lambda_{A\lg orithm1} = 0.52\), \(\Lambda_{PCMSA} = 0.59\). PCMSA, again, performs better than the other algorithms.

The efficiency of different clustering algorithms for examples 6 and 7 is compared in Fig. 19. The clustering time is indicated in this figure. The following observations on the runtime costs of different methods are made. PCMSA is 1.4–2.1 times slower than Algorithm 1, as it iteratively computes the re-clustering procedure. Although PCMSA is more expensive, but the iterative re-clustering improves the clustering quality a lot, as it is demonstrated in Table 15. The following observations show PCMSA is faster than the CODICIL (Ruan et al. 2013), SA-cluster (Zhou et al. 2009), K-SNAP (Tian et al. 2008), and FCAN (Hu and Chan 2016). K-SNAP is a little slower than PCMSA but the quality of PCMSA is much better than K-SNAP according to Table 15. Figure 19 suggests our method has promising scalability and computational complexity for analysis on even large social networks.

Fig. 19
figure 19

Clustering efficiency

Table 15 Comparison of different methods

Finally, the validity indices of different methods for all samples are illustrated in Table 15. As it is shown in Table 15, in all samples the performance of the PCMSA is better than the others. Samples 1 and 2 are described before in Examples 1 and 2, respectively. Samples 3–10 are described in Example 3 and discussed in Fig. 9. Samples 11 and 12 are discussed in Example 4 and described in Fig. 14. Sample 13 is discussed in example 5 and described in Fig. 16. Finally samples 14, and 15 are mentioned in Examples 6, and 7, respectively and described in Fig. 19.

Considering Table 15, PCMSA has performed better not only in low dimensional data sets, but also in high dimensional ones where complexity increases. Also as it is shown in Figs. 9, 14, 16, and 19, the time efficiency of PCMSA was better in all examples.

7 Conclusion

In this research, we proposed a fuzzy model for overlapping community detection based on the PCM algorithm in complex networks. This model (PCMSA) identifies communities in the social networks based on both resources of data related to the nodes’ attributes and nodes’ structure. Moreover, PCMSA strictly takes attribute and structural similarities into consideration instead of balancing them. Therefore, the proposed community detection algorithm is capable of clustering the graph into quality partitions that have high structural and attribute similarities. The performance of the proposed model was shown with several real-world and artificial networks from small to very large sizes. The trade-off between density and homogeneity was used to assess the model and to specify the most favorable number of clusters. In addition, the structure of the validity index was well adapted for graph clustering considering both link information and node attribute. Results indicated that the proposed model detects communities in a better manner than the other algorithms, especially when some nodes are shared between clusters. Our experiments showed that PCMSA detects meaningful and insightful patterns in both synthetic and large scale real social networks. Also the experimental findings reveal the superiority of this novel model and its promising scalability and computational complexity over others. In future works, the community detection based on other clustering algorithms will be considered.