Fuzzy community detection on the basis of similarities in structural/attribute in large-scale social networks

Naderipour, Mansoureh; Fazel Zarandi, Mohammad Hossein; Bastani, Susan

doi:10.1007/s10462-021-09987-x

Fuzzy community detection on the basis of similarities in structural/attribute in large-scale social networks

Published: 12 April 2021

Volume 55, pages 1373–1407, (2022)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Artificial Intelligence Review Aims and scope Submit manuscript

Fuzzy community detection on the basis of similarities in structural/attribute in large-scale social networks

Download PDF

Mansoureh Naderipour¹,
Mohammad Hossein Fazel Zarandi¹ &
Susan Bastani²

1019 Accesses
16 Citations
1 Altmetric
Explore all metrics

Abstract

Community detection aims to partition a set of nodes with more similarities in the set than out of it based on different criteria like neighborhood similarity or vertex connectivity. Most present day community detection methods principally concentrate on the topological structure, largely ignoring the heterogeneous properties of the vertex. This paper proposes a new community detection model, based on the possibilistic c-means model, by using structural as well as attribute similarities in a large scale in social networks. In the majority of real social networks, different clusters share nodes, resulting in the formation of overlapping communities. The proposed model, on the basis of structural and attribute similarity (PCMSA), serves as a fuzzy community detection model addressing the overlapping community detection problem, and detecting communities in a way that each community has a densely connected sub-graph with homogeneous attribute values. The function of the proposed model is assessed by a trade-off between intra-cluster and inter-cluster density and homogeneity. Therefore, to validate the proposed community detection algorithm (PCMSA) and its results, an index, compatible with the proposed model, is defined; and to assess the efficiency of the proposed fuzzy community detection, several experimental results in variety sizes from very small to very large sizes of real social networks are given, and the results are contrasted with other community detection models like FCAN, CODICIL, SA-cluster, K-SNAP and PCM. The experimental findings reveal the superiority of this novel model and its promising scalability and computational complexity over others.

A community detection algorithm based on multi-similarity method

Article 13 January 2018

A New Community Detection Algorithm Based on Fuzzy Measures

LapEFCM: overlapping community detection using laplacian eigenmaps and fuzzy C-means clustering

Article 15 July 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Networks have been studied in many fields such as biology, mathematics, quantitative geography, sociology, and information science (Fortunato 2010). A graph is made by a number of nodes (vertices) and some links (edges) that join them to each other (Schaeffer 2007). A community (cluster) is made of a series of vertices with common or similar properties based on various criteria (Fortunato 2010). Graph clustering or community detection refers to grouping nodes connected with edges, but not to outside the group (Fortunato 2010; Schaeffer 2007). For example, strategies related to target marketing can be designed well if community detection is possible in such networks. If users are regarded as vertices, and friendship relationships are regarded as edges, graph clustering can formulate the issue of detecting communities for these users for target marketing.

1.1 Reasearch challenge

A significant characteristic of real-world networks is community structure (Fu et al. 2013) in which people have social relations sharing similar personal or professional interests, records or real-life relationships. Different communities can share nodes in graphs; therefore, overlapping can be formed among communities (Zarandi and Razaee 2010), and detecting such communities in social networks, which exists in most real social networks, is of great importance. In recent research, some fuzzy methods such as fuzzy c-means (FCM) clustering (Zhang et al. 2007; Zarinbal et al. 2014) and the possibilistic c-means clustering model (PCM) (Krishnapuram and Keller 1993) have been put forward for discovering these overlapping communities.

The problem studied in this research is to detect communities on the basis of attribute and structural similarities. The goal is to partition the graph into c communities, each of which has cohesive structures and homogeneous attributes. This is somewhat challenging because these two similarities are independent or even conflicting goals. For example, authors who cooperate with each other may have different attributes, such as research topics, whereas those who search the same topics may come from different groups and never cooperate. It is unknown how to balance these two sources of data. Most researches design a distance function between two vertices combining the structural distance and attribute distance with two different weighting factors. Although this procedure is simple, it is hard to set the factors and interpret the function so that it is not clear whether the weight of coauthor relationship should be larger or smaller than that of research topic. Moreover, making quantitative decisions on the weights is even harder.

1.2 Main contribution

This study proposes a fuzzy model and algorithm for detecting communities that overlap on the basis of analyzing the semantic of social networks data. Nodes share common attributes in groups or communities and they have many connections among themselves. Therefore, there are two sources of data for performing the clustering task. The first is information about the nodes and their attributes such as known properties, users’ profiles in social networks or authors’ publications; and the second comes from the set of connections between nodes such as interactions and collaborations that form among users.

Fuzzy clustering is very useful for cluster analysis (Yang 1993; Valente de Oliveira and Pedrycz 2007). Considering fuzzy sets in detecting communities can make it possible to identify clusters due to their various impressions in link and attribute information. Nodes with fuzzy clustering are assigned to one or more clusters with different membership functions, making it possible to have overlapping and interesting clusters of various and flexible structures. Because of these advantages, a fuzzy clustering is proposed in order to identify clusters existing in complex networks using both link information and node attribute. Determining the membership functions for assigning each node to clusters based on node attribute and link information is challenging in fuzzy clustering. Considering this problem, a new model called PCMSA is suggested to identify overlapping clusters on the basis of attribute and structural similarities. The findings indicate that PCMSA is a considerable model for detecting communities in a complex network. Here is a summary of major contributions of this research:

(1)
Community detection in social networks based on both link information and node attribute due to the importance of these sources of data in some real graphs such as social networks
(2)
Fuzzy clustering that makes it possible for overlapping clusters in which nodes are assigned to one or more clusters that have various degrees of memberships
(3)
Strict structural and attribute similarity: in the last algorithms of graph clustering, most algorithms consider weighting factors to balance between attribute and structural similarities; however, the algorithm in this paper (PCMSA) strictly considers the two similarities.

The organization of this paper is as follows: The 2nd section presents the related works. The 3rd section introduces fuzzy clustering based on center, and the 4th section deals with the explanation of the proposed algorithm and importance of weak ties. Sections 5, 6 present the clustering validation index and experimental results, respectively, and conclusions and suggestions for further research work will be presented in the 7th section.

2 Related works

In recent years, fuzzy clustering has been created and widely used in general clustering, but little research has applied it in graph clustering (Schaeffer 2007) such that using fuzzy clustering in graph clustering has been observed less during the past decade (Wang et al. 2013). It is still possible to improve the performance of some methods that are meant for discovering fuzzy overlapping communities (Schaeffer 2007). FCM is among the most common fuzzy clustering models used along with other techniques to detect communities (Zhang et al. 2007). In these studies, the structure of the models is not adapted well enough for graph clustering. Golsefid et al. proposed a fuzzy duo-centric model for community detection in social networks for which the nodes’ properties are not considered in this paper (Golsefid et al. 2015).

Most graph clustering considers only one aspect of the graph and ignores the other (Andersen et al. 2006; Flake et al. 2000; Girvan and Newman 2002; Tian et al. 2008).Consequently, the clusters either have a random distribution of vertex attributes in them, or have a non-cohesive intra-cluster structure. A good graph clustering ought to balance similarities that are both structural and attribute in order to have an intra-cluster structure that is cohesive and has homogeneous vertex properties. However, considering node attributes and network topology together is also challenging so that one must combine two very different pieces of information.

Recently, some attempts take both sources of data into consideration. Such clustering algorithms are based either on distance or on model. Distance-based methods (Zhou et al. 2009, 2010; Ruan et al. 2013) initially form an augmented network by adding the virtual links in order to connect the attributes with nodes. In that case, the clusters can be identified with the similarity between two nodes that have standard clustering algorithms (Markov or K-Medoids clustering), calculated by the distance between nodes in the augmented network. There are two challenges with this algorithm: adding new nodes and new edges leads to a big graph that cannot be solved in some cases. Moreover, it is not clear how to cluster this heterogeneous graph with two types of nodes and edges (Zhou et al. 2010). Model-based approaches, both generative and discriminative models, have been developed to simulate the complex network generated by various Bayesian networks with topic modeling. In the literature, for the first one, there exist papers such as CART (Pathak et al. 2008), iTopic Model (Sun et al. 2009), and for the second one, there exist papers such as PCL-DC (Yang et al. 2009). Moreover, Cao et al. detect prosumer-community groups considering nodes’ attributes and network structure, but they do not consider the overlapping communities. Their algorithm also cannot detect communities in large-scales networks (Cao et al. 2019). Bu et al. propose GK-mean algorithm which is formulated as a multi-objective optimization problem (MOOP). Although Graph K-means (GK-mean) algorithm considers two topological structure and attribute information, but do not work well on large-scale networks (Bu, et al. 2019).

The above clustering algorithms have a drawback in clustering in large-scale networks. Some of them did not consider both sources of data or some have taken the assumptions to the problem more easily. Moreover, they did not consider weak ties in their model in addition to the strong ties (weak ties are discussed in Sect. 4).

This is while, the proposed clustering approach based on structural and attribute similarity (PCMSA), is designed based on the semantic of social networks data in addition to the fuzzy sets by considering the important theorems in overlapping social networks such as, weak ties and homophily. These two theorems are explained in Sect. 4. Moreover, the proposed model is evaluated by an extensive evaluation using different network sizes and even real large graphs. Results show clusters with high quality, homogeneous attributes and cohesive structures.

Until now, methods that consider both topology structure and node attribute have not considered the fuzzy nature in extent of membership in graph clustering. Although, Hu and Chan propose a fuzzy clustering based on two sources of data (Hu and Chan (2016)), but in their problem formulation, the structure of the model is not adapted well enough for analyzing the semantic of fuzzy social networks and overlapping complex networks. As discussed above, in their problem formulation, the structural distance and attribute distance are combined, while they are two seemingly independent, or even conflicting goals and it does not make sense semantically.

3 Background

Fuzzy Center-Based Clustering (FCM) and the possibilistic c-means clustering (PCM) are presented in this section.

3.1 Fuzzy clustering based on center

In fuzzy clustering, each node forms a part of a cluster that has a membership function between 0 (not belonging), and 1(belonging), and each node can form a part of several clusters that have different membership degrees which are crisp values over the interval [0,1] (Höppners 1999). The most famous fuzzy clustering suggested by Dunn (Dunn (1974)) and continued by Bezdek (Bezdek (1981)) is FCM clustering algorithm.

If $X = \{ x_{1} ,x_{2} ,...,x_{n} \} \in R^{\alpha }$ is a series of feature vectors ($\alpha$ and n are the dimension and the number of nodes, respectively), FCM assigns nodes to clusters by making the subsequent function minimum, and partitions nodes into c clusters.

$$J = \sum\limits_{i = 1}^{c} {\sum\limits_{k = 1}^{n} {u_{ik}^{m} \left\| {x_{k} - v_{i} } \right\|_{R}^{2} } }$$

(1)

where, $u_{ik} \in [0,1]$ is the membership degree of each node $k = 1,...,n$ in cluster $i = 1,...,c$, and $v = \{ v_{1} ,...,v_{c} \} \in R^{c\alpha }$ is indicative of a series of clusters’ centers. R shows the distance norm (Fortunato 2010) and $1 \le m < \infty$ indicates the fuzzifier parameter. Clustering based on the objective function can be considered an optimization problem, which is solved by the gradient descent technique (Tan et al. 2007).

Krishnapuram and Keller proposed the PCM clustering algorithm in order to decrease the impact of outliers on FCM with relaxation of the condition of membership values to all clusters for each node, which equals 1, and replaced it with $\mathop {\max }\limits_{i} (u_{ik} ) > 0, \, 1 \le k \le n$.The PCM objective function is defined as:

$$J = \sum\limits_{i = 1}^{c} {\sum\limits_{k = 1}^{n} {u_{ik}^{m} \left\| {x_{k} - v_{i} } \right\|_{R}^{2} + \sum\limits_{i = 1}^{c} {\beta_{i} \sum\limits_{k = 1}^{n} {(1 - u_{ik} )^{m} } } } }$$

(2)

where $\beta_{i}$ is the average fuzzy intra-cluster distance of cluster i.

In a complex network, a number of nodes are joined to one another in a topological structure. By considering nodes as vertices, and links as edges, complex networks could be considered graphs. Many real-world problems such as social networks (Myspace, Facebook,…) have millions of users who are connected to one another as friends(Fortunato 2010). By considering users as vertices, and friendship relationships as edges, we can formulate the issue of assigning users to communities as a problem of community detection. An ideal cluster should possess an intra-cluster structure that is cohesive and has homogeneous vertex attributes. PCM considers only one aspect of the graph related to the nodes’ attributes, and ignores the other aspect related to the structure of the nodes. In this paper, both aspects are considered to detect communities considering attribute/structural similarities based on the PCM algorithm.

4 Proposed fuzzy clustering considering attribute and structural similarities

When a graph is given, different criteria can be defined to identify different graph clusters. By considering both structure and properties of the nodes, the proposed model PCMSA detects ideal communities with the following criteria:

(1) It identifies communities that have more densely connected nodes.
(2) It identifies communities that have nodes more strongly related to each other.
(3) The probability of adjacent nodes to belong to the same community is higher.

For PCMSA, the above criteria are considered by employing fuzzy clustering through formulation of the community detection problem as an optimization problem based on the PCM algorithm. Therefore, the main role of PCMSA is finding the best degree of membership to assign nodes to clusters so that clusters that are the most consistent with the discussed criteria can be achieved.

In order to detect communities that satisfy the above criteria, a minimum optimization problem based on PCM is formulated for the community detection problem of complex networks by considering both datasets related to node attributes and link structure. Then, the solution to this optimization problem is presented (Fig. 1).

4.1 Proposed model

This section deals with the proposed “fuzzy community detection model (PCMSA)” in an attempt to detect overlapping communities with regard to structural and attribute similarities in the complex networks. In the PCMSA, we want to cluster nodes on the basis of analyzing the semantic of social networks data considering the two important homophily and weak ties theorems (Kadushin 2004).

Assume that $G(N,L)$ is a graph in which N indicates a number of nodes $(N = \{ 1,2,...,n\} )$ and L is indicative of some links $(L = \{ 1,2,...,l\} )$. The relevant terminologies are described as follows:

n	Number of nodes
l	Number of edges
e	Number of attributes
t	_{Number of repetitions (periods)}
$m \in [1,\infty )$	Fuzzy parameter (weighting exponent called the fuzzifier)
$x_{i}$	ith object
c	Number of clusters
$g_{i}$	Set of nodes in cluster i
$v_{i}$	ith cluster centroid
$u_{ik}$	Membership degree of kth node to ith cluster
$\Delta_{i}$	Density of cluster i
$v_{i}^{0}$	The initial center of cluster i
$\delta_{i}$	_{Entropy of cluster i}
$\chi_{j}$	_{The set of values of attribute} $\gamma_{j}$

4.1.1 Structural similarity

The center-based community detection objective function considering link structure is formulated as:

$$J = \sum\limits_{k = 1}^{n} {\sum\limits_{i = 1}^{c} {(u_{ik} )^{m} } D_{ik} }$$

(3)

In this formulation, $D_{ik}$ indicates the distance of node k from the center of cluster i ($v_{i}$) that can be calculated with the following formulation:

$$D_{ik} = \left\{ \begin{array}{*{20}l} 0 &\quad if{\text{ there is a link between }}nodes \, v_{i} \, and \, k \, \hfill \\ \sum\limits_{j = 1}^{n} {\left| {a_{{v_{i} j}} - a_{jk} } \right|}&\quad{\text{ otherwise}} \hfill \\ \end{array} \right.$$

(4)

The transitivity theorem (Kadushin 2004) (if node A is connected to node B and node B is also connected to node C, most probably node A will be connected to node C) of social networks is used to define the structural distance between nodes. Therefore, for two connected nodes (first case) the special case is considered and the second case is based on transitivity theorem. In this equation, each $a_{ij} \, (i = 1,...,n \, , \, j = 1,...,n)$ is indicative of the entry in the ith row and jth column of the adjacency matrix denoted by A. The entries in the adjacency matrix $(a_{ij} )$ indicate adjacent nodes. In this matrix, if nodes $i$ and $j$ are adjacent, then $a_{ij} = 1$, and if nodes $i$ and $j$ are not adjacent, $a_{ij} = 0$Wasserman and Faust 1994).

Our article focuses on undirected graphs, and the links are not signed or valued. Therefore, if $a_{ij} = 1$ then $a_{ji} = 1$, thus, the matrix is symmetric (Wasserman and Faust 1994).

In this article, first a community detection model based on PCM is proposed to cluster nodes considering their link information where nodes represent objects, and links indicate the relationship among objects. Therefore, each cluster is presented as a number of interconnected objects which are not connected to objects out of the group (Wasserman and Faust 1994). The objective function that has been proposed is formulated as:

$$\min J_{m} (u,v) = \sum\limits_{k = 1}^{n} {\sum\limits_{i = 1}^{c} {(u_{ik} )^{m} } D_{ik} + \sum\limits_{i = 1}^{c} {\Delta_{i} \sum\limits_{k = 1}^{n} {(1 - u_{ik} )^{m} } } }$$

The first part of this objective function lessens the distance from cluster centers to the extent that is possible considering the data link structure ($D_{ik}$). The second term causes $u_{ik}$ to become as large as possible, and in this way avoid the trivial solution (Malek et al. 2015). $\Delta_{i}$ equals the proportion of existing links in a cluster to all the links that can be presented in this cluster $(\left| {L_{i} } \right|)$, which has become maximized.

$$\mathop {\Delta_{i} }\limits_{i = 1,...,c} = \frac{{\left| {\left. {\{ (p,q)} \right|p,q \in g_{i} ,(p,q) \in L_{i} \} } \right|}}{{\left| {L_{i} } \right|}}$$

Now, the first proposed model identifies a new fuzzy clustering model that is center-based in order to identify communities that overlap in complex networks. This model is defined on the basis of the PCM clustering model and detects overlapping communities on the basis of the link structure. The defined model is formulated as:

$$\min J_{m} (u,v) = \sum\limits_{k = 1}^{n} {\sum\limits_{i = 1}^{c} {(u_{ik} )^{m} } D_{ik} + \sum\limits_{i = 1}^{c} {\Delta_{i} \sum\limits_{k = 1}^{n} {(1 - u_{ik} )^{m} } } }$$

(7)

$$\begin{gathered} subject \, to \hfill \\ \left[ \begin{gathered} \, u_{ik} \in [0,1] \, , \, 1 \le i \le c \, , \, 1 \le k \le n \hfill \\ \hfill \\ \mathop { \, \max }\limits_{i} (u_{ik} ) > 0 \, , \, 1 \le k \le n \hfill \\ \, 0 < \sum\limits_{k = 1}^{n} {u_{ik} } < n \, , \, 1 \le i \le c \hfill \\ \end{gathered} \right. \hfill \\ \end{gathered}$$

(8)

Theorem 1. Assume that $G(N,L)$ is a graph in which N indicates the set of nodes $(N = \{ 1,2,...,n\} )$ and L is indicative of some links $(L = \{ 1,2,...,l\} )$. In our model, $U_{k}$ indicates the kth column of $U$; that is, $U_{k} = \{ u_{1k} ,...,u_{ck} \} ,{ 1} \le {\text{k}} \le n$. Then, U could be a global minimum for $J_{m} (U,V)$ only if the updating fuzzy membership value is:

$$u_{ik} = \left( {1 + \left( {\frac{{D_{ik} }}{{\Delta_{i} }}} \right)^{1/(m - 1)} } \right)^{ - 1}$$

(9)

and the center of cluster is as follows:

$$v_{i}^{ * } = \mathop {\arg \min }\limits_{{v_{i} \in [1,n]}} \left(\sum\limits_{k = 1}^{n} {\sum\limits_{j = 1}^{n} {u_{ik}^{m} } } D_{ik} \right)$$

(10)

Theorem 1 will be proved in “Appendix 1”.

Now, consider the algorithm of fuzzy community detection to identify overlapping communities based on link structure.

The $\Delta_{i}$ moves to 0 if there are no links in a cluster, and it moves to 1 if all the links exist. The higher the value of $\Delta_{i}$, the more connections exist between nodes, leading to a denser cluster.

The most important step in a clustering that is center-based is choosing the proper initial central node. An approach that is more common is random choice of initial centers, however, the outcomes are often weak (Malek et al. 2015). The Nodal degree, the number of lines incident with the node in the graph, can be a good criterion to choose the initial centers. This can be achieved by summing with regard to elements in the adjacency matrix as follows (Wasserman and Faust 1994):

$$v_{i}^{0} = \max_{{^{{i^{\prime} \in [1,n]}} }} \, \sum\limits_{j = 1}^{n} {a_{{i^{\prime}j}} }$$

(11)

In the proposed clustering, clusters have an intra-cluster structure that is cohesive. A favorable cluster should have an intra-cluster structure that is cohesive and has homogeneous vertex attributes. Therefore, in this paper, re-clustering is proposed to re-cluster communities considering a threshold on the basis of the homophily theorem in social networks in which, if two people have characteristics that match in a proportion greater than expected in the population from which they are drawn or the network of which they are apart, then they are more likely to be connected. The converse is also true: if two people are connected, then they are more likely to have common characteristics or attributes (Kadushin 2004).

Moreover, “The strength of weak ties”, which has attracted a lot of research attention, is an article that has been presented by Mark Granovetter (Granovetter 1977). Weak ties concentrate on holes in the network (Kadushin 2004). Our acquaintances (weak ties) may have less relationship with us than our close friends (strong ties). Thus, if we have a set of people with their acquaintances in whom many of the possible ties are absent, their network will constitute a low-density network (Kadushin 2004).

Weak ties cause the information to easily flow from remote parts of a network. Objects that have few weak ties are deprived of information from remote parts of a network and only get provincial news and information from their close friends. Compared to the strong ties, weak ties may serve as bridges between network segments. Thus, social systems that do not have weak ties are incoherent and will be fragmented as weak ties helping to integrate social systems. Without considering weak ties, new ideas will spread slowly, and scientific efforts will not achieve their success (Kadushin 2004). Due to the importance of weak ties they are considered in this paper with the proposed model that detect communities based on two sources of data, structural and attribute similarities.

Therefore, the re-clustering should be employed by the following measurement according to the above theorems.

This operation works by using a pair wise similarity measure to find groups of clusters that could benefit from re-clustering their component nodes and edges. In order to find groups of clusters needing to be re-clustered, the similarity of each pair of clusters based on their nodes’ attributes is found. The similarity measure for two especial communities (clusters) i and j is defined as:

$$p_{\begin{subarray}{l} i,j \\ i < j \end{subarray} } = 1 - \frac{{\sum\limits_{{q \in g_{i} }}^{{}} {\sum\limits_{{q^{\prime} \in g_{j} }}^{{}} {\left\| {x_{q} - x_{{q^{\prime}}} } \right\|} } }}{{\sum\limits_{w = 1}^{c - 1} {\sum\limits_{v = w + 1}^{c} {\sum\limits_{{h \in g_{w} }}^{{}} {\sum\limits_{{h^{\prime} \in g_{v} }}^{{}} {\left\| {x_{h} - x_{{h^{\prime}}} } \right\|} } } } }}$$

Equation (12) indicates the proposed similarity measure that calculates the percentage of similarity levels of cluster i and cluster j considering their nodes’ attributes. For especial case $c = 2$, the re-clustering is done and it is then decided by the proposed validation index whether the re-clustering is good or not.

By using this similarity measure, groups of clusters that could benefit from re-clustering their component nodes and edges can be defined. If $p_{i,j} > B$, ($B$ identifies by try and error) then the re-clustering algorithm is run to re-cluster all the nodes and edges in cluster i and cluster j. This is similar to the “The rich get richer and the poor get poorer” phrase as the nodes in the cluster with the structural algorithm that possess similar attributes become denser with the re-clustering algorithm.

It is worth mentioning that B is not limited and depends on the considering graph. For each clustering, the value of B is defined by try and error and the scenario with the highest validity index. According to the experiments, the value of B is obtained smaller in graphs with more weak ties. As a result our method is flexible in which it is considered for each graph according to its similarities. If the attribute similarity is greater than B that is identified by the validity index, the re-clustering is applied. This is while some graphs don’t need to be re-clustered according to the weak ties and homophily theorems.

As it is mentioned before the value of B depends on the graph and its attribute similarities. Therefore the minimum and maximum values of $p_{i,j \in [1,c]}$ are obtained for the graph. B is searched in this bound $([\min ,\max ])$ and the value with the highest validity index in clustering is selected.

4.1.2 Attribute similarity (Re-clustering)

The proposed re-clustering objective function is defined as follows:

$$\min J_{m} (u,v) = \sum\limits_{k = 1}^{n} {\sum\limits_{i = 1}^{c} {(u_{ik} )^{m} } d_{ik} - \sum\limits_{i = 1}^{c} {\delta_{i} \sum\limits_{k = 1}^{n} {(1 - u_{ik} )^{m} } } }$$

(13)

In this equation, $\delta_{i}$ is measured as:

$$\begin{gathered} \delta_{i \in [1,c]} = entropy(C_{i} ) = \sum\limits_{j = 1}^{e} {\frac{{\left| {g_{i} } \right|}}{n}} \, entropy(\gamma_{j} ,C_{i} ) \hfill \\ entropy(\gamma_{j} ,C_{i} ) = - \sum\limits_{{g \in \chi_{j} }}^{{}} {p_{ijg} *\ln p_{ijg} + (1 - } p_{ijg} )*\ln (1 - p_{ijg} ) \hfill \\ \end{gathered}$$

(14)

In this equation, $p_{ijg}$ refers to the percentage of vertices existing in cluster i with value $\gamma_{jg}$ on attribute $\gamma_{j}$.$\delta_{i}$ measures the weighted entropy from all attributes over c clusters. Moreover, for continues values of an attribute, the fuzzy membership function of that attribute is used and then $(\alpha - cut)$ in fuzzy sets (Mendel and Mendel 2017) is applied to create a finite set of values.

The parameterization of $d_{ik}$ should be specified. Referring to the Gustafson and Kessel’s definition, $d_{ik}$ can be obtained as follows (Gustafson and Kessel 1978):

$$d_{ik} (\Omega_{i} ) = (x_{k} - v_{i} )^{T} H_{i} (x_{k} - v_{i} ), \, 1 \le i \le c$$

(15)

This form of $d_{ik}$ indicates the norm metric of an inner product with $H_{i}$ symmetric and positive-definite matrix. Note that we take $\Omega_{i} = \left\{ {v_{i} ,H_{i} } \right\}$ and J is linear in $H_{i}$ inducing a singular problem (Gustafson and Kessel 1978). Gustafson and Kessel restricted the determinant $\left| {H_{i} } \right|$ of matrix $H_{i}$ in order not to allow the metric to grow without bound (Gustafson and Kessel 1978).

Now the proposed re-clustering model is as follows:

$$\min J_{m} (u,v) = \sum\limits_{k = 1}^{n} {\sum\limits_{i = 1}^{c} {(u_{ik} )^{m} } d_{ik} - \sum\limits_{i = 1}^{c} {\delta_{i} \sum\limits_{k = 1}^{n} {(1 - u_{ik} )^{m} } } }$$

(16)

$$\begin{gathered} subject \, to \hfill \\ \left[ \begin{gathered} \, \left| {H_{i} } \right| = \upsilon_{i} \, , \, \upsilon_{i} > 0 \, \hfill \\ \, u_{ik} \in [0,1] \, , \, 1 \le i \le c \, , \, 1 \le k \le n \hfill \\ \mathop { \, \max }\limits_{i} (u_{ik} ) > 0 \, , \, 1 \le k \le n \hfill \\ \, 0 < \sum\limits_{k = 1}^{n} {u_{ik} } < n \, , \, 1 \le i \le c \hfill \\ \end{gathered} \right. \hfill \\ \end{gathered}$$

(17)

Constraint $\left| {H_{i} } \right| = \upsilon_{i}$ guarantees that $H_{i}$ is positive-definite (Bezdek et al. 1999).

Now, the augmented objective function is defined as:

$$\begin{aligned} \min J_{m} (u,\Omega ,\lambda ) &= \sum\limits_{k = 1}^{n} {\sum\limits_{i = 1}^{c} {(u_{ik} )^{m} } d_{ik} (\Omega_{i} ) - \sum\limits_{i = 1}^{c} {\delta_{i} \sum\limits_{k = 1}^{n} {(1 - u_{ik} )^{m} } } } \hfill \\&\quad + \sum\limits_{i = 1}^{c} {\lambda_{i} } \left[ {\left| {H_{i} } \right| - \upsilon_{i} } \right] \hfill \\ \end{aligned}$$

(18)

where $\left\{ {\lambda_{i} } \right\}$ is a set of Lagrange multipliers.

Theorem 2. U could be a global minimum for $J_{m} (U,V)$ only if the updating fuzzy membership value is:

$$u_{ik}^{*} = \left( {1 + \left( {\frac{{d_{ik} (\Omega_{i} )}}{{ - \delta_{i} }}} \right)^{1/(m - 1)} } \right)^{ - 1}$$

(19)

and the cluster center is:

$$v_{i}^{*} = \frac{{\sum\limits_{k = 1}^{n} {u_{ik}^{m} } x_{k} }}{{\sum\limits_{k = 1}^{n} {u_{ik}^{m} } }}$$

(20)

and finally,

$$H_{i}^{{*^{ - 1} }} = \frac{1}{{\lambda_{i} \left| {H_{i}^{*} } \right|}}\sum\limits_{i = 1}^{c} {u_{ik}^{m} } (x_{k} - v_{i}^{*} )(x_{k} - v_{i}^{*} )^{T}$$

(21)

Theorem 2 will be proved in Appendix 1.

Now, $FC_{i}$ is the fuzzy covariance matrix that can be defined as follows (Gustafson and Kessel 1978):

$$FC_{i} = \frac{{\sum\limits_{k = 1}^{n} {u_{ik}^{m} } (x_{k} - v_{i} )(x_{k} - v_{i} )^{T} }}{{\sum\limits_{k = 1}^{n} {u_{ik}^{m} } }} \, ; \, m > 1$$

(22)

Then, using (22) and $\left| {H_{i} } \right| = \upsilon_{i}$ in (21), $H_{i}^{{*^{ - 1} }}$ gives:

$$H_{i}^{{*^{ - 1} }} = \left[ {\frac{1}{{\upsilon_{i} \left| {FC_{i} } \right|}}} \right]^{1/\alpha } FC_{i}$$

(23)

In which $\alpha$ is the feature space dimension.

The previous discussion and then re-clustering algorithm induce the following proposed algorithm for community detection considering both sources of data related to topological structure and vertex properties. Figure 2 illustrates the proposed algorithm.

The essential condition for converging the algorithm suggested in Fig. 2 is met when:

$$\lim_{t \to \infty } \left\| {U^{(t)} - U^{(t - 1)} } \right\| = 0$$

The reason for this condition as well as the proposed algorithm convergence is to be offered in Appendix 2.

In the PCMSA algorithm, steps 1–4 detect communities on the basis of the structural similarities and steps 5–11 detect the last communities based on the attribute similarities.

For the proposed algorithm the complexity is $O(c*n*t + c*(c - 1)*t*e*n_{g} )$ in which c, n, t, e, and $n_{g}$ indicate the number of communities, the number of nodes, the number of iterations, the number of attributes, and the number of nodes in two re-clustering communities, respectively.

5 Clustering validation index

The appropriate criteria for evaluating the performance of clustering process can be the number of links in the community and those outside the community, which are the base of most community definitions (Zarandi et al. 2010). Suppose a sub-graph $G_{i}$ of a graph G in which $\left| {G_{i} } \right| = g_{i}$ and $\left| G \right| = g$. The internal and external degree of sub-graph $G_{i}$ can be defined as the number of links in the sub-graph connecting nodes to each other and that of links connecting nodes inside the sub-graph to the remainder of the graph, respectively. The quality of clusters with two measures of density, intra-cluster density and inter-cluster density is evaluated. The ratio between the number of internal links of cluster $C_{i}$ and that of possible internal edges $(L_{i} )$ is called the intra-cluster density.

$$\mathop {\Delta_{i} }\limits_{i = 1,...,c} = \frac{{\left| {\left. {\{ (p,q)} \right|p,q \in C_{i} ,(p,q) \in L_{i} \} } \right|}}{{\left| {L_{i} } \right|}}$$

(25)

In addition, the inter-cluster density can be defined as the ratio between the number of edges from the nodes of $C_{i}$ to the remainder of the graph and that of possible inter-cluster edges.

$$\mathop {\Delta_{i}^{ext} }\limits_{i = 1,...,c} = \frac{{\left| {\left. {\{ (p,q)} \right|p \in C_{i} ,q \notin C_{i} ,(p,q) \in L^{\prime}_{i} \} } \right|}}{{\left| {L^{\prime}_{i} } \right|}}$$

(26)

In this equation, $L^{\prime}_{i}$ indicates edges between nodes inside cluster $C_{i}$ to the remainder of the graph.

For $C_{i}$ to be a community with homogeneous attributes, the homogeneity is expected to be appreciably maximum inside a community and minimum between the communities. This is defined as the separation measure. The separation measure is defined as follows:

$$\hom_{i}^{{}} = \frac{{\sum\limits_{j = 1}^{c} {\sum\limits_{{z \in g_{i} }}^{{}} {\sum\limits_{{z^{\prime} \in g_{j} }}^{{}} {\left\| {x_{z} - x_{{z^{\prime}}} } \right\|} } } }}{{\sum\limits_{w = 1}^{c - 1} {\sum\limits_{h = w + 1}^{c} {\sum\limits_{{p \in g_{w} }}^{{}} {\sum\limits_{{p^{\prime} \in g_{h} }}^{{}} {\left\| {x_{p} - x_{{p^{\prime}}} } \right\|} } } } }}$$

The suggested index considers two criteria: compactness and separation. The compactness measure is determined based on inter- and intra-density of communities $\Delta dense_{i} = \Delta_{i} - \Delta_{i}^{ext}$.

As a result, a desired community is one with a maximum level of compactness and a larger level of separation. Therefore searching for the best trade-off between density and homogeneity is the goal of our algorithm, and one method to do that is maximization:

$$\Lambda_{i} = \Delta dense_{i} *\hom_{i}$$

(28)

It is expected that by maximizing $\Lambda_{i}$, $C_{i}$ can be a community. By considering this criterion for each community, the validity index $\Lambda$ is determined by Eq. (29):

$$\Lambda = \Delta dense*\hom$$

(29)

In this equation $\Delta dense$ is the average of $\Delta dense_{i} (i = 1,...,c)$ and $\hom$ is the average of $\hom_{i} (i = 1,...,c)$. Equation (29) is considered the criterion to assess the performance of the proposed community detection and determines the most favorable number of clusters.

6 Experimental results

In this section, the performance of the proposed model is tested in several artificial and large scale real networks.

Example 1

In the first step, a simple dataset with 10 nodes is considered as shown in Fig. 3a, which indicates a co-worker graph where nodes show workers, and edges indicate relationships between them. Each number shows a worker ID. Moreover, two attributes describe features of a node. The first letter indicates gender (Male “M” or Female “F”) and the second letter indicates where they live (Montreal “M” or Toronto “T”). As shown in Fig. 3a, workers 4, 6 and 7 have the same properties, worker 3 is male and lives in Montreal and the others have the same properties. Suppose the cluster number is $c = 2$. As it can be seen, depending on the clustering criteria, several clustering ways are obtained:

Figure 3b indicates clusters based on structure similarity that only considers relationships between co-workers and ignores their attributes where each cluster has varieties of genders and places. In this clustering method, co-workers in clusters are closely connected and coherent.

Figure 3c indicates clusters considering attribute similarity. This means that clusters have co-workers with the same properties as much as possible; therefore, the resulted clusters are homogenous, however, the vertex connectivity may not be considered.

Figure 3d illustrates the result of the proposed clustering algorithm considering both sources of data related to topology structure and the node’s attribute. The workers in one cluster are closely connected and have the same property so that the coherent and homogenous clusters are resulted. In addition, as discussed earlier in Sect. 4, the proposed algorithm detects overlap communities where node 3 is assigned to both communities.

In this section, the function of the proposed model is tested in “m” (fuzzy parameter) and $\upsilon$ ($\left| H \right|$ in (17)). By considering the proposed validity index, to find the best value for fuzzy parameter (m), the values of $1 - \Lambda$ for different values of m (fuzzy parameter) and c (number of communities) are obtained as follows. Thus by minimizing $1 - \Lambda$, the best value for “m” can be 2.5.

The membership values generated by the proposed model with different values for $\upsilon$ are shown in Table 1. In Fig. 3d the left cluster is cluster 1 and the other is cluster 2. By increasing the value of “$\upsilon$” only, the variations in the shape of clusters can become greater without any limitation that leads to the generation of clusters without homogenous attributes or coherent structures depending on the data. The historical membership functions for two critical points are shown in Table 1. Note that point 3 is strongly related to cluster 1 for $\upsilon = 0.5$ and $\upsilon = 1$, but it starts to form correctly from $\upsilon = 1.5$. Moreover, node 4 starts to form incorrectly from $\upsilon = 2$. Therefore, $\upsilon$ can be set to 1.5, according to Table 1, causing reasonable and desired membership functions. Moreover, there is a weak tie between node 3 and node 8 in which they are not coworker. But they are assigned to the same cluster considering PCMSA algorithm based on their similar attributes. The same results can be seen for node 3 compared to nodes 9 and 10 (Fig. 4).

Table 1 Membership functions history for different values of $\upsilon$

Fuzzy community detection on the basis of similarities in structural/attribute in large-scale social networks

Abstract

Similar content being viewed by others

A community detection algorithm based on multi-similarity method

A New Community Detection Algorithm Based on Fuzzy Measures

LapEFCM: overlapping community detection using laplacian eigenmaps and fuzzy C-means clustering

Explore related subjects

1 Introduction

1.1 Reasearch challenge

1.2 Main contribution

2 Related works

3 Background

3.1 Fuzzy clustering based on center

4 Proposed fuzzy clustering considering attribute and structural similarities

4.1 Proposed model

4.1.1 Structural similarity

4.1.2 Attribute similarity (Re-clustering)

5 Clustering validation index

6 Experimental results

Example 1

Example 2

Example 3

Example 4s

Example 5

Example 6:

Example 7

7 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix 1

Theorem 1

Theorem 2

Appendix 2

Appendix 3

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation