1 Introduction

Unraveling community structure in a complex network has been extensively investigated in the past two decades (Fortunato and Hric 2016; Fortunato 2010). A community is intuitively regarded as a group of node closely connected inside the group but rarely making connections with nodes outside the group (Girvan and Newman 2002). In real world, communities can appear in many forms, for example, social circles manually categorized by users in ego networks (McAuley and Leskovec 2012), authors from the same institution in collaboration networks (Newman 2001), proteins with the same functionality in biochemical networks (Girvan and Newman 2002), etc. The research topic of finding such groups is known as community detection.

While classic community detection has a restriction that a node belongs to one and only one community, a real-world network does not usually appear this way. For example, a person in a social network is common to have multiple social identities, e.g., an alumna, a family member, a club member, a star fan, an employer, etc. Each social identity can be defined by a social community, which indicates that these communities overlap with each other. Overlapping community detection is the emerging topic to discover such overlapping communities, which has become the research trend in the last decade.

Unlike classic community detection and overlapping community detection cannot be directly turned into a traditional graph clustering (i.e., node clustering) problem. Thus, many heuristics have been proposed to deal with this task (Xie et al. 2013). Matrix factorization (MF), as one of the standard framework to solve overlapping community detection, detects communities from a global view. Taking the adjacency matrix A of the given network as input, matrix factorization-based methods assign the number of communities in advance and seek out a node-community membership matrix F to match the information revealed by the input as accurate as possible. A typical way to achieve it is to use an objective function consisting of both A and F. Most of previous work (Psorakis et al. 2011; Wang et al. 2011) simply adopts a value-approximation-based objective function which approximates A entry by entry with the product of factorized matrices. However, an entry in A is a label representing whether there is a link or not between two nodes, but an entry in F is a real value representing the weight of a node in a community. The value-approximation-based object function clearly suffers from the mismatch between a label and a real value.

In order to tackle this mismatch issue, we propose a Preference-based Non-negative Matrix Factorization (PNMF) model (Zhang et al. 2015a). This model is established on two intuitions: (1) a node is more likely to build links with other nodes in the same community than those outside its community, and (2) a link can reflect the preferences of both sides to some extent, e.g., when Alice and Bob are friends, it is reasonable to argue that Alice prefers to connect with Bob than other strangers and vice versa. These intuitions can be regarded as one aspect of homophily in social networks (McPherson et al. 2001). Different from previous value-approximation-based objectives, our model maximizes the probability that a node’s preference on neighbors is larger than its preference on non-neighbors, which avoids the mismatch between a real value to a label. Specifically, assuming that \(A_{i,j} = 1\) and \(A_{i,k} = 0\), value-approximation-based objectives try to make \(F_iF_j^\mathrm{T}\) close to 1 and \(F_iF_k^\mathrm{T}\) close to 0, while our preference-based objective tries to make \(F_iF_j^\mathrm{T}\) to be larger than \(F_iF_k^\mathrm{T}\) as much as possible but making no assumption about their values.

However, in the PNMF model, we simply separate nodes into two parts, i.e., neighbors and non-neighbors, while ignoring the fact that all non-neighbors are not supposed to be treated equally. Inspired by the famous saying “my friend’s friend is also my friend”, we propose a Locality-based Non-negative Matrix Factorization (LNMF) model (Zhang et al. 2015b) to refine the preference system of PNMF model by further splitting the non-neighbors into two parts, namely “local non-neighbors” and “distant non-neighbors”. We define a “K-degree local network” to distinguish these two kinds of non-neighbors. We modify the objective function of PNMF to explicitly reflect our new assumptions: (1) neighbors are preferred to local non-neighbors, and (2) local non-neighbors are preferred to distant non-neighbors.

We employ projected stochastic gradient descent as our learning algorithm and provide efficient sampling strategies for both PNMF and LNMF models. To accelerate learning process, we implement a parallel paradigm to update parameters simultaneously with multiple processors. Experiments conducted on various benchmark datasets show that both models can outperform state-of-the-art overlapping community detection approaches and the LNMF model manages to detect communities with better quality than the PNMF model. The results provide a strong evidence that preference and locality help improve the performance of overlapping community detection.

To sum up, in this paper, we present the PNMF model (Zhang et al. 2015a) and its generalization, the LNMF model (Zhang et al. 2015b), in a systematic way, and design a parallel paradigm for both models to speed up parameter learning. We conduct comparison in experiments and show the improvement of this parallel paradigm over the original single process implementation.

Roadmap The rest of this paper is organized as follows. In Sect. 2, we review the formal definition of community detection problem and introduce some most related work in more detail. We present our PNMF and its generalized version GPNMF model in the scenario of overlapping community detection in Sects. 3 and 4, respectively. Our learning technique with sampling strategies and parameter tuning issues is illustrated in Sect. 5. We show our experimental results in Sect. 6, followed by the conclusions in Sect. 7.

2 Problem definition

To formally define the problem of community detection, we need to have a graph in the first place. We denote the graph as G(VE), where V is the node set and E is the link or edge set.

Definition 1

(Community) A community C is a subset of V with a certain characteristic.

According to this definition, nodes in a community share the same characteristic and thus are more likely to make connections with each other. Therefore, a community usually has stronger internal connections and weaker external connection, which is directly proposed as the definition of community in Girvan and Newman (2002).

Definition 2

(Community detection) Given a graph G(VE), community detection aims to find a set of communities \(\mathbb {S} = \{C_i | C_i\ne \emptyset , C_i \ne C_j, 1\le i,j \le p\}\), which minimizes an objective function f, i.e.,

$$\begin{aligned} \mathop {arg\, min} \limits _\mathbb {S} f(G,\mathbb {S}), \end{aligned}$$
(1)

where p is the number of communities.

While classic community detection requires exhaustive and disjoint communities, i.e., \(C_1\bigcup \cdot \cdot \cdot \bigcup C_p = V\) and \(C_i \bigcap C_j = \emptyset\) for any \(i\ne j\), overlapping community detection has no such constraints. The relaxation makes it more realistic for real-world scenarios.

As we mentioned, matrix factorization is a popular class of methods to deal with the overlapping community detection problem. It sets the number of communities in advance and then assigns each node to its corresponding communities based on an optimization objective. We formally define matrix factorization approach for overlapping community detection as follows.

Definition 3

(Overlapping community detection via matrix factorization) Given a graph G(VE) with its adjacency matrix \(A\in \{0,1\}^{n\times n}\), the objective of overlapping community detection via matrix factorization is to find a node-community membership matrix \(F\in \mathbb {R}^{n\times p}\) whose entry \(F_{u,c}\) represents the weight of node \(u\in V\) in community \(c\in C\) so that F can minimize a loss function l, i.e.,

$$\begin{aligned} \arg \min _{F} l(A,F), \end{aligned}$$
(2)

 where n is the number of nodes, p is the number of communities and C is the set of communities. In the end, we obtain the final set of communities C according to F.

3 Related work

3.1 Overlapping community detection

Massive efforts have been made on the research of community detection in the past two decades (Danon et al. 2005; Fortunato 2010; Lancichinetti and Fortunato 2009; Leskovec et al. 2010). Since most of classic community detection methods decline multiple memberships of a node, overlapping community detection attracts more and more attention in the past decade. Many approaches have been proposed to solve this specific problem (Xie et al. 2013). According to the key idea, we classify those approaches into local approaches and global approaches.

Local approaches employ a divide-and-conquer strategy to deal with subgraphs of relatively small size instead of the whole network. For example, clique percolation first finds all the k-cliques in the network and combines those sharing \(k-1\) nodes until no combinations can be made (Kumpula et al. 2008; Palla et al. 2005). Demon applies label propagation algorithm to detect small communities on ego network of each node, i.e., a node with its neighbors and the corresponding links, and in the end merge those small communities with large overlap to be the final communities (Coscia et al. 2012). Seed set expansion approaches first picks seed nodes with a seeding strategy and runs particular algorithms on each seed node to expand the communities around it (Li et al. 2018; Whang et al. 2016).

Global approaches, on the other side, assume that communities exist in the first place and aim to assign each node to a subset of all communities. Some typical frameworks include game theory (Chen et al. 2010), stochastic block models (Airoldi et al. 2008; Jin et al. 2015), and matrix factorization (MF). Here we would like to review several important work in the matrix factorization class. Psorakis et al. (2011) is the earliest method which uses the basic \(||A-WH^{\rm T}||\) as its optimization objective. Due to the vague social meaning of W and H, Wang et al. (2011) refines the objective function to \(||A-FF^{\rm T}||\). Zhang and Yeung (2012) extends the matrix factorization model to matrix tri-factorization model by incorporating a community interaction matrix B, which results in a objective function of \(||A-FBF^{\rm T}||\). Yang and Leskovec (2013) explicitly defines the probability of having an edge between u and v by a function of \(F_u\) and \(F_v\), then generates the likelihood function by fitting the original graph. Most recently, Wu et al. (2018) introduce hypergraph as regularization where hyperedges are created based on structural similarity. Though these objective functions are different, they are all based on value-approximation, which is problematic because the 0 / 1 value in adjacency matrix is more like a label than a value.

3.2 Bayesian personalized ranking

Baysian personalized ranking (BPR) Rendle et al. (2009b) is originally proposed to rank items for a user in recommender systems while only implicit feedback is available. Distinguished from explicit feedback like ratings, implicit feedback only consists whether a user has labeled an item, e.g., click history, purchase history, etc. The basic assumption of BPR is that a user prefers labeled items than unlabeled ones. While traditional methods replace missing values with zeros or negative ones, BPR uses pairwise preference as training data to learn the model parameters. Technically, the objective function maximizes a posterior probability \(p(\Theta |>_u)\) where \(\Theta\) is a parameter and \(>_u\) is a latent preference structure for user u. BPR has become a widely-used model in one-class collaborative filtering and there are several further work on top of it. For example, Rendle et al. (2009a) extend the original matrix factorization to a tensor factorization to recommend personalized tags for a user given an item. Zhao et al. (2014) leverage social connections to improve item recommendations by building a new preference system.

We adopt the basic assumption of BPR into the overlapping community detection task. To be specific, a node prefers its neighbors and its “non-neighbors”. Our objective function is to maximize the probability \(p(>_u|F)\), where F is a nonnegative matrix representing the latent node-community membership. Moreover, our extension of the basic assumption by incorporating locality is similar to that in Zhao et al. (2014).

4 A preference-based non-negative matrix factorization (PNMF) model

In this section, we introduce our PNMF model for overlapping community detection. We start with model assumptions, followed by model formulation and parameter learning scheme. Particularly, we will briefly illustrate our parallel paradigm for parameter learning acceleration.

4.1 Preliminaries

The set of i’s neighbors is denoted by \(N^+(i)\) and \(N^-(i) := N^+(i)^c \backslash \{i\}\) is defined as the “non-neighbors” of i, where \(N^+(i)^c\) denotes the complement set of \(N^+(i)\). It can be inferred that \(V = N^+(i)\cup N^-(i)\cup \{i\}\) for every i. Moreover, we define a learning set \(S: V\times V \times V\) by

$$\begin{aligned} S=\{(i,j,k)|i\in V, j\in N^+(i), k\in N^-(i)\}, \end{aligned}$$

which consists of all the triples (ijk), where j is a neighbor of i while k is not.

4.2 Model assumption

Our PNMF model is based on the intuitive idea that two nodes are more likely to connect with each other if they share more common communities. If we treat a link as the consequence of mutual preference, we can safely claim that a node has a higher preference on its neighbors than non-neighbors. Thus, the main assumption of our PNMF model can be formally denoted as

$$\begin{aligned} r_{u,i}\ge r_{u,j}, \quad \text {if } A_{u,i}=1\,\, \text { and }\,\, A_{u,j}=0, \end{aligned}$$
(3)

where \(r_{u,i}\) is the preference of node u on node i, i.e., how much u wants to build a link with i, and \(A_{u,i}\) is the corresponding entry in adjacency matrix A.

Moreover, we need two more independent assumptions before we start model formulation.

  1. 1.

    Node independence. The preference order of each node is independent with that of any other node. There will be a link between u and v if and only if u prefers to connect with v and symmetrically v prefers to connect with u.

  2. 2.

    Pair independence. For a fixed node i, its preference between pair (jk) is independent with its preference between pair (uv) given \((j,k)\ne (u,v)\), \(j,u\in N^+(i)\) and \(k,v \in N^-(i)\).

4.3 Model formulation

The goal of our PNMF model is to find a node-community membership matrix that maximizes the likelihood of observed preference order for all the nodes. According to the “node independence” assumption, the overall likelihood can be denoted as a product of likelihood of each node. Thus, our objective function can be written as

$$\begin{aligned} \max _{F\in \mathbb {R}_{+}^{n\times p}} \prod _{i\in V}\mathcal {P}(>_i|F), \end{aligned}$$
(4)

where \(>_i\) denotes the observed preferences for node i according to adjacency matrix A and F is the node-community membership matrix.

According to our main assumption as well as the “pair independence” assumption, the probability of preference order for a single node i can be written as

$$\begin{aligned} \begin{array}{lll} \mathcal {P}(>_i|F)&{}=&{}\prod _{(j,k)\in V\times V}\mathcal {P}(j>_ik|F)^{\delta (j\in N^+(i))\delta (k\in N^-(i))}\\ \cdot (1-\mathcal {P}(j>_ik|F))^{1-\delta (j\in N^+(i))\delta (k\in N^-(i))}\\ &{}=&{}\prod _{(j,k)\in V\times V}\mathcal {P}(j>_ik|F)^{\delta ((i,j,k)\in S)}\\ \cdot (1-\mathcal {P}(j>_ik|F))^{\delta ((i,j,k)\notin S)}, \end{array} \end{aligned}$$
(5)

where S is the learning set mentioned in preliminaries and \(\delta\) is the indicator function, i.e.,

$$\begin{aligned} \delta (a) = \left\{ \begin{array}{l l} 1 &{} \quad \text {if }\, a\, \text {is true,}\\ 0 &{} \quad \text {else.} \end{array} \right. \end{aligned}$$

For a triple (ijk), if \((i,j,k)\in S\), then \((i,k,j)\notin S\). Given \(\mathcal {P}(j>_ik|F)+\mathcal {P}(k>_ij|F)=1\), it is obvious that \(\mathcal {P}(j>_ik|F)^{\delta ((i,j,k)\in S)} = (1-\mathcal {P}(k>_ij|F)^{\delta ((i,k,j)\notin S)}\). Applying this to Eq. (5), maximizing \(\mathcal {P}(>_i|F)\) is equivalent to

$$\begin{aligned} \max _{F\in \mathbb {R}_{+}^{n\times p}} \prod _{(j,k)\in V\times V}\mathcal {P}(j>_ik|F)^{\delta ((i,j,k)\in S)}. \end{aligned}$$
(6)

Combining Eqs. (4) and (6), our objective function can be rewritten as

$$\begin{aligned} \max _{F\in \mathbb {R}_{+}^{n\times p}} \prod _{(i,j,k)\in S}\mathcal {P}(j>_ik|F). \end{aligned}$$
(7)

Now the problem comes to how to define the preference of one node on another. A node’s community membership vector is represented by the corresponding row of the node-community membership matrix F. Thus, more community two nodes share, more similar their membership vectors will be. Considering both our intuition that two nodes have a higher probability to be linked if they share more communities and our main assumption that a node has a higher preference on its neighbors, we define the probability that i prefers j than k given the node-community membership matrix as

$$\begin{aligned} \mathcal {P}(j>_ik|F) = \sigma (F_i\cdot F_j^{\rm T}-F_i\cdot F_k^{\rm T}), \end{aligned}$$
(8)

where \(\sigma\) is the sigmoid function \(\sigma (x):=\frac{1}{1+e^{-x}}\).

The sigmoid function can map any real number into (0, 1). The probability that i prefers j than k is 0.5 when \(F_iF_j^{\rm T}=F_iF_k^{\rm T}\). Additionally, this probability approaches 0 when \(F_iF_j^{\rm T}\ll F_iF_k^{\rm T}\) and approaches 1 when \(F_iF_j^{\rm T}\gg F_iF_k^{\rm T}\). These properties precisely satisfy the requirements of our model.

For simplicity, we define \(\hat{x}(i,j):=F_i\cdot F_j^{\rm T}\). Equation (8) can be rewritten as

$$\begin{aligned} \mathcal {P}(j>_ik|F) = \sigma (\hat{x}(i,j)-\hat{x}(i,k)). \end{aligned}$$
(9)

Now combining Eqs. (7), (8), and (9), we can construct the final objective function of our PNMF model as

$$\begin{aligned} \begin{array}{ll} l(F) &{}:= \max _{F\in \mathbb {R}_{+}^{n\times p}} \ln \prod _{(i,j,k)\in S}\mathcal {P}(j>_ik|F)-\lambda \cdot reg(F)\\ &{}= \max _{F\in \mathbb {R}_{+}^{n\times p}} \sum _{(i,j,k)\in S} \ln \mathcal {P}(j>_ik|F)-\lambda \cdot reg(F)\\ &{}= \max _{F\in \mathbb {R}_{+}^{n\times p}} \sum _{(i,j,k)\in S}\ln \sigma (\hat{x}(i,j)-\hat{x}(i,k))-\lambda \cdot reg(F), \end{array} \end{aligned}$$
(10)

where reg(F) is the regularization term we add to avoid over-fitting and \(\lambda\) is the regularization parameter. We choose Frobenius norm as the regularization term, i.e., \(reg(F)=\frac{1}{2}||F||_F^2\), because it is differentiable and fits our parameter learning process.

4.4 Parameter learning

We employ the widely used stochastic gradient descent (SGD) as our learning method because it is efficient and easy to implement. In each updating step, SGD randomly selects a triple in the learning set S and updates the corresponding parameters \(\Theta\) by walking along the gradient direction,

$$\begin{aligned} \Theta ^{t+1} = \Theta ^t + \alpha \frac{\partial l}{\partial \Theta }, \end{aligned}$$
(11)

where \(\alpha\) is the learning rate. Specifically, the derivative of Eq. (11) is calculated by

$$\begin{aligned} \begin{array}{ll} \frac{\partial l}{\partial \Theta } &{}= \frac{\partial }{\partial \Theta }\ln \sigma (\hat{x}(i,j)-\hat{x}(i,k))-\lambda \frac{\partial }{\partial \Theta }reg(F)\\ &{}= \frac{-e^{\hat{x}(i,k)-\hat{x}(i,j)}}{1+e^{\hat{x}(i,k)-\hat{x}(i,j)}}\cdot \frac{\partial }{\partial \Theta }(\hat{x}(i,j)-\hat{x}(i,k))-\lambda \Theta \end{array} \end{aligned}$$
(12)

and

$$\begin{aligned} \frac{\partial }{\partial \Theta }(\hat{x}(i,j)-\hat{x}(i,k)) = \left\{ \begin{array}{ll} F_{j,t}-F_{k,t} &{} \quad \text {if}\,\, \Theta = F_{i,t}\\ F_{i,t} &{} \quad \text {if}\,\, \Theta = F_{j,t}\\ -F_{i,t} &{} \quad \text {if}\,\, \Theta = F_{k,t}\\ 0 &{} \quad \text {else} \end{array} \right. , \end{aligned}$$
(13)

where \(\lambda\) is the regularization parameter. Regarding the non-negativity constraint, we exploit the idea of projected gradient methods for non-negative matrix factorization, which maps the value of a parameter back to non-negativity (Lin 2007).

The whole learning process is described in Algorithm 1 and the sampling strategy is described in Algorithm 2. Let sample size be t. The time complexity of each iteration is O(tp) and the space complexity is O(np), where n is the number of nodes and p is the number of communities, since we need to save the node-community membership matrix into memory.

figure a
figure b

4.5 A parallel paradigm for accelerating parameter learning

In Algorithm 1, the execution time for each iteration is proportional to the sample size, and it increases significantly when the network scales to hundreds of thousands of nodes. To accelerate this learning process, we design a parallel paradigm to distribute stochastic gradient descent (SGD) tasks with multiple processes.

The most important steps in Algorithm 1 in terms of time are Step 5 to Step 10, i.e., triples sampling and parameters updating. The idea of our parallel paradigm is pretty straightforward. Given n sub-processes, each sub-process conducts sampling and updating independently in one iteration. Since the node-community membership matrix F costs a lot of memory to store, we can not make a copy of it for each process. In ideal case, we expect this paradigm to run as n times fast as the single process paradigm. However, while sampling always works in parallel, the updating part suffers synchronization problem. Here we discuss two possible solutions.

The first solution is to simply use locks. When a sub-process performs updating on a particular row of the node-community membership matrix, a lock will be assigned to this row to prevent other sub-processes from updating it. The correctness of this paradigm can be easily justified since two processes only perform updating simultaneously when there is no collision. However, if collisions happen a lot, the computational efficiency will be massively affected and in the worst case, the time cost will be as much as the single process paradigm.

The other solution is a lock-free paradigm (Recht et al. 2011). In this paradigm, each sub-process has read-only access to the global node-community membership matrix and keeps the updating results on its own memory. After each iteration, the main process performs updating on the global node-community membership matrix according to the information it collects from all sub-processes. To be specific, for each row, the changes \(\Delta\) from all sub-processes are summed and the sum is added to this row of the global node-community membership matrix. Thus, in one iteration, each sub-process starts from the same node-community membership matrix from the previous iteration and is totally independent with each other. When no collision happens, this lock-free paradigm is exactly the same as the previous paradigm. On the contrast, when collisions happen a lot, this paradigm prevents any waiting time among sub-processes in one iteration. Although it may introduce some computation errors in updating, it still walks along the gradient in the right direction. Since we care more about the speed of our parallel paradigm, this lock-free one is obviously a better choice for implementation.

4.6 Other issues

There are several remaining details to be discussed.

  • The number of communities. The nature of matrix factorization needs us to set the number of communities which are unknown in advance. A cross-validation paradigm is used. In details, we reserve \(10\%\) of nodes as validation set at first. After learning the node-community membership matrix F, we compute the sum of log-likelihood function for all nodes in validation set via Eq. (17). Since the computational cost is huge for cross-validation, only a small set of quadruple will be sampled.

  • The community membership threshold. Obtaining the node-community membership matrix F is still one step away from getting the final node-community correspondence. We need to set a threshold \(\delta\) to decide whether a community accepts a node. If \(F_{u,c}\ge \delta\), we say that node u belongs to community c. According to Eq. (8), we need \(p(j>_ik|F)\) to be closer to 1 than 0 if i prefer j than k. We assume that ij share exactly one community and ik do not share any communities. Thus \(F_iF_k^{\rm T}=0\). Due to the symmetry of i and j, we have

    $$\begin{aligned} \sigma (F_iF_j^{\rm T}-F_iF_k^{\rm T}) = \sigma (\delta ^2-0)=\frac{1}{1+e^{-\delta ^2}}=\beta , \end{aligned}$$

    where \(\beta\) is in the range of (0.5, 1). When \(\beta\) is given, we can compute \(\delta\) by

    $$\begin{aligned} \delta = \sqrt{-\ln \left(\frac{1}{\beta }-1\right)}. \end{aligned}$$
    (14)
  • The convergence criterion. First, we randomly generate a subset of triples to be our loss sample and compute initial loss on this set according to Eq. (10). After each iteration, we need to compute loss again and we stop stochastic gradient descent when the absolute difference between current loss and previous loss is smaller than a very small percentage, say \(\epsilon\), of initial loss.

5 A locality-based non-negative matrix factorization (LNMF) model

In this section, we first define the concept of K-degree locality and formalize our LNMF model in the scenario of community detection. Then we discuss about our parameter learning process including sampling strategy.

5.1 Preliminaries

Definition 4

(K-degree local network) Given an undirected and unweighted graph G, for a node \(u\in G\), u’s K-degree local network \(L_K(u)\) is the subgraph consisting of all nodes whose shortest path length to u is less than or equal to K.

According to the definition above, \(L_0(u)\) consists of only node u, \(L_1(u)\) is the subgraph including node u and all its neighbors, \(L_\infty (u)\) is the whole graph, etc. We denote the node set of \(L_t(u)\) except u itself as \(V_t(u)\), where \(t=1,2,...\).

Now we further define the terms of “local non-neighbors” and “distant non-neighbors”.

Definition 5

(K-degree local non-neighbors) Given a K-degree local network \(L_K(u)\), the set of K-degree local non-neighbors \(S_K(u)\) is defined as \(S_K(u):=L_K(u)\backslash L_1(u)\), where \(K\ge 1\).

Definition 6

(K-degree distant non-neighbors) Given a K-degree local network \(L_K(u)\), the set of K-degree distant non-neighbors \(T_K(u)\) is defined as \(T_K(u):=L_\infty (u)\backslash L_K(u)\), where \(K\ge 1\).

Table 1 A summary of notations

A summary of notations is shown in Table 1. Four simple propositions can be drawn from the above notations.

Proposition 1

\(V_K(u) = N^+(u)\bigcup S_K(u)\).

Proposition 2

\(N^+(u)\bigcap S_K(u) = \emptyset\).

Proposition 3

\(N^-(u) = S_K(u)\bigcup T_K(u)\).

Proposition 4

\(S_K(u)\bigcap T_K(u)= \emptyset\).

5.2 Model assumption

Recall the main assumption of the PNMF model in Eq. (3). Incorporating the concept of K-degree local network, we can employ K-degree local non-neighbors to enhance the model assumption. The new model assumption for our LNMF model can be represented as

$$\begin{aligned} r_{i,j}\ge r_{i,k}, r_{i,k}\ge r_{i,d}, j\in N^+(i), k\in S_K(i), d\in T_K(i), \end{aligned}$$
(15)

where \(r_{i,j}\) is still the preference of node i on node j. It can be interpreted as (1) neighbors are preferred to local non-neighbors, (2) local non-neighbors are preferred to distant non-neighbors. We notice that when \(K=1\), \(S_K(i)=\emptyset\) and \(T_K(i)=N^-(i)\). In this case, our model assumption degrades to the one of the PNMF model. Thus, the LNMF model can be viewed as a generalization of the PNMF model.

We also adopt the two independence assumptions listed in the Sect. 3.2, i.e., node independence and pair independence, when formalizing our LNMF model.

5.3 Model formulation

Given model assumptions, we are ready to present our LNMF model in a formal way. Since nodes are independent of each other, we can consider one node at first.

For a node i, the optimization criterion is to maximize the likelihood of preference order which can be represented as a product of pairwise preferences, i.e.,

$$\begin{aligned} \begin{array}{ll} \prod _{j,k\in V_K(i)}&{}[\mathcal {P}(r_{i,j}\ge r_{i,k}|F)^{\delta (i,j,k)}\\ &{}(1-\mathcal {P}(r_{i,j}\ge r_{i,k}|F))^{1-\delta (i,j,k)}]\cdot \\ \prod _{k,d\in N^-(i)}&{}[\mathcal {P}(r_{i,k}\ge r_{i,d}|F)^{\xi (i,k,d)}\\ &{}(1-\mathcal {P}(r_{i,k}\ge r_{i,d}|F))^{1-\xi (i,k,d)}], \end{array} \end{aligned}$$
(16)

where \(\delta (\cdot )\) and \(\xi (\cdot )\) are two indicator functions that

$$\begin{aligned} \delta (i,j,k) = \left\{ \begin{array}{l l} 1 &{} \quad \text {if }\,\,j\in N^+(i) \,\,\text {and}\,\, k\in S_K(i),\\ 0 &{} \quad \text {otherwise} \end{array} \right. \end{aligned}$$

and

$$\begin{aligned} \xi (i,k,d) = \left\{ \begin{array}{l l} 1 &{} \quad \text {if} \,\,k\in S_K(i)\,\,\text {and }\,\,d\in T_K(i),\\ 0 &{} \quad \text {otherwise} \end{array} \right. . \end{aligned}$$

Recall the four propositions in preliminaries that \(V_K(i)\) and \(N^-(i)\) can be split into two disjoint sets with different levels of preference. Following the scheme proposed in Rendle et al. (2009b) and Zhao et al. (2014), we can simplify Eq. (16) to

$$\begin{aligned} \begin{array}{ll} &{}\frac{\sum _{j\in N^+(i),k\in S_K(i)}{\mathcal {P}(r_{i,j}\ge r_{i,k}|F)}}{|N^+(i)|\cdot |S_K(i)|}+\\ &{}\frac{\sum _{k\in S_K(i),d\in T_K(i)}{\mathcal {P}(r_{i,k}\ge r_{i,d}|F)}}{|S_K(i)|\cdot |T_K(i)|}. \end{array} \end{aligned}$$
(17)

Applying the sigmoid function \(\sigma (x):=\frac{1}{1+e^{-x}}\) to interpret \(\mathcal {P}(r_{i,j}\ge r_{i,k}|F)\), i.e., \(\mathcal {P}(r_{i,j}\ge r_{i,k}|F)=\sigma (\hat{x}(i,j)-\hat{x}(i,k))\), we sum up the log-likelihood functions of all nodes:

$$\begin{aligned} \begin{array}{ll} \sum _{i}[&{}\sum _{j\in N^+(i),j\in S_K(i)}\ln \sigma (\hat{x}(i,j)-\hat{x}(i,k))+ \\ &{}\lambda (i)\cdot \sum _{k\in S_K(i),d\in T_K(i)}\ln \sigma (\hat{x}(i,k)-\hat{x}(i,d))], \end{array} \end{aligned}$$
(18)

where \(\hat{x}(i,j):=F_i\cdot F_j^{\rm T}\) can be regarded as the correlation between i and j, and \(\lambda (i):=\frac{|N^+(i)|}{|T_K(i)|}\) can be regarded the coefficient of locality influence.

In the end, to prevent our model from over-fitting, we add a regularization term \(reg(F)=\frac{1}{2}||F||_F^2\), which is the Frobenius norm of the node-community membership matrix. Thus, the final objective function l is written as

$$\begin{aligned} \begin{array}{ll} l(F) &{}= \sum _{i}[\sum _{j\in N^+(i),k\in S_K(i)}\ln \sigma (\hat{x}(i,j)-\hat{x}(i,k))+ \\ \lambda (i)\cdot &{}\sum _{k\in S_K(i),d\in T_K(i)}\ln \sigma (\hat{x}(i,k)-\hat{x}(i,d))]-\lambda _r reg(F), \end{array} \end{aligned}$$
(19)

where \(\lambda _r\) is the regularization coefficient.

5.4 Parameter learning

Parameter learning for the LNMF model is mostly the same as that for the PNMF model except that we now need to sample a set of (source, neighbor, local non-neighbor, distant non-neighbor) quadruples instead of (source, neighbor, non-neighbor) triples at each learning step. The learning process for the LNMF model is described in Algorithm 3 and the sampling strategy is described in Algorithm 4. For the sampling of k, we need to pre-process the whole graph to record a set of local nodes of each i in the graph. Based on the fact that \(N^-(i) = S_K(i)\bigcup T_K(i)\), we keep sampling a random node until we get a node d neither in \(N^+(i)\) nor in \(S_K(i)\).

The same parallel paradigm as what we employ in the PNMF model is also adopted to accelerate the LNMF model. In each iteration, multiple sub-processes sample the quadruples and update the node-community membership matrix F stored in the main process in parallel. We expect the parallel version to run much faster than the single process version as well.

figure c
figure d

5.5 Discussion on further extension

Our LNMF model can be further generalized by extending the preference system from three levels to n levels. Mathematically, according to the notations in Sect. 4.1, let \(P_K(u):=L_K(u)\backslash L_{K-1}(u)\) for \(K=1,2,\ldots ,n\) be the node set of a particular preference level of node u and we assume that u’s preference on \(P_i(u)\) is larger than its preference on \(P_j(u)\) if \(i>j\). For model formulation, we only need to modify a few spots to make it a more general one. However, our learning algorithm is not efficient enough. Specifically, a trivial extension of our sampling strategy suffers from two problems: (1) the upper limit of K can change for different nodes, (2) for a node u, pre-processing the whole graph to record node sets of all preference levels is equal to a breath-first search starting from u, which is too time-consuming for large-scale networks. We do not yet have an adequate solution to this problem.

6 Experiments

In this section, we compare our PNMF and LNMF models with both classic and state-of-the-art overlapping community detection approaches on both synthetic and real-world datasets. The experimental results are evaluated by three metrics, namely modularity, normalized mutual index (NMI) and \(F_1\) score. We also compare our parallel implementation with the single process one.

6.1 Data description

For synthetic datasets, we use the Lancichinetti–Fortunato–Radicchi (LFR) benchmark networks (Lancichinetti et al. 2008). We generate three binary undirected networks of 1000 nodes with different settings.

For real-world datasets, We have six benchmark networks collected by NewmanFootnote 1 which have no ground-truth communities and relatively small sizes. Simple statistics of these datasets can be found in Table 2. We also have three networks with ground-truth communities collected by SNAPFootnote 2 (Yang and Leskovec 2015):

  • Amazon: a products co-purchasing network based on customers who bought this item also bought feature of the Amazon website;

  • DBLP: a collaboration network of research paper authors in computer science;

  • YouTube: a social network of a video-sharing website.

These SNAP networks have large sizes so that we can use them to test the scalability of our models. The statistics are shown in Table 3.

Table 2 Statistics of six Newman’s datasets
Table 3 Statistics of three SNAP datasets

6.2 Experimental setup

We conduct our experiments on a computer with 32 Xeon 2.40 GHz CPUs and 128 GB memory.

6.2.1 Baselines

We select both classic and state-of-the-art methods to compare with our models.

  • SCP (Kumpula et al. 2008) accelerates the original CP method (Palla et al. 2005) in a sequential manner. We set k to be 4 or 5 when finding k-cliques.

  • LC (Ahn et al. 2010) clusters link instead of node to get overlapping communities. We ignore all communities with only one or two nodes because they are meaningless.

  • Demon (Coscia et al. 2012) uses label propagation algorithm to detect small communities on ego network of each node and then merge communities with large overlap.

  • BigCLAM (Yang and Leskovec 2013) is also claimed as a scalable model. It can search for the best number of communities given a range.

6.2.2 Evaluation metrics

We use modularity as the evaluation metric for datasets without ground-truth communities, NMI and \(F_{1}\) score for datasets with ground-truth communities.

  • Modularity. The classic modularity is defined as

    $$\begin{aligned} Q = \frac{1}{2|E|}\underset{u,v\in V}{\sum }(G_{u,v} - \frac{d(u)d(v)}{2|E|})I_{u,v}, \end{aligned}$$

    where d(u) is the degree of node u, \(G_{u,v}\) is the (uv) entry of the adjacency matrix G, and \(I_{u,v}=1\) if uv are in the same community otherwise 0 (Newman 2006). In the overlapping scenario, since a node pair may share more than one communities, a minor modification is made by replacing \(I_{u,v}\) with \(|C_{u}\cap C_{v}|\), i.e., the number of overlapped community between u and v:

    $$\begin{aligned} \hat{Q} = \frac{1}{2|E|}\underset{u,v\in V}{\sum }\left( G_{u,v} - \frac{d(u)d(v)}{2|E|}\right) |C_{u}\cap C_{v}|. \end{aligned}$$

    From the definition, we can see that greater value of modularity reveals denser connectivity within the detected communities because only linked node pairs sharing common communities contribute positively to the value. This metric has been frequently used in other MF-based works as well (Yang and Leskovec 2013; Zhang and Yeung 2012; Psorakis et al. 2011). As we know, modularity has been directly used as an optimization objective in community detection, and those approaches are called modularity-based methods (Clauset et al. 2004; Duch and Arenas 2005; Guimera et al. 2004; Newman 2004). However, when we compare the quality of detected communities among non-modularity-based models, modularity can still be served as a useful metric.

  • NMI. Normalized mutual information is very popular in evaluating community detection algorithms (Lancichinetti et al. 2009). The mutual information between two community partitions A and B is defined as the Kullback–Leibler (KL) divergence between joint distribution \(P_{AB}(a,b)\) and the product of two marginal distributions \(P_A(a)P_B(b)\)

    $$\begin{aligned} I(P_A,P_B) = H(P_A) + H(P_B) - H(P_{AB}), \end{aligned}$$

    where \(H(P_A) = - {\sum\nolimits_{a\in A} }P_A(a) \log P_A(a)\) is the entropy of distribution \(P_A\) and \(H(P_{AB})\) is the entropy of the joint distribution \(P_{AB}\). Here \(P_A(a)=\frac{n_a}{n}\) and \(P_{AB}(a,b)=\frac{n_{ab}}{n}\), where n is number of all nodes, \(n_a\) is the number of nodes in community a, and \(n_{ab}\) is the number of nodes in both community a and b. After normalization, NMI is defined as

    $$\begin{aligned} NMI(P_A,P_B) = \frac{2I(P_A,P_B)}{H(P_A) + H(P_B)}. \end{aligned}$$

    In our experiment, we employ a more conventional normalization corrected for overlapping scenario to avoid unintentional behaviors (McDaid et al. 2011).

  • \({{ F}_1}\) score. \(F_1\) score is also one of the best measurements for datasets with ground-truth communities. The \(F_1\) score of a detected community \(S_i\) is defined as the harmonic mean of precision\((S_i)\) and recall\((S_i)\), where precision\((S_i)\) and recall\((S_i)\) are defined as

    $$\begin{aligned} \text {precision}(S_i) = \max _j\frac{C_j\bigcap S_i}{|C_j|}, \end{aligned}$$

    and

    $$\begin{aligned} \text {recall}(S_i) = \max _j\frac{C_j\bigcap S_i}{|S_i|}, \end{aligned}$$

    where \(C_j\) is the node set of a ground-truth community. The average \(F_1\) score for the set of detected communities S is

    $$\begin{aligned} \overline{F_1}(S) = \frac{1}{|S|}\sum _{S_i\in S}{F(S_i)}. \end{aligned}$$

6.2.3 Setting the degree

Recall that if we set \(K=1\) in K-degree local network, the LNMF model will degrade to the PNMF model. We’ve tried both \(K=2\) and 3 on our datasets. The result shows that the average number of common communities two nodes in a K-degree local network share is about the same. For \(K\ge 4\), generating K-degree local network costs too much time. Thus, we set \(K=2\) for all experiments, which means only a friend’s friends are regarded as local non-neighbors.

6.3 Results

Table 4 Experimental results on LFR benchmark datasets in terms of NMI
Table 5 Experimental results on LFR benchmark datasets in terms of \({F_1}\) score

The experiments in Sects. 6.3.1 and  6.3.2 are conducted on a single process. The experiments in Sect. 6.3.3 are conducted on multiple processes.

6.3.1 Performance on LFR benchmark

For all synthetic networks, we set number of nodes as 1000, average degree as 10, maximum degree as 50, exponent for the degree distribution as 2, exponent for the community size distribution as 1 and mixing parameter as 0.2. The only two variants are number of overlapping nodes (on) and number of memberships of the overlapping nodes (om). We use both NMI and \(F_1\) score as metrics and the results are shown in Tables 4 and  5, respectively. We can see that the PNMF model is comparable to other methods while the LNMF model outperforms the others.

6.3.2 Performance on real-world datasets

We set the regularization coefficient to be around 0.001 and the convergence parameter \(\epsilon\) to be 0.001 for all experiments. The sample size t is determined according to data size. For UMich networks, we set \(t=m\), i.e., the number of links. For SNAP networks, we set \(t=10\sqrt{n}\) to finish one iteration without taking too much time, where n is the number of nodes. The maximum times of iteration is set to 100, though in fact all datasets converge before reaching the limit.

The experimental results on UMich networks are shown in Table 6. We use modularity as the metric since these networks have no ground-truth communities. The comparison clearly reveals the dominant superiority of our PNMF and LNMF model over baseline methods, especially MF-based approaches. Moreover, the LNMF model performs slightly better than the PNMF model, which validates the usage of locality information.

Table 6 Experimental results on UMich datasets in terms of modularity

The experimental results on SNAP networks are shown in Table 7. \(F_1\) score is used as measurement because these datasets have ground-truth communities. Here we only compare with BigCLAM because all the other baselines can not scale to datasets with such sizes. This fact can reflect the scalability of our PNMF and LNMF model to some extent. From the results we can see that both of our models outperform the BigCLAM model overall. Specifically, the LNMF model has fair improvement over the PNMF model on both DBLP and Amazon datasets. For the YouTube dataset, we conduct a data observation and figure out that most of its communities are quite sparse due to the small size of communities and large variety of users. In other words, locality information is not useful on the YouTube dataset. This explains the reason why the LNMF model fails to have improvement on this dataset.

Table 7 Experimental results on SNAP datasets in terms of \({F_1}\) score

During parameter learning of both PNMF and LNMF models, the running time of one iteration is about 1–2 h for DBLP and Amazon and about 4–5 h for YouTube. Figure 1a, b demonstrates the convergence speed of the learning algorithm of the PNMF model and LNMF model on SNAP datasets, respectively. Each point in the figure represents the ratio of current loss to initial loss after a certain number of iterations. The results show that both of our models can converge within a fair number of iterations. However, we notice that the final loss of the LNMF model on the YouTube dataset is quite large compared with the other two datasets. This observation actually matches our previous discussion in Table 7 that locality is possibly not a feature of the YouTube dataset.

Fig. 1
figure 1

Convergence time of learning algorithm on SNAP datasets

6.3.3 Performance of parallel paradigm

With the same model parameter setting as single process experiments, we test the efficiency of our parallel paradigm on SNAP datasets. Figure 2 shows the average training time the PNMF model and the LNMF model take for one iteration by using different numbers of processes, respectively. The number of total samples per iteration remains the same so that we can fairly compare the average time per iteration as the number of processes varies. It can be seen from the curves that the training time per iteration decreases dramatically at first and turns smooth afterwards as the number of processes increases. The significant drop when the number of processes is less than 10 validates the correctness of our parallel paradigm. When the number of processes is greater than 10, the training time hardly decreases or even increases a bit because the synchronization after each iteration becomes the bottleneck.

Fig. 2
figure 2

Comparison of single-process implementation with parallel implementation in terms of time on SNAP datasets

As for the quality of detected communities, we do not show the comparison but the parallel version achieves nearly the same \(F_1\) score as the single process one regardless of the number of processes we use. The loss drops at a similar pace as well. To summarize, the parallel paradigm of our learning algorithm not only decreases the running time, but also preserves the the quality of detected communities.

7 Conclusion and future work

In this paper, we incorporate both link preference information and locality information into overlapping community detection. First, we propose a Preference-based Non-negative Matrix Factorization (PNMF) model with an intuition that a node prefers its neighbors than its non-neighbors. We maximize the likelihood of a preference order for each node instead of simply approximating the original adjacency matrix in value. Our PNMF model can eliminate the unreasonable indiscriminate penalty on pairs inside and between communities. Second, we propose a Locality-based Non-negative Matrix Factorization (LNMF) model to further improve the performance of PNMF model with the help of locality. We exploit local area around a node formally defined as K-degree local network to enhance the previous preference system. In details, we extend a two-level preference system which only distinguish neighbors and non-neighbors to a three-level preference system which split the set of non-neighbors into local non-neighbors and distant non-neighbors. In parameter learning of both models, we employ stochastic gradient descent with bootstrap sampling to learn the parameters of node-community membership matrix. Experiments on several benchmark datasets including large ones with ground-truth communities show that both models outperform state-of-art approaches under multiple metrics and the LNMF model achieves better results than the PNMF model.

Currently, our work only employs binary networks as input, but it can be extended to multiplex networks as well. In multiplex networks, all layers share the same set of nodes but may have different topology, and inter-layer edges are allowed (Boccaletti et al. 2014). For each layer, we can apply either PNMF or LNMF model to obtain an objective function. Since all layers share the same node set, we can simply sum up all the objectives functions to learn via stochastic gradient descent. However, such extension cannot deal with inter-layer edges.