Keywords

1 Introduction

Graph clustering is a widely studied research problem and receives considerable attention in data mining and machine learning recently [1,2,3,4,5,6,7,8]. It aims to partition a given graph into several connected components based on structural similarity. Vertices from the same component are expected to be densely connected, and the ones from different components are weakly tied. Graph clustering is popularly used in community detection, protein network analysis, etc. [4,5,6]. The previous work focused mainly on finding clusters by exploiting the topology structures. Recently, as the advance of social networks and Web 2.0, many graph datasets appear with both the topology and node attribute information. For example, a webpage (i.e., vertex) can be associated with other webpages via hyperlinks, and it may have some inherent attributes of itself, like the text description in the webpage. Such type of graphs are known as attributed graphs. Because the topology and attributes together offers us a better probability to find high-quality clusters, attributed graph clustering becomes a hot research topic.

However, finding clusters in attributed graphs is not trivial, and there are two important challenges we need to address. Challenge 1: how to effectively utilize the topology information and the attributes together. In conventional graph clustering methods, only the topology structure is exploited to find clusters. By contrast, conventional feature based clustering algorithms take merely the attributes into account. Different from the two types of approaches, attributed graph clustering algorithms should effectively use the two types of information together. Challenge 2: how to automatically determine the importance of different attributes? It is well-known that weighting features appropriately can help to find the inherent clusters, especially when there is a large portion of noisy features for clusters. We face the same challenge for attributed graph clustering. For example, in the aforementioned webpage example, each webpage may contain different textual information at different locations, e.g., title, body, advertisement, and features extracted thus may have distinct contributions to clusters. Although some methods have been put forward recently to address the first challenge [9,10,11,12,13,14,15,16], few of them notice the second challenge.

In this paper, we introduce a J oint W eighted N onnegative M atrix F actorization method for clustering attributed graphs, namely JWNMF, which can address the two challenges. NMF [17, 18] is a well-known technique, which could produce the promising performance in graph clustering [7, 8, 19]. For a given attributed graph, our method presents a mechanism by using joint-NMF to integrate the structural and attribute information. Specifically, we design two matrix factorization terms. One is modeling the topology structure and the other is for attributes. Meanwhile, we modify the NMF by introducing a weighting variable for each attribute, which can be automatically updated and determined in each iteration. Experiments are performed on seven real-world datasets, including two amazon information networks, one CMU email networks, one DBLP information network, one webpage links network and two citation information networks. Our experimental results show that the proposed JWNMF method outperforms state-of-the-art attributed graph clustering algorithms, like BAGC [11], PICS [13] and SANS [14].

The remainder of this paper is organized as follows: Sect. 2 reviews some existing work on attributed graph clustering. In Sect. 3, we introduce the proposed JWNMF method. Section 4 presents and discusses the experimental results. Finally, the conclusions are given in Sect. 5.

2 Related Work

Several clustering methods have been introduced for mining attributed graphs recently. They can mainly be categorized into two types, namely distance-based methods [9, 10, 14] and model-based methods [11,12,13, 15, 16]. The idea of distance-based methods is to design a unified distance which could combine and leverage structural and attribute information, and then utilize existing clustering methods, e.g., k-means or spectral clustering, to cluster attributed graphs based on the unified distance. Model-based methods leverage the interactions between edges and node attributes to construct a joint model for clustering purpose.

2.1 Distance-Based Methods

Zhou et al. proposed a distance-based method SA-Cluster [9] in 2009 and its efficient version Inc-Cluster [10] in 2011. The key idea of the two methods is to construct a new graph by treating node attributes as new nodes and linking the original nodes to the new attribute nodes if the corresponding attribute values are non-zeros. A unified distance for the augmented graph is designed by using a random walk process. Finally, k-mediods is performed to partition the new augmented graph. As the augmenting step may increase the size of graphs considerable, the two methods are hard to run on large-scale attributed graphs.

SANS was introduced in 2015 [14], which partitions attributed graph leveraging both structural and node attribute information. In the method, a weighting vector is predefined. SANS chooses the node with the largest degree (out-degree plus in-degree) as a cluster center, then other nodes connected with this node are partitioned in the cluster. As a sequel, SANS assigns the clustered nodes whose attribute similarities with those assigned nodes are larger than a threshold into the cluster. After that, the weighting vector and attribute similarities are updated. The procedure is repeated until all nodes are clustered. This method can automatically partition attributed graph without pre-defined number of clusters.

2.2 Model-Based Methods

Xu et al. proposed a model-based approach BAGC in 2012 [11]. This method introduces a Bayesian probabilistic model by assuming that the vertices in same cluster should have a common multinomial distribution for each node attribute and a Bernoulli distribution for node connections. As a result, the attributed graph clustering problem can be transformed into a standard probabilistic inference problem. The clusters can be identified by using the node-to-cluster probabilities. The drawback is that this method cannot handle weighted attributed graphs. To overcome this problem, Xu extended BAGC and proposed GBAGC lately [12].

PICS was proposed by Akoglu in 2012 [13]. This method is a matrix compression based model clustering approach. It treats clustering problem as a data compression problem, where the structure matrix and attribute matrix are compressed at the same time. Each cluster is regarded as a compression of a dense connected subset, and the nodes in the same cluster have similar connectivity and attribute properties. Due to less computational complexity, PICS can deal with large-scale attributed graphs.

In 2014, Perozzi proposed a user interest based attributed graph clustering method, namely FocusCo [15]. The method utilizes the similarities of users’ interests to find an optimal clustering results for attributed graphs. CESNA [16] models the correlations between structures and node attributes to improve the intra-cluster similarities. The method differs from other attributed graph clustering methods in that it can detect overlapping communities in social networks.

Different from the existing studies, we propose a collective nonnegative matrix factorization method to leverage both the topology and attribute information. Moreover, we design a weighting vector to differentiate the contribution of attributes to clusters, which can be automatically determined. Our method addresses the two challenges mentioned in introduction.

3 Proposed Method

An attributed graph can be defined as \(G=(V, E, A)\), where \(V=\{v_{1}, v_{2}, \ldots , v_{n}\}\) denotes the set of nodes, \(E=\{(v_{i}, v_{j}), 1 \le i, j \le n, i \ne j\}\) denotes the set of edges, and \(A=[\mathbf{a}_{1}, \mathbf{a}_{2}, \ldots , \mathbf{a}_{m}]\) denotes the set of node attributes. In an attributed graph G, each node \(v_{i}\) in V is associated with an attribute vector \((a_{1}^{i}, a_{2}^{i}, \ldots , a_{m}^{i})\), where each element of the vector is the attribute value of \(v_{i}\) on the corresponding attribute.

The key difference of attributed graph clustering to conventional graph clustering is that it needs take node attributes into account. Consequently, the ideal clustering results should follow two properties: (1) nodes in the same clusters are densely connected, and sparsely connected in different clusters; (2) and nodes in the same clusters have similar attribute values, and have diverse attribute values in different clusters.

3.1 Overview of NMF

Here, we will briefly review the Nonnegative Matrix Factorization (NMF) [17, 18]. Let X denotes a \(M \times N\) matrix whose data elements are all nonnegative. The goal of NMF is that to find two nonnegative matrix factors \(V=(V_{i,j})_{M \times K}\) and \(U=(U_{i,j})_{N \times K}\), where K denotes the desired reduced dimension of original matrix X. In general, \(K \le \min (M,N)\). After that, we can produce an approximation of X by \(X \approx VU^{T}\). A commonly used objective function for NMF can be regarded as a Frobenius norm optimizing problem, as follows:

$$\begin{aligned} \min _{V, U \ge 0} \Vert X - VU^{T}\Vert ^{2}_{F} \end{aligned}$$

where \(\Vert \cdot \Vert _{F}\) is the Frobenius norm and \(V,U \ge 0\) represent the nonnegative constraints in matrix factorization.

3.2 Objective Function

Following the definition of attributed graphs above, we assume that S denotes the adjacency matrix for topology structure, and matrix A represents the attribute information where rows denote nodes and columns represent attributes. In addition, we also introduce a diagonal matrix \(\varLambda \) to assign a weight for each attribute. Inspired by SymNMF [7, 8], which often delivers promising results for graph clustering, we apply the idea for attributed graph clustering. Specifically, we have factorizations \(S \approx VV^{T}\) and \(A\varLambda \approx VU^{T}\), where V is a fusing representation of topology and attribute information for nodes.

In order to integrate the two approximation into the NMF framework, we propose a weighted joint NMF optimization problem over \(V, U, \varLambda \):

$$\begin{aligned} \min _{V, U, \varLambda \ge 0} \Vert S - VV^{T}\Vert ^{2}_{F} + \lambda \Vert A\varLambda - VU^{T}\Vert ^{2}_{F} \end{aligned}$$
(1)

where \(S \in \mathbb {R}^{n \times n}_{+}\), \(A \in \mathbb {R}^{n \times m}_{+}\), \(\varLambda \in \mathbb {R}^{m \times m}_{+}\), \(V \in \mathbb {R}^{n \times k}_{+}\), \(U \in \mathbb {R}^{m \times k}_{+}\), \(\mathbb {R}_{+}\) denotes the set of nonnegative real numbers, n denotes the number of nodes, m denotes the number of attribute categorizations, \(\varLambda \) is a diagonal matrix satisfying \(\sum _{i=1}^{m}\varLambda _{i,i}=1\) and \(\lambda > 0\) is the weight to balance structural/attribute fusion and k is the number of clusters. Actually, before optimizing Eq. 1, we preprocess the adjacency matrix S and the attribute information matrix A as:

$$\begin{aligned} \begin{aligned} S = \frac{S}{\sum _{i=1}^n\sum _{j=1}^n S_{i,j}}, A = \frac{A}{\sum _{i=1}^n\sum _{j=1}^m A_{i,j}} \end{aligned} \end{aligned}$$
(2)

Next, we will derive the updating rules of V, U and \(\varLambda \).

3.3 Updating Rules

Let \(\alpha \), \(\beta \) and \(\gamma \) denote respectively the Lagrange multiplier matrix for the constraints V \(\ge \) 0, U \(\ge \) 0 and \(\varLambda \) \(\ge \) 0. By using the Lagrange formulation, we obtain the loss function without constraints:

$$\begin{aligned} L = \frac{1}{2}(\Vert S - VV^{T}\Vert ^{2}_{F} + \lambda \Vert A\varLambda - VU^{T}\Vert ^{2}_{F}) + Tr(\alpha ^T V) + Tr(\beta ^T U) + Tr(\gamma ^T\varLambda ) \end{aligned}$$

Taking partial derivatives of L with respect to V, U and \(\varLambda \), we have

$$\begin{aligned} \begin{aligned} \frac{\partial {L}}{\partial {V}}=-(SV+S^{T}V+\lambda A\varLambda U)+(2VV^{T}V+\lambda VU^{T}U+\alpha ) \end{aligned} \end{aligned}$$
(3)
$$\begin{aligned} \begin{aligned} \frac{\partial {L}}{\partial {U}}=-\lambda \varLambda A^{T} V+\lambda UV^{T}V+\beta \end{aligned} \end{aligned}$$
(4)
$$\begin{aligned} \begin{aligned} \frac{\partial {L}}{\partial {\varLambda }}=-\lambda A^{T}VU^{T}+\lambda A^{T}A\varLambda +\gamma \end{aligned} \end{aligned}$$
(5)

In terms of Karush-Kuhn-Tucker (KKT) conditions \(\alpha _{p,r}V_{p,r}=0\), \(\beta _{q,r}U_{q,r}=0\) and \(\gamma _{q,q}\varLambda _{q,q}=0\), it follows that \(\frac{\partial {L}}{\partial {V}}=0\), \(\frac{\partial {L}}{\partial {U}}=0\) and \(\frac{\partial {L}}{\partial {\varLambda }}=0\). Base on these conditions, we can derive the following updating rules with respect to V, U and \(\varLambda \):

$$\begin{aligned} \begin{aligned} V \longleftarrow V.*(SV+S^{T}V+\lambda A\varLambda U)./(2VV^{T}V+\lambda VU^{T}U) \end{aligned} \end{aligned}$$
(6)
$$\begin{aligned} \begin{aligned} U \longleftarrow U.*(\varLambda A^{T}V)./(UV^{T}V) \end{aligned} \end{aligned}$$
(7)
$$\begin{aligned} \begin{aligned} \varLambda \longleftarrow \varLambda .*(A^{T}VU^{T})./(A^{T}A\varLambda ) \end{aligned} \end{aligned}$$
(8)

where .* and ./ represent the elementwise multiplication and division, respectively. In order to assign the weights of \(\varLambda \) into a regular space, we normalize it as:

$$\begin{aligned} \begin{aligned} \varLambda = \frac{\varLambda }{\sum _{i=1}^m \varLambda _{i,i}} \end{aligned} \end{aligned}$$
(9)

Next, we briefly analyze the convergency and the computational complexity of above updating rules.

For proving the convergency, we just need adopt the auxiliary function described in [18]. In addition, the KKT conditions, which suffice the stationary point of the objective function, also imply the convergency of those updating rules.

Here, the computational complexity is discussed. Supposing the algorithm stops after t iterations, the overall cost for SymNMF [7, 8] is \(O(n^2kt)\). As the objective function adds one more linear matrix factorization term, the overall cost for updating rules is \(O((n^2+m^2+mn)kt)\).

3.4 The Joint Weighted NMF Algorithm

By combining the parts above, our attributed graph clustering algorithm JWNMF can be summarized as follows: Firstly, we preprocess the adjacency matrix S and attribute matrix A, and randomly initialize the matrices U, V and assign the values of diagonal matrix \(\varLambda \) with 1/m. Then we iteratively update matrices U, V and \(\varLambda \) as Eqs. (6)–(9) until it converges. Finally, LiteKmeans Footnote 1 is performed on the factorization result V to identify k clusters.

4 Experimental Study

In this section, we evaluate the performance of our algorithm, and compare it with three state-of-the-art attributed graph clustering methods: BAGC [11], PICS [13] and SANS [14], and a benchmark clustering approach S-Cluster which is implemented by using LiteKmeans and focuses only on structure information. All algorithms were implemented in Matlab R2014b, and tested on a Windows 10 PC, Intel Core i5-4460 3.20 GHz CPUs with 32 GB memory.

4.1 Datasets

Seven real-world datasets are employed in our experiments, where four of them do not have ground truth and three of them have ground truth. The datasets without ground truth include two amazon information networks (Amazon FailFootnote 2 and DisneyFootnote 3), a CMU email address network (Enron (see footnote 2)) and a network of DBLP information (DBLP-4AREA (see footnote 3)). On the other hand, the datasets with ground truth (WebKB, Citeseer, Cora)Footnote 4 are all from text categorization applications. We represent all of these datasets as undirected networks. Table 1 summarizes the characteristics of the seven datasets.

Table 1. Description of seven real-world datasets

4.2 Evaluation Measures

The goal of attributed graph clustering is to effectively leverage the topology and attribute information. Hence, we evaluate the attributed graph clustering based on the two aspects. Specially, to evaluate clustering results from the topology structure and the attribute points of view, we employ modularity and average entropy. Modularity [20] is a widely used evaluation measure for graph partition, and average entropy is often used in evaluating feature based clustering results.

Let \(C=(C_1,\ C_2, \ \dots , \ C_k)\) represents the k partitions of an attributed graph, the modularity Q and average entropy \(Avg\_entropy\) are defined as:

$$\begin{aligned} Q = \sum _{i=1}^{k}(e_{i,i} - c_i^{2}) \end{aligned}$$
(10)
$$\begin{aligned} Avg\_entropy = \sum _{t=1}^{m}\sum _{j=1}^{k}\frac{|C_{j}|}{nm}{entropy(a_{t},C_j)} \end{aligned}$$
(11)

where \(e_{i,j}\) is the fraction of edges with the start node in cluster i and the end node in cluster j, and \(c_i\) denotes the fraction of ends of edges that are attached to nodes in cluster i, and \(entropy(a_t,C_j)\) is the information entropy of attribute \(a_t\) in cluster \(C_j\). The value with respect to modularity and average entropy falls within the range of \([-1, 1]\) and the range of \([0, +\infty )\), where higher modularity indicates dense connections between nodes within clusters but sparse connections between nodes in different clusters, and lower average entropy indicates we have similar attribute values within clusters but dissimilar attribute values in different clusters, i.e., a better clustering result.

In addition to modularity and average entropy, we also utilize Normalized Mutual Information (NMI) to evaluate the clustering performance for the datasets with ground truth. Generally, higher NMI values indicate better clustering results.

Fig. 1.
figure 1

Clustering qualities on Amazon Fail

Fig. 2.
figure 2

Clustering qualities on Enron

4.3 Performance on Datasets Without Ground Truth

Effectiveness Evaluation. We show how the modularity and average entropy change with respect to different number of clusters on Amazon Fail in Fig. 1. We observe JWNMF outperforms the four baseline methods in terms of modularity when varying the number of clusters. Meanwhile, in terms of average entropy, JWNMF performs the best, except when the number of clusters is set as 8. Similar observations can be found on Enron and Disney (in Figs. 2 and 3). From Fig. 4, we can see that our method achieves the lowest average entropy on DBLP-4AREA. However, according to modularity, JWNMF is inferior to PICS. The reason is that PICS treats attributed graph clustering problem as a data compression problem, thus it prefers datasets which consist of large number of nodes but sparse topology structures. Moreover, we can see from Figs. 1, 2, 3 and 4 that average entropy has a descending trend as the number of clusters is increased. This is because increasing the number of clusters improves the chances that the nodes with similar attributes are put into the same cluster.

Fig. 3.
figure 3

Clustering qualities on Disney

Fig. 4.
figure 4

Clustering qualities on DBLP-4AREA

Efficiency Evaluation. Table 2 shows the running time of all the methods on the four datasets without ground truth. We can see JWNMF runs much faster than three sate-of-the-art attributed graph clustering methods, PICS, BAGC and SANS. The reason is that JWNMF is a quite efficient method whose iterate computation converges very fast (usually in 100 iterations). Although S-cluster achieves the best efficiency, its clustering results can be pretty poor as in Fig. 4.

Table 2. Running time (sec) on datasets without ground truth

Parameter Setting. In our experiments, we search the parameter \(\lambda \) in the set {\(10^{-10}\), \(10^{-8}\), \(10^{-7}\), \(10^{-6}\), \(10^{-5}\), \(10^{-4}\), \(10^{-3}\), 0.5} to find its optimal settings on Amazon Fail, Enron, Disney and DBLP-4AREA. According to our experience, we advise to set the parameter \(\lambda \) in terms of the sparsity of topology structures. Specifically, it is more appropriate to use a small value of \(\lambda \) for datasets with dense topology structure.

4.4 Performance on Datasets with Ground Truth

Since PICS and SANS cannot output the ground-truth of number of clusters, we do not compare with them in this section. Table 3 reports the performance for S-Cluster, BAGC and JWNMF on the three datasets with ground truth. For JWNMF, we set \(\lambda \) =1.5, 0.5 and 4.5 for the three datasets, respectively. Overall, our method has better performance than the baseline methods. In particular, the improvements are significant in terms of modularity and NMI. In terms of average entropy, however, the superiority of JWNMF is slight. The reason is that the textual attribute is with huge dimensions but very sparse, which makes the computed entropies more or less equal.

Table 3. Performance on three textual datasets (%)
Table 4. Performance on three noisy textual datasets (%)

In JWNMF, we introduced a weighting matrix \(\varLambda \) to handle noisy features. To demonstrate the merits of the weighting scheme, we inject 30% noisy attributes of random 0/1 distribution into the three datasets. Table 4 reports the results on those noisy datasets, where JNMF represents the variant of our method by removing the weighting matrix. We find that JWNMF significantly outperforms other methods including JNMF. The results show that the weighting scheme of our model is very useful, especially in the presence of noisy attributes.

5 Conclusion

In this paper, we develop a joint weighted nonnegative factorization method, namely JWNMF, to solve the attributed graph clustering problem. By using two joint factorization terms, JWNMF nicely fuses the topology and attribute information of attributed graphs for clustering. Moreover, a weighting scheme is incorporated into JWNMF to differentiate attribute importance to clusters. An iterative algorithm is proposed to find solutions of JWNMF. Extensive experimental results show that our method outperforms state-of-the-art attribute graph clustering algorithms.