Keywords

1 Introduction

Many real world data can be modeled using heterogeneous information networks (HINs) which consists of multiple types of objects. For example, a heterogeneous bibliographic network in Fig. 1(a) contains multi-typed objects including authors, venues (conferences or journals) and terms. A heterogeneous servers network in Fig. 1(d) contains switches, email servers, database servers and web servers. Clustering in HINs has attracted increasing attention in recent years. For instance, [1] finds that clustering analysis in heterogeneous bibliographic network helps generate more effective and comprehensive ranking result for authors and venues. In a heterogeneous servers network, if an attack script runs on some compromised web servers and the script reads data from database servers and sends out spam emails through the email servers, we call these servers compose an “attack sub-network”. [2] proposes a framework to find such “attack sub-networks” with the help of clustering in HINs.

Fig. 1.
figure 1

Example of network schemas and meta paths

However, most existing studies on HIN clustering have some limitations. Many studies [35] deal with HINs as homogeneous networks, i.e., a network consisting of single type of objects. All the types (author type, venue type or term type) are treated in the same way in these algorithms. Therefore, these algorithms fail to use rich heterogeneous information during the clustering process. Some studies, RankClus [1] and GenClus [6] for example, distinguish different types in HINs. They assign a certain type in HIN as target type and other types as attribute types. Then, they focus on clustering target type and only generate clusters for that type of objects. HIN clustering aims at finding K partitions for multi-typed objects, so that objects in the same partition should be more similar (have more connections) to each other than to those in other partitions. For example, in a heterogeneous bibliographic network, we can tell which venues belong to “Information Security” and which venues belong to “Information Retrieval” by using clustering analysis in venues. While, if we are interested in which authors are authorities in “Information Security” and which authors are authorities in “Information Retrieval”, we have to cluster the authors. Therefore, a better way to analysis heterogeneous bibliographic networks is to generate clusters for venues and authors simultaneously. Unlike RankClus [1] and GenClus [6], we try to find partitions for more than one type of objects. In this case, bi-clustering is a feasible technique because it generates clusters for two types of objects. In recent years, bi-clustering based on Nonnegative Matrix Tri-Factorization (NMTF) [7, 8] attracts increasing attention because of their mathematical elegance and encouraging empirical results on document data and relational data. Nevertheless, few algorithms utilize NMTF for the HIN clustering. The main challenge of applying NMTF to HIN is how to incorporate rich heterogeneous information into the clustering process.

To address the problem, in this paper we propose a novel bi-clustering algorithm, BMFClus (HIN Bi-Clustering based on Matrix tri-Factorization), which simultaneously generates clusters for two types of objects, and takes advantage of rich heterogeneous information during the clustering process. To achieve this goal, the NMTF is adopted as a basic bi-clustering method of BMFClus to cluster two types of objects in HIN. Furthermore, a similarity regularization term is introduced to the objective function of NMTF. The similarity regularization term enables BMFClus to utilize rich heterogeneous information in HINs, which leads to an improvement over the basic NMTF method. Our contributions are summarized as follows:

  1. 1.

    We propose a bi-clustering algorithm for HINs based on NMTF.

  2. 2.

    We incorporate rich heterogeneous information into the bi-clustering process by a similarity regularization term.

  3. 3.

    Experiments on both synthetic and real-world datasets demonstrate the effectiveness of the proposed algorithm in comparison with the state-of-the-art algorithms.

The rest of this paper is organized as follows: Sect. 2 introduces the problem statement and related work. Section 3 describes the details of the proposed algorithm. Section 4 reports the performance of the proposed algorithm comparing with the state-of-the-art algorithms. Finally, we concludes the paper and outlines the future work.

2 Problem Statement and Related Work

A graph \(G=(V, E)\), where \(V=\bigcup _{i=1}^{t}{X_i}\), and \(X_1 = \{x_{11}, ..., x_{1n_1}\}, ..., X_t = \{x_{t1}, ..., x_{tn_t}\}\) denote the t different types of nodes. E is the set of links between any two data objects in V. If \(t=1\), G is a homogeneous information network. If \(t>1\), G is a heterogeneous information network.

As described in [9], a graph \(S=(T,R)\) is called network schema, if S is an undirected connected graph defined over object types T, with edges as relations from R. A network schema provides a meta structure description of a heterogeneous information network.

For example, Fig. 1 (a) is a network schema of a HIN, specifically, a heterogeneous bibliographic network. It contains three types of objects including authors, venues and terms. For this HIN, \(T = \{Author (A), Venue (V), Term (T)\}\), \(R = \{A-V, V-A, A-T, T-A \}\),\(t = 3\), \(X_1\) denotes the objects of author type, \(X_2\) denotes the objects of venue type, \(X_3\) denotes the objects of term type. We define the meta path as a path in network schema which connects two types, following the definition of meta path in [9]. In Fig. 1, (b) and (c) are two meta paths selected from (a). Meta path (b) composed by relations \(A-T\) and \(T-A\), and is denoted as ATA. ATA encodes the semantic that whether two authors are interested in the same term, e.g. both two authors like “kmeans”. Meta path (c), i.e. AVA, composed by relations \(A-V\) and \(V-A\), denotes the semantic that whether two authors are interested in the same venue, e.g. two co-authors publish a paper in “SIGKDD”.

Given a meta path, we can use PathCount to measure the similarity between a pair of objects [9]. The PathCount of \(x_{1i}, x_{1j}\) is the number of path from \(x_{1i}\) to \(x_{1j}\) following a certain meta path. For instance, in Fig. 1, two authors can be connected via “author-term-author” (ATA) path if they use a same term in their papers, and \(PathCount(x_{1i}, x_{1j})\) under path ATA is the number of common terms used by author \(x_{1i}\) and \(x_{1j}\). Meta path “author-venue-author” (AVA) denotes a relation between authors via venues (i.e., publishing in the same venues), and \(PathCount(x_{1i}, x_{1j})\) under path AVA is the number of common venues attended by author \(x_{1i}\) and \(x_{1j}\). Given a meta path, the higher value of \(PathCount(x_{1i}, x_{1j})\), \(x_{1i}\) and \(x_{1j}\) are considered to be more similar. Since meta path encodes the relationship between different types, it captures rich heterogeneous information of a HIN [911].

Several approaches have been proposed to find K partitions for the multi-typed objects in a HIN. SpectralBiclustering [12] is proposed to bi-clustering two types of objects using spectral clustering. RankClus [1] combines ranking and clustering techniques to analysis two types of objects in HIN. PathSelClus [11] utilizes rich heterogeneous information encoded by meta path. While, PathSelClus only generates cluster for a single type in HIN. Our work is different from theirs, as we focus on simultaneously generating clusters for two types of objects. In addition, we also propose how to incorporate rich heterogeneous information into the NMTF clustering process.

3 Proposed Algorithm

3.1 NMTF

We give a brief review of Nonnegative Matrix Tri-Factorization (NMTF) [8] which is an effective bi-clustering method. Given a data matrix \(M \in \mathbb {R}_{+}^{m \times n}\), the objective function of the NMTF is

$$\begin{aligned} \min _{F \ge 0,G \ge 0, S \ge 0}{\left\| M - FSG^T \right\| _F^2}, \end{aligned}$$
(1)

where \(\left\| \cdot \right\| _F \) denotes the matrix Frobenius norm, \(F \in \mathbb {R}_{+}^{m \times k}\), \( S \in \mathbb {R}_{+}^{k \times k}\) and \(G \in \mathbb {R}_{+}^{n \times k}\). S provides additional degrees of freedom such that the low-rank matrix representation remains accurate, while F gives m row cluster assignment vectors and G gives n column cluster assignment vectors. Equation (1) can be computed using the following update rules [8].

$$\begin{aligned} G_{jk} \leftarrow G_{jk} \sqrt{\frac{(M^TFS)_{jk}}{(GG^TM^TFS)_{jk}}}, \end{aligned}$$
(2)
$$\begin{aligned} F_{ik} \leftarrow F_{ik} \sqrt{\frac{(MGS^T)_{ik}}{(FF^TMGS^T)_{ik}}}, \end{aligned}$$
(3)
$$\begin{aligned} S_{ik} \leftarrow S_{ik} \sqrt{\frac{(F^TMG)_{ik}}{(F^TFSG^TG)_{ik}}}. \end{aligned}$$
(4)

Although the objective function in Eq. (1) is not convex in all variables together, it is proved that the above update rules will find a local minimum of Eq. (1). Using NMTF for data clustering has following merits:

  1. 1.

    We can obtain the clusters of rows and columns simultaneously of a data matrix. Actually, it is also proved that NMTF is equivalent to do kernel K-means clustering on both columns and rows [8].

  2. 2.

    NMTF conducts a knowledge transformation between the row feature space and the column feature space [13]. It means that the quality of the row clustering and the column clustering are mutually enhanced during the update iteration.

The performance of NMTF in document data or relation data has been well studied. However, to the best of our knowledge, we are the first to apply NMTF to HIN clustering.

3.2 BMFClus

Although NMTF can be used to generate clusters for two types on a HIN, it fails to take advantage of rich heterogeneous information captured by meta path. Therefore, we propose BMFClus which not only inherits advantages of NMTF, but also takes into account rich heterogeneous information of HIN.

First, we use NMTF to model a HIN. Two types are selected from T. Then, a nonnegative edges weight matrix M is constructed, where \(M_{i,j}\) is the number of the links between two nodes. For example, in Fig. 1, we choose author type and venue type. The \(M_{i,j}\) denotes how many papers of author i published by venue j. If the topic of venue j is “data mining”, and the value of \(M_{i, j}\) is high, then author i has a high probability to be labeled as “data mining”, and vice versa. According to Eq. (1), F gives the cluster assignment vectors of authors and G gives the cluster assignment vectors of venues. If \(k=3\) and \( F_{i,*} = [0.9, 0.1, 0] \), then author i will be assigned to cluster 1. If \(G_{*,j} = [0.05, 0.95, 0.05]^T\), then venue j will be assigned to cluster 2.

Next, we describe how to incorporate rich heterogeneous information into NMTF. As mentioned before, meta paths encode rich heterogeneous information of HIN. Given a meta path, a nonnegative similarity matrix is constructed using PathCount. For example, in Fig. 1, if we use meta path \(p=ATA\) (author-term-author), a nonnegative similarity matrix \(W^{(F)}\) between each authors can be constructed using \(W_{i,j}^{(F)} = PathCount(x_{1i}, x_{1j})\). And \(W_{i,j}^{(F)}\) is the number of path from object \(x_{1i} \) to object \(x_{1j} \) following p.

Our goal is to encourage two objects (\(x_{1i} \) and \(x_{1j} \)) who have a high similarity (\(W_{i,j}^{(F)}\)) to have similar cluster assignment vectors (\(F_i \approx F_j\)). To achieve this goal, we introduce the following similarity regularization term.

$$\begin{aligned} O_1 = \frac{1}{2} \sum _{i,j}{\Vert F_i - F_j \Vert _2^2W_{i,j}^{(F)}}. \end{aligned}$$
(5)

The regularization term Eq. (5) is a cost function. It is obvious that if \(x_{1i} \) and \(x_{1j} \) have a high similarity value with respect to \(W_{i,j}^{(F)}\), we should make \( \Vert F_i - F_j \Vert _2^2\) small to reduce the punishment of Eq. (5). Minimizing \(O_1\) will smooth the cluster distributions between a object and its similar objects. Define diagonal matrix \(D_{i,i}^{(F)} = \sum _{j}{W_{i,j}^{(F)}} \). Then, we construct the consistent Laplacian matrix \(L_{F} = D^{(F)} - W^{(F)}\). Now, we rewrite the regularization term into trace form:

$$\begin{aligned} O_1= & {} \frac{1}{2} \sum _{i,j}{ \Vert F_i - F_j \Vert _2^2W_{i,j}^{(F)}} \nonumber \\= & {} \sum _i{F_i D_{i,j}^{(F)}{F_i}^T - \sum _{i,j}{F_i W_{i,j}^{(F)}{F_j}^T}} \nonumber \\= & {} Tr ({F}^TL_{F}F ). \end{aligned}$$
(6)

Similar with the construction of \(O_1\), we construct another objective function for venue type:

$$\begin{aligned} O_2 = Tr ({G}^TL_{G}G ). \end{aligned}$$
(7)

Now, we define our BMFClus by adding the regularization terms Eqs. (6) and  (7) to Eq. (1):

$$\begin{aligned} \min _{F \ge 0,G \ge 0, S \ge 0}{\left\| M - FSG^T \right\| _F^2} + \lambda (Tr ({F}^TL_{F}F ) + Tr ({G}^TL_{G}G )), \end{aligned}$$
(8)

where the first term represents the reconstruction error for nonnegative edges weight matrix M, and the second term represents the similarity regularization. \(\lambda \) is a trade-off parameter. This parameter is not much sensitive and we set it to be 0.1 in our experiments.

figure a

BMFClus can be computed using the following update rules [14]:

$$\begin{aligned} G_{jk} \leftarrow G_{jk} \sqrt{\frac{ [ \lambda L_{G}^{-}G + A^{+}+GB^{-} ]_{jk} }{ [\lambda L_{G}^{+}G + A^{-} + GB^{+}]_{jk}} }, \end{aligned}$$
(9)
$$\begin{aligned} F_{ik} \leftarrow F_{ik} \sqrt{\frac{ [ \lambda L_{F}^{-}F + P^{+}+FQ^{-} ]_{ik} }{ [\lambda L_{F}^{+}F + P^{-} + FQ^{+}]_{ik}} }, \end{aligned}$$
(10)
$$\begin{aligned} S = (F^TF)^{-1}F^TMG(G^TG)^{-1}, \end{aligned}$$
(11)

where \(L_{G} = L_{G}^{+} - L_{G}^{-}\), \(A = M^TFS = A^{+} - A^{-}\), \(B=S^TF^TFS=B^{+} - B^{-}\), \(P = MGS^T \) , \(Q = SG^TGS^T\), \( A_{ij}^{+} = (|A_{ij}| + A_{ij})/2 \), \( A_{ij}^{-} = (|A_{ij}| - A_{ij})/2 \) [15]. The detailed process of the BMFClus is summarized in Algorithm 1.

4 Experiments

In this section, we conduct a series of experiments to show the effectiveness of BMFClus on both synthetic and real datasets.

Table 1. Statistics of the datasets

4.1 Dataset

We give a brief description of the datasets used in our experiments as follows:

  1. 1.

    SynData: We generate a synthetic HIN following the properties of real word HIN. A HIN is composed with several bipartite networks. We apply the method described in [1] to generate 3 synthetic bipartite networks and construct a synthetic HIN, SynData. SynData contains 3 clusters and 3 types, denoted as A, B and C. The number of objects: \(N_a = \{1000, 1200, 1300\}\) for type A, \(N_b = \{3000, 3200, 3500\}\) for type B, \(N_c = \{1000, 1200, 1300\}\) for type C.

  2. 2.

    DBLP4 [16] is a real-world HIN extracted from DBLP bibliography dataset in four research areas: database (DB), data mining (DM), information retrieval (IR) and artificial intelligence (AI). DBLP4 contains 3 types of objects: 4236 authors, 20 venues and 11771 unique terms. Each author object links with several venues and terms. The link weight of author-venue pair is the number of papers the author publishes in the venue. The author-term sub-network contains all the terms appeared in the abstract of papers of each author with stopwords removed. All the venue objects and author objects are labeled.

  3. 3.

    Flickr [9, 17]: Flickr is a HIN contains three types of objects: image, user and tag. Each image object links with several tags and one user. Image objects are labeled.

Fig. 2.
figure 2

Network schemas of the datasets. (a):SynData; (b):DBLP4; (c): Flickr. The labeled object types are in grey.

The statistics of the datasets are summarized in Table 1. The network schemas of the datasets are shown in Fig. 2.

4.2 Baselines

We compare the proposed method with the following state-of-the-art algorithms:

  • SpectralBiclustering (SBC) [12] is a well known spectral clustering based bi-clustering algorithm. Give a \(m \times n\) matrix, SBC generates clusters for m rows and n columns.

  • GreedyCoClustering [9] is an information-theoretic greedy bi-clustering algorithm. It use a greedy KL-divergence based bi-clustering method to cluster a \(m \times n\) matrix.

  • NMTF [8] is the basic form of the proposed method.

  • RankClus [1] is a rank-based algorithm which integrates ranking and clustering together for a heterogeneous bibliographic network. RankClus treats a \(m \times n\) matrix as a bipartite graph. RankClus generates clusters for m rows, and applies ranking for n columns. Then, it generates a better cluster structure for m rows based on the ranking distribution on n columns. After several this iteration, quality of clustering and ranking are mutually enhanced.

4.3 Evaluation Metrics

The clustering performance is evaluated by comparing the ground truth labels with the predicted labels. Two popular metrics, i.e., accuracy (ACC) and normalized mutual information (NMI), are used to measure the clustering performance [1, 18, 19].

Given an object \( v_i \) of a certain type \(T_a\) \((1 \le a \le t)\), let \(c_i\) and \(r_i\) be the predicted label and the ground truth label of \(v_i\), respectively. The ACC of type \(T_a\) is defined as follows:

$$\begin{aligned} ACC = \frac{{\sum \nolimits _{i = 1}^{ n_a} {\delta \left( {{c_i},map\left( {{r_i}} \right) } \right) } }}{n_a}. \end{aligned}$$
(12)

where \(\delta (x,y)\) equals one if \( x=y\) and equals zero otherwise. \(map(r_i)\) is the permutation mapping function that maps each cluster label \(r_i\) to the equivalent label from the ground truth labels. Kuhn-Munkres algorithm [20] is used for finding the best mapping.

Given the clustering result of type \(T_a\), let \(n(i,j), i,j = 1, 2, ..., K\), denote the number of objects that predicted as label i and labeled as j in the ground truth. From n(ij), we define joint distribution \(p(i, j)=\frac{n(i,j)}{n_a}\), row distribution \(p_1(j)=\sum _{i=1}^{K}{p(i, j)}\) and column distribution \(p_2(i) = \sum _{j=1}^{K}{p(i, j)}\). The NMI of type \(T_a\) is defined as follows:

$$\begin{aligned} NMI = \frac{ \sum _{i = 1}^K {\sum _{j = 1}^K {p(i,j)\log ( \frac{p(i,j)}{p_1(j)p_2(i)} )}}}{\sqrt{ \sum _{j = 1}^K {p_1(j) \log p_1(j)} \sum _{i = 1}^K {p_2(i) \log p_2(i)} } }. \end{aligned}$$
(13)

NMI dose not require the mapping function between the predicted labels and ground truth labels.

Both metrics are in the range from 0 to 1 and a higher value indicates a better clustering performance in terms of the ground truth labels.

4.4 Settings

For DBLP4 dataset, author type and venue type are selected as the two types of objects we want to cluster. And the nonnegative edges weight matrix M is constructed by the link count between two objects. \(L_F\) and \(L_G\) is constructed using meta path ATA (author-term-author) and VAV (venue-author-venue), respectively. Meta path ATA means two authors share more common terms are more similar and meta path VAV means two venues share more common authors are more similar. For Flickr dataset, we choose image type and tag type as the types of objects we want to cluster. M is constructed by the links between image objects and tag objects. Since only image objects are labeled, the clustering performance are evaluated on image type. \(L_F\) and \(L_G\) is constructed using meta path IUI (image-user-image) and TIT (tag-image-tag), respectively. Meta path IUI means two images provided by same user are similar, and meta path TIT means two tags share more common images are more similar. For SynData dataset, we choose A type and B type as the clustering target. M is constructed by the links between A and B. \(L_F\) and \(L_G\) is constructed using meta path ACA and BCB, respectively. M also serves as the input data for SBC [12], GreedyCoClustering [9], NMTF [8] and RankClus [1]. Each result is the average of 10 runs.

4.5 Results

The results are shown in Table 2 and the best results are highlighted in boldface. On the DBLP4 and SynData dataset, as this dataset is nicely structured, all methods achieve outstanding performance. NMTF outperforms the other baselines on DBLP4 and SynData dataset. As expected, due to the rich heterogeneous information captured by similarity regularization term, BMFClus performs much better than NMTF on author type and venue type with respect to NMI and ACC. On the Flickr dataset, we observe that SBC [12] achieves the best accuracy on image type. While, BMFClus outperforms SBC with respect to NMI. BMFClus achieves the best results on DBLP4 and SynData dataset, and the second best results on Flickr dataset. Overall we conclude that BMFClus outperforms the base line methods.

Table 2. Cluster performance of different methods.
Fig. 3.
figure 3

Convergence of BMFClus

4.6 Algorithm Convergence

The update rules for minimizing the objective functions of BMFClus are essentially iterative We investigate the convergence of BMFClus. Figure 3 shows the convergence curve of the objective functions (in log values) on three datasets.

It is easy to see that the objective values of BMFClus falling fast at the first several iterations on each datasets.

5 Conclusion and Future Work

In this paper, we propose a bi-clustering algorithm (BMFClus) for HINs based on NMTF. Specifically, given a HIN, BMFClus simultaneously generates clusters for two types of objects. Besides, BMFClus takes rich heterogeneous information into account by using a similarity regularization. Experiments on both synthetic and real-world datasets demonstrate that BMFClus outperforms the state-of-the-art methods. For the future work, we will investigate how to extend BMFClus to arbitrary multi-typed heterogeneous information networks.