Keywords

1 Introduction

Heterogeneous Information Network (HIN) [15], as a new network modeling method, has drawn much attention due to its ability to model complex objects and their rich relations. Moreover, in many real HINs, objects are often associated with various attributes. For example, in Yelp dataset, where businesses and users can be regarded as nodes. Business objects are associated with attributes like locations, ratings and business types; besides, users own attributes including age and gender. Researchers have brought forward the concept of Attributed Heterogeneous Information Network [9] (AHIN) to address HINs with node attributes.

Clustering is a fundamental task in data analysis. Given a set of objects, the goal is to partition them into clusters such that objects in the same clusters are similar to each other, while objects in different clusters are dissimilar. Semi-supervised clustering incorporates supervision about clusters into the algorithm in order to improve the clustering results. Till now, many semi-supervised clustering algorithms for information networks have been proposed. Some semi-clustering methods [1, 6, 8] are proposed for semi-supervised clustering in homogeneous information networks, where links are assumed to be of the same type. As for heterogeneous information networks, GNetMine [4], PathSelClus [16] and SemiRPClus [10] are proposed. All of the algorithms above do not concern the attributes of nodes. Recently, SCHAIN [9] studies the problem of semi-supervised clustering in AHIN and achieves state-of-the-art clustering performance.

Although these methods have achieved satisfactory performances to some extent, they have some obvious shortcomings. First of all, they do not consider the coupling relationships among object attributes. In the real world, attributes are associated with each other and have complex relationships addressed as coupling relationships [10, 16]. To illustrate, in a movie dataset, the “budget” and “gross” attribute of a movie is largely dependent on the “country” attribute. Ignoring the dependency between attributes will inevitably lead to inferior clustering results. In addition, previous works cannot flexibly consider the importance of supervision with datasets while many real applications may need to take into account the different importance of supervision for different tasks.

To handle these issues, we put forward an innovative model Semi-supervised Clustering with Coupled Attributes in Attributed Heterogeneous Information Networks (SCAN). Inspired by the newly emerging non-IID learning [3], we take one step further to mine the coupled similarity between node attributes, thus capturing the inter-dependent relationships between attributes. Moreover, we propose a novel constraint method to flexibly leverage the supervision information through which users can adjust the importance of supervised information for different clustering objectives.

The main contributions of this paper can be summarized as follows:

  • To our best knowledge, we are the first to mine the coupling relationships between node attributes in AHIN. We propose a coupled node similarity measure to better analyze the inter-dependent relationships between mixed data types of node attributes.

  • We propose a novel approach to use supervision information, which is able to leverage node similarity and supervision constraint flexibly.

  • We conduct extensive experiments on two real-world datasets. The proposed method is proved to be effective over the state-of-the-art methods.

2 Definitions and Model Overview

2.1 Problem Definition

In this paper, we consider the problem of semi-supervised clustering in AHINs. The terms are defined as follows.

Definition 1

Attributed Heterogeneous Information Network (AHIN) [9]. Let \(\mathcal {T}=\{T_1,\ldots ,T_m\}\) be a set of m object types. For each type \(T_i\), let \(\mathcal {X}_i\) be the set of objects of type \(T_i\) and \(A_i\) be the set of attributes defined for objects of type \(T_i\). An object \(x_j\) of type \(T_i\) is associated with an attribute vector \(\varvec{f}_j=(f_{j_1}, f_{j_2}, \dots , f_{j_{\left| A_i \right| }})\). An AHIN is a graph \(G=(V,E,\mathcal {A})\), where \(V=\bigcup _{i=1}^m\mathcal {X}_i\) is a set of nodes, E is a set of links, each representing a binary relation between two objects in V, and \(\mathcal {A}=\bigcup _{i=1}^mA_i\). If \(m=1 \) (i.e., there is only one object type), G reduces to a homogeneous information network.

Figure 1a shows an example of movie AHIN consisting of four types of objects: T={movies (M), actors (A), genres (G), directors (D)}. The network schema [15] is shown in Fig. 1b. In this AHIN, the M, A and D node types are associated with both numerical and categorical attributes shown in Fig. 1c.

A meta-path [15] \(\mathcal {P}: T_1{\mathop {\longrightarrow }\limits ^{R_1}}\cdots {\mathop {\longrightarrow }\limits ^{R_l}} T_{l+1}\) defines a composite relation \(R=R_1\circ \cdots \circ R_l\) that relates objects of type \(T_1\) to objects of type \(T_{l+1}\). We say \(\mathcal {P}\) is symmetric if the defined relation R is symmetric. Giving a meta-path \(\mathcal {P}\), the multiple specific paths under the meta-path are called path instances denoted by P. To illustrate, in the AHIN shown in Fig. 1a, the meta-path MDM means two movies are directed by the same director, \(M_{1} D_{1} M_{4}\) is a path instance of the MDM meta-path. Different meta-paths preserve different semantic meanings in AHIN.

Definition 2

Semi-supervised clustering in an AHIN [9]. Given an AHIN \(G=(V,E,\mathcal {A})\), a supervision constraint \(\mathcal {(M,C)}\) of must-link set \(\mathcal {M}\) and the cannot-link set \(\mathcal {C}\), a target object type \(T_i\), the number of clusters k, and a set of meta-paths \(\mathcal {PS}\), the problem of semi-supervised clustering of type \(T_i\) objects in G is to (1) discover an object similarity measure S that is based on object attributes and meta-paths, and (2) partition the objects in \(X_i\) into k disjoint clusters \(\mathbf {C}={C_1,\dots ,C_k}\) based on the similarity measure S such that the clustering results best agree with the constraint \((\mathcal {M},\mathcal {C})\).

Fig. 1.
figure 1

An AHIN (a), its network schema (b), and its node attributes (c)

2.2 Model Overview

In this section, we will take a brief look at the proposed model SCAN. As introduced in Definition 1, the task can be decomposed into two steps, namely node similarity calculation and clustering. In the following, we will detail the two steps.

To begin with, our model calculates the similarity of every node pair based on their attribute similarity and the nodes’ network connectedness: The former is obtained by an attribute similarity measure considering coupling relationship amongst attributes, while the latter is derived based on the meta-paths connecting the object pair. Moreover, SCAN assigns a weight to each object attribute and meta-path to compute the overall node similarity S, thus enabling the model to learn different weights for different clustering objectives. As for the supervision constraint, SCAN proposes a flexible penalty function to leverage clustering quality and supervision. Finally, SCAN employs the iterative, 2-step learning process of SCHAIN [9] to determine the optimal weights and cluster assignments as output.

The rest of the paper is organized as follows: The calculation of coupled similarity between nodes is explained in Sect. 3, clustering and weight learning process are explained in Sect. 4, extensive experiments along with parameter discussion are illustrated in Sect. 5. Finally, we conclude the whole paper in Sect. 6.

3 Node Similarity with Coupled Attributes

In this section, we introduce how to calculate the similarity between two nodes in an AHIN. Given a node pair \({<}x_a,x_b{>}\) of the same node type \(T_i\) in an AHIN, the node similarity between a node pair in an AHIN is twofold: On the one hand, similarity can be obtained from attribute similarity; on the other hand, similarity can be measured by the connectedness of the given node pair, namely link similarity. The overall node similarity matrix S can be obtained by aggregating the coupled attribute similarity matrix \(S_A\) and the link similarity matrix \(S_L\). To mine the complex coupling relationships between node attributes, we take one step further to calculate the coupled similarity between node attributes. We will detail this concept in the following section.

3.1 Coupled Node Attribute Similarity

In this section we explain how to capture the coupling relations between node attributes. Given a node pair \({<}x_a,x_b{>}\) of type \(T_i\), let \(\varvec{f}_a\) and \(\varvec{f}_b\) be the attribute vectors of \(x_a\) and \(x_b\), respectively (see Definition 1). Recall that \(A_i\) is the set of attributes associated with type \(T_i\) objects. We define an attribute weight vector \(\varvec{\omega }\), whose j-th component, \(\omega _j\), captures the importance of the j-th attribute in \(A_i\) for the clustering task. The entries of \(S_A\) are calculated by:

$$\begin{aligned} S_A(x_a,x_b)=\sum _{j=1}^{\left| A_i \right| }\omega _j \cdot valSim(f_{aj},f_{bj}). \end{aligned}$$
(1)

The valSim() function calculates the similarity between the attribute value \(f_{aj}\) and \(f_{bj}\). Since in real world, many nodes are assigned with mixed type of attributes. We define \(A_{ij}\) as the j-th attribute type of node \(A_i\) and calculate the value similarity of different attribute types separately.

For numerical attributes, we firstly normalize numerical attribute to [0, 1], and use Eq. (2) to convert distance to similarity. Euclidean distance is adopted as distance metric in this paper.

$$\begin{aligned} valSim(f_{aj},f_{bj})=1-d(f_{aj},f_{bj}) ,\,\,\,if\,A_{ij}\,is\,numerical. \end{aligned}$$
(2)

However, categorical attributes are associated with each other in terms of certain coupling relationships, ignoring attribute value similarity will lead to coarse similarity [17]. This motivates us to design a coupled similarity measure of categorical attributes. Thus, we go one step further to mine the coupled value similarity between categorical attribute values:

$$\begin{aligned} \begin{aligned} valSim(f_{aj},f_{bj})&=(1-\eta )S_p(f_{aj},f_{bj})\\&\quad +\, \eta S_c(f_{aj},f_{bj}) ,\,\,\,if\,A_{ij}\,is\,categorical. \\ \end{aligned} \end{aligned}$$
(3)

We capture two levels of categorical value similarity: The plain similarity \(S_p\) to measure the value difference of the categorical value, and the coupled similarity \(S_c\) to measure the value similarity in terms of couplings. The overall categorical attribute similarity is obtained by a weighted fusion of \(S_p\) and \(S_c\) using coupled coefficient \(\eta \).

For plain similarity, we use SMS [7], which uses 0s and 1s to distinguish the similarity between distinct and identical categorical values, to calculate the value difference. As for the coupled similarity, following IeASV [17], we measure the co-occurrence of the measured attribute to other attributes. To elaborate, we bring forward the inter-coupled attribute value similarity.

Definition 3

Inter-coupled attribute value similarity (IAVS). Given two attribute values \(f_{aj}\) and \(f_{bj}\) from attribute \(A_{ij}\) in an AHIN. The inter-coupled attribute value similarity between attribute values \(f_{aj}\) and \(f_{bj}\) in terms of attribute \(A_{ik}\) is denoted by \(\delta _{j|k} (f_{aj},f_{bj})\).

Although the definition of IAVS is similar to IeASV, IeASV only handles categorical data. Since we are dealing with mixed types of data, we generalize the inter-coupled relative similarity candidate \(A_{ik}\) to numerical data. Therefore, the coupled similarity of two attribute values \(S_c (f_{aj},f_{bj})\) can be obtained by an aggregated similarity of other attributes:

$$\begin{aligned} S_c(f_{aj},f_{bj})=\frac{ \sum _{k=1,k\ne j}^{\left| A_i \right| -1}\delta _{j|k} (f_{aj},f_{bj})}{\left| A_i \right| -1}. \end{aligned}$$
(4)

Let us explain the intuition behind the equation by a toy example. In the movie AHIN shown in Fig. 1, suppose we are calculating the coupled similarity \(S_c (f_{aj},f_{bj})\) between attribute value \(f_{aj}=\) “USA” and \(f_{bj}=\) “UK” of attribute \(A_{ij}=\) “country”. The coupled similarity of “USA” and “UK” will be the aggregated similarity of USA movies and UK movies in terms of other attributes \(A_{ik}\in \) {“language”, “budget”, “gross”, “IMDB score”} calculated by Eq. (4).

For categorical attribute similarity, we adopt the inter-coupled relative similarity based on intersection set (IRSI) [17] to capture the co-occurrence relationship between categorical attributes. However, IRSI is designed to measure the dependencies between categorical variables, while the coupling relationships between categorical and numerical attributes remain undiscussed.

To capture the dependency between a categorical attribute and a numerical attribute, we define \(Val(f_{aj},A_{ik})\) as the values sets of attribute \(A_{ik}\) generated by the objects with value \(f_{aj}\). In our example, if \(A_{ik}\) = “gross”, then \(\delta _{j|k} (f_{aj},f_{bj})\) stands for the similarity between USA and UK movies in terms of gross respectively. Thus, \(Val(f_{aj},A_{ik})\) and \(Val(f_{bj},A_{ik})\) stands for the gross values of USA and UK movies. The task is to judge the similarity between the two distributions of \(Val(f_{aj},A_{ik})\) and \(Val(f_{bj},A_{ik})\), the closer the two distributions, the more similar the two categorical values are. This is a problem of measuring two finite and continuous distributions.

Inspired by the histogram, we simplify the problem by discretizing the two distributions to finite discrete distributions. The discretization method can be any of the methods described in [5]. We define \(\varPsi (V_{ajk},P_{ajk})\) as the discretized distribution of \(Val(f_{aj},A_{ik})\), where \(V_{ajk}\) and \(P_{ajk}\) are the discretized value sets (known as bins in the histogram) of \(Val(f_{aj},A_{ik})\) and its corresponding value occurrence frequency. For simplicity, we use uniform quantization as our discretization method in the experiments.

Till now, the dependency between a categorical attribute and a numerical attribute has been simplified to the similarity between two discrete distributions. We first calculate the distance between two distributions and then we convert the distance to similarity. Thus, \(\delta _{j|k} (f_{aj},f_{bj})\) can be calculated by the following equation:

$$\begin{aligned} \delta _{j|k} (f_{aj},f_{bj})=\left\{ \begin{aligned}&IRSI(f_{aj},f_{bj},A_{ik}),&if\,A_{ik}\,is\,categorical. \\&1-d(\varPsi (V_{ajk},P_{ajk}),\varPsi (V_{bjk},P_{bjk})) ,&if\,A_{ik}\,is\,numerical. \end{aligned} \right. \end{aligned}$$
(5)

Since \(A_{ik}\) is numerical, the difference between different discretized sets in \(V_{ajk}\) and \(V_{bjk}\) can be measured by the distance between the corresponding discretized numerical attribute values (bins in the histogram). To capture this cross-bin relationship, we adopt the earth mover distance (EMD) [12] to measure the distance between the two value distributions.

3.2 Link Similarity and Similarity Aggregation

In this section, we calculate the similarity between a node pair in terms of network connectedness. We adopt the widely used meta-path to calculate link similarity. Given a meta-path set \(\mathcal {PS}\), we define a weight vector \(\varvec{\lambda }\). Each meta-path \(\mathcal {P}_j \in \mathcal {PS}\) is assigned to a weight \(\lambda _j\), the overall link similarity \(S_L\) is obtained by a weighting scheme:

$$\begin{aligned} S_L(x_{a},x_{b})= \sum _{j=1}^{|\mathcal {PS}|}\lambda _j \cdot S_{\mathcal {P}_{j}}(x_a,x_b), \end{aligned}$$
(6)

SCHAIN [9] uses PathSim [15] to measure the meta-path based similarity \(S_{\mathcal {P}_{j}}\) for a given path \(\mathcal {P}_j\). However, PathSim is designed for computing the similarity between symmetric meta-paths. In real applications, asymmetric meta-paths are also useful and cannot be ignored. To illustrate, in movie networks, MAMDM and MDMAM are asymmetric yet useful meta-paths which preserve the cooperating relationship between actors and directors. Thus, we use HeteSim [13] to calculate \(S_{\mathcal {P}_{j}}\), so that SCAN is capable of handling both symmetric and asymmetric meta-paths.

$$\begin{aligned} S_{\mathcal {P}_{j}}(x_a,x_b)= HeteSim(x_a,x_b), \end{aligned}$$
(7)

The overall node similarity matrix S represents the aggregated similarity of link similarity and coupled attribute similarity, which can be calculated by a weighted sum of \(S_A\) and \(S_L\):

$$\begin{aligned} S=\alpha S_A+(1-\alpha )S_L, \end{aligned}$$
(8)

where \(\alpha \) is a weighting factor that controls the relative importance of the two similarity matrices.

4 Clustering and Weight Learning

4.1 Leveraging the Supervision

The penalty function is introduced in this part. Recall that our objective is to get \(\mathbf {C}={C_1,\dots ,C_k }\) based on the similarity matrix S such that the clustering results best agree with the constraint \((\mathcal {M,C})\). After generating the similarity matrix S, we adopt the clustering and optimization framework of SCHAIN. Different from SCHAIN, we propose a more flexible penalty function to better leverage the supervised information.

We use a semi-supervised normalized cut [14] to measure the similarity between clusters. The penalty function of clustering is:

(9)

This penalty function is composed of two parts:

  1. 1.

    Clustering quality based on the similarity matrix S: Normalized cut [14] is used to define the similarity between clusters defined as: \(NC=\sum _{r=1}^{k} \frac{links(C_r,\mathcal {X}_i \backslash C_r)}{links(C_r,\mathcal {X}_i)}\), where \(links(C_p,C_q)=\sum _{x_a \in C_p, x_b \in C_q}S(x_a,x_b)\). The larger the similarity between different clusters, the worse the clustering quality.

  2. 2.

    Supervision constraints: For an object pair \({<}x_a,x_b{>}\) belongs to the same cluster \(C_r\), the presence of \({<}x_a,x_b{>}\) in \(\mathcal {M}\) indicates good clustering quality; the presence of \({<}x_a,x_b{>}\) in \(\mathcal {C}\) indicates bad clustering quality.

For flexibility, we define two supervision importance parameters \(\beta \) and \(\rho \) to balance the influence of supervised information set. The penalty function of SCHAIN for clustering is equal to Eq. (9) with \(\beta \) and \(\rho \) assigned to 1. Thus, SCHAIN regards clustering objective and supervision of \(\mathcal {M}\) and \(\mathcal {C}\) as equal importance. However, while handling different datasets and different clustering objectives, it is flexible to cluster with different \(\beta \) and \(\rho \). In this paper, we set \(\rho \) to a fixed value 1, and use \(\beta \) as a hyper-parameter. The larger \(\beta \) indicates more penalty considered for the object pairs in must-link set \(\mathcal {M}\).

By defining \(\{\varvec{z}_r\}_{r=1}^{k}\), where each \(\varvec{z}_r\) is a binary indicator vector of length \(n=|\mathcal {X}_i|\), \(\varvec{z}_r(a)=1\) represents \(x_a\) is assigned to cluster \(C_r\) and 0 otherwise, we obtain the overall penalty function for clustering:

$$\begin{aligned} \mathcal {J}(\varvec{\lambda ,\omega },\{\varvec{z}_r\}_{r=1}^{k})=\sum _{r=1}^{k} \frac{\varvec{z}_r^T(D-S-W\circ S)\varvec{z}_r}{\varvec{z}_r^T D \varvec{z}_r}. \end{aligned}$$
(10)

\(\circ \) is the Hadamard product for two matrices. W is a constraint matrix of \(W \in \mathbb {R}^{n \times n}\), where \(W(x_a,x_b)=\beta \) for \({<}x_a,x_b{>}\, \in \mathcal {M}\), \(W(x_a,x_b)=-1\) for \({<}x_a,x_b{>}\, \in \mathcal {C}\) and 0 otherwise. Furthermore, we add a regularization term to Eq. (10) and get the overall penalty function:

$$\begin{aligned} \mathcal {J}(\varvec{\lambda ,\omega },\{\varvec{z}_r\}_{r=1}^{k})=\sum _{r=1}^{k} \frac{\varvec{z}_r^T(D-S-W\circ S)\varvec{z}_r}{\varvec{z}_r^T D \varvec{z}_r}+\gamma (\left\| \lambda \right\| ^2+\left\| \omega \right\| ^2). \end{aligned}$$
(11)

Finally, to find the best clustering, we minimize the penalty function subject to the following constraints: \(\sum _{r=1}^k \varvec{z}_r(a)=1\);\(\varvec{z}_r(a)\in \{0,1\}\);\(\sum _{l=1}^{\left| A_i \right| } \omega _l =1\); \(\omega _l >0\) and \(\lambda _j>0\). Note that \(\alpha \), \(\beta \), \(\eta \) and \(\gamma \) are hyper-parameters in the function.

4.2 Model Optimization

There are two objectives to be optimized in the learning process: the clustering results \(\{\varvec{z}_r\}_{r=1}^{k}\) and the weighting vectors \(\varvec{\lambda }\) and \(\varvec{\omega }\). In this section, we introduce the optimization of our model. Following SCHAIN [9], we use a mutual updating optimization method. Firstly, given the weights \( \varvec{\lambda }\) and \(\varvec{\omega }\), we find the optimal clustering \(\{\varvec{z}_r\}_{r=1}^{k}\). Secondly, given \(\{\varvec{z}_r\}_{r=1}^{k}\), we find the optimal \(\varvec{\lambda }\) and \(\varvec{\omega }\). We iterate until \(\mathcal {J}\) is smaller than a given threshold \(\epsilon \). We will briefly explain how the two update steps are performed.

Step 1: Optimize \(\{{\varvec{z}}_{\varvec{r}}\}_{{\varvec{r=1}}}^{{\varvec{k}}}\) given \(\varvec{\lambda }\) and \(\varvec{\omega }\)

In this step, the task is to find the best clustering \(\{\varvec{z}_r\}_{r=1}^{k}\). We define a matrix \(\hat{Z}\) with the r-th column of \(\hat{Z}\) equals to \(D^{\frac{1}{2}}\varvec{z}_r /(\varvec{z}_r^{T}D \varvec{z}_r)^{\frac{1}{2}}\), s.t. \(\hat{Z}^T \hat{Z}=I_k\), where \(I_k\) is the identity matrix of \(\mathbb {R}^{k\times k}\). Since the weights of meta-paths and attributes, namely \(\varvec{\lambda }\) and \(\varvec{\omega }\), are given, the objective function \(\mathcal {J}(\varvec{\lambda ,\omega },\{\varvec{z}_r\}_{r=1}^{k})\) becomes a function of clustering indicator matrix \(\{\varvec{z}_r\}_{r=1}^{k}\). The minimization of \(\mathcal {J}\) can be derived to a trace maximization problem:

$$\begin{aligned} \max _{\hat{Z}^T \hat{Z}=I_k}trace(\hat{Z}^TD^{-\frac{1}{2}}(S+W\circ S)D^{-\frac{1}{2}}\hat{Z}), \end{aligned}$$
(12)

which has a closed form solution by calculating top k eigenvectors [2]. K-means is then adopted to obtain hard clustering results \(\{\varvec{z}_r\}_{r=1}^{k}\).

Step 2: Optimize \(\varvec{\lambda }\) and \(\varvec{\omega }\) given \(\{{\varvec{z}}_{\varvec{r}}\}_{{\varvec{r=1}}}^{{\varvec{k}}}\)

In this step, the model finds the best \(\varvec{\lambda }\) and \(\varvec{\omega }\) given fixed cluster \(\{\varvec{z}_r\}_{r=1}^{k}\). As proved in SCHAIN [9], minimizing \(\mathcal {J}\) is equivalent to maximizing:

$$\begin{aligned} \max _{\varvec{\lambda },\varvec{\omega }}\sum _{r=1}^{k}\frac{\varvec{z}_r^T(S+W\circ S)\varvec{z}_r}{\varvec{z}_r^TD\varvec{z}_r}-\gamma (\left\| \varvec{\lambda }\right\| ^2+\left\| \varvec{\omega }\right\| ^2), \end{aligned}$$
(13)

which can be rewritten as:

$$\begin{aligned} H(\varvec{\lambda },\varvec{\omega })=\max _{\varvec{\lambda }, \varvec{\omega }}\frac{f(\varvec{\lambda },\varvec{\omega })}{g(\varvec{\lambda },\varvec{\omega })}, \end{aligned}$$
(14)

where \(f(\varvec{\lambda },\varvec{\omega })\) and \(g(\varvec{\lambda },\varvec{\omega })\) are two nonlinear multivariate polynomial function. This can be optimized by solving the following non-linear parametric programming problem: Let \(f(\varvec{\lambda },\varvec{\omega })\) and \(g(\varvec{\lambda },\varvec{\omega })\) be two multivariate polynomial functions. For a given \(\mu \), find \(F(\mu )=f(\varvec{\lambda },\varvec{\omega })-\mu g(\varvec{\lambda },\varvec{\omega })\), s.t. \(\sum _{r=1}^k \varvec{z}_r(a)=1\); \(\varvec{z}_r(a)\in \{0,1\}\); \(\sum _{l=1}^{\left| A_i \right| } \omega _l =1\); \(\omega _l >0\) and \(\lambda _j>0\). Readers may refer to SCHAIN [9] for proving and other details.

To sum up the whole model, SCAN firstly calculates the coupled attribute node similarity and link similarity between nodes. Secondly, SCAN computes the overall similarity by aggregating the similarity matrices with weighting vectors \(\varvec{\lambda }\) and \(\varvec{\omega }\) and a balance factor \(\alpha \). The constraint matrix is generated by supervision using supervision importance \(\beta \). Finally, SCAN adopts an iterative mutual update process to learn the clustering and weighting vectors.

5 Experiments

In this section, we firstly introduce the datasets used in the experiments. Then we discuss the attribute similarity matrix of the IMDB dataset. Further, we show the effectiveness of the SCAN model against 6 representative algorithms. Finally, we explain the influence of different hyper-parameters and the weight learning process.

5.1 Datasets

YelpFootnote 1\(\mathbf . \) This experiment is similar to the experiment of Yelp-Restaurant dataset used in SCHAIN [9]. The clustering task is to cluster restaurants by three sub-categories: “Fast Food”, “Sushi Bars” and “American (New) Food”. The AHIN is composed of 2,614 business objects (B); 33,360 review objects (R); 1,286 user objects (U) and 82 food relevant keyword objects (K). 5 attributes are considered: 3 categorical attributes (reservation, service, and parking) and 2 numerical attributes (review count and quality star). For meta-paths: we choose BRURB (businesses reviewed by the same user) and BRKRB (businesses receive the same keyword in the reviews).

IMDB5kFootnote 2\(\mathbf . \) We extracted an AHIN network from the IMDB5k dataset. The AHIN is composed of 4 node types: 4,140 movies (M), 4,907 actors (A), 1,867 directors (D) and 24 Genres (G). The movies are extracted into three types by their IMDB scores. Each movie has 2 categorical attributes (content rating and country) and 3 numerical attributes (gross, critic reviews and movie Facebook likes). Three meta-paths: MAM (movies with the same actor), MGM (movies with the same genre) and MDM (movies with the same director) are chosen in our experiments for link-based algorithms. Note that, as mentioned before, PathSim based clustering algorithms cannot handle asymmetric meta-paths, though SCAN is able to handle asymmetric meta-paths, we choose symmetric meta-paths for the sake of fairness.

Fig. 2.
figure 2

Attribute value similarity of IMDB (Color figure online)

5.2 Discussion of Attribute Value Similarity

The SCAN model captures the coupling relationships between attributes and therefore obtains better node attribute similarity. We will take a closer look at the attribute value similarity of the IMDB dataset. There are two categorical attributes in the IMDB dataset, namely “country” and “content ratings”. The similarity between the attribute values are visualized in Fig. 2. Attribute value pairs with higher similarity are visualized with darker colors.

Table 1. The MPAA rating system.

The attribute similarity of “country” attribute of node type “movie” is shown in Fig. 2a. Top 15 countries in the dataset are analyzed, other countries are combined to the “Others” category. As known, the USA, UK, France, Canada and Australia are all developed countries while India is a developing country. By mining the dependency between movie attributes, our model successfully discovers that USA, UK, France, Canada, and Australia are similar countries, while India is dissimilar to any of the other countries shown in the table. Interestingly, our algorithm also reveals that New Zealand and India are the most dissimilar attribute values in the selected countries.

The Motion Picture Association of America (MPAA) film rating systemFootnote 3 is used to rate a film’s suitability for certain audiences based on its content The attribute similarity of “content rating” attribute of node type “movie” is shown in Fig. 2b. Top 5 genres in the dataset are analyzed (“G”, “PG”, “PG-13”, “R”, and “Unrated”), other genres are combined to the “Others” category. The meanings of the rating levels are shown in the Table 1. Generally speaking, our attribute value similarity obtained is reasonable. To illustrate, our model finds that “G” is most similar to “PG” and “G” is more similar to “PG-13” than “R”. Meanwhile, “R” is most similar to “PG-13”, which is its neighbor level in Table 1; “Others” is somehow similar to other rating levels since its a combined categorical value. “Unrated” is relatively dissimilar to other rating levels.

5.3 Clustering Performance

We conduct experiments of three groups of comparison algorithms: attribute-only, link-only and attribute+link. For each group of comparison algorithms, we consider two algorithms listed as follows.

  • Attribute-only: The clustering algorithms in the first group consider only object attributes. These are traditional methods which ignore the network structure of an AHIN. We choose Spectral-Learning [6] and a semi-supervised version of normalized cuts [8] as representatives, which are denoted SL and SNcuts, respectively.

  • Link-only: These methods utilize only the link information of the network and they ignore object attribute values. GNetMine [4], PathSelClus [16] are chosen as representative methods of this category.

  • Attribute+Link: Methods of this group use both attribute and link information. We consider FocusCO [11], SCHAIN [9], and SCAN-C. Since FocusCO does not consider the heterogeneity of networks, the AHINs are constructed as homogeneous networks for experiments of FocusCO. The SCAN-C is another version of our model using CASV [17] as the categorical attribute value similarity measure. Since CASV only considers categorical data, in the SCAN-C related experiments, numerical attributes are converted to categorical attributes.

Table 2 shows the clustering performance in terms of NMI for Yelp-Restaurant and IMDB5k respectively. We run every experiment 10 times and calculate the average NMI. As can be observed, for both datasets, the clustering performance of SCAN outperforms other algorithms.

For Yelp-Restaurant, the attribute-only algorithms perform relatively worse than link-only algorithms, suggesting that link similarity is more significant of this clustering objective. Since our improvement is mostly on attribute similarity, SCAN turns out to be slightly better than SCHAIN.

For IMDB, due to the sparsity of the network, the attribute-only algorithms perform much better than link-only algorithms, indicating that attribute-based similarity is more significant to the clustering objective. Taking into account the coupling relationships between attributes, SCAN outperforms other algorithms significantly on IMDB dataset. Additionally, SCAN generally outperforms SCAN-C, which proves the importance of mining the coupling relationships between mixed type attributes.

Table 2. NMI comparison on Yelp-Restaurant and IMDB5k

5.4 Weight Learning

SCAN retains the ability of SCHAIN [9] to learn the weights of attributes and meta-paths. We will take clustering of IMDB dataset for illustration. In the following discussion, we assume 25% seed objects. Figure 3a and b exhibit how our algorithm learns the weight through iteration for attributes and meta-paths. From Fig. 3c, we see that SCAN identifies that the “Facebook likes” attribute to be the most useful attribute to cluster movies. As shown in Fig. 3b, the weights of meta-paths remain almost unchanged through optimization, since the network is really sparse and link similarity does not contribute much to the clustering task. In Fig. 3c, we observe that the overall clustering performance is improving through iterations and finally to an optimal value.

Fig. 3.
figure 3

Weight learning on IMDB

Fig. 4.
figure 4

Influence of hyper-parameters on Yelp

5.5 Parameters Discussion

There are three hyper-parameters to be tuned in our model: The balance coefficient \(\alpha \) for link similarity and attribute similarity, the supervision importance \(\beta \) to leverage constraints, and the coupled coefficient \(\eta \) to balance plain and coupled similarity. The influence of hyper-parameters for Yelp and IMDB are shown in Fig. 4.

Figure 4a shows the relationship between clustering performance and the balance coefficient \(\alpha \). If \(\alpha =0\), only link similarity is considered; if \(\alpha =1\), only attribute similarity is considered. We can observe that the attribute similarity is more important for IMDB and link similarity is more important for Yelp in terms of clustering performance. Therefore, it is better to balance between link similarity and attribute similarity.

Figure 4b shows the relationship between clustering performance and supervision importance \(\beta \). For visualization, we map the value of \(\beta \) from \([0,\infty ]\) to [0, 1], and plot \(\frac{1}{\beta +1}\). We can observe that, in both datasets, focusing too much (large \(\beta \)) on the supervision always lead to inferior results. In Yelp, the optimal \(\beta =1.5\), whilst in IMDB, it is better to set \(\beta =0.43\). This result indicates that the importance of supervision in terms of clustering is different, it’s better to leverage the supervision and node similarity flexibly.

Figure 4c shows the relationship between clustering performance and the coupled coefficient \(\eta \). If \(\eta =0\), plain similarity (SMS [7]) is used as our categorical value similarity; if \(\eta =1\), the plain similarity is ignored and only coupled similarity is concerned. We can observe that, for both datasets, it is better to balance the plain similarity and the coupled similarity. The optimal values of \(\eta \) for Yelp and IMDB are 0.5 and 0.7 respectively, which indicates that the attributes from different datasets have different degree of coupling relationships. For highly correlated datasets, high coupled coefficients are recommended.

6 Conclusion

In this paper, we study semi-supervised clustering in attributed heterogeneous information networks. To mine the coupling relationships between node attributes, we propose a new model SCAN. In our model, We present a coupled attribute node similarity measure to capture the dependency between mixed attribute types. Furthermore, we use a flexible approach to better leverage the importance of supervision information and network similarity. Extensive experiments are conducted on two real-world datasets to prove that our model is capable of capturing the coupling relationships between attributes and outperforms other state-of-the-art algorithms.