Semantic-aware heterogeneous information network embedding with incompatible meta-paths

Zheng, Susu; Guan, Donghai; Yuan, Weiwei

doi:10.1007/s11280-021-00903-5

Semantic-aware heterogeneous information network embedding with incompatible meta-paths

Published: 29 October 2021

Volume 25, pages 1–21, (2022)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

World Wide Web Aims and scope Submit manuscript

Semantic-aware heterogeneous information network embedding with incompatible meta-paths

Download PDF

Susu Zheng¹,
Donghai Guan¹ &
Weiwei Yuan¹

773 Accesses
11 Citations
1 Altmetric
Explore all metrics

Abstract

Heterogeneous information network (HIN) embedding represents heterogeneous nodes as vectors in the low-dimensional space. Meta-path is used to measure the nodes similarity to guide HIN embedding. Existing works assume that different meta-paths share the same semantic space and directly fuse the different mate-paths for node similarity calculation. This ignores the incompatibility of different meta-paths, which cannot reflect the real relationship between nodes. To solve the problems of existing works, a novel S emantic-A ware H IN E mbedding (SAHE) is proposed to fuse incompatible meta-paths for node similarity measurement. The key idea of the proposed method is to measure the relative similarity relationship on each meta-path in its own semantic space, and aggregate these similarity relationships to obtain the node similarity to calculate HIN embedding. The kendall tau distance is used to aggregate the different similarity relationship in multiple semantic spaces. The semantic preference is extracted as a constraint to optimize the aggregated similarity matrix. The Kullback-Leibler Divergence (KL Divergence) is used to learn nodes embedding by measuring the node similarity distribution in the embedded space. Experiments on three real HIN datasets verify that the superiority of the proposed model is superior to other state-of-the-art methods on the node classification and the node clustering tasks.

Key Nodes Cluster Augmented Embedding for Heterogeneous Information Networks

MetaGraph2Vec: Complex Semantic Path Augmented Heterogeneous Network Embedding

HEAM: Heterogeneous Network Embedding with Automatic Meta-path Construction

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Heterogeneous information network (HIN) embedding learns the representation of nodes in the low dimensional vector space by preserving its rich semantic information [38]. The key idea of HIN embedding is to measure the similarity between nodes in heterogeneous information network. The more similar the two nodes are, the closer the two nodes are in the mapping space[3, 29]. Meta-path based methods has been widely used in HIN embedding, since meta-paths can obtain rich semantics through multiple connections between nodes in HIN[7, 11, 18]. The basic idea of meta-path is to establish a path between nodes based on semantic relatedness over HINs. The meta-path is used to sample two nodes and compute the semantic similarity of the two nodes to learn node embedding. In HIN, each meta-path can capture its own semantic information [9, 15]. Take Figure 1 as an example, supposing there is a DBLP HIN, author a₂ connected with a₃ following the meta-path M₁ (APTPA), and author a₁ connected with a₂ following the meta-path M₂ (APVPA). Thus, the underlying semantics are, author a₂ and a₃ are related due to the same topic of their papers, while author a₁ and a₂ are related because their papers in the same venue.

Existing methods [8, 15] usually assume that different meta-paths share the same semantic space, and ignore the incompatibility problem between different meta-paths. These methods either average the similarity calculated by multiple meta-paths [6], or directly concatenate the embedding of each meta-path to get an overall embedded representation [4]. The similarities between nodes calculated according to different meta-paths may be different — two nodes may be similar according to one meta-path, while not similar according to another meta-path. The distances of nodes in mapping spaces with different meta-paths are also different. This similarity difference results from inconsistent semantics of different meta-paths. The similarity relationships of nodes in different similarity matrices are inconsistent, which leads to the incompatibility problem of meta-paths. Methods ignoring this incompatibility can lead to unreliable results.

Figure 1 presents an example of node embedding in HIN to illustrate incompatibility problem of meta-path. In DBLP HIN, it is assumed that the node similarity matrices M₁S,M₂S and M₃S are calculated based on three different meta-paths M₁,M₂ and M₃. There are contradictions in the similarity matrices calculated based on different meta-paths. For two pairs of nodes, the similarity relationships they calculated based on different meta-paths are opposite. The similarity between (v₁,v₂) in M₁S and M₂S are both 0.5, the similarity of (v₁,v₂) in M₁S is smaller than the similarity of (v₁,v₃), but the similarity of (v₁,v₂) in M₂S is higher than the similarity of (v₁,v₃). Because the semantics of different meta-paths are different, the strength of the same similarity value of in different similarity matrices may be different. On the other hand, different similarity values may represent the same similarity strength. The similarity of (v₂,v₃) in M₁S and M₃S is 0.4 and 0.1 respectively, but the similarity strength is the weakest in their respective similarity matrices. This shows that the similarity calculated according to different meta-paths is not compatible. Averaging the similarity calculated by multiple meta-paths[6] or directly concatenating the node embedding with each meta-path[4] cannot handle the incompatibility problem above, and may affect the node classification performances and node clustering performances. In addition, existing methods use the method of zero-filling on the similarity matrix completion problem, as in the case of M₂S, which does not truly reflect the similarity between nodes.

To solve the problems of the existing works, this paper proposes a novel Semantic-Aware HIN Embedding (SAHE) method, which aggregates the incompatible meta-paths in their own semantic spaces. The key idea of the proposed model is to convert the node similarity into the similarity relationship on each meta-path in its own semantic space, to aggregate multiple meta-path based similarity matrices. Instead of calculating the node embedding by using node similarity directly in the same semantic space, the proposed method can avoid the noise problem caused by the similarity difference, so that incompatibility of different meta-paths can be solved. The SAHE method first use Pathsim method to calculate the meta-path based similarity matrices according to different meta-paths, and then defines the Kendall tau distance by the similarity relationship, to measure the distance between the aggregated similarity matrix and the meta-path based similarity matrices. Next, the semantic preference is extracted as a constraint to optimize the aggregated similarity matrix. Finally, the KL divergence[21] is used to minimize the distribution difference between the aggregated similarity matrix and the node embedding to obtain the node representation.

The main contributions of this work are as follows:

1)
This work studies the problem of incompatible meta-paths in HIN embedding. The semantic similarity between nodes can be preserved with the consideration of the incompatible problem to improve the embedding performance.
2)
A novel Semantic-Aware HIN embedding (SAHE) is proposed to embed HIN. We measure the similarity relationships on each meta-path in its own semantic space to solve the incompatibility problem.
3)
We extract the semantic preference ranking from meta-path based matrices and use it to optimize the aggregated similarity matrix.

The rest of the paper is organized as follows: we first review the related works in Section 2. Then Section 3 introduces the preliminaries. Section 4 presents the details of the SAHE model. Section 5 shows the experimental results and performance analysis. Finally, the conclusion and future works are shown in Section 6.

2 Related works

The research of heterogeneous information network (HIN) has recently received widespread attention because it can represent rich semantic information. HIN can describe the network with multiple types of nodes and edges, which is not possible in a homogeneous network [4, 32, 38]. HIN is becoming a new research direction in the field of data mining since HIN integrates more effective information and contains richer semantics in nodes and edges. By mining the information of the HIN, more data in the network can be fully analyzed. How to mine effective information in different types of nodes and edges has become a new challenge. The analysis of heterogeneous information networks is difficult due to its complexity.

HIN embedding aims at learning the representation of heterogeneous nodes in the mapped low-dimensional vector space, and it can improve the performance of data mining, such as classification, clustering and recommendation. However, previous works mainly focus on learning the vector representation of nodes in homogeneous information networks. Deepwalk [17] is inspired by the idea of natural language processing. It treats nodes as words and uses sequences generated by random walks as sentences, and then applies the Skip-gram model to these random walk sequences to learn node embedding. Node2vec [10] is an embedding method similar to Deepwalk, which also uses the Skip-gram model. Unlike Deepwalk, Node2vec has a unique random walk strategy. Node2vec combines depth-first search strategy and width-first search strategy to sample nodes and generate random walk sequences. LINE [28] considers not only the first-order neighbor relationship, but also the second-order neighbor relationship when learning node embedding. That is, even if two nodes are not directly connected, they can establish an indirect connection through their first-order public friends. In addition, it uses negative sampling techniques to optimize the model. GCN [13] learns node embedding through graph convolutional neural network, and GCN cannot learn the weight of each neighbor node during convolution. GAT [30] was proposed to solve the above-mentioned problems of GCN. GAT uses the attention mechanism to learn the weight of each neighbor node and aggregate the neighbor information of the node. The MVC-DNE [35] method based on deep learning performs network embedding by fusing data from two views of structure and attributes. Most of the above methods can learn node embedding of homogeneous information networks, without considering the types of nodes and edges. If these methods are applied directly to high-complexity HIN, the embedding effect may be reduced.

Some exiting methods have been proposed on HIN embedding [2, 6, 12, 24, 25, 36]. Metapath2vec [6] method is proposed as an extension of Deepwalk in HIN embedding. It sets different meta-paths for random walks according to the semantic information in HIN. Similar to the Deepwalk model, it applies the Skip-gram to the random walk sequences of heterogeneous nodes to obtain the nodes embedding. Unlike most methods that use Skip-gram models to learn random walk sequences, Hin2vec [8] builds a neural network model. This model learns nodes embedding and meta-paths embedding by maximizing the possibility of node relations. Hin2vec model does not rely on artificially set meta-paths, but automatically learns meta-paths from network data. EOE [34] learns the potential node representations of two networks and uses a harmonious embedded matrix to transform the representations of different networks into the same space. These methods ignore the incompatibility of different meta-paths and miss the different semantic relationships of nodes.

Though several works notice the incompatible semantics of HIN, they ignore that different meta-paths have their own semantic spaces and cannot solve the problem of node similarity aggregation among different semantic spaces. A HIN embedding model based on deep learning method called Esim [22] is proposed, which transcribes semantics in HINs by meta-paths. It fuses multiple meta-paths by assigning different weights to different meta-paths. However, it relies heavily on the weight of the manually set meta-path to represent learning, which may not conform to the real network and cannot achieve a better embedding effect. HERec [23] uses a fusion function to fuse the node embeddings learned by different meta-paths into the final node embedding. DMGI [16] is a multi-relational network embedding method that uses an attention mechanism to learn node embedding. Multi-relational network is a special type of heterogeneous network, with a single type of nodes and multiple types of edges. DMGI cannot learn the node embeddings of general heterogeneous information networks (that is, there are multiple types of nodes and edges). DyHNE [33] gives different weights to various meta-paths, so as to evaluate the importance of different meta-path to node embedding. The HAN [32] method uses semantic-level attention and node-level attention to learn the importance of meta-paths and node neighbors at the same time, and aggregates the node representations of multiple meta-paths to obtain the final node representation. These methods of combining weight coefficient cannot solve the incompatibility problem of meta-paths. Since meta-path can learn the similarity between higher-order neighbors and better capture the semantic information of various aspects of HIN, it is necessary to study the incompatibility of meta-paths.

3 Preliminaries

Definition 3.1 (Heterogeneous information network)

Let a graph G = (V,E) represents a heterogeneous information network, V is the set of all nodes v_i and E is the set of all nodes e_i in the graph G. Each node v_i ∈ V has its own node type and each edge e_i ∈ E has its own edge type. T and R represent the set of node types and the set of node types respectively. The necessary and sufficient condition for a heterogeneous information network is |T| + |R| > 2.

Figur 1(a) illustrates a small bibliographic HIN with |T| = 4 and |R| = 3. It contains four edge types that connect nodes, and the four types of nodes are: author (A), paper (P), venue (V), and topic (T). To simplify the model, it is assumed that each node belongs to a single type.

Definition 3.2 (Network schema)

The network schema is defined as T_G = (T,R), which is the meta template for a heterogeneous information network G = (V,E). There are node-to-node type mapping and edge-to-edge type mapping in the network schema, which are φ: $V \rightarrow T$ and $\psi : E \rightarrow R$.

Figure 2 shows the schema of the DBLP HIN in Figure 1.

Definition 3.3 (Meta-path)

A meta-path M is donated as a path as a path pattern connecting different types of nodes T₁→R₁T₂→R₂…→R_l− 1T_l, where R₁,R₂, … , R_l represent edge type of node type T₁ to T_l.

Figure 1(a) shows a meta-path $A \overset {write}{\longrightarrow } P\overset {publish}{\longrightarrow }V\overset {publish^{-1}}{\longrightarrow }P\overset {write^{-1}}{\longrightarrow }A$ in DBLP, where the semantic information represented by this meta-path is authors(A) publish papers (P) in the same venue (V), short as “APVPA”. M is the set of all meta-path M_m. A path instance m = (v₁v₂…vl) following the meta-path M_m ∈ M represents a path between v₁ and v_l in network G, and short as $m_{v_{1} \sim v_{l}} $.

Definition 3.4 (Node Embedding of HIN)

In a HIN G = (V,E), each node is mapped to a d–dimensional space $\mathcal {S}^{d}$ and represented by a vector v^d, where d ≪|V |. The purpose of the proposed model is to learn node embedding v^d in heterogeneous information network.

4 The proposed model

To solve the incompatibility of different meta-paths in HIN embedding, this work proposes a SAHE model based on similarity relationship, which is used to aggregate multiple meta-paths to get the node representation. The key idea of SAHE method is convert the node similarity to the similarity relationship on each meta-path in its own semantic space. Figure 3 shows the framework of the SAHE model. The model consists of four main steps: node similarity calculation, node similarity aggregation, semantic preference extraction and embedding learning.

The following subsections detail the steps of this method.

4.1 Node similarity calculation on single meta-path

The proposed model first calculates the node similarity based on each meta-path. Due to the skewed distribution of the HIN, the similarity method based on random walks is biased toward those nodes with large degrees. It cannot reflect the real network topology. Pathsim is used in our model because it considers all nodes connected through meta-path, thus avoiding the skew distribution problem. PathSim is a method commonly used to measure node similarity in HIN. It uses symmetric meta-paths to extract the connected path between two nodes to measure node similarity, which not only uses related heterogeneous information, but also extracts the rich semantics of nodes and edges [5, 27].

The basic idea of PathSim is that if two nodes in the network are connected by more meta-path instances, the similarity between them is higher. Given a meta-path M_m and two nodes of the same type v_i and v_j, PathSim is defined as follows:

$$ m s\left( v_{i}, v_{j}\right)=\frac{2 \times\left|m_{v_{i} \sim v_{j}}\right|}{\left|m_{v_{i} \sim v_{i}}\right|+\left|m_{v_{j} \sim v_{j}}\right|}, $$

(1)

where $ms\left (v_{i}, v_{j}\right )$ denotes the similarity between node v_i and v_j, simply expressed as ms_ij. $m_{v_{i} \sim v_{j}}, m_{v_{i} \sim v_{i}}, m_{v_{j} \sim v_{j}} \in M_{m}, m_{v_{i} \sim v_{j}}, m_{v_{i} \sim v_{i}}$ and $m_{v_{j} \sim v_{j}}$ are the path instances following meta-path M_m between (v_i,v_j), (v_i,v_i) and(v_j,v_j), respectively.

As shown in (1), given a meta-path M_m,ms_ij is defined into two parts: the similarity between nodes depends on the number of path instances between them following meta-path M_m; number of path instances from node to itself. For each meta-path M_m, Pathsim is used to calculate the node similarity and obtain a symmetric matrix M_mS with a diagonal equal to 1: ms_ij = ms_ji and ms_ii = 1.

4.2 Node similarity aggregation on multiple meta-paths

After calculating the node similarity on single meta-path, our key step is to aggregate the node similarity on these incompatible meta-paths. According to the definition of node similarity in (1), the proposed method determines the meta-path based similarity matrices of |M| meta-paths. However, the nodes similarity with different meta-paths is not always related to each other: two nodes may be similar according to one meta-path, while not similar according to another meta-path. A Kendall tau distance is defined to aggregate different meta-paths by converting the node similarity to the similarity relationship.

Kendall tau distance is generally used to calculate the reverse ordinal numbers between two sequences [31]. It has many advantages in measuring sequence differences and attracts a lot of attention in many fields. We extend it to convert node similarity into similarity relationship and calculate the consistency of the two matrices. The smaller the Kendall tau distance, the two matrices more consistent. Our method obtains the aggregated similarity matrix by minimizing the Kendall tau distance, which has the smallest difference with all similarity matrices. Therefore, the problem of incompatible meta-paths is solved. Since each meta-path has its own semantics, the similarity relationship in the meta-path similarity matrix is compared by defining the Kendall tau distance. First, the Kendall tau distance between the two similarity matrices is defined as

$$ K\left( \mathbf{M_{m}}\mathbf{S}, \mathbf{AS}\right)=\sum\limits_{i=1}^{|v|} K\left( \mathbf{M_{m}}\mathbf{S}_{\mathbf{i}}, \mathbf{A S}_{\mathbf{i}}\right), $$

(2)

where AS is the aggregated similarity matrix, and the Kendall tau distances of each row vector in the two matrices is defined as follows:

$$ K(\mathbf{M_{m}}\mathbf{S}_{\mathbf{i}} ,\mathbf{AS}_{\mathbf{i}} ) = \left\{ \begin{array}{l} 0,if\left( {ms_{ij} > ms_{ik} \wedge as_{ij} > as_{ik} } \right) \\ \vee \left( {ms_{ij} < ms_{ik} \wedge as_{ij} < as_{ik} } \right) \\ \vee \left( {ms_{ij} = ms_{ik} \wedge as_{ij} = as_{ik} } \right) \\ 1,if\left( {ms_{ij} > ms_{ik} \wedge as_{ij} < as_{ik} } \right) \\ \vee \left( {ms_{ij} < ms_{ik} \wedge as_{ij} > as_{ik} } \right) \\ \vee \left( {ms_{ij} > ms_{ik} \wedge as_{ij} = as_{ik} } \right) \\ \vee \left( {ms_{ij} < ms_{ik} \wedge as_{ij} = as_{ik} } \right) \\ \vee \left( {ms_{ij} = ms_{ik} \wedge as_{ij} > as_{ik} } \right) \\ \vee \left( {ms_{ij} = ms_{ik} \wedge as_{ij} < as_{ik} } \right) \end{array} \right.. $$

(3)

If the relative similarity relationship in the two row vectors is the same, the distance remains unchanged. If the similarity relationship in the two row vectors is reverse, then the distance is accumulated. The sum of the distances between the meta-path based similarity matrices and the aggregated similarity matrix is:

$$ K\left( \mathbf{M_{m}}\mathbf{S}, \textbf{AS}\right)=\sum\limits_{m=1}^{|M|} \sum\limits_{i=1}^{|V|} K\left( \mathbf{M_{m}}\mathbf{S_{i}}, \mathbf{A S}_{\mathbf{i}}\right). $$

(4)

Based on the above analysis, the similarity aggregation problem is defined as: given |M| similarity matrices M_mS, according to |M| meta-paths, find an aggregated similarity matrix AS. Minimize the distance between M_mS and AS, and stochastic gradient descent [1] is used to solve this optimization problem:

$$ \begin{aligned} \mathbf{A} \mathbf{S}^{*} &=\arg \min K\left( \mathbf{M}_{\mathbf{m}} \mathbf{S}, \mathbf{A} \mathbf{S}\right) \\ &=\min \sum\limits_{m=1}^{|M|} \sum\limits_{v=1}^{|V|} K\left( \mathbf{M}_{\mathbf{m}} \mathbf{S}_{\mathbf{i}}, \mathbf{A} \mathbf{S}_{\mathbf{i}}\right). \end{aligned} $$

(5)

In example of Figure 1, the aggregated similarity matrix calculated by SAHE is shown in Figure 4. We can calculate the Kendall tau distance between AS and (M₁S,M₂S,M₃S). Since there is an reverse similarity relationship between row vector v₁ in M₂S and AS: m₂s₁₂ > m₂s₁₃ while as₁₂ < as₁₃, $K\left (\mathbf {M_{2}} \mathbf {S}_{1},\mathbf {AS}_{1}\right )$= 1. No reverse similarity relationship between other row vector in M₂S and $\mathbf {AS}, K\left (\mathbf {M_{2}}\mathbf {S},\mathbf {AS}_{1}\right )$ = 1. The similarity relationship in the other two similarity matrices M₁S and M₃S is the same as the aggregated similarity matrix AS: $K\left (\mathbf {M_{1}}\mathbf {S}, \mathbf {AS}_{1}\right )$+ $K\left (\mathbf {M_{2}}\mathbf {S}, \mathbf {AS}_{1}\right )$+ $K\left (\mathbf {M_{3}}\mathbf {S}, \mathbf {AS}_{1}\right )$= 1. Only one pair of nodes has the reverse similarity relationship between (M₁S,M₂S,M₃S), which means the aggregated similarity matrix AS maintain high consistency with meta-path based similarity matrices.

4.3 Semantic preference extraction

Semantic preferences are extracted from the multiple meta-path based similarity matrices to optimize the aggregated similarity matrix. The majority principle [14], as the name suggests, is the obedience of minority to majority. It is widely used in economics and sociology due to its low complexity and easy to understand. The majority principle represents the majority opinion, and the semantic relationship is determined by obtaining more semantic information of meta-path. The majority principle is extended to the semantic preference ranking: if there are three nodes $\left \{v_{i}, v_{j}, v_{k}\right \}$, nodes (v_i,v_j) are more similar in most similarity matrices than node (v_i,v_k), then according to the majority preference, the aggregated similarity matrix should satisfy the ranking: similarity of nodes (v_i,v_j) is higher than the nodes (v_i,v_k)’s. In other words, v_i prefers v_j to v_k in the aggregated similarity matrices. Formally, if $\left |ms_{i j}>ms_{i k}\right |>\left |ms_{i j}<ms_{i k}\right |$,then v_j ≻ v_k, where ’≻’ denotes the prefer relationship. If nodes (v_i,v_j) are less similar in more similarity matrices than node (v_i,v_k), then according to the majority preference, the aggregated similarity matrix should satisfy the ranking: similarity of nodes v_i,v_j is lower than the nodes (v_i,v_k)’s. Formally, if $\left |ms_{i j}>ms_{i k}\right |<\left |ms_{i j}<ms_{i k}\right |$, then v_k ≻ v_j. Particularly, $\left |ms_{i j}>ms_{i k}\right |=\left |ms_{i j}<ms_{i k}\right |$ appears in some cases. P_j and N_j are defined to solve this problem:

$$ P_{j}=\sum\limits_{k=1}^{|v|}\left|ms_{i j}>ms_{i k}\right|, j \neq k. $$

(6)

$$ N_{j}=\sum\limits_{k=1}^{|v|}\left|ms_{i j}<ms_{i k}\right|, j \neq k. $$

(7)

P_j represents the number of nodes whose similarity lower than node v_j with node v_i. If P_j > P_k, then v_j ≻ v_k, else v_k ≻ v_j. N_j represents the number of nodes whose similarity higher than node v_j with node v_i. If N_j < N_k,then v_j ≻ v_k, else v_k ≻ v_j. Based on the above semantic preference ranking, for one node v_i, the similarity relationship ranking between any two other nodes (v_j,v_k) and v_i can be determined. This ranking can be used as a constraint to optimize the aggregated matrix. Algorithm 1 describes the process of semantic preference ranking.

In example of Figure 1, three similarity relationship rankings are calculated based on semantic preferences ranking. v₁ : v₁ ≻ v₃ ≻ v₂,v₂ : v₂ ≻ v₁ ≻ v₃,v₃ : v₃ ≻ v₁ ≻ v₂. It can be clearly seen that this ranking is consistent with the similarity ranking in Figure 4.

4.4 Embedding learning

Given a heterogeneous information network G = (V,E) and multiple meta-paths M^m ∈ M, an aggregated similarity matrix AS preserves node similarities obtained by the algorithm 1. The purpose of the SAHE method is to learn the embedding of nodes in HIN, which can be explained by minimizing the distribution difference between the aggregated similarity matrix AS and the node embedding matrix H. Since the elements stored in the two matrices are inconsistent, it is not easy to calculate the distribution difference. The activation function sigmoid is used to convert the embedding matrix H into a similarity matrix HS.

$$ h s\left( v_{i,} v_{j}\right)=\frac{1}{1+e^{-{h_{i}^{T}} h_{j}}}, $$

(8)

where h_i and h_j are the embedding vector of node v_i and v_j, respectively. The embedding matrix can be converted to a fitting similarity matrix, denoted as HS, where $hs_{i j}=h s\left (v_{i}, v_{j}\right )$.

KL divergence [26] represents the loss of information when fitting a theoretical distribution to a real distribution. Aggregated similarity matrix AS is the real distribution while fitting similarity matrix HS is the theoretical distribution. Our goal is to minimize the KL divergence between AS and HS:

$$ L=KL(\mathbf{AS} \| \mathbf{HS}). $$

(9)

Since similarities between any node v_i and any other node v_j in the similarity matrix can represent the probability distribution, the loss denotes as:

$$ L=\sum\limits_{i=1}^{|V|} \sum\limits_{j=1}^{|V|} K L\left( as_{ij} \| hs_{i j}\right)=\sum\limits_{i=1}^{|V|} \sum\limits_{j=1}^{|V|} as_{ij} \log \frac{1}{hs_{i j}}. $$

(10)

The optimization objective is to find a node embedding matrix H^∗ by minimizing the following loss function and stochastic gradient descent [1] is used to solve this optimization problem:

$$ \textbf{H}^{*}=\min\sum\limits_{i=1}^{|V|} \sum\limits_{j=1}^{|V|} as_{i j} \log \frac{1}{h s_{i j}}. $$

(11)

The details of the optimization of embedding learning are shown in algorithm 2. A binary classifier is used to distinguish between node samples coming from the aggregated similarity distribution AS and a noise distribution. And an auxiliary random variable D is used for this classification, D = 1 for a node from the aggregated similarity distribution and D = 0 for a sample extract from the noise distribution. σ is the sigmoid function and n is the number of nodes we extract form noise distribution. In our experiments, we use n = 3.

5 Experimental results

The embedding performances of the SAHE method are verified on three real-world HIN datasets. First, the experiments compare the node classification and clustering performances of the proposed model with the other latest existing works. Second, the influences of parameters are verified on the performances of the proposed method.

5.1 Datasets

Three datasets from different fields, including the DBLP network [19], the Movielens network [37], and the Yelp network [20]are used in the experiments to evaluate the performance of the proposed model.

DBLP is an author-centric dataset in the field of computer science. The DBLP network consists of four types of nodes: author(A), paper(P), topic(T) and venue(V). The edge types include authorship(P-A), topic of paper (P-T), publishing venue (P-V), and the cite relation (P-P). The schema is shown in Figure 2.
Movielens is a movie-centric dataset in movie field consisting of four types of nodes: movies(M), users(U), age(A), and occupation(O), three edge types: user’s like (U-M), user’s age (U-A) and user’s occupation (U-O). Figure 5(a) shows the schema of the Movielens HIN.
Yelp is a business-centric dataset containing user information, business information, and user reviews of businesses. We extract some data include business(B), user(U), city(Ci), compliment(Co) as nodes, and evaluation relation (U-B), user’s like(U-Co) and location (B-Ci) as edges. Figure 5(b) shows the schema of the Yelp HIN.

The network of the original dataset is sparse and has a skewed distribution. A large number of nodes are in the network and most nodes are with a small degree. It is difficult to analyze the entire network. We therefore randomly sample the original network and set a minimum degree to filter out nodes whose degree are less than the minimum degree. The minimum degree in DBLP, Movielens and Yelp is 3,4,4 respectively. The basic statistics and meta-paths of these three HIN networks are summarized in Table 1. The length of meta-path is the number of relations in the meta-path. For example, in the DBLP network, the meta-path APA can be described using the length-2 meta-path.

Table 1 The basic statistics and meta-paths of three HIN datasets

Full size table

5.2 Baseline methods

We compare the proposed methods with two types of embedding methods: methods for homogeneous information networks and methods for heterogeneous information networks. These methods have been introduced in the related work in Section 2. Here we present the details of the method in the experiment. For the fairness of the experiment, the dimension d of the embedding vector space of all methods is set to be 64.

For homogeneous information networks, Deepwalk[17], Node2vec [10] and Line [28] method are used to compare with the proposed method. These three methods are network embedding method based on random walk. To apply these methods directly to HIN embedding, the heterogeneity of nodes will be ignored, and heterogeneous nodes will be regarded as homogeneous nodes. For these three methods,we used the default settings from the original article.

For heterogeneous information networks, meta-path based methodsMetapath2vec [6], Esim [22], HAN [32] and non-metapath based method Hin2vec [8] are used to compare with the proposed method. In order to verify the ability of the proposed method to solve the problem of meta-path incompatibility, the Kendall tau distance is deleted and the average similarity calculated by different meta-paths is used as a variant of SAHE, called SAHE_avg, and the variant is compared with SAHE. For each dataset, the meta-paths used by Metapath2vec, Esim and HAN in the experiment are the same as SAHE. Since different datasets have different weight settings, various weights are assigned to Esim and the best performing one are chosen on different datasets. The parameter settings of the Hin2vec method are the same as in the original article.

5.3 Node classification

Node classification classifies the nodes into different categories, where the input of the node classifier are the learned embeddings. In the experiments, we mainly focus on the central node in the datasets for classification. Author, movie and business are used as the central node for classification on the DBLP, Movielens and Yelp datasets respectively. A group of nodes are randomly selected as the labeled training nodes, and the remaining nodes are used as test nodes. The embeddings of the training nodes are used as input and a logistic regression classifier is trained to predict the most likely labels for the test nodes. The network datasets used in the experiments are divided into training and test sets, and the ratio of the training set gradually increases from 5% to 50%. Each experiment is 10 repeated trials and the average Micro-F1 and Macro-F1 scores are recorded. The experiment results are shown in Tables 2 and 3.

Table 2 Micro-F1 scores of node classification task

Full size table

Table 3 Macro-F1 scores of node classification task

Full size table

As is shown in Tables 2 and 3, when the ratio of training data increases, the performance is better in general which explains that the accuracy is positive related to training ratio. (1) The results show that the node classification performance of the SAHE method is always better than the baselines. The performance improvement obtained by SAHE on the best benchmark (HAN) is about 2%-8%. When the training ratio is less than 20%, the performance of the SAHE method is much higher than the baseline methods. This means that the proposed method requires only a small amount of training data to obtain effective embedding results. (2) The performance of DeepWalk, Node2vec and Line is lower than Metapath2vec, Esim, Hin2vec, HAN and SAHE. Thus, the methods for homogeneous network embedding are significantly worse than heterogeneous network embedding methods in node classification task. Because the homogeneous network embedding methods ignore the heterogeneity of nodes. (3) By comparing the performance of heterogeneous information neetwork baselines(Metapath2vec, Esim, HAN, Hin2vec), HAN is the best in baselines because the fusion embedding learned through the attention mechanism can improve the embedding ability. Metapath2vec is the least performance method among the heterogeneous network comparison methods, because this method only extracts the semantic information of a meta-path. (4) The node classification performances of SAHE method are superior than that of SAHE_avg method in all cases. This shows that the proposed method improves the performance of network embedding by solving the problem of meta-path incompatibility.

5.4 Link prediction

We conduct experiments on three datasets to compare the link prediction performance of node embeddings learned by SAHE and the baseline methods. In this task, we predict the citation relationship (P-P) between papers in DBLP, and the friend relationship (U-U) between users in Movielens and Yelp. We first randomly divide the edges of the dataset into a training set and a test set. Here, we set 75% of the edges as the training set and the remaining edges as the test set. Then, the node embedding is used as a feature of the classifier to train the training set and evaluate the link prediction performance on the test set. The AUC and AP scores are used as indicators to reflect the quality of node embedding. The higher the values of AUC and AP are, the better the link prediction performance is. Table 4 shows the performances of link prediction of various methods on three datasets and the best performance is highlighted in bold.

Table 4 AUC and AP scores of the link prediction task

Full size table

Similar as the node classification task, the SAHE method significantly perform better than all baseline methods on three datasets. Compared with the baselines, the AUC and AP scores of link prediction of SAHE increased by about 2%- 22% in DBLP, 3%- 11% in Movielens, and 4%- 12% in Yelp. It shows that the proposed method can improve the performance of network embedding by capturing heterogeneous semantic information. And the link prediction performances of SAHE method are superior than that of SAHE_avg method in all cases. This results verify the ability of the proposed method to solve the problem of meta-path incompatibility on link prediction task. Different from the node classification task, on the DBLP data set, node2vec performs better than the traditional heterogeneous method metapath2vec. This may be due to the fact that the node2vec method obtains more structural and semantic information in the DBLP data set compared with the single metapath metapath2vec method.

5.5 Node clustering

We also conduct expeiments on three datasets to verify the performances of the proposed SAHE method on node clustering task. Node clustering regroups nodes into clusters according to the similarity between nodes. The labeled node field of the clustering task is the same as the node classification task, such as the author type in DBLP, the movie type in Movielens and the business type in YELP. Specifically, the nodes embedding is treated as nodes feature while the k-means algorithm is used to cluster the nodes. The real group consists of nodes with the same label and is compared with the group obtained by the clustering task. The performance of the clustering task is evaluated with Normalized Mutual Information (NMI) and the Adjusted Rand Index (ARI). The average NMI and ARI scores for the 10 repeated trials are shown in Table 5.

Table 5 NMI and ARI scores of the node clustering task

Full size table

The experimental results show that the node clustering performances of the proposed SAHE method is consistently better than other baselines on different datasets. Compared with the baselines, the NMI and ARI scores of node clustering of SAHE increased by about 3%- 9% in DBLP, 1%- 9% in Movielens, and 1%- 7% in Yelp.

Similar to the node classification task, the clustering effect of HAN is still the second best. The methods for homogeneous network embedding are significantly worse than heterogeneous network embedding method in most cases. And the node clustering performances of SAHE method are superior than that of SAHE_avg method in all cases. We observe that the performance of Hin2vec is not good in clustering experiments, sometimes even worse than the embedding method for homogeneous information network, this shows that the performance of HIN is unstable.

Comparing the performance of methods for heterogeneous information network, HAN, metapath2vec and Esim achieves higher NMI and ARI than Hin2vec on the DBLP, Movielens and Yelp datasets. This result shows that the method of manually setting the meta-path (Metapath2vec, Esim) is better than the method of automatically selecting the meta-path (such as Hin2vec). It may be related to the fact that the manually set meta-path is more consistent with the semantic relationship in the real network. In other words, meta-path selection strategies require expert prior knowledge, as they are not only related to the direct connection of nodes in the network.

5.6 Parameter analysis

The experiments evaluate the node clustering task of the SAHE model with different values of node embedding dimension d and different numbers of meta-paths |M|. Other parameters, such as those in the optimization process, are not analyzed because they are not relevant to the contributions of this paper. In order to study the influence of d, we observe the clustering performance by increasing d from 16 to 256. The results of the node classification task are similar to the clustering results. Tables 6 and 7 only show the results of node clustering.

Table 6 Influence of Node embedding dimension d on SAHE

Full size table

Table 7 Influence of meta-paths number |M| on SAHE

Full size table

We observe that when d is less than 64, a smaller d leads to worse performance, which because the small size is not enough to capture more valid information for the node. The clustering performance is best at d = 64 or d = 128, and then decreases with increases the of d, this may be due to over-fitting caused by over-dimension. The experiment at d = 64 showed good performance on DBLP and Yelp, while d = 128 performed well on Movielens. In different networks, different dimensions are required.

Table 7 shows that as the number of meta-paths increases, the clustering performance improves. This is reasonable since multiple meta-paths can obtain richer information from heterogeneous information networks. It further verifies that the proposed SAHE model can solve the incompatibility of different meta-paths.

6 Conclusions and future works

In this paper, a novel method named Semantic-Aware HIN Embedding (SAHE) is proposed to learn nodes embedding in heterogeneous information networks. Particularly, we solve the incompatibility problem of different meta-paths existing in real-world HINs. The core of the SAHE method is to convert the node similarity to the similarity relationships on each meta-path in its own semantic space. Since the similarity matrix calculated from the same meta-paths has no incompatibility problems, minimizing the distance between the aggregated similarity matrix and the meta-path based similarity matrices can solve the incompatibility problem. In addition, an innovative semantic preferences ranking is proposed as a constraint to optimize the aggregated similarity matrix. Extensive experimental evaluations confirm that the SAHE model can capture more semantic information of complex HINs, and overall exceeds the baseline.

To improve the proposed method, future works will focus on the following aspects. First, the proposed method selects meta-path manually, we plan to upgrade it to select the most influential meta-paths automatically and ensure that these meta-paths can preserve rich semantic information in HIN. Second, the proposed model will be extended to solve the problem of node representation on more complex networks, such as dynamic heterogeneous networks, attributed heterogeneous networks and large-scale heterogeneous networks.

References

Amiri, M. M., Gündüz, D.: Machine learning at the wireless edge: Distributed stochastic gradient descent over-the-air. IEEE Trans. Signal Process. 68, 2155–2169 (2020). https://doi.org/10.1109/TSP.2020.2981904
Article MathSciNet Google Scholar
Cen, Y., Zou, X., Zhang, J., Yang, H., Zhou, J., Tang, J.: Representation learning for attributed multiplex heterogeneous network. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1358–1368. https://doi.org/10.1145/3292500.3330964 (2019)
Chen, H., Yin, H., Chen, T., Wang, W., Li, X., Hu, X.: Social boosted recommendation with folded bipartite network embedding. IEEE Transactions on Knowledge and Data Engineering. https://doi.org/10.1109/TKDE.2020.2982878 (2020)
Chen, X., Yu, G., Wang, J., Domeniconi, C., Li, Z., Zhang, X.: Activehne: Active heterogeneous network embedding. arXiv:1905.05659 (2019)
Do, P., Pham, P.: Dw-pathsim: a distributed computing model for topic-driven weighted meta-path-based similarity measure in a large-scale content-based heterogeneous information network. Journal of Information and Telecommunication 3(1), 19–38 (2019). https://doi.org/10.1080/24751839.2018.1516714
Article Google Scholar
Dong, Y., Chawla, N. V., Swami, A.: metapath2vec: Scalable representation learning for heterogeneous networks. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 135–144. https://doi.org/10.1145/3097983.3098036 (2017)
Fan, S., Zhu, J., Han, X., Shi, C., Hu, L., Ma, B., Li, Y.: Metapath-guided heterogeneous graph neural network for intent recommendation. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2478–2486. https://doi.org/10.1145/3292500.3330673 (2019)
Fu, T.y., Lee, W.C., Lei, Z.: Hin2vec: Explore meta-paths in heterogeneous information networks for representation learning. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 1797–1806. https://doi.org/10.1145/3132847.3132953 (2017)
Fu, Y., Xiong, Y., Philip, S. Y., Tao, T., Zhu, Y.: Metapath enhanced graph attention encoder for hins representation learning. In: 2019 IEEE International Conference on Big Data (Big Data). https://doi.org/10.1109/BigData47090.2019.9006097, pp 1103–1110, IEEE (2019)
Grover, A., Leskovec, J.: node2vec: Scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864. https://doi.org/10.1145/2939672.2939754 (2016)
Hussein, R., Yang, D., Cudré-Mauroux, P.: Are meta-paths necessary? revisiting heterogeneous graph embeddings. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 437–446. https://doi.org/10.1145/3269206.3271777 (2018)
Jiang, Z., Gao, Z., Lan, J., Yang, H., Lu, Y., Liu, X.: Task-oriented genetic activation for large-scale complex heterogeneous graph embedding. In: Proceedings of The Web Conference 2020, pp. 1581–1591. https://doi.org/10.1145/3366423.3380230 (2020)
Kipf, T. N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv:1609.02907 (2016)
Kondratev, A. Y., Nesterov, A. S.: Measuring majority power and veto power of voting rules. Public Choice 183(1), 187–210 (2020). https://doi.org/10.1007/s11127-019-00697-1
Article Google Scholar
Molaei, S., Zare, H., Veisi, H.: Deep learning approach on information diffusion in heterogeneous networks. Knowl.-Based Syst. 189, 105153 (2020). https://doi.org/10.1016/j.knosys.2019.105153
Article Google Scholar
Park, C., Kim, D., Han, J., Yu, H.: Unsupervised attributed multiplex network embedding. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp 5371–5378 (2020)
Perozzi, B., Al-Rfou, R., Skiena, S.: Deepwalk: Online learning of social representations. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701–710. https://doi.org/10.1145/2623330.2623732 (2014)
Pham, P., Do, P.: W-metapath2vec: The topic-driven meta-path-based model for large-scaled content-based heterogeneous information network representation learning. Expert Syst. Appl. 123, 328–344 (2019). https://doi.org/10.1016/j.eswa.2019.01.015
Article Google Scholar
Pradhan, T., Pal, S.: Cnaver: A content and network-based academic venue recommender system. Knowl.-Based Syst. 189, 105092 (2020). https://doi.org/10.1016/j.knosys.2019.105092
Article Google Scholar
Rahmani, H. A., Aliannejadi, M., Mirzaei Zadeh, R., Baratchi, M., Afsharchi, M., Crestani, F.: Category-aware location embedding for point-of-interest recommendation. In: Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval. https://doi.org/10.1145/3341981.3344240, pp 173–176 (2019)
Rioux, G., Scarvelis, C., Choksi, R., Hoheisel, T., Marechal, P.: Blind deblurring of barcodes via kullback-leibler divergence. IEEE Trans. Pattern Anal. Mach. Intell. https://doi.org/10.1109/TPAMI.2019.2927311 (2019)
Shang, J., Qu, M., Liu, J., Kaplan, L. M., Han, J., Peng, J.: Meta-path guided embedding for similarity search in large-scale heterogeneous information networks. arXiv:1610.09769 (2016)
Shi, C., Hu, B., Zhao, W. X., Philip, S. Y.: Heterogeneous information network embedding for recommendation. IEEE Trans. Knowl. Data Eng. 31(2), 357–370 (2018)
Article Google Scholar
Shi, Y., Gui, H., Zhu, Q., Kaplan, L., Han, J.: Aspem: Embedding learning by aspects in heterogeneous information networks. In: Proceedings of the 2018 SIAM International Conference on Data Mining. https://doi.org/10.1137/1.9781611975321.16, pp 144–152. SIAM (2018)
Shi, Y., Zhu, Q., Guo, F., Zhang, C., Han, J.: Easing embedding learning by comprehensive transcription of heterogeneous information networks. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. https://doi.org/10.1145/3219819.3220006, pp 2190–2199 (2018)
Song, Y., Deng, Y.: Divergence measure of belief function and its application in data fusion. IEEE Access 7, 107465–107472 (2019)
Article Google Scholar
Sun, Y., Han, J., Yan, X., Yu, P. S., Wu, T.: Pathsim: Meta path-based top-k similarity search in heterogeneous information networks. Proceedings of the VLDB Endowment 4(11), 992–1003 (2011)
Article Google Scholar
Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., Mei, Q.: Line: Large-scale information network embedding. In: Proceedings of the 24th International Conference on World Wide Web. https://doi.org/10.1145/2736277.2741093, pp 1067–1077 (2015)
Tu, K., Ma, J., Cui, P., Pei, J., Zhu, W.: Autone: Hyperparameter optimization for massive network embedding. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. https://doi.org/10.1145/3292500.3330848, pp 216–225 (2019)
Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph attention networks. arXiv:1710.10903 (2017)
Vieira, C., Ribeiro, F., Vaz de Melo, P. O., Benevenuto, F., Zagheni, E.: Using facebook data to measure cultural distance between countries: The case of brazilian cuisine. In: Proceedings of The Web Conference 2020. https://doi.org/10.1145/3366423.3380082, pp 3091–3097 (2020)
Wang, X., Ji, H., Shi, C., Wang, B., Ye, Y., Cui, P., Yu, P. S.: Heterogeneous graph attention network. In: The World Wide Web Conference. https://doi.org/10.1145/3308558.3313562, pp 2022–2032 (2019)
Wang, X., Lu, Y., Shi, C., Wang, R., Cui, P., Mou, S.: Dynamic heterogeneous information network embedding with meta-path based proximity. IEEE Transactions on Knowledge and Data Engineering (2020)
Xu, L., Wei, X., Cao, J., Yu, P. S.: Embedding of embedding (eoe) joint embedding for coupled heterogeneous networks. In: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. https://doi.org/10.1145/3018661.3018723, pp 741–749 (2017)
Yang, D., Wang, S., Li, C., Zhang, X., Li, Z.: From properties to links: Deep network embedding on incomplete graphs. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp 367–376 (2017)
Yang, L., Xiao, Z., Jiang, W., Wei, Y., Hu, Y., Wang, H.: Dynamic heterogeneous graph embedding using hierarchical attentions. In: European Conference on Information Retrieval, pp 425–432. Springer (2020)
Yu, Y., Wang, Z., Yuan, B.: An input-aware factorization machine for sparse prediction. In: Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp 1466–1472. AAAI Press (2019)
Zhang, C., Wang, G., Yu, B., Xie, Y., Pan, K.: Proximity-aware heterogeneous information network embedding. Knowledge-Based Systems, 105468. https://doi.org/10.1016/j.knosys.2019.105468 (2020)

Download references

Acknowledgements

This research is supported by National Natural Science Foundation of China (No. 61672284), Natural Science Foundation of Jiangsu Province (No. BK20171418), Postdoctoral Science Foundation of China (No. 2016M591841) and Jiangsu Planned Projects for Postdoctoral Research Funds (No. 1601225C).

Author information

Authors and Affiliations

Department of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China
Susu Zheng, Donghai Guan & Weiwei Yuan

Authors

Susu Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Donghai Guan
View author publications
You can also search for this author in PubMed Google Scholar
Weiwei Yuan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Weiwei Yuan.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zheng, S., Guan, D. & Yuan, W. Semantic-aware heterogeneous information network embedding with incompatible meta-paths. World Wide Web 25, 1–21 (2022). https://doi.org/10.1007/s11280-021-00903-5

Download citation

Received: 10 September 2020
Revised: 22 March 2021
Accepted: 31 May 2021
Published: 29 October 2021
Issue Date: January 2022
DOI: https://doi.org/10.1007/s11280-021-00903-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Semantic-aware heterogeneous information network embedding with incompatible meta-paths

Abstract

Similar content being viewed by others

Key Nodes Cluster Augmented Embedding for Heterogeneous Information Networks

MetaGraph2Vec: Complex Semantic Path Augmented Heterogeneous Network Embedding

HEAM: Heterogeneous Network Embedding with Automatic Meta-path Construction

1 Introduction

2 Related works

3 Preliminaries