1 Introduction

Heterogeneous information network (HIN) embedding learns the representation of nodes in the low dimensional vector space by preserving its rich semantic information [38]. The key idea of HIN embedding is to measure the similarity between nodes in heterogeneous information network. The more similar the two nodes are, the closer the two nodes are in the mapping space[3, 29]. Meta-path based methods has been widely used in HIN embedding, since meta-paths can obtain rich semantics through multiple connections between nodes in HIN[7, 11, 18]. The basic idea of meta-path is to establish a path between nodes based on semantic relatedness over HINs. The meta-path is used to sample two nodes and compute the semantic similarity of the two nodes to learn node embedding. In HIN, each meta-path can capture its own semantic information [9, 15]. Take Figure 1 as an example, supposing there is a DBLP HIN, author a2 connected with a3 following the meta-path M1 (APTPA), and author a1 connected with a2 following the meta-path M2 (APVPA). Thus, the underlying semantics are, author a2 and a3 are related due to the same topic of their papers, while author a1 and a2 are related because their papers in the same venue.

Figure 1
figure 1

An example of incompatibility problem in HIN embedding, in which (a) gives a DBLP Network schema, (b) lists three similarity matrices based on three meta-paths. The key problem is how to aggregate the similarity matrices with incompatible meta-paths into the matrix of (c)

Existing methods [8, 15] usually assume that different meta-paths share the same semantic space, and ignore the incompatibility problem between different meta-paths. These methods either average the similarity calculated by multiple meta-paths [6], or directly concatenate the embedding of each meta-path to get an overall embedded representation [4]. The similarities between nodes calculated according to different meta-paths may be different — two nodes may be similar according to one meta-path, while not similar according to another meta-path. The distances of nodes in mapping spaces with different meta-paths are also different. This similarity difference results from inconsistent semantics of different meta-paths. The similarity relationships of nodes in different similarity matrices are inconsistent, which leads to the incompatibility problem of meta-paths. Methods ignoring this incompatibility can lead to unreliable results.

Figure 1 presents an example of node embedding in HIN to illustrate incompatibility problem of meta-path. In DBLP HIN, it is assumed that the node similarity matrices M1S,M2S and M3S are calculated based on three different meta-paths M1,M2 and M3. There are contradictions in the similarity matrices calculated based on different meta-paths. For two pairs of nodes, the similarity relationships they calculated based on different meta-paths are opposite. The similarity between (v1,v2) in M1S and M2S are both 0.5, the similarity of (v1,v2) in M1S is smaller than the similarity of (v1,v3), but the similarity of (v1,v2) in M2S is higher than the similarity of (v1,v3). Because the semantics of different meta-paths are different, the strength of the same similarity value of in different similarity matrices may be different. On the other hand, different similarity values may represent the same similarity strength. The similarity of (v2,v3) in M1S and M3S is 0.4 and 0.1 respectively, but the similarity strength is the weakest in their respective similarity matrices. This shows that the similarity calculated according to different meta-paths is not compatible. Averaging the similarity calculated by multiple meta-paths[6] or directly concatenating the node embedding with each meta-path[4] cannot handle the incompatibility problem above, and may affect the node classification performances and node clustering performances. In addition, existing methods use the method of zero-filling on the similarity matrix completion problem, as in the case of M2S, which does not truly reflect the similarity between nodes.

To solve the problems of the existing works, this paper proposes a novel Semantic-Aware HIN Embedding (SAHE) method, which aggregates the incompatible meta-paths in their own semantic spaces. The key idea of the proposed model is to convert the node similarity into the similarity relationship on each meta-path in its own semantic space, to aggregate multiple meta-path based similarity matrices. Instead of calculating the node embedding by using node similarity directly in the same semantic space, the proposed method can avoid the noise problem caused by the similarity difference, so that incompatibility of different meta-paths can be solved. The SAHE method first use Pathsim method to calculate the meta-path based similarity matrices according to different meta-paths, and then defines the Kendall tau distance by the similarity relationship, to measure the distance between the aggregated similarity matrix and the meta-path based similarity matrices. Next, the semantic preference is extracted as a constraint to optimize the aggregated similarity matrix. Finally, the KL divergence[21] is used to minimize the distribution difference between the aggregated similarity matrix and the node embedding to obtain the node representation.

The main contributions of this work are as follows:

  1. 1)

    This work studies the problem of incompatible meta-paths in HIN embedding. The semantic similarity between nodes can be preserved with the consideration of the incompatible problem to improve the embedding performance.

  2. 2)

    A novel Semantic-Aware HIN embedding (SAHE) is proposed to embed HIN. We measure the similarity relationships on each meta-path in its own semantic space to solve the incompatibility problem.

  3. 3)

    We extract the semantic preference ranking from meta-path based matrices and use it to optimize the aggregated similarity matrix.

The rest of the paper is organized as follows: we first review the related works in Section 2. Then Section 3 introduces the preliminaries. Section 4 presents the details of the SAHE model. Section 5 shows the experimental results and performance analysis. Finally, the conclusion and future works are shown in Section 6.

2 Related works

The research of heterogeneous information network (HIN) has recently received widespread attention because it can represent rich semantic information. HIN can describe the network with multiple types of nodes and edges, which is not possible in a homogeneous network [4, 32, 38]. HIN is becoming a new research direction in the field of data mining since HIN integrates more effective information and contains richer semantics in nodes and edges. By mining the information of the HIN, more data in the network can be fully analyzed. How to mine effective information in different types of nodes and edges has become a new challenge. The analysis of heterogeneous information networks is difficult due to its complexity.

HIN embedding aims at learning the representation of heterogeneous nodes in the mapped low-dimensional vector space, and it can improve the performance of data mining, such as classification, clustering and recommendation. However, previous works mainly focus on learning the vector representation of nodes in homogeneous information networks. Deepwalk [17] is inspired by the idea of natural language processing. It treats nodes as words and uses sequences generated by random walks as sentences, and then applies the Skip-gram model to these random walk sequences to learn node embedding. Node2vec [10] is an embedding method similar to Deepwalk, which also uses the Skip-gram model. Unlike Deepwalk, Node2vec has a unique random walk strategy. Node2vec combines depth-first search strategy and width-first search strategy to sample nodes and generate random walk sequences. LINE [28] considers not only the first-order neighbor relationship, but also the second-order neighbor relationship when learning node embedding. That is, even if two nodes are not directly connected, they can establish an indirect connection through their first-order public friends. In addition, it uses negative sampling techniques to optimize the model. GCN [13] learns node embedding through graph convolutional neural network, and GCN cannot learn the weight of each neighbor node during convolution. GAT [30] was proposed to solve the above-mentioned problems of GCN. GAT uses the attention mechanism to learn the weight of each neighbor node and aggregate the neighbor information of the node. The MVC-DNE [35] method based on deep learning performs network embedding by fusing data from two views of structure and attributes. Most of the above methods can learn node embedding of homogeneous information networks, without considering the types of nodes and edges. If these methods are applied directly to high-complexity HIN, the embedding effect may be reduced.

Some exiting methods have been proposed on HIN embedding [2, 6, 12, 24, 25, 36]. Metapath2vec [6] method is proposed as an extension of Deepwalk in HIN embedding. It sets different meta-paths for random walks according to the semantic information in HIN. Similar to the Deepwalk model, it applies the Skip-gram to the random walk sequences of heterogeneous nodes to obtain the nodes embedding. Unlike most methods that use Skip-gram models to learn random walk sequences, Hin2vec [8] builds a neural network model. This model learns nodes embedding and meta-paths embedding by maximizing the possibility of node relations. Hin2vec model does not rely on artificially set meta-paths, but automatically learns meta-paths from network data. EOE [34] learns the potential node representations of two networks and uses a harmonious embedded matrix to transform the representations of different networks into the same space. These methods ignore the incompatibility of different meta-paths and miss the different semantic relationships of nodes.

Though several works notice the incompatible semantics of HIN, they ignore that different meta-paths have their own semantic spaces and cannot solve the problem of node similarity aggregation among different semantic spaces. A HIN embedding model based on deep learning method called Esim [22] is proposed, which transcribes semantics in HINs by meta-paths. It fuses multiple meta-paths by assigning different weights to different meta-paths. However, it relies heavily on the weight of the manually set meta-path to represent learning, which may not conform to the real network and cannot achieve a better embedding effect. HERec [23] uses a fusion function to fuse the node embeddings learned by different meta-paths into the final node embedding. DMGI [16] is a multi-relational network embedding method that uses an attention mechanism to learn node embedding. Multi-relational network is a special type of heterogeneous network, with a single type of nodes and multiple types of edges. DMGI cannot learn the node embeddings of general heterogeneous information networks (that is, there are multiple types of nodes and edges). DyHNE [33] gives different weights to various meta-paths, so as to evaluate the importance of different meta-path to node embedding. The HAN [32] method uses semantic-level attention and node-level attention to learn the importance of meta-paths and node neighbors at the same time, and aggregates the node representations of multiple meta-paths to obtain the final node representation. These methods of combining weight coefficient cannot solve the incompatibility problem of meta-paths. Since meta-path can learn the similarity between higher-order neighbors and better capture the semantic information of various aspects of HIN, it is necessary to study the incompatibility of meta-paths.

3 Preliminaries

Definition 3.1 (Heterogeneous information network)

Let a graph G = (V,E) represents a heterogeneous information network, V is the set of all nodes vi and E is the set of all nodes ei in the graph G. Each node viV has its own node type and each edge eiE has its own edge type. T and R represent the set of node types and the set of node types respectively. The necessary and sufficient condition for a heterogeneous information network is |T| + |R| > 2.

Figur 1(a) illustrates a small bibliographic HIN with |T| = 4 and |R| = 3. It contains four edge types that connect nodes, and the four types of nodes are: author (A), paper (P), venue (V), and topic (T). To simplify the model, it is assumed that each node belongs to a single type.

Definition 3.2 (Network schema)

The network schema is defined as TG = (T,R), which is the meta template for a heterogeneous information network G = (V,E). There are node-to-node type mapping and edge-to-edge type mapping in the network schema, which are φ: \(V \rightarrow T\) and \(\psi : E \rightarrow R\).

Figure 2 shows the schema of the DBLP HIN in Figure 1.

Figure 2
figure 2

The HIN schema of DBLP with four node types and four directed edge types

Definition 3.3 (Meta-path)

A meta-path M is donated as a path as a path pattern connecting different types of nodes T1R1T2R2…→Rl− 1Tl, where R1,R2, … , Rl represent edge type of node type T1 to Tl.

Figure 1(a) shows a meta-path \(A \overset {write}{\longrightarrow } P\overset {publish}{\longrightarrow }V\overset {publish^{-1}}{\longrightarrow }P\overset {write^{-1}}{\longrightarrow }A\) in DBLP, where the semantic information represented by this meta-path is authors(A) publish papers (P) in the same venue (V), short as “APVPA”. M is the set of all meta-path Mm. A path instance m = (v1v2vl) following the meta-path MmM represents a path between v1 and vl in network G, and short as \(m_{v_{1} \sim v_{l}} \).

Definition 3.4 (Node Embedding of HIN)

In a HIN G = (V,E), each node is mapped to a d–dimensional space \(\mathcal {S}^{d}\) and represented by a vector vd, where d ≪|V |. The purpose of the proposed model is to learn node embedding vd in heterogeneous information network.

4 The proposed model

To solve the incompatibility of different meta-paths in HIN embedding, this work proposes a SAHE model based on similarity relationship, which is used to aggregate multiple meta-paths to get the node representation. The key idea of SAHE method is convert the node similarity to the similarity relationship on each meta-path in its own semantic space. Figure 3 shows the framework of the SAHE model. The model consists of four main steps: node similarity calculation, node similarity aggregation, semantic preference extraction and embedding learning.

Figure 3
figure 3

The framework of the proposed SAHE model

The following subsections detail the steps of this method.

4.1 Node similarity calculation on single meta-path

The proposed model first calculates the node similarity based on each meta-path. Due to the skewed distribution of the HIN, the similarity method based on random walks is biased toward those nodes with large degrees. It cannot reflect the real network topology. Pathsim is used in our model because it considers all nodes connected through meta-path, thus avoiding the skew distribution problem. PathSim is a method commonly used to measure node similarity in HIN. It uses symmetric meta-paths to extract the connected path between two nodes to measure node similarity, which not only uses related heterogeneous information, but also extracts the rich semantics of nodes and edges [5, 27].

The basic idea of PathSim is that if two nodes in the network are connected by more meta-path instances, the similarity between them is higher. Given a meta-path Mm and two nodes of the same type vi and vj, PathSim is defined as follows:

$$ m s\left( v_{i}, v_{j}\right)=\frac{2 \times\left|m_{v_{i} \sim v_{j}}\right|}{\left|m_{v_{i} \sim v_{i}}\right|+\left|m_{v_{j} \sim v_{j}}\right|}, $$
(1)

where \(ms\left (v_{i}, v_{j}\right )\) denotes the similarity between node vi and vj, simply expressed as msij. \(m_{v_{i} \sim v_{j}}, m_{v_{i} \sim v_{i}}, m_{v_{j} \sim v_{j}} \in M_{m}, m_{v_{i} \sim v_{j}}, m_{v_{i} \sim v_{i}}\) and \(m_{v_{j} \sim v_{j}}\) are the path instances following meta-path Mm between (vi,vj), (vi,vi) and(vj,vj), respectively.

As shown in (1), given a meta-path Mm,msij is defined into two parts: the similarity between nodes depends on the number of path instances between them following meta-path Mm; number of path instances from node to itself. For each meta-path Mm, Pathsim is used to calculate the node similarity and obtain a symmetric matrix MmS with a diagonal equal to 1: msij = msji and msii = 1.

4.2 Node similarity aggregation on multiple meta-paths

After calculating the node similarity on single meta-path, our key step is to aggregate the node similarity on these incompatible meta-paths. According to the definition of node similarity in (1), the proposed method determines the meta-path based similarity matrices of |M| meta-paths. However, the nodes similarity with different meta-paths is not always related to each other: two nodes may be similar according to one meta-path, while not similar according to another meta-path. A Kendall tau distance is defined to aggregate different meta-paths by converting the node similarity to the similarity relationship.

Kendall tau distance is generally used to calculate the reverse ordinal numbers between two sequences [31]. It has many advantages in measuring sequence differences and attracts a lot of attention in many fields. We extend it to convert node similarity into similarity relationship and calculate the consistency of the two matrices. The smaller the Kendall tau distance, the two matrices more consistent. Our method obtains the aggregated similarity matrix by minimizing the Kendall tau distance, which has the smallest difference with all similarity matrices. Therefore, the problem of incompatible meta-paths is solved. Since each meta-path has its own semantics, the similarity relationship in the meta-path similarity matrix is compared by defining the Kendall tau distance. First, the Kendall tau distance between the two similarity matrices is defined as

$$ K\left( \mathbf{M_{m}}\mathbf{S}, \mathbf{AS}\right)=\sum\limits_{i=1}^{|v|} K\left( \mathbf{M_{m}}\mathbf{S}_{\mathbf{i}}, \mathbf{A S}_{\mathbf{i}}\right), $$
(2)

where AS is the aggregated similarity matrix, and the Kendall tau distances of each row vector in the two matrices is defined as follows:

$$ K(\mathbf{M_{m}}\mathbf{S}_{\mathbf{i}} ,\mathbf{AS}_{\mathbf{i}} ) = \left\{ \begin{array}{l} 0,if\left( {ms_{ij} > ms_{ik} \wedge as_{ij} > as_{ik} } \right) \\ \vee \left( {ms_{ij} < ms_{ik} \wedge as_{ij} < as_{ik} } \right) \\ \vee \left( {ms_{ij} = ms_{ik} \wedge as_{ij} = as_{ik} } \right) \\ 1,if\left( {ms_{ij} > ms_{ik} \wedge as_{ij} < as_{ik} } \right) \\ \vee \left( {ms_{ij} < ms_{ik} \wedge as_{ij} > as_{ik} } \right) \\ \vee \left( {ms_{ij} > ms_{ik} \wedge as_{ij} = as_{ik} } \right) \\ \vee \left( {ms_{ij} < ms_{ik} \wedge as_{ij} = as_{ik} } \right) \\ \vee \left( {ms_{ij} = ms_{ik} \wedge as_{ij} > as_{ik} } \right) \\ \vee \left( {ms_{ij} = ms_{ik} \wedge as_{ij} < as_{ik} } \right) \end{array} \right.. $$
(3)

If the relative similarity relationship in the two row vectors is the same, the distance remains unchanged. If the similarity relationship in the two row vectors is reverse, then the distance is accumulated. The sum of the distances between the meta-path based similarity matrices and the aggregated similarity matrix is:

$$ K\left( \mathbf{M_{m}}\mathbf{S}, \textbf{AS}\right)=\sum\limits_{m=1}^{|M|} \sum\limits_{i=1}^{|V|} K\left( \mathbf{M_{m}}\mathbf{S_{i}}, \mathbf{A S}_{\mathbf{i}}\right). $$
(4)

Based on the above analysis, the similarity aggregation problem is defined as: given |M| similarity matrices MmS, according to |M| meta-paths, find an aggregated similarity matrix AS. Minimize the distance between MmS and AS, and stochastic gradient descent [1] is used to solve this optimization problem:

$$ \begin{aligned} \mathbf{A} \mathbf{S}^{*} &=\arg \min K\left( \mathbf{M}_{\mathbf{m}} \mathbf{S}, \mathbf{A} \mathbf{S}\right) \\ &=\min \sum\limits_{m=1}^{|M|} \sum\limits_{v=1}^{|V|} K\left( \mathbf{M}_{\mathbf{m}} \mathbf{S}_{\mathbf{i}}, \mathbf{A} \mathbf{S}_{\mathbf{i}}\right). \end{aligned} $$
(5)

In example of Figure 1, the aggregated similarity matrix calculated by SAHE is shown in Figure 4. We can calculate the Kendall tau distance between AS and (M1S,M2S,M3S). Since there is an reverse similarity relationship between row vector v1 in M2S and AS: m2s12 > m2s13 while as12 < as13, \(K\left (\mathbf {M_{2}} \mathbf {S}_{1},\mathbf {AS}_{1}\right )\)= 1. No reverse similarity relationship between other row vector in M2S and \(\mathbf {AS}, K\left (\mathbf {M_{2}}\mathbf {S},\mathbf {AS}_{1}\right )\) = 1. The similarity relationship in the other two similarity matrices M1S and M3S is the same as the aggregated similarity matrix AS: \(K\left (\mathbf {M_{1}}\mathbf {S}, \mathbf {AS}_{1}\right )\)+ \(K\left (\mathbf {M_{2}}\mathbf {S}, \mathbf {AS}_{1}\right )\)+ \(K\left (\mathbf {M_{3}}\mathbf {S}, \mathbf {AS}_{1}\right )\)= 1. Only one pair of nodes has the reverse similarity relationship between (M1S,M2S,M3S), which means the aggregated similarity matrix AS maintain high consistency with meta-path based similarity matrices.

Figure 4
figure 4

The aggregated similarity matrix AS calculated by SAHE in the example of Figure 1

4.3 Semantic preference extraction

Semantic preferences are extracted from the multiple meta-path based similarity matrices to optimize the aggregated similarity matrix. The majority principle [14], as the name suggests, is the obedience of minority to majority. It is widely used in economics and sociology due to its low complexity and easy to understand. The majority principle represents the majority opinion, and the semantic relationship is determined by obtaining more semantic information of meta-path. The majority principle is extended to the semantic preference ranking: if there are three nodes \(\left \{v_{i}, v_{j}, v_{k}\right \}\), nodes (vi,vj) are more similar in most similarity matrices than node (vi,vk), then according to the majority preference, the aggregated similarity matrix should satisfy the ranking: similarity of nodes (vi,vj) is higher than the nodes (vi,vk)’s. In other words, vi prefers vj to vk in the aggregated similarity matrices. Formally, if \(\left |ms_{i j}>ms_{i k}\right |>\left |ms_{i j}<ms_{i k}\right |\),then vjvk, where ’≻’ denotes the prefer relationship. If nodes (vi,vj) are less similar in more similarity matrices than node (vi,vk), then according to the majority preference, the aggregated similarity matrix should satisfy the ranking: similarity of nodes vi,vj is lower than the nodes (vi,vk)’s. Formally, if \(\left |ms_{i j}>ms_{i k}\right |<\left |ms_{i j}<ms_{i k}\right |\), then vkvj. Particularly, \(\left |ms_{i j}>ms_{i k}\right |=\left |ms_{i j}<ms_{i k}\right |\) appears in some cases. Pj and Nj are defined to solve this problem:

$$ P_{j}=\sum\limits_{k=1}^{|v|}\left|ms_{i j}>ms_{i k}\right|, j \neq k. $$
(6)
$$ N_{j}=\sum\limits_{k=1}^{|v|}\left|ms_{i j}<ms_{i k}\right|, j \neq k. $$
(7)

Pj represents the number of nodes whose similarity lower than node vj with node vi. If Pj > Pk, then vjvk, else vkvj. Nj represents the number of nodes whose similarity higher than node vj with node vi. If Nj < Nk,then vjvk, else vkvj. Based on the above semantic preference ranking, for one node vi, the similarity relationship ranking between any two other nodes (vj,vk) and vi can be determined. This ranking can be used as a constraint to optimize the aggregated matrix. Algorithm 1 describes the process of semantic preference ranking.

In example of Figure 1, three similarity relationship rankings are calculated based on semantic preferences ranking. v1 : v1v3v2,v2 : v2v1v3,v3 : v3v1v2. It can be clearly seen that this ranking is consistent with the similarity ranking in Figure 4.

figure a

4.4 Embedding learning

Given a heterogeneous information network G = (V,E) and multiple meta-paths MmM, an aggregated similarity matrix AS preserves node similarities obtained by the algorithm 1. The purpose of the SAHE method is to learn the embedding of nodes in HIN, which can be explained by minimizing the distribution difference between the aggregated similarity matrix AS and the node embedding matrix H. Since the elements stored in the two matrices are inconsistent, it is not easy to calculate the distribution difference. The activation function sigmoid is used to convert the embedding matrix H into a similarity matrix HS.

$$ h s\left( v_{i,} v_{j}\right)=\frac{1}{1+e^{-{h_{i}^{T}} h_{j}}}, $$
(8)

where hi and hj are the embedding vector of node vi and vj, respectively. The embedding matrix can be converted to a fitting similarity matrix, denoted as HS, where \(hs_{i j}=h s\left (v_{i}, v_{j}\right )\).

KL divergence [26] represents the loss of information when fitting a theoretical distribution to a real distribution. Aggregated similarity matrix AS is the real distribution while fitting similarity matrix HS is the theoretical distribution. Our goal is to minimize the KL divergence between AS and HS:

$$ L=KL(\mathbf{AS} \| \mathbf{HS}). $$
(9)

Since similarities between any node vi and any other node vj in the similarity matrix can represent the probability distribution, the loss denotes as:

$$ L=\sum\limits_{i=1}^{|V|} \sum\limits_{j=1}^{|V|} K L\left( as_{ij} \| hs_{i j}\right)=\sum\limits_{i=1}^{|V|} \sum\limits_{j=1}^{|V|} as_{ij} \log \frac{1}{hs_{i j}}. $$
(10)

The optimization objective is to find a node embedding matrix H by minimizing the following loss function and stochastic gradient descent [1] is used to solve this optimization problem:

$$ \textbf{H}^{*}=\min\sum\limits_{i=1}^{|V|} \sum\limits_{j=1}^{|V|} as_{i j} \log \frac{1}{h s_{i j}}. $$
(11)

The details of the optimization of embedding learning are shown in algorithm 2. A binary classifier is used to distinguish between node samples coming from the aggregated similarity distribution AS and a noise distribution. And an auxiliary random variable D is used for this classification, D = 1 for a node from the aggregated similarity distribution and D = 0 for a sample extract from the noise distribution. σ is the sigmoid function and n is the number of nodes we extract form noise distribution. In our experiments, we use n = 3.

figure b

5 Experimental results

The embedding performances of the SAHE method are verified on three real-world HIN datasets. First, the experiments compare the node classification and clustering performances of the proposed model with the other latest existing works. Second, the influences of parameters are verified on the performances of the proposed method.

5.1 Datasets

Three datasets from different fields, including the DBLP network [19], the Movielens network [37], and the Yelp network [20]are used in the experiments to evaluate the performance of the proposed model.

  • DBLP is an author-centric dataset in the field of computer science. The DBLP network consists of four types of nodes: author(A), paper(P), topic(T) and venue(V). The edge types include authorship(P-A), topic of paper (P-T), publishing venue (P-V), and the cite relation (P-P). The schema is shown in Figure 2.

  • Movielens is a movie-centric dataset in movie field consisting of four types of nodes: movies(M), users(U), age(A), and occupation(O), three edge types: user’s like (U-M), user’s age (U-A) and user’s occupation (U-O). Figure 5(a) shows the schema of the Movielens HIN.

  • Yelp is a business-centric dataset containing user information, business information, and user reviews of businesses. We extract some data include business(B), user(U), city(Ci), compliment(Co) as nodes, and evaluation relation (U-B), user’s like(U-Co) and location (B-Ci) as edges. Figure 5(b) shows the schema of the Yelp HIN.

Figure 5
figure 5

Schemas of two heterogeneous information networks

The network of the original dataset is sparse and has a skewed distribution. A large number of nodes are in the network and most nodes are with a small degree. It is difficult to analyze the entire network. We therefore randomly sample the original network and set a minimum degree to filter out nodes whose degree are less than the minimum degree. The minimum degree in DBLP, Movielens and Yelp is 3,4,4 respectively. The basic statistics and meta-paths of these three HIN networks are summarized in Table 1. The length of meta-path is the number of relations in the meta-path. For example, in the DBLP network, the meta-path APA can be described using the length-2 meta-path.

Table 1 The basic statistics and meta-paths of three HIN datasets

5.2 Baseline methods

We compare the proposed methods with two types of embedding methods: methods for homogeneous information networks and methods for heterogeneous information networks. These methods have been introduced in the related work in Section 2. Here we present the details of the method in the experiment. For the fairness of the experiment, the dimension d of the embedding vector space of all methods is set to be 64.

For homogeneous information networks, Deepwalk[17], Node2vec [10] and Line [28] method are used to compare with the proposed method. These three methods are network embedding method based on random walk. To apply these methods directly to HIN embedding, the heterogeneity of nodes will be ignored, and heterogeneous nodes will be regarded as homogeneous nodes. For these three methods,we used the default settings from the original article.

For heterogeneous information networks, meta-path based methodsMetapath2vec [6], Esim [22], HAN [32] and non-metapath based method Hin2vec [8] are used to compare with the proposed method. In order to verify the ability of the proposed method to solve the problem of meta-path incompatibility, the Kendall tau distance is deleted and the average similarity calculated by different meta-paths is used as a variant of SAHE, called SAHEavg, and the variant is compared with SAHE. For each dataset, the meta-paths used by Metapath2vec, Esim and HAN in the experiment are the same as SAHE. Since different datasets have different weight settings, various weights are assigned to Esim and the best performing one are chosen on different datasets. The parameter settings of the Hin2vec method are the same as in the original article.

5.3 Node classification

Node classification classifies the nodes into different categories, where the input of the node classifier are the learned embeddings. In the experiments, we mainly focus on the central node in the datasets for classification. Author, movie and business are used as the central node for classification on the DBLP, Movielens and Yelp datasets respectively. A group of nodes are randomly selected as the labeled training nodes, and the remaining nodes are used as test nodes. The embeddings of the training nodes are used as input and a logistic regression classifier is trained to predict the most likely labels for the test nodes. The network datasets used in the experiments are divided into training and test sets, and the ratio of the training set gradually increases from 5% to 50%. Each experiment is 10 repeated trials and the average Micro-F1 and Macro-F1 scores are recorded. The experiment results are shown in Tables 2 and 3.

Table 2 Micro-F1 scores of node classification task
Table 3 Macro-F1 scores of node classification task

As is shown in Tables 2 and 3, when the ratio of training data increases, the performance is better in general which explains that the accuracy is positive related to training ratio. (1) The results show that the node classification performance of the SAHE method is always better than the baselines. The performance improvement obtained by SAHE on the best benchmark (HAN) is about 2%-8%. When the training ratio is less than 20%, the performance of the SAHE method is much higher than the baseline methods. This means that the proposed method requires only a small amount of training data to obtain effective embedding results. (2) The performance of DeepWalk, Node2vec and Line is lower than Metapath2vec, Esim, Hin2vec, HAN and SAHE. Thus, the methods for homogeneous network embedding are significantly worse than heterogeneous network embedding methods in node classification task. Because the homogeneous network embedding methods ignore the heterogeneity of nodes. (3) By comparing the performance of heterogeneous information neetwork baselines(Metapath2vec, Esim, HAN, Hin2vec), HAN is the best in baselines because the fusion embedding learned through the attention mechanism can improve the embedding ability. Metapath2vec is the least performance method among the heterogeneous network comparison methods, because this method only extracts the semantic information of a meta-path. (4) The node classification performances of SAHE method are superior than that of SAHEavg method in all cases. This shows that the proposed method improves the performance of network embedding by solving the problem of meta-path incompatibility.

5.4 Link prediction

We conduct experiments on three datasets to compare the link prediction performance of node embeddings learned by SAHE and the baseline methods. In this task, we predict the citation relationship (P-P) between papers in DBLP, and the friend relationship (U-U) between users in Movielens and Yelp. We first randomly divide the edges of the dataset into a training set and a test set. Here, we set 75% of the edges as the training set and the remaining edges as the test set. Then, the node embedding is used as a feature of the classifier to train the training set and evaluate the link prediction performance on the test set. The AUC and AP scores are used as indicators to reflect the quality of node embedding. The higher the values of AUC and AP are, the better the link prediction performance is. Table 4 shows the performances of link prediction of various methods on three datasets and the best performance is highlighted in bold.

Table 4 AUC and AP scores of the link prediction task

Similar as the node classification task, the SAHE method significantly perform better than all baseline methods on three datasets. Compared with the baselines, the AUC and AP scores of link prediction of SAHE increased by about 2%- 22% in DBLP, 3%- 11% in Movielens, and 4%- 12% in Yelp. It shows that the proposed method can improve the performance of network embedding by capturing heterogeneous semantic information. And the link prediction performances of SAHE method are superior than that of SAHEavg method in all cases. This results verify the ability of the proposed method to solve the problem of meta-path incompatibility on link prediction task. Different from the node classification task, on the DBLP data set, node2vec performs better than the traditional heterogeneous method metapath2vec. This may be due to the fact that the node2vec method obtains more structural and semantic information in the DBLP data set compared with the single metapath metapath2vec method.

5.5 Node clustering

We also conduct expeiments on three datasets to verify the performances of the proposed SAHE method on node clustering task. Node clustering regroups nodes into clusters according to the similarity between nodes. The labeled node field of the clustering task is the same as the node classification task, such as the author type in DBLP, the movie type in Movielens and the business type in YELP. Specifically, the nodes embedding is treated as nodes feature while the k-means algorithm is used to cluster the nodes. The real group consists of nodes with the same label and is compared with the group obtained by the clustering task. The performance of the clustering task is evaluated with Normalized Mutual Information (NMI) and the Adjusted Rand Index (ARI). The average NMI and ARI scores for the 10 repeated trials are shown in Table 5.

Table 5 NMI and ARI scores of the node clustering task

The experimental results show that the node clustering performances of the proposed SAHE method is consistently better than other baselines on different datasets. Compared with the baselines, the NMI and ARI scores of node clustering of SAHE increased by about 3%- 9% in DBLP, 1%- 9% in Movielens, and 1%- 7% in Yelp.

Similar to the node classification task, the clustering effect of HAN is still the second best. The methods for homogeneous network embedding are significantly worse than heterogeneous network embedding method in most cases. And the node clustering performances of SAHE method are superior than that of SAHEavg method in all cases. We observe that the performance of Hin2vec is not good in clustering experiments, sometimes even worse than the embedding method for homogeneous information network, this shows that the performance of HIN is unstable.

Comparing the performance of methods for heterogeneous information network, HAN, metapath2vec and Esim achieves higher NMI and ARI than Hin2vec on the DBLP, Movielens and Yelp datasets. This result shows that the method of manually setting the meta-path (Metapath2vec, Esim) is better than the method of automatically selecting the meta-path (such as Hin2vec). It may be related to the fact that the manually set meta-path is more consistent with the semantic relationship in the real network. In other words, meta-path selection strategies require expert prior knowledge, as they are not only related to the direct connection of nodes in the network.

5.6 Parameter analysis

The experiments evaluate the node clustering task of the SAHE model with different values of node embedding dimension d and different numbers of meta-paths |M|. Other parameters, such as those in the optimization process, are not analyzed because they are not relevant to the contributions of this paper. In order to study the influence of d, we observe the clustering performance by increasing d from 16 to 256. The results of the node classification task are similar to the clustering results. Tables 6 and 7 only show the results of node clustering.

Table 6 Influence of Node embedding dimension d on SAHE
Table 7 Influence of meta-paths number |M| on SAHE

We observe that when d is less than 64, a smaller d leads to worse performance, which because the small size is not enough to capture more valid information for the node. The clustering performance is best at d = 64 or d = 128, and then decreases with increases the of d, this may be due to over-fitting caused by over-dimension. The experiment at d = 64 showed good performance on DBLP and Yelp, while d = 128 performed well on Movielens. In different networks, different dimensions are required.

Table 7 shows that as the number of meta-paths increases, the clustering performance improves. This is reasonable since multiple meta-paths can obtain richer information from heterogeneous information networks. It further verifies that the proposed SAHE model can solve the incompatibility of different meta-paths.

6 Conclusions and future works

In this paper, a novel method named Semantic-Aware HIN Embedding (SAHE) is proposed to learn nodes embedding in heterogeneous information networks. Particularly, we solve the incompatibility problem of different meta-paths existing in real-world HINs. The core of the SAHE method is to convert the node similarity to the similarity relationships on each meta-path in its own semantic space. Since the similarity matrix calculated from the same meta-paths has no incompatibility problems, minimizing the distance between the aggregated similarity matrix and the meta-path based similarity matrices can solve the incompatibility problem. In addition, an innovative semantic preferences ranking is proposed as a constraint to optimize the aggregated similarity matrix. Extensive experimental evaluations confirm that the SAHE model can capture more semantic information of complex HINs, and overall exceeds the baseline.

To improve the proposed method, future works will focus on the following aspects. First, the proposed method selects meta-path manually, we plan to upgrade it to select the most influential meta-paths automatically and ensure that these meta-paths can preserve rich semantic information in HIN. Second, the proposed model will be extended to solve the problem of node representation on more complex networks, such as dynamic heterogeneous networks, attributed heterogeneous networks and large-scale heterogeneous networks.