Keywords

1 Introduction

Heterogeneous information networks (HINs) [17, 19] typically include multiple types of nodes and edges, implying the rich semantic information. These come together with a lot of real-world data, such as social networks [7, 24], citation networks [1, 12] and recommendation systems [2, 32]. The complicated heterogeneity and rich semantic information within HINs bring great challenges on heterogeneous graph tasks, such as node classification, link prediction and graph classification. Recently, the representation learning for heterogeneous graphs [28] receives a surge of research attention, which presents a great opportunity for analyzing HINs.

To capture both heterogeneity and structural information, heterogeneous graph neural networks (HGNNs) have been proposed and widely used to model HINs in recent years. Existing HGNNs can broadly be categorized into metapath-based models and metapath-free models. Generally, metapath-based approaches capture heterogeneity by using the predefined metapaths [5, 26, 29, 31], while they have to redefine appropriate metapaths to adapt to various heterogeneous graphs. To get rid of the dependency on metapaths, metapath-free approaches encode graph heterogeneity by designing additional tailored modules [11, 13, 14]. With the capability of encoding both graph structure and heterogeneity, existing approaches give rise to the performance in a variety of downstream tasks on HINs. Their success demonstrates that leveraging the heterogeneity of HINs can significantly boost the model’s performance.

Since nodes in different types may have diverse attributes, most existing metapath-free methods first align node features by projecting them into a shared low-dimensional space [13, 14]. For example, as illustrated in Fig. 1, the input feature vectors of papers, authors and venues are first mapped into low-dimensional embeddings with same dimension. Although the low-dimensional node embeddings can preserve original feature information and topological information [28], node type information is not retained. This further leads to the loss of heterogeneity information in subsequent neighborhood aggregation operation. Under this condition, when aggregating information from a node’s neighbors, most existing methods can only identify which neighbors are in the same type, but they fail to know the exact types of these neighbors. Based on the above analysis, the first challenge is to design an encoder that can seamlessly integrate the information of graph heterogeneity, including both node types and edge types, with node features and graph structure.

Fig. 1.
figure 1

An illustration of the feature processing for a toy citation network. \(\text {W}_P\), \(\text {W}_A\) and \(\text {W}_V\) are type-specific transformation matrices w.r.t. node types.

Additionally, most existing approaches only consider the interactions between nodes while neglecting the latent interactions among different node features [14, 26, 30]. Specifically, each convolutional layer can be considered as the one-order interaction between a node and its neighbors. By stacking multiple convolutional layers, the high-order information from multi-hop neighboring nodes is captured. However, the feature-level high-order information is also useful for label prediction. For example, in a co-authorship network, attributes of paper nodes include keywords, and our target is to predict their research topics. For a paper, suppose we only consider one keyword like graph neural networks (GNNs), it is difficult to predict whether the paper’s label is Information Retrieval (IR) or Artificial Intelligence (AI), as both IR and AI have sub-topics related to GNNs. If we consider three keywords graph neural networks, personalized search and query recommendation simultaneously, it is easier to classify the paper as IR rather than AI. This indicates that considering such feature-level interactions can boost the model’s capability. Therefore, the second challenge is to design an encoder that can capture latent interactions among node features and leverage feature-level high-order information to enhance node embeddings.

To address the challenges, in this paper, we propose a novel Heterogeneous graph Cascade Attention Network (HetCAN). For the first challenge, we put forward a type-aware encoder composed of multiple type-aware layers, in which learnable type embeddings are explicitly introduced for both nodes and edges. The key idea of introducing node type embeddings is to supplement the loss of type information for node embeddings in the low-dimensional space. To this end, we first propose to fuse node feature embeddings with node type embeddings and then obtain the fused node embeddings. After that, we use fused node embeddings and edge type embeddings to perform attention-based weighted aggregation to learn the type-aware node embeddings. Owing to type information of both nodes and edges, we can capture heterogeneity, node attributes and graph structure simultaneously at the type-aware encoder.

For the second challenge, inspired by Transformer’s outstanding capability of modeling interactions among tokens in the input sequences [22, 30], we propose a dimension-aware encoder to enhance hidden embeddings by learning the feature-level high-order information. To highlight the importance of node types, we also introduce node type embeddings as type encoding which is similar to the positional encoding in Transformer. This is intended to distinguish different feature interactions corresponding to node types. Regarding each dimension (or multiple dimensions) of node embedding as a token, we can construct an input sequence for each node. We then apply multi-head self-attention to these input sequences, allowing each dimension to attend to other dimensions and thereby learn their latent interactions. The outputs from the dimension-aware encoder are concatenated with those from the type-aware encoder to create the final node representations., i.e., the outputs of one-layer cascade block.

Overall, the proposed HetCAN can be regarded as a cascade model with dual-level awareness, i.e., node-level and feature-level awareness. On the one hand, each type-aware layer utilizes type embeddings of both nodes and edges, allowing nodes to be aware of their neighborhood’s type information as well as feature information. On the other hand, at each dimension-aware layer, we employ multi-head self-attention on the sequences expanded from hidden embeddings and make each dimension to be aware of others, thereby learning the high-order information behind latent feature interactions. Finally, the main contributions of this work are summarized as follows:

  • We present a type-aware encoder to make up for heterogeneous information and capture the node-level high-order information, as well as a dimension-aware encoder for learning the feature-level high-order information.

  • We propose a metapath-free model HetCAN built upon the above encoders, which allows us to encode graph heterogeneity in a learnable way and obtain more expressive node representations in an end-to-end manner.

  • We conduct extensive experiments to demonstrate the effectiveness and efficiency of the proposed HetCAN.

2 Related Work

Heterogeneous Network Embedding. A large number of graph embedding approaches have been proposed in recent years [6, 15, 21], which aim to map nodes or substructures into a low-dimensional space in which the connectivity of the graph can be preserved. Meanwhile, as most of real-world networks are usually composed of various types of nodes and relationships [28], researches on heterogeneous network embeddings (HNEs) [17, 25] have also received significant attention. The approaches of HNEs can broadly be categorized into random walk methods [3, 4] and first/second-order proximity methods [18, 20].

Heterogeneous Graph Neural Networks. To capture the rich semantic information contained in heterogeneous graphs, a series of heterogeneous graph neural networks have been proposed in recent years [13, 27]. According to the way to utilize the graph heterogeneity, HGNNs are divided into two categories, i.e., metapath-based HGNNs [5, 16, 26, 29] and metapath-free HGNNs [11, 13, 14]. Typically, metapath-based approaches aggregate messages from type-specific neighboring nodes to generate semantic vectors, then fuse the semantic vectors of different metapaths to output the final node representations. Their dependencies on the predefined metapaths make them challenging to apply to complex real-world networks. Metapath-free models update node representations by directly employing message passing mechanism, with additional tailored modules to model the heterogeneity, which makes them free from the selection of metapaths. But tailored modules tend to make the encoding of heterogeneity separate from node features [14], which fails to capture the relations between them. We aim to encode both graph heterogeneity and feature information in a unified embedding for all nodes in HINs, allowing us to learn more expressive node representations and then improve the model’s performance on downstream tasks.

3 Preliminaries

3.1 Heterogeneous Information Network

A heterogeneous information network (HIN) can be defined as \(G=\{\mathcal {V},\mathcal {E}, \mathcal {A}, \mathcal {R} \}\), where \(\mathcal {V}\) is the set of nodes as well as \(\mathcal {E}\) is the set of edges. In HIN, each node \(v \in \mathcal {V}\) has a type \(\phi (v)\) and each edge \(e \in \mathcal {E}\) has a type \(\psi (e)\). The sets of node types and edge types are denoted by \(\mathcal {A}\) and \(\mathcal {R}\), where \(\mathcal {A} = \{ \phi (v): \forall v \in \mathcal {V} \}\) and \(\mathcal {R} = \{ \psi (e): \forall e \in \mathcal {E} \}\), respectively. And \(\phi :\mathcal {V} \rightarrow \mathcal {A}\) is the node type mapping function, \(\psi :\mathcal {E} \rightarrow \mathcal {R}\) is the edge type mapping function. Typically, a heterogeneous graph is with \(|\mathcal {A}| + |\mathcal {R}| > 2\). Note that, when \(|\mathcal {A}| = |\mathcal {R}| = 1\), the graph degenerates into a homogeneous graph.

3.2 Graph Neural Networks

GNNs [12, 23] and HGNNs [13, 16, 26] commonly rely on the key operation of aggregating neighborhood information in a layer-wise manner, namely the node-level aggregation. In this manner, messages can be recursively passed and transformed from neighboring nodes to the target node. In the l-th layer, the representation of node v can be calculated by

$$\begin{aligned} {\textbf {h}}_v^l = \textsc {Aggr}({\textbf {h}}_v^{l-1}, \{{\textbf {h}}_u^{l-1} : u \in \mathcal {N}_v\}; \theta _g^l). \end{aligned}$$
(1)

where \(\mathcal {N}_v\) is the neighboring nodes set of node v (or type-specific neighboring nodes for HGNNs), and \(\textsc {Aggr}(\cdot ;\theta _g^l)\) denotes the neighborhood aggregation function parameterized by \(\theta _g^l\) in the l-th layer. There are different neighborhood aggregation functions, e.g., mean-pooling aggregation in GCN [12] and attention-based aggregation in GAT [23]. Since GAT can distinguish different importance of neighboring nodes, we adopt it as the backbone of the proposed type-aware encoder, which will be discussed in next section.

3.3 Transformer-Style Architecture

In the following part, we introduce transformer encoder briefly. The transformer encoder [22] is composed of one or multiple transformer blocks, where each transformer block mainly contains a multi-head self-attention (MHSA) module and a feed-forward network (FFN). In natural language processing, the MHSA module, the critical component, aims to receive the semantic correlations among input tokens. Regarding each node feature as a token, it can also be generalized to learn the interactions among node features.

Suppose we have an input \({\textbf {H}} \in \mathbb {R}^{n \times d}\), where n is the length (or number) of input tokens and d is the hidden dimension. The MHSA firstly projects H to Q, K and V by three linear transformations as

$$\begin{aligned} {\textbf {Q}} = {\textbf {H}} {\textbf {W}}_{\text {q}}, {\textbf {K}} = {\textbf {H}} {\textbf {W}}_{\text {k}}, {\textbf {V}} = {\textbf {H}} {\textbf {W}}_{\text {v}}, \end{aligned}$$
(2)

where \({\textbf {W}}_{\text {q}}, {\textbf {W}}_{\text {k}} \in \mathbb {R}^{d \times d_{\text {k}}}\) and \({\textbf {W}}_{\text {v}} \in \mathbb {R}^{d\times d_{\text {v}}}\). Then we calculate the output of MHSA by the scaled dot-product attention mechanism as

$$\begin{aligned} \text {MHSA} ({\textbf {H}}) = \textsc {Softmax} (\frac{{\textbf {Q}} {\textbf {K}}^T}{\sqrt{d_{\text {k}}}}){\textbf {V}}, \end{aligned}$$
(3)

where \(\sqrt{d_{\text {k}}}\) is the scaling factor. For simplicity, we use a single-head self-attention module for the description. Thereafter, the MHSA module is followed by the FFN module which contains two layers of Layer Normalization (\(\textsc {LayerNorm}\)) and the residual connection [8]. Then we can obtain the output of l-th Transformer block as

$$\begin{aligned} \begin{aligned} & {\textbf {H}}^l = \textsc {LayerNorm} (\text {FFN} (\widetilde{{\textbf {H}}}^l) + \widetilde{{\textbf {H}}}^l) \\ & \widetilde{{\textbf {H}}}^l = \textsc {LayerNorm} (\text {MHSA} ({\textbf {H}}^{l-1}) + {\textbf {H}}^{l-1}). \end{aligned} \end{aligned}$$
(4)

By stacking L Transformer blocks, we can obtain the final output representation \({\textbf {H}}^L \in \mathbb {R}^{n \times d}\), which can be used as the input of downstream tasks, such as node classification and link prediction.

4 The Proposed Model

4.1 Overall Architecture

The overall framework of HetCAN is illustrated in Fig. 2. Given a heterogeneous graph HG, we first adopt type-specific linear transformations to project nodes with different feature spaces into a shared feature space. Then, the aligned embeddings are employed as initial node feature matrix and are fed into the type-aware encoder, where each node can simultaneously perceive heterogeneity information, feature information and structural information within its neighborhood. After multiple type-aware layers, hidden node embeddings are passed to the dimension-aware encoder, where latent feature interactions will be modeled through multi-head self-attention mechanism. Afterward, we concatenate the outputs from the type-aware encoder and the dimension-aware encoder to construct the updated node embeddings, which are also referred to the outputs of each cascade block. HetCAN typically includes N cascade blocks. Finally, we perform downstream tasks based on the normalized final node representations. In the following parts, we will illustrate the details of the type-aware encoder and the dimension-aware encoder, respectively.

Fig. 2.
figure 2

The overall framework of HetCAN. Each cascade block consists of L type-aware layers and \(L_d\) dimension-aware layers.

4.2 Type-Aware Encoder

In the type-aware encoder, we first introduce learnable type embeddings for both nodes and edges, and integrate feature embeddings and type embeddings as a whole. Formally, we first initialize a node type matrix denoted by \(\textbf{M} \in \mathbb {R}^{|\mathcal {A}|\times d_t}\), where \(|\mathcal {A}|\) is the number of node types. For each node \(v_i\), its node type embedding \(\textbf{t}_i \in \mathbb {R}^{d_t}\) is derived by \(\textbf{t}_i = \textbf{M}[\phi (v_i),:]\). Then we can obtain type embeddings of all nodes represented as \(\textbf{T} \in \mathbb {R}^{n \times d_t}\). As node features of different types on HINs usually exist in different feature spaces, we project them into a shared feature space before passing them to the type-aware encoder. Formally, the feature processing is denoted as

$$\begin{aligned} \textbf{h}_i = \textbf{W}_{\phi (v_i)}\textbf{x}_i + \textbf{b}_{\phi (v_i)}, \end{aligned}$$
(5)

where \(\textbf{W}_{\phi (v_i)} \in \mathbb {R}^{d\times d_x}\) is the learnable parameter matrix corresponding to node type \(\phi (v_i)\) and \(\textbf{b}_{\phi (v_i)} \in \mathbb {R}^{d}\) is an optional bias term. Then we can obtain node feature embeddings denoted by \(\textbf{H} \in \mathbb {R}^{n\times d}\). After that, to comprehensively supplement node type information, we present a combination function to integrate node feature embeddings with node type embeddings as

$$\begin{aligned} \mathbf {\widetilde{H}} = \textsc {Combine}(\textbf{H}, \textbf{T}), \end{aligned}$$
(6)

where Combine (\(\cdot \)) can be any operator function, such as learnable functions or non-parametric functions. In practice, we simply implement it with Hadamard product which is an element-wise operation. Based on [13], we then extend attention mechanism with integrated node embeddings that contain the node type information. In this way, each type-aware layer calculates the attention coefficient between node \(v_i\) and node \(v_j\) as follows (layer marker (l) is omitted for simplicity)

$$\begin{aligned} \alpha _{ij} = \frac{ \text {exp}\left( \sigma \left( \textbf{a}^T[\textbf{W}\mathbf {\tilde{h}}_i || \textbf{W}\mathbf {\tilde{h}}_j || \mathbf {W_r}\textbf{r}_{\psi (v_i, v_j)}] \right) \right) }{ \sum _{k \in \mathcal {N}_i}{\text {exp}\left( \sigma \left( \textbf{a}^T[\textbf{W}\mathbf {\tilde{h}}_i || \textbf{W}\mathbf {\tilde{h}}_k || \mathbf {W_r}\textbf{r}_{\psi (i,k)}] \right) \right) } }, \end{aligned}$$
(7)

where \(\textbf{r}_{\psi (v_i, v_j)} \in \mathbb {R}^{d_r}\) is learnable edge type embedding w.r.t. the type of edge between node \(v_i\) and node \(v_j\), \(\textbf{W}\) and \(\textbf{W}_r\) are learnable matrices, and \(\sigma \) is the \(\text {LeakyReLU}\) activation function. To stabilize training process and improve the performance, inspired by [8, 9], we employ residual connection on attention coefficients as

$$\begin{aligned} \hat{\alpha }_{ij}^{ (l)} = (1-\beta ) \alpha _{ij}^{ (l)} + \beta \hat{\alpha }_{ij}^{ (l-1)}, \end{aligned}$$
(8)

where \(\beta \in [0,1]\) denotes attention residual weight. Once obtained, the normalized attention coefficients are used to update the hidden node embedding \(\textbf{h}'_i\) for each node \(v_i \in \mathcal {V}\) as

$$\begin{aligned} \textbf{h}'_i = \sigma \left( \sum _{j\in \mathcal {N}_i}\hat{\alpha }_{ij}\textbf{W}\mathbf {\tilde{h}}_j\right) . \end{aligned}$$
(9)

To enhance model’s capacity and stabilize the learning process, we implement a multi-head attention mechanism by averaging

$$\begin{aligned} \textbf{h}'_i = \sigma \left( \frac{1}{K} \sum _{k=1}^{K}\sum _{j\in \mathcal {N}_i}\hat{\alpha }^{k}_{ij}\textbf{W}^{k}\mathbf {\tilde{h}}_j\right) , \end{aligned}$$
(10)

where K is the number of heads. Overall, with the type-aware encoder, hidden node embeddings can seamlessly fuse the information of graph heterogeneity, node feature and graph structure and have more powerful expressive capabilities.

4.3 Dimension-Aware Encoder

The success of Transformer has demonstrated its outstanding capability of learning interactions among the tokens in a sequence. Motivated by this, we propose a dimension-aware encoder with transformer architecture for capturing the feature-level high-order information, which can further enhance the expressive capability of node embeddings.

After acquiring hidden embeddings \(\mathbf {H'}\in \mathbb {R}^{n\times d}\) from L type-aware layers, the dimension-aware encoder constructs the input sequence for each node to adapt to transformer architecture. Specifically, for each node \(v \in \mathcal {V}\), we expand its hidden embedding \(\textbf{h}'_v \in \mathbb {R}^d\) to a sequence \(\mathbf {\hat{h}}'_v \in \mathbb {R}^{d \times 1}\), treating each dimension (or multiple dimensions) as a token represented by a one-dimensional (or multiple dimensional) vector. Then we perform multi-head self-attention on each input sequence to learn interactions among the tokens within it.

Besides, to learn distinct feature interaction patterns for different node types, inspired by the positional encoding in Transformer, we introduce node type embeddings \(\textbf{T} \in \mathbb {R}^{n \times d_t}\) as type encoding and combine them with the hidden node embeddings \(\textbf{H}'\) before performing attention mechanism. We denote this step as

$$\begin{aligned} \begin{aligned} &\mathbf {\hat{H}} = \textsc {Combine} (\mathbf {H'}, \textbf{T}) \\ &\mathbf {\hat{H}'} = \textsc {Expand} (\mathbf {\hat{H}}) \end{aligned} \end{aligned}$$
(11)

where \(\mathbf {\hat{H}'} \in \mathbb {R}^{n\times d \times 1}\) denotes the constructed sequences of all nodes, illustrated in upper right of Fig. 2. Similar to the type-aware encoder, the Combine (\(\cdot \)) is also implemented with Hadamard product by simply setting \(d_t = d\), so that the shape of \(\mathbf {\hat{H}}\) remains consistent with \(\mathbf {H'}\). With type encoding, the dimension-aware encoder can distinguish various node types and learn unique interaction patterns for them. Then, we perform multi-head self-attention on the input sequences \(\mathbf {\hat{H}'}\) and obtain the outputs of each dimension-aware layer as follows

$$\begin{aligned} \mathbf {\overline{H}}= \textsc {Mhsa} (\mathbf {\hat{H}'}), \end{aligned}$$
(12)

where \(\textsc {Mhsa}(\cdot )\) denotes multi-head self-attention (refer to Sect. 3.3) and \(\mathbf {\overline{H}} \in \mathbb {R}^{n\times d}\) is the node representations containing rich feature-level information. Finally, we concatenate the outputs of dimension-aware encoder \(\mathbf {\overline{H}}\) and the type-aware encoder \(\mathbf {H'}\) to construct the final node representations as

$$\begin{aligned} \textbf{H}_{\text {f}} = \mathbf {H'} \parallel \mathbf {\overline{H}}, \end{aligned}$$
(13)

where \(\textbf{H}_{\text {f}} \in \mathbb {R}^{n \times 2d}\) is the output of one-layer cascade block. For simplicity, we only illustrate one layer of cascade block. After one or multiple cascade blocks, we can obtain the enhanced node representations with more expressive capabilities and use them for various downstream tasks.

4.4 Time Complexity Analysis

In this subsection, we give the time complexity analysis of the proposed components in HetCAN. Let \(|{\mathcal {V}}|\) and \(|{\mathcal {E}}|\) are the number of nodes and edges. d is the dimension of both node feature embeddings and node type embeddings, and \(d_r\) is the dimension of edge embeddings. For each type-aware layer, the time complexity of a single attention head can be expressed as \(O(|{\mathcal {V}}|\times d^2+|{\mathcal {E}}| \times {d_r}^2 + |{\mathcal {E}}| \times (2d + d_r))\). For each dimension-aware layer, the time complexity of a single attention head can be expressed as \(O(|{\mathcal {V}}|\times d^2)\). Thus, overall time complexity of HetCAN is linear to both the number of nodes \(|{\mathcal {V}}|\) and the number of edges \(|{\mathcal {E}}|\). The efficiency studies of our model are shown in Fig. 4.

5 Experiments

We evaluate HetCAN by conducting extensive experiments on node classification and link prediction, and compare various competitive approaches, including plain homogeneous GNNs, metapath-based HGNNs and metapath-free HGNNs. In addition, to further investigate the superiority of our model, we comprehensively conduct three studies including an ablation study, an efficiency study and a parameter study. The source code and datasets are available at https://github.com/zzyzeyuan/HetCAN.

5.1 Experimental Setups

Datasets. For node classification, we test our model on five public datasets. Specifically, DBLP, IMDB, ACM and Freebase are from [13], and OGB-MAG is from [10]. For link prediction, we test our model on three public datasets from [13]. Heterogeneous Graph Benchmark standardizes the process pipeline for fair comparison, so we follow their pipelines to conduct experiments. For datasets without node features, we assign them one-hot or all-one vector features to denote their existence. The statistics of all datasets are summarized in Table 1.

Table 1. Statistics of all datasets.

Baselines. To comprehensively evaluate our proposed model against the state-of-the-art methods, we select a collection of baselines, including basic models (GCN [12], GAT [23], Transformer [22]), metapath-based models (RGCN [16], HAN [26], HetGNN [31], MAGNN [5], SeHGNN [29]) and metapath-free models HGT [11], Simple-HGN [13], HINormer [14]). Specifically, as all baselines do not utilize extra label embeddings, we report the results of SeHGNN without the utilization of extra label embeddings.

Settings. Regarding datasets from HGB, we follow the split proportion of 24:6:70 for the training, validation and test sets, respectively. Regarding OGB-MAG dataset, we propose to train on papers published until 2017, validate on those published in 2018, and test on those published since 2019. We evaluate classification performance of all baselines with Micro-F1 and Macro-F1 for HGB datasets and accuracy for OGB-MAG dataset. Following HGB, we use ROC-AUC (area under the ROC curve) and MRR (mean reciprocal rank) metrics to evaluate link prediction performance. Since our experimental setup is consistent with HGB and OGB, we directly borrow the results reported in HGB and OGB leaderboard for comparison. For those results that are not available in HGB or OGB, we conduct experiments based on original experimental setups.

Table 2. Experiment results on four HGB datasets. The best result is bolded and the runner-up is underlined. The error bar (±) denotes the standard deviation of the results over five runs. “–” denotes the models run out of memory.

5.2 Node Classification

Tables 2 and 3 summarize experimental results on node classification over five runs. From the tables, we observe that:

(1) The plain models, i.e., GCN, GAT and Transformer, perform well on all datasets when using proper inputs from HGB, indicating that preprocessing for input node features has great impact on model performance.

(2) Compared to the vanilla models mentioned earlier, SeHGNN and HINormer demonstrate superior performance, with SeHGNN being the best among metapath-based models and HINormer excelling among metapath-free models. By using the predefined metapaths, SeHGNN exploits semantic information to boost model performance. HINormer samples a fixed-length sequence for each node and designs an additional heterogeneous relation encoder, which enlarges the receptive field for each node and also models the heterogeneity.

(3) HetCAN achieves superior results across all HGB datasets, demonstrating its ability to generalize to datasets with varying degrees of heterogeneity. We attribute the generalization ability of our model to the Cascade Block, which allows us to simultaneously learn node-level and feature-level information. This enables the node representations to have more powerful expressive capabilities, thereby boosting both node classification and link prediction tasks.

Table 3. Experiment results on the large-scale dataset OGB-MAG. \(*\) denotes metapath-free models. The best result is bolded and the runner-up are underlined.

(4) Based on the results from the large-scale dataset OGB-MAG (see Table 3), we can observe that HetCAN outperforms all metapath-free competitors. This indicate that our method has further narrowed the gap between metapath-free models and metapath-based models on large-scale dataset. In addition, we further conduct efficiency studies on three datasets shown in Fig. 4, which demonstrates that our method is faster than SeHGNN, the winner of OGB-MAG dataset. Particularly, shown in Fig. 4(c), the convergence speed of our model is much faster than SeHGNN in the scenario with a large number of edge types (39 edge types in Freebase). This is because metapath-based methods require aggregating information through metapaths, and this inherent property results in slower training and convergence speeds.

5.3 Link Prediction

Table 4 summarizes the results on the downstream link prediction task over five runs. Based on this table, we observe that:

(1) Our method HetCAN consistently outperforms all advanced methods over both ROC-AUC and MRR metrics. Particularly, we achieve significant improvements on the Amazon and LastFM datasets. This indicates that our method can learn more expressive node representations with such a cascade structure, while also giving rise to the model’s performance on the link prediction task.

(2) Compared to Simple-HGN, the runner-up on link prediction, our method achieves better performance. Our method introduces both the node-level and the feature-level high-order information through the cascade block, while Simple-HGN only uses learnable type embeddings to compensate for graph heterogeneity and ignores the feature high-order interactions.

5.4 Model Analysis

Ablation Studies. To validate effectiveness of the proposed components, we conduct ablation studies on four datasets by comparing with two variants of HetCAN: (1) we remove node type embeddings and replace it with all-one vectors, which is denoted by w/o Type-encoder; (2) we remove the dimension-aware encoder, which is denoted by w/o Dim-encoder. We report the results of ablation studies in Fig. 3 and give the following observations.

Table 4. Experiment results on link prediction. The best result is bolded and the runner-up is underlined. The error bar ( ± ) denotes the standard deviation of the results over five runs. “–” denotes the results are not available due to lack of metapaths on those datasets.
Fig. 3.
figure 3

Ablation studies.

Firstly, without the type-aware encoder, HetCAN fails to consider node type when performing attention mechanism, resulting in degradation of classification performance. Compared to other datasets, the absence of node type embeddings has more prominent impacts on Freebase that has more node types, indicating that introducing learnable node type embeddings explicitly can make up for type information and benefit the model’s performance. From another perspective, improvements on some datasets are not as significant as that on Freebase, thus how to further exploit the underlying semantic relations between nodes is still a promising direction and we will investigate it in future work.

Secondly, without dimension-aware encoder, HetCAN fails to capture latent feature interactions, resulting in a significant reduction of performance. We also notice that the absence of dimension-aware encoder has a more significant impact on Macro-F1 scores than Micro-F1. Especially on Freebase, the Macro-F1 is reduced by 2.88% and has a larger standard deviation, which indicates that dimension-aware encoder can benefit the robustness of our model.

Fig. 4.
figure 4

Efficiency study: x-axis shows the training time and y-axis is the Micro-F1 score on the validation set.

Efficiency Studies. To assess the efficiency of HetCAN, we compared the training times of several advanced methods in the same experimental environment, using the hyper-parameters corresponding to optimal performance for each method. The results are illustrated in Fig. 4. Specifically, on IMDB, HetCAN converges in around 10 s, while SeHGNN and HGT take more than 30 s. This indicates that our model is as efficient as other metapath-free methods and significantly faster than SeHGNN and HGT. On Freebase, HetCAN achieves the optimal performance around 20 s, while HINormer and Simple-HGN approach their optimal state around 40 s. This also demonstrates the efficiency and robustness of HetCAN on information networks with a greater variety of node types and edges. Surprisingly, we found that SeHGNN takes approximately 500 s, 20 times of our model, to converge to the optimal state. This demonstrates the superiority and flexibility of being free from the predefined metapaths.

Fig. 5.
figure 5

Parameters comparison. The numbers below the model names represent the ratio of the total number of parameters relative to GAT. For example, “1.24” below HGT means its total parameters are 1.24 times that of GAT.

Parameter Studies. We experiment on DBLP (Fig. 5) to statistically compare HetCAN’s total parameter count with that of other competitors. We use the hyper-parameters corresponding to the optimal performance of these models. We observe that SeHGNN achieves its peak performance with a large hidden size (512), which leads to a slower convergence speed. In contrast, HetCAN achieves state-of-the-art performance by introducing an affordable number of parameters, ensuring both efficiency and effectiveness.

Fig. 6.
figure 6

Hyper-parameters sensitivity studies.

We also examine the sensitivity of hyperparameters, including the number of type-aware layers (L), dimension-aware layers (\(L_d\)), and hidden size. The results are depicted in Fig. 6. We consistently observe strong performance across a wide range of \(L_d\) values on both DBLP and IMDB datasets. The impact of hidden size (d) is more significant on IMDB than on DBLP. Increasing L initially improves performance gradually, but further increments eventually lead to a decline, indicating potential harm from overly deep layers.

6 Conclusion

In this paper, we investigate the problem of exploiting graph heterogeneity and the high-order feature information. To achieve our goals, we propose HetCAN composed of multiple cascade blocks, where each block comprises multiple type-aware layers and dimension-aware layers. The type-aware encoder seamlessly integrates node types with node features to comprehensively leverage graph heterogeneity. The dimension-aware encoder pays attention to latent interactions among node features, utilizing the high-order information inherent in such interactions through a transformer architecture. Extensive experiments and studies demonstrate the superiority, efficiency and robustness of the proposed HetCAN.