1 Introduction

Graphs can encode complex geometric structures that lie in the non-Euclidian domain. They can be studied with strong mathematical tools [1], and nowadays have become ubiquitous. For example, in e-commerce, to make accurate recommendations, it is necessary to exploit the interactions between users and products [2, 3]. In chemistry, a new drug can be discovered by using a graph-based learning method that models the molecules as a graph [4]. In citation networks, papers can be categorized into different groups through their citation graphs [5, 6]. Moreover, there are many unlabeled data in the real world and labeling data is sometimes unrealistic and time-consuming. The semi-supervised manner means that only a small amount of labeled data is used to train the model. Consequently, it is often crucial to analyze graphs in that situation, and the key issue is to maximize the effective utilization of the feature information of the unlabeled data [7].

As an approach to graph analysis, graph neural networks (GNNs) are closely related to graph embedding. Graph embedding [8] is a method that learns to represent graph nodes in low-dimensional vectors. Approaches such as word embedding [9], DeepWalk [10], node2vec [11] have achieved a breakthrough. However, because they are unsupervised algorithms, they cannot perform in an end-to-end fashion. The early GNNs employ a recursive mechanism. Gori [12] and Scarselli et al. [13] first introduced the operation of neural networks to the graph. It consists of a repeated application of propagation function until the node states equilibrium. As a result, it suffers a high cost of calculation. This problem is alleviated by Li et al. [14], who propose to use gated recurrent units in the propagation step. As deep learning achieves great success for Euclidean data [15, 16], there is a more promising way to generalize convolution to graph domain. However, the existing deep learning algorithm cannot be directly implemented to the irregular graph data. Specifically, each graph has a different size of node, and each node has a different size of neighbor. It means that some important operations, like convolution, cannot be applied directly. Furthermore, the core assumption that instances are independent of each other is not established in non-Euclidean space [17, 18].

Nevertheless, there are generally two categories of graph convolutional networks (GCNs) [19]: spatial-based and spectral-based. Spatial-based methods construct a new feature vector for each node using its neighborhood information. The convolution of the spectral-based methods is defined by decomposing the graph signal on the spectral domain and applying a spectral filter to the spectral components [20]. However the learned filter depends on the graph structure; the model trained cannot be directly applied to another graph. Consequently, the spatial-based methods begin to increase due to the ability to work with a different graph and weight sharing. More recently, graph attention networks (GATs) [5] have been developed by employing attention mechanisms in GCNs. They are more advanced by making the operator focus on the most relevant parts of the inputs.

In the present work, the unique architecture of a hierarchical graph attention network (HGAT) was proposed for semi-supervised node classification on the graph. Our work was based on GAT, which was introduced by Petar et al. [5]. We applied a hierarchical mechanism in our model, which could increase the receptive field of nodes and more effectively transfer node features. The HGAT consists of an input layer, a hierarchical layer, and a prediction layer. Specifically, every level in the hierarchical layer concludes two symmetrical calculations: the coarsening operation and the refining operation. After the coarsening operation, we got a smaller graph with hyper-nodes. It could reflect the local structure and help to exploit global information through overlying the operation. Before the refining operation, we concatenated the node representations of the coarsened graph with the next-level refined node representations. The refining operation was used to refine the graph structure of the previous level. Such a model can capture node information from most relevant neighbors, leading to a better node representation. To our best knowledge, the experiments show that our method had the best overall performance among different citation and knowledge graph datasets.

There are three main contributions of our work. First, for the first time, the attention mechanism was combined with the hierarchical representation learning on the graph, which could help capture global structure information. Second, the coarsening and refining operation based on the contraction sets were defined. Third, our method achieved a better performance than previous work in the semi-supervised node classification task. Notably, parameter sensitivity analysis showed that the proposed HGAT could get a larger receptive field, and more effectively transfer node features. All the experiment codes will be available at https://github.com/LeeKangjie after the review process.

The remaining sections are organized as follows. A review of the related work is given in Section 2. The methodology adopted in this paper is described in Section 3. Section 4 provides a comparative experiment and analysis. Finally, we summarize the paper in Section 5.

2 Related work

In this section, some previous works about the hierarchical representation learning on the graph and the graph convolutional networks are reviewed.

2.1 Hierarchical representation learning on graph

Some works have been raised to get global information using hierarchical models. Chen et al. [21] propose the hierarchical representation learning for networks (HARP). It works through finding a smaller graph that can approximate the global structure, then using the embedding method to learn the initial representation. Finally, it inductively embeds the hierarchy of graph from the coarsest one to the original graph. Homologous as Chen, Liang et al. [22] use a hybrid matching technique to maintain the backbone structure of the graph. Then they apply existing embedding methods on the coarsest graph and refines the embedding to the original graph. However, these two methods are both unsupervised, which doesn’t use the node features. Besides they are both designed for large graph embedding, which introduces a huge computational overhead, not suitable to deal with the node classification task here. Hu et al. [23] propose hierarchical graph convolutional networks (HGCN) to increase the receptive field. It first repeatedly assigns nodes with similar structures into a hyper-node and then refines the representation for each node. However, each hierarchical level includes a GCN operation, which suffers from the Laplacian smoothing problem. Ying et al. [24] propose a differentiable graph pooling module that could generate hierarchical representations of graphs. The module can be combined with numerous graph neural network architectures in an end-to-end fashion. However they orient for link prediction and graph classification, therefore, cannot be directly applied in node classification tasks. Lv et al. [25] propose ant colony based multi-level network embedding (ACE) to preserve the features of hierarchical clustering structures. It coarsens the graph by an ant colony based algorithm, then the embedding vectors are generated from multiple layers of the coarsened graph. The idea is very impressive and we can learn from it. But, the final vector can be very large and it cannot handle node classification tasks either. In conclusion, we can see that hierarchical is a useful trick that could help to get a better result if used appropriately.

2.2 Graph neural networks for semi-supervised learning

Advances in the graph convolutional networks are generally categorized as spectral approaches and spatial approaches. The spectral approaches have been successfully applied in node classification tasks. In the work of Bruna et al. [20], the eigendecomposition of the graph Laplacian is computed. Then, the convolution operation is defined, which opens a precedent for graph convolutional networks. Inspired by that, Henaff et al. [26] introduce a parameterization of the spectral filters. Then, the smooth coefficient is used to localize them in space. Because the computation of Laplacian eigenvector is computation inefficiently, Defferrard et al. [27] raise a method to avoid it. They build a K-localized ChebNet by approximating spectral filter with Chebyshev polynomials up to the Kth order. Later, Kipf et al. [6] simplify it by restricting the filters to the 1st order, which means use the 1-step neighborhood information. Based on Kipf work, Zhuang et al. [28] raise a method to jointly consider global consistency and local consistency. However, all the spectral-based methods depend on the Laplacian eigenbasis of the given graph, which is not portable. In addition, the computation is non-parallel. These characteristics limit their development to some extent. As for spatial approaches, the main challenge is how to define the operator that can work with different-sized neighborhoods at the same time maintain the weight sharing properties. To achieve that, Duvenaud et al. [29] present a method through learning a specific weight matrix for each node degree in the molecular feature extraction task. Atwood et al. [30] present a model to do the node classification based on the diffusion-based representations. It defines the neighborhood using a transition matrix while learning weights for neighborhood degree and each input channel. Then Niepert et al. [31] present a general approach by extracting and normalizing locally connected regions to learn convolutional neural networks for an arbitrary graph. After that Hamilton [32] introduces GraphSAGE, a general inductive framework to generate node embedding. It operates by sampling the local neighborhood of nodes and perform aggregators over it. Besides that, to enable better structure-aware representation, Xu [33] explore a jumping knowledge network (JK-Net) to leverage different neighborhood ranges of each node. Based on GCNs, Petar et al. [5] propose GAT by introducing the attention mechanism. The architecture leverages a masked self-attentional layer to enable different weights to different nodes. By stacking layers, the node can also attend over their neighborhood features. It achieves state-of-art results among established benchmarks and allows for dealing with variable sized inputs. Due to these benefits, the attention mechanism is employed in our method.

3 Methodology

In this section, we define some notations used in the representation of HGAT architecture. Then we decompose the architecture into input layer, hierarchical layer, and prediction layer, and explain the implementation details.

3.1 Preliminaries

An undirected graph is represented by G = (V,E), where V is the set of nodes (also known as vertices ), E is the set of edges. The notation ∣∣ means the element number in a particular set. The notation vi and ei represent the ith node and the ith edge respectively. The adjacency matrix is represented by A = [aij], which is nonnegative. Notation D = diag(d1,d2,...,dn) is the degree matrix of A, where \({{d_{i}}={{\sum }_{{j}}{a_{{ij}}}}}\) is the degree of vi. The undirected graph is associated with node representation matrix Hn×f (also known as node features), where n is the total node number, f is the node features dimension.

3.2 HGAT architecture

As shown in Fig. 1, the workflow of our HGAT architecture is divided into three parts: input layer, hierarchical layer, and prediction layer. Specifically, the hierarchical layer includes l levels, each of which consists of two symmetrical calculations: the coarsening operation and the refining operation. We employed GAT as an input layer, combined with a multi-head attention mechanism [5] to stabilize the learning process. It takes the graph and the node representation H0 as input, and outputs the node representation H1. In the hierarchical layer, taking the ith level as an example, the coarsening operation derives a coarsened graph Gi+ 1 and node representation matrix Hi+ 1, which will be fed into the next level. Then, we concatenated Hi+ 1 and next-level refined node representation matrix H resulting in \(H^{*}_{i+1}\). Through putting \(H^{*}_{i+1}\) into the refining operation, we finally obtained the ith level node representation matrix. After that, in order to classify each node, we employed a softmax classifier after the GAT. Outputting the classes of each node in the one-hot encoding. An example of the two-level hierarchical layer is shown in Fig. 2. After each coarsening operation, the graph becomes smaller. Before the concatenation, the smaller graph needs to be refined first.

Fig. 1
figure 1

The architecture of our method. From left to right, there are three layers: input, hierarchical, and prediction. The input layer employs multi-head GAT to learn the node representations. The hierarchical layer includes l levels, each of which has symmetric coarsening and refining operations. We employed a softmax classifier after the GAT in the last prediction layer

Fig. 2
figure 2

Example of the two-level hierarchical layer. From left to right, the coarsened graphs appear with the corresponding node representations after the coarsening operation. Then through iteratively refining operation and concatenation, we get the final output

We employed the hierarchical mechanism in the architecture of our model. Because it plays a key role in increasing the receptive field, it could help improve classification accuracy. At each hierarchical level, after the graph coarsening operation, the refining operation was used to help recover the graph structure. It would lead to an effective transfer of a node feature to the most relevant one. Through iteratively refining operations, a node can receive information from further places. More details can be found in the description of the hierarchical layer.

As for the running time of our algorithm, it was basically the same as GAT. GAT needs to compute attention coefficients between every two nodes, which demands numerous parameters to be learned. In the hierarchical layer of our method, we only need to do a matrix multiplication, as shown in the subsequent section, in each level, which has no parameters to be learned. As a result, our method could improve accuracy without a big sacrifice of efficiency compared to GAT.

3.2.1 Input layer

Because of the promising performance of GAT, we employed attention mechanisms in the input layer. Moreover, we used a multi-head mechanism to stabilize the learning process. The layer took initial G0 and H0 as input, and the output node representation matrix H1 was calculated as:

$$ {H_{1}=\bigcup\limits_{k=1}^{K}\sigma(\alpha^{k}W^{k}H_{0})} $$
(1)

where ∪ represents concatenation, K is the multi-head number, αk is a normalized attention coefficient matrix, Wk is the corresponding transformation matrix [5], and \(\sigma \left (\right )\) is the nonlinearity ELU function.

3.2.2 Hierarchical layer

The hierarchical layer involves l levels, as shown in Fig. 1, each level consisting of a coarsening operation and refining operation. The hierarchical layer is based on the assumption that nodes have similar connections are likely to share their features with each other.

Coarsening is a type of graph reduction, which can interpret the graph transformation using a set of constraints. We first expressed a surjective map from the node set Vi to Vi+ 1 as φi, then defined the set of nodes \({V^{r}_{i}}\) of Vi mapped onto the same node vr of Vi+ 1 as a contraction set, which is formulated in (2). We conducted the equivalent structure selection and similar structure selection one after another to construct all the contraction sets.

$$ {{V_{i}^{r}}=\{v\in V_{i} :\varphi_{i} (v)=v_{r} \}} $$
(2)

During the equivalent structure selection, we selected nodes having the same neighbors to form contraction sets. Also, we can say nodes are equivalent if their corresponding rows in the adjacency matrix are identical. Each set corresponds to a hyper-node in the coarsened graph. For instance, as the example in Fig. 3 shows, nodes B and C were selected to form a contraction set.

Fig. 3
figure 3

Example of graph coarsening from G1 to G2. From left to right, we performed the equivalent structure selection and similar structure selection one after another to construct the contraction sets. From top to bottom, we used the contraction sets to derive the coarsened graph G2. The adjacent matrix value of G2 after being directly calculated using (6) is reflected on the edges

Then, we performed a similar structure selection. Inspired by heavy edge matching [34], the connection strength between node vj and vk is defined as (3).

$$ {s_{i}(v_{j} , v_{k})=\frac{A_{i}(v_{j},v_{k})}{\sqrt{D_{i}(v_{j})D_{i}(v_{k})}}} $$
(3)

where Ai and Di are adjacency matrix and degree matrix of graph i, respectively.

As nodes with fewer neighbors have fewer chances of being selected in the contraction set, we sorted all the nodes besides the contraction sets into ascending order of degrees to give these nodes higher priority in terms of selection. If nodes had the same degree, we sorted them into ascending order of their row number in the adjacent matrix. We iteratively picked up the node vj besides the contraction sets and calculated the connection strength between all the neighbors besides the contraction sets. Then, we selected the pairs having the largest connection strength as a new contraction set. In particular, if a node didn’t have neighbors besides the contraction sets, we selected it alone as a contraction set. Finally, every node would be selected in the contraction sets. As the example in Fig. 3 shows, after sorting the degree of node A, D, E, F, and G, we first picked up node G. Because it only had one neighbor F, we selected node F and G to form a contraction set. Then, we picked up the rest node which had a minimum degree, namely node D. After calculating the connection strength, we selected a pair (D,E) as a contraction set, for they had the largest value. Finally, because the single node A didnot have neighbors besides the contraction sets, it was selected as a contraction set itself.

To obtain the coarsened graph Gi+ 1 from Gi including its node representation matrix Hi+ 1 and adjacency matrix Ai+ 1, we defined our contraction matrix Mi based on all the contraction sets.

$$ M_{i}(r,h)=\left\{ \begin{aligned} \frac{1}{|{V^{r}_{i}}|} , \text{if } v_{h} \in {V^{r}_{i}}, \\ 0, \text{otherwise}. \end{aligned} \right. $$
(4)

where r is the contraction set number, h is the node number in Vi, and \({V^{r}_{i}}\) is a contraction set in Gi.

The node representation matrix Hi+ 1 and adjacency matrix Ai+ 1 of Gi+ 1 are determined by (5) and (6) respectively.

$$ H_{i+1}=M_{i}H_{i} $$
(5)
$$ A_{i+1}=M_{i}A_{i}{M^{T}_{i}} $$
(6)

The graph coarsening operation can capture the global structure, and neglect some details. As the hierarchical level increased, the number of nodes decreased. At each hierarchical level, the coarsening techniques reduced the size of the graph by an approximate factor of two, which offers control over the size. We then introduced the refining operation to help recover its structure. The refinement meant calculating the node representation of the current graph from the node representation of its coarsened graph as illustrated in (7). It would average the node features once in the hyper-node, leading to an effective transfer of node features to the most relevant one. Through iteratively refining operations, a node could receive information from further places, that is, obtain a large receptive field.

$$ H_{i+1}=M_{2l+1-i}^{T} H_{i} $$
(7)

where Hi+ 1 is the refined graph node representation, and Hi is the original graph node representation.

Algorithm 1 summarizes the steps of hierarchical layer calculation. The hierarchical layer input H1 and finally output H2l+ 1.

figure e

3.2.3 Prediction layer

Finally, we applied a softmax classifier after GAT to make the prediction.

$$ H^{out}=softmax(\frac{1}{K}\sum\limits_{k=1}^{K}\alpha^{k}W^{k}H^{*}_{1}) $$
(8)

where \(H^{out} \in \mathbb {R}^{|V|\times |Y|}\) is the prediction of nodes belonging to the class yi ∈|Y |, and \(H^{*}_{1}\) is the concatenated node representation of H1 and H2l+ 1.

To train the proposed model for classification, the cross-entropy error on the labeled nodes was defined:

$$ {L=-\sum\limits_{i\in Y_{L}} \sum\limits_{j=1}^{|Y|}Z_{ij} \log H^{out}_{ij}} $$
(9)

where YL represents the node indices that have labels, \(Z \in \mathbb {R}^{|V|\times |Y|}\) is the mask matrix. Zij will be 1 if node i belongs to class j; otherwise, it will be 0.

4 Experiment and analysis

In this section, the validity of our methodology is verified by presenting the results from several experiments. The experiments were implemented on a Windows 10 system with 4 GPUs and 32GB RAM. They are based on a TensorFlow 1.14.0 package and programmed with Python 3.7. Firstly, we present the datasets used in the semi-supervised node classification task. Then, we list the experiment parameters and compare the results with the previous methods. Finally, a sensitivity analysis is conducted.

4.1 Experiment setup

4.1.1 Datasets

The citation network datasets closely followed the experiment setup in Yang et al. [35]. Specifically, there were three citation network datasets: Cora, Citeseer and Pubmed. The nodes were documents and edges were citation links. Each document had a sparse bag-of-words feature vector and a class label. We also included a knowledge graph dataset: never-ending language learning (NELL). This is because the preprocessing of it needed more than 64GB memory, which was infeasible for us. To further verify the algorithm’s robustness, we employed the simplified NELL provided by Zhuang et al. [28]. The details of the datasets are summarized in Table 1. The labeling rate refers to the ratio of labeled nodes to total nodes.

Table 1 Summary of datasets used in experiments

4.1.2 Parameter setting

The hyper-parameters were set as follows. The datasets were all trained for a maximum of 100,000 epochs using Adam [36] with a learning rate of 0.005 and early stopping with a window size of 100, which meant the training would stop if the validation loss didnot decrease for 100 consecutive epochs. The hyperparameters were optimized for each dataset. For the Cora dataset, the hierarchical level was 1, and the input layer consisted of 8 attention heads computing 8 features each. The prediction layer consisted of 1 attention head, with L2 regularization of 0.001, and dropout of 0.6 applied to the layer input. The hyperparameter in Citeseer dataset was the same as Cora except for a higher L2 regularization value of 0.004, and the hierarchical level was 2. For the Pubmed dataset, we chose a lower dropout rate of 0.4 and set the L2 regularization to 0.002. The prediction layer consisted of 8 attention heads. The hierarchical level was 1. For the simplified NELL dataset, we computed 210 features for each head and set the L2 regularization to 0.0003, and the hierarchical level to 2. The remaining parameter settings were the same as Cora.

4.2 Node classification results

The performance of our method was compared with the baseline methods in the semi-supervised node classification. Table 2 summarizes the experimental results over four datasets. With random weight initializations, we reported the mean accuracy of our method in an inductive manner of 20 runs.

Table 2 Results of node classification in terms of accuracies for Cora, Citeseer, Pubmed and simplified NELL

The comparison results in Table 2 are very encouraging. The HGAT outperforms all the baseline methods to the best of our knowledge, which verifies the effectiveness of the hierarchical layer. Specifically, HGAT exceeds GAT on Cora, Citeseer, and Pubmed by 0.7%, 0.8%, and 0.8% respectively. For the simplified NELL dataset, our method outperformed the GAT by 2.9%. The improvement was much bigger than the other two. It demonstrated the significance of being able to learn global structure information by the effective transfer of node features. DeepWalk is a random-walk based algorithm, which cannot model the attribute information. That leads to poor performance. Another random-walk based algorithm Planetoid had a relatively poor performance too, for it could not fully utilize the graph structure knowledge due to its random sampling strategies. To avoid this problem, the neighborhood aggregation scheme was employed by GCN. It produced node embedding by using a linear function on the graph Laplacian spectrum. However, the shallow model restricted the scale of the receptive field. As for GAT, it can be seen removing the hierarchical layer in our algorithm to some extent. It is relatively improved by employing the trick to assign different weights to different neighbors. However, it is still inferior to our method. In conclusion, as there were fewer training samples in the semi-supervised node classification task, the baseline method with a fixed receptive field was unable to transfer the feature to more nodes. Although there are overlaps of the variance intervals compared to GAT, it can be seen that our algorithm had a relatively low variance and high mean value. Making the hypothesis that our algorithm accuracy is less than GAT, the significance level of all datasets is less than 0.001, which means we have 99.9% confidence in saying that our method is more advanced than GAT. Our HGAT method is more promising in terms of improving accuracy than others by learning global information. Because, in the hierarchical layer, more receptive fields can be obtained, most importantly, node features can be effectively transferred in the hierarchical layer.

For an intuitive understanding of the coarsening operation in the hierarchical layer, we made the t-SNE [37] plots of the node representations, as seen in Fig. 4. Different node colors correspond to different classes, each of which is clustered by applying K-means on the node representations. Also, there are seven different colors in the graph. The node size is proportional to the number of nodes it contains. Figure 4a is a visualization of the original Cora dataset features, different nodes being mixed together. Figure 4b is the visualization of the 1st hierarchical level outputs. The total number of nodes is the same as Fig. 4a. However, we can see the cluster of different classes, which verifies the discriminative power across the seven topic classes of our algorithm. Figure 4c is the visualization of the 2nd level output. As the hierarchical level increases, the graph size becomes smaller.

Fig. 4
figure 4

Visualization of graphs on the Cora dataset. a t-SNE plot of the initial node representations. b t-SNE plot of node representations of the 1st level hierarchical output. c t-SNE plot of node representations of the 2nd level hierarchical output. The clusters are represented by different colors. The node size is proportional to the number of nodes it contains

4.3 Parameter sensitivity

4.3.1 Lower labeling rate impact

As in reality, there are many situations that we cannot get more training data. It is important for the algorithm to work in this scenario. In this section, we decrease the labeling rate of two different class datasets: Citeseer and simplified NELL. And the performance of our method is compared to others. We decrease the labeled number of citation datasets per class from 20 to 15, 10 and 5, getting the labeling rate: 0.036, 0.027, 0.018 and 0.009 respectively. We use a simplified NELL dataset of labeling rate: 0.01, 0.006, 0.003 and 0.001 following the method of Zhuang et al. [28]. The corresponding average result of 20 runs is reported in Tables 3 and 4.

Table 3 Results in terms of classification accuracies for Citeseer with different labeling rates
Table 4 Results in terms of classification accuracies for simplified NELL with different labeling rates

From the tables above, it can be seen that our method beats the baselines in different labeling rates. With the labeling percentage decreasing, the accuracy margin between HGAT and the best baseline method becomes bigger. Especially when the labeling rate is 0.009 on the Citeseer dataset, the accuracy of HGAT exceeds the GCN and GAT by 9.8% and 3.2% respectively. For the labeling rate of 0.001 in simplified NELL, the accuracy of HGAT exceeds the GCN and GAT by 5.6% and 5.4% respectively. With the labeling rate decreasing, the connections between the labeled and unlabeled nodes become fewer, that is, the number of edges used for propagation features are fewer. Only when the receptive field is large enough and the information from the unlabeled nodes is passed to the labeled one efficiently can we get a better result. Thus, the results prove that the proposed HGAT can get a larger receptive field and have a more effective node features transfer. In other words, by introducing the hierarchical layer, our method is more robust in much lower training data situations.

4.3.2 Effects of the hierarchical level

The hierarchical level is a key factor in Algorithm 1. A too high value will produce a “smoothing” effect, causing an inferior result. A too low value will not use larger receptive field information. For a better understanding of the hierarchical effect, we analyzed the accuracy of two different labeling rates with different hierarchical levels as shown in Fig. 5.

Fig. 5
figure 5

Results of HGAT with varying hierarchical levels in terms of accuracies on (a) Cora dataset, b Citeseer dataset, c Pubmed dataset, and d simplified NELL dataset. Two labeling rates of different datasets are shown in the legend

It can be seen from Fig. 5, before the peak point, that the accuracy grows as the hierarchical level increases. Then the accuracy falls with a higher hierarchical level. This could be explained by the fact that a higher hierarchical level helps capture the useful node features by increasing the receptive field. Nevertheless, too high levels lead to a bad feature because of the smoothing effect. The best hierarchical level for Cora and Citeseer becomes larger as the labeling rate decreases. It moves from 1 to 3 for the Cora dataset and moves from 2 to 4 for the Citeseer dataset. However, the best hierarchical levels for Pubmed and simplified NELL are the same in two different labeling rates. This could be explained that the labeling rate for Pubmed is extremely sparse, and the advantages of the hierarchical mechanism are not obvious in this case. Although the labeling rate of simplified NELL is the same order of magnitude as Citeseer, there are many more classes in their classification tasks. On average, for each class, the training number is still extremely sparse.

5 Conclusions and future work

In this work, we presented a novel hierarchical graph attention network for semi-supervised node classification. Through employing a hierarchical layer, the larger receptive field of nodes could be obtained, and node features could be effectively transferred. Besides, our method didnot need a costly matrix operation. It could be parallelized across all nodes. The results show that our method achieved state-of-the-art performance on four different datasets. There are several possible improvements that could be addressed in future work. Firstly, our method could be extended to other interesting tasks like graph classification, which is useful in practice. Secondly, many other datasets like text could also be treated as graphs. How to implement our method to these datasets is expected to be explored. Finally, our method cannot be directly applied to the directed graph, which is also due to be improved.