1 Introduction

Graph-structured data is pervasive in various applications, ranging from citation networks, social networks to E-commerce networks. Mining knowledge in graphs such as predicting node properties is desirable and meaningful to both academic and industrial communities. For example, given an academic citation network, we may be interested in predicting the research area of an author. Making such predictions has become the focus of graph analysis which broadly includes graph classification [13], link prediction [21] and community detection [16], etc. Among various graph analysis problems, semi-supervised node classification for graphs is an essential and widespread task, and it has attracted great interests [12, 35, 46].

Graph representation learning is an effective technique for tackling this task. Early shallow approaches [7, 29] typically follow a two-step framework, which aims to learn a continuous, compact, and low-dimensional embedding (vector) for each node in the graph. These embeddings are further fed into a classifier to infer the labels of nodes. Since the node representations are not optimized for the specific classifier, this two-step process will inevitably lead to sub-optimal performance. More recently, several semi-supervised graph neural networks (GNNs) [12, 35, 41] were proposed. They utilize deep learning techniques [15] such as convolution or attention mechanism to encode both the local graph structure and node attributes to generate embeddings, which are then followed by a prediction layer (e.g., softmax or logistic sigmoid function) for the classification purpose. Due to the powerful feature extraction ability of deep learning and the integrated end-to-end framework, they have achieved state-of-the-art performance in the node classification task.

While these GNNs approaches have become the de facto solution for graph semi-supervised learning, they still suffer from two shortcomings, the over-smoothing and over-fitting problems, due to the inherent training and test procedure for semi-supervised learning [30]. First, GNN essentially employs a message passing neural network with neighborhood representation aggregation to train a model and map the feature space into the label space [6, 47]. When the network architecture goes deep, due to the excessive aggregation, all nodes’ representations tend to converge to a stationary point, resulting in the indistinguishable representations of nodes in different classes [4]. Such an issue is denoted as over-smoothing problem which seriously affects the performance of GNNs [18]. Second, GNN aims to minimize a loss function over the labeled nodes, which are typically very limited in semi-supervised learning. Therefore, GNNs easily overfit the training samples, leading to a degenerated generalization performance.

In semi-supervised learning, one alternative and promising learning paradigm is the label propagation algorithm (LPA) [48, 49]. Different from representation learning or GNNs, LPA builds up a graph over the labeled and unlabeled data, where edges in the graph connect semantically similar nodes with its weights reflecting how strong the similarities are. Then, LPA infers the labels of unlabeled nodes by propagating known labels through neighbors of each node. The edge weights in LPA are often set heuristically based on the observed node attributes (e.g., using Gaussian kernel function).

LPA has nice properties that can avoid both the over-smoothing and over-fitting problems faced by GNNs. Concretely, GNNs learn the feature mapping with multiple layers of aggregation, which would cause over-smoothing (node embeddings become indistinguishable) with the networks go deep. In contrast, LPA directly spreads the labels on a graph without involving the feature learning process. As a result, LPA would not lead to indistinguishable node representations. Furthermore, LPA classifies nodes by propagating the labels instead of training a classifier to fit the limited training data. Therefore, the learned model would not over-fit the patterns in training set.

However, propagating labels effectively is not trivial, since the classic LPA still has the following intrinsic limitations.

  • Limitation 1: Limited capacity to exploit features. Classic LPA derives edge weights from original high dimensional node attribute space where a large portion of sparse, redundant or noisy information is contained. These approaches cannot exploit the expressiveness of features effectively. Furthermore, computing the similarity of the raw attribute directly may lead to noisy weight values and the loss of key information.

  • Limitation 2: Hardly capture the strength of relation corresponding to the labels. The edge weight is computed as a separate step from the label propagation in LPA. As a result, the label information is ignored when capturing the strength of the relation. As the edge weights are only calculated once based on the similarity of raw attributes, they cannot be updated reversely by the label propagation process. The fixed edge weights limit the classification performance for semi-supervised classification.

To overcome the aforementioned limitations, in this paper, we propose a novel framework for graph semi-supervised learning named Cyclic Label Propagation (CycProp for abbreviation). Our theme is to integrate GNNs into the process of label propagation in a cyclic and mutually reinforcing manner to exploit the advantages of both GNNs and LPA algorithms. More specifically, to overcome Limitation 1, we derive a novel label-adaptive graph neural network module to learn low-dimensional embeddings of nodes in a graph. To enhance the representation power of the embedding, we exploit the highly reliable labels obtained from label propagation in the negative sampling process, so that the label information can be nicely injected into the node embedding component. For Limitation 2, we develop an embedding-adaptive label propagation module, which utilizes the node embeddings to refine the edge weights for label propagation. With the label information injected in the node embeddings, the weights essentially capture the strength of the relations corresponding to the labels. A self-paced learning manner is devised to adaptively control the cyclic learning process, where the embedding learning and edge refining are optimized alternatively to mutually benefit each other in our framework. Once the model has converged, the unknown node labels can be obtained on-the-fly without training an extra classifier or performing a sophisticated inference procedure. A concept map of our framework is given in Figure 1.

Figure 1
figure 1

A concept map of our proposed framework. The node representations and the predicted labels update each other in a cyclic manner, which benefits each of the components to have a better performance

To summarize, the main contributions are as follows:

  • We propose CycProp, a unified graph semi-supervised learning framework that exploits the advantages of both GNN and LPA. Our novel presented framework updating node representation and weighted graph in a cyclic and mutually reinforcing way, our proposed framework can obtain label estimations and node embeddings simultaneously.

  • We design a novel label-adaptive graph neural network module for graph representation learning, which leverages not only structure context but also self-adaptive augmented label context to learn the node embeddings.

  • We conduct extensive experiments on various datasets to demonstrate the effectiveness of CycProp and its superiority compared to a range of state-of-the-art methods.

2 Related work

2.1 Graph representation learning

Graph embedding, an important branch of graph representation learning aims to embed nodes into latent vector spaces, where the inherent properties of the graph are preserved. Motivated by the success of Word2vec [25, 26], Skip-gram model is employed and adapted from word embeddings to node embeddings based on the graph topology. For instance, DeepWalk [29] and node2vec [7] use different sampling strategies to generate random walk sequences, which are then fed into Skip-gram model to learn low-dimensional embedding vectors. LINE [33] optimizes both the first-order and second-order proximity preserving objectives. While the above methods only utilize the graph structure information, some recent approaches attempts to consider preserving both the structure and attribute proximities in a unified space. For example, SNE [20] leverages a deep neural network architecture to capture the complex interrelations between graph structure and node attributes information. GraphSAGE [8] generates embeddings by sampling and aggregating attributes from nodes’ local neighborhoods in an inductive setting. EP [5] tries to learn vector representations by utilizing a propagation design.

Another category of graph embedding algorithms follows a semi-supervised manner, in which available information includes not only node attributes but also node labels. Among them, TriDNR [27] models tri-party information sources including node structure, node attributes and node labels to jointly learn node representations. Planetoid [46] simultaneously optimizes the prediction of known labels and its corresponding graph contexts to learn node’s representations. SEANO [19] learns low-dimensional node representation that considers the topological proximity, attribute affinity and label similarity. MDAL [42] introduces a multi-task dual attention LSTM model to capture multiple information for graph semi-supervised learning.

Recently, graph neural networks (GNNs) have raised extensive attention and attained state-of-the-art performance in several graph analysis applications, especially in semi-supervised node classification task. By applying deep learning techniques [15] to non-Euclidean domains, GNNs can learn node representation from high-dimensional feature space and predicted labels simultaneously in an end-to-end way [43]. GCN [12], a representative model of GNNs, performs spectral convolutions on graph to encode both local graph structure and attributes of nodes into hidden representations. GAT [35] applies attention mechanism to attend over node’s neighborhood contents for generating node embeddings. SGC [41] simplifies GCN by removing the non-linearity and also achieves competitive performance. GIN [45] borrows the power of graph isomorphism test to GNNs’s design, and DGI [36] leverages a contrastive manner to train GNNs in an unsupervise way. Beyond the graph domain, GNNs are also applied to solving various machine learning problems, including time-series prediction [44], object detection [31] and few-shot learning [22, 23]. Unfortunately, GNNs for node classification usually suffer from two main obstacles, over-fitting and over-smoothing, which seriously hurt the performance of models [30].

In this paper, we leverage GNNs as an important component in our learning framework, but successfully avoid the obstacles of GNNs with the help of label propagation. Compared to the aforementioned methods, CycProp would not suffer from over-fitting and over-smoothing when making prediction for classification.

2.2 Label propagation algorithm

Label propagation algorithm (LPA) has been proposed as an efficient method to learn missing labels for graph data in a semi-supervised setting. GFHF [49] learns the predicted labels by optimizing the harmonic functions based on a Gaussian random field model. LLGC [48] considers local and global prior consistency through combining a smoothness constraint and a fitting constraint. LNP [38] studies the graph construction by approximating the whole graph with linear neighborhood structures, where labels are propagated to the remaining unlabeled nodes using the constructed graph. DLP [37] deals with multi-label propagation problem via considering the label correlation information. Moreover, inspired by LPA’s formulation, other common approaches try to train a supervised learner to classify data features while regularizing it using graph information. For example, manifold regularization [1] trains a support vector machine with a graph Laplacian regularizer. LSHM [11] addresses the node classification task in heterogeneous social networks via the learned node representations.

Several recent works also exploit LPA to neural networks. For instance, NGM [3] utilizes the power of neural networks and constrains neighborhood nodes to learn similar representations for classification. LP-DSSL [10] utilizes label propagation to generate pseudo-labels for the unlabeled data, which expands the training sample set for neural network training. GCN-LPA [39] builds a GCN with learnable edge weights and views LPA as regularization to assist the GCN in learning proper edge weights. Our proposed CycProp also combines label propagation and neural network, but has several essential differences with the above methods: (1) All of the aforementioned methods predict the unknown label by neural networks, whereas CycProp makes classification prediction by label propagation, which effectively avoids over-smoothing and over-fitting problems; (2) In the above methods, the main component of objective functions is the cross-entropy loss function and LPA serves as a regularization term or pseudo-label generator. In contrast, we set the label propagation loss as the main objective and also design a structure-label-aware graph embedding loss function.

Compared to the traditional LPA methods, CycProp introduces GNNs to revise the edge weights iteratively. By such an adaptive weighting propagation, it has a significant advantage on classification performance.

3 Cyclic label propagation for graph semi-supervised learning

In this section, we first define the notations and present our problem formulation. Then, we introduce the two major components in our unified framework: (1) label-adaptive graph neural network module, and (2) embedding-adaptive label propagation module. After that, a joint training framework that integrates the two components is presented.

3.1 Notations and problem formulation

Given an attributed graph \(\mathcal {G} = (\mathcal {V}, \mathcal {E}, \mathbf {X})\), where \(\mathcal {V} = \{v_{1}, \cdots , v_{l}, v_{l+1}, \cdots , v_{n}\}\) and \(\mathcal {E}\) denote the set of nodes and edges, respectively; \(\mathbf {X} \in \mathbb {R}^{n \times m}\) is a matrix that represents all node attributes, and \(\mathbf {x}_{i} \in \mathbb {R}^{m}\) denotes the attributes affiliated with node vi. Let label set {1,2,⋯ ,K} represent different classes of labels and \(\mathbf {Y} \in \mathbb {R}^{l \times K}\) be a label matrix, in which \(\mathbf {y}_{i} \in \mathbb {R}^{K}\) denotes the label distribution of node vi, i.e., if vi belongs to class j, then yij = 1, otherwise yij = 0. The first l nodes vi (1 ≤ il) are labeled, and the remaining nodes vu (l + 1 ≤ un) are unlabeled. With the above notations, we formally define our problem as follows.

Definition 1

Given an attributed graph \(\mathcal {G}\), with partially labeled nodes \(\{v_{1}, \dots , v_{l}\}\) and the desired node embedding dimension d, our goal is to learn the label assignments \(\mathbf {F} \in \mathbb {R}^{n \times K}\) and node embeddings \(\mathbf {E} \in \mathbb {R}^{n \times d}\) simultaneously. Each node has a probability distribution over the set of labels.

3.2 Label-adaptive graph neural network module

To learn meaningful node embeddings, it’s desirable to incorporate various available graph information. Different from [32] that models the attribute information as augmented nodes or [9] that uses node attributes to calculate a similarity matrix, we designed a label-adaptive graph neural network module to capture node’s deep semantics.

Specifically, we generate node embeddings as follows,

$$ \mathbf{e}_{i} = g_{\boldsymbol{\theta}}(\mathbf{x}_{i}), $$
(1)

where g𝜃(⋅) can be any kind of GNNs [12, 35, 41, 43], and 𝜃 is the parameter set. These methods typically work by propagating representations throughout the graph. Here, we choose GraphSAGE [8] as our graph neural network module in the experiments, due to its effectiveness and efficiency. Then, we optimize it in a label-adaptive manner by predicting its associated graph context. Formally, let (i, c) represent the node-context pair, i.e., node vc is the graph context of node vi, and our goal is to minimize the following log softmax probability,

$$ -\log \ \sigma(\mathbf{e}_{c}^{\mathrm{T}}\mathbf{e}_{i}) - \sum\limits_{s=1}^{s_{neg}}\mathbb{E}_{v_{j} \sim P_{n}(v)}[\log \ \sigma(-\mathbf{e}_{j}^{\mathrm{T}}\mathbf{e}_{i})], $$
(2)

where \(P_{n}(v) \propto d_{v}^{3/4}\) as suggested in [26], and dv is the degree of node v. Then, the goal is indeed transformed to classify the node-context pairs (i, c) into positive context (γ = + 1) or negative context (γ = − 1) sampled from a noisy distribution. Therefore, the graph embedding loss with negative sampling can be rewritten as,

$$ \mathcal{L}_{GE} = - \sum\limits_{i=1}^{n}\mathbb{E}_{(i,c,\gamma)}\log\ \sigma(\gamma\mathbf{e}_{c}^{\mathrm{T}}\mathbf{e}_{i}), $$
(3)

where σ(⋅) is the sigmoid function, i.e., σ(x) = 1/(1 + ex).

figure a

We now present how to generate (i, c, γ) graph context pairs with a structure-label-aware sampling process. We develop two graph context sampling mechanisms, which is depicted in Algorithm 1.

Structure aware graph context sampling

The first type is based on the graph structure, which encodes the structure information and regards the neighborhood nodes as positive node-context pairs.

Label aware graph context sampling

The second type is based on the label set, which injects label information into the context and treats the nodes having the same labels as positive node-context pairs. Moreover, an indicator variable of φ is introduced to control the label related graph context. We first initialize the 0/1 indicator vector \(\boldsymbol {\varphi } \in \mathbb {R}^{n}\) with the known label set, i.e., φi = 1 means node vi is a label context candidate. Then we augment it through the label propagation module where the learned highly reliable labeled nodes will be gradually incorporated into φ to expand the label context candidates. The generated context pairs will be dynamically refreshed during the training process, due to the updating of parameter φ and node labels. The details about how to model the indicator vector φ will be described in the following subsection.

3.3 Embedding-adaptive label propagation module

In order to overcome the limitations of existing label propagation approaches [37, 38, 48], we propose to infer the edge weights in an informative embedding space via a mutually reinforcing manner. In detail, for each node vi, its corresponding embedding vector \(\mathbf {e}_{i} \in \mathbb {R}^{d}\) is obtained from the label-adaptive graph neural network module where d is the embedding dimension (d ≪|m|). Basically, edge weights can be calculated based on the following function similarly to [38, 48, 49],

$$ s_{ij} = \exp\left( -\frac{\Vert \mathbf{e}_{i} - \mathbf{e}_{j} \Vert^{2}}{2\delta^{2}}\right), $$
(4)

where δ is the length scale parameter. With this measure, the estimated edge weights reflect the degree of similarities between each connected node pair, which will be dynamically adjusted in the training procedure as the updating of node embeddings.

The key to semi-supervised learning on graphs is to be in line with the prior consistency, such that the label is smooth over the graph. Following this principle, we devise the regularized objective function for embedding-adaptive label propagation as follows,

$$ \begin{array}{@{}rcl@{}} \mathcal{L}_{LP} &=& \sum\limits_{i=1}^{n}\sum\limits_{j \in \mathcal{N}(i)}s_{ij}\Vert \mathbf{f}_{i} - \mathbf{f}_{j} {\Vert_{2}^{2}} + \mu \sum\limits_{i=1}^{l}\Vert \mathbf{f}_{i} - \mathbf{y}_{i} {\Vert_{2}^{2}} \\ &&+ \sum\limits_{i=1}^{n}\varphi_{i} H(\mathbf{f}_{i}) + \lambda \sum\limits_{i=1}^{n}-\varphi_{i} \\ &&\text{s.t.} \ f_{ik} \ge 0; \ \sum\limits_{k=1}^{K}f_{ik} = 1; \ \varphi_{i} \in \{0,1\}, \ i = \{1, \cdots, n\}, \end{array} $$
(5)

where sij is the edge weight between node vi and vj calculated according to (4). yi indicates the ground truth label and \(\mathcal {N}(i)\) represents the neighborhood of node vi. μ is a trade-off hyper parameter between the smoothness and fitness terms, fi is the learned label distribution of node vi. Through this manner, the label propagation procedure benefits from the graph neural network module.

In addition, we introduce a self-paced regularizer [14] to prioritize label learning task as well as selecting some highly reliable node labels in each training iteration. The regularizer is composed of a Shannon entropy function H(⋅) and an indicator variable φ. H(fi) prevents the uniform label distribution, and is formally defined as follows,

$$ H(\mathbf{f}_{i}) = -\sum\limits_{k=1}^{K} \ f_{ik}\times\log(f_{ik}), $$
(6)

where fik denotes the probability of node vi belonging to class k. The smaller the Shannon entropy is, the larger the amount of information it contains. Specifically, a small Shannon entropy implies that fi has a significantly higher probability value in one specific class. For instance, in an extreme case, if the probability of node vi in class k is 1, the Shannon entropy of fi will be 0. The self-paced parameter φ serves as an indicator vector in graph neural network module, which determines the potential label context based on its current label information. The binary value of φi indicates whether node vi’s learned label is reliable or not and λ acts as a threshold to distinguish the informative labels from the uninformative labels. If the Shannon entropy of fi is smaller than the threshold, we set φi as 1 to indicate that node vi can be utilized as a label context. As the training process goes on, λ is gradually increased such that more learned highly reliable labels can be included in graph embedding procedure to adaptively update node embeddings. In this way, the graph embedding procedure benefits from the label propagation module. Thus, the overall framework naturally forms a closed-loop via a mutually reinforcing manner.

3.4 CycProp: a joint learning framework

As is demonstrated before, each of the two modules can learn beneficial information from the other. Hence, we design a cyclic learning framework where the two modules are trained in an iterative and alternate way. Figure 2 provides an overview of our proposed CycProp model. The two main components of our proposed framework, label-adaptive graph neural network module and embedding-adaptive label propagation module, output the node embeddings and label assignments (classification results) respectively. Firstly, in the label-adaptive graph neural network, the attribute matrix X and graph structure information are fed into the graph neural network g𝜃(⋅), which generates the node embeddings E. Then, in the embedding-adaptive label propagation module, the weighted graph is computed by a similarity function with the node embeddings E. After that, the predicted label assignments F and the indicator variable φ are obtained by label propagation with the weighted graph and known labels Y. Next, the predicted label and indicator are exploited to generate the label context, which is used to train the graph neural network in turn together with the structural context. In this way, the node embeddings E is updated to generate a more reasonable weight graph, which further optimizes the prediction F of label propagation. As the blue arrows are shown, by joint learning E and F in such a cyclic and mutually beneficial manner, finally, the model can output informative node embeddings as well as accurate label predictions. It is worth noticing that the final classification results are acquired by the label propagation module, which efficiently prevents the over-smoothing and over-fitting problems caused by GNNs.

Figure 2
figure 2

The overall framework of the CycProp model. The nodes 1-3 have known labels while the labels of nodes 4-6 need to be predicted. In the weighted graph, the thicker edges denote larger weights. In the context pair samples pool, the context pairs in black denote the positive samples (γ = 1), and which in red denote the negative samples (γ = − 1). The arrows in blue indicate the data flow of cyclic learning

The objective function of the proposed model is formulated as the weighted combination of \({\mathscr{L}}_{LP}\) and \({\mathscr{L}}_{GE}\) defined in Equations (3) and (5),

$$ \mathcal{L} = \mathcal{L}_{LP} + \alpha \mathcal{L}_{GE}. $$
(7)

Considering the related parameters of both terms in \({\mathscr{L}}\), the set of learnable parameters in CycProp model is denoted as {F, φ, 𝜃}. To minimize \({\mathscr{L}}\), we propose to employ stochastic gradient descent [2] and proximal algorithm [28] to optimize our model via an alternative updating manner. We first give the partial derivative details of some key parameters as follows.

Updating F

We utilize proximal gradient descent [17, 28] to solve this constrained optimization problem. In proximal algorithms, the interface to meet the constraint terms is via the proximal operator. Satisfying the non-negative and sum-to-1 constraints in our objective function, i.e., \(\mathcal {D} = \{\mathbf {f}|\mathbf {f} \ge 0, \mathbf {f}^{\mathrm {T}}\mathbf {1} = 1\}\), is known as computing the projection onto the probability simplex. Here, we employ an efficient algorithm proposed in [40] to calculate the proximal operator. The procedure of finding \(\mathbf {f} \in \mathcal {D}\) given z is shown in Algorithm 2.

figure b

Then we have the following equation as our proximal operator,

$$ \mathbf{f} = {\mathbf{prox}}_{\mathcal{D}}(\mathbf{z}) = (\mathbf{z} + \eta\mathbf{1})_{+}, $$
(8)

where \((x)_{+} ={\max \limits } \{x, 0\}\) and η is computed according to the procedure shown in Algorithm 2. Note that the proximal operator always keeps our updated label distribution satisfied with the constraint \(\mathcal {D}\). The partial derivative of fi can be formulated as,

$$ \frac{\partial \mathcal{L}}{\mathbf{f}_{i}} = \sum\limits_{j \in \mathcal{N}(i)}2s_{ij}(\mathbf{f}_{i} - \mathbf{f}_{j}) + 2\mu(\mathbf{f}_{i} - \mathbf{y}_{i})\cdot\mathbb{I}(i) - \varphi_{i}(\log(\mathbf{f}_{i}) + \mathbf{1}), $$
(9)

where \(\mathbb {I}(i)\) is an indicator function to indicate whether i is a labeled node or not and \(\mathbf {1} \in \mathbb {R}^{K}\) represents the all one vector. Based on the proximal gradient method, parameter fi can be updated as following,

$$ \mathbf{f}_{i} = {\mathbf{prox}}_{\mathcal{D}}(\mathbf{f}_{i} - lr \cdot\frac{\partial \mathcal{L}}{\mathbf{f}_{i}}), $$
(10)

where lr denotes the learning rate in gradient descent.

Updating φ

We first relax φ to take any real value in the interval [0,1]. Then the partial derivative of φi can be derived as,

$$ \frac{\partial{\mathcal{L}}}{\partial{\varphi_{i}}} = H(\mathbf{f}_{i}) - \lambda. $$
(11)

Since the optimal value of φ is constrained to either 1 or 0 for all samples, the closed-form solution to update φi is,

$$ \begin{array}{@{}rcl@{}} \varphi_{i} = \begin{cases} 1, &H(\mathbf{f}_{i}) \le \lambda\\ 0, & \text{Otherwise} \end{cases}. \end{array} $$
(12)
figure c

Note that, calculating the partial derivative of parameter set 𝜃 in graph neural network module is quite easy, thus we omit the detailed mathematical derivations here due to space limitation. After we have obtained the derivatives of all the parameters, the whole optimization procedure can be efficiently performed via back-propagation. To sum up, the procedure of the joint training framework for CycProp model is depicted in Algorithm 3. First, the parameters F, φ, 𝜃 are initialed. Then, in each iteration step, the graph neural network module and the label propagation module are updated for T1 steps and T2 steps respectively. In the end of an iteration step, the indicator φ is updated and the self-paced hyper parameter λ is increased. When the algorithm converges, the predicted classification results F and node embeddings E are returned simultaneously.

3.5 Complexity analysis

We analyze the time complexity of the proposed CycProp by considering the two main components respectively. For the graph neural network module, the time complexity can be compressed to \(\mathcal {O}(|\mathcal {E}|)=\mathcal {O}(n \mathcal {D})\) with the sparse computation package where \(|\mathcal {E}|\) is the number of edges and \(\mathcal {D}\) is the average degree in a graph. For the label propagation module, the LPA has a linear complexity \(\mathcal {O}(n)\), which is less than the graph neural network module when the graphs are dense (\(\mathcal {D}\) is large). Therefore, compared to other GNN-based methods, there is no significant increase in computation on CycProp.

4 Experiments

In this section, we report the results of our experiments to verify the effectiveness of our proposed CycProp model. We first describe the datasets and experimental setups in detail, and then we present the results with insights.

4.1 Datasets

We adopt three citation networks and two social networks for empirical studies. Statistics of the five datasets are summarized in Table 1 with more descriptions as follows,

  • Citation Networks. Cora, Citeseer and PubmedFootnote 1 [24] are three available public datasets, which are composed of scientific publications. In these networks, nodes represent published papers and edges denote citation relationships. Node labels indicate the categories to which each paper belongs and the text contents are treated as node attributes. We remove papers which have no connection in the network and extract the maximum connected component.

  • Social Networks. Blogcatalog and FlickrFootnote 2 [34] are two typical social networks where nodes represent users and links denote the following relationships. In social networks, users usually generate personalized contents such as posting blogs or sharing photos with tag descriptions, thus these text contents are regarded as node attributes. We set the groups that users joined as labels, and users with no follower or predefined category have been removed.

4.2 Competitors

We compare the proposed CycProp model against several state-of-the-art baselines that can be categorized into the following groups:

  • Classical LPA. These methods perform label propagation based on the edge weights calculated from the original attribute vectors. We consider three popular methods GFHF [49], LLGC [48] and DLP [37] as our compared algorithms.

  • Unsupervised Node Representation Learning. Methods of this group first employ graph embedding techniques to learn the optimal node representations and then classify each node independently in the latent representation space. These approaches can be further classified into the following two classes:

    1) Structure-only. In this group, we choose three baselines DeepWalk [29], LINE [33] and node2vec [7], which utilize graph topological information only, and the node attributes are not taken into consideration.

    2) Attribute + Structure. This category of algorithms aims to encode both node attributes proximity and graph structure proximity into the latent representation space. We consider three recently proposed methods SNE [20], EP [5] and GraphSAGE [8] as our baselines.

  • Semi-supervised Node Representation Learning. Methods in this group further leverage additional label information to model the underlying representations. These approaches can be further classified into the following three classes:

    1) Semi-supervised Node Embedding. This kind of methods train embeddings for each nodes with the supervision of label data. Planetoid [46] is one of the typical methods and is selected as the compared method.

    2) GNN. By employing deep learning techniques [15], GNNs learn node representation as well as classifier in a joint and end-to-end way. We select two representative GNN models GCN [12] and GAT [35] as our baselines.

    3) Neural Network with Label Propagation. This group of methods combine neural network with label propagation to enhance the classification performance. We choose three recently proposed methods NGM [3], GCN-LPA [39] and LP-DSSL [10] as our competitors.

Table 1 Statistics of the datasets. The information of the five benchmark datasets is given, including the number of nodes \(\#|\mathcal {V}|\), the number of edges \(\#|\mathcal {E}|\), the number of attributes #|Attrs|, the number of labels #|L|, and the average degree of graphs \(\mathcal {D}\)

For baseline algorithms, we use the source code released by the authors and the dimension of node embedding is set as 64 for all methods in all datasets. Specially, LP-DSSL [10] is originally designed with convolutional neural network for image classification. To adapt it to the network datasets, we replace the convolutional neural network with a two-layer GCN [12], and use the topology graph in dataset for label propagation instead of KNN graph. We randomly sample 30% labeled nodes as training set, and another 100 labeled nodes are sampled as a validation set to tune the hyper parameters. The remaining unlabeled nodes are used to test the performance of different algorithms. For our proposed CycProp model, the graph neural network’s structure used in our experiments is a 2-hops neighborhood aggregation with dimension of 128 and 64, respectively. The sampled neighborhood size and negative context size are set as 10 for all datasets. We use rectified linear units as the activation function to introduce the non-linearity. To measure the classification result, we utilize both Micro-F1 (Mi-F1) and Macro-F1 (Ma-F1) as evaluation metrics. For unsupervised node representation methods, the learned node representations are regarded as features to train a one-vs-rest logistic regression classifier implemented by scikit-learnFootnote 3. Evaluations are conducted by repeating 10 times with resampled labels, then the average score and its standard derivation are recorded as the final result.

4.3 Results and analysis

The experimental results of different algorithms over different datasets are presented in Table 2. To summarize, we have the following observations.

Table 2 Classification performance on different datasets. Values in (⋅) are the standard derivation of multiple runs. The best performing method in each experiment is in bold

Generally, we can find that our proposed CycProp beats all baselines in all datasets for all settings. As expected, structure-only node representation learning methods like DeepWalk, LINE and node2vec perform worse than those approaches using node attributes (i.e, SNE, EP and GraphSAGE). The reason is that they only attempt to capture the graph’s topology information, which provides very limited information compared to node attributes for node classification task. It’s worth noting that these three classical LPAs such as LLGC, GFHF and DLP achieve better performance than structure-only node representation learning baselines. It further indicates the effectiveness of propagating labels on the graph, which gives us a solid foundation for the proposed CycProp model.

In addition, the semi-supervised methods consistently outperform unsupervised baselines with different gains by incorporating partially known node labels into the model. One major reason for the performance lift is because these semi-supervised methods are trained through an end-to-end manner, thus the learned node representations are specifically optimized for the classifier and show powerful discriminability. Finally, our proposed CycProp model is an efficient and direct way to learn the unknown node labels on graph, which aims to propagate labels rather than classifying each node independently. In classical two-step LPAs, edge weights are predetermined and cannot change during the learning process, thus its performance is bounded by the first step. To overcome these limitations, the proposed CycProp integrates GNN and label propagation in a unified framework via a mutually reinforcing manner that results in a great performance boost.

It is remarkable that compared with other methods that combine neural network and label propagation, CycProp shows better performance and generalization. Such a performance gap is caused by two reasons. First, while other methods that make predictions by neural networks, CycProp classifies each node by label propagation, which would not suffer from over-smoothing and over-fitting problems caused by GNN classifiers. Second, different from other methods that view label propagation as an auxiliary tool for GNNs (such as the regularization term in GCN-LPA or the pseudo-label generator in LP-DSSL), we treat these two algorithms as equal and mutually reinforcing components and integrate them into a joint learning framework. In this way, the advantages of both components are fully leveraged.

Besides, we can observe that CycProp has larger performance gains on the two social network datasets (BlogCatalog and Flickr). Specifically, compared to the baseline with the highest performance, the average Mi-F1 gain on social networks is 1.95%, while which on citation networks is 0.47%. The possible reason is that the social networks have a higher average degree (see Table 1). On these denser networks, the edge weights play a more important role in label propagation. Different from other methods, our proposed CycProp can adaptively optimize the edge weights, which bring significant performance gains on the dense networks.

4.4 Ablation study

In this subsection, we investigate how each of the two individual components contributes to the performance of CycProp. Figure 3 shows the classification results of CycProp and its two variants. Among them, CycProp-P denotes the case where only the label propagation module is used and CycProp-G indicates the case where only the graph neural network module is used. The best results are achieved by the full CycProp, which validates the effectiveness of combining these two modules in a mutually reinforcing manner. Moreover, we can observe that CycProp-P always outperforms CycProp-G, which indicates the propagation module seems to play a more important role in the joint framework. A possible reason is that the label propagation can efficiently prevent the shortages of GNNs, e.g., over-fitting and over-smoothing, especially when the training labels are scarce. This result also verifies the advantage of propagating labels on graph rather than classifying each node independently.

Figure 3
figure 3

The performance of CycProp and its variants. CycProp-P and CycProp-G indicate the variants only using label propagation module and graph neural network module respectively

4.5 Parameter sensitivity

In this subsection, we study the impact of several parameters by varying them in different scales. Due to the limited space, we only show sensitivities of the trade-off parameter α of (7) and the node embedding dimension d in Figure 4. As we can see, α = 0.1 is mostly the best across different data sets. When α is too small or too large, the performance becomes worse. For node embedding dimension d, we observe that, by increasing d, the performance first increases and then keeps stable. Besides, λ is initialized with 0.1 and μ = 10,δ = 0.1 usually get the best results and we don’t observe too much difference when varying them.

Figure 4
figure 4

The sensitivity of the trade-off parameter α and the node embedding dimension d

5 Conclusions

In this paper, we investigate the semi-supervised learning task on graphs and introduce a unified framework CycProp, which integrates label propagation and GNN in a cyclic and dynamically reinforcing manner. Specifically, in each iteration, we employ the graph neural network module to learn the informative node embeddings that can refine the edge weights for facilitating the label propagation; then the learned highly reliable labels obtained from the label propagation module are incorporated into the model to fine-tune the node embedding procedure, thus forming a closed cyclic training loop. Extensive experiments on five real-world datasets demonstrate the effectiveness of CycProp and its superiority to a range of state-of-the-art methods. The most significant advantage of the proposed CycProp framework is that it avoids the shortages of both GNNs and LPA and leverages the goodness of both of them. However, CycProp relies on an alternate updating to learn both of the components, which is practical but might result in a suboptimal solution. We will overcome this disadvantage in our future study. The future works are mainly two-fold. First, we plan to extend the idea of combining GNNs and LPA into heterogeneous graphs where nodes and links are of different types. Second, we will investigate the deeper integration of GNNs and LPA based on the advanced techniques in each type of method, which focuses on solving the aforementioned suboptimal problem.