Abstract
Graph neural networks (GNNs) have proven to be powerful tools for graph analysis. The key idea is to recursively propagate and gather information along the edges of a given graph. Although they have been successful, they are still limited by over-smoothing and noise in the graph. Over-smoothing means that the representation of each node will converge to the similar value as the number of layers increases. “Noise” edges refer to edges with no positive effect on graph representation in this study. To solve the above problems, we propose DropNEdge (Drop “Noise” Edge), which filters useless edges based on two indicators, namely, feature gain and signal-to-noise ratio. DropNEdge can alleviate over-smoothing and remove “noise” edges in the graph effectively. It does not require any changes to the network’s structure, and it is widely adapted to various GNNs. We also show that the use of DropNEdge in GNNs can be interpreted as an approximation of the Bayesian GNNs. Thus, the models’ uncertainty can be obtained.
Supported by ZJFund 2019KB0AB03, NSFC 62076178, TJ-NSF (19JCZDJC31300, 19ZXAZNGX00050).
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Graph neural networks (GNNs) and their many variants have achieved success in graph representation learning by extracting high-level features of nodes from their topological neighborhoods. However, several studies have shown that the performances of GNNs decrease significantly with the increase in the number of neural network layers [15, 16]. The reason is that the nodes’ characteristics will converge to similar values with the continuous aggregation of information. Some existing methods (DropEdge [2], Dropout [3]) solve over-smoothing by dropping some information in the graph randomly. Although these methods are efficient, they cannot guarantee that the dropped information is harmful or beneficial. Hence, they can only bring sub-optimal effect.
In addition, the effect of GNNs is affected by noise edges [1, 17]. Many graphs in real world have noise edges which requires GNNs to have the ability to identify and remove noise edges. The recursive aggregation mode of GNNs makes it susceptible to the influence of surrounding nodes. Therefore, finding a principled way to decide what information not to aggregate will have a positive effect on GNNs’ performance. Topological denoising [1] is an effective solution to solve this problem by removing noise edges. We can trim off the edges with no positive impact on the task to avoid GNNs from aggregating unnecessary information.
In this paper, we propose DropNEdge (Drop “Noise” Edge), which takes the structure and content information of a graph as input and deletes edges with no or little positive effect on the final task based on the node’s signal-to-noise ratio and feature gain. The differences between our method and DropEdge are detailed as follows. First, DropNEdge treats the edges unequally and deletes edges based on the graph’s information, which is a more reasonable and effective method to solve the limitations of GNNs. Second, deleting edges from the above two aspects can ensure that the dropped edges have no or little positive effect on the final task. Therefore, it can not only alleviate over-smoothing, but also remove “noise” edges. DropNEdge is widely adapted to most GNNs and does not need to change the networks’ structure. Because DropNEdge changes the topology of the graph, it can be used as a graphical data enhancement method.
Considering that Dropout can be used as a Bayesian approximation for general neural networks, we prove that DropNEdge can be used as a Bayesian approximation for GNNs. If we use DropNEdge during the training and test phase, then the models’ uncertainty can be obtained.
The main contributions of our work are presented as follows:
-
We propose DropNEdge, which is a plug-and-play layer that is widely adapted to various GNNs. It can effectively alleviate the over-smoothing phenomenon and remove “noise” edges in the graph.
-
We show that the use of DropNEdge in GNNs is an approximation of the Bayesian GNNs. In this way, the uncertainty of GNNs can be obtained.
2 Related Work
Deep stacking of layers usually results in a significant decrease in the performance of GNNs, such as GCN [13] and GAT [14]. Chen et al. [5] measured and alleviated the over-smoothing problem of GNNs from a topological perspective. Hou et al. [6] proposed two over-smoothing indicators to measure the quantity and quality of information obtained from graphic data and designed a new GNN model called CS-GNN. To prevent node embeddings from being too similar, PairNorm [7] was proposed which is a normalization layer based on the analysis of graph convolution operations. DropEdge [2] also effectively relieves the over-smoothing phenomenon by randomly removing a given percentage of edges in the graph.
Another limitation is the noise in the graph. A large number of papers show that GNNs are not robust to noise. Recently, graph sampling has been investigated in GNNs for the rapid calculation and to improve the generalization ability of GNNs, including neighbor-level [8], node-level [9] and edge-level sampling methods [10]. Unlike these methods that randomly sample edges during the training phase, PTDNet [4] uses a parametric network to actively remove “noise” edges for specific tasks. Moreover, it has been proved that the graph data enhancement strategy can effectively improve the robustness of GNNs [17].
Bayesian network is a hot topic which is critical for many machine learning systems. Since exact Bayesian inference is intractable, many approximation methods have been proposed such as Laplace approximation [20], Markov chain Monte Carlo (MCMC) [21], stochastic gradient MCMC [22], and variational inference methods [23]. Bernoulli Dropout and its extensions are commonly used in practice because they are fast in calculation and easy to be implemented. Bayesian neural networks also have some applications in GNNs. Zhang et al. [19] proposed a Bayesian graph convolutional neural networks for semi-supervised classification. Hasanzadeh et al. [25] proposed a unified framework for adaptive connection sampling in GNNs. And GNNs training with adaptive connection sampling is shown to be equivalent to an efficient approximation of training Bayesian GNNs.
3 Notations
Let \(\mathcal {G}=\left( \mathcal {V},\mathcal {E}\right) \) represent the input graph of size N with nodes \(v_{i} \in \mathcal {V}\) and edges \(\left( v_{i},v_{j}\right) \in \mathcal {E}\). The node features are denoted as \(\boldsymbol{X} = \{x_{1},x_{2},\cdots ,x_{N}\} \in R^{N \times C}\) and the adjacent matrix is defined as \(\mathcal {A} \in \mathcal {R}^{N \times N}\) which associates each edge \(\left( v_{i},v_{j}\right) \) with its element \(\mathcal {A}_{ij}\). The node degrees are given by \({d} = \{d_{1},d_{2},\cdots ,d_{N}\}\) where \(d_{i}\) computes the sum of edge weights connected to node i. \(\mathcal {N}_{v_{i}} = \{v_{j}:\left( v_{i},v_{j}\right) \in \mathcal {E}\}\) denotes the set of neighbors of node \(v_{i}\).
4 Methodology
4.1 Drop “Noise" Edge
The GNNs are superior to the existing Euclidean-based methods because they obtain a wealth of information from the nodes’ neighbors. Therefore, the performance improvement brought by graphic data is highly related to the quantity and quality of domain information [11]. DropEdge [2] randomly drops edges in the graph. Although it is efficient, it does not consider the influence of adjacent nodes’ information on the current node. Therefore, it can not determine whether the deleted information is beneficial or harmful to the task. Compared with DropEdge, DropNEdge treats the edges as unequal based on the influence of adjacent nodes’ information on the current node and deletes edges with no or little positive impact on the final task. We use signal-to-noise ratio and feature gain indexes to measure the influence of adjacent nodes on the current node.
Feature Gain. Feature gain is used to measure the information gain of adjacent nodes’ information relative to the current node. Considering that Kullback-Leibler (KL) divergence can measure the amount of information lost when an approximate distribution is adopted, it is used to calculate the information gain of the current node from its adjacent nodes [6]. The definition of KL divergence is stated as follows.
Definition 1
\(C\left( K\right) \) refers to the probability density function (PDF) of \(\tilde{c}_{v_{i}}^{k}\), which is the ground truth and can be estimated by non-parametric methods with a set of samples. Each sample point is sampled with probability \(|\mathcal {N}_{v_{i}}|/2|\mathcal {E}|\). S(k) is the PDF of \(\sum _{v_{j} \in N_{v_{i}}} a^{\left( k\right) }_{i,j} \cdot \tilde{c}^{k}_{v_{j}}\), which can be estimated with a set of samples \(\{\sum _{v_{j} \in N_{v_{i}}} a^{\left( k\right) }_{i,j} \cdot \tilde{c}^{k}_{v_{j}}\}\). Each point is also sampled with probability \(|\mathcal {N}_{v_{i}}|/2|\mathcal {E}|\) [6]. The information gain can be computed by KL divergence [12] as:
In the actual calculation, the true and simulated distributions of the data are unknown. Thus, we use the feature gain to approximate KL divergence which measures the feature difference between the current node and its adjacent nodes. The definition of feature gain of node v is
where \(|\mathcal {N}_{v}|\) is the number of adjacent nodes of node v, and \(x_{v}\) is the representation of node v. Moreover, the feature gain has the following relationship with KL divergence.
Theorem 1
For a node v with feature \(x_{v}\) in space \([0,1]^{d}\), the information gain of the node from the surrounding \(D_{KL}\left( S||C\right) \) is positively related to its feature gain \(FG_{v}\); (i.e., \(D_{KL}\left( S||C\right) \sim FG_{v}\)). In particular, \(D_{KL}\left( S||C\right) =0\), when \(FG_{v}=0\).
Thus, we know that the information gain is positively correlated to the feature gain. That is, the greater the feature gain means that the node can obtain more information from adjacent nodes. Therefore, we should first deal with nodes with less information gain. If the feature similarity of the nodes on both sides of a edge exceeds a given threshold, the edge should be dropped. In this way, edges with a significant impact on the task can be retained. The proof process of Theorem 1 is shown as follows.
Proof
For \(D_{KL}(S||C)\), since the PDFs of C and S are unknown, a non-parametric way is used to estimate the PDFs of C and S. Specifically, the feature space \(X=[0,1]^{d}\) is divided uniformly into \(r^{d}\) bins \(\{H_{1}, H_{2},\cdots , H_{r^{d}}\}\), whose length is 1/r and dimension is d. To simplify the use of notations, \(|H_{i}|_{C}\) and \(|H_{i}|_{S}\) are used to denote the number of samples that are in bin \(H_{i}\). Thus, we yield
where \(\varDelta _{i}=\left| H_{i}\right| _{C}-\left| H_{i}\right| _{S}\). Regard \(\varDelta _{i}\) as an independent variable, we consider the term \(\sum _{i=1}^{r^{d}}\left| H_{i}\right| _{S} \cdot \log \left( \left| H_{i}\right| _{S}+\varDelta _{i}\right) \) with second-order Taylor approximation at point 0 as
Note that the number of samples for the context and the surrounding are the same, where we have
Thus, we obtain \(\sum _{i=1}^{r^{d}} \varDelta _{i}=0\). Therefore, \(D_{KL}(\hat{S}||\hat{C})\) can be written as
If we regard \(\left| H_{i}\right| _{S}\) as constant, we have: if \(\varDelta _{i}^{2}\) is large, then the information gain \(D_{K L}(S|| C)\) tends to be large. The above proof process is borrowed from Reference [6].
Considering the case of a node and its adjacent nodes, the samples of C are equal to \(x_{v}\) and the samples of S are sampled from \(\{x_{v^{\prime }}:{v^{\prime }} \in {\mathcal {N}_{v}}\}\). For the distribution of the difference between the surrounding and the context, we consider \(x_{v^{\prime }}\) as noises on the “expected” signal and \(x_{v}\) is the “observed” signal. Then the difference between C and S is \(\frac{1}{|\mathcal {N}_{v}|}\sum _{v^{\prime } \in \mathcal {N}_{v}}||x_{v}-x_{v^{\prime }}||^{2}\), which is also the definition of \(FG_{v}\). Thus, we obtain
Therefore,
And if \(FG_{v}=0\), the feature vectors of the current node and its adjacent nodes are the same. Thus, \(D_{K L}(S || C) = 0\). \(\square \)
Signal-to-Noise Ratio. The reason for over-smoothing of GNNs is the low signal-to-noise ratio of received information. When aggregations among samples in different categories are excessive, the node representations in different classes will be similar. Thus, we assume that the aggregation of nodes among different categories is harmful, thereby bringing noise of information, and the aggregation of nodes in the same category brings useful signal. Here the signal-to-noise ratio is defined as
where \(ds_{v}\) and \(dh_{v}\) represent the sum of edge weights connected to homogeneous and heterogeneous nodes of node v, respectively. Therefore, for a node with a small signal-to-noise ratio, we will drop the edges connected to heterogeneous nodes until the signal-to-noise ratio of the node is bigger than the given threshold.
Algorithm of DropNEdge. The specific approach of DropNEdge is shown in Algorithm 1. In this algorithm, if the ratio of deleted “noise” edges \(r_{1}\) is set to 0, DropNEdge can be reduced to DropEdge.
4.2 Connection with Bayesian GNNs
Considering that Dropout can be an approximation of the Bayesian neural networks [18]; hence, we show that DropNEdge can be an approximation of Bayesian GNNs. We target the inference of the joint posterior of the random graph parameters, the weights in the GNN and the nodes’ labels. Given that we are usually not directly interested in inferring the graph parameters, posterior estimates of the labels are obtained by marginalization [13]. The goal is to compute the posterior probability of labels, which is
where W is a random variable that represents the weights of the Bayesian GNN over graph \(\mathcal {G}\), and \(\lambda \) characterizes a family of random graphs. This integral is intractable, we can adopt a number of strategies, including variational methods [24] and Markov Chain Monte Carlo (MCMC) [21], to approximate it. A Monte Carlo approximation of it is [13]
In the approximation, V samples \(\lambda _{v}\) are drawn from \(p\left( \lambda |\mathcal {G}_{obs}\right) \), \(N_{G}\) graphs \(\mathcal {G}_{i,v}\) are sampled from \(p\left( \mathcal {G}|\lambda _{v}\right) \), S weight matrices \(W_{s,i,v}\) are sampled from \(p\left( W|\boldsymbol{Y},\boldsymbol{X},\mathcal {G}_{i,v}\right) \) in the Bayesian GNNs that correspond to the graph \(\mathcal {G}_{i,v}\) [19]. The sampled \(w_{s,i,v}\) and \(G_{i,v}\) can be obtained from GNNs with DropNEdge. Thus, if we turn on DropNEdge during the training and test phase, the model’s uncertainty can be obtained.
5 Experiments
5.1 Performance Comparison
We compare the performances of the four GNN models with DropNEdge (DNE) and DropEdge (DE). The results are shown in Table 1. The performances of the four models have been improved in most cases by DropNEdge. The improvement is more clearly depicted in Fig. 1, which counts the average improvement of different number of layers brought by DropNEdge. For example, on Cora data set, DropNEdge brings 6.9% average improvement to the models with 32 layers.
The results of models with and without DropNEdge are shown in Table 2. The effects of all models with DropNEdge have been consistently improved compared with models without DropNEdge. Thus the effect of DropNEdge is demonstrated. Figure 2 (a) and (b) show the comparison of the verification loss of models with DropNEdge or DropEdge which indicate that models with DropNEdge converge faster, and their losses are smaller.
The superiority of DropNEdge lies in the following reasons: (1) DropNEdge avoids excessive aggregation of node information by dropping “noise” edges, which alleviates the over-smoothing phenomenon effectively. (2) It can remove “noise" edges and retain meaningful edges which prevents the transmission of harmful information. (3) DropNEdge can be used as a graphical data enhancement method.
5.2 Remove “Noise" Edges
We randomly add a given proportion of edges to the graph of Cora data set which is set to 0.3 in this experiment. The added edges are considered to be “noise” edges. We change the ratio of deleted “noise” edges \(r_{1}\). Subsequently, the proportions of deleted added edges to total added edges (\(r_{N}\)) and deleted non-added edges to the real edges (\(r_{T}\)) in the graph are counted. The model used is GCN-8 and the results are shown in Fig. 3 (a) and (b). From Fig. 3 (a), we can see that DropNEdge can remove “noise" edges because \(r_{N}\) is always greater than \(r_{T}\) no matter what the ratio \(r_{1}\) is. Figure 3 (b) shows that the model’s accuracy also increases as \(r_{1}\) increases. However, when too many edges are deleted, the meaningful aggregation of information will also decrease. Thus, the accuracy of the model decreases.
5.3 Suppress Over-Smoothing
When the top-level output of GNNs converges to a subspace and becomes irrelevant to the input as the depth increases, over-smoothing phenomenon occurs. Considering that the convergent subspace cannot be derived explicitly, we measure the degree of smoothing by calculating the difference between the output of the current layer and the previous layer. Euclidean distance is used to calculate the difference. The smaller the distance, the more severe the over-smoothing. This experiment is carried out on GCN-8 with Cora data set whose results are shown in Fig. 4.
DropNEdge is better than DropEdge in suppressing over-smoothing. As the number of layers increases, the distances between layers in models with DropEdge and DropNEdge both increase. Furthermore, the distance’s increasing speed of the model with DropNEdge is faster than that of the model with DropEdge.
5.4 Layer Independent DropNEdge
The DropNEdge mentioned above is that all layers share the same perturbation adjacency matrix. In fact, we can perform it for each individual layer. Different layers can have different adjacent matrices. This layer-independent (LI) version brings more randomness and distortion of the original data. We experimentally compare its performance with the shared DropNEdge’s performance on Cora data set. The model used is GCN-8 and the comparisons of the verification loss and training loss between shared and independent DropNEdge are shown in Fig. 5 (a). Although hierarchical DropNEdge may achieve better results, we still prefer to use shared DropNEdge which can not only reduce the risk of over-fitting, but also reduce the computational complexity.
5.5 Model Uncertainty
To obtain the model’s uncertainty, we turn on DropNEdge during the training and test phase and set the ratio of deleted “noise” edges to 0.3. The experiment is carried out on GCN-8 with Cora data set. After the model predicts multiple times, different predictions may be produced for a sample. Figure 5 (b) shows the ratio of different labels obtained in 10 predictions for 10 samples. For example, for sample one, 40% of the ten predictions are class 0 and 60% of the predictions are class 3. Thus, the confidence of the predictions can be obtained by using DropNEdge. For high-confidence samples, that is, samples with consistent results after multiple predictions, the model’s predictions can be used directly. If the model’s predictions of some samples change greatly, other models should be further used or they should be artificially determined to get more reasonable predictions.
6 Conclusion
This paper proposes DropNEdge, a novel and effective method to alleviate the over-smoothing phenomenon and remove “noise” edges in graphs. It mainly considers two indicators based on the graph’s information, namely, feature gain and signal-to-noise ratio. By using DropNEdge, the over-smoothing of GNNs is alleviated and the “noise” edges with no positive impact on the final task can be removed, thereby improving the performance of GNNs. DropNEdge does not need to change the network’s structure and is widely adapted to various GNNs.
References
Dongsheng, L., et al.: Learning to drop: robust graph neural network via topological denoising. In: Proceedings of the 14th ACM International Conference on Web Search and Data Mining (WSDM 2021), pp. 779–787. Association for Computing Machinery, Online (2021) . https://doi.org/10.1145/3437963.3441734
Yu, R., Wenbing, H., Tingyang, X., Junzhou, H.: DropEdge: towards deep graph convolutional networks on node classification. In: Proceedings of the 8th International Conference on Learning Representations (ICLR 2020), International Conference on Learning Representations, Addis Ababa (2020) . https://openreview.net/forum?id=Hkx1qkrKPr
Nitish, S., Geoffrey, H., Alex, K., Ilya, S.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
Qimai, Li., Zhichao, H., Xiao-Ming, W.: Deeper insights into graph convolutional networks for semi-supervised learning. In: The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18), pp. 3538–3545. AAAI press, Louisiana (2018). https://arxiv.org/abs/1801.07606
Deli, C., Yankai, L., Wei, L., Peng, L., Jie, Z., Xu, S.: Measuring and relieving the over-smoothing problem for graph neural networks from the topological view. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20), pp. 3438–3445. AAAI press, New York (2020) . https://doi.org/10.1609/aaai.v34i04.5747
Yifan, H., et al.: Measuring and improving the use of graph information in graph neural network. In: The Eighth International Conference on Learning Representations (ICLR 2020), Addis Ababa (2020) . https://openreview.net/forum?id=rkeIIkHKvS
Lingxiao, Z., Leman, A.: PairNorm: tackling oversmoothing in GNNs. In: Proceedings of the 8th International Conference on Learning Representations (ICLR 2020), Addis Ababa (2020) . https://arxiv.org/abs/1909.12223
William, L. H., Rex, Y., Jure, L.: Inductive representation learning on large graphs. In: Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NIPS 2017), pp. 1025–1035. Neural Information Processing Systems Foundation, California (2017). https://arxiv.org/abs/1706.02216
Jie, C., Tengfei, M., Cao, X.: FastGCN: fast learning with graph convolutional networks via importance sampling. In: Proceedings of the 6th International Conference on Learning Representations (ICLR 2018), Vancouver (2018). https://arxiv.org/abs/1801.10247
Santo, F.: Community detection in graphs. Phys. Rep. 486, 75–174 (2010)
Jie, Z., et al.: Graph neural networks: a review of methods and applications. arXiv:1812.08434 (2021)
Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: Proceedings of the 5th International Conference on Learning Representations (ICLR 2017), Toulon (2017). https://openreview.net/pdf?id=SJU4ayYgl
Petar, V., Guillem, C., Arantxa, C., Adriana, R., Pietro, L., Yoshua, B.: Graph attention networks. In: Proceedings of the 6th International Conference on Learning Representations (ICLR 2018), Vancouver (2018) . https://arxiv.org/abs/1710.10903
Chaoqi, Y., Ruijie, W., Shuochao, Y., Shengzhong, L., Tarek, A.: Revisiting over-smoothing in deep GCNs. arXiv:2003.13663 (2020)
Chen, C., Yusu, W.: A note on over-smoothing for graph neural networks. In: The Thirty-seventh International Conference on Machine Learning (ICML 2020), International Machine Learning Society, Online (2020) . https://arxiv.org/abs/2006.13318
James, F., Sivasankaran, R.: How robust are graph neural networks to structural noise? arXiv:1912.10206 (2019)
Yarin, G., Zoubin, G.: Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In: The Thirty-Third International Conference on Machine Learning (ICML 2016), International Machine Learning Society, New York (2016) . https://arxiv.org/abs/1506.02142v1
Yingxue, Z., Soumyasundar, P., Mark, C., Deniz, Ü.: Bayesian graph convolutional neural networks for semi-supervised classification. In: The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI 2019), pp. 5829–5836. AAAI Press, Hawaii (2019). https://arxiv.org/abs/1811.11103
David, J.C.M.: Bayesian Methods for Adaptive Models. California Institute of Technology, California (1992)
Radford, M.N.: Bayesian Learning for Neural Networks. Springer, New York (1996)
Welling, M., Yee, W.T.: Bayesian learning via stochastic gradient Langevin dynamics. In: Proceedings of the 28th International Conference on Machine Learning (ICML 2011), pp. 681–688. Association for Computing Machinery, Washington (2011). https://dl.acm.org/doi/10.5555/3104482.3104568
David, M.B., Alp, K., Jon, D.M.: Variational inference: a review for statisticians. J. Am. Stat. Assoc. 112(518), 859–877 (2017)
Matthew, D.H., David, M.B., Chong, W., John, P.: Stochastic variational inference. J. Mach. Learn. Res. 14(4), 1303–1347 (2013)
Arman, H., et al.: Bayesian graph neural networks with adaptive connection sampling. arXiv:2006.04064 (2020). https://arxiv.org/abs/2006.04064
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Zhou, X., Wu, O. (2022). Drop “Noise” Edge: An Approximation of the Bayesian GNNs. In: Wallraven, C., Liu, Q., Nagahara, H. (eds) Pattern Recognition. ACPR 2021. Lecture Notes in Computer Science, vol 13189. Springer, Cham. https://doi.org/10.1007/978-3-031-02444-3_5
Download citation
DOI: https://doi.org/10.1007/978-3-031-02444-3_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-02443-6
Online ISBN: 978-3-031-02444-3
eBook Packages: Computer ScienceComputer Science (R0)