Keywords

1 Introduction

Graph neural networks (GNNs) and their many variants have achieved success in graph representation learning by extracting high-level features of nodes from their topological neighborhoods. However, several studies have shown that the performances of GNNs decrease significantly with the increase in the number of neural network layers [15, 16]. The reason is that the nodes’ characteristics will converge to similar values with the continuous aggregation of information. Some existing methods (DropEdge [2], Dropout [3]) solve over-smoothing by dropping some information in the graph randomly. Although these methods are efficient, they cannot guarantee that the dropped information is harmful or beneficial. Hence, they can only bring sub-optimal effect.

In addition, the effect of GNNs is affected by noise edges [1, 17]. Many graphs in real world have noise edges which requires GNNs to have the ability to identify and remove noise edges. The recursive aggregation mode of GNNs makes it susceptible to the influence of surrounding nodes. Therefore, finding a principled way to decide what information not to aggregate will have a positive effect on GNNs’ performance. Topological denoising [1] is an effective solution to solve this problem by removing noise edges. We can trim off the edges with no positive impact on the task to avoid GNNs from aggregating unnecessary information.

In this paper, we propose DropNEdge (Drop “Noise” Edge), which takes the structure and content information of a graph as input and deletes edges with no or little positive effect on the final task based on the node’s signal-to-noise ratio and feature gain. The differences between our method and DropEdge are detailed as follows. First, DropNEdge treats the edges unequally and deletes edges based on the graph’s information, which is a more reasonable and effective method to solve the limitations of GNNs. Second, deleting edges from the above two aspects can ensure that the dropped edges have no or little positive effect on the final task. Therefore, it can not only alleviate over-smoothing, but also remove “noise” edges. DropNEdge is widely adapted to most GNNs and does not need to change the networks’ structure. Because DropNEdge changes the topology of the graph, it can be used as a graphical data enhancement method.

Considering that Dropout can be used as a Bayesian approximation for general neural networks, we prove that DropNEdge can be used as a Bayesian approximation for GNNs. If we use DropNEdge during the training and test phase, then the models’ uncertainty can be obtained.

The main contributions of our work are presented as follows:

  • We propose DropNEdge, which is a plug-and-play layer that is widely adapted to various GNNs. It can effectively alleviate the over-smoothing phenomenon and remove “noise” edges in the graph.

  • We show that the use of DropNEdge in GNNs is an approximation of the Bayesian GNNs. In this way, the uncertainty of GNNs can be obtained.

2 Related Work

Deep stacking of layers usually results in a significant decrease in the performance of GNNs, such as GCN [13] and GAT [14]. Chen et al. [5] measured and alleviated the over-smoothing problem of GNNs from a topological perspective. Hou et al. [6] proposed two over-smoothing indicators to measure the quantity and quality of information obtained from graphic data and designed a new GNN model called CS-GNN. To prevent node embeddings from being too similar, PairNorm [7] was proposed which is a normalization layer based on the analysis of graph convolution operations. DropEdge [2] also effectively relieves the over-smoothing phenomenon by randomly removing a given percentage of edges in the graph.

Another limitation is the noise in the graph. A large number of papers show that GNNs are not robust to noise. Recently, graph sampling has been investigated in GNNs for the rapid calculation and to improve the generalization ability of GNNs, including neighbor-level [8], node-level [9] and edge-level sampling methods [10]. Unlike these methods that randomly sample edges during the training phase, PTDNet [4] uses a parametric network to actively remove “noise” edges for specific tasks. Moreover, it has been proved that the graph data enhancement strategy can effectively improve the robustness of GNNs [17].

Bayesian network is a hot topic which is critical for many machine learning systems. Since exact Bayesian inference is intractable, many approximation methods have been proposed such as Laplace approximation [20], Markov chain Monte Carlo (MCMC) [21], stochastic gradient MCMC [22], and variational inference methods [23]. Bernoulli Dropout and its extensions are commonly used in practice because they are fast in calculation and easy to be implemented. Bayesian neural networks also have some applications in GNNs. Zhang et al. [19] proposed a Bayesian graph convolutional neural networks for semi-supervised classification. Hasanzadeh et al. [25] proposed a unified framework for adaptive connection sampling in GNNs. And GNNs training with adaptive connection sampling is shown to be equivalent to an efficient approximation of training Bayesian GNNs.

3 Notations

Let \(\mathcal {G}=\left( \mathcal {V},\mathcal {E}\right) \) represent the input graph of size N with nodes \(v_{i} \in \mathcal {V}\) and edges \(\left( v_{i},v_{j}\right) \in \mathcal {E}\). The node features are denoted as \(\boldsymbol{X} = \{x_{1},x_{2},\cdots ,x_{N}\} \in R^{N \times C}\) and the adjacent matrix is defined as \(\mathcal {A} \in \mathcal {R}^{N \times N}\) which associates each edge \(\left( v_{i},v_{j}\right) \) with its element \(\mathcal {A}_{ij}\). The node degrees are given by \({d} = \{d_{1},d_{2},\cdots ,d_{N}\}\) where \(d_{i}\) computes the sum of edge weights connected to node i. \(\mathcal {N}_{v_{i}} = \{v_{j}:\left( v_{i},v_{j}\right) \in \mathcal {E}\}\) denotes the set of neighbors of node \(v_{i}\).

4 Methodology

4.1 Drop “Noise" Edge

The GNNs are superior to the existing Euclidean-based methods because they obtain a wealth of information from the nodes’ neighbors. Therefore, the performance improvement brought by graphic data is highly related to the quantity and quality of domain information [11]. DropEdge [2] randomly drops edges in the graph. Although it is efficient, it does not consider the influence of adjacent nodes’ information on the current node. Therefore, it can not determine whether the deleted information is beneficial or harmful to the task. Compared with DropEdge, DropNEdge treats the edges as unequal based on the influence of adjacent nodes’ information on the current node and deletes edges with no or little positive impact on the final task. We use signal-to-noise ratio and feature gain indexes to measure the influence of adjacent nodes on the current node.

Feature Gain. Feature gain is used to measure the information gain of adjacent nodes’ information relative to the current node. Considering that Kullback-Leibler (KL) divergence can measure the amount of information lost when an approximate distribution is adopted, it is used to calculate the information gain of the current node from its adjacent nodes [6]. The definition of KL divergence is stated as follows.

Definition 1

\(C\left( K\right) \) refers to the probability density function (PDF) of \(\tilde{c}_{v_{i}}^{k}\), which is the ground truth and can be estimated by non-parametric methods with a set of samples. Each sample point is sampled with probability \(|\mathcal {N}_{v_{i}}|/2|\mathcal {E}|\). S(k) is the PDF of \(\sum _{v_{j} \in N_{v_{i}}} a^{\left( k\right) }_{i,j} \cdot \tilde{c}^{k}_{v_{j}}\), which can be estimated with a set of samples \(\{\sum _{v_{j} \in N_{v_{i}}} a^{\left( k\right) }_{i,j} \cdot \tilde{c}^{k}_{v_{j}}\}\). Each point is also sampled with probability \(|\mathcal {N}_{v_{i}}|/2|\mathcal {E}|\) [6]. The information gain can be computed by KL divergence [12] as:

$$\begin{aligned} D_{KL}\left( S^{\left( k\right) }||C^{\left( k\right) }\right) =\int _{x_{k}}S^{\left( k\right) }\left( x\right) \cdot log \frac{S^{\left( k\right) }\left( x\right) }{C^{\left( k\right) }\left( x\right) }dx. \end{aligned}$$
(1)

In the actual calculation, the true and simulated distributions of the data are unknown. Thus, we use the feature gain to approximate KL divergence which measures the feature difference between the current node and its adjacent nodes. The definition of feature gain of node v is

$$\begin{aligned} FG_{v}=\frac{1}{|\mathcal {N}_{v}|}\sum _{v^{\prime } \in \mathcal {N}_{v}}||x_{v}-x_{v^{\prime }}||^{2}, \end{aligned}$$
(2)

where \(|\mathcal {N}_{v}|\) is the number of adjacent nodes of node v, and \(x_{v}\) is the representation of node v. Moreover, the feature gain has the following relationship with KL divergence.

Theorem 1

For a node v with feature \(x_{v}\) in space \([0,1]^{d}\), the information gain of the node from the surrounding \(D_{KL}\left( S||C\right) \) is positively related to its feature gain \(FG_{v}\); (i.e., \(D_{KL}\left( S||C\right) \sim FG_{v}\)). In particular, \(D_{KL}\left( S||C\right) =0\), when \(FG_{v}=0\).

Thus, we know that the information gain is positively correlated to the feature gain. That is, the greater the feature gain means that the node can obtain more information from adjacent nodes. Therefore, we should first deal with nodes with less information gain. If the feature similarity of the nodes on both sides of a edge exceeds a given threshold, the edge should be dropped. In this way, edges with a significant impact on the task can be retained. The proof process of Theorem 1 is shown as follows.

Proof

For \(D_{KL}(S||C)\), since the PDFs of C and S are unknown, a non-parametric way is used to estimate the PDFs of C and S. Specifically, the feature space \(X=[0,1]^{d}\) is divided uniformly into \(r^{d}\) bins \(\{H_{1}, H_{2},\cdots , H_{r^{d}}\}\), whose length is 1/r and dimension is d. To simplify the use of notations, \(|H_{i}|_{C}\) and \(|H_{i}|_{S}\) are used to denote the number of samples that are in bin \(H_{i}\). Thus, we yield

$$\begin{aligned} D_{K L}(S \Vert C)&\approx D_{K L}(\hat{S}|| \hat{C}) \nonumber \\&=\sum _{i=1}^{r^{d}} \frac{\left| H_{i}\right| _{S}}{2|\mathcal {E}|} \cdot \log \frac{\frac{\left| H_{i}\right| S}{2|\mathcal {E}|}}{\frac{\left| H_{i}\right| _{C}}{2|\mathcal {E}|}} \nonumber \\&=\frac{1}{2|\mathcal {E}|} \cdot \sum _{i=1}^{r^{d}}\left| H_{i}\right| _{S} \cdot \log \frac{\left| H_{i}\right| _{S}}{\left| H_{i}\right| _{C}} \nonumber \\&=\frac{1}{2|\mathcal {E}|} \cdot \left( \sum _{i=1}^{r^{d}}\left| H_{i}\right| _{S} \cdot \log \left| H_{i}\right| _{S}-\sum _{i=1}^{r^{d}}\left| H_{i}\right| _{S} \cdot \log \left| H_{i}\right| _{C}\right) \nonumber \\&=\frac{1}{2|\mathcal {E}|} \cdot \left( \sum _{i=1}^{r^{d}}\left| H_{i}\right| _{S} \cdot \log \left| H_{i}\right| _{S}-\sum _{i=1}^{r^{d}}\left| H_{i}\right| _{S} \cdot \log \left( \left| H_{i}\right| _{S}+\varDelta _{i}\right) \right) , \end{aligned}$$
(3)

where \(\varDelta _{i}=\left| H_{i}\right| _{C}-\left| H_{i}\right| _{S}\). Regard \(\varDelta _{i}\) as an independent variable, we consider the term \(\sum _{i=1}^{r^{d}}\left| H_{i}\right| _{S} \cdot \log \left( \left| H_{i}\right| _{S}+\varDelta _{i}\right) \) with second-order Taylor approximation at point 0 as

$$\begin{aligned} \sum _{i=1}^{r^{d}}\left| H_{i}\right| _{S} \cdot \log \left( \left| H_{i}\right| _{S}+\varDelta _{i}\right) \approx \sum _{i=1}^{r^{d}}\left| H_{i}\right| _{S} \cdot \left( \log \left| H_{i}\right| _{S}+\frac{\ln 2}{\left| H_{i}\right| _{S}} \cdot \varDelta _{i}-\frac{\ln 2}{2\left( \left| H_{i}\right| _{S}\right) ^{2}} \cdot \varDelta _{i}^{2}\right) . \end{aligned}$$
(4)

Note that the number of samples for the context and the surrounding are the same, where we have

$$\begin{aligned} \sum _{i=1}^{r^{d}}\left| H_{i}\right| _{C}=\sum _{i=1}^{r^{d}}\left| H_{i}\right| _{S}=2 \cdot |\mathcal {E}| . \end{aligned}$$
(5)

Thus, we obtain \(\sum _{i=1}^{r^{d}} \varDelta _{i}=0\). Therefore, \(D_{KL}(\hat{S}||\hat{C})\) can be written as

$$\begin{aligned} D_{KL}\left( S||C\right)&\approx D_{KL}\left( \hat{S}||\hat{C}\right) \nonumber \\&=\frac{1}{2|\mathcal {E}|} \cdot \left( \sum _{i=1}^{r^{d}}|H_{i}|_{S} \cdot \log |H_{i}|_{S}-\sum _{i=1}^{r^{d}}|H_{i}|_{S} \cdot \log \left( |H_{i}|_{S}+\varDelta _{i}\right) \right) \nonumber \\&\approx \frac{1}{2|\mathcal {E}|} \cdot \left( \sum _{i=1}^{r^{d}}|H_{i}|_{S} \cdot \left( -\frac{\ln 2}{|H_{i}|_{S}} \cdot \varDelta _{i}+\frac{\ln 2}{2\left( |H_{i}|_{S}\right) ^{2}} \cdot \varDelta _{i}^{2}\right) \right) \nonumber \\&=\frac{1}{2|\mathcal {E}|} \cdot \sum _{i=1}^{r^{d}}\left( \frac{\ln 2}{2|H_{i}|_{S}} \cdot \varDelta _{i}^{2}-\ln 2 \cdot \varDelta _{i}\right) \nonumber \\&=\frac{\ln 2}{4|\mathcal {E}|} \cdot \sum _{i=1}^{r^{d}} \frac{\varDelta _{i}^{2}}{|H_{i}|_{S}}. \end{aligned}$$
(6)

If we regard \(\left| H_{i}\right| _{S}\) as constant, we have: if \(\varDelta _{i}^{2}\) is large, then the information gain \(D_{K L}(S|| C)\) tends to be large. The above proof process is borrowed from Reference [6].

Considering the case of a node and its adjacent nodes, the samples of C are equal to \(x_{v}\) and the samples of S are sampled from \(\{x_{v^{\prime }}:{v^{\prime }} \in {\mathcal {N}_{v}}\}\). For the distribution of the difference between the surrounding and the context, we consider \(x_{v^{\prime }}\) as noises on the “expected” signal and \(x_{v}\) is the “observed” signal. Then the difference between C and S is \(\frac{1}{|\mathcal {N}_{v}|}\sum _{v^{\prime } \in \mathcal {N}_{v}}||x_{v}-x_{v^{\prime }}||^{2}\), which is also the definition of \(FG_{v}\). Thus, we obtain

$$\begin{aligned} \sum _{i=1}^{r^{d}} {\varDelta _{i}^{2}}\sim FG_{v}. \end{aligned}$$
(7)

Therefore,

$$\begin{aligned} D_{K L}(S \Vert C) =\frac{\ln 2}{4|\mathcal {E}|} \sum _{i=1}^{r^{d}} \frac{\varDelta _{i}^{2}}{\left| H_{i}\right| _{S}} \sim FG_{v}. \end{aligned}$$
(8)

And if \(FG_{v}=0\), the feature vectors of the current node and its adjacent nodes are the same. Thus, \(D_{K L}(S || C) = 0\).   \(\square \)

Signal-to-Noise Ratio. The reason for over-smoothing of GNNs is the low signal-to-noise ratio of received information. When aggregations among samples in different categories are excessive, the node representations in different classes will be similar. Thus, we assume that the aggregation of nodes among different categories is harmful, thereby bringing noise of information, and the aggregation of nodes in the same category brings useful signal. Here the signal-to-noise ratio is defined as

$$\begin{aligned} In_{v} = \frac{ds_{v}}{dh_{v}}, \end{aligned}$$
(9)

where \(ds_{v}\) and \(dh_{v}\) represent the sum of edge weights connected to homogeneous and heterogeneous nodes of node v, respectively. Therefore, for a node with a small signal-to-noise ratio, we will drop the edges connected to heterogeneous nodes until the signal-to-noise ratio of the node is bigger than the given threshold.

Algorithm of DropNEdge. The specific approach of DropNEdge is shown in Algorithm 1. In this algorithm, if the ratio of deleted “noise” edges \(r_{1}\) is set to 0, DropNEdge can be reduced to DropEdge.

figure a

4.2 Connection with Bayesian GNNs

Considering that Dropout can be an approximation of the Bayesian neural networks [18]; hence, we show that DropNEdge can be an approximation of Bayesian GNNs. We target the inference of the joint posterior of the random graph parameters, the weights in the GNN and the nodes’ labels. Given that we are usually not directly interested in inferring the graph parameters, posterior estimates of the labels are obtained by marginalization [13]. The goal is to compute the posterior probability of labels, which is

$$\begin{aligned} p\left( \boldsymbol{Z} \mid \boldsymbol{Y}, \boldsymbol{X}, \mathcal {G}_{o b s}\right) =\int p(\boldsymbol{Z} \mid W, \mathcal {G}, \boldsymbol{X}) p(W \mid \boldsymbol{Y}, \boldsymbol{X}, \mathcal {G}) p(\mathcal {G} \mid \lambda ) p\left( \lambda \mid \mathcal {G}_{o b s}\right) d W d \mathcal {G} d \lambda \end{aligned}$$
(10)

where W is a random variable that represents the weights of the Bayesian GNN over graph \(\mathcal {G}\), and \(\lambda \) characterizes a family of random graphs. This integral is intractable, we can adopt a number of strategies, including variational methods [24] and Markov Chain Monte Carlo (MCMC) [21], to approximate it. A Monte Carlo approximation of it is [13]

$$\begin{aligned} p\left( \boldsymbol{Z}|\boldsymbol{Y},\boldsymbol{X},\mathcal {G}_{obs}\right) \approx \frac{1}{V}\sum _{v}^{V}\frac{1}{N_{G}S}\sum _{i=1}^{N_{G}}\sum _{s=1}^{S}p\left( \boldsymbol{Z}|W_{s,i,v},\mathcal {G}_{i,v},\boldsymbol{X}\right) . \end{aligned}$$
(11)

In the approximation, V samples \(\lambda _{v}\) are drawn from \(p\left( \lambda |\mathcal {G}_{obs}\right) \), \(N_{G}\) graphs \(\mathcal {G}_{i,v}\) are sampled from \(p\left( \mathcal {G}|\lambda _{v}\right) \), S weight matrices \(W_{s,i,v}\) are sampled from \(p\left( W|\boldsymbol{Y},\boldsymbol{X},\mathcal {G}_{i,v}\right) \) in the Bayesian GNNs that correspond to the graph \(\mathcal {G}_{i,v}\) [19]. The sampled \(w_{s,i,v}\) and \(G_{i,v}\) can be obtained from GNNs with DropNEdge. Thus, if we turn on DropNEdge during the training and test phase, the model’s uncertainty can be obtained.

Fig. 1.
figure 1

The average absolute improvement by DropNEdge.

Fig. 2.
figure 2

(a) and (b) show the validation loss of different backbones with DropNEdge or DropEdge on Cora and Citeseer data sets.

Table 1. Accuracies of models with DropNEdge or DropEdge.

5 Experiments

5.1 Performance Comparison

We compare the performances of the four GNN models with DropNEdge (DNE) and DropEdge (DE). The results are shown in Table 1. The performances of the four models have been improved in most cases by DropNEdge. The improvement is more clearly depicted in Fig. 1, which counts the average improvement of different number of layers brought by DropNEdge. For example, on Cora data set, DropNEdge brings 6.9% average improvement to the models with 32 layers.

The results of models with and without DropNEdge are shown in Table 2. The effects of all models with DropNEdge have been consistently improved compared with models without DropNEdge. Thus the effect of DropNEdge is demonstrated. Figure 2 (a) and (b) show the comparison of the verification loss of models with DropNEdge or DropEdge which indicate that models with DropNEdge converge faster, and their losses are smaller.

Fig. 3.
figure 3

The effect of removing noise edges.

Table 2. Accuracies of models with and without DropNEdge. “OOM” represents out of memory.

The superiority of DropNEdge lies in the following reasons: (1) DropNEdge avoids excessive aggregation of node information by dropping “noise” edges, which alleviates the over-smoothing phenomenon effectively. (2) It can remove “noise" edges and retain meaningful edges which prevents the transmission of harmful information. (3) DropNEdge can be used as a graphical data enhancement method.

5.2 Remove “Noise" Edges

We randomly add a given proportion of edges to the graph of Cora data set which is set to 0.3 in this experiment. The added edges are considered to be “noise” edges. We change the ratio of deleted “noise” edges \(r_{1}\). Subsequently, the proportions of deleted added edges to total added edges (\(r_{N}\)) and deleted non-added edges to the real edges (\(r_{T}\)) in the graph are counted. The model used is GCN-8 and the results are shown in Fig. 3 (a) and (b). From Fig. 3 (a), we can see that DropNEdge can remove “noise" edges because \(r_{N}\) is always greater than \(r_{T}\) no matter what the ratio \(r_{1}\) is. Figure 3 (b) shows that the model’s accuracy also increases as \(r_{1}\) increases. However, when too many edges are deleted, the meaningful aggregation of information will also decrease. Thus, the accuracy of the model decreases.

5.3 Suppress Over-Smoothing

When the top-level output of GNNs converges to a subspace and becomes irrelevant to the input as the depth increases, over-smoothing phenomenon occurs. Considering that the convergent subspace cannot be derived explicitly, we measure the degree of smoothing by calculating the difference between the output of the current layer and the previous layer. Euclidean distance is used to calculate the difference. The smaller the distance, the more severe the over-smoothing. This experiment is carried out on GCN-8 with Cora data set whose results are shown in Fig. 4.

DropNEdge is better than DropEdge in suppressing over-smoothing. As the number of layers increases, the distances between layers in models with DropEdge and DropNEdge both increase. Furthermore, the distance’s increasing speed of the model with DropNEdge is faster than that of the model with DropEdge.

Fig. 4.
figure 4

The effect comparison between DropEdge and DropNEdge of suppressing over-smoothing.

Fig. 5.
figure 5

(a) shows the comparison of the training and verification loss between DropNEdge and LI DropNEdge. (b) shows the uncertainty of model with DropNEdge.

5.4 Layer Independent DropNEdge

The DropNEdge mentioned above is that all layers share the same perturbation adjacency matrix. In fact, we can perform it for each individual layer. Different layers can have different adjacent matrices. This layer-independent (LI) version brings more randomness and distortion of the original data. We experimentally compare its performance with the shared DropNEdge’s performance on Cora data set. The model used is GCN-8 and the comparisons of the verification loss and training loss between shared and independent DropNEdge are shown in Fig. 5 (a). Although hierarchical DropNEdge may achieve better results, we still prefer to use shared DropNEdge which can not only reduce the risk of over-fitting, but also reduce the computational complexity.

5.5 Model Uncertainty

To obtain the model’s uncertainty, we turn on DropNEdge during the training and test phase and set the ratio of deleted “noise” edges to 0.3. The experiment is carried out on GCN-8 with Cora data set. After the model predicts multiple times, different predictions may be produced for a sample. Figure 5 (b) shows the ratio of different labels obtained in 10 predictions for 10 samples. For example, for sample one, 40% of the ten predictions are class 0 and 60% of the predictions are class 3. Thus, the confidence of the predictions can be obtained by using DropNEdge. For high-confidence samples, that is, samples with consistent results after multiple predictions, the model’s predictions can be used directly. If the model’s predictions of some samples change greatly, other models should be further used or they should be artificially determined to get more reasonable predictions.

6 Conclusion

This paper proposes DropNEdge, a novel and effective method to alleviate the over-smoothing phenomenon and remove “noise” edges in graphs. It mainly considers two indicators based on the graph’s information, namely, feature gain and signal-to-noise ratio. By using DropNEdge, the over-smoothing of GNNs is alleviated and the “noise” edges with no positive impact on the final task can be removed, thereby improving the performance of GNNs. DropNEdge does not need to change the network’s structure and is widely adapted to various GNNs.