Keywords

1 Introduction

Node classification involves a network with known graph structure, where each node is associated with a set of known attributes, as well as a label (or ‘class’) which is known only for some of the nodes and unknown for others. The task is to predict the unknown labels given all the known information about the network. This task finds application in many complex network settingsd such as information networks [21], complex systems [30], and protein function identification [5].

Previous Work. Motivated by theories of homophily [15] and social influence [14], a common assumption is that adjacent nodes tend to have similar labels. For instance, in a social network, friends may be likely to vote for the same political party. Methods for node classification that rely on this assumption typically enforce homophily by assigning the same label to proximate (i.e., nearby) nodes. In label propagation, for example, labels diffuse from labeled nodes to their unlabeled neighbors in an iterative manner until convergence [29]. Other methods induce label uniformity within cuts or clusters of the network [1, 2, 9]; or consider node proximity in a latent space that preserves network distances, as in DeepWalk [17] and similar matrix factorization methods [18].

However, the aforementioned methods ignore other node attributes, which can be detrimental. For example, Hamilton et al. [5] show that, for certain predictive tasks on citation and social networks, a linear classifier that is built only on node attributes outperforms methods such as DeepWalk that are based on node proximity but ignore node attributes. Similar cases are reported elsewhere in the literature [13, 20]. Therefore, it is important to appropriately leverage both the node proximity and node attributes, for label prediction.

Partially addressing this limitation, AANE [7] and DANE [4] combine low-dimensional encodings of node attributes with network-distance-preserving node embeddings, and use them as input for label prediction. However, they do not account for known labels during training, thus potentially ignoring information that would be useful in predicting the unknown labels. LANE [8] overcomes this limitation, by learning joint latent representations of node attributes, proximity, and labels. However, LANE does not directly address the node classification task, i.e., it does not optimize the conditional probability distribution of node labels given the node attributes and network structure, but rather targets their joint distribution of all quantities.

Separately, graph convolutional networks (e.g., GCN [26], GAT [23], GraphSAGE [5]) use network topology for low-pass filtering on node attributes. However, these convolutions are equivalent to repeated smoothing over the node attributes and performance quickly degrades [12]. Subsequent approaches such as DiffPool [28] have sought to address this limitation, but these too aggressively enforce homophily and require that nodes with the same labels have similar representations. Recent work by AM-GCN [25] attempts to weaken this assumption by analyzing the fusing capabilities of convolutional models. They define two modules, one each for the topology space and attribute space, and adaptively combine them using an attention mechanism.

Our Contribution. We develop an approach to node classification that flexibly adapts to a range of settings, from cases where node labels exhibit strong homophily (i.e., a node’s label can be predicted from the labels of proximate nodes) to cases where labels are solely predicted from attributes of the node, as well as cases that lie between these two extremes. Unlike aforementioned approaches that are heavily based on label homophily (e.g., label propagation), the proposed approach is based on a generative probabilistic model that jointly captures the role of network structure and node similarity in predicting labels. Our analysis leads to JANE, an algorithm that identifies a maximum-likelihood node labels. Moreover, unlike AANE [7] and DANE [4], JANE learns a low-dimensional node representation informed by labels, that is then used for prediction. And unlike LANE [8], it directly optimizes the conditional probability of unknown labels given the known information about the network. We empirically validate JANE ’s performance on real datasets and compare to standard baselines.

2 Problem Setting

Let us consider a network with undirected and connected graph structure \(\mathcal {G}= \left( \mathcal {V}, \mathcal {E}\right) \) of node size \(|\mathcal {V}| = n\). Let \({\mathbf {A}}\) be the adjacency matrix, \({\mathbf {D}} \) the (diagonal) degree matrix, and \({\mathbf {L}} = {\mathbf {D}}- {\mathbf {A}} \) the unnormalized Laplacian. Let \({\mathbf {e}}_{i}\left( {\mathbf {L}} \right) \) be the i-th spectral eigenvector, corresponding to the i-th smallest eigenvalue of \({\mathbf {L}}\).

Each node j is associated with: d observed attributes \({\mathbf {x}}_{j} \in \mathbb {R}^{d}\), k latent attributes \({\mathbf {u}}_{j} \in \mathbb {R}^{k}\), and one (1) possibly unobserved label \({\mathbf {y}}_{j} \in \left\{ 1,2,\ldots ,M\right\} \) as label. To refer to the attributes and labels of all nodes, we’ll be using the notation \({\mathbf {X}} = \left\{ {\mathbf {x}}_{j} \right\} _{j = 1\ldots n}\), \({\mathbf {U}} = \left\{ {\mathbf {u}}_{j} \right\} _{j = 1\ldots n}\), and \({\mathbf {Y}} = \left\{ {\mathbf {y}}_{j} \right\} _{j = 1\ldots n}\), correspondingly.

Node classification is formalized as Problem 1.

Problem 1 (Node-Classification)

Given adjacency matrix \({\mathbf {A}} \), node attributes \({\mathbf {X}} \), and labels \({\mathbf {Y}} _{L}\) for a subset \(L\subseteq \mathcal {V}\) of nodes, predict labels \({\mathbf {Y}} _{\mathcal {V}-L}\) for the remaining nodes \({\mathcal {V}-L}\) in the network.

3 Our Approach

Following a principled probabilistic approach for Problem 1 we make use of an appropriate generative model (defined in Sect. 3.1) to derive JANE, our node classification method (Sect. 3.2).

3.1 Model

Figure 1 illustrates our generative model. First, to model homophily, we assume that two nodes will be more likely to be adjacent in \({\mathbf {A}}\) if they are similar in terms of some attributes. Formally, we assume that the adjacency matrix \({\mathbf {A}}\) is generated from latent attributes \({\mathbf {U}} \). Specifically, the probability that there exists an edge between two nodes i and j is decreases with their distance in terms of \({\mathbf {u}}_{} \) according to the following formula, for some scaling parameter \(s ^2\).

$$\begin{aligned} \text {Pr}\left[ ~ \left( i,j\right) \in \mathcal {E}~ | ~ {\mathbf {u}}_{i}, {\mathbf {u}}_{j} ~\right] = p_{ij} = {e}^{\frac{-\left\Vert {u}_{i} - {u}_{j}\right\Vert ^2}{s ^2}}\end{aligned}$$
(1)

In light of the above, U represents a low-dimensional Euclidean embedding of the nodes that preserves connectivity in the form of Eq. 1.

Fig. 1.
figure 1

The generative model. Observed node attributes \({\mathbf {X}} \) (circular box) and latent attributes \({\mathbf {U}} \) (square box) jointly generate node labels \({\mathbf {Y}} \). The (observed) adjacency matrix \({\mathbf {A}} \) is generated from \({\mathbf {U}} \) and indirectly correlates with \({\mathbf {Y}} \) via \({\mathbf {U}} \).

Second, to build the desired flexibility into our approach, we assume that labels \({\mathbf {Y}}\) may be determined both from node attributes \({\mathbf {X}}\) (directly) and node proximity as expressed directly via \({\mathbf {U}}\) and thus indirectly via adjacency \({\mathbf {A}}\). Formally, we assume that this conditional probability of is given by a simple two-layer neural network,

$$\begin{aligned} \text {Pr}\left[ {\mathbf {Y}} |{\mathbf {X}},{\mathbf {U}},\mathbf {W} \right] = \sigma (\text {ReLU} (\left[ {\mathbf {X}} {\mathbf {U}} \right] {W}^{\left( 0\right) }) {W}^{\left( 1\right) } ) \end{aligned}$$
(2)

where \(\sigma \) denotes the softmax function and weight matrices \(\mathbf {W} = \{{W}^{\left( 0\right) }, {W}^{\left( 1\right) } \}\) are parameters that control the effect of \({\mathbf {X}} \) and \({\mathbf {U}} \) on labels \({\mathbf {Y}}\). We make this choice because we found this model to be sufficiently expressive for our empirical evaluation – however note that other models (e.g., neural networks with more hidden layers) could also be used in Eq. 2.

3.2 Algorithms

Problem 1 asks for predictions for \({\mathbf {Y}} _{\mathcal {V}-L}\) given the data \(\mathcal {D} = ({\mathbf {X}}, {\mathbf {A}}, {\mathbf {Y}} _L)\) that are provided as input. Towards this end, we treat the remaining quantities in the model, i.e., the latent variables \({\mathbf {U}}\) and the weights of the neural network \(\mathbf {W}\), as unobserved parameters \(\mathcal {\theta }\) = (\({\mathbf {U}}\), \(\mathbf {W}\)) – and use their maximum lilekihood values \(\hat{{\mathbf {U}}}\) and \(\hat{\mathbf {W}}\) in making the predictions. In summary, JANE consists of two algorithms: first, a training algorithm from which we learn the maximum likelihood estimates \(\hat{{\mathbf {U}}}\) and \(\hat{\mathbf {W}}\); second, a prediction algorithm, in which we use the learned parameter values to predict the missing labels. We provide the algorithms below.

Training. The training algorithm is defined in Algorithm 1. In summary, it uses gradient descent on negative log-likelihood (\(-\log \mathcal {L}\)) to iteratively update \(\hat{{\mathbf {U}}}\) and \(\hat{\mathbf {W}}\) towards their maximum-likelihood estimates. The initial value of \(\hat{{\mathbf {U}}}\) is given by the first spectral vectors of the network. Due to space constraints, the analysis that justifies this initialization and the detailed gradient calculations are omitted here but provided in the extended version [?].

figure a

Prediction. Given maximum-likelihood estimates \(\hat{{\mathbf {U}}}\) and \(\hat{\mathbf {W}}\), JANE invokes Eq. 2 with \({\mathbf {U}} =\hat{{\mathbf {U}}} \) and \({W}^{\left( =\right) } \hat{\mathbf {W}} \) for each node with unknown label, and predicts for it the most likely label according to Eq. 2.

4 Experiments

Baselines. We evaluate JANE against a range of baseline methods.

  • JANE and variants: JANE-NU wherein we do not update the initial estimate of \(\hat{{\mathbf {U}}} \) during training, and JANE-R where the initial estimate of \({\mathbf {U}}\) is a random matrix.

  • Methods based on network structure: Label Propagation (LP) [29], and DeepWalk (DW) [17] that encodes neighbourhood information via truncated random walks. These do not incorporate node attributes.

  • Deep attributed embeddings: We evaluate LANE [8] which constructs node embeddings that encode network structure and node attribute information, in addition to node labels.

  • Methods based on graph-convolutions: GCN [10] and GraphSAGE (mean aggregator) as representatives of graph convolutional networks. We acknowledge that this is extremely active area of research today and there are several methods that demonstrate improved performance along axes such as training efficiency [3], explainability [26], etc. on various real-world datasets [6]. For economy of space, we empirically compare with two benchmark, representative methods to demonstrate our central point: these methods sometimes fail because they strictly and inherently enforce homophily.

We were unable to reproduce the node classification results for AANE [7] based on the available implementationFootnote 1. Further, we could not locate the authors’ implementations for DANE [4]. Therefore, we do not report results for them.

Experimental Setup. We implement JANE, JANE-NU, and LP in Pytorch. We use out-of-box Pytorch implementations of DeepWalkFootnote 2, GCNFootnote 3 and GraphSAGEFootnote 4. And, we use a MATLAB implementationFootnote 5 of LANE [4]. In all of our experiments, as is standard, all methods receive only the adjacency matrix \({\mathbf {A}} \) and the node attributes \({\mathbf {X}} \) as input, along with the same 10% and 20% of the node labels for training and validation, respectively. Wherever available, we use the default hyperparamater configurations as suggested in the original papers. Otherwise, we grid search over the hyperparameter space to find the best setting for all of our baselines. We perform all experiments on a Linux machine with 4 cores and 32 GB RAM.

Reproducibility. To aid further research, we make our code publicly availableFootnote 6. This also includes an implementation for constructing synthetic datasets as described below. The real-world datasets (cf. Section 4.2) are publicly available.

4.1 Node Classification on Synthetic Data

The goal of these experiments is to demonstrate differences between JANE and existing classification methods.

Synthetic Datasets. Figure 2 describes representative synthetic datasets generated according to the framework described previously. We set the number of individual node attributes \(|{\mathbf {X}} | = d = 2\) and number of latent attributes \(|{\mathbf {U}} | = k = 2\). We generate these attributes (gaussian-distributed) for \(n=200\) points, each of which belongs to one of \(M=4\) classes and set the scale \(s ^2 = 1\). An influence parameter, \(\alpha \in \left[ 0, 1\right] \), controls the degree to which node labels derive from \({\mathbf {X}}\) or \({\mathbf {U}}\): \(\alpha = 0.0\) signifies that they derive only from \({\mathbf {U}} \) and are independent of \({\mathbf {X}} \); \(\alpha = 1.0\) that they derive only from \({\mathbf {X}} \) and are independent of \({\mathbf {U}} \); and \(\alpha = 0.5\) that they derive equally from \({\mathbf {X}}\) and \({\mathbf {U}}\) (specifically, without loss of generality, only the first attribute from \({\mathbf {X}} \) and \({\mathbf {U}} \) contributes to label assignment). Figure 2a depicts instances of synthetic networks corresponding to \(\alpha = \left\{ 0.0, 0.5, 1.0\right\} \). The colors of points represent classes.

Fig. 2.
figure 2

Figure 2a shows 3 synthetic networks where class labels \({\mathbf {Y}}\) are determined only by \({\mathbf {U}} \) (\(\alpha = 0.0\)), partly by \({\mathbf {U}} \) and partly by \({\mathbf {X}} \) (\(\alpha = 0.5\)), and only by \({\mathbf {X}} \) (\(\alpha = 1.0\)). Figure 2b compares the node classification accuracy of JANE and JANE-NU with the baselines averaged over 5 random train-test splits.

Implementation Details. We use Scikit-Learn’s [16] make_classification to generate these datasets. Classification algorithms do not have access to \(\alpha \) or \({\mathbf {U}} \). JANE is trained as a two-layer neural network for a maximum of \(T=200\) epochs with dropout of 0.2 for each layer, weight decay of \(5e^{-2}\), and learning rate of 0.005 using Adam. We set the number of eigenvectors \(k=2\) and choose a scaling factor \(s ^2=0.01\).

Performance. Figure 2b shows performance as a function of training set size.

  • \(\alpha = 0.0\): LP and DW infer that labels derive from \({\mathbf {A}} \) (indirectly). GCN converges attribute values of nodes in the same cluster but is not perfectly accurate because \({\mathbf {X}} \) does not correlate with \({\mathbf {Y}} \). LANE forces the proximity representation to be similar to the attribute representation and then smoothens it using the labels. It does not perform well since there is no correlation between them.

  • \(\alpha = 0.5\): LP, DW are able to correctly classify nodes belong to 2 out of 4 classes, i.e. precisely those nodes whose labels are influenced by \({\mathbf {U}} \). Conversely, LANE is able to classify those nodes belong to two classes of nodes that correlate with \({\mathbf {X}} \). GCN smoothens attribute values of adjacent nodes and thus can correctly infer labels correlated with \({\mathbf {X}} \).

  • \(\alpha = 1.0\) LP and DW reduce to random classifiers since adjacent nodes do not have similar labels. GCN reduces to a nearly random classifier because by forcing adjcent nodes with different attribute values to become similar, it destroys the correlation between \({\mathbf {X}} \) and the labels.

In all cases, JANE-NU and JANE achieve perfect accuracy, as they flexibly learn whether labels are predicted from \({\mathbf {X}} \), \({\mathbf {A}} \) (indirectly), or both. While these datasets are simplistic, this demonstrates how important this flexibility is.

4.2 Node Classification on Real-World Data

Datasets. We show our results on four tasks for a total of 9 networks (see Table 1 for basic statistics). If the original network is diconnected, we extract node attributes and labels belonging to its largest connected component.

  • Citation Networks: We use Cora, Citeseer, Pubmed [21], and UAI2010 [24]. Here, nodes represent academic papers, edges denote a citation between two nodes, node attributes are 0/1-valued sparse bag-of-words vectors and class labels denote the subfield of research that the papers belong to.

  • Social Networks: We focus on BlogCatalog and Flickr where the task is to predict pre-defined categories of blogs and images, respectively. Nodes are users that post content, edges represent follower relationships, and attributes are specified by a list of tags reflecting the interests of the users [11].

  • Air-traffic Networks: Based on flight records from Brazil, Europe, and USA, each node is an airport and an edge indicates a commercial airline route exists between them. Labels denote the level of activity in terms of people and flights passing through an airport [19]. Since no attributes for the nodes exist, we assign the all-ones dummy vector as the sole attribute.

  • Biological Networks: We use a processed protein-protein interaction (PPI) dataset [5] where the task is to identify protein roles based on gene ontology sets using positional gene sets, motif gene sets, and immunological signatures as attributes [22].

Table 1. Summary of dataset statistics.

Experimental Setup. for the citation datasets, we use the same train-validation-test splits as in Yang, et al. [27] minus the nodes which do not belong to the largest connected component. These comprise of 20 samples for each class and represent 5% of the entire dataset. We use 500 additional samples as a validation set for hyperparameter optimization as per Kipf, et al. [10] to enable fair comparison. For all other tasks, we use 10% and 20% of the dataset for training and validation, respectively. We evaluate the performance of all methods on the remaining nodes. Values of hyperparameters k, the number of eigenvectors of the Laplacian, and the scaling factor \(s \) are determined so as to minimize the likelihood function (details omitted).

Table 2. Classification accuracy (%) on test data averaged over 10 independent runs of \(T=200\) epochs each. Bold denotes best average accuracy or overlapping with best accuracy range (within \(\pm 0.3\) standard deviation). JANE-NU and JANE perform as well as or better than LANE, GCN and GraphSAGE on most datasets and consistently well across datasets.
Fig. 3.
figure 3

Average training time per epoch (seconds) of baselines on different datasets.

Performance Analysis. Table 2 provides the average accuracy of each method over 10 independent runs for \(T=200\) epochs each for GCN, GraphSAGE, LANE, and JANE. Values in bold denote that the method either performs best or its accuracy range overlaps with that of the best. JANE-NU and JANE consistently outperform LP and DW on all datasets by significant margins. JANE-R on the other hand, has significantly lower performance either because it gets stuck in a local optima or it requires much longer training time. This shows that the choice of the initial estimate is crucial to performance. The embedding obtained from DW is unsupervised and thus its predictive power is limited in comparison to our \(\hat{{\mathbf {U}}} \) which is label-informed. Both variants of JANE are competitive with GCN and GraphSAGE on Cora, Citeseer, and UAI2020. For instance, GraphSAGE achieves 79.70% accuracy on Cora, while JANE gets 79.24% which is within the margin of error. We observe strong performance by JANE-NU and JANE on Pubmed and PPI. This may be explained in part by Li, et al. [12]’s observation that Pubmed exhibits strong manifold structure as well as JANE ’s ability to better utilize the adjacency information. Lastly, we find significant gains on the social, and air-traffic datasets. LANE [8] report strong performance on Blogcatalog and Flickr, but its performance is noticeably poorer on citation, flights, and biological datasets.

Runtime. Figure 3 plots the average time for a single training epoch for JANE-NU, JANE, GCN, and GraphSAGE on various datasets. JANE is comparatively the slowest model while JANE-NU is by far the fastest. This is because JANE-NU is a vanilla neural network model that does not update its estimate of \(\hat{{\mathbf {U}}} \). Further, a single laplacian eigenvector can be approximately computed using the Lanczos algorithm in \(\tilde{\mathcal {O}}\left( |\mathcal {E}|\right) \) (up to log factors). Thus k eigenvectors can be computed in \(\tilde{\mathcal {O}}\left( k \cdot |\mathcal {E}|\right) \) and this is a one-time operation. Note, we do not need to perform a full eigendecomposition.

Parameter Sensitivity. A crucial parameter of JANE is the number of latent attributes \({\mathbf {U}} \) during training. Figure 4 demonstrates their impact on performance. In each case, JANE is allowed a maximum of 200 training epochs. We find that test accuracy consistently increases as dimension of \(\hat{{\mathbf {U}}} \) increases up until a certain threshold. However, increasing beyond this threshold introduces noise and reduces performances. Having too many latent attributes gives marginal and diminishing returns in performance while increasing the runtime.

Fig. 4.
figure 4

JANE ’s classification accuracy (%) on test data averaged over 10 runs w.r.t. number of eigenvectors.

Limitations. The primary limitation of JANE is the time and memory requirement for computing the gradient of \({\mathbf {A}} \) w.r.t. \(\hat{{\mathbf {U}}} \). These requirements grow linearly in network size. Since the gradient is computed in every training epoch, it may not be viable to fit it into GPU memory. Future work can outline procedures for mini-batch computation. However, note that JANE-NU is orders of magnitude faster than the other methods while also demonstrating strong performance.

5 Conclusion

We demonstrated an approach to node classification that flexibly adapts to settings where proximity and individual attributes have varying importance in predicting labels. Given its simplicity, interpretability and performance, JANE can serve as a useful starting point in designing models that holistically account for different sources of node labels. As a future direction, we aim to the evaluate the design and performance of advanced graph neural networks that go beyond requiring or enforcing homophily.