Keywords

1 Introduction

Many complex biological systems can be well-represented with graphs where a node represents a biological entity (e.g. protein, gene, etc.) and a link represents the interaction between two entities. Most real-world biological graphs are incomplete in nature. For example, 99.7% of the molecular interactions in human cells are still not known [1]. The links in biological graphs must be validated by field and/or laboratory experiments, which are expensive and time consuming. Researchers have developed link prediction methods to compute the plausibility of a link between two unconnected nodes in a graph to avoid the blind checking of all possible interactions. Formally, link prediction is the task of predicting the likelihood of a link between two nodes based on available topological/attribute information of a graph [2]. Link prediction methods help us toward a deep understanding of the structure, evolution, and functions of biological graphs [3].

Similarity-based methods are the simplest and unsupervised methods of link prediction in biological graphs, which define the proximity of a link by the similarity between its end nodes. The great advantage of these methods is their interpretibility which is essential for any biological system [4]. However, each of the similarity-based methods performs well only in some particular graphs and no one wins in all graphs. These methods necessitate manually formulating various heuristics based on prior beliefs or extensive knowledge of various biological graphs. The lack of universal applicability of similarity-based methods motivates researchers to study machine learning methods to automatically learn the heuristics from a graph. To learn the appropriate heuristics automatically from a graph, researchers have developed embedding-based methods which represent nodes, edges, graphs in low dimensional vector space [5]. The embedding-based method has become a popular link prediction tool in graphs over the last decade. These methods show impressive link prediction performance in most of the graphs. The downside of embedding-based methods is that they seriously suffer from the well-known ‘black-box’ problem. As the link decisions in biological graphs are critical, a link prediction method should be sufficiently interpretable to achieve trust among stakeholders [6]. The requirement for link prediction methods to be interpretable may limit the use of embedding-based methods in real-world biological systems. Researchers are still working on opening the ‘black-box’ of embedding-based methods [9, 10].

Another group of link prediction methods is developed based on traditional supervised learning-based methods. These methods extract features from a graph and train a traditional classifier for the link prediction task [11,12,13,14,15,16,17]. These methods are nearly as performant as embedding-based methods and as interpretable as similarity-based methods in many biological graphs. These methods describe the link prediction problem as a link classification problem with two classes: existence and absence of a link. In this paper, we intend to investigate whether the existing similarity-based heuristics collaboratively improve the link prediction performance in biological graphs. We study similarity-based heuristics for feature extraction and utilize the features in supervised learning-based classifiers for link prediction in biological graphs. We find that this is not the first attempt to study supervised learning methods to link prediction problem in graphs. But there are important differences between past works [12, 18, 19] and this study. The existing methods mostly focus on node attributes for extracting features which are application dependent. However, node attributes are not available in many real-world biological graphs. In contrast, our supervised learning-based method is developed based on only the topological features (similarity-based heuristics). Kumari et al. [17] studied a few local (four) and global (three) similarity heuristics for supervised link predictions, which is the closest work in the literature to our study. However, for large graphs, global methods are not the best option as they are computationally expensive [20]. In this study, we enrich the feature set by including fourteen local similarity-based heuristics. In addition, we extract few other topological features of nodes and derive link-based features based on end node features. We study these features in supervised machine learning methods for link prediction in biological graphs. We see that supervised learning methods show comparable prediction results in many of the biological graphs. We also demonstrate the feature importance in different datasets for different supervised learning-based methods.

1.1 Similarity-Based Link Prediction

Link prediction is the task of discovering or inferring a set of non-existing links in a graph based on the current snapshot of the graph. Similarity-based is the simplest category of link prediction methods, which is formulated based on the assumption that two nodes interact if they are similar in a graph [20]. Generally, these methods compute similarity scores of non-existent links, sort the links in decreasing order of their scores and top-L links are predicted as potential existent links. Defining the similarity is a crucial and non-trivial task which differs from graphs to graphs [20]. Consequently, numerous similarity-based methods exist in the literature. These methods are broadly categorized into three categories: local, global and quasi-local methods. Local methods are developed based on local topological or neighbourhood information, whereas global methods use the global topological information of graphs to define similarity functions [20]. Quasi-local methods consider the neighbourhood up-to a predefined hop for defining the similarity function. The high computational time of global methods motivates us to study only local and quasi-local methods. We study fourteen well-known local similarity-based methods for link prediction in graphs, thirteen of which are summarized in Table 1 local and one quasi-local. We summarize the similarity-based methods and the rest one (Preferential Attachment (PA)) in Table 2 with basic principles and the definition of similarity functions.

Table 1. Summary of similarity-based methods. Each method is considered as an individual link feature. S(xy) is the similarity function between two end nodes x and y. \(\varGamma x\) and \(\varGamma x\) denote the neighbour sets of nodes x and y respectively. A is the adjacency matrix and \(\lambda \) is a free parameter.

2 Methodology

In a broader sense, we consider the similarity-based heuristics as individual features to generate the feature set for a supervised learning-based classifier.

We describe each of the steps in Sects. 2.12.3.

2.1 Feature Extraction

The most crucial task of a supervised learning-based classifier is to define an appropriate feature set [12]. Given a graph and a train set of links, we extract structural features for the train links. When extracting the features of a link, the link is temporarily removed from the graph and re-connected after feature extraction to ensure that the extracted features are not biased by the existence of the train link. We are motivated to use only topological features for defining our feature set as they exist in all kinds of graphs. Our feature set contains twenty topological features which are broadly categorized into two categories: similarity-based and derived link features (Fig. 1).

Fig. 1.
figure 1

Feature set for supervised learning

Similarity-Based Link Features: We define the link-based features as the features which are related to the common topological information of end-nodes of a link. We use thirteen existing similarity-based heuristics as link-based features, which are summarized in Table 1. For instance, the number of common neighbours of end nodes of a link is used as the common neighbour (CN) feature.

Derived Link Features: Few link-based features are derived from the individual features of the link’s end nodes. We summarized six derived features in Table 2. These features are related to the topological information of individual nodes only. For example, the degree of end nodes is multiplied in Preferential Attachment (PA) to define the similarity score. Note that the link features in Table 2 except PA are not directly defined in the literature. We derive the link features based on the end node feature. To compute the link feature, features of end nodes are simply added except PA. As the voterank centrality computes low ranks for high-influencing nodes in a graph, the reciprocals of the voterank scores of end nodes are summed to define the voterank centrality feature.

Table 2. Summary of derived link features: the derived link feature function S(xy) is defined based on end nodes features.

2.2 Feature Scaling

In general, the magnitude scale for different features in different graphs varies [7, 8]. Supervised learning-based methods are easily affected by the non-uniform scaling as there is a high chance that features with higher magnitude play a more decisive role during the training of a classifier. But, it is not desirable for the classifier to be biased towards one particular feature. Hence, we normalize each feature in the range of 0–1.

2.3 Classifier Training and Link Prediction

For the link prediction task, we train a traditional supervised machine learning classifier to classify a link into either existent or non-existent classes. There exist many classifiers in the literature which perform better than others in some particular datasets. In this paper, we study three traditional classifiers: Support Vector Machine (SVM) with RBF kernel, Decision Tree, and Logistic Regression. We extract the features of the test links and classify them into existent or non-existent classes using a trained classifier to evaluate the link prediction performance.

3 Experiments

3.1 The Baselines

To evaluate the prediction performance of supervised learning methods, we consider two categories of link prediction methods: similarity-based and embedding-based methods.

For the similarity-based category, we consider all the heuristics in Table 1 in Table 2. For the embedding-based methods, we choose two popular methods: Node2Vec [40] and SEAL [41]. We shortly describe Node2Vec and SEAL methods. For more details, we refer to the original papers. Node2Vec [40] is a classical skip-gram model-based graph embedding method which learns node embeddings by optimizing a neighbourhood preserving objective function. It makes an interpolation between BFS (Breadth First Search) and DFS (Depth First Search) to define a 2\(^{nd}\) order random walk. A fixed size neighbourhood is sampled using the 2\(^{nd}\) order random walk and fed into the well-known skip-gram model [42] to learn the node embedding. The link embedding is then computed as the Hadamard product of the end node embeddings. A logistic regression-based classifier is then trained for the link prediction task. SEAL, the second embedding-based approach, is based on neural networks (NN). Learning from Sub-graphs, Embeddings and Attributes (SEAL) utilizes the latent and explicit features of end nodes and structural information of the graph to learn the link embedding. SEAL starts with extracting a h-hop neighbouring sub-graph and node labeling by a double radius node labeling (DRNL) algorithm. In the second step, the labelled sub-graph is then used to generate the structural encoding. The link embedding is the concatenation of structural encoding, pre-computed latent encoding and explicit feature encoding. In the final step, a neural network (NN) is trained for link prediction task.

3.2 Experimental Datasets

In this study, we focus on only biological graphs. For evaluating performance, we collect six biological graphs from the Network RepositoryFootnote 1. Table 3 summarizes the topological statistics and descriptions of the graph datasets.

Table 3. The graph datasets: number of nodes (\(\mathbf {|V|}\)), links (\(\mathbf {|E|}\)), average node degree (NDeg), average clustering coefficient (CC), and description.

The link prediction performance is evaluated using a random sampling validation protocol [7, 8, 41]. For a graph dataset, train and test sets are prepared by splitting the existent links. The train set consists of 90% existent and an equal number of non-existent links. The test set contains the remaining 10% existent and equal number of non-existent links. To prepare five train and five test sets for each graph, we repeat the link splitting operation five times independently. The datasets are available in a GitLab repositoryFootnote 2.

3.3 Evaluation Metrics

The link prediction problem is considered as a binary classification problem [46]. A traditional classifier, in general, learns a threshold to classify links as existent or non-existent. However, for similarity-based link classification methods, we find no standard approach for computing the threshold. The threshold is calculated in an optimistic manner. We first normalize the link scores to a range of 0–1 and then use the normalized scores to compute a ROC curve. The curve gives the true positiverate (TPR) and false positive rate (FPR) for different score threshold settings. The threshold point with the highest [TPR + (1 − FPR)] is computed as the threshold as we want to maximize TPR as well as minimize FPR. We classify links based on this threshold. A link with a \(score>=threshold\) is classified as existent and non-existent otherwise. Based on the true and predicted classes of links, we define four metrics: true positive (TP), true negative (TN), false positive (FP), and false negative (FN). TP is the number of existent links predicted to be existent, TN is the number of non-existent links predicted to be non-existent, FP is the number of non-existent links predicted to be existent, and FN is the number of existent links predicted to be non-existent links. We compute the following three well-known metrics using these four metrics.

$$\begin{aligned} Recall=\frac{TP}{TP+FN}\end{aligned}$$
(1)
$$\begin{aligned} Precision=\frac{TP}{TP+FP}\end{aligned}$$
(2)
$$\begin{aligned} F1=2\times \frac{Precision\times Recall}{Precision+Recall} \end{aligned}$$
(3)

3.4 Results and Discussion

In this section, we describe the prediction performance of supervised learning-based methods on six biological graphs. We also illustrate the importance of the features in graphs.

Table 4. Performance metrics: the dataset-wise best and second best precision, recall and F1 scores are indicated in bold and underline. The best and second best similarity-based methods are denoted with \(Sim^1\) and\(Sim^2\) respectively. For \(Sim^1\) and \(Sim^2\) methods, the methods are specified and the performance scores are given in ().
Fig. 2.
figure 2

Feature importance in HS-HT graph by logistic regression classifier

Fig. 3.
figure 3

Feature importance in different datasets by different supervised methods: (a)–(c) in Celegans, (d)–(f) in Diseasome, (g)–(i) in DM-HT

Prediction Performance: The prediction performance is computed for all methods over all the five sets for each graph, and the average scores are recorded. We do not include the standard deviation results as the values are very low in all the experiments. The precision, recall and F1 scores are tabulated Table 4, where the best two similarity-based methods are denoted with \(Sim^1\) and \(Sim^2\). We compute the precision scores of similarity-based methods in a optimistic way. The precision scores of similarity-based methods (best and second best) are very high and highest among all the methods in all the graphs, as shown in the table. This demonstrates the ability of similarity-based methods to predict high-quality links. However, the recall scores are low, implying that these methods identify the majority of existing test links as non-existent. As a result, the F1 score for similarity-based methods is very low. We also see that, as expected, the two best-performing similarity-based methods differ for different datasets. Among the supervised learning methods (SVM, DT, LR), DT shows the worst prediction results, but it is still much better than similarity-based methods. The other two classifiers have similar performance scores. The performance of the other two classifiers in terms of prediction scores is impressive. Yet in many graphs, supervised learning-based classifiers show superior prediction performance than embedding-based methods. Relating the performance to graph properties, we see that traditional classifiers outperform embedding-based methods in dense graphs. This is intuitive as the majority of the studied similarity-based heuristics are based on common neighbours (see Table 1). The performance scores of traditional classifiers are worse in the sparse graphs (CE-HT, CE-LC, Yeast), where embedding-based methods show better performance scores.

Feature Importance: In this section, we investigate the influence of each feature in a classifier for the link prediction task. To compute the feature importance coefficient, we use the Permutation importance module from the sklearn python-based machine learning toolFootnote 3. When a feature is unavailable, the coefficient is calculated by looking at how much the score (accuracy) drops [47]. The higher the coefficient, the higher the importance of the feature. In Fig. 2, we demonstrate the feature importance in the HS-HT biological graph in the logistic regression (LR) classifier to investigate how the importance of features differs in different sets of the same biological graph. In the LR classifier for the HS-HT biological graph, four features dominate. The dominance of multiple heuristics or features in a graph shows that heuristics that work collaboratively perform better than heuristics that work alone. We can also find that the feature importance coefficients in all five sets in the HS-HT graph are substantially identical.

We further investigate the importance score of features in three classifiers (SVM, DT, LR) for three different datasets. We evaluate the importance score of features for only one set for each graph. We see that different classifiers give different importance coefficients to different features in different datasets. In DM-HT dataset, all the classifiers compute high coefficient for LPI feature and they have close prediction performance (in Table 4). In the Celegans dataset, the HPI feature dominates in SVM and LR classifiers whereas LPI dominates in the DT classifier. In the Celegans dataset, SVM and LR outperform DT in terms of prediction (in Table 4), demonstrating that LR and SVM compute feature importance scores more correctly. Surprisingly, we see that DT has a tendency to give more importance to the LPI feature in these three datasets (Fig. 3).

4 Conclusion

Do similarity-based heuristics compete or collaborate for link prediction task in graphs? In this article, we study this question. We study fourteen similarity-based heuristics in six biological graph from three different organisms. As expected, we observe they perform well only in some particular biological graphs and no one wins in all graphs. Rather than using them as standalone link prediction methods, we consider them as features for supervised learning methods. In addition, we derive six link features based on the node’s topological information. Based on the twenty features, we train three traditional supervised learning methods: SVM, DT and LR-based classifiers. We see that the similarity-based heuristics collaboratively improve link prediction performance remarkably, even outperforming embedding-based methods in some graphs.

We propose three future dimensions of this study. Firstly, studying collaboration of similarity-based heuristics in large scale biological as well as social graphs could be a potential future work as the graphs in the current study are small/medium in size. Secondly, exploring some other heuristics might improve prediction performance in sparse graphs. The final future research could be studying other classifiers like Random Forest, AdaBoost, K-Neighbors for the link prediction task in graphs.