1 Introduction

Link prediction is a fundamental research problem in mining of graphs and networks. It has made wide and important impacts in social network analysis [9], bioinformatics [2], and e-commerce [8]. A major question that link prediction answers is how to estimate the structure of a network that is partially known: given the presence or absence of known links, how to predict the presence/absence of unknown ones? Solutions to this question can bring forward practical applications such as predicting following-behaviour in Twitter, suggesting friendship in Facebook, and inferring unknown interactions among proteins in protein-to-protein interaction networks [10].

Link prediction has been mostly viewed in the literature as an unsupervised learning problem and has been studied via different unsupervised methods including neighborhood-based approaches [10], matrix factorization [14], coupled tensor factorization [5], etc. But these unsupervised learning approaches are closely based on graph structure measures and adjacency matrices, while their performance have been shown to be limited to that of similarity based approaches [20]. There also has been research work that views link prediction as a supervised learning method [4, 12]. This type of methods particularly focuses on deriving and extracting features from the graph data, and then use the extracted features for link prediction.

The existing supervised and unsupervised models are both challenged when dealing with very big networks. When the network becomes very large, the size of the adjacency matrix of the network increases exponentially to the increase of the number of nodes. This makes the adjacency matrix extremely sparse which non-trivially decreases the performance of unsupervised learning models [10]. Besides, the huge sparsity also means that there is significantly more known absent links than present ones. In other words, if we represent a network graph by \(G = \{V,E\}\) and \(E \in \{0,1,?\}\), where V denotes all vertices (nodes), E denotes all edges (links), 0 denotes a known absent link, 1 denotes a known present link, and ? denotes an unknown link whose status is to be predicted, then the huge sparsity means there are significantly more unknown ? entries than known 1 and 0 entries. Moreover, since the 0 and 1 entries are taken as class labels in supervised link prediction approaches, the frequently observed scenarios of having significantly more 0 entries than 1 ones result in a significant class imbalance issue which downgrades the performance of existing supervised learning approaches [23].

To address the above challenges, in this research we proposed an effective link prediction model called Balanced Factorization Machines (BFM). The BFM takes all pair-wise interactions among nodes, links, and all their properties into the training processes. It is a supervised learning model that is not reliant on the network’s adjacency matrix and can handle network data with huge sparsity. Moreover, we design a balanced loss function for the BFM which is robust to the distribution of known link labels. Specifically, our main contributions in this paper are the following:

  • We propose a factorization machine based model, named Balanced Factorization Machines (BFM), to address the sparsity issue in link prediction problems in a supervised learning approach. The BFM method models all interactions among nodes, edges, and their associated properties using both linear and second-order expressions among features, which ensures that the BFM can well handle the sparsity of the network.

  • We design a quadratic-mean-based loss function for the BFM model to address the link label imbalance issue. The loss function of BFM is designed insensitive to distributions of labels in the training data and is thus robust to the imbalanced distribution of known links.

  • We perform evaluations on several real-world network data sets. Results from our experiments demonstrate that the BFM model is significantly more accurate on link prediction tasks in comparisons to existing other models.

To the best of our knowledge, this paper is the first research that uses factorization machines to solve the link prediction problem and to address its inherent class label imbalance problem.

The structure of the rest of the paper is as follows. We provide a review of related work in Sect. 2. Section 3 introduces the formulation and design of the proposed BFM model. We empirically evaluate our model and analyze the experiment outcomes in Sect. 4. In Sect. 5, we provide conclusions of this research with our plan for future work.

2 Literature Review

In this section, we provide a review of related work from the perspectives of link prediction and factorization machines respectively.

2.1 Link Prediction

Link prediction has become an attractive field in graph and network mining [13]. Since many real-world data can be formulated as networks, the estimation of unknown connections among nodes in a network can provide a comprehensive understanding of the studied problem, and thus provide significant values to real-world applications. One type of solutions in link predictions is the neighbourhood-based approach [1, 10]. In such approaches, similarity-based methods are used where it is assumed the nodes tend to have connections with other nodes with the highest similarity scores. More advanced approaches, such as factorization-based method [10, 14], used the principles of collaborative filtering to make link predictions, which is a similar approach as building recommendation systems. In these methods, latent factors are learned from the adjacency matrix of a network, and the predictions of links are made by a product of the latent factors. Since these factorization based methods heavily rely on the adjacency matrix, their performance is impacted when the network becomes very large and the adjacency matrix becomes highly sparse. Another type of approach for link prediction is classification-based methods where the classification is for the two labels: presence and absence of the link. In the work of [1], different classification algorithms are empirically studied with respect to their effectiveness on link prediction. Besides, directed networks are particularly studied in [4] where random forests are found to be the most effective among other alternative ensemble methods. However, these classification-based methods all face a common challenge of class imbalance, as in real-world settings there is always extremely more absence than presence of known links [23].

2.2 Factorization Machines

Factorization machines (FM) [17] are a type of supervised learning model that improves linear classification models by incorporating second-order feature interactions into the modelling processes. After the initial publication of FMs, many researchers have improved the model theoretically from several perspectives. To integrate context information into modelling processes, [18] proposed Context-aware FMs which can integrate contextual data and user it to improve the training outcome. Besides, although FM was initially introduced as a method for building recommendation systems, it has also been applied to many other domains such as Click-Through Rate (CTR) prediction [7]. Specifically, the CTR prediction is made via an extended FM model named Field-aware Factorization Machines (FFMs).

Despite of the improvements and applications of FMs, the class imbalance problems for FM has not been addressed. Since the link prediction problem usually faces huge imbalance of class labels, the imbalance problem will non-trivially affect the performance of FM. In the next section, we explain how we design our new model to address the class imbalance problem.

3 Balanced Factorization Machines

In this section, we introduce how we model link prediction as a supervised learning problem that can be addressed by factorization machines. We will also introduce our Balanced Factorization Machine (BFM) model that is capable of handling class imbalance problems.

3.1 Feature Modelling for Link Prediction

To make link prediction a supervised learning problem, we extract several types of properties from the network and use them as feature vectors. In our modelling, each data sample represents a known absent or present link from the network where the label y is 0 or 1 respectively. The feature vector of a data sample is comprised of the properties associated with the corresponding link, such as which nodes the link connects, the degrees of the nodes, and what other nodes that the link’s two ends connect to. An illustration of our feature modelling is presented in Fig. 1. In this figure, the first two blocks of features indicate the two nodes (denoted by \(N_1\) and \(N_2\)) of each link, the third block stores the distance between the two nodes (such as random walk distance) or the weight of the link, the next two blocks are the degrees of \(N_1\) and \(N_2\), and the last two blocks have the other nodes that \(N_1\) and \(N_2\) are respectively connected to.

Our feature modelling is capable of handling many different types of link properties. When the network is directed, \(N_1\) is modelled as the source node and \(N_2\) is the target node. Our model also integrates in-degrees and out-degrees of nodes on directed network by adding extra in-degree and out-degree columns to the formulation.

Fig. 1.
figure 1

Network properties extracted as feature vectors. Each row represents a known presence (i.e., label y = 1) or absence (i.e., y = 0) of links. Node 1 and node 2 present the two ends of each link. When the network is directed, node 1 is the source and node 2 is the target (from node 1 pointing to node 2). When the network is undirected, the nodes 1 and 2 are arbitrarily two ends of each link.

3.2 Our Proposed BFM Model

The creation of factorization machines can be linked back to the fundamental model of linear supervised learners. Using bold font \(\mathbf{x }\) to denote a data sample, \(x_i (i\in \{1,...,n\})\) to denote the ith feature value of \(\mathbf{x }\), y to denote labels, and D to denote the training data, the formulation of a typical linear learning model is:

$$\begin{aligned} \min _\mathbf{w }~ Sqrt \big \{\frac{1}{|D|}{\sum _{\forall x \in D}(y - \mathbf{w }^T\mathbf{x })^2}\big \} + R(\mathbf{w }) \end{aligned}$$
(1)

where |D| represents the size of training data, \(\mathbf{w }\) is the weight vector to be learned, \(R(\mathbf{w })\) is the regularizer for \(\mathbf{w }\) (such as \(L_2\) norm), and Sqrt represents the square root operation. Using the same notation, the optimization problem for a degree-2 polynomial learning model can be formulated as:

$$\begin{aligned} \min _\mathbf{v }~ Sqrt \big \{\frac{1}{|D|}{\sum _{\forall x \in D}(y - \sum _{i,j}^n{v_{i,j} x_i x_j})^2}\big \} + R'(\mathbf{v }) \end{aligned}$$
(2)

where \(\mathbf{v }\) is a weight matrix of size \(n\times {n}\), \(v_{i,j}\) is an element of \(\mathbf{v }\) at its ith row and jth column, and \(R'(\mathbf{v })\) is the regularizer for \(\mathbf{v }\) (such as matrix Frobenius norm). Factorization machines further improved the above two models by taking both linear and second-order feature interactions into its modelling process. Using the same notation above, factorization machines can be modelled as follows:

$$\begin{aligned} \min _\mathbf{w,v }~ Sqrt\big \{ \frac{1}{|D|}{\sum _{\forall x \in D}(y - ( w_0 + \sum _{i=1}^n{w_ix_i} + \sum _{i,j}^n<v_i,v_j> x_i x_j}))^2 \big \} + R(\mathbf{w }) + R'(\mathbf{v }) \end{aligned}$$
(3)

where \(\mathbf{w }\) is the weight vector, \(\mathbf{v }\) is the weight matrix of size \(n\times {k}\) (where k is the length of a column vector of \(\mathbf{v }\)), and \(<v_i,v_j>\) is the inner product (i.e., dot product) of the ith and jth column vectors of \(\mathbf{v }\).

Since the factorization machine’s loss function uses arithmetic mean (average of all errors), when the label is very imbalanced (e.g., when there is significantly more \(y = 0\) than \(y = 1\)), Eq. 3 can be almost optimized by predicting all samples to be 0 since in this way it will generate very small error rates measured by the first term of Eq. 3. Consequently, the optimization of factorization machines will be biased towards the majority label in the training processes which will lead to low accuracy particularly on the minority label. The minority label is the presence of links in link prediction problems, the accuracy of which is vitally important and cannot be compromised. To address this label imbalance problem, we proposed a Balanced Factorization Machines (BFM) which uses the quadratic-mean of errors with respect to labels as its loss function. The overall formulation of our BFM model is as follows:

$$\begin{aligned} \begin{aligned} \min _\mathbf{w,v }~&Sqrt \Big \{ \frac{\sum _{\forall x \in D+} (y - ( w_0 + \sum _{i=1}^n{w_ix_i} + \sum _{i,j}^n<v_i,v_j> x_i x_j))^2}{|D^+|} \\+&\frac{\sum _{\forall x \in D-} (y - ( w_0 + \sum _{i=1}^n{w_ix_i} + \sum _{i,j}^n<v_i,v_j> x_i x_j))^2}{|D^-|} \Big \} \!+\! R(\mathbf{w }) + R'(\mathbf{v }) \end{aligned} \end{aligned}$$
(4)

where \(D^+\) and \(D^-\) represent data samples in positive and negative labels (i.e., presence and absence of known links), respectively. The first term in Eq. 4 uses quadratic mean to measure the error loss of BFM. Now we introduce a lemma to show why our quadratic-mean-based loss function is theoretical more advantageous in learning from imbalanced data:

Lemma 1

\(\forall ~err_1 \ge 0\), and \(\forall ~err_2 \ge 0\), denote by \(AM(err_1, err_2)\) and QM(\(err_1\), \(err_2\)) the arithmetic mean and quadratic mean of \(err_1\) and \(err_2\) respectively, then \( QM(err_1, err_2) = \sqrt{AM(err_1, err_2) + (\frac{err1 - err2}{2})^2} \).

Proof

The lemma can be proved by the derivations below:

$${ \begin{aligned} QM(err_1, err_2)&= \sqrt{\frac{err_1^2 + err_2^2}{2}} \\&= \sqrt{\frac{err_1^2 + err_2^2 + 2err_1err_2}{4} + \frac{err_1^2 + err_2^2 - 2err_1err_2}{4}} \\&= \sqrt{(\frac{err_1 + err_2}{2})^2 + (\frac{err_1 - err_2}{2})^2 } \\&= \sqrt{AM(err_1, err_2) + (\frac{err_1 - err_2}{2})^2 } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\square \end{aligned} } $$

The above lemma shows that conventional factorization machines (which use the AM) are optimized when the sum of errors from all samples are minimized, while our BFM model (which uses the QM) is optimized if only the sum and the difference between errors of positive and negative samples are both minimized. To further illustrate the significance of this difference between AM and QM, we provide the following example on error calculations:

Example 1

Consider a learning algorithm that has a classification performance shown in Table 1. Apparently the classification is biased towards the negative class (i.e., predicting almost all data to be negative). However, the AM-based error is as extremely low as 0.8% (i.e., \(\frac{8+0}{10+990}\)). On the contrary, the QM-based error is as high as 56.6% (i.e., \(\sqrt{\frac{(8/10)^2+(0/990)^2}{2}}\)).

Table 1. Classification performance on an imbalanced data set for Example 1.

This example illustrates that QM can detect the bias of the learning algorithm which AM cannot. This unique property of QM can effectively guide the optimization of our BFM model and make BFM robust to learning on imbalanced data sets.

4 Experiments and Analysis

We analyze and compare the performance of our BFM model against existing link prediction methods on real-world data sets. We compare our method with classical factorization machines (FM) [17], field-aware factorization machines (FFM) [7], graph-based link prediction method (GLP) [4], and matrix factorization based method (MF) [14]. We also include unsupervised learning based methods in the comparisons, such as non-negative matrix factorization (NNMF) [3] and Link Propagation (LP) [11]. We use the \(L_2\) norm for \(R(\mathbf{w })\) and use the Frobenius norm for \(R'(\mathbf{v })\), and use graph random walk [6] as the distance between every pair of two nodes. In all our experiments, we use 5-fold cross validation to separate training and test data, conduct 10 repeated runs with different random seeds, and report the average results of the repeated runs. We use the AUC (i.e., the area under ROC curve) as our evaluation metric, which is more suitable than using the overall accuracy as the metric for imbalanced data.

4.1 Data Sets

We use six publicly available real-world network data sets in our experiments: protein-to-protein interaction networks (PPI) [21], NIPS co-authorship networks (NIPS) [19], email Eu-core network (Email) [24], UC Irvine messaging network (SMS) [16], adolescent health network (Health) [15], and US electric power grid network (PowerGrid) [22]. The details of all data sets are shown in Table 2.

4.2 Experimental Results

We compare BFM against the two most popular factorization machines, FM and FFM, by evaluating their responses to changes of parameter k (i.e., the length of the column vectors of \(\mathbf{v }\)). The experimental results are shown in Fig. 2. From the subfigures, we can clearly see that the performance of BFM is almost persistently better than that of FM and FFM by a large magnitude. Although the performances of FM and FFM are sometime indistinguishable (such as in the Email data set), the performance of our BFM model is always better than them. This confirms that our QM-based loss function is significantly more effective on handling imbalanced data for link predictions problems.

Fig. 2.
figure 2

Performance comparisons among factorization machines based methods.

Table 2. Details of data sets used in experiments.
Table 3. Performance comparisons between BFM and other link prediction methods.

We also compare our BFM model with other existing supervised and unsupervised link prediction methods. The detailed experimental results of these comparisons are presented in Table 3. In the table, we report the mean and variance values of 10 repeated runs for each algorithm on each data set. From the table, we can observed that the performance of our BFM model is significantly better than all other alternative models.

5 Conclusions and Future Work

In this paper we propose to address the link prediction problem by using factorization machines. We extract graphical features from networks and use the features to training our model. Our modelling of the features can utilize information from both directed and undirected networks. More importantly, we designed a novel and effective loss function, which uses the quadratic mean of errors with respect to labels, to handle the label imbalance problems. Our experiment results on real-world data sets demonstrate that our BFM mode significantly outperforms other link prediction methods. In future, we plan to extend our model to address link prediction problems on temporal networks and graph streams.