1 Introduction

The evolution of the network in the real world is complex. Due to experimental errors and technical limitations, network data obtained from the real world are often incomplete [1, 2]. The missing links in a network may have a great impact on the network topological structure and network properties. Therefore, it is necessary to detect the missing links in the network based on the existing information. The study of link prediction mainly focuses on the detection of missing links and prediction of unknown links [3]. It is of great theoretical significance, since it could make the data of the network more accurate, which helps researchers to study the properties of the network more exactly [4, 5]. In addition, link prediction research has important practical value in many research fields. For example, infer potential protein–protein interactions in biological networks [6], which could save the experimental cost and time coast of biological experiments, reveal the type and coauthor of the paper in the academic network [7, 8], and make contribution to recommendation system in different social networks [9,10,11].

Various methods have been proposed for link prediction. These methods can be roughly divided into similarity-based methods and learning-based methods [12,13,14]. Based on the assumption that the node pairs with higher similarity score are more likely to be connected, the similarity-based method is usually defined to calculate the linkage likelihood of non-observed links through extracting the information of network topological structure features [12]. Local similarity indices mainly consider the function of common neighbors of a node pair, such as common neighbor (CN) index [15] index, Adamic/Adar (AA) [16] index, preferential attachment [17], recourse allocation (RA) [18], and local naive Bayes-based common neighbors (LNBCN) [19]. Besides, it is worth noting that relevant studies have demonstrated that clustering information of networks plays much significant role in link prediction. For instance, the node clustering coefficient for link prediction (CCLP) index was applied to characterize the similarity of node pairs and derived good prediction performance in [20]. Global similarity indices are generally calculated based on the holistic topological structure information of the network, such as Katz [21] index, random walk with restart [22], Leicht–Holme–Newman Global [23] index, and matrix forest [24] index. Quasi-local similarity metrics attempt to make the balance of computation efficiency and complexity, some of which can utilize the entire topological structure of the network but with lower complexity than global approaches, such as local path (LP) [25] index, local random walk [26] index, and significant path [27] index.

Learning-based methods, including classification-based methods, probabilistic and statistical methods, and matrix factorization methods, usually have good prediction accuracy. First, the classification-based methods regard link prediction as a binary classification problem, marking existing links as positive and non-existing links as negative. The key step to the classification-based method is feature extraction and machine learning. Second, several probabilistic and statistical methods have been proposed, such as probabilistic relation model for link prediction [28,29,30], hierarchical structure model [31] and stochastic block model [32]. These methods can capture useful correlation information and hence improve the prediction accuracy. However, the time cost and complexity make them not suitable for large-scale real-world networks. Third, matrix factorization methods exploit the extracted features to represent each vertex in the latent space, and the representations are applied to different learning framework for link prediction [33,34,35]. Matrix factorization has been widely used in the problem of dimensionality reduction, and to solve this problem, some network embedding-based methods have been proposed and further adopted to address link prediction problem, such as DeepWalk [36], large-scale information network embedding (LINE) [37], Node2vec [38], and structural deep network embedding (SDNE) [39]. These methods avoid the scalability issues faced by traditional graph embedding methods, including Laplacian eigenmaps (LE) [40] and logically linear embedding [41]. The disadvantage of learning-based methods remains in the process of training data and learning from training data time-consuming. On the contrary, topological similarity-based methods have the advantage of low computation complexity. Various topological structures can be extracted easily to be employed to improve the accuracy of link prediction. Therefore, topological structure-based methods are still a good choice to link prediction.

Considering the important role of clustering information in link prediction, in this paper, a local similarity index is proposed to improve the accuracy of link prediction under the topological structure framework. It is known that node clustering information is widely involved in the construction of a similarity index and may achieve good effectiveness. However, node clustering coefficient only provide the invariant clustering information for different node pairs, which may not be very effective when a node plays different roles in different communities. Therefore, instead of directly using the node clustering coefficient [42] of the common neighbor of a node pair, the link clustering coefficient is mainly considered in this work. Another important topological metric called degree-related clustering coefficient [43] is introduced to obtain the local similarity information, which can be regarded as the initial resource of the common neighbor. The degree-related and link clustering coefficient is defined for links between the common neighbor and the node pair. This paper further analyzes the influence of end nodes and utilizes the degree of node to describe the influence of the endpoints. Then, the similarity score is defined by integrating above topological information. In addition, based on different evaluation metrics, the performance of the proposed approach is estimated by comparing it with some classical local similarity methods, a quasi-local similarity method, and several graph embedding-based methods on small-scale, medium-scale, and large-scale networks. Simulation results show the effectiveness and superiority of our method.

2 Methods

2.1 Definition

Considering an undirected unweighted network represented by a graph \(G=(V,E)\), where V is the set of nodes and E is the set of links. |M| represents the number of elements in a set M. U is the set used to denote all possible links. Then, the set of the nonexistent links are represented by \(U\) \(\backslash \) \(E\), which contains the potential missing links to be detected. For a given node pair \((x,y)\in {V\times V}\), s(xy) represents the similarity score of (xy). Besides, the definitions of topological measures used in this paper are described as follows.

Definition 1

[CC (Clustering coefficient)] The clustering coefficient [42] is defined by calculating the ratio of the number of triples passing through the node to the maximum number of triples passing through the node, that is:

$$\begin{aligned} CC_x=\frac{2l_x}{k_x(k_x-1)}, \end{aligned}$$
(1)

where \(l_x\) represents the number of triples passing through node x, and \(k_x\) represents the degree of node x.

CC reflects the tightness of connections between local nodes in the network. Since, for a given predicted node pair, the clustering coefficient of the node in their common neighbor part is not a distinct metric. Furthermore, researchers have considered the edge clustering coefficient, and there are different edge clustering coefficients between nodes in common neighbor and node pairs. Therefore, it can be used as a specific metric to determine whether there is a link between node pairs. Wang et al. demonstrated the concept of the edge clustering coefficient [44], which is similar to the node clustering coefficient.

Definition 2

[ECC (Edge clustering coefficient)] The edge clustering coefficient [44, 45] is the ratio of the number of triples through links to the maximum number of triples through links, that is:

$$\begin{aligned} \text {ECC}_{x, z}=\frac{|{N}(x)\cap {N}(z)|}{\text {min}(k_x-1,k_z-1)}, \end{aligned}$$
(2)

where \(\ {N}(x)\) is the set of neighbor nodes of node x, \(\ {N}(z)\) is the set of neighbor nodes of node z  and \(\text {min}(k_x-1,k_z-1)\) is a constraint.

Note that the mathematical form of ECC is similar to that of the hub promoted index (HPI) [46]; that is, their numerators are the same and denominators differ by one. However, the two indices are designed for different purposes. HPI aims to facilitate the connection between sparsely connected nodes and hub nodes, and inhibit the connection between hub nodes. ECC is proposed to measure the clustering ability of an edge. Furthermore, due to the difference of the investigated objects, ECC is not suitable for node pairs without links, while HPI is applicable to any node pair.

Definition 3

[ALC (Asymmetry link clustering coefficient)] In this paper, we mainly consider the clustering influence of the end node z of a link (xz) or (yz), where z is in common neighbor of a specified node pair (xy). The greater the degree of the node z is, the smaller \(\frac{1}{k_z(k_z-1)}\) will be, namely, the smaller difference will be generated when the denominator takes \(k_z\). Hence, to expand feasibility and simplify the following operation, \(\text {min}(k_x-1,k_z-1)\) in ECC will be replaced by \(k_z\), and the corresponding definition is represented by asymmetry link clustering coefficient, that is:

$$\begin{aligned} \text {ALC}_{x,z}=\frac{|{N}(x)\cap {N}(z)|}{k_z}. \end{aligned}$$
(3)

Asymmetry of this coefficient lies in the degree of the node as it is obvious that \(\text {ALC}_{x,z}\ne \text {ALC}_{z,x}\) when \(k_x\ne k_z\). Besides, \(\text {ALC}_{x,z}\ne \text {ALC}_{y,z}\) when \(|{N}(x)\cap {N}(z)|\ne |{N}(y)\cap {N}(z)|\). Thereby, diverse clustering information of common neighbors can be employed via ALC.

Definition 4

(Assortative coefficient) The assortative coefficient [47] is defined to characterize the assortative preference of the network, that is:

$$\begin{aligned} r=\frac{L^{-1}\sum _{l_{ij}\in E}{k_ik_j}-{(L^{-1}\sum _{l_{ij}\in E}{\frac{1}{2}(k_i+k_j)})}^2}{L^{-1}\sum _{l_{ij}\in E}{\frac{1}{2}({k_i}^2+{k_j}^2)}-{(L^{-1}\sum _{l_{ij}\in E}{\frac{1}{2}(k_i+k_j)})}^2}, \nonumber \\ \end{aligned}$$
(4)

where L is the number of links in the network. \(l_{ij}\) means there is a linkage between node i and node j, \(i\in V\), \(j\in V\). \(k_i\) and \(k_j\) represent the degree of i and j, respectively.

A network is assortative if a node in the network tends to connect with its similar nodes. The link predictability of the network is often closely related to the topology features of the network. To fully consider the clustering capabilities of the endpoints, a novel network topology measurement degree-related clustering coefficient [43] is introduced.

Definition 5

[DCC (Degree-related clustering coefficient)] The degree-related clustering coefficient [43] is defined based on the assumption that the observed networks are incomplete, that is:

$$\begin{aligned} \text {DCC}_z={\overline{\text {CC}}}_{k_z}\cdot {k_z}^r, \end{aligned}$$
(5)

where \({{\overline{\text {CC}}}}_{k_z}\) is the average of the clustering coefficients of all nodes with degree equaling to the degree \(k_z\) of the node z, and r is the assortative coefficient of the network.

The previous article has demonstrated the significant correlation between the topological score DCC and link predictability [43]. Based on a similarity index constructed by DCC, excellent prediction performance has been obtained for the link prediction problem in [43].

Definition 6

[DLC (Degree-related and link clustering coefficient)] In this paper, a new degree-dependent link clustering coefficient named degree-related and link clustering coefficient is defined by combining the asymmetry link clustering coefficient with the degree-related clustering coefficient, that is:

$$\begin{aligned} \text {DLC}_{x,z}=\frac{|{N}(x)\cap {N}(z)|}{k_z}\cdot \text {DCC}_z; \end{aligned}$$
(6)

that is:

$$\begin{aligned} \text {DLC}_{x,z}={|{N}(x)\cap {N}(z)|}\cdot {{\overline{\text {CC}}}_{k_z}}\cdot {k_z}^{(r-1)}. \end{aligned}$$
(7)

The DLC index not only makes full use of the clustering ability information of the link (xz), but also takes the influence of the end point z into consideration. By considering the degree-related clustering coefficient of node z as the initial resource owned by node z, \(\text {DLC}_{x,z}\) can be used to represent the resource contribution that node z can allocate to node x. As shown in Fig. 1a, for a specified node pair, the degree-related and link clustering coefficient between the node in their common neighbor and the end nodes can provide the specific clustering information.

Fig. 1
figure 1

The DLC index and similarity score based on the DLC index. a A simple instance to explain the DLC index. \(\left| \ {N}\left( x\right) \cap \ {N}\left( z\right) \right| ={3}\), \(\left| \ {N}\left( y\right) \cap \ {N}\left( z\right) \right| ={1}\). Clearly, \(\text {DLC}_{x,z}\ne \text {DLC}_{y,z}\), which implies that the DLC index can be used as a specific measurement for the predictive node pairs. b Similarity score between node x and node y is calculated via DLC index and common neighbors. Every red link corresponds to a clustering information DLC value

2.2 Similarity score based on DLC

Given a pair of unconnected nodes (xy), \(z\in \ {N}(x)\cap \ {N}(y)\). \(\text {DLC}_{x,z}\) is not equal to \(\text {DLC}_{y, z}\) when \(\ {N}(x)\cap \ {N}(z)\) and \(\ {N}(y)\cap \ {N}(z)\) are different. If there is no common neighbor between node x and z, the clustering ability of node z cannot contribute to the connection between node x and y. Conversely, if the number of common neighbors of nodes z and x is large, z is clearly adjacent to y, the possibility of a link exists between x and y will also increase. \(\text {DLC}_{x,z}\) and \(\text {DLC}_{y,z}\) can be regarded as clustering information of the common neighbor of the node pair (xy). The new similarity score of the node pair (xy) can be defined by combining the distinct part, that is:

$$\begin{aligned} {s(x,y)}^{\mathrm{DLC}}= & {} w_x\sum _{z\in {N}(x)\cap {N}(y)}{\mathrm{DLC}}_{x,z}\nonumber \\&+w_y\sum _{z\in {N}(x)\cap {N}(y)}{\mathrm{DLC}}_{y,z}, \end{aligned}$$
(8)

where:

$$\begin{aligned} w_x=\frac{k_x}{max(k_x,k_y)},\quad w_y=\frac{k_y}{max(k_x,k_y)}. \end{aligned}$$
(9)

Considering the influence of end nodes x and y, \(w_x\) and \(w_y\) are introduced as regulation weights. Obviously, \(w_x\in (0,1]\), \(w_y\in (0,1]\). The information of common neighbor is used as the main measurement parameter. When combining the information of the node pair, we simply take the degree feature of the endpoint as the weight, ranging from 0 to 1. It is indicated that the clustering information of the common neighborhood for the end node with larger degree is fully exploited, and partially adopted for the endpoint with less degree.

It is worth mentioning that another method of weight allocation, that is, \(w_x = k_x/(k_x + k_y)\) and \(w_y = k_y/(k_x + k_y)\), has also been considered. This weight allocation has been tested on several networks, and the results show that the proposed DLC framework can achieve better performance when using the weight assignment operation of Eq. 9. This further demonstrates the advantages of the proposed method.

2.3 Baselines

To verify the prediction performance of the proposed method, we compare it with eight classical local similarity-based algorithms and a quasi-local similarity index. These similarity scores are defined as follows.

  1. (1)

    Common Neighbors (CN) index [15]:

    $$\begin{aligned} {s(x,y)}^{\mathrm{CN}}=|{N}(x)\cap {N}(y)|. \end{aligned}$$
    (10)

    This similarity index is calculated by counting the number of the common neighbors for a node pair, which is based on the assumption that more common neighbors of the node pair imply more similar of the endpoints.

  2. (2)

    Adamic/Adar (AA) index [16]:

    $$\begin{aligned} {s(x,y)}^{\mathrm{AA}}=\sum _{z\in {N}(x)\cap {N}(y)}\frac{1}{\mathrm{log}(k_z)}. \end{aligned}$$
    (11)

    The index established the weight to common neighbors.

  3. (3)

    Resource Allocation (RA) index [18]:

    $$\begin{aligned} {s(x,y)}^{\mathrm{RA}}=\sum _{z\in {N}(x)\cap {N}(y)}\frac{1}{k_z}. \nonumber \\ \end{aligned}$$
    (12)

    This index is inspired by the resource allocation mechanism of networks. Compared to AA, this index will generate more penalty for larger degree nodes.

  4. (4)

    CAR-based Common Neighbor (CAR) index [48]:

    $$\begin{aligned} {s(x,y)}^{\mathrm{CRA}}=|{N}(x)\cap {N}(y)|\sum _{z\in {N}(x)\cap {N}(y)}\frac{|\gamma (z)|}{2},\nonumber \\ \end{aligned}$$
    (13)

    where \(\gamma (z)\) is the subset of neighbors of node z that are also common neighbors of the endpoints x and y.    This index is developed based on the hypothesis that the node pair tends to be more similar with the number of the local community links increases.

  5. (5)

    CAR-based Adamic/Adar (CAA) index [48]:

    $$\begin{aligned} {s(x,y)}^{\mathrm{CAA}}=\sum _{z\in {N}(x)\cap {N}(y)}\frac{|\gamma (z)|}{\mathrm{log}(k_z)}. \end{aligned}$$
    (14)

    The index is defined by introducing the local-community-links concept to AA index.

  6. (6)

    CAR-based Resource Allocation (CRA) index [48]:

    $$\begin{aligned} {s(x,y)}^{\mathrm{CRA}}=\sum _{z\in {N}(x)\cap {N}(y)}\frac{|\gamma (z)|}{k_z}. \end{aligned}$$
    (15)

    The CRA index is generated by incorporating the local-community-links concept into the RA index.

  7. (7)

    Local Naive Bayes-based Common Neighbors (LNBCN) index [19]:

    $$\begin{aligned} {s(x,y)}^{\mathrm{LNBCN}}= & {} \sum _{z\in {N}\left( x\right) \cap {N}\left( y\right) }\left[ \mathrm{log}\left( \frac{{\mathrm{CC}}_z}{1-{\mathrm{CC}}_z}\right) \right. \nonumber \\&\left. +\mathrm{log}\left( \frac{1-\rho }{\rho }\right) \right] , \end{aligned}$$
    (16)

    where \(\rho \) is the network density defined by \(\left| E\right| /\left| U\right| \).    This index is proposed based on naive Bayes theory and considers different functions of different common neighbors.

  8. (8)

    Node Clustering Coefficient (CCLP) index [20]:

    $$\begin{aligned} {s(x,y)}^{\mathrm{CCLP}}=\sum _{z\in {N}(x)\cap {N}(y)}\text {CC}_z. \end{aligned}$$
    (17)

    This index is defined by summing the clustering coefficient of all the common neighbors of the node pair.

  9. (9)

    Local Path (LP) index [25]:

    $$\begin{aligned} {s(x,y)}^{\mathrm{LP}}=A^2+\varepsilon A^3, \end{aligned}$$
    (18)

    where A is the adjacent matrix of the network and \(\varepsilon \) is the penalization parameter.    This index calculates the number of paths with length 2 and 3 of the unconnected node pair.    Recently, various network embedding methods have been proposed, which perform well in link prediction. Considering the scalability of networks and to show the effectiveness of our method, the following five graph embedding methods are introduced for link prediction to compare with the DLC index.

  10. (9)

    Laplacian Eigenmaps (LE) [40]: Laplacian eigenmaps is a graph-based dimensionality reduction algorithm, which can construct the data to be reduced into a graph and maintain the original local structural relationship after dimensionality reduction. The limitation of this traditional graph embedding method is its complexity for large-scale networks. To overcome the limitation, some graph embedding methods utilize the sparsity of real-world networks, which are described as follows.

  11. (10)

    DeepWalk [36]: DeepWalk is a network representation approach formed by combining truncated random walk with a neural language model. It learns latent representations of vertices in a network by treating walks as the equivalent of sentences, and has characteristics of adaptability, community aware, low-dimensional, and continuous.

  12. (11)

    Large-scale Information Network Embedding (LINE) [37]: The large-scale information network embedding method can preserve both the local and global network structures by optimizing a carefully designed objective function, and it is able to learn representations of different types of information networks.

  13. (12)

    Node2vec [38]: Node2vec is a feature learning approach proposed to transform a mapping of nodes into a low-dimensional space of features while maximizing the likelihood of preserving network neighborhoods of nodes. This framework can capture the diversity of connectivity patterns observed in networks.

  14. (13)

    Structural Deep Network Embedding (SDNE) [39]: SDNE uses an auto-encoder structure to optimize the first-order and second-order similarities at the same time, and the learned vector representation can retain the local and global structure and is robust to sparse networks.

It is worth noting that the node features extracted by above graph embedding methods need to be converted into link features to be suitable for link prediction, so the binary operator, Hadamard operator, defined in [38] is used to learn link features. In particular, given a node pair (uv), the corresponding feature vectors are f(u) and f(v), and the Hadamard operator is defined as:

$$\begin{aligned} {\vert f(u)\boxdot f(v)\vert }_i={f_i (u)}*{f_i (v)}, \end{aligned}$$
(19)

and then, the corresponding link features for the node pair (uv) can be obtained even if there may not exist a real link between node u and node v.

In addition, for outliers, the similarity between them to another node depends on the degree of the given node. For this case, we refer to Ding et al. [49] and the same calculation measure is adopted.

Table 1 Topological features of 12 real-world network datasets. \(\left| v\right| \) and \(\left| E\right| \) denote the number of nodes and links, respectively. \(\langle k \rangle \) and \(\langle d \rangle \) represent the average degree and average distance, respectively. H is the degree heterogeneity defined by \(H=\frac{\langle k^2 \rangle }{{\langle k \rangle }^2}\). r is the assortative coefficient. \(\langle c \rangle \) and \(\langle c_d \rangle \) represent the average clustering coefficient and degree-related clustering coefficient, respectively

3 Experiments and analysis

In this section, computational simulations for real-world networks from different fields are performed to show the effectiveness and advantages of the proposed approach. The similarity index DLC is compared with classical link prediction methods introduced in Sect. 2.3. In particular, we further compare the prediction performance of the CCLP index and the DLC index in inferring top-ranked links by hitting links mutually. In addition, to indicate the applicability of our methods for large-scale networks, we implement the DLC index and these representative methods on four large-scale networks, that is, Arxiv [50], Deezer [51], Enron [52], and Brightkite [53] to further demonstrate the prediction ability of the DLC index.

3.1 Datasets

The tested 12 networks are listed as follows.

  1. (1)

    Dolphins [54]: A dolphin social network.

  2. (2)

    Football [55]: An American football games network.

  3. (3)

    Jazz [56]: A cooperation network of Jazz musicians.

  4. (4)

    Celegans [57]: A neural network of the nematode worm C.elegans.

  5. (5)

    USair [15]: US Air transportation network.

  6. (6)

    Metabolic [58]: Metabolic network of the roundworm C.elegans.

  7. (7)

    Email [59]: A email contacts network of the University Rovira i Virgili.

  8. (8)

    PB [60]: A political blog network.

  9. (9)

    King [61]: Network of places and names in the King James Bible.

  10. (10)

    Yeast [62]: A protein-protein interactions of yeast.

  11. (11)

    Erdos [63]: Network of collaboration.

  12. (12)

    Router [64]: Network of Internet route level.

  13. (13)

    Arxiv [50]: Collaboration network of Arxiv Astro Physics category.

  14. (14)

    Deezer [51]: A social network of Deezer users from European countries.

  15. (15)

    Enron [52]: Enron email communication network.

  16. (16)

    Brightkite [53]: Friendship network of Brightkite users.

According to the node number and the link number of a network, these networks are divided into three types: the small-scale networks with less than 1000 nodes and 5000 links (datasets (1)–(6)), middle-scale networks with more than 1000 nodes (datasets (7)–(12)), and large-scale networks with more than 10,000 nodes and 10,000 links (datasets (13)–(16)). Topological features of 16 real-world network datasets are summarized in Table 1.

3.2 Evaluation metrics

The performance of the proposed method and other typical methods are evaluated by two commonly used evaluation metrics, that is, AUC (area under the receiver-operating characteristic curve) [65] and Precision [66]. First, all links of the network are randomly selected as a test set. For example, 10% or 20% of links can be randomly selected, and the remaining links are used as training sets. The test set is represented by \(E^p\) and the training set is represented by \(E^t\), where \(E^p\cup E^t=E\) and \(E^p\cap E^t=\emptyset \).

Table 2 AUC metric and precision metric of DLC index and nine classical indices on the six small-scale networks. Each value is obtained by averaging 20 independent implementations with 10% links as probe set. The best performance is emphasized in bold for each network

(i) AUC

The AUC metric denotes the probability that the score of the link randomly selected from the test set is higher than the score of the randomly selected nonexistent link. For n times independent comparisons, \(n^\prime \) denotes the times that link scores in the test set are higher and \(n^{\prime \prime }\) denotes the times that link scores are the same. Then, the AUC metric is defined as:

$$\begin{aligned} \text {AUC}=\frac{n^\prime +0.5n^{\prime \prime }}{n}. \end{aligned}$$
(20)

It is obvious that the AUC value is equal to 0.5 if the prediction is random, since the score of the link in the test set and nonexistent link are subject to the same distribution. Therefore, how much that the value of AUC exceeds 0.5 actually indicates the accuracy of the corresponding method.

(ii) Precision

The precision metric mainly involves the top-ranked links. Among top predicted links, if \(L_r\) denotes the number of missing links. Then, the precision metric can be defined by:

$$\begin{aligned} \text {Precision}=\frac{L_r}{L}, \end{aligned}$$
(21)

where L denotes the number of top predicted links.

Obviously, the higher precision means the higher prediction accuracy of the method.

3.3 Results

Based on aforementioned evaluation metrics, the performance of the DLC index is compared with other representative methods (baselines introduced in Sect. 2.3). Since small-scale networks do not require network embedding to capture topology information, we only implement local similarity-based approaches for six small-scale networks in Table 1. As shown in Table 2, the metric values of AUC and precision of each method implemented on six small-scale networks are listed. Specifically, based on the AUC metric, the DLC index performs best on four out of the six networks and ranks second on two networks, which nearly obtains the best performance for all six small networks. For the precision results, the DLC index ranks top two in all six small networks, which has further indicated the good feasibility of the proposed method for small-scale networks. Table 3 lists the AUC metric values obtained by implementing local similarity methods and graph embedding-based methods on the middle-scale and large-scale networks. Table 4 lists the precision metric values obtained by implementing local similarity methods and graph embedding-based methods on the middle-scale and large-scale networks. It is shown that the DLC index has the best performance on four out of ten networks based on the AUC metric, and on five out of ten networks based on the precision metric, which indicates the effectiveness of the DLC index for both middle-scale and large-scale networks. In the six middle-scale networks, the quasi-local similarity LP index and the graph embedding method SDNE have the better performance according to AUC, and other local similarity-based methods perform worse. The main reason may lie in that the local topological feature becomes less useful to estimate the probability due to the sparsity of the tested networks. While the quasi-local similarity index and the SDNE method have a higher AUC, they are more time-consuming compared with other local similarity-based methods and obtain a lower precision. In the large-scale networks, graph embedding methods show their advantages, but the performance of these methods is often affected by the parameters, and the calculation cost and time cost are very expensive. In the tested networks, classical local similarity-based methods achieve better performance with lower computational complexity and time cost. Obviously, the prediction performance of the traditional graph embedding method LE is very unstable, and it does not perform well on the large-scale networks. Another disadvantage of LE exists in a great burden of the data storage during the calculation of the eigenvalues of matrices with high dimensions.

Table 3 AUC metric of the DLC method and other classical methods on the six middle-scale and four large-scale networks. Each value is obtained by averaging 20 independent implementations with 10% links as probe set. The best performance is emphasized in bold for each network
Table 4 Precision metric of the DLC method and other classical methods on the six middle-scale and four large-scale networks. Each value is obtained by averaging 20 independent implementations with 10% links as probe set. The best performance is emphasized in bold for each network

Note that although both the LNBCN and CCLP indexes use node clustering information, the DLC index is superior to the two indexes in most cases, which demonstrates the function of the degree-related and link clustering coefficient. In summary, regardless of the network size, the DLC index can achieve the highest prediction accuracy or performs better than other classical local similarity methods in most cases, which indicates that DLC is less affected by the network size in solving the link prediction problem and can obtain stable prediction performance. These results have proved the validity and robustness of our method. Furthermore, as shown in Fig. 2, the precision curves of different methods are plotted with changing number of top predicted links in 12 networks and the DLC index performs better with higher curves in most cases, such as Dolphins, Celegans, King, Yeast, and Router networks.

Fig. 2
figure 2

Prediction performance measured by precision metric with various L of nine similarity indices in six small-scale and six middle-scale real-world networks

Fig. 3
figure 3

Comparison results of rank of the top-100 hitting ranked links for DLC–CCLP and CCLP–DLC in six networks with different sizes. The blue and purple columns represent the number of hitting links with an average rank of L2 of less than 100 in DLC–CCLP and CCLP–DLC, respectively. DLC–CCLP: The rank values of missing links in DLC denoted by L1 and in CCLP denoted by L2, respectively. CCLP–DLC: the rank values of missing links in CCLP denoted by L1 and in DLC denoted by L2, respectively

In addition, since both the DLC and CCLP indexes utilize the clustering coefficient of nodes or links to define the similarity score of node pairs, to compare the prediction ability of DLC and CCLP, we further detect the corresponding rank of the top-ranked hitting links and show the results in Fig. 3. For the blue point of DLC–CCLP, the abscissa L1 indicates the rank of the link predicted by DLC, i.e., the top-100 missing links in DLC, and the ordinate L2 shows the values of the corresponding average ranks of these links predicted by CCLP in 100 independent experiments. Similarly, for the purple point of CCLP–DLC, the abscissa L1 indicates the rank of the link predicted by CCLP, i.e., the top-100 missing links in CCLP, and the ordinate L2 shows the values of the corresponding average ranks of these links predicted by DLC in 100 independent experiments. For example, the coordinates (60,80) in blue represents the hitting links in DLC with the rank of 60 have a corresponding average rank of 80 in CCLP over 100 independent tests; The coordinates (40,90) in purple denotes the hitting links in CCLP with the rank of 40 have a corresponding average rank of 90 in DLC over 100 independent tests. The hitting links of the DLC cannot be hit by CCLP when L2 is more than 100. For two comparative methods, one method, which hit the hitting links of another method with a higher hitting rate, is considered to have a better predictive ability. In other words, the larger number of purple points below the dashed line of \(L2=100\) indicates a better predictive ability of DLC and the larger number of blue points below the dashed line of \(L2=100\) indicates a better predictive ability of CCLP. We randomly select two small-scale, two middle-scale, and two large-scale networks to perform the experiment. It can be seen from Fig. 3 that the top-ranked links predicted by CCLP can be hit by DLC in most networks with a lower rank, and the number of purple points below the dashed line is more than or equal to that of the blue points in all tested networks. Therefore, we can conclude that DLC performs better than CCLP for networks with different scales. Overall, the proposed method performs well in most cases from the analysis of various aspects.

4 Conclusions

In this paper, a new similarity index, that is, DLC, has been proposed, which considers the important topological features of link clustering coefficient and degree-related clustering coefficient. Besides, the index reflects the impact of the endpoints which are measured by node centrality. To investigate the prediction performance of DLC, we have introduced local similarity, quasi-local similarity, and graph embedding methods to perform comparison experiments. Through comparing with these representative methods, our proposed method achieves better performance in most cases based on the two metrics, that is, AUC and Precision. Considering the scalability of networks, it is noteworthy that our proposed method and the compared methods have been implemented on several large-scale networks, which are rarely conducted in the previous studies. The experimental results have demonstrated the robustness and effectiveness of our proposed method. Particularly, the DLC index outperforms the CCLP index that mainly relies on the node clustering coefficient, which further indicates the effectiveness of using the information of degree-related and link clustering coefficient. Therefore, the accuracy of the previous local similarity-based method can be improved by taking full advantage of the link clustering and local structure information.

Although the proposed methods perform well and have stable prediction ability, there are still some limitations. Above all, the application of proposed methods for middle-scale and large-scale networks needs to be further improved. However, there is a contradiction that is difficult to reconcile. On one hand, the local topological information of the network is limited. On the other hand, it will increase the computational complexity and time cost to extract global or greater local topological information. Therefore, it may be beneficial to develop practicable quasi-local similarity methods based on these useful topological features. In addition, the graph embedding methods can generate some beneficial results, but may be unstable, so they can be fully explored and further applied to the link prediction problem. Finally, with the development of the community detection study, more reliable and effective community detection algorithms may be utilized for the prediction performance improvement.