Keywords

1 Introduction

With the increase of scientific researchers, the number of published papers is also increasing. Reading papers is one of the most important and time-consuming parts of scientific research. When researchers want to get the papers about the field for study, the traditional search engine can only search by keywords and phrases, then get the search results. But the results have a wide range and lack of pertinence. Therefore, the paper recommendation system is produced. This paper is the research about the paper recommendation system based on Citation Network.

In the first part, we conduct a certain amount of research on the recommendation system and the recommendation algorithm. Understanding the development status of researches. In the second part, we constructed a three layers citation network model. We integrate the citation relationship, paper’s feature information, co-authorship relationship and research field information into this model. And adopts the DBLP-Citation-network data set of AMiner to build the three layers citation network model. In the third part, we put forward the algorithm of a paper recommendation system based on the three layers citation network model. We combined the three layers citation network model and RWR algorithm to form the paper recommendation algorithm (PAFRWR). Finally the experimental results show that, the evaluation index value of PAFRWR is better than the other three methods.

1.1 Research Status of Paper Recommendation System

In 1997, Resnick and Varian gave the definition of recommender system for the first time [1]. Then, Bollacker, Lawrence, Giles build the first paper recommendation system in 1998 [2]. This system uses web search engine and heuristic search to find the papers, and through the reference relationships between papers to find the recommendation of related papers.

In 2012, Wang et al. Proposed a method of paper recommendation based on the historical behavior of users [3]. Choochaiwattana [4] studies the application of tags in paper recommendation and proposes a paper recommendation mechanism based tag. Li Ran et al. [5] solved the cold start problem in the paper recommendation system. According to the preferences for research of users, a collaborative topic regression model based on frequent topic set preferences was proposed.

1.2 Research Status of Citation Network

At first, citation network was only used to Library and Information Science [6]. Through the continuous research of citation network, its application scope is more comprehensive. The citation network of the papers is a kind of network information body, which is formed by the reference relationships between the papers [7].

Strohman et al. [8] put forward a global citation recommendation for the first time, and they get the papers of citation recommendation by searching the whole papers. Tang et al. [9] studies a new problem in his paper, that is, citation recommendation based topic, and explores the topic distribution and citation relationship of the papers. In 2011, Shi Jie et al. [10] proposed to form a relationship collection with multiple attribute information in the citations, and then cluster more related citations according to the clustering algorithm, and finally recommend the results to users. Xiao Shibo et al. [11] use the graph model to analyze the relationship between the papers in the citation network and recommend for users. Chen Zhitao et al. [12] proposed a citation recommendation algorithm based on multi feature factor fusion. Based on the traditional citation recommendation model, integrated to the author related factors, overall influence factors and query related factors.

2 Related Work

2.1 Citation Recommendation

A paper needs a large number of references to support its point of view, the citation recommendation can provide appropriate references for researchers. According to the different citation method, citation recommendation can be divided into local citation recommendation and global citation recommendation. The purpose of local citation recommendation is to recommend relevant papers in the process of writing papers where the citation needs to be added. However the global citation recommendation refers to the recommendation for the whole paper, and provides a reference list for the target paper.

2.2 Random Walk with Restart

Random Walk with Restart (RWR) is a model proposed by Grady [13] in 2006. It mainly measures the similarity between network nodes through the relationship between the topology structure. The main ideas of RWR are as follows: (1) Starting from any vertex or any set of vertices in the graph, walk randomly along the edge of the graph to the next vertex. (2) In the random walk process, any vertex randomly selects the next adjacent node with a certain probability to move or selects the starting point to return to for random walk again. (3) After repeating the random walk process for a finite times and iterating for many times, the probability value of vertices in each graph tends to be stable, and the iteration ends. (4) Finally, the probability value of each vertex can be regarded as the similarity between the current vertex and the selected starting vertex.

The formula for RWR is as follows:

$$ \overrightarrow {{r_{i} }} = c\mathop {\tilde{W}}\limits^{{}} \overrightarrow {{r_{i} }} + \left( {1 - c} \right)\overrightarrow {{e_{i} }} $$
(1)

Here, \( \overrightarrow {{r_{i} }} = \left[ {r_{i,j} } \right] \) is the relevance score, \( c \) is the restart probability, \( W = \left[ {w_{i,j} } \right] \) is the weighted graph adjacency matrix, \( \mathop {\tilde{W}}\limits^{{}} \) is matrix of W by standardizing, \( \overrightarrow {{e_{i} }} \) is the identity matrix.

3 Analysis and Construction of Three-Layer Citation Network Graph Model

3.1 Analysis of Three Layers Citation Network Model

This part will analyze and build a three layers citation network model.

  1. 1.

    Citation network of papers

    A paper contains multiple references, and each reference, as an independent paper, also has its own references, which constitutes a citation network. The Fig. 1 is the citation network structure of the papers. As shown in the figure, paper P1 quoted P3, P4 and P5. If a paper Pi quoted a paper Pj, there are directed edges to connect them. So PiPj = 1 when we building the paper citation network.

    Fig. 1.
    figure 1

    Citation network of the papers

    In this paper, we get the word vector of feature information of the paper by word2vec. Then we calculate the mean value of the word vector, and the mean reflects the characteristic information of a paper. The formula for calculation is as follow: N is the total vocabulary of title and abstract of paper i, wj is the word vector of a word j in paper i.

    $$ R_{i} = \frac{1}{N}\mathop \sum \limits_{j = 1}^{n} w_{j} $$
    (2)
  2. 2.

    Co-authorship network

    For an academic paper, it can have many authors, so there is a co-authorship relationship in the paper. The Fig. 2 is the co-authorship network diagram. As shown in the figure, author A1 and authors A2, A4, A5 write the same paper. If there is a co-authorship between author Ai and author Aj, there is an undirected edge between them. So AiAj = 1 when we building the co-authorship network.

    Fig. 2.
    figure 2

    Co-authorship network diagram

  3. 3.

    Relationship between papers and research fields

    Research fields can help researchers directly locate the topic and research direction of the paper when they study the papers. A paper can correspond to one or more research fields. The Fig. 3 is the diagram of relationship between papers and research fields. As shown in the figure, if the paper Pi corresponds to a research field Fj, there is an edge to connect them. So PiFj = 1 when building relationships between papers and research fields.

    Fig. 3.
    figure 3

    Research field diagram

  4. 4.

    Three layers citation network model

    In this paper, we selects the feature information, the author and the research field of the paper, and builds a three layers citation network model with the relationship among them. The Fig. 4 is the three citation network model diagram.

    Fig. 4.
    figure 4

    Three layers citation network model

3.2 Construction of Three Layers Citation Network Model

Before building the model, explain the symbols used. As follows in this Table 1, it is the definition of related symbols.

Table 1. Definition of related symbols

When building this model, the matrix is used to express the relationship among the three layers networks. The following matrix is the matrix of three layers citation network graph model.

$$ {\text{M}} = \begin{array}{*{20}c} {} & {\begin{array}{*{20}c} {P\,\,\,\,\,{\text{~}}} & {{\text{~}}\begin{array}{*{20}c} {A\,\,\,\,{\text{~}}} & {F\,\,\,} \\ \end{array} } \\ \end{array} } \\ {\begin{array}{*{20}c} {\text{P}} \\ {\begin{array}{*{20}c} {\text{A}} \\ {\text{F}} \\ \end{array} } \\ \end{array} } & {\left| {\begin{array}{*{20}c} {\begin{array}{*{20}c} {\begin{array}{*{20}c} {M_{{PP}} } & {M_{{PA}} } \\ \end{array} } \\ {\begin{array}{*{20}c} {M_{{AP}} } & {M_{{AA}} } \\ \end{array} } \\ \end{array} } & {\begin{array}{*{20}c} {M_{{PF}} } \\ {M_{{AF}} } \\ \end{array} } \\ {\begin{array}{*{20}c} {M_{{FP}} } & {M_{{FA}} } \\ \end{array} } & {M_{{FF}} } \\ \end{array} } \right|} \\ \end{array} $$
(3)

4 Recommendation Algorithm Based on Citation Network Model

4.1 Algorithm Design

In this paper, we combines the RWR algorithm with the proposed three layers citation network graph model, it formed the recommendation algorithm based on Citation Network Model, and we call it PAFRWR. Compared with the random walk algorithm, RWR can make the node walk around the initial node, not aimless walk in the random walk model. So RWR algorithm is more accurate and efficient for the determination of similar nodes.

In order to implement the paper recommendation algorithm based on the three layers citation network model, search and recommendation are combined. In this paper, S is used to represent the set of search vectors. Training the search information by word2vec and structure search vector. We normalize the transition probability matrix for assign a reasonable weight to each node.

The recommended algorithm in this paper is as follows:

$$ C^{{\left( {t + 1} \right)}} = \left( {1 - \beta } \right)MC^{\left( t \right)} + \beta s $$
(4)

Here C is stationary distribution probability of each node. It represent the relevance between user search and nodes in the three layers citation network model. β is the restart probability. M is a transition probability matrix. s is the initial search vector.

The pseudo code of the algorithm is shown below:

figure a

4.2 Experiment and Analysis

  1. 1.

    Data

    In this paper, The data set used in this paper is DBLP-Citation-network downloaded from AMiner. Then, data of 63469 non repetitive papers from 2013 to 2019 are selected, and it include 152586 authors.

  2. 2.

    Evaluation index

    We use Recall and NDCG as evaluation indexes.

    Recall is an important index to evaluate recommendation results. In the field of information retrieval. The following is the calculation formula of recall. Here, N is the total number of recommended results. Q is the total number of search. \( R\left( p \right) \) is the set of recommended results produced. \( T\left( p \right) \) is the set of references of the current tested papers. \( {\text{R}}\left( p \right) \cap T\left( p \right) \) is the set of papers recommended correctly.

    $$ {\text{Recall}}@{\text{N}} = \frac{1}{Q}\sum\nolimits_{i = 1}^{Q} {\frac{R\left( p \right) \cap T\left( p \right)}{T\left( p \right)}} $$
    (5)

    In the paper recommendation system, it is considered that the recommendation results are in the current references of the tested papers, and the higher the recommendation results are in the list, indicating that the performance of the recommendation algorithm is better. NDCG is the value normalized by IDCG.

    $$ {\text{NDCG}}@{\text{N}} = \frac{1}{Q}\sum\nolimits_{j = 1}^{Q} {\frac{DCG@N}{IDCG@N}} $$
    $$ {\text{DCG}}@{\text{N}} = \sum\nolimits_{i = 1}^{N} {\frac{{2^{{r_{i} }} - 1}}{{log_{2} \left( {i + 1} \right)}}} $$
    (6)
    $$ {\text{IDCG}}@{\text{N}} = \sum\nolimits_{i = 1}^{Rel} {\frac{{2^{{rel_{i} }} - 1}}{{log_{2} \left( {i + 1} \right)}}} $$
  3. 3.

    Restart probability experiment

    In this section, the restart probability parameter β is determined by algorithm experiment. The following figure shows the change trend of Recall@N and NDCG@N under different restart probability β. In this experiment, N is selected as 50, 75 and 100. It can be seen that Recall@N and NDCG@N are the highest when β = 0.3. And the value of Recall@100 and NDCG@100 are the highest under different . So we choose the β = 0.3 as the restart probability (Fig. 5).

    Fig. 5.
    figure 5

    Recall@N and NDCG@N under different β values

  4. 4.

    Search vector contrast experiment

    In PAFRWR algorithm, there are author search vectors, paper search vectors and field search vectors. Different search vectors have different effects on the recommendation results of this paper. Four search vectors are defined as follows:

  5. (1)

    \( s_{1} = \left[ {s_{p} ,s_{a} ,s_{f} } \right] \) include paper search vectors, author search vectors and field search vectors.

  6. (2)

    \( s_{2} = \left[ {0,s_{a} ,s_{f} } \right] \) include author search vectors and field search vectors.

  7. (3)

    \( s_{3} = \left[ {s_{p} ,s_{a} ,0} \right] \) include paper search vectors, author search vectors.

  8. (4)

    \( s_{4} = \left[ {0,s_{a} ,0} \right] \) include only author search vectors.

    The following table shows the results of the search vector comparison experiment (Table 2):

    Table 2. Recall@N and NDCG@N under different search vectors

    In this experiment, the N is 25, 50, 75, 100. Among them, the Recall@N and NDCG@N of search vector S1 reach the maximum, which containing all three kinds of information. The results of S2 and S3 are very similar, but S3 has higher Recall@N and NDCG@N than S2. This shows that the search vector containing the content information of the paper can give the user better recommendations. So the search vector S1 will be selected as the search vector of the PAFRWR in this paper.

  9. (5)

    PAFRWR algorithm comparison

    We compare PAFRWR with PageRank, LDA and Link-PLSA-LDA. The following figure is a comparison of four algorithms. The Recall@N and NDCG@N of PAFRWR are higher than PageRank, LDA and Link-PLSA-LDA.

    PageRank only considers the citation relationship between papers, but does not consider the author, content subject, research field and other specific information of the paper. Therefore, the Recall@N and NDCG@N of PageRank are the lowest among the four algorithms. LDA and Link-PLSA-LDA both build the theme model of the paper. The Link-PLSA-LDA combines the reference relationship between papers based on the topic model. So the result of Link-PLSA-LDA is slightly higher than that of LDA. Overall, the algorithm of this paper (PAFRWR) has better recommendation results (Fig. 6).

    Fig. 6.
    figure 6

    Comparison of Recall@N and NDCG@N of different algorithms

5 Conclusions

In this paper, we constructs a three layers citation network graph model based on the DBLP citation network data set of AMiner. Which combines the content information, author information and research field information of the papers. Secondly, we proposed the algorithm PAFRWR. This algorithm combine three layers citation network graph mode with RWR. The restart probability β = 0.3 is determined by experiments, and the most effective search vector is determined. Finally, the Recall@N and NDCG@N of PAFRWR are higher than PageRank, LDA and Link-PLSA-LDA through the experiment.

For the paper recommendation method, it can also be improved from the following aspects. First, we can refine the structure and content of the network model, for example, you can add the same journal relationship of the papers, the same organization relationship of authors, and the relationship between research fields. Secondly, it can combine the user’s historical behavior in the paper recommendation. According to the shortcomings, we can build more effective paper recommendation model and algorithm in the future work.