Keywords

1 Introduction

Literature reviews are in great demand arising from the rapidly-increasing publications of scientific articles as well as maintaining awareness of developments in scientific fields. Finding and evaluating relevant material to synthesize information from various sources is a crucial and challenging task.

Table 1. The number of references of surveys hit in the top 100 retrieval results.

One main method of literature searching is keyword search. However, keyword search can’t meet the need of literature searching partly due to the synonymy and polysemy problems. On the other hand, literature reviews consist of correlated articles of variant words, resulting in worse performance of citation recommendation.

The situation improves little despite of the development of search techniques in recently years. Take the literature searching of surveys as examples. Three surveys are randomly selected from a bibliography sharing service, CiteULike, and the retrieval performance by the search engine Google Scholar is evaluated according to the amount of references located among the searching results, as illustrated in Table 1. The amount of references of the surveys hit in the top 100 retrieval results indicates poor performance for citation recommendation.

In this paper, we propose a novel query expansion framework for academic citation recommendation. By incorporating multiple domain knowledge graphs for scientific publications, we use domain-specific concepts to expand an original query with appropriate spectrum of knowledge structure. Meanwhile, text features are extracted to capture context-aware concepts of the query. Then candidate concepts, namely domain-specific and context-aware concepts, are further filtered via distributed representations to derive the expanded query for citation recommendation.

2 Related Work

Traditional methods of query expansion choose terms from relevant/irrelevant documents. Terms are usually weighted according to their frequency in single document and in the collection, and the top terms with the highest frequency are added into the initial query. To provide structured knowledge of a topic, Ref [1] addresses the problem of citation recommendation by expanding the semantic features of the abstract using DBpedia Spotlight [14], a general purpose knowledge graph.

A content-based method for recommending citations [2] is suggested, which embeds a query and documents into a vector space, then reranks the nearest neighbors as candidates using a discriminative model. A context-aware citation recommendation model [8] is proposed with BERT and graph convolutional networks. In the context of heterogenous bibliographic networks, ClusCite [12], a cluster-based citation recommendation framework, is proposed.

Text feature extraction is fundamental to citation recommendation. TF-IDF [9] (Term Frequency–Inverse Document Frequency) and their variants are essential to represent documents as vectors of terms weighted according to term frequency in one document as well as in the number of documents in the corpus.

To capture the latent semantic associations of terms, Latent Semantic Analysis (LSA)  [5] is proposed to analyze the relationships of documents based on common patterns of terms. To uncover semantic structures of documents, Latent Dirichlet Allocation (LDA)  [3] is a generative probabilistic model to represent one document as a mixture of topics, and the words in each document imply a probabilistic weighting of a set of topics that the document embodies.

In the above representations, each term in documents is represented by a one-hot vector, in which only the corresponding component of the term is 1 and all others are zeros. To provide a better estimate for term semantics, distributed representations of terms are proposed to boost the semantics of documents. Instead of sparse vectors, embedding techniques, such as Word2Vec [10] and BERT [6], learn a dense low-dimensional representation of a term by means of its neighbors.

3 The Proposed Framework

The proposed query expansion framework for citation recommendation is composed of three major steps: (1) From the perspective of enriching knowledge structure, expanding an original query with domain-specific concepts based on multiple domain knowledge graphs. (2) From the perspective of providing query scenarios, extending the query with context-aware concepts derived from text feature extraction. (3) Filtering out the above candidate concepts via distributed representations to derive the expanded query for citation recommendation.

3.1 Model Overview

To capture diverse information needs underlying the query, we propose a query expansion framework combining domain knowledge with text features, as illustrated in Fig. 1.

Fig. 1.
figure 1

The query expansion framework based on domain knowledge and text features.

Starting from an initial query \(q_0\) in Fig. 1, domain knowledge and text features are used to expand the query, and then candidate concepts are filtered to derive the enriched query for citation recommendation.

More specifically, domain knowledge are used to provide knowledge structure in the form of domain-specific concepts, with the aid of multiple domain knowledge graphs such as ACM Computing Classification SystemFootnote 1 (CCS for short) and IEEE thesaurusFootnote 2. Meanwhile, text feature extraction is utilized to derive context-aware concepts from document collections to provide query scenarios.

Among the above domain-specific and context-aware candidate concepts, filtering techniques are applied to choose a set of closely-related concepts to formulate a new query q, and citations are recommended with respect to the expanded query q.

3.2 Domain-Specific Expansion Using Multiple Knowledge Graphs

With the characteristics of citation recommendations in mind, we propose query expansion based on knowledge graphs. However, general-purpose knowledge graphs are not applicable in academic disciplines.

Take the citation recommendation for ‘A Survey of Mobility Models for Ad Hoc Network research’ in Table 1 as an example (hereinafter called the survey). Using Microsoft Concept GraphFootnote 3, the related concepts for ad hoc network consists of research domain, wireless network, network, system and so on. As to the concept mobility model, the suggested concepts include parameter, component, user criterion and so on. Analyzing the references cited in the survey, the suggested concepts derived from general-purpose knowledge graphs contribute little to recommend appropriate citations. On the other hand, knowledge graphs such as WordNet [11] have also limited contribution, as no concepts for ad hoc network and mobility model.

We utilize domain-specific knowledge graphs, such as ACM CCS and IEEE thesaurus, to expand academic concepts. Take the concept ad hoc networks as an example. The concept hierarchies of ACM CCS and IEEE thesaurus are illustrated in Fig. 2, which suggest structures of concepts and their relations with ad hoc networks. Exploring the references of the survey, highly relevant concepts appearing in the references are circled in red rectangles in Fig. 2.

Fig. 2.
figure 2

The diverse concept hierarchies for ad hoc networks in multiple domain knowledge graphs. (Color figure online)

According to IEEE thesaurus, related terms are also highly probably mentioned in references, such as wireless sensor networks. For ACM CCS, we notice that the concepts of the siblings of network types, a hypernym of ad hoc networks, are also highly related to the topics of the references, rather than the direct siblings of ad hoc networks like networks on chip or home networks. Here a concept’s siblings refer to those concepts that share the same parents with the given concept.

Therefore, to expand domain-specific concepts of a term to the original query, we design adaptive policies to multiple domain knowledge graphs. For IEEE thesaurus, the concepts of broader terms, related terms and narrower terms are added into the expansion set of candidate concepts. For ACM CCS, a two-level policy is suggested to add the concepts of the hyponyms and the siblings of the hypernyms of the term to the original query.

Using the policy, domain-specific knowledge for ad hoc networks with 22 concepts are derived from the ACM CCS and IEEE thesaurus, as circled in the blue rectangles in Fig. 2. The concept mobile ad hoc networks is counted once.

3.3 Context-Aware Expansion Based on Text Feature Extraction

To expand a query with context awareness, the basic idea is to derive context-aware terms from document collections based on text feature extraction, as context-aware terms enriched the original query with characteristics of scenarios for better matching of relevant documents.

Key phrases are first extracted using PositionRank algorithm [7] as features of documents. Then document collections are processed to derive semantically related key phrases with terms in the original query, such as highly co-occurred key phrases. Providing the scenarios of co-occurrence with respect to the original query, semantically related key phrases are appended to the candidate set of query expansion.

Figure 3 lists ten context-aware concepts of ad hoc networks and mobility model based on text feature extraction, respectively. Here, the context-aware concepts are computed from the CiteULike corpus which will be described in detail in Sect. 4.

Fig. 3.
figure 3

The context-ware concepts of ad hoc networks and mobility model.

3.4 Candidate Concept Filtering via Distributed Representation

Rather than directly expanding an original query with candidate concepts in Sect. 3.2 and Sect. 3.3, the aim is to filter the set of candidate concepts to derive a subset of closely-related concepts for query expansion to reduce noise.

In order to solve this problem, we propose a candidate concept filtering method via distributed representations. The input consists of a candidate concept for expansion and the original query. Then, the vectorized representations of the inputs are concatenated via BERT distributed representations, and each input will be converted into a 1 * 1024 dimensional vector. For detailed calculation principles, please see the BERT-as-service toolFootnote 4.

Then cosine similarity is used to calculate the distance between the vector representations of the candidate concept and the original query, and output the normalized result (between 0 and 1) as a matching score. With the matching scores between candidate concepts and query, candidate concepts are sorted, and the top-k closely-related concepts are chosen to expand the original query.

4 Experiments

4.1 Data Sets and Evaluation Metrics

In this section, we test our model for citation recommendation tasks on two public data sets. The DBLP data set contains citation information extracted by Tang et al. [13]. The CiteULike data set consists of scientific articles from CiteULike databaseFootnote 5. Statistics of the two data sets are summarized in Table 2.

Table 2. Data sets overview.

The performance is mainly evaluated in terms of precision and recall. The precision at top N results (P@N for short), and the recall at top N results (R@N for short) are reported respectively. Additionally, mean average precision (MAP) [4] is evaluated to measure the performance averaged over all queries,

$$\begin{aligned} MAP(Q)=\frac{1}{\Vert Q\Vert }\sum _{j=1}^{\Vert Q\Vert } \frac{1}{m_j} \sum _{k=1}^{m_j}Precision(R_{jk}) \end{aligned}$$
(1)

where Q is the set of queries. For the query \(q_j\), \(\{d_1,...,d_{m_j}\}\) is the set of cited articles and \(R_{jk}\) is the set of ranked results from the top result to the article \(d_k\).

4.2 Experimental Results

The experimental results on the CiteULike data set are presented in Table 3. We vary our model with text analysis and filtering methods. The domain KG + TF-IDF method expands query without filtering, namely implementing domain-specific expansion based on multiple domain knowledge graphs and context-aware expansion with TF-IDF. All candidate concepts are appended to the original query without filtering. Similarly, the domain KG + LSA and domain KG + LDA methods use domain knowledge plus LSA and LDA, respectively. In contrast to the first three methods, BERT embeddings of candidate concepts are used for filtering in the domain KG + TF-IDF + BERT filtering method.

Table 3. Experimental results of our model on the CiteUlike data set

The results in Table 3 show that among the first three methods without filtering, the domain KG + TF-IDF method has better performance than the ones using LSA and LDA. Thus, the impacts of filtering technique on query expansion are further evaluated based on TF-IDF.

Furthermore, the method with filtering outperformed the ones without filtering. And the domain KG + TF-IDF + BERT filtering method contributes the best performance among all the methods on the CiteULike data set. It indicates that the expansion of domain-specific and context-ware concepts plus BERT filtering improves the performance of citation recommendation.

The experimental results on the DBLP data set also proves the effectiveness of query exapnsion using domain-specific and context-ware concepts plus BERT filtering, as shown in Table 4.

Table 4. Experimental results of our model on the DBLP data set

The above results suggest that combining domain knowledge with text features indicates remarkable advantages for citation recommendation, and filtering of candidate concepts is key to improve the performance.

5 Conclusions

In this paper, we address the problem of citation recommendation for literature review. Fusing domain knowledge and text feature to expand query is verified to improve the performance of locating citations scientific articles. Domain-specific concepts are extracted from multiple domain knowledge graphs to enrich knowledge structure for query expansion. Context-aware concepts are derived from text feature extraction to provide query scenarios. Then candidate concepts are filtered via distributed representations like BERT to expand the query with closely-related concepts. Experiments of citation recommendation on bibliographic databases show that our proposed model effectively improves the performance of citation recommendation.

Future research considers using large-scale scientific literature corpora to fine-tune the BERT pre-training vectors. In addition, the combination of named entity recognition technology to achieve the extraction of term features is also our focus.