1 Introduction

Keyphrase extraction is an important issue in natural language processing. Keyphrases are very helpful in many natural language processing tasks, such as text classification, text clustering, recommendation systems, and search engine [1].

The task of keyphrase extraction is to select the important and topical phrases that best describe a document [2, 3]. According to the definition of the task, the core problem of keyphrase extraction is to determine which candidate phrases can best describe a document. Further, keyphrase extraction needs to determine the importance of each candidate phrases.

Many approaches have been proposed in keyphrase extraction. These approaches can be roughly divided into two categories: supervised methods and unsupervised methods.

Supervised methods translate a keyphrase extraction problem into a binary classification problem, in which candidate phrases are classified as keyphrase or non-keyphrase [4]. All classification methods can be used in the supervised approaches. However, the supervised approaches require a lot of labeled data, which may be impractical in some domains. Thus, the unsupervised approaches attract more and more attention [1]. Many unsupervised methods have been used in keyphrase extraction, such as language modeling, clustering or graph-based ranking [4]. Among them, graph-based ranking provides impressive results [5].

The graph-based ranking is inspired by Google’s PageRank that is proposed to analyze the link structure of the World Wide Web [3]. In graph-based ranking, a document is modeled as a graph where vertexes are phrases, and edges are the relationships between phrases. The main idea of graph-based ranking approaches is that a vertex is important if it is related to a large number of vertexes or related to important vertexes [4]. The graph-based ranking focuses on measuring the semantic relation between vertexes or designing vertex ranking functions [5]. There are a lot of studies about graph-based ranking. For example, Wan and Xiao [6] apply a graph-based ranking algorithm on a graph that is composed of a document and several adjacent documents similar to the document. Bougouin et al. [7] present a graph-based keyphrases extraction approach named TopicRank. In TopicRank, candidate keyphrases are first clustered into topics, a graph-based ranking algorithm is then applied to each topic, and the most representative keyphrases from each topic are chosen [7]. Boudin [5] proposed an unsupervised keyphrase extraction model that encodes topical information within a multipartite graph structure. This model is an improvement of the TopicRank. As sorting candidate keyphrases in a single operation, it implicitly enforces topical diversity and avoids accumulation of errors [5].

Based on the TopicRank and its improvements, we proposed a new model that incorporates semantic information and structural information of a document to keyphrase extraction. In [5], the weight between two vertexes is firstly computed by the distance of the two candidates in different topics and then adjusted by the position of candidates. Though the position of candidate can provide important information, it is too sensitive and often leads to unexpected results [1]. To improve this insufficiency, we use the position of the sentence where the candidate is located to adjust the weight between two vertexes instead of the absolute position of candidate. In addition, we incorporate the semantic information of candidates in keyphrase extraction. Since keyphrases are single or multi-word phrases that best describe the document [3], the semantic relation between keyphrases and sentences certainly includes valuable information. Thus, incorporating the semantic information of candidates is helpful. Our contributions are as follows:

  • We proposed an unsupervised model that incorporates the structural information and the semantic information of a document to score and rank candidate keyphrases.

  • We aggregate the information from all positions of sentences that a candidate phrase is located in to adjust the weight between candidate phrases, which is an improvement of graphical presentation of a document.

  • We proposed a new method to get the semantic information from a document, and it shows that the semantic information of a document is helpful in scoring and ranking candidate keyphrases.

We experimentally evaluate our model on three datasets that are widely used in keyphrase extraction, and its superiority is shown by the comparisons of unsupervised graph-based model that do not take into account sentence positions and semantic information, as well as some baselines for unsupervised keyphrase extraction. The rest of the paper is organized as follows: The related work is summarized in Sect. 2. The details of our proposed model are provided in Sect. 3. Experiments are conducted in Sect. 4. The conclusion and future work are presented in Sect. 5.

2 Related work

Many supervised and unsupervised approaches have been proposed in the task of keyphrase extraction [4]. Supervised methods take keyphrase extraction as a binary classification task. Unsupervised methods include language modeling, clustering or graph-based ranking, etc. Since the pioneering work of Mihalcea and Tarau [3], graph-based methods have been the most widely used unsupervised approaches for keyphrases extraction [7]. Many researchers have devoted a large amount of effort to develop better ways of modeling documents as graphs or ranking vertexes in graphs [5]. Bougouin et al. [7] present a graph-based keyphrase extraction approach named TopicRank. In TopicRank, similar keyphrases in a document are first clustered into a topic [7]. Then, a document can be represented as a graph in which vertexes are topics and the edge between two topics is weighted according to the position relationship of phrases in two topics [7]. As an improved approach of TopicRank, Boudin [5] proposed an unsupervised keyphrase extraction model that encodes topical information within a multipartite graph structure. Different from TopicRank, vertexes of graph in [5] are phrases and the edge between two phrases is weighted according to the position relationship of phrases if they are from different topics. Once the graph is created, the graph-based ranking model TextRank proposed by Mihalcea and Tarau [3] is used both in [5, 7] to rank vertexes. In TopicRank, the phrase in a topic which first appears in the document is selected as keyphrase [7].

To capture the semantic relatedness of words in a document, word sense disambiguation is an essential predecessor step. Word sense disambiguation is the core task of human language understanding [8]. So far, researchers have already proposed many approaches for word sense disambiguation.

In this paper, we adopt the state-of-the-art approach proposed by Luo et al. [9] that integrate the context and glosses of the target word into neural network in order to make full use of labeled data and lexical knowledge. The gloss of a word is provided by WordNet that is a famous online lexical database in which English nouns, verbs, adjectives, and adverbs are organized into sets of synonyms [10]. Synonymy, antonymy, hyponymy, meronymy, troponymy, and entailment are semantic relations for words in WordNet [10]. Next, we will briefly describe the model proposed by Luo et al. [9].

The model proposed by Luo et al. [9] consists of gloss module, context module, memory module, and scoring module. The context module encodes a sequence of surrounding words of the target word into a distributed vector representation [9]. The gloss module encodes the gloss of three items, namely the target word, the corresponding hypernyms, and hyponyms into distributed vector representations [9]. The memory module contains two steps: attention calculation and memory update [9]. The attention calculation models the inner relationship between the context and glosses [9]. The memory states are updated after the attention calculation [9]. Finally, the output of the context module and the output of the memory module are used in scoring module to calculate the scores for all the related senses in WordNet corresponding to the target word [9].

3 Proposed model

In this section, we will describe our approach in detail. Our approach can be divided into two steps. We first build a multipartite graph to represent document in which the edge weights are adjusted with sentence position, and structural score for candidates is obtained by the structure information of document. Semantic score for candidates is then computed by the semantic similarity between the candidate and all the sentences. The structural score and the semantic score are integrated to form the final score. In order to facilitate comparison with other approaches, our candidates are noun phrases that match the regular expression ’(adjective)*(noun)+’ and the phrases will be stemmed to reduce the number of mismatches. The overall framework is illustrated in Algorithm 1.

figure a

3.1 Graph-based ranking model

The graph-based ranking model can be subdivided into three stages: (1) the multipartite graph construction at phrases level; (2) the weight adjustment; (3) the structural scores of candidates.

3.1.1 Graph construction

We follow the approaches proposed by Bougouin et al. (2013) to group the candidates into topics in a document. Hierarchical agglomerative clustering (HAC) algorithm is used to automatically group similar candidates into topics [7]. Two candidates are considered similar if they have at least 25% of overlapping words and the candidates are stemmed to eliminate the influence of word form [7]. A directed graph is built as illustrated in Fig. 1 in which the vertexes are connected if they belong to different topics. The weight from vertex i to vertex j is defined as follows:

$$\begin{aligned} w_{ij}=\sum _{p_i\in P(c_i)}\sum _{p_j\in P(c_j)}\frac{1}{|p_i-p_j|} \end{aligned}$$
(1)

where \(c_i\) and \(c_j\) are corresponding candidates and \(P(c_i)\) is the set of the word positions of candidate \(c_i\). Algorithm 2 describes the process of graph construction.

Fig. 1
figure 1

Multipartite graph representation of a document, where the nodes are candidates and edges are position relationship between candidates. Nodes of the same color belong to the same topic

figure b

3.1.2 Graph weight adjustment

The weight in (1) just captures the distance relations between candidates and is not sufficient to determine the importance of the candidates. The positions of the candidates can provide more information about the importance of the candidates [5].

In [5], the positions of the candidates are used for weight adjustment. Adjustment of the incoming edge weights for the first occurring candidate of each topic is helpful for keyphrase extraction [5]. The adjustment strategy in [5] results in different importance if candidates are located in different positions in the same sentence. However, it is unreasonable to assign different importance to different candidates in the same sentence based on the order. For example, the active sentence and the passive sentence express the same meaning, but the order of the same words is different. It results in that the same words have different importance. Hence, it may be more reasonable and more robust that the importance of the candidates in a sentence is considered to be the same. Following [5], we adjust the incoming edge weights of the first occurring candidate of each topic. Similar to the method proposed by Boudin [5], candidates that occur at the beginning of the document gather information from the other candidates belonging to the same topic. Different from [5], our method adjusts the incoming edge weights with positions of the sentences in which the candidates located rather than absolute positions of the candidates. Weights of the incoming edge for the first occurring candidate of each topic are adjusted by the following equation:

$$\begin{aligned} w_{ij}=w_{ij}+\alpha \cdot \exp ^{\sum _{p\in P_s(c_j)}\frac{1}{p}}\cdot \sum _{c_k\in T(c_j)\backslash \{c_j\}}w_{ki} \end{aligned}$$
(2)

where \(P_s(c_j)\) is the set of word positions of the sentences where the candidate \(c_i\) located, \(T(c_j)\) is the set of candidates with the same topic as \(c_j\), and \(\alpha \) is a hyperparameter that controls the strength of the weight adjustment. According to the experiments of Boudin (2018), we empirically set \(\alpha =1.1\) [5]. The procedure of graph weight adjustment is illustrated in Algorithm 3.

figure c

3.1.3 Graph ranking

After the graph is built, we compute the score of candidates by a graph-based ranking algorithm. Here, we use the widely used TextRank algorithm proposed by Mihalcea and Tarau [3]. TextRank algorithm is based on graph-based ranking algorithm named PageRank [3]. A graph-based ranking algorithm uses global information recursively computed from the entire graph to obtain the importance of the vertex [3]. The structural score of the candidate \(c_i\) is computed as follows:

$$\begin{aligned} S_{base}(c_i)=(1-\lambda )+\lambda \cdot \sum _{c_j\in I(c_i)}\frac{w_{ij}S_{base}(c_j)}{\sum _{c_k\in O(c_j)}w_{jk}} \end{aligned}$$
(3)

where \(I(c_i)\) is the set that is made up of predecessors of \(c_i\), \(O(c_j)\) is the set that is made up of successors of \(c_j\), and \(\lambda \) is a damping factor that integrates the probability of jumping from a given vertex to another random vertex in the graph [3]. According to the experiments of Mihalcea and Tarau [3], we empirically set \(\lambda =0.85\). The computation of structural score is illustrated in Algorithm 4.

figure d

3.2 Semantic relatedness model

The semantic relatedness model can be subdivided into two stages: (1) The correct meaning of words in a document is assigned by word sense disambiguation algorithm; (2) the scores of candidates are assigned by semantic similarity between candidates and sentences in a document.

3.2.1 Word sense disambiguation

Our similarity measuring is based on the senses assigned by word sense disambiguation algorithm. In this paper, we adopt the state-of-the-art approach proposed by Luo et al. (2018) to assign senses to the words in a document [9].

3.2.2 Semantic ranking

Keyphrases can be directly or indirectly mentioned in many places in a document. And the semantic relatedness between candidates and sentences can provide effective assist for keyphrases extraction especially for short documents with less structural information. In order to obtain the semantic relationship between candidates and sentences, we calculate the similarity between candidates and sentences based on the senses assigned by word sense disambiguation algorithm.

A phrase mentioned in a sentence does not mean that every word in the sentence is similar to the words in the phrase, but at least one word is similar to the word in the phrase. Hence, our similarity between a phrase and a sentence is the maximum of the similarity between the words in the phrase and those in the sentence. The semantic score of candidate \(c_i\) is computed by:

$$\begin{aligned} S_{smtc}(c_i)=\sum _{s_j\subset S}\max _{u_k\in s_j, u_n\in c_i}[Sim(u_k, u_n)] \end{aligned}$$
(4)

where S is a set of sentences, \(u_k\) and \(u_n\) are words in \(s_j\) and \(c_i\), respectively, and Sim is the similarity calculation function in WordNet that is based on the shortest path that connects the senses. The computation of semantic score is illustrated in Algorithm 5.

figure e

3.3 Final score calculation

The above two steps produce two sets of score sequences for candidates. The final score is integrated by:

$$\begin{aligned} S_{final}(c_i)=Norm(S_{base}(c_i))+Norm(S_{smtc}(c_i)) \end{aligned}$$
(5)

where the weights of two scores are the same and Norm is the maximum and minimum normalization computed by:

$$\begin{aligned} Norm(Sc_i) = \frac{Sc_i}{max(Sc) - min(Sc)} \end{aligned}$$
(6)

where Sc is the set of scores.

4 Experiments and results

4.1 Datasets and evaluation measures

We carry out our experiments on three widely used datasets. The first dataset is Krapivin-2009 that has a high quality and consists of 2304 of scientific papers from computer science domain published by ACM [11]. Each paper has its keyphrases assigned by the authors and verified by the reviewers [11]. The second dataset is SemEval-2010 that consists of 244 scientific articles from the ACM Digital Library, and their keyphrases are carefully chosen by both authors and readers [12]. The last dataset is Hulth-2003 that consists of 2000 abstracts from the Inspec database, and each abstract has its keyphrases assigned by a professional indexer [13]. Some statistical information of these datasets is detailed in Table 1.

Table 1 Average size of each article in the datasets
Table 2 F1-scores computed at the top 5, 10 extracted keyphrases

We evaluate the performance of our approach in terms of F1-score at the top N keyphrases. The F1-score is the harmonic average of the precision and recall and is computed as follows:

$$\begin{aligned} F1 = 2\times \frac{precision \times recall}{precision + recall} \end{aligned}$$
(7)

For the convenience of display, we multiply F1-score by 100.

4.2 Baselines and parameter settings

We compare the performance of our approach against that of six baselines. The first baseline is TextRank, which is the first graph-based keyphrases extraction method [3]. The second baseline is SingleRank, which is an improvement of TextRank by adding weighted edges between words that co-occur in a window of variable size \(w > 2\) [14]. The third baseline is TopicRank which relies on a topical representation of the document [7]. The fourth baseline is PositionRank which incorporates information from all positions of a word’s occurrences into a biased PageRank algorithm [1]. Unlike other baselines, the fifth baseline named YAKE is a feature-based method which uses statistical features such as word frequency to directly calculate every words’ scores [15]. The last baseline is an approach which encodes topical information within a multipartite graph structure [5].

Our experiments are based on the pke toolkit [16]. The parameters are set to the values suggested by the authors.

4.3 Results

The results for the baselines and the proposed approach are detailed in Table 2. Overall, our approach achieves the best results in the datasets with long documents and achieves the competitive results in the datasets with short documents. MultipartiteRank obtains the best performance in the datasets with long documents among the baselines, which means that multipartite graph structure is effective in keyphrases extraction especially in long documents. Because TextRank and SingleRank construct graph based on words rather than phrases, they perform well in the datasets with short documents but poorly in long documents. The structural information of long documents is more complicated than that of short documents. The graph constructed by words cannot capture such complex information. This is why TextRank and SingleRank perform well in short documents but poorly in long documents. The results of our approach in the first two datasets are close to the results of MultipartiteRank. The reason is that our approach is based on MultipartiteRank. The first two datasets are composed of long documents in which MultipartiteRank performs much better than other baselines. The scores computed by semantic relationships are lower discrimination in long documents than that in short documents. It can also explain why our approach performs much better than MultipartiteRank in the third dataset, which is composed of short documents. In general, our approach achieves the best results in the datasets with long documents and achieves the competitive results in the datasets with short documents.

5 Conclusion

We introduced an unsupervised keyphrases extraction algorithm that combines semantic information and structural information, and demonstrated its effectiveness on three public datasets. In future work, we would like to apply more proper methods instead of clustering to break the document, especially the short document into topics.