1 Introduction

   Unlocking the most important information hidden within a document can be a challenging task, but with the power of keyphrase extraction, it becomes an effortless and efficient process. Within the domain of natural language processing (NLP), a fundamental concept is the identification of keyphrases and keywords. Keyphrases are comprised of a series of words that encapsulate the central theme or primary idea conveyed within a given text document. In contrast, a keyword is a solitary word that holds significant meaning or conveys an important concept within a text document.

The process of identifying and extracting relevant keyphrases or keywords from a given text is a critical task in the field of NLP that involves identifying and extracting the most relevant and meaningful phrases or concepts from a given text [1]. This process can help in document summarization, information retrieval [2], and indexing. Keyphrase extraction is also a critical step in many downstream NLP applications, including document classification [3] and sentiment analysis. Additionally, keyphrase extraction is a technique that facilitates rapid comprehension of the content of a document by identifying and extracting significant phrases from it. This process enables users to obtain a gist of the document’s essence without the need to read it in its entirety, making it an essential tool for information management in various fields.

Given the importance of keyphrase extraction, numerous techniques have been developed over the years [4]. However, these methods often have limitations that impede their effectiveness [5]. For example, some techniques rely on frequency-based methods, which may result in identifying and removing irrelevant or misleading words or phrases from a document. Other techniques rely on shallow syntactic patterns, which can miss important semantic relationships between words. These limitations highlight the need for more sophisticated techniques that can capture the complex and nuanced information present in natural language text.

Despite these limitations, keyphrase extraction remains an active area of research, with many researchers exploring new approaches to improve its accuracy and efficiency [6]. One promising direction is to leverage graph-based methods to identify keyphrases. These approaches can capture the complex relationships between words in a document and can be fine-tuned to improve their performance. Additionally, unsupervised methods such as statistical [7], co-occurrence and semantic similarity-based techniques can also be utilized for keyphrase extraction. These techniques exhibit zero dependence on training data and possess the potential to be widely implemented across various domains, thus displaying high scalability and adaptability.

The proposed methodology employs a graph-based model of the document and employs diverse range of measures such as degree centrality, TextRank, closeness, and betweenness to recognize the crucial nodes within the graph. NLP patterns are also incorporated to further refine the extracted keyphrases. In a benchmark dataset, this approach demonstrates favorable outcomes in contrast to state-of-the-art methods for extracting keyphrases. This approach offers a robust and scalable solution for identifying keyphrases that capture the essential information of a document. By leveraging the structural information of a document, this approach provides a more accurate and relevant extraction of keyphrases.

This study is structured into multiple sections. The Sect. 2 summarizes keyphrase extraction techniques and highlights their strengths and limitations. The proposed methodology is presented in Sect. 3, which employs graph-based methods based on linguistic patterns. The proposed methodology is explained in detail, including the preprocessing steps, feature extraction techniques, and the model architecture. Sect. 4 of this research paper details the results and discussions derived from a comprehensive evaluation of the experimental setup and performance parameters employed. A detailed dataset analysis is also presented alongside the results and comparisons of the experiment. Finally, Sect. 5 concludes the paper by summarizing the contributions and limitations of the proposed method and provides directions for future research in the field of keyphrase extraction.

2 Related work

   With multiple research suggesting various methods for locating and extracting keyphrases from textual data, the area of keyphrase extraction has attracted a lot of attention recently. In this section, we examine the body of knowledge on keyphrase extraction, concentrating on the various procedures and approaches previously employed in research. NLP’s difficult problem of keyphrase extraction has drawn a lot of attention from researchers who have studied both classification-based and ranking-based, or unsupervised methods [8].

Researchers have suggested utilizing meeting-specific characteristics, such as sentences related to decision-making and summaries generated by the system, to enhance the accuracy of keyword extraction in meetings [9]. In addition, certain investigations have explored the utilization of web resources and confidence scores to expand bigrams. Despite these efforts, there remains a necessity for more efficient and reliable techniques, particularly in regards to ASR output, and appropriate assessment criteria for low levels of human agreement.

A recent scholarly publication presented a supervised framework designed for the automatic extraction of keywords that exploits the graph-theoretic attributes of words present in a given text [10]. The methodology employs a complex network model to represent the text and extracts a set of node properties to create a feature set. To create a training set, each candidate keyword is assigned a label based on its appearance in a gold-standard keyword list or not. A binary classification model is developed to forecast specific keywords in a given dataset. The research demonstrates that this technique surpasses various keyword and keyphrase extraction methods on different datasets, and it is not restricted to any specific domain, collection or language.

In a recent academic paper, an innovative DAKE framework for extracting keyphrases is presented. This framework leverages supplemental contextual information sourced from other sentences within the same document [11]. DAKE incorporates a BiLSTM-CRF network, document-level attention mechanism, and gating mechanisms to balance between global and local contexts. As demonstrated on a research paper dataset, this model surpasses current keyphrase extraction methods. There is potential for this approach to be utilized in downstream tasks that necessitate a compact representation of the topical content of a document.

Unsupervised techniques, such as TF-IDF [12] and graph-based methods, have exhibited exceptional potential for the ranking of keyphrases. The TF-IDF method employs word frequency in the document and inverse document frequency [13]. An investigation proposes an unsupervised algorithm that employs the average TF-IDF of candidate words in different languages, including the same language, to select extracted keywords [14]. The proposed algorithm outperformed other keyword extraction techniques, achieving a total accuracy rate of 91.3% on a dataset of 200 news articles in various languages from the BBC website. On the other hand, the graph-based approach involves building a word graph and utilizing centrality measures to identify significant words [15]. A research paper introduces an unsupervised graph-based approach called PositionRank for keyphrase extraction from online textual documents [16]. It incorporates the relative position and frequency of a word into a biased PageRank algorithm to extract keyphrases. This method outperforms strong baselines with up to 26.4% improvement in performance on two datasets. Future research could investigate the effectiveness of PositionRank on other types of documents.

In a recent research, a comparison was made between three machine learning algorithms, namely Decision Trees, Naïve Bayes, and Artificial Neural Networks, for the purpose of extracting keyphrases from text documents [17]. According to the experimental results, the keyphrase extraction method based on Neural Networks outperformed the other two algorithms, as well as a publicly available system known as KEA. However, the Naïve Bayes algorithm can deliver comparable results to the Neural Network method when used with a suitable discretization algorithm. The authors suggest that future research could focus on improving the candidate phrase extraction module and incorporating new features such as structural and lexical features. In another study, a novel deep recurrent neural network (RNN) model was proposed to extract keyphrases from tweets [18]. This model combines keywords and context information and has two hidden layers that discriminate keywords and classify keyphrases. The proposed method outperforms existing methods on a large-scale dataset collected from Twitter, demonstrating its effectiveness for keyphrase extraction on single tweet.

Although these approaches have achieved success, they still have some limitations. One such limitation is that the TF-IDF method only considers lexical features, which leads to inaccurate ranking of keyphrases as it fails to consider text semantics. The graph-based approach may also suffer from sparsity in the word graph, resulting in inadequate coverage of keyphrases. Additionally, supervised methods require labeled data, which may not generalize well to new domains or may require extensive labeling effort. Furthermore, deep learning models necessitate significant training data, which may not always be available. To overcome these limitations, recent studies have focused on developing hybrid approaches that combine the strengths of multiple methods. For example, researchers have investigated incorporating domain knowledge and word embeddings to enhance the performance of keyphrase extraction [19]. These hybrid approaches have displayed promising results and are predicted to advance the state-of-the-art in keyphrase extraction.

The study presents a novel unsupervised technique for extracting keyphrases, known as TeKET, which addresses the limitations of current domain-specific approaches. These approaches necessitate significant domain expertise and training data. TeKET, on the other hand, is language and domain independent, relies on basic statistical knowledge, and doesn’t require any training data [20]. It employs a modified version of a binary tree, KePhEx, to extract the ultimate keyphrases from the candidate ones. The effectiveness of the method is evaluated using benchmark datasets, and the results reveal that it outperforms other unsupervised approaches in terms of precision, recall and F1 scores.

3 Proposed methodology

   This section describes the proposed methodology for graph-based keyphrase extraction using different centrality measures and the Textrank algorithm. The figure 1 shown below represents overview of the proposed architecture. The methodology is divided into the following subsections:

Figure 1
figure 1

Overview of the proposed system architecture.

3.1 Preprocessing

Preprocessing is an essential step in graph preparation that involves converting words into their root forms, also known as lemmatized tokens. The output of this process helps prevent the formation of multiple nodes for different forms of the same word. Preprocessing can increase the graph’s density. This technique optimizes the graph’s structure, making it more efficient and streamlined. The input document is tokenized at the word level, and a lemmatization technique is applied to derive the root form of each token.

3.2 Graph preparation

A directed graph is prepared on the entire document, where the preprocessed words represent the nodes of the graph [21]. The weights of edges are determined using their co-occurrence in the document. For example, if the co-occurrence of two words is three, then the edge connecting the nodes for both words is assigned a weight of three. A directed word graph is an effective way to represent a document in a graphical format that facilitates the calculation of centrality measures. This technique is preferred due to its ability to effectively capture the relationships between words and their importance in the document

3.3 Centrality measures and textrank algorithm

Several centrality measures and algorithms are utilized to assign scores to the nodes of the graph, such as degree centrality, betweenness centrality, closeness centrality, and Textrank [22]. The utilization of centrality measures is a crucial technique in graph theory for determining the most significant nodes in a graph. These measures play a vital role in comprehending the relative importance of various nodes in a graph and can help in identifying critical nodes that could considerably influence the graph’s overall structure.


3.3a Degree centrality: The calculation of the degree centrality of a node relies on the count of direct connections among the nodes present in the graph. This metric reflects the sum of the weights of both the incoming and outgoing edges associated with the node in question within the graph [23].

$$\begin{aligned} Cd(Ni)= \sum _{i=1}^{n} X_{ij} \scriptstyle {(i \ne j)} \end{aligned}$$
(1)

where, Ni is i-th node in the graph, Cd is Degree Centrality measure for node Ni, Xij is the weight of the edge connecting nodes Ni and Nj.


3.3b Betweenness centrality: The calculation of betweenness centrality of a node is based on the count of the shortest routes that encompass the node and the count of the shortest routes that include other vertices as well [24]. The betweenness centrality value for a node is determined through the following equation:

$$\begin{aligned} Cb(v) = \sum _{s \ne v \ne t}\frac{\sigma _{st}}{\sigma _{st}(v)} \end{aligned}$$
(2)

where \(\sigma {\scriptstyle st}\) is the number of shortest paths that include vertices s and t as their end vertices, and \(\sigma {\scriptstyle st}\)(v) is the number of shortest pathways that also contain vertex


3.3c Closeness centrality: The measure of the closeness centrality is determined by adding up the distances between a single node and all the other nodes present in the graph [25].

$$\begin{aligned} Cc(v) = \frac{1}{\sum _{j=a}^{n}d(Ni,Nj)}{\scriptstyle (i \ne j)} \end{aligned}$$
(3)

where Ni is i-th edge, Nj is jth edge, Cc is the closeness centrality measure for node Ni and d(Ni, Nj) is distance between nodes Ni and Nj.


3.3d Textrank: The TextRank method is a ranking algorithm that operates on graphs and is applied for the extraction of keywords and sentences in natural language processing. The algorithm constructs a relationship between the nodes by using the damping factor and a group of vertices that indicate the node’s directionality [26].

$$\begin{aligned} Score(S_i) = (1 - d) + d * \sum _{j \in In(S_i)} \frac{w_{i,j}}{\sum _{k \in Out(S_j)} w_{j,k}} Score(S_j) \end{aligned}$$
(4)

where \(S_i\) is a sentence in the text document, The damping factor d is often established at 0.85, \(In(S_i)\) represents the set of sentences that point to \(S_i\), \(Out(S_j)\) represents the set of sentences that \(S_j\) points to, and \(w_{i,j}\) is the similarity score between \(S_i\) and \(S_j\) (measured using a semantic similarity metric). The \(Score(S_i)\) is the importance score assigned to \(S_i\) based on the iterative calculation.

3.4 Score calculation

The scores assigned by different centrality measures and the Textrank algorithm are combined using a formula that generates scores for functional words more than that of stop words. In textual analysis, stopwords are frequently encountered as they are used to connect meaningful words within a sentence. However, they possess minimal semantic significance and do not contribute to the overall meaning of the text. Therefore, functional words such as nouns, verbs, and adjectives are more relevant in capturing the essence of the text and can be effectively classified as keywords. To determine the significance of functional words and stopwords in a given text, different centrality measures such as closeness, betweenness, and degree centrality can be employed. Empirical observations indicate that closeness centrality and betweenness centrality tend to rank functional words higher than stopwords. On the other hand, degree centrality and Textrank tend to rank stopwords higher and functional words lower. An effective combination of these centrality measures and the Textrank algorithm can lead to a scoring system that assigns a higher weightage to functional words and a lower weightage to stopwords. This can significantly improve the accuracy and precision of keyword extraction in textual analysis. The centrality measures and textrank algorithm is combined using following formula:

$$\begin{aligned} Score(Ni) = ( Cc(Ni) * Cb(Ni) ) / ( Cd(Ni) * S(Ni) ) \end{aligned}$$
(5)

Here Cc, is closeness centrality, Cb is betweenness centrality, Cd is degree centrality and S is page rank for node Ni.

3.5 Keyphrase extraction

After scoring the nodes using the centrality measures and the Textrank algorithm, keyphrases are extracted from the document using specific language patterns [27]. The goal of this step is to identify the most relevant and meaningful keyphrases in the document. To achieve this, a set of language patterns are employed that capture common syntactic structures of keyphrases in natural language. These patterns are included in the table 1.

Table 1 Keyphrase language pattern table.

The given examples illustrate various patterns of noun phrases commonly used. The A N pattern comprises an adjective followed by a noun, such as “linear function”. The N N pattern involves two nouns used together, as in “regression coefficients”. The A N N pattern consists of an adjective followed by two nouns, such as “Gaussian random variables”. The A A N pattern comprises two adjectives followed by a noun, as in “cumulative distribution function”. The N A N pattern involves a noun followed by an adjective and another noun, as in “mean squared error”. The N N N pattern comprises three consecutive nouns, such as “class probability function”. Lastly, the N P N pattern consists of a noun, followed by a preposition, and another noun, as in “degrees of coefficient”.

The table 1 shows the different language patterns used for keyphrase extraction, along with their corresponding meaning and example. For each language pattern, candidate keyphrases are extracted from the document and assigned a phrase-score based on the sum of the scores of its constituent nodes.

Next, keyphrases are ranked based on their phrase-score to identify the most important ones. This allows to identify the most relevant and salient keyphrases in the document that can effectively summarize its content. Overall, the proposed graph-based keyphrase extraction method, with its novel combination of different centrality measures and the Textrank algorithm, provides a robust and effective approach to extract keyphrases from the document. By leveraging linguistic patterns and scores assigned to the nodes, the method identifies keyphrases that effectively capture the essence of the document, making it a valuable tool for various natural language processing applications.

4 Results and discussions

The outcome of the experiment revealed that the suggested technique for extracting keyphrases is efficient in generating top-notch keyphrases from diverse text datasets. This was established by means of evaluation metrics and a comparison with contemporary techniques. This segment entails a meticulous examination of the acquired results, along with an extensive deliberation of their implications for future research.

4.1 Experimental setup

The experimental configuration involved the utilization of Python programming language version 3.8 along with numerous libraries, specifically NLTK, spaCy, and WordNet, to augment the model’s performance. These libraries were employed for different natural language processing and preprocessing tasks, such as tokenization, part-of-speech tagging, and lemmatization. Furthermore, the NetworkX library was utilized to establish a graph that represents the document, with each node representing a preprocessed word, and edge weights being determined by co-occurrence frequency. To demonstrate the model’s effectiveness in generating high-quality keyphrases, it was evaluated on various textual datasets. To operate the model, the system must have a minimum of 4GB memory and a Python installation.

4.2 Performance parameters

Precision, recall, and F1-score are widely used evaluation metrics to measure the effectiveness of a keyphrase extraction system.

Precision calculates the proportion of accurately extracted keyphrases to the total number of keyphrases extracted by the system. The mathematical formula for precision is:

$$\begin{aligned} Precision = \frac{TP}{TP+FP} \end{aligned}$$
(6)

where TP represents true positive and FP represents false positive.

Recall is a metric that evaluates the proportion of accurately extracted keyphrases in relation to the overall number of keyphrases that were expected to be extracted. Recall can be expressed mathematically as follows:

$$\begin{aligned} Recall = \frac{TP}{TP+FN} \end{aligned}$$
(7)

where TP represents true positive and FN represents false negative.

The F-score is a mathematical representation of the system’s effectiveness, which is obtained by calculating the harmonic mean of precision and recall. It provides a comprehensive and unified evaluation of both precision and recall, allowing for a more detailed analysis of the system’s performance. Mathematically, F-measure is defined as:

$$\begin{aligned} F{-}measure = 2 * \frac{Precision * Recall}{Precision + Recall} \end{aligned}$$
(8)

4.3 Dataset

In order to assess the efficiency of the suggested model for keyphrase extraction, a comparative analysis was conducted using the SemEval2017 dataset. This particular dataset is a widely accepted benchmark for evaluating keyphrase extraction systems, consisting of 300 academic articles spanning diverse domains such as computer science, economics, and linguistics, and comprising a total of 7,018 keyphrases that have been manually annotated.

4.4 Results and comparisons

To measure the effectiveness of the proposed keyphrase extraction model, the accuracy of the extracted keyphrases was evaluated by comparing them with the reference keyphrases in the SemEval2017 dataset. The evaluation was performed using precision, recall and F1 score as the chosen metrics. Five transcripts (D1-D5) were randomly chosen from the dataset and the model’s performance was evaluated on each of them, as presented in table 2. To obtain a general evaluation of the model’s performance, the average of these measures was computed.

Table 2 Model evaluation on SemEval2017.

   The graph presented in figure 2 illustrates the comparative evaluation of the F1 scores for all five transcript variants, along with the mean score across all transcripts. This visualization provides a clear and concise representation of the performance of the proposed keyphrase extraction model and highlights its effectiveness in extracting keyphrases from different transcripts.

Figure 2
figure 2

Keyphrase extraction F1 scores across SemEval2017 transcripts.

A comparative evaluation of the performance of the proposed keyphrase extraction model and conventional models [28] is illustrated in table 3. The assessment is carried out using various evaluation metrics, such as precision, recall, and F-measure. Through this analysis, a comprehensive understanding of the comparative advantages and disadvantages of the proposed model with other state-of-the-art models in the field can be attained.

Table 3 Comparison of keyphrase extraction methods on SemEval2017 dataset.

Overall, the results of this study suggests that the proposed keyphrase extraction model is a promising approach for extracting keyphrases from text. Its performance is superior to conventional models, and it has the potential to be further developed and improved in future studies.

5 Conclusion and future work

In conclusion, the proposed graph-based approach to keyphrase extraction leverages the strengths of the TextRank algorithm and different centrality measures to assign scores to words in the document. The method utilizes linguistic patterns and phrase-scores to extract keyphrases from the text. The approach presented in this study has shown to be highly effective in identifying relevant keyphrases, achieving excellent precision and recall rates on benchmark datasets and real-world applications. Specifically, the proposed approach outperformed existing models, achieving a precision score of 0.5263, a recall score of 0.5498, and an F-measure score of 0.5323.

The approach proposed in this study presents numerous advantages over existing methods. Firstly, it demonstrates exceptional flexibility, enabling customization to meet the requirements of different document types and languages. Secondly, it facilitates the identification of multi-word phrases as keyphrases, which offer richer and more meaningful insights than single-word keyphrases. Finally, the use of various centrality measures in combination with the Textrank algorithm results in a comprehensive and precise representation of word importance in the document, leading to more pertinent and insightful keyphrase extraction.

Further research can explore new graph-based algorithms and techniques, in addition to integrating the proposed approach with existing methods to improve keyphrase extraction performance. The method can also be extended to support related natural language processing tasks, such as document summarization and text classification. The proposed approach is a valuable contribution to the field of natural language processing, with practical applications in various domains.