1 Introduction

Keywords are described as a series of one or more words which provide a compact representation of a documents’ content (Berry and Kogan 2010; Lahiri et al. 2014; Boudin 2013; Grineva et al. 2009). However, assigning keywords to documents manually is very costly, time consuming, and tedious task, in addition to this, the web is a very rich source of information which has been progressively expanding, and hence, the number of digital documents available has been progressively expanding. Consequently, automatic keyword extraction from text of social networking site has attracted the interest of researchers over the last few years. Automatic keyword extraction is an important task and hence finds its applicability in many research directions such as text mining (TM), information retrieval (IR), and natural language processing (NLP) as it enables us to represent text documents in a condensed way. The compact representation of documents can be helpful in several applications, such as automatic indexing, automatic summarization, automatic classification, automatic clustering, automatic topic detection and tracking, and automatic filtering.

With the increase of social networking, people started to share more information through different kinds of social media. Social networks are established by social interactions like co-authoring, counseling, supervising, helping academic committees, sharing views, etc. Micro-blogs have been recently attracting people to express their opinions and socialize with others. One of the most popular micro-blogging sites is twitter. Here, people put their opinions about various topics like politics, brands, products, and celebrities. (Savita and Gore 2016). This growth desires study and analysis of contents of tweets in hope of summarizing the huge collection of posts, which is called sentiment analysis. Hence, sentiment analysis has been an important and significant topic for data mining (Hemalatha and Saradhi Varma 2013). Sentiment classification is extensively helpful and useful in business intelligence applications, recommender systems, and political and administrative decisions. One of the most important tasks in sentiment analysis is keyword extraction. If keywords of a text are extracted properly, subject of the text can be studied and analyzed comprehensively and good decision can be made on the text.

Texts are commonly represented using the well-known vector space model (VSM); however, it results in sparse matrices which is to be dealt computationally. When target application involves twitter contents, as compared to traditional text collections, this problem becomes even worse. Because of many factors such as short length of texts, diversity in twitter contents, casualness, grammatical errors, catchwords, slangs, and the speed with which real-time content is produced; an effective technique is obligatory (Ediger et al. 2010) to extract useful keywords. Graph-based technique to extract important keywords, from a set of tweets, is appropriate in such situation and has gained popularity in the recent times.

Bellaachia and Al-Dhelaan (2012) proposed a graph-based method to extract keywords from twitter data, which uses node weight with TextRank, hereby resulting in a node-edge-weighting approach called NE rank (node and edge rank). Term frequency–inverse document frequency (TF–IDF) is used as the node weight. However, keywords in tweets do not solely depend on TF–IDF. Abilhoa and Castro (2014) proposed a graph-based technique to extract keywords from twitter data, which uses closeness and eccentricity centralities to determine node weight and degree centrality as the tie breaker. Closeness and eccentricity centralities do not work well for disconnected graphs. However, in most of the cases, the graph made from tweets becomes a disconnected graph due to the diversity of the tweet contents. Therefore, an effective graph-based keyword extraction method is required which can overcome most of the drawbacks of graph-based model including the ones cited above. This paper proposes such a graph-based keyword extraction method called keywords from collective weights (KCW). KCW analyzes the whole corpus of tweets and uses different features of a node like frequency, centrality, position, and strength of neighbours, to determine the weight of the nodes and thus extracts the important set of keywords, based on ranks obtained using NE rank and degree centrality. Here, the NE rank depends on the node weight obtained using the different features.

The remaining part of the research article is organized as follows. Section 2 presents literature survey which describes the related previous works. Section 3 discusses the proposed model in detail. An illustrative example is presented in Sect. 4 to understand the proposed model clearly. Results with discussion are presented in Sect. 5, and Sect. 6 draws some conclusions about the research work.

2 Literature survey

The keyword extraction techniques can be divided into four categories, namely, linguistic approach, machine learning approach, statistical approach and other approaches (Zahang et al. 2008). The linguistic properties of the words, sentences and documents are used in the linguistic approaches, the most commonly examined linguistic properties being lexical, syntactic, semantic, and discourse analysis (Hulth 2003; Nguyen and Kan 2007; Cohen-Kerner 2003). Supervised or unsupervised learning approach for keyword extraction is considered in machine learning approaches. In supervised machine learning approach, a model is trained on a set of known keywords and then it is used to find the keywords for unknown documents (Witten et al. 1999; Zhang et al. 2006; Medelyan and Witten 2006). Statistical approach comprises of language and domain independent simple methods which do not require the training data but uses the statistics of the words from document such as n-gram statistics, word frequency, TF–IDF, word co-occurrences, PAT Tree etc. to identify keywords (Chen and Lin 2010). Other approaches for keyword extraction, in general, is a combination of all approaches mentioned above.

Graph-based keyword extraction is a statistical approach for identifying the keywords. The literature provides many recent graph-based methods for keyword extraction. Litvak et al. (2011) used graph-based syntactic representation of text and web documents to propose an unsupervised cross-lingual key phrase extractor, known as DegExt. The absence of any constraints on the number of nodes in their model leads to exponentially larger graphs for larger data sets. Bellaachia and Al-Dhelaan (2012) proposed a novel graph-based keyword ranking method, called NE rank which considers word weights in addition to edge weights when calculating the ranking. NE rank forms a major part of the keywords extraction process in the proposed model. Bougouin et al. (2013) proposed an unsupervised method that aims to extract key phrases from the most important topics of a document, called TopicRank. Noun phrases belonging to a particular topic are clustered to form the vertices of the graph and then each topic is ranked using TextRank model. With the ranked clusters, keyphrases are selected. However, the current method proposed by Bougouin et al. (2013) does not provide the best solution for keyphrase selection. Zhao et al. (2011) proposed a three step algorithm that consists of keyword ranking, candidate keyphrase generation, and keyphrase ranking, for extraction of important keyphrases meant for a particular topic. The edge-weighting scheme used for the ranking of the keywords simply uses the frequency of co-occurrence of two words in a tweet assigned to a topic. Abilhoa and Castro (2014) proposed a keyword extraction method from tweet collections that represent texts as graphs and applies centrality measures-degree, closeness, and eccentricity, for finding the relevant vertices (keywords). The performance evaluation of the proposed model is mainly inspired by this model and hence has been used as one of the baseline models for comparison purpose. Lahiri et al. (2014) extracted keywords using different centrality measures such as degree, strength, neighbourhood size—order 1, coreness, pagerank, etc. on word and noun phrase collocation networks and analyzed their performance on four benchmark data sets. Lahiri et al. (2014) observed that degree centrality measure is much simpler to use and performs well than most of the existing methods while extracting keywords and keyphrases. Beliga et al. (2015) proposed a keyword extraction method using node selectivity while using different centrality measures. Kwon et al. (2015) proposed a model for term weighting and representative keyword extraction based on graphs. Wang et al. (23) introduced average term frequency (ATF) and document frequency (DF) as an improvisation over TextRank to calculate the node weight for extracting domain-specific keyphrases. Khan et al. (2016) proposed a novel graph-based re-ranking approach, called term ranker which extracts single-word and multi-word terms using a statistical approach, identifies groups of semantically similar terms, estimates term similarity based on term embedding, and uses graph refinement and node centrality ranking to extract the top k terms. Ravinuthala et al. (2016) proposed a directed graph representation technique in which weighted edges are drawn between the words based on the theme of the document. They use both system generated keywords and manually generated keywords for performance evaluation. Nagarajan et al. (2016) presented a keyword extraction algorithm, where words of the documents are represented as nodes, the relation between the words of the documents is represented as edges, documents are represented as graphs, and keywords are extracted using degree and closeness centrality measures. In the proposed model, degree centrality is a major contributor in the determination of the top k keywords. Song et al. (2017) proposed a method which considers three major factors, namely, temporal history of the preceding utterances, topic relevance, and the participants for keyword extraction using TextRank and some graph operations. The utterances spoken by the current speaker should be considered as more important than those spoken by other participants.

3 Proposed KCW model

The KCW model considers frequency, centrality, position, and strength of neighbours of a node to calculate importance of the node. The implementation of the model is divided into four phases: pre-processing, textual graph representation, node weight assignment, and keyword extraction. The flowchart of the proposed model is shown in Fig. 1. All the phases are discussed in detail below.

Fig. 1
figure 1

Flowchart of the proposed model

3.1 Phase 1: pre-processing

Twitter is a micro-blog, where people from all over the world interact through messages, called tweets which are restricted to 140 characters. Tweets are generally very noisy for any text mining task as they contain a number of symbols that do not have any useful information and make further processing ineffective. Therefore, this model performs effective pre-processing to remove meaningless symbols, characters or words from the tweets so as to extract keywords more effectively. The steps for pre-processing are as follows:

  1. (i)

    Remove username and retweet symbol: Most of the tweets begin with a username with the symbol ‘@’ preceding it. Sometimes, when another user agrees, supports and shares a tweet of a user, the original tweet is said to be re-tweeted and contains the symbol RT. These usernames and retweet symbol do not contribute much to keyword extraction and act as noise and thus are removed.

  2. (ii)

    Remove hashtags: The Hashtag, i.e., # before a word such as #KarnatakaWithCongress, is removed to get ‘KarnatakaWithCongress’.

  3. (iii)

    Remove URLs: The model focuses only on the textual part of the original tweet which users provided as sufficient information source to evaluate the content and context of their coping expressions, and to extract the important keywords. In addition, the URL links mainly represent image or video files which are not the focus of the proposed model. Therefore, URLs are considered as noise, and thus are removed.

  4. (iv)

    Stop word removal: A standard list of stop words such as about, be, etc. that does not contribute much in the overall meaning is created and these stop words are then removed from the set.

  5. (v)

    Tokenization: Tokens are the basic constituents of a tweet/text. Each term in a tweet is treated as a token. Let T be the set of tweets which is represented as T = {T1, T2, T3Ti| i is the number of tweets}. Each Ti in T is pre-processed and its terms are treated as tokens. Let t be the set of tokens represented as t = {t1, t2, t3tk}. t includes tokens from all the tweets of T, where the number of tokens in the set T is k.

  6. (vi)

    Removal of unimportant tokens: There are many tokens which are comparatively less important and cannot be keywords. To identify and remove these tokens, a mechanism is established, so that these tokens do not compete in the keyword extraction phase, making it more efficient. The tokens which occur less than the average occurrence frequency (AOF) are removed. AOF is determined by Eq. (1) as given below:

    $${\text{AOF=~}}\frac{{\sum {\text{Frequency}}\;{\text{of}}\;{\text{each}}\;{\text{token}}}}{{{\text{Number}}\;{\text{of}}\;{\text{tokens}}}}.$$
    (1)

    For a given token i, if frequency (i) < AOF, delete token i.

  7. (vii)

    Additional white spaces left after the removal of the stop words and unimportant tokens are removed.

3.2 Phase 2: textual graph representation

Let G = (V, E) be a graph, where V is the set of vertices and E is the set of edges. The textual graph is then represented as described below.

  1. (i)

    Vertex assignment: Each token is used to create one vertex each. The set of vertices (V) is created from the set of tokens in the vertex assignment.

  2. (ii)

    Edging: If two tokens co-occur within the same window (i.e., a tweet), then there is an edge between them. A directed edge Ei,j is generated for each token “i” and its immediate successor “j” based on the same sequence in which they appear in the original tweets/texts. The generated graph revolves around a single topic and thus holds many common tokens that are associated with more than one tweet.

    An adjacency matrix for the textual graph is created which represents weights of edges. The weights represent the strength of the relationship between two tokens. Wu et al. (2011) presented a weighted sematic graph, where the use of directed edges representing the co-occurrence relationship between two terms has shown outstanding results. Edge generation based on co-occurrence relationship between terms has encouraged many other researches (Bordag et al. 2003; Jin and Srihari 2007; Rousseau and Vazigiannis 2013). Thus, to find the weight of each edge, frequency of the nodes, i.e., vertices and their co-occurrence frequency in the overall data set are used. The weight of an edge between vertices/nodes/terms ti and tj is determined by Eq. (2) (Sonawane et al. 2014):

    $${W_c}(i,j)=\frac{{{\text{freq}}(i,j)}}{{{\text{freq}}(i)+{\text{freq}}(j)-{\text{freq}}(i,j)}},$$
    (2)

    where freq(i, j) is the number of times node i and j co-occur, and freq(i) and freq(j) are the occurrence frequencies of nodes i and j, respectively.

The algorithm of constructing graph is as follows.

figure a

3.3 Phase 3: node weight assignment

In keyword extraction using graph-based model, the weight of a node plays a vital role. Proper node weight evaluation leads to effective and representative keywords determination for tweets/text. Many factors affect the importance of a node. The KCW model considers five different important features to calculate the node weights as discussed below:

  1. (i)

    Position of a node: Hotho et al. (2005) suggested that the position of a term is an important criterion while extracting keywords. Twitter data sets contain short texts, and thus, the probability of the first or last word in a tweet to become keyword is higher. Therefore, added weight is given to them, which is calculated as follows.

figure b
  1. (ii)

    Term frequency: Important keywords tend to occur more frequently in a document. Thus, the frequency of the keywords forms an essential parameter in the node weight determination. Term frequency is defined as the number of times a given term occurs in a document. Here, the term frequency is the number of times, and the term occurs in the whole set of tweets.

  2. (iii)

    Selectivity centrality: Selectivity centrality is defined as the average weight on the links of a single node. Centrality measure is an indication of the importance of a node within a graph. Even for disconnected graphs, selectivity centrality works well. The strength of a vertex (v), s (v), is a sum of the weights of all edges incident with the vertex v. The selectivity centrality of vertex v is calculated as a fraction of the vertex strength and vertex degree d(v) and is computed by Eq. (3):

    $${\text{SC}}(v)=\frac{{s(v)}}{{d(v)}}$$
    (3)
    $$s(v)=\mathop \sum \limits_{u} {W_{vu}}.$$
    (4)
  3. (iv)

    Distance from central node: The closer a node is to the most central node the more probability it has to be an important keyword. Therefore, importance of a node is calculated as the inverse of its distance from the central node. The central node is found using degree of the nodes. The node with the highest degree is considered as the most central. In case there is a tie in the degree of nodes, frequency of the nodes is used to break the tie. If central node is c, then the importance for a node i is determined by Eq. (5):

    $${D_{C\left( i \right)}}=\frac{1}{{d(c,i)}},$$
    (5)

    where d(c, i) is distance of node i from the central node c. This value is normalized to be within the range of 0 and 1. For the central node c, the DC(c) value is set to 1.

  4. (v)

    Importance of neighbouring nodes: A node is considered more important if its neighbours are also important. The importance of the neighbouring nodes can be calculated as the average strength of all the neighbours, which is calculated by Eq. (6):

    $${\text{Neig}}{{\text{h}}_{{\text{Imp}}}}(i)=~\frac{{\mathop \sum \nolimits_{j} {\text{Strength}}(j)}}{N},$$
    (6)

    where j is any neighbour of i, N is the number of neighbours of i, and NeighImp(i) is the importance of the neighbouring nodes of i.

    Then, the weight of any node i is calculated by Eq. (7) and is normalized to get value between 0 and 1 by Eq. (8). The normalized value of i is the final weight of i:

    $${\text{Nod}}{{\text{e}}_{{\text{weight}}(i)}}=F{\text{(}}i{\text{)}}+L{\text{(}}i{\text{)}}+{\text{TF(}}i{\text{)}}+{\text{SC(}}i{\text{)}}+{D_{C(i)}}+{\text{Neig}}{{\text{h}}_{{\text{Imp}}}}{\text{(}}i{\text{),}}$$
    (7)

    where TF(i) and SC(i) are the term frequency and selectivity centrality of i, respectively:

    $${\text{Fina}}{{\text{l}}_{{\text{weight}}(i)}}{\text{=}}\frac{{{\text{Node\_weight}}(i) - {\text{min\_weight}}}}{{{\text{max\_weight}} - {\text{min\_weight}}}},$$
    (8)

    where Node_weight(i) is the weight of node i, min_weight is the minimum weight among all the nodes, and max_weight is the maximum weight among all the nodes.

3.4 Phase 4: keyword extraction

The process of identifying keywords from a tweet/document that can appropriately represent the subject of the tweet/document is termed keyword extraction. To extract the important keywords, the proposed model uses NE rank and degree centrality.

  1. (i)

    Calculate the NE rank: NE rank method uses the weight of nodes with TextRank method which results in a node-edge-weighting approach called NE rank (node and edge rank) (Bellaachia and Al-Dhelaan 2012): For a node vi, the rank/relevance using NE rank is calculated using Eq. (9) given below:

    $$R({v_i})=(1 - d) \cdot W({v_i})+d \cdot W({v_i}) \cdot \mathop \sum \limits_{{j:{v_j} \to {v_i}}} \frac{{{w_{ji}}}}{{\mathop \sum \nolimits_{{k:{v_j} \to {v_k}}} {w_{jk}}}}R({v_j}).$$
    (9)

    In Eq. (9), d is the damping factor which denotes the probability of jumping from a node to the next node and is usually set to 0.85; and (1 − d) denotes the probability of jumping to a new node. W(vi) is the weight of the current node vi. wji is the weight of the edge from the previous vertex vj to the current vertex vi and \(\sum\nolimits_{{k:{v_j} \to {v_k}}} {{w_{jk}}}\) is the summation of all edge weights in the previous node vj. Here, the rank/relevance of node vj is denoted by R(vj).

  2. (ii)

    Calculate the degree centrality: Degree centrality of a node is defined as the number of edges incident on the node.

  3. (iii)

    Sort the keywords: Do the following for all the terms/nodes.

figure c

Finally, n best ranked terms/nodes are selected as keywords.

4 An illustrative example

Let us use the IPL data set to illustrate the proposed model in detail. Let us consider five tweets as a sample from the data set as shown below.

  1. (i)

    ‘Very excited for todays IPL contest RPS vs KKR, @msdhoni vs @GautamGambhir fight! #IPL’.

  2. (ii)

    ‘#poll who score 50 + score today #smithy #dhoni #stokes #Rahane #KKRvRPS #rpsvskkr #cricketlovers #ipl #IPL2017’.

  3. (iii)

    ‘RPS should be happy team today, because KKR have decided to rest NCN. He has been in prime form. #KKRvRPS #IPL @RPSupergiants @KKRiders’.

  4. (iv)

    ‘KKR seek to extend unbeaten run against Pune https://t.co/NdEuZIdxL5 via @cricbuzz @RPSupergiants @KKRiders #IPL’.

  5. (v)

    ‘#RPSvKKR Predict What will be the outcome? #ipl #KKRvRPS #ipl #Smithy #Gambhir 21’.

Pre-processing is performed to remove the noise from the tweets. After removing RT symbol, @ symbol, and URLs, the tweets are as follows:

  1. (i)

    ‘Very excited for todays IPL contest RPS vs KKR, vs fight! IPL’.

  2. (ii)

    ‘poll who score 50 + score today smithy dhoni stokes Rahane KKRvRPS rpsvskkr cricketlovers ipl IPL2017’.

  3. (iii)

    ‘RPS should be happy team today, because KKR have decided to rest NCN. He has been in prime form. KKRvRPS IPL’.

  4. (iv)

    ‘KKR seek to extend unbeaten run against Pune via IPL’.

  5. (v)

    ‘RPSvKKR Predict What will be the outcome? ipl KKRvRPS ipl Smithy Gambhir 21’.

After removing stop words and tokenization, the set (t) of tokens is represented as

$$t=\{ {\text{excited}},{\text{ todays}},{\text{ ipl}},{\text{ contest}},{\text{ rps}},{\text{ vs}},{\text{ kkr}},{\text{ vs}},{\text{ fight}},{\text{ ipl}},{\text{ poll}},{\text{ score}},{\text{ 5}}0,{\text{ score}},{\text{ today}},{\text{ smithy}},{\text{ dhoni}},{\text{ stokes}},{\text{ rahane}},{\text{ kkrvrps}},{\text{ rpsvskkr}},{\text{ cricketlovers}},{\text{ ipl}},{\text{ ipl2}}0{\text{17}},{\text{ rps}},{\text{ happy}},{\text{ team}},{\text{ today}},{\text{ kkr}},{\text{ decided}},{\text{ rest}},{\text{ ncn}},{\text{ prime}},{\text{ form}},{\text{ kkrvrps}},{\text{ ipl}},{\text{ kkr}},{\text{ seek}},{\text{ extend}},{\text{ unbeaten}},{\text{ run}},{\text{ pune}},{\text{ ipl}},{\text{ rpsvkkr}},{\text{ predict}},{\text{ outcome}},{\text{ ipl}},{\text{ kkrvrps}},{\text{ ipl}},{\text{ smithy}},{\text{ gambhir}},{\text{ 21}}\} .$$

Using frequency of all the tokens, the average occurrence frequency (AOF) is then calculated which is obtained as 1.3659. The tokens whose frequency is less than 1.3659 are removed. Therefore, the set (t) is represented as

$$t=\{ {\text{ipl}},{\text{ rps}},{\text{ vs}},{\text{ kkr}},{\text{ score}},{\text{ today}},{\text{ smithy}},{\text{ kkrvrps}}\} .$$

The textual graph is then represented by Fig. 2 and weights of the nodes calculated by Eq. (7) are shown in normalized form in Table 1.

Fig. 2
figure 2

Textual graph of the illustrative example

Table 1 Weights of nodes of the example

For each of the nodes, the NE rank centrality and degree centrality are calculated as given in Table 2.

Table 2 NE centrality and degree centrality for the example

Using NE rank centrality with degree as tie breaker, the top keywords are obtained, as shown in Table 3.

Table 3 Top keywords

5 Results and discussion

Data sets used in this paper are mainly tweets collected from twitter using NodeXL and Google spreadsheets using twitter Achiever. Four data sets, namely, Donald Trump, Harry Potter, IPL, and Uri Attack, are collected from twitter, containing 1000 tweets each and experiments are performed deliberately. For better assessment of the relevance of the results, the study is also conducted using a data set, namely, IPhone5 which is collected from Amazon and Flipkart that contains 1500 IPhone5 reviews.

To evaluate the performance of the proposed model and the other existing methods used for comparison, a set of keywords are defined manually, because there is no standard or correct set of keywords for any given data set. Even humans may not agree fully on the keywords that they extract for a given data set. Therefore, three persons are invited to suggest an unspecified number of keywords from the data sets. The human extractors are guided to extract the keywords based on two norms. First and most importantly, after the summarization of the topic, keywords must be extracted based on the relevance of the keyword along with its importance as a whole for the particular topic. Second, the keywords with higher frequencies are also to be considered as the important keywords. The human extractors are instructed to select the words or terms as it is in the data sets. There must not be any alteration of words or terms in case of encountering abbreviations, plural form, joined words, words with the same stem, and words in upper or lower case. They must be selected and extracted as they appear in the data sets as there is no limit in the number of terms that can be extracted as the useful keywords and any word may seem to be important to different individuals for a particular topic. The intersection and union of the sets identified by three people are determined for each data set which ensures zero variations while determining the performances. Table 4 contains keywords extracted by three persons for five data sets and the intersection sets are represented by bold face.

Table 4 Keywords extracted by three persons for five data sets

Three different performance measures, namely, precision (Pr), recall (Re), and F measure, are used as evaluation metrics for keyword extraction as given in Eqs. (10)–(12):

$$\Pr =\frac{{|\{ {\text{Relevant}}\} \cap \{ {\text{Retrieved}}\} |}}{{|\{ {\text{Retrieved}}\} |}}$$
(10)
$${\text{Re}}=~\frac{{|\{ {\text{Inter}}\_{\text{Relevant}}\} \cap \{ {\text{Retrieved}}\} |}}{{|\{ {\text{Inter}}\_{\text{Relevant}}\} |}}$$
(11)
$$F\;{\text{measure}}={\text{2}} \times \frac{{{\text{Pr}} \times {\text{Re}}}}{{({\text{Pr}}+{\text{Re}})}}.$$
(12)

To compute Pr, Relevant denotes the number of retrieved keywords which appear in at least one of the human lists/sets. To compute Re, Inter_Relevant denotes the number of retrieved keywords which appear in the intersection set of the three human lists. Precision may be defined as the probability that a keyword is relevant given that it is returned by a system and recall as the probability that a relevant keyword is only returned (Goutte et al. 2005). The intersection set consists of the keywords which are approved by all the three persons to be relevant for a particular topic. Therefore, to maintain zero variation in the consideration of the actual relevant keywords and also to maintain the effectiveness of the system, the intersection is considered only for the recall. However, those keywords which are found to be relevant by at least one of the three persons cannot be neglected and hence are considered while calculating the precision.

To show how different features are sensitive or important to find the overall accuracy of the proposed model, an incremental approach is adopted to make five different combinations. The different features are added one by one in KCW model for experimentation. Performances of five different combinations obtained by adding the features one by one are shown in Table 5 in percentage when the number of retrieved keywords is 10.

Table 5 Performance of the incremental combinations of parameters in KCW model

Following observations are studied from the experimental results, as shown in Table 5:

  1. (i)

    Using only distance from central node, \({D_{C(i)}}\) for the node weight assignment, KCW produces the same accuracy for three data sets, namely, Uri Attack, Harry Potter and IPL, and slightly lower accuracy for the Donald Trump and Iphone5 data set as compared to that when all parameters are used.

  2. (ii)

    For Harry Potter data set, KCW produces slightly higher accuracy when summation of \({D_{C(i)}}\;{\text{and}}\;{\text{SC}}(i)\) is used for assigning node weights than KCW model using all the parameters. For IPL, Donald Trump data set, and IPhone5, KCW produces the same results when all the parameters are used and when summation of \({D_{C(i)}}\;{\text{and}}\;{\text{SC}}(i)\) is considered. However, a slight decrease in accuracy is encountered for Uri Attack data set when only \({D_{C(i)}}\;{\text{and}}\;{\text{SC}}(i)\) is used for node weight assignment.

  3. (iii)

    For three data sets, the results using \({D_{C(i)}},\;{\text{SC}}(i)\;{\text{and}}\;{\text{Neig}}{{\text{h}}_{{\text{Imp}}}}(i)\) and using all the parameters in KCW model are the same. However, KCW model using all the parameters produces more accuracy than using \({D_{C(i)}},~\;{\text{SC}}(i)\;{\text{and}}\;{\text{Neig}}{{\text{h}}_{{\text{Imp}}}}(i)\), in Donald Trump and IPhone5 data set.

  4. (iv)

    For three data sets, the results using \({D_{C(i)}},\;{\text{SC}}(i),\;{\text{Neig}}{{\text{h}}_{{\text{Imp}}}}(i)\;{\text{and}};\;F(i)\;{\text{and}}\;L(i)\) and using all the parameters in KCW model are the same. However, KCW model using all the parameters produces more accuracy than using \({D_{C(i)}},\;{\text{SC}}(i),\;{\text{Neig}}{{\text{h}}_{{\text{Imp}}}}(i)\;{\text{and}};\;F(i)\;{\text{and}}\;L(i)\), in IPL and Iphone5 data set.

From the observations, it can be concluded that distance from the central node (\({D_{C(i)}}\)) is the most important feature in the node weight assignment. Second, selectivity centrality (\({\text{SC}}(i)\)) is found to be the second most significant feature while calculating the weights of the nodes. Term frequency seems to be important than neighbouring nodes (\({\text{Neig}}{{\text{h}}_{{\text{Imp}}}}(i)\)) and position of nodes (\(F(i),~L(i)\)). However, neighbouring nodes (\({\text{Neig}}{{\text{h}}_{{\text{Imp}}}}(i)\)) and position of nodes (\(F(i),~L(i)\)) seem to be equally significant in the weight assignment. From the observations of Table 5, it can be concluded that though all the features are significant for the node weight assignment, the distance from the central node (\({D_{C(i)}}\)) and selectivity centrality (\({\text{SC}}(i)\)) is the most important features in the proposed KCW model.

For the comparison of our proposed model, we have considered three baseline models as follows:

  1. (i)

    Twitter keyword graph (TKG) (Abilhoa and Castro 2014): all the variants are considered.

  2. (ii)

    Keyword extraction using graph-based approach (KEGBA) (Nagarajan et al. 2016).

  3. (iii)

    NE rank using TF–IDF as node weighting scheme (Bellaachia and Al-Dhelaan 2012).

Three edge-weighting possibilities are considered for existing model (Abilhoa and Castro 2014): same weight assignment (W1), weight as co-occurrence frequency (Wf) and weight as inverse co-occurrence frequency (W1/f). Degree centrality (Cd), closeness centrality (Cc), and eccentricity centrality (Ce) are used to measure the relevance of nodes/terms. The existing model (Bellaachia and Al-Dhelaan 2012) uses the co-occurrence value between any two words within a specific window size as the weight of the edges (Wf2) between two nodes. The window size used is equal to 2. The proposed model uses Eq. (2) to find edge weight and Eq. (7) to find node weight.

The performances of TKG, KEGBA, and NE rank using TF–IDF and proposed KCW for Uri Attack, IPL, Harry Potter, Donald Trump, and IPhone5 data sets are shown in Tables 6, 7, 8, 9, and 10, respectively. The pre-processing described in Sect. 3.1 is done in exactly the same way for all the four methods to assure the fairness of the evaluation. The following observations are made from the experimental results when the number of retrieved keywords is 10. The proposed model provides the highest precision, recall, and F measure in Uri Attack data set. However, the Precision is equal to the Precision produced by KEGBA with degree centrality. The model also provides the highest precision, recall, and F measure in IPL data set. The proposed model provides the highest precision, recall, and F measure in Harry Potter data set; however, the Precision is equal to the precision produced by TKG with TextRank centrality and NE rank using TF–IDF model. For Donald Trump data set, the proposed KCW model provides better precision and F measure than TKG and KEGBA model; however, the recall is equal to the recall produced by TKG with TextRank and closeness centralities; KEGBA with degree centrality and NE rank using TFIDF. For IPhone5 data set, the model provides the highest precision, recall, and F measure. However, for IPhone5 data set, the precision is equal to the precision produced by TKG with eccentricity centrality and the recall is equal to the recall produced by TKG with Eigen centrality.

Table 6 Performances in Uri Attack data set
Table 7 Performances in IPL data set
Table 8 Performances in Harry Potter data set
Table 9 Performances in Donald Trump data set
Table 10 Performances in IPhone5 data set

To show the relevance of the proposed work, the study is also conducted using a crowd-sourced data set IPhone5. A survey is conducted using Google spreadsheets within the institution, where scholars and faculties are instructed to extract keywords for the data set based on the same norms and instructions as the three human extractors. Around 250 individuals responded. Using the keywords selected by around 250 individuals, instead of that selected by three human extractors, as shown in Table 4, Precision, Recall, and F measure are calculated for the IPhone5 data set using Eqs. (10)–(12), respectively, which is shown in Table 11. For the crowd-sourced Iphone5 data set, number of Inter_Relevant keywords is 8. From Tables 10 and 11, it can be seen that the results obtained using the keywords selected by the three human extractors are mostly better than that obtained using the crowd-sourced keywords. It is also be observed that out of 17 different methods, 12 methods extracted the same number of Relevant keywords while calculating the precision, and while calculating the recall, six different methods extracted the same number of Inter_Relevant keywords for the two different sets of keywords, i.e., one selected by 3 humans and the other selected by almost 250 individuals. This shows that the study of the proposed work can also be conducted using crowd-sourced user bases and thus establishes the relevance of the proposed work.

Table 12 shows the top ten extracted keywords for IPL data set according to their ranks, for the three existing methods used for comparison and the proposed model. The keywords highlighted in bold are the ones which are in at least one of the human lists. The keywords which are both bold and italic are the ones which are present in the intersection set of the three human lists. The keywords without any highlight are the ones which are misclassified. It is observed from Table 12 that keywords misclassified by the different methods are mostly irrelevant for the topic. It is also observed that KCW model succeeds to extract significantly important keywords for a particular topic in comparison with other existing methods.

Table 11 Performances in IPhone5 data set using the crowd-sourced keywords set
Table 12 Keywords extracted in IPL data set

From Tables 6, 7, 8, 9, 10, 11 and 12, it can be observed that the proposed model shows significant improvement in comparison with other existing methods. This significant improvement is a contribution of different factors. KCW model uses an effective pre-processing phase along with the use of AOF which eliminates the irrelevant tokens/terms. TKG originally performs simple stopwords removal and tokenization as the pre-processing step. KEGBA originally does not use any pre-processing step. KCW model considers five significantly important features of a node for determining the node weight. TKG does not combine different measures to find weight of the nodes. KCW considers both frequency and relation between the nodes for calculating the weight of the edges in a balanced and effective manner using both frequency and co-occurrence frequency of the nodes. TKG finds weight of the edges by the same weight edges/co-occurrence frequency/inverse co-occurrence frequency. TKG uses closeness and eccentricity centralities to determine node weight and degree centrality as the tie breaker; however, closeness and eccentricity centralities do not work well for disconnected graphs. Finally, two well established methods are used by KCW for determining the ranks of the keywords, i.e., degree centrality and NE rank. KCW uses an improvised version of NE rank for keyword extraction, i.e., the node weight used in NE rank is determined by the summation of different significantly important features of nodes. Even though TKG is implemented with different centrality measures, it does not perform better than KCW, because originally TKG uses simple pre-processing and edge-weighting techniques, and does not combine different measures to find weight of the nodes. KEGBA represents graph using simple co-occurrence relations between words to find weight of the edges. Finally, degree and closeness centralities are separately used to find weight of the nodes. Thus, noises are not removed and edge weights are not properly captured. Centrality measures and others are not combined to overcome each other’s disadvantages to find weight of the nodes. NE rank using TF–IDF solely depends on frequency of words and disregards the relationship between words for determining the node weights. Sometimes, an important word appears less number of times which is not taken care by TF–IDF. TF–IDF does not perform better keyword extraction in twitter data due to the diversity and informality in twitter contents.

6 Conclusions

Keyword extraction is one of the most important tasks in analyzing textual data of micro-blogging sites like Twitter. This paper proposes an efficient keyword extraction model called KCW which consists of four phases: pre-processing, textual graph representation, node weight assignment, and keyword extraction. Pre-processing phase is executed to remove unwanted noise, stop words, and perform tokenization. Textual graph representation involves node assignment and establishment of edges between these nodes. A new node weight assignment scheme is introduced for the NE-rank approach. Node weight assignment phase determines node weight based on its position, frequency, centrality, distance from the central node, and strength of the neighbours. Finally, the most important keywords are extracted using NE-rank centrality with degree as tie breaker. Experimental results show that each of the five different features used for the node weighing scheme is significantly important for the overall accuracy of the proposed model. To show the superiority of the proposed model, it is compared with variants of two existing graph-based models and one existing non-graph model. It is observed from the experimental results that the performance of KCW model is far better than others. It is also observed that the most important keywords extracted by KCW model are certainly important for a particular topic in comparison with other existing methods.

Being a generalized model, the proposed KCW model can be used in sentiment analysis for keyword extraction. This domain pursues great potential of quality research in the near future. Different angles of this domain can be explored and can be augmented with other significant algorithms. New edge weighing and node weighing methods can be proposed and they can be used in NE-rank centrality or other centrality measures to find rank of keywords. Different cascaded or individual approaches for other centrality measures can also be adopted in the future to get better results. Semantic methods for keyword extraction can be explored along with the proposed model to get better results.