Keywords

1 Introduction

Social media analysis is one of the most widely studied areas of research. The field of topic detection and tracking has been introduced [1]. The key idea for social media analysis is keyword extraction and keyphrase extraction. The words which are important in terms of any text are said to be keywords. Similarly, the set of words which occur as a phrase and represent the text are known as keyphrase. The keyphrase represents the topic of discussion which summarizes the textual information.

  • Keyphrase Extraction

The keyphrase extraction is the process of identifying important phrases from the given textual data. It is used to summarize the short text to identify topic of discussion from a set of Twitter posts which are related to the same topic or domain. The topic or domain is very subjective decision to decide if the extracted keyphrase should be about main topic or sub-topics under that topic. For instance, ‘floods in Kerala 2018’ is a main topic but ‘having nothing to eat in XYZ village’, ‘save us! Water all around’ and ‘our house drowned in flood’ are sub-events. These sub-events help us to determine the emergency conditions. Such topic and emergency conditions can be deduced by identifying keyphrases from many Twitter posts. Tackling this kind of information shall provide useful insights into social media data. In this study, short text is being used for keyphrase extraction; the keyphrase extraction and text summarization are used interchangeably.

Recent progress on word co-occurrence network has added statistical and computational significance over information processing. Graph-based keyphrase extraction measures are usually based on the global network metrics which have been used for keyword extraction. The network metrics have been studied for well-formed data and social media data. It has been observed that the keywords obtained using the random walk-based measures are ranked, and using this information keyphrase is extracted. The application domains for keyword extraction and keyphrase extraction techniques are sentiment analysis, topic and trend detection, event detection, disaster management, outbreak detection and many other applications. The major challenges for identifying keywords in social media data are that the data is unstructured. Due to graphical and statistical approach, the random walk-based keyword extraction from word co-occurrence network has proved to be useful for well-formed data. However, for ill-formed data, traditional network metrics for word co-occurrence network have not been explored.

Word co-occurrence networks are graphical networks which are generated from contextual information. These are also referred to as word adjacency model [2]. There are different types of textual networks which can be framed from given data. For each graph G, node is considered as word and edge is considered as a link connecting two words co-occurring in the given document. In this research work, the document is considered as Twitter feed. Based on the co-occurrence and associativity, construction of graph is decided on three parameters as follows. This paper is organized as follows. In Sect. 2, existing metrics have been implemented on standard Twitter feeds and results have been obtained. In Sect. 3, evaluations and results have been analysed and discussed. Conclusion and future work have been presented in Sect. 4.

2 Evolution of Different Approaches

Graph-based keyword extraction techniques can be both supervised and unsupervised, context dependent and context independent. In this research work, many context-independent unsupervised graph-based keyword extraction techniques have been explored. KeyWorld is an automatic indexing system which has been proposed by Matsuo et al. [3] which extracts candidate keywords by measuring their influence on small-world properties. It captures characteristic path length and extended characteristic path length. This algorithm has been inspired by small-world phenomenon and KeyGraph algorithm proposed by Ohsawa et al. [4]. Thereafter, Erkan et al. [5] proposed LexRank which is insensitive to noise in text and calculates importance of sentence (or word) using eigenvector centrality. Mihalcea et al. [6] proposed graph-based TextRank model which has been originated from the concept of PageRank. The author further improved TextRank for text summarization. In 2007, Palshikar [7] proposed hybrid and statistics-based approach for keyword extraction using co-occurrence frequency measure. The author described eccentricity-based keyword identification, other centrality measure-based keyword extractions and proximity-based keyword identification. Litvak et al. [8] proposed HITS-based algorithm for keyword extraction. In 2009, for event detection and tracking in social streams, Sayyadi [9] used KeyGraph algorithm which was proposed earlier by Ohsawa et al. [4]. Later, in 2011, the author introduced DegExt, a graph-based language-independent keyphrase extractor. The author used degree centrality for keyword extraction. In 2013, Boudin et al. [10] compared various centrality measures for graph-based keyphrase extraction from short documents. Abilhoa et al. [11] proposed Twitter Keyword Graph (TKG) algorithm to extract keywords from Twitter data. The author introduced all neighbour edging and nearest neighbour edging for constructing graph, frequency-based and inverse-frequency-based weights in graph and different centrality measures. Besides this, another algorithm named selectivity-based keyword extraction (SBKE) has been proposed by Beliga et al. [12]. The author used degree and strength for each node. There are many such algorithms which have been improved on semantic and linguistic features on textual networks including SingleRank, ExpandRank, word co-occurrence statistical information [13], noun phrase-based keyword extraction, semantic relationships using Wikipedia texts [14], weighted lexical complex network-based keyword extraction Bollen et al. [15], keyword and keyphrase extraction algorithm [16] for word co-occurrence network structure, word topic network model [17] and many more. However, such algorithms have not been considered for keyword extraction from Twitter.

3 Experiments and Evaluation

The traditional techniques for keyword extraction and keyphrase extraction using random walk-based measures and other network metrics have been implemented for Twitter data. The data set which has been used for this study is First Story Detection Petrovic et al. [18] data set in which 27 topics have been mentioned as ground truth. The ground truth topic contains topic id among all the 27 topics in front of every tweet id which are given in data set. Among all the tweets, 3034 tweets have been marked against the topic id correspondingly. The data set has been used by extracting all the tweets for one topic. All the tweets are summarized using keyword and keyphrase extraction technique. Python and NetworkX module have been used for implementation of existing keyphrase extraction technique over Twitter data. For every experimental evaluation, Twitter data has been given as input and top values and bottom values for keywords have been obtained as output.

Word co-occurrence networks have been generated using word co-occurrence architecture. Different basic and derived keyword extraction metrics have been studied for textual networks and evaluated after implementation on Twitter feeds of standard data set of FSD. As per the scores of different metrics, top ten keywords (words with highest value) and bottom ten keywords (words with lowest values) have been collected. Significance of measuring values for words varies from one metric to another. Out of given 18 metrics, one metrics is non-beneficiary (lower the value, better is the significance), fourteen metrics are beneficiary (higher the value, better is the significance), and three metrics have been observed as non-significant. The metrics may have overlapping significance. For each topic, corresponding results have been obtained as shown in Table 1 for first topic ‘Death of Amy Winehouse’. Italic font indicates that results obtained are meaningful. The topic is given as ground truth in data set FSD. Each word of topic is considered as keyword.

Table 1 Data set considered for keyword extraction

The given values of recall have been obtained by automatic word-to-word matching. However, due to the presence of ill-formed words, the precision measure obtained is low. However, manual intervention of results gives meaningful results and better value for precision. For instance, the topic ‘Death of Amy Winehouse’ may contain two words out of three for degree precision measure, but dead and died give clear indication for death and hence, precision may be recorded as one. However, in order to keep results unbiased, automatic evaluation has been preferred. But topic 1 has only three words as keywords which may result in poor analysis of precision and recall. Five topics were selected for experiments on the basis of number of keywords obtained from given topic for evaluation. For better examination for precision and recall values, topics having about ten keywords were selected. Further, precision, recall and F-measure have been obtained for each experiment and averaged as shown in Table 2.

Table 2 Performance measures obtained for topics using different keyword extraction metrics

4 Discussion

Different performance measures for extracting keywords have been analysed on the basis of experimental evaluation. It has been observed that strength measure and HITS algorithm outperform all the existing techniques on short text as shown in Fig. 1. Also, DegExt, TKG, LexRank, Degree, closeness centrality, eigenvector centrality, betweenness centrality, clustering coefficient and influence measure perform omparably significant results. However, eccentricity and SBKE are least significant measures for keyword extraction from uncertain user-generated text. Among all the basic and derived metrics, strength and HITS metrics have proved to be useful in terms of F-measure as observed in Table 2. When F-measure is considered for low-valued elements, it is observed that higher is the value of metrics, better it is for keyword extraction and vice versa. However, it has been observed that small values of clustering coefficient for each node give better keyword extraction. This shows that clustering coefficient is non-beneficiary attribute. Moreover, Tf-Idf, eccentricity and selectivity-based keyword extraction have proved to be not much significant measure for keyword extraction from Twitter. Inference for different keyword extraction metrics have been shown in Table 3.

Table 3 Inference for different keyword extraction metrics for textual networks
Fig. 1
figure 1

Graphical representation of performance measures for bottom ten keywords using different keyword extraction metrics

Semantics of textual networks has been observed. Different features can be used as important statistical measure for identifying relevant and influential terms. As per observation, degree measure and DegExt experimentation signify similar semantics and thus overlap. Metrics with high precision and recall values for top-ranked keywords and low precision and recall values for low-ranked keywords have found to be better than other metrics. On the contrary, clustering coefficient has opposite nature. Although KeyWorld metrics marks significantly irrelevant terms, it is computationally expensive and thus is not suitable for large-scale data analysis.

As observed in Table 4, metrics with non-weighted textual graph gives meaningful results. However, strength is a significant measure for textual networks, and thus, edge-based metrics (co-occurrence frequency-based metrics) for weighted textual networks may mark influential words. On the basis of need, adjacent pair and all pair neighbouring models have been used. However, majority of value and centrality-based metrics outperforms using all pair neighbouring model, whereas neighbourhood and its vote-based metrics used adjacent neighbour model for better performance. The inference for this parameter signifies that relation of every word to every other in a document signifies its value as how important the word is for the network and adjacent word pair signifies what impact does influential (important) word have on its neighbour. Similarly, undirected graph is used for neighbourhood-based metrics and directed graph is used for measuring incoming and outgoing links to other words. Also, directed textual network may have better lexical output as sequence of words. Based on this inference, a metric can be developed and it identifies dominant phrase from textual networks which may provide more meaningful results than just keywords.

Table 4 Inference for different keyword extraction metrics for textual networks

For each topic, set of different keywords has been obtained. Unique words from this set are represented as ground truth for analysing relevant Twitter feeds. For differently selected metrics, the values have been obtained for each keyword of topic 4. Normalized graph for values of keywords for differently selected metrics has been obtained in Fig. 2. It has been clearly observed that the word ‘victim’ has not appeared in text. Also, words ‘Richard’ and ‘Bowes’ are co-occurring and have same values in most of the metrics, and thus, ‘Richard Bowes’ may represent candidate keyword or named entity. Also, the last word ‘Hospital’ has high clustering coefficient and strength. However, other values for these keywords are low.

Fig. 2
figure 2

Normalized metric values for differently selected keyword extraction metrics

Zero value of betweenness centrality indicates that the word has either no in-degree or no out-degree. Also, it can be observed that linking from one word to another, for instance ‘London’ to ‘Riots’ and ‘Riots’ to ‘dies’, is dropping and rising in high range of values, and thus, weight edges as strength play pivotal role in identifying co-occurring important words which represent influential nodes in textual networks. As observed from Fig. 2, all measures except strength and clustering coefficient give highest values for ‘Richard Bowes’. The inference for lower value of strength indicates that for each node, number of occurrences has been considered by strength, and thus, most frequently occurring words are given higher values, for instance ‘very’ which is not significant.

5 Conclusion and Future Work

The user-generated data on social media is analysed using textual networks. To evaluate the performance of basic and derived metrics for textual networks, we have used Twitter data. In this analysis, unsupervised context-independent graph-based keyword extraction techniques have been implemented and discussed. It is observed that out of 17 identified network-based metrics, 14 metrics proved to be beneficiary and one as non-beneficiary attribute. The two metrics fluctuate with change in data set. Differently selected metrics have been analysed, and semantics of textual networks have been discussed. It is observed that in order to maintain lexical sequence of occurrence of words, directed adjacent and weighted graph should be constructed. Majority of network metrics outperforms traditional Tf-Idf statistical method. Using this analysis, many useful insights can be obtained for different real-world applications including text mining, topic tracking and detection, event detection and opinion mining. In future, edge-based network metrics can be studied for word co-occurrence directed graph for identifying co-occurring keyphrases.