Keywords

1 Introduction

Our need to quickly sift through large amount of textual information is growing on a daily basis at an overwhelming rate. This task can be made easier if we have a subset of words (keywords) which can provide us with the main features, concept, theme, etc., of the document. Keyword extraction is also an important task in the field of text mining. Keywords can be extracted either manually or automatically but the former approach is very time-consuming and expensive. Automatic approaches for keyword extraction can be language-dependent or statistical approaches, which are language independent. Most of the work in this area has been done in English language. Other languages are underrepresented due to scarcity of domain-specific resources and non-availability of data, especially labeled data. Now due to availability of large amount of textual data in different languages, there is a growing need for developing keyword extraction techniques in other languages as well.

There are many approaches by which keyword extraction can be carried out, such as supervised and unsupervised machine learning, statistical methods, and linguistic ones. Statistical methods for the extraction of keywords from documents have certain advantages over linguistic-based approaches such as the same approach can be applied to many different languages without the need to develop different set of rules each time for a different language. However, most of the statistical approaches are based on the corpus statistics. An obvious disadvantage of such methods is that they cannot be applied in the absence of availability of corpus. There are some other disadvantages as well of corpus-oriented keyword extraction methods. Keywords which occur in many documents of the corpus are less likely to be statistically discriminating, and most of the corpus-oriented approaches typically work on single words which are used in different contexts with different meanings. As availability of the corpus is a severe limitation in resource poor languages, it is important to focus on document-based approaches which can extract keywords from single document. Such methods are especially suitable for live corpuses such as news and technical abstracts. Therefore, there is a need to find approaches which are unsupervised, domain independent, and corpus independent.

Some work has been done for extracting keywords using statistical measures. Among one of the frequently used statistical measures to judge the importance of a particular word in the document is the frequency of a word in the document [1]. The intuition behind this measure is that the more important a word is to a document, the more number of times it should occur in the document. But in a document, most of the words having highest frequencies are stopwords (the, and, of, it, etc.). It was proposed in [2] to select the intermediate frequency words as keywords and discard the high- and low-frequency words as stopwords. Tf-idf measure [3] gives much better results than frequency score, but it requires a corpus and cannot be used on a single document. Shannon’s entropy measure [4] was used to extract keywords from literary text in conjunction with randomly shuffled text. Spatial distribution [5] of words in the document was utilized to estimate the importance of a word in the document.

In this paper, we present a statistical approach for keyword extraction from single documents based on frequency and next nearest neighbor analysis of words in the document. We present an algorithm for the same. The algorithm has been tested on a famous Hindi novel “Godan” by Munshi Premchand.

2 Framework

This paper is based on the observation that both the frequency and spatial distribution of a word play an important role in evaluating its importance. Therefore, we have tried to develop an integrated approach which combines the information contained in frequency and spatial distribution of a word in order to extract keywords from a document. Our approach has strong theoretical background as well as empirical evidence. It works better than simple frequency-based approach as well as better than the approach based on spatial distribution only.

It has been observed that in a long text, the words with middle range of frequency are more important, because very high-frequency words are stopwords, whereas very low-frequency words are not important. We start with removing very low-frequency words.

Our spatial distribution-based approach is based on the observation that occurrence pattern for important words (keywords) should be different from that of non-important words. For a non-important word, its word distribution pattern should be random in the text and significant clustering should not be observed, whereas for a keyword, its distribution pattern should indicate some level of clustering because a keyword is expected to be repeated more often in specific contexts or portions of text.

To estimate the degree of clustering or randomness in the text for different words, we can use the standard deviation which measures the amount of dispersion in a data series. One such data series which can be analyzed is the successive differences between positions of occurrence of the word in the text. In other words, if a word W occurs N times in the document at positions X 1, X 2, X 3,…, X N, then successive differences are (X 2− X 1), (X 3− X 2),…. (X N X N−1). Representing the intermediate difference series for word W as S W we have,

$$ {\mathbf{S}}_{{\mathbf{W}}} = {\text{ }}\left\{ {\left( {{\mathbf{X}}_{{\mathbf{2}}} - {\text{ }}{\mathbf{X}}_{{\mathbf{1}}} } \right),\,\,\left( {{\mathbf{X}}_{{\mathbf{3}}} - {\text{ }}{\mathbf{X}}_{{\mathbf{2}}} } \right),\,\,\left( {{\mathbf{X}}_{{\mathbf{4}}} - {\text{ }}{\mathbf{X}}_{{\mathbf{3}}} } \right), \ldots \left( {{\mathbf{X}}_{{\mathbf{N}}} - {\mathbf{X}}_{{{\mathbf{N}} - {\mathbf{1}}}} } \right)} \right\}$$

For a word W with frequency N in the text, we have N−1 elements in the series S W .

To eliminate the effect of frequency on the standard deviation analysis of different words, it is convenient to normalize the standard deviation (σ W ) of series S W with its corresponding mean (µ W ), so that normalized standard deviation of series S W is \( {\mathbf{\hat{\sigma }}}_{{\text{w}}} \). Higher values of generally represent more pronounced clustering or lesser random behavior which is what we expect for important words.

Figures 1 and 2 show two words from the document (Godan) with similar frequencies, while the first word is a keyword, the other is a stopword. The clustering is evident in Fig. 1, while in Fig. 2, it can be easily seen that distribution is random throughout and no significant clustering is observed. As observed, spatial distribution can be used for differentiating between a keyword and a non-keyword.

Fig. 1
figure 1

Spatial distribution of the keyword “होरी” with frequency 623 in the document

Fig. 2
figure 2

Spatial distribution of stopword “गया” with frequency 677 in the document

2.1 Algorithm

  1. 1.

    Generate the list of unique words in the document which forms the vocabulary.

  2. 2.

    Calculate the frequency (f) of each unique word in the document.

  3. 3.

    Eliminate low-frequency words.

  4. 4.

    Generate the dataset of next nearest neighbor distances series S W for each unique word in the document.

  5. 5.

    Calculate the mean (µ W ) and standard deviation (σ W ) of dataset S W for each unique word.

  6. 6.

    Normalize the σ W with µ W .

  7. 7.

    Rank the resulting words list in order of decreasing normalized standard deviation (\( {\mathbf{\hat{\sigma }}}_{{\text{w}}} \)).

  8. 8.

    Select the words with highest standard deviation as keywords.

3 Experiment

We performed our experiment on a novel “Godan” of Hindi language by Premchand. Some document statistics are as follows:

Total number of words in document = 167,707.

Total number of unique words in the document = 11,160.

Number of words with frequency greater than 10 = 1,565.

Words with frequency lesser than or equal to 10 were removed from consideration.

The proposed algorithm was implemented. The result is shown through a precision curve for top ranked 40 words in Fig. 3. It can be easily visualized that top ranked 10 words are all keywords of “Godan.”

Fig. 3
figure 3

Precision curve for top 40 words ranked on decreasing standard deviation. X-axis represents the ranks of words. The curve shows the number of keywords found in top N words ranked on \( {\mathbf{\hat{\sigma }}} \)

The result can be better understood with the help of Tables 1 and 2 and Fig. 4. Table 1 gives the list of top ranked words on \( {\mathbf{\hat{\sigma }}} \). Though many words have large frequency differences between them, but they have much closer values of standard deviation \( {\mathbf{\hat{\sigma }}} \). Table 2 gives a list of stopwords used in the document. Although these words have quite larger frequencies than the important words in the document, their \( {\mathbf{\hat{\sigma }}} \) values are lower compared to the latter. Thus, \( {\mathbf{\hat{\sigma }}} \) works better in differentiating between important and non-important words. Figure 4 shows the distribution of words of “Godan” on frequency and \( {\mathbf{\hat{\sigma }}} \) where important and non-important words with frequency greater than 10 are labeled separately.

Fig. 4
figure 4

Plot of frequency versus standard deviation of words in the document. Frequency has been plotted on logarithmic scale, whereas normalized standard deviation is plotted on linear scale. It can be seen that for \( {\mathbf{\hat{\sigma }}} \) >3, most of the words are important words

Table 1 Top ranked words from the document and their standard deviation scores
Table 2 Some stopwords from the document and their standard deviation scores

4 Conclusion

In this paper, we have proposed a novel document-based statistical approach to extract important words from Hindi literary document. Our approach hybridizes the information from both the frequency value and spatial distribution of words in the document. Standard deviation of nearest next neighbor distances of the word in the document was used as a discriminating factor. It was found that higher values of standard deviation generally correspond to the important words in the document. Considering that our approach is unsupervised, domain independent, and corpus independent, the results are quite motivating. Further, it was also observed that most of the important words which were extracted were named entities (NEs). Thus, a further research area which can be explored is the application of statistical methods to extract important named entities from literary texts.