1 Introduction

Tremendous growth of text data due to heavy use of electronic devices and internet technologies, necessitates efficient techniques or tools (like Text Mining) that automatically arrange text documents into known classesFootnote 1 , Footnote 2 (Joachims 1996). In a multi-class environment of text classification, the classifier algorithm predicts the class label of new documents based on the training of the model. In the real world, its applications are sentiment classification, spam filtering, classification of Pubmed articles, organizing web contents into topical hierarchies, and news filtering, etc. In this paper, the text documents are classified by two standard classifiers, Linear Support Vector Machines (LSVM) (Joachims 1996) and Multinomial Naive Bayes (MNB) (Manning et al. 2008; Sebastiani 2002; Joachims 1998). MNB is known as the fastest probabilistic generative learning model. However, its accuracy is relatively modest. LSVM is based on graceful foundations of statistical learning theory. The training time of SVM is higher than MNB but the classification results of SVM are more accurate. The strength of the classifier depends upon the contents of documents. The contents are grammatical sequences of the words organized in the form of sentences. The word (named term) is the smallest constituent of the text contents and plays a vital role in text classification.

Text classification process uses following steps to select the most relevant words as features: (1) feature extraction from the corpus (i.e. generation of tokens from text contents), (2) elimination of less informative words (i.e. stop words, punctuation marks, numbers, links, white spaces, etc.), and (3) lemmatization or stemming of the words (Agnihotri et al. 2017). Finally, the resultant words build a vocabulary of the corpus that helps in the construction of a vector space. The frequency of each word present in the documents is represented as a vector (Yang & Pedersen 1997). Term Frequency-Inverse Document frequency (TF-IDF) based scaling technique is used to normalize the frequency of the words (Mladenic & Grobelnik 1999). The collection of word vectors in a matrix form is called a vector space. In this vector space, each individual word constitutes one dimension. For a typical document collection, there may be millions of words. Thus, text classification process requires a much larger dimension to fit in a limited memory space that makes this process cumbersome (Kevin & Moshe 2013). Feature selection techniques eliminate the less informative features and selects a reduced subset of features for training. It increases the performance as well as the speed of the classifier and considered as an important step in the classification process (Lamirel et al. 2015; Joachims 1998).

The representative words are either of positive or negative nature due to their correlative association with many classes. The correlative association creates an uncertainty to determine the most representative words. An uneven distribution of terms, that are present in the documents belonging to different classes, has given birth to the concept of correlative association. The word which is present in the j th class (say C j ) while absent in all other classes (say \(\bar {C}_{j}\)) is of a positive nature for the j th class C j . Whereas, if a word is absent in the j th class C j but present in all other classes \(\bar {C}_{j}\) is of a negative nature for the j th class C j . The words of negative nature are also important to find out the class labels of the documents. The presence of the negative nature words in a document, assures that this document does not belong to the class for which the word is negative (Uysal & Kursat 2016). The common words are distributed equally to all the classes, whereas the rare words belong in most of the documents of a specific class. The sparse words occur less frequently in the documents of a class, and its presence or absence is not important to decide the class label of the documents (Agnihotri et al. 2016).

The standard methods do not consider positive and negative nature of the words while determining their importance. It degrades the performance of the classifiers. To address this issue, this paper presents an information theory based new feature selection method named Correlative Association Score (CAS). It combines the strength, mutual information, and strong association of the words to determine the positive and negative nature of words. CAS selects a few (k) informative words from the set of all words (m). These informative words generate a set of N-grams of length 1–3. Finally, the standard Apriori algorithm ensembles the power of CAS and CHI to select the top most, b informative N-grams, where b is a number set by an empirical evaluation. To evaluate the performance of selected top most b N-grams, the MNB and LSVM classifier models classify four standard text data sets, viz. Webkb, 20Newsgroup, Ohsumed10, and Ohsumed23. A significant improvement in the performance of the classifiers that is based on two standard measures named Macro_F1 and Micro_F1, prove the effectiveness of the proposed CAS method. The processing flow of the proposed work is shown by the Fig. 1 (steps (1)–(13)), the main contribution of the paper concerns steps (2)–(8).

Fig. 1
figure 1

Processing flow of the proposed work

The rest of the paper comprises of six sections which are as follows: The preliminary concepts and related works are discussed in Sections 2 and 3 respectively. Section 4 describes the proposed work. Section 5 explains the experimental setup, datasets and performance evaluation metrics. Section 6 illustrates results and discussion. The paper concludes in Section 7.

2 Preliminary concept

The preliminary concepts related to this paper, i.e. word representation, normalization, feature selection, and text document classification are described in this section.

2.1 Word representation

The representation of the words in the form of vectors is the base to determine the computational informativeness of the words and plays a vital role in an automatic classification of the text documents. The most common models to represent the words as vectors are the Bag Of Words (BOW) and N-grams Language (NGL). In BOW model, the frequency of each word in the documents of the corpus represents a vector. In this model, the order of word occurrence is not important. The N-grams are the combination of 2–4 words that co-occurred together in the documents. In the NGL model, the set of N-grams represents a vector space. Consider two documents D1 and D2, where D1 represents “viral disease” category and D2 “Bacterial disease”. The contents of D1 and D2 are as follows:

  1. 1.

    D1: “Viral diseases are extremely widespread infections caused by viruses, a type of microorganism. There are many types of viruses that cause a wide variety of viral diseases. The most common type of viral disease is the common cold, which is caused by a viral infection of the upper respiratory tract (nose and throat)”.Footnote 3

  2. 2.

    D2: “Bacterial diseases include any type of illness caused by bacteria. Bacterial diseases occur when pathogenic bacteria get into the body and begin to reproduce and crowd out healthy bacteria, or to grow in tissues that are normally sterile. Bacterial diseases are contagious and can result in many serious or life-threatening complications, such as blood poisoning (bacteremia), kidney failure, and toxic shock syndrome”.Footnote 4

In documents D1 and D2, the frequency of word “disease” is higher than other words (e.g. “Bacterial” or “Viral”) and looks more informative using BOW. In NGL model, the combination of the words “Bacterial”, “Viral”, and “disease” as “Bacterial disease” and “Viral disease” looks more informative than an individual representation of words as “disease”, “Bacterial” or “Viral”. Thus, the order of word occurrence is maintained in the NGL model and improves the quality of word representation. In this paper, the BOW is used at the initial level to represent the words using CAS, while NGL at the second level of filtration using CHI.

2.2 Normalization

Normalization is a technique of scaling the data in a fixed range. The authors (Dewang & Singh 2017; Agnihotri et al. 2014), and Sebastiani (2002) addressed problems like keyword spamming, scaling up frequent words and scaling down rare words. The problem of keyword spamming occurred when a word appears repeatedly in a document with the purpose of improving its ranking on the Information Retrieval system or even to create a bias towards longer documents. The word frequency in a document of a vector space is usually normalized using the Term Frequency-Inverse Document Frequency (TF-IDF) method to overcome this problem (see (1)) (Agnihotri et al. 2014; Manning et al. 2008; Sebastiani 2002).

$$ W_{i,j}= tf_{i,j}*\log\frac{N_{d}}{df_{i}} $$
(1)

Where W i,j = weight for word t i in document d j , N d = Total number of documents in the corpus, t f i,j = frequency of word t i in document d j , d f i = document frequency of i th word in the corpus.

2.3 Feature selection

Feature selection improves the performance and accelerate the training speed of the classifiers. It reduces a huge feature space into a smaller subset. Let us define, p as the total number of words in the corpus, and n as the total number of documents. Subsequently, the text contents of the entire corpus D (as discussed in Section 2.4) is extracted as tokens p and kept in a set t. Let t = [t 1,t 2,...,t p ], where p > 0 and each word contains some information to discriminate the class label of the documents. The selection of words t q t that contain the maximum information to discriminate a class label which helps in correct classification of the documents is known as feature selection (Agnihotri et al. 2016).

2.4 Text documents classification

In text document classification, the documents set (D = [d 1,d 2,...,d j ]) of a r number of classes C = [C 1,C 2,...,C r ] is divided into two subsets, i.e. training (D t r a i n ) and test (D t e s t ). The objective of the classification is to build a model based on the known class labels of training set documents which have the capability to predict the class labels of test documents with maximum accuracy (Manning et al. 2008; Sebastiani 2002; Joachims 1998).

3 Related works

In literature, many researchers have contributed significantly in the area of feature selection. The core contribution of this paper is compared with four state-of-the-art methods, viz. MI, IG, DFS, and CHI. The brief description of these methods is given in this section.

Mutual information (MI) concept (Manning et al. 2008; Joachims 1998; Yang & Pedersen 1997) is carried out from information theory to measure the dependencies between random variables and used to measure the information contained by a word t i t. If the feature word t i possesses higher mutual information with the class C j , it contains more information about the class C j . The MI computes the dependence between the word t i and the class C j using (2) and the MI weight of the word t i is computed using (3). The preliminary notations used in this study are defined in Table 1.

$$ MI(t_{i},C_{j}) = \log (\frac{p(t_{i} , C_{j})}{p(t_{i}) \times p(C_{j})}) \approx \log \frac{a \times N}{(a+c) \times (a+b)} $$
(2)
$$ MI(t_{i}) = \max_{j=1}^{j=r} MI(t_{i},C_{j}) $$
(3)
Table 1 Preliminary notations

The Information Gain (IG) is a measure of reduction in entropy for words when they are separated into different classes. The IG score of a word t i given in (4) is the contribution of word t i in class C j (Uysal & Gunal 2012; Forman 2003; Yang & Pedersen 1997; Lewis & Ringuette 1994).

$$\begin{array}{@{}rcl@{}} IG(t_{i}) &=& p(t_{i})\times \sum\limits_{j=1}^{j=r} p(C_{j}|t_{i}) \times \log p(C_{j}|t_{i}) \\ &&+\ p(\bar{t}_{i})\times \sum\limits_{j=1}^{j=r} p(C_{j}|\bar{t_{i}}) \times \log p(C_{j}|\bar{t}_{i}) - \sum\limits_{j=1}^{j=r} p(C_{j})\times \log{p(C_{j})} \end{array} $$
(4)

Uysal & Gunal (2012) defined the Distinguishing Feature Selector (DFS) method to compute the weight of a word t i for a class C j shown in (5).

$$ DFS(t_{i}) = \sum\limits_{j=1}^{j=r} \frac{p(C_{j} |t_{i})}{p(\bar{t_{i}}|C_{j})+p(t_{i}|\bar{C_{j}})+1} $$
(5)

Mathematically, Chi square testing is used to determine the independence of the word t i and class C j during the feature selection. If CHI (t i ,C j ) = 0, the word t i and class C j are independent, thus the word t i does not contain category information. Otherwise, if the value of the CHI (t i ,C j ) is higher, the word t i contains more information to represent the class C j . The contribution of word t i in class C j (see (6)) is used to compute the contribution of word t i using the CHI method (Manning et al. 2008; Yang & Pedersen 1997).

$$ CHI(t_{i})= {\sum}_{j=1}^{j=r} p(C_{j})\times \frac{N\times(a\times d - b \times c)^{2}}{(a+c)\times(a+b)\times(c+d)\times(b+d)} $$
(6)

4 Proposed work

Substantial works were carried out in the area of feature selection to improve the prediction capability of the classifiers. The standard methods, viz. Mutual Information (MI), Information Gain (IG), Discriminating Feature Selection (DFS) and Chi Square (CHI), did not consider positive and negative nature of the words that affects the performance of the classifiers. To address this issue, a new feature selection method named Correlative Association Score (CAS) is proposed. CAS combines the strength, mutual information, and strong association of the words to determine the positive and negative nature of the words for the class. The weight assignment process of the CAS is shown by the Algorithm 1 and its summary is as follows:

  1. 1.

    In this study, the NGL model is used to represent the words as a set, i.e. N G[b] of b N-Grams of length 1–3. The set N G[b] has been generated by using the join and prune steps of the Apriori algorithm.

    1. (a)

      The join step: Suppose L k−1 = {t 1,t 2,..,t m } is the set of uni-grams, L k = {t 1 t 2,..,t m−1 t m } is the set of bi-grams, i.e. (t m−1,t m ) where t m−1,t m L k−1. Similarly, L k+1 = {t 1 t 2 t 3,..,t m−2 t m−1 t m } is the set of tri-grams. Finally, the set N G[g] is generated by taking the union of \(L_{k+1}\bigcup L_{k}\bigcup L_{k-1}\).

    2. (b)

      The prune step: This step eliminates less informative words, initially from set L k , and subsequently from set N G[g], by using a threshold value (i.e. determined empirically).

  2. 2.

    The CAS extracts a set of m most discriminating words L k−1 from the set of all words using a threshold value. It computes weight W C A S of each word t i as follows:

    1. (a)

      The CAS computes a unique weight of each word on the basis of three criteria, first criterion computes weight W 1j to measure the strength of i th word t i for j th class C j (i.e. W 1(t i ,C j )), second criterion computes weight W 2j to measure the likelihood of class C j when the word t i is present (i.e. W 2(C j |t i )), and third criterion computes weight W 3j of the word t i to measure the association of t i with class C j (i.e. W 3(t i ,C j )). The resultant weight (W C A S ) of the word t i is computed as,

      $$ \textbf W_{CAS} =\log \left( \sum\limits_{j=1}^{j=r}(W_{1j}+W_{2j})^{3}\right)+\sum\limits_{j=1}^{j=r}\left( W_{3j}^{4}\right) $$
      (7)

      Where,

      $$ W_{1}(t_{i},C_{j}) = \frac{maxf(t_{i},C_{j})}{\epsilon+avgf(t_{i},C_{j})} + \log{_{2}\left[\epsilon+\frac{\epsilon+(p(t_{i} | C_{j}) \times (1- p(t_{i} | \bar{C_{j}})}{\epsilon+(p(t_{i} | \bar{C_{j}}) \times (1- p(t_{i} |C_{j})}\right]} $$
      (8)
      $$ W_{2}(C_{j}|t_{i}) = a \times \log{\frac{p(t_{i},C_{j})}{p(t_{i})\times p(C_{j})}} + b \times \log{\frac{p(t_{i},\bar{C}_{j})}{p(t_{i})\times p(\bar{C}_{j})}} $$
      (9)
      $$ W_{3}(t_{i},C_{j})= \left\lvert\frac{a}{a+c+df_{[t_{i},j]}}-\frac{c}{a + c+df_{[t_{i},j]}}\right\rvert $$
      (10)
    2. (b)

      Sort the words in descending order based on the CAS Weight (W C A S ).

    3. (c)

      Select the top t of CAS weighted words from set L k based on a threshold value t.

    4. (d)

      Generate the set of N-grams, i.e. N G[g] of length 1-3 from the top t of CAS weighted words. Normalize the frequencies of N-grams using TF-IDF weight (Agnihotri et al. 2014; Sebastiani 2002; Mladenic & Grobelnik 1999) (see (1)).

  3. 2.

    Finally, CHI method is applied to select the top most, b discriminating N-grams N G[b] ⊂ N G[g].

The time complexity of the Algorithm 1 is \(\mathcal {O}(p \times n \times r)\), where n is the total number of documents, r is the total number of classes, p is the total number of words, m number of words obtained after removal of stop words, punctuation marks and white spaces, k of CAS weighted words are selected as the most informative words based on a threshold value.

figure a

4.1 Explanation of CAS

The example data shown in the Table 2 explains the process of weight assignment by CAS. In this dataset, there are three categories of documents (i.e. C1, C2, and C3) with eight unique words. The confusion matrix shown in the Table 3 represents the values of a, b, c, and d (see Table 1) of the words. An analysis of words, such as their document, maximum, and average frequencies are shown in Table 4. The properties of eight unique words of Table 2 are as follows:

  1. 1.

    The word “turtle” is of a positive nature for class C1, whereas “emu” is negative.

  2. 2.

    The word “toad” is of a positive nature for class C2, whereas “ostrich” is negative.

  3. 3.

    The words “shark” and “rays” are of a positive nature for class C3, whereas “mouse” is negative.

  4. 4.

    The words “cat” and “cow’ are present in all three classes, i.e. C1, C2, and C3 and named common words.

Table 2 Example dataset
Table 3 Confusion matrix for words as per preliminary notations shown in Table 1
Table 4 Analysis of word’s: document, maximum, and average frequencies

Table 5 describes mathematical computations of each ensemble part of CAS method. Table 6 presents an explanation for the use of ensemble parts in CAS method. The weight assigned by the CAS method are compared with MI, IG, DFS, and CHI methods (see Table 7). The CAS has assigned a much higher weight to the words of a positive nature (i.e. “rays, turtle, toad, and shark”), no matter the words are used most frequently or not. A medium weight to the negative words (i.e. “mouse, ostrich, and emu”), and a lower weight to the common (i.e. “cat, and cow”) and sparse words. Each ensemble part of CAS has its own value to decide an importance of the words. The ensemble parts of CAS are as follows:

Table 5 Analysis of ensemble parts of CAS
Table 6 Explanation of CAS method
Table 7 Comparison of scores on Example Dataset

Strength of a word to represent a class (W 1)

The strength of the word t i depends on its occurrence in a class C j compared to other classes \(\bar {C}_{j}\). It is the sum of two ratios. The first is a ratio of maximum occurrence (say, m a x f(t i ,C j )) of t i with its average occurrence (say, a v g f(t i ,C j )) in class C j (see (8)). Second, is the ratio similar to the odds ratio method (Rehman et al. 2015; Forman 2003). The second ratio of (8) assigns a highest positive value to the rare words of positive nature, whereas the negative values for the common words of negative nature for a class. Its resultant sum with the first ratio has balanced the negative value with positive value.

Likelihood of a word for a class (W 2)

It is an improvement of the standard MI (Forman 2003) method, where each logarithmic quantity is multiplied by p(t i ,C j ) and \(p(t_{i},\bar {C}_{j})\) (see Table 1). In W 2, each logarithmic quantity is multiplied with the total occurrence of a word t i in the documents of a class C j and other classes \(\bar {C}_{j}\) (see (9)). The likelihood weight (W 2j ) assigns a very high weight to the rare words of a positive nature and a medium weight to the common words of a negative nature for the class C j ; e.g. “toad” (12.77) and “emu” (6.12) (see Table 5).

Ensemble of W 1j with W 2j as \(\log {\sum }_{j=1}^{j=r}(W_{1j} + W_{2j})^{3}\)

An ensemble of W 1 with W 2 increases weight of the rare words with positive nature and common words of a negative nature. Further, It decreases the weight of most common (i.e. present in all the classes) and sparse words (see Table 6). Table 6 presents an analysis of the ensemble parts of CAS. In this table we can observed that if W 1 is multiplied with W 2 in spite of addition, the resultant weight of the words is zero or negative for some classes in which the occurrence of these words is lesser to other classes. Whereas, it causes a very high increase in the positive values of the words for the classes in which they are most frequent. Therefore, the resultant weight, i.e. W 1 × W 2 does not fit for the current scenario, but definitely it may be a choice with some other approaches. Hence, W 1 + W 2 is the best choice for the proposed method. The 3 rd power of W 1j + W 2j is chosen empirically (see Fig. 2). Figure 2 shows that the characteristics of 1 st, 2 nd, 3 rd, and 4th power are similar, but the goodness of fit of the classification model is better for 3 rd power rather than 1 st or 2 nd. Whereas, for the 4th power the classification model over-fits the words of training documents and consequently degrades the performance. One way to understand over-fitting intuitively is that a model may use some relevant words of the data (signal) and some irrelevant words (noise). An over-fitted model picks up the noise or random fluctuations in the training data, which increases its performance in case of known noise (training data) and decreases its performance in the case of novel noise (test data). Thus, over-fitting negatively impacts the performance of the model on new data. The aim is to keep the cube value of W 1j + W 2j as positive or negative for the j th class, thus the selection of odd numbers viz. 1, 3, or 5 as powers is a better choice. The resultant weight will always be positive due to ensemble of W 1j and W 2j (see Table 6). The logarithm of the resultant sum is taken to normalize the weight, but in a few cases if the resultant normalized weight is negative then it becomes positive when it is summed with \({\sum }_{j=1}^{j=r} W_{3j}^{4}\).

Fig. 2
figure 2

Analysis of weight W 1 + W 2 for various powers

Association of a word to specific class (W 3)

The main motto for computation of an association value, i.e. W 3j of a word t i (see (10)) is to discriminate the common words more effectively and increase the weight of positive and negative nature of words optimally. In this context, an association value of word t i is computed for each class that depends upon frequency of the words in the class. Thus, if a word is absent in a class its association value will be maximum (i.e. 1) for that class, i.e. more the frequency of a word in a class, less will be its association value in that class. E.g. the word “toad” is present in C2 class, while absent in the C1 and C3 (see Table 5). As a result, the association value for C1 and C3 is 1, whereas 0.6 for C2. The resultant value of W 3j is always in the range of 0 and 1. Thus, 4th power of W 3j is a lesser value for the class in which the word is most frequent, whereas it is maximum 1 for the class in which the word is absent. Therefore, the resultant association value of “toad” is 14 + 0.64 + 14 = 2.13. As the word “cat” is present in all three classes, its resultant association value is 0.254 + 0.734 + 0.744 = 0.59. Similarly, the word “emu” is present in C2 and C3 class, while absent in C1 class, is considered as negative word for class C3. Its resultant association value is 14 + 0.424 + 0.264 = 1.04. It can be observed by above examples that the \({\sum }_{j=1}^{j=r} W_{3j}^{4}\) has assigned highest weight to the rare positive words, higher weight to the rare negative words, and least weight to the common and sparse words. Figure 3 shows the characteristics of various powers of W 3j , e.g. the nature of 4th and 5th power is similar, but it slightly differ with 3rd power and noticeable change with 2nd and 1st power. The goodness of fit of the classification model is better for 4th power rather than 1,2,3, o r 5th power.

Fig. 3
figure 3

Analysis of weight W 3 for various powers

Ensemble of \(\log {\sum }_{j=1}^{j=r}(W_{1j} + W_{2j})^{3}\)

with \({\sum }_{j=1}^{j=r}W_{3j}^{4}\) As shown in Table 6, using \(\log {\sum }_{j=1}^{j=r}(W_{1j} + W_{2j})^{3}\) alone suffers in discrimination of common words. E.g. the frequency of common word “cat” is more than “cow”, therefore “cat” is more common than “cow” and it should get a lower rank than “cow”, but \(\log {\sum }_{j=1}^{j=r}(W_{1j} + W_{2j})^{3}\) has assigned 8th rank to cat and 9th rank to cow. To overcome this issue, \({\sum }_{j=1}^{j=r}W_{3j}^{4}\) is summed with \(\log {\sum }_{j=1}^{j=r}(W_{1j} + W_{2j})^{3}\) which assigns 8th rank to “cow” and 9th rank to “cat”. The result proves that \(\log {\sum }_{j=1}^{j=r}(W_{1j} + W_{2j})^{3} + {\sum }_{j=1}^{j=r}W_{3j}^{4}\) discriminates common words more appropriately and increases the proportional weight of rare positive and negative nature words better than \(\log {\sum }_{j=1}^{j=r}(W_{1j} + W_{2j})^{3}\). Further, if multiplication is used in place of addition, i.e. \(\log {\sum }_{j=1}^{j=r}(W_{1j} + W_{2j})^{3} \times {\sum }_{j=1}^{j=r}W_{3j}^{4}\), then the resultant weight will be much higher, which causes over-fitting of less informative words by the classification model and degrade the performance. Thus, the proposed CAS method shown in (7) is empirically evaluated and found most suitable for the selection of most informative words.

5 Experimental setup and performance evaluation

All the experiments have been carried out on a machine with Intel core i7, 8GB RAM, 1.8 GHz Processor in UBUNTU 14.04 64-bit OS. The steps of text document classification, i.e. Tokenization, preprocessing of the words of the corpus (D), feature extraction (t[m] ⊂ t[p]), feature selection (t[k] ⊂ t[m], and N G[f] ⊂ N G[g]) classification, and performance analysis are performed in Python 2.7. The nltk, scipy, numpy, ipython notebook, scikitlearn, matplotlib, etc. packages of python2.7 are used in experimental analysis.Footnote 5 The entire corpus is sliced into multiple arrays of each class, in spite of loading entire corpus into a single array. It speeds up the computing process and resolves the memory related issues.

5.1 Data set

In this paper, four standard text datasets, viz. Webkb, 20Newsgroup, Ohsumed10, and Ohsumed23 evaluate the performance of the proposed CAS method. The detailed summary of datasets is given in Table 8. For experimental studies, the corpus D is divided into two subsets using the holdout method in which 75% data is placed in training set (D t r a i n ) and remaining 25% in test set (D t e s t ). The “Beautifulsoup” packageFootnote 6 of python2.7 is used to extract the text contents of the documents by removing tags, html links, punctuation marks, and white spaces from the documents. Subsequently, all the stop words (defined in python natural language tool kit) are removed and as a result t[m] ⊂ t[p] words are remained for further processing. Initially, the CAS score of all m numbers of words (t m ) are computed and arranged in descending order to select the top most, k informative words (t[k] ⊂ t[m], where k < m). For experimentation, the value of k is selected as 100, 200, 500, 1000, 2000, 3000, 5000, and 10000. The set N G[g] of g (where g > k) N-grams of length 1-3 are generated from the k CAS filtered words. The TF-IDF Vectorizer of python ”scikitlearn” packageFootnote 7 is used to normalize the weight of N G[g] N-grams. The Apriori algorithm is used to select the top most, b (where b < g) discriminating N-grams N G[b] using the CHI method. The vocabulary V of b unique N-grams (as discussed in Section 4) are used to train the model.

Table 8 Details of the datasets

5.2 Performance evaluation

In this paper, the benchmarked Macro and Micro averaged F 1 measures are used to evaluate the performance of MNB and LSVM (Uysal & Kursat 2016). The F measure (F β and F 1) can be interpreted as a weighted harmonic mean of the precision and recall. The F β score weight recall more than precision by a factor of beta. A F β measure reaches its best value at 1 and its worst score at 0. If β = 1 then F β and F 1 are equivalent, and the recall and the precision are equally important.Footnote 8 The accuracy gives same weight to all the classes but it is not suitable especially for imbalanced datasets. The macro F 1 measure computes metrics for each label, and finds their unweighted mean and does not consider label imbalance. Whereas, micro F 1 calculates metrics globally by counting the total true positives, false negatives and false positives. The (11)–(16) shows precision (i.e. Macro in (11) and Micro in (12)), recall (i.e. Macro in (13) and Micro in (14)), Accuracy (i.e. (15)) and F measure (i.e. (16)).

$$ Precision_{Macro}=\frac{1}{n(C)}\sum\limits_{C=1}^{C=r}\frac{TP_{C}}{TP_{C}+FP_{C}} $$
(11)
$$ Precision_{Micro}=\frac{{\sum}_{C=1}^{C=r} TP_{C}}{{\sum}_{C=1}^{C=r}TP_{C}+{\sum}_{C=1}^{C=r}FP_{C}} $$
(12)
$$ Recall_{Macro}=\frac{1}{n(C)}\sum\limits_{C=1}^{C=r}\frac{TP_{C}}{TP_{C}+FN_{C}} $$
(13)
$$ Recall_{Micro}=\frac{{\sum}_{C=1}^{C=r} TP_{C}}{{\sum}_{C=1}^{C=r}TP_{C}+{\sum}_{C=1}^{C=r}FN_{C}} $$
(14)
$$ Accuracy=\frac{TP+TN}{(TP+FP+TN+FN)} $$
(15)
$$ F_{\beta} =(1+\beta^{2})\times \frac{Precision \times Recall}{(\beta^{2} \times Precision) + Recall} $$
(16)

Where C = 1 to r represents r class labels and n(C) is the count of the total number of classes. Let TP, FP, FN, and TN are the counts of true positives, false positives, false negatives, and true negatives respectively.

6 Results and discussions

The proposed CAS method has obtained a significant improvement in the results for both LSVM and MNB classifiers (see Table 9). The experimental results are better than earlier works of Guo et al. (2009), Rehman et al. (2015). Table 9 shows that in all the datasets, the proposed CAS method has given highest Macro_F1 measure for both classifiers. Whereas, CHI has given highest Micro_F1 measure for Ohsumed10 (using LSVM) and Webkb (using MNB) datasets.

Table 9 Peak Performance of methods in four Datasets

Figures 45678910111213141516171819 show the Macro_F1 and Micro_F1 obtained for different number of features for all four datasets classified by LSVM and MNB. The average rank of the CAS, MI, IG, DFS, and CHI methods are shown in the Tables 10. The lowest value indicates highest ranks. In most of the cases, the average rank of CAS is highest. The performance of MI is always at 5th position, whereas there is a close competition among DFS, IG, and CHI methods for 2nd, 3rd, and 4th positions respectively.

Fig. 4
figure 4

Macro F1 for LSVM in Webkb Dataset

Fig. 5
figure 5

Micro_F1 for LSVM in Webkb Dataset

Fig. 6
figure 6

Macro_F1 for MNB in Webkb Dataset

Fig. 7
figure 7

Micro_F1 for MNB in Webkb Dataset

Fig. 8
figure 8

Macro_F1 for LSVM in 20Newsgroup dataset

Fig. 9
figure 9

Micro_F1 for LSVM in 20Newsgroup

Fig. 10
figure 10

Macro_F1 for MNB in 20Newsgroup dataset

Fig. 11
figure 11

Micro_F1 for MNB in 20Newsgroup dataset

Fig. 12
figure 12

Macro_F1 for LSVM in Ohsumed10 dataset

Fig. 13
figure 13

Micro_F1 for LSVM in Ohsumed10 dataset

Fig. 14
figure 14

Macro_F1 for MNB in Ohsumed10 dataset

Fig. 15
figure 15

Micro_F1 for MNB in Ohsumed10 dataset

Fig. 16
figure 16

Macro_F1 for LSVM in Ohsumed23 dataset

Fig. 17
figure 17

Micro_F1 for LSVM in Ohsumed23 dataset

Fig. 18
figure 18

Macro_F1 for MNB in Ohsumed23 dataset

Fig. 19
figure 19

Micro_F1 for MNB in Ohsumed23 dataset

Table 10 Average rank of methods

Table 7 presents the comparison of word scores on an example dataset (see Table 2) for various methods, viz. MI, IG, DFS, CHI, and CAS. The ranges of assigned weight by these methods is summarized in Table 11. The observed key points by analyzing the results of example dataset (see Table 2) are as follows:

  1. 1.

    Strength of MI: Whether the word is most frequent or less frequent, the MI assigns high weight to rare positive words.

    Weakness of MI: The MI assigns low weight to negative, common as well as sparse words. The distance among the weight of rare positive and negative words are not adequate, i.e. there is a low variance in the distance; e.g. in the weight of positive words viz. “toad”, “shark”, and “rays” as 0.25, 0.35, and 0.24 respectively. Similarly, the weight of negative nature words, i.e. “ostrich”, “emu”, and “mouse” are equal to 0.17, 0.21, and 0.20 respectively. The variance in the weight of common and sparse words with positive and negative words is also very low, e.g. “cow” and “cat” as 0.1 and 0.14 respectively.

  2. 2.

    Strength of IG: The IG assigns very less weight to common and sparse terms; e.g. “cow” and “cat” as 0.03 and 0.06 respectively.

    Weakness of IG: The IG assigns high weight to most frequent words, whether positive or negative, but medium weight to the less frequent positive and negative words.

  3. 3.

    Strength of DFS: The weight assignment process of DFS is similar to IG with few improvements. It assigns highest weight to the most frequent words in the range of 0.8 to 1. E.g. the positive words, “toad” and “turtle” get an equal weight 1, but IG assigns 4 th rank to the “turtle”. DFS differentiates common words better than MI, IG, and CHI. E.g. word, “cow” gets a higher rank than “cat”.

    Weakness of DFS: The DFS assigns highest weight to the most frequent words, whether positive or negative, but medium weight to the less frequent positive and negative words. E.g. the negative word, “mouse” gets higher weight as 0.94 than the weight of positive word “rays” as 0.9.

  4. 4.

    Strength of CHI: The CHI assigns weight to the most frequent words in very high range than negative, common, and sparse words.

    Weakness of CHI: It doesn’t discriminate positive and negative nature of words proportionally. E.g. negative word, “mouse” gets higher weight as 17.7 than the weight of positive word “shark” as 7.67.

  5. 5.

    Strength of CAS: The CAS assigns highest weight to the words of a positive nature, medium weight to the words of a negative nature, and lower weight to the common/sparse words. The assigned weight of the words by the CAS method are more appropriate than other methods.

    Weakness of CAS: A larger set of informative words with higher weight are present in a group of classes. Whereas, a subset of the informative words with much smaller weight are present in a few classes. The CAS, MI, IG, DFS, and CHI methods select the top most, f features by sorting the words in descending order based on their weight. As a result, substantial words of a few classes are either partially or completely eliminated.

Table 11 Analysis of assigned weight by the methods in various ranges

7 Conclusion

This paper introduced a new text feature selection method named Correlative Association Score (CAS). The main objective of the CAS was to identify the nature of the words, i.e. positive, negative, common, or sparse. The words of negative nature for a class are also important to identify the class label of documents. The presence of negative nature words in the document assured that the document doesn’t belong to that class for which the word is negative. In this context, CAS combined the strength, likelihood and the association of words. It helped in identification of mutually associated words towards many classes. It has constructed an improved final feature set than state-of-the-art methods, viz. Mutual Information (MI), Information Gain (IG), Distinguishing Feature Selector (DFS), and Chi square (CHI). CAS assigned a much higher weight to the words of a positive and negative nature than common and sparse. The feature selection process was carried out under different conditions, i.e. feature set of varying sizes, dataset of diverse characteristics, classification algorithms, and success measures. The promising results based on Macro_F1 and Micro_F1 success measures proved the effectiveness of the proposed CAS method.