1 Introduction

Accurate and timely information is the basic need for effective decision making. Wondrous growth in the e-corpus of various fields (e.g. Business, Biomedical, Engineering, News, etc.) [30, 47] demands for an intelligent Decision Support System (DSS) that helps in an Automated Text Document Classification (ATDC) process [10, 11]. In this context, a model is built by observing the occurrence of words in the training set documents with known class labels. The trained model has the capability to predict the class labels of test documents with maximum accuracy [16, 42].

The prediction completely relies on the contents of the documents. Substantial contents of these documents are stored as text [44, 45]. The word (term) is the smallest constituent of text and play a vital role in the ATDC process [27, 33, 39]. The processing steps followed by the ATDC process are as follows: the first step extracts features from the entire corpus (i.e. generation of tokens from the text contents), after this some less informative words (e.g. stop words, punctuation marks, white spaces) are eliminated in the second step, and in third step lemmatization/stemming is performed in the remaining terms of the corpus. Finally, the resultant terms are used to build a vocabulary of the entire corpus [4]. The resultant terms of this vocabulary are represented by vectors, where the frequency of each term in the documents represents a vector [34]. The collection of term vectors in a matrix form is called a vector space. In this vector space, each individual term constitutes one dimension. For a typical document collection, there may be millions of terms, hence ATDC requires to cater a large number of dimensions which makes the classification process cumbersome [19, 30]. In literature, feature selection techniques are used to select the most relevant features by eliminating the less important features. Feature selection increases the performance as well as the speed of the classifier and is considered as an important step in the classification process.

The Term Frequency-Inverse Document Frequency (TF-IDF) Vectorizer [34] is applied to normalize the weight of the terms, but the TF-IDF vectors tend to be of high dimension since they have one component for every term in the vocabulary. The terms which represent similar concepts or with similar meanings are treated as individual words in the corpus and enormously increase the size of TF-IDF vectors.

A linear combination of terms defined using Latent Semantic Analysis (LSA), identifies the relationships among terms.Footnote 1 The LSA creates a vector representation of a document that helps to compare documents based on similarity. It assigns an equal weight to the terms representing similar concepts or meanings.Footnote 2 E.g., the four words “carcinoma”, “sarcoma”, “melanoma”, and “cancer” represent a similar concept, i.e. “cancer”. Thus, an equal weight is assigned by LSA to these words and they contribute equally to the resulting LSA component and any selected word by the algorithms doesn’t affect the classifier performance [12]. Further, the LSA processed vector space has a considerable number of features that are not relevant to the text of the specific class. In order to improve the scalability of ATDC, an effective feature selection technique is needed to reduce the feature set.

Many feature subset selection methods have been proposed and studied in machine learning paradigm. It can be broadly divided into three categories: Filter, Wrapper, and Embedded [6, 32, 44]. The filter methods compute the score of a feature using an evaluation function. It is independent of any classification algorithm and determined using the mutual correlation of the data. In contrast, the wrappers and embedded methods require a frequent classifier interaction in their flow to estimate the value of a given subset. The requirement of a classifier interaction may increase running time and force the feature selection method to work according to a specific learning model. Thus, filter-based methods are preferred more in comparison to wrappers and embedded methods [32, 44]. The Global Filter-based Feature Selection Scheme (GFSS) assigns a score to each feature and the topmost, N features are selected using this score, where N is an empirically determined number [23, 44]. The Filter-based methods are further subdivided into One Sided Local Filter-based Feature Selection Scheme (OLFSS) and Global filter-based feature selection scheme (GFSS).

In the OLFSS, the local class-based score for each feature is computed and used as a final score. The GFSS follows a global policy and converts the multiple local scores into a global score to compute the final score of the features. The local and global scores can be directly used in the feature ranking. The features are sorted in descending order and the top-N features are included in the Final Feature Set (FSS). The Information Gain (IG), Gini Index (GI), Distinguishing Feature Selector (DFS) and Gain Ratio (GR) etc. are known as the methods of GFSS, whereas Mutual Information (MI), Odds Ratio (OR), GSS coefficients (GSS), and Correlation Coefficient (CC) as a method of OLFSS [44]. The selected discriminating features of FFS are used by the classifiers in the final step of the ATDC process. A meta-heuristic technique can also be used to search for a configuration that produces a highly effective text classifier. This model selection procedure is commonly named in the literature as hyper-parameter optimization [43].

Although, the GFSS method improves the performance of the classifiers, but it has some limitations. The GFSS is suitable for the balanced dataset where each class contains an equal number of documents along with a sufficient number of terms. In the case of an unbalanced dataset, having a large number of classes with variable distribution of terms, affects the performance of GFSS. The GFSS eliminates the informative features of the class either partially or completely from the topmost, N features. Most of the studies in the literature are focused on providing some improvements to specific feature selection methods rather than providing a new generic scheme.

Uysal [44] extended the work of [19] and proposed a solution named an Improved Global Feature Selection Scheme (IGFSS). IGFSS selects an equal number of representative features from each class in the final feature set. There are mainly two issues with the IGFSS, first, it assigns the class label of each feature with positive or negative membership by considering an individual vote of the Odds Ratio (OR) method [34].

An individual method may have some weakness such as the OR method assigns a positive score to a term for a class if the occurrence of the term is more in that class, otherwise, a negative score is assigned. However, the numerical difference between the OR score of the terms for positive and negative membership is very less. Thus, it affects the process of the class label and membership assignment. In this paper, the membership of terms is referred to as nature of terms, i.e. positive or negative membership of a term means positive or negative nature of the terms. Second, the ratio of negative nature features is determined empirically by the IGFSS and a common negative features ratio is applied to all the classes to select the positive and negative nature features. It affects those classes of a dataset which have more positive features than negative or vice-versa [4].

To address these two issues, a new technique named Soft Voting Technique (SVT) is proposed. It is based on the presumption that the ensemble votes of several methods give better results than an individual vote and determines the most appropriate class label of the features. This technique can be useful for a set of equally well-performing methods to balance out their individual weaknesses. The flow of SVT is similar to IGFSS, but in an improved way and based on the key points of IGFSS we have given a generic solution for the filter based feature selection methods. There are two main contributions in this paper:

  1. 1.

    The SVT uses the weighted average score (Soft Vote) of three methods, viz. OR, Correlation Coefficient (CC), and GSS Coefficients (GSS) to predict the class labels of terms and it computes a more balanced score of a word than OR which improves numerical discrimination of positive and negative nature words.

  2. 2.

    A mathematical expression is incorporated in IGFSS that computes a varying ratio of positive and negative nature terms for each class which is based on the occurrence of the terms in the classes.

The proposed SVT is evaluated using four standard classifiers, viz. Linear Support Vector Machine (LSVM), Softmax regression (SOFT MAX), Stochastic Gradient Descent Classifier (SGDC), and RIDGE. The classifiers are applied to five benchmarked text data sets, viz. Webkb, Classic4, Reuters10, Trec2004 and Ohsumed10. The experimental results of SVT, which is based on Macro_F1 and Micro_F1 are compared with classical information science methods and IGFSS.

The rest of the paper is organized as follows. In Section 2, a brief overview of the state-of-the-art methods and related works are discussed. Section 3 presents the details of the proposed SVT. The experimental setup and performance evaluation measures are discussed in Section 4. Section 5 present the experimental results and discussions. Finally, the paper concludes in Section 6.

2 Related works

Substantial works have been carried out in the area of filter-based feature selection. The most common methods, viz. Mutual Information (MI), Information Gain (IG), Distinguishing Feature Selector (DFS), Gini Index (GI), Gain Ratio (GR), and Odds Ratio (OR) are briefly described as follows:

Mutual information (MI) concept [18, 48, 49] is carried out from information theory to measure the dependencies between random variables and used to measure the information contained by a term ti. It is strongly influenced by the marginal probabilities of the terms. It assigns higher weight to rare terms than common and sparse terms. Therefore, the weights of the terms are not comparable for the terms with widely differing frequencies. The final score (i.e. MI) of term ti is the maximum class-based score as shown in (1). The brief preliminary notations are shown in the Table 1.

$$ MI(t_{i}) = \max_{j = 1}^{j=r}\left[\log \left( \frac{p(t_{i} , C_{j})}{p(t_{i}) \times p(C_{j})}\right)\right] $$
(1)

Information Gain (IG) [18, 44, 45, 48, 49] assigns higher weight to common terms distributed in many categories than rare terms. The IG is also known as average Mutual Information. (see (2)).

$$\begin{array}{@{}rcl@{}} IG(t_{i}) &=& p(t_{i})\times \sum\limits_{j = 1}^{j=r} p(C_{j}|t_{i}) \times \log p(C_{j}|t_{i})\\ &&+\ p(\bar{t}_{i})\times \sum\limits_{j = 1}^{j=r} p(C_{j}|\bar{t_{i}}) \times \log p(C_{j}|\bar{t}_{i})\\ &&- \sum\limits_{j = 1}^{j=r} p(C_{j})\times \log{p(C_{j})} \end{array} $$
(2)

Gini Index (GI) is a global feature selection method for text classification which can be defined as an improved version of an attribute selection algorithm used in decision tree construction (see (3)) [44].

$$ GI\left( t_{i}\right)=\sum\limits^{r}_{j = 1}{{p\left( t_{i}|C_{j}\right)}^{2}\ {p\left( C_{j}|t_{i}\right)}^{2}} $$
(3)

Distinguishing Feature Selector (DFS) [44, 45] is an improvement of Mutual Information by reducing the effect of marginal probabilities of the terms by normalizing the weight of the terms. It gives weight of the term in a range of [0,1] defined by (4).

$$ DFS\left( t_{i}\right)=\sum\limits^{r}_{j = 1}{\left[\frac{p\left( C_{j}|t_{i}\right)}{p\left( \overline{t_{i}}{\left|\vphantom{\overline{t_{i}} C_{j}}\right.\kern-\nulldelimiterspace}C_{j}\right)+p\left( t_{i}|{\overline{C}}_{j}\right)+ 1}\right]} $$
(4)

Gain Ratio (GR) is proposed in information science to reduce the effect of the most common terms and marginal probabilities of the terms by normalizing their weights, obtained using IG [24] (see (5)).

$$ GR\left( t_{i}\right)=\sum\limits_{j = 1}^{j=r}\frac{IG(t_{i})}{-p\left( C_{j}\right)\times log \left( p\left( C_{j}\right)\right)} $$
(5)

Odds Ratio (OR) reflects the odds of the word occurring in the positive class normalized by that of the negative class. It has been used for relevance ranking in information retrieval [18, 28, 34, 44, 46] (see (6)).

$$ OR\left( t_{i},C_{j}\right)=\frac{p(t_{i}|C_{j})(1-p(t_{i}|{\overline{C}}_{j}))}{1-p(t_{i}|C_{j})\times p(t_{i}|{\overline{C}}_{j})} $$
(6)

Correlation Coefficient CC(ti,Cj) of a word ti with a category Cj is a variant of the χ2 metric, where CC2 = χ2 × CC can be viewed as a “one-sided” chi-square metric. The positive values correspond to features indicative of membership, while negative values indicate non-membership. The greater (smaller) the positive (negative) values are, the stronger the terms will be to indicate the membership (non-membership) [36, 39, 50].

$$\begin{array}{@{}rcl@{}} &&\!\!\!\!CC\left( t_{i},C_{j}\right)\\&&\!\!\! = \frac{\sqrt{N}\!\times\! \left[p(t_{i},C_{j})\!\times\! p(\overline{t}_{i},{\overline{C}}_{j}) - p(t_{i},\overline{C}_{j})\!\times\! p(\overline{t}_{i},C_{j})\right]}{\sqrt{p(t_{i})\times p(\overline{t}_{i})\times p(C_{j}) \times p(\overline{C}_{j})}} \end{array} $$
(7)

GSS Coefficient (GSS) is another simplified variant of the χ2 statistics proposed by [20]. Similar to CC, the positive values correspond to features indicative of membership, while negative values indicate non-membership [50].

$$ GSS\left( t_{i},C_{j}\right)=p(t_{i},C_{j})\times p(\overline{t}_{i},{\overline{C}}_{j})-p(t_{i},\overline{C}_{j})\times p(\overline{t}_{i},C_{j}) $$
(8)

Uysal [44] proposed an ensemble method named as Improved Global Feature Selection Scheme (IGFSS) which provides a generic solution for the GFSS. The IGFSS has merged the power of local and global feature selection methods. It is an ensemble of OR with any one method of GFSS at a time. The OR is used to assign the class label as well as membership value to the features. It computes the negative value of a feature for the class, if the presence of that feature is very less or none in that class. Similarly, a positive value of a feature for the class, if it occurs most frequently in the class. Further, the IGFSS uses the maximum absolute score of the feature for a class to assign the class label and the sign of the maximum value is used to find out the membership of the feature.

figure e
figure f
Table 1 Preliminary Notations [5, 7]

The summarized steps of IGFSS, presented in Algorithm 1 is as follows: (i) Step 5 computes a global score of each term using methods of GFSS (i.e. MI, IG, GI, DFS, and GR), (ii) Step 6 sort the terms based on their computed global score, (iii) Step 7 determines the class label of features, (iv) Step 8 computes positive and negative membership of features, (v) Step 9 computes positive and negative features count for each class, (vi) Steps 10-14 determine the selection criterion based on positive and negative features ratio, and (vii) Steps 14-16 select an equal number of, the most informative features from each class by applying all the above steps.

In IGFSS algorithm, in the worst case, all features need to be traversed once and some of them may be traversed two times while constructing the candidate feature set. Let L be the total number of documents, r as the total number of classes, p as the total number of terms, and m as the number of terms obtained after removal of less informative terms, viz. stop words, punctuation marks, and white spaces. Let N be the number of IGFSS weighted terms that are selected as the most informative terms based on the length of the final feature set.

The values of n, r, m and N are much smaller compared to p, because the total number of terms p is in millions, and others are in the hundreds or thousands. Thus, the overall time complexity of the Algorithm 1 is computed as Θ(p). In the special cases, if the number of documents is in millions and there are fewer terms and classes in comparison to the documents. The resultant number of terms is also in millions because they are extracted by the combination of these documents into a corpus. Thus, the time complexity of the Algorithm 1 will be Θ(n).

3 Proposed Soft Voting Technique (SVT)

Having reviewed the related studies which helped in the ATDC process, it is found that in most of the reported techniques the class label of features is determined using a single criteria [6, 44]. However, the ensemble techniques are less explored in Text Mining to decide the class label of the features. Using single criteria for feature selection has shown the limited capability in knowledge discovery and decision making systems [41]. Therefore, we introduce a new Soft Voting Technique (SVT) to determine the most appropriate class labels of the features. The IGFSS improved the GFSS by using an individual vote of Odds Ratio (OR) method to define the class label of each feature. We extend this state-of-the-art approach by providing a more generic solution for all filter-based feature selection method. The SVT is based on the presumption that the ensemble votes of several methods can yield better results than an individual vote. This technique can be useful for a set of equally well-performing model in order to balance out their individual weaknesses. The SVT uses the weighted average score (Soft Vote) of three methods, viz. Odds Ratio (OR), Correlation Coefficient (CC), and GSS Coefficients (GSS) to predict the class label of features.

The central idea behind use of weighted average score (i.e. Soft Vote) of these three methods is as follows: (i) all three methods, i.e. OR, CC, and GSS assign a positive score to a term for a class if the occurrence of a term is more in that class, otherwise a negative score is assigned, (ii) the numerical difference between the OR score of positive and negative nature terms is very less, (iii) however, the numerical range of scores assigned to the terms by OR is higher than CC and GSS methods, (iv) thus, the resultant sum of OR, CC, and GSS is not much different from OR score, but their weighted average score is a more balanced score than OR, and (v) it helps in discrimination of positive and negative nature of terms for a class due to the computation of balanced numerical difference by the SVT.

To address the second issue of IGFSS, i.e instead of using one common negative features ratio for each class a mathematical expression is incorporated in IGFSS which computes a varying ratio of positive and negative features for each class. The resulting ratio solves the problem of imbalance between positive and negative nature of terms in each class and select the most appropriate positive and negative nature terms from each class.

3.1 Explanation by synthetic data

The concept of SVT is now explained using a synthetic dataset shown in the Table 2. Assume that, we have to select the top 6 words from this data. The OLFSS and weighted average scores of the features of this dataset is shown in Table 3. Further, the class labels as well as membership assigned to these features using SVT are shown in Table 4. The SVT and IGFSS both select two features from each class as shown in Table 5.

Table 2 Synthetic data
Table 3 OLFSS Score on Synthetic data
Table 4 Class Label assignment on Synthetic data using SVT
Table 5 Selected words from Synthetic data

It can be observed from Table 3 that all three methods, i.e. OR, CC, and GSS assign a positive score to a term for a class if the occurrence of term is more in that class, otherwise a negative score is assigned. However, there is less numerical difference among OR scores of positive and negative nature terms for the classes. Also, the numerical range of scores assigned to the terms by the OR is higher than CC and GSS methods. Thus, the resultant sum of OR, CC, and GSS does not have much difference from the OR score, but their weighted average score is a more balanced score than OR. It helps in discrimination of positive and negative nature of terms for a class due to balanced numerical difference among scores of terms. The selection process of SVT, IGFSS, and GFSS based on IG is explained as follows:

The SVT follows the structure of IGFSS as shown in the Algorithm 2 to select the final feature set. The entire process for assignment of class labels to the features and the flow of SVT is shown in the Algorithm 2. In SVT, (25)–(26) are used to compute the negative features ratio (nfr) and positive features ratio (pfr) for all three classes as follows:

  1. 1.

    nfr[C1] = 2/(3 + 2) = 2/5 = 0.4, C1 = 1 − nfr = 1 − 0.4 = 0.6

  2. 2.

    nfr[C2] = 2/(3 + 2) = 2/5 = 0.4, C2 = 0.6

  3. 3.

    nfr[C3] = 2/(3 + 2) = 2/5 = 0.4, C3 = 0.6

Therefore, using (27) the equal split criteria if the length of Final Feature Set (FFS) is 6 EqualSplit = 6/3 = 2. Further, using (28) the selected positive words count in class C1 = EqualSplit × nfr[C1] = 2 × 0.6 = 1.2 ≈ 1, in class C2 = 2 × 0.6 = 1.2 ≈ 1, and in class C3 = 2 × 0.6 = 1.2 ≈ 1. Similarly, using (29) the selected negative words count in class C1 = 2 × 0.4 = 0.8 = 1, in class C2 = 2 × 0.4 = 0.8 = 1, and in class C3 = 2 × 0.4 = 0.8 = 1.

Whereas, the IGFSS chooses a common nfr from the set of nfrs in the range of 0 to 1 that is based on the experimental evaluation, e. g. if we choose nfr = 0.8, then pfr = 1 − 0.8 = 0.2. The selected positive words count in class C1 = 2 × 0.2 = 0.4 ≈ 0, in class C2 = 2 × 0.2 = 0.4 ≈ 0, and in class C3 = 2 × 0.2 = 0.4 ≈ 0. Similarly, the selected negative words count in class C1 = 2 × 0.8 = 1.6 = 2, selected negative words count in class C2 = 2 × 0.8 = 1.6 = 2, and selected negative words count in class C3 = 2 × 0.8 = 1.6 = 2. Thus, it can be observed from above discussions that SVT has used varying nfrs and pfrs instead of a common nfr and pfr which is chosen by the IGFSS. The SVT follows distribution of positive and negative nature terms in the classes while computing the nfr and pfr values. The distribution of words in the three classes C1, C2, and C3 using SVT and IGFSS are as follows:

  1. 1.

    C1- pos:‘shark’, ‘hagfish’, ‘ray’, neg: ‘leopard’, ‘tiger’, C2-pos:‘turtle’, ‘lizard’, ‘snack’, neg: ‘cow’, ‘deer’.

  2. 2.

    C3- pos: ‘penguin’, ‘ostrich’, ‘emu’, ‘bird’, neg: ‘goat’, ‘toad’.

  3. 3.

    IG selected top 6 features→ ‘shark’, ‘turtle’, ‘emu’, ‘penguin’, ‘ostrich’, ‘lizard’ (C1 = 1, C2 = 2, C3 = 3).

  4. 4.

    IGFSS selected top 6 features→C1: ‘leopard’, ‘tiger’, C2: ‘cow’, ‘deer’, and C3: ‘goat’, ‘toad’.

  5. 5.

    SVT selected top 6 features→ C1: ‘shark’, ‘tiger’, C2: ‘turtle’, ‘deer’, C3: ‘emu’, ‘goat’.

It has been observed from the results on synthetic data that the IGFSS has used one negative features ratio for all the classes. Whereas, the SVT has selected negative and positive features from each class based on distribution of features in the classes as negative or positive.

The negative nature of features has similar importance as the positive nature of features to discriminate the class label of a document. E.g., the two features ‘leopard’ and ‘tiger’ has been selected as negative features for class C1 by the IGFSS method, their presence in a document ensures that this document cannot be classified as class C1. These two features are present in the documents of class C2 and C3, but absent in the documents of class C1. Therefore, the two features ‘leopard’ and ‘tiger’ became negative features for class C1. The similar situation occurs for other negative features of this dataset. The GFSS based on IG has selected top 6 features, i.e. from class C1 = 1, C2 = 2, and C3 = 3.

As shown in Table 5, there are a less number of features from class C1 than C3 and C2. This issue has been resolved by choosing an equal number of features from each class. The assignment of the class labels of the features using IGFSS depends upon an individual vote of the OR method, whereas, the SVT has used the soft voting technique by computing the weighted average score of the features. Thus, SVT reduces the bias in feature selection towards single criteria by an ensemble of three methods method, viz. OR, GSS, and CC.

The stop words (e.g. “this”, “that”, “those”, etc.), punctuation marks, white spaces, links, email addresses, numbers, etc. are less informative to decide the class label of the documents in ATDC. They are removed from the corpus in the pre-processing steps [1,2,3, 22, 44] with function PREPROCESSING(D) as follows:

function Preprocessing(D){

  1. 1.

    T = [t1,t2,...,tp] ← Tokenizer(D) // Tokenization

  2. 2.

    T = stopWordsRemoval(T)

  3. 3.

    T = punctuationMarksRemoval(T)

  4. 4.

    T = whiteSpaceRemoval(T)

  5. 5.

    T = lemmatize(T) // convert the word to its root form (e.g. went, gone–> go)

  6. 6.

    T ← [t1,t2,...,tm] // Where m < p

  7. 7.

    return (T) }

figure g
figure h

The steps of SVT can be summarized as follows,

  1. 1.

    The SVT computes the weighted average score (Soft Vote) of the three methods, i.e. OR, CC, and GSS to find out the final score of the features. Further, it is used to determine the class label of the features. In this regard, it computes the negative value of a feature for the class, if the presence of that feature is very less or none in that class. Similarly, a positive value of a feature for the class, if it occurs most frequently in the class. Further, the SVT uses the maximum absolute score of the feature for a class to assign the class label and the sign of the maximum value to find out the membership of the feature.

  2. 2.

    The features are sorted in descending order using the scores obtained using any one method of GFSS at a time. Further, the negative and positive features ratios are derived using a mathematical model as shown in (26)–(30).

4 Experimental setup and performance evaluation

In order to evaluate the performance of SVT over IGFSS and various methods of GFSS (MI, IG, GI, DFS, and GR), all the experiments have been carried out on a machine with Intel core i7, 8GB RAM, 1.8 GHz Processor in UBUNTU 16.04 64-bit OS. The process of document classification- tokenization, preprocessing of the words of the corpus (D), feature extraction (t[m] ⊂ t[p]), feature selection (t[k] ⊂ t[m]), classification, and performance analysis are performed in Python 2.7 with nltk, scipy, numpy, ipython notebook, scikitlearn, matplotlib etc. packages.Footnote 3 To speed up the computing process and resolve the memory related issues the entire corpus is sliced into multiple arrays of each class, in spite of loading entire corpus into a single array. The number of features selected for analysis is in the range of 300, 400, and 500. The statistical tests have been performed using Java and KEEL software tool to evaluate the performance of the proposed SVT method with other compared methods using LSVM, SOFT MAX, SGDC, and RIDGE classifiers. The average rankings of the compared methods are also computed using the Java and KEEL software tool.Footnote 4

4.1 Data set

In this study, five distinct standard text datasets (viz. Reuters10, Ohsumed10, Webkb, Classic4, and Trec2004) with varying characteristics were used for the assessment of the proposed technique (see Table 6). The brief description of these datasets is as follows: The Reuters10 dataset consists of top-10 classes of the Reuters-21578 dataset.Footnote 5,Footnote 6 The Ohsumed10 dataset [35, 37] is a subset containing most frequent 10 categories of original Ohsumed23 dataset. The Ohsumed10 is highly dense, unbalanced and challenging dataset. The Webkb datasetFootnote 7 consists of four classes [14]. In the Webkb dataset, the “student” class has the most samples, whereas the “project” class has the least samples. The Classic4 datasetFootnote 8 and its class distribution is nearly homogeneous among four classes [29]. In Classic4 dataset the most of the samples are from class “cacm”, whereas, class “med” has least number of samples. The Trec2004 dataset [13] is the original subset of MEDLINE for the TREC 2004 Genomics Track9. The Trec2004 dataset consists of 10 years of completed citations from the database includes from 1994 to 2003. The full text articles are extracted from the Pubmed databaseFootnote 9 in the form of XML file. These articles are based on four categories: mouse tumor biology (tumor), embryologic mouse gene expression (expression), mouse gene ontology (GO), and alleles of mutant mouse phenotypes (allele). The above four categories of documents are searched and saved in xml files. Subsequently, pubmed id, title, and abstract are parsed from relevant xml files using the R xml parser.Footnote 10

Table 6 Details of the datasets

The training and test documents are already defined in the Reuters10 dataset, whereas for other datasets, viz. Ohsumed10, Webkb, Classic4, and Trec2004 the stratified nested 5-fold cross-validation scheme is used to split the dataset in training and test sets. This cross-validation object is a variation of k-fold that returns stratified folds; each fold contains approximately the same percentage of samples of each target class as the complete set. Stratification is the process of rearranging the data as to ensure each fold is a good representative of the whole.

4.2 Classification algorithms

In order to prove the efficacy of the proposed technique four state-of-the-art classifiers viz. Linear Support Vector Machine (LSVM) [27, 33], SOFT MAX classifiers [8], Stochastic Gradient Descent Classifier (SGDC) [9], and RIDGE [38] are employed on text datasets. As the text classification problems tend to be quite high dimensional (many features) and the high dimensional problems are likely to be linearly separable. Therefore, the performance of the linear classifiers is likely, well if SOFT MAX, SGDC, RIDGE or LSVM is used with a linear kernel. However, to get good performance the regularization parameters need to be properly tuned. In the experiments, python scikit-learnFootnote 11 are used to classify the documents [17]. A brief description of the methods are as follows:

A support vector machine constructs a hyperplane or set of hyperplanes in a high or infinite dimensional space which can be used for classification, regression or other tasks. The SVM is one of the most successful classifiers for text classification. SVM search for a decision boundary that is maximally far away from any data point. The distance from the decision surface to the closest data point determines the margin of the classifier. The SVM classifier is based on the margin maximization concept [27, 33].

Softmax Regression (synonyms: Multinomial Logistic, Maximum Entropy Classifier, or just Multi-class Logistic Regression) is a generalization of logistic regression that we can use for multi-class classification (under the assumption that the classes are mutually exclusive). In contrast, we use the (standard) Logistic Regression model in binary classification tasks.Footnote 12 SVM methods required less variables than Logistic Regression to achieve a better (or equivalent) performance. The sigmoid logistic function is replaced by the softmax function in this classifier [8].

Stochastic Gradient Descent (SGD) is a simple and very efficient approach to discriminative learning of linear classifiers under convex loss functions such as (linear) Support Vector Machines. The SGD has been successfully applied to the large-scale and sparse machine learning problems often encountered in the text classification and natural language processing [9].

The ridge classifier uses ridge regression with L2 Regularization technique to classify the data. The stochastic average gradient descent solver used in ridge regression classifier to speed up the iterative procedure of the classification. The Ridge regression is simply a type of linear regression that controls the magnitude of the coefficients to mitigate the effect of over-fitting. The major strength of this classifier is that there is no need for feature selection if the model is properly tuned using the regularization parameter [38].

4.3 Performance evaluation measures

In this paper, the benchmarked macro and micro averaged F1 measures [44] are used to evaluate the performance of classifiers. The F-measure (Fβ and F1) can be interpreted as a weighted harmonic mean of the precision and recall. The Fβ score weights recall more than precision by a factor of beta. A Fβ measure reaches its best value at 1 and its worst score at 0. With β = 1, Fβ and F1 are equivalent, and the recall and the precision are equally important.Footnote 13 The accuracy gives the same weight to all classes and it is not suitable for imbalanced datasets. The Macro_F1 measure computes metrics for each label, and find their unweighted mean and does not consider label imbalance. Whereas Micro_F1 calculate metrics globally by counting the total true positives, false negatives and false positives. In this context, the notions of precision (macro (31) and micro (32)), recall (macro (33) and micro (34)), accuracy (35), F_β (36), Macro_F1 (37), and Micro_F1 (38) measures are as follows:

$$ Precision_{macro}=\frac{1}{n(C)}\sum\limits_{C = 1}^{C=r}\frac{TP_{C}}{TP_{C}+FP_{C}} $$
(31)
$$ Precision_{micro}=\frac{\sum\limits_{C = 1}^{C=r} TP_{C}}{\sum\limits_{C = 1}^{C=r}TP_{C}+\sum\limits_{C = 1}^{C=r}FP_{C}} $$
(32)
$$ Recall_{macro}=\frac{1}{n(C)}\sum\limits_{C = 1}^{C=r}\frac{TP_{C}}{TP_{C}+FN_{C}} $$
(33)
$$ Recall_{micro}=\frac{\sum\limits_{C = 1}^{C=r} TP_{C}}{\sum\limits_{C = 1}^{C=r}TP_{C}+\sum\limits_{C = 1}^{C=r}FN_{C}} $$
(34)
$$ accuracy=\frac{TP+TN}{(TP+FP+TN+FN)} $$
(35)
$$ F_{\beta} =(1+\beta^{2})\times \frac{Precision \times Recall}{(\beta^{2} \times Precision) + Recall} $$
(36)
$$ Macro\_F1 = 2 \times \frac{Precision_{macro}\times Recall_{macro}}{Precision_{macro} + Recall_{macro}} $$
(37)
$$ Micro\_F1 = 2 \times \frac{Precision_{micro}\times Recall_{micro}}{Precision_{micro} + Recall_{micro}} $$
(38)

Where C = 1 to C = r represent r class labels and n(C) is the count of the total number of classes. The TP is the count of true positives, FP is the count of false positives, FN is the count of false negatives, TN is the count of true negatives. In the multi class and multi label classification task, the notions of precision, recall, and F-measures can be applied to each class label independently.

The Z-test statistics [21] has been used to evaluate the performance of VGFSS over IGFSS and the methods of GFSS. A set of pairwise comparisons can be associated with a set or family of hypotheses. As [15] explained, the test statistics for comparing the ith and jth classifier is,

$$ \mathrm{z=}\frac{{\mathrm{R}}_{\mathrm{i}}-{\mathrm{R}}_{j}}{\sqrt{\frac{\text{k(k + 1)}}{\mathrm{6}N}}} $$
(39)

where Ri and Rj is the average rank computed through the Friedman test for the ith and jth classifier, k is the number of classifiers to be compared and N is the number of data sets used in comparison. The z value is used to find the corresponding probability (p-value) from the table of the normal distribution, which is further compared with an appropriate level of significance α. There are two basic procedures for doing that [31]:

  1. 1.

    Holm’s Procedure [25]: The value of α is adjusted using a step-down method. Let p1,...,pm be the ordered p-values arranged in ascending order and H1,...,Hm be the comparable hypotheses. The Holm’s procedure rejects the hypothesis from H1 to H(i− 1), if pi > α/(mi + 1), where i is the smallest integer.

    The set of all pairwise comparisons builds a group of a logically interrelated hypothesis. If there are three hypotheses of pairwise equality associated with the pairwise comparisons of three classifiers Ci, where i = 1,2,3. Therefore, the true and false hypothesis of all the comparisons are not possible and if any one of these hypotheses is false, then at least one other must be false. For example, if C1 is better or worse than C2, then it is not possible that C1 has the same performance as C3 and there cannot be one false and two true hypotheses among these three relations at the same time. Shaffer [40] proposed two procedures to address this issue and used a logical relation between the family of hypotheses to adjust the value of α.

  2. 2.

    Shaffer’s static procedure: Similar to Holm’s step down method, at stage j, instead of rejecting Hi if piα/(mi + 1), reject Hi if piα/ti, where ti is the maximum number of hypotheses which can be true given that any (i − 1) hypotheses are false. Shaffer’s dynamic procedure: It is an improvement of static procedure and uses the value α/ti* in place of α/ti at stage i, where ti* is the maximum number of hypotheses that could be true, given that the previous hypotheses are false. It is a dynamic procedure since ti* depends not only on the logical structure of the hypotheses but also on the hypotheses already rejected at step i.

The presented results of this study are based on a static procedure. However, the hypothesis can be examined using other advanced [26] dynamic procedure on the presented experimental results.

5 Results and discussions

In this section, the experimental results of the proposed SVT algorithm and its comparison with classical information science methods and IGFSS are presented. The results are analyzed based on the selected features (viz. 300, 400, and 500) and the performance of the four classifiers. The dispersed distribution of the features in the various classes of all five datasets can be observed from Figs. 1234 and 5. The distribution of features in the classes is based on the class labels assigned by the IGFSS and SVT. The classifier results based on Macro_F1 and Micro_F1 measures are shown in the Tables 789101112131415 and 16. The maximum performance measures achieved by the algorithms are shown in bold letters in these tables.

Fig. 1
figure 1

Reuters10 dataset

Fig. 2
figure 2

Ohsumed10 dataset

Fig. 3
figure 3

Webkb dataset

Fig. 4
figure 4

Classic4 dataset

Fig. 5
figure 5

Trec2004 dataset

Table 7 Macro_F1 measure for Ohsumed10 dataset
Table 8 Macro_F1 measure for Reuters10 dataset
Table 9 Macro_F1 measure for Webkb dataset
Table 10 Macro_F1 measure for Classic4 dataset
Table 11 Macro_F1 measure for Trec2004 dataset
Table 12 Micro_F1 measure for Ohsumed10 dataset
Table 13 Micro_F1 measure of Reuters10 dataset
Table 14 Micro_F1 measure of Webkb dataset
Table 15 Micro_F1 measure of Classic4 dataset
Table 16 Micro_F1 measure of Trec2004 dataset

The average rankings of the algorithms GFSS (MI, IG, GI, DFS, and GR), GFSS + IGFSS, and GFSS + SVT is shown in the Table 18. In this table, the Micro_F1 based average rank of MI, MI+IGFSS, and MI+VGFSS are 2.942, 2.025, and 1.033 respectively. Here, the highest value of average rank of the algorithm means the last performer, while a minimum average rank value means top performer. The GFSS+SVT is found to be the top performer in all four classifiers (as shown in bold letters).

5.1 Data and statistical analysis

As can be observed from Figs. 15 that due to the selection of two principal components using LSA the total number of features selected using SVT has been reduced (≈ up to 1000 features) in comparison to IGFSS for each dataset. The LSA assigns an equal weight to the words (e.g. “shark”, “hagfish”, and “rays”) representing similar concepts or meanings (category of “fish”) and it doesn’t matter which word is selected by the algorithm in the final feature set. The IGFSS selects one common negative features ratio (nfr) for each class while the positive and negative features have been selected based on that nfr value. It affects those classes of a dataset which have more positive features than negative or vice-versa. The SVT solves this issue by selecting a variable negative and positive features ratio for each class based on the count of negative and positive features in the classes. Thus, SVT has selected an equal number of features from each class, but a variable number of positive and negative features (see the Figs. 15). For example, in the Reuters10 dataset, there is less percentage of positive features for “acq” class selected by the IGFSS, whereas the SVT selects a balanced percentage of positive and negative features not only for the “acq” class but also for all classes.

Similar cases can be observed for all other datasets, as shown in the Figs. 25. However, the changes are not as effective in case of a balanced dataset (e.g. Webkb, Classic4, and Trec2004) in comparison to an unbalanced dataset (e.g. Reuters10, Ohsumed10). It is due to the almost similar distribution of samples as well as terms in the classes of the balanced dataset. The distribution of samples and terms are variable in the classes of an unbalanced dataset, therefore one negative feature ratio, which is determined empirically by the IGFSS, lacks in selecting the most appropriate negative and positive features from all classes. This issue is solved using SVT because it selects positive and negative features using a set of negative features ratio (i.e. nfrs) derived using an improved mathematical model. It selects the positive and negative features based on their distribution in the class.

The statistical tests based on Z-test statistics [21] is shown in the Table 17. They illustrate the Holm/Shaffer values of compared methods for four classifiers. The compared methods are shown as algorithms in these tables. There are total 3 hypotheses formed for every five methods (MI, IG, GI, DFS, and GR) and these hypotheses are denoted as i in the tables. The value of α has been selected as 0.05. The Holm’s and Shaffer’s procedure rejects those hypotheses that have a p-value ≤ 0.033. The average ranking of the algorithms has been presented in Table 18 which is prepared by these values.

Table 17 Holm / Shaffer values
Table 18 Average rankings of the algorithms

5.2 Discussions

In order to compare the performances, two null hypothesis is assumed, First: “The performance of MI, IG, GI, DFS, and GR methods is equal to IGFSS, and SVT”, and Second: “The performance of IGFSS and SVT is equal”. In most of the cases, the performance of the MI, IG, GI, DFS, and GR methods are lower than IGFSS and SVT. The performance of the classifiers is significantly improved in comparison to IGFSS when the GFSS based methods are an ensemble with the SVT algorithm. Thus, both the first and second null hypothesis are rejected due to the lower values of α, Holm, and Shaffer from the standard.

The standard MI method assigns higher scores to the low-frequency terms that appear in only one class. It proves that it has the capability to discriminate terms which are present in a specific class, i.e. the terms with a positive nature. However, MI suffers in case of overlapping terms which are identified as negative nature terms in this paper. Thus, the experimental results obtained by MI shows good scores for those datasets that have more positive nature of terms in comparison of negative (e.g. Ohsumed10, trec2005). Other methods, viz. IG, DFS, GI, etc. which assign higher weights to most frequent negative nature terms have performed better in those datasets which have more negative nature terms than positive (e.g. Webkb, Classic4).

From extensive experimental study and statistical analysis, it is found that the performance of all the feature selection methods has been improved by embedding the SVT algorithm. The key points which are the main cause of the success of SVT over IGFSS and other classical methods of GFSS are as follows:

  1. 1.

    Strength of GFSS: (a). It selects top-N scored most representative features from all the classes. Weakness of GFSS: (a). It discards low scored features from some classes either partially or completely, due to dispersed distribution of the features in the classes.

  2. 2.

    Strength of IGFSS: (a). It selects an equal number of most representative features from all the classes. Weakness of IGFSS: (a). The OR has its own weakness to assign the adequate class label to the features. If a feature (say, to) is present more frequently in all the classes, but absent in any specific class (say Cl) then the presence of this feature in a test sample (say, Dtest[i]) assures that the class label of Dtest[i] is not class Cl. In this case, as the feature to is present in all most all the classes except Cl, the OR method assigns a very high positive value of it to all the classes but a lesser negative value for the class Cl. The IGFSS fails in this situation to assign the most appropriate class label and the membership value of to. These types of terms are defined as common negative terms. (b). The negative features ratio is determined empirically, therefore, the selected positive and negative features are not adequate for all the classes.

  3. 3.

    Strength of SVT: (a). It uses the ensemble votes of three methods which gives better results than an individual vote and determines the most appropriate class label of the features. (b). The negative features ratio is determined using a mathematical model, therefore, the selected positive and negative features are the most appropriate for all the classes. Weakness of SVT: Although, the weighted average score of three methods (i.e. OR, CC, and GSS) balances the weaknesses of OR but not much effective to decide the class label and membership of the common negative features due to their similar numeric scoring nature, e.g. in case of term to.

6 Conclusions

The main contribution of this study is to introduce a new Soft Voting Technique (SVT) for determination of most appropriate class labels of the features. SVT has provided a generic solution for all filter-based global feature selection methods to select the most informative features based on assigned class labels. The proposed SVT technique has used the advantage of ensemble results obtained from the weighted average score (Soft Vote) of three methods, i.e. Odds Ratio (OR), Correlation Coefficient (CC), and GSS Coefficients (GSS) to predict the class labels of the features. This technique can be useful for a set of equally well-performing methods in order to balance out their individual weaknesses. Although, the SVT has selected an equal number of features from each class similar to IGFSS, but the process for selection of positive and negative features count has followed an improved approach derived using a mathematical model. The use of Latent Semantic Analysis (LSA) at the initial level has reduced the high-dimensional feature space into a smaller one. The constructed final feature set has improved the scalability, efficiency, and accuracy of classifiers in all the five datasets used in this study and proved the efficacy of the proposed Soft Voting Technique. In the future, there is a need to find out some more appropriate methods for the selection of positive and negative features.