Abstract
In text classification, the Global Filter-based Feature Selection Scheme (GFSS) selects the top-N ranked words as features. It discards the low ranked features from some classes either partially or completely. The low rank is usually due to varying occurrence of the words (terms) in the classes. The Latent Semantic Analysis (LSA) can be used to address this issue as it eliminates the redundant terms. It assigns an equal rank to the terms that represent similar concepts or meanings, e.g. four terms “carcinoma”, “sarcoma”, “melanoma”, and “cancer” represent a similar concept, i.e. “cancer”. Thus, any selected term by the algorithms from these four terms doesn’t affect the classifier performance. However, it does not guarantee that the selection of top-N LSA ranked terms by GFSS are the representative terms of each class. An Improved Global Feature Selection Scheme (IGFSS) solves this issue by selecting an equal number of representative terms from all the classes. However, it has two issues, first, it assigns the class label and membership of each term on the basis of an individual vote of the Odds Ratio (OR) method thereby limiting the decision making capability. Second, the ratio of selected terms is determined empirically by the IGFSS and a common ratio is applied to all the classes to assign the positive and negative membership of the terms. However, the ratio of positive and negative nature terms varies from one class to another and it may be very less for one class, whereas high for other classes. Thus, one common negative features ratio used by the IGFSS affects those classes of a dataset in which there is an imbalance between positive and negative nature words. To address these issues of IGFSS, a new Soft Voting Technique (SVT) is proposed to improve the performance of GFSS. There are two main contributions in this paper: (i) The weighted average score (Soft Vote) of three methods, viz. OR, Correlation Coefficient (CC), and GSS Coefficients (GSS) improves the numerical discrimination of words to identify there positive and negative membership to a class. (ii) A mathematical expression is incorporated in the IGFSS that computes a varying ratio of positive and negative memberships of the terms for each class. The membership is based on the occurrence of the terms in the classes. The proposed SVT is evaluated using four standard classifiers applied on five bench-marked datasets. The experimental results based on Macro_F1 and Micro_F1 measures show that SVT achieves a significant improvement in the performance of classifiers in comparison of standard methods.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Accurate and timely information is the basic need for effective decision making. Wondrous growth in the e-corpus of various fields (e.g. Business, Biomedical, Engineering, News, etc.) [30, 47] demands for an intelligent Decision Support System (DSS) that helps in an Automated Text Document Classification (ATDC) process [10, 11]. In this context, a model is built by observing the occurrence of words in the training set documents with known class labels. The trained model has the capability to predict the class labels of test documents with maximum accuracy [16, 42].
The prediction completely relies on the contents of the documents. Substantial contents of these documents are stored as text [44, 45]. The word (term) is the smallest constituent of text and play a vital role in the ATDC process [27, 33, 39]. The processing steps followed by the ATDC process are as follows: the first step extracts features from the entire corpus (i.e. generation of tokens from the text contents), after this some less informative words (e.g. stop words, punctuation marks, white spaces) are eliminated in the second step, and in third step lemmatization/stemming is performed in the remaining terms of the corpus. Finally, the resultant terms are used to build a vocabulary of the entire corpus [4]. The resultant terms of this vocabulary are represented by vectors, where the frequency of each term in the documents represents a vector [34]. The collection of term vectors in a matrix form is called a vector space. In this vector space, each individual term constitutes one dimension. For a typical document collection, there may be millions of terms, hence ATDC requires to cater a large number of dimensions which makes the classification process cumbersome [19, 30]. In literature, feature selection techniques are used to select the most relevant features by eliminating the less important features. Feature selection increases the performance as well as the speed of the classifier and is considered as an important step in the classification process.
The Term Frequency-Inverse Document Frequency (TF-IDF) Vectorizer [34] is applied to normalize the weight of the terms, but the TF-IDF vectors tend to be of high dimension since they have one component for every term in the vocabulary. The terms which represent similar concepts or with similar meanings are treated as individual words in the corpus and enormously increase the size of TF-IDF vectors.
A linear combination of terms defined using Latent Semantic Analysis (LSA), identifies the relationships among terms.Footnote 1 The LSA creates a vector representation of a document that helps to compare documents based on similarity. It assigns an equal weight to the terms representing similar concepts or meanings.Footnote 2 E.g., the four words “carcinoma”, “sarcoma”, “melanoma”, and “cancer” represent a similar concept, i.e. “cancer”. Thus, an equal weight is assigned by LSA to these words and they contribute equally to the resulting LSA component and any selected word by the algorithms doesn’t affect the classifier performance [12]. Further, the LSA processed vector space has a considerable number of features that are not relevant to the text of the specific class. In order to improve the scalability of ATDC, an effective feature selection technique is needed to reduce the feature set.
Many feature subset selection methods have been proposed and studied in machine learning paradigm. It can be broadly divided into three categories: Filter, Wrapper, and Embedded [6, 32, 44]. The filter methods compute the score of a feature using an evaluation function. It is independent of any classification algorithm and determined using the mutual correlation of the data. In contrast, the wrappers and embedded methods require a frequent classifier interaction in their flow to estimate the value of a given subset. The requirement of a classifier interaction may increase running time and force the feature selection method to work according to a specific learning model. Thus, filter-based methods are preferred more in comparison to wrappers and embedded methods [32, 44]. The Global Filter-based Feature Selection Scheme (GFSS) assigns a score to each feature and the topmost, N features are selected using this score, where N is an empirically determined number [23, 44]. The Filter-based methods are further subdivided into One Sided Local Filter-based Feature Selection Scheme (OLFSS) and Global filter-based feature selection scheme (GFSS).
In the OLFSS, the local class-based score for each feature is computed and used as a final score. The GFSS follows a global policy and converts the multiple local scores into a global score to compute the final score of the features. The local and global scores can be directly used in the feature ranking. The features are sorted in descending order and the top-N features are included in the Final Feature Set (FSS). The Information Gain (IG), Gini Index (GI), Distinguishing Feature Selector (DFS) and Gain Ratio (GR) etc. are known as the methods of GFSS, whereas Mutual Information (MI), Odds Ratio (OR), GSS coefficients (GSS), and Correlation Coefficient (CC) as a method of OLFSS [44]. The selected discriminating features of FFS are used by the classifiers in the final step of the ATDC process. A meta-heuristic technique can also be used to search for a configuration that produces a highly effective text classifier. This model selection procedure is commonly named in the literature as hyper-parameter optimization [43].
Although, the GFSS method improves the performance of the classifiers, but it has some limitations. The GFSS is suitable for the balanced dataset where each class contains an equal number of documents along with a sufficient number of terms. In the case of an unbalanced dataset, having a large number of classes with variable distribution of terms, affects the performance of GFSS. The GFSS eliminates the informative features of the class either partially or completely from the topmost, N features. Most of the studies in the literature are focused on providing some improvements to specific feature selection methods rather than providing a new generic scheme.
Uysal [44] extended the work of [19] and proposed a solution named an Improved Global Feature Selection Scheme (IGFSS). IGFSS selects an equal number of representative features from each class in the final feature set. There are mainly two issues with the IGFSS, first, it assigns the class label of each feature with positive or negative membership by considering an individual vote of the Odds Ratio (OR) method [34].
An individual method may have some weakness such as the OR method assigns a positive score to a term for a class if the occurrence of the term is more in that class, otherwise, a negative score is assigned. However, the numerical difference between the OR score of the terms for positive and negative membership is very less. Thus, it affects the process of the class label and membership assignment. In this paper, the membership of terms is referred to as nature of terms, i.e. positive or negative membership of a term means positive or negative nature of the terms. Second, the ratio of negative nature features is determined empirically by the IGFSS and a common negative features ratio is applied to all the classes to select the positive and negative nature features. It affects those classes of a dataset which have more positive features than negative or vice-versa [4].
To address these two issues, a new technique named Soft Voting Technique (SVT) is proposed. It is based on the presumption that the ensemble votes of several methods give better results than an individual vote and determines the most appropriate class label of the features. This technique can be useful for a set of equally well-performing methods to balance out their individual weaknesses. The flow of SVT is similar to IGFSS, but in an improved way and based on the key points of IGFSS we have given a generic solution for the filter based feature selection methods. There are two main contributions in this paper:
-
1.
The SVT uses the weighted average score (Soft Vote) of three methods, viz. OR, Correlation Coefficient (CC), and GSS Coefficients (GSS) to predict the class labels of terms and it computes a more balanced score of a word than OR which improves numerical discrimination of positive and negative nature words.
-
2.
A mathematical expression is incorporated in IGFSS that computes a varying ratio of positive and negative nature terms for each class which is based on the occurrence of the terms in the classes.
The proposed SVT is evaluated using four standard classifiers, viz. Linear Support Vector Machine (LSVM), Softmax regression (SOFT MAX), Stochastic Gradient Descent Classifier (SGDC), and RIDGE. The classifiers are applied to five benchmarked text data sets, viz. Webkb, Classic4, Reuters10, Trec2004 and Ohsumed10. The experimental results of SVT, which is based on Macro_F1 and Micro_F1 are compared with classical information science methods and IGFSS.
The rest of the paper is organized as follows. In Section 2, a brief overview of the state-of-the-art methods and related works are discussed. Section 3 presents the details of the proposed SVT. The experimental setup and performance evaluation measures are discussed in Section 4. Section 5 present the experimental results and discussions. Finally, the paper concludes in Section 6.
2 Related works
Substantial works have been carried out in the area of filter-based feature selection. The most common methods, viz. Mutual Information (MI), Information Gain (IG), Distinguishing Feature Selector (DFS), Gini Index (GI), Gain Ratio (GR), and Odds Ratio (OR) are briefly described as follows:
Mutual information (MI) concept [18, 48, 49] is carried out from information theory to measure the dependencies between random variables and used to measure the information contained by a term ti. It is strongly influenced by the marginal probabilities of the terms. It assigns higher weight to rare terms than common and sparse terms. Therefore, the weights of the terms are not comparable for the terms with widely differing frequencies. The final score (i.e. MI) of term ti is the maximum class-based score as shown in (1). The brief preliminary notations are shown in the Table 1.
Information Gain (IG) [18, 44, 45, 48, 49] assigns higher weight to common terms distributed in many categories than rare terms. The IG is also known as average Mutual Information. (see (2)).
Gini Index (GI) is a global feature selection method for text classification which can be defined as an improved version of an attribute selection algorithm used in decision tree construction (see (3)) [44].
Distinguishing Feature Selector (DFS) [44, 45] is an improvement of Mutual Information by reducing the effect of marginal probabilities of the terms by normalizing the weight of the terms. It gives weight of the term in a range of [0,1] defined by (4).
Gain Ratio (GR) is proposed in information science to reduce the effect of the most common terms and marginal probabilities of the terms by normalizing their weights, obtained using IG [24] (see (5)).
Odds Ratio (OR) reflects the odds of the word occurring in the positive class normalized by that of the negative class. It has been used for relevance ranking in information retrieval [18, 28, 34, 44, 46] (see (6)).
Correlation Coefficient CC(ti,Cj) of a word ti with a category Cj is a variant of the χ2 metric, where CC2 = χ2 × CC can be viewed as a “one-sided” chi-square metric. The positive values correspond to features indicative of membership, while negative values indicate non-membership. The greater (smaller) the positive (negative) values are, the stronger the terms will be to indicate the membership (non-membership) [36, 39, 50].
GSS Coefficient (GSS) is another simplified variant of the χ2 statistics proposed by [20]. Similar to CC, the positive values correspond to features indicative of membership, while negative values indicate non-membership [50].
Uysal [44] proposed an ensemble method named as Improved Global Feature Selection Scheme (IGFSS) which provides a generic solution for the GFSS. The IGFSS has merged the power of local and global feature selection methods. It is an ensemble of OR with any one method of GFSS at a time. The OR is used to assign the class label as well as membership value to the features. It computes the negative value of a feature for the class, if the presence of that feature is very less or none in that class. Similarly, a positive value of a feature for the class, if it occurs most frequently in the class. Further, the IGFSS uses the maximum absolute score of the feature for a class to assign the class label and the sign of the maximum value is used to find out the membership of the feature.
The summarized steps of IGFSS, presented in Algorithm 1 is as follows: (i) Step 5 computes a global score of each term using methods of GFSS (i.e. MI, IG, GI, DFS, and GR), (ii) Step 6 sort the terms based on their computed global score, (iii) Step 7 determines the class label of features, (iv) Step 8 computes positive and negative membership of features, (v) Step 9 computes positive and negative features count for each class, (vi) Steps 10-14 determine the selection criterion based on positive and negative features ratio, and (vii) Steps 14-16 select an equal number of, the most informative features from each class by applying all the above steps.
In IGFSS algorithm, in the worst case, all features need to be traversed once and some of them may be traversed two times while constructing the candidate feature set. Let L be the total number of documents, r as the total number of classes, p as the total number of terms, and m as the number of terms obtained after removal of less informative terms, viz. stop words, punctuation marks, and white spaces. Let N be the number of IGFSS weighted terms that are selected as the most informative terms based on the length of the final feature set.
The values of n, r, m and N are much smaller compared to p, because the total number of terms p is in millions, and others are in the hundreds or thousands. Thus, the overall time complexity of the Algorithm 1 is computed as Θ(p). In the special cases, if the number of documents is in millions and there are fewer terms and classes in comparison to the documents. The resultant number of terms is also in millions because they are extracted by the combination of these documents into a corpus. Thus, the time complexity of the Algorithm 1 will be Θ(n).
3 Proposed Soft Voting Technique (SVT)
Having reviewed the related studies which helped in the ATDC process, it is found that in most of the reported techniques the class label of features is determined using a single criteria [6, 44]. However, the ensemble techniques are less explored in Text Mining to decide the class label of the features. Using single criteria for feature selection has shown the limited capability in knowledge discovery and decision making systems [41]. Therefore, we introduce a new Soft Voting Technique (SVT) to determine the most appropriate class labels of the features. The IGFSS improved the GFSS by using an individual vote of Odds Ratio (OR) method to define the class label of each feature. We extend this state-of-the-art approach by providing a more generic solution for all filter-based feature selection method. The SVT is based on the presumption that the ensemble votes of several methods can yield better results than an individual vote. This technique can be useful for a set of equally well-performing model in order to balance out their individual weaknesses. The SVT uses the weighted average score (Soft Vote) of three methods, viz. Odds Ratio (OR), Correlation Coefficient (CC), and GSS Coefficients (GSS) to predict the class label of features.
The central idea behind use of weighted average score (i.e. Soft Vote) of these three methods is as follows: (i) all three methods, i.e. OR, CC, and GSS assign a positive score to a term for a class if the occurrence of a term is more in that class, otherwise a negative score is assigned, (ii) the numerical difference between the OR score of positive and negative nature terms is very less, (iii) however, the numerical range of scores assigned to the terms by OR is higher than CC and GSS methods, (iv) thus, the resultant sum of OR, CC, and GSS is not much different from OR score, but their weighted average score is a more balanced score than OR, and (v) it helps in discrimination of positive and negative nature of terms for a class due to the computation of balanced numerical difference by the SVT.
To address the second issue of IGFSS, i.e instead of using one common negative features ratio for each class a mathematical expression is incorporated in IGFSS which computes a varying ratio of positive and negative features for each class. The resulting ratio solves the problem of imbalance between positive and negative nature of terms in each class and select the most appropriate positive and negative nature terms from each class.
3.1 Explanation by synthetic data
The concept of SVT is now explained using a synthetic dataset shown in the Table 2. Assume that, we have to select the top 6 words from this data. The OLFSS and weighted average scores of the features of this dataset is shown in Table 3. Further, the class labels as well as membership assigned to these features using SVT are shown in Table 4. The SVT and IGFSS both select two features from each class as shown in Table 5.
It can be observed from Table 3 that all three methods, i.e. OR, CC, and GSS assign a positive score to a term for a class if the occurrence of term is more in that class, otherwise a negative score is assigned. However, there is less numerical difference among OR scores of positive and negative nature terms for the classes. Also, the numerical range of scores assigned to the terms by the OR is higher than CC and GSS methods. Thus, the resultant sum of OR, CC, and GSS does not have much difference from the OR score, but their weighted average score is a more balanced score than OR. It helps in discrimination of positive and negative nature of terms for a class due to balanced numerical difference among scores of terms. The selection process of SVT, IGFSS, and GFSS based on IG is explained as follows:
The SVT follows the structure of IGFSS as shown in the Algorithm 2 to select the final feature set. The entire process for assignment of class labels to the features and the flow of SVT is shown in the Algorithm 2. In SVT, (25)–(26) are used to compute the negative features ratio (nfr) and positive features ratio (pfr) for all three classes as follows:
-
1.
nfr[C1] = 2/(3 + 2) = 2/5 = 0.4, C1 = 1 − nfr = 1 − 0.4 = 0.6
-
2.
nfr[C2] = 2/(3 + 2) = 2/5 = 0.4, C2 = 0.6
-
3.
nfr[C3] = 2/(3 + 2) = 2/5 = 0.4, C3 = 0.6
Therefore, using (27) the equal split criteria if the length of Final Feature Set (FFS) is 6 EqualSplit = 6/3 = 2. Further, using (28) the selected positive words count in class C1 = EqualSplit × nfr[C1] = 2 × 0.6 = 1.2 ≈ 1, in class C2 = 2 × 0.6 = 1.2 ≈ 1, and in class C3 = 2 × 0.6 = 1.2 ≈ 1. Similarly, using (29) the selected negative words count in class C1 = 2 × 0.4 = 0.8 = 1, in class C2 = 2 × 0.4 = 0.8 = 1, and in class C3 = 2 × 0.4 = 0.8 = 1.
Whereas, the IGFSS chooses a common nfr from the set of nfrs in the range of 0 to 1 that is based on the experimental evaluation, e. g. if we choose nfr = 0.8, then pfr = 1 − 0.8 = 0.2. The selected positive words count in class C1 = 2 × 0.2 = 0.4 ≈ 0, in class C2 = 2 × 0.2 = 0.4 ≈ 0, and in class C3 = 2 × 0.2 = 0.4 ≈ 0. Similarly, the selected negative words count in class C1 = 2 × 0.8 = 1.6 = 2, selected negative words count in class C2 = 2 × 0.8 = 1.6 = 2, and selected negative words count in class C3 = 2 × 0.8 = 1.6 = 2. Thus, it can be observed from above discussions that SVT has used varying nfrs and pfrs instead of a common nfr and pfr which is chosen by the IGFSS. The SVT follows distribution of positive and negative nature terms in the classes while computing the nfr and pfr values. The distribution of words in the three classes C1, C2, and C3 using SVT and IGFSS are as follows:
-
1.
C1- pos:‘shark’, ‘hagfish’, ‘ray’, neg: ‘leopard’, ‘tiger’, C2-pos:‘turtle’, ‘lizard’, ‘snack’, neg: ‘cow’, ‘deer’.
-
2.
C3- pos: ‘penguin’, ‘ostrich’, ‘emu’, ‘bird’, neg: ‘goat’, ‘toad’.
-
3.
IG selected top 6 features→ ‘shark’, ‘turtle’, ‘emu’, ‘penguin’, ‘ostrich’, ‘lizard’ (C1 = 1, C2 = 2, C3 = 3).
-
4.
IGFSS selected top 6 features→C1: ‘leopard’, ‘tiger’, C2: ‘cow’, ‘deer’, and C3: ‘goat’, ‘toad’.
-
5.
SVT selected top 6 features→ C1: ‘shark’, ‘tiger’, C2: ‘turtle’, ‘deer’, C3: ‘emu’, ‘goat’.
It has been observed from the results on synthetic data that the IGFSS has used one negative features ratio for all the classes. Whereas, the SVT has selected negative and positive features from each class based on distribution of features in the classes as negative or positive.
The negative nature of features has similar importance as the positive nature of features to discriminate the class label of a document. E.g., the two features ‘leopard’ and ‘tiger’ has been selected as negative features for class C1 by the IGFSS method, their presence in a document ensures that this document cannot be classified as class C1. These two features are present in the documents of class C2 and C3, but absent in the documents of class C1. Therefore, the two features ‘leopard’ and ‘tiger’ became negative features for class C1. The similar situation occurs for other negative features of this dataset. The GFSS based on IG has selected top 6 features, i.e. from class C1 = 1, C2 = 2, and C3 = 3.
As shown in Table 5, there are a less number of features from class C1 than C3 and C2. This issue has been resolved by choosing an equal number of features from each class. The assignment of the class labels of the features using IGFSS depends upon an individual vote of the OR method, whereas, the SVT has used the soft voting technique by computing the weighted average score of the features. Thus, SVT reduces the bias in feature selection towards single criteria by an ensemble of three methods method, viz. OR, GSS, and CC.
The stop words (e.g. “this”, “that”, “those”, etc.), punctuation marks, white spaces, links, email addresses, numbers, etc. are less informative to decide the class label of the documents in ATDC. They are removed from the corpus in the pre-processing steps [1,2,3, 22, 44] with function PREPROCESSING(D) as follows:
function Preprocessing(D){
-
1.
T = [t1,t2,...,tp] ← Tokenizer(D) // Tokenization
-
2.
T = stopWordsRemoval(T)
-
3.
T = punctuationMarksRemoval(T)
-
4.
T = whiteSpaceRemoval(T)
-
5.
T = lemmatize(T) // convert the word to its root form (e.g. went, gone–> go)
-
6.
T ← [t1,t2,...,tm] // Where m < p
-
7.
return (T) }
The steps of SVT can be summarized as follows,
-
1.
The SVT computes the weighted average score (Soft Vote) of the three methods, i.e. OR, CC, and GSS to find out the final score of the features. Further, it is used to determine the class label of the features. In this regard, it computes the negative value of a feature for the class, if the presence of that feature is very less or none in that class. Similarly, a positive value of a feature for the class, if it occurs most frequently in the class. Further, the SVT uses the maximum absolute score of the feature for a class to assign the class label and the sign of the maximum value to find out the membership of the feature.
-
2.
The features are sorted in descending order using the scores obtained using any one method of GFSS at a time. Further, the negative and positive features ratios are derived using a mathematical model as shown in (26)–(30).
4 Experimental setup and performance evaluation
In order to evaluate the performance of SVT over IGFSS and various methods of GFSS (MI, IG, GI, DFS, and GR), all the experiments have been carried out on a machine with Intel core i7, 8GB RAM, 1.8 GHz Processor in UBUNTU 16.04 64-bit OS. The process of document classification- tokenization, preprocessing of the words of the corpus (D), feature extraction (t[m] ⊂ t[p]), feature selection (t[k] ⊂ t[m]), classification, and performance analysis are performed in Python 2.7 with nltk, scipy, numpy, ipython notebook, scikitlearn, matplotlib etc. packages.Footnote 3 To speed up the computing process and resolve the memory related issues the entire corpus is sliced into multiple arrays of each class, in spite of loading entire corpus into a single array. The number of features selected for analysis is in the range of 300, 400, and 500. The statistical tests have been performed using Java and KEEL software tool to evaluate the performance of the proposed SVT method with other compared methods using LSVM, SOFT MAX, SGDC, and RIDGE classifiers. The average rankings of the compared methods are also computed using the Java and KEEL software tool.Footnote 4
4.1 Data set
In this study, five distinct standard text datasets (viz. Reuters10, Ohsumed10, Webkb, Classic4, and Trec2004) with varying characteristics were used for the assessment of the proposed technique (see Table 6). The brief description of these datasets is as follows: The Reuters10 dataset consists of top-10 classes of the Reuters-21578 dataset.Footnote 5,Footnote 6 The Ohsumed10 dataset [35, 37] is a subset containing most frequent 10 categories of original Ohsumed23 dataset. The Ohsumed10 is highly dense, unbalanced and challenging dataset. The Webkb datasetFootnote 7 consists of four classes [14]. In the Webkb dataset, the “student” class has the most samples, whereas the “project” class has the least samples. The Classic4 datasetFootnote 8 and its class distribution is nearly homogeneous among four classes [29]. In Classic4 dataset the most of the samples are from class “cacm”, whereas, class “med” has least number of samples. The Trec2004 dataset [13] is the original subset of MEDLINE for the TREC 2004 Genomics Track9. The Trec2004 dataset consists of 10 years of completed citations from the database includes from 1994 to 2003. The full text articles are extracted from the Pubmed databaseFootnote 9 in the form of XML file. These articles are based on four categories: mouse tumor biology (tumor), embryologic mouse gene expression (expression), mouse gene ontology (GO), and alleles of mutant mouse phenotypes (allele). The above four categories of documents are searched and saved in xml files. Subsequently, pubmed id, title, and abstract are parsed from relevant xml files using the R xml parser.Footnote 10
The training and test documents are already defined in the Reuters10 dataset, whereas for other datasets, viz. Ohsumed10, Webkb, Classic4, and Trec2004 the stratified nested 5-fold cross-validation scheme is used to split the dataset in training and test sets. This cross-validation object is a variation of k-fold that returns stratified folds; each fold contains approximately the same percentage of samples of each target class as the complete set. Stratification is the process of rearranging the data as to ensure each fold is a good representative of the whole.
4.2 Classification algorithms
In order to prove the efficacy of the proposed technique four state-of-the-art classifiers viz. Linear Support Vector Machine (LSVM) [27, 33], SOFT MAX classifiers [8], Stochastic Gradient Descent Classifier (SGDC) [9], and RIDGE [38] are employed on text datasets. As the text classification problems tend to be quite high dimensional (many features) and the high dimensional problems are likely to be linearly separable. Therefore, the performance of the linear classifiers is likely, well if SOFT MAX, SGDC, RIDGE or LSVM is used with a linear kernel. However, to get good performance the regularization parameters need to be properly tuned. In the experiments, python scikit-learnFootnote 11 are used to classify the documents [17]. A brief description of the methods are as follows:
A support vector machine constructs a hyperplane or set of hyperplanes in a high or infinite dimensional space which can be used for classification, regression or other tasks. The SVM is one of the most successful classifiers for text classification. SVM search for a decision boundary that is maximally far away from any data point. The distance from the decision surface to the closest data point determines the margin of the classifier. The SVM classifier is based on the margin maximization concept [27, 33].
Softmax Regression (synonyms: Multinomial Logistic, Maximum Entropy Classifier, or just Multi-class Logistic Regression) is a generalization of logistic regression that we can use for multi-class classification (under the assumption that the classes are mutually exclusive). In contrast, we use the (standard) Logistic Regression model in binary classification tasks.Footnote 12 SVM methods required less variables than Logistic Regression to achieve a better (or equivalent) performance. The sigmoid logistic function is replaced by the softmax function in this classifier [8].
Stochastic Gradient Descent (SGD) is a simple and very efficient approach to discriminative learning of linear classifiers under convex loss functions such as (linear) Support Vector Machines. The SGD has been successfully applied to the large-scale and sparse machine learning problems often encountered in the text classification and natural language processing [9].
The ridge classifier uses ridge regression with L2 Regularization technique to classify the data. The stochastic average gradient descent solver used in ridge regression classifier to speed up the iterative procedure of the classification. The Ridge regression is simply a type of linear regression that controls the magnitude of the coefficients to mitigate the effect of over-fitting. The major strength of this classifier is that there is no need for feature selection if the model is properly tuned using the regularization parameter [38].
4.3 Performance evaluation measures
In this paper, the benchmarked macro and micro averaged F1 measures [44] are used to evaluate the performance of classifiers. The F-measure (Fβ and F1) can be interpreted as a weighted harmonic mean of the precision and recall. The Fβ score weights recall more than precision by a factor of beta. A Fβ measure reaches its best value at 1 and its worst score at 0. With β = 1, Fβ and F1 are equivalent, and the recall and the precision are equally important.Footnote 13 The accuracy gives the same weight to all classes and it is not suitable for imbalanced datasets. The Macro_F1 measure computes metrics for each label, and find their unweighted mean and does not consider label imbalance. Whereas Micro_F1 calculate metrics globally by counting the total true positives, false negatives and false positives. In this context, the notions of precision (macro (31) and micro (32)), recall (macro (33) and micro (34)), accuracy (35), F_β (36), Macro_F1 (37), and Micro_F1 (38) measures are as follows:
Where C = 1 to C = r represent r class labels and n(C) is the count of the total number of classes. The TP is the count of true positives, FP is the count of false positives, FN is the count of false negatives, TN is the count of true negatives. In the multi class and multi label classification task, the notions of precision, recall, and F-measures can be applied to each class label independently.
The Z-test statistics [21] has been used to evaluate the performance of VGFSS over IGFSS and the methods of GFSS. A set of pairwise comparisons can be associated with a set or family of hypotheses. As [15] explained, the test statistics for comparing the ith and jth classifier is,
where Ri and Rj is the average rank computed through the Friedman test for the ith and jth classifier, k is the number of classifiers to be compared and N is the number of data sets used in comparison. The z value is used to find the corresponding probability (p-value) from the table of the normal distribution, which is further compared with an appropriate level of significance α. There are two basic procedures for doing that [31]:
-
1.
Holm’s Procedure [25]: The value of α is adjusted using a step-down method. Let p1,...,pm be the ordered p-values arranged in ascending order and H1,...,Hm be the comparable hypotheses. The Holm’s procedure rejects the hypothesis from H1 to H(i− 1), if pi > α/(m − i + 1), where i is the smallest integer.
The set of all pairwise comparisons builds a group of a logically interrelated hypothesis. If there are three hypotheses of pairwise equality associated with the pairwise comparisons of three classifiers Ci, where i = 1,2,3. Therefore, the true and false hypothesis of all the comparisons are not possible and if any one of these hypotheses is false, then at least one other must be false. For example, if C1 is better or worse than C2, then it is not possible that C1 has the same performance as C3 and there cannot be one false and two true hypotheses among these three relations at the same time. Shaffer [40] proposed two procedures to address this issue and used a logical relation between the family of hypotheses to adjust the value of α.
-
2.
Shaffer’s static procedure: Similar to Holm’s step down method, at stage j, instead of rejecting Hi if pi ≤ α/(m − i + 1), reject Hi if pi ≤ α/ti, where ti is the maximum number of hypotheses which can be true given that any (i − 1) hypotheses are false. Shaffer’s dynamic procedure: It is an improvement of static procedure and uses the value α/ti* in place of α/ti at stage i, where ti* is the maximum number of hypotheses that could be true, given that the previous hypotheses are false. It is a dynamic procedure since ti* depends not only on the logical structure of the hypotheses but also on the hypotheses already rejected at step i.
The presented results of this study are based on a static procedure. However, the hypothesis can be examined using other advanced [26] dynamic procedure on the presented experimental results.
5 Results and discussions
In this section, the experimental results of the proposed SVT algorithm and its comparison with classical information science methods and IGFSS are presented. The results are analyzed based on the selected features (viz. 300, 400, and 500) and the performance of the four classifiers. The dispersed distribution of the features in the various classes of all five datasets can be observed from Figs. 1, 2, 3, 4 and 5. The distribution of features in the classes is based on the class labels assigned by the IGFSS and SVT. The classifier results based on Macro_F1 and Micro_F1 measures are shown in the Tables 7, 8, 9, 10, 11, 12, 13, 14, 15 and 16. The maximum performance measures achieved by the algorithms are shown in bold letters in these tables.
The average rankings of the algorithms GFSS (MI, IG, GI, DFS, and GR), GFSS + IGFSS, and GFSS + SVT is shown in the Table 18. In this table, the Micro_F1 based average rank of MI, MI+IGFSS, and MI+VGFSS are 2.942, 2.025, and 1.033 respectively. Here, the highest value of average rank of the algorithm means the last performer, while a minimum average rank value means top performer. The GFSS+SVT is found to be the top performer in all four classifiers (as shown in bold letters).
5.1 Data and statistical analysis
As can be observed from Figs. 1–5 that due to the selection of two principal components using LSA the total number of features selected using SVT has been reduced (≈ up to 1000 features) in comparison to IGFSS for each dataset. The LSA assigns an equal weight to the words (e.g. “shark”, “hagfish”, and “rays”) representing similar concepts or meanings (category of “fish”) and it doesn’t matter which word is selected by the algorithm in the final feature set. The IGFSS selects one common negative features ratio (nfr) for each class while the positive and negative features have been selected based on that nfr value. It affects those classes of a dataset which have more positive features than negative or vice-versa. The SVT solves this issue by selecting a variable negative and positive features ratio for each class based on the count of negative and positive features in the classes. Thus, SVT has selected an equal number of features from each class, but a variable number of positive and negative features (see the Figs. 1–5). For example, in the Reuters10 dataset, there is less percentage of positive features for “acq” class selected by the IGFSS, whereas the SVT selects a balanced percentage of positive and negative features not only for the “acq” class but also for all classes.
Similar cases can be observed for all other datasets, as shown in the Figs. 2–5. However, the changes are not as effective in case of a balanced dataset (e.g. Webkb, Classic4, and Trec2004) in comparison to an unbalanced dataset (e.g. Reuters10, Ohsumed10). It is due to the almost similar distribution of samples as well as terms in the classes of the balanced dataset. The distribution of samples and terms are variable in the classes of an unbalanced dataset, therefore one negative feature ratio, which is determined empirically by the IGFSS, lacks in selecting the most appropriate negative and positive features from all classes. This issue is solved using SVT because it selects positive and negative features using a set of negative features ratio (i.e. nfrs) derived using an improved mathematical model. It selects the positive and negative features based on their distribution in the class.
The statistical tests based on Z-test statistics [21] is shown in the Table 17. They illustrate the Holm/Shaffer values of compared methods for four classifiers. The compared methods are shown as algorithms in these tables. There are total 3 hypotheses formed for every five methods (MI, IG, GI, DFS, and GR) and these hypotheses are denoted as i in the tables. The value of α has been selected as 0.05. The Holm’s and Shaffer’s procedure rejects those hypotheses that have a p-value ≤ 0.033. The average ranking of the algorithms has been presented in Table 18 which is prepared by these values.
5.2 Discussions
In order to compare the performances, two null hypothesis is assumed, First: “The performance of MI, IG, GI, DFS, and GR methods is equal to IGFSS, and SVT”, and Second: “The performance of IGFSS and SVT is equal”. In most of the cases, the performance of the MI, IG, GI, DFS, and GR methods are lower than IGFSS and SVT. The performance of the classifiers is significantly improved in comparison to IGFSS when the GFSS based methods are an ensemble with the SVT algorithm. Thus, both the first and second null hypothesis are rejected due to the lower values of α, Holm, and Shaffer from the standard.
The standard MI method assigns higher scores to the low-frequency terms that appear in only one class. It proves that it has the capability to discriminate terms which are present in a specific class, i.e. the terms with a positive nature. However, MI suffers in case of overlapping terms which are identified as negative nature terms in this paper. Thus, the experimental results obtained by MI shows good scores for those datasets that have more positive nature of terms in comparison of negative (e.g. Ohsumed10, trec2005). Other methods, viz. IG, DFS, GI, etc. which assign higher weights to most frequent negative nature terms have performed better in those datasets which have more negative nature terms than positive (e.g. Webkb, Classic4).
From extensive experimental study and statistical analysis, it is found that the performance of all the feature selection methods has been improved by embedding the SVT algorithm. The key points which are the main cause of the success of SVT over IGFSS and other classical methods of GFSS are as follows:
-
1.
Strength of GFSS: (a). It selects top-N scored most representative features from all the classes. Weakness of GFSS: (a). It discards low scored features from some classes either partially or completely, due to dispersed distribution of the features in the classes.
-
2.
Strength of IGFSS: (a). It selects an equal number of most representative features from all the classes. Weakness of IGFSS: (a). The OR has its own weakness to assign the adequate class label to the features. If a feature (say, to) is present more frequently in all the classes, but absent in any specific class (say Cl) then the presence of this feature in a test sample (say, Dtest[i]) assures that the class label of Dtest[i] is not class Cl. In this case, as the feature to is present in all most all the classes except Cl, the OR method assigns a very high positive value of it to all the classes but a lesser negative value for the class Cl. The IGFSS fails in this situation to assign the most appropriate class label and the membership value of to. These types of terms are defined as common negative terms. (b). The negative features ratio is determined empirically, therefore, the selected positive and negative features are not adequate for all the classes.
-
3.
Strength of SVT: (a). It uses the ensemble votes of three methods which gives better results than an individual vote and determines the most appropriate class label of the features. (b). The negative features ratio is determined using a mathematical model, therefore, the selected positive and negative features are the most appropriate for all the classes. Weakness of SVT: Although, the weighted average score of three methods (i.e. OR, CC, and GSS) balances the weaknesses of OR but not much effective to decide the class label and membership of the common negative features due to their similar numeric scoring nature, e.g. in case of term to.
6 Conclusions
The main contribution of this study is to introduce a new Soft Voting Technique (SVT) for determination of most appropriate class labels of the features. SVT has provided a generic solution for all filter-based global feature selection methods to select the most informative features based on assigned class labels. The proposed SVT technique has used the advantage of ensemble results obtained from the weighted average score (Soft Vote) of three methods, i.e. Odds Ratio (OR), Correlation Coefficient (CC), and GSS Coefficients (GSS) to predict the class labels of the features. This technique can be useful for a set of equally well-performing methods in order to balance out their individual weaknesses. Although, the SVT has selected an equal number of features from each class similar to IGFSS, but the process for selection of positive and negative features count has followed an improved approach derived using a mathematical model. The use of Latent Semantic Analysis (LSA) at the initial level has reduced the high-dimensional feature space into a smaller one. The constructed final feature set has improved the scalability, efficiency, and accuracy of classifiers in all the five datasets used in this study and proved the efficacy of the proposed Soft Voting Technique. In the future, there is a need to find out some more appropriate methods for the selection of positive and negative features.
Notes
References
Agnihotri D, Verma K, Tripathi P (2014) Pattern and cluster mining on text data. In: Fourth international conference on communication systems and network technologies. IEEE Computer Society, CSNT, Bhopal, pp 428-432. https://doi.org/10.1109/CSNT.2014.92
Agnihotri D, Verma K, Tripathi P (2016) Computing correlative association of terms for automatic classification of text documents. In: Proceedings of the third international symposium on computer vision and the internet, ACM, pp 71–80
Agnihotri D, Verma K, Tripathi P (2016b) Computing symmetrical strength of n-grams: a two pass filtering approach in automatic classification of text documents. SPRINGERPLUS 5(942):1–29
Agnihotri D, Verma K, Tripathi P (2017) An automatic classification of text documents based on correlative association of words. J Intell Inform Syst. https://doi.org/10.1007/s10844-017-0482-3
Agnihotri D, Verma K, Tripathi P (2017) Mutual information using sample variance for text feature selection. In: Proceedings of the 3rd international conference on communication and information processing, ACM, New York, NY, USA, ICCIP ’17, pp 39–44. https://doi.org/10.1145/3162957.3163054
Agnihotri D, Verma K, Tripathi P (2017) Variable global feature selection scheme for automatic classification of text documents. Expert Syst Appl 81:268–281. https://doi.org/10.1016/j.eswa.2017.03.057. http://www.sciencedirect.com/science/article/pii/S0957417417302208
Agnihotri D, Verma K, Tripathi P, Choudhary N (2018) A review of techniques to determine the optimal word score in text classification. In: Perez GM, Tiwari S, Trivedi MC, Mishra KK (eds) Ambient communications and computer systems. Springer, Singapore, pp 497–507
Alejandro SD, VAJIA N, Carlos SJ (2012) Comparison between svm and logistic regression: which one is better to discriminate?. Revista Colombiana de EstadÃstica 35:223–237. http://www.scielo.org.co/scielo.php?script=sci_arttext&pid=S0120-17512012000200003&nrm=iso
Bottou L (2010) Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010, Physica-Verlag HD, pp 177–186
Caulkins JP, Ding W, Duncan G, Krishnan R, Nyberg E (2006) A method for managing access to web pages: Filtering by statistical classification (fsc) applied to text. Decision Support Syst 42(1):144–161. https://doi.org/10.1016/j.dss.2004.11.015. http://www.sciencedirect.com/science/article/pii/S0167923604002635
Chan SW, Chong MW (2004) Unsupervised clustering for nontextual web document classification. Decision Support Syst 37(3):377– 396. https://doi.org/10.1016/S0167-9236(03)00035-6. http://www.sciencedirect.com/science/article/pii/S0167923603000356
Chen Y, Zhang H, Liu R, Ye Z, Lin J (2018) Experimental explorations on short text topic mining between lda and nmf based schemes. Knowledge-Based Syst. https://doi.org/10.1016/j.knosys.2018.08.011. http://www.sciencedirect.com/science/article/pii/S0950705118304076
Cohen AM, Hersh WR (2006) The trec 2004 genomics track categorization task: classifying full text biomedical documents. J Biomed Discov Collab 1(1):4. https://doi.org/10.1186/1747-5333-1-4
Craven M, McCallum A, PiPasquo D, Mitchell T, Freitag D (1998) Learning to extract symbolic knowledge from the world wide web. Tech. Rep. No. CMU-CS-98-122, Carnegie-Mellon University Pittsburgh pa School of Computer Science
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30. http://dl.acm.org/citation.cfm?id=1248547.1248548
Du M, Chen XS (2013) Accelerated k-nearest neighbors algorithm based on principal component analysis for text categorization. J Zhejiang University Sci C 14(6):407–416. https://doi.org/10.1631/jzus.C1200303
Fabian P, Gaël V, Alexandre G, Vincent M, Bertrand T, Olivier G, Mathieu B, Peter P, Ron W, Vincent D, Jake V, Alexandre P, David C, Matthieu B, Matthieu P, Édouard D (2011) Scikit-learn: Machine learning in python. J Mach Learn Res 12:2825–2830. http://dl.acm.org/citation.cfm?id=1953048.2078195
Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3:1289–1305
Forman G (2004) A pitfall and solution in multi-class feature selection for text classification. In: Proceedings of the twenty-first international conference on machine learning, ACM, New York, NY, USA, ICML ’04, pp 38–. https://doi.org/10.1145/1015330.1015356
Galavotti L, Sebastiani F, Simi M (2000) Experiments on the use of feature selection and negative evidence in automated text categorization, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol 1923. Springer, Berlin, pp 59–68
García S, Herrera F (2008) An extension on “statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. J Mach Learn Res 9:2677–2694
García S, Luengo J, Herrera F (2015) Data preprocessing in data mining. Springer, Berlin
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182. http://dl.acm.org/citation.cfm?id=944919.944968
Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Elsevier, Amsterdam
Holm S (1979) A simple sequentially rejective multiple test procedure. Scand J Stat 6(2):65–70
Hommel G, Bernhard G (1999) Bonferroni procedures for logically related hypotheses. J Statist Plann Inference 82:119–128
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In European conference on machine learning (pp. 137–142). Springer, Berlin, Heidelberg
Kamal N, Kachites MA, Sebastian T, Tom M (2000) Text classification from labeled and unlabeled documents using em. Mach Learn 39(2):103–134. https://doi.org/10.1023/A:1007692713085
Kevin B, Moshe L (2013) Uci machine learning repository. http://archiveicsuciedu/ml901
Li XM, Ouyang JH, Lu Y (2015) Topic modeling for large-scale text data. Frontiers Inform Technol Electron Eng 16(6):457–465. https://doi.org/10.1631/FITEE.1400352
Luengoán J, García S, Francisco H (2009) A study on the use of statistical tests for experimentation with neural networks: analysis of parametric test conditions and non-parametric tests. Expert Syst Appl 36(4):7798–7808
Luis T (2005) An Evaluation of Filter and Wrapper Methods for Feature Selection in Categorical Clustering. Springer, Berlin, pp 440–451. https://doi.org/10.1007/11552253_40
Manning C D, Raghavan P, Schutze H (2008) Introduction to information retrieval. Cambridge University Press, New York
Mladenic D, Grobelnik M (1999) Feature selection for unbalanced class distribution and naive bayes. In: Proceeding of the 16th international conference on machine learning, San Francisco, SF, pp 258–267
Moschitti A, Basili R (2004) Ohsumed medical corpus dataset. http://disi.unitn.it/moschitti/corpora.htm
Ng HT, Goh WB, Low KL (1997) Feature selection, perceptron learning, and a usability case study for text categorization. SIGIR Forum 31(SI):67–73. https://doi.org/10.1145/278459.258537
Nist T (2001) Ohsumed medical corpus dataset. http://trec.nist.gov/data/t9_filtering.html
Rohit P, Devansh A, Shuang W, Premkumar N, Pradeep N (2013) Ridge regression based classifiers for large scale class imbalanced datasets. In: Proceedings of the 2013 IEEE workshop on applications of computer vision (WACV), IEEE Computer Society, Washington, DC, USA, WACV ’13, pp 267–274. https://doi.org/10.1109/WACV.2013.6475028
Sebastiani F (2002) Machine learning in automated text classification. ACM Comput Surv 34(1):1–47
Shaffer JP (1986) Modified sequentially rejective multiple test procedures. J Am Stat Assoc 81(395):826–831
Singh BK, Verma K, Thoke A, Suri JS (2017) Risk stratification of 2d ultrasound-based breast lesions using hybrid feature selection in machine learning paradigm. Measurement 105:146–157. https://doi.org/10.1016/j.measurement.2017.01.016. http://www.sciencedirect.com/science/article/pii/S026322411730026X
Song D, Lau RY, Bruza PD, Wong KF, Chen DY (2007) An intelligent information agent for document title classification and filtering in document-intensive domains. Decision Support Syst 44(1):251–265. https://doi.org/10.1016/j.dss.2007.04.001. http://www.sciencedirect.com/science/article/pii/S0167923607000681
Tellez ES, Moctezuma D, Miranda-Jiménez S, Graff M (2018) An automated text categorization framework based on hyperparameter optimization. Knowledge-Based Syst 149:110–123. https://doi.org/10.1016/j.knosys.2018.03.003. http://www.sciencedirect.com/science/article/pii/S0950705118301217
Uysal AK (2016) An improved global feature selection scheme for text classification. Expert Syst Appl 43:82–92. https://doi.org/10.1016/j.eswa.2015.08.050. http://www.sciencedirect.com/science/article/pii/S0957417415006077
Uysal AK, Gunal S (2012) A novel probabilistic feature selection method for text classification. Knowledge-based Syst, Elsevier 36:226–235
Van RCJ (1979) Information retrieval, 2nd edn. Butterworth-Heinemann, Newton
Bk W, Yf H, Wx Y, Li X (2012) Short text classification based on strong feature thesaurus. J Zhejiang University Sci C 13(9):649–659. https://doi.org/10.1631/jzus.C1100373
Wang D, Zhang H, Liu R, Lv W, Wang D (2014) t-test feature selection approach based on term frequency for text categorization. Pattern Recogn Lett 45:1–10
Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Proceedings of the fourteenth international conference on machine learning, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, ICML ’97, pp 412–420. http://dl.acm.org/citation.cfm?id=645526.657137
Zheng Z, Srihari R (2003) Optimally combining positive and negative features for text categorization. In: ICML 2003 Workshop, Washington DC, USA
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Agnihotri, D., Verma, K., Tripathi, P. et al. Soft voting technique to improve the performance of global filter based feature selection in text corpus. Appl Intell 49, 1597–1619 (2019). https://doi.org/10.1007/s10489-018-1349-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-018-1349-1