Soft voting technique to improve the performance of global filter based feature selection in text corpus

Agnihotri, Deepak; Verma, Kesari; Tripathi, Priyanka; Singh, Bikesh Kumar

doi:10.1007/s10489-018-1349-1

Soft voting technique to improve the performance of global filter based feature selection in text corpus

Published: 21 November 2018

Volume 49, pages 1597–1619, (2019)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Applied Intelligence Aims and scope Submit manuscript

Soft voting technique to improve the performance of global filter based feature selection in text corpus

Download PDF

Deepak Agnihotri ORCID: orcid.org/0000-0002-1536-2261¹,
Kesari Verma¹,
Priyanka Tripathi¹ &
…
Bikesh Kumar Singh²

550 Accesses
20 Citations
Explore all metrics

Abstract

In text classification, the Global Filter-based Feature Selection Scheme (GFSS) selects the top-N ranked words as features. It discards the low ranked features from some classes either partially or completely. The low rank is usually due to varying occurrence of the words (terms) in the classes. The Latent Semantic Analysis (LSA) can be used to address this issue as it eliminates the redundant terms. It assigns an equal rank to the terms that represent similar concepts or meanings, e.g. four terms “carcinoma”, “sarcoma”, “melanoma”, and “cancer” represent a similar concept, i.e. “cancer”. Thus, any selected term by the algorithms from these four terms doesn’t affect the classifier performance. However, it does not guarantee that the selection of top-N LSA ranked terms by GFSS are the representative terms of each class. An Improved Global Feature Selection Scheme (IGFSS) solves this issue by selecting an equal number of representative terms from all the classes. However, it has two issues, first, it assigns the class label and membership of each term on the basis of an individual vote of the Odds Ratio (OR) method thereby limiting the decision making capability. Second, the ratio of selected terms is determined empirically by the IGFSS and a common ratio is applied to all the classes to assign the positive and negative membership of the terms. However, the ratio of positive and negative nature terms varies from one class to another and it may be very less for one class, whereas high for other classes. Thus, one common negative features ratio used by the IGFSS affects those classes of a dataset in which there is an imbalance between positive and negative nature words. To address these issues of IGFSS, a new Soft Voting Technique (SVT) is proposed to improve the performance of GFSS. There are two main contributions in this paper: (i) The weighted average score (Soft Vote) of three methods, viz. OR, Correlation Coefficient (CC), and GSS Coefficients (GSS) improves the numerical discrimination of words to identify there positive and negative membership to a class. (ii) A mathematical expression is incorporated in the IGFSS that computes a varying ratio of positive and negative memberships of the terms for each class. The membership is based on the occurrence of the terms in the classes. The proposed SVT is evaluated using four standard classifiers applied on five bench-marked datasets. The experimental results based on Macro_F1 and Micro_F1 measures show that SVT achieves a significant improvement in the performance of classifiers in comparison of standard methods.

A novel feature and class-based globalization technique for text classification

Article 25 April 2023

A new feature selection method for handling redundant information in text classification

Article 01 February 2018

Ensemble feature selection for single-label text classification: a comprehensive analytical study

Article 22 June 2023

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Accurate and timely information is the basic need for effective decision making. Wondrous growth in the e-corpus of various fields (e.g. Business, Biomedical, Engineering, News, etc.) [30, 47] demands for an intelligent Decision Support System (DSS) that helps in an Automated Text Document Classification (ATDC) process [10, 11]. In this context, a model is built by observing the occurrence of words in the training set documents with known class labels. The trained model has the capability to predict the class labels of test documents with maximum accuracy [16, 42].

The prediction completely relies on the contents of the documents. Substantial contents of these documents are stored as text [44, 45]. The word (term) is the smallest constituent of text and play a vital role in the ATDC process [27, 33, 39]. The processing steps followed by the ATDC process are as follows: the first step extracts features from the entire corpus (i.e. generation of tokens from the text contents), after this some less informative words (e.g. stop words, punctuation marks, white spaces) are eliminated in the second step, and in third step lemmatization/stemming is performed in the remaining terms of the corpus. Finally, the resultant terms are used to build a vocabulary of the entire corpus [4]. The resultant terms of this vocabulary are represented by vectors, where the frequency of each term in the documents represents a vector [34]. The collection of term vectors in a matrix form is called a vector space. In this vector space, each individual term constitutes one dimension. For a typical document collection, there may be millions of terms, hence ATDC requires to cater a large number of dimensions which makes the classification process cumbersome [19, 30]. In literature, feature selection techniques are used to select the most relevant features by eliminating the less important features. Feature selection increases the performance as well as the speed of the classifier and is considered as an important step in the classification process.

The Term Frequency-Inverse Document Frequency (TF-IDF) Vectorizer [34] is applied to normalize the weight of the terms, but the TF-IDF vectors tend to be of high dimension since they have one component for every term in the vocabulary. The terms which represent similar concepts or with similar meanings are treated as individual words in the corpus and enormously increase the size of TF-IDF vectors.

A linear combination of terms defined using Latent Semantic Analysis (LSA), identifies the relationships among terms.^{Footnote 1} The LSA creates a vector representation of a document that helps to compare documents based on similarity. It assigns an equal weight to the terms representing similar concepts or meanings.^{Footnote 2} E.g., the four words “carcinoma”, “sarcoma”, “melanoma”, and “cancer” represent a similar concept, i.e. “cancer”. Thus, an equal weight is assigned by LSA to these words and they contribute equally to the resulting LSA component and any selected word by the algorithms doesn’t affect the classifier performance [12]. Further, the LSA processed vector space has a considerable number of features that are not relevant to the text of the specific class. In order to improve the scalability of ATDC, an effective feature selection technique is needed to reduce the feature set.

Many feature subset selection methods have been proposed and studied in machine learning paradigm. It can be broadly divided into three categories: Filter, Wrapper, and Embedded [6, 32, 44]. The filter methods compute the score of a feature using an evaluation function. It is independent of any classification algorithm and determined using the mutual correlation of the data. In contrast, the wrappers and embedded methods require a frequent classifier interaction in their flow to estimate the value of a given subset. The requirement of a classifier interaction may increase running time and force the feature selection method to work according to a specific learning model. Thus, filter-based methods are preferred more in comparison to wrappers and embedded methods [32, 44]. The Global Filter-based Feature Selection Scheme (GFSS) assigns a score to each feature and the topmost, N features are selected using this score, where N is an empirically determined number [23, 44]. The Filter-based methods are further subdivided into One Sided Local Filter-based Feature Selection Scheme (OLFSS) and Global filter-based feature selection scheme (GFSS).

In the OLFSS, the local class-based score for each feature is computed and used as a final score. The GFSS follows a global policy and converts the multiple local scores into a global score to compute the final score of the features. The local and global scores can be directly used in the feature ranking. The features are sorted in descending order and the top-N features are included in the Final Feature Set (FSS). The Information Gain (IG), Gini Index (GI), Distinguishing Feature Selector (DFS) and Gain Ratio (GR) etc. are known as the methods of GFSS, whereas Mutual Information (MI), Odds Ratio (OR), GSS coefficients (GSS), and Correlation Coefficient (CC) as a method of OLFSS [44]. The selected discriminating features of FFS are used by the classifiers in the final step of the ATDC process. A meta-heuristic technique can also be used to search for a configuration that produces a highly effective text classifier. This model selection procedure is commonly named in the literature as hyper-parameter optimization [43].

Although, the GFSS method improves the performance of the classifiers, but it has some limitations. The GFSS is suitable for the balanced dataset where each class contains an equal number of documents along with a sufficient number of terms. In the case of an unbalanced dataset, having a large number of classes with variable distribution of terms, affects the performance of GFSS. The GFSS eliminates the informative features of the class either partially or completely from the topmost, N features. Most of the studies in the literature are focused on providing some improvements to specific feature selection methods rather than providing a new generic scheme.

Uysal [44] extended the work of [19] and proposed a solution named an Improved Global Feature Selection Scheme (IGFSS). IGFSS selects an equal number of representative features from each class in the final feature set. There are mainly two issues with the IGFSS, first, it assigns the class label of each feature with positive or negative membership by considering an individual vote of the Odds Ratio (OR) method [34].

An individual method may have some weakness such as the OR method assigns a positive score to a term for a class if the occurrence of the term is more in that class, otherwise, a negative score is assigned. However, the numerical difference between the OR score of the terms for positive and negative membership is very less. Thus, it affects the process of the class label and membership assignment. In this paper, the membership of terms is referred to as nature of terms, i.e. positive or negative membership of a term means positive or negative nature of the terms. Second, the ratio of negative nature features is determined empirically by the IGFSS and a common negative features ratio is applied to all the classes to select the positive and negative nature features. It affects those classes of a dataset which have more positive features than negative or vice-versa [4].

To address these two issues, a new technique named Soft Voting Technique (SVT) is proposed. It is based on the presumption that the ensemble votes of several methods give better results than an individual vote and determines the most appropriate class label of the features. This technique can be useful for a set of equally well-performing methods to balance out their individual weaknesses. The flow of SVT is similar to IGFSS, but in an improved way and based on the key points of IGFSS we have given a generic solution for the filter based feature selection methods. There are two main contributions in this paper:

1.
The SVT uses the weighted average score (Soft Vote) of three methods, viz. OR, Correlation Coefficient (CC), and GSS Coefficients (GSS) to predict the class labels of terms and it computes a more balanced score of a word than OR which improves numerical discrimination of positive and negative nature words.
2.
A mathematical expression is incorporated in IGFSS that computes a varying ratio of positive and negative nature terms for each class which is based on the occurrence of the terms in the classes.

The proposed SVT is evaluated using four standard classifiers, viz. Linear Support Vector Machine (LSVM), Softmax regression (SOFT MAX), Stochastic Gradient Descent Classifier (SGDC), and RIDGE. The classifiers are applied to five benchmarked text data sets, viz. Webkb, Classic4, Reuters10, Trec2004 and Ohsumed10. The experimental results of SVT, which is based on Macro_F1 and Micro_F1 are compared with classical information science methods and IGFSS.

The rest of the paper is organized as follows. In Section 2, a brief overview of the state-of-the-art methods and related works are discussed. Section 3 presents the details of the proposed SVT. The experimental setup and performance evaluation measures are discussed in Section 4. Section 5 present the experimental results and discussions. Finally, the paper concludes in Section 6.

2 Related works

Substantial works have been carried out in the area of filter-based feature selection. The most common methods, viz. Mutual Information (MI), Information Gain (IG), Distinguishing Feature Selector (DFS), Gini Index (GI), Gain Ratio (GR), and Odds Ratio (OR) are briefly described as follows:

Mutual information (MI) concept [18, 48, 49] is carried out from information theory to measure the dependencies between random variables and used to measure the information contained by a term t_i. It is strongly influenced by the marginal probabilities of the terms. It assigns higher weight to rare terms than common and sparse terms. Therefore, the weights of the terms are not comparable for the terms with widely differing frequencies. The final score (i.e. MI) of term t_i is the maximum class-based score as shown in (1). The brief preliminary notations are shown in the Table 1.

$$ MI(t_{i}) = \max_{j = 1}^{j=r}\left[\log \left( \frac{p(t_{i} , C_{j})}{p(t_{i}) \times p(C_{j})}\right)\right] $$

(1)

Information Gain (IG) [18, 44, 45, 48, 49] assigns higher weight to common terms distributed in many categories than rare terms. The IG is also known as average Mutual Information. (see (2)).

$$\begin{array}{@{}rcl@{}} IG(t_{i}) &=& p(t_{i})\times \sum\limits_{j = 1}^{j=r} p(C_{j}|t_{i}) \times \log p(C_{j}|t_{i})\\ &&+\ p(\bar{t}_{i})\times \sum\limits_{j = 1}^{j=r} p(C_{j}|\bar{t_{i}}) \times \log p(C_{j}|\bar{t}_{i})\\ &&- \sum\limits_{j = 1}^{j=r} p(C_{j})\times \log{p(C_{j})} \end{array} $$

(2)

Gini Index (GI) is a global feature selection method for text classification which can be defined as an improved version of an attribute selection algorithm used in decision tree construction (see (3)) [44].

$$ GI\left( t_{i}\right)=\sum\limits^{r}_{j = 1}{{p\left( t_{i}|C_{j}\right)}^{2}\ {p\left( C_{j}|t_{i}\right)}^{2}} $$

(3)

Distinguishing Feature Selector (DFS) [44, 45] is an improvement of Mutual Information by reducing the effect of marginal probabilities of the terms by normalizing the weight of the terms. It gives weight of the term in a range of [0,1] defined by (4).

$$ DFS\left( t_{i}\right)=\sum\limits^{r}_{j = 1}{\left[\frac{p\left( C_{j}|t_{i}\right)}{p\left( \overline{t_{i}}{\left|\vphantom{\overline{t_{i}} C_{j}}\right.\kern-\nulldelimiterspace}C_{j}\right)+p\left( t_{i}|{\overline{C}}_{j}\right)+ 1}\right]} $$

(4)

Gain Ratio (GR) is proposed in information science to reduce the effect of the most common terms and marginal probabilities of the terms by normalizing their weights, obtained using IG [24] (see (5)).

$$ GR\left( t_{i}\right)=\sum\limits_{j = 1}^{j=r}\frac{IG(t_{i})}{-p\left( C_{j}\right)\times log \left( p\left( C_{j}\right)\right)} $$

(5)

Odds Ratio (OR) reflects the odds of the word occurring in the positive class normalized by that of the negative class. It has been used for relevance ranking in information retrieval [18, 28, 34, 44, 46] (see (6)).

$$ OR\left( t_{i},C_{j}\right)=\frac{p(t_{i}|C_{j})(1-p(t_{i}|{\overline{C}}_{j}))}{1-p(t_{i}|C_{j})\times p(t_{i}|{\overline{C}}_{j})} $$

(6)

Correlation Coefficient CC(t_i,C_j) of a word t_i with a category C_j is a variant of the χ² metric, where CC² = χ² × CC can be viewed as a “one-sided” chi-square metric. The positive values correspond to features indicative of membership, while negative values indicate non-membership. The greater (smaller) the positive (negative) values are, the stronger the terms will be to indicate the membership (non-membership) [36, 39, 50].

$$\begin{array}{@{}rcl@{}} &&\!\!\!\!CC\left( t_{i},C_{j}\right)\\&&\!\!\! = \frac{\sqrt{N}\!\times\! \left[p(t_{i},C_{j})\!\times\! p(\overline{t}_{i},{\overline{C}}_{j}) - p(t_{i},\overline{C}_{j})\!\times\! p(\overline{t}_{i},C_{j})\right]}{\sqrt{p(t_{i})\times p(\overline{t}_{i})\times p(C_{j}) \times p(\overline{C}_{j})}} \end{array} $$

(7)

GSS Coefficient (GSS) is another simplified variant of the χ² statistics proposed by [20]. Similar to CC, the positive values correspond to features indicative of membership, while negative values indicate non-membership [50].

$$ GSS\left( t_{i},C_{j}\right)=p(t_{i},C_{j})\times p(\overline{t}_{i},{\overline{C}}_{j})-p(t_{i},\overline{C}_{j})\times p(\overline{t}_{i},C_{j}) $$

(8)

Uysal [44] proposed an ensemble method named as Improved Global Feature Selection Scheme (IGFSS) which provides a generic solution for the GFSS. The IGFSS has merged the power of local and global feature selection methods. It is an ensemble of OR with any one method of GFSS at a time. The OR is used to assign the class label as well as membership value to the features. It computes the negative value of a feature for the class, if the presence of that feature is very less or none in that class. Similarly, a positive value of a feature for the class, if it occurs most frequently in the class. Further, the IGFSS uses the maximum absolute score of the feature for a class to assign the class label and the sign of the maximum value is used to find out the membership of the feature.

Table 1 Preliminary Notations [5, 7]

Soft voting technique to improve the performance of global filter based feature selection in text corpus

Abstract

Similar content being viewed by others

A novel feature and class-based globalization technique for text classification

A new feature selection method for handling redundant information in text classification

Ensemble feature selection for single-label text classification: a comprehensive analytical study

Explore related subjects

1 Introduction

2 Related works

3 Proposed Soft Voting Technique (SVT)

3.1 Explanation by synthetic data

4 Experimental setup and performance evaluation

4.1 Data set

4.2 Classification algorithms

4.3 Performance evaluation measures

5 Results and discussions

5.1 Data and statistical analysis

5.2 Discussions

6 Conclusions

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation