Feature selection methods for text classification: a systematic literature review

Pintas, Julliano Trindade; Fernandes, Leandro A. F.; Garcia, Ana Cristina Bicharra

doi:10.1007/s10462-021-09970-6

Feature selection methods for text classification: a systematic literature review

Published: 24 February 2021

Volume 54, pages 6149–6200, (2021)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Artificial Intelligence Review Aims and scope Submit manuscript

Feature selection methods for text classification: a systematic literature review

Download PDF

4063 Accesses
57 Citations
1 Altmetric
Explore all metrics

Abstract

Feature Selection (FS) methods alleviate key problems in classification procedures as they are used to improve classification accuracy, reduce data dimensionality, and remove irrelevant data. FS methods have received a great deal of attention from the text classification community. However, only a few literature surveys include them focusing on text classification, and the ones available are either a superficial analysis or present a very small set of work in the subject. For this reason, we conducted a Systematic Literature Review (SLR) that asses 1376 unique papers from journals and conferences published in the past eight years (2013–2020). After abstract screening and full-text eligibility analysis, 175 studies were included in our SLR. Our contribution is twofold. We have considered several aspects of each proposed method and mapped them into a new categorization schema. Additionally, we mapped the main characteristics of the experiments, identifying which datasets, languages, machine learning algorithms, and validation methods have been used to evaluate new and existing techniques. By following the SLR protocol, we allow the replication of our revision process and minimize the chances of bias while classifying the included studies. By mapping issues and experiment settings, our SLR helps researchers to develop and position new studies with respect to the existing literature.

Filter feature selection methods for text classification: a review

Article 11 May 2023

Modified Pointwise Mutual Information-Based Feature Selection for Text Classification

Feature Selection in Text Mining

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Automated text classifiers can be used to handle several real-world problems, such as spam filtering, sentiment analysis, and news classification. Texts are usually represented by a high-dimensional and sparse document-term matrix in a space having the dimensionality of the size of the vocabulary containing word frequency counts. The high dimensionality can cause some problems, such as the curse of dimensionality and model overfitting. Feature Selection (FS) can be used to reduce dimensionality, remove irrelevant data, and increase the learning accuracy. FS is the process of automatically or manually select the features which contribute most to the classification of a given text. In text classification problems, the feature is usually some representation of a subset of words. A significant subset of features extracted from text corpora may not be relevant for the text classification task. These non-relevant features can either deteriorate the efficiency and accuracy of the classification models (Kumbhar and Mali 2013). For this reason, FS for text classification became a popular research topic in artificial intelligence and data mining conferences and journals.

Some general reviews about FS are available. Chandrashekar and Sahin (2014) and Kumar (2014) provide a general introduction to FS methods and classify them into the filter, wrapper, and embedded categories. Pereira et al. (2018) give a comprehensive survey and novel categorization of the FS techniques focusing on multi-label classification. However, these surveys did not consider in their analyses the different methods to handle the high dimensionality of the feature space, the different text representation formats such as bag of words and word embedding, and the power of the features’ semantics for choosing the most efficient set of features.

FS methods have received a great deal of attention from the text classification community due to their strength in improving retrieval recall and computational efficiency (Kumbhar and Mali 2013). However important, there are only a few literature surveys (Kumbhar and Mali 2013; Shah and Patel 2016; Deng et al. 2019) that include them focusing on text classification. The ones available are either a superficial analysis or present a very small set of work in the subject. Kumbhar and Mali (2013) and Shah and Patel (2016) are more introductory studies, and both surveys don’t focus only on FS methods. Besides to FS, Kumbhar and Mali (2013) address feature extraction methods and Shah and Patel (2016) address algorithms for text classification. For the best of our knowledge, there is only one review work focused exclusively on FS for text classification (Deng et al. 2019). Although Deng et al. (2019) provide a good overview of the subject, a limited proportion of published papers about FS for text classification have been included (28 studies). Among these, only fourteen were published in the last ten years, and six were published in the last five years. Besides, no clear criteria for inclusion or exclusion of the selected articles were defined. The study selection was made from other FS reviews that are not specific to text classification.

Our literature review expands existing surveys on FS methods, including up-to-date researches and providing a thorough analysis of FS methods considering the text classification task. The contribution of our literature survey lays on:

Including a more significant number of papers covered (175 studies) resulting from a more comprehensive review in the theme;
Bringing more up-to-date researches, including studies from 2013–2020;
Proving a reproducible review according to an established literature review protocol;
Providing a new research categorization for understanding the FS methods area;
Providing a description of the experimental settings carried by the 175 reviewed studies; and
Last but not least, we classified all 175 papers retrieved in our study according to our categorization scheme.

This paper is organized as follows: Section 2 provides background information about the main elements for text classification, including FS. The protocol of our SLR, which includes the research questions and inclusion/exclusion criteria for selecting the studies from the literature, is detailed in Sect. 3. Section 4 summarizes the issues addressed in the included studies. In Sect. 5, we cover all of the included studies by organizing them into a new categorization scheme specific to FS methods for text classification. The categorization schema proposed in this paper provides a simplified way to organize the actual methods as well as positioning new studies about FS for text categorization. The mapping of the included studies into this categorization schema allows us to identify which are the issues/topics that already have a significant amount of studies and which ones have been less explored (possibly research gaps). In Sect. 6, we survey the experiment settings used to evaluate the proposed methods. We believe that the mapping of existing studies and their experiment settings would help researchers to position and develop new studies about FS for text classification.

2 Background

Text classification is the problem to determine which class(es) a given document belongs to (Manning et al. 2008). The classification problem can be divided into three main sub-types: binary, multiclass and multilabel. If only two classes are predefined, the problem is called as a binary classification problem. If three or more classes are defined, and each document can only be associated with one of these classes, it is known as a multiclass classification problem. Finally, if each document can be simultaneously associated with two or more classes (or labels), it is defined as a multilabel classification problem.

Currently, developing models for text classification is a sophisticated process involving not only the training of models, but also numerous additional procedures, e.g., data pre-processing, transformation, and dimensionality reduction (Mirończuk and Protasiewicz 2018). This background section presents the main concepts directly related to this review’s theme. Section 2.1 discusses distinct text representation models punctuating its advantages and disadvantages. Section 2.3 introduces the main concepts on FS specifically for text classification. Finally, Sect. 2.2 presents learning algorithms/architectures for text classification.

2.1 Representation models for textual data

Once you have labeled documents, the first step to construct a classification model is to extract features from text corpus. Different models of feature representation and weighting can be used for text classification and each representation model has advantages and disadvantages that must be considered. Below, we present two groups of representation models that are widely used in text classification architectures: N-gram based Models and Word Embedding Models.

N-gram is a set of N words which occurs “in that order” in a text set (Kowsari et al. 2019). The simplest and most widely used N-gram model is the BoW in which the \(N = 1\) (called 1-gram or uni-gram model). In this model, each feature corresponds a unique word in the text. However, the N-gram model can also be applied with N values greater than 1. For example, in the 2-gram model each feature corresponds to two consecutive words. N-gram models with \(N > 1\) could detect more information in comparison to 1-gram (Kowsari et al. 2019) because with \(N = 1\) the word order information is disregarded while in 2-gram or higher models part of the word order information is captured.

In the N-gram model, each feature (a word or set of words) receives a value/weight for each document in the corpus. This value is usually calculated based on the frequency of that word (or set of words) in each document. The simplest is precisely the frequency of the word (or set of words) in the document, known as Term Frequency (TF). However, other weighting methods may be used. The most well-known and widely used method is the Term Frequency-Inverse Document Frequency (TF-IDF). In this method, the Inverse Document Frequency (IDF) is used in conjunction with TF in order to reduce the effect of implicitly common words in the corpus (Kowsari et al. 2019).

The N-gram model is usually chosen to represent text in machine learning activities due to its simplicity, robustness and the observation that simple models trained on huge amounts of data outperform complex systems trained on less data (Mikolov et al. 2013a). However, recall that N-gram models don’t measure the semantic similarity of the words becoming a limiting factor for some types of machine learning tasks (Mikolov et al. 2013a). Thus, many researchers have been looking for representation models that capture the syntactic or semantic similarity of words (Mikolov et al. 2013a, b; Kowsari et al. 2019).

Unlike N-gram models that represent each word (or set of words) by a single value/weight per document, word embedding models represent each word (or set of words) by a N-dimension vector of real numbers (Kowsari et al. 2019). The idea behind word embedding models is that similar words have vectors with close values. In this way, the level of syntactic or semantic similarity between words can be measured based on the distance of their vectors. Different techniques for estimating word vectors have been proposed, as Word2Vec (Mikolov et al. 2013a), Glove (Pennington et al. 2014) and FastText (Bojanowski et al. 2017).

2.2 Text classification architectures

Over the years, different types of algorithms have been developed for the task of text classification (Kowsari et al. 2019). These algorithms can be divided into two main groups: traditional machine learning and deep learning. Some traditional algorithms, like Support Vector Machines (SVM), Naive Bayes (NB) and k-Nearest Neighbors (KNN), are widely studied for the text classification problem and are still commonly used by the scientific community (Kowsari et al. 2019). However, architectures based on deep learning like Convolutional Neural Network (CNN), Deep Belief Network (DBN), and Hierarchical Attention Network (HAN) are increasingly being researched for text classification (Kowsari et al. 2019). Despite having the potential to achieve excellent results in some situations, deep learning architectures have some limitations and disadvantages. Table 1 compares deep learning and traditional architecture for text classification.

Table 1 Summary of advantages and disadvantages of text classification architectures. Adapted from Kowsari et al. (2019)

Feature selection methods for text classification: a systematic literature review

Abstract

Similar content being viewed by others

Filter feature selection methods for text classification: a review

Modified Pointwise Mutual Information-Based Feature Selection for Text Classification

Feature Selection in Text Mining

Explore related subjects

1 Introduction

2 Background

2.1 Representation models for textual data

2.2 Text classification architectures

2.3 Feature selection for text classification

3 Systematic literature review

3.1 Research questions and search strategy

3.2 Conducting the review

4 Feature selection issues for text classification

4.1 Issues about measure feature relevance

4.2 Issues about subset search

4.3 Issues about globalization

4.4 Issues about ensemble

5 Feature selection methods for text classification

5.1 Categorization by strategy

5.2 Categorization by approach

5.3 Categorization by target

5.4 Categorization by labeled data dependence

6 Experiment settings analysis

6.1 Text representation used in experiments

6.2 Datasets used in experiments

6.3 Classification algorithms used in experiments

6.4 Validation settings used in experiments

7 Research trends and discussion

7.1 Filter has been the feature selection dominant strategy for text classification, but a change is coming

7.2 Metaheuristic approach is the trend

7.3 Multiclass classifiers are still dominant

7.4 Supervised versus unsupervised feature selection methods

7.5 Recent researches still over old public datasets: the need for new benchmarks

7.6 The english language dominance

7.7 Feature selection is already a mature field allowing statistical evaluations

8 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix A. List of acronyms

Appendix A. List of acronyms

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation