1 Introduction

With the advancement of information technology over the last decade, many new applications, such as social networks, location-based services, social media and e-Commerce, have emerged as hubs for information gathering and dissemination. The majority of the information generated and shared in the Internet is in the form of text data, for example, news reports on CNN, articles listed on Wall Street Journal website, tweets sent by Twitter users, product reviews published on Amazon, etc. In a typical scenario, people may want to see webpages they are interested in, ignoring those irrelevant ones in a possibly very large corpus. This means that webpages (or documents) that contain text content must be classified by topics or some predefined categories, so as to facilitate efficient indexing and search. It is impossible to process this kind of text data manually, however, due to the volume of text data is simply enormous. To deal with this challenge, many techniques and methods have been proposed for classifying documents automatically, and this process is referred to as text categorization.

Text classification has a broad applications in real-world scenarios, such as automatically classifying webpages or documents according to a set of pre-specified labels [113], filing new patents into patent categories, user sentiment analysis for social network multimedia [5], spam email filtering, disseminating information to subscribers, document genre identification, video tagging [7], multimedia recommendation [69], etc. In a typical scenario, a webpage or a text document can contain hundreds or thousands of unique terms. If we use all the terms for text classification, we may get poor result because some terms are not helpful for classification and some terms may mislead the classifiers [60, 82]. In this survey, we aim to provide researchers and practitioners with a comprehensive understanding of feature selection theories, models, and techniques, especially the state-of-the-art feature selection techniques designed for text categorization.

2 Text classification

Text classification, also known as text categorization, is to assign one ore more class labels or categories from a predefined set of labels or categories to a document, according to its content [23, 78, 106, 109]. Formally, given a collection of N documents \(\mathcal {D}=\{d_{1}, d_{2}, ..., d_{N}\}\) and a set \(\mathcal {C}=\{c_{1},c_{2},...,c_{k}\}\) of k predefined categories, the problem of text categorization can be modeled as finding a mapping \(\mathcal {F}\) from the Cartesian product \(\mathcal {D}\times \mathcal {C}\) to a set {True,False}, i.e., \(\mathcal {F}: \mathcal {D} \times \mathcal {C} \rightarrow \{True, False\}\). The mapping \(\mathcal {F}\) is called the classifier. Based on this mapping, for a document \(d_{i}\in \mathcal {D}\) and a category \(c_{j}\in \mathcal {C}\), if \(\mathcal {F}(d_{i},c_{j})=True\), then di belongs to category cj, otherwise di does not belong to cj.

In real applications, text classification task is subjective, meaning that assignment of a category to a document depends on the judgment of human experts. For a same document, two human experts may have different opinion as to which category the document should be assigned to. On the other hand, in library science a book is classified to a class/category if at least 20% of the content of the book is about that class/category. Hence, a document may be affiliated with more than one categories. For instance, the news ‘In 2013 Tiger Woods sold his house for 2.2 million US dollars, where cheating scandal broke out’, may be classified to sports, economics, and properties as well.

A typical text categorization process is illustrated in Fig. 1. Generally, to classify a set \(\mathcal {D}\) of documents, we need to build a classifier \(\mathcal {T}\) at first. Since \(\mathcal {T}\) knows nothing about the relation between content of a document and category of the document, we wish to train \(\mathcal {T}\) by feeding to it a set \(\mathcal {D}^{\prime }\) of documents, where each document in \(\mathcal {D}^{\prime }\) has already been assigned a category by human experts in advance. The trained classifier \(\mathcal {T}^{\prime }\) is then applied to classify \(\mathcal {D}\), after which each document in \(\mathcal {D}\) is assigned to a category by \(\mathcal {T}^{\prime }\). Some important questions arise naturally in the text categorization process, i.e., how do we represent document so that classifiers can access document content efficiently; which document is the most similar one to a given document di; what are the available classifiers we can choose from; and how can we know which classifier performs the best. To answer these critical questions, in this section we discuss document representation and indexing, similarity between documents, classifiers for documents, and performance of classifiers, respectively.

Fig. 1
figure 1

A Typical Text Classification Process

2.1 Document representation

Although the topic of document representation is very much related to the information retrieval (IR) community, one may find that document representation can greatly affect performance of text classification algorithms, in terms of computational time, storage overhead, and accuracy.

The most popular document representation scheme is the bag-of-words model, which is widely used in natural language processing and information retrieval communities. In this model, each document is regarded as a bag that contains its words, keeping word multiplicity while ignoring grammar and the word ordering. Note that generally in a document there may be many stop words like a, the, and are, but they are neither descriptive nor meaningful for the subject of a document. Meanwhile, some words are sharing a same stem and should be treated as a single word, because they are similar in meaning. By removing stop words, applying Porter’s stemming algorithm to get word stems, combining synonyms and so on in a preprocessing step, the remaining words (or terms) in the document are both descriptive and thematically unique. In the sequel, when referring to a document we mean that the document has already been preprocessed, and we use word and term interchangeably.

With the bag-of-words model, each document di can be represented by a list of pairs (sj,fj), where sj is a term and fj the frequency of sj appearing in di. If we restrict fj to take on either 0 or 1, denoting absence and presence of sj in di respectively, then this representation is called the boolean model. To represent a document collection, a natural scheme is to use a document-term matrix [21], where each column of the matrix corresponds to a unique word and each row represents a document. An entry eij of the matrix reveals how many times word sj appearing in document di. Alternatively, eij can be the weight of term sj in document di, computed by using the TF-IDF weighting scheme.

Some researcher proposed to use alternative ways to represent documents, in the hope of improving the performance of text categorization algorithms that reply on the bag-of-words representation model. Scott et al. [81] proposes two representations for text, the first one is based on phrase whereas the second one is based on synonyms and hypernyms. These two alternative representations are tested on a rule-based classifier RIPPER [15]. Experimental results show that these two alternative representations on their own do not add any classification power to the classifiers, as compared to the bag-of-words representation model. However, combining classifiers based on different representations do bring some benefit in terms of accuracy [81].

To represent a document by a list a words/features, there are two popular schemes, i.e., the universal dictionary method and the local dictionary method [2]. The former constructs for all the class labels a universal dictionary that stores features in each document, whereas the latter builds a local dictionary for each class label respectively. A word can be put into the dictionary as a feature if it appears more than five times, according to empirical study on this cutoff value [2]. The local dictionary method can yield better results, as observed by some researchers [2, 67].

2.2 Similarity between documents

Similarity computation plays a critical role in various real applications such as document clustering and classification, data mining, and information retrieval [120]. In the context of document processing, a similarity function Sim() computes a similarity score of two documents, whose value is real and within the range [0,1]. In general, if we have Sim(di,dj) > Sim(di,dk), then we say document dj is more similar to di, as compared to dk. Currently, various similarity measures have been proposed and we focus on some popular measures for text processing, for example, Euclidean distance, Jaccard coefficient [54], Pearson correlation [40], Cosine Similarity [6], Hamming distance [62], Dice coefficient [85], IT-Sim [3, 56], SMTP [58], Earth mover’s distance [29, 96], Kullback-Leibler divergence [50], and BM25 [74]. A detailed survey on similarity and distance measures can be found in [13].

Strehl et al. have conducted experimental comparison between several similarity measures for text categorization and showed that Euclidean distance (L2 metric) performs the poorest, while Cosine and Jaccard are the best to capture human categorization behavior [84]. It is not surprising that Euclidean distance fails to measure similarity between high-dimensional data objects, since given a data object di, the Euclidean distance between di and any other distinct object tends to be equal in high-dimensional space [1, 100]. Similar behavior can be observed for Lp distance metrics such as Manhattan distance (L1), L3 distance, and general Lk metrics [37].

Cosine similarity is a commonly used measure for text categorization. Since a document can be represented by a high-dimensional vector, where each unique term is a dimension and the value of that dimension corresponds to the number of that term appearing in the document, the Cosine similarity between two documents is defined as \(\cos (\theta )= \frac {\overrightarrow {d_{i}} \cdot \overrightarrow {d_{j}}}{\lVert \overrightarrow {d_{i}} \rVert _{2} \cdot \lVert \overrightarrow {d_{j}} \rVert _{2}}\), where \(\overrightarrow {d_{i}}\) and \(\overrightarrow {d_{j}}\) are documents in vector form, and 𝜃 is the angle between the two vectors. One important property of Cosine similarity is that it depends on the angle between two vectors, instead of on the magnitude of them, meaning that two documents will be treated identically if they have the same term composition but with different total term counts [84]. In practice, document vectors are normalized to a unit length before computation.

The Jaccard coefficient was proposed for evaluation of similarity between ecological species, and is has been widely adopted to measure how similar two sets will be. Mathematically, the Jaccard coefficient of two documents di and dj is defined as \(J(d_{i},d_{j})=\frac {|d_{i} \cap d_{j}|}{|d_{i} \cup d_{j}|}\), where the numerator is the number of terms appearing in both di and dj, and the denominator is the total number of unique terms in the two documents. A closely related measure to Jaccard coefficient is the Dice coefficient [13, 85], which can be computed by \(s(d_{i},d_{j})=\frac {2\overrightarrow {d_{i}} \cdot \overrightarrow {d_{j}}} {\|\overrightarrow {d_{i}}\|_{2}^{2}+\|\overrightarrow {d_{j}}\|_{2}^{2}}\).

Some researchers propose to measure object similarity from information theory point of view, and designed IT-Sim, i.e., the information-theoretic measure [3, 56]. The main idea of IT-Sim is that similarity between objects may be considered as a question of how much information two objects have in common and how much they have in difference [56]. According to [58], IT-Sim achieves very good performance in measuring document similarities compared to existing popular similarity measures, although it is a bit computationally expensive. SMTP [58] is another information theory-based measure, which takes into account three cases when computing similarity between two documents: (1) the feature considered appears in both documents, (2) the feature considered appears in only one document, and (3) the feature considered appears in none of the two documents. Compared to IT-Sim, SMTP performs better and runs faster [58] .

The document similarity measures we have introduced so far do not consider document structure information, neglecting distribution information of words in documents. In fact, each document has a distribution of words different from that of another document, and this discrepancy can reveal some important information about how similar these two documents will be. Recently, some measures have been proposed, for example EMD-based measure [29, 96] and K-L divergence-based measure [40], for evaluating the similarity between two documents according to the difference between their distributions of words. The EMD-based measure works in two steps: (1) decompose documents into sets of subtopics, and (2) based on the sets of subtopics, calculate document similarity by using the Earth Mover’s Distance [77]. A detailed comparison between EMD-based measure and non-distribution-based measures confirms that EMD-based scheme outperforms all the other measures, in terms of MAP, P@5 and P@10 metrics [29].

K-L divergence-based scheme is another type of distribution-based measure, which computes the divergence between two distributions of words as \(D_{KL}(d_{i}||d_{j})={\sum }_{t = 1}^{m} w_{t,i}\times \log {\frac {w_{t,i}}{w_{t,j}}}\), where m is the total number of unique words in the document collection, wt,i and wt,j the tf-idf weights of word t in document di and dj respectively. A major drawback of the K-L divergence-based measure is that vanilla K-L divergence is not symmetric, i.e., DKL(di||dj)≠DKL(dj||di), which means that traditional K-L divergence cannot be used directly for measuring document similarity. To solve this problem, Huang et al. proposed an averaged K-L divergence measure, which can produce a symmetric similarity score for any two documents [40]. Similar to the EMD-based measure, experiments show that in most cases the averaged K-L divergence outperforms those non-distribution-based measures [40]. We summarize commonly used similarity measures in Table 1.

Table 1 Similarity measures for documents

2.3 Classifiers for text classification

To automatically classify documents, many statistical and machine learning techniques have been designed in the last decade, such as kNN method [17, 111], Naïve Bayes [22], Rocchio [75], multivariate regression models [80, 107], decision trees [72, 73], Support Vector Machines (SVMs) [94], neural networks [45, 78, 102], graph partitioning-based approach [31], and genetic algorithm-based methods [70, 93]. In this section, we review some of the most frequently used classifiers in text categorization.

kNN classification is a non-parametric method widely used in various fields, including data mining, machine learning, information retrieval, and statistics [106]. Given a document di with unknown category, a user defined parameter k, and a set \(\mathcal {D}\) of documents where each is associated with a category, kNN method computes k nearest documents for di according to some similarity measure, such as those in Section 2.2, then kNN method assigns to di the category that is the most commonly seemed in the k nearest documents. When deciding which documents is the nearest neighbor to di, one needs to evaluate each of the documents in \(\mathcal {D}\) according to some distance or similarity measure listed in Section 2.2. Note that when k is set to 1, kNN method degenerates to Nearest Neighbor (NN) method. kNN method is easy to implement and its performance is reasonably good, thus it has been employed in many real-applications since 1970s. Despite its prevalence in many real applications, kNN method has a major drawback, i.e., high computational cost, since it is a lazy learning method and each time an object is given, kNN method needs to examine the whole dataset so as to find the k nearest neighbors for the object.

Naïve Bayes (NB) is another classification method that has been studied extensively in text categorization [22]. Normally, NB classifiers adopt the assumption that the value of a particular feature is independent of the value of any other feature. In the context of text classification, the naïve Bayes assumption is that the probability of each word appearing in a document is independent of the occurrence of other word in the same document. There are two kinds of NB based text classifiers. The first one is called multivariate Bernoulli NB, which uses a binary vector to represent a document, where each component of the vector represent whether a term is present or absent in the document [63, 64]. The second one is multinomial NB, which also takes into account term frequencies in the document [64, 109]. In real applications, multinomial NB classifiers usually performs better than multivariate NB classifiers, especially true on large document collections [64]. Recently, two drawbacks of the multinomial NB classifiers have been identified by some researchers, i.e., its rough parameter estimation and bias against rare classes that contain only a few training documents. Some effective techniques are proposed to further improve prediction accuracy of the multinomial NB classifiers [47].

The Linear Least Squares Fit (LLSF) is a multivariate regression model for text categorization, which can automatically learn the correlation between a set of training documents and their categories [107]. Specifically, the training documents are represented in the form of (input,output) vector pairs, where input vector is a document represented by using the vector space model and output vector consists of categories of the corresponding document. A linear least-square fit problem is solved on these training pairs of vectors, resulting in a matrix of word-category regression coefficients. This matrix gives a mapping from an arbitrary document to a vector of weighted categories, through which a ranked list of categories sorted on weights can be obtained for an input document [106, 107].

Decision tree (DT) is a well-known machine learning algorithm that has been used extensively in automatic classification tasks [72, 73]. When applying to text categorization tasks, DT learning algorithms are employed to select informative words according to information gain criterion. Given a document to classify, the constructed decision tree is used to predict which category the document should belong to, according to the occurrence of word combinations in the document. Some researchers showed that in terms of prediction accuracy, decision trees usually outperform Naïve Bayes classifiers and Rocchio’s algorithm, but are slightly worse than kNN methods [23, 55, 106].

Support Vector Machines (SVMs) were introduced by Vapnik et al. [94] for classification tasks, which adheres to structural risk minimization principle to construct an optimal hyperplane with the widest possible margin to separate a set of data points that consist of positive and negative data examples. SVMs have been widely and successfully used as a powerful classification tool in various applications, including object recognition [79], image classification [57], text categorization [43, 87], etc. Joachims [43] was the first to apply SVMs to text categorization tasks, due to the fact that SVMs are well suited for several critical properties of text data. First, text data normally contains tens of thousands of terms, meaning that text data is very high-dimensional, whereas the ability of SVMs to learn can be independent of the dimensionality of the feature space. Second, although text data intrinsically contains many features (i.e., unique terms), there are few irrelevant features in a document in general. SVMs can take into account all the features, unlike conventional classification methods that must resort to feature selection techniques to reduce the number of features to a manageable level. Third, document vectors are sparse, meaning that each document only contains few non-zero entries. SVMs are well suited for this kind of classification problems with dense concepts and sparse instances. Through extensive experiments, Joachims showed that SVMs consistently outperform conventional classifiers such as Naïve Bayes, Rocchio, decision trees, and kNN [43].

Neural network was first used by Wiener et al. for text classification [102], where in their model a three-layered neural network is used for each category to learn a non-linear mapping from input document (represented by a vector of term weights) to a category. Recently, Johnson et al. proposes to use Convolutional Neural Networks (CNN) for text categorization, by taking into account 1D internal structure of document, i.e., the order of words [45]. In their work, CNN is applied directly to high-dimensional text data, instead of using low-dimensional word vectors as input by most conventional classifiers. Experiments on real datasets show that CNN outperforms SVM and normal neural networks in terms of error rate [45].

3 Feature selection for text classification

Despite many existing classifiers for text categorization, a major challenge of text categorization is high dimensionality of the feature space [108]. A document usually contains hundreds or thousands of distinct words that are regarded as features, however many of them may be noisy, less informative, or redundant with respect to class label. This may mislead the classifiers and degrade their performance in general [60, 82]. Therefore, feature selection must be applied to eliminate noisy, less informative, and redundant features, so as to reduce the feature space to a manageable level, thus improving efficiency and accuracy of the classifiers used.

Generally, a feature selection method involves four basic steps, i.e., feature subset generation, subset evaluation, stopping condition, and classification result validation [89]. In the first step, we use some search strategy to find a candidate feature subset, which is then evaluated by certain goodness criterion in the second step. In the third step, subset generation and evaluation terminate when stopping conditions are met, after which the best feature subset with be chosen from all the candidates. In the last step, the feature subset will be validated using a validation set. Depending on how to generate feature subsets, feature selection methods can be divided into four categories, namely, filter model [10, 83, 110], wrapper model [24, 71], embedded model, and hybrid model [18, 104]. Majority of feature selection methods for text categorization belong to filter-based method, due to its simplicity and efficiency. Detailed investigation and comparison between different feature selection algorithms for generic data can be found in [20, 60, 66, 82]. In the sequel, we focus on feature selection methods that are explicitly designed for text categorization.

3.1 Filter model

Given a set S = {s1,s2,...,sm} of m features (terms), the filter approach evaluates, by employing some scoring function 𝜃, each feature siS and assigns a real number 𝜃(si) to si according to the contribution of si to solving the classification task [11, 60, 89]. Among all the features in S, only k (alternatively, a predefined percentage of |S|) features with the highest score are retained, where k is pre-specified by the user. The rest ones in S are discarded without consideration, resulting in a reduced feature space. An illustration of how the filter methods work is given in Fig. 2. Some filter methods evaluate goodness of a term based on how frequently it appears in text corpus, such as document frequency (DF), TF-IDF and term strength (TS), while other filter methods reply on information theory, such as mutual information (MI), information gain (IG), χ2 (CHI), ECCD, PCA, correlation coefficient (CC), t-test, etc. We summarize existing filter methods in Table 2, and discuss some of them in this section.

Fig. 2
figure 2

General framework of filter methods for text Classification

Table 2 Feature selection methods for text categorization

Document frequency method (DF) : document frequency of a term is defined as the total number of documents in the document collection that contain the term. The basic idea of DF is that rare terms are considered non-informative for classification and they should be removed during feature selection [51, 88]. Specifically, a ranking procedure is performed to evaluate the goodness or importance of each term in the vocabulary of the document collection, and here document frequency is regarded as the goodness measure for terms. The k most important terms are selected as features for classification, and the rest are filtered out. DF is simple and effective, since its time complexity is approximately linear in the number of training documents. However, the drawback of DF is that some selected terms that appear frequently in many documents may not be discriminative, whereas some discarded terms with low document frequency may be relatively informative.

TF-IDF method: this method stems from information retrieval, which takes into account both term frequency (TF) and inverse document frequency (IDF) when measuring the importance of a term [51, 88]. Here, TF is the number of times a term appearing in a document and IDF measures whether the term is common or rare across documents. The importance of a term is then jointly determined by the product of its TF and IDF values.

Information Gain (IG): information gain is one of the most popular metrics used in machine learning for measuring the goodness of attributes, for example, in ID3 [72] and C4.5 [73]. It is used to measure the dependence between features and class labels, as follows

$$ IG(s_{i}, c_{j}) = H(s_{i}) - H(s_{i}|c_{j}) $$
(1)

where si and cj are the j-th term and the class label respectively, H(si) the entropy of term si, and H(si|cj) the entropy of si after observing class label cj. Here entropy H(si) is defined as

$$ H(s_{i}) = -\sum\limits_{j} p(x_{j})\log_{2} p(x_{j}) $$
(2)

and entropy H(si|cj) can be computed by

$$ H(s_{i}|c_{j}) = - \sum\limits_{k} p(c_{k})\sum\limits_{j} p(x_{j}|c_{k})\log_{2} p(x_{j}|c_{k}) $$
(3)

Basically, if a feature has a larger IG value with respect to a class label, then this feature is more relevant for classification. Zheng et al. [114] proposed iSIG, an improved IG measure, by taking into account of both positive and negative features for imbalanced text data. Some researchers employed IG as a preprocessing component to rank features, which is then followed by a feature reduction technique, such as PCA and genetic algorithm (GA) for feature selection [90].

Another measure called Term ReLatedness (TRL) also takes into account occurrences of terms, however it focuses on term absence and presence with respect to categories. Empirical study shows that TRL achieves performance improvement even after removal of 90% unique terms [8]. Some researchers proposes to simultaneously consider the significance of a term both inter-category and intra-category, and based on this idea they designed a comprehensive measure called CMFS for feature selection [110]. CMFS outperforms IG, DF, CHI, when naive Bayes and SVM classifiers are used. Uysal et al. [92] proposed another measure named Distinguishing Feature Selector (DFS), which relies on the same idea of selecting distinctive terms by taking into consideration several rules of how distinctive or irrelevant a term could be. DFS shows competitive performance with respect to some well-known metrics such as CHI and IG.

χ2 (CHI): CHI can measure the degree of independence between a term and a category and has been widely used for text categorization [76, 108]. Given a term si and a category cj, a contingency table can be constructed, based on which we can evaluate whether si and cj are independent. The major drawback of CHI is that it is not reliable for low-frequency terms [108]. Similar measures that rely on χ2-test and contingency table include GSS [30] and ECCD [52], hence they all suffer from the same drawback as CHI. To overcome the drawback of χ2-based methods, Wang et al. [98] proposed to combine averaged term frequency and student t-test for feature selection. Here, the averaged term frequency of term si is the average of the term frequencies of si across the document corpus. Obviously, the averaged term frequency can capture the low-frequency terms and experiments confirm that their technique is superior to CHI in terms of macro-F1 and micro-F1.

Mutual Information (MI): MI is a frequently used metric in information theory to measure the mutual dependency between two variables [87]. Specifically, the MI value between a term si and a class label cj is defined as

$$ MI(s_{i}, c_{j}) = \log \frac{p(s_{i}, c_{j})}{p(s_{i})p(c_{j})} $$
(4)

MI performs poorly, as compared to other measures such as IG and CHI, due to its bias toward favoring rare terms and its sensitivity to errors in probability estimation [108]. A detailed review on feature selection methods based on mutual information can be found in [95].

Correlation Coefficient (CC): CC is a variant of the CHI measure, and it can be viewed as a one-sided χ2 metric [67]. The rationale of CC is that it looks for terms that only come from the relevant documents of a category cj and are indicative of membership in cj. And terms that come from irrelevant documents or are highly indicative of non-membership in cj are not regarded as useful. This is the major difference between CC and CHI, which makes CC superior to CHI [67].

Maximum Discrimination (MD): MD is an information theory based method for feature selection, and its basic assumption is that the goodness of a feature can be measured by the discriminative capacity of the feature [88]. Specifically, MD uses a Jeffreys-Multi-Hypothesis divergence (JMH-divergence) to compute discriminative capacity of each feature, and it is designed for naive Bayes classifiers in text categorization.

Linear Measure (LM): LM is a family of linear measures for feature selection in text categorization [16]. A measure is called linear filtering measure if it is in the form of LMk(w) = kaw,cbw,c, where aw,c represents the number of documents with category label c in which term w appears, bw,c the number of documents containing term w but do not belong to category c, and k is a parameter. The value of LM can reveal the quality of the rule wc, that is, if term w appears in a document, then that document belongs to category c. LM shows superiority to some entropy-based and TF-IDF based measures when a SVM classifier is adopted for text classification [16].

Some researchers propose to use a round-robin strategy called SpreadFx to rank features with respect to the class, and experiments show that SpreadFx achieves substantial improvements compared to IG and CHI [27].

Bi-Normal Separation (BNS)is a distribution-based metric that relies on normal distribution. For intuition, suppose the occurrence of a given feature in each document is modeled by the event of a random normal variable exceeding a hypothetical threshold. The prevalence rate of the feature corresponds to the area under the curve past the threshold. If the feture is more prevalent in the positive class, then its threshold is further from the tail of the distribution than that of the negative class. The BNS measures the separation between these two thresholds. Eyheramendy et al. proposed to use Bayesian posterior probability based on Bernoulli distribution for feature ranking, and designed a Posterior Inclusion Probability (PIP) method for feature selection [86].

IGFSS [91] is an ensemble feature selection method for text categorization, which aims to improve the performance of classification by modifying the last step of filter-based feature selection algorithms. Conventional filter-based feature selection algorithms rank the features and choose the top-N features for classification, where N is an empirical parameter specified by the user, whereas IGFSS can create a set of features that represent all classes nearly equally, hence these features can improve the performance of classifiers.

Different from existing filter-based algorithms, Dasgupta et al. [19] proposed three sampling-based methods, namely Subspace Sample (SS), Weight-based Sampling (WS), and Uniform Sampling (US), for feature selection. These sampling-based methods randomly sample a small proportion of features, where these sampled features are independent of the total number of features, but dependent on the number of documents and an error parameter. Both theoretical and experimental result show that the proposed sampling methods perform well compared with other popular filter-based feature selection methods [19].

Fragoudis et al. proposed a filter method called Best Terms (BT) for text categorization [28]. Specifically, BT uses a two-step procedure to select the best features. In the first step, BT collects all documents that belong to a given category cj and selects a set of features that yield the highest prediction accuracy for these documents with respect to cj. In the second step, BT chooses documents that do not belong to cj and contain at least one of the features obtained in the first step. Then another set of features are computed, which best classify these documents with respect to \(\bar {c_{j}}\) where \(\bar {c_{j}}\) is the compliment of cj. The union of the two sets of features obtained during the two steps is the final result. BT improved the accuracy of NB and SVM, as compared to filter methods such as DF, IG, MI, CHI and GSS [28].

Filter-based methods are prevalent in feature selection for text categorization, in that documents usually contain tens of thousands of features and filter methods are generally very efficient to pick single feature among others. However, some researchers also explored the possibility of applying more sophisticated and accurate techniques for text categorization, as described below.

3.2 Wrapper model

The wrapper approach utilizes some search strategy to evaluate each possible subset SS, by feeding S to the chosen classifier and then evaluating the performance of the classifier. These two steps are repeated until the desired quality of feature subset is reached. The wrapper approach achieves better classification accuracy than filter methods, however the computational complexity of wrapper approaches is very high [66, 89]. The number of subset to consider is exponential when the cardinality of S is very large, meaning that the wrapper approach is inadequate for text categorization task, due to the fact that the feature space is usually in the order of hundreds even thousands. Hence, the wrapper approach is only feasible when the number of features is relatively small [24, 71, 89]. In practice, heuristics can be used to restrict the search space, so as to speed up the evaluation process.

There are some search strategies for generating feature subsets, such as hill-climbing, best-first search, branch-and-bound, and genetic algorithms [35, 48]. The hill-climbing expands a feature subset and turns to the subset with the highest accuracy. Hill-climbing terminates when there is no subset improved over current subset. The best-first search is to select the most promising subset that has not been explored before. In general, best-first search is more robust than hill-climbing. Greedy search is a computationally efficient strategy to find the optimal feature subset, which contains forward selection and backward elimination methods. The forward selection method starts with an empty set, and features are added into this set progressively according to some goodness measure. In contrast, backward elimination begins with the whole set of features and less promising features are removed from this set progressively.

Although there are quite a lot wrapper methods for generic data classification, such as LVM [59], FSSEM [24], SFFS [41], backward elimination and forward selection [44], very few are explicitly designed for text categorization purpose. The reason might be that wrapper methods are not suitable for text classification scenario due to high dimensionality, as claimed in [14]. Gutlein et al. focused on forward selection wrapper methods and proposed linear forward selection (LFS) method to reduce the number of feature expansions in each forward selection step [34]. Specifically, two strategies, namely fixed set and fixed width, are designed to limit the number of features considered during forward selection. The main drawback of LFS, however, is that it only takes into account the top k features (obtained by using some filter method or ranking function) during forward selection, failing to utilize the remaining features. Some researchers focused on SVM classifiers and proposed to increase efficiency of SVMs by desiging wrapper-based feature method for text categorization [101].

3.3 Embedded model

Different from the above two approaches, the embedded approach does not perform the feature selection phase explicitly before learning task begins. Instead, it embeds feature selection operation into the learning process [66]. Some argues that decision trees (DT) such as ID3 and C4.5 are examples of embedded method for feature selection, since while constructing the classifier, DT selects the best features (attributes) that may give the best discriminative power. However, to the best of our knowledge, there is no embedded feature selection method dedicated to text categorization.

3.4 Hybrid model

Hybrid methods are different from the embedded ones, in that the former combine a filter method with a wrapper method during the feature selection process, whereas the latter embed feature selection operation into the learning process of a classifier [9, 66]. Most hybrid methods employ some sort of filter methods to select promising features at first, then apply wrapper methods to the obtained features, for example those methods in [104] and [68], which are designed for generic data though instead of for text corpus.

Günal proposed a simple hybrid method, named HYBRID, that combines filter method and wrapper feature selection steps to select promising combination of features for text categorization [33]. HYBRID consists of two stages. In the first stage each of the four filter methods DF, MI, CHI and IG are employed to select a subset of the features with the top k highest scores, where k is a parameter specified by the user. Then those subsets of selected features are merged together, by eliminating duplicate features. In the second stage, based on this subset of features a generic algorithm is utilized to find the final solution. The major finding of G unal’s work is that a combination of features selected by various filter methods is more effective than the features selected by a single filter-based method. A similar idea is proposed by Chou et al., who employ filter method first and then apply wrapper method to the selected features [14].

The above hybrid methods for text categorization suffer from several problems: (1) even though lots of irrelevant features are pruned by the filter methods, the number of wrapper evaluations can still be very large, and (2) the hybrid methods ignore interactions between the selected features and the pruned ones. Aiming to solve these problems, Bermejio et al. proposed a novel iterative hybrid strategy by combing re-ranking method and wrapper evaluation [9]. Specifically, a filter method is used to rank all the features, and then a wrapper method is performed on the first k features of the ranked list, resulting in a subset of features selected. Then, this subset of features and those remaining ones are re-ranked again, producing another ranked list of features, which are fed into the wrapper method in the next run. This process repeats until there is no change in the selected features.

4 Discussion and conclusion

In this paper, we have given a detailed review of the state-of-the-art feature selection methods for text classification. Although there is overwhelmingly large number of feature selection techniques, a relatively small portion of them are dedicated to text classification purpose. Feature selection methods are generally divided into four categories, namely the filter model, wrapper model, embedded model, and hybrid model. Filter model is the most efficient one and has also been investigated extensively in text categorization. However, there are very few wrapper and embedded methods for text categorization at present, due to the fact that these two models are very computationally expensive when facing thousands of features contained in a normal text document. To overcome this challenge, researchers have proposed hybrid model that employ filter methods to eliminate redundant and irrelevant features, and the selected features are fed to wrapper methods for further refinement.

We also notice that there are some work targeted at novel applications of feature selection in recently years, for example, multi-label feature selection [42, 61, 82, 117, 118, 122], feature selection with streaming features [103], online feature selection [99], filter-based locality preserving feature selection [36], similarity preserving feature selection [112], feature selection with optimization techniques [105], regularization based feature selection methods [121], feature selection with machine learning techniques [25, 38, 39, 53, 115, 119, 123], stability measures for feature selection algorithms [46]. However, these work mainly focus on generic data, and it is not clear whether they can be applied to text data. With the proliferation of text applications, we may see a trend that these feature selection techniques will be applied to text categorization, and interesting problems may arise, for example, feature selection for text categorization when there are missing values in documents [116].