A new hybrid feature selection based on multi-filter weights and multi-feature weights

Wang, Youwei; Feng, Lizhou

doi:10.1007/s10489-019-01470-z

A new hybrid feature selection based on multi-filter weights and multi-feature weights

Published: 21 May 2019

Volume 49, pages 4033–4057, (2019)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Applied Intelligence Aims and scope Submit manuscript

A new hybrid feature selection based on multi-filter weights and multi-feature weights

Download PDF

Youwei Wang¹ &
Lizhou Feng²

753 Accesses
16 Citations
Explore all metrics

Abstract

A traditional feature selection of filters evaluates the importance of a feature by using a particular metric, deducing unstable performances when the dataset changes. In this paper, a new hybrid feature selection (called MFHFS) based on multi-filter weights and multi-feature weights is proposed. Concretely speaking, MFHFS includes the following three stages: Firstly, all samples are normalized and discretized, and the noises and the outliers are removed based on 10-folder cross validation. Secondly, the vector of multi-filter weights and the matrix of multi-feature weights are calculated and used to combine different feature subsets obtained by the optimal filters. Finally, a Q-range based feature relevance calculation method is proposed to measure the relationship of different features and the greedy searching policy is used to filter the redundant features of the temp feature subset to obtain the final feature subset. Experiments are carried out using two typical classifiers of support vector machine and random forest on six datasets (APS, Madelon, CNAE9, Gisette, DrivFace and Amazon). When the measurements of F₁^macro and F₁^micro are used, the experimental results show that the proposed method has great improvement on classification accuracy compared to the traditional filters, and it achieves significant improvements on running speed while guaranteeing the classification accuracy compared to typical hybrid feature selections.

Feature subset selection combining maximal information entropy and maximal information coefficient

Article 29 July 2019

A New Proposed Feature Subset Selection Algorithm Based on Maximization of Gain Ratio

Improved Filter-Based Feature Selection Using Correlation and Clustering Techniques

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In machine learning fields, a sample often contains a large number of features, many of which are highly correlated or even logically redundant, leading to the problems of high computational complexity and low interpretability [1]. As a consequence, many dimension reduction methods such as feature selection, feature reduction and feature extraction have been studied deeply in recent years. Feature selection is concerned with identifying a small subset of relevant features that are sufficient for learning the target concept, while the aim of feature reduction is to eliminate the logically redundant features from the original feature space without sacrificing the classification accuracy [2]. Different with feature reduction and feature selection, feature extraction transforms a high dimensional feature space into a distinct low dimensional feature space through a transformation of the original feature space [3]. Typical feature extraction methods include: AutoEncoder (AE) [4], latent semantic indexing (LSI) [5], partial least square (PLS) [6], multidimensional scaling [7], principal component analysis (PCA) [8] and latent Dirichlet allocation (LDA) [9]. However, compared to the feature selections and feature reductions, feature extractions cannot be easily interpreted since the physical meaning of the original features cannot be retrieved, limiting its application in dimension reduction.

The methods of feature selections can be divided in to two types: filters and wrappers [10]. Filters evaluate the importance of the features separately based on some predefined metrics instead of using the classifiers. Features are measured and ranked according to their importance using simple measurements such as distance, dependency and information. The most widely used filters include: Chi-square (CHI) [11], improved gini index (IMGI) [12], distinguishing feature selector (DFS) [13], odds ratio (OR) [14], document frequency (DF) [11], Darmstadt indexing approach (DIA) [15], information gain (IG) [16], mutual information (MI) [17], F-score [18], orthogonal centroid feature selection (OCFS) [19], feature selection considering the imbalance problem in text categorization (CMFSX) [20], supervised meaning rank (SMR) [21], unsupervised meaning average (UMA) [21], balanced accuracy measure (AAC2) [22], normalized difference measure (NDM) [23], clustering based feature selection (CBFS) [24], improved mutual information feature selection (NDMI) [25], novel feature selection based on normalized mutual information (NNMI) [26], multi-label feature selection based on max-dependency and min-redundancy (MDMR) [27]. Among these methods, NDMI, NNMI and MDMR are MI based filters which use the greedy searching policy to select the best features by calculating the mutual information of feature-feature and feature-class. A wrapper estimates the accuracy of a classifier by adding each unselected feature to the feature subset until the accuracy is less than the estimated accuracy of the feature set already selected [28]. Typical feature selections of wrappers include: forward search (FS), backward search (BS), sequential floating search (SFS) and simplified silhouette filter (SSF) [29]. Compared to the wrapper methods, though filters may provide less precise results, they are particularly efficient when dealing with large datasets.

In order to combine the advantages of filters or wrappers, many hybrid methods are proposed in recent years. Typical hybrid feature selections include: cluster based hybrid feature selection approach (CBHFS) [30], improved global feature selection scheme (IGFSS) [31], variable global feature selection scheme (VGFSS) [32], hybrid dimension reduction by integrating feature selection with feature extraction method (TDPFS) [3], novel feature selection based on harmony search (HFS) [33], hybrid approach of differential evolution and artificial bee colony for feature selection (HDAFS) [34], particle swarm optimization for feature selection (PSOFS) [35], hybrid feature selection based on enhanced genetic algorithm (EGAFS) [36], hybrid feature selection using component co-occurrence based feature relevance measurement (HFSCC) [37], multi-measure multi-weight ranking approach (MMMW) [38], multi-filter based feature selection (EMFFS) [39], two-step based feature selection (TSFS) [40], etc. These methods have good performances on classification accuracy, thus have been widely used in data classification fields.

In this paper, a new hybrid feature selection (called MFHFS) based on the information of multi-filter weights and multi-feature weights is proposed. In our experiments, we applied the proposed hybrid feature selection method on SVM and RF using six benchmark datasets. We show the effectiveness of our method by demonstrating that it significantly outperforms typical existing feature selection methods on the aspects of classification accuracy or running speed. The remainder of this paper is organized as follows. Section 2 reviews the related work on feature selections. Section 3 gives the details of the proposed method. Section 4 shows the experimental results. Section 5 concludes the whole paper.

2 Related work

2.1 The filters

2.1.1 The IMGI method

In order to apply the feature selection to data classification tasks with multiple class labels, Shang [14] improved the traditional Gini index method [41] and proposed the IMGI method. Given a feature t_i, its IMGI value is defined as follows:

$$ \mathrm{IMGI}\left({t}_i\right)=\sum \limits_{c_k}p{\left({t}_i|{c}_k\right)}^2\times p{\left({c}_k|{t}_i\right)}^2 $$

(1)

where p(t_i|c_k) represents the conditional probability that feature t_i occurs in category c_k, p(c_k|t_i) represents the conditional probability that a sample belongs to c_k when it contains t_i.

2.1.2 The CHI method

Given a category c_k and a feature t_i, t_i has strong category discriminative ability if it has a high CHI value. The CHI method calculates the score of t_i as follows [42]:

$$ \Big\{{\displaystyle \begin{array}{l}\mathrm{CHI}\left({t}_i\right)=\underset{c_k}{\max}\left\{\mathrm{CHI}\left({t}_i,{c}_k\right)\right\}\\ {}\mathrm{CHI}\left({t}_i,{c}_k\right)=\frac{N{\left({a}_{ik}{d}_{ik}-{b}_{ik}{e}_{ik}\right)}^2}{\left({a}_{ik}+{b}_{ik}\right)\left({a}_{ik}+{e}_{ik}\right)\left({b}_{ik}+{d}_{ik}\right)\left({e}_{ik}+{d}_{ik}\right)}\end{array}} $$

(2)

where N represents the total number of samples, a_ik represents the frequency that t_i and c_k co-occur, b_ik represents the frequency that t_i does not occur in c_k, e_ik represents the frequency that class c_k occurs and does not contain feature t_i, and d_ik represents the frequency that neither c_k nor t_i occurs.

2.1.3 The DFS method

DFS is one of the successful feature selections and is also a global feature selection metric [13]. The idea of DFS is to select a set of distinctive features while eliminating the uninformative ones. The formula of DFS is defined as follows [31]:

$$ \mathrm{DFS}\left({t}_i\right)=\sum \limits_{c_k}\frac{p\left({c}_k|{t}_i\right)}{p\left(\overline{t_i}|{c}_k\right)+p\left({t}_i|\overline{c_k}\right)+1} $$

(3)

where $ p\left(\overline{t_i}|{c}_k\right) $ is the conditional probability of absence of term t_i given class c_k, and $ p\left({t}_i|\overline{c_k}\right) $ is the conditional probability of term t_i given all the classes except c_k.

2.1.4 The CMFSX method

Yang proposed a new feature selection method (called CMFSX) which can weaken the adverse effect caused by the imbalance factor in the dataset. CMFSX calculates the score of a feature t_i as follows [20]:

$$ \Big\{{\displaystyle \begin{array}{l}\mathrm{CMFSX}\left({t}_i\right)=\underset{c_k}{\max}\left\{\mathrm{CMFSX}\left({t}_i,{c}_k\right)\right\}\\ {}\mathrm{CMFSX}\left({t}_i,{c}_k\right)=\frac{p\left({t}_i|{c}_k\right)\times p\left({c}_k|{t}_i\right)}{p\left({c}_k\right)}\end{array}} $$

(4)

2.1.5 The SMR method

SMR uses a class of documents as the basic unit or context in order to calculate the meaning scores for words. Assume that a feature t_i appears k times in the dataset S, and m times in the documents of class c_k, SMR calculates the score of t_i as follows [21]:

$$ \Big\{{\displaystyle \begin{array}{l}\mathrm{SMR}\left({t}_i\right)=\underset{c_k}{\max}\left\{\mathrm{SMR}\left({t}_i,{c}_k\right)\right\}\\ {}\begin{array}{l}\mathrm{SMR}\left({t}_i,{c}_k\right)=-\frac{1}{m}\log \left(\begin{array}{c}k\\ {}m\end{array}\right)-\left[\left(m-1\right)\log N\right]\\ {}N=L/B\end{array}\end{array}} $$

(5)

where L and B are the feature frequency of t_i in the dataset and class c_k, respectively.

2.1.6 The NDM method

NDM is a modified form of ACC2 measure [22] which solves the problem of class imbalance by normalizing true positive and false positive rates by the respective class size. Mathematically, NDM is defined as follows [23]:

$$ \Big\{{\displaystyle \begin{array}{l}\mathrm{NDM}\left({t}_i\right)=\mid \frac{tpr\left({t}_i\right)- fpr\left({t}_i\right)}{\min \left( tpr\left({t}_i\right), fpr\left({t}_i\right)\right)}\mid \\ {} tpr\left({t}_i\right)=\frac{t{p}_i}{t{p}_i+f{n}_i}\\ {} fpr\left({t}_i\right)=\frac{t{n}_i}{t{n}_i+f{p}_i}\end{array}} $$

(6)

Where tp_i is the number of samples belonging to the positive classes and containing the term t_i, fn_i is the number of samples not belonging to the positive classes and not containing the feature t_i; fp_i is the number of samples not belonging to the positive classes and not containing the feature t_i.

2.1.7 The NDMI method

NDMI uses the mutual information of feature-feature and feature-class to determine an optimal set of features [25]. It uses a greedy searching policy to select the most discriminative features and filters the redundant ones. Given the feature t_i, its NDMI score is defined as follows:

$$ \Big\{{\displaystyle \begin{array}{l}\mathrm{NDMI}\left({t}_i\right)=\max \left\{\sum \limits_{c_k}\mathrm{MI}\left({t}_i,{c}_k\right)-\frac{1}{\mid S\mid}\sum \limits_{t_j\in S}\mathrm{MI}\left({t}_i,{t}_j\right)\right\}\\ {}\mathrm{MI}\left({t}_i,{c}_k\right)=p\left({c}_k,{t}_i\right){\log}_2\frac{p\left({c}_k,{t}_i\right)}{p\left({c}_k\right)p\left({t}_i\right)}\\ {}\mathrm{MI}\left({t}_i,{t}_j\right)=p\left({t}_i,{t}_j\right){\log}_2\frac{p\left({t}_i,{t}_j\right)}{p\left({t}_i\right)p\left({t}_j\right)}\end{array}} $$

(7)

where S is the set of selected features, |S| is the number of features in S, p(t_i) denotes the occurring probability that a sample contains the feature t_i, p(c_k) denotes the probability of the samples in category c_k, p(t_i, c_k) denotes the probability that t_i occurs in c_k, p(t_i, t_j) denotes the probability that t_i and t_j both occurs in a sample.

2.1.8 The MDMR method

Different from traditional single-label feature selection, MDMR considers not only the redundancies between individual features or the redundancies between class label and the candidate features, but also the conditional dependencies between features and each class label. The objective function is described as follows [27]:

$$ \max \left\{\sum \limits_{c_j\in C}\mathrm{MI}\left({t}_i,{c}_j\right)-\frac{1}{\mid S\mid}\sum \limits_{t_l\in S}\left(\mathrm{MI}\left({t}_i,{t}_l\right)-\sum \limits_{c_j\in C}\mathrm{MI}\left({t}_i,{c}_j|{t}_l\right)\right)\right\} $$

(8)

where MI(t_i, c_j| t_l) is the mutual information between the candidate feature t_i and all categories when given the selected feature t_l.

2.2 The hybrid methods

As different filters use different metrics to evaluate the feature importance, they always output different feature subsets for a particular dataset. In recent years, hybrid methods which combine different feature evaluating metrics altogether have received considerable attentions.

2.2.1 The EGAFS method

This approach combines the advantages of the filters with an enhanced genetic algorithm (EGA) in a wrapper approach to handle the high dimensionality of the feature space [36]. EGA improves the crossover and mutation operators of traditional genetic algorithms. The crossover operation is performed based on feature subset partitioning, while the mutation is performed based on the classifier performance of the original parents and feature importance. Moreover, this method combines six well-known filters with the EGA to obtain the final feature subset. Though EGAFS has high classification accuracy, it uses wrappers and deduces the problem of high computational complexity.

2.2.2 The TDPFS method

Based on two feature selections and a feature extraction method, Bharti and Singh proposed the TDPFS method for text clustering [3]. Firstly, a typical term frequency based feature selection and a typical document frequency based feature selection are used to obtain two feature subsets, respectively. Then, the modified union operation is proposed to merge the features of these feature subsets. Finally, the PCA algorithm is applied to further transform the features in an interpretable feature space to reduce the dimension further without losing much information.

2.2.3 The IGFSS method

Although the features selected by traditional feature selections scheme can represents some of the classes successfully, some of the classes still may not be represented. Uysal proposed an improved global feature selection scheme (IGFSS) where a traditional feature selection is modified to obtain a more representative feature set [31]. In other words, IGFSS aims to improve the performance of traditional feature selections by selecting the features which represents all classes almost equally.

2.2.4 The VGFSS method

Though IGFSS solves the problem that some of the classes may not be represented by selecting an equal number of representative features from all the classes. However, this method is not suitable for an unbalanced dataset which has many classes, deducing the problem that some important features of the class that contains a higher number of features may be ignored. On this basis, Agnihotri and Verma proposed the VGFSS method to select a variable number of features from each class based on the distribution of features [32]. The number of selected features in each class is defined as follows:

$$ NV\left({C}_j\right)= count\left({C}_j\right)\times \frac{N}{TFC} $$

(9)

Where count(C_j) is the number of features of class C_j, N is the number of final selected features, TFC is the number of all features in the dataset.

2.2.5 The CBHFS method

Jaskowiak and Campello presented a hybrid feature selection (CBHFS) tailored for data classification problems [30]. This method consists of two main stages: in the first stage, it uses a clustering algorithm to identify the best features and remove the redundant ones [18]; in the second stage, a wrapper is used to evaluate different feature subsets produced by the clustering processes, determining the one that produces the best classification performance in terms of accuracy. This method has high classification accuracy but faces the problems of parameter dependency and high computational complexity.

2.2.6 The HFSCC method

Wang and Feng proposed a hybrid feature selection which can achieve the best features effectively and efficiently [37]. HFSCC consists of three steps: firstly, the samples are preprocessed and two feature subsets are obtained by using two different optimal filters; secondly, a feature weight based union operation is proposed to merge the obtained feature subsets; finally, the hierarchical agglomerative clustering algorithm is applied to obtain the final feature subset by combining a component co-occurrence based feature relevance measurement and a predetermined threshold. Experimental results show that this method achieves high classification accuracy and execution speed.

2.2.7 The MMMW method

As the features selected by different feature selections are always different, it is not enough to evaluate the importance of a feature by using only one particular feature selection. Bhattacharya and Selvakumar proposed the MMMW method which combines a filter, a wrapper and a clustering based feature selection to select the best features [38]. Though MMMW uses novel metrics to assign weights to the features, it ignores the weights of different methods on different datasets.

2.2.8 The EMFFS method

Osanaiye and Cai proposed a multi-filter based feature selection that combines the results of four filters to achieve the best features [39]. This method uses a simple majority voting policy to merge the features selected by different filters. Moreover, a threshold is predetermined to select the frequently occurring features among the four filters to construct the set of final features. However, this method ignores the fact that different filters have different performances on different datasets, and the optimal value of the predetermined threshold is hard to tune.

As is shown in Table 1, the main advantages and limitations of the above feature selections can be described as follows: (1) The traditional filters have high running speed, but they cannot filter the redundant features and the output results rely on the feature importance measurements and datasets seriously; (2) The MI based filters and wrappers can filter the redundant features, but they generally have high computation complexities when dealing with the data sets with high numbers of features or samples [40]; (3) The hybrid methods can obtain higher classification accuracy than those of the other methods, but some are very time consuming as they use the wrappers to obtain the best feature subset. Moreover, some existing hybrid methods combine the results of different filters by using a simple majority voting or set union mechanism [37], ignoring the effects of feature weight and filter weight.

Table 1 Advantages and limitations of different feature selections

A new hybrid feature selection based on multi-filter weights and multi-feature weights

Abstract

Similar content being viewed by others

Explore related subjects

1 Introduction

2 Related work

2.1 The filters

2.1.1 The IMGI method

2.1.2 The CHI method

2.1.3 The DFS method

2.1.4 The CMFSX method

2.1.5 The SMR method

2.1.6 The NDM method

2.1.7 The NDMI method

2.1.8 The MDMR method

2.2 The hybrid methods

2.2.1 The EGAFS method

2.2.2 The TDPFS method

2.2.3 The IGFSS method

2.2.4 The VGFSS method

2.2.5 The CBHFS method

2.2.6 The HFSCC method

2.2.7 The MMMW method

2.2.8 The EMFFS method

3 The proposed method - MFHFS

3.1 Description of MFHFS

3.2 Time complexity analysis of MFHFS

4 Experimental results and analysis

4.1 Datasets

4.2 Classifiers

4.3 Evaluation measurements

4.4 Selection of the optimal filters in OFS

4.5 Sensitivity analysis of the parameter Q

4.6 Comparisons of MFHFS with typical filters

4.7 Comparisons of MFHFS with typical feature extractions and hybrid feature selections

4.8 Statistical comparisons

5 Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Electronic supplementary material

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation