1 Introduction

In machine learning fields, a sample often contains a large number of features, many of which are highly correlated or even logically redundant, leading to the problems of high computational complexity and low interpretability [1]. As a consequence, many dimension reduction methods such as feature selection, feature reduction and feature extraction have been studied deeply in recent years. Feature selection is concerned with identifying a small subset of relevant features that are sufficient for learning the target concept, while the aim of feature reduction is to eliminate the logically redundant features from the original feature space without sacrificing the classification accuracy [2]. Different with feature reduction and feature selection, feature extraction transforms a high dimensional feature space into a distinct low dimensional feature space through a transformation of the original feature space [3]. Typical feature extraction methods include: AutoEncoder (AE) [4], latent semantic indexing (LSI) [5], partial least square (PLS) [6], multidimensional scaling [7], principal component analysis (PCA) [8] and latent Dirichlet allocation (LDA) [9]. However, compared to the feature selections and feature reductions, feature extractions cannot be easily interpreted since the physical meaning of the original features cannot be retrieved, limiting its application in dimension reduction.

The methods of feature selections can be divided in to two types: filters and wrappers [10]. Filters evaluate the importance of the features separately based on some predefined metrics instead of using the classifiers. Features are measured and ranked according to their importance using simple measurements such as distance, dependency and information. The most widely used filters include: Chi-square (CHI) [11], improved gini index (IMGI) [12], distinguishing feature selector (DFS) [13], odds ratio (OR) [14], document frequency (DF) [11], Darmstadt indexing approach (DIA) [15], information gain (IG) [16], mutual information (MI) [17], F-score [18], orthogonal centroid feature selection (OCFS) [19], feature selection considering the imbalance problem in text categorization (CMFSX) [20], supervised meaning rank (SMR) [21], unsupervised meaning average (UMA) [21], balanced accuracy measure (AAC2) [22], normalized difference measure (NDM) [23], clustering based feature selection (CBFS) [24], improved mutual information feature selection (NDMI) [25], novel feature selection based on normalized mutual information (NNMI) [26], multi-label feature selection based on max-dependency and min-redundancy (MDMR) [27]. Among these methods, NDMI, NNMI and MDMR are MI based filters which use the greedy searching policy to select the best features by calculating the mutual information of feature-feature and feature-class. A wrapper estimates the accuracy of a classifier by adding each unselected feature to the feature subset until the accuracy is less than the estimated accuracy of the feature set already selected [28]. Typical feature selections of wrappers include: forward search (FS), backward search (BS), sequential floating search (SFS) and simplified silhouette filter (SSF) [29]. Compared to the wrapper methods, though filters may provide less precise results, they are particularly efficient when dealing with large datasets.

In order to combine the advantages of filters or wrappers, many hybrid methods are proposed in recent years. Typical hybrid feature selections include: cluster based hybrid feature selection approach (CBHFS) [30], improved global feature selection scheme (IGFSS) [31], variable global feature selection scheme (VGFSS) [32], hybrid dimension reduction by integrating feature selection with feature extraction method (TDPFS) [3], novel feature selection based on harmony search (HFS) [33], hybrid approach of differential evolution and artificial bee colony for feature selection (HDAFS) [34], particle swarm optimization for feature selection (PSOFS) [35], hybrid feature selection based on enhanced genetic algorithm (EGAFS) [36], hybrid feature selection using component co-occurrence based feature relevance measurement (HFSCC) [37], multi-measure multi-weight ranking approach (MMMW) [38], multi-filter based feature selection (EMFFS) [39], two-step based feature selection (TSFS) [40], etc. These methods have good performances on classification accuracy, thus have been widely used in data classification fields.

In this paper, a new hybrid feature selection (called MFHFS) based on the information of multi-filter weights and multi-feature weights is proposed. In our experiments, we applied the proposed hybrid feature selection method on SVM and RF using six benchmark datasets. We show the effectiveness of our method by demonstrating that it significantly outperforms typical existing feature selection methods on the aspects of classification accuracy or running speed. The remainder of this paper is organized as follows. Section 2 reviews the related work on feature selections. Section 3 gives the details of the proposed method. Section 4 shows the experimental results. Section 5 concludes the whole paper.

2 Related work

2.1 The filters

2.1.1 The IMGI method

In order to apply the feature selection to data classification tasks with multiple class labels, Shang [14] improved the traditional Gini index method [41] and proposed the IMGI method. Given a feature ti, its IMGI value is defined as follows:

$$ \mathrm{IMGI}\left({t}_i\right)=\sum \limits_{c_k}p{\left({t}_i|{c}_k\right)}^2\times p{\left({c}_k|{t}_i\right)}^2 $$
(1)

where p(ti|ck) represents the conditional probability that feature ti occurs in category ck, p(ck|ti) represents the conditional probability that a sample belongs to ck when it contains ti.

2.1.2 The CHI method

Given a category ck and a feature ti, ti has strong category discriminative ability if it has a high CHI value. The CHI method calculates the score of ti as follows [42]:

$$ \Big\{{\displaystyle \begin{array}{l}\mathrm{CHI}\left({t}_i\right)=\underset{c_k}{\max}\left\{\mathrm{CHI}\left({t}_i,{c}_k\right)\right\}\\ {}\mathrm{CHI}\left({t}_i,{c}_k\right)=\frac{N{\left({a}_{ik}{d}_{ik}-{b}_{ik}{e}_{ik}\right)}^2}{\left({a}_{ik}+{b}_{ik}\right)\left({a}_{ik}+{e}_{ik}\right)\left({b}_{ik}+{d}_{ik}\right)\left({e}_{ik}+{d}_{ik}\right)}\end{array}} $$
(2)

where N represents the total number of samples, aik represents the frequency that ti and ck co-occur, bik represents the frequency that ti does not occur in ck, eik represents the frequency that class ck occurs and does not contain feature ti, and dik represents the frequency that neither ck nor ti occurs.

2.1.3 The DFS method

DFS is one of the successful feature selections and is also a global feature selection metric [13]. The idea of DFS is to select a set of distinctive features while eliminating the uninformative ones. The formula of DFS is defined as follows [31]:

$$ \mathrm{DFS}\left({t}_i\right)=\sum \limits_{c_k}\frac{p\left({c}_k|{t}_i\right)}{p\left(\overline{t_i}|{c}_k\right)+p\left({t}_i|\overline{c_k}\right)+1} $$
(3)

where \( p\left(\overline{t_i}|{c}_k\right) \) is the conditional probability of absence of term ti given class ck, and \( p\left({t}_i|\overline{c_k}\right) \) is the conditional probability of term ti given all the classes except ck.

2.1.4 The CMFSX method

Yang proposed a new feature selection method (called CMFSX) which can weaken the adverse effect caused by the imbalance factor in the dataset. CMFSX calculates the score of a feature ti as follows [20]:

$$ \Big\{{\displaystyle \begin{array}{l}\mathrm{CMFSX}\left({t}_i\right)=\underset{c_k}{\max}\left\{\mathrm{CMFSX}\left({t}_i,{c}_k\right)\right\}\\ {}\mathrm{CMFSX}\left({t}_i,{c}_k\right)=\frac{p\left({t}_i|{c}_k\right)\times p\left({c}_k|{t}_i\right)}{p\left({c}_k\right)}\end{array}} $$
(4)

2.1.5 The SMR method

SMR uses a class of documents as the basic unit or context in order to calculate the meaning scores for words. Assume that a feature ti appears k times in the dataset S, and m times in the documents of class ck, SMR calculates the score of ti as follows [21]:

$$ \Big\{{\displaystyle \begin{array}{l}\mathrm{SMR}\left({t}_i\right)=\underset{c_k}{\max}\left\{\mathrm{SMR}\left({t}_i,{c}_k\right)\right\}\\ {}\begin{array}{l}\mathrm{SMR}\left({t}_i,{c}_k\right)=-\frac{1}{m}\log \left(\begin{array}{c}k\\ {}m\end{array}\right)-\left[\left(m-1\right)\log N\right]\\ {}N=L/B\end{array}\end{array}} $$
(5)

where L and B are the feature frequency of ti in the dataset and class ck, respectively.

2.1.6 The NDM method

NDM is a modified form of ACC2 measure [22] which solves the problem of class imbalance by normalizing true positive and false positive rates by the respective class size. Mathematically, NDM is defined as follows [23]:

$$ \Big\{{\displaystyle \begin{array}{l}\mathrm{NDM}\left({t}_i\right)=\mid \frac{tpr\left({t}_i\right)- fpr\left({t}_i\right)}{\min \left( tpr\left({t}_i\right), fpr\left({t}_i\right)\right)}\mid \\ {} tpr\left({t}_i\right)=\frac{t{p}_i}{t{p}_i+f{n}_i}\\ {} fpr\left({t}_i\right)=\frac{t{n}_i}{t{n}_i+f{p}_i}\end{array}} $$
(6)

Where tpi is the number of samples belonging to the positive classes and containing the term ti, fni is the number of samples not belonging to the positive classes and not containing the feature ti; fpi is the number of samples not belonging to the positive classes and not containing the feature ti.

2.1.7 The NDMI method

NDMI uses the mutual information of feature-feature and feature-class to determine an optimal set of features [25]. It uses a greedy searching policy to select the most discriminative features and filters the redundant ones. Given the feature ti, its NDMI score is defined as follows:

$$ \Big\{{\displaystyle \begin{array}{l}\mathrm{NDMI}\left({t}_i\right)=\max \left\{\sum \limits_{c_k}\mathrm{MI}\left({t}_i,{c}_k\right)-\frac{1}{\mid S\mid}\sum \limits_{t_j\in S}\mathrm{MI}\left({t}_i,{t}_j\right)\right\}\\ {}\mathrm{MI}\left({t}_i,{c}_k\right)=p\left({c}_k,{t}_i\right){\log}_2\frac{p\left({c}_k,{t}_i\right)}{p\left({c}_k\right)p\left({t}_i\right)}\\ {}\mathrm{MI}\left({t}_i,{t}_j\right)=p\left({t}_i,{t}_j\right){\log}_2\frac{p\left({t}_i,{t}_j\right)}{p\left({t}_i\right)p\left({t}_j\right)}\end{array}} $$
(7)

where S is the set of selected features, |S| is the number of features in S, p(ti) denotes the occurring probability that a sample contains the feature ti, p(ck) denotes the probability of the samples in category ck, p(ti, ck) denotes the probability that ti occurs in ck, p(ti, tj) denotes the probability that ti and tj both occurs in a sample.

2.1.8 The MDMR method

Different from traditional single-label feature selection, MDMR considers not only the redundancies between individual features or the redundancies between class label and the candidate features, but also the conditional dependencies between features and each class label. The objective function is described as follows [27]:

$$ \max \left\{\sum \limits_{c_j\in C}\mathrm{MI}\left({t}_i,{c}_j\right)-\frac{1}{\mid S\mid}\sum \limits_{t_l\in S}\left(\mathrm{MI}\left({t}_i,{t}_l\right)-\sum \limits_{c_j\in C}\mathrm{MI}\left({t}_i,{c}_j|{t}_l\right)\right)\right\} $$
(8)

where MI(ti, cj| tl) is the mutual information between the candidate feature ti and all categories when given the selected feature tl.

2.2 The hybrid methods

As different filters use different metrics to evaluate the feature importance, they always output different feature subsets for a particular dataset. In recent years, hybrid methods which combine different feature evaluating metrics altogether have received considerable attentions.

2.2.1 The EGAFS method

This approach combines the advantages of the filters with an enhanced genetic algorithm (EGA) in a wrapper approach to handle the high dimensionality of the feature space [36]. EGA improves the crossover and mutation operators of traditional genetic algorithms. The crossover operation is performed based on feature subset partitioning, while the mutation is performed based on the classifier performance of the original parents and feature importance. Moreover, this method combines six well-known filters with the EGA to obtain the final feature subset. Though EGAFS has high classification accuracy, it uses wrappers and deduces the problem of high computational complexity.

2.2.2 The TDPFS method

Based on two feature selections and a feature extraction method, Bharti and Singh proposed the TDPFS method for text clustering [3]. Firstly, a typical term frequency based feature selection and a typical document frequency based feature selection are used to obtain two feature subsets, respectively. Then, the modified union operation is proposed to merge the features of these feature subsets. Finally, the PCA algorithm is applied to further transform the features in an interpretable feature space to reduce the dimension further without losing much information.

2.2.3 The IGFSS method

Although the features selected by traditional feature selections scheme can represents some of the classes successfully, some of the classes still may not be represented. Uysal proposed an improved global feature selection scheme (IGFSS) where a traditional feature selection is modified to obtain a more representative feature set [31]. In other words, IGFSS aims to improve the performance of traditional feature selections by selecting the features which represents all classes almost equally.

2.2.4 The VGFSS method

Though IGFSS solves the problem that some of the classes may not be represented by selecting an equal number of representative features from all the classes. However, this method is not suitable for an unbalanced dataset which has many classes, deducing the problem that some important features of the class that contains a higher number of features may be ignored. On this basis, Agnihotri and Verma proposed the VGFSS method to select a variable number of features from each class based on the distribution of features [32]. The number of selected features in each class is defined as follows:

$$ NV\left({C}_j\right)= count\left({C}_j\right)\times \frac{N}{TFC} $$
(9)

Where count(Cj) is the number of features of class Cj, N is the number of final selected features, TFC is the number of all features in the dataset.

2.2.5 The CBHFS method

Jaskowiak and Campello presented a hybrid feature selection (CBHFS) tailored for data classification problems [30]. This method consists of two main stages: in the first stage, it uses a clustering algorithm to identify the best features and remove the redundant ones [18]; in the second stage, a wrapper is used to evaluate different feature subsets produced by the clustering processes, determining the one that produces the best classification performance in terms of accuracy. This method has high classification accuracy but faces the problems of parameter dependency and high computational complexity.

2.2.6 The HFSCC method

Wang and Feng proposed a hybrid feature selection which can achieve the best features effectively and efficiently [37]. HFSCC consists of three steps: firstly, the samples are preprocessed and two feature subsets are obtained by using two different optimal filters; secondly, a feature weight based union operation is proposed to merge the obtained feature subsets; finally, the hierarchical agglomerative clustering algorithm is applied to obtain the final feature subset by combining a component co-occurrence based feature relevance measurement and a predetermined threshold. Experimental results show that this method achieves high classification accuracy and execution speed.

2.2.7 The MMMW method

As the features selected by different feature selections are always different, it is not enough to evaluate the importance of a feature by using only one particular feature selection. Bhattacharya and Selvakumar proposed the MMMW method which combines a filter, a wrapper and a clustering based feature selection to select the best features [38]. Though MMMW uses novel metrics to assign weights to the features, it ignores the weights of different methods on different datasets.

2.2.8 The EMFFS method

Osanaiye and Cai proposed a multi-filter based feature selection that combines the results of four filters to achieve the best features [39]. This method uses a simple majority voting policy to merge the features selected by different filters. Moreover, a threshold is predetermined to select the frequently occurring features among the four filters to construct the set of final features. However, this method ignores the fact that different filters have different performances on different datasets, and the optimal value of the predetermined threshold is hard to tune.

As is shown in Table 1, the main advantages and limitations of the above feature selections can be described as follows: (1) The traditional filters have high running speed, but they cannot filter the redundant features and the output results rely on the feature importance measurements and datasets seriously; (2) The MI based filters and wrappers can filter the redundant features, but they generally have high computation complexities when dealing with the data sets with high numbers of features or samples [40]; (3) The hybrid methods can obtain higher classification accuracy than those of the other methods, but some are very time consuming as they use the wrappers to obtain the best feature subset. Moreover, some existing hybrid methods combine the results of different filters by using a simple majority voting or set union mechanism [37], ignoring the effects of feature weight and filter weight.

Table 1 Advantages and limitations of different feature selections

For solving these problems, we propose a new hybrid feature selection (called MFHFS) based on the information of multi-filter weights and multi-feature weights. The main contributions of this paper are given as follows: (1) A new hybrid feature selection frame which combines the advantage of traditional filters on running speed and the advantage of greedy searching policy on filtering the redundant features is proposed. (2) In order to avoid the problem that the performance of a filter is unstable when the dataset changes, the vector of multi-filter weights is introduced. Moreover, in order to distinguish the importance of different features from different datasets, the matrix of multi-feature weights is proposed to obtain the most discriminative features. (3) A Q-range based feature relevance calculation method is proposed to improve the running speed of achieving the feature relevance matrix when filtering the redundant features.

3 The proposed method - MFHFS

3.1 Description of MFHFS

Though the traditional filters have high executing speed, they fail to filter the redundant features and their performances rely on the feature importance measurements and the datasets seriously. Moreover, though the MI based filters or the hybrid methods can filter the redundant features, they always face the problem of high computational complexity, decreasing the executing speed when the number of features or the number of samples is high [37]. On this basis, we combine the advantages of the methods with different types, and propose a new multi-filter weights and multi-feature weights based feature selection (called MFHFS) of which the flowchart is shown in Fig. 1. MFHFS can be executed in the following three steps: (1) Data preprocessing: the samples are normalized and discretized by using the equal width interval binning (EWIB) algorithm, and the noises and outliers are filtered by combining the 10-folder cross validation; (2) Feature combination: several feature subsets are obtained by a set of optimal filters, and they are used to obtain a temp feature subset FSt by considering the multi-filter weights and multi-feature weights; (3) Feature refinement: a Q-range based feature relevance calculation method is proposed to obtain the feature relevance matrix of FSt, and the final features are selected from FSt by using the greedy searching policy. The details of MFHFS are described as follows:

Fig. 1
figure 1

Flowchart of MFHFS

  • Stage 1: Data preprocessing

In the pre-processing step, three operations: normalization, discretization and removal of noises and outliers are executed. Normalization aims to adjust values measured on different scales to a notionally common scale. It can reduce the computational cost and solve the problem that the classification accuracy may be unstable when the features have large ranges of values. Discretization is the process of transferring continuous data into discrete counterparts. It is used to reduce the computational complexity and improve the stability of feature selection in the preprocess of the proposed method. Moreover, as the noises and outliers may affect the distribution characteristic of the entire dataset, we will remove them to improve the robustness of the proposed method. The details of data preprocessing are given in Algorithm 1.

figure a

In the normalization step (step 1.3), fmin and fmax are set to fmin = 0 and fmax = 6; in the discretization step (step 1.4), NQ is set to NQ = fmax + 1, and the interval width is set to \( \delta =\frac{f_{\mathrm{max}}-{f}_{\mathrm{min}}}{N_{\mathrm{Q}}} \) [37].

  • Stage 2: Feature combination

Different traditional filters use different metrics to evaluate the feature importance, thus the performance of a filter may be unstable when the dataset changes. Osanaiye et al. proposed an ensemble-based multi-filter feature selection that combines the output of four filter methods to achieve an optimum selection [40]. However, this method uses a simple majority vote to determine the final selected features by combining a predetermined threshold, thus deduces the following problems: (1) It treats different filters equally thus ignores the filter weights, decreasing the classification accuracy when some of the filters have bad performances on particular datasets. (2) The features selected by a filter are treated equally, ignoring the fact that the category discriminative abilities of these features are always different. (3) The predetermined threshold is related to the number of selected features and is hard to tune. On this basis, we propose a novel feature combination method by introducing the vector of multi-filter weights and the matrix of multi-feature weights, and the executing flowchart of this stage is given in Fig. 2. Given the preprocessed entire sample set ES′, we first obtain the vector of multi-filter weights and denote it as A = (α1, α2, …, αL). As is shown in Algorithm 2, the details are given as follows:

Fig. 2
figure 2

Flowchart of feature combination stage

figure b

Further, we obtain the matrix of multi-feature weights and denote it as B = {Bi} = {(βi1, βi2, …, βij, …, βiM)}. As is shown in Algorithm 3, the details are given as follows:

figure c

On this basis, the 1 × M size vector of final feature weights is obtained by calculating the dot product of the 1 × L size vector of multi-filter weights and the L × M size matrix of multi-feature weights, and the temp feature subset FSt is obtained by the steps described in Algorithm 4:

figure d

From Algorithms 2–4 we know that we consider the effects of the multi-filter weights and the multi-feature weights by introducing the vector W, thus can avoid the problem that the optimal filters or the selected features are treated equally when searching the best features. For ease of computation, we set the number of optimal filters L = 4 in this paper. Thus, the time complexity of obtaining the final feature subset FSt is O(4 × (M|C| + Mlog2M + 3 N) + Nlog2N).

  • Stage 3: Feature refinement

In order to filter the redundant features in FSt, TDPFS [3] uses the PCA algorithm to transform the high dimensional feature space into an un-interpreted low dimensional feature space. EGAFS [36] uses the enhanced GA based wrapper method to select the optimal features, thus has the problem of high computation cost. On this basis, we utilize the greedy searching policy [25,26,27] to obtain the optimal final feature subset from FSt of which the feature number is much lower than that of entire dataset. The details of this stage are given as follows:

  1. (1)

    obtain the feature relevance matrix

HFSCC [37] obtains the feature relevance matrix by calculating the feature relevance of each pair of the features in FSt. This method ignores the fact that a feature is only redundant with the ones which have equal or approximate global feature weight values, thus deduces a high computation complexity of O(N2). In this paper, we suppose that the proportion of redundant features in FSt is no higher than r (0 < r < 1), and propose a new feature relevance calculation method which only calculate the relevance between a feature fi and the features which have similar normalized global feature weight values with fi. As is shown in Fig. 3, given the temp feature subset FSt obtained in stage 2, we first normalize the final feature weight values of FSt, then calculate the feature relevance matrix Φ = {ϕij} (1 ≤ i, j ≤ N) using formula (15):

$$ \Big\{{\displaystyle \begin{array}{l}{\phi}_{ij}={\phi}_{ji}=\left\{\begin{array}{c}{c}_{ij},\mathrm{if}\mid {\mathbf{W}}_{\mathrm{N}}\left({f}_i\right)-{\mathbf{W}}_{\mathrm{N}}\left({f}_j\right)\mid <Q\\ {}0,\kern4.199998em \mathrm{else}\end{array}\right.\\ {}{c}_{ij}=\sum \limits_{f_{jl}\in \varOmega}\sum \limits_{f_{ik}\in \varOmega}\left(p\left({f}_{ik}|{f}_{jl}\right)\times p\left({f}_{jl}|{f}_{ik}\right)\times p\left({f}_{jl},{f}_{ik}\right)\right)\end{array}} $$
(15)

where fi and fj are the ith and jth features in the feature subset FSt; Q is a factor which ranges in the interval (0, 1); WN(fi) and WN(fj) are the normalized final feature weight values of fi and fj, respectively; cij denotes the feature relevance of fi and fj using the CCFR algorithm of HFSCC method; Ω denotes the corresponding component values (possible values) for each feature; fik denotes the kth component value of fi and fjl denotes the lth component value of fj; p(fik| fjl) denotes the conditional probability that fik occurs when fjl occurs; p(fjl| fik) denotes the conditional probability that fjl occurs when fik occurs; p(fik, fjl) is the feature component based normalization coefficient which denotes the probability that fik and fjl occur together in the dataset.

  1. (2)

    select the final feature subset using the greedy searching policy

Fig. 3
figure 3

Graphical representation of different feature relevance calculation methods

Though HFSCC [37] and TSFS [40] can filter the redundant features of FSt, they both require a predetermined threshold th which has great influence on the number of selected features. Therefore, we apply the greedy searching policy of traditional MI filters on the sorted feature subset FSt to obtain the final feature subset FS. A greedy algorithm is proposed for solving an optimization problem of finding the solution that maximises the measured fitness [44]. When dealing with an optimization problem, greedy searching policy selects the local optimal solution in each step, and hopes to generate a global optimal solution through a series of local optimal selections. As this policy measures the importance of the features by maximining the relevance between the features and the classes and minimizing the redundancies among the selected features [26, 27], in this paper, the objective function of selecting a feature f from FSt is defined as follows:

$$ \Big\{{\displaystyle \begin{array}{l}f=\mathrm{agr}\underset{f_i\in F{S}_{\mathrm{t}}}{\max } Score\left({f}_i\right)\\ {} Score\left({f}_i\right)=\theta \times {\mathbf{W}}_{\mathrm{N}}\left({f}_i\right)-\frac{\varphi }{\mid S\mid}\sum \limits_{f_j\in S}{\phi}_{ij}\end{array}} $$
(16)

Where S is the set of selected features; θ (θ > 0) and φ (φ > 0) are two parameters balancing the feature category discriminative ability and the feature redundancy. In order to emphasize the importance of the feature category discriminative ability and weaken the data loss that may be brought by setting some elements in Φ to zero directly, the parameters of θ and φ are set to θ = 1 and φ = 0.5 in this paper. On this basis, the details of stage 3 are shown in Algorithm 5:

figure e

From Algorithm 5 we know that the time complexity of steps 2–3 is O(N + pN2) (0 < p < 1, p is the probability that the expression ∣WN(fi) − WN(fj) ∣  < Q is true). As there are few redundant features in FSt, Q is much lower than 1 in practical situations, which means that the running speed of the proposed Q-range based method is 1/p times of that of the HFSCC method. Moreover, we notice that the time complexity of step 4 is \( O\left(\sum \limits_{i=0}^{N_{\mathrm{s}}-1}\left({N}_1-i\right)\times i\right)=\frac{\left({N}_{\mathrm{s}}-1\right){N}_{\mathrm{s}}{N}_1}{2}-\frac{\left({N}_{\mathrm{s}}-1\right){N}_{\mathrm{s}}\left(2{N}_{\mathrm{s}}-1\right)}{6} \), thus we conclude that the overall time complexity of Algorithm 5 is \( O\left(N+p{N}^2+\frac{\left({N}_{\mathrm{s}}-1\right){N}_{\mathrm{s}}N}{2}-\frac{\left({N}_{\mathrm{s}}-1\right){N}_{\mathrm{s}}\left(2{N}_{\mathrm{s}}-1\right)}{6}\right) \).

For ease of understanding, we give an example to show the executing processes of MFHFS. As is shown in Table 2, the preprocessed samples are denoted as ES = {s1, s2, s3, s4, s5, s6, s7, s8} and the set of class labels are denoted as La = {1,2,3}. Moreover, we suppose that the set of optimal filters used in Algorithms 2–4 is OFS = {IG, IMGI, CHI, DFS}, the corresponding vector of multi-filter weights is set to A = (1, 1, 1, 1), the feature number of FSt is set to N = 8, the factor Q is set to Q = 0.1, and the feature number of FS is set to Ns = 6. On this basis, Table 3 shows the feature scores when different methods are used and the corresponding feature orders are given in the brackets. Moreover, Tables 4 and 5 give the feature relevance matrices of the features in FSt when HFSCC and the proposed Q-range based method are used, respectively. Obviously, we know from Table 4 that the features f1 and f2 are redundant with each other as their corresponding feature relevance equals to 1. Further, it can be seen from Table 5 that the feature relevance between f1 and f2 can also be achieved by the proposed Q-range based method accurately and the computational cost of our method is 25% (16 of 64 cases) of that of HFSCC method. Moreover, Table 6 gives the Score values of the features in FSt and the changes of the final feature subset FS when the number of selected feature increases. Obviously, we know from Table 6 that the redundant feature f2 is filtered and the final selected feature subset is FS = {f4, f5, f1, f8, f9, f7}.

Table 2 Example of the preprocessed sample set ES
Table 3 Scores of the features when different methods are used
Table 4 Feature relevance matrix of FSt using HFSCC method
Table 5 Feature relevance matrix of FSt using the proposed Q-range based method
Table 6 Changes of the final feature subset FS and Score values of the features in FSt

3.2 Time complexity analysis of MFHFS

The time complexity of the proposed feature selection consists of three stages:

  1. (1)

    The stage of data processing: we know from Algorithm 1 that the complexity of this stage is T1 = O(MD) + L×(O(flt) + O(clat) + |ES| × O(clap)), where O(flt), O(clat) and O(clap) are the time complexities of the filter flt, the training and the predicting processes of classifier cla. However, as this stage can be applied to all feature selections, we do not consider this part when calculating the time complexity of MFHFS.

  2. (2)

    The stage of feature combination: we know from Algorithms 2–4 that the complexity of this stage is T2 = O(4 × (M| C| +Mlog2M + 3N) + Nlog2N), where M is the number of all features and N is the number of selected features in Algorithm 2.

  3. (3)

    The stage of feature refinement: we know from Algorithm 5 that the time complexity of this stage is \( {T}_3=\mathrm{O}\left(N+p{N}^2+\frac{\left({N}_{\mathrm{s}}-1\right){N}_{\mathrm{s}}N}{2}-\frac{\left({N}_{\mathrm{s}}-1\right){N}_{\mathrm{s}}\left(2{N}_{\mathrm{s}}-1\right)}{6}\right) \), where Ns is the number of features in FS. Generally, there exists Ns ≈ N> > 1, thus we have:

$$ {\displaystyle \begin{array}{l}{T}_3\approx \mathrm{O}\left(N+p{N}^2+\frac{\left({N}_{\mathrm{s}}-1\right){N}_{\mathrm{s}}N}{2}-\frac{\left({N}_{\mathrm{s}}-1\right){N}_{\mathrm{s}}\left(2{N}_{\mathrm{s}}-1\right)}{6}\right)\\ {}\kern0.36em \approx \mathrm{O}\left({N}_{\mathrm{s}}+p{N_{\mathrm{s}}}^2+\frac{{N_{\mathrm{s}}}^3}{2}-\frac{{N_{\mathrm{s}}}^3}{3}\right)=\mathrm{O}\left({N}_s+p{N_{\mathrm{s}}}^2+\frac{{N_{\mathrm{s}}}^3}{6}\right)\end{array}} $$
(19)

On this basis, we combine the results of T2 and T3 and remove the constant coefficients, then obtain the overall time complexity TMFHFS (constant coefficients are removed) as follows:

$$ {T}_{\mathrm{MFHFS}}\approx \mathrm{O}\left(M|C|+M{\log}_2M+{N_{\mathrm{s}}}^3\right) $$
(20)

When considering the traditional MI based filters (like NDMI and NNMI), according to [35] we know that their computational complexities are all:

$$ {T}_{\mathrm{MI}}=\mathrm{O}\left(\frac{\left({N}_{\mathrm{s}}-1\right){N}_{\mathrm{s}}M}{2}-\frac{\left({N}_{\mathrm{s}}-1\right){N}_{\mathrm{s}}\left(2{N}_{\mathrm{s}}-1\right)}{6}\right)\approx \mathrm{O}\left({N_{\mathrm{s}}}^2\left(\frac{M}{2}-\frac{1}{3}\right)\right) $$
(21)

When the value of M is large enough, there generally exists M/2> > 1/3, thus we remove the constant coefficients of formula (21) and transform TMI to the following formula:

$$ {T}_{\mathrm{MI}}=O\left({N_{\mathrm{s}}}^2M\right) $$
(22)

Then, we have:

$$ \frac{T_{\mathrm{MFHFS}}}{T_{\mathrm{MI}}}=\frac{M\mid C\mid +M{\log}_2M+{N_{\mathrm{s}}}^3}{{N_{\mathrm{s}}}^2M}=\frac{\mid C\mid }{{N_{\mathrm{s}}}^2}+\frac{\log_2M}{{N_{\mathrm{s}}}^2}+\frac{N_{\mathrm{s}}}{M} $$
(23)

As there exists |C| < <Ns2, log2M < <Ns2 and Ns < <M, then we have \( \frac{T_{\mathrm{MFHFS}}}{T_{\mathrm{MI}}}<<1 \), which means that the computational complexity of MFHFS is obviously lower than that of the traditional MI based filters.

4 Experimental results and analysis

We developed our hybrid method on a computer having Windows 10 Home operating system, 8 GB of RAM, and Intel Core i7–7500 processor. In this section, to verify the classification performance of the selected feature subsets, a series of experiments are conducted to compare the proposed method with typical feature selections. All experiments are implemented with matlab 2015a which is popular on machine learning and data mining. Moreover, the 10-folder cross validation is used to test the performances of different methods.

4.1 Datasets

In this section, we select six datasets which contain more than 100 features from UCI machine learning repository [45, 46]. Table 7 represents the characteristics of these datasets including the numbers of features, the numbers of instances, and the numbers of classes. These datasets included APS, Madelon, CNAE9, Gisette, DrivFace and Amazon. As Amazon contains too many classes, for ease of computation, we only consider the top six classes in which each class consists of 30 samples.

Table 7 Details of the datasets used in this paper

4.2 Classifiers

In order to investigate the performance of the proposed algorithm, two well-known classifiers: support vector machine (SVM) [47] and random forest (RF) [48] are used. SVM is a discriminative classifier which is formally defined by a separating hyper-plane. In SVM, a good separation is achieved by the hyper-plane that has the largest distance to the nearest training data point of any class. RF is a classification method which combines multiple tree predictors in a way that each tree depends on a value of randomly chosen vector distributed among all trees in forest in the same way [49]. The parameters of these classifiers are given as follows: the number of trees in RF classifier is set to nt = 5; the libSVM library [50] is used for SVM classifier with the parameters c = 15,000 and gamma = 0.07; the classifier used in Algorithms 1–2 is SVM.

4.3 Evaluation measurements

The macro-averaged F1 measurement and the micro-averaged F1 measurement are used to evaluate the performances of different feature selections. Given class c, the F1 measurement (F1c) which is defined in formula (24) considers both the precision pc and the recall rc [21]:

$$ \Big\{{\displaystyle \begin{array}{l}{F}_1^c=\frac{2\times {p}_c\times {r}_c}{p_c+{r}_c}\\ {}{p}_c=\frac{t{p}_c}{t{p}_c+f{p}_c}\\ {}{r}_c=\frac{t{p}_c}{t{p}_c+f{n}_c}\end{array}} $$
(24)

Where tpc is the number of the samples which are correctly classified into c, fpc is the number of the samples which are wrongly classified into c, fnc is the number of the samples which belong to c but are wrongly classified. Obviously, the F1c measurement is a harmonic mean of precision and recall whose best value is 1 and worst value is 0 [15]. On this basis, the definition of Macro-averaged F1 measurement (F1macro) is given as follows:

$$ {F}_1^{\mathrm{macro}}=\frac{\sum \limits_{c\in C}{F}_1^c}{\mid C\mid } $$
(25)

Where |C| is the number of classes in the dataset.

Different with Macro-averaged F1 measurement, Micro-averaged F1 measurement sums up all classification decisions for all classes of a dataset to calculate global precision and recall. The calculation for Micro-averaged F1 measurement is shown as follows:

$$ {F}_1^{\mathrm{micro}}=\frac{2\times p\times r}{p+r} $$
(26)

Where p is the global precision for all classes and r is the global recall for all classes. The definitions of p and r are shown in formula (27) [51, 52]:

$$ p=\frac{\sum \limits_{c\in \mid C\mid }t{p}_{\mathrm{c}}}{\sum \limits_{c\in \mid C\mid}\left(t{p}_{\mathrm{c}}+f{p}_{\mathrm{c}}\right)},r=\frac{\sum \limits_{c\in \mid C\mid }t{p}_{\mathrm{c}}}{\sum \limits_{c\in \mid C\mid}\left(t{p}_{\mathrm{c}}+f{n}_{\mathrm{c}}\right)} $$
(27)

4.4 Selection of the optimal filters in OFS

From Algorithms 2–4 we know that, a good selection of the optimal filters may deduce a feature subset FSt with high quality. On this basis, eight traditional filters (IG, IMGI, CHI, CMFSX, DIA, DFS, CDM [51] and OR) are used for experiments to select the set of four optimal filters OFS = {of1, of2, of3, of4}. When the classifiers of SVM and RF are used on different datasets, we calculate the average F1macro values (Fmac_a) and average F1micro values (Fmic_a) of different filters as the ratio of selected features ranges from 2% to 10% with a step of 2%. Further, the average values (called Fa) of Fmac_a values and Fmic_a values with respect to each method are calculated and the results are shown in Tables 8 and 9. For ease of understanding, the highest Fa values with respect to each dataset are denoted in bold. We know from Tables 8 and 9 that CHI, CMFSX and DFS perform generally better than the other methods as they obtain the highest Fa values for 6, 4 and 3 times, respectively. Moreover, Fig. 4 give the average Fa values (Faa) of each method with respect to all datasets when SVM and RF are used, respectively. As the performances of CHI, CMFSX, DFS and IMGI are obviously better than those of the other filters, they are selected to form the set of optimal filters OFS = {CHI, CMFSX, DFS, IMGI}. Moreover, according to Algorithm 2, we obtain the vectors of multi-filter weights of different datasets and the results are shown in Table 10 and 11.

Table 8 Fa values of different filters when SVM is used on different datasets
Table 9 Fa values of different filters when RF is used on different datasets
Fig. 4
figure 4

Faa values of different filters when SVM and RF are used, respectively

Table 10 Vectors of multi-filter weights of different datasets when SVM is used
Table 11 Vectors of multi-filter weights of different datasets when RF is used

4.5 Sensitivity analysis of the parameter Q

From Algorithm 5 we know that the parameter Q affects the running speed of calculating the feature relevance matrix. In this section, we evaluate the performances of the Q-range based method when Q ranges from 0.1 to 1.0 with a step of 0.1. With respect to each value of Q, we calculate the average consuming time (denoted as ta and expressed in seconds) of step 3 in Algorithm 5 when the ratio of selected features ranges from 1% to 10% on each dataset, and the results are shown in Fig. 5. Further, given a predetermined threshold th (th = 0.9 in this paper) which is approaching 1, we calculate the average probability (denoted as pa) that the feature relevance is higher than th when the ratio of selected features ranges from 1% to 10% to test the ability of the Q-range based method on achieving the redundant features, and the results are also shown in Fig. 5. Obviously, when considering the datasets of APS, Madelon, Gisette, DrivFace and Amazon, we can see from these figures that the pa values remain unchanged, but the ta values increase gradually with the highest increment of about 120 s when Gisette is used. When considering the dataset of CNAE9, we notice that the pa values have slight increments but the ta values have significant increments when value of Q increases. Therefore, we conclude that, when Q is greater than 0.1, it has great effect on the running speed of calculating the feature relevance matrix but little effect on the performance of filtering the redundant features. On this basis, in order to improve the efficiency of the proposed method while guaranteeing the classification accuracy, Q is set to Q = 0.1 in this paper.

Fig. 5
figure 5

ta and pa values of MFHFS on each dataset when Q ranges from 0.1 to 1

4.6 Comparisons of MFHFS with typical filters

In order to validate the performances of different methods, we compare typical methods of filters (SMR [21], NDM [23], NDMI [25], MDMR [27], NFS [52], OCFSX [53] and FECS_FR [54]) on the aspects of classification accuracy and running speed. Figure 6, 7, 8, 9, 10 and 11 gives the average values (called F1m) of F1macro values and F1micro values of SVM and RF classifiers when the ratio of selected features ranges from 2% to 10% with a step of 2%. Obviously, when SVM is used, the F1m values of MFHFS are greater than those of the other filters in most of the cases when the ratio of selected features is no less than 0.06. As the ratio of selected features increases, the performances of MFHFS are much more stable than those of the other filters, showing the robustness of the proposed method on different datasets. When RF is used, the performances of SMR and NFS are less stable than those of the other filters as the ratio of selected features increases. MFHFS obtains the highest F1m values in all cases with respect to the datasets of APS, DrivFace and Amazon, with the highest F1m value (0.994) when 0.04 features are selected from the dataset of APS.

Fig. 6
figure 6

APS dataset

Fig. 7
figure 7

Madelon dataset

Fig. 8
figure 8

CNAE9 dataset

Fig. 9
figure 9

Gisette dataset

Fig. 10
figure 10

DrivFace dataset

Fig. 11
figure 11

Amazon dataset

Further, Tables 12, 13, 14 and 15 given the average F1macro values (Fmac_a) and average F1micro values (Fmic_a) with the ratio of selected features ranging from 2% to 10% when the classifiers of SVM and RF are used, respectively. In these tables, the highest values of each dataset are denoted in bold. Obviously, we know from Tables 12 and 13 that when SVM is used, MFHFS obtains the highest Fmac_a values except for the DrivFace dataset, and it obtains the highest Fmic_a values except for the Madelon and DrivFace datasets. Moreover, it can be seen from Tables 14 and 15 that, when RF is used, MFHFS outperforms the other methods for six and five times on the measurements of Fmac_a and Fmic_a, respectively, illustrating the effectiveness of the proposed method on selecting the best features. Therefore, we conclude that the performance of MFHFS is generally better than those of the typical filters on classification accuracy. This may due to the following two reasons: 1) MFHFS filters the noises and outliers in the preprocessing process, avoiding the overfit problem and improving the robustness of the proposed method; 2) MFHFS achieves the best features of different optimal filters by combining the multi-filter weights and the multi-feature weights, differentiating the importance of different features and solving the problem that the classification performance is not stable when the dataset changes.

Table 12 Fmac_a values of typical filters and MFHFS when SVM is used
Table 13 Fmic_a values of typical filters and MFHFS when SVM is used
Table 14 Fmac_a values of typical filters and MFHFS when RF is used
Table 15 Fmic_a values of typical filters and MFHFS when RF is used

Furthermore, Table 16 shows the average running time (called rta, expressed in seconds) of different methods when the ratio of selected features ranging from 2% to 10% with a step of 2%. It can be seen from Table 16 that the filters of OCFSX, SMR, NFS, FECS_FR and NDM run obviously faster than MFHFS. This is due to the fact that MFHFS combines multiple filters and executes redundant feature filtering process which is time-consuming on calculating the feature relevance matrix. However, when comparing MFHFS with NDMI and MDMR, we notice that the rta values of the former method are obviously lower than those of the later methods. For example, when Gisette dataset is used, the rta values of the NDMI and MDMR are 1260.371 and 4387.189, respectively, while the corresponding rta value of MFHFS is 87.162. Therefore, by combining Tables 12, 13, 14 and 15 we conclude that: 1) for the aspect of classification accuracy, the performances of MFHFS are generally better than those of typical filters; 2) for the aspect of running speed, MFHFS has acceptable decrements compared to the traditional filters (OCFSX, SMR, NFS, FECS_FR and NDM) and has great increments compared to the MI based filters (NDMI and MDMR).

Table 16 rta values of typical filters and MFHFS on each dataset (unit: second)

4.7 Comparisons of MFHFS with typical feature extractions and hybrid feature selections

In this section, we conduct a series of experiments to compare MFHFS with several typical feature extractions and hybrid feature selections: AE [4], PCA [8], CBHFS [30], VGFSS [32], EGAFS [36], EMFFS [39], MFHFS1 and MFHFS2. Among these methods, MFHFS1 is modified from MFHFS and calculates the feature relevance matrix using HFSCC method [37]; MFHFS2 is modified from MFHFS without filtering the noises and outliers in the data preprocessing stage. Here, the parameters of each method are given as follows: (1) AE: the maximum epochs: Me = 100; (2) EMFFS: the optimal filters are CMFSX, CHI, IMGI and DFS; (3) EGAFS: number of particles: Np = 20, number of iterations: Nit = 10; (4) CBHFS: number of the K values: Nk = 10, number of repetitions for each K: Nr = 10. Moreover, the classifiers used in the wrappers of these methods are SVM and RF. On this basis, when the ratio of selected features ranges from 2% to 10% with a step of 2%, the average F1macro values (Fmac_a) and average F1micro values (Fmic_a) are shown in Tables 17, 18, 19 and 20, and the average Fmac_a (Fmic_a) values of all datasets for each method when SVM and RF are used are given in Fig. 12.

Table 17 Fmac_a values of typical hybrid methods and MFHFS when SVM is used
Table 18 Fmic_a values of typical hybrid methods and MFHFS when SVM is used
Table 19 Fmac_a values of typical hybrid methods and MFHFS when RF is used
Table 20 Fmic_a values of typical hybrid methods and MFHFS when RF is used
Fig. 12
figure 12

Average Fmac_a values and average Fmic_a values of different methods when SVM and RF are used, respectively

We can see from Tables 17 and 18 that when SVM is used, MFHFS obtains the highest values for five times, which is obviously higher than those of the other methods. The performances of MFHFS and MFHFS1 are similar and generally better than that of NFHFS2, showing the efficiency of filtering noises and outliers in the preprocessing stage on improving the classification accuracy. Moreover, we can see from Fig. 12a and b that the average Fmac_a values and average Fmic_a values of the proposed method are generally higher than those of the other methods, with the highest improvement of about 0.04 over that of PCA when SVM is used. We know from Tables 19 and 20 that when RF is used, MFHFS obtains the highest Fmac_a value when the datasets of APS, Gisette and DrivFace are used, illustrating the efficacy of MFHFS on dealing with the datasets those have high numbers of features or samples. Moreover, we know from Fig. 12c and d that MFHFS and MFHFS1 outperforms the other method significantly, with the average improvements of about 0.05 and 0.04 over those of EGAFS and EMFFS when the measurements of average Fmac_a values and average Fmic_a values are used, respectively. This may be due to the following reasons: (1) Feature extractions like PCA and AE ignore the category information of the samples, losing some information which may be helpful for data classification; (2) EMFFS and VGFSS evaluate the importance of the features separately, ignoring the effect of the redundant information contained in the selected features; (3) The results of EGAFS and CBHFS rely on some important parameters (such as the numbers of iterations or particles) in the genetic algorithm and the clustering algorithm, deducing unstable results when dealing with high dimensional datasets; (4) The samples of noise or outliers are removed in MFHFS, improving the robustness and generalization of the proposed method.

Table 21 gives the average running time (called rta, expressed in seconds) of these methods on each dataset when SVM and RF are used, respectively. We observe that CBHFS is the slowest one in these methods, and it obtains the highest running time (335,926.365 s) on Gisette dataset. Moreover, EGAFS runs much slower than the other methods except for CBHFS. The reason may be that CBHFS and EGAFS both use the wrapper methods which take a lot of time on training and classification to obtain the best features. In addition, we notice that the increments of MFHFS over PCA, EMFFS and VGFSS on rta are all lower than 70 s, which is acceptable in actual situations. Moreover, compared to the MFHFS1 methods, MFHFS runs much faster, illustrating the efficiency of the Q-range based feature relevance calculation method on improve the running speed while guaranteeing the qualities of selected features.

Table 21 rta values of typical hybrid methods and MFHFS (unit: second)

4.8 Statistical comparisons

The nonparametric tests are widely used in statistical learning. The Wilcoxon signed rank test is usually powerful in detecting the difference between two populations [55]. In this section, the Wilcoxon signed rank test is used, and the null hypothesis is that the two methods are equivalent. If the null hypothesis is rejected (p value is less than or equal to the significance level α), the differences between the methods are significant. Given the significance level α = 0.05, the average F1macro and F1micro values (called Fa) of different typical methods are compared using the Wilcoxon signed rank test when the ratio of selected feature ranges from 0.01 to 0.1 with a step of 0.01. Moreover, experiments are carried out 10 times and the average p values are shown in Tables 22 and 23 when SVM and RF are used, respectively. We can see from Table 22 that the p values are lower than α in 52 of 60 cases when SVM is used, showing that the proposed method outperforms the other methods significantly in 86.7% of the cases on classification accuracy. Moreover, it is obvious that the proposed method outputs the other methods significantly on the datasets of CNAE9, Gisette and DrivFace, though its superiority on Amazon dataset is not remarkable when compared to IMGI, OCFSX, EMFFS and VGFSS. Further, we can see from Table 23 that the proposed method outputs the other methods significantly in 91.7% cases (55 of 60 cases) when RF is used, illustrating the effectiveness of the proposed method on selecting the best features and ensuring the classification accuracy.

Table 22 p values of different methods when SVM is used (the cases of p > α are denoted in bold)
Table 23 p values of different methods when RF is used (the cases of p > α are denoted in bold)

5 Conclusions

Many traditional feature selections of filters rely on the datasets and cannot deal with the redundant information effectively. Moreover, many MI based filters or hybrid methods have high time complexities when the numbers of features or the numbers of samples are high. In this paper, a multi-filter weights and multi-feature weights based hybrid feature selection (called MFHFS) is proposed. In the data preprocessing stage, the samples are normalized and discretized by using the equal width interval binning (EWIB) algorithm. Moreover, the 10-folder cross validation is combined to remove the samples of noises and outliers. In the feature combination stage, several optimal filters are chosen and used to obtain the feature subsets, and these feature subsets are merged into a temp feature subset by considering the multi-filter weights and multi-feature weights. In the feature refinement stage, the redundant information of the temp feature subset is filtered to obtain the final feature subset, and a Q-range based feature relevance calculation method is proposed to improve the running speed of redundant information filtering. The efficiency of MFHFS is examined through the classification experiments with SVM and RF classifiers on six datasets: APS, Madelon, CNAE9, Gisette, DrivFace and Amazon. Experimental results show that: (1) When compared to the traditional filters, MFHFS has great improvement on classification accuracy at the cost of acceptable decrement of running speed; (2) MFHFS achieves obvious improvements on classification accuracy over typical feature extractions and hybrid feature selections, and it has significant improvements on running speed over typical methods like AE, EGAFS and CBHFS, illustrating its efficacy on obtaining the best features in data classification field.

In the future, we will study deeper in the following two aspects: (1) Learn from different datasets of multiple fields and investigate the multi-task based feature selection methods; (2) Investigate parallel computing methods to improve the running speed and extend the proposed method to real applications.