Abstract
Minimal-redundancy–maximal-relevance (mRMR) algorithm is a typical feature selection algorithm. To select the feature which has minimal redundancy with the selected features and maximal relevance with the class label, the objective function of mRMR subtracts the average value of mutual information between features from mutual information between features and the class label, and selects the feature with the maximum difference. However, the problem is that the feature with the maximum difference is not always the feature with minimal redundancy maximal relevance. To solve the problem, the objective function of mRMR is first analyzed and a constraint condition that determines whether the objective function can guarantee the effectiveness of the selected features is achieved. Then, for the case where the objective function is not accurate, an idea of equal interval division is proposed and combined with ranking to process the interval of mutual information between features and the class label, and that of the average value of mutual information between features. Finally, a feature selection algorithm based on equal interval division and minimal-redundancy–maximal-relevance (EID–mRMR) is proposed. To validate the performance of EID–mRMR, we compare it with several incremental feature selection algorithms based on mutual information and other feature selection algorithms. Experimental results demonstrate that the EID–mRMR algorithm can achieve better feature selection performance.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
With the explosive growth of information, dimension of feature set increases and it can cause the curse of dimensionality. Therefore, it is necessary to reduce the dimension of feature set [1,2,3]. Dimensionality reduction methods involve feature extraction and feature selection [4]. Feature extraction is a way that transforms the original features into a new space and takes the transformed features as the final features, while feature selection selects a subset of the original features. Compared with feature extraction, feature selection has advantages in the interpretation of data [5]. Therefore, feature selection has a wide range of applications, such as text processing [6, 7], underwater objects recognition and classification [8, 9],network anomaly detecting [10], information retrieval [11], image classification [12, 13] and microarray data classification [14].
The metrics adopted in feature selection include distance, mutual information and consistency. Compared with other metrics, mutual information can measure the relationship between variables and it has the invariance under space transformations [15]. Hence, many feature selection algorithms based on mutual information are proposed, such as [16, 17]. Among these algorithms, mutual information maximisation (MIM) algorithm [18] is a basic algorithm. However, it does not perform well due to only considering mutual information between features and the class label.
To overcome the shortcoming of MIM, some algorithms that employ mutual information between features and the class label to describe relevance and adopt mutual information between features to describe redundancy are proposed. Among them, minimal-redundancy–maximal-relevance (mRMR) algorithm [19] is a typical algorithm. In order to select the feature that has minimal redundancy with the selected features and maximal relevance with the class label, the average value of mutual information between each candidate feature and all the selected features is subtracted from mutual information between each candidate feature and the class label, and the feature with the maximum difference is selected. Since the feature with the maximum difference does not mean that the feature has minimal redundancy maximal relevance, the objective function of mRMR has a limitation.
Aiming at solving the existing problems of mRMR, some feature selection algorithms have been proposed. Since mRMR had the problem that mutual information biases toward multivalued features, normalization operation was used. Ultimately, NMIFS algorithm was proposed in [15]. Mutual information between each candidate feature and the class label, and the average value of mutual information between each candidate feature and all the selected features were processed by an optimization algorithm known as NSGA-II. Finally, MIFS-ND algorithm was presented in [20]. Combining mRMR with the idea of optimization, feature selection was investigated in [21]. mRMR was combined with ReliefF algorithm, and a two-stage feature selection algorithm was proposed in [22]. Combining mRMR with a particle swarm optimization algorithm, a maximum relevance minimum redundancy PSO algorithm was presented in [23]. In [15, 20,21,22,23], the aforementioned limitation of the objective function of mRMR has not been handled properly.
In view of the problem that the objective function of mRMR has a limitation, this paper first analyzes the objective function of mRMR and achieves a condition that the objective function can guarantee the effectiveness of selected features. Then, for the case where the objective function cannot guarantee the effectiveness of selected features, the interval of mutual information between each candidate feature and the class label, and that of the average value of mutual information between each candidate feature and all the selected features are divided equally, and then the subintervals are ranked. Finally, a feature selection algorithm based on equal interval division and minimal-redundancy–maximal-relevance (EID–mRMR) is proposed.
The rest of this paper is organized as follows. Section 2 analyzes some feature selection algorithms based on mutual information. The EID–mRMR algorithm is proposed in Sect. 3. Section 4 presents and discusses experimental results. Conclusions and future work are presented in Sect. 5.
2 Related Work
In this paper, we only analyze mutual information of discrete random variables. Assuming Y and Z are two discrete random variables, p(y) is the probability density function of Y, p(z) is the probability density function of Z, and p(y, z) is the joint probability density function of Y and Z. Mutual information is utilized to quantify the information that two random variables share. Mutual information I(Y; Z) can be defined as
The higher mutual information values means that the two random variables share more information.
MIM is a feature selection algorithm based on mutual information, and its objective function is expressed as
where X is the candidate features set, \({{f_i}}\) is a candidate feature and c is the class label. MIM calculates mutual information between each candidate feature and the class label. Then, it ranks features in descending order according to the values, and selects some features with larger values. The algorithm does not yield good results due to ignoring feature interactions.
To overcome the shortcoming of MIM, some feature selection algorithms based on relevance and redundancy are proposed [19, 24]. Objective functions of these algorithms are different, while their feature selection processes are same. The process is presented as follows. It first calculates mutual information between features and the class label, and selects the feature that has the maximum value. Then, it loops to select the feature that complies with the objective function in a forward search way. The loop ends when a specified number of features are selected. Obviously, objective functions are the key of these algorithms. Combined with the objective functions, these algorithms are analyzed.
Equation (3) is the objective function of mutual information based feature selection (MIFS) algorithm [24]. S is the selected feature set and \({{f_s}}\) is a selected feature. MIFS uses a parameter \(\beta \) to adjust mutual information \({I\left( {c;{f_i}} \right) }\) and mutual information between all the selected features and \({{f_i}}\). When \(\beta \) is set to zero, this algorithm is MIM.
mRMR [19] uses the reciprocal of the number of selected features to replace the parameter \(\beta \), solving the problem of uncertain parameter. For selecting the feature that has minimal redundancy with the selected features and maximal relevance with the class label, mRMR subtracts the average value of mutual information between all the selected features and \({{f_i}}\) from \({I\left( {c;{f_i}} \right) }\), and selects the feature with the maximum difference. The objective function of mRMR is expressed as
where \({\left| S \right| }\) is the number of selected features. However, the feature satisfying Eq. (4) is not always the feature with minimal redundancy maximal relevance. Therefore, the objective function of mRMR has a limitation.
Combining mRMR with an optimization algorithm NSGA-II, MIFS-ND algorithm [20] was proposed. MIFS-ND first selects the feature that has the maximum mutual information value with the class label. Then, it calculates \({I\left( {c;{f_i}} \right) }\) and the average value of mutual information between all the selected features and each candidate feature. Following that, it processes them by NSGA-II and achieves the domination count \({C_d}\) and the dominated count \({F_d}\) for each feature. As shown in [20], the domination count of a candidate feature represents the number of features that it dominates for mutual information between the candidate feature and the class label. The dominated count of a candidate feature represents the number of features that it dominates for the average value of mutual information between the candidate feature and all the selected features. Finally, the feature satisfying Eq. (5) is selected. Following the above steps, it loops to select features until a specified number of features are selected. Compared with the range of \({I\left( {c;{f_i}} \right) }\) and that of the average value of mutual information between all the selected features and \({{f_i}}\), the range of \({C_d}\) and that of \({F_d}\) are greater. However, since \({C_d}\) and \({F_d}\) are not correlated to the difference between mutual information between the class label and different candidate features, and the difference between the average values of mutual information between the selected features and different candidate features, MIFS-ND cannot effectively handle the problem that the limitation existed in the objective function of mRMR.
3 The Proposed Feature Selection Algorithm
This section first achieves a condition that tests whether the objective function can guarantee the performance of selected features. Then, equal interval division is proposed and combined with ranking to deal with the case where the objective function cannot guarantee the performance of selected features. Following that, the proposed algorithm EID–mRMR is presented. Finally, an example is adopted to analyze the first two features selected by mRMR, MIFS-ND and EID–mRMR.
3.1 A Validation Condition
mRMR is a feature selection algorithm based on relevance and redundancy. Flow chart of mRMR is presented in Fig. 1. In the flow chart, the candidate feature set X and the selected feature set S are initialized. Then, mutual information between features and the class label is calculated, and the feature with the maximum value is selected; it loops to select the feature that complies with Eq. (4) in a forward search way until a specified number of features N are selected.
The objective function presented in Eq. (4) is the key of mRMR. mRMR utilizes \({I\left( {c;{f_i}} \right) }\) to describe relevance and adopts \({I\left( {{f_s};{f_i}} \right) }\) to describe redundancy. In order to select the feature that has minimal redundancy with the selected features and maximal relevance with the class label, mRMR subtracts the average value of mutual information between all the selected features and each candidate feature from \({I\left( {c;{f_i}} \right) }\), and selects the feature with the maximum difference. The feature that has minimal redundancy with the selected features and maximal relevance with the class label can satisfy Eq. (4), not vice versa. Therefore, Eq. (4) has a limitation. The feature selected by Eq. (4) is analyzed. It is necessary to present Eq. (6) at first.
Equation (4) is a special case of Eq. (6) attaining the maximum value. Calculate Eq. (6), if the maximum value is far greater than the secondary maximum value, and then select the feature with the maximum value. If not, the advantage of using Eq. (4) to select features is not obvious. To simplify the calculation, a condition that the difference between the maximum value and the secondary maximum value of Eq. (6) is greater than a fixed value P is adopted to test whether using Eq. (4) to select features. If the difference between the maximum value and the secondary maximum value is greater than P, Eq. (4) is used to select features; otherwise, an idea of equal interval division and ranking is adopted to select features. We will give a detailed description of the idea of equal interval division and ranking in the next section.
3.2 Equal Interval Division and Ranking
For the situation where Eq. (4) cannot guarantee the effectiveness of selected features, to guarantee the feature with minimal redundancy maximal relevance being selected, the interval of \({I\left( {c;{f_i}} \right) }\) and that of the average value of mutual information between all the selected features and \({{f_i}}\) are divided equally, and then the subintervals are ranked. The number of dataset’s features is taken as the number of subintervals. The concrete practices are presented as follows: determine the maximum value and the minimum value of \({I\left( {c;{f_i}} \right) }\) and the average value of mutual information between all the selected features and \({{f_i}}\) as interval values, and then the interval values are divided equally. Following that, the subintervals are numbered from 1 to the number of dataset’s features, and the numbers are taken as the ordinal values of the values in the subintervals.
FFMI \({\left( {{f_s};{f_i}} \right) }\) is the value that the interval of \(\sum \nolimits _{{f_s} \in S} {I\left( {{f_s};{f_i}} \right) } /\left| S \right| \) is processed by equal division and ranking. CFMI \({\left( {c;{f_i}} \right) }\) is the value that the interval of \({I\left( {c;{f_i}} \right) }\) is processed by equal division and ranking. The process of computing FFMI \({\left( {{f_s};{f_i}} \right) }\) is shown in Algorithm 1 and that of computing CFMI \({\left( {c;{f_i}} \right) }\) is shown in Algorithm 2.
For the situation where Eq. (4) cannot guarantee the effectiveness of selected features, equal interval division and ranking are adopted to process \({I\left( {c;{f_i}} \right) }\) and \(\sum \nolimits _{{f_s} \in S} {I\left( {{f_s};{f_i}} \right) }/\left| S \right| \), and CFMI \({\left( {c;{f_i}} \right) }\) and FFMI \({\left( {{f_s};{f_i}} \right) }\) are attained. Then, Eq. (7) is presented.
Equation (7) is the objective function of the proposed algorithm EID–mRMR. CFMI \({\left( {c;{f_i}} \right) }\) is utilized to describe relevance and FFMI \({\left( {{f_s};{f_i}} \right) }\) is adopted to describe redundancy. If there are some features satisfying Eq. (7), the feature that maximizes Eq. (6) is selected from these features satisfying Eq. (7).
Equations (4), (5) and (7) are all proposed to select the features with minimal redundancy maximal relevance. Therefore, it is necessary to compare Eq. (7) with Eqs. (4) and (5). Before comparisons, Eqs. (8) and (9) are presented.
Equation (5) is a special case of Eq. (8) attaining the maximum value and Eq. (7) is that of Eq. (9) attaining the maximum value. Equation (9) is compared with Eqs. (6) and (8). The first part of Eqs. (6), (8) and (9) is adopted to describe relevance, and the second part is adopted to describe redundancy. With different candidate features, the range of \({C_d}\) and that of \({F_d}\) are related to the number of candidate features, and the range of CFMI \({\left( {c;{f_i}} \right) }\) and that of FFMI \({\left( {{f_s};{f_i}} \right) }\) are related to the number of dataset’s features. While the range of \({I\left( {c;{f_i}} \right) }\) and that of the average value of mutual information between all the selected features and \({{f_i}}\) are not related to the number of candidate features and that of dataset’s features, and they are smaller than the range of \({C_d}\), that of \({F_d}\), that of CFMI \({\left( {c;{f_i}} \right) }\) and that of FFMI \({\left( {{f_s};{f_i}} \right) }\). Further, since the number of dataset’s features is greater than that of candidate features, the range of \({C_d}\) and that of \({F_d}\) are smaller than the range of CFMI \({\left( {c;{f_i}} \right) }\) and that of FFMI \({\left( {{f_s};{f_i}} \right) }\). Therefore, Eq. (9) has greater range of relevance and range of redundancy than Eqs. (6) and (8), and Eq. (8) has greater range of relevance and range of redundancy than Eq. (6).
Further, different from Eq. (8), since Eq. (9) adopts equal interval division and ranking to process \({I\left( {c;{f_i}} \right) }\) and the average value of mutual information between all the selected features and \({{f_i}}\), CFMI \({\left( {c;{f_i}} \right) }\) and FFMI \({\left( {{f_s};{f_i}} \right) }\) can guarantee the values in the same subinterval have the same priority. Therefore, compared with Eqs. (4) and (5), Eq. (7) has more advantages in selecting the feature with minimal redundancy maximal relevance.
3.3 Algorithmic Implementation
With the above validation condition and Eq. (7), EID–mRMR is shown in Algorithm 3.
EID–mRMR consists of two parts: in the first part (lines 1–7), the candidate feature set X and the selected feature set S are initialized. Then, \({I\left( {c;{f_i}} \right) }\) is calculated, and the feature with the maximum value is selected; in the second part (lines 8–34), the average value of mutual information between all the selected features and \({{f_i}}\) is calculated. Then, the difference between the maximum and the secondary maximum of Eq. (6) is calculated, if the difference is greater than a fixed value P, select the feature that satisfies Eq. (4); otherwise, calculate CFMI \({\left( {c;{f_i}} \right) }\), FFMI \({\left( {{f_s};{f_i}} \right) }\) and the difference between CFMI \({\left( {c;{f_i}} \right) }\) and FFMI \({\left( {{f_s};{f_i}} \right) }\). Following that, test whether the number of features with the maximum difference between CFMI \({\left( {c;{f_i}} \right) }\) and FFMI \({\left( {{f_s};{f_i}} \right) }\) is more than 1, if it is no less than 2, select the feature that maximizes Eq. (6) from these features; otherwise, select the feature with the maximum difference between CFMI \({\left( {c;{f_i}} \right) }\) and FFMI \({\left( {{f_s};{f_i}} \right) }\). Following the above steps, it loops to select features. The loop ends when a specified number of features are selected.
3.4 An Example
For better understanding the idea of EID–mRMR, an example is presented. Since the first selected feature is the maximum mutual information value with the class label and it is different from the other selected features. The second selected feature and other selected features satisfy the objective function of the algorithm. Therefore, an example is adopted to analyze the first two features selected by mRMR, MIFS-ND and EID–mRMR. A dataset with 7 features is used, and the feature \({{f_7}}\) has the maximum mutual information value with c. Dataset description is presented in Table 1.
Since the feature \({{f_7}}\) has the maximum mutual information value with c, mRMR, EID–mRMR and MIFS-ND first select \({{f_7}}\). Since \({{f_7}}\) is selected and the number of selected feature is 1, the average value of mutual information between all the selected features and the candidate feature \({{f_i}}\) is \({I\left( {{f_7};{f_i}} \right) }\). The range of \({I\left( {c;{f_i}} \right) }\) is [0.60, 0.90] and that of \({I\left( {{f_7};{f_i}} \right) }\) is [0.03, 0.09]. Considering that the feature \({{f_2}}\) has the maximum difference between mutual information, mRMR selects \({{f_2}}\).
MIFS-ND adopts NSGA-II to process mutual information \({I\left( {c;{f_i}} \right) }\) and \({I\left( {{f_7};{f_i}} \right) }\), and achieves the domination count \({C_d}\) and the dominated count \({F_d}\). \({F_d}\) is subtracted from \({C_d}\) and the difference is attained. Domination count of a feature is shown in Table 2.
As shown in Table 2, because \({{f_5}}\) has the maximum difference, MIFS-ND selects \({{f_5}}\). Since the number of the candidate features is 6, the range of \({C_d}\) and that of \({F_d}\) are both [0, 5]. Therefore, compared with the range of \({I\left( {c;{f_i}} \right) }\) and that of \({I\left( {{f_7};{f_i}} \right) }\), the range of \({C_d}\) and that of \({F_d}\) are greater. Further, although \({I\left( {c;{f_2}} \right) }\) is not far greater than \({I\left( {c;{f_1}} \right) }\), \({{f_1}}\) and \({{f_2}}\) have different \({C_d}\) values. We can know that \({C_d}\) and \({F_d}\) are not correlated to the difference between mutual information between the class label and different candidate features, and the difference between the average values of mutual information between all the selected features and different candidate features.
\({{f_2}}\) has the maximum difference between \({I\left( {c;{f_2}} \right) }\) and \({I\left( {{f_7};{f_2}} \right) }\). \({{f_1}}\) has the secondary maximum difference between \({I\left( {c;{f_1}} \right) }\) and \({I\left( {{f_7};{f_1}} \right) }\). Considering that the maximum difference is not far greater than the secondary maximum difference, the objective function of mRMR cannot guarantee the effectiveness of the selected features. We adopt equal interval division and ranking to process \({I\left( {c;{f_i}} \right) }\) and \({I\left( {{f_7};{f_i}} \right) }\). Since the dataset has 7 features, with the idea of equal interval division, the interval of \({I\left( {c;{f_i}} \right) }\) is divided into [0.60, 0.65), [0.65, 0.70), [0.70, 0.75), [0.75, 0.80), [0.80, 0.85), [0.85, 0.90), [0.90, 0.95) and that of \({I\left( {{f_7};{f_i}} \right) }\) is divided into [0.03, 0.04), [0.04, 0.05), [0.05, 0.06), [0.06, 0.07), [0.07, 0.08), [0.08, 0.09), [0.09, 0.10). The ordinal values CFMI \({\left( {c;{f_i}} \right) }\) and FFMI \({\left( {{f_s};{f_i}} \right) }\) are calculated. Ordinal value of a feature is shown in Table 3.
In order to understand EID–mRMR, \({I\left( {c;{f_1}} \right) }\) and \({I\left( {{f_7};{f_1}} \right) }\) are adopted to analyze the ordinal values with EID–mRMR. Since \({I\left( {c;{f_1}} \right) }\) is 0.90, the value is in the subinterval of [0.90, 0.95) and the number is 7, hence the ordinal value of \({I\left( {c;{f_1}} \right) }\) is 7. Considering that \({I\left( {{f_7};{f_1}} \right) }\) is 0.09, the value is in the subinterval of [0.09, 0.10) and the number is 7, hence the ordinal value of \({I\left( {{f_7};{f_1}} \right) }\) is 7.
As shown in Table 3, since \({{f_3}}\) has the maximum difference between the ordinal values, EID–mRMR selects \({{f_3}}\) as the second selected feature. Since the dataset has 7 features, the range of CFMI \({\left( {c;{f_i}} \right) }\) and that of FFMI \({\left( {{f_s};{f_i}} \right) }\) are both [1, 7]. Compared with the range of \({I\left( {c;{f_i}} \right) }\), that of \({I\left( {{f_7};{f_i}} \right) }\), that of \({C_d}\) and that of \({F_d}\), the range of CFMI \({\left( {c;{f_i}} \right) }\) and that of FFMI \({\left( {{f_s};{f_i}} \right) }\) are greater. Therefore, EID–mRMR has greater range of relevance and range of redundancy than mRMR and MIFS-ND. Further, CFMI \({\left( {c;{f_i}} \right) }\) and FFMI \({\left( {{f_s};{f_i}} \right) }\) can guarantee the features in the same subinterval have the same priority. Therefore, EID–mRMR has more advantages in selecting the feature with minimal redundancy maximal relevance. Compared with \({{f_2}}\) and \({{f_5}}\), it is more appropriate to select \({{f_3}}\).
4 Experimental Results
To validate the effectiveness of EID–mRMR, mRMR and MIFS-ND, other five incremental MI-based feature selection algorithms and other four feature selection algorithms are adopted for performance comparisons.
4.1 The Datasets and Experimental Settings
The datasets presented in Table 4 are from UCI machine learning repository [25] and ASU feature selection datasets [26]. The number of selected features is 50 for all the datasets. The minimum description length discretization method is adopted to transform the numerical features into discrete ones [27] and it is only used for feature selection. Three popular classifiers, J48, IB1 and Naive Bayes are utilized. The classifiers’ parameters are set to WEKA’s [28] default values. ASU feature selection software package [29] is adopted.
To validate the effectiveness of EID–mRMR, MIFS, mRMR, MIFS-ND, NMIFS [15], MIFS-U [30], MIFS-CR [31] and CMI [32] are adopted for performance comparisons. NMIFS, MIFS-U, MIFS-CR and CMI are four feature selection algorithms based on relevance and redundancy, and their objective functions are presented as follows:
where \({H({f_s})}\) is the entropy of \({{f_s}}\), \({H({f_i})}\) is the entropy of \({{f_i}}\), H(c) is the entropy of c, \({H({f_i}/c)}\) is the conditional entropy.
In the experiment, the \(\beta \) value of MIFS is set to 0.5 and that of MIFS-U is set to 1.
Except for seven incremental MI-based algorithms, Relief-F [33], Fisher [34], QPFS [35] and SPEC-CMI [36] are compared with EID–mRMR. In QPFS and SPEC-CMI, before feature selection, all the features are normalized to the range [− 1, 1] and five equal-size bins is adopted to transform the numerical features into discrete ones.
As shown in [37], for reducing the influence of randomness on the final results, ten times of tenfold cross-validation are used, and the mean value and standard deviation of ten results are taken as the final results. To determine whether the effectiveness of experimental results is significant, a one-sided paired t-test at 5% significance level is carried out.
4.2 Experimental Results and Analysis
4.2.1 With Different P Values
In this section, the influence of P value on EID–mRMR is analyzed. The P value is set to 0.02, 0.03, 0.04, 0.05 and 0.06 respectively. The average performance of EID–mRMR with different P values when using J48, IB1 and Naive Bayes is presented in Table 5.
As shown in Table 5, when the P value is 0.05 or 0.06, it achieves the greatest Avg. value. At the same time, with different P values, the differences between the Avg. values are very small. Therefore, the P value has a relatively small impact on the performance of EID–mRMR. In the experiment, we find that it takes more time for feature selection with the P value increasing. Therefore, in comparison with the incremental MI-based Algorithms and other feature selection algorithms, the P value of EID–mRMR is set to 0.05.
4.2.2 Comparison with Incremental MI-Based Algorithms
The section compares EID–mRMR with other incremental MI-based algorithms. Margins between the performance of EID–mRMR and other algorithms when using J48, IB1 and Naive Bayes are presented in Tables 6, 7 and 8. The values in the Avg. row are the mean value and standard deviation of the above twelve values. The values in the Win/Tie/Loss (W/T/L) row are the results obtained with one-sided paired t-test, among the values, the first value is the number of datasets that EID–mRMR is significantly superior to the other algorithms, the second value is the number of datasets that EID–mRMR performs equally with the other algorithms and the third value is the number of datasets that EID–mRMR is significantly inferior to the other algorithms. Classification accuracy of the optimal features selected by EID–mRMR and other algorithms when using J48, IB1 and Naive Bayes are presented in Tables 9, 10 and 11. The running time of EID–mRMR and other algorithms that select 50 features is shown in Table 12. Average performance comparisons of algorithms with the three classifiers are shown in Fig. 2.
As shown in Table 6, in Movement_libras, Mfeat_pix, Semeion, COIL20 and orlraws10P, EID–mRMR is significantly better than mRMR. From the values in the Avg. and W/T/L row, EID–mRMR performs better than mRMR when using J48. Compared with other algorithms, MIFS-U, MIFS-ND and EID–mRMR achieve better results.
In Table 7, the Avg. values indicate that MIFS-U, MIFS-CR and EID–mRMR yield better results. The values in the W/T/L row suggest that mRMR, NMIFS, MIFS-U and EID–mRMR can obtain better feature selection performance. Different from Table 6, the differences between EID–mRMR and the other algorithms increase except for MIFS-U.
In Table 8, in the terms of the Avg. values, compared with other algorithms, MIFS-U, MIFS-CR and EID–mRMR achieve better results. For the W/T/L values, MIFS-U, CMI and EID–mRMR perform better than the other algorithms. Moreover, the advantages that EID–mRMR is superior to the other algorithms increase in both the Avg. and W/T/L values.
As shown in Tables 9, 10 and 11, we can see that EID–mRMR can achieve better results in the majority of datasets. Take Table 11 for example, the number of datasets that EID–mRMR performs better than the other seven algorithms is 7, while that of any of the other algorithms is superior to EID–mRMR is no more than 2. Furthermore, EID–mRMR can obtain the greatest Avg. values with all the three classifiers.
In Table 12, EID–mRMR takes a fair amount of time with the other seven algorithms in the majority of datasets except for gisette and orlraws10p. Since features of gisette and orlraws10p are high dimensional and they need to be processed by the method of equal interval division and ranking for more times, EID–mRMR is more time-consuming in these two datasets and it obtains greater Avg. values than other algorithms.
As shown in Fig. 2, in some datasets, such as Spambase and Mfeat_fou, the selected features account for a large proportion of these datasets. After classification accuracy of the features selected by the majority of algorithms reaches the maximum value, it decreases to some extent. While in most of datasets, due to selected features occupying a certain proportion of datasets, the average performance of the features selected by most of algorithms significantly increases in the beginning. Then, it increases slowly and varies gently after arriving at a certain extent. Considering that the average value of the 50 features selected by the algorithms is taken as the final result, the percentage of selected features accounting for the datasets has less impact on the final result. Therefore, it is suitable for selecting 50 features. In the average performance comparisons of algorithms, EID–mRMR can achieve better effectiveness in the majority of datasets. However, the other algorithms do not perform well in some datasets; for example, mRMR cannot handle well in Synthetic_control, Movement_libras, Mfeat_pix, Semeion, ORL and gisette. MIFS-ND does not yield good results in Spambase, Mfeat_fou and gisette. As a whole, compared with the other seven feature selection algorithms, EID–mRMR can achieve better results.
4.2.3 Comparison with Other Feature Selection Algorithms
The section compares EID–mRMR with Relief-F, Fisher, QPFS and SPEC-CMI. Margins between the performance of EID–mRMR and other algorithms when using J48, IB1 and Naive Bayes are presented in Tables 13, 14 and 15. Classification accuracy of the optimal features selected by EID–mRMR and other algorithms when using J48, IB1 and Naive Bayes are presented in Tables 16, 17 and 18.
As shown in Table 13, EID–mRMR and SPEC-CMI are superior to Relief-F, Fisher and QPFS in Spambase and Mfeat_fou. EID–mRMR and QPFS are superior to Relief-F, Fisher and SPEC-CMI in Mfeat_fou and Movement_libras. In most of datasets, EID–mRMR performs better than Relief-F, Fisher, QPFS and SPEC-CMI. From the values in the Avg. and W/T/L row, EID–mRMR can achieve better results than the other four algorithms.
In Table 14, SPEC-CMI performs better than EID–mRMR in Spambase and Mfeat_fou. In the majority of datasets, EID–mRMR yields better results than the other four algorithms. The values in the Avg. row and the W/T/L row suggest that EID–mRMR can obtain better feature selection effectiveness. Different from Table 13, the advantages that EID–mRMR is superior to the other algorithms increase in both the Avg. and W/T/L values.
As shown in Table 15, Fisher and SPEC-CMI achieve better results than EID–mRMR in Spambase. In most of datasets, EID–mRMR can achieve better feature selection performance than the other four algorithms. Different from Tables 13 and 14, the differences between EID–mRMR and the other four algorithms increase in the Avg. and W/T/L values.
In Tables 16, 17 and 18, we can see that EID–mRMR performs better than the other four algorithms in the majority of datasets. Take Table 16 for example, the number of datasets that EID–mRMR can obtain better results than the other four algorithms is 7, while that of datasets that EID–mRMR is inferior to any one of the other algorithms is no more than 5. Furthermore, EID–mRMR achieves the greatest Avg. values with all the three classifiers.
5 Conclusions and Future Work
This paper first analyzes the objective function of mRMR. Then, for the case where the objective function cannot guarantee the effectiveness of selected features, an idea of equal interval division and ranking is proposed to process mutual information between the class label and features, and the average value of mutual information between features. Ultimately, EID–mRMR is proposed. To verify the performance, we apply EID–mRMR to three classifiers, eight UCI datasets and four ASU datasets, and compare results with those from seven incremental MI-based algorithms and other four feature selection algorithms. Experimental results validate that EID–mRMR can achieve better feature selection effectiveness in the majority of datasets.
Considering that EID–mRMR currently only adopts mutual information, and it does not utilize three-way interaction information or higher dimensional mutual information [38,39,40,41,42,43,44,45,46], it is likely to degrade the performance due to missing some useful information. In the next stage, we will investigate how to combine three-way interaction information with the idea of equal interval division and ranking.
References
Tan MK, Tsang IW, Wang L (2014) Towards ultrahigh dimensional feature selection for big data. J Mach Learn Res 15:1371–1429
Cataron A, Andonie R (2010) Energy supervised relevance neural gas for feature ranking. Neural Process Lett 32(1):59–73
Borja SP, Veronica BC, Amparo AB (2017) Testing different ensemble configurations for feature selection. Neural Process Lett 46:1–24
Tiwari S, Singh B, Kaur M (2017) An approach for feature selection using local searching and global optimization techniques. Neural Comput Appl 28(10):2915–2930
Wang JZ, Wu LS, Kong J, Li YX, Zhang BX (2013) Maximum weight and minimum redundancy: a novel framework for feature subset selection. Pattern Recognit 46(6):1616–1627
Shang CX, Li M, Feng SZ, Jiang QS, Fan JP (2013) Feature selection via maximizing global information gain for text classification. Knowl-Based Syst 54:298–309
Tang B, Kay S, He HB (2016) Toward optimal feature selection in naive Bayes for text categorization. IEEE Trans Knowl Data Eng 28(9):2508–2521
Fei T, Kraus D, Zoubir AM (2012) A hybrid relevance measure for feature selection and its application to underwater objects recognition. In: International conference on image processing, pp 97–100
Fei T, Kraus D, Zoubir AM (2015) Contributions to automatic target recognition systems for underwater mine classification. IEEE Trans Geosci Remote Sens 53(1):505–518
Zhang F, Chan PPK, Biggio B, Yeung DS, Roli F (2016) Adversarial feature selection against evasion attacks. IEEE Trans Cybern 46(3):766–777
Veronica BC, Noelia SM, Amparo AB (2013) A review of feature selection methods on synthetic data. Knowl Inf Syst 34(3):483–519
Jia XP, Kuo BC, Crawford MM (2013) Feature mining for hyperspectral image classification. Proc IEEE 101(3):676–697
Lin CH, Chen HY, Wu YS (2014) Study of image retrieval and classification based on adaptive features using genetic algorithm feature selection. Expert Syst Appl 41(15):6611–6621
Zhao YH, Wang GR, Yin Y, Li Y, Wang ZH (2016) Improving ELM-based microarray data classification by diversified sequence features selection. Neural Comput Appl 27(1):155–166
Estevez PA, Tesmer M, Perez CA, Zurada JM (2009) Normalized mutual information feature selection. IEEE Trans Neural Netw 20(2):189–201
Brown G, Pocock A, Zhao MJ, Lujan M (2012) Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. J Mach Learn Res 13:27–66
Vergara JR, Estevez PA (2014) A review of feature selection methods based on mutual information. Neural Comput Appl 24(1):175–186
Lewis DD (1992) Feature selection and feature extraction for text categorization. In: Proceedings of the workshop on speech and natural language, pp 212–217
Peng HC, Long FH, Ding C (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238
Hoque N, Bhattacharyya DK, Kalita JK (2014) MIFS-ND: a mutual information-based feature selection method. Expert Syst Appl 41(14):6371–6385
Han M, Ren WJ (2015) Global mutual information-based feature selection approach using single-objective and multi-objective optimization. Neurocomputing 168:47–54
Zhang Y, Ding C, Li T (2008) Gene selection algorithm by combining reliefF and mRMR. BMC Genom 9(2):1–10
Unler A, Murat A, Chinnam RB (2011) m\({r^2}\)PSO: a maximum relevance minimum redundancy feature selection method based on swarm intelligence for support vector machine classification. Inf Sci 181(20):4625–4641
Battiti R (1994) Using mutual information for selecting features in supervised neural net learning. IEEE Trans Neural Netw 5(4):537–550
UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 6 Mar 2018
ASU feature selection datasets. http://featureselection.asu.edu/datasets/. Accessed 6 Mar 2018
Fayyad U, Irani K (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: Proceedings of joint conference on artificial intelligence, pp 1022–1027
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. ACM SIGKDD Explor Newsl 11(1):10–18
Zhao Z, Morstatter F, Sharma S, Alelyani S, Anand A, Liu H (2010) Advancing feature selection research. In: ASU feature selection repository, pp 1–28
Kwak N, Choi CH (2002) Input feature selection for classification problems. IEEE Trans Neural Netw 13(1):143–159
Wang ZC, Li MQ, Li JZ (2015) A multi-objective evolutionary algorithm for feature selection based on mutual information with a new redundancy measure. Inf Sci 307:73–88
Foithong S, Pinngern O, Attachoo B (2012) Feature subset selection wrapper based on mutual information and rough sets. Expert Syst Appl 39(1):574–584
Kononenko I (1994) Estimating attributes: analysis and extension of RELIEF. In: Proceedings of European conference on machine learning, pp 171–182
Duda RO, Hart PE, Stork DG (2001) Pattern classification. Wiley-Interscience Publication, New York
Rodriguez-Lujan I, Huerta R, Elkan C, Cruz CS (2010) Quadratic programming feature selection. J Mach Learn Res 11:1491–1516
Nguyen XV, Chan J, Romano S, Bailey J (2014) Effective global approaches for mutual information based feature selection. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining, pp 512–521
Herman G, Zhang B, Wang Y, Ye GT, Chen F (2013) Mutual information-based method for selecting informative feature sets. Pattern Recognit 46(12):3315–3327
Fleuret F (2004) Fast binary feature selection with conditional mutual information. J Mach Learn Res 5:1531–1555
Sun X, Liu YH, Xu MT, Chen HL, Han JW, Wang KH (2013) Feature selection using dynamic weights for classification. Knowl-Based Syst 37:541–549
Bennasar M, Hicks Y, Setchi R (2015) Feature selection using joint mutual information maximisation. Expert Syst Appl 42(22):8520–8532
Zeng ZL, Zhang HJ, Zhang R, Yin CX (2015) A novel feature selection method considering feature interaction. Pattern Recognit 48(8):2656–2666
Shishkin A, Bezzubtseva AA, Drutsa A, Shishkov I, Gladkikh E, Gusev G, Serdyukov P (2016) Efficient high order interaction aware feature selection based on conditional mutual information. In: Proceedings of the conference on advances in neural information processing systems, pp 4637–4645
Vinh NX, Zhou S, Chan J, Bailey J (2016) Can high-order dependencies improve mutual information based feature selection. Pattern Recognit 53:46–58
Wang J, Wei JM, Yang ZL, Wang SQ (2017) Feature selection by maximizing independent classification information. IEEE Trans Knowl Data Eng 29(4):828–841
Gao WF, Hu L, Zhang P (2018) Class-specific mutual information variation for feature selection. Pattern Recognit 79:328–339
Gao WF, Hu L, Zhang P, He JL (2018) Feature selection considering the composition of feature relevancy. Pattern Recognit Lett 112:70–74
Acknowledgements
This work was supported by the National Natural Science Foundation of China (61771334).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Gu, X., Guo, J., Xiao, L. et al. A Feature Selection Algorithm Based on Equal Interval Division and Minimal-Redundancy–Maximal-Relevance. Neural Process Lett 51, 1237–1263 (2020). https://doi.org/10.1007/s11063-019-10144-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-019-10144-3