1 Introduction

With the explosive growth of information, dimension of feature set increases and it can cause the curse of dimensionality. Therefore, it is necessary to reduce the dimension of feature set [1,2,3]. Dimensionality reduction methods involve feature extraction and feature selection [4]. Feature extraction is a way that transforms the original features into a new space and takes the transformed features as the final features, while feature selection selects a subset of the original features. Compared with feature extraction, feature selection has advantages in the interpretation of data [5]. Therefore, feature selection has a wide range of applications, such as text processing [6, 7], underwater objects recognition and classification [8, 9],network anomaly detecting [10], information retrieval [11], image classification [12, 13] and microarray data classification [14].

The metrics adopted in feature selection include distance, mutual information and consistency. Compared with other metrics, mutual information can measure the relationship between variables and it has the invariance under space transformations [15]. Hence, many feature selection algorithms based on mutual information are proposed, such as [16, 17]. Among these algorithms, mutual information maximisation (MIM) algorithm [18] is a basic algorithm. However, it does not perform well due to only considering mutual information between features and the class label.

To overcome the shortcoming of MIM, some algorithms that employ mutual information between features and the class label to describe relevance and adopt mutual information between features to describe redundancy are proposed. Among them, minimal-redundancy–maximal-relevance (mRMR) algorithm [19] is a typical algorithm. In order to select the feature that has minimal redundancy with the selected features and maximal relevance with the class label, the average value of mutual information between each candidate feature and all the selected features is subtracted from mutual information between each candidate feature and the class label, and the feature with the maximum difference is selected. Since the feature with the maximum difference does not mean that the feature has minimal redundancy maximal relevance, the objective function of mRMR has a limitation.

Aiming at solving the existing problems of mRMR, some feature selection algorithms have been proposed. Since mRMR had the problem that mutual information biases toward multivalued features, normalization operation was used. Ultimately, NMIFS algorithm was proposed in [15]. Mutual information between each candidate feature and the class label, and the average value of mutual information between each candidate feature and all the selected features were processed by an optimization algorithm known as NSGA-II. Finally, MIFS-ND algorithm was presented in [20]. Combining mRMR with the idea of optimization, feature selection was investigated in [21]. mRMR was combined with ReliefF algorithm, and a two-stage feature selection algorithm was proposed in [22]. Combining mRMR with a particle swarm optimization algorithm, a maximum relevance minimum redundancy PSO algorithm was presented in [23]. In [15, 20,21,22,23], the aforementioned limitation of the objective function of mRMR has not been handled properly.

In view of the problem that the objective function of mRMR has a limitation, this paper first analyzes the objective function of mRMR and achieves a condition that the objective function can guarantee the effectiveness of selected features. Then, for the case where the objective function cannot guarantee the effectiveness of selected features, the interval of mutual information between each candidate feature and the class label, and that of the average value of mutual information between each candidate feature and all the selected features are divided equally, and then the subintervals are ranked. Finally, a feature selection algorithm based on equal interval division and minimal-redundancy–maximal-relevance (EID–mRMR) is proposed.

The rest of this paper is organized as follows. Section 2 analyzes some feature selection algorithms based on mutual information. The EID–mRMR algorithm is proposed in Sect. 3. Section 4 presents and discusses experimental results. Conclusions and future work are presented in Sect. 5.

2 Related Work

In this paper, we only analyze mutual information of discrete random variables. Assuming Y and Z are two discrete random variables, p(y) is the probability density function of Y, p(z) is the probability density function of Z, and p(yz) is the joint probability density function of Y and Z. Mutual information is utilized to quantify the information that two random variables share. Mutual information I(YZ) can be defined as

$$\begin{aligned} I(Y;Z) = \sum \limits _{y \in Y} {\sum \limits _{z \in Z} {p(y,z)\log \frac{{p(y,z)}}{{p(y)p(z)}}}}. \end{aligned}$$
(1)

The higher mutual information values means that the two random variables share more information.

MIM is a feature selection algorithm based on mutual information, and its objective function is expressed as

$$\begin{aligned} { MIM} = \arg \mathop {\max }\limits _{{f_i} \in X} \left[ {I\left( {c;{f_i}} \right) } \right] \end{aligned}$$
(2)

where X is the candidate features set, \({{f_i}}\) is a candidate feature and c is the class label. MIM calculates mutual information between each candidate feature and the class label. Then, it ranks features in descending order according to the values, and selects some features with larger values. The algorithm does not yield good results due to ignoring feature interactions.

To overcome the shortcoming of MIM, some feature selection algorithms based on relevance and redundancy are proposed [19, 24]. Objective functions of these algorithms are different, while their feature selection processes are same. The process is presented as follows. It first calculates mutual information between features and the class label, and selects the feature that has the maximum value. Then, it loops to select the feature that complies with the objective function in a forward search way. The loop ends when a specified number of features are selected. Obviously, objective functions are the key of these algorithms. Combined with the objective functions, these algorithms are analyzed.

$$\begin{aligned} { MIFS} = \arg \mathop {\max }\limits _{{f_i} \in X} \left[ {I\left( {c;{f_i}} \right) - \beta \sum \limits _{{f_s} \in S} {I\left( {{f_s};{f_i}} \right) } } \right] . \end{aligned}$$
(3)

Equation (3) is the objective function of mutual information based feature selection (MIFS) algorithm [24]. S is the selected feature set and \({{f_s}}\) is a selected feature. MIFS uses a parameter \(\beta \) to adjust mutual information \({I\left( {c;{f_i}} \right) }\) and mutual information between all the selected features and \({{f_i}}\). When \(\beta \) is set to zero, this algorithm is MIM.

mRMR [19] uses the reciprocal of the number of selected features to replace the parameter \(\beta \), solving the problem of uncertain parameter. For selecting the feature that has minimal redundancy with the selected features and maximal relevance with the class label, mRMR subtracts the average value of mutual information between all the selected features and \({{f_i}}\) from \({I\left( {c;{f_i}} \right) }\), and selects the feature with the maximum difference. The objective function of mRMR is expressed as

$$\begin{aligned} { mRMR} = \arg \mathop {\max }\limits _{{f_i} \in X} \left[ {I\left( {c;{f_i}} \right) - \frac{1}{{\left| S \right| }}\sum \limits _{{f_s} \in S} {I\left( {{f_s};{f_i}} \right) } } \right] \end{aligned}$$
(4)

where \({\left| S \right| }\) is the number of selected features. However, the feature satisfying Eq. (4) is not always the feature with minimal redundancy maximal relevance. Therefore, the objective function of mRMR has a limitation.

$$\begin{aligned} { MIFS}\text {-}{} { ND} = \arg \mathop {\max }\limits _{{f_i} \in X} \left[ {{C_d} - {F_d}} \right] . \end{aligned}$$
(5)

Combining mRMR with an optimization algorithm NSGA-II, MIFS-ND algorithm [20] was proposed. MIFS-ND first selects the feature that has the maximum mutual information value with the class label. Then, it calculates \({I\left( {c;{f_i}} \right) }\) and the average value of mutual information between all the selected features and each candidate feature. Following that, it processes them by NSGA-II and achieves the domination count \({C_d}\) and the dominated count \({F_d}\) for each feature. As shown in [20], the domination count of a candidate feature represents the number of features that it dominates for mutual information between the candidate feature and the class label. The dominated count of a candidate feature represents the number of features that it dominates for the average value of mutual information between the candidate feature and all the selected features. Finally, the feature satisfying Eq. (5) is selected. Following the above steps, it loops to select features until a specified number of features are selected. Compared with the range of \({I\left( {c;{f_i}} \right) }\) and that of the average value of mutual information between all the selected features and \({{f_i}}\), the range of \({C_d}\) and that of \({F_d}\) are greater. However, since \({C_d}\) and \({F_d}\) are not correlated to the difference between mutual information between the class label and different candidate features, and the difference between the average values of mutual information between the selected features and different candidate features, MIFS-ND cannot effectively handle the problem that the limitation existed in the objective function of mRMR.

3 The Proposed Feature Selection Algorithm

This section first achieves a condition that tests whether the objective function can guarantee the performance of selected features. Then, equal interval division is proposed and combined with ranking to deal with the case where the objective function cannot guarantee the performance of selected features. Following that, the proposed algorithm EID–mRMR is presented. Finally, an example is adopted to analyze the first two features selected by mRMR, MIFS-ND and EID–mRMR.

3.1 A Validation Condition

mRMR is a feature selection algorithm based on relevance and redundancy. Flow chart of mRMR is presented in Fig. 1. In the flow chart, the candidate feature set X and the selected feature set S are initialized. Then, mutual information between features and the class label is calculated, and the feature with the maximum value is selected; it loops to select the feature that complies with Eq. (4) in a forward search way until a specified number of features N are selected.

Fig. 1
figure 1

Flow chart of mRMR

The objective function presented in Eq. (4) is the key of mRMR. mRMR utilizes \({I\left( {c;{f_i}} \right) }\) to describe relevance and adopts \({I\left( {{f_s};{f_i}} \right) }\) to describe redundancy. In order to select the feature that has minimal redundancy with the selected features and maximal relevance with the class label, mRMR subtracts the average value of mutual information between all the selected features and each candidate feature from \({I\left( {c;{f_i}} \right) }\), and selects the feature with the maximum difference. The feature that has minimal redundancy with the selected features and maximal relevance with the class label can satisfy Eq. (4), not vice versa. Therefore, Eq. (4) has a limitation. The feature selected by Eq. (4) is analyzed. It is necessary to present Eq. (6) at first.

$$\begin{aligned} J\left( {{f_i}} \right) = I\left( {c;{f_i}} \right) - \frac{1}{{\left| S \right| }}\sum \limits _{{f_s} \in S} {I\left( {{f_s};{f_i}} \right) } \end{aligned}$$
(6)

Equation (4) is a special case of Eq. (6) attaining the maximum value. Calculate Eq. (6), if the maximum value is far greater than the secondary maximum value, and then select the feature with the maximum value. If not, the advantage of using Eq. (4) to select features is not obvious. To simplify the calculation, a condition that the difference between the maximum value and the secondary maximum value of Eq. (6) is greater than a fixed value P is adopted to test whether using Eq. (4) to select features. If the difference between the maximum value and the secondary maximum value is greater than P, Eq. (4) is used to select features; otherwise, an idea of equal interval division and ranking is adopted to select features. We will give a detailed description of the idea of equal interval division and ranking in the next section.

3.2 Equal Interval Division and Ranking

For the situation where Eq. (4) cannot guarantee the effectiveness of selected features, to guarantee the feature with minimal redundancy maximal relevance being selected, the interval of \({I\left( {c;{f_i}} \right) }\) and that of the average value of mutual information between all the selected features and \({{f_i}}\) are divided equally, and then the subintervals are ranked. The number of dataset’s features is taken as the number of subintervals. The concrete practices are presented as follows: determine the maximum value and the minimum value of \({I\left( {c;{f_i}} \right) }\) and the average value of mutual information between all the selected features and \({{f_i}}\) as interval values, and then the interval values are divided equally. Following that, the subintervals are numbered from 1 to the number of dataset’s features, and the numbers are taken as the ordinal values of the values in the subintervals.

FFMI \({\left( {{f_s};{f_i}} \right) }\) is the value that the interval of \(\sum \nolimits _{{f_s} \in S} {I\left( {{f_s};{f_i}} \right) } /\left| S \right| \) is processed by equal division and ranking. CFMI \({\left( {c;{f_i}} \right) }\) is the value that the interval of \({I\left( {c;{f_i}} \right) }\) is processed by equal division and ranking. The process of computing FFMI \({\left( {{f_s};{f_i}} \right) }\) is shown in Algorithm 1 and that of computing CFMI \({\left( {c;{f_i}} \right) }\) is shown in Algorithm 2.

figure a
figure b

For the situation where Eq. (4) cannot guarantee the effectiveness of selected features, equal interval division and ranking are adopted to process \({I\left( {c;{f_i}} \right) }\) and \(\sum \nolimits _{{f_s} \in S} {I\left( {{f_s};{f_i}} \right) }/\left| S \right| \), and CFMI \({\left( {c;{f_i}} \right) }\) and FFMI \({\left( {{f_s};{f_i}} \right) }\) are attained. Then, Eq. (7) is presented.

$$\begin{aligned} { EID}{-}{} { mRMR} = \arg \mathop {\max }\limits _{{f_i} \in X} \left[ {\mathrm{{CFMI}}\left( {c;{f_i}} \right) - \mathrm{{FFMI}}\left( {{f_s};{f_i}} \right) } \right] \end{aligned}$$
(7)

Equation (7) is the objective function of the proposed algorithm EID–mRMR. CFMI \({\left( {c;{f_i}} \right) }\) is utilized to describe relevance and FFMI \({\left( {{f_s};{f_i}} \right) }\) is adopted to describe redundancy. If there are some features satisfying Eq. (7), the feature that maximizes Eq. (6) is selected from these features satisfying Eq. (7).

Equations (4), (5) and (7) are all proposed to select the features with minimal redundancy maximal relevance. Therefore, it is necessary to compare Eq. (7) with Eqs. (4) and (5). Before comparisons, Eqs. (8) and (9) are presented.

$$\begin{aligned} J\left( {{f_i}} \right)= & {} {C_d} - {F_d} \end{aligned}$$
(8)
$$\begin{aligned} J\left( {{f_i}} \right)= & {} \mathrm{{CFMI}}\left( {c;{f_i}} \right) - \mathrm{{FFMI}}\left( {{f_s};{f_i}} \right) \end{aligned}$$
(9)

Equation (5) is a special case of Eq. (8) attaining the maximum value and Eq. (7) is that of Eq. (9) attaining the maximum value. Equation (9) is compared with Eqs. (6) and (8). The first part of Eqs. (6), (8) and (9) is adopted to describe relevance, and the second part is adopted to describe redundancy. With different candidate features, the range of \({C_d}\) and that of \({F_d}\) are related to the number of candidate features, and the range of CFMI \({\left( {c;{f_i}} \right) }\) and that of FFMI \({\left( {{f_s};{f_i}} \right) }\) are related to the number of dataset’s features. While the range of \({I\left( {c;{f_i}} \right) }\) and that of the average value of mutual information between all the selected features and \({{f_i}}\) are not related to the number of candidate features and that of dataset’s features, and they are smaller than the range of \({C_d}\), that of \({F_d}\), that of CFMI \({\left( {c;{f_i}} \right) }\) and that of FFMI \({\left( {{f_s};{f_i}} \right) }\). Further, since the number of dataset’s features is greater than that of candidate features, the range of \({C_d}\) and that of \({F_d}\) are smaller than the range of CFMI \({\left( {c;{f_i}} \right) }\) and that of FFMI \({\left( {{f_s};{f_i}} \right) }\). Therefore, Eq. (9) has greater range of relevance and range of redundancy than Eqs. (6) and (8), and Eq. (8) has greater range of relevance and range of redundancy than Eq. (6).

Further, different from Eq. (8), since Eq. (9) adopts equal interval division and ranking to process \({I\left( {c;{f_i}} \right) }\) and the average value of mutual information between all the selected features and \({{f_i}}\), CFMI \({\left( {c;{f_i}} \right) }\) and FFMI \({\left( {{f_s};{f_i}} \right) }\) can guarantee the values in the same subinterval have the same priority. Therefore, compared with Eqs. (4) and (5), Eq. (7) has more advantages in selecting the feature with minimal redundancy maximal relevance.

3.3 Algorithmic Implementation

With the above validation condition and Eq. (7), EID–mRMR is shown in Algorithm 3.

figure c

EID–mRMR consists of two parts: in the first part (lines 1–7), the candidate feature set X and the selected feature set S are initialized. Then, \({I\left( {c;{f_i}} \right) }\) is calculated, and the feature with the maximum value is selected; in the second part (lines 8–34), the average value of mutual information between all the selected features and \({{f_i}}\) is calculated. Then, the difference between the maximum and the secondary maximum of Eq. (6) is calculated, if the difference is greater than a fixed value P, select the feature that satisfies Eq. (4); otherwise, calculate CFMI \({\left( {c;{f_i}} \right) }\), FFMI \({\left( {{f_s};{f_i}} \right) }\) and the difference between CFMI \({\left( {c;{f_i}} \right) }\) and FFMI \({\left( {{f_s};{f_i}} \right) }\). Following that, test whether the number of features with the maximum difference between CFMI \({\left( {c;{f_i}} \right) }\) and FFMI \({\left( {{f_s};{f_i}} \right) }\) is more than 1, if it is no less than 2, select the feature that maximizes Eq. (6) from these features; otherwise, select the feature with the maximum difference between CFMI \({\left( {c;{f_i}} \right) }\) and FFMI \({\left( {{f_s};{f_i}} \right) }\). Following the above steps, it loops to select features. The loop ends when a specified number of features are selected.

3.4 An Example

For better understanding the idea of EID–mRMR, an example is presented. Since the first selected feature is the maximum mutual information value with the class label and it is different from the other selected features. The second selected feature and other selected features satisfy the objective function of the algorithm. Therefore, an example is adopted to analyze the first two features selected by mRMR, MIFS-ND and EID–mRMR. A dataset with 7 features is used, and the feature \({{f_7}}\) has the maximum mutual information value with c. Dataset description is presented in Table 1.

Table 1 Dataset description

Since the feature \({{f_7}}\) has the maximum mutual information value with c, mRMR, EID–mRMR and MIFS-ND first select \({{f_7}}\). Since \({{f_7}}\) is selected and the number of selected feature is 1, the average value of mutual information between all the selected features and the candidate feature \({{f_i}}\) is \({I\left( {{f_7};{f_i}} \right) }\). The range of \({I\left( {c;{f_i}} \right) }\) is [0.60, 0.90] and that of \({I\left( {{f_7};{f_i}} \right) }\) is [0.03, 0.09]. Considering that the feature \({{f_2}}\) has the maximum difference between mutual information, mRMR selects \({{f_2}}\).

MIFS-ND adopts NSGA-II to process mutual information \({I\left( {c;{f_i}} \right) }\) and \({I\left( {{f_7};{f_i}} \right) }\), and achieves the domination count \({C_d}\) and the dominated count \({F_d}\). \({F_d}\) is subtracted from \({C_d}\) and the difference is attained. Domination count of a feature is shown in Table 2.

Table 2 Domination count of a feature

As shown in Table 2, because \({{f_5}}\) has the maximum difference, MIFS-ND selects \({{f_5}}\). Since the number of the candidate features is 6, the range of \({C_d}\) and that of \({F_d}\) are both [0, 5]. Therefore, compared with the range of \({I\left( {c;{f_i}} \right) }\) and that of \({I\left( {{f_7};{f_i}} \right) }\), the range of \({C_d}\) and that of \({F_d}\) are greater. Further, although \({I\left( {c;{f_2}} \right) }\) is not far greater than \({I\left( {c;{f_1}} \right) }\), \({{f_1}}\) and \({{f_2}}\) have different \({C_d}\) values. We can know that \({C_d}\) and \({F_d}\) are not correlated to the difference between mutual information between the class label and different candidate features, and the difference between the average values of mutual information between all the selected features and different candidate features.

\({{f_2}}\) has the maximum difference between \({I\left( {c;{f_2}} \right) }\) and \({I\left( {{f_7};{f_2}} \right) }\). \({{f_1}}\) has the secondary maximum difference between \({I\left( {c;{f_1}} \right) }\) and \({I\left( {{f_7};{f_1}} \right) }\). Considering that the maximum difference is not far greater than the secondary maximum difference, the objective function of mRMR cannot guarantee the effectiveness of the selected features. We adopt equal interval division and ranking to process \({I\left( {c;{f_i}} \right) }\) and \({I\left( {{f_7};{f_i}} \right) }\). Since the dataset has 7 features, with the idea of equal interval division, the interval of \({I\left( {c;{f_i}} \right) }\) is divided into [0.60, 0.65), [0.65, 0.70), [0.70, 0.75), [0.75, 0.80), [0.80, 0.85), [0.85, 0.90), [0.90, 0.95) and that of \({I\left( {{f_7};{f_i}} \right) }\) is divided into [0.03, 0.04), [0.04, 0.05), [0.05, 0.06), [0.06, 0.07), [0.07, 0.08), [0.08, 0.09), [0.09, 0.10). The ordinal values CFMI \({\left( {c;{f_i}} \right) }\) and FFMI \({\left( {{f_s};{f_i}} \right) }\) are calculated. Ordinal value of a feature is shown in Table 3.

Table 3 Ordinal value of a feature

In order to understand EID–mRMR, \({I\left( {c;{f_1}} \right) }\) and \({I\left( {{f_7};{f_1}} \right) }\) are adopted to analyze the ordinal values with EID–mRMR. Since \({I\left( {c;{f_1}} \right) }\) is 0.90, the value is in the subinterval of [0.90, 0.95) and the number is 7, hence the ordinal value of \({I\left( {c;{f_1}} \right) }\) is 7. Considering that \({I\left( {{f_7};{f_1}} \right) }\) is 0.09, the value is in the subinterval of [0.09, 0.10) and the number is 7, hence the ordinal value of \({I\left( {{f_7};{f_1}} \right) }\) is 7.

As shown in Table 3, since \({{f_3}}\) has the maximum difference between the ordinal values, EID–mRMR selects \({{f_3}}\) as the second selected feature. Since the dataset has 7 features, the range of CFMI \({\left( {c;{f_i}} \right) }\) and that of FFMI \({\left( {{f_s};{f_i}} \right) }\) are both [1, 7]. Compared with the range of \({I\left( {c;{f_i}} \right) }\), that of \({I\left( {{f_7};{f_i}} \right) }\), that of \({C_d}\) and that of \({F_d}\), the range of CFMI \({\left( {c;{f_i}} \right) }\) and that of FFMI \({\left( {{f_s};{f_i}} \right) }\) are greater. Therefore, EID–mRMR has greater range of relevance and range of redundancy than mRMR and MIFS-ND. Further, CFMI \({\left( {c;{f_i}} \right) }\) and FFMI \({\left( {{f_s};{f_i}} \right) }\) can guarantee the features in the same subinterval have the same priority. Therefore, EID–mRMR has more advantages in selecting the feature with minimal redundancy maximal relevance. Compared with \({{f_2}}\) and \({{f_5}}\), it is more appropriate to select \({{f_3}}\).

4 Experimental Results

To validate the effectiveness of EID–mRMR, mRMR and MIFS-ND, other five incremental MI-based feature selection algorithms and other four feature selection algorithms are adopted for performance comparisons.

4.1 The Datasets and Experimental Settings

The datasets presented in Table 4 are from UCI machine learning repository [25] and ASU feature selection datasets [26]. The number of selected features is 50 for all the datasets. The minimum description length discretization method is adopted to transform the numerical features into discrete ones [27] and it is only used for feature selection. Three popular classifiers, J48, IB1 and Naive Bayes are utilized. The classifiers’ parameters are set to WEKA’s [28] default values. ASU feature selection software package [29] is adopted.

Table 4 Summary of datasets in the experiment

To validate the effectiveness of EID–mRMR, MIFS, mRMR, MIFS-ND, NMIFS [15], MIFS-U [30], MIFS-CR [31] and CMI [32] are adopted for performance comparisons. NMIFS, MIFS-U, MIFS-CR and CMI are four feature selection algorithms based on relevance and redundancy, and their objective functions are presented as follows:

$$\begin{aligned} { NMIFS}= & {} \arg \mathop {\max }\limits _{{f_i} \in X} \left[ {I\left( {c;{f_i}} \right) - \frac{1}{{\left| S \right| }}\sum \limits _{{f_s} \in S} {\frac{{I\left( {{f_s};{f_i}} \right) }}{{\min \left\{ {H\left( {{f_i}} \right) ,H\left( {{f_s}} \right) } \right\} }}} } \right] \end{aligned}$$
(10)
$$\begin{aligned} { MIFS}\text {-}U= & {} \arg \mathop {\max }\limits _{{f_i} \in X} \left[ {I\left( {c;{f_i}} \right) - \beta \sum \limits _{{f_s} \in S} {\frac{{I(c;{f_s})}}{{H({f_s})}}I\left( {{f_s};{f_i}} \right) } } \right] \end{aligned}$$
(11)
$$\begin{aligned} { MIFS}\text {-}{} { CR}= & {} \arg \mathop {\max }\limits _{{f_i} \in X} \left\{ {I\left( {c;{f_i}} \right) - \frac{1}{2}\sum \limits _{{f_s} \in S} {\left[ {\frac{{I(c;{f_s})}}{{H({f_s})}} + \frac{{I(c;{f_i})}}{{H({f_i})}}} \right] I\left( {{f_s};{f_i}} \right) } } \right\} \end{aligned}$$
(12)
$$\begin{aligned} { CMI}= & {} \arg \mathop {\max }\limits _{{f_i} \in X} \left[ {I\left( {c;{f_i}} \right) - \frac{{H({f_i}/c)}}{{H({f_i})}}\sum \limits _{{f_s} \in S} {\frac{{I(c;{f_s})I\left( {{f_s};{f_i}} \right) }}{{H({f_s})H(c)}}} } \right] \end{aligned}$$
(13)

where \({H({f_s})}\) is the entropy of \({{f_s}}\), \({H({f_i})}\) is the entropy of \({{f_i}}\), H(c) is the entropy of c, \({H({f_i}/c)}\) is the conditional entropy.

In the experiment, the \(\beta \) value of MIFS is set to 0.5 and that of MIFS-U is set to 1.

Except for seven incremental MI-based algorithms, Relief-F [33], Fisher [34], QPFS [35] and SPEC-CMI [36] are compared with EID–mRMR. In QPFS and SPEC-CMI, before feature selection, all the features are normalized to the range [− 1, 1] and five equal-size bins is adopted to transform the numerical features into discrete ones.

As shown in [37], for reducing the influence of randomness on the final results, ten times of tenfold cross-validation are used, and the mean value and standard deviation of ten results are taken as the final results. To determine whether the effectiveness of experimental results is significant, a one-sided paired t-test at 5% significance level is carried out.

4.2 Experimental Results and Analysis

4.2.1 With Different P Values

In this section, the influence of P value on EID–mRMR is analyzed. The P value is set to 0.02, 0.03, 0.04, 0.05 and 0.06 respectively. The average performance of EID–mRMR with different P values when using J48, IB1 and Naive Bayes is presented in Table 5.

Table 5 Average performance of EID–mRMR with different P values

As shown in Table 5, when the P value is 0.05 or 0.06, it achieves the greatest Avg. value. At the same time, with different P values, the differences between the Avg. values are very small. Therefore, the P value has a relatively small impact on the performance of EID–mRMR. In the experiment, we find that it takes more time for feature selection with the P value increasing. Therefore, in comparison with the incremental MI-based Algorithms and other feature selection algorithms, the P value of EID–mRMR is set to 0.05.

4.2.2 Comparison with Incremental MI-Based Algorithms

The section compares EID–mRMR with other incremental MI-based algorithms. Margins between the performance of EID–mRMR and other algorithms when using J48, IB1 and Naive Bayes are presented in Tables 67 and 8. The values in the Avg. row are the mean value and standard deviation of the above twelve values. The values in the Win/Tie/Loss (W/T/L) row are the results obtained with one-sided paired t-test, among the values, the first value is the number of datasets that EID–mRMR is significantly superior to the other algorithms, the second value is the number of datasets that EID–mRMR performs equally with the other algorithms and the third value is the number of datasets that EID–mRMR is significantly inferior to the other algorithms. Classification accuracy of the optimal features selected by EID–mRMR and other algorithms when using J48, IB1 and Naive Bayes are presented in Tables 910 and 11. The running time of EID–mRMR and other algorithms that select 50 features is shown in Table 12. Average performance comparisons of algorithms with the three classifiers are shown in Fig. 2.

Table 6 Margins (mean ± SD) (%) between the performance of EID–mRMR and other algorithms when using J48

As shown in Table 6, in Movement_libras, Mfeat_pix, Semeion, COIL20 and orlraws10P, EID–mRMR is significantly better than mRMR. From the values in the Avg. and W/T/L row, EID–mRMR performs better than mRMR when using J48. Compared with other algorithms, MIFS-U, MIFS-ND and EID–mRMR achieve better results.

Table 7 Margins (mean ± SD) (%) between the performance of EID–mRMR and other algorithms when using IB1

In Table 7, the Avg. values indicate that MIFS-U, MIFS-CR and EID–mRMR yield better results. The values in the W/T/L row suggest that mRMR, NMIFS, MIFS-U and EID–mRMR can obtain better feature selection performance. Different from Table 6, the differences between EID–mRMR and the other algorithms increase except for MIFS-U.

Table 8 Margins (mean ± SD) (%) between the performance of EID–mRMR and other algorithms when using Naive Bayes

In Table 8, in the terms of the Avg. values, compared with other algorithms, MIFS-U, MIFS-CR and EID–mRMR achieve better results. For the W/T/L values, MIFS-U, CMI and EID–mRMR perform better than the other algorithms. Moreover, the advantages that EID–mRMR is superior to the other algorithms increase in both the Avg. and W/T/L values.

Table 9 Classification accuracy (%) of the optimal features selected by EID–mRMR and other algorithms when using J48
Table 10 Classification accuracy (%) of the optimal features selected by EID–mRMR and other algorithms when using IB1
Table 11 Classification accuracy (%) of the optimal features selected by EID–mRMR and other algorithms when using Naive Bayes

As shown in Tables 910 and 11, we can see that EID–mRMR can achieve better results in the majority of datasets. Take Table 11 for example, the number of datasets that EID–mRMR performs better than the other seven algorithms is 7, while that of any of the other algorithms is superior to EID–mRMR is no more than 2. Furthermore, EID–mRMR can obtain the greatest Avg. values with all the three classifiers.

Table 12 Running time (s) of EID–mRMR and other algorithms

In Table 12, EID–mRMR takes a fair amount of time with the other seven algorithms in the majority of datasets except for gisette and orlraws10p. Since features of gisette and orlraws10p are high dimensional and they need to be processed by the method of equal interval division and ranking for more times, EID–mRMR is more time-consuming in these two datasets and it obtains greater Avg. values than other algorithms.

Fig. 2
figure 2figure 2

Average performance comparisons of algorithms with the three classifiers

As shown in Fig. 2, in some datasets, such as Spambase and Mfeat_fou, the selected features account for a large proportion of these datasets. After classification accuracy of the features selected by the majority of algorithms reaches the maximum value, it decreases to some extent. While in most of datasets, due to selected features occupying a certain proportion of datasets, the average performance of the features selected by most of algorithms significantly increases in the beginning. Then, it increases slowly and varies gently after arriving at a certain extent. Considering that the average value of the 50 features selected by the algorithms is taken as the final result, the percentage of selected features accounting for the datasets has less impact on the final result. Therefore, it is suitable for selecting 50 features. In the average performance comparisons of algorithms, EID–mRMR can achieve better effectiveness in the majority of datasets. However, the other algorithms do not perform well in some datasets; for example, mRMR cannot handle well in Synthetic_control, Movement_libras, Mfeat_pix, Semeion, ORL and gisette. MIFS-ND does not yield good results in Spambase, Mfeat_fou and gisette. As a whole, compared with the other seven feature selection algorithms, EID–mRMR can achieve better results.

4.2.3 Comparison with Other Feature Selection Algorithms

The section compares EID–mRMR with Relief-F, Fisher, QPFS and SPEC-CMI. Margins between the performance of EID–mRMR and other algorithms when using J48, IB1 and Naive Bayes are presented in Tables 1314 and 15. Classification accuracy of the optimal features selected by EID–mRMR and other algorithms when using J48, IB1 and Naive Bayes are presented in Tables 1617 and 18.

Table 13 Margins (mean ± SD) (%) between the performance of EID–mRMR and other algorithms when using J48
Table 14 Margins (mean ± SD) (%) between the performance of EID–mRMR and other algorithms when using IB1
Table 15 Margins (mean ± SD) (%) between the performance of EID–mRMR and other algorithms when using Naive Bayes
Table 16 Classification accuracy (%) of the optimal features selected by EID–mRMR and other algorithms when using J48
Table 17 Classification accuracy (%) of the optimal features selected by EID–mRMR and other algorithms when using IB1
Table 18 Classification accuracy (%) of the optimal features selected by EID–mRMR and other algorithms when using Naive Bayes

As shown in Table 13, EID–mRMR and SPEC-CMI are superior to Relief-F, Fisher and QPFS in Spambase and Mfeat_fou. EID–mRMR and QPFS are superior to Relief-F, Fisher and SPEC-CMI in Mfeat_fou and Movement_libras. In most of datasets, EID–mRMR performs better than Relief-F, Fisher, QPFS and SPEC-CMI. From the values in the Avg. and W/T/L row, EID–mRMR can achieve better results than the other four algorithms.

In Table 14, SPEC-CMI performs better than EID–mRMR in Spambase and Mfeat_fou. In the majority of datasets, EID–mRMR yields better results than the other four algorithms. The values in the Avg. row and the W/T/L row suggest that EID–mRMR can obtain better feature selection effectiveness. Different from Table 13, the advantages that EID–mRMR is superior to the other algorithms increase in both the Avg. and W/T/L values.

As shown in Table 15, Fisher and SPEC-CMI achieve better results than EID–mRMR in Spambase. In most of datasets, EID–mRMR can achieve better feature selection performance than the other four algorithms. Different from Tables 13 and 14, the differences between EID–mRMR and the other four algorithms increase in the Avg. and W/T/L values.

In Tables 1617 and 18, we can see that EID–mRMR performs better than the other four algorithms in the majority of datasets. Take Table 16 for example, the number of datasets that EID–mRMR can obtain better results than the other four algorithms is 7, while that of datasets that EID–mRMR is inferior to any one of the other algorithms is no more than 5. Furthermore, EID–mRMR achieves the greatest Avg. values with all the three classifiers.

5 Conclusions and Future Work

This paper first analyzes the objective function of mRMR. Then, for the case where the objective function cannot guarantee the effectiveness of selected features, an idea of equal interval division and ranking is proposed to process mutual information between the class label and features, and the average value of mutual information between features. Ultimately, EID–mRMR is proposed. To verify the performance, we apply EID–mRMR to three classifiers, eight UCI datasets and four ASU datasets, and compare results with those from seven incremental MI-based algorithms and other four feature selection algorithms. Experimental results validate that EID–mRMR can achieve better feature selection effectiveness in the majority of datasets.

Considering that EID–mRMR currently only adopts mutual information, and it does not utilize three-way interaction information or higher dimensional mutual information [38,39,40,41,42,43,44,45,46], it is likely to degrade the performance due to missing some useful information. In the next stage, we will investigate how to combine three-way interaction information with the idea of equal interval division and ranking.