A Feature Selection Algorithm Based on Equal Interval Division and Minimal-Redundancy–Maximal-Relevance

Gu, Xiangyuan; Guo, Jichang; Xiao, Lijun; Ming, Tao; Li, Chongyi

doi:10.1007/s11063-019-10144-3

A Feature Selection Algorithm Based on Equal Interval Division and Minimal-Redundancy–Maximal-Relevance

Published: 04 November 2019

Volume 51, pages 1237–1263, (2020)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Neural Processing Letters Aims and scope Submit manuscript

A Feature Selection Algorithm Based on Equal Interval Division and Minimal-Redundancy–Maximal-Relevance

Download PDF

Xiangyuan Gu¹,
Jichang Guo¹,
Lijun Xiao¹,
Tao Ming¹ &
…
Chongyi Li²

431 Accesses
15 Citations
Explore all metrics

Abstract

Minimal-redundancy–maximal-relevance (mRMR) algorithm is a typical feature selection algorithm. To select the feature which has minimal redundancy with the selected features and maximal relevance with the class label, the objective function of mRMR subtracts the average value of mutual information between features from mutual information between features and the class label, and selects the feature with the maximum difference. However, the problem is that the feature with the maximum difference is not always the feature with minimal redundancy maximal relevance. To solve the problem, the objective function of mRMR is first analyzed and a constraint condition that determines whether the objective function can guarantee the effectiveness of the selected features is achieved. Then, for the case where the objective function is not accurate, an idea of equal interval division is proposed and combined with ranking to process the interval of mutual information between features and the class label, and that of the average value of mutual information between features. Finally, a feature selection algorithm based on equal interval division and minimal-redundancy–maximal-relevance (EID–mRMR) is proposed. To validate the performance of EID–mRMR, we compare it with several incremental feature selection algorithms based on mutual information and other feature selection algorithms. Experimental results demonstrate that the EID–mRMR algorithm can achieve better feature selection performance.

A feature subset selection algorithm based on equal interval division and three-way interaction information

Article 20 April 2021

A Feature Selection Algorithm Based on Equal Interval Division and Conditional Mutual Information

Article 04 January 2022

A new improved maximal relevance and minimal redundancy method based on feature subset

Article Open access 30 August 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

With the explosive growth of information, dimension of feature set increases and it can cause the curse of dimensionality. Therefore, it is necessary to reduce the dimension of feature set [1,2,3]. Dimensionality reduction methods involve feature extraction and feature selection [4]. Feature extraction is a way that transforms the original features into a new space and takes the transformed features as the final features, while feature selection selects a subset of the original features. Compared with feature extraction, feature selection has advantages in the interpretation of data [5]. Therefore, feature selection has a wide range of applications, such as text processing [6, 7], underwater objects recognition and classification [8, 9],network anomaly detecting [10], information retrieval [11], image classification [12, 13] and microarray data classification [14].

The metrics adopted in feature selection include distance, mutual information and consistency. Compared with other metrics, mutual information can measure the relationship between variables and it has the invariance under space transformations [15]. Hence, many feature selection algorithms based on mutual information are proposed, such as [16, 17]. Among these algorithms, mutual information maximisation (MIM) algorithm [18] is a basic algorithm. However, it does not perform well due to only considering mutual information between features and the class label.

To overcome the shortcoming of MIM, some algorithms that employ mutual information between features and the class label to describe relevance and adopt mutual information between features to describe redundancy are proposed. Among them, minimal-redundancy–maximal-relevance (mRMR) algorithm [19] is a typical algorithm. In order to select the feature that has minimal redundancy with the selected features and maximal relevance with the class label, the average value of mutual information between each candidate feature and all the selected features is subtracted from mutual information between each candidate feature and the class label, and the feature with the maximum difference is selected. Since the feature with the maximum difference does not mean that the feature has minimal redundancy maximal relevance, the objective function of mRMR has a limitation.

Aiming at solving the existing problems of mRMR, some feature selection algorithms have been proposed. Since mRMR had the problem that mutual information biases toward multivalued features, normalization operation was used. Ultimately, NMIFS algorithm was proposed in [15]. Mutual information between each candidate feature and the class label, and the average value of mutual information between each candidate feature and all the selected features were processed by an optimization algorithm known as NSGA-II. Finally, MIFS-ND algorithm was presented in [20]. Combining mRMR with the idea of optimization, feature selection was investigated in [21]. mRMR was combined with ReliefF algorithm, and a two-stage feature selection algorithm was proposed in [22]. Combining mRMR with a particle swarm optimization algorithm, a maximum relevance minimum redundancy PSO algorithm was presented in [23]. In [15, 20,21,22,23], the aforementioned limitation of the objective function of mRMR has not been handled properly.

In view of the problem that the objective function of mRMR has a limitation, this paper first analyzes the objective function of mRMR and achieves a condition that the objective function can guarantee the effectiveness of selected features. Then, for the case where the objective function cannot guarantee the effectiveness of selected features, the interval of mutual information between each candidate feature and the class label, and that of the average value of mutual information between each candidate feature and all the selected features are divided equally, and then the subintervals are ranked. Finally, a feature selection algorithm based on equal interval division and minimal-redundancy–maximal-relevance (EID–mRMR) is proposed.

The rest of this paper is organized as follows. Section 2 analyzes some feature selection algorithms based on mutual information. The EID–mRMR algorithm is proposed in Sect. 3. Section 4 presents and discusses experimental results. Conclusions and future work are presented in Sect. 5.

2 Related Work

In this paper, we only analyze mutual information of discrete random variables. Assuming Y and Z are two discrete random variables, p(y) is the probability density function of Y, p(z) is the probability density function of Z, and p(y, z) is the joint probability density function of Y and Z. Mutual information is utilized to quantify the information that two random variables share. Mutual information I(Y; Z) can be defined as

$$\begin{aligned} I(Y;Z) = \sum \limits _{y \in Y} {\sum \limits _{z \in Z} {p(y,z)\log \frac{{p(y,z)}}{{p(y)p(z)}}}}. \end{aligned}$$

(1)

The higher mutual information values means that the two random variables share more information.

MIM is a feature selection algorithm based on mutual information, and its objective function is expressed as

$$\begin{aligned} { MIM} = \arg \mathop {\max }\limits _{{f_i} \in X} \left[ {I\left( {c;{f_i}} \right) } \right] \end{aligned}$$

(2)

where X is the candidate features set, ${{f_i}}$ is a candidate feature and c is the class label. MIM calculates mutual information between each candidate feature and the class label. Then, it ranks features in descending order according to the values, and selects some features with larger values. The algorithm does not yield good results due to ignoring feature interactions.

To overcome the shortcoming of MIM, some feature selection algorithms based on relevance and redundancy are proposed [19, 24]. Objective functions of these algorithms are different, while their feature selection processes are same. The process is presented as follows. It first calculates mutual information between features and the class label, and selects the feature that has the maximum value. Then, it loops to select the feature that complies with the objective function in a forward search way. The loop ends when a specified number of features are selected. Obviously, objective functions are the key of these algorithms. Combined with the objective functions, these algorithms are analyzed.

$$\begin{aligned} { MIFS} = \arg \mathop {\max }\limits _{{f_i} \in X} \left[ {I\left( {c;{f_i}} \right) - \beta \sum \limits _{{f_s} \in S} {I\left( {{f_s};{f_i}} \right) } } \right] . \end{aligned}$$

(3)

Equation (3) is the objective function of mutual information based feature selection (MIFS) algorithm [24]. S is the selected feature set and ${{f_s}}$ is a selected feature. MIFS uses a parameter $\beta $ to adjust mutual information ${I\left( {c;{f_i}} \right) }$ and mutual information between all the selected features and ${{f_i}}$. When $\beta $ is set to zero, this algorithm is MIM.

mRMR [19] uses the reciprocal of the number of selected features to replace the parameter $\beta $, solving the problem of uncertain parameter. For selecting the feature that has minimal redundancy with the selected features and maximal relevance with the class label, mRMR subtracts the average value of mutual information between all the selected features and ${{f_i}}$ from ${I\left( {c;{f_i}} \right) }$, and selects the feature with the maximum difference. The objective function of mRMR is expressed as

$$\begin{aligned} { mRMR} = \arg \mathop {\max }\limits _{{f_i} \in X} \left[ {I\left( {c;{f_i}} \right) - \frac{1}{{\left| S \right| }}\sum \limits _{{f_s} \in S} {I\left( {{f_s};{f_i}} \right) } } \right] \end{aligned}$$

(4)

where ${\left| S \right| }$ is the number of selected features. However, the feature satisfying Eq. (4) is not always the feature with minimal redundancy maximal relevance. Therefore, the objective function of mRMR has a limitation.

$$\begin{aligned} { MIFS}\text {-}{} { ND} = \arg \mathop {\max }\limits _{{f_i} \in X} \left[ {{C_d} - {F_d}} \right] . \end{aligned}$$

(5)

Combining mRMR with an optimization algorithm NSGA-II, MIFS-ND algorithm [20] was proposed. MIFS-ND first selects the feature that has the maximum mutual information value with the class label. Then, it calculates ${I\left( {c;{f_i}} \right) }$ and the average value of mutual information between all the selected features and each candidate feature. Following that, it processes them by NSGA-II and achieves the domination count ${C_d}$ and the dominated count ${F_d}$ for each feature. As shown in [20], the domination count of a candidate feature represents the number of features that it dominates for mutual information between the candidate feature and the class label. The dominated count of a candidate feature represents the number of features that it dominates for the average value of mutual information between the candidate feature and all the selected features. Finally, the feature satisfying Eq. (5) is selected. Following the above steps, it loops to select features until a specified number of features are selected. Compared with the range of ${I\left( {c;{f_i}} \right) }$ and that of the average value of mutual information between all the selected features and ${{f_i}}$, the range of ${C_d}$ and that of ${F_d}$ are greater. However, since ${C_d}$ and ${F_d}$ are not correlated to the difference between mutual information between the class label and different candidate features, and the difference between the average values of mutual information between the selected features and different candidate features, MIFS-ND cannot effectively handle the problem that the limitation existed in the objective function of mRMR.

3 The Proposed Feature Selection Algorithm

This section first achieves a condition that tests whether the objective function can guarantee the performance of selected features. Then, equal interval division is proposed and combined with ranking to deal with the case where the objective function cannot guarantee the performance of selected features. Following that, the proposed algorithm EID–mRMR is presented. Finally, an example is adopted to analyze the first two features selected by mRMR, MIFS-ND and EID–mRMR.

3.1 A Validation Condition

mRMR is a feature selection algorithm based on relevance and redundancy. Flow chart of mRMR is presented in Fig. 1. In the flow chart, the candidate feature set X and the selected feature set S are initialized. Then, mutual information between features and the class label is calculated, and the feature with the maximum value is selected; it loops to select the feature that complies with Eq. (4) in a forward search way until a specified number of features N are selected.

The objective function presented in Eq. (4) is the key of mRMR. mRMR utilizes ${I\left( {c;{f_i}} \right) }$ to describe relevance and adopts ${I\left( {{f_s};{f_i}} \right) }$ to describe redundancy. In order to select the feature that has minimal redundancy with the selected features and maximal relevance with the class label, mRMR subtracts the average value of mutual information between all the selected features and each candidate feature from ${I\left( {c;{f_i}} \right) }$, and selects the feature with the maximum difference. The feature that has minimal redundancy with the selected features and maximal relevance with the class label can satisfy Eq. (4), not vice versa. Therefore, Eq. (4) has a limitation. The feature selected by Eq. (4) is analyzed. It is necessary to present Eq. (6) at first.

$$\begin{aligned} J\left( {{f_i}} \right) = I\left( {c;{f_i}} \right) - \frac{1}{{\left| S \right| }}\sum \limits _{{f_s} \in S} {I\left( {{f_s};{f_i}} \right) } \end{aligned}$$

(6)

Equation (4) is a special case of Eq. (6) attaining the maximum value. Calculate Eq. (6), if the maximum value is far greater than the secondary maximum value, and then select the feature with the maximum value. If not, the advantage of using Eq. (4) to select features is not obvious. To simplify the calculation, a condition that the difference between the maximum value and the secondary maximum value of Eq. (6) is greater than a fixed value P is adopted to test whether using Eq. (4) to select features. If the difference between the maximum value and the secondary maximum value is greater than P, Eq. (4) is used to select features; otherwise, an idea of equal interval division and ranking is adopted to select features. We will give a detailed description of the idea of equal interval division and ranking in the next section.

3.2 Equal Interval Division and Ranking

For the situation where Eq. (4) cannot guarantee the effectiveness of selected features, to guarantee the feature with minimal redundancy maximal relevance being selected, the interval of ${I\left( {c;{f_i}} \right) }$ and that of the average value of mutual information between all the selected features and ${{f_i}}$ are divided equally, and then the subintervals are ranked. The number of dataset’s features is taken as the number of subintervals. The concrete practices are presented as follows: determine the maximum value and the minimum value of ${I\left( {c;{f_i}} \right) }$ and the average value of mutual information between all the selected features and ${{f_i}}$ as interval values, and then the interval values are divided equally. Following that, the subintervals are numbered from 1 to the number of dataset’s features, and the numbers are taken as the ordinal values of the values in the subintervals.

FFMI ${\left( {{f_s};{f_i}} \right) }$ is the value that the interval of $\sum \nolimits _{{f_s} \in S} {I\left( {{f_s};{f_i}} \right) } /\left| S \right| $ is processed by equal division and ranking. CFMI ${\left( {c;{f_i}} \right) }$ is the value that the interval of ${I\left( {c;{f_i}} \right) }$ is processed by equal division and ranking. The process of computing FFMI ${\left( {{f_s};{f_i}} \right) }$ is shown in Algorithm 1 and that of computing CFMI ${\left( {c;{f_i}} \right) }$ is shown in Algorithm 2.

For the situation where Eq. (4) cannot guarantee the effectiveness of selected features, equal interval division and ranking are adopted to process ${I\left( {c;{f_i}} \right) }$ and $\sum \nolimits _{{f_s} \in S} {I\left( {{f_s};{f_i}} \right) }/\left| S \right| $, and CFMI ${\left( {c;{f_i}} \right) }$ and FFMI ${\left( {{f_s};{f_i}} \right) }$ are attained. Then, Eq. (7) is presented.

$$\begin{aligned} { EID}{-}{} { mRMR} = \arg \mathop {\max }\limits _{{f_i} \in X} \left[ {\mathrm{{CFMI}}\left( {c;{f_i}} \right) - \mathrm{{FFMI}}\left( {{f_s};{f_i}} \right) } \right] \end{aligned}$$

(7)

Equation (7) is the objective function of the proposed algorithm EID–mRMR. CFMI ${\left( {c;{f_i}} \right) }$ is utilized to describe relevance and FFMI ${\left( {{f_s};{f_i}} \right) }$ is adopted to describe redundancy. If there are some features satisfying Eq. (7), the feature that maximizes Eq. (6) is selected from these features satisfying Eq. (7).

Equations (4), (5) and (7) are all proposed to select the features with minimal redundancy maximal relevance. Therefore, it is necessary to compare Eq. (7) with Eqs. (4) and (5). Before comparisons, Eqs. (8) and (9) are presented.

$$\begin{aligned} J\left( {{f_i}} \right)= & {} {C_d} - {F_d} \end{aligned}$$

(8)

$$\begin{aligned} J\left( {{f_i}} \right)= & {} \mathrm{{CFMI}}\left( {c;{f_i}} \right) - \mathrm{{FFMI}}\left( {{f_s};{f_i}} \right) \end{aligned}$$

(9)

Equation (5) is a special case of Eq. (8) attaining the maximum value and Eq. (7) is that of Eq. (9) attaining the maximum value. Equation (9) is compared with Eqs. (6) and (8). The first part of Eqs. (6), (8) and (9) is adopted to describe relevance, and the second part is adopted to describe redundancy. With different candidate features, the range of ${C_d}$ and that of ${F_d}$ are related to the number of candidate features, and the range of CFMI ${\left( {c;{f_i}} \right) }$ and that of FFMI ${\left( {{f_s};{f_i}} \right) }$ are related to the number of dataset’s features. While the range of ${I\left( {c;{f_i}} \right) }$ and that of the average value of mutual information between all the selected features and ${{f_i}}$ are not related to the number of candidate features and that of dataset’s features, and they are smaller than the range of ${C_d}$, that of ${F_d}$, that of CFMI ${\left( {c;{f_i}} \right) }$ and that of FFMI ${\left( {{f_s};{f_i}} \right) }$. Further, since the number of dataset’s features is greater than that of candidate features, the range of ${C_d}$ and that of ${F_d}$ are smaller than the range of CFMI ${\left( {c;{f_i}} \right) }$ and that of FFMI ${\left( {{f_s};{f_i}} \right) }$. Therefore, Eq. (9) has greater range of relevance and range of redundancy than Eqs. (6) and (8), and Eq. (8) has greater range of relevance and range of redundancy than Eq. (6).

Further, different from Eq. (8), since Eq. (9) adopts equal interval division and ranking to process ${I\left( {c;{f_i}} \right) }$ and the average value of mutual information between all the selected features and ${{f_i}}$, CFMI ${\left( {c;{f_i}} \right) }$ and FFMI ${\left( {{f_s};{f_i}} \right) }$ can guarantee the values in the same subinterval have the same priority. Therefore, compared with Eqs. (4) and (5), Eq. (7) has more advantages in selecting the feature with minimal redundancy maximal relevance.

3.3 Algorithmic Implementation

With the above validation condition and Eq. (7), EID–mRMR is shown in Algorithm 3.

EID–mRMR consists of two parts: in the first part (lines 1–7), the candidate feature set X and the selected feature set S are initialized. Then, ${I\left( {c;{f_i}} \right) }$ is calculated, and the feature with the maximum value is selected; in the second part (lines 8–34), the average value of mutual information between all the selected features and ${{f_i}}$ is calculated. Then, the difference between the maximum and the secondary maximum of Eq. (6) is calculated, if the difference is greater than a fixed value P, select the feature that satisfies Eq. (4); otherwise, calculate CFMI ${\left( {c;{f_i}} \right) }$, FFMI ${\left( {{f_s};{f_i}} \right) }$ and the difference between CFMI ${\left( {c;{f_i}} \right) }$ and FFMI ${\left( {{f_s};{f_i}} \right) }$. Following that, test whether the number of features with the maximum difference between CFMI ${\left( {c;{f_i}} \right) }$ and FFMI ${\left( {{f_s};{f_i}} \right) }$ is more than 1, if it is no less than 2, select the feature that maximizes Eq. (6) from these features; otherwise, select the feature with the maximum difference between CFMI ${\left( {c;{f_i}} \right) }$ and FFMI ${\left( {{f_s};{f_i}} \right) }$. Following the above steps, it loops to select features. The loop ends when a specified number of features are selected.

3.4 An Example

For better understanding the idea of EID–mRMR, an example is presented. Since the first selected feature is the maximum mutual information value with the class label and it is different from the other selected features. The second selected feature and other selected features satisfy the objective function of the algorithm. Therefore, an example is adopted to analyze the first two features selected by mRMR, MIFS-ND and EID–mRMR. A dataset with 7 features is used, and the feature ${{f_7}}$ has the maximum mutual information value with c. Dataset description is presented in Table 1.

Table 1 Dataset description

A Feature Selection Algorithm Based on Equal Interval Division and Minimal-Redundancy–Maximal-Relevance

Abstract

Similar content being viewed by others

A feature subset selection algorithm based on equal interval division and three-way interaction information

A Feature Selection Algorithm Based on Equal Interval Division and Conditional Mutual Information

A new improved maximal relevance and minimal redundancy method based on feature subset

Explore related subjects

1 Introduction

2 Related Work

3 The Proposed Feature Selection Algorithm

3.1 A Validation Condition

3.2 Equal Interval Division and Ranking

3.3 Algorithmic Implementation

3.4 An Example

4 Experimental Results

4.1 The Datasets and Experimental Settings

4.2 Experimental Results and Analysis

4.2.1 With Different P Values

4.2.2 Comparison with Incremental MI-Based Algorithms

4.2.3 Comparison with Other Feature Selection Algorithms

5 Conclusions and Future Work

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation