A Feature Selection Algorithm Based on Equal Interval Division and Conditional Mutual Information

Gu, Xiangyuan; Guo, Jichang; Ming, Tao; Xiao, Lijun; Li, Chongyi

doi:10.1007/s11063-021-10720-6

A Feature Selection Algorithm Based on Equal Interval Division and Conditional Mutual Information

Published: 04 January 2022

Volume 54, pages 2079–2105, (2022)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Neural Processing Letters Aims and scope Submit manuscript

A Feature Selection Algorithm Based on Equal Interval Division and Conditional Mutual Information

Download PDF

Xiangyuan Gu¹,
Jichang Guo¹,
Tao Ming¹,
Lijun Xiao¹ &
…
Chongyi Li¹

390 Accesses
5 Citations
1 Altmetric
Explore all metrics

Abstract

The performance of many feature selection algorithms is affected because of ignoring three-dimensional mutual information among features. Three-dimensional mutual information includes conditional mutual information, joint mutual information and three-way interaction information. Aiming at the limitation, this paper investigates feature selection based on three-dimensional mutual information. First, we propose an objective function based on conditional mutual information. Further, we propose a criterion to validate whether the objective function can guarantee the effectiveness of selected features. In the case that the objective function cannot guarantee the effectiveness of selected features, we combine a method of equal interval division and ranking with the objective function to select features. Finally, we propose a feature selection algorithm named EID-CMI. To validate the performance of EID-CMI, we compare it with several feature selection algorithms. Experimental results demonstrate that EID-CMI can achieve better feature selection performance.

A Feature Selection Algorithm Based on Equal Interval Division and Minimal-Redundancy–Maximal-Relevance

Article 04 November 2019

Conditional mutual information-based feature selection algorithm for maximal relevance minimal redundancy

Article 21 May 2021

A feature selection algorithm based on redundancy analysis and interaction weight

Article 07 November 2020

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Feature selection is an important way of dimension reduction, which adopts metrics to measure features and selects informative features [1, 2]. Feature selection algorithms can be classified into three categories: wrapper, embedded and filter. Since accuracy of classifier is used as the metric, wrapper and embedded algorithms are time-consuming. Filter algorithms are time-saving at the cost of performance. The most commonly used metrics in filter algorithms include Pearson correlation coefficient, Euclidean distance and mutual information. As mutual information can measure linear and nonlinear correlation, and has the invariance under space transformations [3], it has an advantage over other metrics. Therefore, many feature selection algorithms based on mutual information are proposed [4, 5].

MIM [6] is a basic algorithm based on mutual information and it calculates mutual information between each feature and the class label. On basis of MIM, some feature selection algorithms based on relevancy and redundancy are proposed, such as MIFS [7], MIFS-U [8], mRMR [9], CMI [10] and MIFS-CR [11]. These algorithms introduce mutual information between features and use it to describe redundancy. Furthermore, in our previous work [12], we investigate feature selection based on mutual information and propose an algorithm named EID-mRMR. EID-mRMR combines mutual information with a method of equal interval division and ranking to select features. Since three-dimensional mutual information, such as conditional mutual information, joint mutual information or three-way interaction information, is not exploited, the performance of these algorithms can be influenced.

There are some feature selection algorithms based on three-dimensional mutual information, such as DWFS [13], IWFS [14], JMI [15], CMIM [16], CIFE [17], JMIM [18] and RelaxMRMR [19]. All of these algorithms consider three-dimensional mutual information among features and the class label, while only RelaxMRMR considers three-dimensional mutual information among features.

Since our previous work [12] only considers mutual information and does not consider three-dimensional mutual information, the objective function of EID-mRMR misses some useful information and the performance is affected. To further improve the performance, feature selection based on three-dimensional mutual information is investigated. This work is its extension in three-dimensional mutual information and it combines three-dimensional mutual information among features with the method of equal interval division and ranking to select features. Specifically, an objective function based on conditional mutual information is presented. Then, the objective function is analyzed and a criterion that validates whether the objective function can guarantee the performance of selected features is achieved. Following that, in the case that the objective function cannot guarantee the performance of selected features, conditional mutual information among features and the class label as well as conditional mutual information among features is processed by the method of equal interval division and ranking. Finally, a feature selection algorithm named EID-CMI is proposed.

The main contributions of this paper are as follows: (1) Feature selection based on three-dimensional mutual information is investigated. (2) The method of equal interval division and ranking is adopted. (3) A feature selection algorithm based on conditional mutual information is proposed. (4) The proposed algorithm can achieve better feature selection performance.

The remainder of this paper is organized as follows: Sect. 2 presents some basic concepts of information theory. Related works of feature selection algorithms based on three-dimensional mutual information are introduced in Sect. 3. Section 4 presents the proposed algorithm. Experimental results are given and analyzed in Sect. 5. Section 6 gives conclusions and future work.

2 Some Basic Concepts of Information Theory

Mutual information is exploited to quantify the information shared by two variables, and mutual information I(X; Y) between two variables X and Y can be defined as

$$\begin{aligned} I(X;Y) = \sum \limits _{x \in X} {\sum \limits _{y \in Y} {p(x,y)\log \frac{{p(x,y)}}{{p(x)p(y)}}\mathrm{}} } \end{aligned}$$

(1)

A greater mutual information value indicates that two variables share more information.

Conditional mutual information is adopted to quantify the information by the other two variables when one variable is known and I(X; Y|Z) can be defined as

$$\begin{aligned} I(X;Y|Z) = \sum \limits _{x \in X} {\sum \limits _{y \in Y} {\sum \limits _{z \in Z} {p(x,y,z)\log \frac{{p(x,y|z)}}{{p(x|z)p(y|z)}}} \mathrm{{}}} } \end{aligned}$$

(2)

Three-way interaction information is an extension of mutual information and I(X; Y; Z) has the following relationship with conditional mutual information I(X; Y|Z) [20].

$$\begin{aligned} I(X;Y;Z) = I(X;Y|Z) - I(X;Y) \end{aligned}$$

(3)

where I(X; Y; Z) can be positive, negative or zero.

3 Related Works

Feature selection algorithms based on three-dimensional mutual information include DWFS, IWFS, JMI, CMIM, CIFE, JMIM and RelaxMRMR. DWFS and IWFS belong to the same class, and they utilize symmetric uncertainty [21] between features and the class label to describe relevance, and adopt three-way interaction information among features and the class label to describe interaction. Owing to lacking some evidences in determining these two parts, their feature selection performance cannot be guaranteed.

JMI, CMIM, CIFE, JMIM and RelaxMRMR belong to the same class. Objective functions are the key of these algorithms and their objective functions are respectively presented as follows:

$$\begin{aligned}&JMI = \arg \mathop {\max }\limits _{{f_i} \in X} \left[ {I\left( {c\mathrm{{;}}{f_i}} \right) - \frac{1}{{\left| S \right| }}\sum \limits _{{f_s} \in S} {I\left( {{f_s}\mathrm{{;}}{f_i}} \right) } + \frac{1}{{\left| S \right| }}\sum \limits _{{f_s} \in S} {I\left( {{f_s}\mathrm{{;}}{f_i}|c} \right) } } \right] \end{aligned}$$

(4)

$$\begin{aligned}&CMIM = \arg \mathop {\max }\limits _{{f_i} \in X} \left\{ {\mathop {\min }\limits _{{f_s} \in S} \left[ {I\left( {c;{f_i}|{f_s}} \right) } \right] } \right\} \end{aligned}$$

(5)

$$\begin{aligned}&CIFE = \arg \mathop {\max }\limits _{{f_i} \in X} \left[ {I\left( {c\mathrm{{;}}{f_i}} \right) - \sum \limits _{{f_s} \in S} {I\left( {{f_s}\mathrm{{;}}{f_i}} \right) } + \sum \limits _{{f_s} \in S} {I\left( {{f_s}\mathrm{{;}}{f_i}|c} \right) } } \right] \end{aligned}$$

(6)

$$\begin{aligned}&JMIM = \arg \mathop {\max }\limits _{{f_i} \in X} \left\{ {\mathop {\min }\limits _{{f_s} \in S} \left[ {I\left( {{f_i},{f_s};c} \right) } \right] } \right\} \end{aligned}$$

(7)

$$\begin{aligned}&RelaxMRMR = \arg \mathop {\max }\limits _{{f_i} \in X} I\left( {c\mathrm{{;}}{f_i}} \right) - \frac{1}{{\left| S \right| }}\sum \limits _{{f_s} \in S} {I\left( {{f_s}\mathrm{{;}}{f_i}} \right) }\nonumber \\&\quad + \frac{1}{{\left| S \right| }}\sum \limits _{{f_s} \in S} {I\left( {{f_s}\mathrm{{;}}{f_i}|c} \right) } - \frac{1}{{\left| S \right| \left| {S - 1} \right| }}\sum \limits _{{f_s} \in S} {\sum \limits _{{f_k} \in S;k \ne s} {I\left( {{f_i};{f_k}|{f_s}} \right) } } \end{aligned}$$

(8)

where ${I\left( {c\mathrm{{;}}{f_i}} \right) }$ is mutual information between a candidate feature ${{f_i}}$ and the class label c, ${I\left( {{f_s}\mathrm{{;}}{f_i}} \right) }$ is mutual information between a selected feature ${{f_s}}$ and ${{f_i}}$, ${I\left( {{f_s}\mathrm{{;}}{f_i}|c} \right) }$ is conditional mutual information when c is given, ${I\left( {c;{f_i}|{f_s}} \right) }$ is conditional mutual information when ${{f_s}}$ is given, ${I\left( {{f_i},{f_s};c} \right) }$ is joint mutual information, ${I\left( {{f_i};{f_k}|{f_s}} \right) }$ is conditional mutual information when ${{f_s}}$ is given, ${\left| S \right| }$ is the number of selected features, S is the selected feature set and X is the candidate feature set.

JMI, CMIM, CIFE and JMIM do not consider three-dimensional mutual information among features except for RelaxMRMR. Since three-dimensional mutual information among features is considered, the objective function of RelaxMRMR contains more useful information.

4 The Proposed Feature Selection Algorithm

This section first presents our proposed objective function and a criterion that validates whether the objective function can guarantee the effectiveness of selected features. Then, a method of equal interval division and ranking is given. Finally, our proposed feature selection algorithm, EID-CMI, is presented.

4.1 The Proposed Objective Function and Criterion

First, we analyze the first three selected features. Since it can provide more information, the feature that has maximum mutual information with the class label c is the first to select and it is assumed to be ${{f_s}}$. In selecting the second feature, Eq. (9) is satisfied.

$$\begin{aligned} I\left( {c;T} \right) = I\left( {c;{f_i},{f_s}} \right) = I\left( {c;{f_s}} \right) + I\left( {c;{f_i}|{f_s}} \right) \end{aligned}$$

(9)

where ${{f_i}}$ is a candidate feature. In order to guarantee ${I\left( {c;T} \right) }$ is maximum, since ${I\left( {c;{f_s}} \right) }$ is maximum, ${I\left( {c;{f_i}|{f_s}} \right) }$ should be maximum. The feature that has maximum conditional mutual information with c and ${{f_s}}$ is selected.

In selecting the third feature, Eq. (10) is satisfied.

$$\begin{aligned} I\left( {c;T} \right) = I\left( {c;{f_i},{f_s},{f_k}} \right) \end{aligned}$$

(10)

Since it is difficult to calculate Eq. (10), we introduce Eqs. (11) and (8) is its special case attaining maximum.

$$\begin{aligned} J({f_i})= & {} I\left( {c\mathrm{{;}}{f_i}} \right) - \frac{1}{{\left| S \right| }}\sum \limits _{{f_s} \in S} {I\left( {{f_s}\mathrm{{;}}{f_i}} \right) } + \frac{1}{{\left| S \right| }}\sum \limits _{{f_s} \in S} {I\left( {{f_s}\mathrm{{;}}{f_i}|c} \right) } \nonumber \\&- \frac{1}{{\left| S \right| \left| {S - 1} \right| }}\sum \limits _{{f_s} \in S} {\sum \limits _{{f_k} \in S;k \ne s} {I\left( {{f_i};{f_k}|{f_s}} \right) } } \end{aligned}$$

(11)

where ${{f_k}}$ is a selected feature and S is the selected feature set.

Equation (12) is satisfied.

$$\begin{aligned}&I\left( {c\mathrm{{;}}{f_i}} \right) - \frac{1}{{\left| S \right| }}\sum \limits _{{f_s} \in S} {I\left( {{f_s}\mathrm{{;}}{f_i}} \right) } + \frac{1}{{\left| S \right| }}\sum \limits _{{f_s} \in S} {I\left( {{f_s}\mathrm{{;}}{f_i}|c} \right) } \nonumber \\&\quad = I\left( {c\mathrm{{;}}{f_i}} \right) + \frac{1}{{\left| S \right| }}\sum \limits _{{f_s} \in S} {I\left( {{f_s}\mathrm{{;}}{f_i};c} \right) } = \frac{1}{{\left| S \right| }}\sum \limits _{{f_s} \in S} {I\left( {c;{f_i}|{f_s}} \right) } \end{aligned}$$

(12)

Equation (13) is introduced in selecting the third feature for simplifying computation and it can be understood as effective information that is brought in selecting features. The first part means the average value of relevant information, while the second part can be considered as the average value of redundant information.

$$\begin{aligned} J({f_i}) = \frac{1}{{\left| S \right| }}\sum \limits _{{f_s} \in S} {I\left( {c;{f_i}|{f_s}} \right) - \frac{1}{{\left| S \right| \left| {S - 1} \right| }}\sum \limits _{{f_s} \in S} {\sum \limits _{{f_k} \in S;k \ne s} {I\left( {{f_k};{f_i}|{f_s}} \right) } } } \end{aligned}$$

(13)

When the maximum of Eq. (13) is far greater than the secondary maximum, more information can be achieved by selecting the maximum and thus can guarantee the effectiveness of selected features. Otherwise, the advantage is not obvious. To guarantee the performance of selected features, a criterion that whether the difference between the maximum and secondary maximum is greater than a fixed value P is used to evaluate whether the objective function can guarantee the performance of selected features, as we do in our previous work [12]. If the difference is greater than P, Eq. (13) is exploited to select features; otherwise, the method of equal interval division and ranking is adopted to select features and it will be introduced in the next section.

4.2 Equal Interval Division and Ranking

In the case that the performance of selected features cannot be guaranteed, the method of equal interval division and ranking is exploited to process these two parts of Eq. (13), and CFF-CMI$\mathrm{{ }}\left( {c;{f_i}|{f_s}} \right) $ and FFF-CMI$\mathrm{{ }}\left( {{f_k};{f_i}|{f_s}} \right) $ are obtained. Their computing processes are shown in Algorithms 1 and 2.

The number of dataset’s features is taken as the number of subintervals. Take Algorithm 1 for example to analyze the method of equal interval division and ranking, the concrete practices are presented as follows: determining maximum and minimum of the interval of average value of ${I\left( {c;{f_i}|{f_s}} \right) }$ as interval values, and the interval values are divided equally. Then, the subintervals are indexed from 1 to the number of dataset’s features, and the indexes of these subintervals are taken as the ordinal values of the values in the subintervals.

The method of equal interval division and ranking is adopted to process these two parts, and it can increase the difference among each part of values and guarantee the features in the same subinterval have the same priority. Therefore, Eq. (14) is adopted to process the case that the effectiveness of selected features cannot be guaranteed. When the number of features that satisfies the condition is 1, selecting the feature; otherwise, the feature that maximizes Eq. (13) is selected from these features.

$$\begin{aligned} J({f_i}) = \arg \mathop {\max }\limits _{{f_i} \in X} \left[ {CFF-CMI\mathrm{{ }}\left( {c;{f_i}|{f_s}} \right) \mathrm{{ }} - \mathrm{{ }}FFF-CMI\mathrm{{ }}\left( {{f_k};{f_i}|{f_s}} \right) } \right] \end{aligned}$$

(14)

4.3 Algorithmic Implementation

The proposed objective function and criterion are combined with the method of equal interval division and ranking, and a feature selection algorithm named EID-CMI is proposed and it is shown in Algorithm 3.

EID-CMI consists of two parts: in the first part (lines 1–10), initializing the candidate feature set X and the selected feature set S. Then, calculating ${I\left( {c;{f_i}} \right) }$ and selecting the feature with the maximum. Calculating $I\left( {c;{f_i}|{f_s}} \right) $ and taking the feature with the maximum from X; in the second part (lines 11–31), calculating $I\left( {{f_k};{f_i}|{f_s}} \right) $, and the difference between the maximum and secondary maximum of Eq. (13), if the difference is greater than a fixed value P, selecting the feature with the maximum of Eq. (13); if not, calculating CFF-CMI$\mathrm{{ }}\left( {c;{f_i}|{f_s}} \right) $, FFF-CMI$\mathrm{{ }}\left( {{f_k};{f_i}|{f_s}} \right) $ and the difference, and testing the number of the features with the maximum difference, if it is one, selecting the feature; otherwise, selecting the feature with the maximum of Eq. (13) from these features with the maximum difference. Following the above steps, it loops to select features until selecting a specified number of features N.

5 Experimental Results

This section describes the datasets and experimental settings, and compares EID-CMI with several algorithms.

5.1 The Datasets and Experimental Settings

These datasets in Table 1 are commonly used in feature selection and they come from UCI machine learning repository [22] and ASU feature selection datasets [23]. These datasets are from different fields, such as face image and digit recognition. In these datasets, their instances vary from 100 to 7000 and the number of their features varies from 57 to 10304. For these datasets with continuous features, minimum description length discretization method [24] is employed. Three classifiers, J48, IB1 and Naive Bayes, are adopted and their parameters are set to WEKA’s [25] default values. ASU feature selection software package [26] is utilized. N is set to 50.

Table 1 Summary of datasets in the experiment

A Feature Selection Algorithm Based on Equal Interval Division and Conditional Mutual Information

Abstract

Similar content being viewed by others

A Feature Selection Algorithm Based on Equal Interval Division and Minimal-Redundancy–Maximal-Relevance

Conditional mutual information-based feature selection algorithm for maximal relevance minimal redundancy

A feature selection algorithm based on redundancy analysis and interaction weight

Explore related subjects

1 Introduction

2 Some Basic Concepts of Information Theory

3 Related Works

4 The Proposed Feature Selection Algorithm

4.1 The Proposed Objective Function and Criterion

4.2 Equal Interval Division and Ranking

4.3 Algorithmic Implementation

5 Experimental Results

5.1 The Datasets and Experimental Settings

5.2 Experimental Results and Analysis

5.2.1 With Different P Values

5.2.2 Comparison with Feature Selection Algorithms

6 Conclusions and Future Work

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation