Conditional mutual information-based feature selection algorithm for maximal relevance minimal redundancy

Gu, Xiangyuan; Guo, Jichang; Xiao, Lijun; Li, Chongyi

doi:10.1007/s10489-021-02412-4

Conditional mutual information-based feature selection algorithm for maximal relevance minimal redundancy

Published: 21 May 2021

Volume 52, pages 1436–1447, (2022)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Applied Intelligence Aims and scope Submit manuscript

Conditional mutual information-based feature selection algorithm for maximal relevance minimal redundancy

Download PDF

Xiangyuan Gu¹,
Jichang Guo¹,
Lijun Xiao¹ &
…
Chongyi Li¹

1283 Accesses
19 Citations
Explore all metrics

Abstract

There are many feature selection algorithms based on mutual information and three-dimensional mutual information (TDMI) among features and the class label, since these algorithms do not consider TDMI among features, feature selection performance can be influenced. In view of the problem, this paper investigates feature selection based on TDMI among features. According to the maximal relevance minimal redundancy criterion, joint mutual information among the class label and feature set is adopted to describe relevance, and mutual information between feature set is exploited to describe redundancy. Then, joint mutual information among the class label and feature set as well as mutual information between feature set is decomposed. In the process of decomposing, TDMI among features is considered and an objective function is obtained. Finally, a feature selection algorithm based on conditional mutual information for maximal relevance minimal redundancy (CMI-MRMR) is proposed. To validate the performance, we compare CMI-MRMR with several feature selection algorithms. Experimental results show that CMI-MRMR can achieve better feature selection performance.

Feature selection based on conditional mutual information: minimum conditional relevance and minimum conditional redundancy

Article 09 October 2018

Feature selection based on mutual information with correlation coefficient

Article 12 August 2021

A feature selection algorithm based on redundancy analysis and interaction weight

Article 07 November 2020

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Feature selection [1,2,3,4,5] aims at selecting some informative features from feature set, which is an important method of dimensionality reduction and has widespread applications, such as text processing [6, 7], steganalysis [8, 9], underwater objects classification [10], network anomaly detection [11], information retrieval [12], and image classification [13, 14]. Feature selection algorithms are divided into three categories: filter, embedded, and wrapper methods. Since classification accuracy of classifier is taken as the metric, embedded and wrapper methods are time-consuming and not robust. Filter methods take less time at the cost of the decline in classification results. It is advisable that the datasets with high dimensional features be dealt with filter methods [15].

The metrics commonly used in filter methods include consistency, distance and MI. Since MI has the capacity of measuring linear and non-linear correlation as well as its invariance under space transformations [16], feature selection based on MI and TDMI is widely investigated. Mutual information maximization (MIM) [17] is a basic feature selection algorithm based on MI. It calculates MI between the class label and features, and selects some features with greater values. On the basis of MIM, feature selection algorithms based on relevance and redundancy are proposed, such as minimum-redundancy maximum-relevance (mRMR) [18], conditional mutual information (CMI) [19], and MIFS-CR [20]. These algorithms adopt MI between features and the class label to describe relevance and exploit MI between features to describe redundancy. Owing to considering relevance and redundancy simultaneously, feature selection performance of these algorithms is improved.

Some feature selection algorithms further consider TDMI to improve the performance and algorithms based on TDMI are proposed, such as interaction weight based feature selection (IWFS) [21], joint mutual information maximization (JMIM) [22] and maximizing independent classification information (MRI) [23]. Since these algorithms consider TDMI among features and the class label and ignore TDMI among features, their objective functions might miss some useful information and the performance of these algorithms is influenced.

Considering the above problem, this paper investigates feature selection based on TDMI among features. Firstly, to select the features that provide more useful information, based on the maximal relevance minimal redundancy (MRMR) criterion, joint mutual information (JMI) among the class label and feature set as well as MI between feature set is employed to describe relevance and redundancy separately. Then, JMI among the class label and feature set as well as MI between features is decomposed, and TDMI among features is adopted. Furthermore, both performance and computation are considered, and an objective function is achieved. Finally, a feature selection algorithm based on CMI is proposed.

The main contributions of this paper are as follows. (1) The maximal relevance minimal redundancy criterion is adopted in selecting features. (2) Our algorithm takes special consideration of three-dimensional mutual information among features. (3) The proposed algorithm takes both performance and computation into account. (4) Our algorithm can achieve better feature selection performance at the expense of more time-consuming.

The rest of this paper is organized as follows. Section 2 gives the knowledge of mutual information. Related works are analyzed in Section 3. Section 4 presents the proposed algorithm. Experimental results and analysis are given in Section 5. Section 6 is conclusions and future work.

2 The knowledge of mutual information

Assuming x and y are two discrete random variables. p(x) and p(y) are the probability of x and y separately. Information entropy is exploited to measure information and H(X) is defined by (1):

$$ H(X) = - \underset{x \in X}{\sum} {p(x)\log p(x){\text{ }}} $$

(1)

Conditional entropy H(Y |X) is the entropy of Y when X is given and it can be expressed as (2):

$$ H(Y|X) = - \underset{x \in X}{\sum} {\underset{y \in Y}{\sum} {p(x,y)\log p(y|x)} {\ }} $$

(2)

where p(y|x) is the conditional probability and p(x,y) is the joint probability.

MI is employed to quantify the information that two variables share and MI I(X;Y ) is defined as (3):

$$ I(X;Y) = \underset{x \in X}{\sum} {\underset{y \in Y}{\sum} {p(x,y)\log \frac{{p(x,y)}}{{p(x)p(y)}}{\ }}} $$

(3)

Greater MI value suggests more information that the two variables share. MI has the relationship with information entropy and conditional entropy as (4).

$$ I(X;Y) = I(Y;X) = H(Y) - H(Y|X) = H(X) - H(X|Y){\ } $$

(4)

TDMI is a supplement of MI and it includes CMI, JMI and three-way interaction information. CMI is the reduction in the uncertainty of the other variable due to the knowledge of another variable when one variable is given. CMI I(X;Y |Z) is defined by (5):

$$ I(X;Y|Z) = \underset{x \in X}{\sum} {\underset{y \in Y}{\sum} {\underset{z \in Z}{\sum} {p(x{\text{,}}y,z)\log \frac{{p(x,y|z)}}{{p(x|z)p(y|z)}}} {\ }} } $$

(5)

JMI is utilized to measure the information that two variables share with the other variable. JMI I(X,Y ;Z) has the relationship with I(Y ;Z) and I(X;Z|Y ) as (6).

$$ I{\text{(}}X{\text{,}}Y{\text{;}}Z{\text{)}} = I{\text{(}}Y{\text{;}}Z{\text{)}} + I{\text{(}}X{\text{;}}Z|Y{\text{)}} {\ } $$

(6)

Three-way interaction information I(X;Y ;Z) has the relationship with I(X;Z|Y ) and I(X;Z) as (7) [24].

$$ I{\text{(}}X;Y;Z{\text{)}} = I{\text{(}}X;Z|Y{\text{)}} - I{\text{(}}X;Z{\text{)}} {\ } $$

(7)

3 Related works

MIM is a basic feature selection algorithm based on MI and the objective function is presented in (8).

$$ MIM = \arg \underset{{f_{i}} \in X}{\max} \left[ {I\left( {c{\text{;}}{f_{i}}} \right)} \right]{\ } $$

(8)

MIM calculates MI between the class label c and each candidate feature f_i, and selects some features with greater values from the candidate feature set X.

Some feature selection algorithms consider redundancy between features further and algorithms based on relevance and redundancy are proposed, such as mutual information feature selection (MIFS) [25], MIFS-U [26], mRMR, normalized mutual information feature selection (NMIFS) [16], CMI and MIFS-CR. These algorithms exploit MI between candidate features and the class label to describe relevance, and employ MI between selected features and candidate features to describe redundancy. Objective functions are the key of these algorithms. The objective functions of MIFS and MIFS-U are given below.

$$ MIFS = \arg \underset{{f_{i}} \in X}{\max} \left[ {I\left( {c{\text{;}}{f_{i}}} \right) - \beta \underset{{f_{s}} \in S}{\sum} {I\left( {{f_{s}}{\text{;}}{f_{i}}} \right)} } \right]{\ } $$

(9)

$$ MIFS - U = \arg \underset{{f_{i}} \in X}{\max} \left[ {I\left( {c{\text{;}}{f_{i}}} \right) - \beta \underset{{f_{s}} \in S}{\sum} {\frac{{I(c;{f_{s}})}}{{H({f_{s}})}}I\left( {{f_{s}}{\text{;}}{f_{i}}} \right)} } \right]{\ } $$

(10)

where β is a parameter, f_s is a selected feature and S is the selected feature set.

Since MIFS and MIFS-U have the problem that β is uncertain, mRMR adopts the reciprocal of the number of selected features ${\left | S \right |}$ to replace β and the objective function is shown in (11).

$$ mRMR = \arg \underset{{f_{i}} \in X}{\max} \left[ {I\left( {c{\text{;}}{f_{i}}} \right) - \frac{1}{{\left| S \right|}}\underset{{f_{s}} \in S}{\sum} {I\left( {{f_{s}}{\text{;}}{f_{i}}} \right)} } \right]{\ } $$

(11)

Since mRMR is considered to have bias toward the features with greater MI values, MI between a candidate feature and a selected feature is normalized and NMIFS is proposed. Its objective function is given in (12).

$$ NMIFS = \arg \underset{{f_{i}} \in X}{\max} \left[ {I\left( {c;{f_{i}}} \right) - \frac{1}{{\left| S \right|}}\underset{{f_{s}} \in S}{\sum} {\frac{{I\left( {{f_{i}};{f_{s}}} \right)}}{{\min \left( {H\left( {{f_{i}}} \right),H\left( {{f_{s}}} \right)} \right)}}} } \right] $$

(12)

CMI and MIFS-CR are proposed, and their objective functions are shown below.

$$ CMI = \arg \underset{{f_{i}} \in X}{\max} \left[ {I\left( {c{\text{;}}{f_{i}}} \right) - \frac{{H({f_{i}}|c)}}{{H({f_{i}})}}\underset{{f_{s}} \in S}{\sum} {\frac{{I(c;{f_{s}})I\left( {{f_{s}}{\text{;}}{f_{i}}} \right)}}{{H({f_{s}})H(c)}}} } \right]{\ } $$

(13)

$$ MIFS - CR = \arg \underset{{f_{i}} \in X}{\max} \left[ {I\left( {c;{f_{i}}} \right) - \frac{1}{2}\underset{{f_{s}} \in S}{\sum} {\left( {\frac{{I(c;{f_{s}})}}{{H({f_{s}})}} + \frac{{I(c;{f_{i}})}}{{H({f_{i}})}}} \right)I\left( {{f_{s}};{f_{i}}} \right)} } \right] $$

(14)

Furthermore, considering that mRMR has the problem that the feature with the maximum difference is not always the feature with minimal redundancy maximal relevance, a method of equal interval division is adopted to process the case where the objective function of mRMR is not accurate. Finally an algorithm based on equal interval division and minimal-redundancy-maximal-relevance (EID-mRMR) [27] is proposed. Since relevance and redundancy are both considered, these algorithms based on relevance and redundancy can achieve better feature selection performance.

There are some feature selection algorithms based on TDMI, such as dynamic weighting-based feature selection (DWFS) [28], IWFS, conditional mutual information maximization (CMIM) [29], JMI [30], maximal conditional mutual information (MCMI) [31] and MRI. DWFS and IWFS belong to the same category, and they utilize symmetrical uncertainty that normalizes MI to describe relevance.

JMI, CMIM, MCMI, and MRI fall into a different category, and they have different objective functions. Except for the four algorithms above, the kind of algorithms include conditional informative feature extraction (CIFE) [32], JMIM, CFR [33], and Dynamic Change of Selected Feature with the class (DCSF) [34], and their objective functions are presented below.

$$ JMI = \arg \underset{{f_{i}} \in X}{\max} \left[ {I\left( {c{\text{;}}{f_{i}}} \right) - \frac{1}{{\left| S \right|}}\underset{{f_{s}} \in S}{\sum} {I\left( {{f_{s}}{\text{;}}{f_{i}}} \right)} + \frac{1}{{\left| S \right|}}\underset{{f_{s}} \in S}{\sum} {I\left( {{f_{s}}{\text{;}}{f_{i}}|c} \right)} } \right]{\ } $$

(15)

$$ CMIM = \arg \underset{{f_{i}} \in X}{\max} \left[ {\mathop {\min }\limits_{{f_{s}} \in S} \left( {I\left( {c;{f_{i}}|{f_{s}}} \right)} \right)} \right] $$

(16)

$$ MCMI = \arg \underset{{f_{i}} \in X}{\max} \left[ {\mathop {\max }\limits_{{f_{s}} \in S} \left( {I\left( {c;{f_{i}}|{f_{s}}} \right)} \right)} \right] $$

(17)

$$ MRI = \arg \underset{{f_{i}} \in X}{\max} \left[ {I\left( {c{\text{;}}{f_{i}}} \right) + \underset{{f_{s}} \in S}{\sum} {I\left( {c{\text{;}}{f_{i}}|{f_{s}}} \right)} + \underset{{f_{s}} \in S}{\sum} {I\left( {c{\text{;}}{f_{s}}|{f_{i}}} \right)} } \right]{\ } $$

(18)

$$ CIFE = \arg \underset{{f_{i}} \in X}{\max} \left[ {I\left( {c{\text{;}}{f_{i}}} \right) - \underset{{f_{s}} \in S}{\sum} {I\left( {{f_{s}}{\text{;}}{f_{i}}} \right)} + \underset{{f_{s}} \in S}{\sum} {I\left( {{f_{s}}{\text{;}}{f_{i}}|c} \right)} } \right]{\ } $$

(19)

$$ JMIM = \arg \underset{{f_{i}} \in X}{\max} \left[ {\mathop {\min }\limits_{{f_{s}} \in S} \left( {I\left( {{f_{i}},{f_{s}};c} \right)} \right)} \right] $$

(20)

$$ CFR = \arg \underset{{f_{i}} \in X}{\max} \left[ { \underset{{f_{s}} \in S}{\sum} {I\left( {c{\text{;}}{f_{i}}|{f_{s}}} \right)} + \underset{{f_{s}} \in S}{\sum} {I\left( {c{\text{;}}{f_{s}}{\text{;}}{f_{i}}} \right)} } \right]{\ } $$

(21)

In (15)–(21), since TDMI among features is not employed, feature selection effectiveness of these algorithms can be affected.

4 The proposed algorithm

This section decomposes JMI among feature set and the class label as well as MI between feature set and attains an objective function. Then, based on the objective function, the proposed algorithm is presented.

4.1 The proposed objective function

The aim of some existing feature selection algorithms based on MI and TDMI is to select the feature set that have maximum MI with the class label, and these algorithms only consider relevant information satisfying the maximum. However, selecting a candidate feature f_i introduces not only relevant information, but also redundant information. To introduce maximal relevant and minimal redundant information, we formulate this issue as (22).

$$ \arg \underset{{f_{i}} \in X}{\max} \left[ {I{\text{(}}S{\text{,}}{f_{i}}{\text{;}}c{\text{)}} - I{\text{(}}S{\text{;}}{f_{i}}{\text{)}}} \right] {\ } $$

(22)

where S is the selected feature set and X is the candidate feature set. The total of relevant information that is introduced is I(S,f_i;c) and that of redundant information is I(S;f_i). The greater the difference between relevant and redundant information, the more informative the candidate feature is. Adopting (22) can ensure that maximal relevant and minimal redundant information is introduced, thus guaranteeing feature selection effectiveness of selected features. I(S,f_i;c) satisfies (23).

$$ I{\text{(}}S{\text{,}}{f_{i}}{\text{;}}c{\text{)}} = I{\text{(}}S{\text{;}}c{\text{)}} + I{\text{(}}{f_{i}}{\text{;}}c|S{\text{)}} {\ } $$

(23)

I(f_i;c|S) satisfies (24).

$$ I{\text{(}}{f_{i}}{\text{;}}c|S{\text{) = }}I{\text{(}}c{\text{;}}{f_{i}}{\text{)}} - I{\text{(}}S{\text{;}}c{\text{) + }}I{\text{(}}S{\text{;}}c|{f_{i}}{\text{)}} {\ } $$

(24)

Equation (25) is derived by adding (23) to (24).

$$ I{\text{(}}S{\text{,}}{f_{i}}{\text{;}}c{\text{)}} = I{\text{(}}c{\text{;}}{f_{i}}{\text{) + }}I{\text{(}}S{\text{;}}c|{f_{i}}{\text{)}} {\ } $$

(25)

By combining (25) with (22), (26) is obtained.

$$ \arg \underset{{f_{i}} \in X}{\max} \left[ {I{\text{(}}c{\text{;}}{f_{i}}{\text{) + }}I{\text{(}}S{\text{;}}c|{f_{i}}{\text{)}} - I{\text{(}}S{\text{;}}{f_{i}}{\text{)}}} \right] $$

(26)

I(S;c|f_i) satisfies (27).

$$ I{\text{(}}S{\text{;}}c|{f_{i}}{\text{)}} = \frac{1}{{\left| S \right|}}\underset{{f_{s}} \in S}{\sum} {I\left( {{f_{s}}{\text{;}}c|{f_{i}}} \right)} $$

(27)

By combining (25) with (27), (28) is derived.

$$ I{\text{(}}S{\text{,}}{f_{i}}{\text{;}}c{\text{)}} = I{\text{(}}c{\text{;}}{f_{i}}{\text{) + }}\frac{1}{{\left| S \right|}}\underset{{f_{s}} \in S}{\sum} {I\left( {{f_{s}}{\text{;}}c|{f_{i}}} \right)} {\ } $$

(28)

I(S;f_i) satisfies (29).

$$ I{\text{(}}S{\text{;}}{f_{i}}{\text{)}} = \frac{1}{{\left| S \right|}}\underset{{f_{s}} \in S}{\sum} {I\left( {{f_{i}}{\text{;}}{f_{s}}} \right)} + \frac{1}{{\left| S \right|\left| {S - 1} \right|}}\underset{{f_{s}} \in S}{\sum} {\underset{{f_{j}} \in S,{f_{j}} \ne {f_{s}}}{\sum} {I\left( {{f_{j}}{\text{;}}{f_{i}}|{f_{s}}} \right)} } $$

(29)

Since it is quite time-consuming to calculate the second part of (29), we replace (29) by (30).

$$ \frac{1}{{\left| S \right|}}\underset{{f_{s}} \in S}{\sum} {I\left( {{f_{i}}{\text{;}}{f_{s}}} \right)} + \frac{1}{{\left| S \right|\left| {S - 1} \right|}}\underset{{f_{s}} \in S}{\sum} {\underset{{f_{j}} \in S,{f_{j}} \ne {f_{s}}}{\sum} {I\left( {{f_{j}}{\text{;}}{f_{s}}|{f_{i}}} \right)} } $$

(30)

By combining (28) and (30), (31) is obtained.

$$ \begin{array}{@{}rcl@{}} \arg \underset{{f_{i}} \in X}{\max} &&\left[ {I{\text{(}}c{\text{;}}{f_{i}}{\text{)}} - \frac{1}{{\left| S \right|}}\underset{{f_{s}} \in S}{\sum} {I\left( {{f_{i}}{\text{;}}{f_{s}}} \right)} {\text{ + }}\frac{1}{{\left| S \right|}}\underset{{f_{s}} \in S}{\sum} {I\left( {{f_{s}}{\text{;}}c|{f_{i}}} \right)} } \right.\!\!\\ &&\left. { - \frac{1}{{\left| S \right|\left| {S - 1} \right|}}\underset{{f_{s}} \in S}{\sum} {\underset{{f_{j}} \in S,{f_{j}} \ne {f_{s}}}{\sum} {I\left( {{f_{j}}{\text{;}}{f_{s}}|{f_{i}}} \right)} } } \right] \end{array} $$

(31)

Figure 1 gives a brief description of the determination of (31). Equation (31) considers not only MI between features and the class label, CMI among features and the class label as well as MI between features, but also CMI among features, while the objective functions of other feature selection algorithms do not consider TDMI among features. Compared with the objective functions of other algorithms, (31) contains more useful information and selecting the feature satisfying (31) can guarantee that maximal relevant and minimal redundant information is obtained. To guarantee introducing maximal relevant and minimal redundant information, (31) is taken as an objective function.

4.2 Algorithmic implementation

Based on (31), a feature selection algorithm based on CMI for MRMR (CMI-MRMR) is proposed and the flow chart is presented in Fig. 2.

In Fig. 2, we first initialize X and S. Then, we calculate MI between features and the class label, and select the feature with maximum. We calculate (32) that does not have the fourth part of (31) and select the feature satisfying the condition. Following that, we calculate (31) and select the feature that meets the requirement until a specified number of features are selected.

$$ \arg \underset{{f_{i}} \in X}{\max} \left[ {I{\text{(}}c{\text{;}}{f_{i}}{\text{)}} - \frac{1}{{\left| S \right|}}\underset{{f_{s}} \in S}{\sum} {I\left( {{f_{i}}{\text{;}}{f_{s}}} \right)} {\text{ + }}\frac{1}{{\left| S \right|}}\underset{{f_{s}} \in S}{\sum} {I\left( {{f_{s}}{\text{;}}c|{f_{i}}} \right)} } \right] $$

(32)

The pseudo-code of CMI-MRMR is shown in Algorithm 1.

CMI-MRMR consists of three parts. The first part (lines 1-6), S and X are initialized. Then, MI between the class label and features is calculated, and the feature f_k is selected; in the second part (lines 7-12), ${I\left ({{f_{i}}{\text {;}}{f_{s}}} \right )}$ and ${I\left ({{f_{s}}{\text {;}}c|{f_{i}}} \right )}$ are calculated. Then, (32) is calculated and the feature f_l that satisfies the condition is selected from X; in the third part (lines 13-25), ${I\left ({{f_{j}}{\text {;}}{f_{s}}|{f_{i}}} \right )}$, ${I\left ({{f_{s}}{\text {;}}c|{f_{i}}} \right )}$, and ${I\left ({{f_{i}}{\text {;}}{f_{s}}} \right )}$ are calculated. Then, (31) is calculated and the feature f_m meeting the requirement is selected. Following the above steps, the process ends when the number of selected features is N.

5 Experimental results

To validate the performance of CMI-MRMR, mRMR, IWFS, JMIM, MCMI, MRI, CFR, and DCSF are compared.

5.1 The datasets and experimental settings

The datasets in Table 1 are from UCI machine learning repository [35] and Arizona State University (ASU) feature selection datasets [36]. For all the datasets, N is set to 50. Minimum description length discretization method [37] is exploited to transform the numerical features into discrete ones. Three popular classifiers, J48, IB1, and Naive Bayes are employed and their parameters are set to Waikato environment for knowledge analysis (WEKA)’s [38] default values. ASU feature selection software package [39] is utilized.

Table 1 Description of the datasets

Full size table

5.2 Experimental results and analysis

To reduce the influence of randomness on the final results, ten times of 10-fold cross-validation are employed, and the mean value and standard deviation of ten results are taken as the final results. Classification accuracy of features selected by these algorithms with J48, IB1 and Naive Bayes is presented in Tables 2–4. To determine whether the effectiveness of experimental results is significant, a one-sided paired t-test at 5% significance level is performed, and the number of the datasets that CMI-MRMR performs better than/equal to/worse than other algorithms is shown in Win/Tie/Loss (W/T/L). Average performance of algorithms with the three classifiers is given in Fig. 3. Furthermore, the optimal top several features from 1 to 50 selected by CMI-MRMR is compared with all features, and the comparison result with three classifiers is shown in Tables 5 and 6.

Table 2 Classification accuracy (%) of selected features with J48

Full size table

As shown in Table 2, the Avg. values show that mRMR, JMIM and CMI-MRMR achieve better results. For the W/T/L values, the number of datasets that mRMR, DCSF and CMI-MRMR can obtain better feature selection performace.

In Table 3, mRMR, DCSF and CMI-MRMR obtain greater Avg. and W/T/L values with IB1. In comparison with Table 2, CMI-MRMR outperforms mRMR, IWFS and MCMI with more performance gain in the Avg. values and it achieves more advantages than IWFS in the W/T/L values.

Table 3 Classification accuracy (%) of selected features with IB1

Full size table

The Avg. and W/T/L values in Table 4 show that mRMR, MRI, DCSF and CMI-MRMR obtain better feature selection effectiveness. Compared with Tables 2 and 3, in terms of the Avg. values, CMI-MRMR has more advantage than mRMR, IWFS, JMIM, and MCMI. For the W/T/L values, CMI-MRMR can obtain better performance gain than other algorithms except DCSF.

Table 4 Classification accuracy (%) of selected features with Naive Bayes

Full size table

As shown in Fig. 3, CMI-MRMR achieves better feature selection effectiveness in the majority of datasets, while other algorithms cannot obtain the desired results in some datasets. We take mRMR and CFR as examples, mRMR cannot handle well in lung_discrete and orlraws10P. CFR does not achieve the desired feature selection performance in lung_discrete and arcene.

As shown in Tables 5 and 6, the Avg. values show that the optimal features selected by CMI-MRMR obtain higher accuracy than all features. In comparison with these datasets, the number of the datasets that the features selected by CMI-MRMR perform better than all features is 8 with J48, 4 with IB1, and 5 with Naive Bayes. Overall, although CMI-MRMR only selects the top 50 features, it can have fairly good performance with all features.

Table 5 Classification accuracy (%) of the optimal features selected by CMI-MRMR and all features

Full size table

Table 6 The number of the optimal features selected by CMI-MRMR and all features

Full size table

6 Conclusions and future work

This paper investigates feature selection based on TDMI among features and proposes a feature selection algorithm named CMI-MRMR. To verify the performance, we apply it to three classifiers, four UCI datasets, and twelve ASU datasets, and compare results with those from several algorithms based on MI and TDMI. Experimental results validate that CMI-MRMR can achieve better feature selection effectiveness. Furthermore, the optimal feature set selected by CMI-MRMR are compared with all features, the comparison results show that CMI-MRMR can achieve fairly good performance with all features in the majority of datasets, even better than all features in some datasets.

Considering that CMI-MRMR can achieve better feature selection performance, it can be applied in many fields, such as text processing, underwater objects recognition and classification, network anomaly detection, gene expression, and image classification. Classification accuracy of the top optimal several features from 1 to 50 selected by CMI-MRMR is compared with all features, since classification results of the top optimal several features are worse than all features in some datasets, the determination of the number of selected features will be investigated in the next stage.

References

Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
MATH Google Scholar
Brown G, Pocock A, Zhao MJ, Lujan M (2012) Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. J Mach Learn Res 13:27–66
MathSciNet MATH Google Scholar
Bolon CV, Sanchez MN, Alonso BA (2015) Recent advances and emerging challenges of feature selection in the context of big data. Knowl.-Based Syst 86:33–45
Article Google Scholar
Huang XJ, Zhang L, Wang BJ, Li FZ, Zhang Z (2018) Feature clustering based support vector machine recursive feature elimination for gene selection. Appl Intell 48(3):594–607
Article Google Scholar
Wang YW, Feng LZ, Zhu JM (2018) Novel artificial bee colony based feature selection method for filtering redundant information. Appl Intell 48(4):868–885
Article Google Scholar
Shang CX, Li M, Feng SZ, Jiang QS, Fan JP (2013) Feature selection via maximizing global information gain for text classification. Knowl-Based Syst 54:298–309
Article Google Scholar
Tang B, Kay S, He HB (2016) Toward optimal feature selection in naive bayes for text categorization. IEEE Trans Knowl Data Eng 28(9):2508–2521
Article Google Scholar
Gu XY, Guo JC (2019) A study on Subtractive Pixel Adjacency Matrix features. Multimed Tools Appl 78(14):19681–19695
Article Google Scholar
Gu XY, Guo JC, Wei HW, He YH (2020) Spatial-domain steganalytic feature selection based on three-way interaction information and KS test. Soft Comput 24(1):333–340
Article Google Scholar
Fei T, Kraus D, Zoubir AM (2015) Contributions to Automatic Target Recognition Systems for Underwater Mine Classification. IEEE Trans Geosci Remote Sens 53(1):505–518
Article Google Scholar
Zhang F, Chan PPK, Biggio B, Yeung DS, Roli F (2016) Adversarial feature selection against evasion attacks. IEEE Trans Cybern 46(3):766–777
Article Google Scholar
Veronica BC, Noelia SM, Amparo AB (2013) A review of feature selection methods on synthetic data. Knowl Inf Syst 34(3):483–519
Article Google Scholar
Jia XP, Kuo BC, Crawford MM (2013) Feature Mining for Hyperspectral Image Classification. Proc IEEE 101(3):676–697
Article Google Scholar
Lin CH, Chen HY, Wu YS (2014) Study of image retrieval and classification based on adaptive features using genetic algorithm feature selection. Expert Syst Appl 41(15):6611–6621
Article Google Scholar
Naghibi T, Hoffmann S, Pfister B (2015) A Semidefinite Programming Based Search Strategy for Feature Selection with Mutual Information Measure. IEEE Trans Pattern Anal Mach Intell 37(8):1529–1541
Article Google Scholar
Estevez PA, Tesmer M, Perez CA, Zurada JA (2009) Normalized Mutual Information Feature Selection. IEEE Trans Neural Netw 20(2):189–201
Article Google Scholar
Lewis DD (1992) Feature selection and feature extraction for text categorization. In: Proceedings of the workshop on speech and natural language, pp 212–217
Peng HC, Long FH, Ding C (2005) Feature selection based on mutual information: Criteria of max-dependency, max-relevance and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238
Article Google Scholar
Foithong S, Pinngern O, Attachoo B (2012) Feature subset selection wrapper based on mutual information and rough sets. Expert Syst Appl 39(1):574–584
Article Google Scholar
Wang ZC, Li MQ, Li JZ (2015) A multi-objective evolutionary algorithm for feature selection based on mutual information with a new redundancy measure. Inform Sci 307:73–88
Article MathSciNet Google Scholar
Zeng ZL, Zhang HJ, Zhang R, Yin CX (2015) A novel feature selection method considering feature interaction. Pattern Recogn 48(8):2656–2666
Article Google Scholar
Bennasar M, Hicks Y, Setchi R (2015) Feature selection using Joint Mutual Information Maximisation. Expert Syst Appl 42(22):8520–8532
Article Google Scholar
Wang J, Wei JM, Yang ZL, Wang SQ (2017) Feature Selection by Maximizing Independent Classification Information. IEEE Trans Knowl Data Eng 29(4):828–841
Article Google Scholar
Jakulin A, Bratko I (2004) Testing the significance of attribute interactions. In: Proceedings of international conference on machine learning, pp 409–416
Battiti R (1994) Using mutual information for selecting features in supervised neural net learning. IEEE Trans Neural Netw 5(4):537–550
Article Google Scholar
Kwak N, Choi CH (2002) Input feature selection for classification problems. IEEE Trans Neural Netw 13(1):143–159
Article Google Scholar
Gu XY, Guo JC, Xiao LJ, Ming T, Li CY (2020) A Feature Selection Algorithm Based on Equal Interval Division and Minimal-Redundancy-Maximal-Relevance. Neural Process Lett 51(2):1237–1263
Article Google Scholar
Sun X, Liu YH, Xu MT, Chen HL, Han JW, Wang KH (2013) Feature selection using dynamic weights for classification. Knowl.-Based Syst 37:541–549
Article Google Scholar
Fleuret F (2004) Fast binary feature selection with conditional mutual information. J Mach Learn Res 5:1531–1555
MathSciNet MATH Google Scholar
Yang HH, Moody JE (1999) Data visualization and feature selection: new algorithms for nongaussian data. In: Proceedings of conference on neural information processing systems
Ren JF, Jiang XD, Yuan JS (2015) Learning LBP structure by maximizing the conditional mutual information. Pattern Recogn 48(10):3180–3190
Article Google Scholar
Lin DH, Tang X (2006) Conditional infomax learning: An integrated framework for feature extraction and fusion. In: Proceedings of european conference on computer vision. pp 68–82
Gao WF, Hu L, Zhang P, He JL (2018) Feature selection considering the composition of feature relevancy. Pattern Recogn Lett 112:70–74
Article Google Scholar
Gao WF, Hu L, Zhang P (2018) Class-specific mutual information variation for feature selection. Pattern Recogn 79:328–339
Article Google Scholar
Dua D, Graff C (2019) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences
Li JD, Cheng KW, Wang SH, Morstatter F, Trevino RP, Tang JL, Liu H (2018) Feature selection: a data perspective. ACM Comput Surv 50(6)
Fayyad UM, Irani KB (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: Proceedings of International Joint Conference on Artificial Intelligence. pp 1022–1027
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. ACM SIGKDD Explor Newslett 11(1):10–18
Article Google Scholar
Zhao Z, Morstatter F, Sharma S, Alelyani S, Anand A, Liu H (2010) Advancing feature selection research. ASU feature selection repository 1–28

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (61771334).

Author information

Authors and Affiliations

School of Electrical and Information Engineering, Tianjin University, Tianjin, 300072, China
Xiangyuan Gu, Jichang Guo, Lijun Xiao & Chongyi Li

Authors

Xiangyuan Gu
View author publications
You can also search for this author in PubMed Google Scholar
Jichang Guo
View author publications
You can also search for this author in PubMed Google Scholar
Lijun Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Chongyi Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jichang Guo.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gu, X., Guo, J., Xiao, L. et al. Conditional mutual information-based feature selection algorithm for maximal relevance minimal redundancy. Appl Intell 52, 1436–1447 (2022). https://doi.org/10.1007/s10489-021-02412-4

Download citation

Accepted: 03 April 2021
Published: 21 May 2021
Issue Date: January 2022
DOI: https://doi.org/10.1007/s10489-021-02412-4

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Conditional mutual information-based feature selection algorithm for maximal relevance minimal redundancy

Abstract

Similar content being viewed by others

Feature selection based on conditional mutual information: minimum conditional relevance and minimum conditional redundancy

Feature selection based on mutual information with correlation coefficient

A feature selection algorithm based on redundancy analysis and interaction weight

1 Introduction

2 The knowledge of mutual information

3 Related works