1 Introduction

Feature selection is an important way of dimension reduction, which adopts metrics to measure features and selects informative features [1, 2]. Feature selection algorithms can be classified into three categories: wrapper, embedded and filter. Since accuracy of classifier is used as the metric, wrapper and embedded algorithms are time-consuming. Filter algorithms are time-saving at the cost of performance. The most commonly used metrics in filter algorithms include Pearson correlation coefficient, Euclidean distance and mutual information. As mutual information can measure linear and nonlinear correlation, and has the invariance under space transformations [3], it has an advantage over other metrics. Therefore, many feature selection algorithms based on mutual information are proposed [4, 5].

MIM [6] is a basic algorithm based on mutual information and it calculates mutual information between each feature and the class label. On basis of MIM, some feature selection algorithms based on relevancy and redundancy are proposed, such as MIFS [7], MIFS-U [8], mRMR [9], CMI [10] and MIFS-CR [11]. These algorithms introduce mutual information between features and use it to describe redundancy. Furthermore, in our previous work [12], we investigate feature selection based on mutual information and propose an algorithm named EID-mRMR. EID-mRMR combines mutual information with a method of equal interval division and ranking to select features. Since three-dimensional mutual information, such as conditional mutual information, joint mutual information or three-way interaction information, is not exploited, the performance of these algorithms can be influenced.

There are some feature selection algorithms based on three-dimensional mutual information, such as DWFS [13], IWFS [14], JMI [15], CMIM [16], CIFE [17], JMIM [18] and RelaxMRMR [19]. All of these algorithms consider three-dimensional mutual information among features and the class label, while only RelaxMRMR considers three-dimensional mutual information among features.

Since our previous work [12] only considers mutual information and does not consider three-dimensional mutual information, the objective function of EID-mRMR misses some useful information and the performance is affected. To further improve the performance, feature selection based on three-dimensional mutual information is investigated. This work is its extension in three-dimensional mutual information and it combines three-dimensional mutual information among features with the method of equal interval division and ranking to select features. Specifically, an objective function based on conditional mutual information is presented. Then, the objective function is analyzed and a criterion that validates whether the objective function can guarantee the performance of selected features is achieved. Following that, in the case that the objective function cannot guarantee the performance of selected features, conditional mutual information among features and the class label as well as conditional mutual information among features is processed by the method of equal interval division and ranking. Finally, a feature selection algorithm named EID-CMI is proposed.

The main contributions of this paper are as follows: (1) Feature selection based on three-dimensional mutual information is investigated. (2) The method of equal interval division and ranking is adopted. (3) A feature selection algorithm based on conditional mutual information is proposed. (4) The proposed algorithm can achieve better feature selection performance.

The remainder of this paper is organized as follows: Sect. 2 presents some basic concepts of information theory. Related works of feature selection algorithms based on three-dimensional mutual information are introduced in Sect. 3. Section 4 presents the proposed algorithm. Experimental results are given and analyzed in Sect. 5. Section 6 gives conclusions and future work.

2 Some Basic Concepts of Information Theory

Mutual information is exploited to quantify the information shared by two variables, and mutual information I(XY) between two variables X and Y can be defined as

$$\begin{aligned} I(X;Y) = \sum \limits _{x \in X} {\sum \limits _{y \in Y} {p(x,y)\log \frac{{p(x,y)}}{{p(x)p(y)}}\mathrm{}} } \end{aligned}$$
(1)

A greater mutual information value indicates that two variables share more information.

Conditional mutual information is adopted to quantify the information by the other two variables when one variable is known and I(XY|Z) can be defined as

$$\begin{aligned} I(X;Y|Z) = \sum \limits _{x \in X} {\sum \limits _{y \in Y} {\sum \limits _{z \in Z} {p(x,y,z)\log \frac{{p(x,y|z)}}{{p(x|z)p(y|z)}}} \mathrm{{}}} } \end{aligned}$$
(2)

Three-way interaction information is an extension of mutual information and I(XYZ) has the following relationship with conditional mutual information I(XY|Z) [20].

$$\begin{aligned} I(X;Y;Z) = I(X;Y|Z) - I(X;Y) \end{aligned}$$
(3)

where I(XYZ) can be positive, negative or zero.

3 Related Works

Feature selection algorithms based on three-dimensional mutual information include DWFS, IWFS, JMI, CMIM, CIFE, JMIM and RelaxMRMR. DWFS and IWFS belong to the same class, and they utilize symmetric uncertainty [21] between features and the class label to describe relevance, and adopt three-way interaction information among features and the class label to describe interaction. Owing to lacking some evidences in determining these two parts, their feature selection performance cannot be guaranteed.

JMI, CMIM, CIFE, JMIM and RelaxMRMR belong to the same class. Objective functions are the key of these algorithms and their objective functions are respectively presented as follows:

$$\begin{aligned}&JMI = \arg \mathop {\max }\limits _{{f_i} \in X} \left[ {I\left( {c\mathrm{{;}}{f_i}} \right) - \frac{1}{{\left| S \right| }}\sum \limits _{{f_s} \in S} {I\left( {{f_s}\mathrm{{;}}{f_i}} \right) } + \frac{1}{{\left| S \right| }}\sum \limits _{{f_s} \in S} {I\left( {{f_s}\mathrm{{;}}{f_i}|c} \right) } } \right] \end{aligned}$$
(4)
$$\begin{aligned}&CMIM = \arg \mathop {\max }\limits _{{f_i} \in X} \left\{ {\mathop {\min }\limits _{{f_s} \in S} \left[ {I\left( {c;{f_i}|{f_s}} \right) } \right] } \right\} \end{aligned}$$
(5)
$$\begin{aligned}&CIFE = \arg \mathop {\max }\limits _{{f_i} \in X} \left[ {I\left( {c\mathrm{{;}}{f_i}} \right) - \sum \limits _{{f_s} \in S} {I\left( {{f_s}\mathrm{{;}}{f_i}} \right) } + \sum \limits _{{f_s} \in S} {I\left( {{f_s}\mathrm{{;}}{f_i}|c} \right) } } \right] \end{aligned}$$
(6)
$$\begin{aligned}&JMIM = \arg \mathop {\max }\limits _{{f_i} \in X} \left\{ {\mathop {\min }\limits _{{f_s} \in S} \left[ {I\left( {{f_i},{f_s};c} \right) } \right] } \right\} \end{aligned}$$
(7)
$$\begin{aligned}&RelaxMRMR = \arg \mathop {\max }\limits _{{f_i} \in X} I\left( {c\mathrm{{;}}{f_i}} \right) - \frac{1}{{\left| S \right| }}\sum \limits _{{f_s} \in S} {I\left( {{f_s}\mathrm{{;}}{f_i}} \right) }\nonumber \\&\quad + \frac{1}{{\left| S \right| }}\sum \limits _{{f_s} \in S} {I\left( {{f_s}\mathrm{{;}}{f_i}|c} \right) } - \frac{1}{{\left| S \right| \left| {S - 1} \right| }}\sum \limits _{{f_s} \in S} {\sum \limits _{{f_k} \in S;k \ne s} {I\left( {{f_i};{f_k}|{f_s}} \right) } } \end{aligned}$$
(8)

where \({I\left( {c\mathrm{{;}}{f_i}} \right) }\) is mutual information between a candidate feature \({{f_i}}\) and the class label c, \({I\left( {{f_s}\mathrm{{;}}{f_i}} \right) }\) is mutual information between a selected feature \({{f_s}}\) and \({{f_i}}\), \({I\left( {{f_s}\mathrm{{;}}{f_i}|c} \right) }\) is conditional mutual information when c is given, \({I\left( {c;{f_i}|{f_s}} \right) }\) is conditional mutual information when \({{f_s}}\) is given, \({I\left( {{f_i},{f_s};c} \right) }\) is joint mutual information, \({I\left( {{f_i};{f_k}|{f_s}} \right) }\) is conditional mutual information when \({{f_s}}\) is given, \({\left| S \right| }\) is the number of selected features, S is the selected feature set and X is the candidate feature set.

JMI, CMIM, CIFE and JMIM do not consider three-dimensional mutual information among features except for RelaxMRMR. Since three-dimensional mutual information among features is considered, the objective function of RelaxMRMR contains more useful information.

4 The Proposed Feature Selection Algorithm

This section first presents our proposed objective function and a criterion that validates whether the objective function can guarantee the effectiveness of selected features. Then, a method of equal interval division and ranking is given. Finally, our proposed feature selection algorithm, EID-CMI, is presented.

4.1 The Proposed Objective Function and Criterion

First, we analyze the first three selected features. Since it can provide more information, the feature that has maximum mutual information with the class label c is the first to select and it is assumed to be \({{f_s}}\). In selecting the second feature, Eq. (9) is satisfied.

$$\begin{aligned} I\left( {c;T} \right) = I\left( {c;{f_i},{f_s}} \right) = I\left( {c;{f_s}} \right) + I\left( {c;{f_i}|{f_s}} \right) \end{aligned}$$
(9)

where \({{f_i}}\) is a candidate feature. In order to guarantee \({I\left( {c;T} \right) }\) is maximum, since \({I\left( {c;{f_s}} \right) }\) is maximum, \({I\left( {c;{f_i}|{f_s}} \right) }\) should be maximum. The feature that has maximum conditional mutual information with c and \({{f_s}}\) is selected.

In selecting the third feature, Eq. (10) is satisfied.

$$\begin{aligned} I\left( {c;T} \right) = I\left( {c;{f_i},{f_s},{f_k}} \right) \end{aligned}$$
(10)

Since it is difficult to calculate Eq. (10), we introduce Eqs. (11) and (8) is its special case attaining maximum.

$$\begin{aligned} J({f_i})= & {} I\left( {c\mathrm{{;}}{f_i}} \right) - \frac{1}{{\left| S \right| }}\sum \limits _{{f_s} \in S} {I\left( {{f_s}\mathrm{{;}}{f_i}} \right) } + \frac{1}{{\left| S \right| }}\sum \limits _{{f_s} \in S} {I\left( {{f_s}\mathrm{{;}}{f_i}|c} \right) } \nonumber \\&- \frac{1}{{\left| S \right| \left| {S - 1} \right| }}\sum \limits _{{f_s} \in S} {\sum \limits _{{f_k} \in S;k \ne s} {I\left( {{f_i};{f_k}|{f_s}} \right) } } \end{aligned}$$
(11)

where \({{f_k}}\) is a selected feature and S is the selected feature set.

Equation (12) is satisfied.

$$\begin{aligned}&I\left( {c\mathrm{{;}}{f_i}} \right) - \frac{1}{{\left| S \right| }}\sum \limits _{{f_s} \in S} {I\left( {{f_s}\mathrm{{;}}{f_i}} \right) } + \frac{1}{{\left| S \right| }}\sum \limits _{{f_s} \in S} {I\left( {{f_s}\mathrm{{;}}{f_i}|c} \right) } \nonumber \\&\quad = I\left( {c\mathrm{{;}}{f_i}} \right) + \frac{1}{{\left| S \right| }}\sum \limits _{{f_s} \in S} {I\left( {{f_s}\mathrm{{;}}{f_i};c} \right) } = \frac{1}{{\left| S \right| }}\sum \limits _{{f_s} \in S} {I\left( {c;{f_i}|{f_s}} \right) } \end{aligned}$$
(12)

Equation (13) is introduced in selecting the third feature for simplifying computation and it can be understood as effective information that is brought in selecting features. The first part means the average value of relevant information, while the second part can be considered as the average value of redundant information.

$$\begin{aligned} J({f_i}) = \frac{1}{{\left| S \right| }}\sum \limits _{{f_s} \in S} {I\left( {c;{f_i}|{f_s}} \right) - \frac{1}{{\left| S \right| \left| {S - 1} \right| }}\sum \limits _{{f_s} \in S} {\sum \limits _{{f_k} \in S;k \ne s} {I\left( {{f_k};{f_i}|{f_s}} \right) } } } \end{aligned}$$
(13)

When the maximum of Eq. (13) is far greater than the secondary maximum, more information can be achieved by selecting the maximum and thus can guarantee the effectiveness of selected features. Otherwise, the advantage is not obvious. To guarantee the performance of selected features, a criterion that whether the difference between the maximum and secondary maximum is greater than a fixed value P is used to evaluate whether the objective function can guarantee the performance of selected features, as we do in our previous work [12]. If the difference is greater than P, Eq. (13) is exploited to select features; otherwise, the method of equal interval division and ranking is adopted to select features and it will be introduced in the next section.

4.2 Equal Interval Division and Ranking

In the case that the performance of selected features cannot be guaranteed, the method of equal interval division and ranking is exploited to process these two parts of Eq. (13), and CFF-CMI\(\mathrm{{ }}\left( {c;{f_i}|{f_s}} \right) \) and FFF-CMI\(\mathrm{{ }}\left( {{f_k};{f_i}|{f_s}} \right) \) are obtained. Their computing processes are shown in Algorithms 1 and 2.

figure a
figure b

The number of dataset’s features is taken as the number of subintervals. Take Algorithm 1 for example to analyze the method of equal interval division and ranking, the concrete practices are presented as follows: determining maximum and minimum of the interval of average value of \({I\left( {c;{f_i}|{f_s}} \right) }\) as interval values, and the interval values are divided equally. Then, the subintervals are indexed from 1 to the number of dataset’s features, and the indexes of these subintervals are taken as the ordinal values of the values in the subintervals.

The method of equal interval division and ranking is adopted to process these two parts, and it can increase the difference among each part of values and guarantee the features in the same subinterval have the same priority. Therefore, Eq. (14) is adopted to process the case that the effectiveness of selected features cannot be guaranteed. When the number of features that satisfies the condition is 1, selecting the feature; otherwise, the feature that maximizes Eq. (13) is selected from these features.

$$\begin{aligned} J({f_i}) = \arg \mathop {\max }\limits _{{f_i} \in X} \left[ {CFF-CMI\mathrm{{ }}\left( {c;{f_i}|{f_s}} \right) \mathrm{{ }} - \mathrm{{ }}FFF-CMI\mathrm{{ }}\left( {{f_k};{f_i}|{f_s}} \right) } \right] \end{aligned}$$
(14)

4.3 Algorithmic Implementation

The proposed objective function and criterion are combined with the method of equal interval division and ranking, and a feature selection algorithm named EID-CMI is proposed and it is shown in Algorithm 3.

figure c

EID-CMI consists of two parts: in the first part (lines 1–10), initializing the candidate feature set X and the selected feature set S. Then, calculating \({I\left( {c;{f_i}} \right) }\) and selecting the feature with the maximum. Calculating \(I\left( {c;{f_i}|{f_s}} \right) \) and taking the feature with the maximum from X; in the second part (lines 11–31), calculating \(I\left( {{f_k};{f_i}|{f_s}} \right) \), and the difference between the maximum and secondary maximum of Eq. (13), if the difference is greater than a fixed value P, selecting the feature with the maximum of Eq. (13); if not, calculating CFF-CMI\(\mathrm{{ }}\left( {c;{f_i}|{f_s}} \right) \), FFF-CMI\(\mathrm{{ }}\left( {{f_k};{f_i}|{f_s}} \right) \) and the difference, and testing the number of the features with the maximum difference, if it is one, selecting the feature; otherwise, selecting the feature with the maximum of Eq. (13) from these features with the maximum difference. Following the above steps, it loops to select features until selecting a specified number of features N.

5 Experimental Results

This section describes the datasets and experimental settings, and compares EID-CMI with several algorithms.

5.1 The Datasets and Experimental Settings

These datasets in Table 1 are commonly used in feature selection and they come from UCI machine learning repository [22] and ASU feature selection datasets [23]. These datasets are from different fields, such as face image and digit recognition. In these datasets, their instances vary from 100 to 7000 and the number of their features varies from 57 to 10304. For these datasets with continuous features, minimum description length discretization method [24] is employed. Three classifiers, J48, IB1 and Naive Bayes, are adopted and their parameters are set to WEKA’s [25] default values. ASU feature selection software package [26] is utilized. N is set to 50.

Table 1 Summary of datasets in the experiment

For reducing the influence of randomness on final results, ten times of 10-fold cross-validation are conducted, and the mean values and standard deviations of ten results are taken as the final results. To determine whether the effectiveness of experimental results is significant, a one-sided paired t-test at 5% significance level is carried out, and the number of the datasets that EID-CMI performs better than/equal to/worse than other algorithms is presented by Win/Tie/Loss (W/T/L).

5.2 Experimental Results and Analysis

5.2.1 With Different P Values

To analyze the influence of P value on EID-CMI, we set the P value to 0.02, 0.03, 0.04, 0.05 and 0.06 respectively and conduct experiments with J48, IB1 and Naive Bayes. Calculating the accuracy of selected features from 1 to 50 for the three classifiers and the mean value is taken as the final result. Average performance of EID-CMI with different P values is presented in Table 2. The value in the Avg. row presents the mean value of these twenty values.

Table 2 Average performance with different P values

In Table 2, there is a little difference among these Avg. values. When the P value is 0.04, EID-CMI achieves the best Avg. value. Therefore, in comparison with other feature selection algorithms, the P value of EID-CMI is set to 0.04.

5.2.2 Comparison with Feature Selection Algorithms

To validate feature selection performance of EID-CMI, several three-dimensional mutual information-based algorithms, JMI [15], CMIM [16], CIFE [17], JMIM [18], RelaxMRMR [19], DWFS [13] and IWFS [14], are compared. Average accuracies of selected features with J48, IB1 and Naive Bayes are presented in Tables 3, 4 and 5, respectively. Average performance of these twenty datasets with three classifiers is shown in Fig. 1. Calculating the accuracy of selected features from 1 to 50 and selecting the optimal value. The optimal accuracy of EID-CMI and other algorithms when using J48, IB1 and Naive Bayes is presented in Tables 6, 7 and 8. The running time of EID-CMI and other algorithms that select 50 features is shown in Table 9.

Table 3 Margins (mean ± std. dev.) between the performance of algorithms when using J48

As shown in Table 3, CMIM, RelaxMRMR and EID-CMI can obtain better results than other algorithms in terms of the Avg. and W/T/L values. Compared with RelaxMRMR, the Avg. value of EID-CMI is 0.64% greater than RelaxMRMR. In the W/T/L values, the number of the datasets that EID-CMI is superior to RelaxMRMR is 7, while that of the datasets that RelaxMRMR is superior to EID-CMI is 2, as presented, EID-CMI has more advantages than RelaxMRMR.

Table 4 Margins (mean ± std. dev.) between the performance of algorithms when using IB1

In Table 4, CMIM, RelaxMRMR and EID-CMI can obtain greater Avg. value. The W/T/L values show that RelaxMRMR and EID-CMI achieve better feature selection performance. Compared with Table 3, considering the Avg. values, EID-CMI gains more advantages than the other algorithms. In terms of the W/T/L values, EID-CMI outperforms other algorithms except IWFS with extra performance gain.

Table 5 Margins (mean ± std. dev.) between the performance of algorithms when using Naive Bayes

The Avg. and W/T/L values in Table 5 show that CMIM, RelaxMRMR and EID-CMI achieve better results. Different from Tables 3 and 4, in terms of the Avg. values, EID-CMI achieves more advantages than other algorithms. The W/T/L values show that EID-CMI is superior to other algorithms except for CMIM with more performance gain.

Tables 6, 7 and 8 show that EID-CMI performs better than other algorithms in general. Take Table 7 for example, in the twenty datasets, the number of the datasets that EID-CMI can obtain better results than other algorithms is 9. Furthermore, EID-CMI can achieve greater Avg. values than other algorithms with the three classifiers.

Table 6 Classification accuracy (%) of the selected optimal features when using J48
Table 7 Classification accuracy (%) of the selected optimal features when using IB1
Table 8 Classification accuracy (%) of the selected optimal features when using Naive Bayes
Table 9 Running time (s) of EID-CMI and other algorithms

In Table 9, owing to calculating conditional mutual information among features and the class label as well as conditional mutual information among features, from the Avg. value, we can see that EID-CMI and RelaxMRMR are more time-consuming than the other algorithms. Considering that it only calculates conditional mutual information among features and the class label, CMIM takes the least time in feature selection.

Fig. 1
figure 1

Average performance comparisons of algorithms with the three classifiers

As shown in Fig. 1, EID-CMI achieves better feature selection performance, while the other algorithms cannot handle well in some datasets. For example, JMI cannot obtain the desired performance in Movement_libras, Musk, Mfeat_fac, Semeion and ORL. RelaxMRMR does not achieve the desired results in Movement_libras, Semeion and ORL.

In addition to these algorithms above, EID-CMI is compared with the algorithm in [27], MCMI [28], MRI [29], MIM [6] and EID-mRMR [12]. In the algorithm in [27], the parameter c is set to 0.4 and \(\eta \) is set to 0.2. In EID-mRMR, the parameter P is set to 0.5. The accuracy of the features selected by EID-CMI and other algorithms with J48, IB1 and Naive Bayes is presented in Tables 10, 11 and 12. The accuracy of the optimal features selected by EID-CMI and other algorithms when using J48, IB1 and Naive Bayes is presented in Tables 13, 14 and 15. The running time of EID-CMI and other algorithms that select 50 features is given in Table 16.

Table 10 Margins (mean ± std. dev.) between the performance of algorithms when using J48

In Table 10, EID-mRMR and EID-CMI can achieve better Avg. and W/T/L values. The number of the datasets that EID-CMI is superior to EID-mRMR is 8, while that of the datasets that EID-CMI is inferior to EID-mRMR is 2, as shown, EID-CMI can achieve better feature selection performance than EID-mRMR.

Table 11 Margins (mean ± std. dev.) between the performance of algorithms when using IB1

As shown in Table 11, EID-mRMR and EID-CMI can obtain better Avg. and W/T/L values. Compared with Table 10, in terms of the Avg. values, EID-CMI outperforms other algorithms except EID-mRMR with more performance gain. In the W/T/L values, EID-CMI achieves more advantages than the algorithm in [27], MRI and EID-mRMR.

Table 12 Margins (mean ± std. dev.) between the performance of algorithms when using Naive Bayes

The Avg. and W/T/L values in Table 12 show that EID-mRMR and EID-CMI achieve better feature selection performances. Different from Tables 10 and 11, in the Avg. values, EID-CMI gains more advantages than other algorithms except for EID-mRMR. In terms of the W/T/L values, EID-CMI performs better than MCMI, MRI and MIM.

Table 13 Classification accuracy (%) of the selected optimal features when using J48
Table 14 Classification accuracy (%) of the selected optimal features when using IB1
Table 15 Classification accuracy (%) of the selected optimal features when using Naive Bayes

Tables 13, 14 and 15 show that EID-CMI and EID-mRMR perform better than other algorithms in general. Take Table 13 for example, in the twenty datasets, the number of the datasets that EID-CMI can obtain better results than other algorithms is 7. Furthermore, EID-CMI and EID-mRMR obtain greater Avg. values with the three classifiers.

Table 16 Running time (s) of EID-CMI and other five algorithms

In Table 16, considering that both conditional mutual information among features and the class label and conditional mutual information among features are calculated, EID-CMI takes more time than the other algorithms in feature selection. Since it only calculates mutual information between features and the class label, MIM spends the least time.

Compared with other algorithms, EID-CMI and RelaxMRMR consider three-dimensional mutual information among features. Due to considering three-dimensional mutual information among features, the objective functions contain more useful information and they can achieve better results than the majority of algorithms at the cost of more time-consuming. As EID-CMI adopts the objective function of RelaxMRMR to select features when the objective function can guarantee the performance of selected features, and combines the objective function with equal interval division and ranking to select features when the objective function cannot guarantee the performance of selected features, EID-CMI can obtain better feature selection performance than RelaxMRMR.

6 Conclusions and Future Work

This paper introduces a method of equal interval division and ranking to investigate feature selection based on three-dimensional mutual information and proposes a feature selection algorithm named EID-CMI. To verify the performance, EID-CMI is applied to eight UCI datasets, twelve ASU datasets and three classifiers, and compared with several feature selection algorithms. Experimental results demonstrate that EID-CMI can achieve better feature selection effectiveness.

Since it achieves better feature selection performance, EID-CMI can be applied to some fields, such as text processing [30], network anomaly detecting [31], underwater objects recognition and classification [32, 33], steganalysis [34, 35] and information retrieval [36]. Considering that IED-CMI is more time-consuming, we will consider both performance and time consumption to investigate feature selection in the next stage.