Keywords

1 Introduction

Feature selection is of crucial importance for high-dimensional data classification. It can help reduce the size of the feature space, thereby reduce the computational cost and time complexity of the learning model. Moreover, it aims to identify the smallest subset of features that can explain the target, thus it can enhance the model interpretability.

Information entropy based feature selection methods are widely used. They are easy to interpret and implement, due to their information theoretical framework. These approaches can always be classified as filter methods. They are independent of the subsequent learning algorithm, which facility reducing the computational cost. Information entropy based methods generally use information criteria to measure the relevance between features and target classes, and the interaction within features. Mutual information is widely used to measure the relevance between two variables in information entropy based feature selection. For example, the Relevance criterion (REL) [1] considers the relevance between features and the targets, while Maximum relevance and minimum redundancy (mRMR) [2] and CIFE [3] utilizes mutual information to measure the relevance between features and target class and the redundancy between features. However, most existing approaches usually measure the feature interactions only using correlation between features without considering their joint dependencies on the target. That is to say, they ignore the information of classification complementarity among the features. In contrast, the wrapper method and embedded method, such as variable importance measures in random forests [4, 5], consider the feature interactions conditioned on the target classes. They always obtain more satisfactory classification results along with the learning algorithm.

In this paper, we propose a new feature selection method, Maximum-Relevance and Maximum-Complementarity (MRMC). The main contributions are as follows.

  1. 1.

    We give the formal definition of information complementarity in the information-theoretical framework. Besides the relevance with the target, MRMC takes into consideration the complementary information of a candidate feature providing to a selected feature or a selected feature subset to predict the target.

  2. 2.

    To efficiently measure the information complementarity conditioned on the target classes, we present an efficient approach to approximate the complementarity between pairs of features with random forests. Thus it can be viewed as a hybrid feature selection approach with takes the advantages of both entropy based method and random forest.

  3. 3.

    Experiments results on 18 datasets in comparison with classical and state-of-the-art approaches, i.e. CMIM [15], mRMR [2], DISR [16], JMI [17] and RF-RFE [11], demonstrate the effectiveness and efficiency of our method.

The paper is structured as follows: in Sect. 2 we states the related work. In Sect. 3, we describe the proposed feature selection algorithm. Extensive experimental results are provided in Sect. 4. In Sect. 5, we summarize our work.

2 The Information-Theoretical Framework for Feature Selection

In this section we review the related works on information entropy based feature selection. To understand them, we first introduce some basic concepts of relevant features and redundant features from the perspective of information theory.

Definition 1

Given discrete random variable X and its probability distribution \(p(x)=P(X=x)\) with domain X, the entropy of random variable X is defined as:

$$\begin{aligned} H(X) = -\sum _{x\in X}{p(x)\log _{2}{p(x)}} \end{aligned}$$
(1)

H(X) indicates the amount of information needed to eliminate its uncertainty, that is, the amount of information that X may contain.

Definition 2

Given discrete random variables X and Y with domain X and Y and their joint probability distribution \(p(x,y)=P(X=x,Y=y)\), then the conditional entropy of random variable Y given X is defined as:

$$\begin{aligned} H(Y|X) = -\sum _{y\in Y}\sum _{x\in X}{p(x,y)\log _{2}{p(y|x)}} \end{aligned}$$
(2)

In holds that \(0\le H(Y|X)\le H(Y)\). \(H(Y|X)=0\) with the condition that \(p(y|x)=1\) or \(p(y|x)=0\) for any pair xy, in simple words, the value of Y is determined given the value of X. \(H(Y|X)=H(Y)\) with X and Y are independent.

Definition 3

Given discrete random variables X and Y, then their mutual information is defined as:

$$\begin{aligned} I(X;Y)=\sum _{y\in Y}\sum _{x\in X}{p(x,y)\log {\frac{p(x,y)}{p(x)p(y)}}} \end{aligned}$$
(3)

I(XY) can be interpreted as the amount of reduced uncertainty of Y due to X. Therefore, I(XY) represents the relevance between X and Y.

Definition 4

Given random variables X, Y and Z, then the conditional mutual information of X and Y given Z is defined as:

$$\begin{aligned} \begin{aligned} I(X;Y|Z)&= H(Y|Z)-H(Y|Z,X) \\&=H(X|Z)-H(X|Z,Y)=I(Y;X|Z) \end{aligned} \end{aligned}$$
(4)

The conditional mutual information quantifies the reduction of uncertainty of Y(X) owing to the variable X(Y) given Z is known.

Definition 5

(Chain rule for mutual information)

Given a set of random variables \(X_S = \{X_1,X_2,\cdots ,X_n\}\) and random variable Y, then the mutual information of \(X_S\) and Y is defined as:

$$\begin{aligned} \begin{aligned} I(X_S;Y)&= I(X_1,X_2,\cdots ,X_n;Y) \\&= \sum _{i=1}^{n}{I(X_i;Y|X_{i-1},X_{i-2},\cdots ,X_1)} \end{aligned} \end{aligned}$$
(5)

The chain rule for mutual information indicates the amount of information that the random variables set \(X_S\) can provide for Y equals to the sum of pairwise mutual information of Y and each variable under certain conditions. This is an important trick which has been used in many feature selection methods when considering the influence a feature subset on the target class.

2.1 The Relevant Information of Features

In feature selection, the mutual information \(I(X_S;Y)\) can be used to measure the dependency between input feature subset \(X_S\) and output target Y. Suppose \(X_k\) be a candidate feature and \(X_S\subset \{X_1,\cdots ,X_{k-1},X_{k+1},\cdots ,X_n\}\) be a feature subset selected, the relevance of \(X_k\) is calculates as:

$$\begin{aligned} I(X_k;Y|X_S) = I(\{X_S,X_k\};Y)-I(X_S;Y) \end{aligned}$$
(6)

It measures how much the candidate feature \(X_k\) is relevant to Y when \(X_S\) is given. From it, we can observe that the relevance in feature selection is conditional, as pointed out by [18, 19]. For a candidate feature, it may be strongly relevant, weakly relevant or irrelevant when conditioned by different context \(X_S\).

The Relevance criterion (REL) use \(I(X_k;Y|X_S)\) combined with forward feature selection directly, known as maximal dependency [1]. That is in each step the feature \(X_k\) is selected which meets:

$$\begin{aligned} J_{REL}(X_k) = \max _{X_k\in X_{-S}}{I(X_k;Y|X_S)} \end{aligned}$$
(7)

where \(X_{-S}=X/X_S\). There is a major drawback that we need to estimate multivariate probability density to measure the feature subset-conditional mutual information. It is almost impossible to estimate with limited data if features are all correlated to each other. To solve those problem, the bi-variate or tri-variate probability density is employed in existing methods [2, 20]. For example, according to Conditional Mutual Information Maximization criterion (CMIM) [15], in each step, the candidate feature can be selected as follows:

$$\begin{aligned} J_{CMIM}(X_k) = \max _{X_k\in X_{-S}}\min _{X_j\in X_{S}}{I(X_k;Y|X_j)} \end{aligned}$$
(8)

It can seen that CMIM uses \(\min _{X_j\in X_{S}}{I(X_k;Y|X_j)}\) to replace \(I(X_k;Y|X_S)\), because the latter is hard to compute when the selected feature subset increases.

2.2 The Redundant Information Within Features

Neglecting the feature interactions may cause redundancy in the selected feature subset. Some methods focus on removing feature redundancy when taking into account feature relevance with predictive targets [19, 21, 22]. For example, it can be computed through mutual information between the candidate feature and the selected feature subset as follows,

$$\begin{aligned} R(X_S;X_k) = \alpha \sum _{X_j\in X_S}{I(X_j;X_k)} \end{aligned}$$
(9)

where \(X_S\) is a feature subset selected, \(X_k\) is a candidate feature. This measurement of redundancy has been applied in mRMR [2]. Its objective is as follows:

$$\begin{aligned} J_{MRMR}(X_k) = I(X_k;Y)-\frac{1}{|X_S|}\sum _{X_j\in X_S}{I(X_k;X_j)} \end{aligned}$$
(10)

It considers the mutual information between features as redundant information. However the target class Y is neglected in measuring the redundancy. Figure 1 gives a intuitive example. Two shadow parts in Fig. 1 represent \(I(X_1;X_2)\), but only red shadow part kicks in classifying the target class Y. Therefore, some methods, e.g. CIFE [23], MIFS-U [24], mIMR [25] and IGFS [26] all use joint mutual information \(I(X_i;X_k;Y)\) to measure the feature redundancy, which can defined as follows:

$$\begin{aligned} I(X_i;X_k;Y) = I(X_i;Y)+I(X_k;Y)-I(X_i,X_k;Y) \end{aligned}$$
(11)

Similarly, given a selected feature subset \(X_S\), the joint mutual information \(I(X_S;X_k;Y)\) measures the feature redundancy of the candidate feature \(X_k\) when given \(X_S\) and Y. In this sense, the joint mutual information can be regarded as the shared discriminative information of \(\{X_S,X_k\}\) about Y. However, similarly, it is hard to compute \(I(X_S;X_k;Y)\), so the sum of pairwise redundancies between features is always calculated to approximate it. For example, the criterion of CIFE is defined as:

$$\begin{aligned} J_{CIFE}(X_k)=I(X_k;Y)-\sum _{X_j\in X_S}I(X_k;X_j;Y) \end{aligned}$$
(12)

In [19, 27], the markov blanket was also used to evaluate the redundancy between features.

3 Maximum-Relevance and Maximum-Complementarity Feature Selection

In this section, we first give the definition of feature complementarity. Two features with more complementarity indicates less redundancy. Then we give the objective of MRMC and propose to use random forests to approximate the complementarity score between any pair of features. Finally, for simplicity, a sequential forward search strategy is employed to maximize the objective function.

3.1 The Complementary Information of Features

Besides relevancy and redundancy, there are some literature have proposed the definition of complementarity from different aspect. Paper [16] proposes that complementarity is the beneficial effect if feature interaction. From this part, the complementary information is the negative interaction information. The definition is as follows,

Definition 6

Suppose \(X_1,X_2,\cdots ,X_n\) are random variables, they are complementary if

$$(-1)^nI(X_1;X_2;\cdots ;n)>0$$

That is, for two variables, the classification information about target provided by two features together is greater than the sum of the classification information about provided by two features individual. If \(I(X_i,X_j;Y)>I(X_i;Y)+I(X_j;Y)\), the features \(X_i\) and \(X_j\) are said to be complementary.

However, it is hard to compute the complementarity of a large feature set. In [28], the authors proposed JMI, which defines the complementary information as the shared information between two features \(X_i\) and \(X_j\) given the target, i.e. \(I(X_i,X_j;Y)\). Thus they defined complementary information of \(X_k\) provided to the selected subset \(X_S\) as follows,

$$J_{JMI}(X_k)=\sum _{X_j\in X_S}{I(X_k,X_j;Y)}$$

In [29], the relevance and complementary score are estimated by using neural network, which is highly time-consuming.

Unlike the negative interaction information and shared information, we formally give the definition of feature complementarity from the perspective of information entropy which is the sum of the unique information of two features about target.

Definition 7

Suppose \(X_1\) and \(X_2\) are two random variables used for predicting the target variable Y, their complementary classification information between two variables (CCI) can be defined as follows,

$$\begin{aligned} CCI(X_1,X_2;Y) = \frac{I(X_1;Y|X_2)+I(X_2;Y|X_1)}{2} \end{aligned}$$
(13)

Definition 8

Suppose \(X_k\) be a random variable and \(X_S=\{X_1,X_2,\cdots ,X_{k-1}\} \) be a random variable subset, their complementary classification information (CCI) for predicting the target variable Y can be defined as follows,

$$\begin{aligned} CCI(X_k,X_S;Y) = \frac{I(X_k;Y|X_S)+I(X_S;Y|X_k)}{|X_S|+1} \end{aligned}$$
(14)

CCI quantifies the new classification information provided by a feature when another feature or subset is given. Suppose \(X_k\) is a candidate feature and \(X_S\) is selected feature subset, \(I(X_k;Y|X_S)\) measures the amount of newly provided classification information by the candidate feature \(X_k\) while \(I(X_S;Y|X_k)\) indicates the amount of classification information preserved by the selected feature set when the candidate feature is added. Hence CCI measures the classification information of subset \(\{X_k,X_S\}\).

It is hard to compute the complementary information between feature and a feature subset. We can also use sum of pairwise CCI between features to approximate it. Assume that dataset \(X=\{X_1,X_2,\cdots ,X_n\}\) characterized as an n-dimensional vector and X is labeled with L classes \(Y=\{y_j\},j=1,2,\cdots ,L\). According to the definition of complementarity between features, we can obtain the complementary matrix between any pair of features as follows:

$$\begin{aligned} C=\{c_{i,j}\}_{1\le i,j\le n}={ \left[ \begin{array}{cccc} 0&{} c_{12} &{} {\cdots } &{}c_{1n}\\ c_{21} &{} 0 &{} {\cdots } &{} c_{2n}\\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ c_{n1} &{} c_{n2}&{} {\cdots } &{} 0 \end{array} \right] } \end{aligned}$$
(15)

where \(c_{i,j} = CCI(X_i,X_j;Y)\), C carries all complementary information between any pair of features. According to the definitions of CCI, C is a real symmetrical matrix.

3.2 Maximum-Relevance and Maximum-Complementarity Based Feature Selection

For high-dimensional data, it is infeasible to compute the joint mutual information of many variables. Hence we use the pairwise complementarity of features to help find the optimal feature subset. Suppose \(X_k\) is a candidate feature and \(X_S=\{X_1,X_2,\cdots ,X_{k-1}\}\) is selected subset, the criterion of MRMC can be defined as follows,

$$\begin{aligned} J_{MRMC}(X_k) = I(X_k;Y)+CCI(X_k,X_S;Y) \end{aligned}$$
(16)

For simplicity, the criterion can be rewrite as follows,

$$\begin{aligned} J_{MRMC}(X_k) = I(X_k;Y)+\frac{1}{|X_S|}\sum _{X_j\in X_S}{CCI(X_k,X_j;Y)} \end{aligned}$$
(17)

where \(|X_S|\) is the size of the current selected feature set.

3.3 Mining the Feature Complementarity with Random Forest

To efficiently estimate the feature complementarity, we propose to calculate the complementary scores between any pair of features using random forest. Inspired by the sample proximity matrix of random forest, we can also get the pairwise complementarity matrix from random forests. The main assumption is that when two features are used in the same tree, then they can be viewed as complementary to each other, since a tree model naturally performs feature selection when it is being built. The frequency of two features being used in the forest can naturally used as an estimate of the feature complementarity.

Given a trained random forest \(\{h_1,h_2,\cdots ,h_{n_{tree}}\}\), the complementarity score of any pair of features is defined as:

$$\begin{aligned} \tilde{c}_{i,j}=av_{k}Idt(X_i,X_j\in h_k),k=1,2,\cdots ,n_{tree} \end{aligned}$$
(18)

where \(n_{tree}\) is the number of trees in the forest, \(Idt(\cdot )\) is the indicator function, and \(av(\cdot )\) is the mean operator. Its value is normalized to range [0, 1]. A value 0 of \(\tilde{c}_{i,j}\)represents that there is no additional knowledge to predict Y when using \(\{X_i,X_j\}\) compared with only using \(X_i\) or \(X_j\). And value 1 reveals that \(X_i\) and \(X_j\) have the largest complementarity. The performance of combining \(X_i\) and \(X_j\) is much better than only employing \(X_i\) or \(X_j\).

An efficient greedy search method, Sequential Forward Search (SFS) was employed to obtain a feature subset for MRMC. SFS-MRMC consists the following steps: 1) It starts from an empty feature subset and the relevance scores of all features is calculated. The most relevant feature is selected at first; 2) At each selection step, it expands feature subset with the feature with largest MRMC score.; 3) Repeat step 2) until the stop conditions reached.

4 Experiment and Discussion

4.1 Datasets and Algorithms for Comparison

To evaluate the performance of SFS-MRMC, 18 datasets from different domains are selected from http://featureselection.asu.edu/datasets.php, https://archive.ics.uci.edu/ml/index.php and http://www.gems-system.org/. The description of the datasets is summarized in Table 1. The number of features ranges from 44 to 10000 with categories varying from 2 to 15.

Table 1. Description of datasets

SFS-MRMC is compared with one Relevance and Redundancy based methods mRMR, three Complementarity based methods CMIM, DISR and JMI, and an embedded method RF-RFE. Note that RF-RFE always achieves state-of-the-art performances on high-dimensional data.

For fair comparison, in RF-RFE and SFS-MRMC, the number of trees in the forest is set to 1000, the number of splitting features per node is set to the default value \(m_{try}=\sqrt{p}\), where p is the total number of features of the dataset.

RF-RFE starts from the total set of features, prunes the least important feature from the current feature subset and then retrains a random forest to update the feature ranking at every iteration, so it is very time-consuming for high-dimensional data. To accelerate the experimental process, when the size of current feature subset is larger than 200, the least important 20 percent of features will be removed at each iteration. When the size of current feature subset reaches 200, the least important one will be eliminated at each iteration.

4.2 Evaluation Metrics

To evaluate the performances of different algorithms, the ten-fold cross-validation (10-CV) is conducted for five times on each dataset. In each fold, every algorithm is performed to obtain a feature ranking on the training set respectively. Then the performances of the feature subset selected according to the feature ranking are evaluated on the test set using the following metrics.

Note that our goal is to select a small feature subset for classification, so we only record the results on the top 200 features selected by different algorithms. For datasets with number of features smaller than 200, we record the results on all sizes of selected feature sets, i.e. \(1 \sim m \).

Average Test Accuracy. The classical classifier KNN is used to test the average performances of the selected features using different feature selection algorithms in the five runs of 10-CV. For fair comparison, we set the same parameter settings of the classifiers. For KNN, the parameter k is set to 3.

Average Size of Optimal Feature Subset. We aim to get a small feature subset. For different classifiers, we find the optimal feature subset with the highest accuracy on each dataset. Then the proportion of the subset to the total number of features is recorded, and is averaged over the five runs of 10-CV.

All the experiments are conducted on a PC with Intel CPU 8 GB RAM. We use Python 2.7 for coding, as well as Scikit-Feature feature selection repository.

4.3 Results and Discussion

In this section, we present comparison of SFS-MRMC against other feature selection algorithms in terms of the following aspects.

Fig. 1.
figure 1

Accuracy comparison among different feature selection methods

Table 2. The performances of KNN using different feature selection algorithms
Table 3. Average size of optimal feature subsets selected with different methods (%)

Average Test Accuracy. Limited by length, we only show the results of the average test accuracy of KNN on six datasets with different sizes of feature sets selected by different feature selection algorithms in Fig. 1. The number of selected features ranges from 1 to 200. Obviously, different datasets exhibits different variations in terms of test accuracy. For some data sets, such as Prostate_GE, ALLAML, Brain_Tumor1, all feature selection algorithms reach their best performance only with a small number of features. In contrast, for some datasets, such as sonar, the test accuracy exhibits a rising trend with the growth of selected features. And for other datasets, such as hearts and p_gene, more features may deteriorate the performance. We can observe that in most cases SFS-MRMC and RF-RFE are significantly better than other algorithms, while the performance of SFS-MRMC is slightly better than RF-RFE.

Table 2 shows the mean and standard deviations of the best test accuracy of KNN classifier over 5 runs of 10-CV, with different feature selection algorithms respectively. The numbers in red indicate the largest value of each row, i.e. the best results on each datasets, while the numbers in blue indicate the second best accuracy on the datasets. In the bottom, we also display the average performances in terms of accuracy and ranks of each algorithm over the datasets, which show that SFS-MRMC and RF-RFE outperform other algorithms. For all the three classifiers, SFS-MRMC always achieves the highest average ranks.

Average Size of Optimal Feature Subset. Table 3 records the average proportion of selected features over 18 data sets of the six feature selection algorithms with KNN classifiers respectively. Generally, all the six feature selection algorithms achieve remarkable reduction of feature dimension by only selecting a small proportion of the original feature sets.

5 Conclusions

MRMC takes into consideration both the complementarity and relevance within features. We approximate the complementarity between pairs of features using random forests. Thus it can be viewed as a hybrid feature selection approach of both entropy based method and random forest. Experiments results show that SFS-MRMC outperforms four classical entropy based methods and state-of-the-art RF-RFE in terms of classification accuracy, while it is efficient in terms of time cost.