1 Introduction

Feature selection is an important part of machine learning and data mining. Modern data sets are getting wider and bigger. But sometimes this is not a good thing for some tasks in machine learning such as classification and this problem was described by [12] in detail. And they roughly divided the methods of feature selection into three categories: filter methods such as mRMR algorithm [25], wrapper methods such as RFE algorithm [27], Boruta algorithm [20] and embedded methods such as L1 and L2 regularization [23] and different kinds of decision trees and their ensembles such as Random Forest [2]. And the elements and process of feature selection can be described as generation, evaluation, stopping criterion and validation [4]. In addition, there are many ways to classify feature selection methods. According to the evaluation of the relativity between features and the class variable, they can be divided into the consistency measure [4, 21, 38], the dependence measure [13, 35], the distance measure [19, 24] and the information theory measure [7, 18].

Furthermore, the current feature selection methods can be classified according to whether they can cope with different types of features which can be grouped into irrelevant features, redundant features and the interactive features. Irrelevant features do not help with learning tasks [16]. Searching and deleting irrelevant features has always been the major topic of feature selection. These methods often give feature weights and ordering by measuring the degree of association between features and class variable [28]. But most of them cannot remove redundant features, such as Relief [17] and its extension ReliefF [19]. Redundant features have a negligible effect on learning and forecasting because most information provided by them has already been presented by other features [36]. Some algorithms take it into consideration, such as CFS [13], FCBF [35], CMIM [8], ACE [32]. Whereas, they cannot recognize feature interactions.

A feature interaction is a ubiquitous issue that cannot be ignored and has attracted more attention in recent years [15]. A simple example of feature interactions is the concept of exclusive OR operation: C = XOR(X1,X2) where C is a Boolean label, X1 and X2 are the Boolean features. X1 and X2 are irrelevant with Y separately, however they have a strong relevance to Y when they are combined together. Thus, Removing any one of them will have a bad effect. There are few researches on feature interaction currently. Jakulin and Bratko [14] introduced the concept of interactive gain and used interactive information as a way of qualifying the feature interactions. Wang et al. [34] provides a flexible framework to extract different types of information which can deal with feature interactions. Wang and Song [33] proposed FEAST algorithm which is based on the association rule mining and takes feature interactions into account. Zeng et al. [37] proposed the interaction weight factor which can reflect the information of whether a feature is redundant or interactive and then brought forward an Interaction Weight based Feature Selection(IWFS) algorithm. Gao et al. [10] took composition of feature relevancy into account and proposed the Composition of Feature Relevancy (CFR) feature selection method. Tang et al. [31] proposed the Five-dimensional Joint Mutual Information(FJMI) feature selection algorithm which took the higher-order feature interactions into account and adopted the ‘maximum of the minimum’ method. Whereas, there is no widely accepted method currently, in particular, a better way to measure feature interactions.

In this paper, we will propose a new method of quantifying feature interactions, namely, the Symmetrical Complementary Coefficient. The Symmetrical Complementary Coefficient is based on the Enhanced Complementary Coefficient which is an improved version of the Complementary Coefficient [30]. Based on the Symmetrical Complementary Coefficient, we will improve the Sequential Forward Selection (SFS) algorithm and propose a new feature subset search algorithm which is called SCom-SFS. In addition, we will propose an improved ReliefF algorithm by selecting representative instances. An effective and complete feature selection algorithm RRSS will be obtained through the combination of the two modified algorithms. And it will be compared with other five classic feature selection algorithms and two latest algorithms in terms of four evaluation metrics, i.e., size of resulting feature subset, Accuracy, Kappa coefficient, and adjusted Mean-Square Error (MSE).

The rest of the paper is organized as follows. In Section 2, we describe the preliminaries. In Section 3, we introduce the RF-efficient-ReliefF algorithm. In Section 4, we give the definition of the Symmetrical Complementary Coefficient and propose the SCom-SFS algorithm. In Section 5, we bring in the RRSS algorithm and analyze its time complexity and parameter’s selection. In Section 6 and 7, we present the experiment, results and its analyses. Section 8 provides conclusions.

2 Preliminaries

This section introduces the knowledge associated with the topic. Firstly, let’s introduce the famous ReliefF algorithm, then introduce the SFS algorithm for feature subset generation, and then introduce the concept of feature interaction. Finally, we introduce the Random Forest and its feature selection algorithm.

2.1 ReliefF algorithm

Relief algorithm is a classical filter method of feature selection [17]. It is a method of calculating feature weights for the binary classification problem. Kononenko [19] extended it to ReliefF algorithm, which can be used for multi-class problems. The basic idea of calculating feature weights is to assign weights to the features through distance measures. It believes that good features should make the distances between samples with the same class as short as possible, while the distances between samples with the different classes are as long as possible. The weight of feature A in each sampling is W(A), m is the sampling times, S is the original data set, and n is the number of features. The details of the algorithm are in Algorithm 1.

According to the obtained features’ weights, the threshold of the ordering and weights can be given, and the features with lower weights will be deleted. The ReliefF algorithm only needs to calculate the feature weights on the samples which are randomly sampled rather than on the entire data set. The time cost of this algorithm depends on the number of features and the sampling times.

2.2 Sequential Forward Selection

As a feature subset searching method, Sequential Forward Selection (SFS) is essentially a kind of greedy algorithm. The feature subset starts from an empty set, and features are added one by one if it makes the evaluation function better. So, evaluation function achieves the optimal in this way. When there are no features to be selected, the algorithm is terminated. The evaluation function can be the classifier-dependent metrics such as accuracy or some other classifier-independent metrics. The advantage of using classifier related metrics is higher accuracy, but often cost too much time. On the contrary, the advantage of using classifier-independent metrics is the faster speed, but the accuracy is not that high.

figure a

2.3 Definition of feature interactions

Consider a supervised learning problem with one class variable and n features. Feature interaction represents that multiple features can be more effective than the sum of the effects of the single feature. According to this, an informal definition of feature interaction has been given [15].

k-way feature interaction works on k features X1,X2,…,Xk. Define the function e(C;X1,X2,…,Xi) (i = 1,2,…,k) which evaluates the contribution from feature(s) to the class variable C. If

$$ e(C;X_1, X_2,\ldots, X_k)\geq \sum\limits_{i=1}^{k}e(C;X_i) $$

we can deem that there is a k-way feature interaction between these k features. And these features are called the interactive features. The difference between them

$$ e(C;X_1, X_2,\ldots, X_k)- \sum\limits_{i=1}^{k}e(C;X_i) $$

can be seen as the strength of the existed feature interaction.

According to the concept of feature interaction, some methods can be designed to describe it. Whereas there isn’t a particularly good way to do it. It’s not easy to search out interactive features, and multi-way interaction is complex [14, 33, 34, 37]. But the quantification of feature interactions is still very profitable and worth doing.

2.4 Random Forest

Random forest (RF) is an integrated machine learning algorithm proposed by [2]. RF is one of the hottest algorithms in the field of machine learning and data mining. It is data-oriented and does not need any assumption on the distribution of the data set. Compared with other models, it has a faster operating speed and a better performance on imbalanced data, multi-variable data, and big data. It is robust to outliers and noises, and without over-fitting problems. K sample sets \(\{S_{k}\}_{k=1}^{K}\) are obtained from the initial data set S = {(x1,y1),(x2,y2),…,(xN,yN)} through bootstrap sampling method. A decision tree hk is built on each data set Sk. Each node of each tree randomly selects the same number of features for splitting. The combination of decision trees is the Random Forest. Then the result is a vote or an average given by these decision trees. It depends on whether the question is classification or regression. For classification problems, the results can be expressed by the following formula.

$$ H(x)=\text{arg max}_{y}\sum\limits_{k=1}^{K}I(h_{k}(x)=y) $$

where hk(x)is the result given by the k th decision tree. I(⋅) is the indicator function.

In the process of constructing Random Forest, the ordering and the weights of features can be given. There are two ways to calculate the weight of each feature:

  1. 1.

    Mean Decrease Gini (MDG): Feature A’s weight is measured by the decrease of Gini index caused by A. The more Gini index goes down, the greater the weight of A [29].

  2. 2.

    Mean Decrease Accuracy (MDA): Add noises to a certain feature in OOB data [1]. The greater the OOB error goes down, the more important this feature is.

After getting the feature weights, RF has its own feature selection algorithm: Random Forest-based Wrapper Feature Selection (RFWFS). The details of this algorithm are in Algorithm 2. Through the combination of MDG or MDA and RFWFS, reliable feature subset can be screened out.

figure b

3 RF-efficient-ReliefF algorithm

Although ReliefF algorithm is quick and effective, there are still many deficiencies:

  1. 1.

    The resulting feature weights do not help remove redundant features.

  2. 2.

    Sampling times m will affect the final weights and ordering.

  3. 3.

    Although random sampling will reduce the time cost, the weights and ordering will change with the sampling results.

  4. 4.

    In the case of repeated sampling, repeated samples do not have any significance for the update of feature weights, but reduce the effectiveness and reliability of the results.

  5. 5.

    The point which is the most important in my view is that the data set has noisy data and bad data upon most occasions, which seriously affects the results.

In response to the above problems, this paper proposes a reliable and efficient algorithm: RF-efficient-ReliefF algorithm.

RF has a high accuracy, especially for the data far from the classification boundary. And RF has a fast operating speed. Based on these characteristics, first of all, we can establish a RF model on the original data set. For each sample, there are some decision trees which did not use it during the build process. So, we can give each sample the proportion of votes for each class according to the vote results of these trees. Based on the results of each sample, we first calculate the average proportions of votes for each class, and then filtered out those data whose proportion of votes for its real class is higher than the average proportion of votes for this class. ReliefF algorithm is used on the data subset which can be seen as the high quality data collection. The difference from the original algorithm is that we use all samples in the data subset instead of sampling. In addition, we adopted the same treatment as [26] used to assign k different nearest neighbors the weights, which indicates that k nearest instances have weights exponentially decreasing with increasing rank. The original data set is S. Suppose the classes for the data set are c = 1,2,…,length(C) where length(C) is the size of the class variable C. n is the number of features. The details of the algorithm are in Algorithm 3.

RF-efficient-ReliefF has these advantages:

  1. 1.

    The selected data subset does not include noisy data and bad data. It can be considered that the samples in it are more beneficial to distinguish the class variable. The feature weights derived from them are more effective.

  2. 2.

    The selected good data subset has roughly half of the original data set, which can shorten the time cost compared to using all the data.

  3. 3.

    No sampling is required, which eliminates the confusion caused by the determination of sampling times, and the results will not change with it.

In addition, we propose two alternatives:

  • RF-e-R1: Also use the new data subset. For each sample d in the data subset S, let the normalized voting proportion of d for its true class be its weight. And during the implementation of the ReliefF algorithm, feature weights are multiplied by sample weights in each iteration. Still use all data subsets without sampling.

  • RF-e-R2: Using original samples for feature weight calculation. Let each data’s voting proportion for its true class be its weight. And during the implementation of the ReliefF algorithm, feature weights are multiplied by sample weights in each iteration. Still use all data subsets without sampling.

In our research, we found that the effects of these three algorithms are similar, but it is clear that the RF-efficient-ReliefF algorithm is more convenient. Therefore, it is chosen for further research in the subsequent sections.

4 Improved SFS algorithm based on Symmetrical Complementary Coefficient

In this section, firstly, we analyze some defects of the SFS algorithm we have discovered. Then we give the definition of Symmetrical Complementary Coefficient and its calculation formulas. And finally, the SFS algorithm is improved based on the Symmetrical Complementary Coefficient. This section gives two examples to help explain.

figure c

4.1 The defects of current SFS algorithm

The SFS algorithm starts from an empty feature subset and continuously adds features to it, as a consequence, the evaluation function maintains optimal status in every addition. So, you can only continue adding features to the future subset, but can’t delete the features inside. The result obtained in this way is not reliable, and it is easy to fall into local optimum. But these defects can be alleviated by a pre-specified feature ordering, for example, the algorithm proposed in the previous section can be used to give the feature ordering. Then according to the feature ordering, implement the SFS algorithm.

Here another defect of SFS algorithm is presented, that is, feature interactions exist and SFS algorithm cannot recognize them. When an added feature cannot make the evaluation function optimal, it will be discarded. But in fact, it may interact with features that haven’t been added yet, so deleting it is a very hasty behavior. Using the classic German credit data set gives the following example to illustrate the existence of this problem.

Example 1

German credit data set has 20 features, and two cases or states of the class variable. Table 1 shows the feature weights and ordering calculated by the RF-efficient-ReliefF algorithm. Features are sorted by weights from large to small.

Table 1 Feature weights and ordering. Xi represents the i-th feature of the original data set (i = 1,2,…,20)

We then use the accuracy of RF as an evaluation function to implement the SFS algorithm in order to find the optimal feature subset and its corresponding accuracy in this case. As a comparison, we selected the first eight features in Table 1 as a feature subset to calculate the accuracy. The two cases are shown in Table 2.

Table 2 The resulting feature subsets and their accuracies

From Table 2, we can see that although the result of using the first 8 features is better, the SFS algorithm cannot search out all of them, but instead added many features ranked behind. Now we can see that the feature interactions cannot be recognized by SFS.

4.2 Symmetrical Complementary Coefficient

As mentioned in the previous introduction, currently, there is no widely accepted method. And, multi-way interaction is complex and it’s not easy to search out all the interactive features. Here we present a new method for the quantification of feature interactions between small batch features. Its definition and calculation are as follows.

Firstly, let us introduce the Complementary Coefficient. The Complementary Coefficient was proposed by [30]. It has some good properties such as the contribution of feature interactions to the classification task has been taken into consideration and it’s calculation is fast. However, at first it did not attract enough attention. It was not used to describe feature interactions but merely to search for important features. It is not symmetrical and the situation of multiple features has not been considered. And we made some changes to it.

The two features Xi,Xj of a given data set S are used to characterize the entire data set respectively. And use the classifier to classify the two data set which just have one feature and get two models Mi,Mj. Then get their corresponding misclassified sample sets Di,Dj. Note that, you can use the reserved test set, just like cross-validation. In this paper, we take RF as the classifier and use the OOB data to find the Di and Dj. Mark Di’s subset whose samples are classified correctly by Mj as Dij. Mark Dj’s subset whose samples are classified correctly by Mi as Dji. Note that DijDji.

The Complementary Coefficient from Xj to Xi is Com(i,j). The Complementary Coefficient from Xi to Xj is Com(j,i).

$$ \begin{array}{@{}rcl@{}} Com(i,j)=\frac{|D_{ij}|}{|S|}\\ Com(j,i)=\frac{|D_{ji}|}{|S|} \end{array} $$

where |⋅| is the number of the samples in the date set.

And in this paper, we put forward the Enhanced Complementary Coefficient. The Enhanced Complementary Coefficient from Xj to Xi is ECom(i,j). The Enhanced Complementary Coefficient from Xi to Xj is ECom(j,i).

$$ \begin{array}{@{}rcl@{}} ECom(i,j)=\frac{|D_{ij}|}{|D_{i}|} \end{array} $$
(1)
$$ \begin{array}{@{}rcl@{}} ECom(j,i)=\frac{|D_{ji}|}{|D_{j}|} \end{array} $$
(2)

The Complementary Coefficient and Enhanced Complementary Coefficient are different. The denominator changed from |S| to |Di| and |Dj|. Although the contribution of feature interactions to the whole classification task has been taken into consideration in the case of using |S| as the denominator, it’s unfair. Supposing Xi is a very good feature, Xj is a very bad feature, and Xk is another common feature. Apparently, we have that |Di|≤|Dj|. So, under normal circumstances we have |Dik|/|S|≤|Djk|/|S|. Absolutely it’s unfair.

Let’s suppose that

$$ \begin{array}{@{}rcl@{}} &|S|{\kern1.7pt}={\kern1.7pt}1000, &|D_{i}|\!=10,{\kern1.7pt} |D_{j}|{\kern1.7pt}={\kern1.7pt}100,{\kern1.7pt} |D_{ik}|\!=9,{\kern1.7pt} |D_{jk}|\!=9,\\ &\text{so}& |D_{ik}|/ |S|=|D_{jk}|/ |S|=0.009,\\ &\text{but}& |D_{ik}|/|D_{i}|=0.9, \quad |D_{jk}|/|D_{j}|=0.09 \end{array} $$

Obviously, the latter reflects more real and reliable feature interactions.

We are the first to use the Enhanced Complementary Coefficient to quantify feature interactions, and we have slightly expanded it to obtain the Symmetrical Complementary Coefficient. Firstly, let’s consider the situation of two features. The Symmetrical Complementary Coefficient between Xi and Xj is SCom(i,j).

$$ \begin{array}{@{}rcl@{}} SCom(i,j)&=&SCom(j,i){\kern1.7pt}={\kern1.7pt}\frac{ECom(i,j){\kern1.7pt}+{\kern1.7pt}ECom(j,i)}{2}\\ &=&\frac{1}{2}\cdot \left( \frac{|D_{ij}|}{|D_{i}|}+\frac{|D_{ji}|}{|D_{j}|}\right) \end{array} $$
(3)

Symmetrical Complementary Coefficient can quantify the strength of the interaction between the two features. It has the following properties.

  • SCom(i,j) ∈ [0,1]

  • SCom(i,j) = 0 indicates that there is no interaction between the two features.

  • SCom(i,j) > 0 indicates that there is a positive interactive relationship between two features, that is, a synergistic relationship.

  • The bigger the SCom(i,j), the stronger the interaction between the two features.

The situations of one-to-many, many-to-many can be given similarly. Let U,V be the collections of feature subscripts. They may contain many features or only one features. Similarly there are DUV,DVU. So, the feature interaction between two sets of features can be obtained.

$$ \begin{array}{@{}rcl@{}} SCom(U,V)&=&SCom(V,U)\\ &=&\frac{ECom(U,V)+ECom(V,U)}{2}\\ &=&\frac{1}{2}\cdot \left( \frac{|D_{UV}|}{|D_{U}|}+\frac{|D_{VU}|}{|D_{V}|}\right) \end{array} $$

However, it will take more time to do that. Computation of Symmetrical Complementary Coefficient between two features only need to use each feature to complete the classification task. So it won’t take much time. In this paper, we just care about the situation of two features.

Let’s observe the actual effect of the Symmetrical Complementary Coefficient by the following example which still use the German Credit Data set, and compare the Symmetrical Complementary Coefficient with the Interaction Information which is a classic quantification method of the feature interaction relationship proposed by [14].

Interaction Information between feature Xi and Xj is I(Xi;Xj;C) whose formula is as follows. C is the class variable.

$$ \begin{array}{@{}rcl@{}} I(X_{i};X_{j};C)=I(X_{i};X_{j}|C)-I(Xi;Xj) \end{array} $$

where I(Xi;Xj|C) is the conditional mutual information between Xi and Xj under C. I(Xi;Xj) is the mutual information between Xi and Xj. I(Xi;Xj;C) = 0 indicates that there is no interaction between the two features. I(Xi;Xj;C) > 0 indicates that there is a positive interactive relationship between two features. I(Xi;Xj;C) < 0 indicates that there is a negative interactive relationship which is also called the redundancy between two features.

Example 2

Feature ordering has been given by RF-efficient-ReliefF algorithm. Compute the Symmetrical Complementary Coefficient and the Interaction Information between the kth feature and the k + 1st feature. The results of SCom(i,j) and I(Xi;Xj;C) are shown in Table 3.

Table 3 Results of SCom(i,j) and I(Xi;Xj;C)

It can be found from the table that under the feature ordering obtained by RF-efficient-ReliefF algorithm, the Interaction Information between each feature and its next feature are all negative and close to 0. It indicates that the interaction relations between features quantified by the Interaction Information are all redundancy. Or there are no interactions at all. Whereas, the results given by Symmetrical Complementary Coefficient indicates that there are feature interactions among six pairs of features, and four of them are obvious.

As can be seen in Table 3, the Symmetrical Complementary Coefficient between X5 and X12 is 0.71500. From Table 2 in Example 1, the original SFS cannot search out the X5 and X12. It is precisely because SFS cannot recognize the feature interaction between X5 and X12 by adding features one by one. It happened that X5 and X12 cannot improve the evaluation function respectively. As a consequence, X5 and X12 have not been added to the optimal feature subset.

The results of Example 2 show that the Symmetrical Complementary Coefficient proposed in this paper is effective. Moreover, by comparison, It is more efficient than Interaction Information, at least in the German credit data set.

Note that the higher the Symmetrical Complementarity Coefficient between the two features, does not represent that the two features are more powerful for classification, but only shows that the combination of these two features is more powerful than they are used for classification respectively. If one or two features are to be judged as good correctly, besides observing the Symmetrical Complementary Coefficient between them, we must also look at their performance in the classifier, because feature interaction is only an auxiliary means and it should be combined with features’ performance in the classifier which is absolutely accurate and effective. If you choose to save the time cost of feature selection, some classifier-independent indicators can also be used.

What needs to be pointed out is that feature interaction and feature redundancy are two opposite things. Therefore, the redundant relationship between two features can be ignored when the Symmetrical Complementary Coefficient between them is large, and even if there is a strong redundancy relationship between them, these two features also deserve to be preserved. In next part, we will propose an improved SFS algorithm based on the Symmetrical Complementary Coefficient, which is called SCom-SFS.

4.3 SCom-SFS algorithm

For the reasons that the multi-way interaction is complex and it is hard to search out all interactive features, we choose to search for feature interactions in sequence. This idea is very similar to the principle of SFS algorithm. Moreover, the process of discovering feature interactions can speed up the features’ search process. So we gathered them together without strain. The new algorithm is called SCom-SFS algorithm which can screen out the beneficial features at a faster speed while identifying and quantifying feature interactions.

Its main idea is that under a given feature ordering, firstly for each feature Xi which is the i th feature in the ordering, calculate the Symmetrical Complementary Coefficient between it and its next feature in the ordering and let this value correspond to Xi like what we did in Example 2. And if their symmetrical complementary coefficient is greater than or equal to the threshold β, the two features are seemed as partner of each other and then they are going to be packaged. In the process of implementing SFS algorithm, these two features will advance and retreat together. Then if the symmetrical complementary coefficient between the 2nd and the 3rd features is greater than or equal to β, the 1st, the 2nd and the 3rd features are seemed as partner of each other and then they are packaged. This process will continue similarly until the symmetrical complementary coefficient between the k th and the k + 1st features is less than β. So the 1st, 2nd, 3rd ,…, k th features will be packaged and they will advance and retreat together. Next consider the k + 1st and k + 2, etc. The defect that the original SFS algorithm cannot recognize feature interactions will be solved in this way. And the discovered feature interactions will speed up the SFS algorithm. The details of the SCom-SFS algorithm are in Algorithm 4.

figure d

5 Combination of RF-efficient-ReliefF algorithm and SCom-SFS algorithm

Firstly, we will introduce this new algorithm. And then explain how to determine the threshold β and the time complexity of this algorithm.

5.1 RRSS algorithm

From the former two sections, the RF-efficient-ReliefF algorithm can solve some defects of the original ReliefF algorithm, and obtain more reliable feature weights and ordering. Whereas it determines neither how many features should be deleted nor what features should be deleted. Simultaneously, under a given feature weights and ordering, the SCom-SFS algorithm can recognize the feature interactions and search out the optimal feature subset. After considering the above, we combined these two algorithms to get a new and complete feature selection algorithm. We named it RRSS Algorithm. In this algorithm, RF-efficient-ReliefF can provide SCom-SFS a more reliable search sequence so that it is not easy to fall into a local optimum, and SCom-SFS can be seen as a correction to the feature ordering provided by RF-efficient-ReliefF and get the optimal feature subset. This algorithm’s flow chart is shown in Fig. 1.

Fig. 1
figure 1

The RRSS algorithm

5.2 Threshold’s determination and time complexity of the algorithm

N is the number of samples in the data set, and n is the number of features in the data set. The time complexity of establishing a Random Forest is O(knNlog2N) where k is the number of decision trees. The time complexity of the original ReliefF algorithm is O(Nnm) where m is the number of random samples. However, in the RF-efficient-ReliefF algorithm, we use probably half of the original data set instead of sampling. So, the time complexity of RF-efficient-ReliefF algorithm is \(O(knNlog_{2}N+\frac {N^2}{4}n)\).

In the process of calculating the Symmetrical Complementary Coefficient, it is necessary to establish Random Forest for each feature. At the same time, their corresponding misclassified samples’ sets need to be tested on one or two other RF models. Given that it costs very little time to run an established RF model and the number of misclassified samples is often not too much, the time complexity of this part can be ignored. So, the time complexity of calculating the Symmetrical Complementary Coefficient and packaging features is O(nkNlog2N + (n − 1)) = O(knNlog2N). It is the same time complexity as building a Random Forest.

The time complexity of searching the optimal feature subset needs to take the number of packages and the specific evaluation function into account. In this paper, we select the accuracy of Random Forest as the evaluation function in order to make the resulting feature subsets good enough. Suppose that the number of features in the feature subset after each addition is \(\{n_i\}_{i=1}^{M}\), where n1n2 ≤… ≤ nMn. So, the time complexity of searching the optimal feature subset is O(n1kNlog2N + n2kNlog2N + … + nMkNlog2N).

It has a lot to do with M. However, M depends on threshold β. Consider two extremes. When β is small enough, M = 1 which means all features were packed together. Although the time complexity is greatly reduced, this makes no sense for feature selection. When β is large enough, M = n which means all features were not packed with other features. This will not only lead to a high time complexity, but also will also fail to recognize feature interactions. Thus, threshold β’s determination is very important. A good threshold should screen out the important feature interactions which is quantified by Symmetrical Complementary Coefficient. Here comes the question. How large is the Symmetrical Complementary Coefficient worth paying attention to? At first, we considered giving a fixed threshold. Whereas, it is difficult to determine the threshold, because the threshold depends on the size of the data set, the specifics of features, and the problems studied. Solution to this problem is what we will study further in the future. In this paper, we will give a method which is actually a piecewise function of determining the threshold that can get good results.

Because using the results of the classification will lead to small deviations, firstly let’s define the Symmetrical Complementary Coefficient to be valid if it is greater than 0.05. The identification of feature interactions should be treated with caution because it is the opposite of removing features. Taking the mean of the valid symmetrical complementary coefficients as a threshold is a good choice. At the same time, we note that 0.5 is an important demarcation point. It distinguishes the overall levels of symmetrical complementary coefficients in different data sets. Thresholds need to be determined more carefully when symmetrical complementary coefficients are generally large. We then subdivide the value of the mean further. The threshold β can be given by the following equation.

$$ \beta=\left\{\begin{array}{ll} mean\!(SCom_{\textit{valid}})& \text{If}~mean\!(SCom_{\textit{valid}})\!\in [0,0.4)\\ 0.5 & \text{If}~mean\!(SCom_{\textit{valid}})\!\in[0.4,0.5)\\ mean\!(SCom_{\textit{valid}})& \text{If}~mean\!(SCom_{\textit{valid}})\!\in[0.5,0.8)\\ 1 & \text{If}~mean\!(SCom_{\textit{valid}})\!\in [0.8,1] \end{array} \right. $$
(4)

where mean(SComvalid) is the mean of the valid symmetrical complementary coefficients. This piecewise function is just an empirical result which can make the results of this paper’s experimental data sets generally good. For a specific data set, perhaps some search strategies yield better results, but they will bring in more time cost.

6 Experiment

6.1 Data set

In this section, 10 real-world data sets from the UCI repository [6] will be used. These data sets cover a wide range of situations, i.e., different number of instances, features, classes, and proportions of discrete features and continuous features, which can represent a large proportion of real-world applications. Their features are normalized through Min-Max scaling before the experiment. And their details are shown in Table 4.

Table 4 Date set description

6.2 Evaluation metric

The number of features in the resulting feature subset, accuracy, Kappa coefficient and mean-square error (MSE) were selected as the evaluation metrics.

The Kappa coefficient is calculated with the confusion matrix. The overall accuracy can only reflect the case of samples that are correctly classified in the diagonal direction. Whereas, Kappa coefficient also considers the non-recognized and misclassified samples outside the diagonal direction. Its value usually be between 0 and 1 and the larger, the better.

Table 5 gives a typical resulting confusion matrix for a problem with C cases or status of class variable, where ni,j denotes the number of observations of a class with actual list status j that are classified as having listing status i (i = 1,⋯ ,C,j = 1,⋯ ,C).

$$ \begin{array}{@{}rcl@{}} accuracy=\frac{{\sum}_{i=1}^{C}n_{i,i}}{N}\\ p_{0}=accuracy \\ p_{e}=\frac{{\sum}_{i=1}^{C}{n_{i}\cdot\hat{n_{i}}}}{N^{2}}\\ kappa \ coefficient=\frac{p_{0}-p_{e}}{1-p_{e}} \end{array} $$

MSE has some good properties, it can be expressed as the sum of the variance and the square of the deviation, and the smaller the MSE, the better. Since the MSE is for regression, we make some adjustments. The voting proportion p of the sample xi for its real class is taken as the predicted value of the RF model. And 1 is taken as the true value of xi, as follows.

$$ \begin{array}{@{}rcl@{}} &&\hat{f}(x_{i})=p \qquad f(x_{i})=1\\ &&MSE=\frac{1}{N}\sum\limits_{i=1}^{N}(\hat{f}(x_{i})-f(x_{i}))^{2} \end{array} $$
Table 5 A typical resulting confusion matrix

6.3 Comparison algorithms

In Section 4, we illustrated the defects of original SFS algorithm and how the SCom-SFS algorithm solved them. Therefore, it is not necessary to compare SCom-SFS algorithm with SFS algorithm.

We perform a comparison of RRSS, RF-efficient-ReliefF, five classic feature selection algorithm and two latest feature selection algorithms considering feature interactions. The algorithms used for comparison in this experiment are as follows.

  1. 1.

    MDG [2], the embedded method, which is included in the process of building the RF.

  2. 2.

    ReliefF [19], one of the most famous feature selection algorithms.

  3. 3.

    mRMR [25], one of the most famous feature selection algorithms, which pays attention to the irrelevant features and redundant features at the same time.

  4. 4.

    RFE [27], the wrapper methods.

  5. 5.

    Boruta [20], the wrapper methods.

  6. 6.

    CFR [10], one of the latest feature selection algorithms.

  7. 7.

    FJMI [31], one of the latest feature selection algorithms, which takes high dimensional feature interactions into account.

  8. 8.

    RF-efficient-ReliefF, the improved version of ReliefF, proposed in this paper.

  9. 9.

    RRSS, the combination of RF-efficient-ReliefF and the SCom-SFS proposed in this paper.

Among them, RF-efficient-ReliefF, ReliefF, MDG, mRMR, CFR, and FJMI need to artificially determine the number of features selected. So, we implement RFWFS to get their optimal feature subsets.

6.4 Experimental settings

These seven algorithms are tested on 10 data sets, using Random Forests as the classifier which is provided by R language. For the convenience of comparison, RF models use the default parameters where the number of trees is 500 and the number of randomly chosen attributes at each node is equal to the first integer less than log2(number of features) + 1. In order to avoid randomness, repeat the calculation of each evaluation metric for 30 times and take the average then get the results. And the results were calculated with OOB data. Breiman [1] has proved that using OOB data to estimate the results is unbiased. Compared with the cross-validation, the calculation of OOB estimation is simpler and faster, and its results are similar to cross-validation.

We use the ‘CORElearn’ package in the R language to implement ReliefF algorithm. The number of randomly selected samples is m = 1000. The number of selected nearest neighbor samples is k = 5.

We use the ‘mRMRe’ package in the R language to implement mRMR algorithm. All parameters are their default values.

We use the ‘caret’ package in the R language to implement RFE algorithm. functions = rfFuncs, method = cv, number = 5.

We use the ‘Boruta’ package in the R language to implement Boruta algorithm. pValue = 0.01,mcAdj = TRUE, maxRuns = 100.

In the process of implementing CFR and FJMI, continuous features are discretized by the Ameva algorithm [11]. The discretized features are only used for feature selection, and the classification process still uses these original features.

Table 6 The numbers of selected samples to implement the new ReliefF algorithm

In the process of implementing algorithm RF-efficient-ReliefF algorithm, no sampling required, and the number of selected nearest neighbor samples is k = 5. Table 6 shows how many samples were selected to implement the new ReliefF algorithm. There exists randomness when building the RF classifier, so these numbers are derived from one RF.

In the process of implementing the RRSS algorithm, the Symmetrical Complementary Coefficients were calculated. Let SCom(i,i + 1)(i = 1,2,⋯ ,n − 1) be the Symmetrical Complementary Coefficients between feature Xi and its next feature in the ordering like what we did in the Example 2. And Since the last feature does not have the next one, its default value is 0. Based on these coordinate points \(\{(X_i,SCom(i,i+1))\}_{i=1}^{n-1}\), we can draw figures for each data set. Figures 23 and 4 show the Symmetrical Complementary Coefficients of the first three data sets and their thresholds. The threshold is represented by a horizontal line. The figures of the remaining seven data sets, Figs. 67891011 and 12, can be found in the Appendix.

Fig. 2
figure 2

Breast-cancer data set

Fig. 3
figure 3

Germen-credit data set

Fig. 4
figure 4

Chess-kr-vs-kp data set

In Fig. 2 the mean of the valid symmetrical complementary coefficients is 0.38579. So the threshold β = 0.38579 according to (4). And the feature X6 and X4 are packaged.

In Fig. 3 the mean of the valid symmetrical complementary coefficients is 0.45128. So the threshold β = 0.5 according to (4). And the feature X2, X5 and X12 are packaged.

In Fig. 4 the mean of the valid symmetrical complementary coefficients is 0.41456. So the threshold β = 0.5 according to (4). And the feature X10, X33 and X21 are packaged. Feature X15, X6 and X34 are packaged. Feature X9, X18 and X35 are packaged. Feature X1, X22, X7 are packaged. Feature X27 and X11 are packaged. Feature X31 and X2 are packaged.

7 Results and analyses

In the experiment we selected four evaluation metrics, i.e., number of features in the resulting feature subset, Accuracy, Kappa coefficient and MSE. Their results on the 10 data sets and the average of the results are shown in Tables 789, and 10, respectively. In these tables, the best algorithm for each data set is emphasized with bold fonts, the second best is marked with italic fonts.

Table 7 The number of features in the resulting feature subsets
Table 8 Accuracy results(%)
Table 9 Kappa coefficient results(%)
Table 10 MSE results(%)

RRSS achieved a lower number of features while maintaining the best ranking, especially on the ‘Chess-kr-vs-kp’, ‘Spambase’, and ‘Wine’ data set. The number of features obtained by these algorithms on ten data sets is shown in Fig. 5.

Fig. 5
figure 5

Number of features obtained by these 9 algorithms

It seems unfair to average the results directly because the average is affected by larger values greatly. So we gave Friedman’s ranks [9] of these results. For each algorithm, its rank in each data set is obtained first, and then average them. The best performing algorithm is assigned the rank of 1, the second best performing algorithm is assigned the rank of 2, etc. Table 11 shows the Friedman’s ranks of the four evaluation metrics of the nine algorithms and then average them again. The best algorithm on these ten data sets is emphasized with bold fonts, the second best is marked with italic fonts.

Table 11 Friedman’s ranks of the four evaluation metrics of different algorithms

In addition, all these 9 algorithms were compared to each other by Nemenyi test [22] which is a non-parametric test. All the tests were carried out with α = 0.05 which is the level of significance. We put the ranks of the four evaluation metrics together to implement the tests. Nemenyi test rejects the hypotheses that the algorithms are equivalent if the corresponding p-value ≤ 0.05. When there is a significant difference, the best algorithm is marked with bold fonts. Table 12 shows the p-values of the Nemenyi test on the pairs of algorithms which have significant differences.

Table 12 p-values of the Nemenyi test on the pairs of algorithms which have significant differences

Then we performed the Wilcoxon signed rank test [5] on RRSS algorithm and RF-efficient-ReliefF algorithm, and on RF-efficient-ReliefF algorithm and ReliefF algorithm. Table 13 shows the p-values. When there is a significant difference, the best algorithm is marked with bold fonts. We can find that the corresponding p-values are ≤ 0.05. So we can reject the hypotheses that the two pairs of algorithms are equivalent and hold that RRSS algorithm is better than RF-efficient-ReliefF algorithm and RF-efficient-ReliefF algorithm is better than ReliefF algorithm.

Table 13 p-values of the Wilcoxon signed rank test

From these results, several points can be obtained.

  • The RRSS algorithm which is the combination of RF-efficient-ReliefF and SCom-SFS algorithm ranked first among the nine algorithms. Overall, the RRSS algorithm have much higher accuracy, Kappa coefficient and smaller MSE than other algorithms while possessing the smallest number of features in the resulting feature subsets. Although it slightly performed bad on the Chess-kr-vs-kp data set, it significantly reduced the number of features compared to other algorithms.

  • The RF-efficient-ReliefF algorithm proposed in this paper is better than the original ReliefF algorithm. And the combination of it and RFWFS has a positive statistical significant difference with Boruta algorithm. The rest algorithms have no significant differences.

  • The SCom-SFS algorithm proposed in this paper is very successful. The combination of it and RF-efficient-Relief is much better than the combination of RF-efficient-Relief and RFWFS algorithm. The number of features in the resulting subset selected by SCom-SFS algorithm is much less than that of other algorithms, because it can minimize the number of feature subset by adding features instead of deleting features. Moreover, SCom-SFS considers feature interactions. Based on the feature ordering given by RF-efficient-ReliefF algorithm, it can search out truly effective features, and it is difficult to fall into local optimum. At the same time, SCom-SFS can be seen as a correction to the results of RF-efficient-ReliefF algorithm.

8 Conclusion

After analyzing the shortcomings of ReliefF algorithm, this paper combines it with Random Forests and proposed the RF-efficient-ReliefF algorithm. It can screen out representative samples that are more useful for the identification of class labels, and the feature weights obtained are more reliable. Moreover, it does not require sampling.

This paper proposed a new method of quantifying feature interactions, namely, the Symmetrical Complementary Coefficient. And we illustrated the superiority of it by using an example.

We proposed the SCom-SFS which is based on the Symmetrical Complementary Coefficient. It can recognise the feature interactions and takes the strong feature interactions into account in the subset searching process through the given threshold. So as to search out the truly effective feature subset at a faster speed by packaging these features which have strong feature interactions.

We combined the RF-efficient-ReliefF algorithm with the SCom-SFS algorithm to obtain a complete feature selection algorithm, that is, the RRSS algorithm. It was tested on 10 data sets and compared with the other 8 algorithms including 5 classic and 2 latest feature selection algorithms, and found that the feature subsets obtained by it have the best performance while having the minimum number of features.

In the future, we will find a better method to determine the threshold more accurately, and combine the symmetrical complementary coefficients with different algorithms to make the feature interaction play a greater role.