1 Introduction

Feature selection plays an important role in many aspects of machine learning such as multivariate classification (including the binary classification) where each instance has just one class, and the multi-label classification [18, 19] where there is more than one class variable or each instance can belong to multiple classes at the same time, and sometimes there are dependencies between these classes. These classification tasks all need to learn the input data, and the features are used to characterize the data from different perspectives. Whereas, for a data set, sometimes many features of it are not helpful for learning and mining tasks, or even harmful [20]. Thus, correct features are essential. Nowadays, there is a growing requirement of feature selection, as data sets are getting bigger and wider.

1.1 Literature review

Features which need to be coped with can be divided into three types, i.e., irrelevant features, redundant features and interactive features. Irrelevant feature refers to the one which does not help with learning and mining tasks, as it has no connection or weak connection with the target [26]. Redundant feature refers to the one whose information or knowledge can be provided by other features [48]. Irrelevant features and redundant features need to be removed. Whereas, the interactive features need to be retained for further observation. Feature interaction refers to that multiple features contribute more than the sum of these individual features’ contribution, and these features are interactive features [25]. A typical example of it is the exclusive OR operation, i.e., Y = X1X2, and Y, X1, X2 are Boolean. A single feature X1 or X2 is irrelevant to Y, but when they are combined together, they have a strong correlation with Y. Therefore, removing any one of them will lead to a serious consequence. More and more scholars are beginning to pay attention to feature interactions [41, 43, 44, 49]. Recently, [50] proposed a new well-performed metric of feature interaction named Symmetrical Complementary Coefficient (SCom) which measures feature interaction based on classifiers at high speed. Whereas, they have to determine SCom’s threshold artificially, which is troublesome.

Feature selection can be divided into four types. Firstly, filter method such as ReliefF [28] which determines the weight of features by calculating the distance between randomly selected instances and their nearest neighbors of the same and different classes, mRMR [34] which quantify candidate feature’s relevancy and the redundancy between each candidate feature and the selected features based on the information theory, and CRF [14] which can find the composition of feature relevancy and maximize feature relevancy while minimizing feature redundancy based on the information theory. Secondly, wrapper method such as RFE [39] which builds the models iteratively, and then selects the best (or worst) features based on the fitness function, and Boruta [29] which iteratively creates new feature set composed of shadow features and original features, builds model, and then determines important or unimportant features. Thirdly, embedded method which evaluates features during the model building process such as L1, L2 regularization [33] and Random Forest [3]. Finally, hybrid method such as BBHFS algorithm [7] which incorporates some advantages of filter and wrapper methods through a natural stopping criterion and the concept of boosting, [22] combines filter and wrapper methods which takes advantage of both the filters and wrappers, [45] combines two different optimal filters and a component co-occurrence based feature relevance measure is proposed, [46] combines several filters and calculate the multi-filter wights and the multi-feature weights, then using a new Q-range based feature relevance calculation method and the greedy searching policy to get the final feature subset. Among them, the hybrid method is getting more and more popular, as it combines several different feature selection algorithms which belong to different kinds of methods or the same kind of method, and normally performs better.

Besides, these traditional methods are often combined with the evolutionary algorithms which showed promising results, [23]. Combined Genetic Algorithm (GA) with filters and wrappers, and both the parsimonious feature selection and excellent classification accuracy of the new algorithm are presented. [51] proposed PSO-based multi-objective feature selection algorithm which consists of a probability-based encoding technology and an effective hybrid operator can evolve a set of non-dominated solutions automatically. [15], combined six well-known filters and an enhanced GA in a wrapper approach. Multi-dimensional Archive of Phenotypic Elites (MAP-Elites) algorithm [4, 12, 31, 37] could brilliantly complete the problem space exploration through the lens of chosen features, and generate a set of solutions which is both diverse and high-performing. [52], proposed two-archive multi-objective artificial bee colony algorithm (TMABC-FS) which contains two new operators, i.e., convergence-guiding search for employed bees and diversity-guiding search for onlooker bees, and two archives, i.e., the leader archive and the external archive. [16], proposed a wrapper-filter combination of Ant Colony Optimization (ACO) which can reduce computational complexity and performs well. [53], improved the binary differential evolution algorithm based on a new binary mutation operator, a new one-bit purifying search operator (OPS), and a novel efficient non-dominated sorting operator to solve the multi-objective feature selection problems. [40], proposed the variable-size cooperative coevolutionary particle swarm optimization algorithm (VS-CCPSO) which is competitive in solving the problem of high-dimensional feature selection problems. [24], proposed a novel two-step method called Information Gain Directed Feature Selection (IGDFS) algorithm, in which Information Gain reduces the feature space of possible configuration, and GA is combined with a wrapper to get the optimal solution. [21], proposed a tri-objective GA-based feature selection algorithm which considers the number of features, mutual information, and accuracy at the same time. [6], proposed a bi-objective GA-based feature selection algorithm whose two fitness functions are based on rough set theory and Kullback-Leibler divergence.

Although many scholars have combined hybrid methods with evolutionary algorithms, there are still some defects. Firstly, most of them [6, 15, 16, 21, 23, 24] use wrappers which means their methods are classifier-dependent in each generation. As evolutionary algorithms have many generations and there are many individuals in each generation, the time cost has increased significantly. Secondly, when generating the initial population, a large proportion of studies use random [15, 16, 21, 23, 24] or some pseudo-random pattern generator [6], which does not take full advantage of knowledge or information of data set used. Finally, at present, no research has combined feature interaction with GA-based feature selection algorithm.

1.2 Contribution

In this paper, a novel feature selection algorithm is proposed, which can solve these problems and performs better, i.e., H ybrid filter and S com based M ulti-O bjective G enetic A lgorithm feature selection (HSMOGA). Main contributions of this paper are as follows.

  • HSMOGA contains three new fitness functions including a hybrid filter which performs more balanced than these single filters, a new usage of the feature interaction metric SCom which solves the problem of [50] that needs to determine the threshold of SCom artificially, and a novel way to limit the feature subset’s size. In this way, no wrapper method is contained in HSMOGA, and HSMOGA is classifier-independent in each generation. Which also makes it faster than other GA-based algorithms, and we are the first to combine GA-based feature selection algorithm with the feature interaction metric.

  • A new Pareto-based ranking function which is more suitable than the traditional one in NSGA-ii for these three fitness functions is proposed.

  • We make great improvements in the structure, whose superiority is obvious. HSMOGA starts with a new step called knowledge reserve, which precalculate the knowledge required for the calculation of fitness functions and the initial population generation. Compared with other GA-based feature selection methods, HSMOGA has a much lower time complexity and its parameter tuning can save a lot of time.

  • Its initial population generation makes full use of the knowledge of data set, which makes solutions converge faster. Moreover, it is related to the third fitness function that controls the size of the feature subset, and jointly promotes HSMOGA to obtain better results.

We design three experiments to objectively evaluate the effectiveness of the two of three fitness functions and the Pareto-based ranking function proposed in this paper, and two experiments to objectively compare and assess the relative performance of HSMOGA in terms of classification performance and time cost. According to the experimental results, the fitness functions and ranking function proposed in this paper are satisfactory. For the data sets tested, HSMOGA outperforms other nine state-of-art feature selection algorithms including five classic and four more recent algorithms. Moreover, for the data sets tested, HSMOGA is much faster than other GA-based feature selection algorithm which is classifier-dependent in each generation, and even if the max number of generations increases a lot, the time cost will not increase much.

The structure of this paper is as follows. Section 2 introduces the preliminaries. In Section 3, we introduce HSMOGA step by step. Section 4 presents experiments, results, and analyses. Section 5 concludes our work.

2 Preliminaries

We introduce some methods related to this article. Firstly, the three feature filters. Secondly, the Symmetrical Complementary Coefficient. Finally, the Multi-objective genetic algorithm.

2.1 Three filters

2.1.1 Information gain

Considering the data set with N instances, n features, and 1 class variable, S = (X, Y ) = {(xi1, xi2, ⋯ , xin, yi)}(i = 1, ⋯ , N), where X = (X1, ⋯ , Xn), yi ∈ {1,2, ⋯ , C}, and C is the number of class.

Entropy is a measure of variable’s uncertainty [5]. Entropy of the class variable Y is in (1).

$$ H(Y)=-\sum\limits_{k=1}^{C}p_{k}\log_{2}p_{k} $$
(1)

where pk is the proportion of instances whose class is k.

According to values of a specific feature A ∈ {Xj}, j = 1, ⋯ , n, instances are divided into V subsets. Therefore, class variable Y is divided into {Y1, Y2, ⋯ , YV}. Information Gain (IG) of feature A is in (2).

$$ IG(A)=H(Y)-\sum\limits_{v=1}^{V}\frac{|Y^{v}|}{|Y|}H(Y^{v}) $$
(2)

IG quantifies the information shared by feature A and class variable Y, which is often used as a filter to measure whether a feature is good or not for the classification problem. One of its most famous applications is the ID3 decision tree algorithm [35]. Although it has a defect that favors the features with more values.

2.1.2 Information gain ratio

Information Gain Ratio (IGR) is an improvement of IG to overcome the above defect through adding a penalty term, as in (3).

$$ \mathit{IGR}(\mathit{A}) = \frac{IG(A)}{-{\sum}_{v=1}^{V}\frac{|Y^{v}|}{|Y|}\log_{2}\frac{|Y^{v}|}{|Y|}} = \frac{\mathit{H}(\mathit{Y}) - {\sum}_{v=1}^{V}\frac{|Y^{v}|}{|Y|}H(Y^{v})}{-{\sum}_{v=1}^{V}\frac{|Y^{v}|}{|Y|}\log_{2}\frac{|Y^{v}|}{|Y|}} $$
(3)

IGR is also known as a feature filter and is applied to the C4.5 decision tree [36]. Whereas, IGR favors the features with fewer values. So, at each split point of C4.5 decision tree, firstly, select those features whose IG are greater than the average, and then select the feature with the largest IGR.

2.1.3 ReliefF

ReliefF is a distance-based filter method [28]. It assigns higher weights to the features which make distance between instances with the same class shorter, while distance between instances with different classes longer, and vice versa. It calculates feature’s weight iteratively through select m instances randomly, as in (4).

$$ \begin{aligned} W(A)\!&=W(A)-\sum\limits_{i=1}^{k}diff(A,R,NH_{j})/(mk)+\\ &\sum\limits_{c\neq class(R)}\left[\!\frac{p(c)}{1 - p(\mathit{class}(\mathit{R}))}\cdot\sum\limits_{j=1}^{k}\mathit{diff}(\mathit{A,R,NM}_{j}(c))\!\right]\/(mk) \end{aligned} $$
(4)
$$ diff(A,R_{1},R_{2})= \left\{\begin{array}{ll} \frac{|R_{1}[A]-R_{2}[A]|}{\max(A)-\min(A)}& \text{If } A \text{ is continuous;}\\ 0&\text{If } A \text{ is discrete and}\\ &\quad R_{1}[A]=R_{2}[A];\\ 1&\text{If } A \text{ is discrete and}\\ &\quad R_{1}[A]\neq R_{2}[A]; \end{array}\right. $$
(5)

where W(A) is the weight of feature A. R is the current selected instance. NHj(j = 1, 2, … , k) are R’s k nearest neighbor instances which have the same class with R. NMj(c)(j = 1, 2, … , k) are R’s k nearest neighbor instances whose class are c and c is different from the class of R. class(R) is the class of R. p(c) is the proportion of class c. R[A] is the value of R on A.

2.2 Symmetrical complementary coefficient

Symmetrical Complementary Coefficient (SCom) [50] is an effective metric of feature interactions. Feature interaction refers to that multiple features contribute more than the sum of these individual features’ contribution. k-way feature interaction acts on k features A1, A2,…, Ak. Let e(Y ;A1, A2,…, Ai) (i = 1, 2, … , k) be the contribution of feature(s) to the class variable Y. So, k-way feature interaction exists if (6) exists [25].

$$ e(Y;A_{1}, A_{2},\ldots, A_{k})\geq \sum\limits_{i=1}^{k}e(Y;A_{i}) $$
(6)

SCom measures feature interaction based on classifiers at a high speed. Considering two features Xi, Xj, firstly, they are used individually to represent the characteristics of the data set, namely, new data set Si = (Xi, Y ), Sj = (Xj, Y ). Then, two classification models Mi, Mj are established through learning Si, Sj. Then, get two models’ corresponding misclassified instances Di, Dj. Note that, in this step, reserved validation data set can be used, or use the OOB data [2] if the classifier are Bagging-based ensemble method such as Random Forest just like [50] did. Next, Di, Dj are classified by Mj, Mi respectively. Define Di’s subset which are correctly classified by Mj is Dij. Define Dj’s subset which are correctly classified by Mi is Dji.

Therefore, define ECom(i, j) is the Enhanced Complementary Coefficient from Xj to Xi, as in (7). ECom(j, i) is the Enhanced Complementary Coefficient from Xi to Xj, as in (8). SCom(i, j) is the SCom between Xi and Xj, as in (9).

$$ ECom(i,j)=\frac{|D_{ij}|}{|D_{i}|} $$
(7)
$$ ECom(j,i)=\frac{|D_{ji}|}{|D_{j}|} $$
(8)
$$ \begin{array}{@{}rcl@{}} SCom(i,j)&=&SCom(j,i) = \frac{ECom(i,j) + ECom(j,i)}{2}\\ &=&\frac{1}{2}\left( \frac{|D_{ij}|}{|D_{i}|}+\frac{|D_{ji}|}{|D_{j}|}\right) \end{array} $$
(9)

The situation of two groups of features can be given similarly. However, it will cost much more time, while calculating SCom between two features takes very little time, as we only need to use one feature to create the classifier. This paper only considers the circumstances of two features. Note that, a large SCom(i, j) does not represent Xi, Xj should be contained, but only represents Xi, Xj should be considered simultaneously, and we should combine SCom with two features’ contribution to class variable to decide whether to keep them or not [50]. Thus, SCom can be combined with some feature evaluation methods and feature subset searching algorithms to obtain a better feature subset.

2.3 Multi-objective genetic algorithm

Genetic algorithm (GA) [8] is a kind of heuristic evolutionary algorithms based on language of natural genetics and biological evolution. In GA, each variable corresponds to a gene and one solution of these variables corresponds to a chromosome. There is a fitness function to evaluate chromosomes’ performance by calculating their fitness values. According to these fitness values, selection, crossover, and mutation operation are implemented to these chromosomes to generate new individuals. These steps are cycled until the stopping criteria is reached. In this paper, ‘chromosome’, ‘individual’, and ‘solution’ are the same, and feature subset corresponds to them.

GA deals with just one fitness function. Whereas, a large proportion of real-world problems are multi-objective inherently, in which more than one objective needs to be satisfied simultaneously. Moreover, sometimes these objectives are conflicting. Apparently, using a single fitness function will cause that the promising resolution cannot be found. Multi-objective genetic algorithm (MOGA) [9, 27] was proposed to solve these problems. MOGA contains more than one fitness function, i.e., F1(⋅), F2(⋅), ⋯ , Fm(⋅). In each generation, these fitness functions’ values of all solutions need to be calculated.

A solution ch1 dominate another solution ch2 if the following two condition are met:

  1. 1.

    i ∈ {1, 2, ⋯ , m}, Fi(ch1) ≥ Fi(ch2)

  2. 2.

    j ∈ {1, 2, ⋯ , m}, Fj(ch1) > Fj(ch2)

The dominating relation is not reflexive and symmetric but transitive. Considering one generation’s solution set R, its non-dominated solution set \(R^{\prime }\), which is also called the Pareto front, refers to a subset whose members are not dominated by any member of R. Some of MOGA’s operations are implemented based on it.

3 HSMOGA

In this section, a novel feature selection algorithm is proposed, i.e., H ybrid filter and S ymmetrical Complementary Coefficient based M ulti-O bjective G enetic A lgorithm feature selection (HSMOGA). Let’s introduce HSMOGA step by step.

3.1 Knowledge reserve

Unlike the traditional GA and MOGA, we are not starting with generating the initial population, but starting with knowledge reserve. Knowledge reserve refers to calculation and storage of knowledge or information that will be used in subsequent steps including the initial population generation and calculation of fitness functions. Reserved knowledge is divided into two parts, features’ hybrid weights, i.e., HW = (hw1, hw2, ⋯ , hwn), and their SComs, i.e., SCn×n. The hybrid filter is made up with three classic filter, i.e., IG, IGR, and ReliefF. HW is the normalization of the average ranking of IG, IGR and ReliefF weights of these features, as in (10) and (11). SCn×n is a symmetric matrix of SCom of these features, and each element of it is the SCom of the two corresponding features, as in (12).

$$ HW_{temp}=\frac{\mathit{rank}(\mathit{IG}(X))+\mathit{rank}(\mathit{IGR}(X))+\mathit{rank}(W(X))}{3} $$
(10)
$$ HW=\frac{HW_{temp}-\min(HW_{temp})}{\max(HW_{temp})-\min(HW_{temp})} $$
(11)
$$ \mathit{SC} = \begin{pmatrix} 0&\mathit{SCom}(1,2)&\mathit{SCom}(1,3)&\cdots&\mathit{SCom}(1,n)\\ \mathit{SCom}(2,1)&0&\mathit{SCom}(2,3)&\cdots&\mathit{SCom}(2,n)\\ \mathit{SCom}(3,1)&\mathit{SCom}(3,2)&0&\cdots&\mathit{SCom}(3,n)\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ \mathit{SCom}(n,1)&\mathit{SCom}(n,2)&\mathit{SCom}(n,3)&\cdots&0\\ \end{pmatrix} $$
(12)

where X = (X1, ⋯ , Xn). IG(⋅), IGR(⋅), W(⋅), SCom (i, j) are calculated according to (2), (3), (4) and (9), respectively. IG(X), IGR(X), W(X) are all vectors, and each of their elements represents weight of the corresponding feature. rank(⋅) means calculating the rank of every element in a vector. The lowest is assigned 1, the second lowest is assigned 2, etc. min(⋅), max(⋅) represent the minimum and maximum values in a vector, respectively. Denote that HW[k] = hwk, SC[i, j] = SCom(i, j).

The reason for choosing IG, IGR and ReliefF is that they are commonly used filters which have good effects. Meanwhile, using both IG and IGR can solve the problem of their respective bias, which is somewhat similar to C4.5 [36].

Given that a single filter may performs better on some data sets and worse on some other data sets [46], in this paper, we just want to get a hybrid filter that performs well on different kinds of data sets, that is, a more balanced hybrid filter. Of course, it can be replaced with other different kinds of filters or hybrid filters. Therefore, we choose to average the results of these three commonly used filters. Moreover, rank(⋅) can avoid the deviation caused by metrics of different filters. For example, there are a data set with two features A and B,

$$ \begin{array}{@{}rcl@{}} IG(A)=0.19, IGR(A)=0.1, W(A)=0.1\\ IG(B)=0.1, IGR(A)=0.12, W(A)=0.12\\ \frac{IG(A)+IGR(A)+W(A)}{3}=\frac{0.39}{3}\\ \frac{IG(B)+IGR(B)+W(B)}{3}=\frac{0.34}{3}\\ \frac{rank(IG(A))+rank(IGR(A))+rank(W(A))}{3}\\=\frac{2+1+1}{3}=\frac{4}{3}\\ \frac{rank(IG(B))+rank(IGR(B))+rank(W(B))}{3}\\=\frac{1+2+2}{3}=\frac{5}{3} \end{array} $$

By inspection, the latter is more reasonable. In the experiment section, we will design an experiment to demonstrate that this hybrid filter performs more balanced on different data sets, and always has a good performance.

3.2 Population generation

Generating the initial population is a crucial thing in GA, MOGA or some other evolutionary algorithms. Whereas, a large proportion of studies use random [15, 16, 21, 23] or some pseudo-random pattern generator [6]. Knowledge of the used data set should be fully utilized. Although, this bring in more time cost. Therefore, we introduced the knowledge reserve to share time cost with other steps.

Every generation contains M chromosomes. A chromosome chs(s = 1,2, ⋯ , M) is a binary vector whose length is n. Each chromosome is a feature subset, in which ‘1’ represents its corresponding feature is contained, and vice versa. When generating the first generation of individuals, each element of chs, i.e., chs[k](k = 1,2, ⋯ , n) is ‘1’ with a probability pk and is ‘0’ with a probability 1 − pk, as in (13). The formula of pk(k = 1,2, ⋯ , n) is in (14) and (15).

$$ ch_{s}[k]=\left\{\begin{array}{ll} 1 \quad p_{k}\\ 0 \quad 1-p_{k} \end{array}\right. $$
(13)
$$ p_{k,temp}=HW[k]+bias $$
(14)
$$ p_{k}=\left\{\begin{array}{lll} 0 \quad &\text{if } p_{k,temp} \in [-1,0]\\ p_{k,temp} \quad &\text{if } p_{k,temp} \in (0,1)\\ 1 \quad &\text{if } p_{k,temp} \in [1,2] \end{array} \right. $$
(15)

where bias is a control term that controls the size of feature subsets when facing a specific data set. Range of bias is [− 1,1]. Thus, we can control the size of feature subset by controlling the probability that each element of chromosome takes ‘1’. Meanwhile, it is guaranteed that the better the feature (the larger the HW[k]), the greater the probability that it appears in the feature subset.

3.3 Fitness functions

In this paper, three new fitness functions are combined with MOGA. They are proposed from three perspectives, i.e., average hybrid weight, average strength of feature interaction, and size of feature subset, as in (16), (17) and (18), respectively. Values of these three fitness functions are the larger the better.

$$ F_{1}(ch_{s})=\frac{1}{|K|} \sum\limits_{k\in K} HW[k] $$
(16)
$$ F_{2}(ch_{s})=\frac{2}{|K|(|K|-1)} \sum\limits_{k_{1}\in K}\sum\limits_{k_{2}\in K \wedge k_{2}>k_{1}} SC[k_{1},k_{2}] $$
(17)

where K is the collection of sequence numbers of elements that are not ‘0’ in chs. |K| is K’s length, and |K| is also the size of the feature subset which corresponds to chs. For example, if chs = (1,0,1,1,0), then K = {1,3,4},|K| = 3.

$$ F_{3}(ch_{s})=-\left||K|-en \right| $$
(18)
$$ en=\sum\limits_{k=1}^{n}p_{k} $$
(19)

where en represents the expected size of feature subset, as the probability that each feature appears in the initial feature subset is pk. By searching for the optimal bias, we can get the optimal en and make the feature subsets of every generation closer to en.

F1(⋅) is the average hybrid weight of a feature subset. F2(⋅) is the average feature interaction strength of every two features. At present, no research has combined feature interaction with GA-based feature selection algorithm. So, we proposed the F2(⋅) function to quantify the feature interaction degree of each feature subset generated by MOGA. F2(⋅) solves the problem that [50] needs to artificially determine the threshold of SCom. Here we only need to calculate the average value and compare it with the values of other feature subsets. F3(⋅) is the limitation of feature subset’s size, and it can also be seen as the limitation of F1(⋅) and F2(⋅). The feature subset whose size is closer to en has a larger value of F3(⋅).

These three functions are conflicting to some extent, which is suitable for MOGA. If there is no limitation, the best solution which maximize F1(⋅) will be the feature subset with just one feature, and this feature has the largest hybrid weight. Similarly, the best solution which maximize F2(⋅) will be the feature subset with only two features, and the SCom between these two features are the largest. We can be find that F1(⋅) and F2(⋅) conflict with F3(⋅) respectively. As the F1(⋅)-optimized solution conflicts with F2(⋅)-optimized solution (One of two contains one feature and the other contains two features), F1(⋅) conflicts with F2(⋅). As F3(⋅) exists, feature subset’s size will be closer to en.

When dealing with multi-objective problems, the most commonly used approaches are weighted sum, altering objective functions, and Pareto-based ranking approaches [27]. In this paper, Pareto-based ranking approach is used, and we propose a new ranking function to assign solution chs (s = 1,2, ⋯ , M) a rank within every generation, as in (20).

$$ RANK(ch_{s})=(M-n_{bedom}(ch_{s})+n_{dom}(ch_{s}))/2 $$
(20)

where nbedom(chs) is the number of solutions which dominate chs, and ndom(chs) is the number of solutions which are dominated by chs within the current generation.

In the multi-objective optimization problem, the relationships between two individuals are more complicated, i.e. in addition to being completely superior to and completely inferior to, it may be unable to determine which one is better. In fact, Mnbedom(chs) and ndom(chs) can be seen as the upper and lower approximation limitation of chs’s rank, where the former indicates the number of individuals that are not better than chs the latter indicates the number of individuals that are worse than chs, and the average of them can be seen as the rank of chs within the whole generation.

Figure 1 gives an example of RANK(⋅). There are two fitness functions. The dot represents chs, and triangles represent other solutions. The small ellipse represents the solutions dominated by chs, and the large ellipse represents the solutions which dominate chs. Therefore,

$$ \begin{array}{@{}rcl@{}} n_{dom}(ch_{s})\!&=&\!6, n_{bedom}(ch_{s}) = 6, M - n_{bedom}(ch_{s}) = 11\\ RANK(ch_{s})\!&=&\!(6+ 11)/2=8.5 \end{array} $$
Fig. 1
figure 1

An example of ranking function

The reason for constructing such a new ranking function is that the commonly used fast non-dominated sorting and the crowding-distance estimation [27] are not suitable for the three fitness functions proposed in this paper. We want those solutions who perform well on all three functions are used to generate the next generation, and those edge points on different non-dominated fronts should be discarded. The fast non-dominated sorting and the crowding-distance estimation which are the methods used by NSGA-ii to solve multi-objective problems, cannot solve these above problems, as edge points will have higher ranks and will be used to form the next generation.

In the experiment section, we will design several experiments to demonstrate the superiority and effectiveness of these fitness functions and the new ranking function.

The rank of each solution, instead of its fitness function values, will be used to perform the evolution operation, i.e., selection, crossover, and mutation.

3.4 Evolution operation

The flow chart of evolution operation we used is in Fig. 2. The solution with a larger rank has a greater chance of being selected. Chromosomes are selected through the roulette wheel strategy. Moreover, the chromosomes whose ranks are smaller than the average will not be selected.

Fig. 2
figure 2

Evolution operation flow chart

When performing crossover operation, two chromosomes also called parents are selected at a time, and two children are generated. One-point crossover is used. Firstly, randomly select one point indicating the position of gene in chromosome. Then, parents exchange the gene sequences after the selected point mutually, and get two children.

When performing mutation operation, one chromosome is selected at a time. In this paper, each gene of the parent chromosome will mutate with a mutation rate rm. The gene to be mutated will change from ‘0’ to ‘1’ or from ‘1’ to ‘0’.

Besides, we make some minor changes to improve our work. In the crossover and mutation operation, child(ren) identical to parent(s) is prohibited. If such situation happens, we will repeat the crossover and mutation operation.

Moreover, we add the replace operation to guarantee that the next generation will not be worse than the current generation. If parents dominate their children, then children will be replaced by their parents. In other words, in this case, parents will take the place of children into the next generation.

Individuals generated by crossover and mutation operation make up the next generation, and each generation has the same number of individuals.

3.5 HSMOGA algorithm

The proposed HSMOGA algorithm is made up with these previous parts. It contains three new fitness functions and a new ranking function. The three new fitness function are proposed from three conflicting perspectives, i.e., average hybrid feature weights, average feature interaction strength, and the limitation of feature subset’s size. The new ranking function is Pareto-based and it considers the upper and lower approximation limitation of every solution’s rank at the same time. HSMOGA’s initial population generation makes full use of the data set’s knowledge. Moreover, the knowledge or information needed for the fitness functions calculation and the initial population generation is pre-calculated, which indicates these parts will run at a very fast speed. Besides, we add some minor steps to improve the algorithm, as follows. Chromosomes whose ranks are smaller than the average will not be selected. In the crossover and mutation operation, child(ren) identical to parent(s) is prohibited. The replace operation is added to guarantee the next generation will not be worse than the current generation.

Flow chart of HSMOGA is in Fig. 3, where the components contained in the dotted box is the main part where HSMOGA differs from other GA-based algorithms. The pseudo-code of HSMOGA is in Algorithm 1. HSMOGA outputs individuals of the last generation. If you want to reduce time cost, you can choose the highest ranked individual as the final feature subset. In this paper, in order to get better results, the top 50% of individuals will be selected and compared by the classifier to get the best feature subset.

figure a
Fig. 3
figure 3

HSMOGA flow chart

3.6 Time complexity analysis

As the knowledge needed for the fitness function is pre-calculated, the calculation of fitness functions takes very little time. Even though the maximum number of generations is large, it won’t take too much time. Therefore, the two parts that take the most time are the knowledge reserve and the final selection of the best feature subset using classifier.

As filter method of feature selection is very fast, main part of the former’s time cost is the calculation of SCn×n. It can be separated into two parts, i.e., the establishment of n models and classification of the misclassified instances between each two classifiers. In this paper, Random Forest is selected as the classifier just like [50] did, as Random Forest has fast speed, high accuracy, and the OOB data as a validation set [2] brings great convenience in calculating the SCom and parameters’ tuning.

Time complexity of establishing n model where each one contains just one feature is the same as that of establishing a Random Forest model which contains n features, namely, \(O(nN\log _{2}N)\). Time complexity of Di(i = 1,2, ⋯ , n) being classified by other n − 1 models is O(n|Di|). So, the time complexity of classification of the misclassified instances between each two classifiers is O(n(|D1| + |D2| + ⋯ + |Dn|)). Given that the number of misclassified instances is often not too much and normally n < N exists. Therefore, time complexity of knowledge reserve is the same as that of establishing a Random Forest model with n features, i.e., \(O(nN\log _{2}N)\).

Final selection of the optimal feature subset indicates establishing M/2 models using M/2 feature subsets of the last generation, where M is the generation size. Let \(n^{\prime }\) be the average size of these M/2 feature subsets. As F3(⋅) limits the size of feature subset, the size of these last generation feature subset is close to the expected size en. So, time complexity of final selection is \(O(\frac {1}{2}Mn^{\prime }N\log _{2}N)\).

So, time complexity of whole method is \(O(\max \limits \{nN\) \(\log _{2}N,\frac {1}{2}Mn^{\prime }N\log _{2}N\})\). Normally, M is not very large. By inspection, the HSMOGA algorithm proposed in this paper has a fast speed. As HSMOGA is classifier-independent in each generation except the last generation, its time complexity is much lower than the time complexity of GA-based feature selection algorithm which is classifier-dependent in each generation, i.e., normally \(O(MGn^{\prime }N\log _{2}N)\), which indicates that HSMOGA can use a larger maximum number of generations G without increasing much time cost. Moreover, as the existence of knowledge reserve, parameters’ tuning does not have to go through all the steps, which significantly reduces time cost. In the experiment section, we will demonstrate that HSMOGA has a significant advantage in terms of time cost over those GA-based feature selection algorithms that are classifier-dependent in each generation.

4 Experiment

4.1 Data sets

In this paper, 17 real-world data sets in the UCI Machine Learning Repository [10] are selected, which are all available to evaluate the performance of machine learning methods. They come from different fields such as clinical research, purchasing intention research, spam email classification, sonar target classification, molecular research, etc. They are of different number of features, classes, instances, and ratios of discrete and continuous features, which can on behalf of a great part of real-world problems, so that they can evaluate the performance of a method accurately. Features of these data sets were normalized by Min-Max scaling before experiments. Details of these data sets are in Table 1. Their names are the first words or abbreviations of the items in UCI Repository and more detailed description of these data sets can be found there.

Table 1 Date set description. N is the number of instances. n is the number of features. Con is the number of continuous features. Dis is the number of discrete features. Con + Dis = n. C is the number of classes

4.2 Experiment settings

We design three experiments to objectively evaluate the effectiveness of the two of three fitness functions and the Pareto-based ranking function proposed in this paper, and other two experiments to objectively compare and assess the relative performance of HSMOGA in terms of classification performance and time cost.

The first experiment is to evaluate the effectiveness of the hybrid filter through demonstrating that the three single filters, IG, IGR, and ReliefF, perform better on some data sets and worse on some other data sets, while the hybrid filter performs more balanced on different data sets, that is, performs well on different data sets. In this experiment, all 17 data set will be used.

Details of evaluating the effectiveness of SCom and its superiority in combination with filters and feature subset generation algorithms can be found in [50]. So, the second experiment is to evaluate the effectiveness of F3(⋅) function through demonstrating that the size of the feature subsets will be close to en which is the expected size of optimal feature subset and is controlled by bias. In this experiment, the ‘Musk2’ data set will be used, as it has a relatively large number of instances and features.

The third experiment is to evaluate the effectiveness of RANK(⋅) through comparison with other two methods, as follows. Note that, initial population generation method, three fitness functions, and the final optimal feature subset generation method are the same in all three cases. In this experiment, all 17 data set will be used.

  • Case 1: HSMOGA.

  • Case 2: RANK(⋅) is replaced by fast non-dominated sorting and the crowding-distance estimation. Binary tournament selection is used, and the replace operation is the same as HSMOGA, which is based on the three fitness values.

  • Case 3: RANK(⋅) is replaced by fast non-dominated sorting and the crowding-distance estimation. Binary tournament selection is used, and the replace operation is replaced by the one in NSGA-ii, which is based on the non-dominated ranks and crowding distances. Actually, this kind of method is NSGA-ii [27].

The fourth experiment is to compare and assess the classification performance of HSMOGA through comparing it with nine state-of-art feature selection algorithms including five classic and four more recent algorithms. Algorithms for comparison are as follows. Among them, mRMR, ReliefF, MDG, CFR need to determine the size of feature subset artificially. So, searching algorithm are implemented in order to get their feature subsets containing the first \(n^{\prime \prime }\) features. \(n^{\prime \prime }\) is optimized by validation set. In this experiment, all 17 data set will be used.

  1. 1.

    mRMR [34], a commonly used filter method.

  2. 2.

    ReliefF [28], a commonly used filter method.

  3. 3.

    MDG [3], a commonly used embedded method, included in the establishment of Random Forest.

  4. 4.

    RFE [39], a commonly used wrapper method.

  5. 5.

    Boruta [29], a commonly used wrapper method.

  6. 6.

    CRF [14], a more recent filter method, which considers feature’s relevance and redundancy at the same time.

  7. 7.

    RRSS [50], a more recent wrapper method, which takes the feature interactions into account.

  8. 8.

    IGDFS [24], a more recent GA-base method, which combines GA, a filter, and a wrapper.

  9. 9.

    FSBOGA [6], a more recent bi-objective GA-based feature selection algorithm whose two fitness functions are defined using rough set theory and the Kullback-Leibler divergence.

  10. 10.

    HSMOGA, the algorithm proposed in this paper.

The fifth experiment is to demonstrate that HSMOGA has a significant advantage over IGDFS and FSBOGA in terms of time cost, and the latter two stands for state-of-art GA-based feature selection algorithms which are classifier-dependent in each generation. In this experiment, data sets ‘Online’, ‘Default’, ‘Musk2’, and ‘Internet’ will be used, as they have a relatively large number of instances or features. The hardware environment used to run this experiment is equipped with 2.50-GHz i5-7300HQ CPU and 8 GB of RAM.

Package of R language is used to perform Random Forest. For easier comparison, the parameters of Random Forest are set to default values where the number of selected features at every splitting node is \(mtry=\lfloor \log _{2}(\text {the number}\) of features)⌋ + 1, and number of trees is set to ntree = 500. On each data set, five times of five-fold cross-validation are performed, and the average values are taken as the final results.

Some of parameters used in HSMOGA are in Table 2. These parameters are not the optimal, as the time cost is reduced. bias is the most important parameter in our view, because it not only controls the probability of occurrence of each feature in the initial population, but also limits the size of feature subset in each generation through the F3(⋅) function. So, we search for its optimal value within [− 1,1] to maximize the kappa coefficient. Note that, the test data does not appear in any part of the model establishment including the parameter tuning. The OOB data [2] of training set is used as the validation set.

Table 2 Parameters of HSMOGA

Main parameters of comparison algorithms are as follows.

  • The ‘mRMRe’ package of R language is used to implement mRMR algorithm. All parameters are their default values.

  • The ‘CORElearn’ package of R language is used to implement ReliefF algorithm. The number of iterations is equal to the number of instances N. The number of selected neighbor is k = 5.

  • The number of parameter values of MDG are the same of those in RF, i.e., \(mtry=\lfloor \log _{2}(\text {the number}\) of features)⌋ + 1, ntree = 500.

  • The ‘caret’ package of R language is used to implement RFE algorithm. functions = rfFuncs, method = cv, number = 5.

  • The ‘Boruta’ package of R language is used to implement Boruta algorithm. pV alue = 0.01, mcAdj = TRUE, maxRuns = 100.

  • When implementing CFR, continuous features are discretized by Ameva algorithm [17].

  • When implementing RRSS, the parameter values of RF-efficient-ReliefF are the same as those of ReliefF, and the thresholds of SCom-SFS are determined according to their piecewise function.

  • For better comparison, population size and Maximum number of generations of IGDFS are equal to those of HSMOGA, i.e., 50 and 20. Other parameters of IGDFS are consistent with [24], that is, tournament size is 2, mutation rate is 0.1, and mutation type is uniform mutation.

  • Similarly, population size and Maximum number of generations of FSBOGA are equal to those of HSMOGA, i.e., 50 and 20. Other parameters of HSMOGA are consistent with [6], that is, the probability of crossover is 0.9, and the probability of mutation is 0.15.

4.3 Evaluation metrics

When measuring the quality of a filter, i.e., rationality of feature ranking, one commonly used method is to use the first \(n^{\prime }\) features of its resulting order to build the model, and \(n^{\prime }\) changes from 1 to n − 1, that is n − 1 models. Then, we can compare two or more filters by calculating the rank of their models when facing different \(n^{\prime }\). In experiment one, the Average Rank (AR) is selected. One filter’s AR on each data set is in (21).

$$ AR_{i}=\frac{1}{4(n-1)}\sum\limits_{n^{\prime}=1}^{n-1}rank_{in^{\prime}} $$
(21)

where i = 1, 2, 3, 4 represents IG, IGR, ReliefF, Hybrid Filter (HF), respectively. \(rank_{in^{\prime }}\) represents the i-th filter’s rank when using the first \(n^{\prime }\) features of its resulting order to build the model, and the best is ranked 1 and the worst is 4. The range of ARi is [0,1], and the lower, the better.

In this paper, kappa coefficient is the main evaluation metric of models’ results. When facing a confusion matrix, accuracy only considers the instances in the diagonal direction which are classified correctly. Whereas, kappa coefficient also takes the misclassified and unidentified instances outside the diagonal into account, which indicates that kappa coefficient is more reasonable especially when facing the imbalanced data set. Its range is [− 1,1], and the larger, the better. In this paper, parameter tuning of every algorithm is also based on the kappa coefficient.

In addition, accuracy and G-mean are selected as the secondary evaluation metrics, where G-mean is the geometric mean of the recall rate of each class. The range of them is [0,1], and the larger, the better. Table 3 is a typical confusion matrix of C classes. Formulae of accuracy, G-mean and kappa coefficient are in (22), (23) and (24), respectively. In experiment three, kappa coefficient is selected, as we’re just going to demonstrate the superiority of the rank function. In experiment four, kappa coefficient, accuracy and G-mean are selected.

$$ accuracy=\frac{{\sum}_{i=1}^{C}N_{i,i}}{N} $$
(22)
$$ G-mean=\sqrt[C]{\prod\limits_{i=1}^{C}{ \frac{N_{i,i}}{N_{i}} } } $$
(23)
$$ kappa \ coefficient=\frac{p_{0}-p_{e}}{1-p_{e}} $$
(24)

where

$$ \begin{array}{@{}rcl@{}} p_{0}&=&accuracy \end{array} $$
(25)
$$ \begin{array}{@{}rcl@{}} p_{e}&=&\frac{{\sum}_{i=1}^{C}{N_{i}\cdot\hat{N_{i}}}}{N^{2}} \end{array} $$
(26)
Table 3 A typical confusion matrix

4.4 Results and analyses

Experiments one to three are designed to objectively evaluate the effectiveness of the two of three fitness functions and the Pareto-based ranking function proposed in this paper, and experiments four and five are designed to objectively compare and assess the relative performance of HSMOGA in terms of classification performance and time cost. So, they are separated.

4.4.1 Experiments one to three

In experiment one, AR results of 4 filters on 17 data sets are in Table 4. The last line of it is the average results of all data sets, and the best filter for each data set is highlighted in bold, the second best is highlighted in italic.

Table 4 AR results of 4 filters on 17 data sets(%)

By inspection, we can find that single filters performed better on some data sets and worse on some other data sets, such as, IG performed well on Spambase data set but poorly on CMC data set, IGR performed well on CMC data set but poorly on SPECTF data set, ReliefF performed well on SPECTF data set but poorly on Spambase data set. Whereas, the hybrid filter is almost in the top two on all the 17 data sets, and its average AR is the best. It is apparently that the rank operation and average operation on these three single filters are effective. The performance of the hybrid filter is more balanced, and it is good on almost every data set.

In experiment two, in order to prove F3(⋅) function is effective, we plot the average value of F3(⋅) function of each generation, as in Fig. 4, where

Fig. 4
figure 4

Experiment two on Musk2 data set

Mean of F3 function = 0 means all the feature subsets in that generation achieve the size of en which is our expected size and it can be controlled by bias.

By inspection, we can find that as the number of generations increases, the average value of F3(⋅) function increases. Apparently, F3(⋅) function is effective. We can adjust the bias term so that the size of the feature subsets is close to the expected size, and F1(⋅) and F2(⋅) functions are also optimized at the same time. Moreover, we can easily find that the average value of F3(⋅) function of the initial population is almost in [en − 5, en + 5]. It indicates that our initial population generation method is very good, and the population will converge faster.

In experiment three, in order to demonstrate the effectiveness of RANK(⋅) function, HSMOGA was compared with other two methods. Kappa coefficient results of these three cases on 17 data sets are in Table 5, and the best method on each data set is emphasized in bold.

Table 5 Kappa coefficient results of three cases (%)

According to the results, we can find that HSMOGA is better than the other two methods including the commonly used NSGA-ii method on 17 data sets. It indicates that the proposed RANK(⋅) function is more suitable for the three fitness functions proposed in this paper. The reason is that only those individuals who perform well on all the three fitness functions are what we want, and actually they represent better feature subsets. It seems like we focus on those individuals that are more central in the Pareto front, while NSGA-ii focuses more on finding the wider Pareto front.

4.4.2 Experiments four and five

In experiment four, Kappa coefficient is the main evaluation metric. Accuracy and G-mean are the secondary evaluation metrics. Results of them of these ten algorithms on 17 data sets are in Tables 67, and 8, respectively. The best result for every data set is emphasized in bold.

Table 6 Kappa coefficient results (%)
Table 7 Accuracy results (%)
Table 8 G-mean results(%)

For more intuitive comparison, we calculate the Friedman’s ranks [13] of these three metrics of ten algorithms. Firstly, each algorithm’s ranks on these 17 data sets are calculated. The best algorithm is ranked 1, the second-best algorithm is ranked 2, etc. Then average these ranks of every algorithm. Table 9 gives Friedman’s ranks of these ten algorithms, and the best one is emphasized in bold.

Table 9 Friedman’s ranks

We perform Analysis of Variance (ANOVA) to prove that there are significant differences between these ten algorithms. Then, Tukey HSD test [1] and Nemenyi test [32] are performed between every two algorithms. The significance level of all tests is set to α = 0.05. If pvalue ≤ 0.05 exists, then we can hold that these ten algorithms or the corresponding two algorithms are not equivalent. According to Table 10, we can easily find that there are significant differences between these ten algorithms. Tables 11 and 12 are the two statistical tests’ pvalue results of the pairs of algorithms with significant differences, respectively, and the best one of each pair is emphasized in bold.

Table 10 Results of ANOVA
Table 11 p-values of Tukey HSD test on the pairs of algorithms which have significant differences
Table 12 p-values of Nemenyi test on the pairs of algorithms which have significant differences

From these tables, we can get several points:

  • HSMOGA performs well in terms of kappa coefficient, accuracy, and G-mean. HSMOGA is significantly better than other nine algorithms for the data sets tested. It achieved the highest kappa coefficient on 12 of 17 data sets and it is almost the second or the third highest on the other 5 data sets. Although we did not optimize the model based on accuracy and G-mean, HSMOGA’s accuracy and G-mean is still the highest overall. According to Friedman’s ranks, HSMOGA is the best one in terms of all three metrics, and far exceeds the second place.

  • The remaining nine algorithms have no significant difference with each other for the data sets tested. Among them, both IGDFS and FSBOGA performed moderately well, and IGDFS ranked second, which indicates the advantages of GA-based feature selection algorithms. Possible reasons why their results are worse than those of HSMOGA are as follows. The maximum number of generations or the population size we used are much smaller than those in [24] and [6]. IGDFS and FSBOGA did not take feature interaction into account. They have no means to control the size of the feature subset.

  • For the experiments performed in this paper, HSMOGA is better than IGDFS and FSBOGA when using the same maximum number of generations and the population size. It indicates that HSMOGA works better when using small G and M. This is precisely because when generating the initial population, HSMOGA makes full use of the knowledge of data set through controls the occurrence of each feature according to the hybrid weights and the control term bias, which makes the results converge faster and performs better.

In experiment five, we compared the time cost of IGDFS, FSBOGA and HSMOGA. Table 13 shows the time cost of the two algorithms when maximum number of generations G = 10, 20, ⋯ , 100. Note that, HSMOGA includes the knowledge reserve part, MOGA part, and the final feature subset selection part.

Table 13 Time cost of IGDFS, FSBOGA, and HSMOGA on ‘Online’, ‘Default’, ‘Musk2’, and ‘Internet’ data sets (minute)

By inspection, we can easily find that HSMOGA has a significant advantage over IGDFS and FSBOGA in terms of time cost for the data sets tested. Moreover, as G increases, time cost of HSMOGA does not increase much. For example, when dealing with ‘Musk2’ data set, among the whole HSMOGA, the knowledge reserve part costs nearly 16.8 minutes, the final feature subset selection part costs nearly 4.5 minutes, and time cost of MOGA part changes from 6.9 seconds to 10.8 seconds as G changed from 10 to 100. This is because the three fitness functions we proposed are all independent of the classifier, and we precalculated the knowledge or information required through the knowledge reserve step. It indicates that HSMOGA not only has the better results, but also run faster. Most parameters of HSMOGA, such as G and bias, can be tuned at high speed.

5 Conclusion

We proposed a novel MOGA-based feature selection algorithm, i.e., HSMOGA. It contains three new fitness functions F1(⋅), F2(⋅), F3(⋅) and a new Pareto-based ranking function RANK(⋅). F1(⋅) is the average hybrid weight according to a new hybrid filter which performs more balanced on different data sets. F2(⋅) is the average SCom of every two features which is a measure of feature interaction strength. F2(⋅) solves the problem that [50] needs to determine the threshold of SCom artificially. F3(⋅) is the limitation of F1(⋅) and F2(⋅) through ensuring that the size of feature subset is close to an expected size. Compared to traditional fast non-dominated sorting and the crowding-distance estimation, the new ranking function RANK(⋅) is more suitable for the three fitness functions proposed in this paper. When generate the initial population, HSMOGA makes full use of the knowledge of data set instead of some random or pseudo-random pattern generator. The probability of the occurrence of each feature is derived from its hybrid weight and a control term. We proposed a new step called knowledge reserve. It calculates features’ hybrid weights and their SComs before any other steps of HSMOGA, which makes other steps including initial population generation, the calculation of fitness functions, and the parameter’s tuning run faster.

We designed three experiments to objectively evaluate the effectiveness of the two of three fitness functions and the Pareto-based ranking function proposed in this paper, and other two experiments to objectively compare and assess the relative performance of HSMOGA in terms of classification performance and time cost. According to the experimental results, the fitness functions and ranking function proposed in this paper are efficient, and HSMOGA outperforms other nine state-of-art feature selection algorithms in terms of kappa coefficient, accuracy, and G-mean for the data sets tested. Moreover, HSMOGA has a significant time-saving advantage over other GA-based feature selection algorithms, as its three fitness functions are classifier-independent and their knowledge needed is calculated before, and even if the maximum number of generations increases a lot, the time cost will not increase much.

In the future, we will try to construct a better hybrid filter and combine HSMOGA with ensemble methods and Neural Networks.