1 Introduction

With the development of information technology, the main challenge of data mining now is how to extract useful feature information from existing enormous data rather than how to collect a large amount of data. Feature selection can eliminate irrelevant or redundant features, so that it assists in reducing the number of features, improving model accuracy, and shortening running time. On the other hand, selecting truly relevant features can simplify the model, making it easier for researchers to understand how data is produced. Many researchers proposed various methods to select the most suitable features. Some previous researches have viewed feature selection as a single-objective problem to minimize the classification error rate. Actually, feature selection problem can also be regarded as selecting a feature subset from an original set with the minimum feature subset size (Hamdani et al. 2007). This problem can be defined as a multi-objective problem.

Traditional feature selection algorithms can be classified into mainly two types: filter method and wrapper method (Jović and Bogunović 2015). The main principle of filter method is to use evaluation criterions to enhance correlation between individual features and classes, and to reduce correlation among features simultaneously. Filter methods save the training steps of classifier, as a result, the calculation time is generally less and the calculation complex is lower, which can quickly eliminate features with irrelevant correlation. However, this method has a tendency to choose redundant features for the reason that the correlation among features is not considered. Different from filter methods, wrapper methods use classifier to evaluate the classification performance of the selected feature subset. In general, wrapper methods obtain more efficient feature subsets than that of filter methods with lower classification error rate and fewer features.

There exist many learning algorithms for feature evaluation in wrapper methods, such as decision tree, Bayesian classifier, and neighbor algorithm as well as support vector machine. Traditional wrapper methods include sequential forward selection (SFS) and sequential backward selection (SBS) (Xue et al. 2016). However, in these sequential search process, once the feature is added or removed, it will keep unchanged during the whole process. Consequently, it is easy to become trapped in local optima and can only obtain approximate optimal solution (Xue et al. 2014). To effectively address the existing deficiencies mentioned above, research communities began to use evolutionary algorithms with random search strategies to tackle feature selection problem. Those evolutionary algorithms includes simulated annealing (SA) (Lin et al. 2008), genetic algorithm (GA) (Zhu et al. 2007), particle swarm optimization (PSO) (Wang and Yan 2015), bacterial foraging optimization (BFO) (Wang et al. 2017; Chen et al. 2017; Wang and Niu 2017), to name but just a few. However, most of the existing feature selection methods mainly aim at improving the classification accuracy or reducing the classification error rate, which consider feature selection as a single-objective optimization problem. Actually, feature selection can also be considered as a multi-objective problem that aims at reducing the classification error rate with the minimum feature number, which is more general and more applicable.

In this paper, a novel feature selection approach based on multi-objective bacterial foraging optimization algorithm (MOBFO) is leveraged to address the drawbacks of the existing feature selection techniques. Furthermore, in order to tackle high-dimensional feature selection problems, four different information exchange mechanisms are introduced into the bacteria-inspired based feature selection method with the purpose of escaping from the local optima. The key idea of the proposed method is to select feature subsets based on MOBFO algorithm, and then evaluate the subsets of features by using the classification algorithm called KNN.

The main contributions of our approach are listed as follows:

  1. (1)

    Unlike previous feature selection methods, this paper considers the situation that the number of features is not fixed.

  2. (2)

    Under the condition of unfixed optional number of feature subset, this paper extends a single-objective feature selection problem to a multi-objective integer optimization problem with two main conflicting objectives: minimize both the classification error rate and the size of feature subset.

  3. (3)

    In order to ensure that there is no dominant relationship in the final solution set, this paper adds a non-dominant sorting and crowding distance calculation technique to the algorithm.

This paper is divided into five sections. Section 2 reviews the literature on feature selection methods and bacteria-inspired algorithm. A new bacteria-inspired based feature selection approach is described in the third section. Section 4 presents the experimental design and results. Finally, conclusions are drawn in the final section.

2 Related work

Feature selection problems have been extensively studied by research communities. Compared with filter methods, the wrapper methods are more concerned because of its higher classification accuracy. From an optimization perspective, the development of intelligent algorithms provide new ideas as to how to solve the feature selection problem more efficiently. These kinds of approaches can be categorized into wrapper method. This section presents some related works on different feature selection methods and discusses the advantages and disadvantages of different approaches.

2.1 Traditional feature selection algorithms

As a classical problem in machine learning, traditional feature selection has been largely investigated in previous researches. Prior studies in traditional feature selection mainly follow two streams (Jović and Bogunović 2015). One stream takes feature selection as an independent process without classifiers, which we name it as filter methods. Filter approaches evaluate the data set directly by using evaluation criterions to enhance correlation between individual features and classes, and to reduce correlation among features simultaneously. Researchers adopt mainly four evaluation criterion, including the distance measurement, information measurement, correlation measurement, and consistency measurement.

Distance measurement uses distance as a measure of similarity between samples. The smaller the distance, the similar the samples are. Filter methods based on distance measurement consist Relief algorithm (Jia et al. 2013), branch and bound method (Dai and Yao 2017), mahalanobis distance algorithm (Jin et al. 2012), as well as Bhattacharyys distance algorithm (Choi and Lee 2003) and so on. Distance-based filter methods are simple to calculate, however, they tend to select redundant features. In order to reduce the correlation between feature subsets, information measurement based filter methods use information gain or mutual information to effectively select the key features and eliminate irrelevant features. The information distance based methods include information gain (IG) (Wang et al. 2011), minimum Redundancy Maximum Relevance (mRMR) (Peng et al. 2005), interact feature selection (Zhao and Liu 2009), redundancy-complementariness dispersion (Chen et al. 2015), mutual information (MI) (Bennasar et al. 2015) and so on. In recent years, how to develop an effective information measurement based filter method has become a hot research direction. However, with the increase of dimension of data, such algorithms still face the challenge of computational complexity (Xue et al. 2016).

As for the correlation measurement and consistency measurement, the former one uses the correlation coefficient to judge the correlation between features and classes to obtain the feature subset (Ozturk et al. 2013), while the latter one is dedicated to finding the minimum-size feature subset which achieve the same effect as the whole feature set (Dash et al. 2000). In summary, the major advantage of filter methods mentioned above is that these methods save the training steps of classifier so that calculation time is generally less and the calculation complex is lower, which quickly eliminate features with irrelevant correlation. Unfortunately, this kind of method tends to choose redundant features for the reason that the correlation among features is not considered.

Another stream of research takes feature selection as the process with classifiers to evaluate the performance of the selected feature subset, which is called wrapper methods. Conventional wrapper methods can be categorized into sequential forward selection (SFS) and sequential backward selection (SBS) (Xue et al. 2016). In wrapper methods, once the feature is added or removed, it will remain constant during the whole process. Therefore, it is easy to get stuck in a local optimum and only obtain approximate optimal solution (Xue et al. 2014). Generally speaking, wrapper methods obtain more efficient feature subsets than filter methods with lower classification error rate and fewer features. However, the efficiency of feature subset generation is low due to the fact that the classification algorithm need to be performed frequently. It is necessary to design an algorithm to improve the efficiency of the wrapper approaches. To achieve the purpose of dimensionality reduction, how to develop high-precision and high-efficiency wrapper approaches for feature selection problems has become a hot research topic.

2.2 Evolutionary based feature selection algorithms

In recent years, there has been an increasing amount of literature on using evolutionary algorithms to address feature selection problems. This type of algorithms construct the feature selection into the optimization problem with classification accuracy or classification error rate as the objective function, and the corresponding optimization solution is the selected feature subset. Hsu (2004) adopted decision-making tree method to select features and further used genetic algorithm to seek the feature subsets which can make the classification error rate reach the lowest. Similarly, Chiang and Pell (2004) also introduced genetic algorithm into feature selection process. Researchers also use ant colony algorithm to select the most suitable feature subsets. As an illustration, Kashef and Nezamabadi-Pour (2015) modified the original ant colony algorithm and employed the improved algorithms into the feature selection problem solving. In that method, features are regarded as graph nodes and ant colony algorithm is employed to choose nodes. Several simulation experiments on UCI datasets have demonstrated this method’s effectiveness.

However, for high-dimensional feature selection problems, such evolutionary algorithms require plenty of computational time to evaluate the combined effect of all possible feature subsets. Therefore, such feature selection methods which depend on evaluating the combined effect of all feature subsets are not feasible in the case of high-dimensional feature selection problem. In order to resolve this problem, Wang and Yan (2015) put forward a PSO-based method, which views features as optimization variables, sets the weight for the feature subsets on the basis of the classification performance, and selects the feature subset with the best classification performance. Furthermore, Xue et al. (2014) discussed and compared the influence of different initialization strategies on feature selection, and drew a conclusion that adopting SFS and SBS simultaneously in PSO can reduce the computational complexity and get better results.

Actually, those methods which are based on random search strategies, including random generation sequence selection algorithm (RGSS) (Park and Kim 2015), simulated annealing algorithm (SA) (Lin et al. 2008), genetic algorithm (GA) (Zhu et al. 2007) and many others, are the most commonly used techniques. Most evolutionary algorithm based feature selection methods belong to the wrapper methods. The principle idea is to use optimization algorithm to select feature subsets firstly and then use classification algorithms like KNN as an evaluation function. The key advantage of evolutionary algorithm based methods is that these approaches are always with less control parameters and strong robustness.

2.3 Bacteria-inspired algorithms

2.3.1 Original bacteria-inspired algorithm

Most swarm intelligent computation or optimization methods are based on higher organism with more complex behaviours. Particle swarm optimization (PSO) gets inspiration from the social behaviours of bird and fish (Eberhart and Kennedy 1995). Ant colony optimization (ACO) is to mimic the foraging behaviours of nature ants (Dorigo et al. 1996). Artificial bee colony (ABC) is proposed to simulate the intelligent foraging behaviour of a honey bee swarm (Karaboga 2005). However, the biological behaviours mentioned above are relatively complicated and many of them are difficult to be described qualitatively. Because of this, some assumptions factors need to be added to the proposed optimization model. In this way, although the established optimization model reflects some characteristics of biological systems, it still cannot completely describe the actual situations of the biological systems. Therefore, the results of the optimization problems would be affected to some extent.

Under this circumstance, several studies have focused on the behaviours of microorganisms which are easier to be described qualitatively. Bacteria, as the simplest unicellular organisms, have simple patterns of behaviours which can be easily described. Besides, as one of the oldest biological creatures on earth, bacteria’s strong vitality and abilities to flexibly adapt to the complicated environment fully demonstrate their optimization instinct in the process of survival activities. For the two advantages above, research communities have developed the bacteria-inspired (BI) based methods from a new perspective. These BI-based techniques are inspired by the social behaviours of low-grade microorganism bacteria, and further consider the foraging process of bacteria as the optimization problem solving process. To be more specific, the bacteria-inspired algorithm primarily stimulates three typical bacterial foraging behaviours, including chemotaxis, reproduction, and elimination/dispersal.

As the original BI-based algorithm (Passino 2002), bacterial foraging optimization (BFO) is summarized as:

  1. (i)

    Initialization BFO generates initial population randomly and then calculates the fitness value of every individual by using the fitness function. In the first iteration, the individual with the best fitness function will be regarded as the best individual.

  2. (ii)

    Chemotaxis operator All the bacterial individuals tend to avoid the harmful environment and choose to move to favorable one. This kind of behavior is called chemotaxis. During the chemotaxis process, bacteria’s movement behaviors, including tumbling and swimming, are stimulated. Tumbling refers the bacterial behavior of moving a unit step in any direction. When the bacteria complete the tumbling movement, if the fitness value is better than that last time, then the bacteria will keep on moving in the same direction.

  3. (iii)

    Reproduction operator The biological evolution process follows the law of survival of the fittest in nature. After the chemotaxis process, those individuals with poor foraging ability will be eliminated, while the other bacteria with strong foraging ability will survive and further reproduce their next generation by splitting to produce two identical cells. In this way, the stability of the whole population is guaranteed, and the quality of the population is improved.

  4. (iv)

    Elimination/dispersal operator During the chemotaxis and reproduction process, it is likely that the survival environment of the population would be damaged so that some of the individuals would die or migrate to another regions. This elimination/dispersal operator is able to control the exploration processes and escape from the local optima with a certain probability.

2.3.2 Multi-objective bacteria-inspired algorithm

The original bacterial foraging optimization algorithm is developed for the single-objective problems. When it comes to the problems with more than one objective, the original bacterial foraging optimization algorithms cannot be applied directly for the reason that the solution is no longer an absolute global optimal value but a non-dominant solution set. In order to propose a suitable method for multi-objective feature selection problem, the most important point is to select a good leader for bacterial population from a set of potential non-dominated solutions. Non-dominated sorting genetic algorithm (NSGAII) is one of the most widely known evolutionary optimization techniques. Niu et al. (2013) modified the original BFO algorithm into multi-objective algorithm by adding non-dominant sorting mechanism and crowding distance calculation from NSGAII (Deb et al. 2002) to construct the dominant population front in the non-dominant hierarchy. This multi-objective bacteria-inspired algorithm was then extended to improve the accuracy and diversity of the non-dominant frontier, introducing two neighborhood search strategies which are based on the ring topology and star topology respectively.

3 Multi-objective approach

In this paper, according to the characteristic of feature selection problem, we further improve the performance of the multi-objective bacteria-inspired algorithm and bring forward a new variant of MOBFO for feature selection problem (MOBIFS for short).

3.1 Mapping scheme

When using MOBIFS to seek the most appropriate feature subset, every bacterium generates a potential solution satisfying problem constraints. More specifically, each bacterium is endowed with three attributes, i.e. the features being selected, value of corresponding classification error rate and the size of feature subset. As shown in the following equations, n and m stand for the number of features and the bacterial population size respectively. Equation (1) demonstrates the coding of every bacterium’s first attribute about the selected features. When the element in the matrix \(fi\) is zero, it means that the feature is not selected, whereas if the value is non-zero, it means that the feature is selected. Equation (2) shows the coding of the whole population’s first attribute about the selected features. After calculating the fitness value, the classification error rate of every bacterium can be obtained. Matrix \(fit_{1}^{i}\) is set up to store the values of classification error rate. And then the number of features being selected are counted and stored in another matrix called \(fit_{2}^{i}\).

$$f_{i} = [x_{i1} ,x_{i2} ,x_{i3} \ldots x_{in} ],\quad i = 1, \ldots m$$
(1)
$$P = [f^{\prime}_{1} ,f^{\prime}_{2} , \ldots ,f^{\prime}_{m} ]$$
(2)
$$Pop = \left[ {\begin{array}{*{20}l} P \hfill \\ {fit_{1}^{i} } \hfill \\ {fit_{2}^{i} } \hfill \\ \end{array} } \right],\quad i = 1, \ldots m$$
(3)

3.2 Two important mechanisms of MOBIFS

Before describing the computational steps to deal with feature selection problem, we briefly introduce two important mechanisms of the proposed algorithm.

3.2.1 Wheel roulette mechanism

In the process of constructing a feature subset, it is inevitable that an individual would choose repetitive feature. For instance, if the result of an individual is [55.13 20.54 85.54 54.86], the processing becomes [55 21 86 55]. It means 55 is selected twice, which is not allowed. So we need to find another feature to replace the number of 55. The method should replace the repeated features within a reasonable range, and ensure the feature subset converges to the optimal subset more quickly.

According to the reference Khushaba et al. (2011), a distribution factor is proposed to do the task of replacing repeated features. As depicted in Table 1, wheel roulette mechanism uses a weighted scheme to calculate the probability of each feature being selected. The greater the probability, the higher the probability of being chosen. The concrete calculation of positive factor PDj, negative factor NDj, and distribution factor FDj are referred to the literature (Deb et al. 2002).

Table 1 The principle of wheel roulette mechanism

3.2.2 External archive management mechanism

3.2.2.1 Non-dominance sorting

In MOBIFS, non-dominant solutions obtained during the optimization process would be saved to external archive. However, increasing the solution set slows down the convergence speed, so we must limit the size of the external archive. Therefore, how to maintain and manage the external archive is an important component of this algorithm. The flowchart of non-dominance sorting is given in Fig. 1.

Fig. 1
figure 1

Non-dominance sorting

3.2.2.2 External archive updating mechanism

As Fig. 2 presents, when all the bacteria in current population completed the location updating operation, the non-dominated mechanism would be carried out within the population first. Then all individuals in the population will be given a rank result and those bacteria with the highest rank will be selected. Those bacteria with duplicate values are removed at the same time. Next, individuals which are selected through above steps will be compared with the elite individuals in the previous external archive.

Fig. 2
figure 2

External archive updating

The purpose of this operation is to avoid the dominance relationship of the solutions between individuals within a population. Therefore, the non-dominated mechanism should conduct the pairwise comparison of fitness value between bacterium in the current population and that in the external archive. After the comparison, the external archive will be updated by eliminating the bacterium with worse fitness value.

If all individual compared with bacteria in the external archive directly, we can only ensure that each individual in the population has no dominate relationship with the external archive’s bacteria. But we cannot guarantee that there is no dominate relationship among individuals within the population itself. Because of this, the final solution set we obtain may still exist dominate relationships.

3.3 Information exchange mechanisms of MOBIFS

To alleviate the stagnation in local optima, four different information exchange strategies are incorporated to the MOBIFS with the consideration of the lack of information communication. Researchers have investigated different topology structures in their previous work (McNabb et al. 2009). Neighborhood topology is considered as an effective mechanism because this kind of methods facilitate the bacteria converge to the global optima. With such kind of topology structures, every individual in the bacterial group learns from each other aim at obtain useful information to guide their forging behaviors during the whole process.

In this paper, four information communicational systems are chosen and integrated in MOBIFS, including Elite Learning, Ring topology, Star topology, and Von Neumann topology. The corresponding proposed algorithms are named as MOBIFS-EL, MOBIFS-RI, MOBIFS-ST and MOBIFS-VN, respectively. Table 2 gives detailed information about the definitions of the variables firstly, while Table 3 illustrates different bacterial position updating formulas under different information exchange mechanisms.

Table 2 Parameters and definitions
Table 3 Equations for the bacteria position updating

3.4 Computational steps of MOBIFS algorithm

Based on the above important mechanisms, we propose a multi-objective bacteria-inspired algorithm for feature selection problem. Besides, four different information exchange mechanisms are integrated into the bacteria-inspired feature selection algorithm to enhance the performance of the method. Table 4 provides the pseudo-code of MOBIFS algorithm.

Table 4 The pseudo-code of MOBIFS

4 Experimental design

4.1 Datasets and comparison techniques

To test the performance of the proposed MOBIFS method on low-dimensional datasets, six small datasets with less than one hundred features, named Wine, Australian, Zoo, German, Ionosphere, Lung cancer, are used as the benchmark datasets. Moreover, in order to examine the proposed method’s performance on datasets with higher dimensions, another ten datasets are chosen. They have different numbers of features (from 2309 to 15,009), classes (from 2 to 26), and instances (from 50 to 308). All data sets are collected from UCI machine learning repository (Frank and Asuncion 2010). Tables 5 and 6 summarize the characteristics of the low-dimensional datasets and high-dimensional datasets respectively. As for the justification of dataset selection, the main principle is to choose the same datasets as previous works (Khushaba et al. 2011; Xue et al. 2013; Yang et al. 2010; Chuang et al. 2008) so that the experimental results can be directly compared with the figures in literatures.

Table 5 Description of the small datasets employed
Table 6 Description of the high-dimensional datasets employed

For the experiments on small data sets, two single-objective algorithms and another two conventional wrapper methods are used as comparison methods. Two single objective algorithms are existing PSO-based feature selection methods, including commonly used PSO algorithm (ERFS) (Kennedy and Eberhard 1997; Lin et al. 2008; Chuang et al. 2011) and PSO with a two-stage fitness function (2SFS) (Xue et al. 2012). The main difference of these two algorithms is the fitness function. To specify, ERFS employs the fitness function that only takes the classification error rate into account. 2SFS divides the whole evolutionary process into two stages, where fitness function considers the classification performance in the first stage, and includes the number of features in the second stage.

Other traditional wrapper methods are linear forward selection (LFS) (Gutlein et al. 2009) and greedy stepwise backward selection (GSBS) (Caruana and Freitag 1994). These two conventional methods are derived from SFS and SBS, respectively. For the LFS approach, it limits the number of features that are considered in every step of the forward selection, so that the number of evaluations is reduced. As a result, the LFS method can lower computation costs and obtain better results than SFS. Unlike LFS method with a forward search, GSBS chooses a backward search. This method starts with all features and stops as long as removing any remaining features leads to a decline in evaluation. The experimental results of these four comparison methods are directly derived from Xue et al. (2013).

For the experiments on high-dimensional data sets, three evolutionary algorithms, namely Differential Evolution based feature selection method (DEFS) (Khushaba et al. 2011), Information Gain-Genetic Algorithm (IG-GA) (Yang et al. 2010), and Improved Binary Particle Swarm Optimization (IPPSO) (Chuang et al. 2008), are adopted as comparison methods. Besides, to further improve the performance on solving problems with high dimension, four different swarm search strategies are applied in the foraging process of bacteria, so that MOBFO algorithm is able to improve the parallel processing ability and considerably enhance the search efficiency. Specially, four information exchange mechanisms, including elite learning strategy, Ring topology, Star topology, Von Neumann topology, are integrated into the MOBIFS algorithm. Therefore, various MOBIFS methods are developed and employed to solve the problems at the same time.

Experiments are conducted in MATLAB environment. Instances of each dataset are randomly divided into two sets (Wang et al. 2017), including 30% testing set and 70% training set. Each bacterium is regarded as a feature subset during the training process, and evaluated by the classification algorithm called KNN during the training process as well as the testing process. What needs to emphasized is that one of the optimal objectives is the classification error rate during the testing process rather than the training process.

When bacteria-inspired algorithm is used to solve practical problems, the population size is usually set between 100 and 200. The size of the external archive is always set the same number of the population size. Therefore, in the MOBIFS algorithm, the size of population is 200, and the external archive size is 200 as well. The upper limit of the number of feature depends on maximum value of the dataset itself, and the lower limit is set to 1 uniformly. In addition, the maximum number of iterations is set to 100 in the experiment.

4.2 Results and analysis

Figure 3 shows the experimental results attained by MOBIFS on six small benchmark datasets and the evaluation algorithm is KNN. As shown in the experimental curve, MOBIFS technique gets four or more solutions, most of which can achieve better performance than using all features in all benchmark datasets. For instance, for wine dataset, MOBIFS use only 5 features to achieve the classification error rate of 3.77% while using all 13 features can only get the classification error rate of 23.46%. This means that MOBIFS achieve the purpose of feature reduction without increasing the classification error rate.

Fig. 3
figure 3

Experimental results on six small datasets

Actually, feature selection is originally a discrete problem, so it is understandable that there are fewer non-dominant solutions on account of the fact that there are few individuals which satisfy the discrete condition. What’s more, if the design of the algorithm is not good, there are no solutions directly.

From the data in Table 7, it is apparent that MOBIFS achieves a lower classification error rate with fewer features in most cases. It is worth mentioning that the final solution of MOBIFS is a non-dominated solution set, which is different from the result of other comparison methods. We use the solution with the minimum classification error rate and minimum size of feature subset to compare with other methods’ average subset’s size and average classification error rate.

Table 7 Experimental results on small datasets

For Australian dataset, ERFS and 2FES outperform MOBIFS in terms of the number of features and the classification error rate, LFS use fewer features than MOBIFS but the classification accuracy is slightly lower than MOBIFS. For German dataset, with a slightly larger number of features, ERFS and 2FES get a lower classification error rate than MOBIFS. For Lung cancer dataset, the performance of MOBIFS is only slightly worse than LFS but better than the other three comparison methods.

When it comes to the experimental results on data sets with high dimension, as shown in Fig. 4, four MOBIFS variants with different information exchange mechanisms obtain more than three solutions respectively. All of the solution obtained can reach the higher classification accuracy with much fewer features compared with using all the features in the dataset with no selection process. Take 9_Tumors dataset as an example, MOBIFS-EL algorithm is capable to select the most significant 48 features to reflect the key information in the dataset, with the classification error rate at only 11.11%. By contrast, if all the 5726 features are applied in the classification process, the classification error rate achieve 57.59%, which is almost five times as the figure in MOBIFS-EL. That is to say, MOBIFS methods have a good performance in dealing with high-dimensional feature selection problems. As displayed in Fig. 4, there is no one specific MOBIFS method can always get the best solution for each dataset. In fact, it is associated with the characteristic of the benchmark datasets. Four search strategies are designed so as to get high-quality solution of different kinds of datasets, so that the decision maker can choose a better feature selection results for reference from all the final experimental results various MOBIFS obtained.

Fig. 4
figure 4figure 4

Experimental results on ten high-dimensional datasets

The results of comparisons between the MOBIFS techniques and other three evolutionary algorithms are displayed in Table 8. In general, MOBIFS method has better feature selection performance in all ten high-dimensional datasets, that is, MOBIFS technique can reach the highest classification accuracy with fewer feature numbers or at least it can achieve similar value of the classification error rate while using much fewer features.

Table 8 Comparisons between evolutionary algorithms (KNN) on high-dimensional datasets

As illustrated in Table 8, IG-GA and IBPSO are capable to reach the low classification error rate. However, for this two evolutionary algorithms, the size of the selected feature subset is still rather large. As for DEFS method, it gets a better performance in both of the classification error rate and the feature subset size. Even so, the multi-objective bacteria-inspired based method has a lower classification error rate in most cases than the figure in the DEFS method while using similar numbers of features. On top of that, as a multi-objective method, MOBIFS offers the decision maker different feature selection solutions with different feature subset size simultaneously while other three comparison methods can just only provide solutions with the fixed size of feature numbers.

5 Conclusion

This paper formulates the feature selection problem as a multi-objective problem to minimize the classification error rate and the number of features being selected simultaneously. Then, a novel method to support the task of selecting the best features subset is investigated. Four information communicational strategies are incorporated into the bacteria-inspired algorithm so that every individual is able to exchange information with each other and then use the useful information to guide their search. The principle of the proposed method is to use MOBFO algorithm to select feature subset, and to use the classification algorithm called KNN to evaluate subsets of features.

Compared with two single-objective algorithms (ERFS, 2SFS) and other two conventional wrapper methods (LFS, GSBS), simulation results on six small datasets demonstrate that the proposed method has high effectiveness and efficiency in most cases. When it comes to the performance in high-dimensional feature selection problems, on ten benchmark datasets, the proposed MOBIFS algorithms with different information exchange mechanisms (Elite Learning, Ring topology, Star topology and Von Neumann topology) are capable of finding the most representative feature subset with fewer features to reach the lower classification error rate. The related experimental results suggest that different strategies are suitable for different kinds of datasets. In addition, two evolutionary algorithms (DEFS, IG-GA, and IBPSO) are also selected as the comparison algorithms. Overall, the simulation results support that the proposed MOBIFS methods outperform other evolutionary algorithms in terms of the classification error rate and the size of the selected feature subsets.

In the proposed method, the number of features is not fixed as the prior studies assumed. Therefore, this approach does not depend on prior knowledge of the datasets. By applying this method to the feature selection problems, decision makers are capable to obtain a series of solutions, each of which gives a clear idea concerning which features are selected and the corresponding classification error. Under this circumstance, decision makers can choose the one that they consider to be most suitable from the solutions obtained. While this study contributes novel insights into the optimization based feature selection method, future research should be undertaken to investigate how the proposed MOBIFS method performs on a diverse range of real-world feature selection problems.