Abstract
Feature selection (FS) has been studied as a multi-objective problem in recent years to improve the classification effect. In this paper, to expand the ability of Bacterial Foraging Optimization (BFO) in feature selection problems for classification, a Multi-objective Adapting Chemotaxis Bacterial Foraging Optimization (abbreviated as MOACBFO) is proposed. In MOACBFO, a structural variation strategic model is proposed to reduce the computational complexity for multi-objective problems and applied with a dynamically updated external matrix to record the performance of bacteria on two objectives. In addition, an adaptive chemotaxis step mechanism is designed to help the bacteria jump out of the local optimality. To further enhance the diversity of bacteria searching capability, a feature subset updating strategy is developed. The optimal feature subset and fitness value are stored in the external matrix continuously by comparing them with the historical value records. The performance of the proposed algorithm is demonstrated by comparing it with four other advanced swarm intelligent algorithms on 11 high-dimensional microarray datasets. The results indicate that the MOACBFO performs better in achieving a lower classification error rate using a small number of features.
This paper is submitted to Special Session on Intelligent Data Mining: Techniques and Applications (Session Chair: Ben Niu, Hong Wang)
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Iterative updates in information technology increase the amount of data generated and retrieved, resulting in most of the data becoming high-dimensional. To deeply uncover the effective information hidden in the data, researchers are trying to design various advanced data mining methods. In this context, feature selection has become a widely studied method as it can reduce the dimensionality of data, using a small combination of features to express the most information of the primal high-dimensional data.
Most of the features in the high-dimensional data are not always valid and necessary. It may contain a large number of redundant, and invalid information [1], which will significantly affect the effect of data analysis. For example, redundant features may increase the computation cost while invalid information may decrease the classification accuracy. Therefore, an appropriate method of feature selection can improve the reliability and validity of data analysis. At present, the FS for high-dimensional data has been applied in many fields, such as customer relationship management in enterprises [2], disease prediction [3] in modern medical diagnosis technologies, and customer recommendation [4] in electronic commerce.
FS is a preprocessing process, using certain evaluation criteria to select subsets from the original feature space [5]. The process of basic FS is followed in Fig. 1.
In the process of FS, the feature subsets are generated according to some regulations or strategies [1]. Then, the fitness criteria are used to evaluate the effectiveness of the subsets, and the better performance features are stored in the final feature subset. The purpose of selecting feature subsets is to filter out invalid, incomplete, and redundant features as much as possible [2]. In other words, a good feature subset can use fewer features to present most of the attributes of the original data. Therefore, the information conveys by the original data can be better retained. From the processing, it can be seen that FS is also a typical optimization problem. In consequence, FS is now being studied extensively.
Aiming at reducing feature redundancy, J. Lee [3] designed a multi-variate feature ranking mechanism for FS in high-dimensional microarray data. This mechanism not only considers the relevance of features and targets but also considers the redundancy of repeating features. The ranking mechanism filters out the high-quality feature to generate new subsets. E. S. Hosseini and M. H. Moattar [4] also noticed the imbalance of the data features and instances. The candidate feature subsets were formed and evaluated by a multi-interactive information strategy to reduce the analysis bias caused by the imbalanced data.
In addition to data redundancy and imbalance, FS should also consider the computational complexity. In this regard, some studies used the differential evolution method to filter features, enhancing the ability of FS to select feature combination [5]. F. Aghaeipoor and M.M. Javidi [6] adopted a hybrid FS method of filter and wrapper methods, and selected feature subsets by mRMR and fuzzy rules to improve estimation accuracy and decrease the computing time.
Many researchers regard FS as a single objective task in the past. However, there are increasing amounts of researchers who beginning to pay attention to the interpretability and rationality of the feature subsets generated by FS. The inner connection of each feature in a subset needs to be considered. Especially, when classifying high-dimensional data, the algorithm may generate some large feature subsets without removing the redundant features, which is antithetical to the original intention of FS to reduce the dimension of data. Nevertheless, if the feature subsets are too small, it may filter out a lot of information. To make the subsets more concise with low dimensions, it is wise to consider FS as a multi-objective problem. This paper takes error rates of classification and size of feature subsets as the objective of FS.
The swarm intelligence (SI) algorithms are widely used in multi-objective FS due to their characteristics of simple structure and fewer mathematical rules. SI algorithms imitate the evolution processes of the biological population to construct systems with functions of search optimal solutions. It means to use the population to search a large number of solutions to a given problem and find out the optimal one. Common SI algorithms are artificial bee colony (ABC) [7], particle swarm optimization (PSO) [8], ant colony optimization (ACO) [8], and so on.
Focusing on the problems of multi-objective FS, researchers have made various improvements based on the original SI methods. In [9], the authors take the classification accuracy and the number of selected features as the objectives of FS. To reduce the computational cost of multi-objectives, it embeds the updating mechanism of PSO into the ABC to control the displacement of search individuals. Besides, a ladder-like sample combinations strategy is developed to increase search efficiency. Considering reducing the size of the subsets, Y. Zhou, et al. [10] chose to use the discrete method to compress the search space of the particle swarm, and set the size of particle population by a flexible cut-point method to better search optimal solutions.
Only reducing the size of the subsets can easily make the algorithms fall into the situation of local optimum. To escape the local optimum, W. Wei, et al. [11] design a vector-based elite learning rule and applied it to the immune algorithm to remove redundant features in the data being analyzed. However, elite learning often brings about the problem of premature convergence. To control the algorithms’ speed of convergence, A. Santiago, et al. [12] propose a fuzzy selection of operators. All in all, SI methods are applied to multi-objective FS and they can be modified variously.
Except for the SI algorithms mentioned above, bacterial foraging optimization (BFO) is also a widely studied SI algorithm. BFO has been used to deal with a variety of complex optimization problems because of its outstanding searchability and versatility. Inspired by the foraging activities of E. coli, BFO simulates the life process of bacteria, Sect. 2 will give more details.
The common researches of BFO mainly focus on the improvement strategies of the algorithm, the problems of combinatorial optimization, and optimization scheduling. The improvement strategies mainly include parameters optimization, bacteria chemotaxis process optimization, and population structure optimization. For example, H. Chen et al. [13] designed a chaotic local search strategy to better control the convergence speed of BFO and the defect of the constant step length of bacteria chemotaxis, which increased the search diversity by dynamically changing the chemotaxis step length. In addition, the introduction of chaotic strategies has also expanded the search area of the bacteria population.
Except for the algorithm strategies’ improvements, BFO also has multiple applications. B. Turanoğlu and G. Akkaya [14], aiming at the problem of dynamic facility layout, design a hybrid algorithm combining BFO and simulated annealing (SA) to solve such complex optimization problems. In addition to the problem of the dynamic facility, M. Kaur and S. Kadam [15], in the optimal scheduling problem, propose a multi-objective flora optimization algorithm to find the scheme closest to the optimal scheduling solution. However, the size of the population will significantly influence the search effect of BFO. For modulating the search population of the algorithm, H. Wang, et al. [16] design a multi-dimensional population mechanism of BFO based on the FS method. The algorithm can generate populations in different sizes, thereby generating different feature subsets in classification tasks to increase the variety of solutions. At present, there is very little research that focuses on BFO-based multi-objective FS, but it can be found from those existing studies that this area is very useful and meaningful because BFO performs well in multi-objective problems [17,18,19].
1.1 Goals
This research aims to reduce the dimensionality of high-dimensional data when doing classification tasks. So we design a modified BFO with the improved strategies of adaptive foraging process, structural variation, and update search bacteria population based on multi-objective FS. The specific contributions are as follows:
-
Design a multi-objective BFO with structural variation to improve the problem of slow convergence of the original algorithm.
-
Adaptively adjust the process of chemotaxis, replication, and elimination-dispersal of the improved BFO to increase the diversity of the search process.
-
Design a feature subsets updating mechanism to identify features that can retain better classification effects and eliminate features that will weaken the classification accuracy.
-
Combine the proposed algorithm with the KNN classifier to form a wrapper FS method to achieve efficient classification effects.
1.2 Organization
The rest of this article is structured as follows: Sect. 2 describes the principle of the basic BFO. The proposed algorithm and its improvement strategies are described in Sect. 3. Section 4 shows the experimental results and analysis, and the conclusion is placed in Sect. 5.
2 Basic Bacterial Foraging Optimization
The basic Bacterial Foraging Optimization (BFO) was proposed by K.M. Passino, which mimics the biological habits of E. coli. He designed chemotaxis, replication, and elimination-dispersion three main processes for the BFO algorithm. In addition, each bacterium involved activities of swimming, turning, attraction, and repulsion. It can be seen that BFO is a biological heuristic algorithm, and the researches show that this algorithm has strong searchability [20].
The process of bacterial foraging is called chemotaxis. In the chemotaxis, bacteria swarm towards the place with a high concentration of nutrients. One chemotaxis contains two steps: tumbling and swimming. Bacteria select a random direction by tumbling to find a place with a high nutrient concentration. By swimming, bacteria continuously exploit the direction with high nutrient concentration. In addition, the bacteria colony also has a cell-to-cell signaling communication mechanism via special pheromone. Once a unit finds a good place, it will release an attraction pheromone to inform other units. On the contrary, if a bacterium is exposed to a lower nutrient concentration place or a noxious one, it will release a repulsive pheromone to warn other units to avoid approaching. In the BFO algorithm, this mechanism speeds up the convergence and finds the global optimum more efficiently. The position updating of bacterial foraging optimization during the chemotaxis stage is [20]:
The \(\theta^i \left( {j,k,l} \right)\) presents the position of \(i_{th}\) bacterium in \(j_{th}\) chemotaxis, \(k_{th}\) and \(l_{th}\) stand for the process of reproduction elimination-dispersal, respectively. \(C\left( i \right)\) means the step of chemotaxis and \(\Delta \left( i \right)\) is a manually setting parameter that controls the actives of \(i_{th}\) bacterium. As there are attraction and repulsion actives between bacteria, the formula is followed:
Where \(J_{cc} \left( {\theta^i \left( {j,k,l} \right),P\left( {j,k,l} \right)} \right)\) denotes the cell-to-cell signaling of \(i_{th}\) bacterium among other bacteria. \(J_{cc} \left( {\theta^i \left( {j,k,l} \right),\theta^j \left( {j,k,l} \right)} \right)\) controls the communication level of attractant or repellant between \(i_{th}\) and \(j_{th}\) bacterium. Therefore, the \(i_{th}\) the bacterium will be updating with swarming effect not just its fitness \(J\left( {{ }i{ },j{ },{ }k{ },l{ }} \right)\):
Reproduction simulates the most basic criteria in the natural world that is survival of the fittest. It improves the superiority of bacterial colonies and thus accelerates the convergence of BFO. Firstly, bacteria colonies are ranked according to the health level of each bacterium. Secondly, half of the bacteria with lower health levels are replaced by the half with higher. As a result, the overall health level in next-generation is improved. The health of bacteria are as follow:
Where the \(J_{health}^i\) is the health level of \(i_{th}\) the bacterium, it is calculated by adding up each fitness value in \({ }N_c\) chemotaxis processes.
Elimination-dispersal emulates uncertainty in nature. In the life cycle of a bacteria colony, each bacterium could die abruptly or be transported to somewhere else randomly since the uncertain natural disasters. In BFO, the position of bacteria will be reset in a custom probability \(P_{ed}\), which helps the optimization algorithm reduce premature convergence.
3 The Proposed Multi-objective Adapting Chemotaxis BFO Method
Although the basic BFO can realize solving optimization problems. However, the algorithm’s computational time cost of BFO will be increased when faced with high-dimensional data. In addition, the original algorithm is more suitable for single-objective problems. In this paper, we regard FS as a multi-objective problem with two objectives, which are “minimize the size of feature subset” and “maximize classification effect (minimum error rate)”.
This paper proposes an effective algorithm, multi-objective adapting chemotaxis bacteria foraging optimization (MOACBFO), the general flow of MOACBFO is shown in Fig. 2. ‘h’ represents the running time of the program, ‘iter’ means the iteration times of each running. ‘m’ is the swimming number of times of each bacteria.
Based on the original BFO, MOACBFO has four improved strategies, including structural variation (structure mutation of the basic BFO), external matrix mechanisms, adaptive foraging processes, and feature subsets updating mechanism.
3.1 Structure Variation
Traditional BFO is a multi-layer nested structure algorithm, when the bacterial population faces a high-dimensional search space, many calculation processes will be generated. However, as the scale of data analysis increases, many repetitive calculation results will be generated, which does not conform to the original intention of creating a concise and efficient algorithm. Therefore, this paper disassembled the nested structure of BFO, setting the three processes (chemotaxis, replication, and elimination-dispersal) to be carried out sequentially instead of being nested. We canceled the original nested loop structure of BFO and carried out serial and parallel transformations.
As Fig. 2 shows, in the search process of bacteria, the chemotaxis operation is performed first, and the classification effect of the intermediate process is used to decide whether the algorithm goes to replication operation or elimination-dispersal operation in the next stage of the algorithm. It is worth noting that replication or elimination-dispersal is not performed in every iteration, only if the conditions are met will one of the activities take place.
3.2 External Matrix
To make BFO more suitable for solving multi-objective problems, we introduce an external matrix (EMatrix) [21]. In each iteration, the matrix is used to store the location information of the population generated by the improved bacteria optimization algorithm, the classification error rate of the classifier, and the size of the feature subset generated each time. The classifier used in this paper is the K-nearest neighbor classifier, which is lightweight and efficient.
Formula 5 represents the set of all features, ‘n’ represents the number of features, and ‘z’ represents the number of samples. ‘Pos’ records the feature information of the jth generation of bacterial colony search, and ‘k’ represents the number of selected features, see Formula 6. ‘ErrR’ represents the classification error rate obtained by training the feature subset generated by the jth generation of flora into the classifier, see Formula 7 and ‘iter’ represents the number of iterations. ‘FitN’ represents the size of the feature subset generated by the jth generation of flora, see Formula 8.
3.3 Adapting Foraging Process
The adaptive bacterial foraging process is mainly the adaptive change of the step length of bacterial chemotaxis. In addition, there are adaptive options for replication and elimination-dispersal. The basic step length of chemotaxis in BFO is invariable, but the movement of bacteria is not rigid in the real world. The fixed-step makes the bacteria easy to be caught in the same local place which is not beneficial for the diversified development of the population. In other words, the lack of diversity of the bacterium will lead to the results becoming incomplete. In this paper, a simple adaptive chemotaxis step-changing method is adopted [22]. In the MOACBFO, the initial step control mechanism for each bacterium is followed formula (10–11):
‘γ’ is an adaptive parameter following formula (10), it controls the step length of chemotaxis (‘Chest’). \(J_{new} \left( { i,j} \right)\) is the variation of formula (3). It removes ‘k’ and ‘l’ because the structure of BFO has been changed and the new strategy has followed the formula (14).
One bacterium may do reduplicative work if there is no communication between bacteria and the population. To reduce the computation cost, this paper introduces a learning strategy from PSO, see formula (12).
‘PBi’ stands for the best solution of ith bacterium and ‘B’ is the best location searched by the entire bacteria groups. Table 1 shows the pseudo-code of MOACBFO.
When the error rate is over 50%, the bacteria population will do the elimination-dispersal operation. A high error rate means the combination of features in the subset cannot reflect most information of the data. To save the computation resources, we set the MOACBFO only do elimination-dispersal instead of replication. On the contrary, if the error rate is small, the MOACBFO will take the replication activity.
3.4 Feature Subsets Updating Strategy
High dimensional data always bring thousands or more instances with many features. If the search population has been the same all the time, the diversity of feature subsets will lack change. This paper proposes a feature subsets updating strategy, see Table 2.
Before updating the features in a subset, the position of each bacterium ‘P’ needs to be found. The basic position updating formula is followed:
After recording the information of position into EMatrix, the fitness (error rate) must be calculated and saved into EMatrix. The fitness value \(J_{new} \left( {i,j} \right)\) is acquired by formula (14). It uses the K-nearest neighbor to be the classifier, which takes the ‘P’ as the input value.
The pseudo-code of feature subsets updating strategy is below.
4 Experiments and Analysis
In this section, the proposed algorithm is compared with four derived versions of traditional swarm intelligent algorithms. The experiment is carried on 11 microarray data.
4.1 Experiment Design and Parameter Initialization
The comparing algorithms are BPSO [23], BFO [20], BFOLIW [24] and BWFS [21]. Table 3 records the characteristics of the comparing algorithms. To compare the MOACBFO to the multi-objective algorithm, the comparing algorithms in this paper are all added to the external matrix mechanisms used in this paper. As for this, the final comparison algorithms are BPSO, BFO, BFOLIW, and BWFS in a multi-objective version.
To make the comparison experiment more fair and reliable, the parameters set was referred to most of the literature. As the popular size S is usually 30 to 50, this paper takes 45 as the popular size and dimension. The maximum iterations are 200. The running time of the program is 10.
4.2 Data Sources
The testing datasets are 11 microarray data from the UCI machine learning repository [24]. The microarray datasets are high-dimension data about cancer, which have two or more categories and more than 5,000 features. The experiment compares the error rate and the size of the subset of the algorithm.
4.3 Experiment Results and Analysis
Figure 3 and Fig. 4 show the experiment results of the comparisons. The abscissa axis represents the size of the subsets generated by the algorithms, and the axis of ordinates records the error rate of classification. From the results, MOACBFO always gets a minimum error rate. It reflects that the MOACBFO has a good ability in searching the optimization result.
Besides, MOACBFO can get a small number of features in a subset with a lower error rate compared with other methods. The results in Fig. 3 and Fig. 4 prove that MOACBFO has a better comprehensive performance. Especially in the datasets DLBCL, Lung-Cancer, Prostate-Tumor, Brain-Tumor2, Leukemia1, and SRBCT, MOACBFO gets better diversity of results.
To compare the effect of the algorithms in more detail, Table 5 shows the numerical results of average error rates and subsets’ size of all methods. As shown in the Table 5, the MOACBFO can generally reach an error rate of less than 0.005. Although the number of features in each dataset is not the smallest one, the error rate of MOACBFO is low. Since the results need to fit the two objectives (minimize error rates and feature subsets).
5 Conclusions
In general, this paper proposes a modified FS method MOACBFO and receives good performance in comparing with four swarm intelligence algorithms on 11 high-dimensional microarray datasets. MOACBFO effectively improves the classification results of the modified BFO method in a multi-objective optimization perspective. In addition, it achieves the four goals proposed in this paper. MOACBFO has a good convergence effect, better diversity of searching results, higher classification accuracy, and better effect during the multi-objective FS process in the target of classification. However, MOACBFO needs more comparison to become more stable. In the future, we should pay more attention to the application of MOACBFO and its modified version on practical problems.
References
Guan, B., Zhao, Y., Yin, Y., Li, Y.: A differential evolution based feature combination selection algorithm for high-dimensional data. Inf. Sci. (Ny) 547, 870–886 (2021)
Gangavarapu, T., Patil, N.: A novel filter–wrapper hybrid greedy ensemble approach optimized using the genetic algorithm to reduce the dimensionality of high-dimensional biomedical datasets. Appl. Soft Comput. J. 81, 105538 (2019)
Joloudari, J.H., Saadatfar, H., Dehzangi, A., Shamshirband, S.: Computer-aided decision-making for predicting liver disease using PSO-based optimized SVM with feature selection. Inf. Med. Unlocked 17, 100255 (2019)
Wang, H., Niu, B., Tan, L.: Bacterial colony algorithm with adaptive attribute learning strategy for feature selection in classification of customers for personalized recommendation. Neurocomputing 452, 747–755 (2021)
Jain, D., Singh, V.: Feature selection and classification systems for chronic disease prediction: a review. Egypt. Informat. J. 19(3), 179–189 (2018)
Kalimuthan, C., Arokia Renjit, J.: Review on intrusion detection using feature selection with machine learning techniques. Mater. Today Proc. 33, 3794–3802 (2020)
Khaire, U.M., Dhanalakshmi, R.: Stability of feature selection algorithm: a review. J. King Saud Univ. Comput. Inform. Sci. (2019). https://doi.org/10.1016/j.jksuci.2019.06.012
Lee, J., Choi, I.Y., Jun, C.H.: An efficient multivariate feature ranking method for gene selection in high-dimensional microarray data. Expert Syst. Appl. 166, 113971 (2021)
Hosseini, E.S., Moattar, M.H.: Evolutionary feature subsets selection based on interaction information for high dimensional imbalanced data classification. Appl. Soft Comput. J. 82, 105581 (2019)
Aghaeipoor, F., Javidi, M.M.: A hybrid fuzzy feature selection algorithm for high-dimensional regression problems: an mRMR-based framework. Expert Syst. Appl. 162, 113859 (2020)
Karaboga, D., Basturk, B.: On the performance of artificial bee colony (ABC) algorithm. Appl. Soft Comput. J. 8(1), 687–697 (2008)
Okwu, M.O., Tartibu, L.K.: Particle swarm optimisation. In: Metaheuristic Optimization: Nature-Inspired Algorithms Swarm and Computational Intelligence, Theory and Applications. SCI, vol. 927, pp. 5–13. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-61111-8_2
Wang, X., Zhang, Y., Sun, X., Wang, Y., Du, C.: Multi-objective feature selection based on artificial bee colony: an acceleration approach with variable sample size. Appl. Soft Comput. J. 88, 106041 (2020)
Zhou, Y., Kang, J., Sam Kwong, X., Wang, Q.Z.: An evolutionary multi-objective optimization framework of discretization-based feature selection for classification. Swarm Evol. Comput. 60, 100770 (2021)
Wei, W., Chen, S., Lin, Q., Ji, J., Chen, J.: A multi-objective immune algorithm for intrusion feature selection. Appl. Soft Comput. J. 95, 106522 (2020)
Santiago, A., Dorronsoro, B., Fraire, H.J., Ruiz, P.: Micro-genetic algorithm with fuzzy selection of operators for multi-Objective optimization: μFAME. Swarm Evol. Comput. 61, 100818 (2021)
Chen, H., Zhang, Q., Luo, J., Xu, Y., Zhang, X.: An enhanced bacterial foraging optimization and its application for training kernel extreme learning machine. Appl. Soft Comput. J. 86, 105884 (2020)
Turanoğlu, B., Akkaya, G.: A new hybrid heuristic algorithm based on bacterial foraging optimization for the dynamic facility layout problem. Expert Syst. Appl. 98, 93–104 (2018)
Kaur, M., Kadam, S.: A novel multi-objective bacteria foraging optimization algorithm (MOBFOA) for multi-objective scheduling. Appl. Soft Comput. J. 66, 183–195 (2018)
Wang, H., Tan, L., Niu, B.: Feature selection for classification of microarray gene expression cancers using bacterial colony optimization with multi-dimensional population. Swarm Evol. Comput. 48, 172–181 (2019)
Passino, K.M.: Biomimicry of bacterial foraging for distributed optimization and control. IEEE Control Syst. 22(3), 52–67 (2002)
Niu, B., Yi, W., Tan, L., Geng, S., Wang, H.: A multi-objective feature selection method based on bacterial foraging optimization. Nat. Comput. 20(1), 63–76 (2019)
Majhi, R., Panda, G., Majhi, B., Sahoo, G.: Efficient prediction of stock market indices using adaptive bacterial foraging optimization (ABFO) and BFO based techniques. Expert Syst. Appl. 36(6), 10097–10104 (2009)
Too, J., Abdullah, A.R., Saad, N.M.: A new co-evolution binary particle swarm optimization with multiple inertia weight strategy for feature selection. Informatics 6(2), 21 (2019)
Acknowledgment
This work is partially supported by The National Natural Science Foundation of China (Grants Nos. 71901152), Natural Science Foundation of Guangdong Province (2020A1515010752), and Natural Science Foundation of Shenzhen University (85303/00000155).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Wang, H., Ou, Y., Wang, Y. (2021). A Multi-objective Structure Variant Bacterial Heuristic Feature Selection Method in High-dimensional Data Classification. In: Tan, Y., Shi, Y., Zomaya, A., Yan, H., Cai, J. (eds) Data Mining and Big Data. DMBD 2021. Communications in Computer and Information Science, vol 1454. Springer, Singapore. https://doi.org/10.1007/978-981-16-7502-7_34
Download citation
DOI: https://doi.org/10.1007/978-981-16-7502-7_34
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-7501-0
Online ISBN: 978-981-16-7502-7
eBook Packages: Computer ScienceComputer Science (R0)