Keywords

1 Introduction

This chapter presents the importance of flower pollination algorithm (FPA) for feature selection for regression and classification data and knapsack. In the current applications of machine learning and pattern recognition techniques, there are thousands of such features. The vast amounts of data generated today in biology offer more detailed and useful information on one hand; on the contrary, it makes the data analyzing process more difficult because not all the information is relevant. Selecting the important features of a given dataset is a complex problem. Feature selection is a technique for solving classification and regression problems, and it identifies the significant feature subset and removes the unnecessary ones. This mechanism is particularly useful when the size of feature subset is large, and not all of them are required for describing the data features in experiments [1]. Hence, the use of feature selection method is crucial to reduce the enormous number of features. Feature selection helps in understanding data, decreasing the computation time, reducing the effect of the curse of dimensionality and enhancing the performance of prediction model [2]. Furthermore, the feature selection process enhances the visualization and the comprehensibility of the selected feature subset [3].

In real-world applications, due to different reasons not discussed here, many features introduce noise, while others can be totally irrelevant or even misleading, affecting prediction performance. In these cases, feature selection is a must [4]. Two main criteria are employed to differentiate between the feature selection algorithms as follows:

  1. 1.

    Search strategy: the method employed to generate feature subsets or feature combinations.

  2. 2.

    Subset quality (fitness): the criteria used to judge the quality of a feature subset.

There are two major approaches of feature selection methods: wrapper-based approach (applying machine learning algorithms) and filter-based approach (using statistical methods) [5]. The wrapper-based approach employs a machine learning technique as part of the assessment operation that helps to obtain better results than the filter-based [6], but it has a risk of over-fitting the model and can be computationally costly, and hence, a brilliant search method is required to minimize the computational time [7]. In contrast, the filter-based approach explores for a feature subset that optimizes a given data-dependent criterion rather than using classification-dependent criteria as in the wrapper methods [8].

In general, the feature selection is expressed as multi-objective with these two goals: (1) minimize the selected feature subset and (2) maximize the classification precision (minimize the prediction error in the regression problems). Commonly, these two goals are contradictory, and the optimal solution is a trade-off between them. Several search methods have been employed, based mainly on greedy search; however, these techniques have at least two drawbacks: stagnation in local optima and big computational time [9]. Evolutionary computing (EC) and population-based algorithms adaptively search the feature space by using a set of search agents that interact in a social manner to reach the optimal solution [10]. EC methods are inspired by the animal social and biological behavior in nature like (wolves, antlions, dragonflies, spiders, and so on) in a group [11].

Most of the recent optimization techniques are nature-inspired, i.e. they have been inspired from nature [12].

2 Related Work

Feature selection methods are composed of two elements: the search strategy and the evaluation technique (subset goodness). In the wrapper-based approach (alternative to the filter-based approach), the term wrapper refers to the assessment method. Learning boolean is a filter feature selection method that exhaustively explores all potential feature combinations and chooses the minimum feature subset [6].

Various heuristic techniques mimic the biological and physical conducts in nature, and they have been introduced as robust techniques for the global optimization. GA was the earliest evolutionary based technique proposed in the literature, later enhanced relying on the evolution operator during the reproduction [13]. GA feature selection method using a fuzzy set as the fitness function has been introduced in [14]. Wrapper-filter based feature selection methods combine GA with local search methods [15].

In particle swarm optimization (PSO) methods, a solution is represented by a particle with specific properties like position, fitness, and speed [16]. A binary version of PSO (BPSO) modifies the native PSO algorithm to deal with the binary optimization problems [17]. Moreover, an expanded version of BPSO is implemented to deal with feature selection [18]. The binary variant of bat algorithm (BBA) is employed to feature selection, where the search area is described as an n-cube [19].

Ant colony optimization (ACO) uses Fisher discrimination rate to adopt the heuristic information and rough set approach employed for feature selection [20]. Artificial fish swarm (AFS) algorithm mimics the stimulant reaction by controlling the tail and fin [21]. Artificial bee colony (ABC) relies on the natural conduct of honeybees that randomly produced employer bees are moved in the elite bee direction [22]. The elite bee represents the optimal (near to optimal) solution [23]. Antlion optimization algorithm (ALO) is a comparatively recent EC method, which simulates the antlions hunting in nature [24].

Artificial neural networks (ANN) particularly single hidden layer feed-forward neural networks (SLFN) are viewed as a standout amongst the most conventional machine learning models used in regression and classification domains [25]. The learning algorithm is considered the cornerstone of any neural network. Classical gradient-based learning algorithms are suffering from over-fitting, local minima, and they consume a long time to learn [26]. The back-propagation artificial neural network (BP-ANN) has average learning velocity and is likely to get caught in the local minima, leading to miserable performance and efficiency. The revised back propagation artificial neural network (RBP-ANN) is applied to defeat the constraints of BP-ANN and RBP-ANN [27].

In extreme learning machine (ELM) techniques, the output connections are tuned by solving an optimization problem, i.e. finding the minimum of the cost function by linearization [28]. Huang [29] introduced ELM in order to avoid some of the difficulties observed in gradient-based learning methods. ELM is used as a supervised learning method for SLFN neural networks [30, 31]. ELM is choosing the weights of the input and hidden layers randomly rather than completely adapting all the internal parameters. Moreover, ELM could analytically define the output layer weights [32].

3 Flower Pollination Algorithm (FPA) with Selected Applications

FPA is metaheuristic optimization technique relying on the pollination operation of flowering plants that introduced by Yang in 2012 [33]. Pollination is carried out in two modes self pollination (local search) and cross pollination (global search). Detailed information about the two ways of pollination as follow [34]:

  1. 1.

    Cross pollination happens from the pollen of a flower of a different plant at long distance via pollinators that can fly a big distance (global pollination) [34]. In the cross pollination, the pollinators convey the flower pollens and can fly long distance to assure the pollination and proliferation of the optimal solution \(g_{*}\). The initial rule may be formulated as in Eq. (1):

    $$\begin{aligned} X_{i}^{t+1} = X_{i}^{t}+L(X_{i}^{t}-g_*), \end{aligned}$$
    (1)

    where \(X_{i}^{t}\) represents the vector of a i solution at t iteration, \(g_{*}\) demonstrates the present best solution, and L describes the pollination strength that randomly pulled from the Lèvy distribution.

  2. 2.

    Self pollination is implantation of one flower from the pollen of identical flower or different flowers of the identical plant that usually happens when there is no pollinator possible. The local pollination and flower constancy is expressed as in the Eq. (2):

    $$\begin{aligned} X_{i}^{t+1} = X_{i}^{t} +\varepsilon (X_{j}^{t} - X_{k}^{t}), \end{aligned}$$
    (2)

    where \(X_{j}^{t}\) and \(X_{k}^{t}\) demonstrate two random solutions, and \(\varepsilon \) drawn from the uniform distribution.

Because of local pollination may have substantial fraction (p) in the aggregate pollination actions (in our experiments, we used p = 0.5). A switching probability \(p \varepsilon [0, 1]\) manages the local and global pollination. FPA search methodology can be outlined as in the algorithm (1).

figure a

3.1 FPA Applied for Feature Selection

FPA is adopted here for exploiting the capabilities of filter and wrapper approaches for feature selection. The filter approach can be described as data-oriented methods that not directly related to classification performance. The wrapper-based approach is more related to prediction performance, but it does not face redundancy and dependency among the selected feature set.

We are seeking to find similarities and differences based on some evaluation criteria that may help in finding weak and strength features of each. All swarm intelligence methods regularly share the data between their multiple agents. Therefore, at every iteration, all/some agents upgrade/modify their position relied on the data of their own position and the other positions.

FPA is applied for feature selection in both classification and regression problems. For a vector with N features, the various feature selection would be \(2^N\) that is the vast space of features to be searched exhaustively. Therefore, intelligent optimization is applied to explore the search area adaptively for best feature subset. The optimal feature subset is the one with least prediction error and a less number of selected features as a common objective in literature. In classification problems, the general fitness function for the proposed optimization algorithms is to maximize the classification accuracy over the validation set given the training set, as shown in Eq. (3) while keeping the minimum number of features selected:

$$\begin{aligned} \downarrow Fitness = \alpha (1-P) + \beta {\frac{\mid R \mid }{\mid C \mid }}, \end{aligned}$$
(3)

where R indicates the size of chosen feature set, C demonstrates the total number of features in the dataset, \({\alpha }\) and \({\beta }\) depict the significance of classification performance and the chosen feature set length, \({\alpha }\, {\in }\, [0, 1]\) and \({\beta = 1- {\alpha }}\), P is the classification performance measured as in Eq. (4):

$$\begin{aligned} P = \frac{N_c}{N}, \end{aligned}$$
(4)

where \(N_c\) indicates the number of correctly classified instances, and N is the total number of instances.

In the case of regression problems, the general fitness function for the proposed optimization algorithms is to minimize the prediction error over the validation set given the training set as in Eq. (5) while keeping a minimum number of features selected.

$$\begin{aligned} \downarrow Fitness = \alpha * E + \beta {\frac{\mid R \mid }{\mid C \mid }}, \end{aligned}$$
(5)

where E indicates the prediction error, \({\alpha }\) and \({\beta }\) show the importance of prediction error and selected feature subset respectively. E is defined as:

$$\begin{aligned} E=\sum _{i=1}^M|a_i-t_i|, \end{aligned}$$
(6)

where \(a_i\) and \(t_i\) are the actual model prediction value and target value for point i in the validation set.

The used features are the same as the number of features in a given dataset. All features are limited in the range [0, 1], where the feature value approaches to 1; its corresponding feature is a candidate to be selected in classification. In individual fitness calculation, the feature is a threshold to decide whether a feature will be selected at the evaluation stage. Therefore, a static threshold of 0.5 is used as in the Eq. (7):

$$\begin{aligned} y_{ij} = {\left\{ \begin{array}{ll} 0 \text{ if }\, x_{ij} < 0.5 \\ 1 \text{ Otherwise }, \end{array}\right. } \end{aligned}$$
(7)

where \(x_{ij}\) is a D—dimensional point in the search space of features and \(y_{ij}\) is the binary value \(\in {0,1}\) corresponding to selecting/unselecting feature j in solution i from the solution set.

3.2 FPA Applied for Knapsack Problem

Given a set of n elements with each element has a profit \(p_j\) and a weight \(w_j\) and a Knapsack of capacity C the objective is to find the most profitable solution without violating knapsack weight capacity [35]. A vector describing whether an element is selected or not can be represented in binary form with an n-dimensional vector with individual elements \(x_i\in {0,1}\). So, the problem can be mathematically formulated as:

$$\begin{aligned} Maximimize \sum _{j=1}^{n} p_j x_j, \end{aligned}$$
(8)

subject to

$$\begin{aligned} \sum _{j=1}^{n} w_j x_j \leqslant C. \end{aligned}$$
(9)

The knapsack problem is an NP-hard problem which requires a very intelligent optimization to search the huge search space of possibilities. FPA is adopted in this work to solve a set of Knapsack problems with variant dimensions to prove the searching capability of the FPA. Death penalty [36] is adopted to handle the constraint of the knapsack while the total fitness is calculated as in Eq. (8) but with using negative sign to standardize the maximization into minimization.

4 Experimental Results and Discussion

The global and optimizer-specific parameter setting is outlined in Table 1. All the parameters are set either according to domain-specific knowledge as the \(\alpha \) and \(\beta \) parameters of the used fitness function, or based on trial and error on small simulations and common in literature such as the rest of parameters.

Table 1 The parameter setting for experiments

In this study, the wrapper approach is used to find a feature subset supervised by the prediction performance. Hence, an intelligent search method is necessary for searching the feature space. In the case of classification datasets, the used classifier in the fitness function as given in Eq. (3) is KNN [37]. KNN is utilized in the experiments based on trial and error basis where the best choice of K is selected \((K=5)\) as the best performing on all the datasets.

4.1 Assessment Indicators

Each algorithm has been applied \(K*M\) times with random positioning of the search agents except for the full features selected solution that was compelled to be a position for one of the search agents. Compelling the full features solution ensures that all consequent feature subsets; if selected as the global best solution, are fitter than it. Repeated runs of the optimization algorithms were applied to test their convergence capability. We have applied two types of indicators (measures) to compare the various algorithms.

  1. 1.

    Firstly, this group of indicators is applied directly to the fitness function obtained based on the validation set and used to characterize the algorithm performance as follows:

    • Mean fitness: is an average value of all the solutions in the final sets obtained by an optimizer in a number of individual runs [38].

    • Median fitness: is used to assess the average performance tolerating noise performance of the optimizer over all the M runs [38].

    • Best fitness: is the minimum value of the fitness function that acquired by the optimizer in M independent applications [38].

    • Worst fitness: is the maximum fitness function value (or worst obtained fitness value) acquired by an optimization method in M independent applications [38].

    • Statistical standard deviation (std): is a representation of the variation of the obtained best solutions found for running a stochastic optimizer for M different runs. Std is used as an indicator for the optimizer capability to converge to same/similar optimal solution [38].

  2. 2.

    The second group of indicators is applied to assess the performance of the entire prediction model as follows:

    • Average classification error: depicts how precise the classifier of the chosen feature subset, as shown in the Eq. (10):

      $$\begin{aligned} Perf = \frac{1}{M} \sum _{j=1}^M\frac{1}{N}\sum _{i=1}^N Unmatch(C_i,L_i), \end{aligned}$$
      (10)

      where M represents the total number of runs for the optimization method, N describes the total instances in the test subset; \(C_i\) depicts the classifier output label of the i data instance. \(L_i\) denotes the source class label of the i data instance, and Unmatch specifies the function that yields 0 if the two labels are equivalent and yields 1 otherwise.

    • Mean square error (MSE): is measuring the mean square error of the difference between actual output and the predicted one as given in Eq. (11):

      $$\begin{aligned} MSE = \frac{\sum _{i=1}^n(pred_i-obs_i)^2}{n}, \end{aligned}$$
      (11)
    • Root mean square error (RMSE): is measuring the difference among actual output and the predicted ones as given in Eq. (12):

      $$\begin{aligned} RMSE=\sqrt{\frac{\sum _{i=1}^n(obs_i-pred_i)^2}{n}}, \end{aligned}$$
      (12)

      where \(obs_i\) and \(pred_i\) are the observed and predicted values respectively. \(\mu \) represents the mean of the noticed values, n demonstrates the total of examples, and i depicts the example number in a given dataset.

    • Average selection size: demonstrates the average size of the chosen feature subset to the aggregate amount of features as in the Eq. (13):

      $$\begin{aligned} Selection\_Size= \frac{1}{M}\sum _{i=1}^M \frac{size(g_*^i)}{N_t}, \end{aligned}$$
      (13)

      where \(N_t\) represents the total number of features in a given dataset.

    • Average feature reduction: demonstrates the mean size of the reduced features to the aggregate amount of features as in the Eq. (14):

      $$\begin{aligned} Reduction = 1-\frac{1}{M}\sum _{i=1}^M \frac{size(g_*^i)}{N_t}, \end{aligned}$$
      (14)
    • Average Fisher score (F-score): assesses the feature subset that has large distances between the data samples in various classes, while the distances among data instances in the same class are as minimum as possible [39]. F-score is computed for individual features given the class labels and for M independent applications of an algorithm; as shown in Eq. (15):

      $$\begin{aligned} F_j=\frac{\sum _{k=1}^cn_k(\mu _k^j-\mu ^j)^2}{(\sigma ^j)^2}, \end{aligned}$$
      (15)

      where \(F_j\) is the Fisher score for feature j, \(\mu ^j\) is the mean of the entire dataset. \((\sigma ^j)^2\) is the standard deviation of the whole dataset, \(n_k\) denotes the size of the k class, and \(\mu _k^j\) indicates the mean of k class.

    • Wilcoxon: introduced by Wilcoxon [40] as a non-parametric test. The test allocates rank to all the scores considered as one group and afterward sums the ranks of every group. The null hypothesis originates from the same population, so any difference in the two rank sums come only from the testing error. The rank sum test is regularly depicted as the non-parametric version of the T-test for two independent groups.

    • T-test: is a statistical significance that decides whether or not the difference between two classes’ averages most likely reflects a real difference in the population from which the groups were sampled; as in the Eq. (16) [41].

      $$\begin{aligned} t=\frac{\bar{x}-\mu _0}{\frac{S}{\sqrt{n}}} \end{aligned}$$
      (16)

      where \(\mu _0\) is the average of the t-distribution and \(\frac{S}{\sqrt{n}}\) is its standard deviation.

    • Average computational time: is the run time for a given optimization algorithm in millisecond that calculated over the different runs as given in Eq. (17):

      $$\begin{aligned} T_{o}= \frac{1}{M}\sum _{i=1}^M RunTime_{o,i}, \end{aligned}$$
      (17)

      where M demonstrates the total number of runs for the optimizer O, and \(RunTime_{o,i}\) is the computational time in millisecond for optimizer o at run number i.

4.2 Datasets

All datasets were collected to have a variety of features and instances as delegates of various problem types, which the introduced methods will be examined on. Besides, we selected a set of respectively high dimensional data to ensure the performance of optimization algorithms in huge search spaces. Each dataset is split by cross-validation [42] mode for evaluation, which \(K-1\) folds are employed for the training, validation, and testing sets. Each set is repeated M times, hence, each optimizer is estimated \(K*M\) times for individual dataset. Each dataset is equally sized into training, validation, and testing. Training part is used to train the used classifier through optimization and at the final evaluation. Validation part used to assess the performance of the classifier at the optimization time. Testing part is employed to determine the finally selected features given the trained classifier. The classification and regression models are used to ensure the quality of the selected features and are assessed on the validation set inside the fitness function during the optimization process [6]. In the case of regression datasets, the regression model used in the fitness function as in Eq. (5) is extreme learning machine (ELM) with a different number of hidden layers and sigmoid basis function. ELM used for regression purposes and is adopted to evaluate the fitness function. ELM has seven nodes in input layer representation and one hundred hidden nodes (based on trial and error basis); because ELM needs more hidden nodes than the classical gradient training algorithms [28].

Table 2 outlines twenty-one datasets used in classification problems. The datasets are acquired from the UCI machine learning repository [43, 44]. Table 3 displays the ten datasets applied in the regression experiments. The used datasets are picked from the UCI machine learning repository [43].

Table 2 List of datasets used in classification data
Table 3 List of datasets used in regression data

4.3 FPA for Feature Selection Using Classification Data

In classification data category, the classifier used in fitness function as in Eq. (3) is KNN [37]. KNN is applied in the experiments based on trial and error basis where the best choice of K is selected \((K=5)\) as the best performing on all the datasets. The aggregate purpose of this part is to declare the bio-inspired optimization methods for feature selection approaches that minimize the selected feature set and maximize the classification performance from applying the whole features and conventional feature selection methods in the classification problem.

Table 4 outlines the average statistical mean fitness of FPA [45], BA [46], GA, and PSO optimization algorithms for all 21 classification datasets that calculated over the 20 runs. We can observe that all used optimization methods outperform the full features selected that proves the capability of wrapper-based method in feature selection problem. We can also highlight that the CS performs in general better than the other optimizers that demonstrate the ability of CS adaptively to explore the area for the optimal feature combination. For evaluating the stability of the stochastic algorithms in the study and converge to the same optimal solution. We measure the standard deviation, and the results are depicted in the Table 5. We can see that, although the FPA depends on Lèvy distribution that has infinite variance it still keeps comparable std measure.

Table 6 outlines the average classification error of the selected feature subset from the optimization methods of test set averaged over the 20 runs. From the table, FPA obtains the best results on average, thus demonstrating the capability of FPA to find optimal feature combinations ensuring proper test performance. Regarding the size of selected features on the original size, Table 7 outlines the kept feature ratio to the total number of features. We can notice that FFA gets the best selection feature subset results in general. The performance over the test data is to some extent compatible with the results from the F-score calculated over the selected features by the different optimizers; as shown in the Table 8. GA has obtained the best F-score values overall. Table 9 outlines the average computational time of different optimization algorithms. From the table, FPA has the best computational time in comparison to all other algorithms.

Table 4 Mean fitness of 20 runs
Table 5 Std of fitness values for 20 runs
Table 6 Average classification error of 20 runs
Table 7 Average selection size of 20 runs
Table 8 Average F-score of 20 runs
Table 9 Average computational time (milliseconds) of 20 runs for other optimizers

4.4 FPA for Feature Selection Using Regression Data

In regression data, the regression model used in fitness function as in Eq. (5) is extreme learning machine (ELM). The aggregate purpose of this section is to introduce bio-inspired optimization algorithms for feature selection approach that reduce the number of selected feature subset and reduce the prediction error from applying the whole feature set and conventional feature selection techniques in regression problems.

Table 10 outlines the average statistical mean fitness of BA, CS, DA, FFA, FPA, MAKHA, GA, and PSO optimization algorithms for all ten regression datasets that calculated over the 20 runs. We can highlight that the FPA performs in general better than the other optimizers that prove the capability of FPA adaptively to explore the search area for best feature subset. For evaluating the stability of the stochastic algorithms in the study and converge to the same optimal solution. The standard deviation results are depicted in the Table 11. We can see that, although the FPA depends on Lèvy distribution that has infinite variance it still keeps comparable std measure.

Table 12 describes the mean RMSE of the selected feature subset from the optimization algorithms of test data averaged over the 20 runs. From the table, FFA obtains the best results on average, thus demonstrating the capability of FFA to find optimal feature combinations ensuring proper test performance. Regarding the size of selected features on the original size, Table 13 outlines the kept feature ratio to the total number of features. We can highlight that GA obtains the best selection features size results overall. Table 14 outlines the average computational time of different optimization algorithms. From the table, DA has the best computational time in comparison to all other algorithms.

Table 10 Mean fitness of 20 runs
Table 11 Std of fitness values for 20 runs
Table 12 Average RMSE of 20 runs
Table 13 Average selection size of 20 runs
Table 14 Average computational time (in milliseconds) of 20 runs

4.5 FPA for Knapsack Problem

In this section, FPA is used and benchmarked against BA, GA, and PSO on the binary Knapsack problem. A set of 20 benchmark problems were in the study having different dimensionality and capacities as in Table 15.

Table 15 Used problem sets and the corresponding dimension of each problem
Table 16 Mean fitness for the different used optimizers on the different problems

Functions F1–F20 are expected to evaluate the exploitation capability of a given algorithm. We can see in Table 16 that the performance of the FPA optimization algorithm on the average outperforms the other methods. Such result proves the exploitation capability of the FPA algorithm. The same conclusion can be derived by remarking the median performance presented in Table 17 where the FPA still outperform the BA, GA, and PSO algorithms.

Table 18 depicts the best performance indicator for running individual optimizers over 20 runs. Such indicator targets the optimistic users. We can see from the tables that the FPA outperforms the GA and PSO. Table 19 depicts the worst fitness indicator for both simple and composite benchmark functions. Such indicator is expected to assess the worst performance of a given optimizer and hence target the pessimistic users’ satisfaction. We can see from the table that the worst performance of the FPA still outperform the other algorithms and proves the capability of using such FPA for pessimistic applications.

Table 20 depicts the standard deviation of individual optimizer’s output best solution through the 30 runs. Such indicator is expected to assess the repeatability of the obtained solutions and the convergence to same/similar optima. We can see from Table 20 that the standard deviation for the FPA outperforms the other optimizers which proves that FAP has much exploration capability it can still converge to same/similar optimal and hence can be considered as a candidate optimizer for repeatable results.

Tables 21 and 22 depict The P-value for two of the common significance tests that are expected to assess the significance of output enhance using the proposed variants. The used significance tests are two-sided Wilcoxon test and T-test. We can see that the P-value for Wilcoxon and T-test are around 0 and hence neglecting the null hypothesis and hence proves the significance of the proposed variant that it is found to be significant using FPA rather than BA, GA, and PSO algorithms.

Table 17 Median fitness for the different used optimizers on the different problems
Table 18 Best fitness for the different used optimizers on the different problems
Table 19 Worst fitness for the different used optimizers on the different problems
Table 20 Standard deviation of fitness for the different used optimizers on the different problems
Table 21 P-value for T-test of FPA compared to other optimizers
Table 22 P-value for Wilcoxon of FPA compared to other optimizers

5 Conclusions

This work assesses the performance of FPA on two application domains namely feature selection and knapsack. For feature selection, FPA can overcome the performance of BA, GA, and PSO for its capability to adaptively search the search space with many local optima avoiding premature convergence. In the domain of knapsack also FPA is found to be very competitive to PSO, GA, and BA with the tolerable difference in run time and better optimization performance.

On the basis of future performance, we have five ideas that can be investigated in addition to the work presented here:

  1. 1.

    The proposed FPA method will be assessed using complex datasets that have a huge number (thousands) of input features.

  2. 2.

    Add more statistics evaluation measures such as (sensitivity, specificity, and F-measure).

  3. 3.

    Employ bio-inspired optimization methods for solving the challenging problems and in different applications like big data, bioinformatics, and biomedical.

  4. 4.

    Use more machine learning techniques for wrapper-based fitness evaluation such as support vector machine (SVM), random forest (RF), and support vector regression (SVR).

  5. 5.

    Propose a multi-objective fitness function that uses bio-inspired algorithms to the find optimal feature subset.