Keywords

1 Introduction

In classification, feature selection (FS) is mainly used to minimise the features in a data set while maintaining its standard [23, 68]. The aim is to select the most relevant subsets that are sufficient enough to describe the target class [62]. FS can be supervised [51], non-supervised [12] and semi-supervised [61]. In dealing with the supervised type, the class label of the data set is defined already in contrast with the non-supervised that the class label is unaware. Whereas, semi-supervised combine both supervised and non-supervised (i.e. with both label and unlabelled data) [32].

The supervised FS is classified further as a filter, wrapper and hybrid approach depending on their evaluation criteria [64]. In the filter-based approach, the features are evaluated without considering any classification algorithm, which makes them computationally fast. However, the filter ignores feature dependence or relationship among selected or ranked features, which subsequently affects the classification performance (i.e. either error rate or accuracy) [15].

In the wrapper-based method, a classification algorithm is used to evaluate each subset of features selected, and hence, it achieves better classification accuracy or error rate [26, 32, 64]. The major shortcomings of the wrapper-based approach are computationally expensive and not favourable on high-dimensional data sets. The most common examples of the wrapper-based approach are sequential forward selection (SFS) [60], sequential backward selection (SBS) [36] plus q take away r [16] and genetic algorithm (GA) [37] among others.

In FS, the classification performance of any of the approach (filter or wrapper) is measured in according to feature size (number of selected features), error rate (or accuracy) and computational time. Machine learning algorithms are commonly used to measure or evaluate the goodness of the selected subset of features in terms of accuracy or error rates [64]. Examples of the most widely used ones include support vector machine (SVM) [57], K-nearest neighbour (KNN) [24] and Gaussian Naïve Bayes (GNB) [34] among others.

The wrapper-based approach requires one to determine a classification algorithm and uses its performance as the evaluation standard. It searches for features that are suitable to the machine learning algorithm that increases the accuracy [39, 40]. Classification accuracy, as well as the selected subsets of features, are used to determine the prediction performance in wrapper model [54, 66]. As such, prediction accuracy and less prone to local optima are the critical advantage of wrapper over the filter. Hence, the outcomes are mostly more encouraging than the findings of the filter models.

However, the shortcomings of the wrapper-based approach include a high risk of overfitting data, classifier dependency, highly computationally intensive and not favourable for astronomical dimensional data [41]. Examples of wrapper techniques are a sequential forward selection (SFS), sequential backward elimination (SBE), plus q take away r and beam search. Others include simulated annealing, randomised hill-climbing, genetic algorithm and estimation of distribution algorithm among others [40, 46]—the steps on how the wrapper-based approach is illustrated in Fig. 1.

Fig. 1
figure 1

Wrapper-Based feature selection procedure

Finding the optimal subsets of features with the less computational cost is quite a demanding task because of the search space, and the number of features needed to search in the solutions are too many. Hence, FS is considered as an NP-hard problem [33, 39, 52, 64]. In searching for the best subsets of features, [26, 39] identified three search strategies: complete search, sequential or heuristic search and the random search.

The complete search works by finding all the possible feature subsets while evaluating them one after the other to select the best subset of features with the highest classification performance. There is a guarantee of getting the optimal results based on the laid down criteria in it. However, for a data set with N number of features, there will be \(2^{N}\) subsets to be generated and evaluated, which is almost impractical for a considerable value of N. More so, [41] argued that ‘search is complete does not mean that it must be exhaustive’.

The heuristic or sequential search add or remove features in sequential order; the remaining features that are not selected are considered later for selection in that manner. By doing so, the choice of the features may likely end up with the same pattern as complete search. However, there is no guarantee of finding the target solution [52].

Random search is the most popularly used among all the search strategies [42]. It aims at creating stability between the heuristic and complete search by combining the advantages of both. It begins with a randomly selected subset of features and progress in two ways. Either to follow the heuristic search and insert some randomness or to generate the next subset in a completely random manner [39, 42].

Out of all these search strategies, the random search is the only one that can escape from local optimum in the vast search space due to the involvement of randomness and mostly finish within the shortest time [26, 40, 64].

There are some works on the wrapper-based FS that applies different search strategies on different meta-heuristic algorithms. For example, [9, 10] used an artificial immune system for the FS. Also a particle swarm optimisation (PSO) is reported in [31, 44, 52,53,54, 63, 65].

Recently, differential evolution (DE) in [27], cuckoo search in [13, 18], grasshopper optimisation algorithm in [43], genetic algorithm (GA) in [3, 19, 30, 59, 66] are reported. In addition to that, genetic programming is used for the wrapper-based FS in [48], and recently a flower pollination algorithm for FS as also in [56]. It clearly shows that the meta-heuristic algorithms are suitable for addressing these problems. Despite the attempt to solve the lingering issues of the wrapper-based FS still, the existing works cannot successfully evolve the best subset of features with improving accuracy on some of the data sets [23].

The cuckoo optimisation algorithm (COA) presented in [49] is among the evolutionary algorithms that show promising results in handling different combinatorial optimisation problem including NP-hard, despite its proven records, especially in dealing with filter-based FS in [55]. Its application, specifically for the wrapper-based FS, is not fully investigated.

This study aimed to find the best subsets of features with lesser feature size and yet maintain the same or even better classification accuracy compared to using full-length features within a short period. Also, investigate the difference between COA and BCOA in wrapper-based FS.

To accomplish this goal, a pair of two FS frameworks are developed based on BCOA and COA. These proposed algorithms were studied and compared with other FS algorithms presented in other works on benchmark problems of varying difficulties.

Precisely, this study will examine

  1. 1.

    Whether adopted COA wrapper-based FS algorithms would choose the best subsets of feature, that has least feature size, less computational and accomplish the best error rate compared to full-length features, and would outpace the adopted BCOA wrapper-based single objective algorithms;

  2. 2.

    Whether adopted BCOA wrapper-based FS algorithms would choose the best subsets of feature and can attain the best performance than the adopted COA wrapper-based algorithms above;

  3. 3.

    Whether COA wrapper-based algorithm with two steps evaluation would choose sets of best features subsets and would outpace the two steps BCOA wrapper-based algorithm, and other existing works; and

  4. 4.

    Whether BCAO wrapper-based algorithm with two steps evaluation would choose sets of best features subsets and would outpace all other approaches stated directly above.

The rest of the paper is prearranged as follows: Part 2 is the background containing the details about the adopted COA and BCOA along with related works. The proposed wrapper-based feature selection approaches are presented in Part 3, while Part 4 is the experimental design, data sets used along with benchmark approaches. Then Part 5 is the presentation of the results while Part 6 concludes the entire work and suggests future work areas.

2 Background

2.1 Cuckoo Optimisation Algorithm

The original Cuckoo Optimisation Algorithm (COA) is strictly made for a continuous optimisation problem. At the same time, the binary version (BCOA) can be applied to solve problems that are in binary or discrete form. COA used for FS is very scarce in the literature. The size or dimension of the search space (i.e. the full-length features in every data set) is n. Every habitat in the COA is assigned by using a vector of n decimal numbers. The location of habitat i in dth length is xid normally in the range [0, 1]. To know in case if a feature is selected or otherwise, a verge \(0< \theta <1\) is mandatory to equate it with the decimal numbers in the habitat position. If eventually, \(xid > \theta \), then feature d is chosen else d is not be chosen.

COA developed by [49] is adopted, and the detail of how it works is

  1. 1.

    An array called “habitat” is used for the optimisation problem as show in Eq. 1.

    $$\begin{aligned} habitat = [x_{1}, x_{2}, \ldots , x_{Nvar}] \end{aligned}$$
    (1)
  2. 2.

    Five and twenty eggs are used as the lower and upper limits, respectively, for every iteration.

  3. 3.

    They lay their eggs within a maximum range distance from their habitat in Equation.

    $$\begin{aligned} ELR = \alpha \times \frac{number\ of \ current \ cuckoos}{total \ number \ of \ eggs}\times e_{new} \end{aligned}$$
    (2)

    An \(\alpha \) represent an integer number.

  4. 4.

    P% (those without any profit value) of the laid eggs are killed.

  5. 5.

    A k-means (K = 3 or 5) clustering is used for the grouping.

  6. 6.

    All cuckoos deviate \(\varphi \) radians while flying \(\lambda \%\) to the goal, as shown in Eq. 3.

    $$\begin{aligned} \lambda \sim U(0, 1)\qquad \varphi \sim (-\omega , \omega ) \end{aligned}$$
    (3)

    where \(\lambda ~U(0,\ 1)\) means that \(\lambda \) is a uniformly distributed random within range of 0 and 1. \(\omega \) is limits an aberration from goal habitat.

2.2 Binary Cuckoo Optimisation Algorithm

Binary Cuckoo Optimisation Algorithm (BCOA) is mostly used to solve FS problem; meanwhile, the representation of the habitat is in the form of a binary string, where the position of every habitat is a boolean 1 which signifies that a feature is chosen and 0 otherwise. Assuming \(X_{G}\) and \(X_{C}\) represent the respective goal and current habitat. Then, Eq. 4 computes the \(X_{NH}\) next habitat as follows:

$$\begin{aligned} \begin{aligned} X_{NH} =X_{C} + rand (X_{G}-X_{C}) \end{aligned} \end{aligned}$$
(4)

A sigmoid function is applied in Eq. 5 to use \(X_{NH}\) as binary to record it within [0, 1]. Then Eq. 6 alters the values to either 0 or 1.

$$\begin{aligned} S=\frac{1}{(1+{e}^{-X_{NH}} )} \end{aligned}$$
(5)
$$\begin{aligned} \begin{aligned} IF \ (S > rand) \ {THEN}\ X_{NH}=1 \ {AND}\ IF \ (S < rand) \ {THEN}\ X_{NH}=0 \end{aligned} \end{aligned}$$
(6)

2.3 Related Works

This part reviews some related works on wrapper-based FS. Both the traditional and meta-heuristic ones, as shown in the subsequent parts. However, this study focuses mostly on the evolutionary algorithms; for more details on the swarm intelligence based approaches refer to [7].

2.3.1 Classical Wrapper-Based Feature Selection

As mentioned earlier, wrapper-based FS algorithms are highly computationally cost compared to the filter-based FS algorithms [15, 48]. Perhaps, this is due to the longer evaluation processes involved in the training and testing of the classifier. Furthermore, since the search space of the FS problem is exponential to the number of features. Therefore, searching for the entire search space is impractical. Based on that the existing wrapper-based techniques used stochastic or greedy search [21, 42].

The most common FS techniques that practice the greedy hill-climbing are sequential feature selection (SFS) [1] and sequential backward selection (SBS) [41].

In SFS, it begins with an empty set of features and keeps on adding one feature at a time in an iterative manner until adding another feature will not enhance the existing classification performance then it stops. Unlike, in SBS where it starts with a full set of features and keeps on looping to remove one feature at a time until removal of a feature cannot improve the existing classification accuracy (error rate). Apart from the computational cost incurred on a large number of data sets, another major drawback of both SFS and SBS is the nesting effect, since any feature that is added or removed cannot be undone. Thus, they both are trapped into the local optima easily [40, 41].

Although, [38] developed a “plus q take away r” technique that will escape the nesting effect, SBS was applied r times in a back-tracking order while SFS is applied q times in forwarding step order. Determining better numbers for q and r is required, to solve this problem of having fixed values for both q and r. Then, [47] enhanced it by introducing a floating-point in both SFS (sequential forward floating selection (SFFS)) and SBS (sequential backwards floating selection (SBFS)) that automatically determine the value of q and r. Although, both SFFS and SBFS proved to be useful in some cases, [67] argued that they could likely trap into local optimal even if the benchmark function is monotonic (neither decrease nor increase) and yet is a small-scale problem.

Inline spectral frequencies (LFS), the number of features to be used for evaluation in every step are limited. As such, the computational efficiency of the sequential forward’s methods was enhanced by the LFS and sustained an analogous accuracy of the selected subset of features. But, LSF ranks all features without taking into consideration whether some features are present or not, and this restricts the performance of the LSF algorithm particularly the interaction between features.

2.3.2 Wrapper-Based Feature Selection with Meta-Heuristic Algorithms

As mentioned earlier, meta-heuristic algorithms have become more robust in handling NP-hard problems, including FS. Huang and Wang [30] employed GA for both FS and SVM parameter optimisation on a real-world data set. The results obtained are in favour of the GA in terms of classification accuracy and fewer number of features compared with the grid algorithm reported in work. Also, [59] proposed another GA for FS and SVM for parameter selection in the detection of diabetic retinopathy. A promising result was obtained on 60 images of data sets. An enhanced GA (EGA) was proposed in [19] to reduce text dimensionality. It is incorporated with six filter FS methods to create a hybrid one. Finally, experimental results showed that the hybrid outperformed the single approach as well as the traditional GA. Recently in [3], the highest accuracy of 99.48% was attained on two different Wisconsin breast cancer data sets. GA was used for FS before applying the five different classifiers. The results obtained are better than the others.

Unler and Murat [53] present a discrete PSO for FS in binary classification problems. The proposed approach incorporates an adaptive FS technique which dynamically takes into consideration the relevance and dependence of the features included in the feature subset. The experimental results indicated that the proposed discrete PSO algorithm is competitive in terms of both classification accuracy and computational performance compared with the scatter search and tabu search algorithms on openly available data sets.

Vieira et al. [58] proposed a modified binary PSO (MBPSO) for FS with simultaneous optimisation of SVM parameters to predict the outcome of patients with septic shock. The results indicated that MBPSO performed very well compared with the standard PSO both in terms of accuracy and features selected. However, when compared to GA, the same accuracy was recorded, but the MBPSO select fewer features.

Similarly, [31] developed a supervised PSO-based rough set FS for medical data diagnosis. Two different algorithms PSO-based relative reduct (PSO-RR) and PSO-based quick reduct (PSO-QR) are presented. The results obtained showed that the proposed algorithms performed better in terms of the fewer number of features, classification accuracy and computational time compared to the standard PSO and other methods reported.

Recently, PSO initialisation and updating mechanism are changed to suit better FS problems in [52]. The discretisation is applied before the FS since discretisation is considered an essential task of FS. A potential particle swarm optimisation (PPSO) is proposed which employs a modern illustration that can minimise the search space of the problem and an advanced fitness function to assess candidate solutions better and direct the search process. The results of the experiments on the ten high-dimensional data sets disclosed that PPSO chooses fewer than 5% of the number of features for all data sets. Compared with the two-stage method which uses bare-bone PSO (BBPSO) for FS on the discretised data, PPSO attains a better accuracy on seven data sets. Furthermore, PPSO gains improved classification accuracy than evolve PSO (EPSO) on eight data sets with a reduced feature size on six data sets. Moreover, PPSO also performs better than the three compared approaches and achieves similar to one approach on majority data sets in terms of both learning capacity as well as generalisation ability.

To predict heart disease among patients, [18] used cuckoo search and rough set for FS, and the disease prediction is made using fuzzy. A better result was achieved in four different benchmark data sets.

Recently, [13] used a modified cuckoo search along with rough set to build the fitness function that takes several features into the reduce set and classification into consideration. SVM and KNN are used to evaluate the performance of the proposed approach. The results obtained indicate the superiority of the method used and can significantly improve performance.

Despite the attempt to solve the lingering issues of the wrapper-based FS, still the existing works cannot successfully evolve the best subset of features with improving accuracy on some of the data sets [23]. COA presented in [49] is among the evolutionary algorithms that show promising results in handling different combinatorial optimisation problem, including NP-hard; However, despite its proven records, especially in dealing with filter-based FS in [55].

COA has been applied to solve different kinds of problems. Recently, it is used with harmony search for optimum tuning of fuzzy PID controller for LFC of interconnected power systems in [20]. Energy-aware clustering in wireless sensor networks in [35], accelerated COA was proposed in [22] where simulated annealing algorithm was used in place of the k-means clustering of the standard COA in vehicle routing problems.

Compared to GA, the imperialist competitive algorithm (ICA), CSA and PSO. COA is simpler to implement and can converge rapidly [6, 29]. Its application, specifically for the wrapper-based FS, is not fully investigated.

3 Proposed Wrapper-Based Feature Selection Approaches

Thus, in this part first, both BCOA and COA are adopted and used for wrapper-based FS. The detail of how each of the experiments was carried out can be seen in the subsequent parts.

3.1 BCOA and COA for Feature Selection

Two wrapper-based FS are proposed, namely, BCOA-FS and COA-FS. Throughout the evolutionary training process, Eq. 7 is applied as the fitness evaluation function to estimate and evaluate the best cuckoo habitat i, where the position \(x_{i}\) signifies the subsets of features.

$$\begin{aligned} Fitness(x_{(i)}) = Error Rate \end{aligned}$$
(7)

where ErrorRate is calculated based on Eq. 8:

$$\begin{aligned} Error Rate= \frac{(FP+FN)}{(TP+TN+FP+FN)} \end{aligned}$$
(8)

where FP, FN,TP, and TN, are the respective false positives, false negatives, true positives, and true negatives.

3.2 A Combined Fitness Function for BCOA and COA Feature Selection

The subset of feature selected by both BCOA-FS and COA-FS may probably comprise some redundancy since the fitness function in Eq. 7 does not reduce the features. However, it hypothesises that the same or less accuracy might be realised using a smaller subset of features. To additionally minimise the feature size deprived of affecting the classification error rate, a two-step FS method (BCOA-2S and COA-2S) is introduced, where the entire evolutionary procedure is separated into two steps.

In step 1, both BCOA-FS and COA-FS emphasises on improving accuracy. Whereas, in step 2, the features are involved in the fitness function. Furthermore, step 2 begins with the solutions realised in step 1, which certifies the reductions in the features according to the subsets of the features with the best accuracy.

The proposed two-step fitness function employed in both BCOA-2S and COA-2S is shown in Eq. 9:

$$\begin{aligned} Fitness_2(x_i) = {\left\{ \begin{array}{ll} Step\ 1, &{} {Error Rate}\\ Step\ 2, &{} {Error Rate\ \beta * \dfrac{M}{n}+(1- \beta )* \frac{M Eror Rate}{n Eror Rate}} \end{array}\right. } \end{aligned}$$
(9)

where Error Rate is the classification error rate attained by the selected subset of features. \(\beta \in [0,1]\) is a constant number within the range [63]. M denotes the size of selected features and n is the total feature size. nErrorRate is the error rate obtained by using the total feature size for classification on the training set. In step 2, the fitness function considers both the feature size as well as the error rate. It guarantees that these two components are in a similar array, i.e. [0, 1], and the of feature size is normalised and represented by M/n.

The classification performance is represented by (MErrorRate)/(nErrorRate) rather than Error Rate alone to circumvent the circumstances, whereby Error Rate is too insignificant (for instance, < 0.005), and M/n plays a significant role inside the fitness function. In a situation like this, the feature size considers most compared to the error rate, which might have a subset of features with high error rate compared to using the full-length feature size. Meanwhile Error Rate would be lesser than nErrorRate at the end of the step 1, (MErrorRate)/(nErrorRate) is in the similar array as M/n, i.e. [0, 1].

As soon as they are joined into a single fitness function, \(\beta \) is employed to display the comparative significance of the chosen features and \((1 - \beta )\) displays the outstanding significance of the error rate. The Errorrate is expected to be more significance compared with feature size, thus \(\beta \) is assign to be lesser than \((1 - \beta )\) (i.e. \(\beta < 0.5\)). The pseudocode of (BCOA-FS and BCOA-2S) along with (COA-FS and COA-2S) can be seen in Algorithm 1 and Algorithm 2, respectively. The main difference between BCOA-FS, COA-FS and BCOA-2S depend on the fitness evaluation function, that is illustrated mostly in the grey lines of algorithms.

The detailed of the proposed wrapper-based BCOA is depicted in Algorithm 1, whereby Eqs. 7 and 8 have been used as the respective fitness functions. The grey colour signifies the areas where the equations and initialisation as per feature selection problems are used in the proposed algorithms.

figure a
figure b
Fig. 2
figure 2

Flowchart of the (BCOA-FS and BCOA-2S) and (COA-FS and COA-2S)

Figure 2 shows that each cuckoo is initialised according to Eq. 1 for each of the data sets. Then, the number of features in each habitat are collected while those eggs detected in the habitat are killed. At this juncture, the fitness function for both BCOA-FS and COA-FS are evaluated using Eqs. 7 and 8, respectively.Whereas, the fitness function for the combined FS (BCOA-2S and COA-2S) are evaluated using the fitness function in Eq. 9. The population is compared with maximum value, and if the population is less than the maximum value, then the cuckoo in the worst area would be killed, otherwise it gets profit values (check the survival of the egg inside the nest). Then stop condition evaluated; if yes, it leads the eggs to grow. However, the nest found with the best survival rate among the cuckoo societies is transferred to the best society according to Eqs. 5 and 6 for the BCOA. Whereas, Eqs. 2 and 3 are for the standard COA. Based on Eq. 2, one can find the best ELR and repeat all the steps. Otherwise, return the optimum solution of the highest ranked features. Then, finally, the best, along with the average of them, are selected using a classifier.

The time complexity of both BCOA-FS and COA-FS based on the fitness function in Eq. 7 is \(O(\frac{1}{m})+ O(\frac{1}{n})\). The term m and n represent the number of selected features and the population size, respectively. The binary search for using BCOA runs in O(n) time, while the COA search in \(O(log_{2}n)\) time. Thus, the computational complexity of BCOA-FS is \(O(\frac{1}{m})+ O(\frac{1}{n}) + O(m)\) and COA-FS is \(O(\frac{1}{m})+ O(\frac{1}{n}) + O(log_{2}n)\).

Based on the fitness function in Eq. 8, the complexity is \(O(\frac{1}{m^{2}})+ O(\frac{1}{n^{2}})\). Therefore, the total complexity of the BCOA-2S is \(O(\frac{1}{m^{2}})+ O(\frac{1}{n^{2}}) + O(n)\) and that of COA-2S is \(O(\frac{1}{m^{2}})+ O(\frac{1}{n^{2}}) + O(log_{2}n)\). Therefore, BCOA-FS and COA-FS can complete its process within a shorter time in most cases compared to its BCOA-2S and COA-2S counterpart.

4 Experimental Design

This part describes the data sets used in conducting the experiments. Parameters settings, as well as benchmark, approaches are used to test the performance of the proposed methods.

4.1 Experimental Datasets

The data sets used in this study are the 26 well-known University of California Irvine (UCI) Machine learning data sets with distinct features. It contains a different number of features ranging from 9 to 500, 14 categorical and 12 continuous data type, 72–5000 instances, 13 binary classes and 13 multi-classes. These different appearances of the data set, especially on the number of features that contain smaller, medium and large features are the motives behind the selection of the data sets as shown in Table 1. The data sets can be found in [17] or can be downloaded freely at https://www.ics.uci.edu.ml/~earn.

Furthermore, most of the data sets have been used recently in the works of [2, 15, 43, 44], which clearly show that the data sets are goods for benchmarking FS problems. The data sets contain both categorical and continuous data that can be useful in demonstrating the comparison between the categorical discrete and the continuous data. Continuous data have infinite values in the form of decimal numbers, while the categorical discrete values are mostly finite values in groups.

Table 1 List of data sets

4.2 Experimental Parameter Settings

The parameters employed for the experiments were set as follows: initial and maximum population are set to 5 and 20, respectively. Moreover, the proposed algorithms were run 40 independent times on each data set. The parameter settings used for the proposed \(COA-FS\), \(COA-2S\), \(BCOA-FS\) and \(BCOA-2S\) algorithms are chosen based on the work of [45, 49]. The maximum number of iterations was set to 100.

Also, similar to the work of [63] and [65]. In the experiments, all the rows in each of the data sets were partition into two groups: a training group and a test group. The most partitioning approach is that 2/3 (about 66%) of the rows in the data sets are in the training group and 1/3 (almost 33%) of the rows are in the test group [11]. To simplify the process, we divide 70% of the rows into each data set as the training group and the remaining 30% as the test group. The rows are chosen so that the percentage of rows from various classes are equal in both the training group. The proposed wrapper-based methods need a classifier to estimate the suitability of the selected subsets of features. A KNN (with \(\mathrm{K} = 5\)) was used in the experiments, to reduce the wrapper-based computational time [4].

The experiments of GSBS and LFS are carried out using the popularly known Waikato Environment for Knowledge Analysis (WEKA) [28]. The entire settings in LFS along with GSBS are saved to the defaults since they can obtain better results. Also, a 5NN was used in both LFS and GSBS, which generate a unique solution (feature subset) for each data set.

4.3 Benchmark Approaches

Scrutinise the concert of the proposed wrapper-based approaches in this chapter. The results found are related to the previous works, as shown in Table 2. From (Table 2), two traditionally known wrapper-based FS methods, namely linear forward selection (LFS) [25] and greedy stepwise backward selection (GSBS) [8] are used as benchmark methods. Both LFS, together with GSBS, were consequential of SFS and SBS, respectively. LFS [25] limits the number of features that are selected in each step of the forward selection, which can reduce the number of evaluations. As such, the LFS is computationally less expensive compared to the SFS and will get better results. More details about the LFS is in [25].

Table 2 Existing wrapper-based approaches

On the other hand, the greedy stepwise based FS algorithm mostly shifts either forward or backwards in the search space [8]. Provided that the LFS makes a forward selection, a backward search is selected in the greedy stepwise search to create a greedy stepwise backward selection (GSBS). GSBS begins with all the feature size and halts if the removal of any outstanding feature results in a reduction in evaluation measure, i.e. the error rate of the classifier. Also, the work in [63] was used as a benchmark method for both single and multi-objective wrapper-based approach, due to the similarities in the data sets. The detail explanation of the results obtained and the analysis is presented in the subsequent sections.

The details of the results obtained are offered in the subsequent section.

5 Results and Discussions

This part deliberates on the results of the proposed methods, comparison between them and other existing works that their work coincide with the data sets apply in this study.

5.1 Results of the Proposed BCOA-FS and COA-FS

The results of both the categorical and continuous data sets for BCOA-FS and COA-FS are displayed in Tables 4 and 3, respectively. The results showed a comparison between all the proposed wrapper-based methods. From the tables, “BCOA-FS” and “COA-FS” represent the proposed wrapper-based methods that adopt both BCOA and COA, respectively. “All” stands for all features used for each of the data sets. Besides, “Ave Size”, “Ave Acc” and “Best Acc” represents average feature size, average accuracy and best accuracy attained by each of the data sets for the 40 independent runs, respectively.

Table 3 Results of the BCOA-FS, COA-FS, BCOA-2S and COA-2S for continuos data sets

The results proposed BCOA-FS outperformed its COA-FS counterpart on the continuous data sets. Out of the 12 data sets in the table (Table 3), they recorded similar feature size, best accuracy and average accuracy on WineEW, Australian, Zoo and to some extents on Vehicle data sets. However, as the number of features increases, BCOA-FS outperformed COA-FS on the remaining eight data sets. In addition to that a similar performance was slightly noticed between BCOOA-FS and COA-FS on HillValley datasets. On the average, it is clear that BCOA-FS outperformed its COA-FS counterpart on the majority of the data sets, and hence considered the best wrapper-based feature selection.

Alternatively, a comparison between BCOA-FS and COA-FS was made on categorical data sets, as shown in the results Table 4. Similar to the continuous data sets, the categorical data sets also recorded similarities in terms of the mean of selected features, best accuracy and average accuracy on data sets with fewer feature size as such as Lymph, Mushroom, Spect and Leddisplay. However, from Dermatology that has 34 total number of available features, there is a change in performance between BCOA-FS and COA-FS. The results also imply that as the feature size increase the BCOA-FS perform better than the COA-FS in all the data sets except in Coil2000. Perhaps due to a large number of instances in the Coil2000 data set.

5.2 Results of the Proposed BCOA-2S and COA-2S

The results of both the categorical and continuous data sets for BCOA-2S and COA-2S are also displayed in Tables 4 and 3, respectively. The terms “BCOA-2S” and “COA-2S” represents the proposed combined accuracy and selected features into a single fitness function for both BCOA and COA, respectively. All other headings in the table are the same as explained in the previous subsection.

Table 4 Results of the BCOA-FS, COA-FS, BCOA-2S and COA-2S for categorical data sets

There are 14 categorical data sets and 12 continuous data sets that make a total of the 26 data sets used in this research. Out of all the 14 categorical data sets, BCOA-2S accomplished better results than COA-2S in terms of the average number of selected features, best accuracy and average accuracy on almost all the data sets. Although in Leddisplay data set, it has similar performance and same best accuracy on Mushroom data set.

On the other hand, the results of the continuous data sets also are in favour of BCOA-2S compared to the COA-2S in the majority of the data sets. A similar performance was obtained on some few data sets such as WineEW, Australian, Zoo and Vehicle. However, as the feature size increases, the BCO-2S also performed better than COA-2S. It is in contrast with categorical data sets no matter the feature size, BCOA-2S performed better than its COA-2S counterpart in almost all the data sets regardless of the number of features in the data sets.

5.3 Comparison Between Proposed Methods and Classical Methods

A result of LSB and GSBS was reported to further compare with the proposed methods. The results clearly show that LFS could select fewer number of features than GSBS in the majority of the data sets. However, GSBS achieve the best classification results in most of the data sets. Although on some data sets with a fewer number of features, they recorded similar performance. But as the number of features increases, LFS select the smallest feature and GSBS obtained the best accuracy.

Comparing LSF and GSBS with the proposed wrapper-based FS, one can observe that our proposed approaches outperformed both LSF and GSBS in terms of the number of selected features, best accuracy, and average accuracy in all most all the data sets, both continuous (Table 3) and categorical (Table 4).

5.4 Comparison Between Proposed Methods and Other Existing Methods

To further evaluate the performance of the proposed methods and consequently be fair in assessing the proposed wrapper-based multi-objective. Some related works with similar datasets were used for comparison, as shown in Table 2. The details of the comparison are enumerated below:

  1. 1.

    Comparison with ErFS and 2SFS The results of the proposed wrapper-based feature selection are compared with the one in work [63], where ErFS and 2SFS represent the BCOA-FS and BCOA-2S used in this research. The significant difference between the two is the use of the EC algorithm. An outstanding EA, COA, in particular, was used in this research. Whereas, the exiting works used the most common SI based algorithm (PSO). The results indicated our proposed COA and BCOA which outperformed the existing practices of PSO used in [63]. The result is not surprising because COA is reported to be more robust and can attain better results as claimed in the work of [5, 49]. From Table 5, it is clear that all the comparisons were made on the continuous data sets. Out of the 10 data sets used, it shows that in almost all cases, our proposed methods performed better than the existing one. However, in Zoo and Ionosphere data sets, for example, the existing methods performed better in terms of average accuracy. Nevertheless, the best accuracy and the number of selected features clearly show that our proposed methods performed well.

  2. 2.

    Comparison with ABC-ER and ABC-Fit2C The comparison between the proposed methods with ABC-ER and ABC-Fit2C in [27] is shown in Table 6. The comparison shows that our proposed methods performed better than all the seven data sets on both accuracy and number of selected features. However, even though ABC-Fit2C chooses slightly fewer features on German and Vehicle data sets than the proposed methods, but still, the proposed methods attained an improved classification accuracy compared to the ABC-Fit2C and ABC-ER. Therefore, the results displayed in Table 6 indicated that the proposed methods can effectively evolve a fewer number of features and yet achieve a better classification performance.

  3. 3.

    Comparison with BAIS The results obtained by the proposed methods with Bayesian and artificial immune system (BAIS) in work [10] is displayed in Table 7. The results show the superiority of the proposed methods on all the five data sets. The proposed methods outperformed the BAIS with nearly 10% of the classification accuracy on Ionosphere and Sonar data sets. Whereas, around 2–3% of improvement was realised on the proposed methods compared to the BAIS on the Mushroom, WineEW and WBCD data sets. Moreover, fewer subsets of features were selected in the proposed method than the BAIS. Therefore, both in terms of selected features and the classification accuracy, the proposed methods outperformed the BAIS in all aspects.

Table 5 Comparison of (BCOA-FS, COA-FS, BCOA-2S and COA-2S) with ErFS and 2SFS
Table 6 Comparison of (BCOA-FS, COA-FS, BCOA-2S and COA-2S) with ABC-ER and ABC-Fit2C
Table 7 Comparison of (BCOA-FS, COA-FS, BCOA-2S and COA-2S) with BAIS

5.5 Comparisons Between BCOA and COA

Comparing the performance of COA and BCOA for adopted or combined objectives as shown in tables (Tables 4 and 3) for both categorical and continuous data sets, one can observe that BCOA outperformed COA in terms of number of selected features, accuracy and best accuracy for all the proposed methods.

Even though BCOA is a discrete binary version of COA, however, it can be seen that it outperformed COA not only on the categorical or discrete data sets but also on the continuous data sets. Continuous or discrete data sets refer to the data sets that have their class label either as categorical or continuous.

Analysis of the computational time also shows that BCOA can complete its evolutionary process within the shortest time than the COA on the majority of the data sets. BCOA is faster than COA in around 10–5% majority of some of the data sets regardless of the continuous or categorical data sets. Meanwhile, this motivates the use of BCOA alone in the multi-objective wrapper-based feature selection. Moreover, this will avoid repetition of similar explanation of BCOA of being the best compared to its COA counterpart.

5.6 Further Discussions

The results show that both BCOA-FS and COA-FS can successfully evolve a set of features with better classification performance within a short period. However, as the number of features increase, BCOA-FS perform better than COA-FS, especially on the categorical datasets. Whereas, the COA-FS performed better mostly on the continuous class label dataset. It demonstrates that the continuous version works well on the continuous label datasets. In contrast, the binary version works well on the majority of the datasets and mostly performed better on the categorical datasets. Correspondingly, both BCOA-2S and COA-2S can successfully select the best features with better classification performance than the COA-FS and BCOA-2S on the majority of the datasets. Also, BCOA-2S outperformed COS-2S in most cases due to the use of the two-step evaluation process.

The proposed approaches used a \(\beta \) value of [0,1] in the evolution process. However, choosing the most appropriate value is quite a challenge. Because most of the selected features, along with their classification performance, are combined into a single fitness function. Nowadays, FS is considered as a multi-objective optimisation problem and treating the FS in that regards will solve the task much better and obtain the set of nondominated solutions.

6 Conclusions and Future Work

This paper disclosed the first study on wrapper-based feature selection using COA and BCOA. Four wrapper-based feature selections are presented. Both BCOA and COA were adopted and used as a wrapper based in the evolutionary process. Then a two-step fitness function was proposed, whereby the new classification performance obtained in the first step is combined with the number of selected features in the second step. The results obtained showed that the proposed methods performed well compared to the previous work. However, combining the two aims of the feature selection into a single fitness function cannot solve the problem better, and there will be some redundancy still among the number of selected features. Hence there is need for multi-objective feature selection that treats both numbers of selected features and classification performance simultaneously.

On the other hand, COA, especially its binary version, has performed well for FS because (1. COA representation is suitable for FS problems. The habitats in COA is \(N_{var}\)-dimensional array representing the current living position of cuckoos, which looks like the way candidate solutions are represented in the FS problem. In this case, the size of the dimensionality is the number of features. The values in any dimension/habitat display whether a feature is chosen or otherwise. (2. The search space in FS problems is too large and mostly get stuck in local optima in most of the existing methods. As such, there is a need for a global search technique. These ECs are well-known for solving problems that do not have a solution; they are robust to dynamic changes and have broad applicability [14, 49, 50]. COA is an EC; precisely an evolutionary algorithm based that has effective and efficient search operators that can search for large space to discover the optimum otherwise nearby optimum solution [14].