1 Introduction

Feature selection in high dimensional data is a preprocessing step that has been widely used to improve the performance of learning algorithms in many fields (Mafarja et al. 2019). This preprocessing technique selects relevant features, removes other features, decreases the time required to learn the predictors and finally leads to an easier interpretation by simplifying the models (Thaher et al. 2020; Yan et al. 2019).

Nowadays, feature selection is an important task in machine learning, since it can affect the performance of high-dimensional data classification. There are different sources of data applicable to solve various problems of present world. Yet, selecting features from these big and complex data sources are one of the most challenging discussions. The real-world data has a continuous form; therefore, we have to extract useful knowledge from this data to solve problems. The large quantities of high dimensional data creates great opportunities and new challenges in the conventional data mining and knowledge discovery researches (Kriegel et al. 2009). The most important challenges in the high-dimensional data analysis include limitations in time, memory, and computational cost (Rao et al. 2019).

The data are comprised of samples with several features. Thus, selecting some samples or features is advisable to reduce its volume for better computation (Hamidzadeh et al. 2015). This reduction can be done through extracting or selecting features (Huang et al. 2018). However, selecting a feature would be advantageous, since it simplifies the interpretation of the classification model (Ramírez-Gallego et al. 2017). In other words, the complexity of classifiers decreases through using proper set of selected features. Furthermore, it causes faster convergence among optimum results, which leads to an improved performance of classifiers (Li et al. 2017; da Silva et al. 2017).

Many features of problem solving, may be useless or lack information. Some others even mislead the performance of the classifier and affect the accuracy rate due to the presence of noise. Although failing to remove these features does not cause any information problems, it raises the computational cost and increases the response time of the system. Considering the storage space, it also causes storing a lot of non-useful information along with useful data (Ramírez-Gallego et al. 2017; Lee and Kim 2016). Moreover, the specific feature selection method and the specific problem may be interrelated, so the method would not be applicable to other problems (Fong et al. 2013).

Mathematical algorithms are not interesting anymore because they are too complex and time consuming. Instead, metaheuristic algorithms which can imitate the life of living creatures, and their evolution have become more applicable (Arora and Anand 2019). Metaheuristic methods can be used to discover optimal global solutions (Nayar et al. 2019). These types of algorithms solve complex nonlinear and indeterminate problems more quickly than classical algorithms (Arora and Anand 2019). Unlike traditional algorithms, metaheuristic ones did not need to restart when new data enters or environment changes. Considering their nature, selecting features using evolutionary algorithms, reduce the consumed time and make acceptable accuracy rate in classification (Kumar and Bharti 2019). They can be applied for any problem that can be formularized, also they can be integrated with other optimization techniques. The biggest advantage of these algorithms is that they can solve problems very quickly and with reasonable accuracy rate, without applying common mathematical solutions (Mafarja et al. 2019).

There are two concepts in metaheuristic algorithms: exploration and exploitation. Exploration is a capability for searching the problem space without any attention to the results, while exploitation concentrates on results. In order to reach the best performance in problem solving, these capabilities should be balanced. Most of the evolutionary algorithms have issues with getting trapped in local optima and early convergence, thus several studies have been done on these issues (Arora and Anand 2019). By applying stringent conditions and using random numbers in implementing algorithms to increase the convergence rate, algorithms may be locked into local optimizations, and their accuracy and efficiency may reduce (Fong et al. 2013; Fong et al. 2016; Gandomi and Yang 2014).

One of the newest and most powerful evolutionary optimization methods is cuckoo’s search algorithm, which is more capable of finding global optimum points. This algorithm simulates the cuckoo bird’s behavior in nesting and laying to solve optimization problems (Gandomi et al. 2013). The skill of laying eggs like the host bird eggs in each generation improves evolutionary and prevents host birds from detecting them. As a result, chickens out of these eggs grow in the host nest. In this algorithm, cuckoos look for the best eggplant or globally best solution for survival (Gandomi et al. 2013; El Aziz and Hassanien 2016).

Similar to other evolutionary algorithms, cuckoo’s optimization algorithm has an excessive randomness problem, so this study uses the chaotic theory, levy flight and disruption operator to limit the random behavior of this algorithm. These enhancements improve the exploration phase and feature selection in complex data. By improving the cuckoo optimization algorithm for feature selecting, a feature subset with maximum feature elimination and the highest accuracy rate in classification are obtained.

This paper discusses data preprocessing and its most important task: selecting the best feature set for next processing. The CCOALFDO algorithm is a wrapper-based feature selection method developed by an integrated framework that combines chaos theory and cuckoo optimization algorithm as well as levy flight, opposition-based learning and disruption operator to handle the 20 high-dimensional datasets in selecting the most relevant features. The K-NN classifier (where K = 5) is used to evaluate the produced feature subsets. Experimental results on data sets show the efficiency of the proposed method over comparative algorithms.

In summary, what you can find in this study includes:

  1. 1

    Using chaotic theory, levy flight and disruption operator to reduce the random selection of features and avoid getting stuck in local optimum.

  2. 2

    The opposition-based learning prevents the solutions from accumulating too much in one place and causes a better search in the problem space

  3. 3

    The disruption operator searches the opposite points of solutions, so the answers cover more parts of the problem space for searching and get better results.

  4. 4

    Selecting informative features efficiently and reaching convergence in a reasonable time.

  5. 5

    The CCOALFDO method for feature selection problem is defined as a supervised feature selection method which maximizes the classification accuracy rate and minimizes the feature set size.

  6. 6

    In comparison with the five state-of-the-art evolutionary algorithms, the CCOALFDO algorithm ranked first in the accuracy rate and ranked second in reducing the number of features.

In Sect. 2, the related literature of feature selection is reviewed. The proposed method is explained thoroughly in Sect. 3. Section 4 reports the experimental results and analyses the performances of different feature selection algorithms. Finally, Sect. 5 contains conclusions and future works.

2 Related works

Feature selection is a very important preprocessing and knowledge discovery tool for data analysis, and it is applied to many domains. In addition, it is used as an important technique for high-dimensional data analysis, since it has shown its promising ability to enhance a variety of other learning tasks, such as data clustering, document analysis, image processing and video processing, to name a few (Huda et al. 2016; Pes et al. 2017; Hossain et al. 2016; Yadav et al. 2018; Zhang et al. 2019; Luo et al. 2018; Bannigidad and Gudada 2019).

Feature selection problems, used to face a tradeoff between the feature set size and the classification quality. In fact, the size of the subspace of features usually remained fixed during previous studies. This means that if we have a dataset with k features, in order to select a subspace of that with s features, there are \( k!/\left( {\left( {k - s} \right)!s!} \right) \) different ways (Nayar et al. 2019; Mafarja et al. 2019). Selecting s, the size of the subspace of features, is a great challenge because when s is too big, there may be redundant subset of features and if it is too small, some useful information will be lost (Bostani and Sheikhan 2017).

Depending on the label information usage in various datasets, feature selection methods can be classified into three categories (Solorio-Fernández et al. 2020): supervised, semi-supervised and unsupervised.

The supervised learning methods use the label information of data to guide the process of dimensionality reduction. According to some criteria, these methods measure the importance and relevance of the features by utilizing the labeled data to train the feature selection models (Zare et al. 2019; Zhong 2020; Ang et al. 2015). Since unlabeled data consists of samples and features without any information about the natural grouping of data, unsupervised methods are suitable for selecting features without attention to the labels in dataset (Solorio-Fernández et al. 2020).

While the second method, a semi-supervised feature selection, integrates a small amount of labeled data into unlabeled data, the third method lacks this additional information (Ang et al. 2015; Song et al. 2016; Xu et al. 2018; Yang et al. 2018).

Feature selection algorithms are divided into three groups based on the computational time: filters, wrappers and hybrid or embedded (Ramírez-Gallego et al. 2017; Fong et al. 2013).

In filter algorithms, the learner model doesn’t have any bias and features and patterns are independent, so there is no need to evolve the patterns as the features become evolved. Also, the structure of this algorithms have simple and clear criteria for feature evaluation (Li et al. 2017). Although the results of wrapper based algorithms using machine learning techniques are more reliable, and their computational cost is so high that they need to use a special method to reduce it (Kumar and Bharti 2019). Comparing with a filter algorithm, a wrapper-based feature selection has a more complex structure and higher accuracy rate, since it works in conjunction (Ramírez-Gallego et al. 2017; Fong et al. 2013; Peng and Fan 2017). To be flexible in solving problems, hybrid algorithms combine two types of algorithms including filter and wrapper ones which enables them to provide a more favorable result (Thaher et al. 2020).

Features are divided into four groups: irrelevant features, redundant features, weakly relevant but non-redundant features and strongly relevant features (Li et al. 2017). In order to reach an optimal feature set, we need algorithms that can choose some strongly relevant and non-redundant features.

Some redundant features may have significant statistical relations with other features. In other words, an irrelevant feature may be highly relevant when combined with other features. That is why feature set in some studies with computed minimal redundancy and maximum relevance (mRMR), is determined based on mutual dissimilarity with other features (Ding and Peng 2005; Masud 2010).

In past decades, researchers introduced a lot of evolutionary algorithms, such as PSO algorithm (Fong et al. 2016; Mistry et al. 2017; Dara et al. 2017; Qi et al. 2017), ACO algorithm (Shunmugapriya and Kanmani 2017; Sivagaminathan and Ramakrishnan 2007; Sweetlin et al. 2017; Harde and Sahare 2016), bee colony algorithm (da Silva et al. 2017; Shunmugapriya and Kanmani 2017), bat algorithm (Jayabarathi et al. 2018; Cheng et al. 2016),the frog leaping algorithm (Hu et al. 2016; Dai Y 2015) and the krill herd optimization algorithm (Rodrigues D 2014; Wang et al. 2019).

Recently, meta-heuristic algorithms like ant colony optimization (ACO) (Beltramo et al. 2019), simulated annealing (SA) (Yan et al. 2019), and Henry gas solubility optimization (HGSO) (Neggaz et al. 2020), have become more popular due to their good performance in solving feature selection problems. Fong, for example, used a metaheuristic algorithm as a particle search to select features. This method is suitable for high-correlation feature spaces and it can be applied with other classification methods as a fitness function (Fong et al. 2013).

Another metaheuristic algorithm is called crow search algorithm (CSA), which purposes a v-shaped binary CSA. However, like other evolutionary algorithms, it suffers from low convergence rate and entrapments in local optima (De Souza 2018). Anter introduced a hybrid crow search algorithm in selecting feature set with combination of chaos function and fuzzy c-means objective function as a cost function (Anter and Ali 2020).

Thaher purposed an algorithm to select a suitable feature set based on a hunting process performed by hawks and rabbits and called it HHO. This algorithm is a greedy search with improvement of levy flight in continuous space which transfers to binary space (Thaher et al. 2020).

Huang proposed a distributed PSO-SVM approach, which includes the binary PSO for feature selection and standard PSO for parameter optimization of SVM (Huang and Dun 2008). Also, a genetic algorithm (GA) was proposed to reduce the number of features which try to optimize the feature subset without degrading the classification accuracy (Himabindu et al. 2019).

The whale optimization algorithm (WOA) is also a recent algorithm used in feature selection problems which was not successful in reducing features (Hussien et al. 2019). Mafarja used eight different shapes of movement to improve the WOA; the results showed that most of V-shaped transfer functions had better performance in minimize the feature number and less information loss against S-shaped transfer functions (Mafarja et al. 2019).

Multi-strategy ensemble algorithm tried to enhance the processes of feature selection problems with tune gray wolf optimization (GWO) parameters (Hussien et al. 2019). Anter applied the GWO algorithm only in medical datasets (Anter et al. 2019).

Another study expecting more improvement in accuracy rate, combined GA selection with SA to avoid getting into the local optima called GA–SA algorithm (Yan et al. 2019).

The previously mentioned studies were not successful in making a balance between the classification accuracy rate and the feature reduction rate. This paper introduces a new wrapper feature selection method, in which the irrelevant features are discarded by using cuckoo optimization algorithm (COA) with the chaotic theory, levy flight and disruption operator. So, the aim of this paper is to provide a method which is able to improve classification accuracy rate and optimum reduction of the number of features simultaneously.

3 Background

In this section, cuckoo optimization algorithm is described in Sect. 3.1, the definition of chaotic function is given in Sect. 3.2, levy flight algorithm is described in Sect. 3.3, the definition of opposition-based learning is given in Sect. 3.4 and finally in Sect. 3.5, disruption operator is briefly reviewed.

3.1 Cuckoo optimization algorithm

Choosing an optimal subspace of the dataset is critical to speed up and grade the classification especially for high volume data. In this study, the cuckoo’s algorithm which was introduced by Yang in 2009 (Gandomi et al. 2013) is used for selecting features. Cuckoos are interesting birds that have a spatial strategy for aggressive reproduction. In fact, they put their eggs in the nests of other birds called hosts. The cuckoo eggs will not have any chance to grow up if the host birds find them in their nest and throw them out. However, if the host birds do not find the cuckoo’s egg, it will be the cuckoo’s chicken, which grows faster than host’s chickens and throws them out of the nest. Thus, finding the best place with similar eggs for egg laying, is essential for cuckoo’s generation to survive.

Cuckoo search algorithm rules include:

  • Number of nests (habitats) are fixed.

  • A cuckoo lays a number between 5 and 10 eggs each time and then choose a nest.

  • Best nests will be used in the next generation.

  • For Nvar dimensional problem an array with 1 row and Nvar column are used.

A position of habitat or nest in space is like below:

$$ {\text{Habitat}} = \left[ {x_{1} , x_{2} , \ldots ,x_{\text{Nvar}} } \right] $$
(1)

Habitats which are more suitable for egg laying are desirable; therefore, a fitness function is applied to first evaluate every habitat then describe the profit function as follow:

$$ {\text{fitness}} = f_{p} \left( {\text{habitat}} \right) = {\text{f}}_{\text{p}} \left( {x_{1} ,x_{2} , \ldots ,x{\text{x}}_{\text{Nvar}} } \right) $$
(2)

The space needs to be limited in the optimization algorithm, then varhi is applied for upper bound and varlow for lower bound. Every cuckoo has an egg laying radius (ELR) which is determined by the proportion between the number of eggs of this cuckoo and the total number of eggs in bound. α is tuning the maximum value of ELR. ELR is formulated in Eq 3:

$$ {\text{ELR}} = \alpha \times \frac{\text{Number of current cuckoo's eggs}}{\text{Total number of eggs}} \times \left( {{\text{var}}_{\text{hi}} - {\text{var}}_{\text{low}} } \right) $$
(3)

In this algorithm, the first place for egg laying is selected randomly, but the next generations use the best solutions definitely.

3.2 Chaotic function

Chaos is a deterministic dynamic function that is sensitive to its initial conditions and parameters. The nature of chaos is apparently random and unpredictable; however, it also possesses an element of regularity (Alatas et al. 2009).

Chaos is a nonlinear function which can change the algorithm behavior. Chaotic maps represent a pseudo-random deterministic process, that is non-converging bounded  (dos Santos and Mariani 2008).

It can solve the early convergence in evolutionary algorithms by avoiding local minimums. There are a lot of chaotic maps. Ten one-dimensional maps that can be used as initial position pattern in cuckoo optimization algorithms are introduced in Table 1 (Sayed et al. 2019; Saremi et al. 2014; Hamidzadeh and Namaei 2018; Hamidzadeh et al. 2017):

Table 1 Chaotic maps

3.3 Levy flight

A levy flight random walk is suggested as a permutation to perform a local search (Emary and Zawbaa 2018). Levy flight process represents the optimum random search pattern and is frequently detected in nature (Viswanathan et al. 2008). This distribution has infinite mean and infinite variance and also has a simple power law formula. S is the step length in levy flight algorithm. The parameters u and ν are given by the normal distributions in Eq. 6. The variance, σ, is calculated as Eq. 7 with 1  β  2 and \( \varGamma \left( \mathcal{Z} \right) \) is the gamma function (Yang and Deb 2009; Syberfeldt 2014).

$$ L\left( S \right)\sim\left| S \right|^{ - 1 - \beta } \quad 0 < \beta \le 2 ,\,S \to \infty $$
(4)
$$ S = \frac{u}{{\left| v \right|^{{\frac{1}{\beta }}} }} $$
(5)
$$ u \sim N(0,\sigma_{u}^{2} ) \, ,\nu \sim N(0,\sigma_{v}^{2} ) $$
(6)
$$ \sigma_{u} = \left( {\frac{{\varGamma \left( {1 + \beta } \right)*\sin \left( {\pi \beta /2} \right)}}{{\varGamma \left[ {\left( {1 + \beta } \right)/2} \right]*\beta *2^{{\left( {\beta - 1} \right)/2}} }}} \right)^{{{\raise0.7ex\hbox{$1$} \!\mathord{\left/ {\vphantom {1 \beta }}\right.\kern-0pt} \!\lower0.7ex\hbox{$\beta $}}}} ,\sigma_{v} = 1 $$
(7)

3.4 Opposition-based learning

Opposition-based learning (OBL) is a useful way to better explore problem space. This algorithm can help increase efficiency by searching for the solution and its opposite at the same time. Although this strategy raises the algorithm’s computational load, but it helps a lot to achieve faster convergence (Oliva and Abd Elaziz 2020; Tizhoosh 2005; Aladeemy et al. 2020).

The OBL strategy in D-dimensional space is as Eq. 8.

$$ \bar{x}_{j} = u_{j} + l_{j} - x_{j} \quad j = 1,2 \, , \ldots ,D $$
(8)

In Eq. 8 the x is the habitat in search space and \( \bar{x} \) is the opposite sides of the habitat which are generated within the interval [u, l]. Clearly, for each dimension the opposite number should be calculated which is represented by j. Finally, \( \bar{x} \) represents the opposite habitat states in D dimension.

3.5 Disruption operator

Disruption operator (Dop) is implemented as a solution to enhance the population diversity. This phenomenon was inspired from the astrophysics. Harwit belives:

When a set of gravitationally bound particles (with total mass m) is very close to a massive object (with mass M), then the set becomes a torn apart. Similar to this, when a solid body that held together through gravitational forces, approaches a much more massive object.

This operator is used because CCOALFDO algorithm searches better in the problem space and maintains the balance between exploration and exploitation processes. Dop is formulated in Eq. 9:

$$ D_{\text{op}} = \left\{ {\begin{array}{*{20}c} {D_{{is_{i,j} }} \times \delta \left( { - 2,2} \right)} & { {\text{if}} D_{{is_{{i,{\text{best}}}} }} \ge 1} \\ {1 + D_{{is_{{i,{\text{best}}}} }} \times \delta \left( {\frac{{ - 10^{ - 4} }}{2},\frac{{10^{ - 4} }}{2}} \right) } & {\text{otherwise}} \\ \end{array} } \right. . $$
(9)

In Eq. 9 the \( D_{{is_{i,j} }} \) represents the Euclidean distance between the ith solution and its nearest neighborhood jth solution, and \( D_{{is_{{i,{\text{best}}}} }} \) is the Euclidean distance between the ith solution and the best solution. Also, the \( \delta \left( {a,b} \right) \) is a random number which is generated within the interval [a, b] (Neggaz et al. 2020).

4 Proposed chaotic cuckoo optimization algorithm with levy flight and disruption operator for feature selection

In this section, the proposed wrapper-based chaotic cuckoo optimization algorithm with levy flight and disruption operator (CCOALFDO) is introduced as a feature selection method. In the basic cuckoo optimization algorithm, the cuckoos move in the search space to modify their positions to any point in the space which is called the continuous space. Regarding the nature of the feature selection issues, the solutions are limited to the binary space [0,1] values.

With Npop cuckoos, one can make a Npop × Nvar matrix. In the usual algorithm, at the first place of egg laying, cuckoos randomly select some habitats that are built in the algorithm, chaotic function speeds up the convergence of the algorithm.

Searching for the optimal solution in a chaotic way among the local minima using ergodicity, regularity and semi-stochastic properties, leaded to a high probable global optimal solution. In Table 1, chaos variables which are generated by Guass/mouse map in the CCOALFDO algorithm are described. However, chaos variables are now assumed to be generated via an arbitrary one-dimensional map which enables them to either modify in the range of (0.1) or we scale them in this range.

The initial value for Gauss/mouse chaotic function is a random number then the sequence values are calculated by Eq. 10:

$$ \begin{aligned} & x_{0} = {\text{rand}} \\ & x_{i + 1} = \left\{ {\begin{array}{*{20}c} 1 & {x_{i} = 0} \\ {\frac{1}{{\bmod \left( {x_{i} ,1} \right)}}} & {\text{otherwise}} \\ \end{array} } \right. \\ \end{aligned} $$
(10)

In order to find habitat in the next generation, cuckoos should search around to find the best.

Though there are many different evolutionary variants in the literature, cuckoo optimization algorithm problems of premature convergence and generating inefficient results are still persisting. Levy flights method is used to solve these problems and enable the CCOALFDO algorithm to generate more efficient results. Also, the disruption operator observed population diversity of habitats. Using this method, it is ensured that COA, which is unable to perform global search well, would do it more effectively and would not be trapped in local minima. Thus, random walks drawn from levy stable distribution are used in the proposed method to define steps. Then, the step size is calculated by Eq. 11. Here, the factor 0.01 is the typical length scale to avoid exaggerating the size of the step and make the solutions jump out of the design domain.

$$ \begin{array}{*{20}l} {{\text{Step}}\;{\text{size }} = 0.01 \times S} \hfill \\ \end{array} $$
(11)

In Eq. 12 the global random search done by the concept of levy flights is presented in which \( x_{i}^{\left( t \right)} \) is the current state of the cuckoo i in its last search for finding a habitat after t iterations. Levy flight is applied to generate a new state \( x_{i}^{{\left( {t + 1} \right)}} \) of \( x_{i}^{\left( t \right)} \). By levy flight, the new state of the habitat \( x_{i}^{\left( t \right)} \) is calculated by Eq. 12 and ELR is described in Eq. 3.

$$ x_{\text{i}}^{{\left( {{\text{t}} + 1} \right)}} = x_{\text{i}}^{{\left( {\text{t}} \right)}} + {\text{ELR}} \times {\text{levy}}\left( {\text{s}} \right) $$
(12)

For all habitat the OBL strategy in D-dimensional space should be calculated as Eq. 8: This doubles the number of habitats. The disruption operator should be applied to each habitat and their opposites. As mentioned earlier, Dop is calculated by Eq. 9. It is necessary to be sure that all the answers are not too close together and make the balance between exploration and exploitation processes. The inclusion of OBL and DO permits to increase the diversity of the habitats.

Since the continuous space of features ought to be transformed to the corresponding binary space, a new position of the habitat is used in the next formula in which yj,t+1 refers to the cuckoo with j index after t iteration:

$$ y^{{{\text{j}},{\text{t}} + 1}} = {\text{f}}\left( x \right) = \left\{ {\begin{array}{*{20}c} 1 & {{\text{if}}\left( {w\left( {x^{{{\text{j}},{\text{t}} + 1}} } \right)} \right) \ge {\text{rand}}} \\ 0 & { {\text{otherwise}}} \\ \end{array} } \right. $$
(13)

\( w\left( {x^{j,t + 1} } \right) \) is calculated by Eq. 14.

$$ w\left( a \right) = \frac{1}{{1 + {\text{e}}^{{10\left( {{\text{a}} - 0.5} \right)}} }} $$
(14)

To solve the feature selection problems, the continuous space (free position) must be transformed to their corresponding binary solutions.

For example, if the value of a bit equals to 1, its corresponding feature is selected in the feature subset, while 0 indicates nothing (Fig. 1). It means that the second and fourth features are selected for the feature subset.Fitness function described as Eq. 15:

$$ {\text{fitness}}_{t} = { \hbox{max} }\left( {\left( {1 - b} \right) \times {\text{ACC}} + b \times \left( {1 - \frac{m}{n}} \right)} \right) $$
(15)

in Eq. 16, n is the total number of features, m is the selected feature subset length, and b is a parameter corresponding to create a balance between the classification accuracy and feature reduction which can be in \( \left[ {0,1} \right] \). It is illustrated in the experiments that b=0.2. The ACC parameter is the classification accuracy rate made by calling a rapid classifier such as KNN to evaluate the feature sets and generate the accuracy value based on Euclidean distance measure. The K parameter equals 5. ACC is described as Eq. 16:

Fig. 1
figure 1

Example of selected features in each sample

$$ {\text{ACC}} = {\text{ correct}}\;{\text{classificationrate}}\;{\text{with}}\;K\;{\text{Neighbors}}\;{\text{Classifier}}\;\left( {N\_{\text{neighbors }} = \, 5} \right) $$
(16)

During every iteration, fitness values are assigned to cuckoo’s habitat in the search space. These positions are evaluated on each iteration, and the best position is selected as the best solution.

Figure 2 shows the framework of CCOALFDO algorithm for problem optimization. In general, the proposed algorithm depends on improving the behaviors of the cuckoo optimization algorithm by using the chaotic algorithm, levy flight, opposition-based learning and disruption operator. Clearly, each of these operators have their own tasks. The aim of using the chaotic algorithm is to improve the start position in the first step. The updating mechanism was used in the levy flight operation, since the cuckoos have the largest effect on the convergence of solutions leading to the global solution. Also, opposition-based learning is an intelligent solution for searching the opposite side of space and enhancing the exploration power. Whereas the disruption operator is used to enhance the diversification of the whole population, which leads to improving the exploration ability of the CCOALFDO and convergence of the optimal solution. At the end, calling a rapid classifier such as KNN was done to evaluate the candidate solution and generate the fitness value based on the reduced dataset.

Fig. 2
figure 2

The framework of the proposed CCOALFDO method

In CCOALFDO algorithm, two criteria were selected to terminate the algorithm:

  1. 1

    maximum iteration (or generation) number

  2. 2

    no significant improvement in accuracy rate in the latest 20 iterations

    figure a

The pseudo-code of CCOALFDO is shown in Algorithm 1. The algorithm starts by generating a random population of solutions with chaotic formula, then the iteration starts. The basic task in each iteration is determination of the best solution, and the coefficients in the next step. In the following phase, the solutions in the population are updated using levy flight and also the opposite position of each nest is calculated as new nests. It then applies Dop formula for all of the habitats and their opposites to make more suitable positions for them. This process is repeated until reaching a satisfying stopping constraint. Finally, the algorithm returns to the best solution.

The computational complexity of the proposed algorithm depends on several cases, including the cuckoo optimization algorithm (COA), the chaotic map (CM), the levy flight (LF) and disrupt operator (DO). Therefore, the complexity of the CCOALFDO algorithm is calculated by Eq. 17.

$$ \theta ({\text{CCOALFDO}}) = K_{s} \theta \left( {\text{COA}} \right) + N \times \theta \left( {\text{CM}} \right) + \left( {N - k_{s} } \right)\theta \left( {\text{LF}} \right) + \theta \left( {\text{DO}} \right) $$
(17)

where \( \theta \left( {\text{COA}} \right) \) is calculated by Eq. 18.

$$ \theta \left( {\text{COA}} \right) = \theta \left( {t\left( {{\text{Dim}} \times N + C \times N + N{ \log }N} \right)} \right) $$
(18)

and \( \theta \left( {\text{CM}} \right) \) is calculated by Eq. 19.

$$ \theta \left( {\text{CM}} \right) = \theta \left( {t \times N} \right) $$
(19)

and \( \theta \left( {\text{LF}} \right) \) is described as Eq. 20.

$$ \theta \left( {\text{LF}} \right) = \theta \left( {t({\text{Dim}} \times N + C \times N} \right) $$
(20)

And finally, \( \theta \left( {\text{DO}} \right) \) is calculated by Eq. 21.

$$ \theta \left( {\text{DO}} \right) = \theta \left( {t \times N} \right) $$
(21)

In Eq. 17 to 21, t represents the number of iterations, Dim is the number of the features. N represents the number of feature subsets and also, Ks represents the number of habitats which updated in every iteration of the CCOALFDO algorithm, and C is the cost of objective function.

5 Experimental result

In this section, we perform the experiments on various high-dimensional datasets to compare the proposed CCOALFDO method with five other feature selection methods. At first, in Sect. 5.1 applied datasets are introduced concisely. Then, in Sect. 5.2, the experimental setups and parameters are discussed. Finally, in Sect. 5.3 the performance of the proposed method against other relevant methods is evaluated.

5.1 Datasets

The results presented in the experiments are based on the 15 real-world datasets from UCI machine learning repository (Lichman 2013), and the 5 medical datasets were used (Statnikov et al. 2005) that are listed in Table 2 and cover different sizes and dimensions. In Table 2, the number of data samples, features and classes for each dataset are given. In all datasets, the missing value is replaced with the median of that feature.

Table 2 Experimental datasets (Lichman 2013; Statnikov et al. 2005)

5.2 Experimental setup and parameters setup

The experiment was conducted on the Intel core i3 computing platform with 4 GB RAM, 2.13 GHz frequency. The programming environment is MATLAB R2014a in the Windows 10 operating system.

To compare the CCOALFDO algorithm, we used some other evolutionary algorithms such as whale optimization algorithm (WOA) (Mafarja and Mirjalili 2018), gray wolf optimizer (GOW) (Emary et al. 2016), particle swarm optimization (PSO) (Eberchart and Kennedy 1995), crow search algorithm (CSA) (Sayed et al. 2019) and cuckoo optimization algorithm (Rajabioun 2011). Table 3 shows the parameter setting of the algorithms used for comparing.

Table 3 State- of-the-art metaheuristic algorithm parameter settings

In order to train and validate K − onefolds are used, while the last fold is used for testing in the k-fold cross-validation. The proposed method’s performance should be confirmed by applying the fivefold cross validation to generate the five equal portions in the dataset.

The role of the training part involves training the applied classifier through optimization at the final evaluation, while the validation part assesses the performance of the classifier during the training time and the testing part evaluates features given the trained classifier which are picked up eventually. Based on fivefold cross validation, 80% of training partition is used for training, while 20% is used for the fitness function validation.

5.3 Result interpretation

In this section, we report and analyze the experimental results of the proposed CCOALFDO method and the compared feature selection methods. To provide a fair comparison, each method is conducting 10 solutions, and its average percentages are reported in comparison with other methods. All experiments use the same parameters for initialization and evolution. To execute CCOALFDO algorithm, the population size equals 50 and also the iteration number equals 100. It is noteworthy to mention that all the experiments were repeated for 30 independent times to obtain meaningful results.

Using the KNN algorithm for classification and dividing the correct sample classification to the overall number of samples, the accuracy rate of the algorithm (ACC) is found. In KNN algorithm, the K parameter equals 5. The optimum result would be the one with maximum accuracy rate and minimum number of features.

Table 4 presents the results of the proposed algorithm in each dataset. SF shows selected features in percent, and ACC is the accuracy rate in percent that reached to threefold cross-validation.

Table 4 Experiment chaotic maps on different datasets

According to Table 4 the Gauss/mouse map obtains similar or even higher accuracy rates on all datasets in comparison with 10 other maps. However, it obtains insignificant lower accuracy rate in D3, D5, D7, D8, D13and D19 datasets. Also, the results are graphically shown in Fig. 3. As shown in Fig. 3, the red bar is usually higher than the other colors.

Fig. 3
figure 3

Comparison of the classification accuracy rate of 10 chaotic maps in 20 datasets

The Gauss/mouse map results of CCOALFDO are not much higher than other maps; therefore, it is used to compare CCOALFDO algorithm with other five algorithms in each dataset. Figure 4 depicted the convergence of the proposed method with Gauss/mouse map. As it is seen in Fig. 4, larger datasets such as D2, D6, D12 and D19 needs more than 100 iteration numbers to increase the accuracy rate, and if this number of iterations increases, they may perform better at the correct rate. The smallest datasets such as D7, D9, D10 and etc., are also the same. More iterations are not needed as 50 iterations become fixed.

Fig. 4
figure 4

Convergence curve for CCOALFDO with Gauss/mouse map on 20 datasets

Table 5 presents the accuracy and selected feature rate results. Because of the limitation of random behavior of cuckoos, the proposed algorithm caused faster convergence in most datasets.

Table 5 Accuracy and selected feature rate comparing result

According to Table 6, considering the average scores in the datasets, the CCOALFDO algorithm eventually shows the best performance in accuracy rate. The results are graphically shown in Fig. 5. Comparing the accuracy rate of different algorithms, the proposed algorithm shows greater accuracy than other algorithms in most datasets. In terms of ACC, the proposed CCOALFDO method achieves the best performance on the D2, D3, D4, D5, D6, D8, D11, D12, D13, D15, D16, D17, D18, D19, D20 datasets. Other datasets could not have the best accuracy rate which might happen due to the large number of classes. Figures 5 shows the boxplots of classification accuracy rate for the CCOALFDO, which is compared with 5 other evolutionary algorithms including WOA, GOW, PSO, CSA and COA that have been implemented and tested on twenty datasets in the same environment.

Table 6 Accuracy rate and standard deviation comparing result
Fig. 5
figure 5

Accuracy rate classification comparison between the CCOALFDO algorithm and five other evolutionary algorithms

The Wilcoxon statistical test is also used to find about the proposed method more precisely. It is a non-parametric statistical test implemented to determine statistical significance between algorithms (Statnikov et al. 2005). The Wilcoxon test is conducted at 5% significance level to verify whether there is a statistical difference between obtained results of accuracy rate. In other words, if p value < 0. 05, then it indicates the proposed approach has a significant difference.

According to a normative analysis on Wilcoxon test results, the outcome of the proposed method is meaningfully distinguishable from others. Considering accuracy, the CCOALFDO method performs significantly better than the competing methods.

Table 7 compares the results of the CCOALFDO with WOA, GOW, PSO, CSA and COA where bold values represent p-value > = 0.05. According to the results, the proposed algorithm showed more significant improvements comparing to other methods.

Table 7 p-value of the Wilcoxon rank sum test for the classification accuracy rate based on CCOALFDO and five other evolutionary algorithms

The comparison between CCOALFDO accuracy rate and five other evolutionary algorithms are presented in Fig. 5 and 6. These boxplots are created to report the accuracy rates of 30 independent runs for CCOALFDO and five others evolutionary algorithms. The boxplots prove the superiority of the proposed algorithm over other mentioned algorithms. It creates more compact boxes with a higher median than all other algorithms.

Fig. 6
figure 6

Boxplots of CCOALFDO compared to 5 other evolutionary algorithms based on accuracy rate classification

As Table 8 and Fig. 7 represent, CCOALFDO placed as the second-best method considering the number of features. On the D2, D6, D12, D15, D17, D19, D20 datasets, it obtained significantly better SF rank, while obtaining the best ACC rank in Table 6. The results show that the proposed method acts better in feature reduction in larger datasets. To evaluate the quality of each algorithm, standard division is also measured. The CCOALFDO achieved the smallest standard division value in most of datasets which proves the robustness of the algorithm and fixes it as the most stable method in terms of quality of feature selection.

Table 8 Selected features rate comparing results
Fig. 7
figure 7

Comparison selected feature rates of the CCOALFDO algorithm with five other evolutionary algorithms

Considering the p-values in Table 9, the proposed CCOALFDO outperforms others in most of the cases, significantly. The Wilcoxon test measure shows the superiority of the proposed method over WOA, PSO, CSA, COA algorithm. The results also denote that the GOW algorithm got higher p-values than others and achieved better rank comparing with CCOALFDO algorithm, however; its accuracy rate was lower.

Table 9 p-value of the Wilcoxon rank sum test for the selected features rate based on CCOALFDO and five other evolutionary algorithms

According to Tables 6 and 8, the superiority of the proposed method over others in term of classification accuracy rate and feature reduction rate is clear. In all 20 real-world datasets, the CCOALFDO algorithm achieved the best rank in accuracy rate. There are some reasons which create this advantage. The first one is using the chaotic function to enhance the first positions of the cuckoos that leaded to maintaining the population and improved the search ability by levy flight for finding the best solution. Second, applying the DO operator which avoided getting trapped in local optima. And finally, using OBL improved the ability of exploration which guaranteed the diversity of the population. As a result, the performance of convergence improved, the chance of trapping in local minima decreased and the competence of the selected features was guaranteed.

Table 10 lists the average CPU time measured in seconds for per run, taken by the original feature set and the selected feature set. In all cases, the CCOALFDO algorithm deletes the redundant and noisy features, so the classifier can provide enhanced time performance, while keeping better classification performance.

Table 10 Comparing runtime and accuracy rates between selected feature set and original feature set

Table 11 indicates the average CPU time measured in seconds for per run, taken by the CCOALFDO algorithm and five others evolutionary algorithm. This measure indicates the speed of the algorithms in selecting features from a given dataset. In this measure, the GOW is the fastest algorithm, while the proposed algorithm ranks fourth. Because of more searching steps in the CCOALFDO algorithm, it takes more time for computations but it is still reasonable in this case because of its high accuracy rate and good feature selection rate. Although the CCOALFDO algorithm needs more enhancement in the time complexity issue, it can be fixed by applying the opposition-based learning and disruption operator for only a small part of the cuckoo’s or based on a specific condition. Figure 8 represents these results graphically.

Table 11 Comparison between CCOALFDO and other metaheuristics based on time (s)
Fig. 8
figure 8

comparing time computation rate of the CCOALFDO algorithm with five other evolutionary algorithms

6 Conclusion and future works

Feature selection is an ever evolving frontier in data mining and machine learning. The more the machine learning develops, the broader the scopes of feature selection research are. In this study, a new algorithm is presented for feature selection according to the cuckoo optimization algorithm. The proposed CCOALFDO algorithm utilizes the optimization feature selection which transferred the random variables to chaotic behavior. Moreover, the levy flight is used to handle uncertainty and update cuckoo’s steps in high-dimensional space. In addition, the opposition-based learning technique and the disruption operator are used to improve the exploration ability and ensure the diversity of the population to speed up the convergence rate. The evaluation of the proposed method was done through applying 20 datasets. Finally, the results were compared with WOA, GOW, PSO, CSA, COA algorithms. The results revealed that the presented method was able to utilize the information structure of a large number of features and enhance the accuracy rate of the classification tasks accordingly. By controlling the random behavior of this algorithm, a better feature set of datasets was selected. The rate of classification accuracy has improved, since some redundant features were eliminated. In other words, the CCOALFDO algorithm aims to find a compromise between two objectives maximizing the classification accuracy rate and reducing the number of features. The classification time was also reduced after decreasing the number of features and choosing a suitable feature set, while providing a more accurate classification. Unlike other evolutionary algorithms, applying levy flight, opposition-based learning and disrupt operator to the whole population leads to the solution diversity; however, the computational time increases. This is an issue which can be handled in the future works. They can also focus on applying other rates, such as precision or G-mean and the use of CCOALFDO for feature selection issues in online applications. Moreover, the CCOALFDO can evaluate the real-world problems to ensure the stability.