1 Introduction

Each dataset regularly has an enormous number of features, especially in pattern recognition and classification applications. The primary goal of feature selection is a better understanding of the underlying method which generated the data to extract a subset of relevant features from a huge number of available features [1]. The selected feature set will improve the classifier performance and provide a faster and more cost-effective classification that leads to obtaining comparable or even better classification performance from using the full features [2].

The feature selection is recognized as a multi-objective problem that minimizes the size of selected features and maximizes the classification performance. Both objectives are contradicted, and the optimal solution requires to be performed in the appearance of a tradeoff between them. The feature selection methods split into two primary approaches: filter-based and wrapper-based [3]. The filter-based method is not dependent on machine learning technique, and they are participated to be computationally inexpensive. Furthermore, filter-based methods examine the search space for a feature set that optimizing given data-dependent criteria rather than classification-dependent criteria [4].

The wrapper-based method includes a machine learning technique as part of the evaluation function which allows the wrapper-based to obtain better results than filter-based method [5]. The size of search space is exponentially increased concerning the number of features in a given dataset [1]. Hence, in practice, the exhaustive search methods are impossible to get the optimal solution in the most cases. The diversity of search methods are employed to solve feature selection problems like greedy search based on sequential forward selection (SFS) [6] and sequential backward selection (SBS) [7]. However, these classical optimization methods have some restriction in solving the optimization problems. So that, evolutionary computation (EC) algorithms are the option for addressing these shortcomings and their global search ability [8, 9].

Various heuristic methods simulate the behavior of biological and physical systems in nature, and it has been proposed as powerful techniques for global optimizations [10]. EC algorithms are motivated by nature, social behavior, and biological behavior (of animals, birds, fish, bat, Firefly, wolves, spiders, antlion, etc.) in a pack [11]. Many researchers have suggested various computational methods which imitate the behavior of these kinds for investigating their optimal solution [12,13,14,15]. EC algorithms were hired to search adaptively the search area for an optimal subset using a set of search agents who communicate in a social manner to reach a global solution in the minimum number of evaluations [16].

In EC algorithms, it is necessary to have the proper balance between exploration and exploitation [17]. Genetic algorithm (GA) and particle swarm optimization (PSO) are the most common evolutionary computation methods [18]. GA was the first evolutionary-based algorithm introduced in the literature and developed based on the natural process of evolution through reproduction [19]. A feature selection algorithm based on GA using the fuzzy set as fitness function [20]. A hybrid wrapper and filter feature selection with a memetic framework for the classification problem. This method integrated filter ranking with GA to maximize the classification accuracy and selected the feature subsets [21]. However, PSO with the same fitness function achieves better performance than GA algorithm in [22]. A feature selection strategy using rough-set and PSO is introduced in [23]. Rough-set has been applied as a feature selection method to find the optimally selected features, and PSO will discover the optimal feature subset.

The ant colony optimization (ACO) method based on wrapper feature selection algorithm applied in network intrusion detection and used Fisher discrimination rate to adopt the heuristic information [24]. In the artificial bee colony (ABC) algorithm, several behaviors of the bees provide the opportunity to build robust balancing method between exploration and exploitation [25]. In ABC, the employer bees try to find a food source and advertise the other bees [26]. The onlooker bees follow their interesting employer, and the scout bee flies spontaneously to find the best food source [27]. A virtual bee algorithm (VBA) is applied to optimize the numerical function in 2-D using a swarm of virtual bees, which move randomly in the search space and interact to find food sources. The interactions between these virtual bees result in the possible solution for the optimization problem [28]. A proposed approach based on the honeybees natural behavior that randomly produced worker bees are moved in the direction of the elite bee. The elite bee describes the optimal (near to optimal) solution [29].

Artificial fish swarm (AFS) algorithm mimics the stimulant reaction by controlling the tail and fin. AFS is a robust stochastic technique based on the fish movement and its intelligence during the food finding process [30]. Antlion optimization (ALO) algorithm [31, 32] is a relatively new EC method which is computationally less costly than other EC techniques. The proposed simulated annealing (SA) is employed to optimize the feature selection subset in marketing applications for designating large-scale linear regression model [33].

Lately, many types of research have studied the animals and insects flight behavior that represented the general characteristics of Lèvy flights [34]. The fruit flies examine their landscape doing a series of straight flight routes with the 90\({^{\circ }}\) turn, leading to a Lèvy flight interrupted scale-free search pattern. Therefore, Lèvy flight behavior has been employed to optimization and optimal search, and preliminary results show its promising capability [35]. Lèvy flight process represents the optimum random search pattern and is frequently detected in nature [36]. Broadly speaking, the Lèvy flight is a random walk whose step length is pulled from the Lèvy distribution [37].

The aggregate aim of this paper is to propose a variant of ALO which exploits Lèvy flight for performing a local and global search. The proposed Lèvy flight version of ALO is applied for feature selection approach, which selects the minimal number of features and obtains comparable or even best classification efficiency from using all attributes and conventional feature selection techniques. The rest of this paper is organized as the follows: Sect. 2 provides the background information of antlion optimization algorithm (ALO) and Lèvy flights. Section  3 describes the proposed antlion optimization algorithm (ALO)-based multi-objective feature selection algorithms. The result of experiments with discussions is presented in Sect. 5. Finally, conclusions and future work are provided in Sect. 6.

2 Antlion optimization (ALO)

Antlion optimization (ALO) is a newly proposed bio-inspired optimization method that was suggested by Mirjalili [32]. ALO algorithm mimics the antlions hunting behavior in nature. The following two subsections discuss the inspiration and operators of the artificial algorithm [38].

Antlions (doodlebugs) belong to the Myrmeleontidae family and Neuroptera order [32]. An antlion larva digs a cone-shaped hole in the sand. Once caught, the prey tries to stampede from the trap, and the antlion tries to catch its prey. Antlions intelligently throw sands toward to edge of the hole to slide the victim into the bottom of the hole. By consuming the prey, the antlion prepares the hole for the next hunt. The trap size for an antlion is affected by the level of hunger and the moon shape. Antlions tend to dig out larger traps as they become hungrier and when the moon is full. They have been developed and acclimated to this way to increase their chance of survival [39].

Based on the above description of antlions, the ALO is composed of two sets of agents, namely ant and antlions. Antlions are the hunters and are selected as the fittest agents and never change their location unless they replace a given ant. Ants perform random walks in the search space and may be consumed by an antlion if such antlion traps it. The ant’s position is updated according to the formula in Eq. (1):

$$\begin{aligned} \mathrm{Ant}_i^t=\frac{R_A^t+R_E^t}{2}, \end{aligned}$$
(1)

where \(R_A^t\) is the position of randomly walking ant indexed i; called \(\mathrm{Ant}_i\), around a roulette wheel selected antlion indexed A and \(R_E^t\) is the position of randomly walking ant indexed i, called \(\mathrm{Ant}_i\), around the elite antlion indexed E in the ant population. Antlions fitness is used by the roulette wheel operator to select an antlion with index A to perform the ant random walk around it while the elite antlion is established as the fittest antlion with index E.

The random walking of an \(\mathrm{Ant}_i^t\) around a given \(\mathrm{Antlion}_j^t\) is formulated as in Eq. (2):

$$\begin{aligned} R_j^t=\frac{(X_i-a_i)\times (d_i-c_i^t)}{(b_i^t-a_i)}+c_i, \end{aligned}$$
(2)

where \(R_j^t\) is the position of ant i after performing random walk around antlion j at iteration t, \(a_i\) is the minimum step of random walk \(X_i^t\) in i-th dimension, and \(b_i\) is the maximum step of random walk \(X_i^t\) in i-th dimension, \(X_i\) is defined in Eq. (3), c, d are the lower and upper bounds of the random walk. Equation (2) is directly used to define \(R_A^t\) and \(R_E^t\) of Eq. (1).

$$\begin{aligned} X(t)\, = \,& {} [0, \mathrm{cumsum}(2r(t_1)-1); \, \mathrm{cumsum}(2r(t_2)-1);\nonumber \\&\ldots ; \mathrm{cumsum}(2r(t_T)-1)], \end{aligned}$$
(3)

where cumsum calculates the cumulative sum and models the accumulation of successive random steps forming a random walk until time t, and r(t) is a stochastic function characterized as in Eq. (4):

$$\begin{aligned} r(t)={\left\{ \begin{array}{ll} 1 &{} \text{ if } \mathrm{rand}>0.5\\ 0 &{} \text{ if } \mathrm{rand}\le 0.5, \end{array}\right. } \end{aligned}$$
(4)

where rand is a random number relied on uniform distribution. c, d parameters are adapted according to Eqs. (5) and (6) to limit the range of the random walk around the given antlion.

$$\begin{aligned} c_i^t= & {} {\left\{ \begin{array}{ll} \frac{lb}{I}+X_{\mathrm{Antlion}_j}^t &{}\quad \mathrm{if}\; \mathrm{rand}<0.5\\ \frac{-\,lb}{I}+X_{\mathrm{Antlion}_j}^t &{}\quad \mathrm{otherwise},\\ \end{array}\right. } \end{aligned}$$
(5)
$$\begin{aligned} d_i^t= & {} {\left\{ \begin{array}{ll} \frac{ub}{I}+X_{\mathrm{Antlion}_j}^t &{}\quad \mathrm{if}\; \mathrm{rand}>0.5\\ \frac{-\,ub}{I}+X_{\mathrm{Antlion}_j}^t &{}\quad \mathrm{otherwise},\\ \end{array}\right. } \end{aligned}$$
(6)

where lb and ub are the lower limit and upper limit for dimension i, and I is a parameter that monitors the exploration/exploitation rate and is defined as in Eq. (7):

$$\begin{aligned} I=10^w\frac{t}{T}, \end{aligned}$$
(7)

where T is the maximum number of iterations, w is a constant depicted the current iteration (w = 2 when \(t > 0.1T\), w = 3 when \(t > 0.5T\), w = 4 when \(t > 0.75T\), w = 5 when \(t > 0.9T\), and w = 6 when \(t > 0.95T\)). The constant w may update the accuracy level of exploitation.

Finally, the selection operation is applied where an antlion is replaced by an ant if the ant becomes fitter. According to the above formulation, the antlion optimization can be described in the following algorithm (1) [40].

figure a

ALO provides very competitive results in terms of improved exploration, local optima avoidance, exploitation, and convergence in diverse shapes of optimization functions and hence found in solving different problems in different domains [32].

3 The proposed Lèvy antlion optimization (LALO)

Randomization has a significant role in both exploration and exploitation, or diversification and intensification and the essence of such randomization is the random walk [41]. A random walk is a random process that consists of taking a series of consecutive random steps. The random walk may be written as a sum of consecutive random steps as in Eq. (8).

$$\begin{aligned} S_N=\sum _{i=1}^NX_i, \end{aligned}$$
(8)

where \(X_i\) is a random step drawn from a random distribution.

If the step length obeys the Gaussian distribution, the random walk becomes the Brownian motion, as shown in Fig. 1 [41]. Far-field randomization should generate a fraction of solutions and be far enough from the current best solution. Such far solutions help a given optimizer to escape from local optima and avoid stagnation [42]. A variant of Brownian motion that uses Lèvy is expected to reach the end. Lèvy distribution is a heavy-tailed distribution that has infinite variance.

Mathematically speaking, a simple version of Lèvy distribution can be defined as in Eq. (9):

$$\begin{aligned} L(s,\gamma , \mu )={\left\{ \begin{array}{ll} \sqrt{\frac{\gamma }{2\pi }}\exp (-\frac{\gamma }{2(s-\mu )})\frac{1}{(s-\mu )^{3/2}}, &{}\quad 0<\mu<s<\infty \\ 0 &{}\quad \mathrm{otherwise}\\ \end{array}\right. } \end{aligned}$$
(9)

where \(\mu\) is minimum step size, and \(\gamma\) is a scale parameter. As \(s\rightarrow \infty ,\) we have the simple form of Lèvy as in Eq. (10). Figure 2 shows a sample random walking using Lèvy distribution.

$$\begin{aligned} L(s,\gamma , \mu )=\sqrt{\frac{\gamma }{2\pi }}\frac{1}{(s-\mu )^{3/2}} \end{aligned}$$
(10)

When the step length obeys the Lèvy distribution, such a random walk is called a Lèvy flight or Lèvy walk.

Fig. 1
figure 1

Brownian motion with a gaussian step-size distribution and the path of 50 steps starting at the origin; after [41]

Fig. 2
figure 2

Lèvy flights in consecutive 50 steps starting at the origin; after [41]

Lèvy flights are more efficient than Brownian random walks in exploring unknown, large-scale search space which can be observed from the large abrupt jumps. Mathematically, this can be interpreted by the fact that the variance increases much faster than the linear relationship of the Brownian random walk. The facts as mentioned earlier about the Lèvy flights random walk motivate exploiting it in performing the random walking of ants.

The updated version of ALO random walk based on Lèvy flights’ random walk is performed as in Eq. (11).

$$\begin{aligned} \mathrm{RW}_i=\mathrm{ant}_i^{t}+\omega \times \frac{u}{|v|^{\frac{1}{\beta }}}\times r\times \left[ \mathrm{antLion}_j^t-\mathrm{ant}_i^{t}\right] \end{aligned}$$
(11)

where \(\mathrm{RW}_i\) is the random walk of \(\mathrm{ant}_i\) around \(\mathrm{antLion}_j\) at time t, \(\omega\) is a scale parameter controlling the scale of the random walk and is defined according to Eq. (13), vr are random number drawn from normal distribution N(0, 1), \(\beta\) is constant controlling the distribution skewness and u is set as \(r2 * \varphi\) with r2 random number drawn from normal distribution N(0, 1) and \(\varphi\) is defined in Eq.  (12). The above equation allows an ant to be repositioned following a random walking with Lèvy steps.

$$\begin{aligned} \varphi =\left( \frac{\varGamma (1+\beta )\sin (\pi \beta /2)}{\varGamma \left[ \frac{1+\beta }{2}\beta 2^{\frac{\beta -1}{2}}\right] }\right) ^{\frac{1}{\beta }} \end{aligned}$$
(12)

Such implementation of the Lèvy distribution is based on Mantegna algorithm for a symmetric Lèvy stable distribution [43] which is said to be the easiest way.

$$\begin{aligned} \omega =1-\frac{t}{T} \end{aligned}$$
(13)

where t, T are the current iteration and the maximum number of allowed iterations in order. Such equation allows the optimizer to be more diverse at the begin of optimization and then narrows such scale to allow for exploitation at the end stages of optimization.

Based on the above updated random walk formulation, the position update for \(ant_i^t\) is formulated as in Eq. (14).

$$\begin{aligned} \mathrm{ant}_i^t=\frac{\mathrm{RW}_A^t+\mathrm{RW}_E^t}{2}, \end{aligned}$$
(14)

where \(\mathrm{RW}_A^t\) is the random walk of \(\mathrm{ant}_i\) around a roulette wheel selected \(\mathrm{antLion}_j\) and such random walk is obtained by applying Eq. (11) and \(\mathrm{RW}_E^t\) is obtained by applying random walk of \(\mathrm{ant}_i\) around the elite antlion \(\mathrm{antLion}_k\) and is calculated using Eq. (11) too.

The Lèvy antlion optimization (LALO) is comprised of fundamental building phases as described in Fig. 3.

\(\sigma ^2\) defined in [44] was employed as an indicator to prove the convergence of an algorithm to premature or global optima; see Eq. (15).

$$\begin{aligned} \sigma ^2=\sum _{i=1}^N \left( \frac{f_i-f_{\mathrm{avg}}}{f}\right) ^2 \end{aligned}$$
(15)

where \(\sigma ^2\) is the population variance indicator, N is the number of search agents, \(f_i\) is the fitness of search agent number i, \(f_{\mathrm{avg}}\) is the current average fitness of the swarm, and f is the normalized calibration factor to confine \(\sigma ^2\) and is defined as in Eq. (16).

$$\begin{aligned} f={\left\{ \begin{array}{ll} \max {|f_i-f_{\mathrm{avg}}|}, &{}\quad \max {|f_i-f_{\mathrm{avg}}|}>1 \\ 1, &{}\quad \mathrm{otherwise} \end{array}\right. } \end{aligned}$$
(16)

In this study, this function is used to ensure convergence of the ALO and its proposed variant LALO. Figure 4 depicts the \(\sigma ^2\) for both ALO and LALO averaged for 30 different runs on sample data. We can notice from the figure that the ALO approaches a value of 0 for the indicator \(\sigma ^2\) at an earlier time, but there is no grantee of reaching the global optima and at this stage, the algorithm stagnates and can not further enhance its search agents. On the other hand, we can recognize that LALO keeps, respectively, higher \(\sigma ^2\) for a long time allowing for diversity of the population and hence allowing for additional exploration of the search space even at the next optimization steps.

Fig. 3
figure 3

The proposed Lèvy antlion optimization (LALO) algorithm

Figure 5 depicts the convergence performance of the proposed LALO versus the convergence of ALO on the average over 20 runs with random initialization. We can see that the LALO keeps it exploration capability even at later optimization steps thanks to the random walks received by the Lèvy distribution that has infinite variance.

4 LALO applied for feature selection

In this study, wrapper approach for feature selection is used that means the search area is explored to find a feature subset guided by classification performance. This approach may be slow since the classifier must be retrained on all candidate solutions and its performance must be measured. Therefore, the intelligent search for discovering the search space is necessary. The goals are the maximization of classification performance and the minimization of the selected number of features. Of course, such fitness function can be enhanced to incorporate other classification performance measure such as entropy which is implicitly included in the initialization of search agents in some of our experiments as will be explained in the results section. The used fitness function is minimization as shown in Eq. (17) [45]:

$$\begin{aligned} \alpha (1-P)+(1-\alpha )\left( \frac{N_f}{N_t}\right) , \end{aligned}$$
(17)

where P is the classifier performance measure and \(N_f\), \(N_t\) are the size of the selected feature subset and the total number of features in the dataset in order.

\(\alpha \in [0, 1]\) define the weights of the sub-goals. P is commonly measured as in Eq. (18):

$$\begin{aligned} P=\frac{N_c}{N_t}, \end{aligned}$$
(18)

where \(N_c\), \(N_t\) are a number of correctly classified data points and the total number of data points in the dataset, respectively.

The number of dimensions (variables) in the optimization problem is equal to the number of features, and each variable is limited to the range [0, 1]. To decide whether a feature will be selected or not at the evaluation stages, its corresponding variable is threshold using static threshold at 0.5; as shown in Eq. (19):

$$\begin{aligned} {\left\{ \begin{array}{ll} x_{ij}<0.5&{} \quad \text{ Selected } \text{ feature } \\ x_{ij}\ge 0.5&{} \quad \text{ Unselected } \text{ feature } \end{array}\right. } \end{aligned}$$
(19)

where \(x_{ij}\) is the continuous position of search agent i in dimension j. Therefore, an ant/antlion position represents a selected feature set with the positions value increase for dimensions/features that are candidate to be selected for classification.

The proposed LALO for feature selection algorithm is outlined in the algorithm 2.

figure b

5 Experimental results

5.1 Dataset and parameters setting

Table 1 summarizes the 21 used datasets for further experiments. The datasets are drawn from the UCI data repository [46]. As evident in the table, we selected datasets with various dimensions and several numbers of instances to assess the performance on different problems with different complexity. Each dataset is divided in cross-validation [47] manner for evaluation. In the k-fold cross-validation, \(K-1\) folds are used for training and validation and the last fold is used for testing. This process is repeated M times. Hence, an individual optimizer is evaluated \(K*M\) times for individual dataset. In this study, we used \(K=3\) where the data is divided into three equal portions. The training part is used to train the used classifier through optimization and at the final evaluation. The validation part used to assess the performance of the classifiers at the optimization time, while the testing part is used to evaluate the finally selected features given the trained classifier.

Table 1 List of datasets used in the study
Fig. 4
figure 4

The evolution curve of the variance of fitness in global convergence of LALO and ALO

Fig. 5
figure 5

LALO versus ALO convergence on the average

Since the primary goal is to assess the performance of the proposed variant of ALO against the native ALO, we applied both optimizers on the data in Table 1 in cross-validation manner with same parameter settings for both algorithms. A list of parameter setting for both optimizers is mentioned in Table 2. As can be noticed from the table, an individual stochastic algorithm is run for 60 runs to assess its mean performance as a stochastic algorithm. Both algorithms use the same number of search agents; ants and antlions and run for the same number of iterations and in the same search space using the same fitness function. The implementation of Lèvy distribution is based on Mantegna algorithm [41] as an easy implementation with single parameter \(\beta\) controlling the skewness of the distribution. A scaling factor stepSize is used to control the exploitation rate. This parameter is set to decrease linearly as the optimization progresses in a linear fashion as we mentioned in Eq. (13).

Table 2 Parameter settings for different optimization methods

In multi-agent optimizers with stochastic nature, the initial agent’s positions affect the performance of the optimizer. Therefore, in this study we evaluated the two algorithms using different settings for the initial agents so that we can assess:

  • The capability of the algorithm to generate several solutions.

  • The ability of the algorithm to escape from local optima.

  • The intensification capability of the algorithm.

To reach such end, we made use of five initializers that forces the optimizers to be initialized with search agents that:

  • Are apart from the expected global optima.

  • Has very small diversity.

  • Close to the expected global optima.

  • Contains a near-optimal solution.

Table 3 Mean fitness using different initializers for all multi-agent optimizers over various datasets and the fitness value for the deterministic single solution methods

Five initialization methods depicted as an example in Fig. 6 and can be detailed as:

  1. 1.

    Uniform initialization: In this method, all the search agents are placed in the search space at random making use of uniform random distribution. That is the most common initialization technique where every point in the search space is a candidate to select as a position for a search agent with the same probability to other points. This initialization method can be formulated as:

    $$\begin{aligned} X_i^d=lb^d+\mathrm{rand}(ub^d-lb^d) \end{aligned}$$
    (20)

    with \(X_i^d\) is the position of search agent i in dimension d, lb, ub are the lower and upper limits of dimension d and rand is a random number drawn from the uniform distribution.

  2. 2.

    Small initialization: This method initializes the search agents on the terminals of the hypercube of the search space. In the context of feature selection, search agents are initialized with the minimum number of randomly selected features. Therefore, if the number of agents is less than the number of features, we will see that each search agent will have a single dimension with 1. Of course, the optimizer will search for the feature(s) to be set to 1 to enhance the fitness function value as in the standard forward selection of features. This initializer is expected to place the search agents apart from the optimal solution.

  3. 3.

    Large initialization: This method also initializes the search agents on the terminal of the search space hypercube randomly, but it is much biased toward the selection of features rather than the deselection of features. This initializer is expected to initialize the search agents closer to the optimal solution.

  4. 4.

    Mixed initialization: This initializer is a compromise between the small and large initializers where \(50\%\) of the search agents are initialized with small initialization and the remaining \(50\%\) is initialized with large initialization. This method provides much diversity in the population than the small and large initializers.

  5. 5.

    Minimum redundancy maximum relevance (MRMR) feature criterion combines two optimization criteria [48] in a single formula outlined in Eq. (21).

    $$\begin{aligned} \max _{g_i\in G-S_{m-1}}(I(g_i;c)-\frac{1}{m-1}\sum _{g_j\in S_{m-1}}I(g_i;g_j)) \end{aligned}$$
    (21)

    where (xy) is the mutual information between two random variables X and Y that measures the mutual dependence of these two variables, c is the class labels, and \(g_i\) are the values of the variable number i in the data.

    In the use of mutual information concept, MRMR method selects variables that have the highest relevance with the target class and are minimally redundant, i.e., selects variables that are maximally dissimilar to each other. The value of each dimension (feature) is set proportional to how to match it can predict the output class label and how much it is distinct from other features. The assessed MRMR measure for individual features is normalized along the feature set as shown in Eq. (22).

    $$\begin{aligned} P_i^d=\frac{\mathrm{MRMR}_i}{\max _{j=1}^d \mathrm{MRMR}_j} \end{aligned}$$
    (22)

    where \(P_i\) is the search agent position in dimension d, MRMR\(_i\) is the MRMR value for feature i, and d is the problem dimension.

    The obtained position value from the MRMR method is used to initialize one search agent, and the rest of search agents are set at random positions in the search space. This initializer forces one search agent to be very close to the optimal but generally is not the optimal as the criteria used are not directly related to classification performance.

Fig. 6
figure 6

Sample initial wolves positions using different initialization with 9 search agents and 6 dimensions

Table 4 The p value for the t test for the fitness measures calculated for stochastic-multi-agent optimizers versus the proposed LALO

5.2 Evaluation criteria

The well-known K-nearest neighbor (KNN) is used as a classifier to evaluate the final classification performance of each algorithm and wrapper-based feature selection. KNN is employed in the experiments based on trial and error basis where the best choice of K is selected \((K=5)\) as the best performing on all the datasets [49].

A set of evaluation indicators is used to assess different aspects of performance.

Mean fitness is used to evaluate the average performance of the optimizer over all the runs [50].

Standard deviation (std) is a representation of the variation of the obtained best solutions found for running a stochastic optimizer for M different runs. Std is used as an indicator for optimization stability and robustness, whereas Std is smaller; this means that the optimizer always converges to the same solution, while larger values for Std mean much random performance [51].

Average feature size is the average ratio of the selected features to the total number of features which is the secondary objective of the used fitness function. This measure can be formulated as in Eq. (23).

$$\begin{aligned} \mathrm{Avg}SZ= \frac{1}{M}\sum _{i=1}^M \frac{\mathrm{size}(g_*^i)}{D}, \end{aligned}$$
(23)

where \(\mathrm{size}(x)\) is the number of on values for the vector x, and D is the number of features in the original dataset.

T test is statistical importance measure that demonstrates regardless of whether the contrast between two groups’ midpoints in all probability mirrors a real distinction in the population from which the groups were inspected [52].

Table 5 Standard deviation of fitness calculated for multi-agent optimizers over various datasets using uniform initialization
Table 6 Standard deviation of fitness calculated for multi-agent optimizers over various datasets

The above four measures are applied directly to the fitness function obtained based on the validation set.

To assess the future performance of whole feature selection model, the following two indicators are used:

Mean test error is used to evaluate the performance of the feature selection on the data that the classifier never sees.

Average Fisher score is a measure that assesses the feature subset like the distances between data points in various classes are as big as possible, while the distances between data points in the same class are as small as possible [53]. Fisher score in this work is calculated for individual features given the class labels as in Eq. (24).

$$\begin{aligned} F_j=\frac{\sum _{k=1}^cn_k(\mu _k^j-\mu ^j)^2}{(\sigma ^j)^2}, \end{aligned}$$
(24)

where \(F_j\) is the Fisher index for feature j, \(\mu ^j, (\sigma ^j)^2\) is the mean and standard deviation of the whole dataset, \(n_k\) is the size of class k, and \(\mu _k^j\) is the average of class k.

Table 7 Average feature size using different initializers for all multi-agent optimizers over various datasets and feature size ratio for the deterministic single-solution methods
Table 8 Mean test error using different initializers for all multi-agent optimizers over various datasets and test error for deterministic single-solution methods

The average F-score is calculated as the average of score sum of optimal solution found by running individual optimizers for M times.

Average run time is used in this study to assess the average processing time for the different methods used.

5.3 Results and discussion

The methods adopted in this work are deterministic single-solution methods, namely sequential forward selection (SFS), sequential backward selection (SBS) and their floating version SFFS and SFBS based on the version proposed in [54]. Also, the two common multi-agent optimizers are adopted namely genetic algorithms (GA) [55] and particle swarm optimization (PSO) [56].

Table 3 outlines the mean fitness acquired over the different runs of the various multi-agent optimizers as well as the same fitness calculated for selected features by the deterministic single-solution methods. We can remark from the table that the optimal solution obtained from the proposed LALO is much better than the optimal solution of the other optimization methods on the same number of iterations. We can also remark that regardless of the used initializer, the LALO can reach much better optima than the ALO. The enhanced performance can be interpreted by the fact that the LALO generates a set of random solutions that are apart from the current optimal solution in each iteration allowing the optimizer to escape from local optima solution. Such diversity of solutions even at the last iterations of optimization can be interpreted by the thick tail of the Lèvy distribution. Besides, the LALO can generate diverse locations for the search agents even if it is initialized with nearby solutions as in the case of large initialization method, while the ALO can reach such goal.\(\acute{\mathrm{t}}\) One also can remark the impact of the initialization on the performance of all multi-agent optimizers. On the other hand, we can observe that the forward and backward iterative methods can easily be stuck in the first found optimal solution in the search space. Also, the nesting problem found in such methods which make it difficult to abandon a feature that is selected as in SFS and SBS.

Table 9 Average F-score  using different initializers for all multi-agent optimizers over various datasets and Fisher calculated for the deterministic single-solution methods

The MRMR initializer places a search agent very close to the global optimal solution as it initializes such agent using entropy and correlation constraints. Therefore, we can remark the enhance in the performance of the MRMR method over the other methods. We can also remark that although the MRMR initializes a search agent with information related to the data; mutual information, it is still not optimal as it is not directly related to the classifier performance used in wrapper-based feature selection. Moreover, as can be seen, we may remark those initializers with a diversity of population; uniform initializer and mixed initializer can perform better than initializers with less diversity, small and large initializers. For ensuring the significance in advance of the performance, the p value for the t test is calculated for the two optimizers and outlined in Table 4. We can remark that there is a significant enhancement in the performance of the LALO over the ALO at a significance level of \(5\%\) using small, large, mixed and uniform initializers. Smaller significance can be remarked for using the MRMR initializer which can be interpreted by the fact that the MRMR always initializes the optimizers with a solution close to the optimal solution and hence it eases the optimization task for the optimizers.

The standard deviation of optimal fitness function value obtained over all the runs of the optimizer can be used as an indicator for optimizer repeatability of results and performance robustness. Such measure is depicted in Tables 5 and 6 for all optimizers over all the datasets. We can notice from the table that both antlion optimizers have comparable standard deviation which proves that even if we use a random distribution with infinite variance, we still can get repeatable results due to the used selection capability in the LALO. Furthermore, we can see that the LALO and ALO has a general standard deviation less than the ones obtained for the GA and PSO which proves the capability of ALO and LALO to converge to optimal or near-optimal solution regardless of the used random exploration and exploitation used and due to the adaptive control of the step size/exploration rate adopted in both methods.

The secondary goal in the used fitness function is the selected feature size. This indicator is calculated and depicted in Table 7. We can remark that this objective is also minimized and the LALO is still keeping its superiority in performance in comparison to ALO, PSO, and GA. The addition of such secondary objective complicates the shape of the fitness function which is successfully minimized by the proposed LALO but allows the optimizer to carefully add extra features to maximize the classification performance. We can recognize that deterministic single-solution methods have in general better performance in this indicator which is natural for the very cautious move of such algorithms but of course on the price of classification accuracy.

Table 8 depicts the performance of the optimizers on the test data that the classifier never sees. We can remark that the selected features from the LALO are much fit to be a candidate for future performance thanks to the design of the fitness function and the capability of LALO to optimize such function. Such performance proves the ability of the proposed LALO to adaptively search the space of feature for an optimal combination of features maximizing classification performance.

Table 10 Average runtime using uniform initialization for all multi-agent optimizers over various datasets and for the deterministic single-solution methods

Another indicator for assessing the future performance of the selected features is depicted in Table 9. The Fisher index calculates the data compactness versus feature span which eases the classifier design based on the selected features. We can observe that the feature set selected by the LALO is much better thanks to the capability of LALO to search the design fitness function for the global optima.

A final indicator that is related to the runtime is depicted in Table 10 where the runtime for each optimizer is measured and sum over all the used initializers. We can remark from the table that the runtime for LALO is much better than the ALO thanks to the simple implementation of the Lèvy random number generator rather than the complete random walk provided in the implementation of ALO published in [32]. Moreover, in contrast to sequentially adding or removing features as in SFS, SBS, SFFS, SFBS, and multi-agent optimization methods can add the batch of features. Hence, LALO can provide enhanced time performance while keeping better classification performance as we saw before.

6 Conclusions and future work

A novel variant of ALO is proposed based on Lèvy flights. This study applies the Lèvy antlion optimization (LALO) algorithm for feature selection and is tested on 21 different datasets, compared its result to the original ALO. Five initialization methods are used to assess the local and global searching capability of each algorithm. The results are evaluated using different indicators assessing convergence, repeatability value, and feature reduction size, performance on test data and statistical significance as well as runtime.

We can conclude the following:

  • Lèvy flights help the ALO to search the space efficiently for the optimal solution avoiding premature convergence and stagnation.

  • The proposed LALO can provide repeatable results with efficient performance even at complex fitness function shapes.

  • The initial positions of search agents affect the performance of the different optimizers and generally, initializers that provide initial diverse population are better. Moreover, initializers such as MRMR that initialize the search agents with solutions that have many desirable properties has better performance.