Feature selection via Lèvy Antlion optimization

Emary, E.; Zawbaa, Hossam M.

doi:10.1007/s10044-018-0695-2

Feature selection via Lèvy Antlion optimization

Theoretical Advances
Published: 19 March 2018

Volume 22, pages 857–876, (2019)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Pattern Analysis and Applications Aims and scope Submit manuscript

Feature selection via Lèvy Antlion optimization

Download PDF

828 Accesses
58 Citations
Explore all metrics

Abstract

In this paper, a modification of the newly proposed antlion optimization (ALO) is introduced and applied to feature selection relied on the Lèvy flights. ALO method is one of the encouraging swarm intelligence algorithms which make use of random walking to perform the exploration and exploitation operations. Random walks based on uniform distribution is responsible for premature convergence and stagnation. A Lèvy flight random walk is suggested as a permutation for performing a local search. Lèvy random walking grants the optimization ability to generate several solutions that are apart from existing solutions and furthermore enables it to escape from local minima and much efficient in examining large search area. The proposed Lèvy antlion optimization (LALO) algorithm is applied in a wrapper-based mode to select optimal feature combination that maximizing classification accuracy while minimizing the number of selected features. LALO algorithm is applied on 21 different benchmark datasets against genetic algorithm (GA), particle swarm optimization (PSO), and the native ALO methods. Different initialization methods and several evaluation criteria are employed to assess algorithm diversification and intensification of the optimization algorithms. The experimental results demonstrate the significant improvement in the proposed LALO over the native ALO and many well-known methods used in feature selection.

Exploring Ant Colony Optimization for Feature Selection: A Comprehensive Review

The monarch butterfly optimization algorithm for solving feature selection problems

Article 27 July 2020

A Hybrid Feature Selection Approach for Data Clustering Based on Ant Colony Optimization

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Each dataset regularly has an enormous number of features, especially in pattern recognition and classification applications. The primary goal of feature selection is a better understanding of the underlying method which generated the data to extract a subset of relevant features from a huge number of available features [1]. The selected feature set will improve the classifier performance and provide a faster and more cost-effective classification that leads to obtaining comparable or even better classification performance from using the full features [2].

The feature selection is recognized as a multi-objective problem that minimizes the size of selected features and maximizes the classification performance. Both objectives are contradicted, and the optimal solution requires to be performed in the appearance of a tradeoff between them. The feature selection methods split into two primary approaches: filter-based and wrapper-based [3]. The filter-based method is not dependent on machine learning technique, and they are participated to be computationally inexpensive. Furthermore, filter-based methods examine the search space for a feature set that optimizing given data-dependent criteria rather than classification-dependent criteria [4].

The wrapper-based method includes a machine learning technique as part of the evaluation function which allows the wrapper-based to obtain better results than filter-based method [5]. The size of search space is exponentially increased concerning the number of features in a given dataset [1]. Hence, in practice, the exhaustive search methods are impossible to get the optimal solution in the most cases. The diversity of search methods are employed to solve feature selection problems like greedy search based on sequential forward selection (SFS) [6] and sequential backward selection (SBS) [7]. However, these classical optimization methods have some restriction in solving the optimization problems. So that, evolutionary computation (EC) algorithms are the option for addressing these shortcomings and their global search ability [8, 9].

Various heuristic methods simulate the behavior of biological and physical systems in nature, and it has been proposed as powerful techniques for global optimizations [10]. EC algorithms are motivated by nature, social behavior, and biological behavior (of animals, birds, fish, bat, Firefly, wolves, spiders, antlion, etc.) in a pack [11]. Many researchers have suggested various computational methods which imitate the behavior of these kinds for investigating their optimal solution [12,13,14,15]. EC algorithms were hired to search adaptively the search area for an optimal subset using a set of search agents who communicate in a social manner to reach a global solution in the minimum number of evaluations [16].

In EC algorithms, it is necessary to have the proper balance between exploration and exploitation [17]. Genetic algorithm (GA) and particle swarm optimization (PSO) are the most common evolutionary computation methods [18]. GA was the first evolutionary-based algorithm introduced in the literature and developed based on the natural process of evolution through reproduction [19]. A feature selection algorithm based on GA using the fuzzy set as fitness function [20]. A hybrid wrapper and filter feature selection with a memetic framework for the classification problem. This method integrated filter ranking with GA to maximize the classification accuracy and selected the feature subsets [21]. However, PSO with the same fitness function achieves better performance than GA algorithm in [22]. A feature selection strategy using rough-set and PSO is introduced in [23]. Rough-set has been applied as a feature selection method to find the optimally selected features, and PSO will discover the optimal feature subset.

The ant colony optimization (ACO) method based on wrapper feature selection algorithm applied in network intrusion detection and used Fisher discrimination rate to adopt the heuristic information [24]. In the artificial bee colony (ABC) algorithm, several behaviors of the bees provide the opportunity to build robust balancing method between exploration and exploitation [25]. In ABC, the employer bees try to find a food source and advertise the other bees [26]. The onlooker bees follow their interesting employer, and the scout bee flies spontaneously to find the best food source [27]. A virtual bee algorithm (VBA) is applied to optimize the numerical function in 2-D using a swarm of virtual bees, which move randomly in the search space and interact to find food sources. The interactions between these virtual bees result in the possible solution for the optimization problem [28]. A proposed approach based on the honeybees natural behavior that randomly produced worker bees are moved in the direction of the elite bee. The elite bee describes the optimal (near to optimal) solution [29].

Artificial fish swarm (AFS) algorithm mimics the stimulant reaction by controlling the tail and fin. AFS is a robust stochastic technique based on the fish movement and its intelligence during the food finding process [30]. Antlion optimization (ALO) algorithm [31, 32] is a relatively new EC method which is computationally less costly than other EC techniques. The proposed simulated annealing (SA) is employed to optimize the feature selection subset in marketing applications for designating large-scale linear regression model [33].

Lately, many types of research have studied the animals and insects flight behavior that represented the general characteristics of Lèvy flights [34]. The fruit flies examine their landscape doing a series of straight flight routes with the 90${^{\circ }}$ turn, leading to a Lèvy flight interrupted scale-free search pattern. Therefore, Lèvy flight behavior has been employed to optimization and optimal search, and preliminary results show its promising capability [35]. Lèvy flight process represents the optimum random search pattern and is frequently detected in nature [36]. Broadly speaking, the Lèvy flight is a random walk whose step length is pulled from the Lèvy distribution [37].

The aggregate aim of this paper is to propose a variant of ALO which exploits Lèvy flight for performing a local and global search. The proposed Lèvy flight version of ALO is applied for feature selection approach, which selects the minimal number of features and obtains comparable or even best classification efficiency from using all attributes and conventional feature selection techniques. The rest of this paper is organized as the follows: Sect. 2 provides the background information of antlion optimization algorithm (ALO) and Lèvy flights. Section 3 describes the proposed antlion optimization algorithm (ALO)-based multi-objective feature selection algorithms. The result of experiments with discussions is presented in Sect. 5. Finally, conclusions and future work are provided in Sect. 6.

2 Antlion optimization (ALO)

Antlion optimization (ALO) is a newly proposed bio-inspired optimization method that was suggested by Mirjalili [32]. ALO algorithm mimics the antlions hunting behavior in nature. The following two subsections discuss the inspiration and operators of the artificial algorithm [38].

Antlions (doodlebugs) belong to the Myrmeleontidae family and Neuroptera order [32]. An antlion larva digs a cone-shaped hole in the sand. Once caught, the prey tries to stampede from the trap, and the antlion tries to catch its prey. Antlions intelligently throw sands toward to edge of the hole to slide the victim into the bottom of the hole. By consuming the prey, the antlion prepares the hole for the next hunt. The trap size for an antlion is affected by the level of hunger and the moon shape. Antlions tend to dig out larger traps as they become hungrier and when the moon is full. They have been developed and acclimated to this way to increase their chance of survival [39].

Based on the above description of antlions, the ALO is composed of two sets of agents, namely ant and antlions. Antlions are the hunters and are selected as the fittest agents and never change their location unless they replace a given ant. Ants perform random walks in the search space and may be consumed by an antlion if such antlion traps it. The ant’s position is updated according to the formula in Eq. (1):

$$\begin{aligned} \mathrm{Ant}_i^t=\frac{R_A^t+R_E^t}{2}, \end{aligned}$$

(1)

where $R_A^t$ is the position of randomly walking ant indexed i; called $\mathrm{Ant}_i$, around a roulette wheel selected antlion indexed A and $R_E^t$ is the position of randomly walking ant indexed i, called $\mathrm{Ant}_i$, around the elite antlion indexed E in the ant population. Antlions fitness is used by the roulette wheel operator to select an antlion with index A to perform the ant random walk around it while the elite antlion is established as the fittest antlion with index E.

The random walking of an $\mathrm{Ant}_i^t$ around a given $\mathrm{Antlion}_j^t$ is formulated as in Eq. (2):

$$\begin{aligned} R_j^t=\frac{(X_i-a_i)\times (d_i-c_i^t)}{(b_i^t-a_i)}+c_i, \end{aligned}$$

(2)

where $R_j^t$ is the position of ant i after performing random walk around antlion j at iteration t, $a_i$ is the minimum step of random walk $X_i^t$ in i-th dimension, and $b_i$ is the maximum step of random walk $X_i^t$ in i-th dimension, $X_i$ is defined in Eq. (3), c, d are the lower and upper bounds of the random walk. Equation (2) is directly used to define $R_A^t$ and $R_E^t$ of Eq. (1).

$$\begin{aligned} X(t)\, = \,& {} [0, \mathrm{cumsum}(2r(t_1)-1); \, \mathrm{cumsum}(2r(t_2)-1);\nonumber \\&\ldots ; \mathrm{cumsum}(2r(t_T)-1)], \end{aligned}$$

(3)

where cumsum calculates the cumulative sum and models the accumulation of successive random steps forming a random walk until time t, and r(t) is a stochastic function characterized as in Eq. (4):

$$\begin{aligned} r(t)={\left\{ \begin{array}{ll} 1 &{} \text{ if } \mathrm{rand}>0.5\\ 0 &{} \text{ if } \mathrm{rand}\le 0.5, \end{array}\right. } \end{aligned}$$

(4)

where rand is a random number relied on uniform distribution. c, d parameters are adapted according to Eqs. (5) and (6) to limit the range of the random walk around the given antlion.

$$\begin{aligned} c_i^t= & {} {\left\{ \begin{array}{ll} \frac{lb}{I}+X_{\mathrm{Antlion}_j}^t &{}\quad \mathrm{if}\; \mathrm{rand}<0.5\\ \frac{-\,lb}{I}+X_{\mathrm{Antlion}_j}^t &{}\quad \mathrm{otherwise},\\ \end{array}\right. } \end{aligned}$$

(5)

$$\begin{aligned} d_i^t= & {} {\left\{ \begin{array}{ll} \frac{ub}{I}+X_{\mathrm{Antlion}_j}^t &{}\quad \mathrm{if}\; \mathrm{rand}>0.5\\ \frac{-\,ub}{I}+X_{\mathrm{Antlion}_j}^t &{}\quad \mathrm{otherwise},\\ \end{array}\right. } \end{aligned}$$

(6)

where lb and ub are the lower limit and upper limit for dimension i, and I is a parameter that monitors the exploration/exploitation rate and is defined as in Eq. (7):

$$\begin{aligned} I=10^w\frac{t}{T}, \end{aligned}$$

(7)

where T is the maximum number of iterations, w is a constant depicted the current iteration (w = 2 when $t > 0.1T$, w = 3 when $t > 0.5T$, w = 4 when $t > 0.75T$, w = 5 when $t > 0.9T$, and w = 6 when $t > 0.95T$). The constant w may update the accuracy level of exploitation.

Finally, the selection operation is applied where an antlion is replaced by an ant if the ant becomes fitter. According to the above formulation, the antlion optimization can be described in the following algorithm (1) [40].

ALO provides very competitive results in terms of improved exploration, local optima avoidance, exploitation, and convergence in diverse shapes of optimization functions and hence found in solving different problems in different domains [32].

3 The proposed Lèvy antlion optimization (LALO)

Randomization has a significant role in both exploration and exploitation, or diversification and intensification and the essence of such randomization is the random walk [41]. A random walk is a random process that consists of taking a series of consecutive random steps. The random walk may be written as a sum of consecutive random steps as in Eq. (8).

$$\begin{aligned} S_N=\sum _{i=1}^NX_i, \end{aligned}$$

(8)

where $X_i$ is a random step drawn from a random distribution.

If the step length obeys the Gaussian distribution, the random walk becomes the Brownian motion, as shown in Fig. 1 [41]. Far-field randomization should generate a fraction of solutions and be far enough from the current best solution. Such far solutions help a given optimizer to escape from local optima and avoid stagnation [42]. A variant of Brownian motion that uses Lèvy is expected to reach the end. Lèvy distribution is a heavy-tailed distribution that has infinite variance.

Mathematically speaking, a simple version of Lèvy distribution can be defined as in Eq. (9):

$$\begin{aligned} L(s,\gamma , \mu )={\left\{ \begin{array}{ll} \sqrt{\frac{\gamma }{2\pi }}\exp (-\frac{\gamma }{2(s-\mu )})\frac{1}{(s-\mu )^{3/2}}, &{}\quad 0<\mu<s<\infty \\ 0 &{}\quad \mathrm{otherwise}\\ \end{array}\right. } \end{aligned}$$

(9)

where $\mu$ is minimum step size, and $\gamma$ is a scale parameter. As $s\rightarrow \infty ,$ we have the simple form of Lèvy as in Eq. (10). Figure 2 shows a sample random walking using Lèvy distribution.

$$\begin{aligned} L(s,\gamma , \mu )=\sqrt{\frac{\gamma }{2\pi }}\frac{1}{(s-\mu )^{3/2}} \end{aligned}$$

(10)

When the step length obeys the Lèvy distribution, such a random walk is called a Lèvy flight or Lèvy walk.

Lèvy flights are more efficient than Brownian random walks in exploring unknown, large-scale search space which can be observed from the large abrupt jumps. Mathematically, this can be interpreted by the fact that the variance increases much faster than the linear relationship of the Brownian random walk. The facts as mentioned earlier about the Lèvy flights random walk motivate exploiting it in performing the random walking of ants.

The updated version of ALO random walk based on Lèvy flights’ random walk is performed as in Eq. (11).

$$\begin{aligned} \mathrm{RW}_i=\mathrm{ant}_i^{t}+\omega \times \frac{u}{|v|^{\frac{1}{\beta }}}\times r\times \left[ \mathrm{antLion}_j^t-\mathrm{ant}_i^{t}\right] \end{aligned}$$

(11)

where $\mathrm{RW}_i$ is the random walk of $\mathrm{ant}_i$ around $\mathrm{antLion}_j$ at time t, $\omega$ is a scale parameter controlling the scale of the random walk and is defined according to Eq. (13), v, r are random number drawn from normal distribution N(0, 1), $\beta$ is constant controlling the distribution skewness and u is set as $r2 * \varphi$ with r2 random number drawn from normal distribution N(0, 1) and $\varphi$ is defined in Eq. (12). The above equation allows an ant to be repositioned following a random walking with Lèvy steps.

$$\begin{aligned} \varphi =\left( \frac{\varGamma (1+\beta )\sin (\pi \beta /2)}{\varGamma \left[ \frac{1+\beta }{2}\beta 2^{\frac{\beta -1}{2}}\right] }\right) ^{\frac{1}{\beta }} \end{aligned}$$

(12)

Such implementation of the Lèvy distribution is based on Mantegna algorithm for a symmetric Lèvy stable distribution [43] which is said to be the easiest way.

$$\begin{aligned} \omega =1-\frac{t}{T} \end{aligned}$$

(13)

where t, T are the current iteration and the maximum number of allowed iterations in order. Such equation allows the optimizer to be more diverse at the begin of optimization and then narrows such scale to allow for exploitation at the end stages of optimization.

Based on the above updated random walk formulation, the position update for $ant_i^t$ is formulated as in Eq. (14).

$$\begin{aligned} \mathrm{ant}_i^t=\frac{\mathrm{RW}_A^t+\mathrm{RW}_E^t}{2}, \end{aligned}$$

(14)

where $\mathrm{RW}_A^t$ is the random walk of $\mathrm{ant}_i$ around a roulette wheel selected $\mathrm{antLion}_j$ and such random walk is obtained by applying Eq. (11) and $\mathrm{RW}_E^t$ is obtained by applying random walk of $\mathrm{ant}_i$ around the elite antlion $\mathrm{antLion}_k$ and is calculated using Eq. (11) too.

The Lèvy antlion optimization (LALO) is comprised of fundamental building phases as described in Fig. 3.

$\sigma ^2$ defined in [44] was employed as an indicator to prove the convergence of an algorithm to premature or global optima; see Eq. (15).

$$\begin{aligned} \sigma ^2=\sum _{i=1}^N \left( \frac{f_i-f_{\mathrm{avg}}}{f}\right) ^2 \end{aligned}$$

(15)

where $\sigma ^2$ is the population variance indicator, N is the number of search agents, $f_i$ is the fitness of search agent number i, $f_{\mathrm{avg}}$ is the current average fitness of the swarm, and f is the normalized calibration factor to confine $\sigma ^2$ and is defined as in Eq. (16).

$$\begin{aligned} f={\left\{ \begin{array}{ll} \max {|f_i-f_{\mathrm{avg}}|}, &{}\quad \max {|f_i-f_{\mathrm{avg}}|}>1 \\ 1, &{}\quad \mathrm{otherwise} \end{array}\right. } \end{aligned}$$

(16)

In this study, this function is used to ensure convergence of the ALO and its proposed variant LALO. Figure 4 depicts the $\sigma ^2$ for both ALO and LALO averaged for 30 different runs on sample data. We can notice from the figure that the ALO approaches a value of 0 for the indicator $\sigma ^2$ at an earlier time, but there is no grantee of reaching the global optima and at this stage, the algorithm stagnates and can not further enhance its search agents. On the other hand, we can recognize that LALO keeps, respectively, higher $\sigma ^2$ for a long time allowing for diversity of the population and hence allowing for additional exploration of the search space even at the next optimization steps.

Figure 5 depicts the convergence performance of the proposed LALO versus the convergence of ALO on the average over 20 runs with random initialization. We can see that the LALO keeps it exploration capability even at later optimization steps thanks to the random walks received by the Lèvy distribution that has infinite variance.

4 LALO applied for feature selection

In this study, wrapper approach for feature selection is used that means the search area is explored to find a feature subset guided by classification performance. This approach may be slow since the classifier must be retrained on all candidate solutions and its performance must be measured. Therefore, the intelligent search for discovering the search space is necessary. The goals are the maximization of classification performance and the minimization of the selected number of features. Of course, such fitness function can be enhanced to incorporate other classification performance measure such as entropy which is implicitly included in the initialization of search agents in some of our experiments as will be explained in the results section. The used fitness function is minimization as shown in Eq. (17) [45]:

$$\begin{aligned} \alpha (1-P)+(1-\alpha )\left( \frac{N_f}{N_t}\right) , \end{aligned}$$

(17)

where P is the classifier performance measure and $N_f$, $N_t$ are the size of the selected feature subset and the total number of features in the dataset in order.

$\alpha \in [0, 1]$ define the weights of the sub-goals. P is commonly measured as in Eq. (18):

$$\begin{aligned} P=\frac{N_c}{N_t}, \end{aligned}$$

(18)

where $N_c$, $N_t$ are a number of correctly classified data points and the total number of data points in the dataset, respectively.

The number of dimensions (variables) in the optimization problem is equal to the number of features, and each variable is limited to the range [0, 1]. To decide whether a feature will be selected or not at the evaluation stages, its corresponding variable is threshold using static threshold at 0.5; as shown in Eq. (19):

$$\begin{aligned} {\left\{ \begin{array}{ll} x_{ij}<0.5&{} \quad \text{ Selected } \text{ feature } \\ x_{ij}\ge 0.5&{} \quad \text{ Unselected } \text{ feature } \end{array}\right. } \end{aligned}$$

(19)

where $x_{ij}$ is the continuous position of search agent i in dimension j. Therefore, an ant/antlion position represents a selected feature set with the positions value increase for dimensions/features that are candidate to be selected for classification.

The proposed LALO for feature selection algorithm is outlined in the algorithm 2.

5 Experimental results

5.1 Dataset and parameters setting

Table 1 summarizes the 21 used datasets for further experiments. The datasets are drawn from the UCI data repository [46]. As evident in the table, we selected datasets with various dimensions and several numbers of instances to assess the performance on different problems with different complexity. Each dataset is divided in cross-validation [47] manner for evaluation. In the k-fold cross-validation, $K-1$ folds are used for training and validation and the last fold is used for testing. This process is repeated M times. Hence, an individual optimizer is evaluated $K*M$ times for individual dataset. In this study, we used $K=3$ where the data is divided into three equal portions. The training part is used to train the used classifier through optimization and at the final evaluation. The validation part used to assess the performance of the classifiers at the optimization time, while the testing part is used to evaluate the finally selected features given the trained classifier.

Table 1 List of datasets used in the study

Full size table

Since the primary goal is to assess the performance of the proposed variant of ALO against the native ALO, we applied both optimizers on the data in Table 1 in cross-validation manner with same parameter settings for both algorithms. A list of parameter setting for both optimizers is mentioned in Table 2. As can be noticed from the table, an individual stochastic algorithm is run for 60 runs to assess its mean performance as a stochastic algorithm. Both algorithms use the same number of search agents; ants and antlions and run for the same number of iterations and in the same search space using the same fitness function. The implementation of Lèvy distribution is based on Mantegna algorithm [41] as an easy implementation with single parameter $\beta$ controlling the skewness of the distribution. A scaling factor stepSize is used to control the exploitation rate. This parameter is set to decrease linearly as the optimization progresses in a linear fashion as we mentioned in Eq. (13).

Table 2 Parameter settings for different optimization methods

Full size table

In multi-agent optimizers with stochastic nature, the initial agent’s positions affect the performance of the optimizer. Therefore, in this study we evaluated the two algorithms using different settings for the initial agents so that we can assess:

The capability of the algorithm to generate several solutions.
The ability of the algorithm to escape from local optima.
The intensification capability of the algorithm.

To reach such end, we made use of five initializers that forces the optimizers to be initialized with search agents that:

Are apart from the expected global optima.
Has very small diversity.
Close to the expected global optima.
Contains a near-optimal solution.

Table 3 Mean fitness using different initializers for all multi-agent optimizers over various datasets and the fitness value for the deterministic single solution methods

Full size table

Five initialization methods depicted as an example in Fig. 6 and can be detailed as:

1.
Uniform initialization: In this method, all the search agents are placed in the search space at random making use of uniform random distribution. That is the most common initialization technique where every point in the search space is a candidate to select as a position for a search agent with the same probability to other points. This initialization method can be formulated as:
$$\begin{aligned} X_i^d=lb^d+\mathrm{rand}(ub^d-lb^d) \end{aligned}$$
(20)
with $X_i^d$ is the position of search agent i in dimension d, lb, ub are the lower and upper limits of dimension d and rand is a random number drawn from the uniform distribution.
2.
Small initialization: This method initializes the search agents on the terminals of the hypercube of the search space. In the context of feature selection, search agents are initialized with the minimum number of randomly selected features. Therefore, if the number of agents is less than the number of features, we will see that each search agent will have a single dimension with 1. Of course, the optimizer will search for the feature(s) to be set to 1 to enhance the fitness function value as in the standard forward selection of features. This initializer is expected to place the search agents apart from the optimal solution.
3.
Large initialization: This method also initializes the search agents on the terminal of the search space hypercube randomly, but it is much biased toward the selection of features rather than the deselection of features. This initializer is expected to initialize the search agents closer to the optimal solution.
4.
Mixed initialization: This initializer is a compromise between the small and large initializers where $50\%$ of the search agents are initialized with small initialization and the remaining $50\%$ is initialized with large initialization. This method provides much diversity in the population than the small and large initializers.
5.
Minimum redundancy maximum relevance (MRMR) feature criterion combines two optimization criteria [48] in a single formula outlined in Eq. (21).
$$\begin{aligned} \max _{g_i\in G-S_{m-1}}(I(g_i;c)-\frac{1}{m-1}\sum _{g_j\in S_{m-1}}I(g_i;g_j)) \end{aligned}$$
(21)
where (x; y) is the mutual information between two random variables X and Y that measures the mutual dependence of these two variables, c is the class labels, and $g_i$ are the values of the variable number i in the data.

In the use of mutual information concept, MRMR method selects variables that have the highest relevance with the target class and are minimally redundant, i.e., selects variables that are maximally dissimilar to each other. The value of each dimension (feature) is set proportional to how to match it can predict the output class label and how much it is distinct from other features. The assessed MRMR measure for individual features is normalized along the feature set as shown in Eq. (22).
$$\begin{aligned} P_i^d=\frac{\mathrm{MRMR}_i}{\max _{j=1}^d \mathrm{MRMR}_j} \end{aligned}$$
(22)
where $P_i$ is the search agent position in dimension d, MRMR$_i$ is the MRMR value for feature i, and d is the problem dimension.

The obtained position value from the MRMR method is used to initialize one search agent, and the rest of search agents are set at random positions in the search space. This initializer forces one search agent to be very close to the optimal but generally is not the optimal as the criteria used are not directly related to classification performance.

Table 4 The p value for the t test for the fitness measures calculated for stochastic-multi-agent optimizers versus the proposed LALO

Full size table

5.2 Evaluation criteria

The well-known K-nearest neighbor (KNN) is used as a classifier to evaluate the final classification performance of each algorithm and wrapper-based feature selection. KNN is employed in the experiments based on trial and error basis where the best choice of K is selected $(K=5)$ as the best performing on all the datasets [49].

A set of evaluation indicators is used to assess different aspects of performance.

Mean fitness is used to evaluate the average performance of the optimizer over all the runs [50].

Standard deviation (std) is a representation of the variation of the obtained best solutions found for running a stochastic optimizer for M different runs. Std is used as an indicator for optimization stability and robustness, whereas Std is smaller; this means that the optimizer always converges to the same solution, while larger values for Std mean much random performance [51].

Average feature size is the average ratio of the selected features to the total number of features which is the secondary objective of the used fitness function. This measure can be formulated as in Eq. (23).

$$\begin{aligned} \mathrm{Avg}SZ= \frac{1}{M}\sum _{i=1}^M \frac{\mathrm{size}(g_*^i)}{D}, \end{aligned}$$

(23)

where $\mathrm{size}(x)$ is the number of on values for the vector x, and D is the number of features in the original dataset.

T test is statistical importance measure that demonstrates regardless of whether the contrast between two groups’ midpoints in all probability mirrors a real distinction in the population from which the groups were inspected [52].

Table 5 Standard deviation of fitness calculated for multi-agent optimizers over various datasets using uniform initialization

Full size table

Table 6 Standard deviation of fitness calculated for multi-agent optimizers over various datasets

Full size table

The above four measures are applied directly to the fitness function obtained based on the validation set.

To assess the future performance of whole feature selection model, the following two indicators are used:

Mean test error is used to evaluate the performance of the feature selection on the data that the classifier never sees.

Average Fisher score is a measure that assesses the feature subset like the distances between data points in various classes are as big as possible, while the distances between data points in the same class are as small as possible [53]. Fisher score in this work is calculated for individual features given the class labels as in Eq. (24).

$$\begin{aligned} F_j=\frac{\sum _{k=1}^cn_k(\mu _k^j-\mu ^j)^2}{(\sigma ^j)^2}, \end{aligned}$$

(24)

where $F_j$ is the Fisher index for feature j, $\mu ^j, (\sigma ^j)^2$ is the mean and standard deviation of the whole dataset, $n_k$ is the size of class k, and $\mu _k^j$ is the average of class k.

Table 7 Average feature size using different initializers for all multi-agent optimizers over various datasets and feature size ratio for the deterministic single-solution methods

Full size table

Table 8 Mean test error using different initializers for all multi-agent optimizers over various datasets and test error for deterministic single-solution methods

Full size table

The average F-score is calculated as the average of score sum of optimal solution found by running individual optimizers for M times.

Average run time is used in this study to assess the average processing time for the different methods used.

5.3 Results and discussion

The methods adopted in this work are deterministic single-solution methods, namely sequential forward selection (SFS), sequential backward selection (SBS) and their floating version SFFS and SFBS based on the version proposed in [54]. Also, the two common multi-agent optimizers are adopted namely genetic algorithms (GA) [55] and particle swarm optimization (PSO) [56].

Table 3 outlines the mean fitness acquired over the different runs of the various multi-agent optimizers as well as the same fitness calculated for selected features by the deterministic single-solution methods. We can remark from the table that the optimal solution obtained from the proposed LALO is much better than the optimal solution of the other optimization methods on the same number of iterations. We can also remark that regardless of the used initializer, the LALO can reach much better optima than the ALO. The enhanced performance can be interpreted by the fact that the LALO generates a set of random solutions that are apart from the current optimal solution in each iteration allowing the optimizer to escape from local optima solution. Such diversity of solutions even at the last iterations of optimization can be interpreted by the thick tail of the Lèvy distribution. Besides, the LALO can generate diverse locations for the search agents even if it is initialized with nearby solutions as in the case of large initialization method, while the ALO can reach such goal.$\acute{\mathrm{t}}$ One also can remark the impact of the initialization on the performance of all multi-agent optimizers. On the other hand, we can observe that the forward and backward iterative methods can easily be stuck in the first found optimal solution in the search space. Also, the nesting problem found in such methods which make it difficult to abandon a feature that is selected as in SFS and SBS.

Table 9 Average F-score using different initializers for all multi-agent optimizers over various datasets and Fisher calculated for the deterministic single-solution methods

Full size table

The MRMR initializer places a search agent very close to the global optimal solution as it initializes such agent using entropy and correlation constraints. Therefore, we can remark the enhance in the performance of the MRMR method over the other methods. We can also remark that although the MRMR initializes a search agent with information related to the data; mutual information, it is still not optimal as it is not directly related to the classifier performance used in wrapper-based feature selection. Moreover, as can be seen, we may remark those initializers with a diversity of population; uniform initializer and mixed initializer can perform better than initializers with less diversity, small and large initializers. For ensuring the significance in advance of the performance, the p value for the t test is calculated for the two optimizers and outlined in Table 4. We can remark that there is a significant enhancement in the performance of the LALO over the ALO at a significance level of $5\%$ using small, large, mixed and uniform initializers. Smaller significance can be remarked for using the MRMR initializer which can be interpreted by the fact that the MRMR always initializes the optimizers with a solution close to the optimal solution and hence it eases the optimization task for the optimizers.

The standard deviation of optimal fitness function value obtained over all the runs of the optimizer can be used as an indicator for optimizer repeatability of results and performance robustness. Such measure is depicted in Tables 5 and 6 for all optimizers over all the datasets. We can notice from the table that both antlion optimizers have comparable standard deviation which proves that even if we use a random distribution with infinite variance, we still can get repeatable results due to the used selection capability in the LALO. Furthermore, we can see that the LALO and ALO has a general standard deviation less than the ones obtained for the GA and PSO which proves the capability of ALO and LALO to converge to optimal or near-optimal solution regardless of the used random exploration and exploitation used and due to the adaptive control of the step size/exploration rate adopted in both methods.

The secondary goal in the used fitness function is the selected feature size. This indicator is calculated and depicted in Table 7. We can remark that this objective is also minimized and the LALO is still keeping its superiority in performance in comparison to ALO, PSO, and GA. The addition of such secondary objective complicates the shape of the fitness function which is successfully minimized by the proposed LALO but allows the optimizer to carefully add extra features to maximize the classification performance. We can recognize that deterministic single-solution methods have in general better performance in this indicator which is natural for the very cautious move of such algorithms but of course on the price of classification accuracy.

Table 8 depicts the performance of the optimizers on the test data that the classifier never sees. We can remark that the selected features from the LALO are much fit to be a candidate for future performance thanks to the design of the fitness function and the capability of LALO to optimize such function. Such performance proves the ability of the proposed LALO to adaptively search the space of feature for an optimal combination of features maximizing classification performance.

Table 10 Average runtime using uniform initialization for all multi-agent optimizers over various datasets and for the deterministic single-solution methods

Full size table

Another indicator for assessing the future performance of the selected features is depicted in Table 9. The Fisher index calculates the data compactness versus feature span which eases the classifier design based on the selected features. We can observe that the feature set selected by the LALO is much better thanks to the capability of LALO to search the design fitness function for the global optima.

A final indicator that is related to the runtime is depicted in Table 10 where the runtime for each optimizer is measured and sum over all the used initializers. We can remark from the table that the runtime for LALO is much better than the ALO thanks to the simple implementation of the Lèvy random number generator rather than the complete random walk provided in the implementation of ALO published in [32]. Moreover, in contrast to sequentially adding or removing features as in SFS, SBS, SFFS, SFBS, and multi-agent optimization methods can add the batch of features. Hence, LALO can provide enhanced time performance while keeping better classification performance as we saw before.

6 Conclusions and future work

A novel variant of ALO is proposed based on Lèvy flights. This study applies the Lèvy antlion optimization (LALO) algorithm for feature selection and is tested on 21 different datasets, compared its result to the original ALO. Five initialization methods are used to assess the local and global searching capability of each algorithm. The results are evaluated using different indicators assessing convergence, repeatability value, and feature reduction size, performance on test data and statistical significance as well as runtime.

We can conclude the following:

Lèvy flights help the ALO to search the space efficiently for the optimal solution avoiding premature convergence and stagnation.
The proposed LALO can provide repeatable results with efficient performance even at complex fitness function shapes.
The initial positions of search agents affect the performance of the different optimizers and generally, initializers that provide initial diverse population are better. Moreover, initializers such as MRMR that initialize the search agents with solutions that have many desirable properties has better performance.

References

Guyon I, Elisseeff A (2003) An introduction to variable and attribute selection. Mach Learn Res 3:1157–1182
MATH Google Scholar
Dash M, Liu H (1997) Feature selection for classification. Intell Data Anal J 1(3):131–156
Article Google Scholar
Kohavi R, John GH (1997) Wrappers for feature subset selection. Artif. Intell. 97(1):273–324
Article MATH Google Scholar
Chuang LY, Tsai SW, Yang CH (2011) Improved binary particle swarm optimization using catsh effect for attribute selection. Expert Syst Appl 38:12699–12707
Article Google Scholar
Xue B, Zhang M, Browne WN (2014) Particle swarm optimisation for feature selection in classification: novel initialisation and updating mechanisms. Appl Soft Comput 18:261–276
Article Google Scholar
Whitney A (1971) A direct method of nonparametric measurement selection. IEEE Trans Comput C–20(9):1100–1103
Article MATH Google Scholar
Marill T, Green D (1963) On the effectiveness of receptors in recognition systems. IEEE Trans Inf Theory IT–9(1):11–17
Article Google Scholar
Xue B, Zhang M, Browne WN (2013) Particle swarm optimization for feature selection in classification: a multi-objective approach. IEEE Trans Cybern 43(6):1656–1671
Article Google Scholar
Daolio F, Liefooghe A, Verel S, Aguirre H, Tanaka K (2015) Global vs local search on multi-objective NK-landscapes: contrasting the impact of problem features. In: Conference on genetic and evolutionary computation (GECCO), pp 559–566
Valdez F (2015) Bio-inspired optimization methods. In: Springer handbook of, computational intelligence, pp 1533–1538
Emary E, Zawbaa HM (2016) Impact of chaos functions on modern swarm optimizers. PLoS ONE 11(7):e0158738
Article Google Scholar
Xue B, Zhang M, Browne WN, Yao X (2016) A survey on evolutionary computation approaches to feature selection. IEEE Trans Evolut Comput 20(4):606–626
Article Google Scholar
Mirjalili S, Saremi S, Mirjalili SM, Coelho LDS (2016) Multi-objective grey wolf optimizer: a novel algorithm for multi-criterion optimization. Expert Syst Appl 47:106–119
Article Google Scholar
Emary E, Zawbaa HM, Grosan C (2017) Experienced Grey Wolf optimizer through reinforcement learning and neural networks. IEEE Trans Neural Netw Learn Syst (TNNLS) 99:1–14
Google Scholar
Mittal N, Singh U, Sohi BS (2016) Modified Grey Wolf optimizer for global engineering optimization. Appl Comput Intell Soft Comput 2016:8
Article Google Scholar
Shoghian S, Kouzehgar M (2012) A comparison among Wolf Pack search and four other optimization algorithms. World Academy of Science, Engineering and Technology, vol 6
Segura C, Rionda SB, Aguirre AH, Pena SIV (2015) A novel diversity-based evolutionary algorithm for the traveling salesman problem. In: Conference on genetic and evolutionary computation (GECCO), pp 489–496
Goncalves EC, Plastino A, Freitas AA (2015) Simpler is better: a novel genetic algorithm to induce compact multi-label chain classifiers. In: Conference on genetic and evolutionary computation (GECCO), pp 559–566
Holland J (1975) Adaptation in natural and artificial systems. University of Michigan Press, Ann Arbor
Google Scholar
Chakraborty B (2002) Genetic algorithm with fuzzy fitness function for feature selection. In: IEEE international symposium on industrial electronics vol 1, pp 315–319
Zhu Z, Ong YS, Dash M (2007) Wrapper-filter feature selection algorithm using a memetic framework. IEEE Trans Syst Man Cybern 37(1):70–76
Article Google Scholar
Eiben AE, Raue PE, Ruttkay Z (1994) Genetic algorithms with multi-parent recombination. In: Conference on evolutionary computation, third conference on parallel problem solving from nature, vol 866, pp 78–87
Wang X, Yang J, Teng X, Xia W, Jensen R (2007) Feature selection based on rough sets and particle swarm optimization. Pattern Recognit Lett 28:459–471
Article Google Scholar
Ming H (2008) A rough set based hybrid method to feature selection. In: International symposium on knowledge acquisition and modeling, pp 585–588
Akbari R, Mohammadi A, Ziarati K (2010) A novel bee swarm optimization algorithm for numerical function optimization. Commun Nonlinear Sci Numer Simul 15(10):3142–3155
Article MathSciNet MATH Google Scholar
Maeda M, Tsuda S (2015) Reduction of artificial bee colony algorithm for global optimization. Neurocomputing 148:70–74
Article Google Scholar
Karaboga D, Basturk B (2007) A powerful and efficient algorithm for numerical function optimization: artificial bee colony (ABC) algorithm. Glob. Optim. 39(3):459–471
Article MathSciNet MATH Google Scholar
Yang XS (2005) Engineering optimizations via nature-inspired virtual bee algorithms. In: Artificial intelligence and knowledge engineering applications, first international work-conference on the interplay between natural and artificial computation, pp 317–323
Sundareswaran K, Sreedevi VT (2008) Development of novel optimization procedure based on honey bee foraging behavior. In: IEEE international conference on systems, man and cybernetics, pp 1220–1225
Li XL, Shao ZJ, Qian JX (2002) An optimizing method based on autonomous animates: Fish-swarm algorithm. Methods Pract Syst Eng 22:32–38
Google Scholar
Emary E, Zawbaa HM, Hassanien AE (2016) Binary Grey Wolf optimization approaches for feature selection. Neurocomputing, Elsevier 172:371–381
Article Google Scholar
Mirjalili S (2015) The ant lion optimizer. Adv. Eng. Softw. 83:83–98
Article Google Scholar
Meiri R, Zahavi J (2006) Using simulated annealing to optimize the feature selection problem in marketing applications. Eur J Oper Res 171(3):842–858
Article MATH Google Scholar
Pavlyukevich I (2007) Levy flights, non-local search and simulated annealing. Computat Phys 226:1830–1844
Article MathSciNet MATH Google Scholar
Reynolds AM, Frye MA (2007) Free-flight odor tracking in Drosophila is consistent with an optimal intermittent scale-free search. PLoS ONE 2(4):e354
Article Google Scholar
Viswanathan GM (2008) Lèvy flights and superdiffusion in the context of biological encounters and random searches. Phys Life Rev 5:133–150
Article Google Scholar
Barthelemy P, Bertolotti J, Wiersma DS (2008) A Levy flight for light. Nature 453:495–498
Article Google Scholar
Zawbaa HM, Emary E, PARV B (2015) Feature selection based on antlion optimization algorithm. In: 3rd world conference on complex systems (WCCS), Morocco, pp 1–7
Emary E, Zawbaa HM, Hassanien AE (2016) Binary ant lion approaches for feature selection. Neurocomputing, Elsevier 213:54–65
Article Google Scholar
Zawbaa HM, Emary E, Grosan C (2016) Feature selection via chaotic antlion optimization. PLoS ONE 11(3):e0150652
Article Google Scholar
Yang XS (2010) Nature-inspired, metaheuristic algorithms. Luniver Press, Cambridge
Google Scholar
Yang XS, Cui Z, Xiao R, Gandomi AH, Karamanoglu M (2013) swarm intelligence and bio-inspired computation. Elsevier, London, pp 18–20
Google Scholar
Bhandari AK, Singh VK, Kumar A, Singh GK (2014) Cuckoo search algorithm and wind driven optimization based study of satellite image segmentation for multilevel thresholding using Kapur’s entropy. Expert Syst Appl 41:3538–3560
Article Google Scholar
Tang J, Zhao X (2009) Particle swarm optimization with adaptive mutation. In: International conference on information engineering, vol 2, pp 234–237
Vieira SM, Sousa JMC, Runkler TA (2010) Two cooperative ant colonies for feature selection using fuzzy models. Expert Syst Appl 37:2714–2723
Article Google Scholar
Frank A, Asuncion A (2010) UCI machine learning repository
Hastie T, Tibshirani R, Friedman J (2001) The Elements of Statistical Learning. Springer, Berlin
Book MATH Google Scholar
Akadi AE, Amine A, Ouardighi AE, Aboutajdine D (2011) A two-stage gene selection scheme utilizing MRMR filter and GA wrapper. Knowl Inf Syst 26(3):487–500
Article Google Scholar
Chuang LY, Chang HW, Tu CJ, Yang CH (2008) Improved binary PSO for feature selection using gene expression data. Comput Biol Chem 32(1):29–38
Article MATH Google Scholar
Tilahun SL, Ong HC (2015) Prey-Predator algorithm: a new metaheuristic algorithm for optimization problems. Inf Technol Decis Mak 14(6):1331–1352
Article Google Scholar
Jia L, Gong W, Wu H (2009) An improved self-adaptive control parameter of differential evolution for global optimization. Comput Intell Intell Syst 51:215–224
Article MATH Google Scholar
Rice JA (2006) Mathematical Statistics and Data Analysis, 3rd edn. Duxbury Advanced
Duda RO, Hart PE, Stork DG (2000) Pattern classification, 2nd edn. Wiley-Interscience Publication, London
MATH Google Scholar
Kotropoulos VC (2008) Fast and accurate feature subset selection applied into speech emotion recognition. Signal Process 88(12):2956–2970
Article MATH Google Scholar
Yang CH, Tu CJ, Chang JY, Liu HH (2006) Po–Chang Ko, dimensionality reduction using GA-PSO. In: Joint conference on information sciences (JCIS), Atlantis Press, Taiwan
Eberhart RC, Kennedy J (1995) A new optimizer using particle swarm theory. In: Sixth international symposium on micro machine and human science, Japan, pp 39–43

Download references

Acknowledgments

This work was partially supported by the IPROCOM Marie Curie initial training network, funded through the People Programme (Marie Curie Actions) of the European Union’s Seventh Framework Programme FP7/2007-2013/ under REA grant agreement No. 316555.

Author information

Authors and Affiliations

Faculty of Computers and Information, Cairo University, Giza, Egypt
E. Emary
Faculty of Mathematics and Computer Science, Babes-Bolyai University, Cluj-Napoca, Romania
Hossam M. Zawbaa
Faculty of Computers and Information, Beni-Suef University, Beni-Suef, Egypt
Hossam M. Zawbaa

Authors

E. Emary
View author publications
You can also search for this author in PubMed Google Scholar
Hossam M. Zawbaa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hossam M. Zawbaa.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Emary, E., Zawbaa, H.M. Feature selection via Lèvy Antlion optimization. Pattern Anal Applic 22, 857–876 (2019). https://doi.org/10.1007/s10044-018-0695-2

Download citation

Received: 17 September 2016
Accepted: 09 March 2018
Published: 19 March 2018
Issue Date: 01 August 2019
DOI: https://doi.org/10.1007/s10044-018-0695-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Feature selection via Lèvy Antlion optimization

Abstract

Similar content being viewed by others

Exploring Ant Colony Optimization for Feature Selection: A Comprehensive Review

The monarch butterfly optimization algorithm for solving feature selection problems

A Hybrid Feature Selection Approach for Data Clustering Based on Ant Colony Optimization

1 Introduction

2 Antlion optimization (ALO)

3 The proposed Lèvy antlion optimization (LALO)

4 LALO applied for feature selection