1 Introduction

Nowadays, with the advancement of technology and subsequent increase in data collection tools, many fields including business, industry [1, 39, 40], biomedicine [3, 31], government transaction [12] and so on have been computerized. In addition, general use of a global information system, namely World Wide Web has drowned us into a huge amount of data and information [12]. People have no time to look at these large data. Therefore, we are forced to use new techniques and automatic tools for extracting useful information and knowledge from the tremendous amount of data.

Data mining is known as a promising frontier in data and information systems. It can automatically extract suitable knowledge tacitly stored and hidden in the large datasets. Classification as one of the basic techniques in data mining is often confronted with the problem of feature numbers in each dataset. Reduction of the features can simplify the classifier and decrease the computational cost saving [27].

Feature Extraction (FE) and Feature Selection (FS) are two pre-processing methods that are usually used on the large dataset in order to reduce the high-dimension. FE reduces the number of features by combining the original features and constructing the new features [5]. Nevertheless, FS decreases the dimension of the dataset by selecting a subset of the original features. FS approaches, are more interpretative than the FE approaches because they keep the physical properties of the original features [13]. Therefore, FS approaches can improve the classification accuracy by removing the irrelevant/redundant features and choosing one informative feature subset. The following aspects are two important issues of FS approaches that researchers are concerned about in the recent years:

  1. 1)

    Defining a suitable evaluation criterion: Based on the subset evaluation method, FS approaches can be divided into two main classes: filter approaches and wrapper approaches. In the filter approaches, each selected subset is evaluated independently of a certain classifier and by applying just natural properties of the data; whereas the performance of a certain classifier, as a learning algorithm is used for evaluating each subset in the wrapper approaches.

  2. 2)

    Introducing a search strategy: Choosing the search strategy from heuristic/random searches is another important challenge in FS approaches. Heuristic search is known as a global optimal search strategy in FS. Given a dataset with n dimension, there are 2n candidate subsets. By increasing the number of features, the candidate subsets are grown exponentially. For this reason, a variety of heuristic search strategy types, including ‘greedy’ search, local search and evolutionary/meta-heuristic algorithms (or random search) [33] have been adopted for FS. Each of these strategies involves a problem such as nesting effect, stagnation in local optima, premature convergence etc.

However, Evolutionary algorithms with global search ability and high computational efficiency are very appropriate for solving different optimization problems (OPs) such as FS. Until now, many different algorithms of this kind have been designed to solve OPs. It can be shown that some of them produce a better feasible solution for some OPs than others. Tsai et al. [29] proposed feature and instance selection approach based on Genetic Algorithm (GA) show that performing both proposed feature and instance selection usually make the classifiers perform slightly poorer than either of the approaches individually. Feature selection concept, proposed by Sikora and Piramuthu [28], is related to two issues; self-adaptation and use of non-coding material (called introns), inspired by the existence of non-encoding DNA in biological systems, in the chromosome structure of GA. Vierira et al. [31] suggested a modified BPSO method for feature selection to cope with the premature convergence of the original Binary Particle Swarm Optimization (BPSO) [32]. Moreover, in that article, the simultaneous optimization of SVM kernel parameter setting is implemented. Designing a new fitness function [36] is a solution that could cover the disadvantage of original BPSO. A novel FS approach based on continuous PSO has been introduced in Xue et al. [37] which includes new initialization strategy and new personal best (pbest) and global best (gbest) updating mechanism. The proposed algorithm in Papa et al. [21] uses the newest optimization algorithm namely Gravitational Search Algorithm (GSA) and Optimum-Path Forest (OPF) classifier in order to introduce a fast and accurate approach for FS. Also, optimizing the parameters for SVM such as penalty parameter and the parameter for the Radial Basis Functions (RBF) kernel using GSA that proposed in Li et al. [18] can improve the performance of FS. Using a fuzzy controller for adapting the gravitational coefficients as important parameters in the search process of GSA and applying it in the field of data mining are a typical schema that has been suggested in Zahiri [38]. Improving the original binary GSA using the piecewise linear chaotic map and the sequential quadratic programming is a novel approach in the FS that Xiang et al. [35] has been applied in order to find a subset with the minimum number of features and the relatively high accuracy. Combining the advantages of GSA in both versions (i.e., real and binary) and improving BGSA for solving FS problem have produced a suitable content-based image retrieval system [25]. A novel automatic feature selection using GSA that proposed in Kumar et al. [16] has been known as an automatic unsupervised feature selection method that uses the statistical properties of dataset for defining the thresholds in order to select the relevant features. Barani et al. [4] combined the Binary Quantum-Inspired Gravitational Search Algorithm (BQIGSA) proposed by Nezamabadi-pour [20] with the K-NN classifier to solve FS as a binary problem.

Therefore, GSA is a suitable choice for solving the problem of FS [4, 16, 18, 21, 25, 35, 38]. The standard GSA is plagued with two major problems which include; premature convergence due to rapid deduction diversity, and fast convergence in the beginning of the search process while it slows down as the global solution approaches the local solution. Consequently, in order to obtain a global or near-global solution by standard GSA, more impractical iteration is needed [35]. Researchers have suggested a large number of ideas overcoming these disadvantages and improving the performance. These ideas are divided into two strategies:

  • In the first strategy, each idea adds a new operator to the list of standard GSA operators, or it modifies the existing operators of the standard GSA. Sarafrazi et al. [26] added a new operator called “Disruption” based on astrophysics into the list of GSA operators to improve the exploration and exploitation abilities of the standard GSA. Mirjalili and Hashim [19] and Li and Zhou [17] modified the velocity updating operator of the standard GSA by adding the velocity updating strategy of the PSO to that, and they named it GSAPSO. Also, Han and Chang [11] in the proposed CHGSA method, tried to improve the performance of the GSA by modifying the velocity updating operator based on a chaotic operator. Moreover, Nezamabadi-pour [20] by employing the stochastic characteristics of the quantum systems, tried to provide a quantum-inspired GSA with more exploration capability for solving the binary problems.

  • The second strategy has tried to solve the premature convergence problem of standard GSA by adding the advantages of the local search or the other global search algorithms to it. Chen et al. [6] and Chengyi [7] combined the standard GSA with the Simulated Annealing (SA) to concatenate the advantages and overcome the disadvantages of both algorithms. Also, Khadanga and Satapathy [14] proposed HGA-GSA method that jumps from local optima and has good convergence characteristics for finding the global point.

In this article, we propose a method for solving the FS as one of the optimization problems (Ops) using BGSA. The main goal of our proposed method is to select a subset of features with the smallest size. To achieve this goal, we apply the second strategy for modifying BGSA. Therefore, we suggest our two-stage method based on the combination of BGSA with SA as a local search algorithm.

The rest of this paper is organized as follows: A brief summary of background information and necessary preliminaries are described in Section 2. Feature selection based on the proposed two-stage method is discussed in Section 3. In Section 4, the experimental setups and results are presented. Finally, the main conclusions of this article are summarized in Section 5.

2 Preliminaries

In this section, we discuss two main requirement algorithms. Section 2.1 presents the original binary GSA. Section 2.2 introduces two approaches (i.e., GSAPSO and CHGSA) that proposes the improvement of the original real GSA and finally utilizes them in the binary version of GSA. Section 2.3 expresses an algorithm based on local search namely simulated annealing (SA).

2.1 Binary gravitational search algorithm (BGSA)

Gravitational search algorithm (GSA) is one of the random search algorithms and it is based on the law of gravity and attraction of mass. The Gravitational Search Algorithm- as a recent population-based heuristic algorithm- has been proposed in two versions: real [23] and binary [24]. According to the law of the Newtonian gravity, “in the universe, each particle attracts the others with a force which is obtained based on their masses and the square of the distance between them.” [21].

Since there are N particles/agents in which the position of the i th agent is defined as follows:

$$ \left( {x_{i}^{1}},\mathellipsis,{x_{i}^{d}},\mathellipsis,{x_{i}^{m}} \right) \qquad i = 1,2,...,N $$
(1)

where \({x_{i}^{d}}\) is the position of i th agent in the d th dimension and m is dimensions of the search space. Note that in BGSA, each dimension has the value of 0 or 1. Therefore, movement in each dimension means changing the value of it from 0 to 1 and vice versa. The mass of each agent is computed by the current population’s fitness as (2) and then normalized as (3):

$$ q_{i}\left( t \right)=\frac{{fit}_{i}\left( t \right)-worst(t)}{best\left( t \right)-worst(t)} $$
(2)
$$ M_{i}\left( t \right)=\frac{q_{i}(t)}{{{\sum}_{j}^{N}} {q_{j}(t)}} $$
(3)

where Mi(t) and \({fit}_{i}\left (t \right )\) represent the mass and the fitness value of the agent i at iteration t respectively, also \( worst\left (t \right ) \) and best(t) are the worst and the best fitness in the population of agents at iteration t. The total force from a set of the heavier agent that exert to an agent is computed based on the law of gravity i.e. (4):

$$ {F_{i}^{d}}\left( t \right)={\sum}_{\begin{array}{l} j\in kbest \\ ,j\ne i \end{array}} {{rand}_{j}^{d}G(t)\frac{M_{j}(t)M_{i}(t)}{R_{ij}\left( t \right)+\varepsilon} } \left( {x_{j}^{d}}\left( t \right)-{x_{i}^{d}}\left( t \right) \right) $$
(4)

where rand is a random number generated by a uniformly distributed function in the interval [0, 1], ε is a small value, Rij(t) is the hamming distance between two agents i and j. kbest is a function of time, which is initialized to K0 (here, it is set to the total number of agents) and decreased linearly with time, so kbest is the subset of first N agents with the best fitness value and the biggest mass. The gravitational constant (G) takes an initial value and is reduced linearly with time as follows:

$$ G\left( t \right)=G_{0}\left( 1-\frac{t}{\text{IT}} \right) $$
(5)

where G0 is its initial value and IT is the total number of iteration. Afterward, the next velocity of an agent is computed based on a part of its current velocity added to its acceleration that is obtained according to (6).

$$ {a_{i}^{d}}\left( t \right)=\frac{{F_{i}^{d}}(t)}{M_{i}(t)} $$
(6)
$$ {v_{i}^{d}}\left( t + 1 \right)={rand}_{i}^{d}{v_{i}^{d}}\left( t \right)+{a_{i}^{d}}(t) $$
(7)

To have a good convergence of the algorithm, \({v_{i}^{d}}\) must be limited within a good bound \(\left (\left | \left | {v_{i}^{d}} \right |\le v_{max} \right | \right )\). The agents will be moved by the rule in (9) so that, the position is checked to be 1 or 0 with a probability \(S({v_{i}^{d}})\) according to (8):

$$ S\left( {v_{i}^{d}}\left( t + 1 \right) \right)=\left| tanh\left( {v_{i}^{d}}(t + 1) \right) \right| $$
(8)
$$ {x_{i}^{d}}\left( t + 1 \right)=\left\{ {\begin{array}{cc} complement \left( {x_{i}^{d}}\left( t \right) \right) &\quad if~rand<S\left( {v_{i}^{d}}\left( t + 1 \right) \right) \\ {x_{i}^{d}}\left( t \right)& o.w \end{array}} \right\} $$
(9)

2.2 Improved GSA

In this section, we introduce two versions of improved real GSA that have been proposed by Mirjalili and Hashim [19], Li and Zhou [17] and Han and Chang [11]. Then we apply them to FS problem.

2.2.1 Improved GSA by PSO (GSAPSO)

As mentioned in (7), an agent direction of GSA in the search space is based on the overall force obtained by other agents. It implies that each agent is memory-less. Mirjalili and Hashim [19] and Li and Zhou [17] have claimed that this limitation affects the performance of the real version of GSA (RGSA). They demonstrated the effectiveness of this idea not only for solving the benchmark optimization functions but also for solving the special OPs like “parameter identification of hydraulic turbine governing system”. For this reason, they applied two characteristics of the PSO in the velocity updating operator, that is, the idea of memory of each individual and social information of the population. Therefore, they proposed a new velocity updating operator in (10) that obeys the law of gravity as well as noting to the memory of agent and social information:

$$\begin{array}{@{}rcl@{}} {v_{i}^{d}}\left( t + 1 \right)&=&{rand}_{i}^{d}{v_{i}^{d}}\left( t \right)+{a_{i}^{d}}\left( t \right)\\ &&+c_{1}{rand}_{i}^{d}\left( {pbest}_{i}^{d}\left( t \right)-{x_{i}^{d}}(t) \right)\\ &&+c_{2}{rand}_{i}^{d}\left( {gbest}^{d}\left( t \right)-{x_{i}^{d}}(t) \right) \end{array} $$
(10)

where c1 and c2 are positive distance, \({pbest}_{i}^{d} \) is the best position of i th agent in the d th dimension in iteration t and gbestd is the best previous position of the population in the d th.

2.2.2 Improved GSA by chaotic (CHGSA)

The random number used in (7) is for copping the premature convergence and for avoiding the local optimum. Han and Chang [11] have claimed that adding a chaotic operator enhances the ability to escape from two mentioned problems. For this reason, they selected the logistic map to generate chaotic variables and used them for constructing a new velocity updating operator as follows:

$$ {v_{i}^{d}}\left( t + 1 \right)={rand}_{i}^{d}{v_{i}^{d}}\left( t \right)+{a_{i}^{d}}\left( t \right)+\xi \left( {c_{i}^{d}}-0.5 \right) $$
(11)

where ξ is a factor to control the chaos scope and ci is a chaotic vector that is produced in (12) below:

$$ {c_{i}^{d}}=rc_{i-1}^{d}\left( 1-c_{i-1}^{d} \right) \qquad i = 2,\mathellipsis, N $$
(12)

where, \({c_{i}^{d}}\epsilon [0,1]\) is a d-dimensional random vector and r ∈ [0,4] is a control parameter .

2.2.3 Utilizing two above improvements for binary version of GSA

As mentioned in Li and Zhou [17] and Han and Chang [11], these two improvement approaches have suitable effects on solving the particular problems.

In the experimental results section of this article, we substituted (10) and (11) with (7) in the binary GSA (BGSA), respectively, and discussed the influence of them in solving the FS problem.

2.3 Simulated annealing

Simulated annealing algorithm is a common search algorithm that is derived from the solid annealing ideological theory, in which a solid is heated to a temperature more than its melting point and then its temperature decreases gradually. In a sufficiently high temperature, solid particles move in disorder state and the internal energy increases. So, in the cooling process, the temperature of the melted solid should be reduced in the way that particles went back to their order. In other words, the goal is to obtain the best crystal structure at a normal temperature and with least internal energy in which the solid would be in thermodynamically balance [15].

The suggested solution in a high temperature is not suitable, thus, this algorithm tries to achieve the better solution in low temperature by decreasing it based on time. In other words, this algorithm is started with an initial solution and temperature and then, the next solution is provided according to the following procedure:

Assume that the OP is a maximization problem, S is a current answer and \(S^{\prime }\) is a new solution in its neighborhood. The amount of fitness function is computed for two answers and their differences are considered as follows:

$$ {\Delta} =f\left( S^{\prime} \right)-f(S) $$
(13)

Therefore, the movement in the search space from the current solution to the new solution is accepted if Δ > 0 and the otherwise, is accepted with a probability e−Δ/T in which T is temperature. It is observed that with decreasing T, the above probability reduces and therefore, the unsuitable answer is eliminated from the search space.

Initializing and reducing the temperature are two important challenges in this algorithm. If a very large number has been selected, then the solving time increases and the unnecessary searches are done. However, choosing a sufficiently large number in order to check almost all the possible answers in the search space, is a good idea. Moreover, there are several methods in order to reduce temperature that among them the geometry method is commonly used. Therefore, the temperature parameter T is decreased as follows:

$$ T_{t + 1}=\lambda^{t}T_{t} $$
(14)

where t is the current iteration number and 0 < λ < 1 is a rate of temperature reduction. If λ is closer to 1 then the time duration of the algorithm implementation is larger but the quality of the final answer would be better.

Finally, achieving a special temperature, usually 0, or the maximum iteration number is two stop conditions for this algorithm.

3 Proposed method

In this section, we suggest a two-stage method namely BGSA-SA for FS problem. The rapid selection of the smallest subset of features with the suitable and acceptable accuracy is defined as the main goal of the aforementioned method. Here, we first focus on the formulation of the FS problem in Section 3.1. Then we present a relevant description of the used classification algorithms in Section 3.2. Finally, in Section 3.3 the details of our proposed FS method are discussed in both forms of the flowchart and the pseudo-code.

3.1 Formulation of the FS problem

Feature selection problem means preserving some of the features as the relevant features and removing the rest of them as the redundant/irrelevant features. In other words, we can express the FS problem as the problem of presence or absence of each feature in the final dataset produced after pre-processing step. Thus, defining the FS as a binary problem and applying binary vectors for the representation of the search space are known as a common way. So, each individual in the search space is encoded based on a binary vector that was shown in the Fig. 1. In this formulation, the i th feature will be included in the agent if the value of the i th bit is 1; otherwise, the feature will be excluded [35].

Fig. 1
figure 1

Binary encoding for FS

3.2 The used classification algorithms

Given that the proposed method is based on the wrapper approach, we need a classification algorithm to implement it. For this reason, we apply both K-Nearest Neighbor (K-NN) and Support Vector Machine (SVM) algorithms as the common classifiers.

3.2.1 K-Nearest Neighbor (K-NN)

The K-Nearest Neighbor algorithm is known as the most popular lazy learner with easy implementation. This method was described in the early of 1950s and has been widely applied in the area of pattern recognition and data mining [8, 10, 12]. Learning in this classifier is according to the similarity of the testing sample to the K training samples. Given that each sample is represented as a point in the feature space, the similarity between two samples is defined in terms of a distance metric. There are many different metrics for calculating the distance between a pair of points. Euclidean distance as the common metric used in K-NN calculates the distance of two points \(P_{1}=\left (p_{11},p_{12}, \cdots , p_{1n} \right )\) and \(P_{2}=\left (p_{21},p_{22}, \cdots , p_{2n} \right )\) with n-dimension as follows:

$$ dist\left( P_{1},P_{2} \right)=\sqrt[2]{{\sum}_{i = 1}^{n} \left( p_{1i}-p_{2i} \right)^{2}} $$
(15)

3.2.2 Support vector machine (SVM)

Support vector machine is known as a strong learning algorithm in different fields such as data mining. This method was first proposed in 1995 by Vapnik [30] and has been widely used in the categorization of non-linear datasets.

Suppose a two-class problem with a dataset D in form of \(\left (x_{i},y_{i} \right ) i = 1, 2, \cdots , \left | D \right |\) as a training set where xiRn is n dimensional input data with associated class labels \( y_{i}\in \left \{ + 1,-1 \right \}\). Now, all linear separating hyperplane can be defined as follows:

$$ W^{T}.X+b = 0 $$
(16)

where W is a weight vector (in fact the vector orthogonal to the decision surface) and b is known as a bias. The optimal separating hyperplane is defined as a hyperplane with the maximum margin. In other words, the sum of the hyperplane distance to the closest positive and negative training data is maximum. The following optimization problem must be solved for achieving the maximum margin:

$$ {\min}_{W,b}{\frac{1}{2}\left\| W \right\|^{2}} $$
(17)

Subject to:

$$y_{i}.\left( W^{T}.x_{i}+b \right)-1\ge 0 \qquad i = 1, 2, \cdots, \left| D \right| $$

In order to solve the above optimization problem, the dual formulation of it is constructed using the technique of Lagrange multipliers as follows:

$$ L=\frac{1}{2}\left\| W \right\|^{2}-{\sum}_{i = 1}^{\left| D \right|} {a_{i}\left[ y_{i}.\left( W^{T}.x_{i}+b \right)-1 \right]} $$
(18)

where a is the vector of Lagrange multipliers. By applying the KKT conditions and removing W and b from (18) the dual formulation of (17) can be written as below:

$$ {\max}_{a}{\sum}_{i}^{\left| D \right|} {a_{i}-\frac{1}{2}{\sum}_{i = 1}^{\left| D \right|} {\sum}_{j = 1}^{\left| D \right|} {a_{i}a_{j}y_{i}y_{j}x_{i}x_{j}}} $$
(19)

Subject to:

$${\sum}_{i = 1}^{\left| D \right|} {a_{i}y_{i}= 0} \qquad a_{i}\ge 0 \qquad i = 1, 2, \cdots, \left| D \right| $$

Not that \(W={\sum }_{i = 1}^{\left | D \right |} {a_{i}y_{i}x_{i}} \) therefore, decision function is defined as follows:

$$ f\left( x \right)=sign\left( {\sum}_{i = 1}^{\left| D \right|} {a_{i}y_{i}xx_{i}+b} \right) $$
(20)

Suppose that training data are not linearly separable. In this case, SVM finds a non-linear hyperplane in input space by searching a linear hyperplane into a new high-dimensional space. Finding an optimal hyperplane in the new space is done by solving the quadratic optimization problem that defined in (21):

$$ {\min}_{W,b}{\frac{1}{2}\left\| W \right\|^{2}}+C{\sum}_{i = 1}^{\left| D \right|} \zeta_{i} $$
(21)

Subject to:

$$y_{i}.\left( W^{T}.\phi (x_{i})+b \right)\ge 1-\zeta_{i}, \quad \zeta_{i}\ge 0 \quad i = 1, 2, \cdots, \left| D \right| $$

where ϕ(X) is a nonlinear mapping function used for transforming each training data into new high-dimensional space and C is a penalty parameter on the training error.

According to this mapping function, the inner product of two training data xi and xj in the definition of decision function in (20) is exchanged to \( \phi \left (x_{i} \right ).\phi (x_{j})\). Mathematically, computing the inner product on the mapped training data is equivalent to applying a kernel function \(K\left (x_{i},x_{j} \right )= \phi \left (x_{i} \right ).\phi (x_{j})\) on the original training data. So, (20) can be rewritten as follows:

$$ f\left( x \right)=sign\left( {\sum}_{i = 1}^{\left| D \right|} {a_{i}y_{i}K\left( x,x_{j} \right)+b} \right) $$
(22)

Among the various functions that are used as the admissible kernel functions for nonlinear SVM, the Radial Based Function (RBF) is known as a common function and applied in the different applications. RBF kernel function is defined as below:

$$ K\left( x,x_{j} \right)=exp\left( -\frac{\left\| x_{i}-x_{j} \right\|^{2}}{2\sigma^{2}} \right) $$
(23)

where σ > 0 is the parameter of kernel function that controls the width of it.

3.2.3 Classifier parameter setting

In the K-NN algorithm, there is only one key parameter K that must be adjusted. It has been shown that an appropriate value for K is determined experimentally. In this article, we consider three values for parameter K (i.e. 4, 5, and 6) and select the value with the maximum reduction rate as the final value for parameter K.

In the nonlinear SVM classifier, the penalty parameter (i.e., C) and the controller parameter of the used kernel function are known as the parameters that the value of them have an effective role in the SVM learning. In this article, we apply RBF as the common kernel function and setting the parameters of nonlinear SVM with the default values in the MATLAB software (i.e., C = 1 and σ = 1).

3.3 Feature selection based on two-stage method

In order to construct this two-stage method for solving FS problem as our particular optimization problem, we apply both original BGSA and SA in two stages. In the first stage, BGSA with a population of N agent explores the search space. After each iteration of its implementation, the best agent of the same iteration compared with the best agent in the previous iterations is labeled as the best solution that has been found so far. At the end of the last iteration, this best solution is recognized as the final solution of BGSA algorithm. Now the second stage starts and the output of BGSA is selected as the start point of SA. Finally, the found solution by SA is set as the final solution of FS problem. In the rest of this subsection, we first describe the representation of each agent in two stages. Second, we define a suitable fitness function for evaluating each candidate solution in both used algorithms. Third, we discuss some essential changes in the used SA algorithm. Finally, we present the details of our proposed method in form of a flowchart and a pseudo-code.

3.3.1 Solution representation

In our proposed two-stage method and in the first stage, the length of each agent in the BGSA must be equal with the dimension of a certain dataset. Every dimension of each agent has the value of 0 or 1 according to the Section 3.1 that described the formulation of FS problem. Note that in the first iteration, all agents of BGSA are initialized randomly. Additionally, the mass of each agent in BGSA is computed based on the accuracy that defined as a fitness function in the Section 3.3.2. Here, the initial velocity of each dimension of the agent in the search space is a random value in the interval \(\left [ v_{min},v_{max} \right ]\).

However, in the second stage, the length of the initial solution that is known as the start point must be equal to the number of features selected by the BGSA in the previous stage. In the first iteration, each dimension of the initial solution has the value of 1. In other words, all selected features are considered at the start of the second stage. In the next iterations, the new solution is constructed according to the changed procedure for finding the neighborhood described in the Section 3.3.3.

3.3.2 Fitness function

Defining a fitness function for calculating the mass of each agent in BGSA, obtaining the Δ criterion in SA, and evaluating each candidate solution in both algorithms are considered as an important issue in the designing a meta-heuristic method for solving an optimization problem. In this article, accuracy is a good criterion to evaluate classifier performance. This measure is defined as a total number of correctly classified examples over the total number of examples [31]. Because most of the classification problems are two classes, positive and negative, a classified example is divided into four categories that are known as confusion matrix and are represented in Table 1 [12].

Table 1 Confusion matrix

For instance, for an instance β, TP increases 1 when β is predicted as class Positive and if the actual class of β is Positive; otherwise if the true class label of β is Negative, FP increases 1. Also, TN increases 1 when β is predicted as class Negative and if the actual class of β is Negative; otherwise if the true class label of β is Positive, FN increases 1 [37]. Therefore, accuracy measure is defined in (24) using these four categories:

$$ {accu}=\frac{TN+TP}{TP+FP+TN+FN}\ast 100 $$
(24)

Thus, an agent in BGSA with high accuracy as the fitness value can attract the other agents of the next iteration with the most force [35], and also the neighbor solution in SA with high fitness value can be replaced with the current solution.

3.3.3 Changing the procedure of finding neighbor in SA

Finding the neighbor in SA is a function that constructs a new solution in the search space according to the current solution by changing some of its dimensions. This function is dependent on the structure of the problem under study. Because FS has been encoded as a binary problem, changing in a dimension means that its value converts from 1 to 0 or vice versa. In other words, the value of each dimension in the current solution by p probability is changed so that, p is a number in [0, 1] and then a fitness function is used for comparing both current solution and next solution.

3.3.4 Pseudo-code of the proposed two-stage method for FS

As mentioned earlier, using binary encoding is a suitable way of describing the FS problem. For this reason, the binary version of original GSA i.e., BGSA and the original SA with the new function for finding the neighbor are selected as the first stage and the second stage in the proposed two-stage method, respectively. Figures 2 and 3 show the flowchart and the pseudo-code of the BGSA-SA two-stage method, respectively.

Fig. 2
figure 2

Flowchart of the proposed method

Fig. 3
figure 3

Pseudo-code of the proposed method

4 Experimental setup

In this article, 50 independent implementations of the several benchmark techniques on the various datasets and based on two different classification algorithms are compared. The major goal of this comparison is to determine an adequate method with the acceptable performance not only in accuracy but also in reduction rate. Section 4.1 introduces the different groups of methods that are considered as the benchmark techniques. The characteristics of the used datasets are described in Section 4.2. Then, the parameters settings of the benchmark techniques and our proposed method are shown in Section 4.3. Three evaluation measures that are used for comparing the performance of FS approaches are recommended in Section 4.4. Section 4.5 presents the obtained results and discusses them. The statistical analysis of our BGSA-SA method and other mentioned methods is done in Section 4.6. Finally, Section 4.7 analyzes the proposed method and other used methods based on the time complexity and also the running time.

4.1 Benchmark techniques

In order to consider the improvements and validity of our proposed BGSA-SA method, we compare the obtained results with the three groups of methods:

  • The traditional meta-heuristic algorithms (i.e., BGSA [24], BGA [29], BPSO [32], and SA [15]) together with the combinational version of them (i.e., BGA-SA and BPSO-SA) are known as the first group of the comparison methods.

  • Two modified versions of the original GSA (i.e., GSAPSO [17] and CHGSA [11]) suitable for solving other optimization problems except for FS, are introduced as the second group of the comparison methods.

  • The state-of-art methods in the field of solving FS problems using BGSA are considered as the third group of comparison methods.

4.2 Dataset

We utilize a set of five benchmark datasets that we have chosen from the well-known machine learning data repository of the University of California UCI Machine Learning Repository [2]. The characteristics of these datasets are shown in Table 2. For each dataset, the samples are randomly divided into two sets: 70% for the training set and 30% for testing set.

Table 2 Datasets

In the training process, to examine the classification performance of a selected subset of features, k-fold cross-validation method is used. It means that data in training set are divided randomly into k approximately equal subset. Thus, a particular classifier is learned k times such that each time of learning, (i.e. 1 subset and k − 1 subsets) are known as the validation set and the training set, respectively. The estimated accuracy is an average of the validation accuracy of the k subsets [22]. Note that in this article, k-fold cross-validation is applied as an inner loop in the training process and the common value 10 is chosen for k. In other words, during the training process and in order to evaluate the accuracy of the selected subset of features, the 10 various selections for training and validation sets are considered. Then in the testing process, the final selected features are applied on the testing set in order to achieve the accuracy of them on the unseen samples.

4.3 Parameter setting

The performance of the proposed BGSA-SA method is compared with the mentioned benchmark methods. In order to compare fairly, the required parameter settings for all of them are observed in Table 3. In addition, we have performed them on the system with 4 GB RAM and Intel core i5 processor that runs at 2.5 GHz. The source code used for all methods is in MATLAB software.

Table 3 Parameter setting for all mentioned methods

4.4 Evaluation measure

Accuracy and reduction rate are known as two main comparison measures that are widely used for evaluation of various methods. The accuracy is measured as the percentage of samples classified correctly and the reduction rate is known as the percentage of total features eliminated during the evolution. Accuracy measure that is utilized as a fitness function for the mentioned methods, has been defined in (24) and reduction rate measure is described in (25):

$$ R_{rate}=\left( 1-\frac{number~of~selected~features}{ total~features} \right)\ast 100 $$
(25)

According to the confusion matrix in Fig. 2, the several measures can be defined. The Area under the ROC Curve (AUC) is one of them so that it is a very good overall measure for comparing algorithms and introducing which algorithm is better on average. AUC is computed as follows:

$$ AUC=\frac{1+{TP}_{rate}-{FP}_{rate}}{2} $$
(26)

where TPrate and FPrate are true-positive rate and false-positive rate, respectively. The true-positive rate defined in (27), is the ratio of actual positives that are predicted to be positive and false-positive rate defined in (28), is the proportion of actual negatives that are predicted to be positive [12].

$$ {TP}_{rate}=\frac{TP}{TP+FN} $$
(27)
$$ {FP}_{rate}=\frac{FP}{FP+TN} $$
(28)

Therefore in this article, the performance of all methods on all datasets are calculated in terms of three evaluation measures including accuracy, reduction rate, and AUC.

4.5 Experimental results

Here, we present the performance of two classification algorithms (i.e., K-NN and SVM) with and without FS approaches. Generally, we show that the group of two-stage FS approaches and specially BGSA-SA method can achieve a subset of features for all datasets with maximum reduction rate and the similar or even better accuracy.

Tables 4567 and 8 present the experimental results of all methods based on three evaluation measures, namely: accuracy (accu), AUC, and reduction rate (Rrate) and by using two various classifiers, namely, K-NN and SVM. In all of the tables, both average value and best value of all evaluation measures on both testing set and training set are compared. Moreover, the performance of the two classifiers on all datasets with a total number of features is computed based on accuracy and AUC measures and mentioned in all of the tables so that in all of the Tables, “ALL” represents the total number of features.

Table 4 Average and best accuracy of both training and testing sets for two datasets
Table 5 Average and best accuracy of both train and test sets for three datasets
Table 6 AUC and reduction rate for two datasets (the bold value in the last column of 5-NN shows the best FS method based on the reduction rate)
Table 7 AUC and reduction rate for two datasets (the bold value in the last column of 5-NN shows the best FS method based on the reduction rate)
Table 8 Three criteria for all datasets using SVM and 5-NN (the bold value in each column shows the best two-stage method in either SVM or 5-NN learning algorithms)

Inasmuch as there is not a legal way for determining a suitable value for parameter K in K-NN classifier, we examine this classifier with three different values (i.e. K = 4, K = 5, and K = 6). Tables 456 and 7 show the experimental results of all methods using the K-NN classifier. All tables demonstrate that the average value of both criteria-, accuracy and AUC- in all FS methods on testing set is better than “ALL”. Furthermore, it is seen that the performance of the BGSA method and modified versions of it, namely GSAPSO and CHGSA, have almost better results compared to the BGA, BPSO and SA methods. Also, here it is observed that the performance of the group of the two-stage methods based on all K-NN classifiers is better than the other single mentioned methods. Finally, we observe that the proposed BGSA-SA method with an appropriate elimination of redundant/irrelevant features of datasets can be obtained with similar performance or, in some cases, better performance than the other existent methods.

According to these tables, the values of both criteria, namely accuracy, and AUC obtained based on K-NN classifiers on all datasets are approximately similar. Figure 4 shows that these values in our proposed method BGSA-SA are very similar. So, we must select the other criterion for setting a suitable value for parameter K. According to the obtained results of Table 5, the reduction rate criterion can become an appropriate choice for setting parameter K so that it is observed in Fig. 5.

Fig. 4
figure 4

The values of both criteria for all datasets based on our proposed method BGSA-SA. a accuracy b AUC

Fig. 5
figure 5

The values of Reduction rate for all datasets based on our proposed method BGSA-SA

Based on the obtained results of both Figs. 4 and 5, we set the parameter K to 5. Therefore, in all areas of this section, we focused on the achieved results of all methods based on 5-NN classifier.

The “Accuracy” variable in both Tables 4 and 5, indicates the average and the best classification accuracy obtained by each method on both test set and training set. Also in both Tables 6 and 7, the “AUC” observes the average value of AUC criterion obtained from each method on both sets (e.g., training and test sets) and also, Rrate variable represents the reduction rate of each method.

Based on Table 4, all FS methods used on German dataset and WBCD dataset, have better classification accuracy in comparison to ALL. In German dataset among all FS methods, BGSA method has the best result in both testing set and training set, except in the best value of the testing set. With a similar conclusion for WBCD dataset, CHGSA and BGSA methods have rating 1 in average value on the testing set and training set, respectively, but about the best value in both sets, the high rating has been given to both BGA and BPSO methods with a little difference in values.

As demonstrated in Table 5, the classification accuracy of all FS methods applied on three datasets, namely Ionosphere (Iono), Sonar, and Musk1, is better than ALL, except in the best value of Musk1 dataset in the test set. The ionosphere is a dataset in which BGSA-SA method has rank 1 in the average value and the best value on the test set. Furthermore, GSAPSO and BPSO methods have the higher rank of the best result in average value and best value on the training set, respectively. For the Sonar dataset, the best results in both average value and best value on both sets are obtained using GSAPSO and BPSO method, simultaneously. In addition, the rank of other two-stage methods, namely BGA-SA and BPSO-SA, is similar to both GSAPSO and BPSO. Finally, for the Musk 1 dataset, BGSA method has the best result in average accuracy on the test set. Meanwhile, GSAPSO method achieves the best results on training set.

Based on Table 6, the BGA-SA method has rating 1 in the reduction rate for German dataset whereas the percentage of correct prediction of the BGSA method is the best result in both testing and training set. Instead of WBCD dataset, the rank of the reduction rate of BGSA-SA method is 1 and the CHGSA and BPSO have the highest rank in testing set and training set, respectively.

According to the results obtained from Table 7, the percentage of the correct prediction in the Iono, Sonar, and Musk1 datasets have been obtained using SA, CHGSA and BGSA methods in both sets, respectively. Meanwhile, the reduction rate of the BGSA-SA method in all three datasets has rating 1 compared with the other methods.

Based on Tables 47, it is observed that the reduction rate of the group of two-stage methods has the higher rank than the other single methods despite the performance of them based on two criteria, accuracy and AUC, is similar or, in some cases, is better than the others. Also, as a result of above tables that are shown in Figs. 6 and 7, we can see that the reduction rate of our suggested method BGSA-SA is set in rate 1 for all dataset except German dataset whereas the performance of it is very similar to the performance of both two-stage methods, BGA-SA and BPSO-SA. So, the inference is that our proposed method, BGSA-SA, has the best performance in eliminating redundant/irrelevant features.

Fig. 6
figure 6

The values of both criteria for all datasets based on the group of two-stage methods using 5-NN. a accuracy b AUC

Fig. 7
figure 7

The values of Reduction rate for all datasets based on the group of two-stage methods using 5-NN

As shown in Tables 6 and 7, by increasing reduction rate in the group of two-stage methods especially in our suggested method BGSA-SA on all datasets, the AUC value will not decrease. It implies that the AUC value for all methods is approximately similar and the reduction rate of them is different. Thus, in order to compare the ability to eliminate the redundant/irrelevant features using these nine FS methods, a comparison in terms of AUC and reduction rate for all datasets is shown in Fig. 8.

Fig. 8
figure 8

Comparison of reduction rate and AUC value in all datasets

According to Fig. 8, the ability of BGSA-SA method in order to eliminate the redundant features from all datasets (except German dataset) is of high rank. Nevertheless, the classification performance of this method is similar or in some cases such as Sonar and Ionosphere, better than the other methods.

Based on Tables 47, we see that the reduction rate of the group of two-stage methods is better than the other single methods. Moreover, the rank order of the group of two-stage methods for all datasets, except German dataset, based on these criteria is as follows: Our proposed two-stage method, BGSA-SA, is rank 1, the two-stage method BGA-SA is rank 2, and BPSO-SA is rank 3.

As a final experiment, we change the learning algorithm from K-NN to SVM to compare the performance of the group of two-stage methods according to this exchange. The obtained results are shown in Table 8. According to this table, the accuracy and AUC of all methods using SVM are better than when 5-NN was used, especially for the training set. Using SVM, the reduction rate of all methods in some datasets such as WBCD, Sonar, and Musk1 has become better than 5-NN. Furthermore, the reduction rate of BGA-SA and BPSO-SA methods for all datasets is better than BGSA-SA method except Sonar dataset that the BGSA-SA has rank 1 in reduction rate.

Finally, in order to further test the performance of the proposed method in this article, we compare the performance of it with seven recent methods that were published in the field of using GSA for solving feature selection problem [4, 16, 18, 21, 25, 35, 38]. The seven methods published by other researchers have been performed not only based on different datasets but also on systems with various software and hardware. For this reason, we have implemented all of them on our system according to the parameter settings in their papers and utilized similar datasets mentioned in Section 4.2 to compare the performance of them with the performance of our proposed method. The results of this comparison are presented in Table 9.

Table 9 The comparison with state-of-art methods (the bold values in each dataset show the best method with the best performance in either the average accuracy of the testing set or reduction rate)

In Table 9, we compare the obtained results of our proposed method based on both classifiers (i.e., K-NN and SVM) with the obtained results of the state-of-art methods according to the average of accuracy and reduction rate criteria. Base on this comparison, our proposed method in all datasets except German dataset has ranking 1 in the reduction rate criterion. But in three datasets (i.e., German, WBCD, IONO, and Musk) setting of SVM parameters with RBF kernel could produce a method with the best average accuracy in proportion to other methods. In addition, it is obvious that the average accuracy of our suggested method has a little difference in comparison with the best average accuracy whereas the difference between the average of reduction rate in our proposed method with other methods and in most experimented datasets is quite tangible.

4.6 Statistical analysis

The Wilcoxon test [34] is a non-parametric statistical test for comparing pairs of algorithms. In this test, limitations such as an assumption of normal distribution and the vectors with the same length do not exist [22]. Moreover, it is safer than parametric test; and experimental results in Demsar [9] shows that it is stronger than other tests. For these reasons, we have used this test as a main statistical test to accuracy and reduction rate in this article.

If the Wilcoxon rank p-value is lower than the level of significance α (in the experiments α = 0.05), the null hypothesis is rejected and is verified that there are statistical differences between the multiple distributions associated with each approach. Otherwise, there is no statistical difference between the distributions and therefore the pairs of algorithms are statistically equivalent.

The pairwise comparison results between BGSA-SA method and others by using Wilcoxon test are described in Table 10. In this table, the average of accuracy and number of selected features in BGSA-SA method and each of the used methods in all datasets are compared by p-value.

Table 10 Pairwise Wilcoxon test between BGSA-SA method and other used methods for all datasets (5-NN)

The bold values in Table 10 have shown that the BGSA-SA method has the significant difference with other eight methods. In the field of the reduction of feature numbers, the BGSA-SA method has rank 1 in both Tables 6 and 7 for all datasets that this conclusion is proved statistically by observing the last bold column of table. In some cases, the accuracy of the BGSA-SA method is better than other methods that are recognized using determined ranking values for average and best accuracy of all methods. For this reason, the P-value corresponding to these cases in Table 10 is seen bold.

Moreover, in order to prove the major goal of the proposing our two-stage method, we applied the Wilcoxon test on both the average of accuracy and the number of selected features between BGSA-SA method and other state-of-art methods. The obtained results of this pairwise comparison are shown in Table 11. By considering Table 9, the proposed method has the best result for selecting the minimum number of features in comparison with other published methods. Thus, investigating Table 11 shows that the p-value related to the number of the feature selected is lower than the level of significance α and presented with the bold values.

Table 11 Pairwise Wilcoxon test between BGSA-SA method and other state-of-art methods for all datasets (5-NN)

4.7 Time analysis

4.7.1 Time complexity

Here, we analyze the time complexity of the BGSA-SA method. Because the computational complexity of the fitness function is much larger than the complexity of the operations of both BGSA and SA evolutionary algorithms, we can ignore them from computing the time complexity of the suggested method.

The time complexity of our suggested two-stage method (i.e., BGSA-SA) in the worst case, is \(O=\left (N\ast IT\ast \gamma \right )\). Because the time complexity of BGSA is \(O=\left (N\ast IT\ast \gamma \right )\) where N represents the population size (number of agents), IT is the total number of iteration, and γ is the complexity of the fitness function that is computed for each agent in each iteration. Also, the time complexity of SA method is O = (IT ∗ 2γ) since, in each iteration, the fitness function is computed for both current solution and its neighbor.

4.7.2 Running time

We know that, utilizing a pre-processing step, like FS, before the main work can waste a part of the time. Therefore, finding a method with good results that can spend appropriate time, is one of the important goals in the field of data mining. Since all of the used methods in this article are wrapper based FS approaches, consequently, most of their running time is spent on computing the classification performance of the particular classifier.

Table 12 describes the average of computational time that each used method has utilized in 50 independent runs. Also, in Fig. 9, a comparison of the computational performance of different methods for all datasets is shown.

Table 12 Computational time (min)
Fig. 9
figure 9

Computational performance of all methods for all dataset

According to Table 12, SA method has minimum computational time in all datasets. By contrast, by comparing the running time of the proposed BGSA-SA method with others, it is shown that with a little difference our proposed BGSA-SA has better results in the field of the elimination of redundant features and even, in some cases, in the field of accuracy.

Figure 9 shows that all methods, except SA method, have almost the equal computational time. Furthermore, we see that by increasing the size of the dataset, in either the feature numbers or the sample numbers, will in turn increase the running time for all methods. It therefore means that the computational time of German dataset with 1000 samples and Musk1 dataset with 166 features is very large, with Sonar dataset with 208 samples. In other words, a dataset with the minimum number of samples along all datasets has the minimum computational time. Therefore, the proposed BGSA-SA method using a suitable computational time in comparison with others, has high ability in the field of the reduction of feature numbers, especially in large datasets.

5 Conclusion

The goal of this article was to suggest a wrapper-based feature selection method for selecting a feature subset with minimum number and similar or even better classification accuracy. This goal was successfully reached by compounding two global and local search methods, namely Gravitational Search Algorithm (GSA) and Simulated Annealing (SA). Applying the 5-NN classifier and 10-fold cross-validation in the training process of all used meta-heuristic methods, indicated that the proposed two-stage methods with approximately similar computational time and a similar or even, in some cases, better accuracy in comparison with other methods have higher performance in dimensional reduction, particularly in large datasets.