1 Introduction

DNA microarray-based emerging technologies have been applied in the fields of bioinformatics and biotechnology which exponentially increases data size concerning features and number of instances [1, 2]. Generally, genes that are relevant to a specific target annotation are known as biomarker. Biomarker discovery is a vital task for the researcher, as well as for the medical or pharmaceutical company with the goal of identifying genes that can be targeted by drugs. A typical microarray technology involves the hybridization of mRNA molecule to the DNA template. The microarray datasets are categorized as unlabelled, partially labeled and adequately labeled. It leads to the growth of gene selection techniques namely supervised and unsupervised to discover the organic patterns (such as tissues, cell lines, etc.) from the instances [3]. DNA microarray technologies have been applied to manage classification [4] and clustering [5] problems.

In data mining and bioinformatics domains, feature selection (FS) is an remarkable approach to reduce the dimensionality of the dataset (i.e., leukemia and colon) [6]. The basic principle of feature selection is to choose m relevant feature subset from n original DNA microarray dataset attributes on some criteria [7]. One of the major issue has been investigated and addressed by the researchers in DNA microarray as “curse of dimensionality”. However, not all attributes are important, because many of them are redundant or even meaningless, which can decrease the classification accuracy of any learning algorithms. The construction of features [or Feature Extraction (FE)] is strictly related to the selection of attributes, which can also diminish the higher dimensionality [8, 9]. The main difference between FE and FS is as FS picks a subset of the candidate features whereas the feature extraction creates new features from candidate features.

In general, existing feature selection approaches overlook the fact that for a given cardinality, there are numerous subsets with similar information quality. It addresses the critical issues by removing irrelevant and noise data [10]. Generally, FS methods are categorized into four groups such as filter, wrapper, hybrid and embedded methods [11]. In most of the situations, a comprehensive search for choosing the optimal subset of features in a given data set is practically impossible. In the recent literature, various pursuit methods have been employed for the selection of features, such as iterative search, heuristic search and random search [12, 13]. However, most of the methods for selecting existing features still show stagnation in the excellent location and a high computational cost [14]. It is not possible to solve the enormous number of attributes using traditional methods. Therefore, researchers have employed an effective search methods for feature selection by using specific fitness function to reduce the dimensionality of microarrays, efficiently. In this article, our attention on metaheuristic based feature selection, because it is more superior to other well known filters or hybrid methods towards to readability and interpretability.

The major problems in non-metaheuristic algorithms for feature selection is trapped in local (or weak) solutions. To resolve this problem, in the mid-sixties, a wide range of algorithms have been widely applied and showing research interest in this domain [15]. The most popular metaheuristic algorithms are available in swarm intelligence (SI) [16] and evolutionary algorithm (EA) [17, 18] are: genetic algorithm (GA) [19], particle swarm optimization (PSO) [20], artificial bee colony (ABC), ant colony optimization (ACO), Bacterial Foraging Optimization (BFO), and Gravitational search algorithm (GSA), teaching learning based optimization (TLBO) [21], Sequential Forward Search (SFS) [22], and Sequential Backward Search (SBS) [23] which can applied to considerable number of features, but small supervised gene expression data sizes (ten to hundreds). Teaching learning based optimization (TLBO) for finding the near to optimal solution. Genetic algorithm (GA) usages the concept of Darwinian evolution based on the survival of the fittest [24], ACO imitates the behavior of scouring a bee [25], BFO method mimics the foraging strategy of Escherichia coli bacteria [26] and GSA works on the principle of the explosion of a mass [27]. The most informative set of genes can be picked by a corresponding fitness value of all features belonging to the datasets. Still, there are no comprehensive guidelines on the merits and demerits of different metaheuristic methods with their more appropriate areas of application. The approximate distribution of publications and proceeding last 8-year w.r.t. metaheuristic algorithms are depicted in Fig. 1. The articles used in this survey are obtained from all the central databases, such as Web of Science, Scopus, Google Scholar etc.

Fig. 1
figure 1

Number of publications on metaheuristics algorithms (January 2010–June 2019)

The main idea behind investigating the metaheuristic algorithms is to tackle complex optimization problems where classical optimization methods have to be failed. These methods are now accepted as some of the most practical approaches for solving many real world problems like gene selection [28]. There are several advantages of using metaheuristic algorithms for optimization as given below:

  • Broad applicability: It can be applied to any problems that can be formulated as function optimization problems.

  • Hybridization: It can be combined with more than one stochastic or classical optimization techniques.

  • Ease of implementation: It is easy to implement and has a less complicated programming structure with less complicated operations.

  • Efficiency and flexibility: It can be able to solve larger problems rapidly.

  • It can easily handle multiple objective problems of stochastic nature [29].

However, there are some disadvantages of the metaheuristic methods that should be present here:

  • In general, the optimization performance is highly dependent on control parameter tuning.

  • It do not appear mathematical base, in compared to more traditional techniques [30].

  • It cannot prove optimality.

  • It cannot probably reduce the search space.

  • Repeatability of optimization results obtained with the same initial condition settings is not guaranteed.

Since to our awareness, there is no broad discussion of the metaheuristic methods for feature selection. Ang et al. [2] have offered a comprehensive survey and provided the straightforward organization of gene selection and reviews filter, wrapper, and hybrid methods in the high dimensional datasets. Author has not systematically reviewed the wrapper methods for gene selection and also less contribution in the description of data characteristics [31]. This paper presents a comprehensive review of the metaheuristic approaches for features selection in order to provide algorithms applicability with merits and demerits to the researchers involved in cutting-edge research.

The organization of the article described as follows. Section 2 summarize the Broadway concept of gene selection, which is used to improve solution diversity in the FS problems. Section 3 defines the taxonomy of metaheuristic techniques for gene selection that are used for high dimensional datasets. Section 4 describes the classification methods. Section 5 presents the study the experimental results based on metaheuristic approaches for feature selection. Section 6 represents a range of open challenges recommendations for future directions and followed by conclusions in Sect. 7.

2 Outline of gene relevancy

Gene selection aims to choose an optimal features subset from the m cardinality of features which is more relevant and correlated to each other. Yu and Liu [32] have introduced the concept of features subset selection into four aspects: (a) fully irrelevant and noisy variables, (b) weakly relevant and redundant features, (c) weakly relevant and non-redundant variables, and (d) strongly relevant features as depicted in Fig. 2. An optimal subset contains all the features in the class (c) and (d), respectively.

Fig. 2
figure 2

Feature taxonomies based on their redundancy and relevancy

For better accuracy and discriminative power, relevant features play a crucial role in gene selection. minimal Redundancy and Maximum Relevancy (mRMR) as an approach for feature selection is presented by the Ding and Peng [33] and successfully employed further in various real-life applications [34,35,36,37]. As shown in Fig. 2, we can provide an analysis of features taxonomies based on their redundancy and relevancy.

The range of the variables \(F = \{ f_1,f_2,f_3, \ldots ,f_n \} \) and instance space is defined as \( s= \{s_1*s_2*s_3, \ldots , s_m \} \) respectively. Our objective function is represented as \(f{{:}}\,s \Rightarrow l\) according to its meaningful features, where, l present a space of labels.

Definition 1

Gene selection: let, the original set of variables F and \(L(\cdot )\) be an assessment standard to be maximized and defined as \(L{:}\,F^{\prime } \subseteq F \Rightarrow R\). The original subset of features can be considered under the following concerns [38]:

  1. 1.

    Let, |F| = m and |F| = n, then, \(L(F')\) is maximized, where and \(m \gg n\) and \(F^{\prime } \subset F\).

  2. 2.

    Set a threshold \(\delta i.e., L(F^{\prime }) > \delta \), to find a subset of features with the least number \((m \gg n)\).

  3. 3.

    Discovering the Optimization method L(F\(^{\prime }\)) with optimal subsets of features |F|.

There is a continuous problem of selecting variables in which each characteristic \(f_k \in F\) assigns weights \(w_{(k)}\) to preserve the theoretical significance of the variables. The allocation of binary weight is considered in the problem of selecting binary characteristics [39]. The subset of optimal features is considered one of the most optimal subsets. Therefore, the above definition does not guarantee that the subset of optimal features is distinctive. The subset of best features is defined in terms of precision of the induced classifier as below in Definition 2.

Definition 2

Let, the dataset be defined by features \( \{f_1,f_2,f_3, \ldots ,f_n \}\) from a distribution \(\rho \) over the labeled instance space and inducer \(\aleph \). An optimal feature subset, \(f_{opt}\), is a subset of the features such that the accuracy of the induced classifier \(C= \aleph (D)\) is “maximal” [40]. The diversity of approaches have been applied to solve features selection problems, where filter-based gene selection approaches have recently extended attention and shown the efficient results. A common procedure for gene selection is shown as in Algorithm 1.

figure a

Definition 3

Relevance to Object [41] : “A feature \(x_i \in X\) is relevant to an object concept C; if there exists a pair of examples and in the instance space such that and differ only in their assignment to \(x_i\) and \(C(A) \ne C (B)\)”.

Definition 4

Strongly Relevant to Instances [41]: “A feature \(x_i \in X\) is strongly relevant to the instance S if there exists a pair of examples \(A, B \in S\) that only differ in their assignment to \(x_i\) and \(C(A) \ne C(B)\) or a feature \(x_i \in X\) is strongly relevant to an objective C in distribution of P, if there exists a pair of examples \(A, B \in I\) with \(P(A) \ne 0\) and \(P(B) \ne 0\) that only differ in their assignment to \( x_i\) and \(C (A) \ne C(B)\)”.

Definition 5

Weakly Relevant to Instances [41]: “A feature \(x_i \in X\) is weakly relevant to instance S if there exists at least a proper \(X' \subset X (x_i \in X') \) where \(x_i\) is strongly relevant with respect to S. Or, variable \(x_i \in X\) is weakly relevant to objective \( C \in distribution of P\) if there exists at least a proper \(X' \subset X (x_i \in X')\), where \(x_i\) is strongly relevant with respect to P″. The above definitions concentrate on which features are meaningful. Put in other words, it need to utilize relevance as a measure of unpredictability to indicate how “complicated” a function is.

Definition 6

Relevance as a Complexity Measure [41]: “Given an instance of data and a set of concept C, let r(S, C) be the number of variables relevant using Definition 3 to a concept C in that, out of all those whose error over S is the least, has the fewest relevant features”. Otherwise, we will impose an optimal performance on S with concept C using the least number of features. The relevant concepts mentioned above are independent of the precise algorithm of learning. This means that a certain relevant function is not necessary for algorithms to learn.

Definition 7

Incremental Usefulness [42]: “Given an instance of data S, a learning algorithm L, and a subset of variables X’, variable \(x_i\) is incrementally useful to L with respect to X’, if the accuracy of the hypothesis that L produces using the variable set \({x_i} \cup X'\) is better than the accuracy achieved using just the features subset \(X'\)”.

Definition 8

Entropy Relevance [43]: “Denoting mutual information \(I(X;c) = H(c)-H(c \vert X)\) with Shannon Entropy X, the entropy relevance of X to c is defined as \(r(X{:}\,c)=I(X{:}\,c)/H(c)\)”. Let c be the objective seen as a feature and X represent the original set of features, a subset \(X' \subset X (x_i \in X' )\) is sufficient if \(I(X'{:}\,c)= I(X,c)\). For a sufficient subset, it must satisfy \(r(X'{:}\,c)=r(X,c)\).

2.1 Basic progressive flowchart of feature selection

The way of picking a subset of relevant genes from the original datasets, is partitioned into five main steps as illustrates in Fig. 3. At each level, the decision is made which affects the gene selection performance [44].

Fig. 3
figure 3

The overall progressive flowchart of gene selection

Stage 1Define search direction This step defines the initial point and the search direction. According to the forward search process, the search starts with a null set and includes novel features in each successive iteration sequentially [45]. On the contrary, the search process starts with an original set of features and then features are removed successively in each successive iteration; is known as backward elimination search [46].

Stage 2Define a search strategy Many obsolete measures that evaluate features individually do not work well. Therefore, features subset must be evaluated in a group. To handle this issue, Gheyas and Smith [47] addressed the outlines of the search strategy. A high-quality search approach should present exceptional global search capability, high convergence ratio to get the nearest global optimal solution, acceptable local search solution, and high computational capability.

Stage 3Define an evaluation criterion Firstly, evaluation processes of gene selection are categorized into five different varieties such as filter, wrapper, ensemble, hybrid and embedded [48]. Filter method is also recognized as an open-loop method [49]. It is a simple, effective method which selects the features subsets regarding to underlying characteristics of features, additional information in terms of learning tasks. This approach mainly estimates the feature characteristics with four different types of evaluation criteria namely information theory, dependency, consistency and distance [40].

Wrapper method is known as a closed-loop method which encloses the gene selection around the learning algorithm and makes use of classification accuracy as a fitness function for features subset evaluation [19]. It picks the relevant or discriminative features subset by using the specific classifier concerning for minimizing the prediction error [40].

Embedded technique is a built-in feature selection tool that implement the features in the machine learning method and employs its properties through variable evaluation. It is more proficient and less computational cost, and more conformable than wrapper based method in terms of solution quality [22].

Hybrid method is created by merging two dissimilar methods for feature selection, e.g., filter and wrapper. Its activities inherit the advantages of individual methods to gain computational strengths [50, 51]. It employs dissimilar evaluation criteria in dissimilar search stages to get better proficiency and presentation for enhancing computational performance.

Ensemble method is one of the important processes for feature selection that intends to create a cluster of optimal variable subsets and then generate a collective outcome out of the group [52, 53]. For the comprehensive discussion of the ensemble-based gene selection can be found in [54]. It is deliberately planned to deal with the inconstancy and perturbation issues in the many feature selection algorithms. The performance of feature selection depends on a particularly selected subset. Thus, it should be relatively flexible and vigorous when dealing with high-dimensional datasets.

Stage 4Describes the stopping criteria When the FS methods achieves the optimal number of features, then the gene subset selection procedure should stop. An appropriate stopping criterion is prevaricated over-fitting and produced the more capable process to produce an optimal feature subsets with the less computational load. The decisions made in the prior stages may influence the preference of stopping criterion. The general stopping criteria are as follows:

  • Describe the predetermined fixed number of features.

  • Describe the predetermined fixed number of iterations.

  • Determine the stall generation over two consecutive generations in percentage.

  • Obtaining the most excellent variable subset according to accurate evaluation method.

Stage 5Validate the optimum output To measure the effectiveness of important feature subsets for better classification, a large number of judgment or validation methods have been presented in the previous literature [55].

In order to select the best subsets of features, from the literature, it is observed that there are two key aspects such as maximize the accuracy of the classification and minimize the attributes present in datasets. These are often contradictory goals. Therefore, selection of features can be solved by using multi-objective problems (MOPs) to invent a set of compromise results between two conflict objectives. In recent years, current research in this oversight it gained great attention, where metaheuristic techniques provide the evolutionary computation (EC) techniques using a population-based approach is particularly suitable for the optimization of multiple objectives.

2.2 Background information

2.2.1 Current literature on feature selection

In this subsection, we described the metaheuristic techniques in three characteristics: Search techniques (Exploration methods), assessment criteria, and some conflict objectives.

  1. 1.1

    Exploration methods In the literature, various methods are available for FS that makes use of complete/exhaustive search [56]. This is on the grounds when the quantity of features is moderately little, as a rule, these techniques are excessively costly starting thereof view. Therefore, various evolutionary methods have been employed for subset selection, such as a heuristic search algorithm, in which typical cases are Sequential-Forward-Selection (SFS) [34], Sequential-Backward-Selection (SBS) [57]. But these approaches have nesting effect. To avoid the problem of the “nesting” effect two techniques as Sequential Floating Forward Selection (SBFS) [14] and Sequential Floating Forward Selection (SFFS) [58] have addressed. As a comparison to a static method, it gives better performance concerning to computational cost and optimal feature subsets. Han et al. [59] have proposed an approach to investigate a subset of relevant characteristics using the BPSO coding scheme with the help of the ELM classifier. Zhang et al. [60] have presented a heuristic search and regression method, to select the features in high-dimensional data series. From experimental results show that heuristic methods achieved comparable performance as a comparison to the backtracking algorithm, with less computational time. Further, metaheuristic methods treated as active methods and useful to solve FS problems. These methods include GA, PSO, ACO.

  2. 1.2

    Assessment criteria The classification performance is an assessment criterion for optimal feature subsets selection by metaheuristic approaches for feature evaluation. To evaluate the assessment of substantial learning algorithms such as support vector machine (SVM) [61], Naive Bayes (NB) [62], k-nearest neighbor (k-NN) [63], Decision Tree (DT) [64], LASSO [65], artificial neural network (ANN) [66], linear discriminant analysis (LDA) [67], etc., have been employed in metaheuristic for better classification of tumors and cancer [68] from the microarray datasets.

    In case of filter method, some criteria to measure features importance/weight on datasets such as information theory, correlation, distance and consistency are utilized [22]. Various researchers have found the commonly used filter-based gene selection techniques i.e., Joint Mutual Information (JMI) [69], Information Gain [70], Relief-F [71], Chi-Square [72], F-statistic [73], Mutual Information (MI) [74], that work to reduce a considerable number of features but small supervised gene expression data size. One of the most popular filter method as mRMR [75], where MI is used to quantify the relevance of each attribute regarding object class. All the essential attributes are selected for better classification. As evaluation performance, many kinds of literature confirmed that filter based methods do not perform well to problems above tons of features [76].

  3. 1.3

    Number of conflict objectives The excellent FS strategy intend to maximize classification performance only during the search process where, classification performance and some features included in a separate fitness function. To the extent our information, all the algorithms for the FS of the multi-objective characteristics depend on metaheuristic methods, since their population-based way that produces various solutions in single trails that particularly appropriate for Multi-Objective Optimization (MOO)[77].

2.2.2 Taxonomy of features selection approaches

This paper is courtyard on metaheuristic approaches for feature selection which characterized into various classifications as shown in Fig. 4, with three distinct criteria: search technique, assessment and objectives/problems. These criteria are the key segments of an FS strategy. The most popular metaheuristic method is genetic algorithm (GA), the optimal feature sets uses the respective learning categories for classification/regression with various approaches such as SVM, k-NN, and Lasso.

Fig. 4
figure 4

Overall categories of metaheuristic for feature selection

In literature, a wide range of metaheuristic based feature selection algorithms namely GA, Swarm based Algorithm, Artificial Bee Colony (ABC), Ant Colony Optimization (ACO), and Harmony Search (HS). GA is based on the Darwinian theory evolution to achieve the fittest or best features set [78]; PSO implements search for the behavior of a flock of birds or a fish school to look for food [79]; ABC imitates the practice of scouring a bee [80]; and last ACO is based on the behaviour of an ant looking for a destination from the source [81].

Based on assessment standards, filters as well as wrapper along with combination of both methods are evaluated. As per the objective, FS methods are characterized into single objective (SO), Multi-objective (MO) and Many-objective (MOB) methods, here multi-objective methodologies compare to techniques mean to discover a Pareto front of trade-off solutions. These objectives are dependent on fitness function.

3 Metaheuristic method for feature selection

3.1 Genetic algorithm for feature selection

GA [82], inspired by the process of natural selection and working in parallel heuristic research and it solves the problem of optimization based on the process of natural genetic schemes. GA method plays an energetic part in FS using best attributes with the help classification measures as a fitness evaluator with classifiers such as SVM, k-NN, and Lasso. Feature selection problem is binary optimization problem as a feature is selected or not, are represented by using 1 and 0 bits, respectively. But, optimization approaches are prepared for the optimizing the criteria in continuous search spaces. So, for optimizing the features subset, there is a need to covert the optimization approaches which can work in binary search space.

To estimate the goodness of features subsets, various researchers have integrated various classification methods such as SVM, KNN, ANN, DT, NB, multiple linear regression [83] and extreme learning machine (ELM) [84] as a wrapper for metaheuristic approaches. The most popular classification techniques are SVM and k-NN and have better classification performances and effortlessness. Using filter criteria such as information theory [85], consistency measures [86] and fuzzy set theory [87] have been employed with GA for feature selection.

Various enhancement on GA have been introduced for getting optimal subset using crossover and mutation variations. Srinivas and Patnaik [88] have proposed an approach to adjust both the crossover and mutation and to remove the local minima from the search space. Dugan and Erkoç [89] have introduced an extended GA concept as self-adaptive genetic algorithm (SAGA) to search the level of adaptation iteratively. Similarly, GA method is used in a two-stage filtering method, in first stage features ranking is evaluated and selected the top-most features which were passed in GA for optimal feature selection [90]. In contrast, Ghamisi and Benediktsson [91] introduced a different feature selection scheme that is based on the amalgamation of a GA and PSO were proposed. Similarly, Cho et al. [50] have suggested the Quantum GA combined with an improved self-adaptive (SA) scheme that is used for solving Electromagnetic optimization problems.

In [92] proposed a novel Markov blanket-embedded genetic algorithm (MBEGA) for gene selection problem. In particular, embedded Markov blanket-based memetic operators add or delete features (or genes) from genetic algorithm (GA) solution so as to quickly improve the solution and fine-tune the search. Empirical results on synthetic and microarray benchmark datasets suggested that MBEGA was effective and efficient to eliminate irrelevant and redundant features based on both Markov blanket and predictive power in classifier model. Similarly, A new approach for predicting drug effectiveness was presented by [93]. The approach was based on machine learning and genetic algorithms. A global search mechanism, weighted decision tree, decision-tree-based wrapper, correlation-based heuristic, and identification of intersecting feature sets were employed for selecting significant genes. The feature selection approach has resulted in 85% reduction of number of features. The relative increase in accuracy and specificity for the significant gene/SNP set was 10% and 3.2%.

Feature selection with GA by utilizing multi-objective methodologies began much consideration as a contrast with single objective feature determination technique. The vast majority of the multi-objective methods depend on non-dominated based (NSGA-II) or its variations [94, 95]. Author [96] aim was to preserve global diversity better by implementing Localized IMGA (LIMGA) and Dual Dynamic Migration Policy (DDMP). LIMGA creates unique evolution trends by using different kind of GA for each island. DDMP was a new migration policy which rules the individual migration. DDMP determines the state of an island according to its diversity and attractivity level. By determining its states, DDMP ensures the individual migrating to the correct island dynamically.

Even though there are more takes a report at multi-objective-based feature selection utilizing GA than utilizing other wrapper approaches, the capability of GA for MOFS has still not been thoroughly researched since attribute/feature selection is an intricate task that necessitates mainly composed MOGA to scan for the non-dominated solutions. Traditional genetic algorithm uses the two crucial tuning operators for feature selection, e.g., crossover and mutation, and provides chances to classify good feature or to discover the most exquisite feature sets, but this is a challenging task. In GA, the crucial, vital problem is when and how to apply the adjustment operators and parameter settings which influence their performance in feature selection.

3.2 PSO for feature selection

The combination of coding schemes, such as continuous and binary has been used PSO as single objective and multi-objective for feature selection in the filter as well as wrapper method [97]. The illustration of each swarm in the PSO for the FS is generally a string of bits. So, the dimension is equivalent to the complete attributes existent in the datasets. The representation of bit string as binary and real number in binary PSO and continuous PSO, respectively. At the point, when the binary characterization is utilized, one indicates that the relating feature is chosen and 0 infers that it is not chosen. At the point when continuous representation is used, a threshold as \(\theta \) is generally decided the feature subsets from the data series, that is, if the value is higher than as \(\theta \), the corresponding characteristic is selected, otherwise, discarded.

As we can see in mentioned literature [98], a large number of research work have been done on PSO for single objective than multi-objective, and more article on wrapper than filter determination. For metaheuristic approaches, distinctive learning methods have been utilized with PSO to assess the decency of the chosen features, e.g., SVM, KNN, DT, RF [99] and ensemble approach [100]. To intensification the performance of FS problems, the researcher has been familiarized a large number of new PSO algorithms, containing initialization methods, a way of representation, fitness functions and search mechanisms. For example, Cervante et al. [101] established an inventive generation method for FS with PSO, which show that generation and expressively increased the performance of PSO. In three different variants of encoding schemes of PSO are continuous encoding [102], binary encoding [103] and mixture of both encoding [104].

In PSO, the best subset is evaluated with the assistance of fitness function. PSO as wrapper approaches, numerous current works utilized only the classification performance as a fitness function [105], which generally prompted expansive feature subsets. The process to solve the problem by simultaneously optimizing, two conflict objectives such as feature subsets and fitness solutions [106]. Research on multi-objective for FS, PSO [14], and is the focused to optimize the performance (for example, accuracy) and the number of features as two separate objectives. Usually, PSO of being easy to implement and structure of updating solution is easier as compared to GA algorithm. Nowadays, still, an open issue for emergent fresh PSO algorithms is mainly new search methods and parameter control strategies for large-scale feature selection. In [107], author have combined the modified discrete particle swarm optimization (PSO) and support vector machines (SVM) for tumor classification. The modified discrete PSO was applied to select genes, while SVM was used as the classifier. The proposed approach was used to the microarray data of 22 normal and 40 colon tumor tissues and shown good prediction performance.

3.3 Ant colony for feature selection

As the previous work on ACO has introduced for a variety of optimization applications and display that more mechanism on the ensemble as compare filter and hybrid methods [108]. In [109], a new undersampling method ACOSampling called based on the idea of ant colony optimization (ACO) to address this problem. The algorithm starts with feature selection technology to eliminate noisy genes in data. Then author randomly and repeatedly divided the original training set into two groups: training set and validation set. So far, most of the attention on single objective-based feature selection, and there are only a few methods has been investigated in multi-objective methods. Shunmugapriya and Kanmani [110] have proposed ACO method for feature selection using the values of pheromone in ACO for the preferences the features and also recommended the modernized the pheromone traces of the edges that link every two different characteristics of the best solution so far. The experimental result has found a better solution regarding the accuracy of the classification by using a proposed method than the GA and PSO method. Similarly, in [111] author proposed a new modal using the ACO to concurrently select the characteristics and optimize the SVM parameters, in addition, weight optimizer method is also introduced for determining the probabilities of a specific feature through ACO. Similarly, for the selection of features, from [112] have used the two optimizers such as (ACO and DE) methods for integration, where DE optimization method was applied to find the subset of optimal features based on the solutions achieved from ACO.

The structure of ACO for node selection usually is a graph, where attributes represent as nodes to create a chart model. All ants signifies a subset of characteristics, in which the selected features are the visited nodes. Most of the cases, ACO based algorithms, the symptoms are entirely related to each other in the graph, but in [113] each characteristic was related to only two functions. Similarly, for feature selection by Aghdam et al. [114] have suggested a new way of presentation pattern to decrease the search space, during this only selected and not selected feature is connected using two edges. In most ACO-based methodologies, the order of execution has employed as the fitness evaluation. In articles [111, 112], the suitability of ants (features subset) has assessed by utilizing the average classification accuracy. However, the performance of individual features was also considered to enhance the performance additionally. The extraordinary procedure in [114] included both the classification accuracy and the number of highlights. Afterward, by expanding the effort on SO-ACO and a fuzzy classifier for including feature selection [113], Vieira et al. [115] offered an approach based on multi-objective and aimed to minimize the classification error and the number of features. In general, the volume of filter approaches is much sophisticated in ACO for feature selection, comparably as GA, PSO and DE methods.

3.4 Hybrid techniques for feature selection

A large number of hybrid metaheuristic methods has available for feature selection, including GA-PSO, PSO-GSA, TLBOSA, HS-GA, Hybrid Gravitational Search Algorithm (HGSA), the Hybrid Genetic Algorithm (HGA), and WOA-SA, CSPSO and TLBOGSA [116,117,118] have used for the multi-objective problems. DE has introduced to solve feature selection problems [119]. Most of the work focused on refining the DE search mechanisms, in addition, representation pattern has also been presented. Definitely, several relevant documents using hybrid techniques based on conventional DE methods for feature selection has been presented. For example, Hancer et al. [120] investigated the DE with filter method, where DE algorithm found the optimal subsets of feature using filter technique. From the experimental result, it has observed that it has achieved the good result as compared to existing wrapper methods. Other technique such as memetic algorithms for feature selection integration with a single solution based optimization algorithm such as local search strategy and it provides a good combination of the wrapper and filter methods [121]. Furthermore, a large variety of optimization methods have also been employed to solve the complex problems (i.e., feature selection) including MS-PSO, FSS-EBNA, GAPSO, and MMSDE. A new hybrid wrapper approach which was based on cellular learning automata (CLA) with ant colony method (ACO) used to find the set of features which improve the classification accuracy [122]. CLA has applied due to its capability to learn and model complicated relationships. The selected features from last phase were evaluated using ROC curve and the most effective while smallest feature subset was determined.

In bioinformatics, data mining and machine learning domains, GA is another effective feature selection algorithm that extracts useful information from datasets. And, multiple extensions of conventional GA are proposed in recent decades [123]. Several significant articles are presented to solve the feature selection problem using hybrid techniques using conventional GA methods. Recently a diversity of metaheuristic methods have been applied to solve features selection problems. Since all metaheuristic algorithms have their strength and weakness, these concerns are beneficial for the further potential investigation to address different new challenges in the domain of feature selection.

3.5 Other feature selection techniques

In the unlabeled dataset, unsupervised learning plays a vital role in finding a hidden pattern. A primary example of unsupervised learning is clustering techniques [124], which tries to discover natural groupings in a set of objects without knowledge of class labels. The selection of functionalities that use unsupervised learning techniques are beyond the scope of this document and will not be analyzed in detail. But, in this section, we refereed some articles that perform a selection of unsupervised functions. Selecting functions that use unsupervised learning can provide better description and reliability of data than non-supervised learning [125]. Several documents attempting to resolve feature selection using unsupervised learning can be found in [126].

To address the relevant genes, Kalousis et al. [127] proposed a gene (feature) selection method using GA for grouping functions, in which a GA has been applied for finding the best cluster center value of a grouping method to group entities into different clusters. The characteristics of each group were classified according to their distance values at the center of the group. Similarly, Khatami et al. [128] have applied PSO and sample pixels of an image are used to obtain the conversion matrix weights for color differentiation, while K-medoids provide a measure of the fitness conditions for the PSO procedure.

4 Measures in filter methods

To deal with the curse of dimensionality problem, we need to perform a dimensionality reduction task, i.e., feature selection, to assess the goodness of m features from the n-cardinality datasets. Generally, FS methods are classified into two basic methods (i.e., Subset evaluation and Subset generation) [129]. A subset generation procedure is a search technique that selects feature subsets based on specific search strategies, namely sequential search and random search. By using sets of evaluation methods, estimate the feature subset based on some criteria. The criterion used for feature selection is based on their dependency on algorithms. These are categorized in two ways, namely dependent criteria and independent criteria. The independent standards are correlation measures, distance measures, information measures, precision measures, and consistency measures. The four types of filter procedures in metaheuristic for feature selection can be seen as follows:

In the feature selection field using filter approach, mutual information based methods are most popular than all other. The use of information measures is mainly in four traditions.

  1. 1.

    Before using a metaheuristic technique, apply the information criterion to the rank of individual attributes. One of the examples of such kind of methods is Symmetrical Uncertainty (SU) and Mutual Information (MI) is evaluated based on filter rank, and then the top-most features are passed in ACO or GA-based feature selection [130].

  2. 2.

    One of the best examples of local search optimization algorithm is a memetic algorithm, information criterion is Mutual information [131] and symmetrical uncertainty [111] is applied as filter method to refine the poor solution quality obtained by a GA or PSO for feature selection.

  3. 3.

    Another approach as includes an information criterion for updating or search operator. MI has been covered the position vector by using the PSO optimization algorithm with the help of wrapper scheme as SVM [132].

  4. 4.

    Lastly, MI used as an objective function in a metaheuristic algorithm. It is the most proficient approach to feature selection. Based on the clue of “max-relevance and min-redundancy”, MI method is used to quantify the redundancy within a subset of attributes and the relevance between characteristics and the labels. Different metaheuristic algorithm objectives have maximized the significance and minimize the redundancy.

Correlation is a measurement of how strong two variables are linearly related. Authors in article [117], two correlation methods have proposed to assess the relevancy and redundancy of EA [133] and NSGA-II [134] and in [11] for the selection of features in two credit approval datasets. Similarly, Hu et al. [135] proposed a multi-population GA for selection of variables and also create a relationship between variable and labels, which were used as a filtering measure to assess the GA performance.

Distance measurements (DM) are also recognized as measures of separability, divergence, and discrimination. Signal to Noise (S2N) ratio has used for the selection of top-ranked features, and GA method used top-ranked features for the good classification [136]. To assess the goodness of each agent in PSO, the S2N ratio was applied for feature selection.

The measures of consistency are based on the fact that two samples, which have the unique features values, with the same label. Arauzo-Azofra et al. [137] introduced the first method based on filters for the selection of features based on the measurement of coherence. Regarding evolutionary calculation, GA was the first EA algorithm to use the coherence measurement for the selection of selection [138].

Fuzzy logic gives the degree of membership to the feature. It is also able to quantify the imprecision and uncertainty using a membership function, which can be used to assess the excellence of features. Using fuzzy fitness function two optimization algorithm such as PSO and GA have been applied for feature selection in SO [139] and MO approaches [140].

In recent years, the researcher has been investigated the used of more than one evaluation measures simultaneously in a unique FS algorithm which has become popular because every method has its advantages and disadvantages. Emmanouilidis et al. [141] have studied five different filtering measures in NSGA-II for the selection of features, including pairs of inconsistent examples as a measure of coherence, correlation of the attribute class as a measure of dependence/correlation, measurement of the inter-class distance, and entropy representation as for information measure.

In summary, various filter methods have been implemented in metaheuristic for selection of features. Most popular methods such as information measures, correlation measures and distance measures are relatively inexpensive from computational views, while coherence, the approximate set and measures based on silenced theories can handle better with noisy data. Compared to EA methods, generally, the performance of filters methods is gives poor results concerning classification accuracy. But, it can be less complex than wrapper approaches [142] regarding computational time in large datasets. Therefore, developed a filter specific measures based on the characteristics of a metaheuristic technique can increase efficiency and effectiveness, which offers an important direction for future research.

4.1 Classification and regression method

This subsection provides a brief overview of two classifiers ant their accuracy is utilized as fitness function for feature selection. It is a basic approach to data mining that involves the construction of classifier. We present the support vector Machine (SVM) [143], and k-nearest neighbor (k-NN) [144] and LASSO methods.

4.1.1 Support vector machine

Separating the feature vector and predict the correct class label is the major challenging task for the data mining classification algorithm. To resolve separation a set of feature vector which has different class memberships by the supervised machine-learning SVM model that analyses data for classification and constructing optimal decision planes classifier [143]. It got the popularity amongst the other machine learning classification. The main objective is drawing a hyper plane to split the dense feature vector dataset and margins between the sets of feature vectors maximum. To constructing an optimal decision plane, an iterative inductive learning model is used to minimize an error function \(\wedge (w)\) defined in Eqs. 1 and 2.

$$\begin{aligned} \wedge (w)= \frac{1}{2}(WW^T)+ C \sum _{i=m} \phi _i. \end{aligned}$$
(1)

Subject to be constraints

$$\begin{aligned} Y_i[w^Tk + \rho ] \ge 1-\lambda _i \quad and \quad i=1, 2, 3,\ldots , m \end{aligned}$$
(2)

where, c represents the capacity constant, w as the vector of coefficients, and represents parameters for handling non-separable data (inputs), non-negative slack variables \(\phi _i\) to the representation the deviations from the margin, \(\rho \) a constant, \(\lambda _i, with, i = 1,\ldots , m\) are the parameters to use noisy. For each training samples I, \(x_i\) are the independent variables represented by actual class labels \(y_i\). SVM accomplishes non-linear problem solved by kernel function k’ which transforms data into higher space.

4.1.2 k-nearest neighbour

The recent trends of classification problems, nearest neighbors (NN) is applied for distinguishes the classification unknown data point by its most intimate neighbor whose class is already known. It can solve real-world problems with the availability of inexpensive platform. Cover and Hart [145] investigated the purposed k-nearest neighbor (k-NN), and it concerns to find a group of k nearest objects in the training samples which is nearest to the test object, and bases on the label of the majority of particular dataset [146] in this neighborhood.

4.1.3 LASSO

Lasso a regularization technique that’s useful for feature selection and to prevent over-fitting training data [65]. It works by penalizing the sum of absolute value of weights found by the regression. Lasso is great for reducing the feature space, because when \(\lambda \) is sufficiently large, then many of the weights wi are driven to zero have been widely considered for the high-dimensional data analysis. Given a data set that consists of n observations \(\{({\varvec{x_{i}}}, l_{i})|1 \le i \le n\} \). where \({\varvec{x_{i}}} = (x_{i1}, \ldots , x_{ip})\) is a p-dimensional vector of predictors and \(l_{i}\) is a response variable, regression model is written as:

$$\begin{aligned} {l_{i}} = {\varvec{\beta }} {\varvec{x_{i}}} + \epsilon _{i}, \quad i=1,\ldots , n, \end{aligned}$$

where \( \beta = (\beta 1,\ldots ,\beta p)\) is a p-dimensional vector of regression coefficients and \(\epsilon _{i}\) is a random error term which is assumed to be independently and identically normally distributed with mean of zero and variance of \(\sigma ^{2}\). It assumes that the response is mean-corrected and the predictors are standardized, so the intercept term is not included in the model. LASSO is a FS process based on a regression model with L1-norm regularization as:

$$\begin{aligned} \min _{\varvec{\beta }} \sum _{i=1}^{n} \left( {l_{i} - \sum _{j=1}^{p} x_{ij} \beta _{j} }\right) ^{2} + \lambda \sum _{j=1}^{p} \left| { \beta _{j} }\right| , \end{aligned}$$

where \(\lambda \) is non-negative hyper-parameter. Although LASSO has been successfully used in high-dimensional data.

4.2 Microarray datasets description

In this section, we demonstrate the comparative experimental study of the four feature selection method in five commonly used biomedical gene expression datasets namely Lung Cancer [147], Colon Cancer [148], Diffuse Large B-cell Lymphoma (DLBCL) [149], Leukemia [150] and Small-Blue-Round-Cell Tumor (SBRCT) [151] which was downloaded from http://www.gems-system.org. A different set of features with the cancer classes are present in each dataset. The dataset descriptions in terms of sample number, number of genes, and labeled classes are summarized in Table 1.

Table 1 Dataset description

4.3 Validation methods

To measure the acceptability of feature subset for classification, different error estimation strategies have been suggested. It is also vital to choose a validation method (classifier accuracy) for the selected classifier. Most study achieves the validation using either CV or bootstrap techniques [152]. In this paper, we use tenfold cross-validation, which is performed in all classifier for classification performance. It randomly splits the dataset into training and testing samples. In the training dataset that consists of 90% of the data samples and other testing subset consisting of 10% of the data samples to estimate the performance based on the confusion matrix.

4.4 Performance measures

We measure the classification performance with the help of two classifiers SVM and KNN with four performance measures accuracy, sensitivity, precision and F-measure. These performance measures are defined as follows.

  1. 1.

    Accuracy: To predict the percentage of correctly classified samples, it is formulated as in Eq.  3.

    $$\begin{aligned} Accuracy = \frac{TN + TP}{TN + TP+FN+FP} *100 \end{aligned}$$
    (3)
  2. 2.

    Sensitivity: Percentage of positive instances that are predicted as positive. It is also called True Positive Rate (TPR) or Recall. It is formulated as in Eq. 4.

    $$\begin{aligned} Sensitivity (Recall (Re))= \frac{TP}{TP+FN} *100 \end{aligned}$$
    (4)
  3. 3.

    Precision: It is the percentage of positive predictions that are correct. This is also called positive predicted value. It is formulated as in Eq. 5.

    $$\begin{aligned} Precision (Pr) = \frac{TP}{TP+FP} *100 \end{aligned}$$
    (5)
  4. 4.

    F-measure: It is a composite measure which favors algorithms with higher sensitivity and challenges those with higher specificity as in Eq. 6.

    $$\begin{aligned} F{\text {-}}measure = \frac{2*Pr*Re}{Pr+Re} *100 \end{aligned}$$
    (6)

Here, TP, TN, FP, and FN are true positive, true negative, false positive and false negative in the independent datasets.

5 Experimental results

This section, demonstrates the experimental results obtained by four metaheuristic approaches based feature selections on five gene datasets with two classifiers SVM and KNN and one regression Lasso approach. We have implemented four metaheuristic methods such as ACO, PSO, DE, and GA with the performance of the two classifiers and Lasso regression as the objective function. All metaheuristic methods are performed on the identical machine, and MATLAB environment on 2.4 GHz Pentium Core i7 with 8 GB RAM running the Windows 8 operating system. For a fair comparison on the same computing environment, algorithms such as ACO, PSO, DE, and GA nature-inspired algorithms are run with a suitable parameter setting which can be seen in Table 2. Furthermore, selected the top-most ranked subset is 200 on gene datasets. To complete the process, the procedure is repetitive as ten iterations to allow each part of data to become as test data. As shown in Tables 3 and 4, the accuracy of the classifier and the corresponding mean values of number of features selected by respective approach in respective gene datasets. Here, mean of tenfold-cross-validation approach is used for comparative analysis with two measures such as accuracy (acc) and mean of number of features selected in each approach (#feat) and execution time (in s).

Table 2 Parameter setting

As wrapper approach such as SVM uses a fitness function in Table 3. It shows the performance of metaheuristic methods on small gene datasets. As obtained results, GA method indicates the best classification accuracy achieved by the classifier i.e., SVM as 93.56% in DLBCL dataset with 24 genes and lowest classification accuracy as 76.91% in Colon Cancer Dataset using ACO wrapper with 48 genes. The best results among all metaheuristic for feature selection have been highlighted (Table 5).

Table 3 Performance using SVM classification algorithm on dataset with the help of with four feature selection methods

Table 4 shows the performance of wrapper method over other approaches using k-NN. The GA method indicates the best classification accuracy achieved by the classifier as 92.86% in DLBCL dataset with 27 features and lowest classification accuracy as 74.75% in Leukaemia Dataset using ACO wrapper with 41 features. The best results among all metaheuristic for feature selection have been highlighted.

As one of the most attractive regression methods such as ridge, Lasso and Elastic-Net [153] can popularly employed in both machine learning and biomedicine. The comparison of classification accuracies and gene selection of four methods on five biological data over 10 runs are summarized in Table 6. The average classification accuracy of GA is 81.53 percent, which is 1.3, 3.3, 5.6 percent higher compared to PSO, DE and ACO, respectively in colon cancer data.

Table 4 Performance using k-NN classification algorithm on dataset with the help of with four feature selection methods
Table 5 Performance using LASSO algorithm on dataset with the help of with four feature selection methods

In addition to the above cases, the convergence to an optimal solution is an essential issue in the feature selection problems, and comparison between the metaheuristics algorithm has shown in Fig. 5a, b with respective classifiers. Figures 5, 6, 7, 8 and 9 shows convergence curves of the metaheuristic methods, where, the x-axis presents the number of iterations, and the y-axis presents the performance of the subset containing the specified number of iterations.

Fig. 5
figure 5

Convergence curve for feature selection on Leukemia dataset with a KNN, b SVM

Fig. 6
figure 6

Convergence curve for feature selection on SBRCT dataset with a KNN, b SVM

Fig. 7
figure 7

Convergence curve for feature selection on DLBCL dataset with a KNN, b SVM

Fig. 8
figure 8

Convergence curve for feature selection on lung cancer dataset with a KNN, b SVM

Fig. 9
figure 9

Convergence curve for feature selection on Colon dataset with a KNN, b SVM

Table 6 Solution quality of each metaheuristic on selected gene subsets using SVM classifier as the fitness function

5.1 Applications

In this section, we have shown the metaheuristic for feature selection in Table 7. It can be seen that metaheuristic approaches have been applied to a variety of areas. As, results are depicted in Tables 3 and  4 in terms of classification performance and number of features, SVM achieve the better accuracy as compared to KNN. So, for more analysis on SVM based feature selection, to evaluate the classification performance of the selected subset using four metaheuristic algorithms, i.e., GA, PSO, ACO, and DE, respectively on five microarray datasets. The classification sensitivity, specificity, and F-measure are presented in this Table 6. According to Table 6, support vector machine as a fitness function gives the highest performance with GA method almost all datasets. This result shows that the genetic algorithm (GA) is a robust metaheuristic as compare to other metaheuristic used in this experiment.

Table 7 Application of metaheuristic algorithm used in different applications

6 Open challenges

There is no single feature selection algorithm suitable for all classification problems. The problem of selection of features depends on what exactly the task is, and so the issue of classification [157]. What seems to be a useful feature of a problem can be tremendous for another. Despising the suitability, the success and the promise of metaheuristics for the selection of the features, there are still some difficulties and challenges, which is analyzing here.

6.1 Scalability

Due to the tendency of large data [106, 151], getting significant features is the extremely difficult and risky problem in the FS process. The selection of the features of a data set above 300 features called as large-scale dataset for feature selection [108]. However, today the number of features in numerous domains, for example, gene analysis have hundreds or more than hundreds number of features without much stretch. This expands the computational cost and requires innovative search methods. However, the two aspects have their issues, so issue cannot be settled by expanding the computational power. To overcome the issues mentioned above, a large number of metaheuristic algorithms have been investigated by researchers to solve high dimensional problems [110] for feature selection. While, hybrid subset selections based algorithms have been introduced that investigated the trade-off between original features set and classification performance of the model with selected feature subsets [135]. The first phase of hybrid model, in general, is a filter part. With the help of the filtering approach, evaluates the importance of each feature and screen out an irrelevant subset of feature from the datasets. It gives much informative subset as compared to the original data set. By picking features from the candidate set of features which makes a trade-off between the predictive power of the candidate feature (relevance to the class vector) and its independence from all features previously selected. While, the subsequent method (wrapper) accelerates the search with filtered genes, and to optimize classification performance.

6.2 Computational cost

The major issue in the feature selection method is a computational expansion for solving the high dimensional problem. In metaheuristic-based feature selection problem, the computational cost is a thoughtful issue, as they often involve a large number of assessments [158]. In general, the performance of filters methods such as mutual information [159, 160] has significant improved results toward to the classification accuracy but still it needs improvements and can be less complex than wrapper approaches regarding computational time in large datasets. To overcome the current challenge like a computational cost, two significant and effective factors must be considered: an efficient search technique and a rapid assessment measure [161]. It is emphasized that the parallel nature of metaheuristics is adequate such as grid computing, a graphics processing unit, and cloud computing that can be used to accelerate the process.

6.3 Search mechanisms

As we know, feature selection (FS) problem is a computationally complex problem which is a NP-hard problem. It has a large composite solution space [162]. To solve this problem, it requires a robust optimization search method, but recent valuable algorithms still have a necessary great improvement in finding a potential solution. For improved results, newly developed metaheuristic has the capability to explore the complete search space and also be able to exploit the local regions when needed. The new search scheme may include a local search (to form new memetic algorithms), the hybridization of different search schemes, the hybridization of metaheuristic with conventional algorithms [121]. In nature, metaheuristic algorithms are stochastic and approximate for optimizing the problems. It can have generated diversified optimal fitness when the different local solution is considered. When optimal fitness value is the same, then we select the smallest number of selected bits. In addition, the newly proposed search algorithms with high stability is also an important task.

6.4 Measures

The fitness evaluation as a function for suitability analysis is significant aspects in the metaheuristic for the features selection. It effectively influences the computational period, the performance of prediction and the search space landscape. In general, most of the computational period is used in the assessment procedure for optimization of metaheuristic algorithms and also for many filter-based approaches [69, 71]. Although there are some speedy assessment measures, such as information gain [163] to evaluate the characteristics individually rather than a group of features. Ignoring interactions between characteristics produces subgroups with redundancy and lack complementary features [164].

6.5 Dataset structure

The number of variables and instances in a data set is significant influences the work and design of experiments in feature selection problems [165]. A considerable portion of the current features selection methods are intended for actual data and depend on the suspicion that features have no express correlations to each other. As such, they overlook the inherent structures between the attributes. For instance, these FS methods can choose a similar subset of features even if the features have been reorganized. In some data mining applications, features indicate different kinds of structures, such as spatial or temporal softness, disjointed groups, stacked groups, trees, and graphics. When applying FS algorithms in data sets with structured characteristics, is useful to explicitly incorporate this prior knowledge, which could improve post-learning activities such as classification and clustering.

7 Conclusion

Although metaheuristic algorithms for feature selection have accomplished various success but they still face challenges and also their potential has not been fully recognized. In the metaheuristic algorithm, one of the main difficulties is scalability, since both the number of functions and the number of instances is increasing in the microarray datasets. In literature, a variety of conventional metaheuristic algorithms have been applied on microarray datasets based on the following attention: simplicity, stability, robustness, and computational requirements. The metaheuristic methods always provide benefits such as giving insight into the data, better classifier model, enhance generalization, and identification of irrelevant variables for feature selection. This survey shows a series of metaheuristic algorithms for addressing feature selection tasks and focused on key factors such as representation, search mechanisms, performance measures, and structure. In addition, experimental work has also conducted using metaheuristic approaches on large number of datasets for example, Colon, Leukemia, etc. Furthermore, recent examples of metaheuristic algorithms for feature selection from the literature have been presented with a summary of some noticeable applications, and also some issues have examined. Finally, a few proposals prescribed that will help to develop novel and effective metaheuristic approaches and solve the different kinds of problems.