1 Introduction

Detection and classification of cancers are the key issues in microarray technology [1]. The advent of DNA microarray technology facilitates biologist in providing new insight of biological phenomena and monitoring the activity of thousand of genes (features) in one experiment. Normally, the DNA microarray technology generates gene expression datasets considered as high dimensionality datasets. This dataset consists of huge number of genes and limited number of samples, where the columns represent genes and instances represent clinical patients samples [2, 3]. These data carry precious information, which have great potential for cancer diagnosis and classification. However, the existence of a huge number of genes is a challenging issue for development of an efficient classifier (or called machine learning algorithm) [3]. To address this challenge and to improve the predictive accuracy of diseases, gene selection, also known as feature selection, which is a data preprocessing step in data mining, can be used to find the subset of most informative genes [4].

Conventionally, methods for gene selection are divided into three categories, namely, wrapper approach, filter approach, and embedded approach [4]. In wrapper approach, the classifier is employed to assess the reliability of genes or gene subsets. In the filter approach, the machine learning algorithm is not used to remove irrelevant and redundant features; instead, the principal characteristics of the training data were applied to rank the significance of the gene subset or genes [5], such as the Minimum Redundancy Maximum Relevancy (MRMR) [6], RelifF [7], Chi-square [8], KullbackLeibler [9]. The wrapper method provides better results. However, the filter method is computationally less expensive than wrappers, but this method yields less classification accuracy [10]. The embedded method, which is the last category in the gene selection approach, is a hybrid of the filter and wrapper methods. The integration of the features of both approaches leads to detect the informative genes with high classification accuracy [11]. However, the embedded method is still in its infancy. Therefore, further investigations are required for both filter and wrapper methods to develop efficient hybrid/embedded methods for gene selection [3].

In filtering-based approaches, which is one of the core concern of this paper, each filter relies on different metric of various characteristics (distance, probability distribution, information theroy, etc.). For a specific dataset, selecting a filter approach is a very crucial for classification-oriented approach where different filleting approach leads to diverse subset of gens [12]. Restate, the filter approach obtained best results for specific data set may not do so for another, which imply that there exist a high variability in the classification outcomes. Therefore, several filtering approaches are combined together in order to find robust subset of genes. Examples of combining filleting-based approaches are i) Seijo-Pardo et al. [13] studied the ensemble of homogeneous and heterogeneous filter-based gene subset selection approaches ii) Ebrahimpour and Eftekhari [14] utilized hesitant fuzzy sets for representing ensemble of ranking algorithms and ensemble of similarity measures for gene subset selection iii) Boln-Canedo et al. [12] presented ensemble of filters and classifiers for microarray data classification.

In wrapper-based approaches, the classification of the gene subset is accomplished thorough two stages: searching and evaluation. In the search stage, search-based method are utilized to generate a discriminative gene subset based on an efficient classifier [4]. Finding optimal subset of genes has been shown to be NP-hard problem [15]. Therefore, meta-heuristic-based approaches have been devoted as a search method in wrapper-based approaches(i.e., natural inspired evolutionary algorithms) [16]. Several meta-heuristic-based approaches have been applied to solve gene selection problem such as Harmony search with a Markov blanket (HSA-MB) [17], Binary Particle Swarm Optimization and a Combat Genetic Algorithm (BPSO-CGA) [18], and Genetic Algorithm with Artificial Bee Colony [16]. However, most of these approaches still suffer from the problems of stagnation in local optima [17, 19]. Therefore, a strong search method for identifying near-optimal subset of informative genes for accurate cancer classification is still required. One of the efficient meta-heuristic-based approach that have been recently proposed is bat-inspired algorithm (BA) [20].

BA is a meta-heuristic swarm-based method introduced by Xin-She Yang [20] imitated echolocation behavior of swarming of bats. BA employs the idea of combining population based algorithm and local search to empower the global diverse exploration and local intensive exploitation, which are the key success of meta-heuristic methods. BA has other advantages over meta-heuristic based approaches such that the simplicity, flexibility, adaptability and scalability. Therefore, BA has been successfully adapted for a wide variety of optimization problems, such as scheduling [21, 22], inverse problem and parameter estimation [23, 24], classifications [25], gene selection [26], clustering [27], global optimization problem [28], and image processing [29]. In order to be inline with the complexity of the optimization problems, the theory of BA has been also tweaked where several hybridized versions [30, 31] and modified versions [32, 33] are produced. However, BA is not yet properly investigated for a kind of gene selection problems addressed in present work.

In this paper, a hybrid filter/wrapper gene selection method is proposed based on improved MRMR methods and hybridized BA algorithm [20] with \(\beta \)-hill climbing for cancer classification. In the proposed hybrid method, the filter approach (i.e., improved MRMR) is initially work to find the highly discriminative genes. Thereafter, these genes are used by hybrid BA (HBA) as a strong initial stage to find the most informative subset of gens. To elaborate, the performance of MRMR method is improved via ensemble mechanism in which several filtering-based approaches are combined (i.e., RelifF, Chi-square, Kullback-Leibler) to obtain a robust gene list. This list is used by MRMR method to improve the filtering stage outcomes. The hybridization version in wrapper stage is suggested in order to improve the performance of BA algorithm. \(\beta \)-hill climbing is hybridized with BA to empower its intensification capabilities to be more suitable for the complexity of gene selection search space. The proposed method is called Robust MRMR and Hybrid Bat-inspired Algorithm (RMRMR-HBA). Note that SVM classifier is employed to evaluate each candidate gene subset in HBA. Specifically, the main contributions of the proposed method are as follows:

  • Modified version of MRMR through novel hybridization strategy by combining the filtering mechanism of MRMR with aggregate gene list obtained by ensemble of filter methods to select more robust and stable gene list for high classification accuracy.

  • New wrapper approach that embeds of the hybridization BA algorithm with \(\beta \)-hill climbing local search algorithm to guide the search for more informative gene subset.

  • To investigate whether the proposed method(rMRMR-HBA) can obtain a gene subset with small number of genes and better classification accuracy when compared with state-of-art gene selection approaches.

Extensive experiments were done on ten well-known microarray datasets to test rMRMR-HBA method for gene selection problem. These datasets characteristics are different in terms of number of genes, samples, and classes. For performance evaluation, the proposed rMRMR is initials tested against the standard MRMR and other well-regard filtering approaches. Thereafter, HBA is evaluated by studying the convergence behavior of BA with and without \(\beta \)-hill climbing. For comparative evaluation, the results of the proposed rMRMR-HBA were compared with state-of-art methods using the same microarray datasets. The comparative results show that the proposed method achieved the best results in two out of ten datasets and competitive results for others.

The remaining parts of this paper is organized as follows: related background is provided in Section 2. The proposed method is discussed in Section 3. The experimental results and comparative evaluation is analyzed in Section 4. Finally, Section 5 concludes the paper and provide a possible future enhancements.

2 Research background

In order to build a self-exploratory paper, the research background of this paper is described to provide the necessary knowledge about the research methodology. Therefore, the Chi-Square, RelifF, and Kullback-Liebler filtering methods are fully overviews. The fundamental elements of Bat-inspired algorithm together with its Binary version are provided. Finally, \(\beta \)-hill climbing algorithm and Support Vector Machine (SVM) are thoroughly illustrated.

2.1 Chi-Square filtering-based method

Chi-square is statistical-based measure which is widely used as feature selection filtering-based method. In general, Chi-square individually evaluates each features with respect to its related classes [8]. It should note that the Chi-Square measurement is work with a feature of discrete nature and therefore features should be discretized form continuous values to discrete intervals. Subsequently, Chi-square measurement can determines a potential merging whether the relative frequencies of the classes in the adjacent intervals are similar enough. Of the N examples, let \(N_{ij}\) be the number of samples of the \(C_{i}\) class within the \(j^{th}\) interval and \(M_{ij}\) is the number of samples in the jth interval. The expected frequency of \(N_{ij}\) is \(E_{ij} = (M_{ij}*|C_{i}|)/N\). The Chi-squared statistic of a feature is then calculated as:

$$ X^{2}=\sum\limits_{i = 1}^{c}\sum\limits_{j = 1}^{I}{\frac{(N_{ij}-E_{ij})}{E_{i}j}} $$
(1)

where I is the number of intervals. To put it clearly, this method tries to determine the type of relationship between the gene and class label. If the gene and class are dependent, the frequency of the gene can be used to predict the frequency of the class. The task is to select the gene, of which the frequency is highly dependent on the frequency of the class. If the gene and class are independent, the frequency count is close to the expected frequency count, thus a small chi square \({X}^{2}\) score. So a high value of \({X}^{2}\) score indicates that the hypothesis of independence is incorrect. In other words, the larger the \({X}^{2}\) value is, the more informative of the corresponding feature will be. By means of calculating the \(X^{2}\) for every feature, the features can be further processed and ranked based on their \(X^{2}\) score values.

2.2 RelifF filtering-based method

ReliefF [7] is an extension of the original Relief filtering-based algorithm [34] and its considered as distance-based measure [35]. RelifF algorithm works as follows: Firstly, perform randomly sampling an instance from the data set. Secondly, from both same and opposite class, nearest neighbor will be located to the sampled instance. Finally, the relevance scores for each feature are updated by comparing the values of the nearest neighbor features to the sampled instance. The rationale behind this is that the meaningful feature should be recognized among instances of different classes which have same value of instances from the same class. ReliefF has advantages over standard Relief: its more robust, better handles incomplete and noisy data, capability of dealing with multiclass problems, can be adapted in all situations, has low bias, facilitates interactions among features, and may be able to capture local dependency which other methods cannot do so.

2.3 Kullback-Liebler filtering-based method

Kullback-Leibler (KL) [9] is probability-based measure attempt to figure out features where distance between the different class probabilities is big. The KL divergence is a measure of how different two probability distributions (over the same event space) are. The KL divergence or relative entropy of probability distributions P, Q on a finite set X is defined as:

$$ MI(X,Y)= KL((P(X,Y)||P(X)P(Y)). $$
(2)
$$ D(P||Q)=\sum\limits_{x\in X} P(x) \log \frac{P(x)}{Q(x))} $$
(3)

In case the random variable X and Y are independent, then joint probability become equals to the product of marginal probabilities.

$$ P(X,Y)=P(X)P(Y) $$
(4)

Consequently, the KL divergence between distributions will be zero. This implies that X and Y are independent and Y carries no information about X or X not relevant to Y. Similarly, if X and Y and highly dependent and X carries information about Y then their relevance is higher.

2.4 Bat-inspired algorithm

Bat-inspired algorithm (BA), introduced by Xin-She Yang in 2010 [20], emulates the echolocation behavior of bats. There are many kinds of bats. All of them have similar behavior when navigating and hunting; however, they are different in terms of size and weight. Microbats use echolocation which assists them in seeking for prey and/or avoid obstacles in the complete darkness. In BA, the echolocation features of microbats can be idealized according to the following rules:

  1. 1.

    All bats use echolocation to sense distance and determine the difference between food/prey and background barriers in some magical way;

  2. 2.

    Bats randomly fly with velocity \(V_{i}\) at position \(X_{i}\) with a fixed frequency \(f_{min}\), varying wavelength k and loudness \(A_{0}\) to seek for prey. They can automatically regulate the wavelength (or frequency) of their emitted pulses and adjust the rate of pulse emission \(r \in \) [0,1], depending on the closeness of their target;

  3. 3.

    Although the loudness can vary in many ways, it is assumed that it varies from a large (positive) \(A_{0}\) to a minimum constant value \(A_{min}\).

  • Bat Motion:

The frequency of each bat will be positive integer or float relying on the selected upper bound and lower bound of the frequency. The frequency value is calculated through (5). Determining the upper and lower bound frequencies is based on the domain of interest.

$$\begin{array}{@{}rcl@{}} F_{i} = F_{min} + (F_{max} - F_{min}) \times \beta \end{array} $$
(5)

where \(\beta \) is a random number of uniform distribution in \([0,1]\), \(F_{max}\) is upper bound of the frequency, and \(F_{min}\) is lower bound of the frequency. The velocity of each bat will be a positive integer number. Each bat will update its velocity according to the following equation.

$$\begin{array}{@{}rcl@{}} V_{i}(t + 1) = V_{i}(t) + (X_{i}(t)-Gbest) \times F_{i} \end{array} $$
(6)

Where \(Gbest\) is the best solution, \(F_{i}\) represents the frequency of the ith bat and the position \(X_{i}\) of each bat. Each bat’s position is updated as shown in (7).

$$\begin{array}{@{}rcl@{}} X_{i}(t + 1) = X_{i}(t) + V_{i}(t + 1) \end{array} $$
(7)

The BA employed a random walk to improve its capability in exploitation as given below.

$$\begin{array}{@{}rcl@{}} x_{new} = x_{old} + \epsilon {A}_{t} \end{array} $$
(8)

where \(X_{new}\) is the new solution, \(X_{old}\) is the current solution, and \(\epsilon \) is random number in \([-1,1]\).

  • Variations of loudness and pulse rates:

Once a bat finds its prey, the loudness usually decreases and the rate of pulse emission increases. In this case, the loudness can be chosen as any value of convenience. Loudness A and pulse emission rate r are updated according to (9) and (10).

$$\begin{array}{@{}rcl@{}} A_{i}(t + 1) = \alpha A_{i}(t) \end{array} $$
(9)
$$\begin{array}{@{}rcl@{}} r_{i}(t + 1) = r_{i}{(0)} (1 - e^{(-\gamma \times t)}) \end{array} $$
(10)

where \(\alpha \) and \(\gamma \) are constant parameters that lies between 0 and 1 and used to update loudness rate \(A_{i}\) and pulse rate \((r_{i})\). The pseudo code of the algorithm is presented in the following pseudo-code. Note that \(f(X_{i})\) is that fitness function value of \(X_{i}\).

figure e

2.5 Binary bat-inspired algorithm

In the continuous version of BA, the artificial bat can be moved around the search space utilizing positions and velocity vectors (or updating position vectors) within continuous domain. However, in dealing with binary search space, the position or solution is a series of 0’s and 1’s binary bits. The updating position process cannot be performed using (6) because the binary search space deals with only two numbers (0 and 1). Therefore, a transfer strategy should be determined to reflect the velocity vector value in changing the elements of position vector from 0 to 1 or vice versa. The solution in gene selection problem is modified to be as a binary vector and the binary bat algorithm is effective in dealing with it.

Nakamura et al. [36] have introduced a binary version of the Bat Algorithm restricting the new bat’s position to only binary values using a sigmoid function as follows:

$$\begin{array}{@{}rcl@{}} S\left( {V_{i}^{j}}\right)= \frac{1}{1+e^{-{v_{i}^{j}}}} \end{array} $$
(11)

Therefore, (11) can be replaced by:

$$ \textit{{x}}_{i}^{j} \leftarrow \left\{\begin{array}{ll} \textit{{1,}} & U(0,1) > \sigma \\ \textit{{0,}} & otherwise. \end{array}\right. $$
(12)

in which \(\sigma U(0,1)\). Therefore, (12) can provide only binary values for each bat’s coordinates in the Boolean lattice, which represent the presence of absence of the features.

2.6 β-hill climbing algorithm

Hill climbing is the base algorithm of the local search-based techniques [37]. It doesn’t involve an explorative strategy during its search process. Thus, it is easily trapped in local optima. As a consequence, several extensions of hill climbing have been developed to handle this problem. Some extended algorithms tried to compensate the weakness of explorative strategies by embedding neighborhood functions to define systematically different search space regions. Other methods like simulating annealing (SA) [38] and greedy deluge (GD) [39] utilized a relaxed acceptance method to deal with uphill moves. Unlike other methods, \(\beta \)-hill climbing [40, 41] is a recent extension method to the basic hill climbing proposed by Al-Betar in 2016, wherein a new explorative operator called (\(\beta \) ) has been utilized based on an idea inspired by the uniform mutation operator of GA [42]. At each iteration, the current solution is moved around in the search space based on function called neighborhood search \(\mathcal {N}\) and unbounded search space is defined based on the \(\beta \) operator. Figure 1 shows the flow chart of the \(\beta \)-hill climbing algorithm.

Fig. 1
figure 1

Flowchart of β-hill climbing algorithm

The \(\beta \)-hill climbing starts with random initial solution \(\textit {{x}}=(x_{1},x_{2},\ldots ,x_{N})\). At each iteration, new solution \(\acute {x}=({x}^{\prime }_{1},{x}^{\prime }_{2},...,{x}^{\prime }_{N} )\) is generated based on two operators: neighborhood navigation (i.e., \(\mathcal {N}\)- operator) and \(\beta \)-operator. In \(\mathcal {N}\)- operator stage, a random neighboring solution of the solution \(\textit {{x}}\) is adopted by means of function \(improve (\mathcal {N})\) along with ’random walk’ acceptance rule which only move one step without check the change in the objective function, as follow:

$${x}^{\prime}_i=x_i \pm \cup (0,1) \times bw \: \: \: \exists ! \in [0,1]$$

In the \(\beta \)-operator stage, with a probability range of \(\beta \), each single variable of the new solution is assigned values base either existing values of the current solution or randomly from available range, as follows:

$$ \textbf{{x}}^{\prime}_{i}\leftarrow \left\{\begin{array}{ll} x_{r} & U(0,1) > rnd\leq \beta \\ x_{i} & otherwise. \end{array}\right. $$
(13)

where \(\beta \) \(\in [0,1]\), \(x_{r}\) \(\in \) \(X_{i}\) is the possible range for decision variable \({x}^{\prime }_{i}\), and rnd generate a uniform random number between 0 and 1. By the aforementioned operators, the convergence of \(\beta \)-hill climbing is boosted toward the optimal solution. Systematically, \(\mathcal {N}\)-operator navigates the neighboring solutions of the current one and return the one of the solutions with better objective function value. In \(\beta \)-operator, the convergence can be achieved by constructing a controlled portion of the current solution, and therefore, the convergence rate could be boosted. The \(\beta \)-operator can be considered as source of exploration while the \(\mathcal {N}\)-operator can be source of exploitation. In terms of search space, \(\beta \) operator is very useful that empower \(\beta \)-hill climbing algorithm in jumping from search space region to another in the same level or less. By means of this ability, \(\beta \)-hill climbing algorithm can escape the trap of local minima by trying stochastic values for some decision variables.

figure f

2.7 Support vector machine

SVM is used as an evaluator for the candidate gene subset solution, which will be used as an objective function in the proposed method as will be seen in Section 3. SVM is a supervised learning algorithm used for classification and regression [43]. It is a powerful classification algorithm due to its efficient performance in pattern recognition domain. For example, SVM classifier was successfully applied to classify high-dimension data, such as microarray gene expression data [2].

SVM constructs hyperplanes or a set of hyperplanes in a high-dimensional space, which can be utilized for classification, regression, or other tasks. It has the capability to deal with linear and nonlinear datasets. In a linear data, SVM tries to find an optimal separating hyperplane that maximizes the margin between the training examples and the class boundary. In a nonlinear data, such as the microarray gene expression data used in this paper, the definition of a feature mapping function is required where \(X\longmapsto \phi (x)\). The mechanism that defines a feature mapping process is called kernel function.

  1. 1)

    Polynomial kernel function

    $$\begin{array}{@{}rcl@{}} k=(x_{i},x_{j})=(x_{i}.x_{j}+a)^{b} \end{array} $$
    (14)
  2. 2)

    Radial basis kernel function (RBF)

    $$\begin{array}{@{}rcl@{}} k(x_{i},x_{j})=e^{-(||x_{i} - x_{j} ||)^{2} /2\sigma^{2})} \end{array} $$
    (15)
  3. 3)

    Sigmoidal kernel function

    $$\begin{array}{@{}rcl@{}} k(x_{i},x_{j})= tanh(ax_{i}.x_{j}-b) \end{array} $$
    (16)

    where \(x_{i}\) and \(x_{j}\) are vectors, and a and b are parameters that define the kernel’s behavior. Polynomial, RBF, and Sigmoidal are the nonlinear kernels. The generation of a predictive model for cancer classification using microarray dataset is considered a nonlinear classification task [16]. In this paper, RBF kernel will be adopted for cancer classification.

3 Proposed method for gene selection

The proposed method for gene selection has two different stages, namely, the filter approach stage and the wrapper approach stage, as shown in Fig. 2. Each stage will be thoroughly discussed in this paper.

Fig. 2
figure 2

Flowchart of the proposed approach for gene selection

3.1 Stage I: Filter approach

In this stage, the Minimum Redundancy Maximum Relevancy (MRMR) as filter approach [44] is modified by hybridizing its filtering process with ensemble of different filter methods. Initially, the original MRMR is defined in Section 3.1.1. Thereafter, the modified version of MRMR is proposed in Section 3.1.2.

3.1.1 Minimum redundancy maximum relevancy (MRMR)

The minimum redundancy maximum relevancy (MRMR) as a filter approach has been developed by Hanchuan Peng [6]. The MRMR tries to find the most relevant features based on its correlation with class label and to minimize the redundancy of the features themselves [44]. This filtering process reveals the features that have maximum relevancy and minimum redundancy. To quantify both relevancy and redundancy, mutual information is used to estimate the mutual dependency of two variables. Mutual information is defined in (17)).

$$ I(x_{i},c)=\int \int P(x_{i},c) \, \, log\frac{P(x_{i},c)}{P(x_{i})P(c)))} $$
(17)

where \(I(x_{i},c)\) is the mutual information between the feature \(x_{i}\) and class label c. \(P(x_{i})\) and \(P(c)\) are marginal probability functions, and \(P(x_{i}, c)\) is the joint probability distribution. The mutual information value for two completely independent random variables is 0 [45].

Given \(x_{i}\), which represents the feature i, and the class label c, the Maximum-Relevance method selects the top m features in the descent order of I(xi;c), i.e. the best m individual features that are relevant to the class labels.

$$ \max_{S} \frac{1}{|S|} \sum\limits_{x_{i}\in S} I(x_{i};c) $$
(18)

To eliminate the redundancy among features, a Minimum-Redundancy criterion is defined as:

$$ \min_{S} \frac{1}{|S|^{2}} \sum\limits_{x_{i},x_{j}\in S} I(x_{i};x_{j}) $$
(19)

A sequential, incremental search algorithm can be used to solve the simultaneous optimizations of optimization criteria of (1) and (19). Suppose we already have Sm− 1, the feature set with m-1 features; the task is to select the \({m-th}\) feature from the set \({G-S_{m-1}}\). This feature is chosen by maximizing the single-variable relevance minus the redundancy function, as follow:

$$ \max_{x_{i}\in G-S_{m-1}}\left( I(x_{i};c)-\frac{1}{m-1} \sum\limits_{x_{i}\in S_{m-1}}I(x_{i};x_{j})\right) $$
(20)

The method utilizes a series of intuitive measures of relevance and redundancy to pick out promising features of both continuous and discrete datasets, as shown in (17), (18) and (19).

3.1.2 Modified MRMR

The main work in this stage is to modify MRMR to overcome the weakness induced by single filter-based approaches (as mentioned in Section 1) and thus improve the robustness and stability of MRMR. The MRMR is modified via novel hybridization strategy by combining the filtering mechanism of MRMR with aggregate gene list obtained by the ensemble of filter methods. The benefits of ensemble filters is to introduce diversity and reduce this variability associated with single-based filter methods [46]. In this manner, the hybridization of filtering mechanism of MRMR with an ensemble of filters may lead to improve its robustness and stability. The modified version of MRMR is called rMRMR. Moreover, the pseudo-code for the proposed rMRMR is given in Algorithm 3.

Among the wide range of filters available in the literature, three filters were ensembled according to previous studies [14, 35]. Each of which uses different metrics. Thus, the ensemble method will be a combination of the following filters: ReliefF, Chi-Square and Kullback-Liebler. Thereafter, each individual filter method will be executed and the ranks/scores of genes are combined/aggregated by using ‘Mean’ of the scores for each of the ranking list obtained from each individual filter method. The ranking list and the scores of genes \((R(G_{i}))\) will be incorporated with filtering process of MRMR as follows: Firstly, the ranking list is used to formulate the dimensionality of search space used by MRMR. The main merit of incorporating process is to empower the outcome of MRMR by mean of initiating its process with high informative genes produced by ensembled methods. Secondly, while MRMR computes its gene scores , the ensemble scores of genes will be used with relevancy computation process, as shown in (21). This is to emphasis the calculation of each individual gene score subject to different metrics(which is avoided the bias result of single metric). Doing so, the robustness and stability of MRMR will be enhanced. (21).

$$ Relevancy= I(G_{i},c) \ast R(G_{i}) $$
(21)

Lastly, on this stage, SVM classifier is used to calculate the classification accuracy of gene list resulting from rMRMR. To further explore reduced gene subset and identify a subset of informative genes, hybrid BA with \(\beta \) Hill climbing and SVM classifier are combined to seek for the better genes subset using the wrapper method.

3.2 Stage II: Wrapper approach

In this stage, we proposed a hybridized version of Bat-inspired as a global optimizer with \(\beta \) -hill climbing as a local optimizer. As a \(\beta \)-hill climbing is a local-area algorithm which can deal with a single solution at a time, they can easily stagnate in some local optima. On the other hand, BA as a population based algorithm can deal with multi-solutions at the same time, however, it performs a wider search without concentration on single area in the search space. Therefore, complementing their capability in a single hybrid optimization framework can be produced an efficient search method. The Hybrid Bat-inspired Algorithm (HBA) is proposed to select the most informative subset of genes from the top ranked genes obtained by rMRMR. As an evaluator to candidate subset of genes, SVM classifier is iteratively used as a fitness function. The goal of our proposed method is to maximize classification accuracy and minimize the size of genes subset. The pseudo-code of the proposed method is shown in Algorithm 4. The components and process of the proposed method in Stage II are illustrated in the following sections:

figure g

3.2.1 Solution representation

Gene subset selection problems can be expressed as a combinatorial optimization problem, in which the search space involves a set of all possible subsets [47, 48]. This problem is known to be NP-hard problem [49] which has a highly combinatorial search space in nature. The number of solutions in the search space exponentially increases when the number of genes increases; and there are \([2^{N}]\) possible subsets of genes, where N represents the number of genes.

The complete set of genes (i.e., solution \(\textit {{x}}\)) is represented by a binary string of length N, \(\textit {{x}} = (x_{1}, x_{2},{\ldots } , x_{N} )\), where a bit \(x_{i}\) in the solution is set to 1 if the corresponding gene is preserved and set to 0 if it is to be discarded. N is the original number of all genes.

3.2.2 Fitness function

HBA initially starts with a set of generated bats or solutions. Each bat is a series of 0’s and 1’s bit, where bit value of 1 indicates that this gene is selected while 0 is discarded. Each bat is evaluated during the search using (22) where the evaluation function combines classification accuracy and gene subset length [50].

$$ \alpha \times \gamma R(D) + \beta \times \frac{|C|-|R|}{|C|} $$
(22)

where \(\alpha R(D)\) is the average of classification accuracy rate obtained by performing ten multiple cross-validation with SVM classifiers, on the training dataset with gene set R chosen from each bat in the population to decision D. \(|R|\) is the length of the selected gene subset. \(|C|\) is the total number of genes. \(\alpha \) and \(\gamma \) are two parameters related to the significance of classification quality and subset length. \(\alpha \) \(\in \) \([0,1]\) and \(\gamma \) \(=\) (1-α ). The classification accuracy is more important than the subset length. The quality of each bat is gaged by this objective function. In this paper, \(\alpha \) is set to 0.8 [3, 50, 51].

3.2.3 Hybrid bat-inspired and β-hill climbing algorithm (HBA)

In this stage, BA algorithm is hybridized with \(\beta \)-hill climbing(β HC) to empower its exploitation capabilities. Consequently, we named our hybridization method Hybrid Bat-inspired algorithm(HBA). Therefore, HBA algorithm was employed as a gene subset selection method to figure out the near optimal gene subset from the reduced set generated by filter approach stage. Initially, since the gene selection problem is a binary in nature, Nakamura et al. [36] introduced a binary version of the BA as illustrated in Section 2.5. The pseudo-code of the proposed HBA algorithm is presented in Algorithm 4.

After the bat population is initialized and each candidate solution is evaluated, BA iterates toward the optimal solution by means of generating new solutions according to equations formulated by the BA operators, as shown in (5), (6) and (7) as discussed in Sections 2.4 and 2.5. With a probability range of pulse rate \(r_{i}\) (Line 7 in Algorithm 4), instead of using the local search strategy in Basic BA algorithm, we propose to use \(\beta \)HC to utilize an efficient exploitation capabilities around the current best solution (Lines 10 to 21 in Algorithm 4). In other words, \(\beta \)HC iteratively generates a new solution around the current best solution based on two operators: Neighborhood navigation (i.e. \(\mathcal {N}\)-operator) and \(\beta \) operator. Note that in both operators, the flip-flop process is used in \(\mathcal {N}\)-operator and a random digit either 0 or 1 is generated to replace the original one in \(\beta \) operator with a probability of \(\beta \) parameter where \(\beta \in [ 0,1 ]\).

During the search process, the function improve(N(X)) (i.e., Line 11 in Algorithm 4) explores the neighborhood solutions \(X^{\prime }\) of the current solution X using flip-flop process. In flip-flop process, a random variable is selected from X and its value is flipped to be 1 if it 0 and vise-versa. In \(\beta \)-operator (Line 13 in Algorithm 4), new solution (i.e., \(X^{\prime \prime }\)) can be generated from the current solution ((\({X}^{\prime }\))) as follow: With a probability range of \(\beta \) parameter, any variable in the new solution (i.e., \({X}^{\prime \prime }\)) can be assigned by a value of 0 or 1. This is shown in Algorithm 4, line 13. In the algorithm, the function rnd a random distribution number of the range rnd \(\in \) \(\left [ 0,1 \right ]\). The new solution \(X^{\prime \prime }\) is accepted which has an equal or higher fitness value, i.e., \(f({X}^{\prime \prime })\) \(\geq \) \(f(X)\). The search process is repeated until the maximum number of generations are achieved.

Algorithm 4, Line 23 shows the loudness process. In case the loudness parameter a value is higher than a random number generated by rnd, where rnd \(\in \)\(\left [ 0,1 \right ]\). The second condition will be checked. The second condition is satisfied when the fitness function of new solution is better the best solution (i.e., \(X^{*}\)), then the new solution (i.e., \({X}^{\prime \prime }\)) will be included in the bat population and the current solution (i.e., X) will be excluded from bat population. In addition, the loudness rate is decreased and the rate of pulse emission is increased. In the meanwhile the best bat location will be iteratively updated as pseudocoded in Algorithm 4, Line 27. Finally, in case the termination condition is satisfied (Lines 10 to 28 in Algorithm 4), therefore, best predictive genes subset which represents highest fitness function is returned.

figure h

4 Experimental setup & results

In this section, the performance of the proposed method was evaluated using benchmark microarray datasets, which is established by Kent ridge biomedical data set repository.Footnote 1 These dataset are detailed in Section 4.1. The implementation of filter and wrapper approaches was programmed using two languages (i.e., Java and Matlab). In filter approach stage, ReliefF, Chi-Square and Kullback-Liebler are implemented in java using weka tool [52], while rMRMR is implemented using Matlab. In wrapper approach stage, BA, \(\beta \)-hill climbing and SVM are implemented using java. The SVM used in this method was based on the one prepared in libsvm [53]. RBF kernel is adopted for SVM classifier. As conventional, grid search algorithm was used for tuning parameters of SVM classifier [54]. The experiments are performed on an Intel Core Quad 2.66 GHz CPU with 4 GB of RAM in Microsoft Windows 7 platform.

As remembering, the proposed method has two stages: filter and wrapper. In stage I (filter approach), the rMRMR is proposed to improve the performance of the classical MRMR by means of combining the capability of the three other filtering approaches (ReliefF, Chi-Square and Kullback-Liebler) with the framework of MRMR. The effect of rMRMR on the classification accuracy is deeply studied in Section 4.2 in which the proposed rMRMR is compared against MRMR and other well-regarded filtering-based approaches. In stage II (wrapper approach): the \(\beta \)HC is hybridized with BA to improve its exploitation aspects. Therefore, the effect of the \(\beta \)HC on the convergence behavior of HBA is also studied in Section 4.3 where the BA with and without \(\beta \)HC are compared. Moreover, rMRMR with Genetic Algorithm [55] (rMRMR-GA) and rMRMR and Particle Swarm Optimization [56] (rMRMR-PSO) are re-implemented to gene selection problem, and their performance is compared to the performance of the proposed method (rMRMR-HBA). Finally, the comparative evaluation between the proposed rMRMR-HBA and other methods using the same datasets are fully discussed in Section 4.4.

For proper usage of stage I (filtering approach), SVM builds a classification model using top ranked 50 genes generated by rMRMR, MRMR, ReliefF, Chi-Square, Kullback-Liebler and ensemble method (i.e. ReliefF, Chi-Square, and Kullback-Liebler), Note that top ranked 50 genes for SVM is configured in accordance with the previous studies [14, 17, 55, 57]. Subsequently, the highly ranked genes resulting from filtering stage are passed into wrapper stage. It is worth recalling that in wrapper approach, SVM is used to validate and assess each candidate gene subset generated from HBA algorithms. In tables below (Tables 5, 6) is evaluated based on thee measurements: the classification accuracy, the predictive gene subset length and fitness function values. while comparative table (Table 6), only the classification accuracy and the predictive gene subset length are used as done previously in the literature [2, 17]. Classification accuracy is calculated by (23), where gene subset length is number of selected genes.

$$ Classificatio Accuracy= \frac{\# of \ correctly\ predicted \ instances}{\#\ of \ instances} $$
(23)

Table 1 shows the parameter setting values of the proposed method used in the experimental step. These parameters values are carefully selected based on some preliminary experiments and based on the previous parameter setting theory concluded by other studies using BA and \(\beta \)HC [36, 40]. Note that number of iteration is 100 for both algorithms.

Table 1 Parameter setting of the proposed method

4.1 Dataset used

To evaluate the proposed methods, we carried out our experiments using ten de facto datasets of gene expression profile [58, 59]. The datasets and their characteristics are summarized in Table 2. In table, the row represents the key of the dataset instances. For each row, four values are presented including #Gene, which refers to the number of genes in the dataset. The second column includes the #Samples, which refers to the number of patient samples selected for testing; In the third column, the #Class refers to the corresponding disease statuses; The last column shows the description of each cancer disease.

Table 2 Datasets characteristic

4.2 Effect rMRMR in the proposed method

In order to study the effect of the proposed rMRMR on the performance of classification process, the classification behavior of rMRMR is compared with each basic MRMR, ReliefF, Chi-Square, Kullback-Liebler worked individually and an ensemble of filters (i.e. ReliefF, Chi-Square, Kullback-Liebler). For comparative purpose, we performed a 10-fold cross validation to estimate classification accuracy for all comparative methods.

In Table 3, the results of all comparative methods is summarized in terms average classification accuracy along with standard deviation values which represented as (average \(\pm \) standard deviation). The best results are highlighted in bold font. It can be noted that the rMRMR is able to obtain the best recorded results for four out of the 10 datasets. Furthermore, it has similar best score results of four datasets \((i.e., ALL\)_AML,Lymphoma,Ovarian,andSRBCT) as obtained by others. Furthermore, rMRMR obtains the second and fourth best results on MLL and Colon datasets, respectively.

Table 3 The classification accuracy of the proposed rMRMR as compared to the other gene selection methods by 10-fold cross validation on 10 microarray datsets

In terms of the average of all result averages (AVG) shown in the last row of Table 3, the proposed method clearly achieved the best AVG results for almost all tested datasets. Therefore, the rMRMR reveals an efficient outcomes in addressing the problem that consists large number of genes.

In addition, rMRMR and other methods are compared using the win/tie/loss (WIT) aspect shown in the last row of Table 3. The WTL row denotes the number of times that the method is win/tie/loss than rMRMR. As an example, the entry \(``2/4/4"\) in Table 3 indicates that MRMR losses four cases, tie four cases, and wins on only two in comparison with our method rMRMR. Notably, the proposed method is able to outperforms other filter-based gene selection methods in almost all tested datasets. In a nutshell, the proposed filter-based method archives promising improvement in the classifier performance. This outstanding results thanks to ensemble filter component that empower the robustness and stability of MRMR.

4.3 Effect of β HC on the performance of HBA

In this section, the effect of \(\beta \)HC on the convergence behavior of the proposed method is studied. Note that the top ranked 50 genes obtained by Stage I (filtering approach) are used in HBA in order to cut-of the scale of the search space by the most informative genes. The BA with \(\beta \)HC (i.e., HBA), BA without \(\beta \)HC (i.e., BA), GA, and PSO are compared together. Both methods were executed 30 independent runs. Recall, Table 1 shows the parameters setting of HBA. The experimental results obtained are summarized in Table 4 in terms of \(|\#G|\), ACC, \(|F|\), T and \('T-sig^{\prime }\) which represent the average number of selected genes, the average classification accuracy, the average fitness value introduced in (22), average of execution time (minutes), and Wilcoxon signed-rank statistical test to shows whether there exits any statistically significant difference between HBA and other algorithms, respectively. In addition, the best results of \(|\#G|\), ACC, T and \(|F|\) are remarked in bold font.

Table 4 Comparison of the gene selection methods on ten selected datasets

In Table 4, \('T-sig^{\prime }\) row, with the probability range of \(\alpha \leq 0.05\), ‘∗’ denotes that the results of the HBA are significantly better than the corresponding algorithm while ‘−’ denotes that the results of the HBA algorithm are not significantly better than the corresponding algorithm. Wilcoxon signed-rank statistical test is performed on fitness value since its composed of classification accuracy and number of selected genes. Apparently in Table 4, HBA is able to obtain higher classification accuracy than BA and other algorithms on four datasets (i.e., Breast, Colon, ALL_AML_4c, and CNS) and similar classification accuracy to BA and other algorithms on six datasets (i.e., MLL, ALL_AML, ALL_AML_3c, Lymphoma, ovarian, and SRBCT). In terms of classification accuracy and number of selected genes, HBA is able to achieve a better trade-off between classification accuracy and number of selected genes than BA and other algorithms for all datasets except for ‘Colon’ dataset. In Colon dataset, HBA achieved higher classification accuracy but with slightly increasing number of selected genes than BA.

In term of fitness function, HBA exhibits higher fitness function value (|F|) than other algorithms on all datasets. Although the execution time (T) is not a major factor in gene selection domain since it is not a real time, the average execution time of 30 runs replicated for each version of the proposed method in addition to those recorded by GA and PSO are summarized in Table 4. It has to be noted that T recorded by HBA and BA are reasonably acceptable. GA has less T than PSO, BA, and HBA. Notably, HBA records the highest T value since this version is a hybrid version and therefore takes higher execution. Further, between the results of HBA and other algorithms, there exist a significant difference on four out of ten datasets. For instance, HBA gets significantly better results than BA and other algorithms on nine datasets and its not significantly better than BA on only one dataset (i.e., Ovarian). In sum, the experimental results exhibit that HBA is able to yield smaller set of reliable genes with higher/equlivant classification accuracy than BA and other algorithms on almost all datasets. This fruitful results is owing to the integration of \(\beta \)-hill climbing algorithm with BA that significantly improve the local search capability of BA, which enhance the exploitation aspect. Thereby, HBA is capable of eliminating the irrelevant and redundant genes that empower it to identify smaller set of reliable genes and impose it to obtain competitive or higher classification accuracy.

The convergence behavior trend of HBA against BA is drown in Fig. 3 for all experimented datasets. Notably for all figures drown, convergence rate of HBA is superior in comparison with that produced by BA.

Fig. 3
figure 3figure 3

The convergence behavior of BA and HBA for 10 datasets

4.4 Comparative evaluation

To further assess the effectiveness of the proposed method, rMRMR-HBA was compared with the state-of-the-art methods, as shown in Table 5. Table 6 shows the average criteria over multiple independent runs for each method, including the average of the classification accuracy (ACC) and the number of selected genes (#G). The best results are highlighted in bold.

Table 5 Key to comparative methods
Table 6 Results of comparison between rMRMR-HBA and the state-of-art methods

5 Conclusion and future work

In this paper, an efficient method was proposed to solve gene selection problem, which combine filter and wrapper methods, namely, rMRMR, HBA, and SVM classifier. In filter stage, rMRMR method is proposed by combining an ensemble of filter methods(i.e., ReliefF, Chi-Square, Kullback-Liebler) with MRMR to improve its robustness and stability. This combination mechanism is done as follows: firstly, the ensemble methods (i.e., ReliefF, Chi-Square, Kullback-Liebler) generate single gene ranking list by aggregate the results from each individual filter, subsequently, this ranked list is used by MRMR. Secondly, the ensemble scores of genes in coupled with the calculation of relevancy score in MRMR. In wrapper stage,we propose hybrid BA algorithm with β-hill climbing algorithm integrated with SVM to seek for optimal gene subset from top informative genes obtained by rMRMR method. This hybridization process of HBA is proposed to strike a right balance between exploration and exploitation capabilities and improve its local exploitation aspects.

Extensive experiments were conducted into ten microarray benchmark datasets to test the performance of the proposed method. These data sets are high dimensional with high complex. The evaluation process involves three stages, (i) evaluating the filtering-based approach, (ii) evaluating the wrapper-based approach , and (iii) evaluating the proposed rMRMR-HBA method. In the first evaluation stage, the proposed rMRMR is evaluated against original MRMR and other classical filtering-based approach. The results show that the proposed rMRMR outperforms those produced by other comparative filtering-based approach for seven out ten datasets. In the second evaluation stage, the wapper-based approach (i.e., HBA) is evaluated as follows: bat-inspried algorithm with and without β-hill climbing algorithm, as well as, the proposed mrthod (HBA) is evaluated against other well-known methods (i.e., GA and PSO). The evaluation is conducted based on three main criteria: number of selected genes, classification accuracy, execution time, and fitness function value. The proposed HBA is able to exhibit better compromise among number of selected genes and classification accuracy than BA for all datasets except for ‘Colon’ dataset. In Colon dataset, HBA achieved higher classification accuracy but with slightly increasing number of selected genes than BA. In term of fitness function value, HBA obtained higher fitness function value (|F|) than BA on all datasets.

In the third evaluation stage, a comparison with state-of-art methods shows that our proposed method obtained an equivalent or higher classification accuracy in comparison with those yielded by other competitors in nine out of ten datasets. Furthermore, the best overall results in two out of ten datasets (i.e., ALL_AML_3c and Ovarian) and competitive results on the remainder datasets, in terms of number of genes and classification accuracy, are obtained. This finding proves that rMRMR-HBA approach is capable to eliminate the irrelevant and redundant genes effectively due to the integration of rMRMR filter and HBA wrapper, that result in the identification of small set of informative genes. Therefore, these high impressive achievements can be inspired other researchers to contribute in the rMRMR-HBA which is a superior method pregnant with many discoveries and capabilities for gene selection domain.

As the proposed rMRMR-HBA revealed a very successful outcomes for gene selection, improving the performance of each individual filter-based gene selection method (i.e.MRMR) by means of homogeneous/hetrogenous ensembles of filtering-based gene selection methods are required in the future. In addition, rMRMR-HBA can be expanded to other high-dimensional datasets, including image and text mining data.