1 Introduction

A gene is a functional unit of a cell, i.e., each gene provides instruction which contributes to the functionality of the cell. Gene expression contains thousands of genes of a cell which determines the functional characteristics of that particular cell in the form of protein products, also known as polypeptides. Gene expression helps us understand the genetic behavior of a cell/ tissue. The rapid growth of research in DNA microarray technology has made researchers possible to assess the expression levels of large number of genes. Studying gene expression can help a lot in cancer diagnosis [9]. Cancer tissues from normal cells can be distinguished by studying the differences in their gene expressions. From just the morphological appearance of the tumors, identifying cancers has limitations. Therefore researchers make use of gene expression information to identify cancer [19]; also, gene expression profiling identifies the cancer more accurately. Machine learning allows us to build classification models that learns from the data or experience and make decisions [15, 22]. Therefore, using DNA microarray data, cancer cells can be classified. As the number of genes in DNA microarray can be in tens of thousands, the problem of “curse of dimensionality” occurs. Therefore, to address the high-dimensionality problem, feature selection is employed [4, 17]. One of the challenges involved in classification of cancer from DNA microarray data is to select relevant genes which distinguish normal cells from cancer cells. In [22], feature selection methods are systematically studied for classification of cancer data. Feature selector based on correlation and classification using naive Bayes, decision trees, and support vector machines have been studied. Feature selection comprises of choosing a subset of significant features that correctly express a given problem by removing the redundant and correlated features as redundant features can act as noise. Feature selection is mainly of three types—wrapper methods , filter-based methods, and embedded approach. In filter-based feature selection [12], based on some statistical measures, a subset of features is selected that gives maximum predictive power, without requirement of any learning algorithm. In filter methods, a feature subset of cardinality m, denoted by f, selected from a set F is by maximizing a criterion J.

$$ f^{*} = \operatorname*{arg max}_{f \in F} J(f), s.t |f|=m $$
(1)

Wrapper approaches train a model using a subset of features, and based on the performance, features are added or removed from the subset. These methods are reliant on a classifier that is being used and are also time consuming but gives better performance. Embedded methods of feature selection incorporate both wrapper and filter methods. Obtaining a global optimum for Eq. 1 is an NP-hard problem. Nonetheless, a lot of heuristic approaches are being found in the literature which is known to provide suboptimal results. Metaheuristics, especially nature-based computation approaches, are extensively made use in feature subset selection task. They are implemented by beginning with randomly initializing a population in first generation. For every individual in the population, a fitness function or an objective function is evaluated and the solutions are improved iteratively according to the fitness measure. The best solution is obtained at the end of all iterations. These optimization algorithms, when employed for feature selection, yield efficient feature subsets and are used extensively in gene selection for cancer classification [1, 23]. These methods are genetic algorithms [10, 20], particle swarm optimization algorithms [5, 6], ant colony optimization [8], bacterial foraging optimization [21], and bee colony optimization [2]. In these approaches, candidates evolve iteratively evaluating and optimizing an objective function. Since feature selection is an optimization problem with objectives of maximizing the classification accuracies and minimizing the dimension of the feature set, bio-inspired algorithms are employed for feature selection. The rest of the paper is ordered as follows. We discuss about the related work to our proposed methodology in Section 2, Our proposed work is explained in Section 3. Results obtained from the proposed work are in Section 4. We conclude in Section 5.

2 Related work

2.1 Bat algorithm

It is a metaheuristic, bio-inspired algorithm for global optimization problem, which is inspired from echolocation of the microbats, developed by Yang et al. [24]. Echolocation is the way by which microbats hunt for prey in the dark. They are able to tell apart obstacle from prey using echolocation technique. Bats emit a series of loud and short pulses and wait for them to come back. When the pulses that hit an object return, they calculate how far the object is from the time taken by the pulse to travel to and fro. Some of the preliminaries for the bat algorithm are as follows:

  • 1 It is assumed the bats “know” the difference between obstacle and prey.

  • 2 Each bat is denoted by bi fly with velocity vi, at a position pi, with a frequency frmin which is fixed, changing wavelength λ, and a loudness L0. Based on the closeness of the target object, they can adjust the wavelength and rate of emission of pulse r ∈ [0,1]

  • 3 The loudness is assumed to be varying between L0 and \(L_{\min \limits }\).

The update of position pi(p = p1,..., pn), and velocity vi for each bat bi(i = 1,.., m) at time step s are given as follows:

$$ \text{fr}_{i} = \mathrm{fr_{min}+(fr_{min} -fr_{max}}) $$
(2)
$$ {v_{i}^{j}}(s) = {v_{i}^{j}}(s-1) +[\hat{p}^{j} -[\hat{p}^{j}_{i}(s-1) ]fr_{i} $$
(3)
$$ {p_{i}^{j}} = p(s-1) + {v_{i}^{j}}(s) $$
(4)

where \(\hat {p}^{j}\) denotes the current global best for the decision variable j. Random walks are done in order to introduce the variability to the solutions. For this, one solution among all the current best solution is selected and random walk is applied to it if the condition rand > ri is satisfied.

$$ p_{\text{new}} = p_{\text{old}} + \epsilon \bar{L}(s) $$
(5)

where \(\bar {L}\) is the loudness averaged among the whole bat population and 𝜖 is in the range [− 1,1] controls the direction of the random walk. The loudness and the pulse rate for each iteration are updated as follows:

$$ L_{i}(s+1) = \alpha L_{i}(s) $$
(6)
$$ r_{i}(s+1) = r_{i}[1-\exp (-~\gamma s)] $$
(7)

where α and γ are constants. The binary version of the bat algorithm can be obtained by using a transfer function:

$$ S({v_{i}^{j}}) = \frac{1}{1+ \exp^{-{v_{i}^{j}}}} $$
(8)

Then Eq. 4 can be replaced by

$$ {p_{i}^{j}}=\left\{\begin{array}{ll} 1, & \text{if}\ S({v_{i}^{j}}) > \text{rand}.\\ 0, & \text{otherwise}. \end{array}\right. $$
(9)

2.2 Extreme learning machine

Traditionally, all the parameters of the feedforward networks needed tuning which creates the dependency between parameters (weights and biases) of completely different layers. For past decades, gradient descent algorithm–based methods have been utilized in various feedforward neural network learning. However, it is clear that gradient descent–based learning strategies are generally slow and can lead to improper learning steps or could easily converge to native minima. In order to get higher learning performance, these gradient methods also gives plenty of unvarying learning steps needed in learning algorithm.

Extreme learning machine (ELM) is a learning algorithm for fully connected feedforward neural networks that can be used for tasks like classification, feature learning, regression, compression, clustering, and sparse approximation. In ELM, the weights and bias between hidden nodes and input nodes are not tuned and are randomly initialized once. And the output parameters of the hidden nodes are learned in a single pass. ELMs have good generalization performance and are much faster than back propagation [11]. In Fig. 1, an ELM with a single hidden layer is shown.

Fig. 1
figure 1

An ELM with a single hidden layer

Let X be the input, and T be the target. Let W and b be the weights and bias, respectively, between the input layer and the hidden layer. The output of the i th hidden node given by

$$ h_{i}(x) = A(W,b,x) $$
(10)

where A is a transfer activation function, such as Fourier, Sigmoid, Gaussian, hardlimit, and so on. The output of the ELM is given by

$$ O = \sum\limits_{i=1}^{L} \beta_{i}h_{i}(x) $$
(11)

The number of neurons present in the hidden layer is denoted by L. If there are N instances, the hidden layer output matrix is given by

$$ H = \left[\begin{array}{ccc} h_{1}(x_{1}) & {\ldots} & h_{L}(x_{1})\\ {\vdots} & {\ddots} & \vdots\\ h_{1}(x_{N}) &{\ldots} & h_{L}(x_{N}) \end{array}\right], T = \left[\begin{array}{ll} t_{1}\\ \vdots\\ t_{N} \end{array}\right] $$

The algorithm proceeds as

  • 1 Assign W with random values.

    $$ \beta = H^{\dagger}T $$
    (12)
  • 2 Estimate β by computing pseudo-inverse of the matrix H.

  • 3 Compute the output using Eq. 11.

3 Proposed work

One of the challenging areas of research in machine learning is feature selection. The methods existing for feature selection result in suboptimal solutions. Without doing an exhaustive search on features, optimal solutions can’t be guaranteed. For datasets with high dimensionality like DNA microarray, exhaustive search is infeasible. In such cases, fairly efficient solution can be obtained to the optimization problem by metaheuristic approaches. In our proposed work, we are employing binary bat algorithm to choose the best combination of features. Number of feature selected in every iteration for each dataset are shown in Fig. 2. The obtained feature subsets from the binary bat algorithm is evaluated at each iteration, using the novel fitness function which is proposed in this work.

Fig. 2
figure 2

Graphs showing number of features selected by the binary bat algorithm through the iterations for each dataset

3.1 Feature selection by BBA using proposed fitness function

The following is how the feature selection works using binary bat algorithm (BBA). Each bat in the population is initialized randomly with binary array of length equal to the cardinality of feature set in the input data. In an array, zero and one represents absence and presence of the feature, respectively. For each bat, input data is constructed corresponding to the binary array and its fitness is evaluated in each iteration and updated if its value is greater than the previous iteration. Loudness and pulse rates of the bat are modified according to Eqs. 6 and 7 if the solutions are accepted. This process is continued till the user defined highest number of iteration is reached. The best solution obtained is then evaluated for performance. The proposed fitness function is given as

$$ \mathrm{Fitness = Accuracy} + (1-\frac{f}{F}) + \sum\limits_{j \in |V|} \frac{{\sum}_{k=1}^{C}n_{k}({\mu_{k}^{j}} - \mu^{j})^{2}}{(\sigma^{j})^{2}} $$
(13)

where, Accuracy is the classification accuracy obtained from the classification algorithm, corresponding to the best feature subset obtained from the binary bat algorithm. F is the length of the original feature set and V is the feature subset obtained from the bat algorithm, f = |V |; the total number of classes is denoted by C; nk denotes the size of class k; \({\mu _{k}^{j}} \) is the mean of the j th feature of the k th class and \((\sigma ^{j})^2 = {\sum }_{k=1}^{C}n_{k}(\sigma ^{j})_{k}^2\), where σj)k is the standard deviation of k th class. The term \(\frac {{\sum }_{k=1}^{C}n_{k}({\mu _{k}^{j}} - \mu ^{j})^{2}}{(\sigma ^{j})^2}\) computes a score for each feature such that interclass distance of the data points are maximized and the intraclass distance of the data points are minimized. For all the features in the subset, the scores are computed and added together. The binary bat algorithm maximizes the fitness function given in Eq. 13 throughout the iterations. The first term ensures that the accuracy is maximized. The second term ensures that minimum number of features are selected out of a given number of features. Finally, the third term quantifies the relevance of the feature subset.

3.2 Classification

For classification purpose, we employ an ELM classifier. we have used K-fold protocol (K = 10) as cross-validation techniques in learning and testing procedure. K-fold means we have to divide the dataset into K parts, K = 10 means that the dataset is divided into ten parts that implies 90% data is used in training and 10% is for testing. K = 10 or 10-fold: the dataset is divided in 10 parts, and for ten times, a part is considered a test set and the remaining 9 as learning set, and this procedure is repeated for all 10 parts. And the average (or some other combination) classification of the 10 testing sets represents a CM (classification measure) of the entire classifier. This should be one complete iteration. Then, this can be repeated many times: the data set is shuffled and a new complete iteration is performed. The advantages of using an ELM are its good generalization power and fast learning speed. The fast learning speed is a result from the fact that the weights present between hidden and output layers are learned in one pass as opposed to gradient descent with backpropagation. In this work, ELM has been used in binary bat algorithm for computing the accuracy term of the fitness function.

figure f

4 Results

In this section, effectiveness of the proposed methodology is analyzed using various performance measurements such as classification accuracy, recall, precision, specificity, and F score on various datasets. The effect of feature selection on the classification accuracy is also analyzed. The nature of the dataset is also discussed which plays an important role in this work.

The gene expression datasets have the following characteristics:

  • 1 Fewer data items [7, 14]: The data samples corresponds to the expression levels of genes of tissues of different patients. The classes are different subcategories of cancer. With the typical characteristics of the datasets, small samples and high dimensionality, they can be called as degenerate datasets. The number of samples generally available are less than 100. Therefore it hinders the generalizing capability of the models.

  • 2 Class imbalance [14]: The proportion of the classes are not equal. One class can dominate other(s). The minor classes can get misclassified.

The datasets that are used for experiments are publicly available and obtained from the gene expression model analyzer (gems-system) and KanGAL portal of IIT-Kanpur, (https://www.iitk.ac.in/kangal/bioinfo.shtml). For the experimentation purpose, we have used Intel core 13- 3240, 3.40 GHz processor. OS of 64 bit and memory of 3.8 GB. Experiments are performed in MATLAB.

Table 1 presents the description of the dataset used for evaluation of our methodology. We have used eight gene expression datasets, whose description about number of features, instances, and classes with individual class with corresponding instances is given. The dataset Leukemia-1 has 50 genes and 72 samples, dataset colon has 2000 genes and 62 samples and so on. Also the number of instances in each class of eight datasets is given by class–instance (C–I) pair, e.g., in the given table, C–I pair (0-38) represents class 0 have 38 instances. It can be observed that the number of attributes is way more than the number of instances, which indicates the need of feature selection.

Table 1 Description of the datasets used for the experiments

The next table, i.e., Table 2 presents the name of the datasets, number of attributes before and after the feature selection, and also percentage of feature selected. The lowest percentage of feature selected is 15.89 for 9_Tumors dataset and the highest is 39.07 for Leukemia-2 dataset. The average percentage of features selected is 25.80.

Table 2 Number and percentage of features selected by BBA

The comparative performance evaluation of the proposed method with novel fitness function and the existing fitness function are shown in Table 3. ELM classifier with a sigmoid activation function has been used. The sigmoid activation function is found to be performing better than other activation functions such as sine, hard limit, and radial basis. It can be observed that the proposed method’s results are better than the existing methods with regard to evaluation measures such as classification accuracy, precision, recall, specificity, and F score. Here, the existing method refers to the fitness function which comprises of just the classification accuracy and the dimensionality of the feature subset selected, which is found in literature. The proposed methodology obtained highest accuracies for all the datasets when compared with the method with existing fitness function. The classification accuracies for Leukemia-1, Leukemia-2, 9_Tumors, Brain_Tumor1, Brain_Tumor,2 and DLBCL dataset are 100% and for Colon and Lymphoma are 99.29 and 89.50, respectively.

Table 3 Performance evaluation of the proposed and the existing method

Accuracy, precision, recall, specificity, and F score are computed using the following formulas:

$$ \mathrm{Accuracy = \frac{tp+tn}{tp+tn+fp+fn}} $$
(14)
$$ \mathrm{Precision = \frac{tp}{tp+tf}} $$
(15)
$$ \mathrm{Recall = \frac{tp}{tp+fn}} $$
(16)
$$ \mathrm{Specificity = \frac{tn}{tn+fp}} $$
(17)
$$ F \mathrm{score = 2. \frac{Precision.Recall}{Precision+Recall}} $$
(18)

where, tp denotes true positive, i.e., number of observations that are actually positive and also predicted positive.

tn denotes true negative, i.e. number of observations that are actually negative and also predicted negative.

fp denotes false positive, i.e., number of observations that are actually negative but are predicted positive.

fn denotes false negative, i.e., number of observations that are actually positive but are predicted negative.

Table 4 shows the effect of feature selection on the classification accuracies for all the datasets that are used. From the table, we can notice that there is a significant change in four datasets, they are Colon, 9_Tumors, Brain_Tumor1, and Brain_Tumor2. For the remaining datasets, the difference in accuracies is less than 6%. It can be realized that feature selection is necessary for such high-dimensional datasets like microarray gene expression datasets. Some authors highlighted few critical issues about cross-validation error estimates for small-sample microarray classification [25, 26]. Small samples of one class may also be useful and enough for a classifier to learn.

Table 4 Effect of feature selection on classification accuracies

Table 5 presents the comparative performance of different methods of gene expression data classification and our proposed method. Performance of the proposed methodology is slightly less in terms of accuracy for Colon and Lymphoma datasets. For the rest of the datasets, the classification accuracies of the proposed are greater than or equal to the accuracies of the methods being compared.

Table 5 Comparative testing accuracies of BBA-ELM with other methods

5 Conclusions

This paper presents a methodology to classify cancer using gene expression data. First, we perform a feature extraction task to overcome the high-dimensionality problem. We perform a feature selection task, for which we make use of a bio-inspired algorithm called binary bat algorithm with a novel fitness function that is proposed in this paper. The proposed fitness function involves minimizing the intraclass and maximizing the interclass distances of the data points, along with maximizing the accuracy and minimizing the dimension of the data. For the classification task, we make use of the extreme learning machine which is found to be fast and has good generalization capability. We have conducted the experiments on eight gene expression datasets. We have compared our methodology with the proposed fitness function and the existing fitness function that is used mostly in the literature for feature selection. It has been observed that our proposed method performs better than the original method with regard to the classification accuracy, precision, recall, specificity, and F score.