1 Introduction

Identification of disease critical genes by analyzing gene expression datasets is important in bioinformatics. The number of genes in gene expression dataset may rise above 40,000. On the other hand, a number of genes which may be called disease critical are considered to be limited. Otherwise, close monitoring of the genes and further study for therapeutic intervention may not be conducted easily. Due to its profound impact on maternal and neonatal health all over the world, a disease, named preeclampsia [1] has been focused in this paper. Pregnant women get affected by this disease. As there are heredity issues behind the disease [2,3,4], identification of the disease-causing genes has gained importance in recent times.

Gene expression datasets are required to accomplish many important problems in bioinformatics. These datasets are obtained through microarray technology [5, 6]. The data presents protein production information of thousands of genes, either in different time steps or in different conditions (patients, disease type, tissues) in organisms. Gene expression data are obtained from different public repositories (Gene Expression Omnibus, ArrayExpress). These data are undergone pre-processing stages such as normalization [7, 8] dillies. The final data is in the form of a matrix. The rows of the data matrix are denoted by genes and the columns (often termed as samples) are denoted by either time steps or biological conditions. Depending on the types of column, the data set is known either as time series or as steady state gene expression data, respectively. Each matrix element is a floating point number, mapped in a certain range depending on the type of normalization. Analysis of gene expression data provides insight into the normal as well as disease-related cell functioning in organisms.

In preeclampsia, some genes are likely to be differentially expressed in the tissues of an organ named placenta. So, for gene expression analysis of preeclampsia, data are collected from the placenta of different patients (normal/diseased). Using this data, meta-analysis had been performed [9,10,11] to obtain the differentially expressed genes of preeclampsia [12].

In recent times, meta-heuristic algorithms have wide applications in the domain of bioinformatics [13]. Many meta-heuristics have been successfully applied in the areas of reconstruction of gene regulatory network [14], clustering or bi-clustering [15] and genome-wide association studies to detect effective epistatic interactions among genes [16,17,18].

Gene selection is usually performed using sample classification. On multi-class samples of gene expression data, various classification techniques have been employed to determine disease class [19, 20]. Different paradigms of data science, mathematical modeling, statistical schemes and computational theories have been applied to this field. Widely used data classifiers such as k nearest neighbor (kNN) [21] and support vector machine (SVM) [22] have been used in sample/disease classification [23]. Meta-heuristic algorithms such as genetic algorithms (GA) [24], particle swarm optimization (PSO) [25] have been used to optimize gene selection. In a well-structured algorithm, Li et al. (2001) [26] proposed a combination of GA and kNN (referred the algorithm as GA-kNN) to perform gene selection by sample classification [27]. Li et al. (2001) [26] also showed a direction, how using statistical analysis, genes may be ranked according to their impact in causing a disease. Thereafter, numerous algorithms [28,29,30,31,32,33,34,35,36,37,38] have been proposed to enrich the field of gene selection by sample classification. For prioritization of disease critical genes of preeclampsia, Tejera et al. (2013) performed co-expression network analysis [39] and compared their results with two widely used state-of-the-art classification methods, namely, GANN (combination of GA and a classifier named nearest neighbor) and GADA (combination of GA and a classifier named discriminant analysis). Very recently, the meta-heuristic algorithms such as variable neighborhood search (VNS) [40], differential evolution (DE) [41], simulated annealing (SA) [42] and PSO have been combined with kNN to form VNS-kNN, DE-kNN, SA-kNN and PSO-kNN, respectively. These algorithms have been used to identify preeclampsia genes [43, 44].

Here, in this paper, methodologies employing meta-heuristics, sample classification and statistical analysis have been developed for the selection of the disease critical genes. Four meta-heuristic algorithms, namely, honey bee mating optimization (HBMO) [45], harmony search (HS) [46], DE and GA have been used. Classifier kNN is embedded in each algorithm for sample classification. Two hybrid algorithms, HBMO-kNN, and HS-kNN have been newly proposed here. Other two algorithms, namely, DE-kNN [43] and SGA-kNN (a modified version of GA-kNN [26]) have also been implemented to present a comparative study of four hybrid algorithms. From an algorithm (GA-kNN) proposed by Li et al. (2001) [26], Li_01 we have taken many ideas that have been used here. A simple traditional version of GA [47] has been incorporated in SGA-kNN. In this research, three preeclampsia datasets were collected for experiments. In an experiment, all (four) hybrid algorithms have been executed and performances of the algorithms have been compared. In an experiment, during execution of each algorithm, candidate solutions having classification accuracy greater than 90% have been saved. Later on, from each dataset, the genes commonly returned by all algorithms have been termed as critical genes. In different experiments, the classification accuracies of algorithms varied between 92.46 and 100%. Among all algorithms, HBMO-kNN was found best in classification accuracy (99.64–100%). DE-kNN secures second place in classification accuracy (99.42–100%). Numbers of genes, common in the output of each algorithm from three datasets are 87, 73 and 74, respectively. Widely known preeclampsia genes, STS, EPHX1, LEP, LRP8, ADD1, INSR have been detected here. The algorithms, HBMO-kNN and HS-kNN are newly proposed in this paper and the statistical approach involving different meta-heuristics is also newly planned and found to be reliable.

Rest of the paper has been organized as follows: Sect. 2 discusses the formulation of gene selection problem as an optimization problem. The algorithm kNN-fit (to achieve fitness of a candidate solution, using classifier kNN) and the four meta-heuristic based algorithms for gene selection have been discussed in Sect. 3. Discussion regarding data and experimental results are presented in Sect. 4. Section 5 concludes the paper giving some future scopes of research.

2 Gene Selection by Sample Classification

The input taken is a microarray gene expression dataset (say matrix X). Generally, the total number of genes present in X is above 40,000. The objective of this work is to identify a very small subset (containing 30 genes) of genes in X, so that the subset collectively discriminate between normal and preeclampsia affected samples. The search involves nCd possibilities, where, n is the number of genes in the dataset and d is the number of genes in the subset. So, optimization techniques are applied to identify the subset of critical genes in the huge and diversified search space.

2.1 Input Dataset X

The structure of gene expression dataset X, on which the study is based, is provided in Table 1. Here, as in standard form, gene Gi is ith row identifier. The columns contain normal samples (\({N_l}\) stands for lth sample) and diseased samples (\({P_o}~\)stands for oth sample). Here, \(~i~ \in \left\{ {1,~2,~...,~n} \right\}\), \(~l~ \in \left\{ {1,~2,~...,~maxN} \right\}\) and \(~o~ \in \left\{ {1,~2,~...,~maxP} \right\}\), where n is the number of genes, maxN is the number of normal samples and maxP is the number of preeclampsia affected samples in data set X. Each entry is the normalized expression level \(~g_{{i,j}}^{C}\) of gene Gi at sample j (class \(C\) of sample j may be either normal or diseased, \(~j~ \in \left\{ {1,~2,~...,~m} \right\}\) where \(~~m=maxN+maxP\)).

Table 1 Structure of microarray gene expression dataset

2.2 Expression Sub-matrix xS

A candidate solution S contains d non-repetitive gene indexes chosen randomly from \({X_{index}}\). Here, \({X_{index}}=\left( {1,~2,~ \ldots ,n} \right)\) and n is the number of genes in the dataset. A matrix \(~~{x_S}\), \(~{x_S}~ \subset X\) has been obtained from X in the following way. For each gene index in S, the corresponding row of X has been copied and appended in \(~~{x_S}\). The appearance of candidate solution S, gene expression matrix X and expression sub-matrix xS are shown in Fig. 1. Here, a small example has been shown where number of rows (genes) in gene expression matrix X is 13. Each row is denoted as G1, G2, …, G13. Number of elements (d) in candidate solution S is 4. Each randomly chosen index i, (i ∊ {1, 2, 3, …, 13}) in S, points to the row Gi in X. Each such row is appended in xS.

Fig. 1
figure 1

An example of deriving xS from S and X

2.3 Sample Classification of xS

Each sample in an expression sub-matrix xS is classified using kNN. If the computed class of a sample in xS conforms to its original class in X, the sample is termed as properly classified sample. Gene selection incorporates those genes in S, which maximizes the number of properly classified samples in xS. The number of samples in xS is same as the number of samples in X. The number of rows equals the number of elements in candidate solution S. Here, classification accuracy (CA) is used to specify the quality of candidate solution S and it has been calculated by Eq. 1,

$${\text{CA}}~=\frac{{{\text{Number}}~{\text{of}}~{\text{properly~classified~samples}}~{\text{in}}~{x_S}~}}{{{\text{Total~number}}~{\text{of~samples~in}}~X}} \times 100\% .$$
(1)

3 Methods

Meta-heuristic algorithms are used here for gene selection. A candidate solution of length d represents the collection of differentially expressed disease critical genes. Each element is a gene index, selected from gene index vector \({X_{index}}\). To calculate the fitness of candidate solutions, the classifier kNN [48, 49] is used. Use of kNN in sample classification is explained in Sect. 3.1. Sect. 3.2 contains meta-heuristic-based algorithms for gene selection. Uses of meta-heuristic algorithms, HBMO and HS have been discussed in details. Brief descriptions of the other two algorithms, namely DE and GA are also provided.

3.1 k Nearest Neighbor (kNN)

Cross-validation [50] is a kind of model validation, used to assess how far the set of samples in a data set are classifiable. Leave one out (LOO), a type of cross-validation is applied in this research to classify the samples in xS. Here each sample has two classes, normal and diseased. Generally, in cross-validation, the dataset is partitioned into two disjoint sets, namely test-set and train-set. In LOO scheme, m validation tests are performed in a dataset having m samples. In other words, to test the class (normal/disease) of each sample, a validation test is made. In each validation test, the sample to be classified is kept in test-set and remaining (m − 1) samples are kept in train-set.

For each of m validation tests, the classifier kNN is used. kNN determines the class of the sample in test-set. In kNN, the distances (here Euclidian Distance) between the sample in test-set and each of (m − 1) samples in train-set are computed. The samples in train set are ranked in ascending order by their distance from the test sample. Top k (where k = 3) samples are taken and named as nearest neighbors of the test sample. Among the nearest neighbors, if the majority (two out of three) of samples have the same class as the test sample, the test sample is considered as properly classified.

If the test sample is properly classified, a score of 1 is assigned to it. A score of 0 is assigned otherwise. The summation of scores in these m validation attempts, each on one sample, gives the total number of properly classified samples in xS. As the objective function for the search is to maximize the number of properly classified samples in xS, the summation of scores gives the fitness value of candidate solution. Description of kNN used for fitness calculation of candidate solution S in gene selection [kNN-fit (xS)] is given in Algorithm 1.

The expression sub-matrix xS is the input to kNN-fit (xS). Initially, all samples in xS are kept in an empty set All. Each sample contains the expression levels of all genes, present in a candidate solution S. For each sample in All, a cross-validation test is performed. In a cross-validation test, the current sample i is placed in the test set (Test), and the training set (Train) contains all other samples (in All) excluding i. For each sample jTrain, the distance of j from i is calculated. Here sample i, j and distance Dist_j_i is calculated by Eq. (2), Eq. (3) and Eq. (4), respectively,

$$i=~\left[ {g_{{1,i\_ind}}^{{{C_i}}},~~g_{{2,i\_ind}}^{{{C_i}}},~~g_{{3,i\_ind}}^{{{C_i}}}, \ldots ~g_{{d,i\_ind}}^{{{C_i}}}} \right],$$
(2)
$$j=~\left[ {g_{{1,j\_ind}}^{{{C_j}}},~~g_{{2,j\_ind}}^{{{C_j}}},~~g_{{3,j\_ind}}^{{{C_j}}},~ \ldots ,~g_{{d,j\_ind}}^{{{C_j}}}} \right],$$
(3)
$$Dist\_j\_i=\sqrt {{{\left( {g_{{1,j\_ind}}^{{{C_j}}} - g_{{1,i\_ind}}^{{{C_i}}}} \right)}^2}+{{\left( {g_{{2,j\_ind}}^{{{C_j}}} - g_{{2,i\_ind}}^{{{C_i}}}} \right)}^2}+ \ldots +{{\left( {g_{{d,j\_ind}}^{{{C_j}}} - g_{{d,i\_ind}}^{{{C_i}}}} \right)}^2}} .$$
(4)
figure a

Here, Ci and Cj are the classes belonging to sample i and j, respectively. i_ind and j_ind are the indexes of samples i and j in xS, respectively. On the basis of their calculated distance using Eq. (4), all jTrain are sorted in ascending order. k (= 3) samples with least distance from i are kept in a set, Neighbor_i. The class of sample i (Ci) is obtained as follows: if i_ind (index of i in xS) is less than or equal to maxN, Ci is called ‘normal’. Otherwise, Ci is ‘diseased’. If the class Cj of the most of the samples (j’s) in Neighbor_i matches with Ci, i is said to be properly classified. If a sample i is properly classified, a variable (score_i) is set to 1. Otherwise, 0 is assigned to it. Fit_S is obtained by taking summation of the individual scores (score_i). Fit_S is returned as the fitness of current solution S.

3.2 Meta-heuristics in Gene Selection

In gene selection, a state space search is conducted to obtain the solution. In a microarray gene expression dataset, the number of genes may be of the order of 40,000. In gene selection problem, only 30 genes are to be chosen. So, the search space contains 40000C30 possibilities. When an optimization problem pertains a large state space or it has high complexities (NP-Hard or NP-complete), an advanced class of algorithm, called meta-heuristic algorithm is applied to get a near-optimal solution in reasonable time. To optimize gene selection, four meta-heuristic algorithms, namely, HBMO, HS, DE and GA (traditional version) have been used. Two new hybrid algorithms, namely, HBMO-kNN and HS-kNN have been proposed here. The other two algorithms, namely, DE-kNN and SGA-kNN, (SGA—traditional version of GA [47]) have been implemented to match the current framework. Short description of different meta-heuristic algorithms and how they have been used in gene selection are described below:

3.2.1 Honey Bee Mating Optimization (HBMO)

Like in other population-based meta-heuristics, HBMO [51] uses a population of candidate solutions to explore the search space. In HBMO, the behavior of honey bees during mating is imitated to perform the state space search. A population of queens (candidate solutions), each named Qi has been initialized randomly. Qi is associated with a repository Spi where she can store D drones (matrices, similar in structure as Qi). W worker bees are initialized. Each worker bee wi is a single element variable, initialized randomly. Workers are set to improve queens using local heuristics as follows: A queen Qi and a worker wi are chosen randomly. The Queen is copied into a variable named temp. A random element in temp is selected. The element is replaced by the value stored in wi. If the fitness of temp is greater than Qi, Qi is replaced by temp. The generation-based loop starts and continues until its termination condition is met. This termination criterion in HBMO is either iteration count reaches maxIter or highest fitness of a queen reaches to a certain value. In a generation, the improved queens are evaluated and ranked first. Each ranked queen is undergone three main steps: (1) choosing its mates by a process called maiden flight (2) crossover and (3) mutation.

In a maiden flight, each queen Qi is undergone the following process: at energy state t = 0, Qi is assigned with initial energy ei(0) and initial speed sdi(0). Energy ei(0) of Qi is reduced in steps. At each energy state ei(t), queen Qi aims to fill its repository Spi with a drone \(dr_{i}^{t}\) [\(dr_{i}^{t}\): drone of queen Qi produced at her energy state ei(t)]. \(dr_{i}^{t}\) is a vector having the same dimension as Qi and takes values in the same range specified for Qi. The fitness of \(dr_{i}^{t}\) is calculated. If the fitness is greater than that of Qi and number of drones in Spi is less than D, \(dr_{i}^{t}\) is added to Spi. If \(dr_{i}^{t}\) is less fit than Qi, it is given a chance to be added to Spi depending on a small probability (\(~{e^{ - del/s{d_i}\left( t \right)}}\)). Here del is the difference between the fitness of Qi and\(~dr_{i}^{t}\). sdi(t) is the speed of ith queen at tth energy state. Speed and energy of Qi are reduced by Eq. (5) and Eq. (6), respectively,

$${e_i}\left( {t+1} \right)={e_i}\left( t \right) - \gamma ,$$
(5)
$$s{d_i}\left( {t+1} \right)=s{d_i}\left( t \right)*\alpha .$$
(6)

Here, \({e_i}\left( t \right)\) is the energy of Qi at step t. \(\alpha\) and \(\gamma\) are parameters of HBMO.

After all the queens fill their repositories, recombination (crossover) starts. Each queen is recombined with each of the drones present in its repository Spi, producing 2*D offspring. Here, the crossover is not probability dependent. After crossover, each offspring is mutated with a low mutation probability Pm.

The offspring produced from all queens are evaluated and ranked. B offspring from the top are accepted as broods. These broods are improved by workers using local heuristics, almost in the same way as were done in the case of queens. The only difference is that in queen improvement, a queen was chosen at random but in brood improvement, broods are sequentially chosen for improvement. While there is a queen worse than any brood in brood list, the worst among the queens is replaced by the best brood. When the replacement of a queen by a brood occurs, the brood is deleted from brood list.

figure b

Based on HBMO, a hybrid algorithm called HBMO-kNN has been proposed here for gene selection. Description of this algorithm is provided in Algorithm 2. Each of queen Qi and drone \(dr_{i}^{t}\) is a vector having d elements. Each element is a gene index chosen randomly from gene index vector \({X_{index}}.\) Each worker is a single element variable. The value of it is a gene index, randomly chosen from \({X_{index}}.\) Energy ei(0) and speed sdi(0) of queen Qi at step t = 0 is initialized randomly within range (0.5–1) and (14–15), respectively. Evaluation of a solution S (queen/drone/offspring/brood), is a two-step process. In the first step, a sub-matrix of X, named xS is prepared, as described using an example in Fig. 1. In the second step, kNN-fit (xS) is called with xS as an argument. Here an attempt is made to classify (which class it belongs to—normal or diseased) each sample i in xS. Number of properly classified samples gives the fitness of the solution. During iterations, global best solution q* is updated which is returned as output at the end.

The efficiency of a meta-heuristic algorithm mainly depends on the judicious mixture of two strategies—exploration (diversification) and exploitation (intensification). Though HBMO is based on the same theory of genetic evolution (using crossover, mutation and selection) such as genetic algorithm, it incorporates some additional measures of exploration and exploitation. For more exploration, an annealing function \(\left( {{e^{ - del/s{d_i}\left( t \right)}}} \right)\) allows to choose a worse drone by a queen for mating. Improvements of queens and broods by workers and replacement of worse queens by better broods are some of the measures to support exploitation. Also, some problem-specific changes to HBMO, especially, in initialization of speed (originally it was in the range 0.5–1, here 14–15) of the queen has been implemented. Aggregation of all these efforts in HBMO helps it find disease-causing genes more efficiently than the state-of-the-art algorithm GA-kNN.

3.2.2 Harmony Search (HS)

Harmony search [52] is one of the current nature-inspired meta-heuristic algorithms. The natural phenomenon behind this algorithm is the harmony improvisation process of a musician. Initially, some harmony memory is there in the musician’s mind to play a certain piece of music, which he wants to make more accurate by enormous practices. In every practice, he improvises a new harmony and estimates its quality and according to the quality accepts or discards that harmony. If it gets accepted, the musician’s memory is updated. Search for the state where esthetic estimation is highest leads the musician to improve. In practice, instead of one musician, there may be a number of musicians with different musical instruments who are going to perform together. So they have to come to a state at which combination of the harmonies improvised from these instruments provide good esthetic estimation to the listeners.

The algorithm for HS [53] needs to initialize a harmony memory (HM), the initial population, with some random values \(~~h{m_{i,j}}\), \(1 \leqslant i \leqslant HMS~\) and \(~1 \leqslant j \leqslant d\); where, \(~~h{m_{i,j}}\) is an element of HM, \(HMS\) is the size of the population and d is the number of decision variables. Now we have to improvise a new harmony, a new candidate solution S. An element of S, \({s_j}\), is chosen in the following way: if a value rand(0,1) is less than harmony memory consideration rate (HMCR), the element \({s_j}\) under each decision variable is chosen randomly from \({\overline {{hm}} _j}\), the domain of jth decision variable \(\left( {h{m_{1,j}},h{m_{HMS,j}}} \right)\), i.e., the set of consecutive columns of HM. Otherwise, the value of \({s_j}~\) is chosen randomly from \({X_{index}}.\) The parameter HMCR lies between 0 and 1. If \({s_j}\) is selected from \({\overline {{hm}} _j}\), it implies the intensification of this meta-heuristic. Further \({s_j}\) is modified as follows: if a value rand(0,1) is less than pitch adjustment rate (PAR), a parameter of HS, then \({s_j}\) is modified by Eq. (7). Though Eq. (7) returns a float value depending on the variance of previous population, that float value has been converted here to an integer value to reflect proper characteristics of a decision variable (gene index) of gene selection problem. Moreover, the equation makes a small change to current gene index,

$${s_j}={\text{int}}\left( {{s_j} \pm {\text{rand}}\left( {0,1} \right)*{\text{BW}}} \right),$$
(7)

where \({\text{BW}}\) is called band width. This step implies the diversification of this meta-heuristic. Value of BW is calculated by Eq. (8),

$${\text{BW}}=k1 \times \sqrt {{\text{Var}}\left( {{{\overline {{hm}} }_j}} \right)} ,$$
(8)

where \(k1\) is the proportionality constant. BW is proportional to the standard deviation of the current population, which increases the exploratory power of HS [53]. If the fitness value of that new candidate solution S is better than the fitness value of the worst candidate solution in the HM, then the new replaces the worst. We have to continue this up to the number of iterations (NI). We can describe this NI parameter as the number of generation or the terminating criteria of the outer loop.

figure c

Based on HS, an algorithm for gene selection has been proposed here, namely, HS-kNN (Algorithm 3). One harmony in harmony memory HM represents one candidate solution S, containing d genes, randomly selected from gene index vector \({X_{index}}.\) To evaluate this candidate solution S, an expression sub-matrix \({x_s}\) is created. Then \({x_s}\) is evaluated by kNN-fit (xS) and the number of properly classified samples of \({x_s}\) is given as fitness value of S. Replacing only the worst candidate solution by the new one delays the rate of convergence. So we have modified this as follows: instead of replacing the worst candidate solution only, whole population is replaced by a new HM, say newHM. Each candidate solution S, of newHM is formed by the process of creating a new harmony, \(~newH{M_i}.\) The population is taken, if the either the average fitness value or the best fitness value of the new population is better than or equal to either the average fitness value or the best fitness value of the current population, respectively. Best fitness value of the population, i.e., the global best candidate solution is stored in hm* in every iteration.

The effectiveness of HS as an evolutionary meta-heuristic search is based on the natural phenomenon of how musicians use their short-term memory (here as harmony memory) to create notes resulting in a pleasant harmony. The process of selecting a new harmony from the previous harmony memory (HM) corresponds to the exploitation of the search process as the candidates of HM are better in fitness. On the other hand, changing any note of the new harmony randomly, or depending upon some function of the past memory is the exploration part. In HS-kNN, the exploration capability of the search is further enhanced by the incorporation of the standard deviation in the expression of bandwidth \(\left( {BW=k1*\sqrt {Var\left( {{{\overline {{hm}} }_j}} \right)} } \right)\) of the previous harmony memory which is subsequently used in modification of candidate solution. Moreover, here, a new harmony memory (a set of new harmonies) has been taken instead of taking single new harmony for replacement. Here, the average and the maximum fitness of the new HM have been compared with the average and the maximum fitness of current HM, respectively. If the new HM is better than the current HM in any of the two measures (average and maximum), it will replace the current HM. This modification enhances the exploitation of the search process resulting in faster convergence without degrading the exploration capability of the search. It can be concluded that, here, the balance between exploration and exploitation is done more effectively. This way of finding the new set of disease critical genes is found more accurate than the state-of-the-art algorithm GA/kNN.

3.2.3 Differential Evolution (DE)

DE [54] is a population-based meta-heuristics, used for optimizing real-valued stochastic function. Recently, in a hybrid algorithm named DE-kNN [43], DE and a classifier called kNN have been combined and used for gene selection. Here, DE-kNN is implemented changing parameter values and used in a comparative study for gene selection. In this algorithm, each candidate solution, an agent Ai is a vector having d elements. Each element \({a_{ij}}\left( {j \in \left\{ {1,~2,~ \ldots ,~d} \right\}} \right)\) of Ai is a gene index, initially chosen randomly from gene index vector \({X_{index}}.\) In each generation, for each candidate solution Ai, a mutation vector Mi is obtained using a formula involving three other candidate solutions of current population and a parameter called differential weight, F. A trial vector Ti is obtained using crossover (with probability Cr) between Mi and Ai. Each element of Ti is a non-repeating gene index from \({X_{index}}.\) To calculate the fitness of a candidate solution S (Ai or Ti), expression sub-matrix xS is created based on S and X. The function kNN-fit (xS) is used to calculate the fitness of a candidate solution S. The best solution (depending on fitness) in final population is returned.

3.2.4 Genetic Algorithm (GA)

GA [47, 55] is a population-based meta-heuristics which imitate genetic evolution in organisms. In SGA-kNN, simple/traditional version of GA is combined with kNN. GA is a population-based meta-heuristics. A number (Num) of candidate solutions, each called chromosome Ci is initialized with a gene index randomly chosen from \({X_{index}}\), such that each gene index in Ci is unique. Based on each Ci the gene expression sub-matrix xS is created. Fitness of Ci is calculated calling kNN-fit (xS) with xS as an input argument. In each generation, containing Num chromosomes, Num/2 selection operations are performed. In each selection operation, two parents are chosen depending on Rowlett wheel. A one-point crossover is performed on selected parents with crossover probability Pc, producing two offspring. Mutations are performed with a low probability Pm on each offspring. Among the pairs—parents and offspring, the fitter chromosome pair is passed to next generation. During crossover and mutation, measures are taken, such that gene indices in Ci remain unique. The best solution in final population is returned.

The final solution, produced by an algorithm has not been considered here as the output because the input gene expression data is noisy and number of samples is not sufficient. Instead, a collective framework has been proposed here which takes a normalized gene expression dataset as input and encompasses all four meta-heuristic algorithms (HBMO-kNN, HS-kNN, DE-kNN and SGA-kNN). Figure 2 shows how it works in a realistic environment. Multiple executions of each algorithm (HBMO-kNN, HS-kNN, DE-kNN and SGA-kNN) have been separately performed on a dataset. During each execution of an algorithm, candidate solutions (termed here as good solutions) having classification accuracy (fitness) higher or equal to 90% has been stored in an algorithm specific repository. After executions of all algorithms, each algorithm-specific repository is processed to contain the unique genes. Now overlap among the genes in different repositories has been obtained and returned as a disease-causing set of genes.

Fig. 2
figure 2

Proposed framework

Proposed framework inherits some features from the state-of-the-art algorithms such as GA-kNN [26], GANN [39] and GADA [39]. First, it contains methods (HBMO-kNN, HS-kNN, DE-kNN and SGA-kNN) based on a meta-heuristic algorithm and a classifier (kNN) combination. Second, the proposed framework does not consider the final solution of a meta-heuristic algorithm as a final disease-causing subset. Finally, during the execution of a meta-heuristic algorithm, good candidate solutions are stored for further statistical processing.

The state-of-the-art algorithms have following disadvantages. First, the stored candidate solutions obtained from the executions of a single meta-heuristic algorithm is inappropriate for statistical processing. This is because of the fact that decision variables change less frequently in successive candidate solutions. As a result, redundant information may be gathered. Second, the process of prioritization of genes contained in intermediate candidate solutions (even good) is questionable as they are obtained from meta-heuristic algorithms which employ inherent randomization. The proposed framework eliminates the first difficulty, because it collects candidate solutions from four meta-heuristic algorithms instead of one. The second disadvantage of the state-of-the-art algorithm has been overcome using overlap, instead of prioritization.

4 Data and Experimental Results

Three experiments, each based on a preeclampsia dataset have been conducted. Hardware and software used in the experiments, data of preeclampsia, and parameter setting have been described in Sect. 4.1. The Section 4.2 contains results and discussion.

4.1 Data and Experimental Setting

All the experiments have been done on the same computer having Intel Pentium dual-core CPU with a processing speed of 3.0 GHz and inbuilt RAM of 2 GB. All programs are written in Matlab 2011b Windows version 32 bit. Three gene expression datasets with accession number E-MEXP-1050, GSE60438 and GSE10588 have been employed in this research. The first dataset has been downloaded from Array Express, whereas the other two have been collected from Gene Expression Omnibus. Each of these datasets was produced through an underlying microarray experiment. In each such experiment, total RNA was extracted from normal as well as preeclampsia affected decidual tissues of placentas. The expression profile obtained through a microarray experiment is like a matrix (see Table 1). Here each sample (column) contains expression levels of all relevant genes of a decidual tissue. In case of E-MEXP-1050, the expression profile contains 18 diseased and 17 normal samples. Each diseased samples corresponds to a decidual tissue of critical pregnancy (suffered by preeclampsia and/or fetal growth restriction). Each normal sample corresponds to a normal placental tissue. In either case, a sample contains expression levels of 8,937 identical genes. In GSE60438, the expression profile contains 35 preeclamptic samples and 42 normotensive samples. Each sample contains expression levels of 47,323 identical genes. In GSE10588, 17 severely preeclamptic samples and 26 normal samples are present. Each sample has expression levels of 32,878 identical genes. Different aspects of the datasets have been summarized in Table 2.

Table 2 Microarray gene expression datasets collected for experiments

Each matrix element of a gene expression dataset is a floating point number. A dataset is undergone a pre-processing stage, called normalization. Here, normalization aims to map each element (\(g_{{i,j}}^{N}\)/\(g_{{i,j}}^{P}\), see Table 1) of a data matrix in [0, 1]. To apply normalization, the maximum and minimum values (\(G_{i}^{{MAX}}\)and \(G_{i}^{{MIN}}\), respectively) of each gene (\({G_i}\)) over all samples has been calculated first. Then each element \({g_{i,~j}}\) (\(g_{{i,j}}^{N}\)/\(g_{{i,j}}^{P}\)) of the matrix has been normalized using Eq. (9)

$${\text{normalize}}\left( {{g_{i,~j}}} \right)=~\frac{{{g_{i,~j}} - ~G_{i}^{{MIN}}}}{{G_{i}^{{MAX}} - ~G_{i}^{{MIN}}}}.$$
(9)

With the normalized datasets, three experiments have been conducted. Each experiment is based on a dataset specified in Table 2. In an experiment, all (four) hybrid algorithms, namely, HBMO-kNN, HS-kNN, DE-kNN and SGA-kNN have been employed. In all population-based meta-heuristics except HBMO-kNN, the outer loop, controlling terminating condition is iterated maxIter (= 200) times. In these algorithms, population size, Num (= 30) is also identical. So, the total number of steps in each population-based algorithm is 6000 (200*30). In HBMO-kNN, whenever the fitness value changes, the point of time is recorded. Later on, each recorded time instance (corresponding to a fitness value) is scaled in the range (1–6000). In all (four) algorithms, the number of elements d in candidate solution S is kept 30. To set values to the important parameters in an algorithm, different values of each parameter are kept in a vector. Then for all possible combinations of values of the different parameters, the algorithm is executed. During these executions, the combination of parameters for which an algorithm achieves the highest fitness is saved. These saved values of different parameters have been assigned to the corresponding parameters during experiments with the algorithm. The dataset taken for parameter setting of each algorithm is E-MEXP-1050. Table 3 shows the parameter values for different algorithms, used during experiments. An example of parameter setting is given in Fig. 3 where the parameters of HS-kNN are estimated. The parameters of HS have been taken like HMCR ∊ {0.90, 0.85, 0.80, 0.75, 0.70} and PAR ∊ {0.70, 0.65, 0.60, 0.55, 0.50}. Though fitness reaches to the highest value at HMCR = 0.70 and PAR = 0.65, but around HMCR = 0.80 and PAR = 0.60, the stability of high fitness value is being observed. So we have set the value of HMCR and PAR to 0.80 and 0.60, respectively.

Table 3 Parameters of different algorithm
Fig. 3
figure 3

Parameter setting of HS-kNN

4.2 Results and Discussion

With the algorithm parameters being set, each algorithm has been executed on 25 randomly generated problem instances. During the execution of an algorithm, average fitness, classification accuracy (%), best fitness and standard deviation in fitness have been recorded. The average has been taken over 25 solved instances. For example (see Table 4), in the experiment based on dataset E-MEXP-1050, after execution of HS-kNN, among 25 fitness values, average fitness is obtained as 33.2800. Classification accuracy is calculated by Eq. (1) and the value is 95.0857. The best fitness obtained is 34. The standard deviation in fitness values is 0.5416.

Table 4 Performance of algorithms in different experiments; each experiment is based on a dataset

As fitness (number of properly classified samples) depends on the number of samples in a dataset, classification accuracy is a better attribute for comparison. Considering this attribute, the performance of HBMO-kNN is observed as the best in all experiments. In the first and third experiments, based on E-MEXP-1050 and GSE10588 datasets, respectively (Table 4), HBMO-kNN obtained 100% classification accuracy. In the second experiment, based on the GSE60438 dataset (Table 4), HBMO-kNN obtained classification accuracy of 99.6364%.

In all experiments, DE-kNN secured second place with respect to classification accuracy. Moreover, the research observes that (Table 4) DE-kNN is a stable performer as it has the lowest value for standard deviations of fitness as compared to other algorithms in all the experiments. In the second experiment, where HBMO-kNN has classification accuracy less than 100%, standard deviations of DE-kNN are lower than that of HBMO-kNN.

Figures 4, 5 and 6 show fitness convergence graphs of all algorithms based on datasets E-MEXP-1050, GSE60438 and GSE10588, respectively. Here, x-axis is the logarithm of iteration (= 6000; 200 of maxIter *30 of Num) and the y-axis is fitness. In Fig. 4 (based on E-MEXP-1050), maximum fitness (35, classification accuracy is 100%) has been obtained by HBMO-kNN and DE-kNN. Other two algorithms obtained a fitness of 34. In Fig. 5 (based on GSE60438), HBMO-kNN, HS-kNN and DE-kNN have obtained the highest fitness of 77 (100% classification accuracy). SGA-kNN has obtained fitness of 76. In Fig. 6 (based on GSE10588), HBMO-kNN and DE-kNN have obtained the highest fitness of 43 (100% classification accuracy). HS-kNN and SGA-kNN have obtained the fitness of 42 and 41, respectively.

Fig. 4
figure 4

Fitness convergence of the experiment (Table 4) based on E-MEXP-1050

Fig. 5
figure 5

Fitness convergence of the experiment (Table 4) based on GSE60438

Fig. 6
figure 6

Fitness convergence of the experiment (Table 4) based on GSE10588

In Fig. 7, comparison of relative performance of algorithms in different experiments has been shown. Together with four meta-heuristics based algorithms, the performance of simple kNN is also shown here separately. Simple kNN is applied to random initial candidate solutions and it attains classification accuracies in the range of 40–80% in different experiments. This explains the reason why in Figs. 4, 5 and 6 the graphs start with high initial fitness. In the graphs, the global best solutions of different algorithms are plotted. As the best candidate solution (global best) of initial population is classified through kNN, the initial classification accuracy may be as high as 80%. Due to the application of meta-heuristics rest part of accuracy has been achieved.

Fig. 7
figure 7

Relative performance of kNN, HBMO-kNN, HS-kNN, DE-kNN and SGA-kNN in the different datasets

Figure 8 shows relative timing property of individual algorithms in different experiments (based on datasets E-MEXP-1050, GSE60438 and GSE10588). SGA-kNN took the least time in all experiments. DE-kNN took the second lowest time. Time in an experiment generally depends on the number of genes in the corresponding gene expression dataset.

Fig. 8
figure 8

Time analysis of HBMO-kNN, HS-kNN, DE-kNN and SGA-kNN in different datasets

In each experiment, during 25 runs of each algorithm, the candidate solutions that reach classification accuracy of 90% or above are saved. Among the saved solutions of different algorithms, the set of genes commonly obtained by all algorithms are stored. These genes have been termed as disease critical genes. Number of disease critical genes obtained from the datasets, E-MEXP-1050, GSE60438 and GSE10588 are 87, 73 and 74, respectively. Many widely known preeclampsia genes [56, 57] including STS, EPHX1, LEP, LRP8, ADD1, YWHAQ, INSR, CCL8 have been found here.

Each of the preeclampsia genes obtained in this research has been found to be differentially expressed (among preeclamptic and normal samples) in separate previous researches. The appearance of so many important genes together in this collective algorithmic study suggests that they may be a part of some biological pathway. This needs further clinical investigations. Among the important genes, a few such as INSR and CCL8 were previously proved to be associated with adult vascular and metabolic diseases. Later on, an association of these genes with preeclampsia have been established. The appearance of these genes in this research is a further step to establish the association. Some new genes such as MSX2, LOC440737 and UROS have also appeared in this result. No direct association of these genes with preeclampsia has been established so far. MSX2 had been observed to induce Trophoblast invasion in the placenta [58] whereas LOC440737 is considered to be involved in the endometrial–trophoblast interaction. Both of these processes have a complex relationship with preeclampsia [59]. On the other hand, UROS is found responsible for hypertension. Gestational hypertension (hypertension in second or third trimester) may lead to preeclampsia [3]. So, these new genes may have an indirect complex relationship with the disease (preeclampsia) subject to further biological investigation.

Here, one point is to be noted that though the rows in gene expression datasets are termed as genes so far, they are actually named as ‘probe ids’. A probe id is an entity associated with a gene, in context to microarray technology. We have used the indexes of probe ids during processing. At last, from an index, the probe id is obtained from the dataset and from probe id the gene symbol is obtained from a special type of data file (named platform file) supplied by NCBI (from where dataset was collected).

The performance of proposed algorithms of this research, HBMO-kNN and HS-kNN has been compared with three state-of-the-art methods. They are GANN and GADA, both implemented by Tejera et al. (2013) [39], and GA-kNN was proposed by Li et al. (2001) [26]. Tejera et al. (2013) have executed the algorithms (GANN and GADA) on two (out of five used in [39]) datasets (E-MEXP-1050 and GSE10588) which have also been used in this research. The average and best classification accuracies of the algorithms published by Tejera et al. (2013) [39], have been used in the comparative study (see Fig. 9) of this research. The other state-of-the-art algorithm, GA-kNN has been downloaded (executable version) from the website [60] of its originator Li et al. (2001) and executed 25 times on E-MEXP-1050 obeying the conditions provided. The parameters, such as number of niches, number of generations, population size, chromosome length, ermination fitness cut-off and number of solutions specified have been set as arguments during executions of the algorithm are 3, 50, 50, 30, 34 and 200, respectively. Though in a run of GA-kNN, the maximum number of generations is 50, if the given highest cut-off fitness (34) out of 35 samples in E-MEXP-1050 has been achieved in fewer generations, the program terminates. The output of an execution of GA-kNN is the number of properly classified samples. The output of each run is stored and classification accuracy (%) has been calculated using Eq. (1).

Fig. 9
figure 9

Comparison of proposed algorithms (HBMO-kNN and HS-kNN) with three state-of-the-art methods (GADA and GANN [39], and GA-kNN [26]) in terms of average and best classification accuracy (%)

Observation shows (see Fig. 9), HBMO-kNN outperforms all other algorithms in both average (100) and best (100) classification accuracies. HS-kNN is similar to GADA in average (95.2) classification accuracy but wins in best classification rate (99.4).

Change of the fitness value of each algorithm (HBMO-kNN, HS-kNN and GA-kNN) in successive generations (iterations) has been plotted in Fig. 10. The run (among 25 runs of the algorithm) in which the highest fitness has been achieved by an algorithm has been considered. It has been observed that HBMO-kNN is better than GA-kNN and HS-kNN is as good as GA-kNN in achieving highest fitness. In addition, both HBMO-kNN and HS-kNN show a higher rate of improvement than GA-kNN. These two important observations signify that the newly proposed algorithms have outperformed the existing state-of-the-art algorithm (GA-kNN) to obtain the subset of disease-causing genes (i.e., maximizing number of properly classified samples).

Fig. 10
figure 10

Comparison of fitness convergence among HBMO-kNN, HS-kNN and GA-kNN [26]

Some recent gene selection algorithms, not executed on the same dataset as the proposed algorithms, have also been compared with the proposed algorithms on the basis of their overall average classification accuracies in different datasets (see Table 5). Average classification accuracy of two recent gene selection algorithms such as χ2 DC and IVPSO have been obtained from the results reported by Zang et al. (2014) [36] and Ramyachitra et al. (2015) [37], respectively. Average classification accuracy (CA) of a proposed algorithm, e.g., HBMO-kNN has been calculated by taking the arithmetic mean of the classification accuracies of the algorithm over the three datasets (E-MEXP-1050–100%, GSE60438–99.6364% and GSE10588–100%, see Tables 4, 5). Here also HBMO-kNN is the best with average classification accuracy 99.87% followed by HS-kNN (97.03%).

Table 5 Comparison among proposed and current gene selection algorithms depending upon overall classification accuracy in different datasets

5 Conclusion

Identification of a small subset of genes responsible for a disease by comparing characteristics of normal and diseased gene expression dataset is an important task in bioinformatics. In this paper, three preeclampsia datasets have been collected and identification of disease critical genes out of these datasets has been formulated as an optimization problem. Meta-heuristic algorisms such as HBMO, HS, DE and GA have been used for this purpose. To obtain the fitness of a candidate solution, classifier kNN has been used. Two algorithms, namely, HBMO-kNN and HS-kNN have been proposed here and implemented on three different datasets of preeclampsia. In each experiment, apart from the proposed algorithms HBMO-kNN, HS-kNN, other two algorithms, namely, DE-kNN and SGA-kNN have also been implemented. The performance of these four algorithms has been compared with respect to some standard metrics, such as average fitness, classification accuracy, best fitness and standard deviation of fitness and execution time. The experimental study shows that HBMO-kNN outperformed all other algorithms with a classification accuracy of 99.64–100% followed by DE-kNN (99.42–100%). The proposed algorithms (HBMO-kNN and HS-kNN) also have been compared with the state-of-the-art algorithms, giving better results. During execution of each of the algorithms, candidate solutions having classification accuracy above 90% have been saved separately. From the good solutions obtained by an experiment, the genes commonly found by all algorithms are considered to be disease critical. The numbers of disease critical in the three datasets E-MEXP-1050, GSE60438 and GSE10588 are 87, 73 and 74, respectively. Genes such as STS, LEP, LRP8 which are well known as preeclampsia genes have appeared in our findings.

Many algorithms have been proposed so far for gene selection using sample classification. But this research work is important because of implementing a large number of meta-heuristics and compiling their results together. This experimental approach is also important in a sense that it provides a comparative as well as a collective analysis of results. The comparative study shows that one of our proposed algorithms, HBMO-kNN, is the best in performance among all algorithms implemented here. On the other hand, the collective study visualizes that the grouping of solutions of all algorithms and search for common genes achieved significant overlap (87, 73 and 74 in E-MEXP-1050, GSE60438 and GSE10588, respectively).

There are some future scopes. The kNN-based fitness calculation algorithm, kNN_fit, can be easily used in any other meta-heuristic to serve similar purposes. Use of collective strategy using a number of meta-heuristics may be helpful to solve similar problems. New meta-heuristic algorithms may be found to be more appropriate in this context. Identification of disease critical genes of any other disease can be performed using aforesaid methods. Incorporation of other classifiers such as SVM and multilayer perceptron, as a substitute of kNN, may further improve the accuracy of gene selection. As the dataset that can be used for positive control involves more than two classes, kNN_fit may be modified accordingly to facilitate multi-class (more than two) classification. Application of this modified version to such a dataset will be useful to reveal which other diseases play role in increasing the risk of preeclampsia or how preeclampsia may trigger some other disease.