Keywords

1 Introduction

Uncontrollable growth of cells in an unusual fashion is the primary basis of cancer, which is often classified as a group of diseases [1]. Thousands of several genes are responsible for a particular cancer to occur. But all those genes may not be equally significant in this case. Due to this reason, selection of genes based on their ranks for a specific cancer is highly important [2]. In this work, we try to find the significant mediating genes of six various human cancer datasets and to prioritize them as well.

The algorithm used in this paper to rank the genes is primarily constructed by analyzing the expression change of the genes between a diseased and a normal sample. The Genetic Algorithm (GA) based implementations, namely MicroarryGA is applied on microarray gene expression datasets of different cancers to rank genes. Genetic Algorithm can be considered as one of the popular evolutionary algorithms [3, 4]. Later, these ranks are validated by classification step using SVM classifiers [5, 6]. The workflow of the proposed MicroarrayGA method is given in Fig. 1.

Fig. 1.
figure 1

Workflow of the proposed MicroarrayGA method

2 Related Earlier Works

Ranking genes based on the information from several DNA microarray gene expression data is of fundamental importance [7]. Since data sources contain thousands of several genes, so to determine the set of prioritized genes according to their significance order is an important task. Some of methods from current literature are mentioned briefly as follows:

A soft clustering based microarray expression analysis algorithm is proposed in [8] to analyze genes responsible for acute leukemia. Another method mentioned in [9] uses non-parametric statistical tests - Mann-Whitney U test and k-sample Kruskal-Wallis H test to rank genes. A statistical gene subset selection algorithm using Wilcoxon rank sum test is presented in [10].

Huerta et al. [13] analyzed Leukemia [12] and Colon cancer [14] datasets using their proposed genetic algorithm based gene selection method along with SVM classifier. Mondal et al. presented a multi-objective GA based gene ranking approach in [15]. Another method proposed in [16], selects significant features using genetic algorithms (GAs) and then uses Constructive Neural Networks (CNN) [17], C-Mantec [18] based classification methods to validate the results. A neuro-fuzzy approach is proposed in [19] to select those genes responsible for a particular type of cancers. Another method presented in [20] uses graph theory based approach to identify most significant as well as most non-redundant genes from a DNA gene expression dataset. Dragon Wrapper Feature Selection (DWFS), a web based tool, proposed in [21], is used to select significant features for a various types of problems efficiently.

Another new approach was proposed in [22] to rank the genes of microarray gene expression data using squared Pearson correlation coefficient. Another work presented in [23] developed a fuzzy set theory based index, named as Gaussian Fuzzy Index (GFI) to select a set of cancer mediating genes. GeneRank developed by Morrison et al. [24], is a straightforward generalization of PageRank [25] algorithm used by Google.

3 Methodology

We have developed following GA based approach named as MicroarrayGA. This method initially begins with pre-selection based on filtering, and then generates gene subset \( G_{P} \) of \( P \) genes, where \( P \) is problem dependent. A small set of the most important genes are selected from this reduced subset of genes, which give the highest accuracy when used with classification task. Detailed steps of this approach are given below:

  1. 1.

    In the first step, genes/features form the microarray are sorted using Fisher’s Discriminant Criteria [26] in decreasing order. Then from the sorted genes, top \( P \) genes are selected for our calculation.

  2. 2.

    Then \( N \) different gene subsets are generated randomly from selected \( P \) genes and they are used as initial GA population. To encode each of the gene subsets of the population, binary chromosome representation is used, given in Fig. 2. If a particular gene is present in the gene subset, then the corresponding bit is set to 1, otherwise set to 0 in that chromosome representation.

    Fig. 2.
    figure 2

    Chromosome representation

  3. 3.

    After obtaining the initial population, a minimizing fitness function is applied on each of the chromosomes to find the individual fitness value. For each gene present in chromosome, p-value is calculated from Student’s two-tailed t test between the two types of sample corresponding to that gene (normal and cancerous). After calculating the p-values of each of the genes present in the chromosome, the average p-value is calculated, which is the fitness value of the chromosome.

  4. 4.

    In this step, a temporary population \( N^{\prime } \) is created which will be used as a population \( N \) in the next generation. The new population \( N^{\prime } \), is filled up in three steps. In the first step, the top 40% individuals (based on fitness value; in this method from lower to higher) of \( N \) are copied to \( N^{\prime } \) (elitism). In the next two steps, genetic operators: uniform crossover and binary mutation operators are applied respectively to obtain new chromosomes (child solutions).

  5. 5.

    A random pair of solutions are selected from \( N \) and uniform crossover operator is applied on these two individuals. The resulting new child solution is inserted in \( N^{\prime } \). This step is repeated for \( 0.4* \left| N \right| \) times, where 0.4 is crossover probability (\( P_{c} \)).

  6. 6.

    In this step, mutation operator is applied on individual solution selected randomly from \( N \). This step is repeated for \( 0.2* \left| N \right| \) times, where 0.2 is mutation probability (\( P_{m} \)).

  7. 7.

    Replace \( N \) with \( N^{\prime } \) and repeat the steps from 3 to 7 until fixed number of generations are completed.

  8. 8.

    After the completion of all the generations, each gene is assigned an importance value depending on the frequency of selection among all the \( N \) gene subsets. A gene having higher frequency of appearance gets high importance value and vice versa. Their significance is tested by the classification performance using SVM classifier.

4 Experimental Results

MicroarrayGA based approach is applied on microarray gene expression dataset of six various types of cancers, like Breast Cancer, Prostate Cancer, Pediatric Immune Thrombocytopenia (ITP), Small Cell Lung Cancer, Diffuse Large B-Cell Lymphoma (DLBCL) and Colorectal Cancer available at NCBI online repository.

4.1 Dataset Used

Colorectal Cancer tumor (GDS4382) [27, 32] dataset contains 17 samples are of type normal and 17 are of type diseased. Small Cell Lung Cancer (GDS4794) [28, 32] dataset is divided into diseased and normal sample subsets of size 23 and 42 respectively. For DLBCL-FL (GDS4236) [29, 32] dataset, 8 samples are rapamycin sensitive and 6 are rapamycin resistant. In Prostate Cancer (GDS4824) [30, 32] Dataset, 13 diseased samples and 8 normal samples are present. Breast Cancer (GDS4056) [11, 32] contains 32 diseased samples and 29 normal samples. Pediatric Immune Thrombocytopenia (ITP) [31, 32] dataset has total 13 samples out of which 7 are of type newly diagnosed immune thrombocytopenia samples and 6 are of type chronic immune thrombocytopenia samples.

4.2 Parameter Settings

The parameters of the GA applied on microarray gene expression dataset are represented in Table 1 below:

Table 1. GA parameters for MicroarrayGA

4.3 Results and Comparison with Other Methods

Classification algorithm is used to validate the proposed GA based algorithm’s effectiveness in gene ranking. The proposed method’s efficiency is compared with some existing ranking methods on the basis of percentage of classification accuracy along with some other metrics.

We take top 20 genes from the ranked set of 1000 genes. These ranking is validated again by the classification process using SVM method with three kernels. Before the classification process starts, we divide the whole sample space into 60:40 ratio. Following Fig. 3 shows how the MicroarrayGA’s performance varies on the basis of percentage of accuracy using only top most 20 genes for six cancer datasets. From that figure it is clear that for which dataset what is the percentage of classification accuracy proposed algorithm achieves by using how many genes. Although we have done our experiment on six publicly available cancer datasets, but to compare our GA based method’s performance with three state-of-the art methods, like Particle Swarm Optimization (PSO) based Graph Theoretic Approach [20], DWFS using K-Nearest Neighbor (KNN) classifier and DWFS using Naïve Bayes (NBC) classifier [21] only two datasets, Prostate Cancer and DLBCL-FL are used. Table 2 shows this comparative study on the basis of percentage of classification accuracy, F1 Score, G-Mean, Recall, Precision [33].

Fig. 3.
figure 3

Performance of MicroarrayGA method for six cancer dataset on two metrics

Table 2. Performance comparison of MicroarrayGA with existing works

Following Table 3 shows top 15 genes returned by MicroarrayGA algorithm.

Table 3. Top-most 15 genes from all six cancer datasets

5 Conclusion and Future Scope

In the proposed approach, a genetic algorithm based implementation, MicroarrayGA is used to rank the genes causing various cancers. Ranking is validated by SVM classifier method with three kernels. From Fig. 3 it is clear that for which dataset what is the percentage of classification accuracy proposed algorithm achieves by using how many genes. Three different existing methods from current literature are used in this paper along with MicroarrayGA method for the performance comparison purpose. From the performance scores given in Table 2, it can be concluded that our GA based approach surpasses the other methods. Since MicroarrayGA uses gene expression dataset, the results shows that this approach uses top most 3 genes to top most 9 genes for various cancer datasets to achieve percentage of classification accuracy close to 100%.

We have used only microarray dataset to rank genes. But it is quite possible that gene dataset is obtained in terms of gene interaction network. In that case we need to think about new ways to rank the genes involved in the network. Another modification that can be done is the generation of multiple gene subsets (non-dominated solutions) rather than a single gene subset, which can be used for further biological analysis. For that we can think about multi-objective genetic algorithm. That may improve our results further.