Keywords

1 Introduction

Gene expression data can be obtained by high-throughput technologies such as microarray and oligonucleotide chips under various experimental conditions, at different developmental stages. This technique promises to allow for the detection of networks of correlated genes, which are characteristic of phenomena such as diseases. However, classification of samples according to phenotypes or other criteria is not necessarily precise; therefore, it is desirable to use unsupervised methods to classify samples according to gene expression similarity and to detect networks of correlated genes that discriminate those sample classes [1].

The gene expression data are usually organized in a matrix of n rows and m columns, which is known as a gene expression profile. Due to the large amount of gene expression data available on various cancerous samples, it is important to construct classifiers that have high predictive accuracy in classifying cancerous samples based on their gene expression profiles [2]. Microarrays contain precisely positioned DNA probes that are designed to specifically monitor the expression of genes in parallel. Data mining often utilizes mathematic techniques that are traditionally used to identify patterns in complex data. Here, the unsupervised benchmark K-means clustering method is adopted from [7] for comparative analysis.

The rest of the paper is organized as follows: Sect. 2 describes the FCM clustering method, Sect. 3 deals with MKM clustering, Sect. 4 proposes novel proposed approach to gene selection, Sect. 5 provides experimental environment, Sect. 6 includes experimental results, and Sect. 7 concludes this paper with direction for further research.

2 Fuzzy C-Means Clustering

In FCM clustering method, an object can simultaneously be a member of multiple clusters. The objective function, which is minimized iteratively, is a weighted within-group sum of distances. The weight is computed by multiplying the squared distances with membership values. After computing the membership values for all calibration objects, the cluster centers are described by prototypes, which are fuzzy weighted means. This fuzzy clustering method allows intermediate logical assignments whereby genes are placed into multiple groups by assigning a membership value for each group that is compared between 0 (not in group) and 1 (completely in group). The use of membership values has the advantage of allowing a gene or sample to belong to multiple clusters, which may better reflect the underlying biology [4].

3 Modified K-Means Clustering Algorithm

This algorithm calculates the cluster centers that are quite close to the desired cluster centers. It first divides the dataset into K subsets according to some rule associated with data space patterns and then chooses cluster centers for each subset [5].

4 Proposed Approach

In this paper, the gene datasets have been clustered using K-means, modified K-means, and fuzzy C-means by setting K = 5, K = 10, and K = 15. The clusters, which are obtained using the above methods, are further clustered using K-means clustering by taking K = 2. Hence, it is a novel approach to gene selection using clustering methods.

5 Experimental Environment

The description of leukemia cancer dataset is as follows [6]: This has 7,129 genes with 34 samples and consists of 2 classes: acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML). They are various types of cancer, and each of them has different characteristics. Each patient is represented as one row. First column is the patient number in the dataset, columns 2 to 34 denote the gene expression values corresponding to each patient, and 7,130th column indicates the type of cancer (ALL, AML) that each patient is classified. In order to ease the algebraic manipulations of data, the dataset can also be represented as a real two-dimensional matrix S of size 7,129 × 34; the entry s ij of S measures the expression of the jth gene of the ith patient. Each patient is determined by a sequence of 34 real numbers, each measuring the relative expression of the corresponding gene [3].

6 Computational Results

The K value is arbitrarily fixed as 5, 10, and 15, FCM clustering is performed, and the results are provided in Tables 1, 2, and 3, respectively. The best results are indicated in bold letters. Similarly, the K value is arbitrarily fixed as 5, 10, and 15, MKM clustering is performed, and the results of the clusters are provided in Tables 4, 5, and 6, respectively. The best results are indicated in bold letters.

Table 1 Experimental results for K = 5
Table 2 Experimental results for K = 10
Table 3 Experimental results for K = 15
Table 4 Experimental results for K = 5
Table 5 Experimental results for K = 10
Table 6 Experimental results for K = 15

The accuracy of 91 % is achieved when K = 5; 203 genes are selected in Run 4, and the same is achieved for selecting 75, 75, and 42 genes in Runs 7, 8, and 10, respectively, in FCM clustering. The accuracy of 94 % is achieved when K = 10; 219 genes are selected in Run 5, and the accuracy of 85 % is achieved for 34 genes in Run 10. Also, an accuracy of 82 % is achieved for 23, 37, and 37 genes in Runs 2, 6, and 8, respectively, in FCM clustering. The accuracy of 91 % is achieved when K = 15; 104 genes are selected in Run 1, and 20 genes are selected in Run 5 in FCM clustering. The accuracy of 94 % is achieved when K = 15; 19 genes are selected in Run 6, and 189 genes are selected in Run 7. The accuracy of 88 % is achieved for the same value of K; 29 genes are selected in Run 10. The results with best accuracy obtained for K = 5, K = 10, and K = 15 using FCM clusters are given in Table 7.

Table 7 Relative performance measure of experimental analysis

The results of the significant genes selected by FCM algorithm are given in Table 8.

Table 8 Significant genes selected

7 Conclusion

In this paper, clustering-based gene selection methods have been proposed and analyzed. It is a novel approach, since genes were selected through sequence processing of clustering approaches. The FCM and MKM clustering algorithms have been applied for different values of K. Again KM clustering algorithm has been performed for all the clusters produced by FCM and MKM methods. The highly correlated genes were selected from the clusters of the high accurate classification results. Out of 7,129 genes, 19 genes were selected by the proposed novel gene selection method, and it is enough to consider only such 19 genes to predict the leukemia cancer. It was observed that FCM clustering method outperformed.

The further research direction is to identify a single gene for diagnosing the leukemia cancer.