A Novel Approach to Gene Selection of Leukemia Dataset Using Different Clustering Methods

Prasath, P.; Perumal, K.; Thangavel, K.; Manavalan, R.

doi:10.1007/978-81-322-1680-3_7

P. Prasath⁸,
K. Perumal⁸,
K. Thangavel⁹ &
…
R. Manavalan¹⁰

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 246))

1523 Accesses

Abstract

Gene datasets from microarray comprise large number of genes. Clustering is a widely used approach for grouping similar kind of genes. The main objective of this paper is to identify the optimal subset of genes from the leukemia dataset in order to classify the leukemia cancer. Different clustering approaches such as K-means (KM) clustering, fuzzy C-means (FCM) clustering, and modified K-means (MKM) clustering have been adopted in this research. The clusters obtained from these methods are further clustered using K-means sample-wise (by omitting class values), and the results are compared with ground truth value to evaluate the performance of the different clustering methods. The highly correlated genes are selected from the cluster that produces more accurate classification results. It is observed that the FCM (gene-wise clustering) with K-means (sample-wise clustering) produces better accuracy, and the resultant genes have been identified.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Unsupervised gene selection using biological knowledge : application in sample clustering

Article Open access 22 November 2017

Informative Gene Selection Using Clustering and Gene Ontology

Correlation Based Cluster Validity Index for Recognition of Leukemia Mediating Biomarkers

Keywords

1 Introduction

Gene expression data can be obtained by high-throughput technologies such as microarray and oligonucleotide chips under various experimental conditions, at different developmental stages. This technique promises to allow for the detection of networks of correlated genes, which are characteristic of phenomena such as diseases. However, classification of samples according to phenotypes or other criteria is not necessarily precise; therefore, it is desirable to use unsupervised methods to classify samples according to gene expression similarity and to detect networks of correlated genes that discriminate those sample classes [1].

The gene expression data are usually organized in a matrix of n rows and m columns, which is known as a gene expression profile. Due to the large amount of gene expression data available on various cancerous samples, it is important to construct classifiers that have high predictive accuracy in classifying cancerous samples based on their gene expression profiles [2]. Microarrays contain precisely positioned DNA probes that are designed to specifically monitor the expression of genes in parallel. Data mining often utilizes mathematic techniques that are traditionally used to identify patterns in complex data. Here, the unsupervised benchmark K-means clustering method is adopted from [7] for comparative analysis.

The rest of the paper is organized as follows: Sect. 2 describes the FCM clustering method, Sect. 3 deals with MKM clustering, Sect. 4 proposes novel proposed approach to gene selection, Sect. 5 provides experimental environment, Sect. 6 includes experimental results, and Sect. 7 concludes this paper with direction for further research.

2 Fuzzy C-Means Clustering

In FCM clustering method, an object can simultaneously be a member of multiple clusters. The objective function, which is minimized iteratively, is a weighted within-group sum of distances. The weight is computed by multiplying the squared distances with membership values. After computing the membership values for all calibration objects, the cluster centers are described by prototypes, which are fuzzy weighted means. This fuzzy clustering method allows intermediate logical assignments whereby genes are placed into multiple groups by assigning a membership value for each group that is compared between 0 (not in group) and 1 (completely in group). The use of membership values has the advantage of allowing a gene or sample to belong to multiple clusters, which may better reflect the underlying biology [4].

3 Modified K-Means Clustering Algorithm

This algorithm calculates the cluster centers that are quite close to the desired cluster centers. It first divides the dataset into K subsets according to some rule associated with data space patterns and then chooses cluster centers for each subset [5].

4 Proposed Approach

In this paper, the gene datasets have been clustered using K-means, modified K-means, and fuzzy C-means by setting K = 5, K = 10, and K = 15. The clusters, which are obtained using the above methods, are further clustered using K-means clustering by taking K = 2. Hence, it is a novel approach to gene selection using clustering methods.

5 Experimental Environment

The description of leukemia cancer dataset is as follows [6]: This has 7,129 genes with 34 samples and consists of 2 classes: acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML). They are various types of cancer, and each of them has different characteristics. Each patient is represented as one row. First column is the patient number in the dataset, columns 2 to 34 denote the gene expression values corresponding to each patient, and 7,130th column indicates the type of cancer (ALL, AML) that each patient is classified. In order to ease the algebraic manipulations of data, the dataset can also be represented as a real two-dimensional matrix S of size 7,129 × 34; the entry s _ij of S measures the expression of the jth gene of the ith patient. Each patient is determined by a sequence of 34 real numbers, each measuring the relative expression of the corresponding gene [3].

6 Computational Results

The K value is arbitrarily fixed as 5, 10, and 15, FCM clustering is performed, and the results are provided in Tables 1, 2, and 3, respectively. The best results are indicated in bold letters. Similarly, the K value is arbitrarily fixed as 5, 10, and 15, MKM clustering is performed, and the results of the clusters are provided in Tables 4, 5, and 6, respectively. The best results are indicated in bold letters.

Table 1 Experimental results for K = 5

Full size table

Table 2 Experimental results for K = 10

Full size table

Table 3 Experimental results for K = 15

Full size table

Table 4 Experimental results for K = 5

Full size table

Table 5 Experimental results for K = 10

Full size table

Table 6 Experimental results for K = 15

Full size table

The accuracy of 91 % is achieved when K = 5; 203 genes are selected in Run 4, and the same is achieved for selecting 75, 75, and 42 genes in Runs 7, 8, and 10, respectively, in FCM clustering. The accuracy of 94 % is achieved when K = 10; 219 genes are selected in Run 5, and the accuracy of 85 % is achieved for 34 genes in Run 10. Also, an accuracy of 82 % is achieved for 23, 37, and 37 genes in Runs 2, 6, and 8, respectively, in FCM clustering. The accuracy of 91 % is achieved when K = 15; 104 genes are selected in Run 1, and 20 genes are selected in Run 5 in FCM clustering. The accuracy of 94 % is achieved when K = 15; 19 genes are selected in Run 6, and 189 genes are selected in Run 7. The accuracy of 88 % is achieved for the same value of K; 29 genes are selected in Run 10. The results with best accuracy obtained for K = 5, K = 10, and K = 15 using FCM clusters are given in Table 7.

Table 7 Relative performance measure of experimental analysis

Full size table

The results of the significant genes selected by FCM algorithm are given in Table 8.

Table 8 Significant genes selected

Full size table

7 Conclusion

In this paper, clustering-based gene selection methods have been proposed and analyzed. It is a novel approach, since genes were selected through sequence processing of clustering approaches. The FCM and MKM clustering algorithms have been applied for different values of K. Again KM clustering algorithm has been performed for all the clusters produced by FCM and MKM methods. The highly correlated genes were selected from the clusters of the high accurate classification results. Out of 7,129 genes, 19 genes were selected by the proposed novel gene selection method, and it is enough to consider only such 19 genes to predict the leukemia cancer. It was observed that FCM clustering method outperformed.

The further research direction is to identify a single gene for diagnosing the leukemia cancer.

References

Stanislav Busygin, Gerrit Jacobsen, and Ewald Kramer. Double conjugated clustering applied to leukemia microarray data. In Proceedings of the 2nd SIAM International Conference on Data Mining, Workshop on Clustering High Dimensional Data, 2002.
Google Scholar
Aik Choon Tan and David Gilbert, Ensemble machine learning on gene expression data for cancer classification: Applied Bioinformatics 2003:2 (3 Suppl) S75–S83.
Google Scholar
Cherie H. Dunphy (2006) Gene Expression Profiling Data in Lymphoma and Leukemia: Review of the Literature and Extrapolation of Pertinent Clinical Applications. Archives of Pathology & Laboratory Medicine: April 2006, Vol. 130, No. 4, pp. 483–520.
Google Scholar
Yoo CK, Vanrolleghem PA. Interpreting patterns and analysis of acute leukemia gene expression data by multivariate statistical analysis. In: Barbosa Povoa A, Matos H, editors. Computer-Aided Chemical Engineering. Elsevier Science; 2004. pp. 1165–70.
Google Scholar
Wei Li, Modified K-means clustering algorithm, Congress on Image & Signal Processing, IEEE, 2008, pp. 618–621.
Google Scholar
T.R. Golub et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science, 1999, Vol. 286, pp. 531–537.
Google Scholar
Palanisamy, P.; Perumal; Thangavel, K.; Manavalan, R., “A novel approach to select significant genes of leukemia cancer data using K-Means clustering,” Pattern Recognition, Informatics and Medical Engineering (PRIME), 2013 International Conference on, pp. 104, 108, 21–22 Feb. 2013.
Google Scholar

Download references

Acknowledgments

The third author gratefully acknowledges the UGC, New Delhi, for partial financial assistance under UGC-SAP (DRS) Grant No. F3-50/2011.

Author information

Authors and Affiliations

Department of Biotechnology, Periyar University, Salem, 636 011, India
P. Prasath & K. Perumal
Department of Computer Science, Periyar University, Salem, 636 011, India
K. Thangavel
Department of Computer Science, K. S. Rangasamy College of Arts and Science, Thiruchengode, India
R. Manavalan

Authors

P. Prasath
View author publications
You can also search for this author in PubMed Google Scholar
K. Perumal
View author publications
You can also search for this author in PubMed Google Scholar
K. Thangavel
View author publications
You can also search for this author in PubMed Google Scholar
R. Manavalan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to P. Prasath .

Editor information

Editors and Affiliations

Applied Mathematics and Computational Sciences, PSG College of Technology, Coimbatore, Tamil Nadu, India
G. Sai Sundara Krishnan
Applied Mathematics and Computational Sciences, PSG College of Technology, Coimbatore, Tamil Nadu, India
R. Anitha
Applied Mathematics and Computational Sciences, PSG College of Technology, Coimbatore, Tamil Nadu, India
R. S. Lekshmi
Applied Mathematics and Computational Sciences, PSG College of Technology, Coimbatore, Tamil Nadu, India
M. Senthil Kumar
Department of Mathematics, Ryerson University, Toronto, Ontario, Canada
Anthony Bonato
University of Basque Country, Paseo Manuel De Lardizalbal 1, San Sebastian, Spain
Manuel Graña

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Prasath, P., Perumal, K., Thangavel, K., Manavalan, R. (2014). A Novel Approach to Gene Selection of Leukemia Dataset Using Different Clustering Methods. In: Krishnan, G., Anitha, R., Lekshmi, R., Kumar, M., Bonato, A., Graña, M. (eds) Computational Intelligence, Cyber Security and Computational Models. Advances in Intelligent Systems and Computing, vol 246. Springer, New Delhi. https://doi.org/10.1007/978-81-322-1680-3_7

Download citation

DOI: https://doi.org/10.1007/978-81-322-1680-3_7
Published: 27 November 2013
Publisher Name: Springer, New Delhi
Print ISBN: 978-81-322-1679-7
Online ISBN: 978-81-322-1680-3
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

A Novel Approach to Gene Selection of Leukemia Dataset Using Different Clustering Methods

Abstract

Similar content being viewed by others

Unsupervised gene selection using biological knowledge : application in sample clustering

Informative Gene Selection Using Clustering and Gene Ontology

Correlation Based Cluster Validity Index for Recognition of Leukemia Mediating Biomarkers

Keywords

1 Introduction

2 Fuzzy C-Means Clustering

3 Modified K-Means Clustering Algorithm

4 Proposed Approach

5 Experimental Environment

6 Computational Results

7 Conclusion

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

A Novel Approach to Gene Selection of Leukemia Dataset Using Different Clustering Methods

Abstract

Similar content being viewed by others

Unsupervised gene selection using biological knowledge : application in sample clustering

Informative Gene Selection Using Clustering and Gene Ontology

Correlation Based Cluster Validity Index for Recognition of Leukemia Mediating Biomarkers

Keywords

1 Introduction

2 Fuzzy C-Means Clustering

3 Modified K-Means Clustering Algorithm

4 Proposed Approach

5 Experimental Environment

6 Computational Results

7 Conclusion

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation