Abstract
Clustering is a popular technique commonly used to search for groups of similarly expressed genes using mRNA expression data. There are many different clustering algorithms and the application of each one will usually produce different results. Without additional evaluation, it is difficult to determine which solutions are better.
In this chapter we discuss methods to assess algorithms for clustering of gene expression data. In particular, we present a new method that uses two elements: an internal index of validity based on the MDL principle and an external index of validity that measures the consistency with experimental data. Each one is used to suggest an effective set of models, but it is only the combination of both that is capable of pinpointing the best model overall. Our method can be used to compare different clustering algorithms and pick the one that maximizes the correlation with functional links in gene networks while minimizing the error rate. We test our methods on several popular clustering algorithms as well as on clustering algorithms that are specially tailored to deal with noisy data. Finally, we propose methods for assessing the significance of individual clusters and study the correspondence between gene clusters and biochemical pathways.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Spellman, P.T., Sherlock, G., Zhang, M., Iyer, V., Eisen, M., Brown, P., Botstein, D. & Futcher, B. (1998). Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by Microarray Hybridization. Mol. Bio. Cell. 9, 3273–3297.
Hughes, T., Marton, M., Jones, A., Roberts, C., Stoughton, R., Armour, C., Bennett, H., Coffey, E., Dai, H., He, Y., Kidd, M., King, A., Meyer, M., Slade, D., Lum, P., Stepaniants, S., Shoemaker, D., Gachotte, D., Chakraburtty, K., Simon, J., Bard, M. & Friend, S. (2000). Functional discovery via a compendium of expression profiles. Cell. 102, 109–126.
Liu, E.T. (2003). Classification of cancers by expression profiling. Curr. Opin. Genet. Dev. 13, 97–103.
McCormick, S.M., Frye S.R., Eskin, S.G., Teng, C.L., Lu, C.M., Russell, C.G., Chittur, K.K. & McIntire L.V. (2003). Microarray analysis of shear stressed endothelial cells. Biorheology, 40, 5–11.
Yeatman, T.J. (2003). The future of clinical cancer management: one tumor, one chip. Am. Surg. 69, 41–44.
Yoo, M.S., Chun, H.S., Son, J.J., DeGiorgio, L.A., Kim, D.J., Peng, C. & Son J.H. (2003). Brain research. Mol. Brain Res. 110, 76–84.
Jain, A.K. & Dubes, R.C. (1988).”Algorithms for clustering data”. Prentice Hall, Englewood Cliffs, NJ.
Jain, A.K., Murthy, M.N. & Flynn, P.J. (1999). Data clustering: a review. ACM Comput. Surv.. 31, 264–323.
Boutros, P.C. & Okey, A.B. (2005). Unsupervised pattern recognition: an introduction to the whys and wherefores of clustering microarray data. Brief Bioinform. 6, 33 1–343.
D’haeseleer, P. (2005). How does gene expression clustering work? Nat. Biotechnol. 23, 1499–1501.
Gray, R. M., Kieffer, J. C. & Linde, Y. (1980). Locally optimal block quantizier design. Inf. Control 45, 178–198.
Rose, K., Gurewitz, E. & Fox, G. (1990). A deterministic annealing approach to clustering. Patt. Rec. Lett. 11, 589–594.
Wu, Z. & Leahy, R. (1993). An optimal graph theoretic approach to data clustering: theory and its application to image segmentation. PAMI. 15, 1101–1113.
Shi, J. & Malik, J. (1997). Normalized cuts and image segmentation. Proc. CVPR. 731–737.
Dubnov, S., El-Yaniv, R., Gdalyahu, Y., Schneidman, E., Tishby, N. & Yona, G. (2002). A new non-parametric pairwise clustering algorithm based on iterative estimation of distance profiles. Mach. Learn., 47, 35–61.
Rissanen, J. (1989). Stochastic Complexity in Statistical Inquiry. World Scientific, Singapore.
Bolshakova, N., Azuaje, F. & Cunningham, P. (2005). A knowledge-driven approach to cluster validity assessment. Bioinformatics. 21, 2546–2547.
Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., Harris, M.A., Hill, D.P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J.C., Richardson, J.E., Ringwald, M., Rubin, G.M., Sherlock, G. (2000). Gene ontology: tool for the unification of biology. Gene Ontol. Consortium. Nat Genet. 25, 25–29.
Speer, N., Spieth, C. & Zell, A. (2004). A memetic clustering algorithm for the functional partition of genes based on the gene ontology. In Proceedings of the 2004 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB 2004), San Diego, USA IEEE Press, 252–259.
Raychaudhuri, S., Schutze, H. & Altman, R.B. (2002). Using text analysis to identify functionally coherent gene groups. Genome Res. 12, 1582–1590.
Gat-Viks, I., Sharan, R. & Shamir, R. (2003). Scoring clustering solutions by their biological relevance. Bioinformatics 19 2381–2389.
Bolshakova, N. & Azuaje, F. (2003). Machaon CVE: cluster validation for gene expression data. Bioinformatics 19, 2494–2495.
Bertoni, A. & Valentini, G. (2006). Randomized maps for assessing the reliability of patients clusters in DNA microarray data analyses. Artif. Intell. Med. 37 85–109.
Olman, V., Xu, D. & Xu, Y. (2003).CUBIC: identification of regulatory binding sites through data clustering. J. Bioinform. Comput. Biol. 1, 21–40.
McShane, L.M., Radmacher, M.D., Freidlin, B., Yu, R., Li, M.C. & Simon, R. (2002). Methods for assessing reproducibility of clustering patterns observed in analyses of microarray data. Bioinformatics. 18, 1462–1469.
Yeung, K.Y., Haynor, D.R. & Ruzzo, W.L. (2001). Validating clustering for gene expression data. Bioinformatics. 17, 309–318.
Smolkin, M. & Ghosh, D. (2003).Cluster stability scores for microarray data in cancer studies. BMC Bioinformatics. 4, 36.
Dudoit, S. & Fridlyand, J. (2003).Bagging to improve the accuracy of a clustering procedure. Bioinformatics. 19 1090–1099.
Zhang, K. & Zhao, H. (2000). Assessing reliability of gene clusters from gene expression data. Funct. Integr. Genomics. 1, 156–173.
Schwarz, G. (1978). Estimating the dimension of a model. Ann. Stat. 6, 461–464.
Bejerano, G. (2003). Efficient exact p-value computation and applications to biosequence analysis. In the proceedings of RECOMB 2003, 38–47, ACM press, New York.
Yona, G., Dirks, W., Rahman, R. & Lin, M. (2006). Effective similarity measures for expression profiles. Bioinformatics. 22, 1616–1622.
Dirks, W. & Yona, G. (2003). A comprehensive study of the notion of functional link between genes based on microarray data, promoter signals, protein-protein interactions and pathway analysis. Technical report TR2004-1921, Computing and Information Science, Cornell University.
Kanehisa, M. (1996). Toward pathway engineering: a new database of genetic and molecular pathways. Sci. Technol. Jpn. 59, 34–38.
Gygi, S.P., Rochon, Y., Franza, B.R. & Aebersold, R. (1999). Correlation between protein and mRNA abundance in yeast. Mol. Cell Biol. 19, 1720–1730.
Qian, J., Dolled-Filhart, M., Lin, J., Yu, H. & Gerstein, M. (2001). Beyond synexpression relationships: local clustering of time-shifted and inverted gene expression profiles identifies new, biologically relevant interactions. J. Mol. Biol. 312, 1053–1066.
Acknowledgments
This work is supported by the National Science Foundation under Grant No. 0218521, as part of the NSF/NIH Collaborative Research in Computational Neuroscience Program.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Humana Press, a part of Springer Science+Business Media, LLC
About this protocol
Cite this protocol
Yona, G., Dirks, W., Rahman, S. (2009). Comparing Algorithms for Clustering of Expression Data: How to Assess Gene Clusters. In: Ireton, R., Montgomery, K., Bumgarner, R., Samudrala, R., McDermott, J. (eds) Computational Systems Biology. Methods in Molecular Biology, vol 541. Humana Press. https://doi.org/10.1007/978-1-59745-243-4_21
Download citation
DOI: https://doi.org/10.1007/978-1-59745-243-4_21
Published:
Publisher Name: Humana Press
Print ISBN: 978-1-58829-905-5
Online ISBN: 978-1-59745-243-4
eBook Packages: Springer Protocols