Abstract
Enzymes play central roles in metabolic pathways, and the prediction of metabolic pathways in newly sequenced genomes usually starts with the assignment of genes to enzymatic reactions. However, genes with similar catalytic activity are not necessarily similar in sequence, and therefore the traditional sequence similarity-based approach often fails to identify the relevant enzymes, thus hindering efforts to map the metabolome of an organism.
Here we study the direct relationship between basic protein properties and their function. Our goal is to develop a new tool for functional prediction (e.g., prediction of Enzyme Commission number), which can be used to complement and support other techniques based on sequence or structure information. In order to define this mapping we collected a set of 453 features and properties that characterize proteins and are believed to be related to structural and functional aspects of proteins. We introduce a mixture model of stochastic decision trees to learn the set of potentially complex relationships between features and function. To study these correlations, trees are created and tested on the Pfam classification of proteins, which is based on sequence, and the EC classification, which is based on enzymatic function. The model is very effective in learning highly diverged protein families or families that are not defined on the basis of sequence. The resulting tree structures highlight the properties that are strongly correlated with structural and functional aspects of protein families, and can be used to suggest a concise definition of a protein family.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Kanehisa, M. and Goto, S. (2000) KEGG: Kyoto encyclopedia of genes and genomes. Nucl. Acids Res. 28, 27–30.
Caspi, R., Foerster, H., Fulcher, C.A., Hopkinson, R., Ingraham, J., Kaipa, P., Krummenacker, M., Paley, S., Pick, J., Rhee, S. Y., Tissier, C., Zhang, P., and Karp, P. D. (2006) MetaCyc: a multiorganism database of metabolic pathways and enzymes. Nucl. Acids Res. 34, D511–D516.
Paley, S. M. and Karp, P.D. (2002) Evaluation of computational metabolic-pathway predictions for Helicobacter pylori. Bioinformatics 18, 715–724.
Bono, H., Ogata, H., Goto, S., and Kanehisa, M. (1998) Reconstruction of amino acid biosynthesis pathways from the complete genome sequence. Genome Res. 8, 203–210.
Green, M. and Karp, P. D. (2004) A Bayesian method for identifying missing enzymes in predicted metabolic pathway databases. BMC Bioinformatics 5, 76.
Chen, L. and Vitkup, D. (2006) Predicting genes for orphan metabolic activities using phylogenetic profiles. Genome Biol. 7, R17.
Kharchenko, P., Chen, L., Freund, Y., Vitkup, D., and Church, G. M. (2006) Identifying metabolic enzymes with multiple types of association evidence. BMC Bioinformatics 7, 177.
Popescu, L. and Yona, G. (2005) Automation of gene assignments to metabolic pathways using high-throughput expression data. BMC Bioinformatics 6, 217.
Popescu, L. and Yona, G. (2006) Expectation-maximization algorithms for fuzzy assignment of genes to cellular pathways. In proceedings of the 2006 Computational Systems Bioinformatics Conference.
Yaminishi, Y., Vert, J., and Kanehisa, M. (2005) Supervised enzyme network inference from the integration of genomic data and chemical information. Bioinformatics 21, i468–i477.
Shah, I. and Hunter, L. (1997) Predicting enzyme function from sequence: a systematic appraisal. In the Proceedings of the 5th International Conference on Intelligent Systems for Molecular Biology 276–283.
Wilson, D. B. and Irwin, D. C. (1999) Genetics and properties of cellulases. Adv. Biochem. Eng. 65, 2–21.
Stawiski, E. W., Baucom, A. E., Lohr, S. C., and Gregoret, L. M. (2000) Predicting protein function from structure: unique structural features of proteases. Proc. Natl. Acad. Sci. U.S.A. 97, 3954–3958.
Todd, A. E., Orengo, C. A., and Thornton, J. M. (2001) Evolution of function in protein superfamilies, from a structural perspective. J. Mol. Biol. 307, 1113–1143.
Devos, D. and Valencia, A. (2000) Practical limits of function prediction. Prot. Struct. Func. Genet. 41, 98–107.
Holm, L. and Sander, C. (1994) The FSSP database of structurally aligned protein fold families. Nucl. Acids Res. 22, 3600–3609.
Wilson, C. A., Kreychman, J., and Gerstein, M. (2000) Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores. J. Mol. Biol. 297, 233–249.
Murzin A. G., Brenner S. E., Hubbard T., Chothia C. (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536–540.
Rost, B. (2002) Enzyme function less conserved than anticipated. J. Mol. Biol. 318, 595–608.
desJardins, M., Karp, P. D., Krummenacker, M., Lee, T. J., and Ouzounis, C. A. (1997) Prediction of enzyme classification from protein sequence without the use of sequence similarity. In the Proceedings of the 5th International Conference on Intelligent Systems for Molecular Biology 92–99.
Borro, L. C., Oliveira, S. R. M., Yamagishi, M. E. B., Mancini, A. L., Jardine, J. G., Mazoni, I., dos Santos, E. H., Higa, R. H., Kuser P. R., and Neshich G. (2006) Predicting enzyme class from protein structure using Bayesian classification. Genet. Mol. Res. 5, 193–202.
Cai, Y-D. and Chou, K-C. (2004) Using functional domain composition to predict enzyme family classes. J. Proteome Res. 4, 109–111.
The Gene Ontology Consortium. (2000) Gene ontology: tool for the unification of biology. Nat. Genet. 25, 25–29.
Clare, A. and King R. D. (2003) Predicting gene function in Saccharomyces cerevisiae. Bioinformatics 19, ii42–ii49
Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl. Acids Res. 25, 3389–3402.
Mewes, H. W., Heumann, K., Kaps, A., Mayer, K., Pfeiffer, F., Stocker, S., and Frishman, D. (1999) MIPS: a database for genomes and protein sequences. Nucl. Acids Res. 27, 44–48.
Burges, C. J. C. (1998) A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 2, 121–167.
Jaakola, T., Diekhans, M., and Haussler, D. (1999) Using the Fisher kernel method to detect remote protein homologies. In the Proceedings of the 7th International Conference on Intelligent Systems for Molecular Biology 149–158.
Han, L. Y., Cai, C. Z., Ji, Z. L., Cao, Z. W., Cui, J., and Chen, Y. Z. (2004) Predicting functional family of novel enzymes irrespective of sequence similarity: a statistical learning approach. Nucl. Acids Res. 32, 6437–6444.
Leslie, C., Eskin, E., Cohen, A., Weston, J., and Noble, W. S. (2004) Mismatch string kernels for discriminitive protein classification. Bioinformatics 1, 1–10.
Ben-Hur, A. and Brutlag, D. L. (2006) Sequence motifs: highly predictive features of protein function, in Feature Extraction, Foundations and Applications (Guyon, I., Gunn, S., Nikravesh, M., and Zadeh, L. eds.), Springer Verlag, New York.
Kolesov, G., Mewes, H. W., and Frishman, D. (2001) SNAPping up functionally related genes based on context information: a colinearity-free approach. J. Mol. Biol. 311, 639–656.
Tian, W., Arakaki, A. K., and Skolnick, J. (2004) EFICAz: a comprehensive approach for accurate genome-scale enzyme function inference. Nucl. Acids Res. 32, 6226–6239.
Levy, E. D., Ouzounis, C. A., Gilks, W. R., and Audit, B. (2005) Probabilistic annotation of protein sequences based on functional classifications. BMC Bioinformatics 6, 302.
Duda, R. O., Hart, P. E., and Stork, D. G. (2000) Pattern Classification. John Wiley and Sons, New York.
Mitchell, T. M. (1997) Machine Learning. McGraw-Hill, New York.
Breiman, L., Friedman, J. H., Olshen, R.A., and Stone, C. J. (1993) Classification and Regression Trees. Chapman and Hall, New York.
Bateman, A., Birney, E., Durbin, R., Eddy, S. R., Finn R. D., and Sonnhammer E. L. (1999) Pfam 3.1: 1313 multiple alignments and profile HMMs match the majority of proteins. Nucl. Acids Res. 27, 260–262.
Bairoch, A. and Apweiler, R. (1999) The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999. Nucl. Acids Res. 27, 49–54.
Hobohm, U. and Sander, C. (1995) A sequence property approach to searching protein database. J. Mol. Biol. 251, 390–399.
Ferran, E. A., Pflugfelder, B., and Ferrara P. (1994) Self-organized neural maps of human protein sequences. Protein Sci. 3, 507–521.
Black, S.D. and Mould, D.R. (1991) Development of hydrophobicity parameters to analyze proteins which bear post or cotranslational modifications. Anal. Biochem. 193, 72–82.
McGuffin, L. J., Bryson, K., and Jones, D. T. (2000) The PSIPRED protein structure prediction server. Bioinformatics 16, 404–405.
Quinlan, J.R., (1986) Induction of decision trees. Mach. Learn. 1, 81–106.
Quinlan, J.R., (1993) C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA.
Syed, U. and Yona, G. (2003) Using a mixture of probabilistic decision trees for direct prediction of protein function. In the Proceedings of the 7th Annual International Conference on Research in Computational Molecular Biology 289–300.
Dietterich, T. G. (2000) An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Mach. Learn. 40, 139–157.
Ho, T. K. (1998) The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell. 20, 832–844.
Breiman, L. (2001) Random forests. Mach. Learn. 45, 5–32, 48
Lin, J. (1991) Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 37:1, 145–151.
Kullback, S. (1959) Information Theory and Statistics. John Wiley and Sons, New York.
Hughey, R., Karplus, K., and Krogh, A. (1999) SAM: sequence alignment and modeling software system. Technical report UCSC-CRL-99-11. University of California, Santa Cruz, CA.
Birkland, A. and Yona, G. (2006) The BIOZON database: a hub of heterogeneous biological data. Nucl. Acids Res. 34, D235–D242.
Fayyad, U. M. and Irani, K. B. (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In the Proceedings of the 13th International Joint Conference on Artificial Intelligence 1022–1027.
Kohavi, R. and Sahami, M. (1996) Error-based and entropy-based discretization of continuous features. In the Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining 114– 119.
Breiman, L., Friedman, J. H., Olshen, R.A., and Stone, C. J. (1984) Classification and Regression Trees. Wadsworth Int. Group, Belmont, CA.
Mantaras, R. L. (1991) A distance-based attribute selection measure for decision tree induction. Mach. Learn. 6, 81–92.
Kononenko, I. (1995) On biases in estimating multi-valued attributes. In the Proceedings of the 14th International Joint Conference on Artificial Intelligence 1034–1040.
Eskin, E., Grundy, W. N., and Singer, Y. (2000) Protein family classification using sparse Markov transducers. In the Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology 20–23.
Rissanen, J. (1989) Stochastic Complexity in Statistical Inquiry. World Scientific, Singapore.
Hjorth, J. S. U. (1994) Computer Intensive Statistical Methods: Validation, Model Selection, and Bootstrap. Chapman and Hall, London.
Jain, A. K., Dubes, R. C., and Chen, C. (1998) Bootstrap techniques for error estimation. IEEE Trans. Pattern Anal. Appl. 9, 628–633.
Shakhnarovich, G., El-Yaniv, R., and Baram, Y. (2001) Smoothed bootstrap and statistical data cloning for classifier evaluation. In the Proceedings of the 18th International Conference on Machine Learning 521–528.
Pearson, W. R. (1995) Comparison of methods for searching protein sequence databases. Protein Sci. 4, 1145–1160.
Acknowledgments
This work is supported by the National Science Foundation under Grant No. 0133311 to Golan Yona.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Humana Press, a part of Springer Science+Business Media, LLC
About this protocol
Cite this protocol
Syed, U., Yona, G. (2009). Enzyme Function Prediction with Interpretable Models. In: Ireton, R., Montgomery, K., Bumgarner, R., Samudrala, R., McDermott, J. (eds) Computational Systems Biology. Methods in Molecular Biology, vol 541. Humana Press. https://doi.org/10.1007/978-1-59745-243-4_17
Download citation
DOI: https://doi.org/10.1007/978-1-59745-243-4_17
Published:
Publisher Name: Humana Press
Print ISBN: 978-1-58829-905-5
Online ISBN: 978-1-59745-243-4
eBook Packages: Springer Protocols