Abstract
Improving a functional property of an enzyme via mutagenesis is still a challenging problem due to vast search space and difficulty of predicting the effects of mutation(s). Machine learning has proven to be proficient in solving similar problems with unprecedented speed owing to the latest advances in computing power and analytical algorithms. In this study, we investigate the performance of machine learning methods in predicting the H2 production activity and O2 tolerance of the hydrogenase variants. Experimentally measured activities and tolerance of 377 variants having single or double amino acid replacements are used to train and test seven types of machine learning models. Binary representation of amino acid sequence as well as the series of vectors quantifying physicochemical properties of amino acids, namely VHSE, are employed as features representing each variant. The results show that the VHSE enable higher performance, especially with respect to correlation coefficient and coefficient of determination in addition to the root mean square error. Next, the analysis of model performance with respect to changes in the data size and heterogeneity is conducted to provide insights on designing effective mutagenesis library for applying machine learning. The best performance was obtained when support vector machine or ridge regression was trained using a large, homogeneous data. In this manner, our study reveals the factors affecting the performance of machine learning in identifying the enzyme variants with enhanced function.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Korkegian, A., M. E. Black, D. Baker, and B. L. Stoddard (2005) Computational thermostabilization of an enzyme. Science 308: 857–860.
Amin, N., A. D. Liu, S. Ramer, W. Aehle, D. Meijer, M. Metin, S. Wong, P. Gualfetti, and V. Schellenberger (2004) Construction of stabilized proteins by combinatorial consensus mutagenesis. Protein Eng. Des. Sel. 17: 787–793.
Worth, C. L., R. Preissner, and T. L. Blundell (2011) SDM—a server for predicting effects of mutations on protein stability and malfunction. Nucleic Acids Res. 39(Web Server issue): W215–W222.
Thiltgen, G. and R. A. Goldstein (2012) Assessing predictors of changes in protein stability upon mutation using self-consistency. PLoS One 7: e46084.
Vedadi, M., F. H. Niesen, A. Allali-Hassani, O. Y. Fedorov, P. J. Finerty Jr., G. A. Wasney, R. Yeung, C. Arrowsmith, L. J. Ball, H. Berglund, R. Hui, B. D. Marsden, P. Nordlund, M. Sundstrom, J. Weigelt, and A. M. Edwards (2006) Chemical screening methods to identify ligands that promote protein stability, protein crystallization, and structure determination. Proc. Natl. Acad. Sci. U. S. A. 103: 15835–15840.
Koo, J., T. Schnabel, S. Liong, N. H. Evitt, and J. R. Swartz (2017) High-throughput screening of catalytic H2 production. Angew. Chem. Int. Ed. Engl. 56: 1012–1016.
Esvelt, K. M., J. C. Carlson, and D. R. Liu (2011) A system for the continuous directed evolution of biomolecules. Nature 472: 499–503.
Saito, Y., M. Oikawa, H. Nakazawa, T. Niide, T. Kameda, K. Tsuda, and M. Umetsu (2018) Machine-learning-guided mutagenesis for directed evolution of fluorescent proteins. ACS Synth. Biol. 7: 2014–2022.
Wu, Z., S. B. J. Kan, R. D. Lewis, B. J. Wittmann, and F. H. Arnold (2019) Machine learning-assisted directed protein evolution with combinatorial libraries. Proc. Natl. Acad. Sci. U. S. A. 116: 8852–8858. (Erratum published 2020, Proc. Natl. Acad. Sci. U. S. A. 117: 788–789)
Koo, J. and J. R. Swartz (2018) System analysis and improved [FeFe] hydrogenase O2 tolerance suggest feasibility for photosynthetic H2 production. Metab. Eng. 49: 21–27.
Kuchenreuther, J. M., C. S. Grady-Smith, A. S. Bingham, S. J. George, S. P. Cramer, and J. R. Swartz (2010) High-yield expression of heterologous [FeFe] hydrogenases in Escherichia coli. PLoS One 5: e15491.
Koo, J. (2020) Enhanced aerobic H2 production by engineering an [FeFe] hydrogenase from Clostridium pasteurianum. Int. J. Hydrogen Energy 45: 10673–10679.
Koo, J. and Y. Cha (2021) Investigation of the ferredoxin’s influence on the anaerobic and aerobic, enzymatic H2 production. Front. Bioeng. Biotechnol. 9: 641305.
Lu, F., P. R. Smith, K. Mehta, and J. R. Swartz (2015) Development of a synthetic pathway to convert glucose to hydrogen using cell free extracts. Int. J. Hydrogen Energy 40: 9113–9124.
Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay (2011) Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12: 2825–2830.
Mei, H., Z. H. Liao, Y. Zhou, and S. Z. Li (2005) A new set of amino acid descriptors and its application in peptide QSARs. Biopolymers 80: 775–786.
Svetnik, V., A. Liaw, C. Tong, J. C. Culberson, R. P. Sheridan, and B. P. Feuston (2003) Random forest: a classification and regression tool for compound classification and QSAR modeling. J. Chem. Inf. Comput. Sci. 43: 1947–1958.
Suykens, J. A. K. and J. Vandewalle (1999) Least squares support vector machine classifiers. Neural Process. Lett. 9: 293–300.
Chen, T. and C. Guestrin (2016) XGBoost: a scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. August 13–17. San Francisco, CA, USA.
Meier, L., S. Van De Geer, and P. Bühlmann (2008) The group lasso for logistic regression. J. R. Stat. Soc. Series B Stat. Methodol. 70: 53–71.
Akiba, T., S. Sano, T. Yanase, T. Ohta, and M. Koyama (2019) Optuna: a next-generation hyperparameter optimization framework. Proceedings of the 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. August 4–8. Anchorage, AK, USA.
Ostafe, R., N. Fontaine, D. Frank, M. Ng Fuk Chong, R. Prodanovic, R. Pandjaitan, B. Offmann, F. Cadet, and R. Fischer (2020) One-shot optimization of multiple enzyme parameters: tailoring glucose oxidase for pH and electron mediators. Biotechnol. Bioeng. 117: 17–29.
Yang, K. K., Z. Wu, and F. H. Arnold (2019) Machine-learning-guided directed evolution for protein engineering. Nat. Methods 16: 687–694.
Xie, X., T. Wu, M. Zhu, G. Jiang, Y. Xu, X. Wang, and L. Pu (2021) Comparison of random forest and multiple linear regression models for estimation of soil extracellular enzyme activities in agricultural reclaimed coastal saline land. Ecol. Indic. 120: 106925.
Zhao, M., S. Zhou, L. Wu, and Y. Deng (2020) Model-driven promoter strength prediction based on a fine-tuned synthetic promoter library in Escherichia coli. BioRxivhttps://doi.org/10.1101/2020.06.25.170365
Zhao, Z. Y., W. Z. Huang, X. K. Zhan, J. Pan, Y. A. Huang, S. W. Zhang, and C.-Q. Yu (2021) An ensemble learning-based method for inferring drug-target interactions combining protein sequences and drug fingerprints. Biomed Res. Int. 2021: 9933873.
Pertusi, D. A., M. E. Moura, J. G. Jeffryes, S. Prabhu, B. Walters Biggs, and K. E. J. Tyo (2017) Predicting novel substrates for enzymes with minimal experimental effort with active learning. Metab. Eng. 44: 171–181.
Tian, Y., C. Deutsch, and B. Krishnamoorthy (2010) Scoring function to predict solubility mutagenesis. Algorithms Mol. Biol. 5: 33.
Giguère, S., M. Marchand, F. Laviolette, A. Drouin, and J. Corbeil (2013) Learning a peptide-protein binding affinity predictor with kernel ridge regression. BMC Bioinformatics 14: 82.
Mellor, J., I. Grigoras, P. Carbonell, and J. L. Faulon (2016) Semisupervised Gaussian process for automated enzyme search. ACS Synth. Biol. 5: 518–528.
Peng, L., M. Peng, B. Liao, G. Huang, W. Li, and D. Xie (2018) The advances and challenges of deep learning application in biological big data processing. Curr. Bioinform. 13: 352–359.
Yap, B. W., K. A. Rani, H. A. Abd Rahman, S. Fong, Z. Khairudin, and N. N. Abdullah (2014) An application of oversampling, undersampling, bagging and boosting in handling imbalanced datasets. Proceedings of the First International Conference on Advanced Data and Information Engineering (DaEng-2013). 2013 December 16–18. Kuala Lumpur, Malaysia.
Hawkins, D. M. (2004) The problem of overfitting. J. Chem. Inf. Comput. Sci. 44: 1–12.
Kambeitz, J., L. Kambeitz-Ilankovic, S. Leucht, S. Wood, C. Davatzikos, B. Malchow, P. Falkai, and N. Koutsouleris (2015) Detecting neuroimaging biomarkers for schizophrenia: a meta-analysis of multivariate pattern recognition studies. Neuropsychopharmacology 40: 1742–1751.
Zarogianni, E., T. W. J. Moorhead, and S. M. Lawrie (2013) Towards the identification of imaging biomarkers in schizophrenia, using multivariate pattern classification at a single-subject level. Neuroimage Clin. 3: 279–289.
Elnaggar, A., M. Heinzinger, C. Dallago, G. Rihawi, Y. Wang, L. Jones, T. Gibbs, T. Feher, C. Angerer, M. Steinegger, D. Bhowmik, and B. Rost (2020) ProtTrans: towards cracking the language of life’s code through self-supervised deep learning and high performance computing. ArXivhttps://doi.org/10.48550/arxiv.2007.06225
Rives, A., J. Meier, T. Sercu, S. Goyal, Z. Lin, J. Liu, D. Guo, M. Ott, C. L. Zitnick, J. Ma, and R. Fergus (2021) Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl. Acad. Sci. U. S. A. 118: e2016239118.
Acknowledgements
This work was supported by the 2023 Hongik University Research Fund and KISTI R&D Innovation Support Program (KSC-2020-INO-0051).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
The authors declare no conflict of interest.
Neither ethical approval nor informed consent was required for this study.
Additional information
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Rights and permissions
About this article
Cite this article
Choi, G., Kim, W. & Koo, J. Investigating the Performance of Machine Learning Methods in Predicting Functional Properties of the Hydrogenase Variants. Biotechnol Bioproc E 28, 143–151 (2023). https://doi.org/10.1007/s12257-022-0330-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12257-022-0330-3