Machine Learning for Protein Engineering

Chapter
First Online: 24 June 2022

pp 19–29
Cite this chapter

Handbook of Machine Learning Applications for Genomics

Andrew D. Marques⁴

Part of the book series: Studies in Big Data ((SBD,volume 103))

615 Accesses

Abstract

This chapter is intended to outline methods and applications for machine learning (ML) in the context of protein engineering. The content of this chapter is geared toward the biologist’s perspective of ML with an emphasis on experimental design for optimized data collection. After a brief introduction, the subsequent sections of this chapter will be dedicated to following a schema to design and implement ML for problems involving protein engineering. The steps below offer a guide to this schema and the organization of this chapter in the context of protein engineering.

I.
Formulate a question that ML tools can answer
II.
Design an experiment to collect data
III.
Curate the dataset
IV.
Choose and train a model
V.
Interpret results

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Hardcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Machine-learning-guided directed evolution for protein engineering

Article 15 July 2019

Machine Learning for Protein Engineering

Chapter © 2023

Machine learning for functional protein design

Article 15 February 2024

References

Alwosheel, A., van Cranenburgh, S., Chorus, C.G.: Is your dataset big enough? Sample size requirements when using artificial neural networks for discrete choice analysis. J. Choice Model. 28, 167–182 (2018). https://doi.org/10.1016/j.jocm.2018.07.002
Article Google Scholar
Balas, V.E., Roy, S.S., Sharma, D., Samui, P. (eds.): Handbook of Deep Learning Applications, vol. 136. Springer, New York (2019)
Google Scholar
Biswas, S., Khimulya, G., Alley, E.C., Esvelt, K.M., Church, G.M.: Low-N protein engineering with data-efficient deep learning. Nat. Methods 18(4), 389–396 (2021)
Article Google Scholar
Brannigan, J.A., Wilkinson, A.J.: Protein engineering 20 years on. Nat. Rev. Mol. Cell Biol. 3, 964–970 (2002). https://doi.org/10.1038/nrm975
Article Google Scholar
Egloff, P., Zimmermann, I., Arnold, F.M., et al.: Engineered peptide barcodes for in-depth analyses of binding protein ensembles (2018). https://doi.org/10.1101/287813
Ewing, B., Green, P.: Base-calling of automated sequencer traces using Phred II. Error probabilities. Genome Res. 8, 186–194 (1998). https://doi.org/10.1101/gr.8.3.186
Article Google Scholar
Fox, R.J., Davis, S.C., Mundorff, E.C., et al.: Improving catalytic function by ProSAR-driven enzyme evolution. Nat. Biotechnol. 25, 338–344 (2007). https://doi.org/10.1038/nbt1286
Article Google Scholar
Harrell, F.: Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal ... Regression, and Survival Analysis. Springer (2016)
Google Scholar
Hopf, T.A., Colwell, L.J., Sheridan, R., et al.: Three-dimensional structures of membrane proteins from genomic sequencing. Cell 149, 1607–1621 (2012). https://doi.org/10.1016/j.cell.2012.04.012
Article Google Scholar
Hu, J., Liu, Z.: DeepMHC: Deep Convolutional Neural Networks for High-performance peptide-MHC Binding Affinity Prediction (2017). https://doi.org/10.1101/239236
Jia, L., Yarlagadda, R., Reed, C.C.: Structure based thermostability prediction models for protein single point mutations with machine learning tools. PLoS ONE (2015). https://doi.org/10.1371/journal.pone.0138022
Article Google Scholar
Kadoya, S., Urayama, S., Nunoura, T., et al.: Bottleneck Size-Dependent Changes in the Genetic Diversity and Specific Growth Rate of a Rotavirus a Strain (2019). https://doi.org/10.1101/702233
Leatherbarrow, R.J., Fersht, A.R., Winter, G.: Transition-state stabilization in the mechanism of tyrosyl-tRNA synthetase revealed by protein engineering. Proc. Natl. Acad. Sci. 82, 7840–7844 (1985). https://doi.org/10.1073/pnas.82.23.7840
Article Google Scholar
Lee, K.C., Roy, S.S., Samui, P. (eds.): Data Analytics in Biomedical Engineering and Healthcare. Academic Press (2020)
Google Scholar
Li, Y., Drummond, D.A., Sawayama, A.M., et al.: A diverse family of thermostable cytochrome P450s created by recombination of stabilizing fragments. Nat. Biotechnol. 25, 1051–1056 (2007). https://doi.org/10.1038/nbt1333
Article Google Scholar
Li, Y., Fang, J.: PROTS-RF: a robust model for predicting mutation-induced protein stability changes. PLoS ONE (2012). https://doi.org/10.1371/journal.pone.0047247
Article Google Scholar
Marques, A.D., Kummer, M., Kondratov, O., et al.: Applying machine learning to predict viral assembly for adeno-associated virus capsid libraries. Molecular Ther. Methods Clin. Dev. 20, 276–286 (2021). https://doi.org/10.1016/j.omtm.2020.11.017
Article Google Scholar
Miyamoto, K., Aoki, W., Ohtani, Y., et al.: Peptide barcoding for establishment of new types of genotype–phenotype linkages. PLoS ONE (2019). https://doi.org/10.1371/journal.pone.0215993
Article Google Scholar
NIH: DNA sequencing costs: data. In: Genome.gov (2020). https://www.genome.gov/about-genomics/fact-sheets/DNA-Sequencing-Costs-Data. Accessed 24 Feb 2021
Pommié, C., Levadoux, S., Sabatier, R., et al.: IMGT standardized criteria for statistical analysis of immunoglobulin V-REGION amino acid properties. J. Mol. Recognit. 17, 17–32 (2004). https://doi.org/10.1002/jmr.647
Article Google Scholar
Roy, S.S., Samui, P., Deo, R., Ntalampiras, S. (eds.): Big Data in Engineering Applications, vol. 44. Springer (2018)
Google Scholar
Roy, S.S., Taguchi, Y.H.: Identification of genes associated with altered gene expression and m6A profiles during hypoxia using tensor decomposition based unsupervised feature extraction. Sci. Rep. 11(1), 1–18 (2021)
Article Google Scholar
Saladi, S.M., Javed, N., Müller, A., Clemons, W.M.: A statistical model for improved membrane protein expression using sequence-derived features. J. Biol. Chem. 293, 4913–4927 (2018). https://doi.org/10.1074/jbc.ra117.001052
Article Google Scholar
Samui, P., Roy, S.S., Balas, V.E. (eds.): Handbook of Neural Computation. Academic Press (2017)
Google Scholar
Tian, J., Wu, N., Chu, X., Fan, Y.: Predicting changes in protein thermostability brought about by single- or multi-site mutations. BMC Bioinf. 11, 370 (2010). https://doi.org/10.1186/1471-2105-11-370
Article Google Scholar
Yan, K., Wen, J., Liu, J.X., Xu, Y., Liu, B.: Protein fold recognition by combining support vector machines and pairwise sequence similarity scores. IEEE/ACM Trans. Comput. Biol. Bioinf. (2020)
Google Scholar
Yang, K.K., Wu, Z., Arnold, F.H.: Machine-learning-guided directed evolution for protein engineering. Nat. Methods 16, 687–694 (2019). https://doi.org/10.1038/s41592-019-0496-6
Article Google Scholar
Zaugg, J., Gumulya, Y., Malde, A.K., Bodén, M.: Learning epistatic interactions from sequence-activity data to predict enantioselectivity. J. Comput. Aided Mol. Des. 31, 1085–1096 (2017). https://doi.org/10.1007/s10822-017-0090-x
Article Google Scholar

Download references

Author information

Authors and Affiliations

Perelman School of Medicine, University of Pennsylvania, Philadelphia, USA
Andrew D. Marques

Authors

Andrew D. Marques
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Andrew D. Marques .

Editor information

Editors and Affiliations

Vellore Institute of Technology Universi, Vellore, India
Sanjiban Sekhar Roy
Department of Physics, Chuo University, Tokyo, Tokyo, Japan
Y.-H. Taguchi

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this chapter

Cite this chapter

Marques, A.D. (2022). Machine Learning for Protein Engineering. In: Roy, S.S., Taguchi, YH. (eds) Handbook of Machine Learning Applications for Genomics. Studies in Big Data, vol 103. Springer, Singapore. https://doi.org/10.1007/978-981-16-9158-4_2

Download citation

DOI: https://doi.org/10.1007/978-981-16-9158-4_2
Published: 24 June 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-9157-7
Online ISBN: 978-981-16-9158-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Hardcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions