Skip to main content

Machine Learning for Protein Engineering

  • Chapter
  • First Online:
Handbook of Machine Learning Applications for Genomics

Part of the book series: Studies in Big Data ((SBD,volume 103))

  • 615 Accesses

Abstract

This chapter is intended to outline methods and applications for machine learning (ML) in the context of protein engineering. The content of this chapter is geared toward the biologist’s perspective of ML with an emphasis on experimental design for optimized data collection. After a brief introduction, the subsequent sections of this chapter will be dedicated to following a schema to design and implement ML for problems involving protein engineering. The steps below offer a guide to this schema and the organization of this chapter in the context of protein engineering.

  1. I.

    Formulate a question that ML tools can answer

  2. II.

    Design an experiment to collect data

  3. III.

    Curate the dataset

  4. IV.

    Choose and train a model

  5. V.

    Interpret results

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 249.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Alwosheel, A., van Cranenburgh, S., Chorus, C.G.: Is your dataset big enough? Sample size requirements when using artificial neural networks for discrete choice analysis. J. Choice Model. 28, 167–182 (2018). https://doi.org/10.1016/j.jocm.2018.07.002

    Article  Google Scholar 

  2. Balas, V.E., Roy, S.S., Sharma, D., Samui, P. (eds.): Handbook of Deep Learning Applications, vol. 136. Springer, New York (2019)

    Google Scholar 

  3. Biswas, S., Khimulya, G., Alley, E.C., Esvelt, K.M., Church, G.M.: Low-N protein engineering with data-efficient deep learning. Nat. Methods 18(4), 389–396 (2021)

    Article  Google Scholar 

  4. Brannigan, J.A., Wilkinson, A.J.: Protein engineering 20 years on. Nat. Rev. Mol. Cell Biol. 3, 964–970 (2002). https://doi.org/10.1038/nrm975

    Article  Google Scholar 

  5. Egloff, P., Zimmermann, I., Arnold, F.M., et al.: Engineered peptide barcodes for in-depth analyses of binding protein ensembles (2018). https://doi.org/10.1101/287813

  6. Ewing, B., Green, P.: Base-calling of automated sequencer traces using Phred II. Error probabilities. Genome Res. 8, 186–194 (1998). https://doi.org/10.1101/gr.8.3.186

    Article  Google Scholar 

  7. Fox, R.J., Davis, S.C., Mundorff, E.C., et al.: Improving catalytic function by ProSAR-driven enzyme evolution. Nat. Biotechnol. 25, 338–344 (2007). https://doi.org/10.1038/nbt1286

    Article  Google Scholar 

  8. Harrell, F.: Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal ... Regression, and Survival Analysis. Springer (2016)

    Google Scholar 

  9. Hopf, T.A., Colwell, L.J., Sheridan, R., et al.: Three-dimensional structures of membrane proteins from genomic sequencing. Cell 149, 1607–1621 (2012). https://doi.org/10.1016/j.cell.2012.04.012

    Article  Google Scholar 

  10. Hu, J., Liu, Z.: DeepMHC: Deep Convolutional Neural Networks for High-performance peptide-MHC Binding Affinity Prediction (2017). https://doi.org/10.1101/239236

  11. Jia, L., Yarlagadda, R., Reed, C.C.: Structure based thermostability prediction models for protein single point mutations with machine learning tools. PLoS ONE (2015). https://doi.org/10.1371/journal.pone.0138022

    Article  Google Scholar 

  12. Kadoya, S., Urayama, S., Nunoura, T., et al.: Bottleneck Size-Dependent Changes in the Genetic Diversity and Specific Growth Rate of a Rotavirus a Strain (2019). https://doi.org/10.1101/702233

  13. Leatherbarrow, R.J., Fersht, A.R., Winter, G.: Transition-state stabilization in the mechanism of tyrosyl-tRNA synthetase revealed by protein engineering. Proc. Natl. Acad. Sci. 82, 7840–7844 (1985). https://doi.org/10.1073/pnas.82.23.7840

    Article  Google Scholar 

  14. Lee, K.C., Roy, S.S., Samui, P. (eds.): Data Analytics in Biomedical Engineering and Healthcare. Academic Press (2020)

    Google Scholar 

  15. Li, Y., Drummond, D.A., Sawayama, A.M., et al.: A diverse family of thermostable cytochrome P450s created by recombination of stabilizing fragments. Nat. Biotechnol. 25, 1051–1056 (2007). https://doi.org/10.1038/nbt1333

    Article  Google Scholar 

  16. Li, Y., Fang, J.: PROTS-RF: a robust model for predicting mutation-induced protein stability changes. PLoS ONE (2012). https://doi.org/10.1371/journal.pone.0047247

    Article  Google Scholar 

  17. Marques, A.D., Kummer, M., Kondratov, O., et al.: Applying machine learning to predict viral assembly for adeno-associated virus capsid libraries. Molecular Ther. Methods Clin. Dev. 20, 276–286 (2021). https://doi.org/10.1016/j.omtm.2020.11.017

    Article  Google Scholar 

  18. Miyamoto, K., Aoki, W., Ohtani, Y., et al.: Peptide barcoding for establishment of new types of genotype–phenotype linkages. PLoS ONE (2019). https://doi.org/10.1371/journal.pone.0215993

    Article  Google Scholar 

  19. NIH: DNA sequencing costs: data. In: Genome.gov (2020). https://www.genome.gov/about-genomics/fact-sheets/DNA-Sequencing-Costs-Data. Accessed 24 Feb 2021

  20. Pommié, C., Levadoux, S., Sabatier, R., et al.: IMGT standardized criteria for statistical analysis of immunoglobulin V-REGION amino acid properties. J. Mol. Recognit. 17, 17–32 (2004). https://doi.org/10.1002/jmr.647

    Article  Google Scholar 

  21. Roy, S.S., Samui, P., Deo, R., Ntalampiras, S. (eds.): Big Data in Engineering Applications, vol. 44. Springer (2018)

    Google Scholar 

  22. Roy, S.S., Taguchi, Y.H.: Identification of genes associated with altered gene expression and m6A profiles during hypoxia using tensor decomposition based unsupervised feature extraction. Sci. Rep. 11(1), 1–18 (2021)

    Article  Google Scholar 

  23. Saladi, S.M., Javed, N., Müller, A., Clemons, W.M.: A statistical model for improved membrane protein expression using sequence-derived features. J. Biol. Chem. 293, 4913–4927 (2018). https://doi.org/10.1074/jbc.ra117.001052

    Article  Google Scholar 

  24. Samui, P., Roy, S.S., Balas, V.E. (eds.): Handbook of Neural Computation. Academic Press (2017)

    Google Scholar 

  25. Tian, J., Wu, N., Chu, X., Fan, Y.: Predicting changes in protein thermostability brought about by single- or multi-site mutations. BMC Bioinf. 11, 370 (2010). https://doi.org/10.1186/1471-2105-11-370

    Article  Google Scholar 

  26. Yan, K., Wen, J., Liu, J.X., Xu, Y., Liu, B.: Protein fold recognition by combining support vector machines and pairwise sequence similarity scores. IEEE/ACM Trans. Comput. Biol. Bioinf. (2020)

    Google Scholar 

  27. Yang, K.K., Wu, Z., Arnold, F.H.: Machine-learning-guided directed evolution for protein engineering. Nat. Methods 16, 687–694 (2019). https://doi.org/10.1038/s41592-019-0496-6

    Article  Google Scholar 

  28. Zaugg, J., Gumulya, Y., Malde, A.K., Bodén, M.: Learning epistatic interactions from sequence-activity data to predict enantioselectivity. J. Comput. Aided Mol. Des. 31, 1085–1096 (2017). https://doi.org/10.1007/s10822-017-0090-x

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andrew D. Marques .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Marques, A.D. (2022). Machine Learning for Protein Engineering. In: Roy, S.S., Taguchi, YH. (eds) Handbook of Machine Learning Applications for Genomics. Studies in Big Data, vol 103. Springer, Singapore. https://doi.org/10.1007/978-981-16-9158-4_2

Download citation

Publish with us

Policies and ethics