Abstract
This chapter is intended to outline methods and applications for machine learning (ML) in the context of protein engineering. The content of this chapter is geared toward the biologist’s perspective of ML with an emphasis on experimental design for optimized data collection. After a brief introduction, the subsequent sections of this chapter will be dedicated to following a schema to design and implement ML for problems involving protein engineering. The steps below offer a guide to this schema and the organization of this chapter in the context of protein engineering.
-
I.
Formulate a question that ML tools can answer
-
II.
Design an experiment to collect data
-
III.
Curate the dataset
-
IV.
Choose and train a model
-
V.
Interpret results
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Alwosheel, A., van Cranenburgh, S., Chorus, C.G.: Is your dataset big enough? Sample size requirements when using artificial neural networks for discrete choice analysis. J. Choice Model. 28, 167–182 (2018). https://doi.org/10.1016/j.jocm.2018.07.002
Balas, V.E., Roy, S.S., Sharma, D., Samui, P. (eds.): Handbook of Deep Learning Applications, vol. 136. Springer, New York (2019)
Biswas, S., Khimulya, G., Alley, E.C., Esvelt, K.M., Church, G.M.: Low-N protein engineering with data-efficient deep learning. Nat. Methods 18(4), 389–396 (2021)
Brannigan, J.A., Wilkinson, A.J.: Protein engineering 20 years on. Nat. Rev. Mol. Cell Biol. 3, 964–970 (2002). https://doi.org/10.1038/nrm975
Egloff, P., Zimmermann, I., Arnold, F.M., et al.: Engineered peptide barcodes for in-depth analyses of binding protein ensembles (2018). https://doi.org/10.1101/287813
Ewing, B., Green, P.: Base-calling of automated sequencer traces using Phred II. Error probabilities. Genome Res. 8, 186–194 (1998). https://doi.org/10.1101/gr.8.3.186
Fox, R.J., Davis, S.C., Mundorff, E.C., et al.: Improving catalytic function by ProSAR-driven enzyme evolution. Nat. Biotechnol. 25, 338–344 (2007). https://doi.org/10.1038/nbt1286
Harrell, F.: Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal ... Regression, and Survival Analysis. Springer (2016)
Hopf, T.A., Colwell, L.J., Sheridan, R., et al.: Three-dimensional structures of membrane proteins from genomic sequencing. Cell 149, 1607–1621 (2012). https://doi.org/10.1016/j.cell.2012.04.012
Hu, J., Liu, Z.: DeepMHC: Deep Convolutional Neural Networks for High-performance peptide-MHC Binding Affinity Prediction (2017). https://doi.org/10.1101/239236
Jia, L., Yarlagadda, R., Reed, C.C.: Structure based thermostability prediction models for protein single point mutations with machine learning tools. PLoS ONE (2015). https://doi.org/10.1371/journal.pone.0138022
Kadoya, S., Urayama, S., Nunoura, T., et al.: Bottleneck Size-Dependent Changes in the Genetic Diversity and Specific Growth Rate of a Rotavirus a Strain (2019). https://doi.org/10.1101/702233
Leatherbarrow, R.J., Fersht, A.R., Winter, G.: Transition-state stabilization in the mechanism of tyrosyl-tRNA synthetase revealed by protein engineering. Proc. Natl. Acad. Sci. 82, 7840–7844 (1985). https://doi.org/10.1073/pnas.82.23.7840
Lee, K.C., Roy, S.S., Samui, P. (eds.): Data Analytics in Biomedical Engineering and Healthcare. Academic Press (2020)
Li, Y., Drummond, D.A., Sawayama, A.M., et al.: A diverse family of thermostable cytochrome P450s created by recombination of stabilizing fragments. Nat. Biotechnol. 25, 1051–1056 (2007). https://doi.org/10.1038/nbt1333
Li, Y., Fang, J.: PROTS-RF: a robust model for predicting mutation-induced protein stability changes. PLoS ONE (2012). https://doi.org/10.1371/journal.pone.0047247
Marques, A.D., Kummer, M., Kondratov, O., et al.: Applying machine learning to predict viral assembly for adeno-associated virus capsid libraries. Molecular Ther. Methods Clin. Dev. 20, 276–286 (2021). https://doi.org/10.1016/j.omtm.2020.11.017
Miyamoto, K., Aoki, W., Ohtani, Y., et al.: Peptide barcoding for establishment of new types of genotype–phenotype linkages. PLoS ONE (2019). https://doi.org/10.1371/journal.pone.0215993
NIH: DNA sequencing costs: data. In: Genome.gov (2020). https://www.genome.gov/about-genomics/fact-sheets/DNA-Sequencing-Costs-Data. Accessed 24 Feb 2021
Pommié, C., Levadoux, S., Sabatier, R., et al.: IMGT standardized criteria for statistical analysis of immunoglobulin V-REGION amino acid properties. J. Mol. Recognit. 17, 17–32 (2004). https://doi.org/10.1002/jmr.647
Roy, S.S., Samui, P., Deo, R., Ntalampiras, S. (eds.): Big Data in Engineering Applications, vol. 44. Springer (2018)
Roy, S.S., Taguchi, Y.H.: Identification of genes associated with altered gene expression and m6A profiles during hypoxia using tensor decomposition based unsupervised feature extraction. Sci. Rep. 11(1), 1–18 (2021)
Saladi, S.M., Javed, N., Müller, A., Clemons, W.M.: A statistical model for improved membrane protein expression using sequence-derived features. J. Biol. Chem. 293, 4913–4927 (2018). https://doi.org/10.1074/jbc.ra117.001052
Samui, P., Roy, S.S., Balas, V.E. (eds.): Handbook of Neural Computation. Academic Press (2017)
Tian, J., Wu, N., Chu, X., Fan, Y.: Predicting changes in protein thermostability brought about by single- or multi-site mutations. BMC Bioinf. 11, 370 (2010). https://doi.org/10.1186/1471-2105-11-370
Yan, K., Wen, J., Liu, J.X., Xu, Y., Liu, B.: Protein fold recognition by combining support vector machines and pairwise sequence similarity scores. IEEE/ACM Trans. Comput. Biol. Bioinf. (2020)
Yang, K.K., Wu, Z., Arnold, F.H.: Machine-learning-guided directed evolution for protein engineering. Nat. Methods 16, 687–694 (2019). https://doi.org/10.1038/s41592-019-0496-6
Zaugg, J., Gumulya, Y., Malde, A.K., Bodén, M.: Learning epistatic interactions from sequence-activity data to predict enantioselectivity. J. Comput. Aided Mol. Des. 31, 1085–1096 (2017). https://doi.org/10.1007/s10822-017-0090-x
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Marques, A.D. (2022). Machine Learning for Protein Engineering. In: Roy, S.S., Taguchi, YH. (eds) Handbook of Machine Learning Applications for Genomics. Studies in Big Data, vol 103. Springer, Singapore. https://doi.org/10.1007/978-981-16-9158-4_2
Download citation
DOI: https://doi.org/10.1007/978-981-16-9158-4_2
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-9157-7
Online ISBN: 978-981-16-9158-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)