Skip to main content

Predicting Post-Translational Modifications from Local Sequence Fragments Using Machine Learning Algorithms: Overview and Best Practices

  • Protocol
  • First Online:
Prediction of Protein Secondary Structure

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1484))

Abstract

Here, we present two perspectives on the task of predicting post translational modifications (PTMs) from local sequence fragments using machine learning algorithms. The first is the description of the fundamental steps required to construct a PTM predictor from the very beginning. These steps include data gathering, feature extraction, or machine-learning classifier selection. The second part of our work contains the detailed discussion of more advanced problems which are encountered in PTM prediction task. Probably the most challenging issues which we have covered here are: (1) how to address the training data class imbalance problem (we also present statistics describing the problem); (2) how to properly set up cross-validation folds with an approach which takes into account the homology of protein data records, to address this problem we present our folds-over-clusters algorithm; and (3) how to efficiently reach for new sources of learning features. Presented techniques and notes resulted from intense studies in the field, performed by our and other groups, and can be useful both for researchers beginning in the field of PTM prediction and for those who want to extend the repertoire of their research techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Protocol
USD 49.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Less correct name since jacknife is a resampling method rather than a cross-validation type.

References

  1. Uhlen M, Ponten F (2005) Antibody-based proteomics for human tissue profiling. Mol Cell Proteomics 4:384–393

    Article  CAS  PubMed  Google Scholar 

  2. Jensen ON (2004) Modification-specific proteomics: characterization of post-translational modifications by mass spectrometry. Curr Opin Chem Biol 1:33–41

    Article  Google Scholar 

  3. Walsh C (2006) Posttranslational modification of proteins: expanding nature’s inventory. Roberts and Company Publishers, Englewood, CO

    Google Scholar 

  4. Irby RB, Yeatman TJ (2000) Role of Src expression and activation in human cancer. Oncogene 19(49):5636–5642

    Article  CAS  PubMed  Google Scholar 

  5. Brown M, Cooper JA (1996) Regulation, substrates and functions of Src. Biochim Biophys Acta 1287:121–149

    PubMed  Google Scholar 

  6. Abram CL, Courtneidge SA (2000) Src family tyrosine kinases and growth factor signaling. Exp Cell Res 254:1–13

    Article  CAS  PubMed  Google Scholar 

  7. Blom N, Gammeltoft S, Brunak S (1999) Sequence and structure-based prediction of eukaryotic protein phosphorylation sites. J Mol Biol 294(5):1351–1362. doi:10.1006/jmbi.1999.3310

    Article  CAS  PubMed  Google Scholar 

  8. Biswas AK, Noman N, Sikder AR (2010) Machine learning approach to predict protein phosphorylation sites by incorporating evolutionary information. BMC Bioinf 11(1):273. doi:10.1186/1471-2105-11-273

    Article  Google Scholar 

  9. Plewczynski D, Basu S, Saha I (2012) AMS 4.0: consensus prediction of post-translational modifications in protein sequences. Amino Acids 43(2):573–582. doi:10.1007/s00726-012-1290-2

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Jalal S, Arsenault R, Potter AA, Babiuk LA, Griebel PJ, Napper S (2009) Genome to kinome: species-specific peptide arrays for kinome analysis. Sci Signal 2(54):pl1. doi:10.1126/scisignal.254pl1

    Google Scholar 

  11. Trost B, Kusalik A (2011) Computational prediction of eukaryotic phosphorylation sites. Bioinformatics (Oxford, England) 27(21):2927–2935. doi:10.1093/bioinformatics/btr525

    Article  CAS  Google Scholar 

  12. Trost B, Arsenault R, Griebel P, Napper S, Kusalik A (2013) DAPPLE: a pipeline for the homology-based prediction of phosphorylation sites. Bioinformatics (Oxford, England) 29(13):1693–1695. doi:10.1093/bioinformatics/btt265

    Article  CAS  Google Scholar 

  13. Robertson AJ, Trost B, Scruten E, Robertson T, Mostajeran M, Connor W, Kusalik A, Griebel P, Napper S (2014) Identification of developmentally-specific kinotypes and mechanisms of Varroa mite resistance through whole-organism, kinome analysis of honeybee. Front Genet 5:139. doi:10.3389/fgene.2014.00139

    Article  PubMed  PubMed Central  Google Scholar 

  14. The UniProt Consortium (2014) UniProt: a hub for protein information. Nucleic Acids Res 43(D1):D204–D212. doi:10.1093/nar/gku989

    Article  PubMed Central  Google Scholar 

  15. Hornbeck PV, Zhang B, Murray B, Kornhauser JM, Latham V, Skrzypek E (2015) PhosphoSitePlus, 2014: mutations, PTMs and recalibrations. Nucleic Acids Res 43(Database issue):D512–D520. doi:10.1093/nar/gku1267

    Article  CAS  PubMed  Google Scholar 

  16. Dinkel H, Chica C, Via A, Gould CM, Jensen LJ, Gibson TJ, Diella F (2011) Phospho.ELM: a database of phosphorylation sites–update 2011. Nucleic Acids Res 39(Database issue):D261–D267. doi:10.1093/nar/gkq1104

    Google Scholar 

  17. Kamath KS, Vasavada MS, Srivastava S (2011) Proteomic databases and tools to decipher post-translational modifications. J Proteomics 75(1):127–144. doi:10.1016/j.jprot.2011.09.014

    Article  CAS  PubMed  Google Scholar 

  18. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2012) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830. 1201.0490

    Google Scholar 

  19. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software. In: ACM SIGKDD explorations newsletter, vol 11, issue 1, p 10. doi:10.1145/1656274.1656278

    Google Scholar 

  20. Samuel A (2000) Some studies in machine learning using the game of checkers. IBM J Res Dev 44(1.2):206–226. doi:10.1147/rd.441.0206

    Google Scholar 

  21. Provost F, Fawcett T, Kohavi R (1998) The case against accuracy estimation for comparing induction algorithms. In: Proceedings of the fifteenth international conference on machine learning. Morgan Kaufmann, San Francisco, pp 445–453

    Google Scholar 

  22. Matthews B (1975) Comparison of the predicted and observed secondary structure of {T4} phage lysozyme. Biochim Biophys Acta Protein Struct 405(2):442–451. http://dx.doi.org/10.1016/0005-2795(75)90109-9

  23. Powers DM (2011) Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. J Mach Learn Technol 2(1):37–63

    Google Scholar 

  24. Neuberger G, Schneider G, Eisenhaber F (2007) pkaPS: prediction of protein kinase A phosphorylation sites with the simplified kinase-substrate binding model. Biol Direct 2:1. doi:10.1186/1745-6150-2-1

    Google Scholar 

  25. Jung I, Matsuyama A, Yoshida M, Kim D (2010) PostMod: sequence based prediction of kinase-specific phosphorylation sites with indirect relationship. BMC Bioinf 11(Suppl 1):S10. doi:10.1186/1471-2105-11-S1-S10

    Article  Google Scholar 

  26. Kawashima S (2000) AAindex: amino acid index database. Nucleic Acids Res 28(1):374. doi:10.1093/nar/28.1.374

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Kawashima S, Pokarowski P, Pokarowska M, Kolinski A, Katayama T, Kanehisa M (2008) AAindex: amino acid index database, progress report 2008. Nucleic Acids Res 36(Database issue):D202–D205. doi:10.1093/nar/gkm998

    CAS  PubMed  Google Scholar 

  28. Saha I, Maulik U, Bandyopadhyay S, Plewczynski D (2012) Fuzzy clustering of physicochemical and biochemical properties of amino acids. Amino Acids 43(2):583–594. doi:10.1007/s00726-011-1106-9

    Article  CAS  PubMed  Google Scholar 

  29. Iakoucheva LM, Radivojac P, Brown CJ, O’Connor TR, Sikes JG, Obradovic Z, Dunker AK (2004) The importance of intrinsic disorder for protein phosphorylation. Nucleic Acids Res 32(3):1037–1049. doi:10.1093/nar/gkh253

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Lee TY, Hsu JBK, Lin FM, Chang WC, Hsu PC, Huang HD (2010) N-Ace: using solvent accessibility and physicochemical properties to identify protein N-acetylation sites. J Comput Chem 31(15):2759–2771. doi:10.1002/jcc.21569

    Article  CAS  PubMed  Google Scholar 

  31. Chen YZ, Chen Z, Gong YA, Ying G (2012) SUMOhydro: a novel method for the prediction of sumoylation sites based on hydrophobic properties. PloS One 7(6):e39195. doi:10.1371/journal.pone.0039195

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Pejaver V, Hsu WL, Xin F, Dunker AK, Uversky VN, Radivojac P (2014) The structural and functional signatures of proteins that undergo multiple events of post-translational modification. Protein Sci 23(8):1077–1093. doi:10.1002/pro.2494

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Li A, Wang L, Shi Y, Wang M, Jiang Z, Feng H (2005) Phosphorylation site prediction with a modified k-nearest neighbor algorithm and blosum62 matrix. In: 27th Annual International conference of the engineering in medicine and biology society, 2005 (IEEE-EMBS 2005), pp 6075–6078. doi:10.1109/IEMBS.2005.1615878

    Google Scholar 

  34. Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140. doi:10.1007/BF00058655

    Google Scholar 

  35. Ho TK (1998) The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell 20(8):832–844. doi:10.1109/34.709601

    Article  Google Scholar 

  36. Breiman L (2001) Random forests. Mach Learn 45(1):5–32. doi:10.1023/A:1010933404324

    Article  Google Scholar 

  37. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297. doi:10.1007/BF00994018, 10.1007/BF00994018

  38. Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intelligent Data Anal 6(5):429–449

    Google Scholar 

  39. Kramer C, Gedeck P (2010) Leave-cluster-out cross-validation is appropriate for scoring functions derived from diverse protein data sets. J Chem Inf Model 50(11):1961–1969. doi:10.1021/ci100264e

    Article  CAS  PubMed  Google Scholar 

  40. Zubek J, Tatjewski M, Boniecki A, Mnich M, Basu S, Plewczynski D (2015) Multi-level machine learning prediction of protein-protein interactions in Saccharomyces cerevisiae. PeerJ 3:e1041. doi:10.7717/peerj.1041

    Article  PubMed  PubMed Central  Google Scholar 

  41. Schwartz D (2012) Prediction of lysine post-translational modifications using bioinformatic tools. Essays Biochem 52:165–177. doi:10.1042/bse0520165

    Article  CAS  PubMed  Google Scholar 

  42. Durek P, Schudoma C, Weckwerth W, Selbig J, Walther D (2009) Detection and characterization of 3D-signature phosphorylation site motifs and their contribution towards improved phosphorylation site prediction in proteins. BMC Bioinf 10(1):117. doi:10.1186/1471-2105-10-117

    Article  Google Scholar 

  43. Rudnicki WR, Kierczak M, Koronacki J, Komorowski J (2006) A statistical method for determining importance of variables in an information system. In: Rough sets and current …, pp 557–566. doi:10.1007/11908029_58

    Google Scholar 

  44. Draminski M, Rada-Iglesias A, Enroth S, Wadelius C, Koronacki J, Komorowski J (2008) Monte Carlo feature selection for supervised classification. Bioinformatics (Oxford, England) 24(1):110–117. doi:10.1093/bioinformatics/btm486

    Article  CAS  Google Scholar 

  45. Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics (Oxford, England) 22(13):1658–1659. doi:10.1093/bioinformatics/btl158

    Article  CAS  Google Scholar 

  46. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ, Pennsylvania T, Park U (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410. doi:10.1016/S0022-2836(05)80360-2

    Article  CAS  PubMed  Google Scholar 

Download references

Acknowledgements

Marcin Tatjewski was supported by the European Union from resources of the European Social Fund. Project PO KL “Information technologies: Research and their interdisciplinary applications”, Agreement UDA-POKL.04.01.01-00-051/10-00. Marcin Tatjewski and Dariusz Plewczynski were supported by Polish National Science Centre (grant numbers: 2015/16/T/ST6/00493, 2014/15/B/ST6/05082 and 2013/09/B/NZ2/00121) and EU COST BM1405 and BM1408 actions. Marcin Kierczak was supported by the Swedish Foundation for Strategic Research and the Swedish Research Council.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dariusz Plewczynski .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer Science+Business Media New York

About this protocol

Cite this protocol

Tatjewski, M., Kierczak, M., Plewczynski, D. (2017). Predicting Post-Translational Modifications from Local Sequence Fragments Using Machine Learning Algorithms: Overview and Best Practices. In: Zhou, Y., Kloczkowski, A., Faraggi, E., Yang, Y. (eds) Prediction of Protein Secondary Structure. Methods in Molecular Biology, vol 1484. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-6406-2_19

Download citation

  • DOI: https://doi.org/10.1007/978-1-4939-6406-2_19

  • Published:

  • Publisher Name: Humana Press, New York, NY

  • Print ISBN: 978-1-4939-6404-8

  • Online ISBN: 978-1-4939-6406-2

  • eBook Packages: Springer Protocols

Publish with us

Policies and ethics