Predicting Post-Translational Modifications from Local Sequence Fragments Using Machine Learning Algorithms: Overview and Best Practices

Tatjewski, Marcin; Kierczak, Marcin; Plewczynski, Dariusz

doi:10.1007/978-1-4939-6406-2_19

Marcin Tatjewski^6,7,
Marcin Kierczak⁸ &
Dariusz Plewczynski⁹

Part of the book series: Methods in Molecular Biology ((MIMB,volume 1484))

2800 Accesses
4 Citations

Abstract

Here, we present two perspectives on the task of predicting post translational modifications (PTMs) from local sequence fragments using machine learning algorithms. The first is the description of the fundamental steps required to construct a PTM predictor from the very beginning. These steps include data gathering, feature extraction, or machine-learning classifier selection. The second part of our work contains the detailed discussion of more advanced problems which are encountered in PTM prediction task. Probably the most challenging issues which we have covered here are: (1) how to address the training data class imbalance problem (we also present statistics describing the problem); (2) how to properly set up cross-validation folds with an approach which takes into account the homology of protein data records, to address this problem we present our folds-over-clusters algorithm; and (3) how to efficiently reach for new sources of learning features. Presented techniques and notes resulted from intense studies in the field, performed by our and other groups, and can be useful both for researchers beginning in the field of PTM prediction and for those who want to extend the repertoire of their research techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Protocol: USD 49.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.00; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Classification of Protein Sequences by Means of an Ensemble Classifier with an Improved Feature Selection Strategy

Supervised Techniques in Proteomics

Progresses in Predicting Post-translational Modification

Article 12 July 2019

Notes

1.
Less correct name since jacknife is a resampling method rather than a cross-validation type.

References

Uhlen M, Ponten F (2005) Antibody-based proteomics for human tissue profiling. Mol Cell Proteomics 4:384–393
Article CAS PubMed Google Scholar
Jensen ON (2004) Modification-specific proteomics: characterization of post-translational modifications by mass spectrometry. Curr Opin Chem Biol 1:33–41
Article Google Scholar
Walsh C (2006) Posttranslational modification of proteins: expanding nature’s inventory. Roberts and Company Publishers, Englewood, CO
Google Scholar
Irby RB, Yeatman TJ (2000) Role of Src expression and activation in human cancer. Oncogene 19(49):5636–5642
Article CAS PubMed Google Scholar
Brown M, Cooper JA (1996) Regulation, substrates and functions of Src. Biochim Biophys Acta 1287:121–149
PubMed Google Scholar
Abram CL, Courtneidge SA (2000) Src family tyrosine kinases and growth factor signaling. Exp Cell Res 254:1–13
Article CAS PubMed Google Scholar
Blom N, Gammeltoft S, Brunak S (1999) Sequence and structure-based prediction of eukaryotic protein phosphorylation sites. J Mol Biol 294(5):1351–1362. doi:10.1006/jmbi.1999.3310
Article CAS PubMed Google Scholar
Biswas AK, Noman N, Sikder AR (2010) Machine learning approach to predict protein phosphorylation sites by incorporating evolutionary information. BMC Bioinf 11(1):273. doi:10.1186/1471-2105-11-273
Article Google Scholar
Plewczynski D, Basu S, Saha I (2012) AMS 4.0: consensus prediction of post-translational modifications in protein sequences. Amino Acids 43(2):573–582. doi:10.1007/s00726-012-1290-2
Article CAS PubMed PubMed Central Google Scholar
Jalal S, Arsenault R, Potter AA, Babiuk LA, Griebel PJ, Napper S (2009) Genome to kinome: species-specific peptide arrays for kinome analysis. Sci Signal 2(54):pl1. doi:10.1126/scisignal.254pl1
Google Scholar
Trost B, Kusalik A (2011) Computational prediction of eukaryotic phosphorylation sites. Bioinformatics (Oxford, England) 27(21):2927–2935. doi:10.1093/bioinformatics/btr525
Article CAS Google Scholar
Trost B, Arsenault R, Griebel P, Napper S, Kusalik A (2013) DAPPLE: a pipeline for the homology-based prediction of phosphorylation sites. Bioinformatics (Oxford, England) 29(13):1693–1695. doi:10.1093/bioinformatics/btt265
Article CAS Google Scholar
Robertson AJ, Trost B, Scruten E, Robertson T, Mostajeran M, Connor W, Kusalik A, Griebel P, Napper S (2014) Identification of developmentally-specific kinotypes and mechanisms of Varroa mite resistance through whole-organism, kinome analysis of honeybee. Front Genet 5:139. doi:10.3389/fgene.2014.00139
Article PubMed PubMed Central Google Scholar
The UniProt Consortium (2014) UniProt: a hub for protein information. Nucleic Acids Res 43(D1):D204–D212. doi:10.1093/nar/gku989
Article PubMed Central Google Scholar
Hornbeck PV, Zhang B, Murray B, Kornhauser JM, Latham V, Skrzypek E (2015) PhosphoSitePlus, 2014: mutations, PTMs and recalibrations. Nucleic Acids Res 43(Database issue):D512–D520. doi:10.1093/nar/gku1267
Article CAS PubMed Google Scholar
Dinkel H, Chica C, Via A, Gould CM, Jensen LJ, Gibson TJ, Diella F (2011) Phospho.ELM: a database of phosphorylation sites–update 2011. Nucleic Acids Res 39(Database issue):D261–D267. doi:10.1093/nar/gkq1104
Google Scholar
Kamath KS, Vasavada MS, Srivastava S (2011) Proteomic databases and tools to decipher post-translational modifications. J Proteomics 75(1):127–144. doi:10.1016/j.jprot.2011.09.014
Article CAS PubMed Google Scholar
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2012) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830. 1201.0490
Google Scholar
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software. In: ACM SIGKDD explorations newsletter, vol 11, issue 1, p 10. doi:10.1145/1656274.1656278
Google Scholar
Samuel A (2000) Some studies in machine learning using the game of checkers. IBM J Res Dev 44(1.2):206–226. doi:10.1147/rd.441.0206
Google Scholar
Provost F, Fawcett T, Kohavi R (1998) The case against accuracy estimation for comparing induction algorithms. In: Proceedings of the fifteenth international conference on machine learning. Morgan Kaufmann, San Francisco, pp 445–453
Google Scholar
Matthews B (1975) Comparison of the predicted and observed secondary structure of {T4} phage lysozyme. Biochim Biophys Acta Protein Struct 405(2):442–451. http://dx.doi.org/10.1016/0005-2795(75)90109-9
Powers DM (2011) Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. J Mach Learn Technol 2(1):37–63
Google Scholar
Neuberger G, Schneider G, Eisenhaber F (2007) pkaPS: prediction of protein kinase A phosphorylation sites with the simplified kinase-substrate binding model. Biol Direct 2:1. doi:10.1186/1745-6150-2-1
Google Scholar
Jung I, Matsuyama A, Yoshida M, Kim D (2010) PostMod: sequence based prediction of kinase-specific phosphorylation sites with indirect relationship. BMC Bioinf 11(Suppl 1):S10. doi:10.1186/1471-2105-11-S1-S10
Article Google Scholar
Kawashima S (2000) AAindex: amino acid index database. Nucleic Acids Res 28(1):374. doi:10.1093/nar/28.1.374
Article CAS PubMed PubMed Central Google Scholar
Kawashima S, Pokarowski P, Pokarowska M, Kolinski A, Katayama T, Kanehisa M (2008) AAindex: amino acid index database, progress report 2008. Nucleic Acids Res 36(Database issue):D202–D205. doi:10.1093/nar/gkm998
CAS PubMed Google Scholar
Saha I, Maulik U, Bandyopadhyay S, Plewczynski D (2012) Fuzzy clustering of physicochemical and biochemical properties of amino acids. Amino Acids 43(2):583–594. doi:10.1007/s00726-011-1106-9
Article CAS PubMed Google Scholar
Iakoucheva LM, Radivojac P, Brown CJ, O’Connor TR, Sikes JG, Obradovic Z, Dunker AK (2004) The importance of intrinsic disorder for protein phosphorylation. Nucleic Acids Res 32(3):1037–1049. doi:10.1093/nar/gkh253
Article CAS PubMed PubMed Central Google Scholar
Lee TY, Hsu JBK, Lin FM, Chang WC, Hsu PC, Huang HD (2010) N-Ace: using solvent accessibility and physicochemical properties to identify protein N-acetylation sites. J Comput Chem 31(15):2759–2771. doi:10.1002/jcc.21569
Article CAS PubMed Google Scholar
Chen YZ, Chen Z, Gong YA, Ying G (2012) SUMOhydro: a novel method for the prediction of sumoylation sites based on hydrophobic properties. PloS One 7(6):e39195. doi:10.1371/journal.pone.0039195
Article CAS PubMed PubMed Central Google Scholar
Pejaver V, Hsu WL, Xin F, Dunker AK, Uversky VN, Radivojac P (2014) The structural and functional signatures of proteins that undergo multiple events of post-translational modification. Protein Sci 23(8):1077–1093. doi:10.1002/pro.2494
Article CAS PubMed PubMed Central Google Scholar
Li A, Wang L, Shi Y, Wang M, Jiang Z, Feng H (2005) Phosphorylation site prediction with a modified k-nearest neighbor algorithm and blosum62 matrix. In: 27th Annual International conference of the engineering in medicine and biology society, 2005 (IEEE-EMBS 2005), pp 6075–6078. doi:10.1109/IEMBS.2005.1615878
Google Scholar
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140. doi:10.1007/BF00058655
Google Scholar
Ho TK (1998) The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell 20(8):832–844. doi:10.1109/34.709601
Article Google Scholar
Breiman L (2001) Random forests. Mach Learn 45(1):5–32. doi:10.1023/A:1010933404324
Article Google Scholar
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297. doi:10.1007/BF00994018, 10.1007/BF00994018
Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intelligent Data Anal 6(5):429–449
Google Scholar
Kramer C, Gedeck P (2010) Leave-cluster-out cross-validation is appropriate for scoring functions derived from diverse protein data sets. J Chem Inf Model 50(11):1961–1969. doi:10.1021/ci100264e
Article CAS PubMed Google Scholar
Zubek J, Tatjewski M, Boniecki A, Mnich M, Basu S, Plewczynski D (2015) Multi-level machine learning prediction of protein-protein interactions in Saccharomyces cerevisiae. PeerJ 3:e1041. doi:10.7717/peerj.1041
Article PubMed PubMed Central Google Scholar
Schwartz D (2012) Prediction of lysine post-translational modifications using bioinformatic tools. Essays Biochem 52:165–177. doi:10.1042/bse0520165
Article CAS PubMed Google Scholar
Durek P, Schudoma C, Weckwerth W, Selbig J, Walther D (2009) Detection and characterization of 3D-signature phosphorylation site motifs and their contribution towards improved phosphorylation site prediction in proteins. BMC Bioinf 10(1):117. doi:10.1186/1471-2105-10-117
Article Google Scholar
Rudnicki WR, Kierczak M, Koronacki J, Komorowski J (2006) A statistical method for determining importance of variables in an information system. In: Rough sets and current …, pp 557–566. doi:10.1007/11908029_58
Google Scholar
Draminski M, Rada-Iglesias A, Enroth S, Wadelius C, Koronacki J, Komorowski J (2008) Monte Carlo feature selection for supervised classification. Bioinformatics (Oxford, England) 24(1):110–117. doi:10.1093/bioinformatics/btm486
Article CAS Google Scholar
Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics (Oxford, England) 22(13):1658–1659. doi:10.1093/bioinformatics/btl158
Article CAS Google Scholar
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ, Pennsylvania T, Park U (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410. doi:10.1016/S0022-2836(05)80360-2
Article CAS PubMed Google Scholar

Download references

Acknowledgements

Marcin Tatjewski was supported by the European Union from resources of the European Social Fund. Project PO KL “Information technologies: Research and their interdisciplinary applications”, Agreement UDA-POKL.04.01.01-00-051/10-00. Marcin Tatjewski and Dariusz Plewczynski were supported by Polish National Science Centre (grant numbers: 2015/16/T/ST6/00493, 2014/15/B/ST6/05082 and 2013/09/B/NZ2/00121) and EU COST BM1405 and BM1408 actions. Marcin Kierczak was supported by the Swedish Foundation for Strategic Research and the Swedish Research Council.

Author information

Authors and Affiliations

Institute of Computer Science, Polish Academy of Sciences, Warsaw, Poland
Marcin Tatjewski
Centre of New Technologies, University of Warsaw, S. Banacha 2c, 02-097, Warsaw, Poland
Marcin Tatjewski
Science for Life Laboratory, Department of Medical Biochemistry and Microbiology, Uppsala University, Uppsala, Sweden
Marcin Kierczak
Centre of New Technologies, University of Warsaw, S. Banacha 2c, Warsaw, 02-097, Poland
Dariusz Plewczynski

Authors

Marcin Tatjewski
View author publications
You can also search for this author in PubMed Google Scholar
Marcin Kierczak
View author publications
You can also search for this author in PubMed Google Scholar
Dariusz Plewczynski
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dariusz Plewczynski .

Editor information

Editors and Affiliations

Institute for Glycomics and School of Information and Communication Technology, Griffith University, Southport, Queensland, Australia
Yaoqi Zhou
Battelle Center for Mathematical Medicine, Nationwide Children’s Hospital, Columbus, Ohio, USA
Andrzej Kloczkowski
Department of Biochemistry and Molecular Biology, Indiana University School of Medicine, Indianapolis, Indiana, USA
Eshel Faraggi
Institute for Glycomics and School of Information and Communication Technology, Griffith University, Southport, Queensland, Australia
Yuedong Yang

Rights and permissions

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Tatjewski, M., Kierczak, M., Plewczynski, D. (2017). Predicting Post-Translational Modifications from Local Sequence Fragments Using Machine Learning Algorithms: Overview and Best Practices. In: Zhou, Y., Kloczkowski, A., Faraggi, E., Yang, Y. (eds) Prediction of Protein Secondary Structure. Methods in Molecular Biology, vol 1484. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-6406-2_19

Download citation

DOI: https://doi.org/10.1007/978-1-4939-6406-2_19
Published: 28 October 2016
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-6404-8
Online ISBN: 978-1-4939-6406-2
eBook Packages: Springer Protocols

Publish with us

Policies and ethics

Predicting Post-Translational Modifications from Local Sequence Fragments Using Machine Learning Algorithms: Overview and Best Practices

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Classification of Protein Sequences by Means of an Ensemble Classifier with an Improved Feature Selection Strategy

Supervised Techniques in Proteomics

Progresses in Predicting Post-translational Modification

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this protocol

Cite this protocol

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Predicting Post-Translational Modifications from Local Sequence Fragments Using Machine Learning Algorithms: Overview and Best Practices

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Classification of Protein Sequences by Means of an Ensemble Classifier with an Improved Feature Selection Strategy

Supervised Techniques in Proteomics

Progresses in Predicting Post-translational Modification

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this protocol

Cite this protocol

Download citation

Publish with us

Search

Navigation