Abstract
Motivation
Proteins–protein interactions (PPIs) are important to govern several cellular activities. Amino acid residues, which are located at the interface are known as the binding sites and the information about binding sites helps to understand the binding affinities and functions of protein–protein complexes.
Results
We have developed a deep neural network-based method, DeepBSRPred, for predicting the binding sites using protein sequence information and predicted structures from AlphaFold2. Specific sequence and structure-based features include position-specific scoring matrix (PSSM), solvent accessible surface area, conservation score and amino acid properties, and residue depth, respectively. Our method predicted the binding sites with an average F1 score of 0.73 in a dataset of 1236 proteins. Further, we compared the performance with other existing methods in the literature using four benchmark datasets and our method outperformed those methods.
Availability and implementation
The DeepBSRPred web server can be found at https://web.iitm.ac.in/bioinfo2/deepbsrpred/index.html, along with all datasets used in this study. The trained models, the DeepBSRPred standalone source code, and the feature computation pipeline are freely available at https://web.iitm.ac.in/bioinfo2/deepbsrpred/download.html.
Similar content being viewed by others
Data availability
The data used in this work are available at https://web.iitm.ac.in/bioinfo2/deepbsrpred/download.html.
Abbreviations
- An-Ab:
-
Antigen–antibody
- EC:
-
Enzyme containing
- GP:
-
G-protein containing
- IN:
-
Inhibitor containing
- RC:
-
Receptor containing
- MS:
-
Miscellaneous
- ASA:
-
Accessible surface area
- AUROC:
-
Area under the receiver operating characteristic curve
- AUPRC:
-
Area under precision-recall
- PSSM:
-
Position-specific scoring matrix
- F1:
-
F1-score
- MCC:
-
Matthew’s correlation coefficient
- Polar real:
-
ASA of Polar residues
- BIOV880102:
-
Information value for accessibility (Biou et al. 1988)
- NADH010102:
-
Hydropathy scale based on self-information values in the two-state model (Naderi-Manesh et al. 2001)
- VALDAR:
-
Protein conservation metrics (Valdar and Thornton, 2001)
- dASA:
-
Solvent accessible surface area for protein unfolding
- PONJ960101:
-
Average volumes of residues (Pontius et al. 1996)
- FASG760101:
-
Molecular weight (Fasman 1976)
- GRAR740103:
-
Volume (Grantham 1974)
- HB acceptor:
-
Hydrogen bond acceptor
- ASAD:
-
Solvent accessible surface area for denatured protein (Gromiha et al. 1999)
- ASAN:
-
Solvent accessible surface area for native protein (Gromiha et al. 1999)
- TAYLOR_GAPS:
-
Conservation score (Taylor 1986)
- PSSM sum:
-
Summation of PSSM values; Residue depth: Residue depth is computed using python
- SMERFS:
-
Conservation from AAcon tool (Manning et al. 2008)
- Contact count:
-
Number of contacts of the residue
References
Abadi M, Agarwal A et al. (2016) Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467.
Agnieszka G, Peter V et al., (2018) AACon: A Fast Amino Acid Conservation Calculation Service. https://www.compbio.dundee.ac.uk/aacon/
Al-Rfou R, Alain G et al. (2016) Theano: a Python framework for fast computation of mathematical expressions. Comput Sci. abs/1605.02688
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25(17):3389–3402. https://doi.org/10.1093/nar/25.17.3389
Amos-Binks A, Patulea C et al (2011) Binding site prediction for protein-protein interactions and novel motif discovery using re-occurring polypeptide sequences. BMC Bioinform 12:225
Asadabadi EB, Abdolmaleki P (2013) Predictions of protein-protein interfaces within membrane protein complexes. Avicenna J Med Biotechnol 5:148–157
Asgari E, Mofrad MR (2015) Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS ONE 10:e0141287
Asgari E, McHardy, et al (2019) Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX). Sci Rep 9:3577
Biou V, Gibrat JF et al (1988) Secondary structure prediction: combination of three different methods. Protein Eng Des Sel 2(3):185–191
Branco P, Torgo L (2016) A survey of predictive modeling on imbalanced domains. ACM Computing Surveys (CSUR) 49(2):1–50
Cao B, Porollo A et al (2006) Enhanced recognition of protein transmembrane domains with prediction-based structural profiles. Bioinformatics 22:303–309
Chakravarty S, Varadarajan R (1999) Residue depth: a novel parameter for the analysis of protein structure and stability. Structure 7:723–732
Chen X, Jeong JC (2009) Sequence-based prediction of protein interaction sites with an integrative method. Bioinformatics 25:585–591
Chen P, Li J (2010) Sequence-based identification of interface residues by an integrative profile combining hydrophobic and evolutionary information. BMC Bioinformatics 11:402
Chollet F (2015) Keras: Deep learning library for theano and tensorflow. URL: https://keras.io/k, 7(8), T1.
Clark JJ, Orban ZJ et al (2020) Predicting binding sites from unbound versus bound protein structures. Sci Rep 10(1):15856
Dhole K, Singh G et al (2014) Sequence-based prediction of protein-protein interaction sites with L1-logreg classifier. J Theor Biol 348:47–54
Du X, Cheng J, and Song J (2009) Improved prediction of protein binding sites from sequences using genetic algorithm. Protein J 28(6):273–280. https://doi.org/10.1007/s10930-009-9192-1
Fasman GD (1976) Handbook of Biochemistry and Molecular Biology. Proteins. CRC Press, Cleveland
Geng H, LuT, et al (2015) Prediction of protein-protein interaction sites based on naive Bayes classifier. Biochem Res Int 2015:1–7
Grantham R (1974) Amino acid difference formula to help explain protein evolution. Science 185(4154):862–864
Gromiha MM, Oobatake M et al (1999) Important amino acid properties for enhanced thermostability from mesophilic to thermophilic proteins. Biophys Chem 82(1):51–67
Gromiha MM, Yokota K et al (2009) Identification and analysis of binding site residues in protein-protein complexes. Int J Biol Biomed 3(9):415–420
Gromiha MM, Saranya N et al (2011) Sequence and structural features of binding site residues in protein-protein complexes: comparison with protein-nucleic acid complexes. Proteome Science 9(Suppl 1):S13
Heinzinger M, Elnaggar A et al (2019) Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinformatics 20:723
Hubbard SJ, Thornton JM (1993) ‘NACCESS’, computer program. Department of Biochemistry and Molecular Biology, University College, London
Hwang H, Petrey D et al (2016) A hybrid method for protein–protein interface prediction. Protein Sci 25:159–165
Jia J, Liu Z et al (2016) iPPBS-Opt: a sequence-based ensemble classifier for identifying protein-protein binding sites by optimizing imbalanced training datasets. Molecules 21:95
Jones DT, Buchan DW et al (2012) PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments. Bioinformatics 28:184–190
Jumper J, Evans R et al (2021) Highly accurate protein structure prediction with AlphaFold. Nature 596(7873):583–589
Kabsch W, Sander C (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22(12):2577–2637
Kawashima S, Pokarowski P et al (2008) AAindex: amino acid index database progress report. Nucleic Acids Res 36(Database issue):D202–D205
Konc J, Janezic D (2007) Protein-protein binding-sites prediction by protein surface structure conservation. J Chem Inf Model 47(3):940–944
Laine E, Carbone A (2015) Local geometry and evolutionary conservation of protein surfaces reveal the multiple recognition patches in protein-protein interactions. PLoS Comput Biol 11:e1004580
Li Y, Golding GB et al (2021) DELPHI: accurate deep ensemble model for protein interaction sites prediction. Bioinformatics 37(7):896–904
Liang S, Zhang J et al (2004) Prediction of the interaction site on the surface of an isolated protein structure by analysis of side chain energy scores. Proteins 57(3):548–557
Lijnzaad P, Berendsen HJ, Argos P (1996) Hydrophobic patches on the surfaces of protein structures. Proteins 25(3):389–397
Lise S, Archambeau C et al (2009) Prediction of hot spot residues at protein-protein interfaces by combining machine learning and energy-based methods. BMC Bioinform 10:365
Liu GH, Shen HB et al (2016) Prediction of protein–protein interaction sites with machine-learning-based data-cleaning and post-filtering procedures. J Membr Biol 249:141–153
London N, Movshovitz-Attias D et al (2010) The structural basis of peptide-protein binding strategies. Structure 18:188–199
Ma B, Elkayam T et al (2003) Protein-protein interactions: structurally conserved residues distinguish between binding sites and exposed protein surfaces. Proc Natl Acad Sci USA 100(10):5772–5777
Maheshwari S, Brylinski M (2015) Prediction of protein–protein interaction sites from weakly homologous template structures using meta-threading and machine learning. J Mol Recognit 28:35–48
Maheshwari S, Brylinski M (2016) Template-based identification of protein–protein interfaces using eFindSitePPI. Methods 93:64–71
Manning JR, Jefferson ER et al (2008) The contrasting properties of conservation and correlated phylogeny in protein functional residue prediction. BMC Bioinform 9:51
McDonald IK, Thornton JM (1994) Satisfying hydrogen bonding potential in proteins. J Mol Biol 238(5):777–793
Murakami Y, Mizuguchi K (2010) Applying the Naïve Bayes classifier with kernel density estimation to the prediction of protein-protein interaction sites. Bioinformatics 26:1841–1848
Naderi-Manesh H, Sadeghi M et al (2001) Prediction of protein surface accessibility with information theory. Proteins 42(4):452–459
Neuvirth H, Raz R et al (2004) ProMate: a structure-based prediction program to identify the location of protein-protein binding sites. J Mol Biol 338(1):181–199
Ofran Y, Rost B (2007) ISIS: interaction sites identified from sequence. Bioinformatics 23:e13–e16
Pedregosa F, Varoquaux G et al (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
Pontius J, Richelle J et al (1996) Deviations from standard atomic volumes as a quality measure for protein crystal structures. J Mol Biol 264(1):121–136
Porollo A, Meller J (2007) Prediction-based fingerprints of protein-protein interactions. Proteins: structure. Function and Bioinformatics 66:630–645
Saito T, Rehmsmeier M (2015) The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 10(3):e0118432
Singh G, Dhole K et al. (2014) SPRINGS: prediction of protein-protein interaction sites using artificial neural networks. Technical report. PeerJ PrePrints, PPR39858
Sun Y, Wong AKC, Kamel MS (2009) Classification of imbalanced data: A review. Int J Pattern Recognit Artif Intell 23(4):687–719. https://doi.org/10.1142/S0218001409007326
Taherzadeh G, Yang Y, Zhang T, Liew AW, Zhou Y (2016) Sequence-based prediction of protein-peptide binding sites using support vector machine. J Comput Chem 37(13):1223–1229. https://doi.org/10.1002/jcc.24314
Taylor WR (1986) The classification of amino acid conservation. J Theor Biol 119(2):205–218
Thomas CN, Anja B et al (2018) IntPred: a structure-based predictor of protein–protein interaction sites. Bioinformatics 34:223–229
Valdar WS, Thornton JM (2001) Conservation helps to identify biologically relevant crystal contacts. J Mol Biol 313(2):399–416. https://doi.org/10.1006/jmbi.2001.5034
Valdar WS (2002) Scoring residue conservation. Proteins: Struct Funct Bioinform 48:227–241
Varadi M, Anyango S et al (2022) AlphaFold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res 50(D1):D439–D444
Viloria SJ, Allega MF, Lambrughi M, Papaleo E (2017) An optimal distance cutoff for contact-based protein structure networks using side-chain centers of mass. Sci Rep 7:1–11
Wang G, Dunbrack RL (2003) PISCES: a protein sequence culling server. Bioinformatics 19(12):1589–1591
Wang X, Yu B (2019) Protein–protein interaction sites prediction by ensemble random forests with synthetic minority oversampling technique. Bioinformatics 35:2395–2402
Wang DD, Wang R et al (2014) Fast prediction of protein–protein interaction sites based on extreme learning machines. Neurocomputing 128:258–266
Wei Z, Han K et al (2016) Protein–protein interaction sites prediction by ensembling SVM and sample-weighted random forests. Neurocomputing 193:201–212
Wei ZS, Yang JY, Shen HB, Yu DJ (2015) A cascade random forests algorithm for predicting protein-protein interaction sites. IEEE Trans Nanobiosci 14(7):746–760. https://doi.org/10.1109/TNB.2015.2475359
Xie Z, Deng X et al (2020) Prediction of protein–protein interaction sites using convolutional neural network and improved data sets. Int J Mol Sci 21:467
Xingyu G, Zhenyu C et al (2016) Adaptive weighted imbalance learning with application to abnormal activity recognition. Neurocomputing 173:1927–1935
Xue LC, Dobbs D et al (2011) HomPPI: a class of sequence homology-based protein-protein interface prediction methods. BMC Bioinformatics 12:244
Zardecki C, Dutta S et al (2022) PDB-101: Educational resources supporting molecular explorations through biology and medicine. Protein Sci 31(1):129–140
Zeng M, Zhang F et al (2019) Protein–protein interaction site prediction through combining local and global features with deep neural networks. Bioinformatics 36:1114–1120
Zhang J, Kurgan L (2019) Scriber: accurate and partner type-specific prediction of protein-binding residues from proteins sequences. Bioinformatics 35:i343–i353
Zhang B, Li J et al (2019) Sequence-based prediction of protein-protein interaction sites by simplified long short-term memory network. Neurocomputing 357:86–100
Acknowledgements
We thank Indian Institute of Technology Madras and the High-Performance Computing Environment (HPCE) for computational facilities. The work is partially supported by the Department of Science and Technology, Government of India (No. DST/INT/SWD/P-05/2016).
Author information
Authors and Affiliations
Contributions
Conceptualization: MMG; methodology: MMG, software/code: RN; investigation: RN, KY; discussion: RN, KY, MMG; writing original draft: RN; review & editing: MMG, KY; supervision: MMG. All authors read and approved the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors have no conflicts of interest to declare.
Additional information
Handling editor: F. Eisenhaber.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Nikam, R., Yugandhar, K. & Gromiha, M.M. DeepBSRPred: deep learning-based binding site residue prediction for proteins. Amino Acids 55, 1305–1316 (2023). https://doi.org/10.1007/s00726-022-03228-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00726-022-03228-3