Abstract
Sequence-based approaches are fundamental to guide experimental investigations in obtaining structural and/or functional insights into uncharacterized protein families. Powerful profile-based sequence search methods rely on a sequence space continuum to identify non-trivial relationships through homology detection. The computational design of protein-like sequences that serve as “artificial linkers” is useful in identifying relationships between distant members of a structural fold. Such sequences act as intermediates and guide homology searches between distantly related proteins. Here, we describe an approach that represents natural intermediate sequences and designed protein-like sequences as HMM (Hidden Markov Models) profiles, to improve the sensitivity of existing search methods. Searches made within the “Profile database” were shown to recognize the parent structural fold for 90% of the search queries at query coverage better than 60%. For 1040 protein families with no available structure, fold associations were made through searches in the database of natural and designed sequence profiles. Most of the associations were made with the Alpha-alpha superhelix, Transmembrane beta-barrels, TIM barrel, and Immunoglobulin-like beta-sandwich folds. For 11 domain families of unknown functions, we provide confident fold associations using the profiles of designed sequences and a consensus from other fold recognition methods. For two DUFs (Domain families of Unknown Functions), we performed detailed functional annotation through comparisons with characterized templates of families of known function.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Jones DT, Miller RT, Thornton JM (1995) Successful protein fold recognition by optimal sequence threading validated by rigorous blind testing. Proteins Struct Funct Genet 23:387–397. https://doi.org/10.1002/prot.340230312
Jones DT (1999) GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences. J Mol Biol 287:797–815. https://doi.org/10.1006/jmbi.1999.2583
Kelley LA, MacCallum RM, Sternberg MJ (2000) Enhanced genome annotation using structural profiles in the program 3D-PSSM. J Mol Biol 299:501–522. https://doi.org/10.1006/JMBI.2000.3741
Wang Y, Virtanen J, Xue Z, Zhang Y (2017) I-TASSER-MR: automated molecular replacement for distant-homology proteins using iterative fragment assembly and progressive sequence truncation. Nucleic Acids Res 45:W429–W434. https://doi.org/10.1093/nar/gkx349
Kelley LA, Mezulis S, Yates CM, Wass MN, Sternberg MJE (2015) The Phyre2 web portal for protein modeling, prediction and analysis. Nat Protoc 10:845–858. https://doi.org/10.1038/nprot.2015.053
Xu D, Jaroszewski L, Li Z, Godzik A (2014) FFAS-3D: improving fold recognition by including optimized structural features and template re-ranking. Bioinformatics 30:660–667. https://doi.org/10.1093/bioinformatics/btt578
Ghouzam Y, Postic G, Guerin P-E, de Brevern AG, Gelly J-C (2016) ORION: a web server for protein fold recognition and structure prediction using evolutionary hybrid profiles. Sci Rep 6:28268. https://doi.org/10.1038/srep28268
Wu S, Zhang Y (2007) LOMETS: a local meta-threading-server for protein structure prediction. Nucleic Acids Res 35:3375–3382. https://doi.org/10.1093/nar/gkm251
Xu J, Li M, Kim D, Xu Y (2003) Raptor: optimal protein threading by linear programming. J Bioinforma Comput Biol 1:95–117. https://doi.org/10.1142/S0219720003000186
Zhu J, Zhang H, Li SC, Wang C, Kong L, Sun S, Zheng W-M, Bu D (2017) Improving protein fold recognition by extracting fold-specific features from predicted residue–residue contacts. Bioinformatics 33:3749–3757. https://doi.org/10.1093/bioinformatics/btx514
Saidi R, Maddouri M, Mephu Nguifo E (2010) Protein sequences classification by means of feature extraction with substitution matrices. BMC Bioinformatics 11:175. https://doi.org/10.1186/1471-2105-11-175
Wei L, Liao M, Gao X, Zou Q (2015) Enhanced protein fold prediction method through a novel feature extraction technique. IEEE Trans Nanobiosci 14:649–659. https://doi.org/10.1109/TNB.2015.2450233
Ibrahim W, Abadeh MS (2017) Extracting features from protein sequences to improve deep extreme learning machine for protein fold recognition. J Theor Biol 421:1–15. https://doi.org/10.1016/j.jtbi.2017.03.023
Lyons J, Paliwal KK, Dehzangi A, Heffernan R, Tsunoda T, Sharma A (2016) Protein fold recognition using HMM–HMM alignment and dynamic programming. J Theor Biol 393:67–74. https://doi.org/10.1016/J.JTBI.2015.12.018
Loewenstein Y, Raimondo D, Redfern OC, Watson J, Frishman D, Linial M, Orengo C, Thornton J, Tramontano A (2009) Protein function annotation by homology-based inference. Genome Biol 10:207. https://doi.org/10.1186/gb-2009-10-2-207
Rost B (1999) Twilight zone of protein sequence alignments. Protein Eng Des Sel 12:85–94. https://doi.org/10.1093/protein/12.2.85
Watson JD, Laskowski RA, Thornton JM (2005) Predicting protein function from sequence and structural data. Curr Opin Struct Biol 15:275–284. https://doi.org/10.1016/J.SBI.2005.04.003
Bru C, Courcelle E, Carrère S, Beausse Y, Dalmar S, Kahn D (2004) The ProDom database of protein domain families: more emphasis on 3D. Nucleic Acids Res 33:D212–D215. https://doi.org/10.1093/nar/gki034
Jones DT, Swindells MB (2002) Getting the most from PSI-BLAST. Trends Biochem Sci 27:161–164
Sandhya S, Kishore S, Sowdhamini R, Srinivasan N (2003) Effective detection of remote homologues by searching in sequence dataset of a protein domain fold. FEBS Lett 552:225–230. https://doi.org/10.1016/S0014-5793(03)00929-3
Koretke KK, Russell RB, Copley RR, Lupas AN (1999) Fold recognition using sequence and secondary structure information. Proteins Suppl 3:141–148
Krishnadev O, Srinivasan N (2011) AlignHUSH: alignment of HMMs using structure and hydrophobicity information. BMC Bioinform 12:275. https://doi.org/10.1186/1471-2105-12-275
Mistry J, Finn RD, Eddy SR, Bateman A, Punta M (2013) Challenges in homology search: HMMER3 and convergent evolution of coiled-coil regions. Nucleic Acids Res 41:e121–e121. https://doi.org/10.1093/nar/gkt263
Steinegger M, Meier M, Mirdita M, Vöhringer H, Haunsberger SJ, Söding J (2019) HH-suite3 for fast remote homology detection and deep protein annotation. BMC Bioinform 20:473. https://doi.org/10.1186/s12859-019-3019-7
Margelevičius M, Venclovas Č (2005) PSI-BLAST-ISS: an intermediate sequence search tool for estimation of the position-specific alignment reliability. BMC Bioinform 6:185. https://doi.org/10.1186/1471-2105-6-185
Pandurangan AP, Stahlhacke J, Oates ME, Smithers B, Gough J (2019) The SUPERFAMILY 2.0 database: a significant proteome update and a new webserver. Nucleic Acids Res 47:D490–D494. https://doi.org/10.1093/nar/gky1130
Eddy SR (2011) Accelerated profile HMM searches. PLoS Comput Biol 7:e1002195. https://doi.org/10.1371/journal.pcbi.1002195
Johnson LS, Eddy SR, Portugaly E (2010) Hidden Markov model speed heuristic and iterative HMM search procedure. BMC Bioinform 11:431. https://doi.org/10.1186/1471-2105-11-431
Scheeff ED, Bourne PE (2006) Application of protein structure alignments to iterated hidden Markov model protocols for structure prediction. BMC Bioinform 7:410. https://doi.org/10.1186/1471-2105-7-410
Soding J (2005) Protein homology detection by HMM-HMM comparison. Bioinformatics 21:951–960. https://doi.org/10.1093/bioinformatics/bti125
Park J, Teichmann SA, Hubbard T, Chothia C (1997) Intermediate sequences increase the detection of homology between sequences. J Mol Biol 273:349–354. https://doi.org/10.1006/jmbi.1997.1288
Salamov AA, Suwa M, Orengo CA, Swindells MB (1999) Combining sensitive database searches with multiple intermediates to detect distant homologues. Protein Eng 12:95–100. https://doi.org/10.1093/protein/12.2.95
Li W, Pio F, Pawlowski K, Godzik A (2000) Saturated BLAST: an automated multiple intermediate sequence search used to detect distant homology. Bioinformatics 16:1105–1110. https://doi.org/10.1093/bioinformatics/16.12.1105
John B, Sali A (2004) Detection of homologous proteins by an intermediate sequence search. Protein Sci 13:54–62. https://doi.org/10.1110/ps.03335004
Teichmann SA, Chothia C, Church GM, Park J (2000) Fast assignment of protein structures to sequences using the intermediate sequence library PDB-ISL. Bioinformatics 16:117–124. https://doi.org/10.1093/bioinformatics/16.2.117
Sandhya S, Mudgal R, Jayadev C, Abhinandan KR, Sowdhamini R, Srinivasan N (2012) Cascaded walks in protein sequence space: use of artificial sequences in remote homology detection between natural proteins. Mol BioSyst 8:2076–2084. https://doi.org/10.1039/c2mb25113b
Mudgal R, Sandhya S, Kumar G, Sowdhamini R, Chandra NR, Srinivasan N (2014) NrichD database: sequence databases enriched with computationally designed protein-like sequences aid in remote homology detection. Nucleic Acids Res 43:D300–D305. https://doi.org/10.1093/nar/gku888
Mudgal R, Sowdhamini R, Chandra N, Srinivasan N, Sandhya S (2014) Filling-in void and sparse regions in protein sequence space by protein-like artificial sequences enables remarkable enhancement in remote homology detection capability. J Mol Biol 426:962–979. https://doi.org/10.1016/j.jmb.2013.11.026
Mudgal R, Sandhya S, Kumar G, Sowdhamini R, Chandra NR, Srinivasan N (2015) NrichD database: sequence databases enriched with computationally designed protein-like sequences aid in remote homology detection. Nucleic Acids Res 43:D300–D305. https://doi.org/10.1093/nar/gku888
Mudgal R, Sandhya S, Chandra N, Srinivasan N (2015) De-DUFing the DUFs: deciphering distant evolutionary relationships of domains of unknown function using sensitive homology detection methods. Biol Direct 10:38. https://doi.org/10.1186/s13062-015-0069-2
Kumar G, Srinivasan N, Sandhya S (2020) Artificial protein sequences enable recognition of vicinal and distant protein functional relationships. Proteins Struct Funct Bioinform 88:1688–1700. https://doi.org/10.1002/prot.25986
Sandhya S, Mudgal R, Kumar G, Sowdhamini R, Srinivasan N (2016) Protein sequence design and its applications. Curr Opin Struct Biol 37:71–80. https://doi.org/10.1016/j.sbi.2015.12.004
El-Gebali S, Mistry J, Bateman A, Eddy SR, Luciani A, Potter SC, Qureshi M, Richardson LJ, Salazar GA, Smart A, Sonnhammer ELL, Hirsh L, Paladin L, Piovesan D, Tosatto SCE, Finn RD (2019) The PFAM protein families database in 2019. Nucleic Acids Res 47:D427–D432. https://doi.org/10.1093/nar/gky995
Hubbard TJP, Ailey B, Brenner SE, Murzin AG, Chothia C (1999) SCOP: a structural classification of proteins database. Nucleic Acids Res 27:254–256. https://doi.org/10.1093/nar/27.1.254
Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O’Donovan C, Redaschi N, Yeh L-SL (2004) UniProt: the universal protein knowledgebase. Nucleic Acids Res 32:115D–119D. https://doi.org/10.1093/nar/gkh131
Schaffer AA (2001) Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res 29:2994–3005. https://doi.org/10.1093/nar/29.14.2994
Altschul S, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402. https://doi.org/10.1093/nar/25.17.3389
Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, Remmert M, Söding J, Thompson JD, Higgins DG (2011) Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal omega. Mol Syst Biol 7. https://doi.org/10.1038/msb.2011.75
Chandonia J-M, Fox NK, Brenner SE (2019) SCOPe: classification of large macromolecular structures in the structural classification of proteins—extended database. Nucleic Acids Res 47:D475–D481. https://doi.org/10.1093/nar/gky1134
Velankar S, Dana JM, Jacobsen J, van Ginkel G, Gane PJ, Luo J, Oldfield TJ, O’Donovan C, Martin M-J, Kleywegt GJ (2012) SIFTS: structure integration with function, taxonomy and sequences resource. Nucleic Acids Res 41:D483–D489. https://doi.org/10.1093/nar/gks1258
Finn RD, Clements J, Eddy SR (2011) HMMER web server: interactive sequence similarity searching. Nucleic Acids Res 39:29–37. https://doi.org/10.1093/nar/gkr367
Xu Q, Dunbrack RL (2012) Assignment of protein sequences to existing domain and family classification systems: PfamPFAM and the PDB. Bioinformatics 28:2763–2772. https://doi.org/10.1093/bioinformatics/bts533
Hunter S, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bork P, Das U, Daugherty L, Duquenne L, Finn RD, Gough J, Haft D, Hulo N, Kahn D, Kelly E, Laugraud A, Letunic I, Lonsdale D, Lopez R, Madera M, Maslen J, McAnulla C, McDowall J, Mistry J, Mitchell A, Mulder N, Natale D, Orengo C, Quinn AF, Selengut JD, Sigrist CJA, Thimma M, Thomas PD, Valentin F, Wilson D, Wu CH, Yeats C (2009) InterPro: the integrative protein signature database. Nucleic Acids Res 37:D211–D215. https://doi.org/10.1093/nar/gkn785
Katoh K, Misawa K, Kuma K, Miyata T (2002) MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res 30:3059–3066
Pei J, Grishin NV (2014) PROMALS3D: multiple protein sequence alignment enhanced with evolutionary and three-dimensional structural information. In: Methods in molecular biology (Clifton, N.J.), pp 263–271
Jones DT (1999) Protein secondary structure prediction based on position-specific scoring matrices11Edited by G. Von Heijne. J Mol Biol 292:195–202. https://doi.org/10.1006/jmbi.1999.3091
Bateman A, Finn RD (2007) SCOOP: a simple method for identification of novel protein superfamily relationships. Bioinformatics 23:809–814. https://doi.org/10.1093/bioinformatics/btm034
Chen L, Shi K, Yin Z, Aihara H (2013) Structural asymmetry in the Thermus thermophilus RuvC dimer suggests a basis for sequential strand cleavages during Holliday junction resolution. Nucleic Acids Res 41:648–656. https://doi.org/10.1093/nar/gks1015
Yoshikawa M, Iwasaki H, Kinoshita K, Shinagawa H (2000) Two basic residues, Lys-107 and Lys-118, of RuvC resolvase are involved in critical contacts with the Holliday junction for its resolution. Genes Cells 5:803–813. https://doi.org/10.1046/j.1365-2443.2000.00371.x
Singarapu KK, Liu G, Xiao R, Bertonati C, Honig B, Montelione GT, Szyperski T (2007) NMR structure of protein yjbR from Escherichia coli reveals “double-wing” DNA binding motif. Proteins Struct Funct Genet 67:501–504. https://doi.org/10.1002/prot.21297
Feldmann EA, Seetharaman J, Ramelot TA, Lew S, Zhao L, Hamilton K, Ciccosanti C, Xiao R, Acton TB, Everett JK, Tong L, Montelione GT, Kennedy MA (2012) Solution NMR and X-ray crystal structures of pseudomonas syringae Pspto-3016 from protein domain family PF04237 (DUF419) adopt a “double wing” DNA binding motif. J Struct Funct Genom 13:155–162. https://doi.org/10.1007/s10969-012-9140-8
Acknowledgments
This research is supported by Mathematical Biology program and FIST program, sponsored by the Department of Science and Technology and also by the Department of Biotechnology, Government of India, in the form of IISc-DBT partnership program. We also gratefully acknowledge support from Bioinformatics and Computational Biology Centre, funded by DBT and support from UGC, India—Centre for Advanced Studies and Ministry of Human Resource Development, India. NS is a J. C. Bose National Fellow. SS was supported as a post-doctoral fellow by the DBT-IISc partnership program and is currently affiliated with M.S. Ramaiah University of Applied Sciences, Bangalore this paper is dedicated to one of the authors of the paper, Prof N. Srinivasan, who passed away on September 03, 2021.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature
About this protocol
Cite this protocol
Kumar, G., Srinivasan, N., Sandhya, S. (2022). Profiles of Natural and Designed Protein-Like Sequences Effectively Bridge Protein Sequence Gaps: Implications in Distant Homology Detection. In: Carugo, O., Eisenhaber, F. (eds) Data Mining Techniques for the Life Sciences. Methods in Molecular Biology, vol 2449. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-2095-3_5
Download citation
DOI: https://doi.org/10.1007/978-1-0716-2095-3_5
Published:
Publisher Name: Humana, New York, NY
Print ISBN: 978-1-0716-2094-6
Online ISBN: 978-1-0716-2095-3
eBook Packages: Springer Protocols