Abstract
Improvements in nucleotide sequencing technology have resulted in an ever increasing number of nucleotide and protein sequences being deposited in databases. Unfortunately, the ability to manually classify and annotate these sequences cannot keep pace with their rapid generation, resulting in an increased bias toward unannotated sequence. Automatic annotation tools can help redress the balance. There are a number of different groups working to produce protein signatures that describe protein families, functional domains or conserved sites within related groups of proteins. Protein signature databases include CATH-Gene3D, HAMAP, PANTHER, Pfam, PIRSF, PRINTS, ProDom, PROSITE, SMART, SUPERFAMILY, and TIGRFAMs. Their approaches range from characterising small conserved motifs that can identify members of a family or subfamily, to the use of hidden Markov models that describe the conservation of residues over entire domains or whole proteins. To increase their value as protein classification tools, protein signatures from these 11 databases have been combined into one, powerful annotation tool: the InterPro database (http://www.ebi.ac.uk/interpro/) (Hunter et al., Nucleic Acids Res 37:D211–D215, 2009). InterPro is an open-source protein resource used for the automatic annotation of proteins, and is scalable to the analysis of entire new genomes through the use of a downloadable version of InterProScan, which can be incorporated into an existing local pipeline. InterPro provides structural information from PDB (Kouranov et al., Nucleic Acids Res 34:D302–D305, 2006), its classification in CATH (Cuff et al., Nucleic Acids Res 37:D310–D314, 2009) and SCOP (Andreeva et al., Nucleic Acids Res 36:D419–D425, 2008), as well as homology models from ModBase (Pieper et al., Nucleic Acids Res 37:D347–D354, 2009) and SwissModel (Kiefer et al., Nucleic Acids Res 37:D387–D392, 2009), allowing a direct comparison of the protein signatures with the available structural information. This chapter reviews the signature methods found in the InterPro database, and provides an overview of the InterPro resource itself.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Hunter S, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bork P, Das U, Daugherty L, Duquenne L, Finn RD, Gough J, Haft D, Hulo N, Kahn D, Kelly E, Laugraud A, Letunic I, Lonsdale D, Lopez R, Madera M, Maslen J, McAnulla C, McDowall J, Mistry J, Mitchell A, Mulder N, Natale D, Orengo C, Quinn AF, Selengut JD, Sigrist CJ, Thimma M, Thomas PD, Valentin F, Wilson D, Wu CH, Yeats C. (2009) InterPro: the integrative protein signature database. Nucleic Acids Res. 37, D211–D215.
Kouranov A, Xie L, de la Cruz J, Chen L, Westbrook J, Bourne PE, Berman HM. (2006) The RCSB PDB information portal for structural genomics. Nucleic Acids Res. 34, D302–D305.
Cuff AL, Sillitoe I, Lewis T, Redfern OC, Garratt R, Thornton J, Orengo CA. (2009) The CATH classification revisited--architectures reviewed and new ways to characterize structural divergence in superfamilies. Nucleic Acids Res. 37, D310–D314.
Andreeva A, Howorth D, Chandonia JM, Brenner SE, Hubbard TJ, Chothia C, Murzin AG. (2008) Data growth and its impact on the SCOP database: new developments. Nucleic Acids Res. 36, D419–D425.
Pieper U, Eswar N, Webb BM, Eramian D, Kelly L, Barkan DT, Carter H, Mankoo P, Karchin R, Marti-Renom MA, Davis FP, Sali A. (2009) MODBASE, a database of annotated comparative protein structure models and associated resources. Nucleic Acids Res. 37, D347–D354.
Kiefer F, Arnold K, Künzli M, Bordoli L, Schwede T. (2009) The SWISS-MODEL Repository and associated resources. Nucleic Acids Res. 37, D387–D392.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. (1990) Basic local alignment search tool. J Mol Biol. 215, 403–410.
Pearson WR. (1990) Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol. 183, 63–98.
UniProt Consortium. (2009) The Universal Protein Resource (UniProt) 2009. Nucleic Acids Res. 37, D169–D174.
Servant F, Bru C, Carrère S, Courcelle E, Gouzy J, Peyruc D, Kahn D. (2002) ProDom: automated clustering of homologous domains. Brief Bioinform. 3(3), 246–251.
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25(17), 3389–3402.
Sigrist CJA, Cerutti L, Hulo N, Gattiker A, Falquet L, Pagni M, Bairoch A, Bucher P. (2002) PROSITE: a documented database using patterns and profiles as motif descriptors. Brief Bioinform. 3, 265–274.
Gribskov M, Lüthy R, Eisenberg D. (1990) Profile analysis. Methods Enzymol. 183, 146–159.
Lima T, Auchincloss AH, Coudert E, Keller G, Michoud K, Rivoire C, Bulliard V, de Castro E, Lachaize C, Baratin D, Phan I, Bougueleret L, Bairoch A. (2009) HAMAP: a database of completely sequenced microbial proteome sets and manually curated microbial protein families in UniProtKB/Swiss-Prot. Nucleic Acids Res. 37, D471–D478.
Attwood TK. (2002) The PRINTS database: a resource for identification of protein families. Brief Bioinform. 3(3), 252–263.
Krogh A, Brown M, Mian IS, Sjölander K, Haussler D. (1994) Hidden Markov models in computational biology. Applications to protein modeling. J Mol Biol. 235(5), 1501–1531.
Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz HR, Ceric G, Forslund K, Eddy SR, Sonnhammer EL, Bateman A. (2008) The Pfam protein families database. Nucleic Acids Res. 36, D281–D288.
Heger A, Wilton CA, Sivakumar A, Holm L. (2005) ADDA: a domain database with global coverage of the protein universe. Nucleic Acids Res. 33, D188–D191.
Letunic I, Doerks T, Bork P. (2009) SMART 6: recent updates and new developments. Nucleic Acids Res. 37, D229–D232.
Haft DH, Selengut JD, White O. (2003) The TIGRFAMs database of protein families. Nucleic Acids Res. 31(1), 371–373.
Wu CH, Nikolskaya A, Huang H, Yeh LS, Natale DA, Vinayaka CR, Hu ZZ, Mazumder R, Kumar S, Kourtesis P, Ledley RS, Suzek BE, Arminski L, Chen Y, Zhang J, Cardenas JL, Chung S, Castro-Alvear J, Dinkov G, Barker WC. (2004) PIRSF: family classification system at the Protein Information Resource. Nucleic Acids Res. 32, D112–D114.
Mi H, Lazareva-Ulitsky B, Loo R, Kejariwal A, Vandergriff J, Rabkin S, Guo N, Muruganujan A, Doremieux O, Campbell MJ, Kitano H, Thomas PD. (2005) The PANTHER database of protein families, subfamilies, functions and pathways. Nucleic Acids Res. 33, D284–D288.
Wilson D, Pethica R, Zhou Y, Talbot C, Vogel C, Madera M, Chothia C, Gough J. (2009) SUPERFAMILY – sophisticated comparative genomics, data mining, visualization and phylogeny. Nucleic Acids Res. 37, D380–D386.
Yeats C, Lees J, Reid A, Kellam P, Martin N, Liu X, Orengo C. (2008) Gene3D: comprehensive structural and functional annotation of genomes. Nucleic Acids Res. 36, D414–D418.
Quevillon E, Silventoinen V, Pillai S, Harte N, Mulder N, Apweiler R, Lopez R. (2005) InterProScan: protein domains identifier. Nucleic Acids Res. 33, W116–W120.
Haider S, Ballester B, Smedley D, Zhang J, Rice P, Kasprzyk A. (2009) BioMart Central Portal – unified access to biological data. Nucleic Acids Res. 37, W23–W27.
Jones P, Côté RG, Cho SY, Klie S, Martens L, Quinn AF, Thorneycroft D, Hermjakob H. (2008) PRIDE: new developments and new datasets. Nucleic Acids Res. 36, D878–D883.
Joshi-Tope G, Gillespie M, Vastrik I, D’Eustachio P, Schmidt E, de Bono B, Jassal B, Gopinath GR, Wu GR, Matthews L, Lewis S, Birney E, Stein L. (2005) Reactome: a knowledgebase of biological pathways. Nucleic Acids Res. 33, D428–D432.
Reference Genome Group of the Gene Ontology Consortium. (2009) The Gene Ontology’s Reference Genome Project: a unified framework for functional annotation across species. PLoS Comput Biol. 5(7), e1000431.
Kerrien S, Alam-Faruque Y, Aranda B, Bancarz I, Bridge A, Derow C, Dimmer E, Feuermann M, Friedrichsen A, Huntley R, Kohler C, Khadake J, Leroy C, Liban A, Lieftink C, Montecchi-Palazzi L, Orchard S, Risse J, Robbe K, Roechert B, Thorneycroft D, Zhang Y, Apweiler R, Hermjakob H. (2007) IntAct – open source resource for molecular interaction data. Nucleic Acids Res. 35, D561–D565.
Fleischmann A, Darsow M, Degtyarenko K, Fleischmann W, Boyce S, Axelsen KB, Bairoch A, Schomburg D, Tipton KF, Apweiler R. (2004) IntEnz, the integrated relational enzyme database. Nucleic Acids Res. 32, D434–D437.
Cantarel BL, Coutinho PM, Rancurel C, Bernard T, Lombard V, Henrissat B. (2009) The Carbohydrate-Active EnZymes database (CAZy): an expert resource for Glycogenomics. Nucleic Acids Res. 37, D233–D238.
Harmar AJ, Hills RA, Rosser EM, Jones M, Buneman OP, Dunbar DR, Greenhill SD, Hale VA, Sharman JL, Bonner TI, Catterall WA, Davenport AP, Delagrange P, Dollery CT, Foord SM, Gutman GA, Laudet V, Neubig RR, Ohlstein EH, Olsen RW, Peters J, Pin JP, Ruffolo RR, Searls DB, Wright MW, Spedding M. (2009) IUPHAR-DB: the IUPHAR database of G protein-coupled receptors and ion channels. Nucleic Acids Res. 37, D680–D685.
Degtyarenko K, Contrino S. (2004) COMe: the ontology of bioinorganic proteins. BMC Struct Biol. 4, 3.
Rawlings ND, Morton FR, Kok CY, Kong J, Barrett AJ. (2008) MEROPS: the peptidase database. Nucleic Acids Res. 36, D320–D325.
Whelan S, de Bakker PI, Quevillon E, Rodriguez N, Goldman N. (2006) PANDIT: an evolution-centric database of protein and associated nucleotide domains with inferred trees. Nucleic Acids Res. 34, D327–D331.
Golovin A, Henrick K. (2008) MSDmotif: exploring protein sites and motifs. BMC Bioinformatics. 9, 312.
Petryszak R, Kretschmann E, Wieser D, Apweiler R. (2005) The predictive power of the CluSTr database. Bioinformatics. 21(18), 3604–3609.
Haft DH, Selengut JD, Brinkac LM, Zafar N, White O. (2005) Genome Properties: a system for the investigation of prokaryotic genetic content for microbiology, genome annotation and comparative genomics. Bioinformatics. 21(3), 293–306.
Jimenez RC, Quinn AF, Garcia A, Labarga A, O’Neill K, Martinez F, Salazar GA, Hermjakob H. (2008) Dasty2, an Ajax protein DAS client. Bioinformatics. 21(14), 3198–3199.
Prlić A, Down TA, Hubbard TJ. (2005) Adding some SPICE to DAS. Bioinformatics. 21(Suppl 2), ii40–ii41.
Hartshorn MJ. (2002) AstexViewer: a visualisation aid for structure-based drug design. J Comput Aided Mol Des. 16(12), 871–881.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer Science+Business Media, LLC
About this protocol
Cite this protocol
McDowall, J., Hunter, S. (2011). InterPro Protein Classification. In: Wu, C., Chen, C. (eds) Bioinformatics for Comparative Proteomics. Methods in Molecular Biology, vol 694. Humana Press. https://doi.org/10.1007/978-1-60761-977-2_3
Download citation
DOI: https://doi.org/10.1007/978-1-60761-977-2_3
Published:
Publisher Name: Humana Press
Print ISBN: 978-1-60761-976-5
Online ISBN: 978-1-60761-977-2
eBook Packages: Springer Protocols