Abstract
The analysis of uncharacterized biomolecular sequences obtained as a result of genetic screens, expression profile studies, etc. is a standard task in a life science research environment. The understanding of protein function is typically the main difficulty. This chapter intends to give practical advise to students and researchers that have only introductory knowledge in the field of protein sequence analysis.
Applicable theoretical approaches range from (1) textual analyses, interpretation in terms of patterns of physical properties of amino acid side chains and (2) the extrapolation of empirically established relationships between local sequence motifs with known structural and functional properties to the collection of sequence segment families with sequence distance metrics and protein function derivation with annotation transfer (concept of homologous families). Here, the impact of different techniques for the biological interpretation of targets is discussed from the practitioner s point of view and illustrated with examples from recent research reports. Although sequence similarity searching techniques are the most powerful instruments for the analysis of high-complexity regions, other techniques can supply important additional evaluations including the assessment of applicability of the sequence homology concept for the given target segment.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Novatchkova M, Eisenhaber F. Can molecular mechanisms of biological processes be extracted from expression profiles? Case study: Endothelial contribution to tumor-induced angiogenesis. Bioessays 2001; 23:1159–1175.
Zhang MQ. Computational prediction of eukaryotic protein-coding genes. Nat Rev Genet 2002; 3:698–709.
Fickett JW. ORFs and genes: How strong a connection? J Comput Biol 1995; 2:117–123.
Harrison PM, Hegyi H, Balasubramanian S et al. Molecular fossils in the human genome: Identification and analysis of the pseudogenes in chromosomes 21 and 22. Genome Res 2002; 12:272–280.
Bork P, Dandekar T, Diaz-Lazcoz Y et al. Predicting function: From genes to genomes and back. J Mol Biol 1998; 283:707–725.
Altschul S, Boguski M, Gish W et al. Issues in searching molecular sequence databases. Nature Genetics 1994; 6:119–129.
Yuan YP, Schultz J, Mlodzik M et al. Secreted fringe-like signaling molecules may be glycosyltransferases. Cell 1997; 88:9–11.
Rea S, Eisenhaber F, O’Carroll D et al. Regulation of chromatin structure by site-specific histone h3 methyltransferases. Nature 2000; 406:593–599.
Ivanov D, Schleiffer A, Eisenhaber F et al. Ecol is a novel acetyltransferase that can acetylate proteins involved in cohesion. Curr Biol 2002; 12:323–328.
Dlakic M. Chromatin silencing protein and pachytene checkpoint regulator dotlp has a methyltransferase fold. Trends Biochem Sci 2001; 26:405–407.
van Leeuwen F, Gafken PR, Gottschling DE. Dotlp modulates silencing in yeast by methylation of the nucleosome core. Cell 2002; 109:745–756.
Aravind L, Koonin EV. The DNA-repair protein AlkB, EGL-9, and leprecan define new families of 2-oxoglutarate-and iron-dependent dioxygenases. Genome Biol 2001; 2:RESEARCH0007.
Trewick SC, Henshaw TF, Hausinger RP et al. Oxidative demethylation by escherichia coli AlkB directly reverts DNA base damage. Nature 2002; 419:174–178.
Falnes PO, Johansen RF, Seeberg E. AlkB-mediated oxidative demethylation reverses DNA damage in Escherichia Coli. Nature 2002; 419:178–182.
Altschul SF, Madden TL, Schaffer AA et al. Gapped blast and PSI-blast: A new generation of protein database search programs. Nucleic Acids Res 1997; 25:3389–3402.
Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc Nat Acad Sci USA 1992; 89:10915–10919.
Henikoff S, Henikoff JG. Amino acid substitution matrices. Adv Protein Chem 2000; 54:73–97.
Pollock DD, Taylor WR, Goldman N. Coevolving protein residues: Maximum likelihood identification and relationship to structure. J Mol Biol 1999; 287:187–198.
Cootes AP, Curmi PM, Cunningham R et al. The dependence of Amino acid pair correlations on structural environment. Proteins 1998; 32:175–189.
Chelvanayagam G, Eggenschwiler A, Knecht L et al. An analysis of simultaneous variation in protein structures. Protein Eng 1997; 10:307–316.
Eisenhaber B, Bork P, Eisenhaber F. Sequence properties of GPI-anchored proteins near the Ω-site: Constraints for the polypeptide binding site of the putative transamidase. Protein Eng 1998; 11:1155–1161.
Maurer-Stroh S, Eisenhaber B, Eisenhaber F. N-terminal N-myristoylation of proteins: Prediction of substrate proteins from Amino acid sequence. J Mol Biol 2002; 317:541–557.
Sander C, Schneider R. Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins 1991; 9:56–68.
Dcvos D, Valencia A. Practical limits of function prediction. Proteins 2000; 41:98–107.
Wootton JC, Federhen S. Analysis of compositionally biased regions in sequence databases. Methods Enzymol 1996; 266:554–571.
Wootton JC. Sequences with ‘Unusual’ Amino acid compositions. Curr Op Struct Biol 1994; 4:413–421.
Saqi M. An analysis of structural instances of low complexity sequence segments. Protein Eng 1995; 8:1069–1073.
Senti K, Keleman K, Eisenhaber F et al. Brakeless is required for lamina targeting of R1-R6 axons in the Drosophila visual system. Development 2000; 127:2291–2301.
Eisenhaber B, Eisenhaber F. Sequence complexity of proteins and its significance in annotation. In: Subramaniam S, ed. Bioinformatics in the Encyclopedia of Genetics, Genomics, Proteomics and Bioinformatics. New York: Wiley Interscience, 2005:4, (DOI:10.1002/047001153X.g 403313).
Falquet L, Pagni M, Bucher P et al. The PROSITE database, its status in 2002. Nucleic Acids Res 2002; 30:235–238.
Maurer-Stroh S, Eisenhaber B, Eisenhaber F. N-terminal N-myristoylation of proteins: Refinement of the sequence motif and its taxon-specific differences. J Mol Biol 2002; 317:523–540.
Panizza S, Tanaka T, Hochwagen A et al. Pds5 cooperates with cohesin in maintaining sister chromatid cohesion. Curr Biol 2000; 10:1557–1564.
Brendel V, Bucher P, Nourbakhsh IR et al. Methods and algorithms for statistical analysis of protein sequences. Proc Natl Acad Sci USA 1992; 89:2002–2006.
Karlin S, Brendel V. Chance and statistical significance in protein and DNA sequence analysis. Science 1992; 257:39–49.
Promponas VJ, Enright AJ, Tsoka S et al. CAST: An iterative algorithm for the complexity analysis of sequence tracts. Complexity analysis of sequence tracts. Bioinformatics 2000; 16:915–922.
Nielsen H, Brunak S, von Heijne G. Machine learning approaches for the prediction of signal peptides and other protein sorting signals. Protein Eng 1999; 12:3–9.
Emanuelsson O, Nielsen H, von Heijne G. ChloroP, a neural network-based method for predicting chloroplast transit peptides and their cleavage sites. Protein Sci 1999; 8:978–984.
Emanuelsson O, von Heijne G, Schneider G. Analysis and prediction of mitochondrial targeting peptides. Methods Cell Biol 2001; 65:175–187.
Menne KM, Hermjakob H, Apweiler R. A comparison of signal sequence prediction methods using a test set of signal peptides. Bioinformatics 2000; 16:741–742.
Emanuelsson O, von Heijne G. Prediction of organellar targeting signals. Biochim Biophys Acta 2001; 1541:114–119.
Neuberger G, Maurer-Stroh S, Eisenhaber B et al. Prediction of PTS 1 signal dependent peroxysomal targeting from protein sequences, submitted 2002.
Denny PW, Gokool S, Russell DG et al. Acylation-dependent protein export in leishmania. J Biol Chem 2000; 275:11017–11025.
Eisenhaber B, Bork P, Eisenhaber F. Prediction of potential GPI-modification sites in proprotein sequences. J Mol Biol 1999; 292:741–758.
Eisenhaber B, Bork P, Yuan Y et al. Automated annotation of GPI anchor sites: Case study C. Elegans. Trends Biochem Sci 2000; 25:340–341.
Eisenhaber B, Bork P, Eisenhaber F. Post-translational GPI lipid anchor modification of proteins in kingdoms of life: Analysis of protein sequence data from complete genomes. Protein Eng 2001; 14:17–25.
Eisenhaber B, Schneider G, Wildpaner M et al. A sensitive predictor for potential GPI lipid modification sites in fungal protein sequences and its application to genome-wide studies for aspergillus nidulans, Candida albicans, neurospora crassa, Saccharomyces Cerevisiae and schizosaccharomyces pombe. J Mol Biol 2004; 337:243–253.
Eisenhaber B, Wildpaner M, Schultz CJ et al. Glycosylphosphatidylinositol lipid anchoring of plant proteins. Sensitive prediction from sequence-and genome-wide studies for arabidopsis and rice. Plant Physiol 2003; 133:1691–1701.
Eisenhaber B, Eisenhaber F, Maurer-Stroh S et al. Prediction of sequence signals for lipid post-translational modifications: Insights from case studies. Proteomics 2004; 4:1614–1625.
Minor Jr DL, Kim PS. Context-dependent secondary structure formation of a designed protein sequence. Nature 1996; 380:730–734.
Blom N, Gammeltoft S, Brunak S. Sequence and structurebased prediction of eukaryotic protein phosphorylation sites. J Mol Biol 1999; 294:1351–1362.
Hansen JE, Lund O, Tolstrup N et al. NetOglyc: Prediction of mucin type O-glycosylation sites based on sequence context and surface accessibility. Glycoconj J 1998; 15:115–130.
Gupta R, Brunak S. Prediction of glycosylation across the human proteome and the correlation to protein function. Pac Symp Biocomput 2002; 310–322.
Cokol M, Nair R, Rost B. Finding nuclear localization signals. EMBO Rep 2000; 1:411–415.
Yoneda Y. Nudeocytoplasmic protein traffic and its significance to cell function. Genes Cells 2000; 5:777–787.
Rechsteiner M, Rogers SW. PEST sequences and regulation by proteolysis. Trends Biochem Sci 1996; 21:267–271.
Lupas A. Predicting coiled-coil regions in proteins. Curr Opin Struct Biol 1997; 7:388–393.
Bateman A, Birney E, Cerruti L et al. The Pfam protein families database. Nucleic Acids Res 2002; 30:276–280.
Krogh A, Larsson B, von Heijne G et al. Predicting transmembrane protein topology with a hidden markov model: Application to complete genomes. J Mol Biol 2001; 305:567–580.
Cserzo M, Wallin E, Simon I et al. Prediction of transmembrane alpha-helices in prokaryotic membrane proteins: The dense alignment surface method. Protein Eng 1997; 10:673–676.
Moller S, Croning MD, Apweiler R. Evaluation of methods for the prediction of membrane spanning regions. Bioinformatics 2001; 17:646–653.
von Heijne G. Membrane protein structure prediction. Hydrophobicity analysis and the positive-inside rule. J Mol Biol 1992; 225:487–494.
Picot D, Garavito RM. Prostaglandin H synthase: Implications for membrane structure. FEBS Lett 1994; 346:21–25.
Wendt KU, Lenhart A, Schulz GE. The structure of the membrane protein squalene-hopene cyclase at 2.0 a resolution. J Mol Biol 1999; 286:175–187.
Sukumar N, Xu Y, Gatti DL et al. Structure of an active soluble mutant of the membrane-associated (S)-mandelate dehydrogenase. Biochem 2001; 40:9870–9878.
Goder V, Spiess M. Topogenesis of membrane proteins: Determinants and dynamics. FEBS Lett 2001; 504:87–93.
Trifonov EN. Segmented structure of protein sequences and early evolution of genome by combinatorial fusion of DNA elements. J Mol Evol 1995; 40:337–342.
Wheelan SJ, Marchler-Bauer A, Bryant SH. Domain size distributions can predict domain boundaries. Bioinformatics 2000; 16:613–618.
Xu D, Nussinov R. Favorable domain size in proteins. Fold Des 1998; 3:11–17.
Henikoff JG, Pietrokovski S, McCallum CM et al. Blocks-based methods for detecting protein homology. Electrophoresis 2000; 21:1700–1706.
Attwood TK, Beck ME, Flower DR et al. The PRINTS protein fingerprint database in its fifth year. Nucleic Acids Res 1998; 26:304–308.
Letunic I, Goodstadt L, Dickens NJ et al. Recent improvements to the SMART domain-based sequence annotation resource. Nucleic Acids Res 2002; 30:242–244.
Silverstein KA, Kilian A, Freeman JL et al. PANAL: An integrated resource for protein sequence ANALysis. Bioinformatics 2000; 16:1157–1158.
Marchler-Bauer A, Panchenko AR, Shoemaker BA et al. CDD: A database of conserved domain alignments with links to domain three-dimensional structure. Nucleic Acids Res 2002; 30:281–283.
Ponting CP, Schultz J, Copley RR et al. Evolution of domain families. Adv Protein Chem 2000; 54:185–244.
Chelvanayagam G, Knecht L, Jenny T et al. A combinatorial distance-constraint approach to predicting protein tertiary models from known secondary structure. Fold Des 1998; 3:149–160.
Mott R. Accurate formula for P-values of gapped local sequence and profile alignments. J Mol Biol 2000; 300:649–659.
Andrade MA, Ponting CP, Gibson TJ et al. Homology-based method for identification of protein repeats using statistical significance estimates. J Mol Biol 2000; 298:521–537.
Altschul SF, Koonin EV. Iterated profile searches with PSI-BLAST—a tool for discovery in protein databases. Trends Biochem Sci 1998; 23:444–447.
Karplus K, Hu B. Evaluation of protein multiple alignments by SAM-T99 using the BAliBASE multiple alignment test set. Bioinformatics 2001; 17:713–720.
Karplus K, Karchin R, Barrett C et al. What is the value added by human intervention in protein structure prediction? Proteins 2001; 45(Suppl 5):86–91.
Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 1994; 22:4673–4680.
Higgins D, Thompson JD, Gibson TJ. Using CLUSTAL for multiple sequence alignment. Meth Enzymol 1996; 266:383–402.
Bork P, Gibson TJ. Applying motif and profile searches. Meth Enzymol 1996; 266:162–184.
Musacchio A, Gibson TJ, Rice P et al. The PH-domain: A common piece in the structural patch work of signalling proteins. Trends Biochem Sci 1993; 18:343–348.
Gibson TJ, Hyvönen M, Musacchio A et al. PH domain: The first anniversary. Trends Biochem Sci 1994; 19:349–353.
Aravind L, Koonin EV. Classification of the caspase-hemoglobinase fold: Detection of new families and implications for the origin of the eukaryotic separins. Proteins 2002; 46:355–367.
Reichsman F, Moore HM, Cumberledge S. Sequence homology between wingless/Wnt-1 and a lipid-binding domain in secreted phospholipase A2. Curr Biol 1999; 9:R353–R355.
Barnes MR, Russell RB, Copley RR et al. A lipid-binding domain in Wnt: A case of mistaken identity? Curr Biol 1999; 9:R717–R719.
Kelley LA, MacCallum RM, Sternberg MJ. Enhanced genome annotation using structural profiles in the program 3D-PSSM. J Mol Biol 2000; 299:499–520.
Fischer D. Hybrid fold recognition: Combining sequence derived properties with evolutionary information. Pac Symp Biocomput 2000; 5:119–130.
Mallick P, Goodwill KE, Fitz-Gibbon S et al. Selecting protein targets for structural genomics of pyrobaculum aerophilum: Validating automated fold assignment methods by using binary hypothesis testing. Proc Natl Acad Sci USA 2000; 97:2450–2455.
Rychlewski L, Jaroszewski L, Li W et al. Comparison of sequence profiles. Strategies for structural predictions using sequence information. Protein Sci 2000; 9:232–241.
McGuffin LJ, Bryson K, Jones DT. The PSIPRED protein structure prediction server. Bioinformatics 2000; 16:404–405.
Shindyalov IN, Bourne PE. Improving alignments in HM protocol with intermediate sequences. Forth Meeting on the Critical Assessment of Techniques for Protein Structure Prediction 2000; A92.
Gough J, Chothia C. SUPERFAMILY: HMMs representing all proteins of known structure. SCOP sequence searches, alignments and genome assignments. Nucleic Acids Res 2002; 30:268–272.
Novatchkova M, Eisenhaber F. A CH domain-containing N terminus in NuMA? Protein Sci 2002; 11:2281–2284.
Lorenz A, Wells JL, Pryce DW et al. Pombe meiotic linear elements contain proteins related to synaptonemal complex components. J Cell Sci 2004; 117:3343–3351.
Rabitsch KP, Gregan J, Schleiffer A et al. Two fission yeast homologs of Drosophila mei-S332 are required for chromosome segregation during meiosis I and II. Curr Biol 2004; 14:287–301.
Ponting CP. Issues in predicting protein function from sequence. Brief Bioinform 2001; 2:19–29.
Cuff JA, Clamp ME, Siddiqui AS et al. JPred: A consensus secondary structure prediction server. Bioinformatics 1998; 14:892–893.
Shindyalov IN, Bourne PE. Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng 1998; 11:739–747.
Wildpaner M, Schneider G, Schleiffer A et al. Taxonomy workbench. Bioinformatics 2001; 17:1179–1182.
Devos D, Valencia A. Intrinsic errors in genome annotation. Trends Genet 2001; 17:429–431.
Ponting CP, Benjamin DR. A novel family of Ras-binding domains. Trends Biochem Sci 1996; 21:422–425.
Kalhammer G, Bahler M, Schmitz F et al. Ras-binding domains: Predicting function versus folding. FEBS Lett 1997; 414:599–602.
Iyer LM, Aravind L, Bork P et al. Quoderat demonstrandum? The mystery of experimental validation of apparently erroneous computational analyses of protein sequences. Genome Biol 2001; 2, (RESEARCH0051).
Strynadka NCJ, Eisenstein M, Katchalski-Katzir E et al. Molecular docking programs successfully predict the binding of a B-lactamase inhibitory protein to TEM-1 BETA-lactamase. Nature Struct Biol 1996; 3:233–239.
Dandekar T, Snel B, Huynen M et al. Conservation of gene order: A fingerprint of proteins that physically interact. Trends Biochem Sci 1998; 23:324–328.
Marcotte EM, Pellegrini M, Ng HL et al. Detecting protein function and protein-protein interactions from genome sequences. Science 1999; 285:751–753.
Enright AJ, Iliopoulos I, Kyrpides NC et al. Protein interaction maps for complete genomes based on gene fusion events. Nature 1999; 402:86–90.
Gavin AC, Bosche M, Krause R et al. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 2002; 415:141–147.
von Mering C, Krause R, Snel B et al. Comparative assessment of large-scale data sets of protein-protein interactions. Nature 2002; 417:399–403.
Ho Y, Gruhler A, Heilbut A et al. Systematic identification of protein complexes in Saccharomyces Cerevisiae by mass spectrometry. Nature 2002; 415:180–183.
Schwikowski B, Uetz P, Fields S. A network of protein-protein interactions in yeast. Nat Biotechnol 2000; 18:1257–1261.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2006 Landes Bioscience and Springer Science+Business Media
About this chapter
Cite this chapter
Eisenhaber, F. (2006). Prediction of Protein Function. In: Discovering Biomolecular Mechanisms with Computational Biology. Molecular Biology Intelligence Unit. Springer, Boston, MA. https://doi.org/10.1007/0-387-36747-0_4
Download citation
DOI: https://doi.org/10.1007/0-387-36747-0_4
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-34527-7
Online ISBN: 978-0-387-36747-7
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)