Abstract
As biomolecular sequencing is becoming the main technique in life sciences, functional interpretation of sequences in terms of biomolecular mechanisms with in silico approaches is getting increasingly significant. Function prediction tools are most powerful for protein-coding sequences; yet, the concepts and technologies used for this purpose are not well reflected in bioinformatics textbooks. Notably, protein sequences typically consist of globular domains and non-globular segments. The two types of regions require cardinally different approaches for function prediction. Whereas the former are classic targets for homology-inspired function transfer based on remnant, yet statistically significant sequence similarity to other, characterized sequences, the latter type of regions are characterized by compositional bias or simple, repetitive patterns and require lexical analysis and/or empirical sequence pattern–function correlations. The recipe for function prediction recommends first to find all types of non-globular segments and, then, to subject the remaining query sequence to sequence similarity searches. We provide an updated description of the ANNOTATOR software environment as an advanced example of a software platform that facilitates protein sequence-based function prediction.
Access provided by CONRICYT – Journals CONACYT. Download protocol PDF
Similar content being viewed by others
Key words
- Protein sequence analysis
- Protein function prediction
- Globular domain
- Non-globular segment
- Genome annotation
- ANNOTATOR
1 Introduction
Advances in sequencing technology have driven costs to such low levels that DNA , genome, and RNA sequencing have become the main research technologies in life sciences and they get applied in various context not necessarily because these methods are the most appropriate ones for the task but they have become the most accurate, affordable methods and they are also increasingly generally available; so, people just do it [1–3]. The results are heaps of sequence data where only a minor fraction is functionally understood and interpreted.
The issue is best illustrated by the number of genes that remain without function despite having been sequenced longer than a decade ago. For example, among the almost 7000 genes of the yeast Saccharomyces cerevisae, more than 1000 still awaited their functional characterization in 2007 [4] and little has changed since then. To note, the yeast genome has been available since 1997 and yeast is one of the best studied organisms. In human, just 1.5 % of the genome is protein coding with 20000–25000 genes and about half of them lack function description at the molecular and/or cellular level. The remaining genome is known also to be functionally significant; yet, the molecular mechanisms involving the various non-coding transcripts are largely unknown. The classical route to functional characterization involving experimental methods from the genetic and biochemical toolbox like specific knock-outs, targeted mutations, and a battery of biochemical assays is laborious, time-consuming, and expensive. Thus, concepts, approaches, and tools for sequence-based function prediction are very much needed to guide experimental biological and biomedical discovery-oriented research along promising hypotheses.
As proteins are known to be for a large variety of biological functions and mechanisms, hints about their function are especially valuable. Notably, protein function is described within hierarchical concept [5]. The protein’s molecular function set are the functional opportunities that a protein provides for interactions with other molecular players, its binding capacities and enzymatic activities, the range of conformational changes and posttranslational modifications. A subset of these molecular functions becomes actually relevant in the biological context at the cellular level, in biomolecular mechanisms such as metabolic pathway s, signaling cascades or supramolecular complexes together with other biomacromolecules (cellular function). Finally, a protein’s phenotypic function is its result of cooperation with various biomolecular mechanisms under certain environmental conditions.
As experimental characterization of an uncharacterized protein ’s function is time-consuming, costly and risky and as researchers follow the pressure toward short-term publishable results, experimentalists tend to concentrate on very few widely studied gene examples which apparently show the greatest promise for the development of drugs, while ignoring a treasure trove of uncharacterized ones that might hold the key to completely new pathways. In-silico sequence analysis aimed at structure/function prediction can become extremely helpful in generating trusted functional hypotheses. In principle, it is fast (up to a few months of effort) and, with the exception of some compute-intensive homology search heuristics [6], it has become affordable for even small-scale research operations independently or, the easiest way, in collaboration with an internationally well-known sequence-analytic research group.
This is not to say that in-silico analysis generates a function discovery for any query sequence or assesses the effect of any mutation in a functionally characterized gene. Nothing is farther from the truth. Yet, if properly applied, the set of sequence-analytic methods provides options and insights that are orthogonal to those provided by other, especially experimental methods and, with some luck, they can deliver the critical information for the path to the success [7]. The field of function prediction from protein sequence is still evolving. Only for some fraction of the uncharacterized sequence targets, predictions that provide useful hints can be made; yet, with a growing body of biological sequences and other life science knowledge, the number of such targets increases. For example, more sequences imply a denser sequence space and greater chances of success for homology-based function prediction as the recent breakthrough for Gaa1/Gpaa1, a subunit of the transamidase complex with predicted metallo-peptide-synthetase activity, has demonstrated [8, 9]. As a matter of fact, function prediction from sequence has made bioinformatics center stage in life science and exercises its influence in all research fields. Further examples are provided in these references [10–13].
It should be noted that certain prediction algorithms, especially many among those for predicting functional features in non-globular segment s, are plagued by high false-positive rates. Nevertheless, they might be not completely useless. This is especially true if they are applied in conjunction with experimental screening methods with large lists of genes relevant for certain physiological situations as output. Gene expression studies at the RNA or protein level are typical examples. Function prediction tools can serve as filters for dramatically reducing the list, thus, helping to select gene targets for further experimental follow-up studies.
Taken together, the number and the order of structural and functional segments in a protein sequence are called the sequence architecture (historically, it was just the order of globular domain s in the sequence). The sequence architecture is computed by using a variety of sequence-analytic tools over the query sequence. One of the practical problems is that, for each query sequence, it is desirable to apply all known good prediction tools (those with good prediction accuracy) with the hope that at least some of them generate useful information for the query. There are about 40 of such tools available at this time point and many of them need to be run with several parameter sets. Historically, bioinformatics researchers provide their individual prediction algorithms as downloadable programs or web-based services. While generally useful for very specific questions, the input and output formats of these programs tend to be incompatible. It is a considerable workload to feed all the programs and web services with suitable input and to collect the output. Further, the total output for a single protein with ~1000 amino acids can run into GBs and just reading and extracting the useful annotation correctly can become difficult.
These problems multiply with the number of queries to study. Large sequencing projects require the annotation of thousands of proteins . The answer to this challenge is the implementation of script-based annotation pipelines that chain together several prediction tools and perform the necessary reformatting of inputs and outputs with web-accessible visualization of final results. While being adequate for a particular project, these pipelines lack the flexibility of applying modified sets of algorithms with change of task. An alternative are workflow tools that allow for the integration of a large number of individual prediction algorithms while presenting the results through a unified visual interface and keeping them persisted as well as traceable to the original raw output of sequence-analytic programs. The ANNOTATOR [13, 14] and its derivatives ANNIE [15], a fast tool for generating sequence architectures, and HPMV [16], a tool for mapping and evaluating sequence mutations with regard to their effect on sequence architecture, are representatives of this advanced class of sequence analysis frameworks.
2 Concepts in Protein Sequence Analysis and Function Prediction
The most basic concept in protein sequence studies is centered on the idea of segment-based analysis. Proteins are known to consist of structural and functional modules [17], of segments that have structural properties relatively independent from the rest of the protein and that carry an own molecular function. The final interpretation of protein function arises as a synthesis of the individual segment’s functions.
Notably, there are two types of segments. Protein sequences typically consist of globular domain s and non-globular segment s [18–21]. The two types of regions require cardinally different approaches for function prediction . Sequence segments for globular domains have typically a mixed, lexically complex protein sequence with a balanced composition of hydrophobic and hydrophilic residues where the former tend to compose the tightly packed core and the latter form the surface of the globule [17, 21]. Functionally, globular domains with their unique 3D structure offer enzymatic and docking sites. Since the hydrophobic sequence pattern is characteristic for the fold, even a remnant sequence similarity without any sequence identity just with coincidence of the polar/non-polar succession is strongly indicative for fold similarity, common evolutionary origin, and similarity of function. Therefore, function annotation transfer justified by the sequence homology concept is possible within families of such protein segments that have statistically significant sequence similarity [22].
In contrast, non-globular regions have typically a biased amino composition or a simple, repetitive pattern (e.g., [GXP]n in the case of collagen) due to physical constraints as a result of conformational flexibility in an aqueous environment, membrane embedding, or fibrillar structure [22–24]. As a consequence, sequence similarity is not necessarily a sign of common evolutionary origin and common function. Non-globular regions carry important functions hosting sequence signals for intracellular translocation (targeting peptides) and posttranslational modifications [24], serving as linkers or fitting sites for interactions. For their functional study, lexical analysis is required and the application of certain types of pattern–function correlation schemes is recommended. Thus, non-globular features require many dozens of tools to locate them in the sequence whereas globular domain s are functionally annotated uniformly with a battery of sequence similarity search programs.
Correspondingly, the recipe for function prediction recommends first to find all types of non-globular segment s with all available tools for that purpose (step one) and, then to subtract these non-globular regions from the query sequence [21]. The remaining sequence is then considered to consist of globular domain s. Since most sequence similarity programs have an upper limit in the number of similar protein sequences in the output, it might happen that sequences corresponding to domains very frequent in the sequence databases overwhelm the output and certain section of the sequence are not covered by hits of sequence-similarity searching programs at all, even if they exist in the database. Therefore, it is recommended to check for the occurrence of well-studied domains in the remainder of the query sequence (step two). A variety of protein domain libraries is available for this purpose.
After subtracting the sequence segments that represent known domains from the query, the final remainder is believed to consist of new domains not represented in the domain libraries. At this time point, the actual sequence similarity search tools have to kick in to collect the family of statistically similar sequence segments (step three). The hope is that at least one of the sequences found was previously functionally characterized so that it becomes possible to speculate about the function of this domain as, for example, in [25–29].
The existence of homologous sequences with experimentally determined three-dimensional structures opens the possibility to use them as templates for computationally modeling the 3D structure of the query sequence. Determining the evolutionary conservation of individual residues and, then, projecting these values onto the modeled 3D structure can give valuable hints as to interaction interfaces or catalytic sites. This approach was useful to provide crucial insights into mechanisms for the development of drug resistance as the example of the H1N1-Neuraminidase shows [30] but also in other contexts [31]. 3D structure modeling within the homology concept is a complex task with many own parameters that is best executed outside of the ANNOTATOR , for example with the MODELLER tool [32–35].
3 ANNOTATOR : The Integration of Protein Sequence-Analytic Tools
The ANNOTATOR software environment is actively being developed at the Bioinformatics Institute, A*STAR (http://www.annotator.org). This software environment implements many of the features discussed above. Biological objects are represented in a unified data model and long-term persistence in a relational database is supplied by an object-relational mapping layer. Data to be analyzed can be provided in different formats ranging from web-based forms, FASTA formatted flat files to remote import over a SOAP interface. This interface provides also an opportunity for other programs to use the ANNOTATOR as a compute engine and process the prediction results in their own unique way (e.g., ANNIE [15] and HPMV [16]).
At the moment, about 40 external sequence-analytic algorithms from own developments or from the academic community are integrated using a plugin-style mechanism and can be applied to uploaded sets of sequences (see the large Table 1 for details). The display of applicable algorithms follows the three-step recipe described above. Integrated algorithms that execute complex tasks such as ortholog or sequence family searches constitute a further group of algorithms. Finally, the ANNOTATOR provides tools to manage sequence sets (alignments and sequence clustering ).
-
1.
Searching for non-globular domain s.
-
(a)
Tests for segments with amino acid compositional bias and disordered regions.
-
(b)
Tests for sequence complexity.
-
(c)
Prediction of posttranslational modifications.
-
(d)
Prediction of targeting signals.
-
(e)
Prediction of membrane-embedded regions.
-
(f)
Prediction of fibrillar structures and secondary structure .
-
(a)
-
2.
Searching for well-studied globular domain s.
-
(a)
Searches in protein domain libraries.
-
(b)
Tests for small motifs.
-
(c)
Searches for repeated sequence segments.
-
(a)
-
3.
Searching for families of sequence segments corresponding to new domains.
-
4.
Integrated algorithms.
-
5.
Sequence sets: Clustering algorithms.
-
6.
Sequence sets: Multiple alignment algorithms.
-
7.
Sequence sets: Miscellaneous algorithms.
Integrated algorithms offer either complex operations over individual sequences or also over sequence sets. The ANNOTATOR provides an integrated algorithm (“Prim-Seq-An”) that executes automatically the first two steps of the protein sequence analysis recipe. It tests the query sequence for the occurrence of any non-globular feature as well as for hits by any globular domain or motive database. For this purpose, the complete query sequence is subjected to the full set of respective prediction tools. The results can be viewed in an aggregated interactive cartoon.
The matching of domain models with query sequence segments is, similar to many other sequence-analytic problems, a continuing area of research and, consequently, the ANNOTATOR is subject to continuous change in adopted external algorithms. Domain model matching is mostly performed with HMMER-style [36, 37], other profile-based [38, 39], or profile–profile searches [40–42]. There are issues with the P-value statistics applied that have significance for hit selection and that can be improved compared with the original implementation [43]. The sensitivity for remote similarities increases in searches where domain models are reduced to the fold-critical contributions; profile sections corresponding to non-globular parts are advised to be suppressed as in the dissectHMMER concept [44, 45].
Within the third step of the segment-based analysis approach, the identification of distantly related homologs to query sequence segments that remain without match in the preceding two analysis steps is the key task. While tools like PSI-BLAST [46] exist that provide a standard form of iterative family collection, it is often necessary to implement a more sophisticated heuristic to detect weaker links throughout the sequence space. The implementation of such a heuristic might require, among other tasks, the combination of numerous external algorithms such as PSI-BLAST or other similarity search tools with masking of low complexity segments, coiled coils, simple transmembrane regions [23] and other types of non-globular regions, the manipulation of alignments as well as the persistence of intermediate results (e.g., spawning of new similarity searches with sequence hits from previous steps).
Obviously, the mechanism of wrapping an external algorithm would not be sufficient in this case. While the logics of the heuristic could be implemented externally, it would still need access to internal data objects, as well as the ability to submit jobs to a compute-cluster. For this reason, an extension mechanism for the ANNOTATOR was devised which allows for the integration of algorithms that need access to internal mechanisms and data. A typical example for using this extension mechanism to implement a sophisticated search heuristic is the “Family-Searcher”, an integrated algorithm that is used to uncover homology relationships within large superfamilies of protein sequences. Applying this algorithm, the evolutionary relationship between classical mammalian lipases and the human adipose triglyceride lipase (ATGL) was established [6]. For such large sequence families, the amount of data produced when starting with one particular sequence as a seed can easily cross the Terabyte barrier. At the same time, the iterative procedure will spawn the execution of tens of thousands of individual homology searches. It is clearly necessary to have access to a cluster of compute nodes for the heuristic and to have sophisticated software tools for the analysis of the vast output to terminate the task in a reasonable timeframe.
3.1 Visualization
The visualization of results is an important aspect of a sequence analysis system because it allows an expert to gain an immediate condensed overview of possible functional assignments. The ANNOTATOR offers specific visualizers both at the individual sequence as well as at the set level.
The visualizer for an individual sequence projects all regions that have been found to be functionally relevant onto the original sequence. The regions are grouped into panes and are color-coded, which makes it easy to spot consensus among a number of predictors for the same kind of feature (e.g., transmembrane regions that are simple (blue), twilight (yellow-orange), and complex (red) are differently color-coded [23, 47]). Zooming capabilities as well as rulers facilitate the exact localization of relevant amino acids.
The ability to analyze potentially large sets of sequences marks a qualitative step up from the focus on individual proteins . Alternative views of sets of proteins make it possible to find features that are conspicuously more frequent pointing to some interesting property of the sequence set in question. The histogram view in the ANNOTATOR is an example of such a view. It displays a diagram where individual features (e.g., domains) are ordered by their abundance within a set of sequences.
Another example is the taxonomy view. It shows the taxonomic distribution of sequences within a particular sequence set. It is then possible to apply certain operators that will extract a portion of the set that corresponds to a branch of the taxonomic tree which can then be further analyzed. One has to keep in mind that a set of sequences is not only created when a user uploads one but also when a particular result returns more than one sequence. Alignments from homology searches are treated in a similar manner and the same operators can be applied to them.
4 Conclusions
The large amount of sequence data generated with modern sequencing methods makes the applications that can relate sequences and complex function patterns an absolute necessity. At the same time, many algorithms for predicting a particular function or uncovering distant evolutionary relationships (which, at the end, allows functional annotations transfer) have become more demanding on compute resources. The output as well as intermediate results can no longer be manually assessed and require sophisticated integrated frameworks. The ANNOTATOR software provides critical support for many protein sequence-analytic tasks by supplying an appropriate infrastructure capable of supporting a large array of sequence-analytic methods, presenting the user with a condensed view of possible functional assignments and, at the same time, allowing to drill down to raw data from the original prediction tool for validation purposes.
References
Eisenhaber F (2012) A decade after the first full human genome sequencing: when will we understand our own genome? J Bioinform Comput Biol 10:1271001
Kuznetsov V, Lee HK, Maurer-Stroh S, Molnar MJ, Pongor S, Eisenhaber B, Eisenhaber F (2013) How bioinformatics influences health informatics: usage of biomolecular sequences, expression profiles and automated microscopic image analyses for clinical needs and public health. Health Inf Sci Syst 1:2
Eisenhaber F, Sung WK, Wong L (2013) The 24th International Conference on Genome Informatics, GIW2013, in Singapore. J Bioinform Comput Biol 11:1302003
Pena-Castillo L, Hughes TR (2007) Why are there still over 1000 uncharacterized yeast genes? Genetics 176:7–14
Bork P, Dandekar T, az-Lazcoz Y, Eisenhaber F, Huynen M, Yuan Y (1998) Predicting function: from genes to genomes and back. J Mol Biol 283:707–725
Schneider G, Neuberger G, Wildpaner M, Tian S, Berezovsky I, Eisenhaber F (2006) Application of a sensitive collection heuristic for very large protein families: evolutionary relationship between adipose triglyceride lipase (ATGL) and classic mammalian lipases. BMC Bioinformatics 7:164
Eisenhaber F (2006) Bioinformatics: mystery, astrology or service technology. In: Eisenhaber F (ed) Preface for “Discovering Biomolecular Mechanisms with Computational Biology”, 1st edn. Landes Biosciences and Eurekah.com, Georgetown, pp 1–10
Eisenhaber B, Eisenhaber S, Kwang TY, Gruber G, Eisenhaber F (2014) Transamidase subunit GAA1/GPAA1 is a M28 family metallo-peptide-synthetase that catalyzes the peptide bond formation between the substrate protein’s omega-site and the GPI lipid anchor’s phosphoethanolamine. Cell Cycle 13:1912–1917
Kinoshita T (2014) Enzymatic mechanism of GPI anchor attachment clarified. Cell Cycle 13:1838–1839
Novatchkova M, Bachmair A, Eisenhaber B, Eisenhaber F (2005) Proteins with two SUMO-like domains in chromatin-associated complexes: the RENi (Rad60-Esc2-NIP45) family. BMC Bioinformatics 6:22
Panizza S, Tanaka T, Hochwagen A, Eisenhaber F, Nasmyth K (2000) Pds5 cooperates with cohesin in maintaining sister chromatid cohesion. Curr Biol 10:1557–1564
Prokesch A, Bogner-Strauss JG, Hackl H, Rieder D, Neuhold C, Walenta E, Krogsdam A, Scheideler M, Papak C, Wong WC et al (2011) Arxes: retrotransposed genes required for adipogenesis. Nucleic Acids Res 39:3224–3239
Schneider G, Sherman W, Kuchibhatla D, Ooi HS, Sirota FL, Maurer-Stroh S, Eisenhaber B, Eisenhaber F (2012) Protein sequence-structure-function-network links discovered with the ANNOTATOR software suite: application to Elys/Mel-28. In: Trajanoski Z (ed) Computational medicine. Springer, Vienna, pp 111–143
Schneider G, Wildpaner M, Sirota FL, Maurer-Stroh S, Eisenhaber B, Eisenhaber F (2010) Integrated tools for biomolecular sequence-based function prediction as exemplified by the ANNOTATOR software environment. Methods Mol Biol 609:257–267
Ooi HS, Kwo CY, Wildpaner M, Sirota FL, Eisenhaber B, Maurer-Stroh S, Wong WC, Schleiffer A, Eisenhaber F, Schneider G (2009) ANNIE: integrated de novo protein sequence annotation. Nucleic Acids Res 37:W435–W440
Sherman W, Kuchibhatla D, Limviphuvadh V, Maurer-Stroh S, Eisenhaber B, Eisenhaber F (2015) HPMV: Human protein mutation viewer—relating sequence mutations to protein sequence architecture and function changes. J Bioinform Comput Biol 13 (in press)
Eisenhaber F, Bork P (1998) Sequence and structure of proteins. In: Schomburg D (ed) Recombinant proteins, monoclonal antibodies and therapeutic genes. Wiley-VCH, Weinheim, pp 43–86
Eisenhaber B, Eisenhaber F, Maurer-Stroh S, Neuberger G (2004) Prediction of sequence signals for lipid post-translational modifications: insights from case studies. Proteomics 4:1614–1625
Eisenhaber B, Eisenhaber F (2005) Sequence complexity of proteins and its significance in annotation. In: Subramaniam S (ed) “Bioinformatics” in the encyclopedia of genetics, genomics, proteomics and bioinformatics. Wiley Interscience, New York. doi:10.1002/047001153X.g403313
Eisenhaber B, Eisenhaber F (2007) Posttranslational modifications and subcellular localization signals: indicators of sequence regions without inherent 3D structure? Curr Protein Pept Sci 8:197–203
Eisenhaber F (2006) Prediction of protein function: two basic concepts and one practical recipe (Chapter 3). In: Eisenhaber F (ed) Discovering biomolecular mechanisms with computational biology, 1st edn. Landes Biosciences and Eurekah.com, Georgetown, pp 39–54
Wong WC, Maurer-Stroh S, Eisenhaber F (2010) More than 1,001 problems with protein domain databases: transmembrane regions, signal peptides and the issue of sequence homology. PLoS Comput Biol 6:e1000867
Wong WC, Maurer-Stroh S, Eisenhaber F (2011) Not all transmembrane helices are born equal: towards the extension of the sequence homology concept to membrane proteins. Biol Direct 6:57
Sirota FL, Maurer-Stroh S, Eisenhaber B, Eisenhaber F (2015) Single-residue posttranslational modification sites at the N-terminus, C-terminus or in-between: to be or not to be exposed for enzyme access. Proteomics 15:2525–2546
Eisenhaber F, Wechselberger C, Kreil G (2001) The Brix domain protein family -- a key to the ribosomal biogenesis pathway? Trends Biochem Sci 26:345–347
Maurer-Stroh S, Dickens NJ, Hughes-Davies L, Kouzarides T, Eisenhaber F, Ponting CP (2003) The Tudor domain ‘Royal Family’: Tudor, plant Agenet, Chromo PWWP and MBT domains. Trends Biochem Sci 28:69–74
Novatchkova M, Leibbrandt A, Werzowa J, Neubuser A, Eisenhaber F (2003) The STIR-domain superfamily in signal transduction, development and immunity. Trends Biochem Sci 28:226–229
Novatchkova M, Eisenhaber F (2004) Linking transcriptional mediators via the GACKIX domain super family. Curr Biol 14:R54–R55
Bogner-Strauss JG, Prokesch A, Sanchez-Cabo F, Rieder D, Hackl H, Duszka K, Krogsdam A, Di CB, Walenta E, Klatzer A et al (2010) Reconstruction of gene association network reveals a transmembrane protein required for adipogenesis and targeted by PPARgamma. Cell Mol Life Sci 67:4049–4064
Maurer-Stroh S, Ma J, Lee RT, Sirota FL, Eisenhaber F (2009) Mapping the sequence mutations of the 2009 H1N1 influenza A virus neuraminidase relative to drug and antibody binding sites. Biol Direct 4:18
Vodermaier HC, Gieffers C, Maurer-Stroh S, Eisenhaber F, Peters JM (2003) TPR subunits of the anaphase-promoting complex mediate binding to the activator protein CDH1. Curr Biol 13:1459–1468
Eswar N, Webb B, Marti-Renom MA, Madhusudhan MS, Eramian D, Shen MY, Pieper U, Sali A (2006) Comparative protein structure modeling using Modeller. Curr Protoc Bioinformatics Chapter 5, Unit 5.6
Eswar N, Webb B, Marti-Renom MA, Madhusudhan MS, Eramian D, Shen MY, Pieper U, Sali A (2007) Comparative protein structure modeling using MODELLER. Curr Protoc Protein Sci Chapter 2, Unit 2.9
Fiser A, Do RK, Sali A (2000) Modeling of loops in protein structures. Protein Sci 9:1753–1773
Sali A, Blundell TL (1993) Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol 234:779–815
Eddy SR (1998) Profile hidden Markov models. Bioinformatics 14:755–763
Eddy SR (2011) Accelerated profile HMM searches. PLoS Comput Biol 7:e1002195
Marchler-Bauer A, Lu S, Anderson JB, Chitsaz F, Derbyshire MK, Weese-Scott C, Fong JH, Geer LY, Geer RC, Gonzales NR et al (2011) CDD: a Conserved Domain Database for the functional annotation of proteins. Nucleic Acids Res 39:D225–D229
Schaffer AA, Wolf YI, Ponting CP, Koonin EV, Aravind L, Altschul SF (1999) IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices. Bioinformatics 15:1000–1011
Remmert M, Biegert A, Hauser A, Soding J (2012) HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat Methods 9:173–175
Soding J, Biegert A, Lupas AN (2005) The HHpred interactive server for protein homology detection and structure prediction. Nucleic Acids Res 33:W244–W248
Soding J (2005) Protein homology detection by HMM-HMM comparison. Bioinformatics 21:951–960
Wong WC, Maurer-Stroh S, Eisenhaber F (2011) The Janus-faced E-values of HMMER2: extreme value distribution or logistic function? J Bioinform Comput Biol 9:179–206
Wong WC, Maurer-Stroh S, Eisenhaber B, Eisenhaber F (2014) On the necessity of dissecting sequence similarity scores into segment-specific contributions for inferring protein homology, function prediction and annotation. BMC Bioinformatics 15:166
Wong WC, Yap CK, Eisenhaber B, Eisenhaber F (2015) dissectHMMER: a HMMER-based score dissection framework that statistically evaluates fold-critical sequence segments for domain fold similarity. Biol Direct 10:39
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402
Wong WC, Maurer-Stroh S, Schneider G, Eisenhaber F (2012) Transmembrane helix: simple or complex. Nucleic Acids Res 40:W370–W375
Kreil DP, Ouzounis CA (2003) Comparison of sequence masking algorithms and the detection of biased protein sequence regions. Bioinformatics 19:1672–1681
Promponas VJ, Enright AJ, Tsoka S, Kreil DP, Leroy C, Hamodrakas S, Sander C, Ouzounis CA (2000) CAST: an iterative algorithm for the complexity analysis of sequence tracts. Complexity analysis of sequence tracts. Bioinformatics 16:915–922
Iakoucheva LM, Dunker AK (2003) Order, disorder, and flexibility: prediction from protein sequence. Structure 11:1316–1317
Linding R, Jensen LJ, Diella F, Bork P, Gibson TJ, Russell RB (2003) Protein disorder prediction: implications for structural proteomics. Structure 11:1453–1459
Linding R, Russell RB, Neduva V, Gibson TJ (2003) GlobPlot: exploring protein sequences for globularity and disorder. Nucleic Acids Res 31:3701–3708
Dosztanyi Z, Csizmok V, Tompa P, Simon I (2005) IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content. Bioinformatics 21:3433–3434
Dosztanyi Z, Csizmok V, Tompa P, Simon I (2005) The pairwise energy content estimated from amino acid composition discriminates between folded and intrinsically unstructured proteins. J Mol Biol 347:827–839
Brendel V, Bucher P, Nourbakhsh IR, Blaisdell BE, Karlin S (1992) Methods and algorithms for statistical analysis of protein sequences. Proc Natl Acad Sci U S A 89:2002–2006
Claverie JM (1994) Large scale sequence analysis. In: Adams MD, Fields C, Venter JC (eds.), Automated DNA sequencing and analysis. Academic Press, San Diego, pp. 267–279.
Claverie JM, States DJ (1993) Information enhancement methods for large scale sequence analysis. Comput Chem 17:191–201
Ward JJ, Sodhi JS, McGuffin LJ, Buxton BF, Jones DT (2004) Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J Mol Biol 337:635–645
Wootton JC, Federhen S (1993) Statistics of local complexity in amino acid sequences and sequence databases. Comput Chem 17:149–163
Wootton JC (1994) Non-globular domains in protein sequences: automated segmentation using complexity measures. Comput Chem 18:269–285
Wootton JC (1994) Sequences with “unusual” amino acid compositions. Curr Opin Struct Biol 4:413–421
Wootton JC, Federhen S (1996) Analysis of compositionally biased regions in sequence databases. Methods Enzymol 266:554–571
Eisenhaber B, Bork P, Eisenhaber F (1999) Prediction of potential GPI-modification sites in proprotein sequences. J Mol Biol 292:741–758
Eisenhaber B, Wildpaner M, Schultz CJ, Borner GH, Dupree P, Eisenhaber F (2003) Glycosylphosphatidylinositol lipid anchoring of plant proteins. Sensitive prediction from sequence- and genome-wide studies for Arabidopsis and rice. Plant Physiol 133:1691–1701
Eisenhaber B, Maurer-Stroh S, Novatchkova M, Schneider G, Eisenhaber F (2003) Enzymes and auxiliary factors for GPI lipid anchor biosynthesis and post-translational transfer to proteins. Bioessays 25:367–385
Eisenhaber B, Schneider G, Wildpaner M, Eisenhaber F (2004) A sensitive predictor for potential GPI lipid modification sites in fungal protein sequences and its application to genome-wide studies for Aspergillus nidulans, Candida albicans, Neurospora crassa, Saccharomyces cerevisiae and Schizosaccharomyces pombe. J Mol Biol 337:243–253
Maurer-Stroh S, Eisenhaber B, Eisenhaber F (2002) N-terminal N-myristoylation of proteins: prediction of substrate proteins from amino acid sequence. J Mol Biol 317:541–557
Maurer-Stroh S, Eisenhaber B, Eisenhaber F (2002) N-terminal N-myristoylation of proteins: refinement of the sequence motif and its taxon-specific differences. J Mol Biol 317:523–540
Maurer-Stroh S, Gouda M, Novatchkova M, Schleiffer A, Schneider G, Sirota FL, Wildpaner M, Hayashi N, Eisenhaber F (2004) MYRbase: analysis of genome-wide glycine myristoylation enlarges the functional spectrum of eukaryotic myristoylated proteins. Genome Biol 5:R21
Maurer-Stroh S, Eisenhaber F (2004) Myristoylation of viral and bacterial proteins. Trends Microbiol 12:178–185
Maurer-Stroh S, Washietl S, Eisenhaber F (2003) Protein prenyltransferases. Genome Biol 4:212
Maurer-Stroh S, Eisenhaber F (2005) Refinement and prediction of protein prenylation motifs. Genome Biol 6:R55
Maurer-Stroh S, Koranda M, Benetka W, Schneider G, Sirota FL, Eisenhaber F (2007) Towards complete sets of farnesylated and geranylgeranylated proteins. PLoS Comput Biol 3, e66
Neuberger G, Maurer-Stroh S, Eisenhaber B, Hartig A, Eisenhaber F (2003) Prediction of peroxisomal targeting signal 1 containing proteins from amino acid sequence. J Mol Biol 328:581–592
Neuberger G, Maurer-Stroh S, Eisenhaber B, Hartig A, Eisenhaber F (2003) Motif refinement of the peroxisomal targeting signal 1 and evaluation of taxon-specific differences. J Mol Biol 328:567–579
von Heijne G (1986) A new method for predicting signal sequence cleavage sites. Nucleic Acids Res 14:4683–4690
von Heijne G (1987) Sequence analysis in molecular biology? Treasure trove or trivial pursuit. Academic, San Diego
Bendtsen JD, Nielsen H, von Heijne G, Brunak S (2004) Improved prediction of signal peptides: SignalP 3.0. J Mol Biol 340:783–795
Nielsen H, Engelbrecht J, Brunak S, von HG (1997) Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Eng 10:1–6
Nielsen H, Krogh A (1998) Prediction of signal peptides and signal anchors by a hidden Markov model. Proc Int Conf Intell Syst Mol Biol 6:122–130
Cserzo M, Eisenhaber F, Eisenhaber B, Simon I (2002) On filtering false positive transmembrane protein predictions. Protein Eng 15:745–752
Cserzo M, Eisenhaber F, Eisenhaber B, Simon I (2004) TM or not TM: transmembrane protein prediction with low false positive rate using DAS-TMfilter. Bioinformatics 20:136–137
Tusnady GE, Simon I (1998) Principles governing amino acid composition of integral membrane proteins: application to topology prediction. J Mol Biol 283:489–506
Kall L, Krogh A, Sonnhammer EL (2004) A combined transmembrane topology and signal peptide prediction method. J Mol Biol 338:1027–1036
Kall L, Krogh A, Sonnhammer EL (2007) Advantages of combined transmembrane topology and signal peptide prediction--the Phobius web server. Nucleic Acids Res 35:W429–W432
Krogh A, Larsson B, von HG, Sonnhammer EL (2001) Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol 305:567–580
Sonnhammer EL, Von HG, Krogh A (1998) A hidden Markov model for predicting transmembrane helices in protein sequences. Proc Int Conf Intell Syst Mol Biol 6:175–182
Claros MG, von Heijne G (1994) TopPred II: an improved software for membrane protein structure predictions. Comput Appl Biosci 10:685–686
von Heijne G (1992) Membrane protein structure prediction. Hydrophobicity analysis and the positive-inside rule. J Mol Biol 225:487–494
Lupas A, Van DM, Stock J (1991) Predicting coiled coils from protein sequences. Science 252:1162–1164
Lupas A (1996) Prediction and analysis of coiled-coil structures. Methods Enzymol 266:513–525
Frishman D, Argos P (1996) Incorporation of non-local interactions in protein secondary structure prediction from the amino acid sequence. Protein Eng 9:133–142
Frishman D, Argos P (1997) Seventy-five percent accuracy in protein secondary structure prediction. Proteins 27:329–335
Eisenhaber F, Imperiale F, Argos P, Frommel C (1996) Prediction of secondary structural content of proteins from their amino acid composition alone. I New analytic vector decomposition methods. Proteins 25:157–168
Eisenhaber F, Frommel C, Argos P (1996) Prediction of secondary structural content of proteins from their amino acid composition alone. II The paradox with secondary structural class. Proteins 25:169–179
Maurer-Stroh S, Gao H, Han H, Baeten L, Schymkowitz J, Rousseau F, Zhang L, Eisenhaber F (2013) Motif discovery with data mining in 3D protein structure databases: discovery, validation and prediction of the U-shape zinc binding (“Huf-Zinc”) motif. J Bioinform Comput Biol 11:1340008
Andrade MA, Ponting CP, Gibson TJ, Bork P (2000) Homology-based method for identification of protein repeats using statistical significance estimates. J Mol Biol 298:521–537
Andrade MA, Petosa C, O’Donoghue SI, Muller CW, Bork P (2001) Comparison of ARM and HEAT protein repeats. J Mol Biol 309:1–18
Medema MH, Blin K, Cimermancic P, de Jager V, Zakrzewski P, Fischbach MA, Weber T, Takano E, Breitling R (2011) antiSMASH: rapid identification, annotation and analysis of secondary metabolite biosynthesis gene clusters in bacterial and fungal genome sequences. Nucleic Acids Res 39:W339–W346
Blin K, Medema MH, Kazempour D, Fischbach MA, Breitling R, Takano E, Weber T (2013) antiSMASH 2.0--a versatile platform for genome mining of secondary metabolite producers. Nucleic Acids Res 41:W204–W212
Weber T, Blin K, Duddela S, Krug D, Kim HU, Bruccoleri R, Lee SY, Fischbach MA, Muller R, Wohlleben W et al (2015) antiSMASH 3.0-a comprehensive resource for the genome mining of biosynthetic gene clusters. Nucleic Acids Res 43:W237–W243
Yin Y, Mao X, Yang J, Chen X, Mao F, Xu Y (2012) dbCAN: a web resource for automated carbohydrate-active enzyme annotation. Nucleic Acids Res 40:W445–W451
Desai DK, Nandi S, Srivastava PK, Lynn AM (2011) ModEnzA: accurate identification of metabolic enzymes using function specific profile HMMs with optimised discrimination threshold and modified emission probabilities. Adv Bioinformatics 2011:743782
Wolf YI, Brenner SE, Bash PA, Koonin EV (1999) Distribution of protein folds in the three superkingdoms of life. Genome Res 9:17–26
Sigrist CJ, Cerutti L, Hulo N, Gattiker A, Falquet L, Pagni M, Bairoch A, Bucher P (2002) PROSITE: a documented database using patterns and profiles as motif descriptors. Brief Bioinform 3:265–274
Sigrist CJ, de CE, Cerutti L, Cuche BA, Hulo N, Bridge A, Bougueleret L, Xenarios I (2013) New and continuing developments at PROSITE. Nucleic Acids Res 41:D344–D347
Puntervoll P, Linding R, Gemund C, Chabanis-Davidson S, Mattingsdal M, Cameron S, Martin DM, Ausiello G, Brannetti B, Costantini A et al (2003) ELM server: a new resource for investigating short functional sites in modular eukaryotic proteins. Nucleic Acids Res 31:3625–3630
Berezovsky IN, Grosberg AY, Trifonov EN (2000) Closed loops of nearly standard size: common basic element of protein structure. FEBS Lett 466:283–286
Goncearenco A, Berezovsky IN (2010) Prototypes of elementary functional loops unravel evolutionary connections between protein functions. Bioinformatics 26:i497–i503
Goncearenco A, Berezovsky IN (2015) Protein function from its emergence to diversity in contemporary proteins. Phys Biol 12:045002
Mott R (2000) Accurate formula for P-values of gapped local sequence and profile alignments. J Mol Biol 300:649–659
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215:403–410
Dayhoff M (1979) Atlas of protein sequence and structure. National Biomedical Research Foundation, Washington, DC
Altenhoff AM, Schneider A, Gonnet GH, Dessimoz C (2011) OMA 2011: orthology inference among 1000 complete genomes. Nucleic Acids Res 39:D289–D294
Roth AC, Gonnet GH, Dessimoz C (2008) Algorithm of OMA for large-scale orthology inference. BMC Bioinformatics 9:518
Biegert A, Soding J (2009) Sequence context-specific profiles for homology searching. Proc Natl Acad Sci U S A 106:3770–3775
Pearson WR (1998) Empirical statistical estimates for sequence similarity searches. J Mol Biol 276:71–84
Pearson WR (2000) Flexible sequence similarity searching with the FASTA3 program package. Methods Mol Biol 132:185–219
Sirota FL, Ooi HS, Gattermayer T, Schneider G, Eisenhaber F, Maurer-Stroh S (2010) Parameterization of disorder predictors for large-scale applications requiring high specificity by using an extended benchmark dataset. BMC Genomics 11(Suppl 1):S15
Enright AJ, Van DS, Ouzounis CA (2002) An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 30:1575–1584
van Dongen S (2008) Graph clustering via a discrete uncoupling process. SIAM J Matrix Anal Appl 30:121–141
Li W, Jaroszewski L, Godzik A (2001) Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics 17:282–283
Li W, Jaroszewski L, Godzik A (2002) Tolerating some redundancy significantly speeds up clustering of large protein databases. Bioinformatics 18:77–82
Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22:1658–1659
Notredame C, Higgins DG, Heringa J (2000) T-Coffee: a novel method for fast and accurate multiple sequence alignment. J Mol Biol 302:205–217
Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32:1792–1797
Edgar RC (2004) MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5:113
Do CB, Mahabhashyam MS, Brudno M, Batzoglou S (2005) ProbCons: probabilistic consistency-based multiple sequence alignment. Genome Res 15:330–340
Katoh K, Misawa K, Kuma K, Miyata T (2002) MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res 30:3059–3066
Katoh K, Kuma K, Toh H, Miyata T (2005) MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic Acids Res 33:511–518
Katoh K, Toh H (2007) PartTree: an algorithm to build an approximate tree from a large number of unaligned sequences. Bioinformatics 23:372–374
Katoh K, Toh H (2008) Recent developments in the MAFFT multiple sequence alignment program. Brief Bioinform 9:286–298
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer Science+Business Media New York
About this protocol
Cite this protocol
Eisenhaber, B. et al. (2016). The Recipe for Protein Sequence-Based Function Prediction and Its Implementation in the ANNOTATOR Software Environment. In: Carugo, O., Eisenhaber, F. (eds) Data Mining Techniques for the Life Sciences. Methods in Molecular Biology, vol 1415. Humana Press, New York, NY. https://doi.org/10.1007/978-1-4939-3572-7_25
Download citation
DOI: https://doi.org/10.1007/978-1-4939-3572-7_25
Published:
Publisher Name: Humana Press, New York, NY
Print ISBN: 978-1-4939-3570-3
Online ISBN: 978-1-4939-3572-7
eBook Packages: Springer Protocols