Key words

1 Introduction

Advances in sequencing technology have driven costs to such low levels that DNA , genome, and RNA sequencing have become the main research technologies in life sciences and they get applied in various context not necessarily because these methods are the most appropriate ones for the task but they have become the most accurate, affordable methods and they are also increasingly generally available; so, people just do it [13]. The results are heaps of sequence data where only a minor fraction is functionally understood and interpreted.

The issue is best illustrated by the number of genes that remain without function despite having been sequenced longer than a decade ago. For example, among the almost 7000 genes of the yeast Saccharomyces cerevisae, more than 1000 still awaited their functional characterization in 2007 [4] and little has changed since then. To note, the yeast genome has been available since 1997 and yeast is one of the best studied organisms. In human, just 1.5 % of the genome is protein coding with 20000–25000 genes and about half of them lack function description at the molecular and/or cellular level. The remaining genome is known also to be functionally significant; yet, the molecular mechanisms involving the various non-coding transcripts are largely unknown. The classical route to functional characterization involving experimental methods from the genetic and biochemical toolbox like specific knock-outs, targeted mutations, and a battery of biochemical assays is laborious, time-consuming, and expensive. Thus, concepts, approaches, and tools for sequence-based function prediction are very much needed to guide experimental biological and biomedical discovery-oriented research along promising hypotheses.

As proteins are known to be for a large variety of biological functions and mechanisms, hints about their function are especially valuable. Notably, protein function is described within hierarchical concept [5]. The protein’s molecular function set are the functional opportunities that a protein provides for interactions with other molecular players, its binding capacities and enzymatic activities, the range of conformational changes and posttranslational modifications. A subset of these molecular functions becomes actually relevant in the biological context at the cellular level, in biomolecular mechanisms such as metabolic pathway s, signaling cascades or supramolecular complexes together with other biomacromolecules (cellular function). Finally, a protein’s phenotypic function is its result of cooperation with various biomolecular mechanisms under certain environmental conditions.

As experimental characterization of an uncharacterized protein ’s function is time-consuming, costly and risky and as researchers follow the pressure toward short-term publishable results, experimentalists tend to concentrate on very few widely studied gene examples which apparently show the greatest promise for the development of drugs, while ignoring a treasure trove of uncharacterized ones that might hold the key to completely new pathways. In-silico sequence analysis aimed at structure/function prediction can become extremely helpful in generating trusted functional hypotheses. In principle, it is fast (up to a few months of effort) and, with the exception of some compute-intensive homology search heuristics [6], it has become affordable for even small-scale research operations independently or, the easiest way, in collaboration with an internationally well-known sequence-analytic research group.

This is not to say that in-silico analysis generates a function discovery for any query sequence or assesses the effect of any mutation in a functionally characterized gene. Nothing is farther from the truth. Yet, if properly applied, the set of sequence-analytic methods provides options and insights that are orthogonal to those provided by other, especially experimental methods and, with some luck, they can deliver the critical information for the path to the success [7]. The field of function prediction from protein sequence is still evolving. Only for some fraction of the uncharacterized sequence targets, predictions that provide useful hints can be made; yet, with a growing body of biological sequences and other life science knowledge, the number of such targets increases. For example, more sequences imply a denser sequence space and greater chances of success for homology-based function prediction as the recent breakthrough for Gaa1/Gpaa1, a subunit of the transamidase complex with predicted metallo-peptide-synthetase activity, has demonstrated [8, 9]. As a matter of fact, function prediction from sequence has made bioinformatics center stage in life science and exercises its influence in all research fields. Further examples are provided in these references [1013].

It should be noted that certain prediction algorithms, especially many among those for predicting functional features in non-globular segment s, are plagued by high false-positive rates. Nevertheless, they might be not completely useless. This is especially true if they are applied in conjunction with experimental screening methods with large lists of genes relevant for certain physiological situations as output. Gene expression studies at the RNA or protein level are typical examples. Function prediction tools can serve as filters for dramatically reducing the list, thus, helping to select gene targets for further experimental follow-up studies.

Taken together, the number and the order of structural and functional segments in a protein sequence are called the sequence architecture (historically, it was just the order of globular domain s in the sequence). The sequence architecture is computed by using a variety of sequence-analytic tools over the query sequence. One of the practical problems is that, for each query sequence, it is desirable to apply all known good prediction tools (those with good prediction accuracy) with the hope that at least some of them generate useful information for the query. There are about 40 of such tools available at this time point and many of them need to be run with several parameter sets. Historically, bioinformatics researchers provide their individual prediction algorithms as downloadable programs or web-based services. While generally useful for very specific questions, the input and output formats of these programs tend to be incompatible. It is a considerable workload to feed all the programs and web services with suitable input and to collect the output. Further, the total output for a single protein with ~1000 amino acids can run into GBs and just reading and extracting the useful annotation correctly can become difficult.

These problems multiply with the number of queries to study. Large sequencing projects require the annotation of thousands of proteins . The answer to this challenge is the implementation of script-based annotation pipelines that chain together several prediction tools and perform the necessary reformatting of inputs and outputs with web-accessible visualization of final results. While being adequate for a particular project, these pipelines lack the flexibility of applying modified sets of algorithms with change of task. An alternative are workflow tools that allow for the integration of a large number of individual prediction algorithms while presenting the results through a unified visual interface and keeping them persisted as well as traceable to the original raw output of sequence-analytic programs. The ANNOTATOR [13, 14] and its derivatives ANNIE [15], a fast tool for generating sequence architectures, and HPMV [16], a tool for mapping and evaluating sequence mutations with regard to their effect on sequence architecture, are representatives of this advanced class of sequence analysis frameworks.

2 Concepts in Protein Sequence Analysis and Function Prediction

The most basic concept in protein sequence studies is centered on the idea of segment-based analysis. Proteins are known to consist of structural and functional modules [17], of segments that have structural properties relatively independent from the rest of the protein and that carry an own molecular function. The final interpretation of protein function arises as a synthesis of the individual segment’s functions.

Notably, there are two types of segments. Protein sequences typically consist of globular domain s and non-globular segment s [1821]. The two types of regions require cardinally different approaches for function prediction . Sequence segments for globular domains have typically a mixed, lexically complex protein sequence with a balanced composition of hydrophobic and hydrophilic residues where the former tend to compose the tightly packed core and the latter form the surface of the globule [17, 21]. Functionally, globular domains with their unique 3D structure offer enzymatic and docking sites. Since the hydrophobic sequence pattern is characteristic for the fold, even a remnant sequence similarity without any sequence identity just with coincidence of the polar/non-polar succession is strongly indicative for fold similarity, common evolutionary origin, and similarity of function. Therefore, function annotation transfer justified by the sequence homology concept is possible within families of such protein segments that have statistically significant sequence similarity [22].

In contrast, non-globular regions have typically a biased amino composition or a simple, repetitive pattern (e.g., [GXP]n in the case of collagen) due to physical constraints as a result of conformational flexibility in an aqueous environment, membrane embedding, or fibrillar structure [2224]. As a consequence, sequence similarity is not necessarily a sign of common evolutionary origin and common function. Non-globular regions carry important functions hosting sequence signals for intracellular translocation (targeting peptides) and posttranslational modifications [24], serving as linkers or fitting sites for interactions. For their functional study, lexical analysis is required and the application of certain types of pattern–function correlation schemes is recommended. Thus, non-globular features require many dozens of tools to locate them in the sequence whereas globular domain s are functionally annotated uniformly with a battery of sequence similarity search programs.

Correspondingly, the recipe for function prediction recommends first to find all types of non-globular segment s with all available tools for that purpose (step one) and, then to subtract these non-globular regions from the query sequence [21]. The remaining sequence is then considered to consist of globular domain s. Since most sequence similarity programs have an upper limit in the number of similar protein sequences in the output, it might happen that sequences corresponding to domains very frequent in the sequence databases overwhelm the output and certain section of the sequence are not covered by hits of sequence-similarity searching programs at all, even if they exist in the database. Therefore, it is recommended to check for the occurrence of well-studied domains in the remainder of the query sequence (step two). A variety of protein domain libraries is available for this purpose.

After subtracting the sequence segments that represent known domains from the query, the final remainder is believed to consist of new domains not represented in the domain libraries. At this time point, the actual sequence similarity search tools have to kick in to collect the family of statistically similar sequence segments (step three). The hope is that at least one of the sequences found was previously functionally characterized so that it becomes possible to speculate about the function of this domain as, for example, in [2529].

The existence of homologous sequences with experimentally determined three-dimensional structures opens the possibility to use them as templates for computationally modeling the 3D structure of the query sequence. Determining the evolutionary conservation of individual residues and, then, projecting these values onto the modeled 3D structure can give valuable hints as to interaction interfaces or catalytic sites. This approach was useful to provide crucial insights into mechanisms for the development of drug resistance as the example of the H1N1-Neuraminidase shows [30] but also in other contexts [31]. 3D structure modeling within the homology concept is a complex task with many own parameters that is best executed outside of the ANNOTATOR , for example with the MODELLER tool [3235].

3 ANNOTATOR : The Integration of Protein Sequence-Analytic Tools

The ANNOTATOR software environment is actively being developed at the Bioinformatics Institute, A*STAR (http://www.annotator.org). This software environment implements many of the features discussed above. Biological objects are represented in a unified data model and long-term persistence in a relational database is supplied by an object-relational mapping layer. Data to be analyzed can be provided in different formats ranging from web-based forms, FASTA formatted flat files to remote import over a SOAP interface. This interface provides also an opportunity for other programs to use the ANNOTATOR as a compute engine and process the prediction results in their own unique way (e.g., ANNIE [15] and HPMV [16]).

At the moment, about 40 external sequence-analytic algorithms from own developments or from the academic community are integrated using a plugin-style mechanism and can be applied to uploaded sets of sequences (see the large Table 1 for details). The display of applicable algorithms follows the three-step recipe described above. Integrated algorithms that execute complex tasks such as ortholog or sequence family searches constitute a further group of algorithms. Finally, the ANNOTATOR provides tools to manage sequence sets (alignments and sequence clustering ).

Table 1 Algorithms and sequence-analytic tools integrated in the ANNOTATOR
  1. 1.

    Searching for non-globular domain s.

    1. (a)

      Tests for segments with amino acid compositional bias and disordered regions.

    2. (b)

      Tests for sequence complexity.

    3. (c)

      Prediction of posttranslational modifications.

    4. (d)

      Prediction of targeting signals.

    5. (e)

      Prediction of membrane-embedded regions.

    6. (f)

      Prediction of fibrillar structures and secondary structure .

  2. 2.

    Searching for well-studied globular domain s.

    1. (a)

      Searches in protein domain libraries.

    2. (b)

      Tests for small motifs.

    3. (c)

      Searches for repeated sequence segments.

  3. 3.

    Searching for families of sequence segments corresponding to new domains.

  4. 4.

    Integrated algorithms.

  5. 5.

    Sequence sets: Clustering algorithms.

  6. 6.

    Sequence sets: Multiple alignment algorithms.

  7. 7.

    Sequence sets: Miscellaneous algorithms.

Integrated algorithms offer either complex operations over individual sequences or also over sequence sets. The ANNOTATOR provides an integrated algorithm (“Prim-Seq-An”) that executes automatically the first two steps of the protein sequence analysis recipe. It tests the query sequence for the occurrence of any non-globular feature as well as for hits by any globular domain or motive database. For this purpose, the complete query sequence is subjected to the full set of respective prediction tools. The results can be viewed in an aggregated interactive cartoon.

The matching of domain models with query sequence segments is, similar to many other sequence-analytic problems, a continuing area of research and, consequently, the ANNOTATOR is subject to continuous change in adopted external algorithms. Domain model matching is mostly performed with HMMER-style [36, 37], other profile-based [38, 39], or profile–profile searches [4042]. There are issues with the P-value statistics applied that have significance for hit selection and that can be improved compared with the original implementation [43]. The sensitivity for remote similarities increases in searches where domain models are reduced to the fold-critical contributions; profile sections corresponding to non-globular parts are advised to be suppressed as in the dissectHMMER concept [44, 45].

Within the third step of the segment-based analysis approach, the identification of distantly related homologs to query sequence segments that remain without match in the preceding two analysis steps is the key task. While tools like PSI-BLAST [46] exist that provide a standard form of iterative family collection, it is often necessary to implement a more sophisticated heuristic to detect weaker links throughout the sequence space. The implementation of such a heuristic might require, among other tasks, the combination of numerous external algorithms such as PSI-BLAST or other similarity search tools with masking of low complexity segments, coiled coils, simple transmembrane regions [23] and other types of non-globular regions, the manipulation of alignments as well as the persistence of intermediate results (e.g., spawning of new similarity searches with sequence hits from previous steps).

Obviously, the mechanism of wrapping an external algorithm would not be sufficient in this case. While the logics of the heuristic could be implemented externally, it would still need access to internal data objects, as well as the ability to submit jobs to a compute-cluster. For this reason, an extension mechanism for the ANNOTATOR was devised which allows for the integration of algorithms that need access to internal mechanisms and data. A typical example for using this extension mechanism to implement a sophisticated search heuristic is the “Family-Searcher”, an integrated algorithm that is used to uncover homology relationships within large superfamilies of protein sequences. Applying this algorithm, the evolutionary relationship between classical mammalian lipases and the human adipose triglyceride lipase (ATGL) was established [6]. For such large sequence families, the amount of data produced when starting with one particular sequence as a seed can easily cross the Terabyte barrier. At the same time, the iterative procedure will spawn the execution of tens of thousands of individual homology searches. It is clearly necessary to have access to a cluster of compute nodes for the heuristic and to have sophisticated software tools for the analysis of the vast output to terminate the task in a reasonable timeframe.

3.1 Visualization

The visualization of results is an important aspect of a sequence analysis system because it allows an expert to gain an immediate condensed overview of possible functional assignments. The ANNOTATOR offers specific visualizers both at the individual sequence as well as at the set level.

The visualizer for an individual sequence projects all regions that have been found to be functionally relevant onto the original sequence. The regions are grouped into panes and are color-coded, which makes it easy to spot consensus among a number of predictors for the same kind of feature (e.g., transmembrane regions that are simple (blue), twilight (yellow-orange), and complex (red) are differently color-coded [23, 47]). Zooming capabilities as well as rulers facilitate the exact localization of relevant amino acids.

The ability to analyze potentially large sets of sequences marks a qualitative step up from the focus on individual proteins . Alternative views of sets of proteins make it possible to find features that are conspicuously more frequent pointing to some interesting property of the sequence set in question. The histogram view in the ANNOTATOR is an example of such a view. It displays a diagram where individual features (e.g., domains) are ordered by their abundance within a set of sequences.

Another example is the taxonomy view. It shows the taxonomic distribution of sequences within a particular sequence set. It is then possible to apply certain operators that will extract a portion of the set that corresponds to a branch of the taxonomic tree which can then be further analyzed. One has to keep in mind that a set of sequences is not only created when a user uploads one but also when a particular result returns more than one sequence. Alignments from homology searches are treated in a similar manner and the same operators can be applied to them.

4 Conclusions

The large amount of sequence data generated with modern sequencing methods makes the applications that can relate sequences and complex function patterns an absolute necessity. At the same time, many algorithms for predicting a particular function or uncovering distant evolutionary relationships (which, at the end, allows functional annotations transfer) have become more demanding on compute resources. The output as well as intermediate results can no longer be manually assessed and require sophisticated integrated frameworks. The ANNOTATOR software provides critical support for many protein sequence-analytic tasks by supplying an appropriate infrastructure capable of supporting a large array of sequence-analytic methods, presenting the user with a condensed view of possible functional assignments and, at the same time, allowing to drill down to raw data from the original prediction tool for validation purposes.