Synonyms

Statistical or intrinsic methods of gene prediction

Definition

Computational inference of how a metagenomic sequence is divided into protein-coding and noncoding regions based on presence or absence of characteristic oligonucleotide frequency patterns.

Introduction

As of April 2013 sequences of 370 metagenomes were available in databases. On the other hand, Genomes Online Database (www.genomesonline.org) lists 186 complete archaeal and 3,956 complete bacterial genomes; also there are about 15,000 incomplete (draft) prokaryotic genomes. With the average size of a metagenome being 100 times larger than an average prokaryotic genome, the current volume of metagenomic sequences is twice as large as the total sequence in “genomic” data. Therefore, current metagenomes carry a larger wealth of genes than all the prokaryotic genomes, and this gap is growing.

Notably, gene prediction and annotation of gene and protein function is more challenging in metagenomes than in draft or complete genomes. To give a historic perspective, one can compare gene annotation of a metagenome with annotation of the first completely sequenced archaeal genome, Methanococcus jannaschii (Bult et al. 1996). All the M. jannaschii genes were predicted by the ab initio statistical method (Borodovsky and McIninch 1993) while function of 2/3 of them was a mystery since the translated protein sequences did not show sequence similarity to proteins in databases.

The history repeats itself in metagenomes, since majority of protein-coding regions in a new metagenome may code for proteins that do not show similarity to already known proteins. “Evidence-based” or “similarity-based” methods of gene finding (Kunin et al. 2008) provide gene prediction along with valuable information about function of encoded proteins. Similarity-based gene finders possess high specificity, close to 100 % (Altschul et al. 1997; Badger and Olsen 1999; Frishman et al. 1998; Gish and States 1993). Still, the drawback of similarity-based methods is low sensitivity; they cannot predict novel genes.

The similarity-based methods are less useful for gene prediction in metagenomes that carry many novel genes, while the ab initio gene prediction methods, not depending on presence of homologs in protein databases, are both effective and efficient for annotating genes in metagenomic sequences (Kunin et al. 2008).

Ab Initio Gene Finding

Ab initio gene prediction tools have high sensitivity (above 90 % for the best tools) and high specificity (above 90 % as well). Ab initio gene finders use statistical pattern recognition methods (Wooley et al. 2010). Statistical models such as Markov models, hidden Markov models (HMM), and hidden semi-Markov models (HSMM, also called hidden Markov model with duration) proved to be very useful to model statistical patterns of nucleotide ordering in protein-coding and noncoding regions. Accurate ab initio gene finding in isolated genomes requires ample sequence data for estimation of algorithm parameters (model training).

Contrary to isolated (complete and draft) genomes metagenomic sequences are derived from numerous genomes of heterogeneous microbial communities (microbiomes). A typical metagenomic sequence is short; its genomic context and the phylogenetic origin are rarely known. Gene identification is also affected by sequencing and assembly errors; for example, errors that lead to frameshifts (change of coding frame).

The major challenge for ab initio gene prediction in metagenomic sequences is that the metagenomic sequences are often too short for reliable estimation of parameters of species-specific models of coding and noncoding regions. Special training techniques have to be developed to address the challenging task of parameter estimation (see below). Similarly to gene prediction in isolated genomes, newly predicted genes are immediately translated into proteins and the similarity search is used in an attempt of function annotation.

Gene Finders Currently Available for Metagenomes

Current metagenomic gene-finding tools include FragGeneScan (Rho et al. 2010), Glimmer-MG (Kelley et al. 2012), MetaGene Annotator (Noguchi et al. 2008), MetaGeneMark (Zhu et al. 2010), and Orphelia (Hoff et al. 2009, 2008). Glimmer-MG and MetaGeneMark are extensions of gene finders for complete or draft genomes Glimmer3 (Delcher et al. 2007) and GeneMarkS (Besemer et al. 2001), respectively.

The MetaGeneMark algorithm uses HSMM architecture, originally developed in GeneMarkS (Besemer et al. 2001). The HSMM parameter derivation approach used in MetaGeneMark is to arrive to a large set of parameters (thousands of parameters related to oligonucleotide frequencies) from a small set (nucleotide frequencies determined in a short fragment) using the dependencies between oligonucleotide and nucleotide frequencies that have been formed in evolution. The original idea of this approach (Besemer and Borodovsky 1999) has been developed for small viral genomes before the start of “metagenomic era” (see below for more details).

Glimmer-MG is based on interpolated Markov models or IMMs (Salzberg et al. 1998). Glimmer-MG scores metagenomic sequences and assigns them into clusters; then, the algorithm iteratively estimates the IMM parameters and reassigns sequences to clusters.

FragGeneScan (Rho et al. 2010), an HMM-based gene finder, has an additional ability to predict frameshifts caused by sequencing errors. Transition probabilities between coding frames are determined with respect to the error models of sequencing technologies used to derive the input sequence.

MetaGene Annotator (Noguchi et al. 2008) works in two steps: in the first step the program scores open reading frames (ORFs) with respect to base composition and lengths; in the second step, it connects high-scoring ORFs using dynamic programming.

Machine learning classification algorithms such as support vector machines and neural networks are also used for ab initio gene finding. In order to classify coding or noncoding ORFs, Orphelia (Hoff et al. 2009, 2008) uses an artificial neural network combining multiple features to get ORF’s scores.

Parameter Estimation for Metagenomic Gene-Finding Algorithms

Patterns of oligonucleotide frequencies differ in coding and noncoding regions; these patterns are more pronounced when frequencies of longer oligomers are considered. Sequences with specific oligomer frequencies can be modeled by Markov chain models and in the important case of protein-coding sequences by three-periodic Markov chain models (Borodovsky et al. 1986). The number of parameters of a three-periodic Markov chain model increases exponentially with the model order; estimation of parameters of the practically useful fifth order model requires at least several hundred thousand nucleotide long sequence. Use of a shorter training sequence leads to over-fitting and will corrupt gene prediction. If the origin of the metagenomic sequence is known, sequences from the whole parent genome could be used for training. Alternatively, if novel metagenomic sequences from a single species are assembled in sufficiently long contig the model parameters can be estimated by self-training on the contig sequence (Besemer et al. 2001; Kelley et al. 2012). Most frequently, however, metagenomic sequences are short and novel (of the order of a few hundred nucleotides). Therefore, new approach to the model parameter derivation is needed.

A novel approach for constructing parameters and making efficient models for gene prediction in short genomic sequences was proposed back in 1999 (Besemer and Borodovsky 1999). The idea was to use observed trends in the nucleotide frequencies in the three codon positions in genomes with various GC content. Use of these dependencies allows for reconstructing the species-specific codon usage pattern in the whole genome starting from a short fragment of this genome whose length is sufficient to estimate the genome GC content. This approach is based on the assumption of genome compositional uniformity that is largely valid for prokaryotic genomes. It was shown that parameters provided by this approach allow sufficiently accurate gene prediction in short metagenomic sequences. Later on, with more genomes becoming available, this idea was extended (Zhu et al. 2010) to longer oligonucleotides (e.g., hexamers). With GC content of a genome being an independent variable X, it could be shown that frequency of phased K-mers in any of three frames, variable Y, can be approximated by a polynomial of order K. Particularly, the mononucleotide frequencies in three codon positions can be approximated by linear functions. These dependencies indicate that GC content is a major driving factor that determines evolution of genome-wide codon usage pattern (Chen et al. 2004). In MetaGeneMark, the value of GC content determined for a short metagenomic sequence is used as an estimate of GC content of the whole genome the sequence originated from. This value allows immediate reconstruction of frequencies of phased oligonucleotides and, at the next step, parameters of three-periodic Markov chain models of the heuristic model (Zhu et al. 2010).

Interestingly, the heuristic models can also be used for gene prediction in complete genomes or draft genomes. In comparison with the “native” models (models trained on a genome of interest), heuristic models are more sensitive to so-called “atypical” genes. Many atypical genes appear to be horizontally transferred genes with codon frequencies deviating from dominant codon usage pattern of the “host” genome.

Another approach to model parameter estimation is attempting to make a sufficiently large set of training sequences by linking anonymous sequences that appear to be taxonomically close. For example, Glimmer-MG assigns a taxon for a metagenomic sequence by a classification method called Phymm (Brady and Salzberg 2009) and then searches databases for genomes that belong to this taxon. Since such type of training is executed in real time, the running time of gene-finding algorithm may increase significantly in comparison with the algorithm selecting a heuristic model from a set of models precomputed for possible values of GC contents.

Additional Sequence Features Used by Metagenomic Gene Finders

Besides function-specific patterns in oligonucleotide composition, gene identification algorithms can use additional features that help discriminate protein-coding and noncoding regions. Such features include empirical length distributions of coding and noncoding regions, mutual orientation of neighboring coding regions, and sequence patterns related to functional sites such as ribosomal binding sites (RBS). The two-component model of RBS, containing positional frequency matrix as a model of the RBS motif and the length distribution of a “spacer,” the sequence between RBS and gene start, carries important additional information for improving accuracy of gene start prediction. In prokaryotic genomes an average spacer length is 5–7 nt. The RBS positional frequency matrix can be derived by algorithms such as MCMC (Markov chain Monte Carlo)-based Gibbs sampler (Lawrence et al. 1993) or EM (Expectation Maximization)-based MEME (Bailey and Elkan 1994); detection of the RBS motif is done by finding the most conserved set of ungapped sequence fragments within the multiple alignment window. The structure of two-component RBS model is convenient for incorporation into HMM-based framework of several algorithms such as MetaGeneMark and FragGeneScan

Another feature, the prokaryotic gene length distribution, is approximated for complete or draft genomes by the gamma distribution with mean value about 900 nt; yet another one, the distribution of length of noncoding region is approximated by exponential distribution. These two distributions, as well as the RBS spacer length distribution, are used as in the HSMM-based algorithms (Besemer et al. 2001). Since short metagenomic sequences are more likely to contain partial genes than complete genes, length distributions of partial genes are used in HSMM-based metagenomic gene finders (Rho et al. 2010; Zhu et al. 2010).

About 70 % of neighboring genes in prokaryotic genomes have the same orientation (Noguchi et al. 2006), and many of them make co-transcribed “chains” or operons. Genes in an operon are located on a close distance or even overlap. Four base-pair overlap ATGA is very common in adjacent genes as an overlap of stop and start codons ATG and TGA. Average distance between adjacent genes having the same orientation is shorter than that between neighbor genes residing in complementary strands, especially in gene start-to-gene start configuration where additional space has to be available for promoters.

All these features are incorporated in metagenomic gene finders, e.g., MetaGeneMark. Tests of ab initio gene finders on simulated metagenomic sequences have shown that these algorithms are quite accurate, with average values of sensitivity and specificity above 90 %; see Table 1. However, the sensitivity drops if the sequence length goes below 200 nt (Yok and Rosen 2011; Zhu et al. 2010).

Ab Initio Gene Identification in Metagenomic Sequences, Table 1 Gene prediction accuracy for five ab initio gene finders. Sn stands for sensitivity and Sp stands for specificity

An Initio Gene Finding in Metagenomic Sequences with Errors

Real metagenomic sequences contain errors: substitutions, insertion, and deletions (indels), as well as chimerisms, when two reads from different species are joined due to assembly error. Indels can cause frameshifts in coding regions; thus gene prediction accuracy is affected by sequencing errors. The overall effect on accuracy depends on error rates specific to sequencing and finishing technologies; for example, the error rates reported for Sanger sequencing may be as low as 0.001 % while sequencing errors in NGS technologies can go above 1 %. In both simulated Sanger reads and simulated 454 reads significant decrease of gene prediction sensitivity is observed when error rate exceeds 1 % (Hoff 2009). Still, in assembled sequences, the per-nucleotide error rate of 0.5 % in raw reads can be reduced to as low as 0.005 %. This error rate is still large enough to affect ~3–4.5 % of genes in assembled sequences (Luo et al. 2012).

To identify frameshift errors in metagenomic sequences, gene-finding algorithms have to model frame transitions that occur due to sequencing errors. In HSMM-based gene finders, e.g., FragGeneScan, new hidden states designating transitions between coding frames in the same strand were incorporated into the HSMM architecture. Another recent tool able to detect frameshift in metagenomic coding regions is MetaGeneTack (Tang et al. 2013). It combines the original HSMM-based MetaGeneMark with an ab initio frameshift finding program GeneTack (Antonov and Borodovsky 2010). Several filters of false-positive predictions were employed in MetaGeneTack to achieve higher accuracy. MetaGeneTack is reported to have higher frameshift prediction specificity than FragGeneScan (Table 2) in reads with error rate typical for metagenomic projects (Tang et al. 2013).

Ab Initio Gene Identification in Metagenomic Sequences, Table 2 Frameshift prediction accuracy

Yet another approach was used in Glimmer-MG, which, to trace possible indel errors, splits an ORF into three branches (frames), starting from the position of a nucleotide called with low confidence (Kelley et al. 2012). This approach was reported to have higher gene prediction accuracy on error-contained reads than FragGeneScan. Methods that account for sequencing errors generally perform better in real error-prone metagenomic sequences than “idealistic” approaches. The accuracy of sequencing error detection, however, depends on how accurate is the modeling of sequencing errors is.

Summary

Accurate ab initio gene prediction in metagenomic sequences is necessary for reliable functional annotation. Ab initio algorithms identify genes in metagenomic sequences by detecting intrinsic statistical patterns of coding and noncoding regions. Being independent of data stored in databases, these methods are especially useful for discovering novel genes. Special techniques have been developed for derivation of parameters of the ab initio algorithms working with short anonymous metagenomic sequences. We have reviewed several ab initio gene finders developed for metagenomic sequences including the latest tools that take into account possible sequencing errors (frameshifts).

Cross-References