Ab Initio Gene Identification in Metagenomic Sequences

Tang, Shiyuyun; Borodovsky, Mark

doi:10.1007/978-1-4899-7478-5_440

Shiyuyun Tang² &
Mark Borodovsky³

469 Accesses
1 Citations

Access provided by Autonomous University of Puebla. Download reference work entry PDF

Synonyms

Statistical or intrinsic methods of gene prediction

Definition

Computational inference of how a metagenomic sequence is divided into protein-coding and noncoding regions based on presence or absence of characteristic oligonucleotide frequency patterns.

Introduction

As of April 2013 sequences of 370 metagenomes were available in databases. On the other hand, Genomes Online Database (www.genomesonline.org) lists 186 complete archaeal and 3,956 complete bacterial genomes; also there are about 15,000 incomplete (draft) prokaryotic genomes. With the average size of a metagenome being 100 times larger than an average prokaryotic genome, the current volume of metagenomic sequences is twice as large as the total sequence in “genomic” data. Therefore, current metagenomes carry a larger wealth of genes than all the prokaryotic genomes, and this gap is growing.

Notably, gene prediction and annotation of gene and protein function is more challenging in metagenomes than in draft or complete genomes. To give a historic perspective, one can compare gene annotation of a metagenome with annotation of the first completely sequenced archaeal genome, Methanococcus jannaschii (Bult et al. 1996). All the M. jannaschii genes were predicted by the ab initio statistical method (Borodovsky and McIninch 1993) while function of 2/3 of them was a mystery since the translated protein sequences did not show sequence similarity to proteins in databases.

The history repeats itself in metagenomes, since majority of protein-coding regions in a new metagenome may code for proteins that do not show similarity to already known proteins. “Evidence-based” or “similarity-based” methods of gene finding (Kunin et al. 2008) provide gene prediction along with valuable information about function of encoded proteins. Similarity-based gene finders possess high specificity, close to 100 % (Altschul et al. 1997; Badger and Olsen 1999; Frishman et al. 1998; Gish and States 1993). Still, the drawback of similarity-based methods is low sensitivity; they cannot predict novel genes.

The similarity-based methods are less useful for gene prediction in metagenomes that carry many novel genes, while the ab initio gene prediction methods, not depending on presence of homologs in protein databases, are both effective and efficient for annotating genes in metagenomic sequences (Kunin et al. 2008).

Ab Initio Gene Finding

Ab initio gene prediction tools have high sensitivity (above 90 % for the best tools) and high specificity (above 90 % as well). Ab initio gene finders use statistical pattern recognition methods (Wooley et al. 2010). Statistical models such as Markov models, hidden Markov models (HMM), and hidden semi-Markov models (HSMM, also called hidden Markov model with duration) proved to be very useful to model statistical patterns of nucleotide ordering in protein-coding and noncoding regions. Accurate ab initio gene finding in isolated genomes requires ample sequence data for estimation of algorithm parameters (model training).

Contrary to isolated (complete and draft) genomes metagenomic sequences are derived from numerous genomes of heterogeneous microbial communities (microbiomes). A typical metagenomic sequence is short; its genomic context and the phylogenetic origin are rarely known. Gene identification is also affected by sequencing and assembly errors; for example, errors that lead to frameshifts (change of coding frame).

The major challenge for ab initio gene prediction in metagenomic sequences is that the metagenomic sequences are often too short for reliable estimation of parameters of species-specific models of coding and noncoding regions. Special training techniques have to be developed to address the challenging task of parameter estimation (see below). Similarly to gene prediction in isolated genomes, newly predicted genes are immediately translated into proteins and the similarity search is used in an attempt of function annotation.

Gene Finders Currently Available for Metagenomes

Current metagenomic gene-finding tools include FragGeneScan (Rho et al. 2010), Glimmer-MG (Kelley et al. 2012), MetaGene Annotator (Noguchi et al. 2008), MetaGeneMark (Zhu et al. 2010), and Orphelia (Hoff et al. 2009, 2008). Glimmer-MG and MetaGeneMark are extensions of gene finders for complete or draft genomes Glimmer3 (Delcher et al. 2007) and GeneMarkS (Besemer et al. 2001), respectively.

The MetaGeneMark algorithm uses HSMM architecture, originally developed in GeneMarkS (Besemer et al. 2001). The HSMM parameter derivation approach used in MetaGeneMark is to arrive to a large set of parameters (thousands of parameters related to oligonucleotide frequencies) from a small set (nucleotide frequencies determined in a short fragment) using the dependencies between oligonucleotide and nucleotide frequencies that have been formed in evolution. The original idea of this approach (Besemer and Borodovsky 1999) has been developed for small viral genomes before the start of “metagenomic era” (see below for more details).

Glimmer-MG is based on interpolated Markov models or IMMs (Salzberg et al. 1998). Glimmer-MG scores metagenomic sequences and assigns them into clusters; then, the algorithm iteratively estimates the IMM parameters and reassigns sequences to clusters.

FragGeneScan (Rho et al. 2010), an HMM-based gene finder, has an additional ability to predict frameshifts caused by sequencing errors. Transition probabilities between coding frames are determined with respect to the error models of sequencing technologies used to derive the input sequence.

MetaGene Annotator (Noguchi et al. 2008) works in two steps: in the first step the program scores open reading frames (ORFs) with respect to base composition and lengths; in the second step, it connects high-scoring ORFs using dynamic programming.

Machine learning classification algorithms such as support vector machines and neural networks are also used for ab initio gene finding. In order to classify coding or noncoding ORFs, Orphelia (Hoff et al. 2009, 2008) uses an artificial neural network combining multiple features to get ORF’s scores.

Parameter Estimation for Metagenomic Gene-Finding Algorithms

Patterns of oligonucleotide frequencies differ in coding and noncoding regions; these patterns are more pronounced when frequencies of longer oligomers are considered. Sequences with specific oligomer frequencies can be modeled by Markov chain models and in the important case of protein-coding sequences by three-periodic Markov chain models (Borodovsky et al. 1986). The number of parameters of a three-periodic Markov chain model increases exponentially with the model order; estimation of parameters of the practically useful fifth order model requires at least several hundred thousand nucleotide long sequence. Use of a shorter training sequence leads to over-fitting and will corrupt gene prediction. If the origin of the metagenomic sequence is known, sequences from the whole parent genome could be used for training. Alternatively, if novel metagenomic sequences from a single species are assembled in sufficiently long contig the model parameters can be estimated by self-training on the contig sequence (Besemer et al. 2001; Kelley et al. 2012). Most frequently, however, metagenomic sequences are short and novel (of the order of a few hundred nucleotides). Therefore, new approach to the model parameter derivation is needed.

A novel approach for constructing parameters and making efficient models for gene prediction in short genomic sequences was proposed back in 1999 (Besemer and Borodovsky 1999). The idea was to use observed trends in the nucleotide frequencies in the three codon positions in genomes with various GC content. Use of these dependencies allows for reconstructing the species-specific codon usage pattern in the whole genome starting from a short fragment of this genome whose length is sufficient to estimate the genome GC content. This approach is based on the assumption of genome compositional uniformity that is largely valid for prokaryotic genomes. It was shown that parameters provided by this approach allow sufficiently accurate gene prediction in short metagenomic sequences. Later on, with more genomes becoming available, this idea was extended (Zhu et al. 2010) to longer oligonucleotides (e.g., hexamers). With GC content of a genome being an independent variable X, it could be shown that frequency of phased K-mers in any of three frames, variable Y, can be approximated by a polynomial of order K. Particularly, the mononucleotide frequencies in three codon positions can be approximated by linear functions. These dependencies indicate that GC content is a major driving factor that determines evolution of genome-wide codon usage pattern (Chen et al. 2004). In MetaGeneMark, the value of GC content determined for a short metagenomic sequence is used as an estimate of GC content of the whole genome the sequence originated from. This value allows immediate reconstruction of frequencies of phased oligonucleotides and, at the next step, parameters of three-periodic Markov chain models of the heuristic model (Zhu et al. 2010).

Interestingly, the heuristic models can also be used for gene prediction in complete genomes or draft genomes. In comparison with the “native” models (models trained on a genome of interest), heuristic models are more sensitive to so-called “atypical” genes. Many atypical genes appear to be horizontally transferred genes with codon frequencies deviating from dominant codon usage pattern of the “host” genome.

Another approach to model parameter estimation is attempting to make a sufficiently large set of training sequences by linking anonymous sequences that appear to be taxonomically close. For example, Glimmer-MG assigns a taxon for a metagenomic sequence by a classification method called Phymm (Brady and Salzberg 2009) and then searches databases for genomes that belong to this taxon. Since such type of training is executed in real time, the running time of gene-finding algorithm may increase significantly in comparison with the algorithm selecting a heuristic model from a set of models precomputed for possible values of GC contents.

Additional Sequence Features Used by Metagenomic Gene Finders

Besides function-specific patterns in oligonucleotide composition, gene identification algorithms can use additional features that help discriminate protein-coding and noncoding regions. Such features include empirical length distributions of coding and noncoding regions, mutual orientation of neighboring coding regions, and sequence patterns related to functional sites such as ribosomal binding sites (RBS). The two-component model of RBS, containing positional frequency matrix as a model of the RBS motif and the length distribution of a “spacer,” the sequence between RBS and gene start, carries important additional information for improving accuracy of gene start prediction. In prokaryotic genomes an average spacer length is 5–7 nt. The RBS positional frequency matrix can be derived by algorithms such as MCMC (Markov chain Monte Carlo)-based Gibbs sampler (Lawrence et al. 1993) or EM (Expectation Maximization)-based MEME (Bailey and Elkan 1994); detection of the RBS motif is done by finding the most conserved set of ungapped sequence fragments within the multiple alignment window. The structure of two-component RBS model is convenient for incorporation into HMM-based framework of several algorithms such as MetaGeneMark and FragGeneScan

Another feature, the prokaryotic gene length distribution, is approximated for complete or draft genomes by the gamma distribution with mean value about 900 nt; yet another one, the distribution of length of noncoding region is approximated by exponential distribution. These two distributions, as well as the RBS spacer length distribution, are used as in the HSMM-based algorithms (Besemer et al. 2001). Since short metagenomic sequences are more likely to contain partial genes than complete genes, length distributions of partial genes are used in HSMM-based metagenomic gene finders (Rho et al. 2010; Zhu et al. 2010).

About 70 % of neighboring genes in prokaryotic genomes have the same orientation (Noguchi et al. 2006), and many of them make co-transcribed “chains” or operons. Genes in an operon are located on a close distance or even overlap. Four base-pair overlap ATGA is very common in adjacent genes as an overlap of stop and start codons ATG and TGA. Average distance between adjacent genes having the same orientation is shorter than that between neighbor genes residing in complementary strands, especially in gene start-to-gene start configuration where additional space has to be available for promoters.

All these features are incorporated in metagenomic gene finders, e.g., MetaGeneMark. Tests of ab initio gene finders on simulated metagenomic sequences have shown that these algorithms are quite accurate, with average values of sensitivity and specificity above 90 %; see Table 1. However, the sensitivity drops if the sequence length goes below 200 nt (Yok and Rosen 2011; Zhu et al. 2010).

Ab Initio Gene Identification in Metagenomic Sequences, Table 1 Gene prediction accuracy for five ab initio gene finders. Sn stands for sensitivity and Sp stands for specificity

Full size table

An Initio Gene Finding in Metagenomic Sequences with Errors

Real metagenomic sequences contain errors: substitutions, insertion, and deletions (indels), as well as chimerisms, when two reads from different species are joined due to assembly error. Indels can cause frameshifts in coding regions; thus gene prediction accuracy is affected by sequencing errors. The overall effect on accuracy depends on error rates specific to sequencing and finishing technologies; for example, the error rates reported for Sanger sequencing may be as low as 0.001 % while sequencing errors in NGS technologies can go above 1 %. In both simulated Sanger reads and simulated 454 reads significant decrease of gene prediction sensitivity is observed when error rate exceeds 1 % (Hoff 2009). Still, in assembled sequences, the per-nucleotide error rate of 0.5 % in raw reads can be reduced to as low as 0.005 %. This error rate is still large enough to affect ~3–4.5 % of genes in assembled sequences (Luo et al. 2012).

To identify frameshift errors in metagenomic sequences, gene-finding algorithms have to model frame transitions that occur due to sequencing errors. In HSMM-based gene finders, e.g., FragGeneScan, new hidden states designating transitions between coding frames in the same strand were incorporated into the HSMM architecture. Another recent tool able to detect frameshift in metagenomic coding regions is MetaGeneTack (Tang et al. 2013). It combines the original HSMM-based MetaGeneMark with an ab initio frameshift finding program GeneTack (Antonov and Borodovsky 2010). Several filters of false-positive predictions were employed in MetaGeneTack to achieve higher accuracy. MetaGeneTack is reported to have higher frameshift prediction specificity than FragGeneScan (Table 2) in reads with error rate typical for metagenomic projects (Tang et al. 2013).

Ab Initio Gene Identification in Metagenomic Sequences, Table 2 Frameshift prediction accuracy

Full size table

Yet another approach was used in Glimmer-MG, which, to trace possible indel errors, splits an ORF into three branches (frames), starting from the position of a nucleotide called with low confidence (Kelley et al. 2012). This approach was reported to have higher gene prediction accuracy on error-contained reads than FragGeneScan. Methods that account for sequencing errors generally perform better in real error-prone metagenomic sequences than “idealistic” approaches. The accuracy of sequencing error detection, however, depends on how accurate is the modeling of sequencing errors is.

Summary

Accurate ab initio gene prediction in metagenomic sequences is necessary for reliable functional annotation. Ab initio algorithms identify genes in metagenomic sequences by detecting intrinsic statistical patterns of coding and noncoding regions. Being independent of data stored in databases, these methods are especially useful for discovering novel genes. Special techniques have been developed for derivation of parameters of the ab initio algorithms working with short anonymous metagenomic sequences. We have reviewed several ab initio gene finders developed for metagenomic sequences including the latest tools that take into account possible sequencing errors (frameshifts).

Cross-References

References

Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25(17):3389–402.
CAS PubMed Central PubMed Google Scholar
Antonov I, Borodovsky M. Genetack: frameshift identification in protein-coding sequences by the viterbi algorithm. J Bioinforma Comput Biol. 2010;8(3):535–51. PubMed PMID: 20556861.
CAS Google Scholar
Badger JH, Olsen GJ. CRITICA: coding region identification tool invoking comparative analysis. Mol Biol Evol. 1999;16(4):512–24.
CAS PubMed Google Scholar
Bailey TL, Elkan C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proceedings/International Conference on Intelligent Systems for Molecular Biology; ISMB International Conference on Intelligent Systems for Molecular Biology, Vol. 2; 1994; p. 28–36. PubMed PMID: 7584402.
Google Scholar
Besemer J, Borodovsky M. Heuristic approach to deriving models for gene finding. Nucleic Acids Res. 1999;27(19):3911–20. PubMed PMID: 10481031. Pubmed Central PMCID: 148655.
CAS PubMed Central PubMed Google Scholar
Besemer J, Lomsadze A, Borodovsky M. GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res. 2001;29(12):2607–18. PubMed PMID: 11410670. Pubmed Central PMCID: 55746.
CAS PubMed Central PubMed Google Scholar
Borodovsky M, McIninch J. GENMARK: parallel gene recognition for both DNA strands. Comp Chem. 1993;17(2):123–33.
CAS Google Scholar
Borodovsky MY, Sprizhitskii Y, Golovanov E, Aleksandrov A. Statistical patterns in primary structures of functional regions in the E. coli genome. III. Computer recognition of coding regions. Mol Biol. 1986;20:1145–50.
Google Scholar
Brady A, Salzberg SL. Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated Markov models. Nat Methods. 2009;6(9):673–6. PubMed PMID: 19648916. Pubmed Central PMCID: 2762791.
CAS PubMed Central PubMed Google Scholar
Bult CJ, White O, Olsen GJ, Zhou L, Fleischmann RD, Sutton GG, et al. Complete genome sequence of the methanogenic archaeon. Methanococcus jannaschii. Science. 1996;273(5278):1058–73. PubMed PMID: 8688087.
CAS PubMed Google Scholar
Chen SL, Lee W, Hottes AK, Shapiro L, McAdams HH. Codon usage between genomes is constrained by genome-wide mutational processes. Proc Natl Acad Sci U S A. 2004;101(10):3480–5. PubMed PMID: 14990797. Pubmed Central PMCID: 373487.
CAS PubMed Central PubMed Google Scholar
Delcher AL, Bratke KA, Powers EC, Salzberg SL. Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics. 2007;23(6):673–9. PubMed PMID: 17237039. Pubmed Central PMCID: 2387122.
CAS PubMed Central PubMed Google Scholar
Frishman D, Mironov A, Mewes H-W, Gelfand M. Combining diverse evidence for gene recognition in completely sequenced bacterial genomes. Nucleic Acids Res. 1998;26(12):2941–7.
CAS PubMed Central PubMed Google Scholar
Gish W, States DJ. Identification of protein coding regions by database similarity search. Nat Genet. 1993;3(3):266–72.
CAS PubMed Google Scholar
Hoff KJ. The effect of sequencing errors on metagenomic gene prediction. BMC Genomics. 2009;10:520. PubMed PMID: 19909532. Pubmed Central PMCID: 2781827.
PubMed Central PubMed Google Scholar
Hoff KJ, Tech M, Lingner T, Daniel R, Morgenstern B, Meinicke P. Gene prediction in metagenomic fragments: a large scale machine learning approach. BMC Bioinforma. 2008;9:217. PubMed PMID: 18442389. Pubmed Central PMCID: 2409338.
Google Scholar
Hoff KJ, Lingner T, Meinicke P, Tech M. Orphelia: predicting genes in metagenomic sequencing reads. Nucleic Acids Res. 2009 Jul 37(Web Server issue):W101-5. PubMed PMID: 19429689. Pubmed Central PMCID: 2703946.
Google Scholar
Kelley DR, Liu B, Delcher AL, Pop M, Salzberg SL. Gene prediction with Glimmer for metagenomic sequences augmented by classification and clustering. Nucleic Acids Res. 2012;40(1):e9. PubMed PMID: 22102569. Pubmed Central PMCID: 3245904.
CAS PubMed Central PubMed Google Scholar
Kunin V, Copeland A, Lapidus A, Mavromatis K, Hugenholtz P. A bioinformatician’s guide to metagenomics. Microbiol Mol Biol Rev. 2008;72(4):557–78. Table of Contents. PubMed PMID: 19052320. Pubmed Central PMCID: 2593568.
CAS PubMed Central PubMed Google Scholar
Lawrence CE, Altschul SF, Boguski MS, Liu JS, Neuwald AF, Wootton JC. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science. 1993;262(5131):208–14. PubMed PMID: 8211139.
CAS PubMed Google Scholar
Luo C, Tsementzi D, Kyrpides N, Read T, Konstantinidis KT. Direct comparisons of Illumina vs. Roche 454 sequencing technologies on the same microbial community DNA sample. PloS ONE. 2012;7(2):e30087.
CAS PubMed Central PubMed Google Scholar
Noguchi H, Park J, Takagi T. MetaGene: prokaryotic gene finding from environmental genome shotgun sequences. Nucleic Acids Res. 2006;34(19):5623–30. PubMed PMID: 17028096. Pubmed Central PMCID: 1636498.
CAS PubMed Central PubMed Google Scholar
Noguchi H, Taniguchi T, Itoh T. MetaGeneAnnotator: detecting species-specific patterns of ribosomal binding site for precise gene prediction in anonymous prokaryotic and phage genomes. DNA Res Int J Rapid Publ Rep Genes Genomes. 2008;15(6):387–96. PubMed PMID: 18940874. Pubmed Central PMCID: 2608843.
CAS Google Scholar
Rho M, Tang H, Ye Y. FragGeneScan: predicting genes in short and error-prone reads. Nucleic Acids Res. 2010;38(20):e191. PubMed PMID: 20805240. Pubmed Central PMCID: 2978382.
PubMed Central PubMed Google Scholar
Salzberg SL, Delcher AL, Kasif S, White O. Microbial gene identification using interpolated Markov models. Nucleic Acids Res. 1998;26(2):544–8. PubMed PMID: 9421513. Pubmed Central PMCID: 147303.
CAS PubMed Central PubMed Google Scholar
Tang S, Antonov I, Borodovsky M. MetaGeneTack: ab initio detection of frameshifts in metagenomic sequences. Bioinformatics. 2013;29(1):114–6. PubMed PMID: 23129300. Pubmed Central PMCID: 3530910.
CAS PubMed Central PubMed Google Scholar
Wooley JC, Godzik A, Friedberg I. A primer on metagenomics. PLoS Comput Biol. 2010;6(2):e1000667. PubMed PMID: 20195499. Pubmed Central PMCID: 2829047.
PubMed Central PubMed Google Scholar
Yok NG, Rosen GL. Combining gene prediction methods to improve metagenomic gene annotation. BMC Bioinforma. 2011;12:20. PubMed PMID: 21232129. Pubmed Central PMCID: 3042383.
Google Scholar
Zhu W, Lomsadze A, Borodovsky M. Ab initio gene identification in metagenomic sequences. Nucleic Acids Res. 2010;38(12):e132. PubMed PMID: 20403810. Pubmed Central PMCID: 2896542.
PubMed Central PubMed Google Scholar

Download references

Author information

Authors and Affiliations

School of Biology, Biodiversity Research Center, Georgia Institute of Technology, Atlanta, GA, 30332, USA
Shiyuyun Tang
Joint Georgia Tech and Emory Wallace H Coulter Department of Biomedical Engineering, Center for Bioinformatics and Computational Genomics, Atlanta, GA, 30332, USA
Mark Borodovsky

Authors

Shiyuyun Tang
View author publications
You can also search for this author in PubMed Google Scholar
Mark Borodovsky
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mark Borodovsky .

Editor information

Editors and Affiliations

J. Craig Venter Institute, Rockville, MD, USA
Karen E. Nelson

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Tang, S., Borodovsky, M. (2015). Ab Initio Gene Identification in Metagenomic Sequences. In: Nelson, K.E. (eds) Encyclopedia of Metagenomics. Springer, Boston, MA. https://doi.org/10.1007/978-1-4899-7478-5_440

Download citation

DOI: https://doi.org/10.1007/978-1-4899-7478-5_440
Published: 04 January 2015
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4899-7477-8
Online ISBN: 978-1-4899-7478-5
eBook Packages: Biomedical and Life SciencesReference Module Biomedical and Life Sciences

Publish with us

Policies and ethics