Abstract
The critical part of genome assembly is resolution of repeats and scaffolding of shorter contigs. Modern assemblers usually perform this step by heuristics, often tailored to a particular technology for producing paired reads or long reads. We propose a new framework that allows systematic combination of diverse sequencing datasets into a single assembly. We achieve this by searching for an assembly with maximum likelihood in a probabilistic model capturing error rate, insert lengths, and other characteristics of each sequencing technology.
We have implemented a prototype genome assembler GAML that can use any combination of insert sizes with Illumina or 454 reads, as well as PacBio reads. Our experiments show that we can assemble short genomes with N50 sizes and error rates comparable to ALLPATHS-LG or Cerulean. While ALLPATHS-LG and Cerulean require each a specific combination of datasets, GAML works on any combination.
Data and software is available at http://compbio.fmph.uniba.sk/gaml
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Chaisson, M.J., Tesler, G.: Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics 13(1), 238 (2012)
Clark, S.C., Egan, R., Frazier, P.I., Wang, Z.: ALE: a generic assembly likelihood evaluation framework for assessing the accuracy of genome and metagenome assemblies. Bioinformatics 29(4), 435–443 (2013)
Deshpande, V., Fung, E.D.K., Pham, S., Bafna, V.: Cerulean: A hybrid assembly using high throughput short and long reads. In: Darling, A., Stoye, J. (eds.) WABI 2013. LNCS, vol. 8126, pp. 349–363. Springer, Heidelberg (2013)
Eglese, R.: Simulated annealing: a tool for operational research. European Journal of Operational Research 46(3), 271–281 (1990)
English, A.C., Richards, S., et al.: Mind the gap: upgrading genomes with Pacific Biosciences RS long-read sequencing technology. PLoS One 7(11), e47768 (2012)
Ghodsi, M., Hill, C.M., Astrovskaya, I., Lin, H., Sommer, D.D., Koren, S., Pop, M.: De novo likelihood-based measures for comparing genome assemblies. BMC Research Notes 6(1), 334 (2013)
Gnerre, S., MacCallum, I., et al.: High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proceedings of the National Academy of Sciences 108(4), 1513–1518 (2011)
Huang, W., Li, L., Myers, J.R., Marth, G.T.: ART: a next-generation sequencing read simulator. Bioinformatics 28(4), 593–594 (2012)
Koren, S., Schatz, M.C., et al.: Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nature Biotechnology 30(7), 693–700 (2012)
Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with Bowtie 2. Nature Methods 9(4), 357–359 (2012)
Medvedev, P., Brudno, M.: Maximum likelihood genome assembly. Journal of Computational Biology 16(8), 1101–1116 (2009)
Medvedev, P., Pham, S., Chaisson, M., Tesler, G., Pevzner, P.: Paired de Bruijn graphs: a novel approach for incorporating mate pair information into genome assemblers. Journal of Computational Biology 18(11), 1625–1634 (2011)
Myers, E.W.: The fragment assembly string graph. Bioinformatics 21(suppl 2), ii79–ii85 (2005)
Myers, E.W., Sutton, G.G., et al.: A whole-genome assembly of Drosophila. Science 287(5461), 2196–2204 (2000)
Pham, S.K., Antipov, D., Sirotkin, A., Tesler, G., Pevzner, P.A., Alekseyev, M.A.: Pathset graphs: a novel approach for comprehensive utilization of paired reads in genome assembly. Journal of Computational Biology 20(4), 359–371 (2013)
Quail, M.A., Smith, M., Coupland, P., Otto, T.D., Harris, S.R., Connor, T.R., Bertoni, A., Swerdlow, H.P., Gu, Y.: A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics 13(1), 341 (2012)
Rahman, A., Pachter, L.: CGAL: computing genome assembly likelihoods. Genome Biology 14(1), R8 (2013)
Salzberg, S.L., Phillippy, A.M., et al.: GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Research 22(3), 557–567 (2012)
Simpson, J.T., Durbin, R.: Efficient construction of an assembly string graph using the FM-index. Bioinformatics 26(12), i367–i373 (2010)
Varma, A., Ranade, A., Aluru, S.: An improved maximum likelihood formulation for accurate genome assembly. In: Computational Advances in Bio and Medical Sciences (ICCABS 2011), pp. 165–170. IEEE (2011)
Zerbino, D.R., Birney, E.: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Research 18(5), 821–829 (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Boža, V., Brejová, B., Vinař, T. (2014). GAML: Genome Assembly by Maximum Likelihood. In: Brown, D., Morgenstern, B. (eds) Algorithms in Bioinformatics. WABI 2014. Lecture Notes in Computer Science(), vol 8701. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-44753-6_10
Download citation
DOI: https://doi.org/10.1007/978-3-662-44753-6_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-44752-9
Online ISBN: 978-3-662-44753-6
eBook Packages: Computer ScienceComputer Science (R0)