Abstract
Background
De novo genome assembly relies on two kinds of graphs: de Bruijn graphs and overlap graphs. Overlap graphs are the basis for the Celera assembler, while de Bruijn graphs have become the dominant technical device in the last decade. Those two kinds of graphs are collectively called assembly graphs.
Results
In this review, we discuss the most recent advances in the problem of constructing, representing and navigating assembly graphs, focusing on very large datasets. We will also explore some computational techniques, such as the Bloom filter, to compactly store graphs while keeping all functionalities intact.
Conclusions
We complete our analysis with a discussion on the algorithmic issues of assembling from long reads (e.g., PacBio and Oxford Nanopore). Finally, we present some of the most relevant open problems in this field.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Tomescu, A.I., Medvedev, P. (2016) Safe and Complete Contig Assembly Via Omnitigs. In: Research in Computational Molecular Biology, Singh, M. (eds.), RECOMB 2016. Lecture Notes in Computer Science. vol 9649. Springer, Cham
Burrows, M. and Wheeler, D. J. (1994) A block-sorting lossless data compression algorithm. CiteSeer
Ferragina, P. and Manzini, G. (2005) Indexing compressed text. J. Assoc. Comput. Mach., 52, 552–581
Baker, M. (2012) De novo genome assembly: what every biologist should know. Nat. Methods, 9, 333–337
Miller, J. R., Koren, S. and Sutton, G. (2010) Assembly algorithms for next-generation sequencing data. Genomics, 95, 315–327
Paszkiewicz, K. and Studholme, D. J. (2010) De novo assembly of short sequence reads. Brief. Bioinform., 11, 457–472
Narzisi, G. and Mishra, B. (2011) Comparing de novo genome assembly: the long and short of it. PLoS One, 6, e19175
Li, Z., Chen, Y., Mu, D., Yuan, J., Shi, Y., Zhang, H., Gan, J., Li, N., Hu, X., Liu, B., et al. (2012) Comparison of the two major classes of assembly algorithms: overlap-layout-consensus and debrujn-graph. Brief. Funct. Genomics, 11, 25–37
Jayakumar, V. and Sakakibara, Y. (2019) Comprehensive evaluation of non-hybrid genome assembly tools for third-generation PacBio long-read sequence data. Brief. Bioinform., 20, 866–876
Ghurye, J. and Pop, M. (2019) Modern technologies and algorithms for scaffolding assembled genomes. PLOS Comput. Biol., 15, e1006994
Shi, F. (1996) Suffix arrays for multiple strings: A method for online multiple string searches. In: Concurrency and Parallelism, Programming, Networking, and Security, Jaffar, J., Yap, R.H.C. (eds.), ASIAN 1996. Lecture Notes in Computer Science, vol 1179. Springer, Berlin, Heidelberg
Manber, U. and Myers, G. (1993) Suffix arrays: A new method for on-line string searches. SIAM J. Comput., 22, 935–948
Lam, T. W., Li, R., Tam, A., Wong, S., Wu, E. and Yiu, S. M. (2009) High throughput short read alignment via bi-directional BWT. In Bioinformatics and Biomedicine (BIBM’ 09), pp. 31–36, Washington, DC
Simpson, J. T. and Durbin, R. (2010) Efficient construction of an assembly string graph using the FM-index. Bioinformatics, 26, i367–i373
Simpson, J. T. and Durbin, R. (2012) Efficient de novo assembly of large genomes using compressed data structures. Genome Res., 22, 549–556
Li, H. and Durbin, R. (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25, 1754–1760
Langmead, B., Trapnell, C., Pop, M. and Salzberg, S. L. (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol., 10, R25
Cox, A. J., Garofalo, F., Rosone, G. and Sciortino, M. (2016) Lightweight LCP construction for very large collections of strings. J. Discrete Algorithms, 37, 17–33
Li, H. (2014) Fast construction of FM-index for long sequence reads. Bioinformatics, 30, 3274–3275
Bonizzoni, P., Della Vedova, G., Pirola, Y., Previtali, M. and Rizzi, R. (2019) Multithread multistring Burrows-Wheeler transform and longest common prefix array. J. Comput. Biol., 26
Bloom, B. H. (1970) Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 13, 422–426
Fan, L., Cao, P., Almeida, J. and Broder, A. Z. (2000) Summary cache: a scalable wide-area web cache sharing protocol. IEEE/ACM Trans. Netw., 8, 281–293
Broder, A. Z. (1997) On the resemblance and containment of documents. In Proceedings of Compression and Complexity of SEQUENCES 1997, pp.21–29. IEEE
Roberts, M., Hayes, W., Hunt, B. R., Mount, S. M. and Yorke, J. A. (2004) Reducing storage requirements for biological sequence comparison. Bioinformatics, 20, 3363–3369
Salmela, L. and Rivals, E. (2014) LoRDEC: accurate and efficient long read error correction. Bioinformatics, 30, 3506–3514
Nordberg, H., Cantor, M., Dusheyko, S., Hua, S., Poliakov, A., Shabalov, I., Smirnova, T., Grigoriev, I. V. and Dubchak, I. (2013) The genome portal of the Department of Energy Joint Genome Institute: 2014 updates. Nucleic Acids Research, 42, D26–D31
Myers, E. W. (2005) The fragment assembly string graph. Bioinformatics, 21, i79–i85
Gonnella, G. and Kurtz, S. (2012) Readjoiner: a fast and memory efficient string graph-based sequence assembler. BMC Bioinformatics, 13, 82
Pevzner, P. A., Tang, H. and Waterman, M. S. (2001) An Eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. USA, 98, 9748–9753
Bonizzoni, P., Della Vedova, G., Pirola, Y., Previtali, M. and Rizzi, R. (2016) LSG: An external-memory tool to compute string graphs for next-generation sequencing data assembly. J. Comput. Biol., 23, 137–149
Peng, Y. and Leung, H. C. M., Yiu, S. M. and Chin, F. Y. L. (2010) IDBA—a practical iterative de Bruijn graph de novo assembler. In Research in Computational Molecular Biology, Bonnie B., (ed.), pp. 426–440, Springer Berlin Heidelberg
Medvedev, P., Pham, S., Chaisson, M., Tesler, G. and Pevzner, P. (2011) Paired de Bruijn graphs: A novel approach for incorporating mate pair information into genome assemblers. In Proceedings of the 15th Annual International Conference on Research in Computational Molecular Biology, RECOMB’11, pp. 238–251, Springer-Verlag
Pham, S. K., Antipov, D., Sirotkin, A., Tesler, G., Pevzner, P. A. and Alekseyev, M. A. (2012) Pathset graphs: A novel approach for comprehensive utilization of paired reads in genome assembly. In Research in Computational Molecular Biology, Chor, B., (ed.), pp. 200–212, Springer Berlin Heidelberg
Ronen, R., Boucher, C., Chtsaz, H. and Pevzner, P. (2012) SEQuel: improving the accuracy ofgenome assemblies. Bioinformatics, 28, i188–i196
Bao, E., Jiang, T., and Girke, T. (2014) AlignGraph: algorithm for secondary de novo genome assembly guided by closely related references. Bioinformatics, 30, i319–i328
Li, H. (2012) Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly. Bioinformatics, 28, 1838–1844
Nong, G., Zhang, S. and Chan, W. H. (2011) Two efficient algorithms for linear time suffix array construction. IEEE Trans. Comput., 60, 1471–1484
Bauer, M., Cox, A. and Rosone, G. (2013) Lightweight algorithms for constructing and inverting the BWT of string collections. Theor. Comput. Sci., 483, 134–148
Bonizzoni, P., Della Vedova, G., Pirola, Y., Previtali, M. and Rizzi, R. (2017) FSG: Fast String Graph construction for de novo assembly. J. Comput. Biol., 24, 953–968
Garey, M. R. and Johnson, D. S. (1979) Computer and Intractability: A Guide to the Theory of NP-completeness. Freeman, W. H., (ed.), San Francisco
Conway, T. C. and Bromage, A. J. (2011) Succinct data structures for assembling large genomes. Bioinformatics, 27, 479–486
Chikhi, R. and Rizk, G. (2013) Space-efficient and exact de Bruijn graph representation based on a Bloom filter. Algorithms Mol. Biol., 8, 22
Salikhov, K., Sacomoto, G. and Kucherov, G. (2014) Using cascading Bloom filters to improve the memory usage for de Bruijn graphs. Algorithms Mol. Biol., 9, 2
Jackman, S. D., Vandervalk, B. P., Mohamadi, H., Chu, J., Yeo, S., Hammond, S. A., Jahesh, G., Khan, H., Coombe, L., Warren, R. L., et al. (2017) ABySS 2.0: resource-efficient assembly of large genomes using a Bloom filter. Genome Res., 27, 768–777
Chikhi, R., Limasset, A., Jackman, S., Simpson, J. T. and Medvedev, P. (2015) On the representation of de Bruijn graphs. J. Comput. Biol., 22, 336–352
Bowe, A., Onodera, T., Sadakane, K. and Shibuya, T. (2012) Succinct de Bruijn graphs. In: Algorithms in Bioinformatics, volume 7534 of Lecture Notes in Computer Science, Raphael, B. and Tang, J. (eds.), pp. 225–235. Springer Berlin Heidelberg
Boucher, C., Bowe, A., Gagie, T., Puglisi, S. J. and Sadakane, K. (2015) Variable-order de Bruijn graphs. In Data Compression Conference (DCC), pp. 383–392
Belazzougui, D., Gagie, T., Mäkinen, V., Previtali, M. and Puglisi, S. J. (2018) Bidirectional variable-order de Bruijn graphs. Int. J. Found. Comput. Sci., 29, 1279–1295
Muggli, M. D., Bowe, A., Noyes, N. R., Morley, P.S., Belk, K. E., Raymond, R., Gagie, T., Puglisi, S.J., Boucher, C. (2017) Succinct colored de Bruijn graphs. Bioinformatics, 33, 3181–3187
Almodaresi, F., Pandey, P. and Patro, R. (2017) Rainbowfish: a succinct colored de Bruijn graph representation. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017).
Almodaresi, F., Sarkar, H., Srivastava, A. and Patro, R. (2018) A space and time-efficient index for the compacted colored de Bruijn graph. Bioinformatics, 34, i169–i177
Muggli, M. D., Alipanahi, B., Boucher, C. (2019) Building large updatable colored de Bruijn graphs via merging. bioRxiv, 229641
Neil, C. Jones, Pavel A Pevzner, and Pavel Pevzner. (2004) An introduction to bioinformatics algorithms. MIT Press
Koren, S., Walenz, B. P., Berlin, K., Miller, J. R., Bergman, N. H. and Phillippy, A. M. (2017) Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res., 27, 722–736
Berlin, K., Koren, S., Chin, C.-S., Drake, J. P., Landolin, J. M. and Phillippy, A. M. (2015) Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol., 33, 623–630
Chu, J., Mohamadi, H., Warren, R. L., Yang, C. and Birol, I. (2017) Innovations and challenges in detecting long read overlaps: an evaluation of the state-of-the-art. Bioinformatics, 33, 1261–1270
Ruan, J. and Li, H. (2019) Fast and accurate long-read assembly with wtdbg2. bioRxiv, 530972
Kamath, G. M., Shomorony, I., Xia, F., Courtade, T. A. and Tse, D. N. (2017) HINGE: long-read assembly achieves optimal repeat resolution. Genome Res., 27, 747–756
Myers, G. (2014) Efficient local alignment discovery amongst noisy long reads. In International Workshop on Algorithms in Bioinformatics, pp. 52–67. Springer
Miller, J. R., Delcher, A. L., Koren, S., Venter, E., Walenz, B. P., Brownley, A., Johnson, J., Li, K., Mobarry, C. and Sutton, G. (2008) Aggressive assembly ofpyrosequencing reads with mates. Bioinformatics, 24, 2818–2824
Lin, Y., Yuan, J., Kolmogorov, M., Shen, M. W., Chaisson, M. and Pevzner, P. A. (2016) Assembly of long error-prone reads using de Bruijn graphs. Proc. Natl. Acad. Sci. USA, 113, E8396–E8405
Prjibelski, A. D., Vasilinetc, I., Bankevich, A., Gurevich, A., Krivosheeva, T., Nurk, S., Pham, S., Korobeynikov, A., Lapidus, A., and Pevzner, P. A. (2014) Exspander: a universal repeat resolver for DNA fragment assembly. Bioinformatics, 30, i293–i301
Chin, C.-S., Peluso, P., Sedlazeck, F. J., Nattestad, M., Concepcion, G. T., Clum, A., Dunn, C., O’Malley, R., Figueroa-Balderas, R., Morales-Cruz, A., et al. (2016) Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods, 13, 1050–1054
Chin, C. S., Alexander, D. H., Marks, P., Klammer, A. A., Drake, J., Heiner, C., Clum, A., Copeland, A., Huddleston, J., Eichler, E. E., et al. (2013) Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods, 10, 563–569
Fasulo, D., Halpern, A., Dew, I., and Mobarry, C. (2002) Efficiently detecting polymorphisms during the fragment assembly process. Bioinformatics, 18, S294S302
Myers, E. W., Sutton, G. G., Delcher, A. L., Dew, I. M., Fasulo, D. P., Flanigan, M. J., Kravitz, S. A., Mobarry, C. M., Reinert, K. H., Remington, K. A., et al. (2000) A whole-genome assembly of Drosophila. Science, 287, 2196–2204
Berger, A. and Lafferty, J. (2017) Information retrieval as statistical translation. ACM SIGIR Forum, 51, 219–226
Li, H. (2016) Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics, 32, 2103–2110
Välimäki, N., Ladra, S. and Mäkinen, V. (2012) Approximate all-pairs suffix/prefix overlaps. Inf. Comput., 213, 49–58
Egidi, L., Louza, F. A., Manzini, G. and Telles, G. (2018) External memory BWT and LCP computation for sequence collections with applications. In 18th International Workshop on Algorithms in Bioinformatics, WABI 2018, Helsinki, Finland, volume 113, pp. 1–14
Liu, B., Liu, C.-M., Li, D., Li, Y., Ting, H.-F., Yiu, S. M., Luo, R. and Lam, T. W. (2016) BASE: a practical de novo assembler for large genomes using long NGS reads. BMC Genomics, 17, 499
Sohn, J. and Nam, J.-W. (2016) The present and future of de novo whole-genome assembly. Brief. Bioinform., bbw096
Simpson, J. T. (2014) Exploring genome characteristics and sequence quality without a reference. Bioinformatics, 30, 1228–1235
Garrison, E., Sirén, J., Novak, A. M., Hickey, G., Eizenga, J. M., Dawson, E. T., Jones, W., Garg, S., Markello, C., Lin, M. F., et al. (2018) Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat. Biotechnol., 36, 875–879
Iqbal, Z., Caccamo, M., Turner, I., Flicek, P. and McVean, G. (2012) De novo assembly and genotyping ofvariants using colored de Bruijn graphs. Nat. Genet., 44, 226–232
Pandey, P., Almodaresi, F., Bender, M. A., Ferdman, M., Johnson, R. and Patro, R. (2018) Mantis: a fast, small, and exact large-scale sequence-search index. Cell Syst., 7, 201–207.e4
Simpson, J. T., Wong, K., Jackman, S. D., Schein, J. E., Jones, S. J. and Birol, I. (2009) ABySS: a parallel assembler for short read sequence data. Genome Res., 19, 1117–1123
Holt, J. and McMillan, L. (2014) Merging of multi-string BWTs with applications. Bioinformatics, 30, 3524–3531
Minkin, I., Pham, S. and Medvedev, P. (2017) TwoPaCo: an efficient algorithm to build the compacted de Bruijn graph from many complete genomes. Bioinformatics, 33, 4024–4032
Marcus, S., Lee, H. and Schatz, M. C. (2014) SplitMEM: a graphical algorithm for pan-genome analysis with suffix skips. Bioinformatics, 30, 3476–3483
Cazaux, B., Lecroq, T. and Rivals, E. (2014) From indexing data structures to de Bruijn graphs. In Combinatorial Pattern Matching, pp. 89–99. Springer
Baier, U., Beller, T. and Ohlebusch, E. (2016) Graphical pangenome analysis with compressed suffix trees and the Burrows-Wheeler transform. Bioinformatics, 32, 497–504
Marschall, T., Marz, M., Abeel, T., Dijkstra, L., Dutilh, B. E., Ghaffaari, A., Kersey, P., Kloosterman, W., Makinen, V., Novak, A., et al. (2016) Computational pan-genomics: status, promises and challenges. Brief. Bioinform., 19, 118–135
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
The authors Raffaella Rizzi, Stefano Beretta, Murray Patterson, Yuri Pirola, Marco Previtali, Gianluca Della Vedova and Paola Bonizzoni declare that they have no conflict of interests.
This article is a review article and does not contain any studies with human or animal subjects performed by any of the authors.
Additional information
Author summary: All available assemblers are built upon the notions of de Bruijn graphs or overlap graphs. This review discusses the most recent advances in the problem of constructing, representing and navigating those graphs, focusing on very large datasets. Moreover, we present some of the most relevant open problems.
Rights and permissions
About this article
Cite this article
Rizzi, R., Beretta, S., Patterson, M. et al. Overlap graphs and de Bruijn graphs: data structures for de novo genome assembly in the big data era. Quant Biol 7, 278–292 (2019). https://doi.org/10.1007/s40484-019-0181-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40484-019-0181-x