Overlap graphs and de Bruijn graphs: data structures for de novo genome assembly in the big data era

Rizzi, Raffaella; Beretta, Stefano; Patterson, Murray; Pirola, Yuri; Previtali, Marco; Della Vedova, Gianluca; Bonizzoni, Paola

doi:10.1007/s40484-019-0181-x

Overlap graphs and de Bruijn graphs: data structures for de novo genome assembly in the big data era

Review
Published: 16 December 2019

Volume 7, pages 278–292, (2019)
Cite this article

Download PDF

Quantitative Biology

Overlap graphs and de Bruijn graphs: data structures for de novo genome assembly in the big data era

Download PDF

Raffaella Rizzi¹,
Stefano Beretta¹,
Murray Patterson¹,
Yuri Pirola¹,
Marco Previtali¹,
Gianluca Della Vedova¹ &
…
Paola Bonizzoni¹

5774 Accesses
26 Citations
6 Altmetric
Explore all metrics

Abstract

Background

De novo genome assembly relies on two kinds of graphs: de Bruijn graphs and overlap graphs. Overlap graphs are the basis for the Celera assembler, while de Bruijn graphs have become the dominant technical device in the last decade. Those two kinds of graphs are collectively called assembly graphs.

Results

In this review, we discuss the most recent advances in the problem of constructing, representing and navigating assembly graphs, focusing on very large datasets. We will also explore some computational techniques, such as the Bloom filter, to compactly store graphs while keeping all functionalities intact.

Conclusions

We complete our analysis with a discussion on the algorithmic issues of assembling from long reads (e.g., PacBio and Oxford Nanopore). Finally, we present some of the most relevant open problems in this field.

Article PDF

A Classification of de Bruijn Graph Approaches for De Novo Fragment Assembly

Read mapping on de Bruijn graphs

Article Open access 16 June 2016

LazyB: fast and cheap genome assembly

Article Open access 01 June 2021

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Tomescu, A.I., Medvedev, P. (2016) Safe and Complete Contig Assembly Via Omnitigs. In: Research in Computational Molecular Biology, Singh, M. (eds.), RECOMB 2016. Lecture Notes in Computer Science. vol 9649. Springer, Cham
Google Scholar
Burrows, M. and Wheeler, D. J. (1994) A block-sorting lossless data compression algorithm. CiteSeer
Ferragina, P. and Manzini, G. (2005) Indexing compressed text. J. Assoc. Comput. Mach., 52, 552–581
Article Google Scholar
Baker, M. (2012) De novo genome assembly: what every biologist should know. Nat. Methods, 9, 333–337
Article CAS Google Scholar
Miller, J. R., Koren, S. and Sutton, G. (2010) Assembly algorithms for next-generation sequencing data. Genomics, 95, 315–327
Article CAS PubMed Google Scholar
Paszkiewicz, K. and Studholme, D. J. (2010) De novo assembly of short sequence reads. Brief. Bioinform., 11, 457–472
Article CAS PubMed Google Scholar
Narzisi, G. and Mishra, B. (2011) Comparing de novo genome assembly: the long and short of it. PLoS One, 6, e19175
Article CAS PubMed PubMed Central Google Scholar
Li, Z., Chen, Y., Mu, D., Yuan, J., Shi, Y., Zhang, H., Gan, J., Li, N., Hu, X., Liu, B., et al. (2012) Comparison of the two major classes of assembly algorithms: overlap-layout-consensus and debrujn-graph. Brief. Funct. Genomics, 11, 25–37
Article PubMed CAS Google Scholar
Jayakumar, V. and Sakakibara, Y. (2019) Comprehensive evaluation of non-hybrid genome assembly tools for third-generation PacBio long-read sequence data. Brief. Bioinform., 20, 866–876
Article CAS PubMed Google Scholar
Ghurye, J. and Pop, M. (2019) Modern technologies and algorithms for scaffolding assembled genomes. PLOS Comput. Biol., 15, e1006994
Article CAS PubMed PubMed Central Google Scholar
Shi, F. (1996) Suffix arrays for multiple strings: A method for online multiple string searches. In: Concurrency and Parallelism, Programming, Networking, and Security, Jaffar, J., Yap, R.H.C. (eds.), ASIAN 1996. Lecture Notes in Computer Science, vol 1179. Springer, Berlin, Heidelberg
Google Scholar
Manber, U. and Myers, G. (1993) Suffix arrays: A new method for on-line string searches. SIAM J. Comput., 22, 935–948
Article Google Scholar
Lam, T. W., Li, R., Tam, A., Wong, S., Wu, E. and Yiu, S. M. (2009) High throughput short read alignment via bi-directional BWT. In Bioinformatics and Biomedicine (BIBM’ 09), pp. 31–36, Washington, DC
Simpson, J. T. and Durbin, R. (2010) Efficient construction of an assembly string graph using the FM-index. Bioinformatics, 26, i367–i373
Article CAS PubMed PubMed Central Google Scholar
Simpson, J. T. and Durbin, R. (2012) Efficient de novo assembly of large genomes using compressed data structures. Genome Res., 22, 549–556
Article CAS PubMed PubMed Central Google Scholar
Li, H. and Durbin, R. (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25, 1754–1760
Article CAS PubMed PubMed Central Google Scholar
Langmead, B., Trapnell, C., Pop, M. and Salzberg, S. L. (2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol., 10, R25
Article PubMed PubMed Central CAS Google Scholar
Cox, A. J., Garofalo, F., Rosone, G. and Sciortino, M. (2016) Lightweight LCP construction for very large collections of strings. J. Discrete Algorithms, 37, 17–33
Article Google Scholar
Li, H. (2014) Fast construction of FM-index for long sequence reads. Bioinformatics, 30, 3274–3275
Article CAS PubMed PubMed Central Google Scholar
Bonizzoni, P., Della Vedova, G., Pirola, Y., Previtali, M. and Rizzi, R. (2019) Multithread multistring Burrows-Wheeler transform and longest common prefix array. J. Comput. Biol., 26
Bloom, B. H. (1970) Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 13, 422–426
Article Google Scholar
Fan, L., Cao, P., Almeida, J. and Broder, A. Z. (2000) Summary cache: a scalable wide-area web cache sharing protocol. IEEE/ACM Trans. Netw., 8, 281–293
Article Google Scholar
Broder, A. Z. (1997) On the resemblance and containment of documents. In Proceedings of Compression and Complexity of SEQUENCES 1997, pp.21–29. IEEE
Roberts, M., Hayes, W., Hunt, B. R., Mount, S. M. and Yorke, J. A. (2004) Reducing storage requirements for biological sequence comparison. Bioinformatics, 20, 3363–3369
Article CAS PubMed Google Scholar
Salmela, L. and Rivals, E. (2014) LoRDEC: accurate and efficient long read error correction. Bioinformatics, 30, 3506–3514
Article CAS PubMed PubMed Central Google Scholar
Nordberg, H., Cantor, M., Dusheyko, S., Hua, S., Poliakov, A., Shabalov, I., Smirnova, T., Grigoriev, I. V. and Dubchak, I. (2013) The genome portal of the Department of Energy Joint Genome Institute: 2014 updates. Nucleic Acids Research, 42, D26–D31
Article PubMed PubMed Central CAS Google Scholar
Myers, E. W. (2005) The fragment assembly string graph. Bioinformatics, 21, i79–i85
Google Scholar
Gonnella, G. and Kurtz, S. (2012) Readjoiner: a fast and memory efficient string graph-based sequence assembler. BMC Bioinformatics, 13, 82
Article PubMed PubMed Central Google Scholar
Pevzner, P. A., Tang, H. and Waterman, M. S. (2001) An Eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. USA, 98, 9748–9753
Article CAS PubMed PubMed Central Google Scholar
Bonizzoni, P., Della Vedova, G., Pirola, Y., Previtali, M. and Rizzi, R. (2016) LSG: An external-memory tool to compute string graphs for next-generation sequencing data assembly. J. Comput. Biol., 23, 137–149
Article CAS PubMed Google Scholar
Peng, Y. and Leung, H. C. M., Yiu, S. M. and Chin, F. Y. L. (2010) IDBA—a practical iterative de Bruijn graph de novo assembler. In Research in Computational Molecular Biology, Bonnie B., (ed.), pp. 426–440, Springer Berlin Heidelberg
Chapter Google Scholar
Medvedev, P., Pham, S., Chaisson, M., Tesler, G. and Pevzner, P. (2011) Paired de Bruijn graphs: A novel approach for incorporating mate pair information into genome assemblers. In Proceedings of the 15th Annual International Conference on Research in Computational Molecular Biology, RECOMB’11, pp. 238–251, Springer-Verlag
Pham, S. K., Antipov, D., Sirotkin, A., Tesler, G., Pevzner, P. A. and Alekseyev, M. A. (2012) Pathset graphs: A novel approach for comprehensive utilization of paired reads in genome assembly. In Research in Computational Molecular Biology, Chor, B., (ed.), pp. 200–212, Springer Berlin Heidelberg
Chapter Google Scholar
Ronen, R., Boucher, C., Chtsaz, H. and Pevzner, P. (2012) SEQuel: improving the accuracy ofgenome assemblies. Bioinformatics, 28, i188–i196
Article CAS PubMed PubMed Central Google Scholar
Bao, E., Jiang, T., and Girke, T. (2014) AlignGraph: algorithm for secondary de novo genome assembly guided by closely related references. Bioinformatics, 30, i319–i328
Article CAS PubMed PubMed Central Google Scholar
Li, H. (2012) Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly. Bioinformatics, 28, 1838–1844
Article CAS PubMed PubMed Central Google Scholar
Nong, G., Zhang, S. and Chan, W. H. (2011) Two efficient algorithms for linear time suffix array construction. IEEE Trans. Comput., 60, 1471–1484
Article Google Scholar
Bauer, M., Cox, A. and Rosone, G. (2013) Lightweight algorithms for constructing and inverting the BWT of string collections. Theor. Comput. Sci., 483, 134–148
Article Google Scholar
Bonizzoni, P., Della Vedova, G., Pirola, Y., Previtali, M. and Rizzi, R. (2017) FSG: Fast String Graph construction for de novo assembly. J. Comput. Biol., 24, 953–968
Article CAS PubMed Google Scholar
Garey, M. R. and Johnson, D. S. (1979) Computer and Intractability: A Guide to the Theory of NP-completeness. Freeman, W. H., (ed.), San Francisco
Conway, T. C. and Bromage, A. J. (2011) Succinct data structures for assembling large genomes. Bioinformatics, 27, 479–486
Article CAS PubMed Google Scholar
Chikhi, R. and Rizk, G. (2013) Space-efficient and exact de Bruijn graph representation based on a Bloom filter. Algorithms Mol. Biol., 8, 22
Article PubMed PubMed Central CAS Google Scholar
Salikhov, K., Sacomoto, G. and Kucherov, G. (2014) Using cascading Bloom filters to improve the memory usage for de Bruijn graphs. Algorithms Mol. Biol., 9, 2
Article PubMed PubMed Central Google Scholar
Jackman, S. D., Vandervalk, B. P., Mohamadi, H., Chu, J., Yeo, S., Hammond, S. A., Jahesh, G., Khan, H., Coombe, L., Warren, R. L., et al. (2017) ABySS 2.0: resource-efficient assembly of large genomes using a Bloom filter. Genome Res., 27, 768–777
Article CAS PubMed PubMed Central Google Scholar
Chikhi, R., Limasset, A., Jackman, S., Simpson, J. T. and Medvedev, P. (2015) On the representation of de Bruijn graphs. J. Comput. Biol., 22, 336–352
Article CAS PubMed Google Scholar
Bowe, A., Onodera, T., Sadakane, K. and Shibuya, T. (2012) Succinct de Bruijn graphs. In: Algorithms in Bioinformatics, volume 7534 of Lecture Notes in Computer Science, Raphael, B. and Tang, J. (eds.), pp. 225–235. Springer Berlin Heidelberg
Chapter Google Scholar
Boucher, C., Bowe, A., Gagie, T., Puglisi, S. J. and Sadakane, K. (2015) Variable-order de Bruijn graphs. In Data Compression Conference (DCC), pp. 383–392
Belazzougui, D., Gagie, T., Mäkinen, V., Previtali, M. and Puglisi, S. J. (2018) Bidirectional variable-order de Bruijn graphs. Int. J. Found. Comput. Sci., 29, 1279–1295
Article Google Scholar
Muggli, M. D., Bowe, A., Noyes, N. R., Morley, P.S., Belk, K. E., Raymond, R., Gagie, T., Puglisi, S.J., Boucher, C. (2017) Succinct colored de Bruijn graphs. Bioinformatics, 33, 3181–3187
Article CAS PubMed PubMed Central Google Scholar
Almodaresi, F., Pandey, P. and Patro, R. (2017) Rainbowfish: a succinct colored de Bruijn graph representation. In 17th International Workshop on Algorithms in Bioinformatics (WABI 2017).
Almodaresi, F., Sarkar, H., Srivastava, A. and Patro, R. (2018) A space and time-efficient index for the compacted colored de Bruijn graph. Bioinformatics, 34, i169–i177
Article CAS PubMed PubMed Central Google Scholar
Muggli, M. D., Alipanahi, B., Boucher, C. (2019) Building large updatable colored de Bruijn graphs via merging. bioRxiv, 229641
Neil, C. Jones, Pavel A Pevzner, and Pavel Pevzner. (2004) An introduction to bioinformatics algorithms. MIT Press
Koren, S., Walenz, B. P., Berlin, K., Miller, J. R., Bergman, N. H. and Phillippy, A. M. (2017) Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res., 27, 722–736
Article CAS PubMed PubMed Central Google Scholar
Berlin, K., Koren, S., Chin, C.-S., Drake, J. P., Landolin, J. M. and Phillippy, A. M. (2015) Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol., 33, 623–630
Article CAS PubMed Google Scholar
Chu, J., Mohamadi, H., Warren, R. L., Yang, C. and Birol, I. (2017) Innovations and challenges in detecting long read overlaps: an evaluation of the state-of-the-art. Bioinformatics, 33, 1261–1270
CAS PubMed Google Scholar
Ruan, J. and Li, H. (2019) Fast and accurate long-read assembly with wtdbg2. bioRxiv, 530972
Kamath, G. M., Shomorony, I., Xia, F., Courtade, T. A. and Tse, D. N. (2017) HINGE: long-read assembly achieves optimal repeat resolution. Genome Res., 27, 747–756
Article CAS PubMed PubMed Central Google Scholar
Myers, G. (2014) Efficient local alignment discovery amongst noisy long reads. In International Workshop on Algorithms in Bioinformatics, pp. 52–67. Springer
Miller, J. R., Delcher, A. L., Koren, S., Venter, E., Walenz, B. P., Brownley, A., Johnson, J., Li, K., Mobarry, C. and Sutton, G. (2008) Aggressive assembly ofpyrosequencing reads with mates. Bioinformatics, 24, 2818–2824
Article CAS PubMed PubMed Central Google Scholar
Lin, Y., Yuan, J., Kolmogorov, M., Shen, M. W., Chaisson, M. and Pevzner, P. A. (2016) Assembly of long error-prone reads using de Bruijn graphs. Proc. Natl. Acad. Sci. USA, 113, E8396–E8405
Article CAS PubMed PubMed Central Google Scholar
Prjibelski, A. D., Vasilinetc, I., Bankevich, A., Gurevich, A., Krivosheeva, T., Nurk, S., Pham, S., Korobeynikov, A., Lapidus, A., and Pevzner, P. A. (2014) Exspander: a universal repeat resolver for DNA fragment assembly. Bioinformatics, 30, i293–i301
Article CAS PubMed PubMed Central Google Scholar
Chin, C.-S., Peluso, P., Sedlazeck, F. J., Nattestad, M., Concepcion, G. T., Clum, A., Dunn, C., O’Malley, R., Figueroa-Balderas, R., Morales-Cruz, A., et al. (2016) Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods, 13, 1050–1054
Article CAS PubMed PubMed Central Google Scholar
Chin, C. S., Alexander, D. H., Marks, P., Klammer, A. A., Drake, J., Heiner, C., Clum, A., Copeland, A., Huddleston, J., Eichler, E. E., et al. (2013) Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods, 10, 563–569
Article CAS PubMed Google Scholar
Fasulo, D., Halpern, A., Dew, I., and Mobarry, C. (2002) Efficiently detecting polymorphisms during the fragment assembly process. Bioinformatics, 18, S294S302
Article Google Scholar
Myers, E. W., Sutton, G. G., Delcher, A. L., Dew, I. M., Fasulo, D. P., Flanigan, M. J., Kravitz, S. A., Mobarry, C. M., Reinert, K. H., Remington, K. A., et al. (2000) A whole-genome assembly of Drosophila. Science, 287, 2196–2204
Article CAS PubMed Google Scholar
Berger, A. and Lafferty, J. (2017) Information retrieval as statistical translation. ACM SIGIR Forum, 51, 219–226
Article Google Scholar
Li, H. (2016) Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences. Bioinformatics, 32, 2103–2110
Article CAS PubMed PubMed Central Google Scholar
Välimäki, N., Ladra, S. and Mäkinen, V. (2012) Approximate all-pairs suffix/prefix overlaps. Inf. Comput., 213, 49–58
Article Google Scholar
Egidi, L., Louza, F. A., Manzini, G. and Telles, G. (2018) External memory BWT and LCP computation for sequence collections with applications. In 18th International Workshop on Algorithms in Bioinformatics, WABI 2018, Helsinki, Finland, volume 113, pp. 1–14
Liu, B., Liu, C.-M., Li, D., Li, Y., Ting, H.-F., Yiu, S. M., Luo, R. and Lam, T. W. (2016) BASE: a practical de novo assembler for large genomes using long NGS reads. BMC Genomics, 17, 499
Article PubMed PubMed Central Google Scholar
Sohn, J. and Nam, J.-W. (2016) The present and future of de novo whole-genome assembly. Brief. Bioinform., bbw096
Simpson, J. T. (2014) Exploring genome characteristics and sequence quality without a reference. Bioinformatics, 30, 1228–1235
Article CAS PubMed PubMed Central Google Scholar
Garrison, E., Sirén, J., Novak, A. M., Hickey, G., Eizenga, J. M., Dawson, E. T., Jones, W., Garg, S., Markello, C., Lin, M. F., et al. (2018) Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat. Biotechnol., 36, 875–879
Article CAS PubMed PubMed Central Google Scholar
Iqbal, Z., Caccamo, M., Turner, I., Flicek, P. and McVean, G. (2012) De novo assembly and genotyping ofvariants using colored de Bruijn graphs. Nat. Genet., 44, 226–232
Article CAS PubMed PubMed Central Google Scholar
Pandey, P., Almodaresi, F., Bender, M. A., Ferdman, M., Johnson, R. and Patro, R. (2018) Mantis: a fast, small, and exact large-scale sequence-search index. Cell Syst., 7, 201–207.e4
Article CAS PubMed Google Scholar
Simpson, J. T., Wong, K., Jackman, S. D., Schein, J. E., Jones, S. J. and Birol, I. (2009) ABySS: a parallel assembler for short read sequence data. Genome Res., 19, 1117–1123
Article CAS PubMed PubMed Central Google Scholar
Holt, J. and McMillan, L. (2014) Merging of multi-string BWTs with applications. Bioinformatics, 30, 3524–3531
Article CAS PubMed PubMed Central Google Scholar
Minkin, I., Pham, S. and Medvedev, P. (2017) TwoPaCo: an efficient algorithm to build the compacted de Bruijn graph from many complete genomes. Bioinformatics, 33, 4024–4032
CAS PubMed Google Scholar
Marcus, S., Lee, H. and Schatz, M. C. (2014) SplitMEM: a graphical algorithm for pan-genome analysis with suffix skips. Bioinformatics, 30, 3476–3483
Article CAS PubMed PubMed Central Google Scholar
Cazaux, B., Lecroq, T. and Rivals, E. (2014) From indexing data structures to de Bruijn graphs. In Combinatorial Pattern Matching, pp. 89–99. Springer
Baier, U., Beller, T. and Ohlebusch, E. (2016) Graphical pangenome analysis with compressed suffix trees and the Burrows-Wheeler transform. Bioinformatics, 32, 497–504
Article CAS PubMed Google Scholar
Marschall, T., Marz, M., Abeel, T., Dijkstra, L., Dutilh, B. E., Ghaffaari, A., Kersey, P., Kloosterman, W., Makinen, V., Novak, A., et al. (2016) Computational pan-genomics: status, promises and challenges. Brief. Bioinform., 19, 118–135
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Informatics, Systems and Communications, University of Milan-Bicocca, Milan, Viale Sarca 336, 20126, Italy
Raffaella Rizzi, Stefano Beretta, Murray Patterson, Yuri Pirola, Marco Previtali, Gianluca Della Vedova & Paola Bonizzoni

Authors

Raffaella Rizzi
View author publications
You can also search for this author in PubMed Google Scholar
Stefano Beretta
View author publications
You can also search for this author in PubMed Google Scholar
Murray Patterson
View author publications
You can also search for this author in PubMed Google Scholar
Yuri Pirola
View author publications
You can also search for this author in PubMed Google Scholar
Marco Previtali
View author publications
You can also search for this author in PubMed Google Scholar
Gianluca Della Vedova
View author publications
You can also search for this author in PubMed Google Scholar
Paola Bonizzoni
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Paola Bonizzoni.

Ethics declarations

The authors Raffaella Rizzi, Stefano Beretta, Murray Patterson, Yuri Pirola, Marco Previtali, Gianluca Della Vedova and Paola Bonizzoni declare that they have no conflict of interests.

This article is a review article and does not contain any studies with human or animal subjects performed by any of the authors.

Additional information

Author summary: All available assemblers are built upon the notions of de Bruijn graphs or overlap graphs. This review discusses the most recent advances in the problem of constructing, representing and navigating those graphs, focusing on very large datasets. Moreover, we present some of the most relevant open problems.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rizzi, R., Beretta, S., Patterson, M. et al. Overlap graphs and de Bruijn graphs: data structures for de novo genome assembly in the big data era. Quant Biol 7, 278–292 (2019). https://doi.org/10.1007/s40484-019-0181-x

Download citation

Received: 04 June 2019
Revised: 03 July 2019
Accepted: 26 July 2019
Published: 16 December 2019
Issue Date: December 2019
DOI: https://doi.org/10.1007/s40484-019-0181-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Overlap graphs and de Bruijn graphs: data structures for de novo genome assembly in the big data era