Abstract
Long-read sequencing is now routinely used at scale for genomics and transcriptomics applications. Mapping long reads or a draft genome assembly to a reference sequence is often one of the most time-consuming steps in these applications. Here we present techniques to accelerate minimap2, a widely used software for this task. We present multiple optimizations using single-instruction multiple-data parallelization, efficient cache utilization and a learned index data structure to accelerate the three main computational modules of minimap2: seeding, chaining and pairwise sequence alignment. These optimizations result in an up to 1.8-fold reduction of end-to-end mapping time of minimap2 while maintaining identical output.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
Datasets used for benchmarking are publicly available (Supplementary Table 2). Human reference genome is available at https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/references/GRCh38/GCA_000001405.15_GRCh38_no_alt_analysis_set.fasta.gz. All ONT and PacBio HiFi datasets (HG002, HG003, HG004) used are available at https://precision.fda.gov/challenges/10/view. Datasets for PacBio CLR (HG002, HG003, HG004) are available at https://github.com/genome-in-a-bottle/giab_data_indexes. Genome assemblies are available at: CHM13: NCBI (GCA009914755.3), HG002 (hap1) and HG002 (hap2) are publicly available at ref. 33. The speedup shown in the paper can also be realized with a smaller subset of the above datasets. Source Data are provided with this paper.
Code availability
The mm2-fast source code is available under the open source MIT license at https://github.com/bwa-mem2/mm2-fast. The particular version of mm2-fast used in this manuscript is publicly available at ref. 34. The scripts used for the experiments in the manuscript are available at ref. 35.
Optimization Notice: Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance. Intel, Xeon, and Intel Xeon Phi are trademarks of Intel Corporation in the US and/or other countries.
References
Chaisson, M. J. et al. Multi-platform discovery of haplotype-resolved structural variation in human genomes. Nat. Commun. 10, 1–16 (2019).
Conesa, A. et al. A survey of best practices for RNA-seq data analysis. Genome Biol. 17, 1–19 (2016).
Beyter, D. et al. Long-read sequencing of 3,622 Icelanders provides insight into the role of structural variants in human diseases and other traits. Nat. Genet. 53, 779–886 (2021).
Rhie, A. et al. Towards complete and error-free genome assemblies of all vertebrate species. Nature 592, 737–746 (2021).
De Coster, W., Weissensteiner, M. H. & Sedlazeck, F. J. Towards population-scale long-read sequencing. Nat. Rev. Genet. 22, 572–587 (2021).
PromethION Brochure (Nanophore Technologies, 2021); https://nanoporetech.com/sites/default/files/s3/literature/PromethION-brochure.pdf
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Guo, L., Lau, J., Ruan, Z., Wei, P. & Cong, J. Hardware acceleration of long read pairwise overlapping in genome sequencing: a race between FPGA and GPU. In 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines 127–135 (IEEE, 2019).
Zeni, A. et al. LOGAN: high-performance GPU-based X-drop long-read alignment. In 2020 IEEE International Parallel and Distributed Processing Symposium 462–471 (IEEE, 2020).
Feng, Z., Qiu, S., Wang, L. & Luo, Q. Accelerating long read alignment on three processors. In Proc. 48th International Conference on Parallel Processing 1–10 (ACM, 2019).
Roberts, M., Hayes, W., Hunt, B. R., Mount, S. M. & Yorke, J. A. Reducing storage requirements for biological sequence comparison. Bioinformatics 20, 3363–3369 (2004).
Abouelhoda, M. I. & Ohlebusch, E. Chaining algorithms for multiple genome comparison. J. Discrete Algorithms 3, 321–341 (2005).
Jain, C., Gibney, D. & Thankachan, S. V. Co-linear chaining with overlaps and gap costs. Preprint at https://www.biorxiv.org/content/10.1101/2021.02.03.429492v2 (2021).
Ho, D. et al. LISA: learned indexes for DNA sequence analysis. Preprint at https://arxiv.org/abs/1910.04728 (2020).
Schneider, V. A. et al. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 27, 849–864 (2017).
Nurk, S., Koren, S., Rhie, A., Rautiainen, M. et al. The complete sequence of a human genome. Preprint at https://doi.org/10.1101/2021.05.26.445798 (2021).
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).
Payne, A. et al. Readfish enables targeted nanopore sequencing of gigabase-sized genomes. Nat. Biotechnol. 39, 442–450 (2021).
Kovaka, S., Fan, Y., Ni, B., Timp, W. & Schatz, M. C. Targeted nanopore sequencing by real-time mapping of raw electrical signal with uncalled. Nat. Biotechnol. 39, 431–441 (2021).
Zhang, H. et al. Real-time mapping of nanopore raw signals. Bioinformatics https://doi.org/10.1093/bioinformatics/btab264 (2021).
Jain, C., Rhie, A., Hansen, N., Koren, S. & Phillippy, A.M. A long read mapping method for highly repetitive reference sequences. Preprint at https://www.biorxiv.org/content/10.1101/2020.11.01.363887v1.full (2020).
Sedlazeck, F. J. et al. Accurate detection of complex structural variations using single-molecule sequencing. Nat. Methods 15, 461–468 (2018).
Ren, J. & Chaisson, M. lRA: the long read aligner for sequences and contigs. Preprint at https://doi.org/10.1371/journal.pcbi.1009078 (2020).
Kraska, T., Beutel, A., Chi, E.H., Dean, J. & Polyzotis, N. The case for learned index structures. In ACM International Conference on Management of Data 489–504 (ACM, 2018).
Galakatos, A., Markovitch, M., Binnig, C., Fonseca, R. & Kraska, T. FITing-Tree: a data-aware index structure. In SIGMOD ’19: Proceedings of the 2019 International Conference on Management of Data 1189–1206 (ACM, 2019); https://doi.org/10.1145/3299869.3319860
Ferragina, P. & Vinciguerra, G. The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds. PVLDB 13, 1162–1175 (2020).
Ding, J. et al. ALEX: An Updatable Adaptive Learned Index. In SIGMOD ‘20: Proceedings of the 2020 International Conference on Management of Data 969-984 (ACM, 2020). https://doi.org/10.1145/3318464.3389711
Wu, Y., Yu, J., Tian, Y., Sidle, R. & Barber, R. Designing succinct secondary indexing mechanism by exploiting column correlations. In SIGMOD ’19: Proceedings of the 2019 International Conference on Management of Data 1223–1240 (ACM, 2019). https://doi.org/10.1145/3299869.3319861
Kirsche, M., Das, A. & Schatz, M. C. Sapling: accelerating suffix array queries with learned data models. Bioinformatics 37, 744–749 (2021).
Marcus, R. et al. Benchmarking learned indexes. In PVLDB Vol. 14, 1–13 (2021).
Marcus, R., Zhang, E. & Kraska, T. CDFShop: exploring and optimizing learned index structures. In SIGMOD ’20: Proc. 2020 ACM SIGMOD International Conference on Management of Data 2789–2792 (ACM, 2020); https://doi.org/10.1145/3318464.3384706
Suzuki, H. & Kasahara, M. Introducing difference recurrence relations for faster semi-global alignment of long sequences. BMC Bioinformatics 19, 33–47 (2018).
Cheng, H., Concepcion, G., Feng, X., Zhang, H. & Li, H. Human Assemblies Evaluated in the Hifiasm Paper (Zenodo, 2020); https://doi.org/10.5281/zenodo.4393631
Kalikar, S., Jain, C., Md, V. & Misra, S. mm2-fast Source Code Used in the Manuscript—Accelerating Minimap2 for Long-Read Sequencing Applications on Modern CPUs (Zenodo, 2022); https://doi.org/10.5281/zenodo.5888171
Kalikar, S., Jain, C., Md, V. & Misra, S. Scripts Used for the Experiments in the Manuscript—Accelerating Minimap2 for Long-Read Sequencing Applications on Modern CPUs (Zenodo, 2022); https://doi.org/10.5281/zenodo.5884451
Acknowledgements
This work is supported in part by the National Supercomputing Mission (NSM) India under DST/NSM/R&D_HPC_Applications to C.J. The authors are grateful to H. Li for guidance and technical discussions on minimap2 and working with us to get our improvements integrated in a branch of minimap2 github repo.
Author information
Authors and Affiliations
Contributions
S.K. led the software implementation of mm2-fast. All authors contributed to algorithm design, experiments and manuscript preparation, and read and approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
S.K., V.M. and S.M. are employees of Intel Corporation.
Peer review
Peer review information
Nature Computational Science thanks Aydin Buluc, Zemin Ning and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Handling editor: Fernando Chirigati, in collaboration with the Nature Computational Science team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Minimap2 workflow depicting its three key modules – (i) seeding, (ii) chaining, and (iii) alignment – and mm2-fast optimizations.
The seeding stage identifies short fixed-length exact matches between a read and a reference sequence. Chaining stage selects an ordered subset of these exact matches (anchors) to form a chain. The final alignment stage computes base-level alignments for filling the gaps between adjacent anchors in these chains. Our optimizations to each of the modules are shown in the blue dotted rectangle.
Extended Data Fig. 2 Cross-platform performance of our optimizations for Rome, Skylake, Cascade Lake and Ice Lake architectures using single socket.
X-axis shows various query datasets and y-axis indicates the speedup achieved by mm2-fast over minimap2 – both running on the same CPU.
Extended Data Fig. 3 Data structures used for hash table.
Minimizers extracted from the reference sequence are stored in a sorted list as key-value pairs. Position list maintains a separate list of the positions of minimizers on the reference sequence.
Extended Data Fig. 4 Two-layer RMI.
An example minimizer lookup is illustrated - get_mm_hits(mm5) calls a lookup for a minimizer mm5. The RMI root predicts the leaf layer model which in turn predicts the location of mm4 in the sorted list. Finally, the last mile search from mm4 walks to the location of mm5 and returns its value to the caller.
Extended Data Fig. 5 Chaining of two co-linear anchors A and B.
Here two anchors overlap on the query sequence. Gap cost function in minimap2 is calculated using the reference gap, query gap, and the average length of all anchors avg_qlen.
Supplementary information
Supplementary Information
Supplementary Tables 1–4, Figs. 1 and 2, Algorithms 1 and 2, and Sections 1 and 2.
Supplementary Data 1
Source data showing the single-threaded and multithreaded runtime of mm2-fast.
Supplementary Data 2
Source data showing the time spent by mm2-fast and minimap2 in the chaining module.
Source data
Source Data Fig. 1
Source Data showing the time spent by mm2-fast and minimap2 in various modules.
Source Data Fig. 2
Source Data showing the end-to-end mapping time of mm2-fast and minimap2 on the full datasets.
Source Data Extended Data Fig. 2
Source data showing the speedups of mm2-fast on various architectures.
Rights and permissions
About this article
Cite this article
Kalikar, S., Jain, C., Vasimuddin, M. et al. Accelerating minimap2 for long-read sequencing applications on modern CPUs. Nat Comput Sci 2, 78–83 (2022). https://doi.org/10.1038/s43588-022-00201-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s43588-022-00201-8
- Springer Nature America, Inc.
This article is cited by
-
Efficient end-to-end long-read sequence mapping using minimap2-fpga integrated with hardware accelerated chaining
Scientific Reports (2023)
-
MetaCC allows scalable and integrative analyses of both long-read and short-read metagenomic Hi-C data
Nature Communications (2023)