Main

Mosaic mutations are ubiquitous in the body and accumulate throughout life in every cell1,2. Most mosaic mutations begin as nucleotide mismatches or damage in only one of the two strands of the DNA double helix3,4. When these single-strand DNA (ssDNA) events are misrepaired, or when they are replicated during the cell cycle before repair, they then become permanent double-strand DNA (dsDNA) mosaic mutations3. Although current methods for profiling mosaic changes to DNA achieve high fidelity for dsDNA mutations, they cannot accurately resolve these precursor ssDNA events. This is because current methods—single-cell genome sequencing5, in vitro cloning of single cells6, microdissection or biopsy of clonal populations7, and duplex sequencing8,9—amplify the original DNA molecules before sequencing, either prior to or on the sequencer itself. This masks true ssDNA events by either transforming existing ssDNA mismatches and damage to dsDNA mutations, or by introducing artefactual ssDNA mismatches and damage8.

Mosaic dsDNA mutations are the result of the interaction between ssDNA mismatch and damage events, DNA repair, and DNA replication3. Consequently, dsDNA mutational signatures (that is, the sequence contexts of mutations) may not reflect the patterns of the originating ssDNA events4. dsDNA mutation profiling also does not resolve on which strands the initiating ssDNA events occur. Therefore, a complete understanding of mutational processes requires profiling of ssDNA mismatches and damage3,10. Here, to study the ssDNA origins of mosaic mutations, we developed an approach for direct sequencing of single DNA molecules without any previous amplification that achieves, for single-base substitutions, single-molecule fidelity detection of dsDNA mutations simultaneously with ssDNA mismatches and damage.

HiDEF-seq

Profiling dsDNA mosaic mutations in human tissues requires single-molecule fidelity of less than 1 error per 1 billion bases (10−9), and profiling ssDNA mismatch and damage events would probably require similar or greater fidelity8,10. However, to our knowledge, no technology to date has achieved this fidelity when directly sequencing unamplified single DNA molecules. To achieve this, we developed HiDEF-seq. HiDEF-seq substantially increases the fidelity of single-molecule sequencing by (1) increasing the number of independent sequencing passes per strand (median of 32 passes with a median of 1.7 kilobase (kb) molecules) relative to standard single-molecule sequencing11 to create a high-quality consensus sequence for each strand; (2) eliminating in vitro artefacts during library preparation by ssDNA nick ligation and by using either the NanoSeq A-tailing approach8 or a protocol without A-tailing for post-mortem samples with degraded DNA; and (3) a computational pipeline that avoids analytic artefacts (Fig. 1a,b, Methods, Extended Data Figs. 15 and Supplementary Note 1). HiDEF-seq libraries are sequenced on Pacific Biosciences (PacBio) single-molecule, long-read sequencers. The computational pipeline analyses single-base substitutions, as these have an orthogonal error profile to the prevalent insertion and deletion sequencing errors of single-molecule sequencing12, and it analyses each strand separately to distinguish between dsDNA and ssDNA events (Methods).

Fig. 1: Overview of HiDEF-seq.
figure 1

a, HiDEF-seq schematic. A-tailing uses dATP and non-A dideoxynucleotides8, except for lower-quality post-mortem samples that use only non-A dideoxynucleotides to avoid dATP misincorporation at residual nicks (Methods and Extended Data Fig. 5). b, The average fraction of molecules across representative HiDEF-seq samples (n = 51) and standard PacBio sequencing (HiFi) samples (n = 10) in different bins of number of passes per strand (Methods). The average percentage of molecules with ≥5 and ≥20 passes per strand is 99.8% and 70% for HiDEF-seq, respectively, and 78.7% and 0.1% for HiFi, respectively. The plot shows HiDEF-seq molecules output by the pipeline’s primary data-processing step. x-axis square brackets and parentheses signify inclusion and exclusion of bin end points, respectively. c, HiDEF-seq and NanoSeq dsDNA mutation burdens in sperm samples (left to right, SPM-1013, SPM-1002, SPM-1004, SPM-1020, SPM-1060) were compared for each age to paternally phased de novo mutations from a previous study14. d, HiDEF-seq dsDNA mutation burdens in human tissues (Supplementary Table 1). The dashed lines show weighted least-squares linear regressions. e,f, HiDEF-seq versus NanoSeq dsDNA mutation burdens (e) and ssDNA call burdens (f). Samples are (top to bottom in legend): SPM-1013, SPM-1002, SPM-1004, SPM-1020, SPM-1060, 1443, 1105, 6501, 63143. Only sample 63143 (POLE p.M444K) is from an individual with a cancer predisposition syndrome. The dashed line shows y = x (expectation for concordance). g, HiDEF-seq versus NanoSeq ssDNA call burdens separated by call type. For each call type, each bar represents a different sperm sample (left to right, the same samples as in c). b, Error bars show standard deviations. cf, Dots and error bars show point estimates and their Poisson 95% confidence intervals. c, Box plots show the median (centre line), the first and third quartiles (box limits), and the 5% and 95% quantiles (whiskers). c,e,f, For each sample, HiDEF-seq and NanoSeq confidence intervals were normalized to reflect an equivalent number of interrogated base pairs (c and e) or bases (f) (Methods). mo, months old; yo, years old.

We profiled purified human sperm with HiDEF-seq as the most rigorous test of fidelity for detecting dsDNA mosaic mutations, as sperm have the lowest dsDNA mutation burden of any readily accessible human cell type13. Sperm dsDNA mutation burdens measured by HiDEF-seq were concordant with a previous study of de novo mutations14 and with NanoSeq profiling8 (a method for duplex sequencing of mosaic dsDNA mutations) that we performed for the same samples (Fig. 1c). HiDEF-seq also measured the expected dsDNA mutational signatures and linear increase in dsDNA mutation burdens with age in other human tissues (liver, kidney, blood and cerebral cortex neurons)8,15, with one outlier blood sample of an individual with a kidney transplant (Fig. 1d, Extended Data Fig. 5i and Supplementary Note 2).

Notably, relaxing from a threshold of ≥20 to ≥5 sequencing passes per strand, while keeping our optimized computational filters, produced concordant dsDNA mutation burdens (Extended Data Fig. 3e). This suggests that PacBio sequencing can achieve a higher per-pass fidelity for substitutions than estimated by previous studies11. Using the probability of complementary single-strand calls occurring at the same position (Methods), we estimate HiDEF-seq’s fidelity for dsDNA mutations as less than 1 error per 3 × 1013 base pairs (bp) with ≥5 passes per strand and less than 1 error per 1 × 1014 bp with ≥20 passes per strand. Accordingly, for analysis of dsDNA mutations, we used the lower threshold of ≥5 passes per strand as this increases the percentage of analysed molecules from 70% to 99.8% (of molecules passing primary data processing), and it increases the percentage of interrogated bases by 11%. HiDEF-seq uses restriction enzyme fragmentation that captures approximately 40% of the human genome (Extended Data Fig. 1a), which is sufficient for obtaining accurate mosaic mutation burdens and mutational patterns8. It can also use random fragmentation to enable profiling of any genomic region, although this requires more input DNA (Methods). We also successfully quantified dsDNA mutation burdens in sperm using HiDEF-seq with larger DNA fragments (median, 4.2 kb), which have correspondingly fewer (median, 15) passes per strand (Supplementary Note 3). However, for this study, we proceeded with HiDEF-seq with the smaller median 1.7 kb fragments, as a higher threshold of ≥20 passes per strand was required for ssDNA analysis.

We next analysed ssDNA calls. Importantly, these may include not only ssDNA mismatches, but also damaged bases that alter base pairing and lead to misincorporation of nucleotides by the sequencer polymerase. The latter may be advantageous as it would enable high-fidelity detection of ssDNA damage. In contrast to dsDNA mutation analysis, duplex error correction is not possible for ssDNA calls, and true ssDNA call burdens (calls per base) are unknown. Thus, for ssDNA calling, we optimized key analytic parameters by identifying filter thresholds above which ssDNA burden estimates are stable (Methods and Extended Data Fig. 3i,j). To compare ssDNA calls between HiDEF-seq and NanoSeq, we profiled 9 samples using both methods. Although HiDEF-seq and NanoSeq dsDNA mutation burdens and patterns were concordant, HiDEF-seq measured on average 18-fold lower ssDNA call burdens, with distinct patterns, and 5-fold lower when considering only C>T calls (Fig. 1e–g and Extended Data Fig. 6a–c). This suggests that, while NanoSeq achieves high fidelity for dsDNA mutations, its ssDNA calls are largely artefactual as suggested by its developers8. HiDEF-seq ssDNA burdens in cerebral cortex neurons were also around 13-fold lower than estimated by Meta-CS single-cell duplex sequencing16, with a distinct pattern, and about 4-fold lower when considering only C>T calls (Supplementary Tables 2 and 3). Overall, by direct interrogation of unamplified single molecules, HiDEF-seq achieves, to our knowledge, the highest fidelity for single-base changes of any DNA-sequencing method to date.

Cancer predisposition syndromes

As there is no previous method for sequencing ssDNA mismatches with single-molecule fidelity, we sought to confirm the veracity of HiDEF-seq’s ssDNA calls by profiling samples from individuals with inherited cancer predisposition syndromes that may have elevated ssDNA call burdens. We profiled 17 blood, primary fibroblast, and lymphoblastoid cell line samples from 8 different cancer predisposition syndromes, including defects in nucleotide excision repair, mismatch repair, polymerase proofreading, and base excision repair (Supplementary Tables 1 and 2). In these samples, we first confirmed HiDEF-seq’s single-molecule fidelity for dsDNA mutations by measuring the expected dsDNA mutation burdens and signatures based on previous studies17,18,19,20,21 (Extended Data Fig. 7a–d and Supplementary Tables 2 and 4).

Notably, compared to non-cancer predisposition samples, we detected higher ssDNA call burdens in two cancer predisposition syndromes: a 2.6-fold increase (95% confidence interval: 2.3–3.0) in POLE polymerase proofreading-associated polyposis syndrome samples (PPAP; germline heterozygous exonuclease domain mutations in POLE, which encodes the catalytic subunit of polymerase epsilon that performs leading strand genome replication22), and a 1.6-fold increase (95% confidence interval: 1.4–1.9) in congenital mismatch repair deficiency syndrome samples (CMMRD; MSH2, MSH6, and PMS2 germline biallelic loss of function) (Fig. 2a). Moreover, the percentage of purine ssDNA calls (G>T/C/A and A>T/G/C) was elevated in PPAP samples (average, 61%; range, 52–73%) and CMMRD samples (average, 33%; range, 23–57%) compared to non-cancer predisposition samples (average, 20%; range, 12–29%) (Fig. 2b). In PPAP samples, this was largely due to increased G>T, G>A, and A>C ssDNA calls, while CMMRD samples exhibited smaller alterations in sequence contexts of ssDNA calls (Fig. 2b). These data indicate that most ssDNA calls in PPAP samples, and at least some calls in CMMRD samples, are bona fide ssDNA mismatches.

Fig. 2: ssDNA call burdens and patterns in cancer predisposition syndromes.
figure 2

a, ssDNA call burdens in blood (B), fibroblasts (F) and lymphoblastoid cell lines (L) from individuals without and with cancer predisposition syndromes. Burdens are corrected for trinucleotide context opportunities and detection sensitivity (Methods). Statistical analysis was performed using two-sided Poisson rates ratio tests, combining calls and interrogated bases from each group, with Holm multiple-comparison adjustment; ***P = 2 × 10−10 for mismatch repair and P < 10−15 for polymerase proofreading, versus non-cancer predisposition samples. Results were also significant when including only blood samples. Samples (left to right) are: 5203, 1105, 1301, 6501, 1901, GM12812, GM02036, GM03348, GM16381, GM01629, GM28257, 55838, 58801, 57627, 1400, 1324, 1325, 60603, 59637, 57615, 63143 (L), 63143 (B), CC-346-253, CC-388-290, and CC-713-555. Cancer predisposition samples are ordered as in b, which lists the affected genes. b, ssDNA call burdens by context, corrected for trinucleotide context opportunities. Statistical analysis was performed using heteroscedastic two-tailed t-tests, adjusted for multiple comparisons; *P = 0.03, ***P = 0.0008. Only non-cancer predisposition samples with >30 ssDNA calls were included (1105, 1301, 1901, GM12812, GM03348), as patterns are not reliably ascertained with fewer calls. However, GM16381 (XPC) with <30 calls was included for completeness in showing all cancer predisposition samples. c,d, Spectra of ssDNA calls (c) and dsDNA mutations (d) for representative POLE PPAP sample 57615, corrected for trinucleotide context opportunities. e, Top, the ssDNA mismatch signature SBS10ss extracted from all PPAP samples while simultaneously fitting SBS30ss* (Fig. 4d). Middle, SBS10ss projected to central pyrimidine contexts by summing central pyrimidine values and their reverse-complement central purine values to enable comparison to dsDNA signatures. Bottom, the dsDNA mutational signature (sum of SBSE and SBSF) extracted from PPAP samples. f, The fraction of ssDNA calls attributed to ssDNA signatures in PPAP samples (same PPAP sample order as in a). Cosine similarities of original spectra to spectra reconstructed from signatures (left to right) were: 0.94, 0.97, 0.97, 0.85. Sample details for a and b are provided in Supplementary Tables 1 and 2. a, Error bars show Poisson 95% confidence intervals.

To further characterize the patterns of ssDNA mismatches in POLE PPAP samples, we plotted their 192-trinucleotide context spectra (standard 96-trinucleotide context spectra, separated by central pyrimidine versus central purine). This revealed a distinct pattern, with two large peaks for AGA>ATA and AAA>ACA accounting for around 15–20% and about 5–10% of ssDNA mismatches, respectively, in addition to smaller peaks with G>T, G>A, A>C, and C>T contexts (Fig. 2c and Supplementary Table 3). The ssDNA mismatch spectra were highly concordant with the dsDNA mutation spectra of these same samples (Fig. 2d and Supplementary Table 4), confirming that these are true ssDNA mismatches—arising from polymerase epsilon nucleotide misincorporation—that lead to the subsequent pattern of accumulated dsDNA mutations. De novo extraction of ssDNA mismatch signatures from PPAP samples produced a signature that we name SBS10ss (SBS, single-base-substitution; ss, single-strand) (Fig. 2e). Note that we propose a nomenclature with the suffix ‘ss’ to distinguish between ssDNA and dsDNA signatures. Projecting SBS10ss to central pyrimidine contexts, by summing central purine and central pyrimidine spectra, produced a spectrum remarkably similar (cosine similarity = 0.97) to the dsDNA signatures extracted de novo (SBSE + SBSF) from these same samples (Fig. 2e), again indicating that the ssDNA mismatches are the inciting events leading to the dsDNA mutations. SBS10ss also had strong similarity (cosine similarity = 0.90) to COSMIC23 SBS10c that was previously associated with POLE PPAP17. SBS10ss accounted for an average of 79% (range, 70–91%) of ssDNA calls in PPAP samples, with the remaining attributed to SBS30ss*, a ssDNA cytosine deamination damage signature (asterisk (*) indicates damage) that is described in a subsequent section (Fig. 2f). For CMMRD samples, the number of ssDNA calls was too low to extract a signature.

The two most frequent ssDNA mismatch contexts in PPAP samples are also notable for the asymmetry of their prevalence relative to their reverse complements: AGA>ATA versus TCT>TAT (73 versus 10 mismatches across all PPAP samples; χ2 test, P < 0.0001) and AAA>ACA versus TTT>TGT (26 versus 2 mismatches; χ2 test, P < 0.0001). These data provide a direct observation that the dsDNA mutational context AGA>ATA / TCT>TAT prevalent in POLE PPAP arises in vivo significantly more frequently from C:dT (template base:polymerase incorporated base) misincorporations than G:dA misincorporations, and that the dsDNA mutational context AAA>ACA / TTT>TGT arises in vivo more frequently from T:dC than A:dG misincorporations. These results are consistent with previous studies that indirectly inferred this asymmetry in yeast24 and human tumours25,26,27 harboring mutations in the polymerase epsilon exonuclease domain by identifying asymmetries in the prevalence of dsDNA mutation contexts relative to their reverse complement contexts depending on whether the mutation locus is preferentially replicated through leading-strand versus lagging-strand synthesis. However, while these studies rely on replication timing data that imperfectly estimates the probability of leading- versus lagging-strand replication to measure this asymmetry, our single-molecule detection of nucleotides that were misincorporated by polymerases in vivo enables us to measure this asymmetry directly. Our results are also consistent with in vitro polymerase gap-filling assays25,28, but, in contrast to our detection of in vivo misincorporation events, these assays lack the full context of DNA replication and repair. We also applied the above studies’ indirect replication timing analysis and similarly found in our POLE PPAP samples a higher frequency of AGA>ATA dsDNA mutations and AGA>ATA ssDNA mismatches on the strand that is preferentially replicated in the leading direction (Extended Data Fig. 7e,f). Together, our results demonstrate direct measurements of in vivo ssDNA mismatch burdens and patterns.

Hypermutating tumours

To study the interaction between ssDNA mismatches introduced during replication and mismatch repair, we profiled three hypermutating brain tumours from individuals with CMMRD whose tumours also contained somatic mutations affecting polymerase proofreading. We excluded one tumour (tumour 3) from further analysis due to a very high ssDNA C>T burden attributed to SBS30ss* (a ssDNA cytosine deamination damage signature described in the next section) that probably arose ex vivo (Supplementary Tables 2 and 3). The other two tumours, a medulloblastoma and a glioblastoma—both with biallelic germline PMS2 mutations and somatic POLE exonuclease domain mutations—had higher burdens and distinct patterns of dsDNA mutations and ssDNA calls compared with samples deficient in only mismatch repair or only polymerase proofreading (Figs. 2a–d and 3, Extended Data Figs. 7a,b and 8a–c and Supplementary Tables 24). Additionally, the dsDNA mutation spectra of these tumours resembled those found in previous studies of tumours and cell lines deficient in both mismatch repair and polymerase proofreading29,30,31,32 (Fig. 3). Most dsDNA mutations were attributed to a signature with moderate similarity to COSMIC SBS14 (cosine similarity = 0.85)31 (Extended Data Fig. 8e). Moreover, the dsDNA mutation spectra of the tumours resembled their ssDNA call spectra (Fig. 3 and Extended Data Fig. 8b,c), except for ssDNA C>T calls related to SBS30ss* (Fig. 3 and Extended Data Fig. 8f).

Fig. 3: Hypermutating tumours deficient in both mismatch repair and polymerase proofreading.
figure 3

Spectra of ssDNA calls (top) and dsDNA mutations (bottom) in tumour samples corrected for trinucleotide context opportunities. The parentheses show the total number of raw calls and the percentage of calls that are C>T after correction for trinucleotide context opportunities. The blue annotation on the top right of each ssDNA spectrum is the cosine similarity of only the ssDNA C>T calls to SBS30ss* (details of SBS30ss* are shown in Fig. 4d). Also annotated are the cosine similarities of each sample’s full ssDNA call spectrum (projected to central pyrimidine context) to its dsDNA mutation spectrum, for all ssDNA calls and after excluding ssDNA C>T calls (most of which are due to SBS30ss* cytosine deamination). Medulloblastoma ID: tumour 8; glioblastoma ID: tumour 10. Sample details are provided in Supplementary Table 1.

Importantly, the ssDNA call spectra of the tumours had notable differences relative to ssDNA call spectra of samples deficient in only polymerase proofreading, including increases in ssDNA AG>AT calls flanked by 3′ C/G/T, and increases in ssDNA G>A, A>G, and T>C calls (Figs. 2c and 3 and Supplementary Table 3). These differences in ssDNA call spectra of polymerase proofreading-deficient samples with and without mismatch repair deficiency are consistent with previous studies suggesting that mismatch repair is more efficient for certain mismatches caused by deficient polymerase proofreading32,33. The tumours’ relative increase in ssDNA C>T calls largely arose from cytosine deamination damage rather than polymerase misincorporation (Figs. 3 and 4d and Extended Data Fig. 8f). The ssDNA call spectra further resolve the identity of the nucleotides misincorporated by proofreading-deficient polymerase epsilon—for example, C>T / G>A dsDNA mutations largely arise from C:dA rather than G:dT misincorporations (Fig. 3). We extracted a ssDNA mismatch signature from tumour samples that we name SBS14ss, as after projecting it to central pyrimidine contexts, its most similar COSMIC dsDNA signature is SBS14 (cosine similarity = 0.73 for all ssDNA calls and 0.96 for only C>A ssDNA calls) (Extended Data Fig. 8d). SBS14ss accounted for most ssDNA calls in both tumours (Extended Data Fig. 8f). We also profiled post-mortem brain and spinal cord of individuals with MSH2 and MSH6 CMMRD who died of brain tumours harboring somatic POLE mutations. This revealed not only an elevated burden of SBS1 dsDNA mutations as seen in a previous study19, but also an elevated burden of ssDNA C>T calls at CG dinucleotides (Supplementary Note 4). This demonstrates that HiDEF-seq can also detect the ssDNA precursor lesions of SBS1 when this mutational process is elevated.

Fig. 4: ssDNA damage signatures of sperm and heat-treated DNA.
figure 4

a, Spectrum of all ssDNA calls of non-cancer predisposition blood samples (one sample each from individuals 1105, 1301, 5203 and 6501, and five samples from individual 1901). The cosine similarity to COSMIC SBS30 was calculated after projecting the ssDNA spectrum to central pyrimidine contexts. b, dsDNA mutation and ssDNA call burdens of heat-treated DNA. c, ssDNA call spectra of representative sperm and heat-treated blood DNA samples, and SBS30 for comparison. d, SBS30ss* obtained by de novo signature extraction from central pyrimidine ssDNA calls of sperm and heat-treated samples. The cosine similarity to SBS30 was calculated after projecting to central pyrimidine contexts. e, Schematic of PW and IPD measured for incorporated bases during sequencing. f, Average PW ratios for positions −1 to +6 (relative to C>T calls), which is the polymerase footprint that has a kinetic signal that differs from the flanking baseline. Unbiased hierarchical clustering (dendrogram) separates ssDNA C>T calls from dsDNA C>T mutations and from kinetic profiles with randomized molecule labels. Positions +1 and +3 (stars) best discriminate ssDNA C>T damage from dsDNA C>T mutations. dsDNA ‘Blood, heat’ samples were heat treated at 56 °C and 72 °C (both 3 hours and 6 hours for each). dsDNA ‘Blood’: n = 4 samples; dsDNA ‘Kidney and liver’: n = 10 samples. a,c,d, HiDEF-seq spectra were corrected for trinucleotide context opportunities. b, Bars and error bars show point estimates and their Poisson 95% confidence intervals, and statistical analysis was performed using two-sided Poisson rates ratio tests; from left to right, *P = 0.001, 0.35 (not significant (NS)), *P < 10−15, *P < 10−15.

Patterns of cytosine deamination damage

A common form of DNA damage is deamination of cytosine (with or without preceding oxidation) to uracil, uracil glycol, 5-hydroxyuracil, or 5-hydroxyhydantoin (uracil-species)34,35. When unrepaired, these lesions result in dsDNA C>T mutations34. We reasoned that HiDEF-seq may detect these ssDNA cytosine to uracil-species events with single-molecule fidelity despite their low levels (estimated by mass spectrometry at less than 1 per 1 million bases36), as damaged cytosines would be mis-sequenced as thymines due to nucleotide misincorporation by the sequencer polymerase.

We began by investigating the burden and pattern of ssDNA C>T calls in the blood DNA of individuals without cancer predisposition, as blood can be processed rapidly without potential post-mortem DNA damage. We also extracted the DNA with room temperature incubations to avoid heat-induced deamination37. Blood DNA had 2.0 × 10−8 ssDNA C>T calls per base (mean of n = 9 samples from n = 5 individuals; range 9.8 × 10−9–3.1 × 10−8), comprising on average 71% of these samples’ ssDNA calls (Extended Data Fig. 9a and Supplementary Tables 2 and 3). This burden, which may have either been present in vivo or partly arisen during laboratory processing, suggests that there are less than 250 cytosine to uracil-species deaminated bases per cell in blood leukocytes. Our detection level of 1 event per 50 million bases is on par with the most sensitive mass spectrometry methods36,38—which cannot determine the sequence context of damaged bases—and provides a low background for studying cytosine deamination processes. Notably, the spectrum of the combined ssDNA calls of these blood samples, projected to central pyrimidine contexts, most closely resembled COSMIC23 SBS30 (cosine similarity = 0.83) (Fig. 4a,c), a signature associated with cytosine oxidative deamination damage repaired by DNA glycosylases18,39,40. Surprisingly, G>T ssDNA calls, which would be expected due to the commonly oxidized base 8-oxoguanine, were very infrequent in these blood samples (average of 6% of ssDNA calls, 1.5 × 10−9 ssDNA calls per base; range 0–2.9 × 10−9), possibly due to the sequencer polymerase correctly incorporating dC across from 8-oxoguanine.

Given the high sensitivity of HiDEF-seq’s ssDNA C>T detection, we investigated the effect of heat, an important source of laboratory-based cytosine deamination artefacts (as most DNA extraction methods use heat)37. We profiled purified blood DNA after heat incubation at 56 °C and 72 °C, each for 3 hours (h) and 6 h. While heat did not affect dsDNA mutation burdens, HiDEF-seq measured a significant increase in ssDNA calls (29-fold for 72 °C, 6 h treatment), specifically C>T calls (97% of calls), with increasing temperature and time (Fig. 4b and Supplementary Tables 2 and 3). This observation led us to profile all of the samples in this study except four (neurons of individual 5344 and 3 tumour samples) at least once with a room temperature DNA extraction (Methods and Supplementary Table 1). Notably, HiDEF-seq library preparation temperatures do not exceed 37 °C (Methods).

Across all of the healthy tissues and cell lines that we profiled, only sperm had a similarly high percentage of ssDNA calls that were C>T (average, 94%; Extended Data Fig. 9a). Sperm also had a higher ssDNA C>T burden than the other sample types (average, 1.4 × 10−7 C>T calls per base; Extended Data Fig. 9a). This suggests that these are also cytosine deamination events and that sperm DNA either undergoes more in vivo cytosine deamination than DNA of other tissues, or that it incurs this damage ex vivo before sperm purification from semen, during sperm purification or freezing, and/or during DNA extraction. To distinguish between these possibilities, we profiled non-sperm samples with the same processes used to freeze sperm and extract DNA from sperm, and we profiled additional sperm samples purified using filter chips that mimic physiological separation of motile sperm (Methods). The former did not produce an increase in ssDNA C>T burden, and the latter measured similar C>T burdens to the previous sperm samples that were purified by standard density gradient centrifugation (Supplementary Table 2 and Supplementary Note 5). These results suggest that sperm incur an elevated cytosine deamination burden either in vivo or ex vivo during the time (<1 h) that semen liquefies in the laboratory before sperm purification. In both cases, the elevated cytosine deamination burden would likely be present in sperm fertilizing the egg, and the egg’s DNA repair machinery would then repair the damage41. Moreover, sperm ssDNA C>T calls did not exhibit transcription level or transcription strand biases (Supplementary Note 6).

Notably, all sperm and heat-treated blood DNA samples exhibited similar ssDNA C>T spectra, and the projection of these ssDNA spectra to dsDNA spectra again closely matched COSMIC dsDNA signature SBS30 (average cosine similarities of 0.92 and 0.95 for sperm and 72 °C heat samples, respectively) (Fig. 4c and Extended Data Fig. 9b). Using all of the above sperm and heat-damage samples, we next extracted this ssDNA signature, which we named SBS30ss* (cosine similarity = 0.94 to SBS30) (Fig. 4d). COSMIC signature SBS30 is associated with NTHL1 and UNG biallelic loss-of-function mutations18,39 and with formalin fixation42. NTHL1 and UNG encode DNA glycosylases that initiate base excision repair of oxidized pyrimidines, including uracil-species resulting from cytosine oxidation40. Our finding that in vitro heating of purified DNA leads to a ssDNA damage signature, SBS30ss*, that matches the in vivo dsDNA SBS30 signature indicates that the SBS30ss* process is active in vivo, and that its pattern reflects the nucleotide context bias of the primary biochemical process of cytosine deamination, probably through an oxidized intermediate.

To further characterize the ssDNA C>T calls in heat-treated DNA and sperm, we took advantage of the single-molecule sequencer’s polymerase kinetic data that record the duration of each nucleotide incorporation (pulse width (PW)) and the time between nucleotide incorporations (interpulse duration (IPD)) (Fig. 4e). PW and IPD encode unique kinetic signatures for different canonical and damaged bases43. ssDNA C>T calls in heat-treated DNA and sperm exhibited a distinct PW and IPD kinetic signature compared to dsDNA C>T mutations (for the mutation strand containing thymine) (Fig. 4f, Methods and Extended Data Fig. 9c,d,g). These results provide further evidence that the ssDNA C>T calls are uracil-species arising from cytosine deamination damage and exclude the possibility that they are cytosine to thymine changes. We further validated that nearly all ssDNA C>T calls in heat-treated DNA and sperm are uracil-species by incubating three of these HiDEF-seq libraries with uracil DNA glycosylase and endonuclease VIII. This eliminated the SBS30ss* pattern and nearly all ssDNA C>T calls (Supplementary Note 7 and Supplementary Tables 2 and 3).

We also evaluated heating of DNA in five different buffers and in water. Heating in water or Tris buffer without additional salt increased cytosine damage 66-fold relative to heating in higher-salt buffers, with slight differences in ssDNA C>T patterns (Extended Data Fig. 9e,f and Supplementary Table 2). As low salt decreases DNA duplex stability at elevated temperatures, these results suggest that the in vivo mechanism of SBS30ss*/SBS30 is cytosine deamination while DNA is transiently single-stranded.

Patterns of APOBEC3A-induced damage

HiDEF-seq’s detection of cytosine deamination damage with single-molecule fidelity motivated us to define a ssDNA damage signature for APOBEC3A that was recently distinguished as the key contributor to cytosine deamination caused by APOBEC3 family proteins44. We expressed human APOBEC3A in primary human fibroblasts and extracted a ssDNA signature, which we named SBS2ss*, with strong similarity to APOBEC3A’s associated COSMIC dsDNA signature SBS2 (cosine similarity = 0.92) (Extended Data Fig. 10a–f). Notably, SBS2ss* contained additional low-level peaks of ssDNA C>T calls outside the TCN contexts characteristic of SBS2 (Extended Data Fig. 10f and Supplementary Note 8). Moreover, the absence of any appreciable ssDNA C>A or C>G calls (Extended Data Fig. 10e,f) provides further strong evidence that the COSMIC SBS13 signature associated with APOBEC3A arises by base excision followed by error-prone translesion synthesis across the resulting abasic sites44 (Supplementary Note 8).

Profiling the mitochondrial genome

Previous studies measured an approximately 20–40-fold higher dsDNA mutation rate with age in the mitochondrial genome than in the nuclear genome15. However, the mechanism by which the mitochondrial genome mutates remains unclear45,46,47,48. While it was long assumed to be primarily due to oxidative damage47, recent studies instead support a mechanism linked to replication45,46,47,48,49. Specifically, A>G and C>T dsDNA mutations are highly enriched on the mitochondrial heavy (G+T-rich) strand, with a frequency that decreases with distance from the heavy strand origin of replication in the direction of heavy strand synthesis45,46,48,49. Several potentially overlapping hypotheses have been proposed for these findings: (1) strand-displacement replication leaves the heavy strand exposed longer as ssDNA, making it vulnerable to deamination of adenine and cytosine that are then mispaired during replication with cytosine and adenine, respectively45,46,48; (2) strand asymmetries in polymerase misincorporation of canonical nucleotides46,47; and (3) strand asymmetries in DNA repair46. Importantly, assuming that DNA repair is not substantially more efficient in mitochondria than in nuclei50 and that most mutagenic mitochondrial ssDNA lesions can be detected by HiDEF-seq, then possibilities (2) and (3) should exhibit significantly higher HiDEF-seq ssDNA burdens in the mitochondrial genome than in the nuclear genome—since HiDEF-seq detects an increased ssDNA burden in CMMRD and POLE PPAP samples that have even lower dsDNA mutation rates than mitochondria (8.1-fold and 5.4-fold lower, respectively) (Fig. 5a and Extended Data Fig. 7d). However, possibility (1) would not yield a substantial difference in HiDEF-seq ssDNA burdens between the mitochondrial and nuclear genomes because HiDEF-seq would not capture denatured mitochondrial ssDNA in which the ssDNA damage events occur, and these ssDNA damage events would be rapidly transformed into dsDNA changes by replication. We investigated HiDEF-seq’s mitochondrial dsDNA and ssDNA calls to assess these hypotheses.

Fig. 5: Mitochondrial genome dsDNA and ssDNA call burdens and patterns.
figure 5

a, Nuclear versus mitochondrial genome dsDNA mutation rates. Mitochondrial rates are from the regressions in Extended Data Fig. 11a, which were performed similarly for the nuclear genome and for liver and kidney samples combined. P values were calculated using analysis of variance (ANOVA) comparing two weighted least-squares linear regression models of mutation burden versus age and genome type covariates: one with and one without an ‘age × genome type’ interaction term (an estimate of the difference in dsDNA mutation rate depending on whether it is the nuclear or mitochondrial genome). b, ssDNA call burdens in the nuclear versus mitochondrial genomes after combining the calls of liver and kidney samples shown in Extended Data Fig. 11a, excluding from the nuclear genome burden the liver samples from which mitochondria were enriched as, due to low DNA inputs, these samples were profiled with HiDEF-seq with A-tailing, which induces ssDNA T>A artefacts in the nuclear genome of post-mortem liver. P value was calculated using a two-sided Poisson rates ratio test. c, dsDNA mutation spectrum, corrected for trinucleotide context opportunities, of the liver and kidney samples shown in Extended Data Fig. 11a for the mitochondrial genome heavy strand, separated by pyrimidine (top) and purine (bottom) contexts. d, Spectrum of mitochondrial ssDNA calls combined from the liver and kidney samples shown in Extended Data Fig. 11a plus all bulk (that is, non-mitochondria enriched) liver and kidney samples profiled by HiDEF-seq with A-tailing, as the ssDNA T>A artefact that A-tailing can incur in these post-mortem tissues (Supplementary Note 1) is orthogonal to the contexts of mitochondrial mutagenesis. Spectra are corrected for trinucleotide context opportunities, separately for each strand. Excluding bulk samples profiled by HiDEF-seq with A-tailing yields a similar spectrum (Extended Data Fig. 11c). a, Error bars show the 95% confidence intervals from regressions. b, Bars and error bars show point estimates and their Poisson 95% confidence intervals.

We focused on liver and kidney samples, which yield more mitochondrial DNA (average 1% of sequenced molecules) than other tissues, and we also purified mitochondria from five liver samples to further increase mitochondrial DNA yield (average of 13% of molecules; Supplementary Table 1). Mitochondrial dsDNA mutation rates measured by HiDEF-seq were 38.9- and 60.1-folder higher in liver and kidney, respectively, than the dsDNA mutation rates of the nuclear genomes of these tissues (Fig. 5a and Extended Data Fig. 11a). Combining liver and kidney samples, the difference was 45.4-fold (Fig. 5a). HiDEF-seq also detected the expected highly asymmetric pattern of A>G and C>T dsDNA mutations on the heavy strand, and the heavy strand’s A>G mutation spectrum had strong similarity to SBS30ss* and SBS30 (both cosine similarities = 0.91) (Fig. 5c, Extended Data Fig. 11b and Supplementary Note 9).

Notably, despite the mitochondrial genome’s significantly higher dsDNA mutation rate, its ssDNA call burden in liver and kidney was only 1.5-fold higher (95% confidence interval: 1.1–2.1) than the ssDNA call burden of the nuclear genome (Fig. 5b). While the number of mitochondrial ssDNA calls was low, these were concentrated in sequence contexts consistent with the dsDNA mutation spectrum (Fig. 5d, Extended Data Fig. 11c and Supplementary Note 9). Together, these data strengthen the evidence that the mitochondrial genome mutates primarily during replication, possibly through DNA damage on the heavy strand while it is single-stranded and, to a lesser extent, through cytosine deamination on the light strand (Supplementary Note 9).

Discussion

Profiling dsDNA mutations provides information on past mutational events, while profiling ssDNA mismatches and damage provides a real-time view of DNA lesions that reflects the current equilibrium between DNA damage, repair, and replication. Once ssDNA mismatches and damage transform into dsDNA mutations, information is lost about the originating lesions. This gap in studying mutagenesis motivated us to develop HiDEF-seq—a single-molecule sequencing approach that achieves single-molecule fidelity. Our approach opens new avenues for studying DNA damage and mutation processes.

Mutational signatures have transformed the study of cancer and mosaic mutations4, but current signatures reflect only dsDNA mutations. Here we have begun to define ssDNA signatures, specifically: SBS10ss, SBS14ss, SBS30ss* and SBS2ss* (Supplementary Table 7). SBS10ss and SBS14ss arise from misincorporation of canonical (that is, non-damaged) nucleotides during replication. ssDNA mismatches of canonical nucleotides probably also occur outside the setting of replication. For example, signature SBS5 is ubiquitous in all cells, including post-mitotic neurons8,51, and a recent study indicates that SBS5 may be caused by translesion polymerases44. This implies a mechanism of canonical nucleotide misincorporation that may become detectable by HiDEF-seq with higher-throughput instruments. We anticipate that HiDEF-seq will spur efforts to create a comprehensive catalogue of ssDNA signatures that complements the existing catalogue of dsDNA signatures. It will then be important to relate specific ssDNA and dsDNA signatures to each other, as these relationships will encode information about DNA damage, repair, and replication dynamics. Furthermore, as we have shown here, HiDEF-seq may be used to systematically assess potential damage caused by laboratory tissue and DNA processing.

The prevailing view that single-molecule sequencers have relatively high cost may have deterred their use in studying mosaic mutations and rare events, with the exception of in vitro polymerase and bacterial mutagenesis studies52,53. Since HiDEF-seq captures data from both DNA strands more efficiently than short-read duplex sequencing, it is only around 4.6-fold more expensive for dsDNA mosaic mutation detection than short-read duplex sequencing, and new sequencing instruments will reduce this to an approximately 2.8-fold difference (and about 1.6-fold for large-fragment HiDEF-seq) (Supplementary Note 10). One limitation of HiDEF-seq is that it does not achieve single-molecule fidelity for insertions and deletions (indels) due to high sequencing error rates for these events in single-molecule sequencing12. This may become feasible with improved sequencing fidelity and indel-tuned consensus sequence calling12. Moreover, HiDEF-seq does not currently detect types of ssDNA damage that do not affect base pairing or that cannot be replicated by the sequencing polymerase. Since diverse types of ssDNA damage alter sequencing polymerase kinetics43, other types of damage may be feasible to detect in the future with single-molecule fidelity.

The high mutation rates of CMMRD and PPAP syndromes put their abnormal ssDNA call burdens and patterns within range of currently feasible single-molecule sequencing depth. However, we did not detect altered ssDNA burdens or patterns in cancer predisposition syndromes involving nucleotide excision repair or base repair, probably due to current limitations of sequencing depth and/or their mutational mechanisms involving types of ssDNA damage that we do not currently detect. We anticipate that future higher-throughput single-molecule sequencing combined with kinetics analyses will reveal additional ssDNA signatures in other cancer predisposition syndromes and in individuals with normal mutation rates.

Diverse methods profile DNA damage by enzymatic alteration at damage sites or by affinity enrichment, but their lack of single-molecule fidelity yields low-resolution damage patterns10. HiDEF-seq’s single-molecule fidelity for cytosine deamination damage revealed SBS30ss*. In healthy tissues, we detect SBS30ss* but not an SBS1ss* signature corresponding to SBS1, suggesting that SBS30ss* in healthy tissues reflects primarily ex vivo cytosine deamination that obscures in vivo SBS1ss* (Supplementary Note 11). However, in sperm, the higher burden of SBS30ss* may reflect in vivo cytosine deamination that accumulates in the absence of effective DNA repair and is later repaired after fertilization41. Nevertheless, when SBS1 is elevated, HiDEF-seq can detect its ssDNA precursors (Supplementary Note 4).

HiDEF-seq may also find utility in experimental systems to dissect the kinetics of the DNA damage, repair, and replication equilibrium—for example, combined with in vitro genetic and other manipulations, with synchronization of the cell cycle, and in reconstituted enzyme systems. Sequencing single-strand changes in DNA with single-molecule fidelity will greatly advance our understanding of the origins of mutations.

Methods

Sample sources

Post-mortem tissues obtained by the NIH NeuroBioBank (University of Maryland site) were frozen in isopentane-liquid nitrogen baths and stored at −80 °C until use. Post-mortem tissues obtained by the International Replication Repair Deficiency Consortium (IRRDC) biobank were frozen and stored at −80 °C until use. Blood was obtained from individuals enrolled in human subjects research of the New York University Grossman School of Medicine, the IRRDC, the University of Pittsburgh and the Cryos International Sperm Bank. All blood samples were collected in EDTA tubes and frozen immediately after collection until use. Tumour samples were obtained from the IRRDC and were frozen and stored at −80 °C until use. Semen samples (processing details described in the ‘Sperm purification’ section) were obtained at Cryos International Sperm Bank from individuals enrolled in human subjects research approved by the New York University Grossman School of Medicine Institutional Review Board, except for participants D1 and D2, who were enrolled in human subjects research conducted by Cryos International Sperm Bank. Lymphoblastoid cell lines were obtained from Coriell Institute and the IRRDC. Primary fibroblasts were obtained from Coriell Institute and the IRRDC. All of the samples were collected under human subjects research protocols approved by either the New York University Grossman School of Medicine Institutional Review Board, the Hospital for Sick Children (SickKids) Research Ethics Board as part of the IRRDC, the Cryos International Sperm Bank scientific advisory committee or the University of Pittsburgh Institutional Review Board.

The source, sex, age at collection, and post-mortem interval of each sample are provided in Supplementary Table 1.

Sperm purification

After collection at the Cryos International Sperm Bank, semen underwent liquefaction at room temperature for 30 to 60 min. Semen then immediately underwent initial purification for sperm using density gradient centrifugation followed by a wash with HEPES-buffered medium54. For semen from individuals D1 and D2, sperm were purified from half of each semen sample using this method, and sperm were purified from the other half using the ZyMot Multi (850 µl) Sperm Separation Device (ZyMot) according to the manufacturer’s instructions. After addition of cryopreservation media, sperm were stored in liquid nitrogen until further use.

Cryopreserved sperm that previously underwent initial purification by density gradient centrifugation were further purified in the laboratory with a second density gradient centrifugation and two additional washes, as follows. First, the following reagents were equilibrated to room temperature: ORIGIO gradient 40/80 buffer (Cooper Surgical, 84022010), Origio sperm wash buffer (Cooper Surgical, 84050060) and Quinn’s Advantage sperm freezing medium (Cooper Surgical, ART-8022). In a 15 ml tube, 1 ml of Origio 80 buffer was placed at the bottom, and 1 ml of Origio 40 buffer was gently layered on top. Sperm were thawed at room temperature for 15 min, gently mixed with a pipette, and carefully layered on top of the Origio 40 buffer. The tube was then centrifuged in a swinging-bucket centrifuge at 400g for 20 min at room temperature with low acceleration and deceleration speeds. The supernatant was aspirated, leaving 500 μl of sperm/buffer at the bottom. The sperm was transferred to a new 15 ml tube and diluted with 5 ml sperm wash buffer. The tube was mixed by inverting ten times and centrifuged in a swinging-bucket centrifuge at 300g for 10 min at room temperature with maximum acceleration and deceleration. The supernatant was removed, leaving about 350 μl of sperm/buffer at the bottom. The sperm was then washed again in the same way with 5 ml of sperm wash buffer, and the supernatant was removed, leaving about 250 μl of sperm/buffer at the bottom of the tube. After pipette mixing, an aliquot of this sperm was transferred to a 2 ml DNA LoBind microtube (Eppendorf) for immediate DNA extraction and general evaluation using a haemocytometer. The remaining sperm was diluted dropwise with a 1:1 volumetric ratio of sperm freezing medium, incubated at room temperature for 3 min, frozen in a Mr. Frosty freezing container (Thermo Fisher Scientific) in a −80 °C freezer for 24 h and then transferred to a liquid nitrogen freezer.

Cerebral cortex neuronal nuclei purification

Cerebral cortex neuronal nuclei were isolated as previously described5 from post-mortem frontal cortex (Brodmann area 9, left hemisphere) of individuals who did not have any known neurological or psychiatric disease. Specifically, approximately 1 g of frozen tissue from each individual was cut into 5 mm3 pieces and added to 9 ml of chilled lysis buffer (0.32 M sucrose, 10 mM Tris HCl pH 8, 5 mM CaCl2, 3 mM magnesium acetate, 0.1 mM EDTA, 1 mM DTT, 0.1% Triton X-100) in a large dounce homogenizer (Sigma-Aldrich, D9938). While on ice, the tissue was dounced 20 times each with pestle size A and then B. The homogenate was layered on a 7.4 ml sucrose cushion (1.8 M sucrose, 10 mM Tris HCl pH 8, 3 mM magnesium acetate, 1 mM DTT) in an ultracentrifuge tube on ice. The tubes were centrifuged (Thermo Fisher Scientific, Sorvall LYNX 6000) at 10,000 rpm for 1 h at 4 °C. The resulting supernatant was removed, and 500 μl of nuclei resuspension buffer (3 mM MgCl2 in 1× phosphate-buffered saline) was added on top of the pellet and then incubated on ice for 10 min. The pellet was then gently resuspended. Antibody staining buffer was prepared by adding 1.2 μg of NeuN-Alexa-647 (Abcam, ab190565) to 400 μl of antibody staining buffer (3% BSA in nuclei resuspension buffer) and inverted gently to mix. Then, 400 μl of antibody staining buffer was added to 1 ml of nuclei and the sample was rotated at 4 °C for 30 min. NeuN-positive nuclei were gated as shown in Supplementary Note 12. NeuN-positive nuclei were collected in 30 μl of nuclei buffer in 1.5 ml LoBind tubes (Eppendorf) by fluorescence-activated nuclei sorting on a SONY LE-SH800 sorter. After sorting, a 1:1 volumetric ratio of 80% glycerol was added to sorted nuclei for a final concentration of 40% glycerol to stabilize nuclei during centrifugation. Nuclei were centrifuged at 4 °C, 500g for 10 min. The supernatant was removed and nuclei pellets were immediately frozen at −80 °C.

Extraction and isolation of mitochondria

Mitochondria were extracted and isolated from 300–500 mg of tissue using the Mitochondria Extraction Kit (Miltenyi Biotec) and Mitochondria Isolation Kit (Miltenyi Biotec), according to the manufacturer’s Extraction Kit protocol, with the following modifications: (1) protease inhibition buffer was prepared with 100× HALT protease inhibitor cocktail (Thermo Fisher Scientific); (2) minced tissue was resuspended with a larger 2 × 2.5 ml volume of protease inhibitor buffer instead of 2 × 1 ml; (3) after homogenization, the homogenate was passed through a 30 µm SmartStrainer (Miltenyi Biotec); (4) the SmartStrainer was washed with 2 × 2.5 ml of solution 3 instead of 2 × 1 ml; (5) before adding TOM22 antibody, the homogenate was diluted with Separation Buffer to a volume of 25 ml instead of 10 ml; and (6) 125 µl of TOM22 antibody was used per sample instead of 50 µl. Final mitochondria pellets were frozen at −20 °C for subsequent DNA extraction.

Cell culture for direct profiling

Lymphoblastoid cell lines were cultured at 37 °C, 5% CO2, and ambient oxygen in T25 flasks with RPMI 1640 medium (Thermo Fisher Scientific, 61870036) supplemented with 15% fetal bovine serum and penicillin–streptomycin. Cells were passaged to new medium approximately every 2–3 days.

Fibroblasts were cultured at 37 °C, 5% CO2 and ambient oxygen in T25 flasks with DMEM medium (Thermo Fisher Scientific, 10569010) supplemented with 10% fetal bovine serum and penicillin–streptomycin. Cells were passaged to new medium every 3–5 days before reaching full confluency. Cells were collected for DNA extraction at 80–90% confluency using trypsin-EDTA.

For the experiment testing the potential effect of sperm freezing medium on cytosine deamination, we resuspended the collected pellet of fibroblasts in Origio sperm wash buffer, mixed with a 1:1 volume ratio of Freezing Medium TYB with Glycerol & Gentamicin (Irvine Scientific), and froze the cells in a Mr. Frosty container (Thermo Fisher Scientific) at −80 °C followed by transfer to a liquid nitrogen freezer. After thawing, cells were either washed once with PBS followed by Puregene DNA extraction or they were processed using the same method of DNA extraction that was used for sperm (the details of each method are described in the ‘DNA extraction’ section).

Lentivirus experiments

Lentivirus plasmid design and synthesis

The lentivirus transfer plasmid design and sequences are listed in Supplementary Table 8. APOBEC3A constructs included a human gamma globin intron 2 sequence to prevent expression of the mutagenic protein during bacterial cloning55. Gene inserts were synthesized and cloned by GenScript into a pLVX-TetOne lentiviral vector (Takara). The pLVX-TetOne vector was used to enable temporal control of gene expression using doxycycline. This prevents expression of encoded mutagenic proteins during lentiviral packaging, which could mutate the lentiviral transfer plasmid and lentiviral RNA to create non-functional lentiviruses. GenScript verified gene inserts by sequencing and prepared quality-controlled quantities of transfer plasmid sufficient for lentiviral packaging.

Lentivirus packaging

Lenti-X 293T cells (Takara) were cultured at 37 °C, 5% CO2 and ambient oxygen in T75 collagen-coated flasks (ZenBio) with DMEM medium (Thermo Fisher Scientific, 11995065) supplemented with 10% tetracycline-free fetal bovine serum (Takara). Cells were transfected at about 80% confluency. The lentiviral packaging transfection mix was prepared by combining 0.8 ml DMEM (Thermo Fisher Scientific; 11995065), 20 µl pC-Pack2 second-generation lentiviral packaging plasmid mix (Cellecta, CPCP-K2A), lentiviral transfer plasmid (10 µg for eGFP plasmid; 12.5 µg for APOBEC3A plasmids), and 36 µl PureFection transfection reagent (System Biosciences). Note that a second-generation packaging system was necessary because fourth-generation packaging systems contain a Tet-Off gene that would cause the pLVX-TetOne gene insert to be expressed during packaging, and third-generation packaging systems do not contain the tat gene required for efficient packaging of the fourth-generation pLVX-TetOne transfer plasmid. Cells were transfected by adding this transfection mix to cells in fresh 10 ml of the above medium. The next day, an additional 8 ml of the above medium was added to the cells. Then, 72 h after transfection, the cell medium was collected and centrifuged at 500g for 10 min to pellet the cell debris. The ~18 ml supernatant was mixed with 6 ml of Lenti-X Concentrator (Takara), incubated for at least 3 h at 4 °C and centrifuged at 1,500g for 45 min at 4 °C. The lentivirus pellet was resuspended in DMEM medium (Thermo Fisher Scientific, 10569010) supplemented with 10% standard fetal bovine serum and penicillin–streptomycin. Aliquots of lentivirus were flash-frozen in liquid nitrogen and stored at −80 °C.

Lentiviral particles were quantified after thawing using Lenti-X GoStix Plus (Takara). The resulting GoStix values were multiplied by 1.25 × 107 to obtain the lentiviral particle per ml concentration.

Lentivirus transduction

Fibroblasts were cultured at 37 °C, 5% CO2 and ambient oxygen in T75 flasks with DMEM medium (Thermo Fisher Scientific, 10569010) supplemented with 10% fetal bovine serum and penicillin–streptomycin. Cells were transduced with lentivirus at about 60% confluency in 15 ml of the above medium supplemented with 8 µg ml−1 polybrene (Sigma-Aldrich, H9268). The amount of lentivirus added was calculated as follows: ([estimated 900,000 cells in a 60% confluent T75 flask] × [250 infectious units per cell])/([previously measured concentration of lentiviral particles per ml]/[estimated 100 viral particles per infectious unit]). The factor of 250 infectious units per cell was optimized to obtain > 80% GFP-positive cells using the eGFP lentivirus. Then, 16 h after transduction, the medium was replaced with a new 15 ml of the above medium (without polybrene) supplemented with 250 ng ml−1 doxycycline (Takara, 631311). After an additional 48 h, the medium was replaced with a new 15 ml of the above doxycycline medium. After an additional 24 h, cells were collected for DNA extraction using trypsin-EDTA.

DNA extraction

The DNA-extraction method used for each sample is listed in Supplementary Table 1. Details of each DNA extraction method are provided below.

DNA extraction from sperm for HiDEF-seq

An aliquot of washed sperm (that is, after the washes that are performed after density gradient centrifugation) was centrifuged at 300g for 5 min at room temperature. The supernatant was removed, leaving approximately 50 μl of sperm/buffer at the bottom of the microtube. The tube was tapped gently five times to break up the sperm pellet before adding lysis buffer.

If starting with frozen sperm instead of an aliquot of washed sperm, the frozen sperm vial was rapidly thawed in a 37 °C water bath, gently mixed with a pipette, and an aliquot was transferred to a 2 ml DNA LoBind microtube for DNA extraction. The remaining sperm was frozen again. The DNA extraction aliquot was diluted with 600 μl of Origio sperm wash buffer, centrifuged at 300g for 5 min at room temperature, and the supernatant was removed to leave approximately 100 μl of sperm/buffer at the bottom. The sperm was diluted again with 600 μl of Origio sperm wash buffer, centrifuged at 300g for 5 min at room temperature, and the supernatant was removed to leave approximately 50 μl of sperm/buffer at the bottom. The tube was tapped gently five times to break up the sperm pellet before adding lysis buffer.

Sperm DNA extraction was based on a previous study56, with some modifications, including optimizations we performed that showed that tris(2-carboxyethyl)phosphine (TCEP) can be reduced from 50 mM to 2.5 mM in the lysis buffer. Specifically, sperm lysis buffer was prepared by combining (for each sample) 497.5 μl of Qiagen Buffer RLT (Qiagen) without β-mercaptoethanol and 2.5 μl of 0.5 M Bond-Breaker TCEP Solution (Thermo Fisher Scientific) for a lysis buffer with 2.5 mM TCEP final concentration. Then, 500 μl of sperm lysis buffer and 100 mg of 0.2 mm stainless-steel beads (Next Advance, SSB02-RNA) were added without mixing to each sample. Homogenization was then performed using a TissueLyser II instrument (Qiagen) at 20 Hz for 4 min (samples profiled by HiDEF-seq without nick ligation: SPM-1004, SPM-1020; samples profiled by HiDEF-seq with nick ligation and A-tailing, and samples profiled by NanoSeq: SPM-1002, SPM-1004, SPM-1013, SPM-1020; samples profiled by HiDEF-seq with nick ligation in large fragments: SPM-1002, SPM-1020) or 30 s (samples profiled by HiDEF-seq with nick ligation and A-tailing: SPM-1060, D1, D2; sample profiled by HiDEF-seq with nick ligation without A-tailing: SPM-1013; sample profiled by NanoSeq: SPM-1060; and samples profiled by HiDEF-seq with nick ligation and with uracil DNA glycosylase/endonuclease VIII treatment: SPM-1002 and SPM-1004). DNA was then extracted from the lysate using the QIAamp DNA Mini Kit (Qiagen) with a modified protocol as follows. A 500 μl volume of buffer AL was added to each lysate and vortexed well. Then, 500 μl of 100% ethanol was added and vortexed well. The mixture was then applied to a QIAamp DNA Mini spin column and the remaining standard QIAamp protocol was followed. DNA was eluted with 100 μl of 10 mM Tris pH 8. RNase treatment was then performed by adding 12 μl of 10× PBS pH 7.4 (Gibco), 2 μl of Monarch RNase A (New England Biolabs (NEB)) and 6 μl nuclease-free water (NFW). The reaction was incubated at room temperature for 5 min and immediately purified using a 0.8× beads to sample volume ratio of SPRI beads (solid-phase reversible immobilization; made by washing 1 ml Sera-Mag carboxylate-modified SpeedBead (Cytiva, 65152105050250) and resuspending the beads in 50 ml of 18% PEG-8000, 1.75 M NaCl, 10 mM Tris pH 8, 1 mM EDTA, 0.044% Tween-20). DNA was eluted from beads with 35 μl of 10 mM Tris/0.1 mM EDTA pH 8. For the experiments in which we processed previously extracted blood DNA and primary fibroblast DNA with the same process used for sperm DNA extraction, we inputted previously extracted DNA and followed the same process above beginning with addition of lysis buffer, with a homogenization time of either 30 s or 4 min with concordant results (Supplementary Table 2).

A somatic cell contamination assay was adapted from a previous study57 and performed on all extracted sperm DNA samples to further confirm sperm purity. This assay amplifies four loci from bisulfite-treated DNA: three loci that are methylated in sperm but not in somatic cells (PCR7, PCR11, PCR31) and 1 locus that is methylated in somatic cells but not in sperm (PCR12). After bisulfite treatment and PCR amplification of each locus, the PCR amplicon is cut by a restriction enzyme only if the original DNA was methylated. Thus, this assay can detect somatic cell contamination. In total, 350 ng of each extracted sperm DNA and 350 ng of control human NA12878 lymphoblastoid cell line genomic DNA (Coriell Institute) were bisulfite-converted using the Zymo EZ DNA Methylation Kit (Zymo Research). The loci were amplified by PCR using the following primer sets: PCR7 (GGGTTATATGATAGTTTATAGGGTTATT and TCTATTACTACCACTTCCTAAATCAA), PCR11 (TGAGATGTTTGTTAGTTTATTATTTTGG and TCATCTTCTCCCACCAAATTTC), PCR12 (TAGAGGGTAGTTTTTAAGAGGG and ATTAACCAACCTCTTCCATATTCTT) and PCR31 (TTTTAGTTTTGGGAGGGGTTGTTT and CTACCAAAATTAAAAACCAACCCAC). The PCR reaction contained 1.5 μl of bisulfite-converted DNA, 10 μl of 2× ZymoTaq PCR Mix (Zymo Research), PCR primers, and NFW to a final volume of 20 μl. The PCR reactions were optimized to contain the following final concentrations of each forward and reverse primer: 0.6 μM for PCR7 primers, 0.6 μM for PCR11 primers, 0.3 μM for PCR12 primers, and 0.45 μM for PCR31 primers. The PCR reactions were cycled as follows: 95 °C for 10 min; 40 cycles of 94 °C for 30 s, X °C for 30 s and 72 °C for 30 s; 72 °C for 7 min; and hold at 4 °C, where X (annealing temperature) was 49 °C for PCR7 and PCR11, 51 °C for PCR12 and 55 °C for PCR31. PCR reactions were purified by 2× volumetric ratio SPRI beads cleanup and eluted in 22 μl of 10 mM Tris pH 8. Restriction digests were performed by combining 5 μl of purified PCR product, restriction enzyme (10 units of HpyCH4IV (NEB) for PCR7 and PCR31, and 20 units of TaqI-v2 (NEB) for PCR11 and PCR12), 1 μl of 10× CutSmart buffer (NEB), and NFW for a total reaction volume of 10 μl. Restriction digestions were performed at 37 °C (HpyCH4IV) or 65 °C (TaqI-v2) for 60 min. Control reactions without restriction enzyme were performed for each sample/locus combination. A total of 5 μl of each restriction digest reaction was combined with 1 μl 6× TriTrack DNA loading dye (Thermo Fisher Scientific) and run on a 2% agarose gel prestained with ethidium bromide, followed by imaging of the gel.

DNA extraction from solid tissues for HiDEF-seq

Approximately 50–300 mg of tissue was cut in a Petri dish on dry ice and minced with a scalpel, followed by one of the following DNA-extraction methods, as specified for each sample in Supplementary Table 1.

Nucleobond HMW, MagAttract HMW, QIAamp. In this method, DNA was extracted and purified with three serial kits to maximize DNA purity. DNA was extracted using the NucleoBond HMW DNA Kit (Takara) according to the manufacturer’s instructions with a 50 °C proteinase K incubation for 4.5 h. The eluted DNA was then further purified with the MagAttract HMW DNA Kit (Qiagen) according to the manufacturer’s whole-blood purification protocol, except with proteinase K/RNase A incubation occurring at 56 °C for 20 min. The eluted DNA was then further purified using the QIAamp DNA Mini Kit (Qiagen) by diluting the DNA to a final volume of 200 μl and final 1× PBS concentration, adding 20 μl of proteinase K (Qiagen) and continuing according to the manufacturer’s body fluids DNA purification protocol with a 56 °C proteinase K incubation for 10 min without RNase A treatment.

MagAttract HMW. We used the MagAttract HMW DNA Kit (Qiagen) according to the manufacturer’s protocol for tissue, with a 2 h proteinase K digestion at 56 °C. DNA was eluted with 10 mM Tris pH 8.

Puregene. Tissue was pulverized inside a microtube while in a liquid-nitrogen cooled mini mortar and pestle (Bel-Art). DNA was then extracted using the Puregene DNA Kit (Qiagen) according to the manufacturer’s protocol for tissues, except (1) the lysis step with proteinase K was performed at room temperature on a ThermoMixer C instrument (Eppendorf) at 1,400 rpm for 1 h; (2) the RNase A treatment was performed at room temperature for 20 min; and (3) the final DNA pellet was resuspended in 10 mM Tris pH 8 at room temperature for 1 h.

DNA extraction from cerebral cortex neuronal nuclei for HiDEF-seq

DNA was extracted from nuclei pellets using two methods, as specified for each sample in Supplementary Table 1.

QIAamp: we used the QIAamp DNA Mini Kit (Qiagen) according to the manufacturer’s protocol, with lysis performed by adding 180 μl of buffer ATL and 20 μl of proteinase K to the nuclei pellet, followed by a 56 °C incubation for 1 h, and including RNase A treatment.

MagAttract: we used the MagAttract HMW DNA Kit according to the manufacturer’s protocol for blood, after resuspending nuclei with 200 µl of 1× PBS, with a 30 min proteinase K digestion at room temperature.

DNA extraction from mitochondria for HiDEF-seq

DNA was extracted from mitochondria pellets using the Puregene DNA Kit (Qiagen) according to the manufacturer’s protocol for tissues, except (1) the lysis step used 200 µl Cell Lysis Solution and 1.5 µl proteinase K and was performed at room temperature for 30 min; (2) the RNase A treatment was performed at room temperature for 20 min; and (3) the final DNA pellet was resuspended in 10 mM Tris pH 8 at room temperature without an extended incubation.

Note that, due to the relatively low yields of mitochondria DNA preparations, these samples were profiled with HiDEF-seq with A-tailing (see the ‘HiDEF-seq library preparation’ section).

DNA extraction from blood, lymphoblastoid cells, and fibroblasts for HiDEF-seq and germline sequencing

DNA from blood, lymphoblastoid cells, and fibroblasts (the latter two after resuspending cell pellets in 1× PBS)—except for blood from individuals whose tumours were profiled, fibroblasts testing the effect of sperm freezing medium, and fibroblasts from lentivirus experiments—was extracted using the MagAttract HMW DNA Kit according to the manufacturer’s whole-blood purification protocol, with proteinase K incubation at room temperature for 30 min.

DNA from fibroblasts frozen in sperm-freezing medium and fibroblasts in lentivirus experiments was extracted using the Puregene DNA Kit according to the manufacturer’s protocol for cultured cells, except (1) the protocol volumes were scaled 2.8-fold; (2) the lysis step used 840 µl cell lysis solution and 4.2 µl proteinase K and was performed at room temperature for 30 min; (3) the RNase A treatment was performed at room temperature for 20 min; and (4) the final DNA pellet was resuspended in 10 mM Tris pH 8 at 4 °C for 1 h.

We also performed an experiment that excluded a measurable cytosine deamination effect by possible leached iron from MagAttract magnetic beads (Extended Data Fig. 9e) by extracting an additional aliquot of DNA from the blood of individual 1901 using the Puregene DNA Kit according to the manufacturer’s protocol for ‘whole blood or bone marrow’, except (1) 200 µl blood was first diluted with 100 µl of 1× PBS; (2) the cell lysis step was performed at room temperature; (3) the RNase A treatment was performed at room temperature for 20 min; and (4) the final DNA pellet was resuspended in 10 mM Tris pH 8 at 4 °C for 1 h.

DNA extraction from tumours and those individuals’ corresponding blood for Illumina tumour and germline sequencing

DNA was extracted from tumours by first homogenizing the tumour using the Precellys 24 Tissue Homogenizer followed by the DNeasy Blood & Tissue Kit (Qiagen), according to the manufacturer’s protocol for animal tissues with a 56 °C incubation for 10 min. For individuals whose tumours were profiled, DNA was extracted from blood of those individuals using the PAXgene Blood DNA Kit (Qiagen) according to the manufacturer’s protocol.

DNA extraction from saliva for Illumina germline sequencing

DNA was extracted using the QIAamp DNA Mini Kit according to the manufacturer’s ‘DNA purification from blood or body fluids’ protocol and including RNase A treatment.

DNA extraction from the liver and spleen for Illumina germline sequencing

DNA of all of the samples was extracted using the QIAamp DNA Mini Kit according to the manufacturer’s ‘DNA purification from tissues’ protocol with a 2 h proteinase K digestion at 56 °C and including RNase A treatment, except for liver of individual 5309, from which DNA was extracted using the MagAttract HMW DNA Kit according to the manufacturer’s ‘Fresh or Frozen Tissue’ protocol with a 2 h proteinase K digestion at 56 °C.

DNA extraction from blood for Pacific Biosciences germline sequencing

DNA was extracted using the Chemagic DNA Blood 2k Kit (Perkin Elmer, CMG-1097) on the Chemagic 360 automated nucleic extraction instrument (Perkin Elmer) according to the manufacturer’s protocols for DNA isolation from whole blood.

DNA quantity and quality measurements and storage

The concentration and quality of all DNA samples were measured using a NanoDrop instrument (Thermo Fisher Scientific), a Qubit 1× dsDNA HS Assay Kit (Thermo Fisher Scientific) and a Genomic DNA ScreenTape TapeStation Assay (Agilent). DNA was stored at −20 °C.

Illumina germline and tumour library preparation and sequencing

Illumina germline and tumour sequencing libraries were prepared using the TruSeq DNA PCR-Free Kit (Illumina) for all samples. At least 110 Gb (~36× genome coverage) of 150 bp paired-end sequencing per sample was obtained using a NovaSeq 6000 instrument (Illumina) by Psomagen, except for tumour sequencing and those individuals’ corresponding germline sequencing, for which HiSeqX and NovaSeq 6000 instruments were used at the Centre for Applied Genomics at the Hospital for Sick Children.

Pacific Biosciences germline library preparation and sequencing

A total of 15 μg of DNA was cleaned up with 1× AMPure PB beads (Pacific Biosciences) and sheared to a target size of 14 kb using the Megaruptor 3 instrument (Diagenode) using the following settings: speed, 36; volume, 300 µl; concentration, 33 ng µl−1. Library preparation was performed using the SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences) according to the manufacturer’s instructions. Library fragments longer than 10 kb were selected using a PippinHT instrument (Sage Science). Size-selected libraries were sequenced on the Pacific Biosciences Sequel IIe system using the Sequel II Binding Kit 2.0 and Sequel II Sequencing Kit 2.0 (Pacific Biosciences), Sequencing primer v4 (Pacific Biosciences), 1 h binding time, 2 h pre-extension, adaptive loading, 2 h immobilization time, and 30 h movies.

Heat damage of DNA

DNA was heated in a volume of 62 µl at the temperature, for the time, and in the buffer listed for each sample in Supplementary Table 1, followed by incubation on ice up to a total of 6 h if the heating time was less than 6 h. Untreated samples in these experiments were incubated on ice for 6 h. The DNA was then input into HiDEF-seq library preparation.

NanoSeq library preparation and sequencing

NanoSeq libraries were prepared as previously described8 with 50 ng DNA input from the same DNA aliquots used for HiDEF-seq.

HiDEF-seq library preparation and sequencing

Choice of restriction enzymes for DNA fragmentation

We performed in silico digests of the CHM13 v.1.0 human reference genome sequence58 to identify restriction enzymes that (1) maximize the percentage of the genome between 1 and 4 kb; (2) are active at 37 °C; and (3) the DNA is fragmented with blunt ends, since blunt fragmentation avoids single-strand overhangs that can lead to artefactual double-strand mutations during end repair8. This in silico screen identified Hpy166II (recognition sequence: 5′-GTN/NAC-3′) as the optimal restriction enzyme, with a prediction of 37% of the genome mass fragmenting between 1 and 4 kb. The percentage of the genome fragmented to sizes between 1 and 4 kb was then empirically measured by fragmenting 1 µg of genomic DNA followed by quantification on a Genomic DNA ScreenTape assay (Agilent). Hpy166II fragments 41% of the genome to within the target size range. Note that, although Hpy166II is blocked by methylated CpG when present on both sides of the recognition sequence (New England BioLabs), this will occur only with the larger recognition sequence 5′-C*GTN/NAC*G-3′ (the asterisks signify methylation of the preceding cytosine); excluding all of these potential bimethylated sites increases the in silico predicted percentage of the genome fragmented by Hpy166II to within the target size range by 0.2%, and 99.97% of genomic bases within the original target size range remain when excluding these as potential fragmentation sites.

For the mitochondrial genome, Hpy166II captures 3 fragments in our target 1–4 kb size range, at the following coordinates (CHM13 v.1.0): (1) 3068–5116 (2,048 bp); (2) 7581–9439 (1,858 bp); and (3) 10441–11831 (1,390 bp). These fragments encompass 32% of the mitochondrial genome.

HiDEF-seq library preparation

Input DNA amounts of 500–3,000 ng (as measured using the Qubit 1× dsDNA HS Assay (Qubit)) were used per library, depending on available DNA. With high-quality DNA, input amounts of 500 ng provide sufficient HiDEF-seq library yield for approximately one full (non-multiplexed) Pacific Biosciences (PacBio) Sequel II instrument sequencing run, and lower input amounts are feasible for filling a fraction of a sequencing run. We have successfully made HiDEF-seq libraries with as low as 200 ng input DNA, producing sufficient yield for 40% of a sequencing run. For fragmented DNA samples, more than 1,500 ng of input DNA is generally required. Generally, for samples other than sperm and tissues from young children that have low mutation burdens, one quarter of a sequencing run is sufficient for mutation burden and pattern analysis. Input DNA A260/A280 > 1.8 and A260/A230 > 2.0 absorption ratios were confirmed on the NanoDrop before library preparation according to the Pacific Biosciences DNA preparation guidelines; we found that this quality control is important for sequencing yield for post-mortem tissues, but is not strictly necessary for other sample types.

As some DNA fragments are <1 kb after restriction enzyme fragmentation, these small fragments need to be removed during library preparation. We found that effective removal of <1 kb DNA fragments with high-yield recovery of larger DNA fragments requires three size selections with a 75% dilution of AMPure PB beads (Pacific Biosciences) during library preparation. We also found that efficient size selection critically depends on a DNA concentration of <10 ng µl−1 in the input sample. Accordingly, before beginning library preparation, a sufficient volume of AMPure PB Beads was diluted with Elution Buffer (Pacific Biosciences) to a final 75% AMPure PB bead volume/total volume solution to be used for all subsequent bead purifications and size selections. Below, ‘diluted AMPure beads’ refers to these diluted beads.

Input DNA was fragmented with 0.14 U µl−1 of Hpy166II restriction enzyme (NEB) in a 70 µl reaction with 1× CutSmart buffer (NEB) for 20 min at 37 °C. The fragmentation reaction was scaled to a higher volume if the input DNA was too dilute to accommodate a 70 µl reaction.

Fibroblast samples from lentivirus transduction experiments and fibroblasts frozen in sperm-freezing medium that were extracted using the Puregene method comprised around 30–70% residual RNA (based on a comparison of Qubit and NanoDrop quantification), which was not fully removed by the DNA extraction’s RNase A digestion. For these samples, we added 0.5 µl of 100 mg ml−1 RNase A (Qiagen) after completion of the fragmentation reaction and incubated at room temperature for 1 min.

Next, the fragmentation reaction was diluted with NFW to a DNA concentration of 10 ng µl−1 (or not diluted if DNA is already <10 ng µl−1) based on the Qubit concentration of the DNA measured before the fragmentation reaction. For the first bead purification/size selection, a ratio of 0.8× diluted AMPure beads volume to sample volume was used, with two 80% ethanol washes, and the DNA was eluted with 22 µl of 10 mM Tris pH 8. The DNA concentration was measured again with Qubit.

Nick ligation was then performed (except for the initial version of HiDEF-seq without nick ligation; Supplementary Note 1) in a 30 µl reaction with 3 µl of 10× rCutSmart Buffer (NEB), 1.56 µl of 500 µM β-nicotinamide adenine dinucleotide (NAD+) (NEB) and 15 U of Escherichia coli DNA ligase (NEB). The nick ligation reaction was incubated at 16 °C for 30 min with the heated lid turned off.

The DNA was then diluted with 10 mM Tris pH 8 to 10 ng µl−1 (or not diluted if DNA is already <10 ng µl−1) based on the Qubit concentration measured after the post-fragmentation reaction bead purification. For the second bead purification/size selection, a ratio of 0.75× diluted AMPure bead volume to sample volume was used, with two 80% ethanol washes, and the DNA was eluted with 22 µl of 10 mM Tris pH 8. The DNA concentration was measured with Qubit.

The DNA was then treated as described in ref. 8 in 30 µl volume reactions with 21 µl of input DNA, 1.5 µl NFW, 3 µl 10× NEBuffer 4 (NEB), 3 µl of 1 mM each dATP/ddCTP/ddGTP/ddTTP (dATP/ddBTP) (dATP, Thermo Fisher Scientific, R0141; ddCTP/ddGTP/ddTTP, Jena Bioscience NU-1019S) and 7.5 U of Klenow fragment 3′→5′ exo- (NEB), except for input DNA that is degraded (such as post-mortem kidney and liver) for which the reaction was performed without dATP. The reaction was incubated at 37 °C for 30 min. Next, a third bead-purification/size-selection step was performed: the reaction volume was diluted with 10 mM Tris pH 8 to 10 ng µl−1 DNA (or not diluted if DNA is already <10 ng µl−1) based on the Qubit concentration measured before the Klenow reaction, followed by a ratio of 0.75× diluted AMPure bead volume to sample volume, with two 80% ethanol washes, and elution of DNA with 22 µl of 10 mM Tris pH 8. The eluted DNA was then adjusted to a total of 30 µl with 3 µl of 10× NEBuffer 4 and NFW before proceeding to adapter ligation.

Ligation of hairpin adapters was performed using reagents from the SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences) by combining 30 μl of Klenow-treated DNA, 2.5 µl Barcoded Overhang Adapter (Pacific Biosciences), 15 µl Ligation Mix, 0.5 µl Ligation Additive and 0.5 µl of Ligation Enhancer. For samples for which the preceding Klenow reaction was performed without dATP (that is, non-A tailed libraries), 2.5 µl of 17 µM annealed blunt adapters were used instead (their sequences and preparation are described below). The adapter ligation reaction was incubated at 20 °C for 60 min with the heated lid turned off. Immediately after the adapter ligation, nuclease treatment was performed using the SMRTbell Enzyme Clean Up Kit 1.0 (Pacific Biosciences) to remove any non-circularized DNA containing nicks and/or without hairpin adapters: 2 µl Enzyme A, 0.5 µl Enzyme B, 0.5 µl Enzyme C and 1 µl Enzyme D were combined, and this 4 µl enzyme mix was added to the ligation reaction and incubated at 37 °C for 60 min. After the nuclease treatment, the samples were purified with a ratio of 1.2× diluted AMPure bead volume to sample volume and eluted with 24 µl of 10 mM Tris pH 8.

After the post-nuclease treatment AMPure bead purification, non-A tailed HiDEF-seq libraries underwent an additional 1.1× diluted AMPure bead purification to remove residual adapter dimers.

Final library concentration and size distribution were measured using the Qubit and High Sensitivity D5000 ScreenTape (Agilent). The final library fragment size distribution should contain <5% of DNA mass <1 kb and <5% of DNA mass >4 kb (percentages calculated using the ScreenTape analysis software’s manual region analysis ‘% of Total’ field). The final mass yield of A-tailed libraries should be around 6–10% of the input genomic DNA mass, and approximately half of that for non-A tailed libraries. The libraries were stored in 0.5 ml DNA LoBind tubes at −20 °C.

On ScreenTape, some non-A tailed HiDEF-seq libraries had a low level of residual adapter dimers, which was removed with a final 1.3× diluted AMPure bead purification after multiplexing the libraries from the same run (see multiplexing details in the ‘HiDEF-seq library sequencing’ section).

Sequences and preparation of blunt adapters used for HiDEF-seq without A-tailing

Adapters (sequences below) were ordered as HPLC-purified oligonucleotides from Integrated DNA Technologies. Each adapter was reconstituted to 100 µM concentration with NFW. Annealing was then performed for each adapter at a concentration of 17 µM in a 30 µl volume containing 10 mM Tris pH 8 and 50 mM NaCl, by incubating at 95 °C for 3 min and cooling at room temperature for 30 min. Additional barcoded adapters can be designed by replacing the below barcodes with alternative sequences. bcAd1001: /5′Phos/ACGCACTCTGATATGTGATCTCTCTCTTTTCCTCCTCCTCCGTTGTTGTTGTTGAGAGAGATCACATATCAGAGTGCGT (barcode = CACATATCAGAGTGCG); bcAd1002: /5′Phos/ACTCACAGTCTGTGTGTATCTCTCTCTTTTCCTCCTCCTCCGTTGTTGTTGTTGAGAGAGATACACACAGACTGTGAGT (barcode = ACACACAGACTGTGAG); bcAd1003: /5′Phos/ACTCTCACGAGATGTGTATCTCTCTCTTTTCCTCCTCCTCCGTTGTTGTTGTTGAGAGAGATACACATCTCGTGAGAGT (barcode = ACACATCTCGTGAGAG); bcAd1008: /5′Phos/ACGCAGCGCTCGACTGTATCTCTCTCTTTTCCTCCTCCTCCGTTGTTGTTGTTGAGAGAGATACAGTCGAGCGCTGCGT (barcode = ACAGTCGAGCGCTGCG).

Modified HiDEF-seq library preparation trials to remove ssDNA T>A artefacts

Below are details of trials to remove ssDNA T>A artefacts arising from ssDNA nicks that remain after nick ligation. The final protocol that completely removes these artefacts (HiDEF-seq without A-tailing) is described in the main ‘HiDEF-seq library preparation’ section.

Polynucleotide kinase. The standard HiDEF-seq protocol was followed with the exception that, before nick ligation, the DNA was treated in a 30 µl reaction containing 12 U T4 polynucleotide kinase (NEB), 1 mM ATP (NEB), 4 mM DTT (Promega) and 1× CutSmart buffer (NEB) at 37 °C for 1 h. The sample then proceeded into the nick ligation reaction in a higher reaction volume of 35 µl, with reaction components scaled proportionally to the higher volume and a final 1× CutSmart buffer concentration.

Alternative A-tailing polymerases. The standard HiDEF-seq protocol was followed with the exception of replacing Klenow fragment 3′→5′ exo- polymerase with one of the following: 9.6 U Bst large fragment (NEB), 9.6 U Bst 2.0 (NEB), 9.6 U Bst 3.0 (NEB) or 9 U Isopol SD+ (ArcticZymes). The reaction temperatures and times for these polymerases were as follows: (1) Bst large fragment and Bst 2.0: 30 min at 45 °C; (2) Bst 3.0: 30 min or 150 min at 45 °C, or 210 min at 37 °C; and (3) Isopol SD+: either 30 min or 210 min at 37 °C.

Pyrophosphatase. The standard HiDEF-seq protocol was followed with the exception of adding 0.15 U of E. coli inorganic pyrophosphatase (NEB).

Klenow reaction without dATP or without dATP/ddBTP. The standard HiDEF-seq protocol was followed with the exception that the Klenow reaction was performed without dATP or without dATP/ddBTP.

No Klenow reaction. The standard HiDEF-seq protocol was followed, except that, after the post-nick ligation bead purification, the DNA was diluted to 30 µl in a final 1× NEBuffer 4 concentration and taken directly to adapter ligation using blunt adapters. After the post-nuclease treatment bead purification, an additional size-selection step was performed with 0.75× diluted AMPure beads as this would ordinarily have occurred after the Klenow reaction. Note that this protocol produces a CCT>CGT ssDNA artefact that does not occur when the Klenow reaction is performed without dATP or ddBTP, indicating that Klenow polymerase removes this artefact likely through a pyrophosphorolysis mechanism (Extended Data Fig. 5d and Supplementary Table 3).

HiDEF-seq library preparation with uracil DNA glycosylase/endonuclease VIII treatment

Libraries were prepared according to the above HiDEF-seq library protocol with A-tailing, except that 3 µl of a mixture of uracil DNA glycosylase/endonuclease VIII (NEB USER enzyme, M5505) was added to the nuclease treatment step.

HiDEF-seq library preparation with multi-glycosylase/endonuclease treatment

Libraries were prepared according to the above HiDEF-seq library protocol without A-tailing, except that, after the bead purification/size selection that occurs after the Klenow ddBTP reaction, an additional enzyme treatment step was performed. This enzyme treatment occurred in a total volume of 60 µl in a final 1× ThermoPol Buffer (NEB) at 37 °C for 30 min, with the following enzymes: (1) 10 U endonuclease IV (NEB); (2) 8 U formamidopyrimidine DNA glycosylase (Fpg) (NEB); (3) 10 U T4 pyrimidine dimer glycosylase (NEB); (4) 2 µl of a mixture of uracil DNA glycosylase/endonuclease VIII (NEB USER enzyme); (5) 10 U endonuclease VIII (NEB); (6) 10 U human alkyl adenine DNA glycosylase (hAAG) (NEB); and (7) 5 U human single-stranded selective monofunctional uracil DNA glycosylase (hSMUG1) (NEB). This reaction was cleaned up with a ratio of 1.2× diluted AMPure bead volume to sample volume, with two 80% ethanol washes, and elution of DNA with 22 µl of 10 mM Tris pH 8. The eluted DNA was then adjusted to a total of 30 µl with 3 µl of 10× NEBuffer 4 and NFW before proceeding to adapter ligation according to the standard HiDEF-seq protocol.

HiDEF-seq large fragment library preparation

Large-fragment-size libraries (range, 1–10 kb; median, 4.1 kb) were prepared according to the above HiDEF-seq library protocol, except (1) fragmentation was performed with 30 U PvuII-HF enzyme (NEB) instead of Hpy166II; (2) post-nick ligation and post-A-tailing cleanups were performed with 1.8× undiluted AMPure PB beads, and DNA was not diluted to <10 ng µl−1 (since size selection is not being performed); and (3) final post-nuclease treatment bead purification was performed with 1× undiluted AMPure PB beads.

HiDEF-seq library preparation with random fragmentation

Libraries were prepared according to the above HiDEF-seq library protocol without A-tailing (that is, Klenow reaction without dATP and using blunt adapters), except that (1) a higher amount of input DNA was used (4 µg per sample); (2) instead of restriction enzyme fragmentation, DNA was acoustically fragmented in miniTUBE Clear tubes (2 µg per tube, that is, 2 × 2 µg aliquots per sample), with each 2 µg DNA aliquot diluted to 200 µl in a final buffer of 10 mM Tris pH 8 and 50 mM NaCl, on an ME220 instrument (Covaris) using the following settings: temperature, 7 °C; treatment time, 900 s; peak incident factor, 8 W; duty factor, 20%; and cycles/burst, 1,000; (3) each 2 µg fragmented DNA aliquot was blunted in a 200 µl reaction containing 0.5 U µl−1 nuclease P1 (NEB) and 1× NEBuffer r1.1 (NEB) at 37 °C for 30 min, after which the reaction was stopped by adding 8 µl of 0.5 M EDTA and 2 µl of 1% SDS; (4) after the Nuclease P1 reaction, the protocol continued with the 0.8× diluted AMPure bead purification as is usually performed after restriction enzyme fragmentation, and the two aliquots of each sample were combined at the elution stage for a final elution volume of 22 µl; (5) before nick ligation, the DNA was treated with 0.4 U µl−1 T4 polynucleotide kinase (NEB), 1 mM ATP and 4 mM DTT in a 30 µl volume of 1× rCutSmart buffer (NEB) at 37 °C for 1 h; (6) nick ligation was performed immediately after by adding the required reagents to the T4 polynucleotide kinase reaction to a final volume of 35 µl; (7) the bead-purification step after the Klenow reaction was performed with a 1.2× ratio of diluted AMPure bead volume to sample volume, instead of a ratio of 0.75×; (9) after nuclease treatment, libraries underwent a 1.2× diluted AMPure bead purification, then libraries for the same sequencing run were pooled, and a final 1.0× diluted AMPure bead purification was performed to remove residual adapter dimers.

HiDEF-seq library sequencing

Libraries sequenced on the same sequencing run were multiplexed together based on the final library Qubit quantification to achieve at least 50 ng of total library in no more than 15 µl volume. When necessary, the concentration of individual or pooled libraries can be increased by room temperature centrifugal vacuum concentration (Eppendorf Vacufuge) and pausing periodically (approximately every 2 min) to avoid increases in temperature, or using AMPure PB bead purification.

Sequencing was performed on Pacific Biosciences Sequel II or Sequel IIe systems with 8M SMRT Cells by the Icahn School of Medicine at Mount Sinai Genomics Core Facility and the New York University Grossman School of Medicine Genome Technology Center. Sequencing parameters were as follows: Sequel II Binding Kit 2.0 (Pacific Biosciences), Sequel II Sequencing Kit 2.0 (Pacific Biosciences), Sequencing primer v4 (Pacific Biosciences), 1 h binding time, diffusion loading, loading concentrations between 125 and 160 pM (lower concentrations were used for blood than for tissues) for standard size libraries (Hpy166II libraries) or 80 pM for large-fragment libraries (PvuII libraries), 2 h pre-extension, and 30 h movies.

Germline and tumour sequencing data processing

The HiDEF-seq computational pipeline can filter germline variants using either standard short-read or long-read genome sequencing of the same individual (Extended Data Fig. 3k,l).

Illumina germline and tumour sequencing data processing

Reads were aligned to the CHM13 v.1.0 reference genome58 using BWA-MEM (v.0.7.17)59 with the standard settings, followed by marking of optical duplicates and sorting using the Picard Toolkit v3.1.0 (Broad Institute). Variants were called from the aligned reads with two different variant callers: (1) Genome Analysis Toolkit (GATK)60 v.4.1.9.0 using the HaplotypeCaller tool with the parameters ‘-ERC GVCF -G StandardAnnotation -G StandardHCAnnotation -G AS_StandardAnnotation’ followed by the GenotypeGVCFs tool with the default parameters; (2) DeepVariant61 v.1.2.0 with the parameter: ‘--model_type=WGS’. Both GATK and DeepVariant variant calls were used during the subsequent analysis.

Pacific Biosciences germline sequencing data processing

Circular consensus sequences were derived from raw subreads (a subread is one sequencing pass of a single strand of a DNA molecule) using pbccs v.5.0.0 (ccs, Pacific Biosciences) with the default parameters, and consensus sequences were filtered to retain only high-quality ‘HiFi’ reads, that is, reads with predicted consensus sequence accuracy ‘rq’ tag ≥ 0.99 (rq is calculated by ccs as the average of the per base consensus qualities of the read). These reads were then aligned to the CHM13 v.1.0 reference genome with pbmm2 v.1.4.0 (Pacific Biosciences) with the parameters ‘--preset CCS --sort’. Variants were called from the aligned reads with DeepVariant61 v.1.2.0 with the parameter ‘--model_type=PACBIO’.

HiDEF-seq primary data processing

HiDEF-seq data first undergoes a two-part primary data processing pipeline that transforms the raw data into a format suitable for subsequent analysis. Primary data processing also produces quality-control plots generated by custom scripts and by SMRT Link (Pacific Biosciences) software (for example, distributions of polymerase read lengths and number of passes). Note that, for simplicity, we use the term ‘call’ to refer to both dsDNA mutations and ssDNA mismatch and damage events. The pipeline analyses calls in sequencing reads that are single-base mismatches relative to the reference genome (that is, not insertions and deletions).

The first part of the primary data-processing pipeline uses a combination of bash and awk scripts to process raw subread sequencing data (a subread is one sequencing pass of a single strand of a DNA molecule) into a strand-specific aligned BAM format62 with additional tags needed for call analysis62. The steps of this first part of data processing are as follows:

  1. (1)

    Subreads for which an adapter was not detected on both ends of the molecule (‘cx’ tag not equal to 3) are removed.

  2. (2)

    Consensus sequences are created separately for each strand of the DNA molecule (that is, forward and reverse strand separately) using pbccs v.6.2.0 (Pacific Biosciences) with the parameters: --by-strand, --min-rq 0.99 (minimum predicted consensus sequence accuracy > Q20 (Phred quality score) to remove low-quality consensus sequences) and --top-passes 0 (unlimited number of full-length subreads used per consensus).

  3. (3)

    Demultiplexing of samples according to adapter barcodes using lima v.2.5.0 (Pacific Biosciences) with the parameters: --ccs --same --split-named --min-score 80 --min-end-score 50 --min-ref-span 0.75 --min-scoring-regions 2.

  4. (4)

    Filter to remove any DNA molecules (also known as zero-mode waveguides, which are sequencing wells containing a single DNA molecule) that did not successfully produce both one forward- and one reverse-strand consensus sequence.

  5. (5)

    Align forward- and reverse-strand consensus sequences to the CHM13 v.1.0 reference genome58 using pbmm2 v.1.7.0 (Pacific Biosciences), an aligner based on minimap263, with the parameters: --preset CCS. We use the telomere-to-telomere CHM13 human reference genome, which was itself constructed using long reads, to reduce genome alignment artefacts58. Note that the CHM13 v.1.0 reference genome contains only nuclear chromosomes 1–22, chromosome X and the mitochondrial genome—but not chromosome Y, which is therefore not part of the analyses.

  6. (6)

    Filter to retain only DNA molecules that produce only one forward strand primary not-supplementary alignment and one reverse-strand primary not-supplementary alignment, where the forward and reverse alignments overlap (reciprocally) in the genome by at least 90%.

  7. (7)

    Sort alignments by reference position.

  8. (8)

    Add five tags, detailed below, to each alignment in the final BAM file, with each tag containing a comma-separated array with a length corresponding to the number of single-base mismatches in the alignment (relative to the reference genome) per the alignment CIGAR string:

    • qp: positions of bases in the read sequence (query) that are mismatches relative to the reference genome; 1-based coordinates with the left-most base in the alignment record’s SEQ column = 1;

    • qn: sequences of bases in the read (query) that are mismatches relative to the reference genome (base sequences are according to the forward genomic strand, that is, they are taken from the SEQ column of the SAM alignment record);

    • qq: qualities of bases in the read that are mismatches relative to the reference genome (taken from the QUAL column of the SAM alignment record);

    • rp: positions in reference genome coordinates of read bases that are mismatches relative to the reference genome;

    • rn: sequences of the reference genome at positions of read bases that are mismatches relative to the reference genome.

The second part of the primary data processing pipeline is an R64 script (R v.4.1.2, requiring the packages Rsamtools65, GenomicAlignments66, GenomicRanges66, vcfR67, plyr68, configr69, qs70) that further processes and annotates the aligned BAM file into an R data file as follows:

  1. (1)

    Load the aligned BAM file into R, including the custom tags that annotated the positions of base mismatches relative to the reference genome.

  2. (2)

    Annotate calls (bases mismatched relative to the reference genome) for which the reference genome base is ‘N’, to exclude these from subsequent analysis.

  3. (3)

    Annotate the positions of indels in each alignment, based on the alignment CIGAR string.

  4. (4)

    Annotate each call if it was present in any of the VCF variant call files of the corresponding individual’s germline sequencing, along with details of the VCF variant annotation.

  5. (5)

    Save positions of indels from the VCF variant call files of the corresponding individual’s germline sequencing.

  6. (6)

    Transform the dataset so that forward- and reverse-strand consensus reads and ssDNA and dsDNA calls (and tag information) from the same DNA molecule are linked to each other as dsDNA molecules.

  7. (7)

    Save the final R dataset to a file.

HiDEF-seq call filtering

The call-filtering pipeline implements a series of filters that were optimized to maximize the number of true calls while minimizing the number of sequenced bases and regions of the genome that are filtered out. During the development of the pipeline, filters and filter parameters were iteratively optimized using low-mutation-rate samples (that is, tissues from infants and sperm) to identify patterns that are common to false positives. These false positives were apparent as clusters of mutations in low-quality regions of the genome and as regions with low-quality alignment of sequencing reads. For example, when a metric of low-quality genome regions was found to correlate with clusters of low-quality calls, this metric was added as a filter, and its threshold was iteratively tuned to maximally remove false positives while minimizing the number of sequenced bases and genomic regions that are filtered.

Additional optimization of filter thresholds was performed using sperm samples that have a known low mutation burden. Specifically, we plotted the dsDNA and ssDNA burdens with a range of thresholds for three key filters: (1) minimum predicted consensus accuracy (0.99 to 0.999); (2) minimum number of passes per strand (5 to 20); and (3) minimum fraction of subreads (passes) detecting the mutation (0.5 to 0.8) (Extended Data Fig. 3c–j). We examined these plots for threshold settings above which burden estimates are stable. Since burdens were corrected for sensitivity (based on total interrogated bases and detection of known germline variants; see the ‘HiDEF-seq calculation of call burdens’ section), a decrease in burden estimates with increasing threshold settings indicates removal of sequencing artefacts. These plots showed that sperm dsDNA mutation burden estimates were stable even down to the lowest (most lenient) thresholds (Extended Data Fig. 3d,e,g). By contrast, ssDNA burdens required higher threshold settings before burden estimates stabilized (Extended Data Fig. 3i,j). Individually increasing the thresholds of each of the above three filters stabilized ssDNA burden estimates at approximately 20%, 15% and 10% lower levels, respectively, compared to the least stringent settings, and applying all three filters together with these higher thresholds reduced the ssDNA burden estimate by approximately 25% (that is, the three filters are not independent). Specific thresholds used for dsDNA and ssDNA mismatch filtering are detailed in the below sections detailing each filter.

The call-filtering pipeline uses the following R packages: GenomicAlignments (v.1.30.0)66, GenomicRanges (v.1.46.1)66, vcfR (v.1.12.0)67, Rsamtools (v.2.10.0)65, plyr (v.1.8.6)68, configr (v.0.3.5)69, MutationalPatterns (v.3.4.1)71, magrittr (v.2.0.2)72, readr (v.2.1.2)73, dplyr (v.1.0.8)74, plyranges (v.1.14.0)75, stringr (v.1.4.0)76, digest (v.0.6.29)77, rtracklayer (v.1.54.0)78, qs (v.0.25.2)70; and the following software tools: bcftools (v.1.14)79, samtools79, wigToBigWig (v.2.8)80, wiggletools (v.1.2.11)81, pbmm2 (v.1.7.0; Pacific Biosciences), zmwfilter (v.1.2.0; Pacific Biosciences), SeqKit (v.2.1.0)82 and KMC (v.3.1.1)83.

Additional filters used in the pipeline were created using REAPR (v.1.0.18)84. REAPR was originally designed to identify regions with errors in reference genome assemblies, but we found that it calculates metrics that are useful for identifying regions of the genome prone to generating false-positive and false-negative variant calls in Illumina (short-read) sequencing data. First, Illumina whole-genome sequencing reads from a sperm sample were aligned to CHM13 v.1.0 using SMALT (v.0.7.6)85 with the parameters ‘-r 0 -x -y 0.5’ and a CHM13 v.1.0 index created with SMALT using parameters ‘-k 13 -s 2’. Next, reads were sorted and duplicates were marked. The REAPR perfectfrombam command was then run on the resulting BAM file using the parameters ‘min insert=266, max insert=998, repetitive max qual=3, perfect min qual=4, and perfect min alignment score=151’ (min and max insert size are the 1 and 99 percentiles of insert sizes calculated from the sequencing data using the Picard Toolkit CollectInsertSizeMetrics tool). REAPR metrics for each base of the genome were obtained from the output stats.per_base file and a bigwig86 annotation file was created for each metric.

The mutation analysis filters were applied serially as described below. Unless otherwise specified, the filters were applied to both ssDNA and dsDNA calls. Note that the computational pipeline has the capability to implement additional filters not listed here, as specified in the pipeline configuration documentation available online.

Filters based on DNA molecule quality and alignment metrics

Retain only DNA molecules that meet all of the below criteria:

  1. (1)

    ccs predicted consensus accuracy ≥ 0.99 in both forward and reverse strand (that is, rq tag of ccs ≥ 0.99) for dsDNA calls, and ≥Q30 (that is, rq ≥ 0.999) for ssDNA calls.

  2. (2)

    Minimum of 5 (for dsDNA calls) and 20 (for ssDNA calls) sequencing passes for each of the forward and reverse strands (using the ‘ec’ BAM file tag, which is computed by ccs as the average subread coverage across all consensus calling windows).

  3. (3)

    Both forward and reverse strands have mapping quality ≥ 60.

  4. (4)

    Maximum difference in number of ssDNA calls between the forward and reverse strands of 5, before germline variant filtering. This removes artefacts from rare chimeric molecules and residual low-quality molecules.

  5. (5)

    Average of the number of indels relative to the human reference genome in the forward and reverse strands of ≤20, before germline variant filtering. This removes low-quality molecules with many indels.

  6. (6)

    Average of the number of soft-clipped bases in the forward and reverse strands of ≤30. This removes low-quality molecules and molecules that align to complex regions of the genome with long stretches of mismatched bases.

Filters based on germline sequencing variant calls

  1. (1)

    Filter out calls that were also identified in any of the individual’s germline sequencing VCF files with read depth ≥3, allele quality (QUAL column in VCF) ≥ 3, genotype quality (GQ tag in VCF) ≥ 3, and variant allele fraction ≥ 0.05.

  2. (2)

    Filter out DNA molecules with >8 dsDNA calls remaining after VCF germline filtering. This removes molecules with misalignment to complex regions of the genome leading to many clustered calls and regions of the genome for which Illumina short reads are not effective in identifying and filtering out germline variants.

For tumour analysis, variant calls were used in this step from both germline blood sequencing and standard fidelity (Illumina) tumour sequencing to focus the analysis on low-level mosaic calls.

Filters based on genomic regions

Filters that remove the entire DNA molecule if it meets any of the following criteria:

  1. (1)

    For analyses using either Illumina or PacBio germline sequencing data: (i) segmental duplication regions: any overlap with the DNA molecule’s forward or reverse consensus sequence alignments. This annotation was obtained from the file chm13.draft_v1.0_plus38Y.SDs.bed created by the Telomere-to-Telomere consortium87. However, for analysis of mitochondrial mutations, this region filter is not used because it contains the region chrM:10000–14910 due to a similar nuclear genome sequence on chromosome 5, which would cause unnecessary filtering of reads aligning to this region of the mitochondrial genome. There is negligible risk of nuclear genome sequences falsely aligning to this mitochondrial region since we obtain long reads, we require high mapping quality, and we exclude reads with many mismatches—and these mitochondrial and nuclear genome regions have only 94% identity. (ii) Satellite sequence regions: ≥20% of the DNA molecule’s forward- and reverse-strand consensus alignments (average for the two strands) overlaps the region. The satellite sequence region annotation was created for CHM13 v.1.0 using RepeatMasker (v.4.1.1)88 with the parameters ‘-pa 4 -e rmblast -species human -html -gff -nolow’, followed by extraction of ‘Satellite’ regions.

  2. (2)

    Only for analyses that use Illumina germline sequencing data, because short-read data is more prone to missing true germline variants in these regions: (i) telomere regions: any overlap with the DNA molecule’s forward or reverse consensus sequence alignments. This annotation was obtained from the file chm13.draft_v1.0.telomere created by the Telomere-to-Telomere consortium58. (ii) 50-mer mappability score: ≥30% of the DNA molecule’s forward- and reverse-strand consensus alignments (average for the two strands) has a mappability score < 0.4. This annotation was created for CHM13 v.1.0 using Umap (v.1.2.0)89. This annotation calculates the mappability for every base in the genome. (iii) The fraction of Illumina short reads aligning to the region that are orphaned reads (that is, the read’s mate is either unmapped or mapped to a different chromosome), averaged across the genome in 20 bp non-overlapping bins, is ≥0.15 for ≥20% of the DNA molecule’s forward- and reverse-strand consensus alignments (average for the two strands). The fraction of orphaned reads metric used in this filter is the average of the orphan_cov and orphan_cov_r REAPR metrics, which are the fraction of forward- and reverse-strand reads that are orphaned, respectively.

Filters that remove only the portions of the DNA molecule that overlap any of the following regions, while the remaining bases of the DNA molecule are still included in analysis:

  1. (1)

    Regions of the reference genome whose sequence is ‘N’.

  2. (2)

    For analyses using either Illumina or PacBio germline sequencing data: (i) satellite sequence regions: any base that overlaps one of these regions. (ii) Bases with gnomAD (v.3.1.2)90 single-nucleotide variants with ‘PASS’ flag and population allele frequency > 0.1%, lifted over from the hg38 to the CHM13 v.1.0 reference genome using the liftOver tool80. This filter removes 27,476,828 genome bases from the analysis. It is required to remove residual germline variants that were not detected in the germline sequencing of the individual, and it reduces the risk of false-positive mosaic event calls due to very low level contamination that may occur between samples of different individuals8.

  3. (3)

    Only for analyses that use Illumina germline sequencing data, because short-read data are more prone to missing true germline variants in these regions: (i) 100-mer mappability score: any base with a mappability score < 0.95, with mappability scores averaged across the genome in 20 bp non-overlapping bins (binning smoothes the mappability score signal). The primary mappability scores were calculated as described for the above 50-mer mappability score. (ii) The fraction of Illumina short reads aligning to the region that are properly paired (that is, aligned in the correct orientation and within the expected distance based on insert size distribution), averaged across the genome in 20 bp non-overlapping bins, is <0.7. The fraction of properly paired reads metric used in this filter is the average of the prop_cov and prop_cov_r REAPR metrics, which are the fraction of forward-strand and reverse-strand reads that are properly paired, respectively. (iii) The fraction of Illumina short reads aligning to the region that are orphaned reads (that is, the read’s mate is either unmapped or mapped to a different chromosome), averaged across the genome in 20 bp non-overlapping bins, is ≥0.2. The fraction of orphaned reads metric used in this filter is the average of the orphan_cov and orphan_cov_r REAPR metrics, which are the fraction of forward- and reverse-strand reads that are orphaned, respectively. (iv) The number of Illumina short reads aligning to the region to either the forward or the reverse strand and that are soft-clipped at the left end or the right end (that is, the sum of the REAPR clip_fl, clip_fr, clip_rl, clip_rr metrics), divided by [4 × number of mapped reads/100,000,000], averaged across the genome in 200 bp non-overlapping bins, is ≥0.09. (v) The number of Illumina short reads with mapping quality 0 aligning to the region, divided by [4 × number of mapped reads/100,000,000], averaged across the genome in 20 bp non-overlapping bins, is ≥0.1. Note that this general filtering annotation was calculated using Illumina whole-genome sequencing data of one representative sample.

Base quality filter

Filter out dsDNA calls whose consensus sequence base quality is <93 (from QUAL column in BAM file) in either the forward- or reverse-strand consensus, and filter ssDNA calls whose base quality is <93 in the strand containing the call.

Filter based on location within the read

Filter out calls that are ≤10 bases from the ends of the consensus sequence alignment (alignment span excludes soft-clipped bases). For ssDNA calls, this filter is applied to the strand containing the call and, for dsDNA calls, this filter is applied to both the forward- and reverse-strand consensus sequence alignments. Although this only negligibly alters call burdens (Extended Data Fig. 3h), it removes rare alignment artefacts.

Filter based on location near germline indels

Regions near germline indels are prone to alignment artefacts that can lead to false-positive calls. This filter removes calls located near an indel within a distance less than or equal to twice the length of the indel or less than or equal to 15 bases of the indel (whichever range is larger), using indels called in any of the germline sequencing data of the individual (that is, both GATK and DeepVariant indel calls when using Illumina germline sequencing data, and only DeepVariant indel calls when using PacBio germline sequencing data). For GATK indel calls, only indels with read depth ≥ 5, QUAL ≥ 10, genotype quality ≥ 5 and variant allele fraction ≥ 0.2 were used in this filtering. For DeepVariant indel calls, only indels with read depth ≥ 3, QUAL ≥ 3, genotype quality ≥ 3 and variant allele fraction ≥ 0.1 were used in this filtering.

Filter based on location near consensus sequence indels

Regions near HiDEF-seq consensus sequence indels are prone to alignment artefacts that can lead to false-positive calls. This filter removes calls located near a consensus sequence indel within a distance less than or equal to twice the length of the indel or less than or equal to 15 bases of the indel (whichever range is larger). For dsDNA calls, the call must pass this filter on both forward and reverse consensus strands. For ssDNA calls, this filter applies only to the strand containing the call.

Filters based on germline sequencing read depth and variant allele fraction

  1. (1)

    Filter out calls in locations where the germline sequencing data has <15 total reads coverage, as these low-coverage germline sequencing regions will be prone to false-negative germline variant calls that would then lead to false-positive HiDEF-seq calls.

  2. (2)

    Filter out calls detected with variant allele fraction > 0.05 or read depth > 3 in the germline sequencing data to remove variants that were not called by the previous germline variant callers (due to low variant allele fraction or due to different local haplotype assembly in GATK/DeepVariant that calls variants in a different nearby location than the bwa alignment of the consensus molecule sequence). This filter is less stringent than a recent somatic mutation analysis method8, but may still remove a small number of very early developmental mosaic variants shared between HiDEF-seq data and the individual’s germline sequencing.

The above two filters use the samtools mpileup command to determine total read depth and variant allele fraction, using the parameters ‘-I -A -B -Q 11 --ff 1024 -d 10000 -a “INFO/AD”’ for Illumina germline sequencing data and the parameters ‘-I -B -Q5 --ff 2048 --max-BQ 50 -F0.1 -o25 -e1 --delta-BQ 10 -M399999 -d 10000 -a “INFO/AD”’ for PacBio germline sequencing data.

For tumour analysis, this filter step used both germline blood sequencing and standard fidelity (Illumina) tumour sequencing to focus the analysis on low-level mosaic calls.

Filters based on fraction of subreads (passes) detecting the call and fraction of subreads overlapping the call

We filter out calls detected in <50% (for dsDNA calls) and <60% (for ssDNA calls) of the subreads of the DNA molecule that detected the call. For dsDNA calls, this filter is applied to forward and reverse subreads separately, and the call must pass the filter in both strands. For ssDNA calls, this filter is applied only to subreads of the strand in which the call was detected.

This removes false-positive calls in the consensus sequence that are not well-supported by the subreads. The filter is implemented by first extracting the subreads of all of the DNA molecules containing calls from the raw subreads BAM file using the zmwfilter tool (Pacific Biosciences) and aligning them to the CHM13 v.1.0 reference genome with pbmm2 v.1.7.0 with the parameters ‘--preset SUBREAD --sort’. The bcftools mpileup command is then used with the parameters ‘-I -A -B -Q 0 -d 10000 -a “INFO/AD”’ to calculate the fraction of subreads detecting the call (excluding subreads with the supplementary alignment SAM flag).

In rare DNA molecules, a large fraction of subreads is soft-clipped, leading to false-positive calls in the small fraction of remaining subreads aligned to the soft-clipped region. We therefore also filter out calls for which the percentage of subreads overlapping the call (regardless of whether they contain the call) out of the total subreads aligned to the genome is <50%, calculated separately for subreads of each strand for the molecule in which the call was made. This filter is applied to the strand containing the call for ssDNA calls, and to both strands for dsDNA calls (that is, a dsDNA call must pass this filter in both strands).

HiDEF-seq calculation of call burdens

After application of all of the above filters, DNA molecules are further filtered to retain only those with a maximum of one dsDNA call for dsDNA call burden calculations, and a maximum of one ssDNA call per strand for ssDNA call burden calculations. This removes a small number of the remaining DNA molecules that contain multiple post-filtering calls that, after manual inspection, are due to residual regions of the genome prone to false positives.

The raw dsDNA mutation burden (that is, mutations per bp) of a sample is then calculated as the [number of dsDNA calls]/[number of interrogated dsDNA base pairs], and the raw ssDNA call burden (that is, calls per base) is calculated as the [number of forward strand calls + number of reverse strand calls]/[number of interrogated forward strand read bases + number of interrogated reverse strand read bases]. Note that we subsequently use the term ‘interrogated bases’ for simplicity, even though, for dsDNA mutation analysis, it refers to interrogated base pairs. The number of interrogated bases takes into account all of the relevant filters that were applied, both filters that remove entire DNA molecules and filters that remove only portions of DNA molecules. Specifically, the number of interrogated bases is the total number of bases of DNA molecules that passed all of the filters that remove full DNA molecules (described in the ‘Filters based on DNA molecule quality and alignment metrics’ and ‘Filters based on genomic regions’ (first part) sections), excluding the bases of those remaining DNA molecules removed by the following filters (described above) that remove only portions of DNA molecules: (1) ‘Filters based on genomic regions’ (second part); (2) ‘Base quality filter’; (3) ‘Filter based on location within the read’; (4) ‘Filter based on location near germline indels’; (5) ‘Filter based on location near consensus sequence indels’; and (6) the minimum germline sequencing total read coverage filter described in the ‘Filters based on germline sequencing read depth and variant allele fraction’ section.

We also calculated ‘corrected’ call burdens that correct both for: (1) differences in trinucleotide sequence context of the genome relative to interrogated bases; and (2) sensitivity of detection. These corrections were applied as follows:

First, we corrected raw call counts for the trinucleotide frequency distribution of the genome (specifically, the CHM13 v.1.0 sequences of chromosomes being analysed; that is, chromosomes 1–22 and X for nuclear genome analysis, and the mitochondrial sequence for mitochondrial genome analysis) relative to the trinucleotide frequency distribution of interrogated bases in sequencing reads. This correction for ‘trinucleotide context opportunities’ is necessary because interrogated bases may have a different distribution of trinucleotides compared to the genome due to restriction enzyme fragmentation and computational filters, and this may affect burden estimates8. Specifically, we first calculate the distribution of trinucleotides (the fraction of each trinucleotide out of all trinucleotides) across the genome. We then calculate the distribution of trinucleotides across interrogated bases of sequencing reads in the sample. Next, for each trinucleotide, we calculate the ratio of its fractional distribution in the full genome to its fractional distribution in the interrogated bases. The trinucleotide-corrected count of HiDEF-seq calls is then obtained by multiplying the raw call count for each trinucleotide context by that context’s genome/interrogated bases trinucleotide ratio. For dsDNA calls, trinucleotide context corrections are performed using all possible 32 trinucleotide contexts where the middle base is a pyrimidine. For ssDNA calls, trinucleotide context corrections are performed using all 64 possible trinucleotides and using strand-specific trinucleotide sequences of calls, interrogated bases and the genome. The trinucleotide contexts of ssDNA calls reflect the original DNA molecule’s ssDNA change—that is, for calls in strands aligning to the forward strand of the reference genome, the reverse complements of the call, interrogated read sequences and genome are used for trinucleotide context corrections, and vice versa for calls in strands aligning to the reverse strand. This is because the sequence data produced by the sequencer has the directionality of the sequencer-synthesized strand rather than the original (template) DNA molecule.

Second, we corrected call counts for sensitivity of detection separately for each sample using a set of high-quality, true-positive heterozygous germline (dsDNA) variants detected in the HiDEF-seq data of the sample. This specifically accounts for single-molecule sensitivity loss due to the ‘Filters based on fraction of subreads (passes) detecting the call and fraction of subreads overlapping the call’ that are applied to calls detected in the final interrogated bases (they are applied to each strand separately, and dsDNA calls must pass the filters in both strands). All of the other filters remove DNA molecules and bases from the final set of interrogated bases and therefore do not require sensitivity correction. To generate the true-positive set of heterozygous germline variants for each sample, we extracted all of the autosomal dsDNA calls detected in the final interrogated HiDEF-seq bases of the sample that were also called in all of the germline variant call sets of the individual with ≥50th percentile VCF QUAL score, ≥50th percentile VCF genotype quality, ≥50th percentile total read depth, and variant allele fraction between 30% and 70%. We retain only calls that meet these criteria across every one of the variant call sets of the individual and that are present in gnomAD v.3.1.2 with ‘PASS’ flag and population allele frequency > 0.1%. If more than 10,000 such true-positive germline calls are identified, a random subset of 10,000 calls is selected for the sensitivity calculation. We then extract subreads corresponding to the DNA molecules that detected these calls in the sample, realign them to the genome with pbmm2 v.1.7.0 with the ‘--preset SUBREAD --sort’ settings and annotate the variants using the same process described in the ‘Filters based on fraction of subreads (passes) detecting the call and fraction of subreads overlapping the call’ step of the call-filtering pipeline. We next calculate germline variant sensitivity for the sample as the number of true-positive germline variant calls that pass the same filtering thresholds used in the ‘Filters based on fraction of subreads (passes) detecting the call and fraction of subreads overlapping the call’ step of the call-filtering pipeline, divided by the total number of true-positive germline variant calls. Each sample’s dsDNA call counts are then corrected for sensitivity by dividing by that sample’s calculated germline variant sensitivity. Each sample’s ssDNA call counts are corrected by dividing by the square root of that sample’s germline variant sensitivity, because the above dsDNA germline variant sensitivity estimate corrects for filters applied to both strands separately.

Finally, ssDNA and dsDNA burdens corrected for both trinucleotide context and sensitivity are calculated as the sum of the trinucleotide context- and sensitivity-corrected call counts divided by the number of interrogated bases (ssDNA burdens) or base pairs (dsDNA burdens). For all analyses and figures, unless otherwise specified, we use burden estimates corrected for both the full genome trinucleotide distribution and sensitivity.

The Poisson 95% confidence intervals of a sample’s corrected burden were calculated as the corrected burden × [Poisson 95% confidence interval of raw call counts, calculated using the poisson.test function in R]/[raw call counts]. Weighted least-squares linear regressions of call burdens versus age were performed using the ‘lm’ function in R (via the ggplot91 package), with weights equal to 1/[raw call counts].

HiDEF-seq estimate of fidelity for dsDNA mutations

The fidelity for dsDNA mutations was estimated for each sample as follows: (1) for each of the 192 possible trinucleotide contexts (that is, both central pyrimidine and central purine contexts), the number of single-strand calls at that context was divided by the total number of interrogated bases with that trinucleotide context to obtain a ssDNA call burden for that context; (2) for each central pyrimidine trinucleotide context, a dsDNA mutation error probability was calculated by multiplying the single-strand call burdens of the corresponding central pyrimidine and reverse-complement central purine trinucleotide contexts; and (3) all of the resulting central pyrimidine trinucleotide context dsDNA mutation error probabilities were summed. The main text reports the average fidelity across samples from healthy individuals, excluding sperm samples (as these have an outlier high ssDNA C>T burden) and post-mortem samples processed with HiDEF-seq with A-tailing.

Comparison of HiDEF-seq and standard PacBio HiFi molecule characteristics

Standard PacBio HiFi raw subread data for comparison to HiDEF-seq (Fig. 1b and Extended Data Fig. 1d,f) were obtained from the Human Pangenome Reference Consortium (HPRC) public data repository92 (samples HG02080, HG03098, HG02055, HG03492, HG02109, HG01442, HG02145, HG02004, HG01496, HG02083). Circular consensus sequences were derived from raw subreads using the same ccs version and ccs parameters used to analyse HiDEF-seq data.

Comparison of HiDEF-seq mutation burdens in sperm to paternally phased de novo mutation burdens

Paternally phased de novo mutation (DNM) burdens were calculated for each paternal age (in one-year intervals) from data published in a previous study of 2,976 trios14 (supplementary files aau1043_datas5_revision1.tsv and aau1043_datas7.tsv), and using additional methodological details obtained from its associated study93. Paternally phased DNM burdens were first calculated for each child as [total number of paternally phased DNMs]/[fraction of the child’s DNMs that were either paternally or maternally phased (which corrects for each child’s phasing rate)] × [the Jónsson et al.93 correction factor of 1.009 (which accounts for its false-positive and false-negative rate)]/[the Jónsson et al.93 interrogated genome size of 2,682,890,000] (refs. 14,93). We then compare the dsDNA mutation burden of each HiDEF-seq sperm sample to the DNM burdens of children whose fathers’ age at their birth is one year higher than the age at which the sperm sample was collected (to account for around 9 months difference between the father’s age at conception and the child’s birth).

Comparison of HiDEF-seq and NanoSeq call burdens and patterns

NanosSeq data were processed using the NanoSeq analysis pipeline v.3.2.1 (https://github.com/cancerit/NanoSeq) for chromosomes 1–22 and X (hs37d5 reference genome). NanoSeq dsDNA burdens corrected for trinucleotide context opportunities were obtained from the ‘results.mut_burden.tsv’ output file of the NanoSeq pipeline. NanoSeq ssDNA call burdens were calculated as the sum of the values in the ‘results.mismatches.subst_asym.tsv’ output file, divided by 2 × [the sum of the values in the ‘results.mismatches.subst_asym.tsv’ output file + the number of interrogated dsDNA base pairs obtained from the ‘results.mut_burden.tsv’ output file]. NanoSeq ssDNA call counts for each context were obtained from the ‘results.SSC-mismatches-Pyrimidine.triprofiles.tsv’ and ‘results.SSC-mismatches-Purine.triprofiles.tsv’ output files. Because the NanoSeq pipeline does not correct ssDNA calls for trinucleotide context opportunities, we compared the burdens of NanoSeq ssDNA calls in each context to the burdens of HiDEF-seq ssDNA calls that are also not corrected for trinucleotide context opportunities (that is, to HiDEF-seq burdens corrected only for sensitivity) (Fig. 1f,g and Extended Data Fig. 6b,c).

For more informative comparison of Poisson 95% confidence intervals of HiDEF-seq and NanoSeq (Fig. 1c,e,f and Extended Data Fig. 4a) despite a different number of interrogated bases (for ssDNA calls, or base pairs for dsDNA calls) measured by each method, for each sample, the number of calls of the method with the higher number of interrogated bases (or base pairs) was downsampled proportionally to the ratio of the number of interrogated bases of the two methods. The downsampled method’s burden was then recalculated as the downsampled call count divided by the number of interrogated bases of the method with fewer interrogated bases, and the downsampled method’s Poisson 95% confidence interval was recalculated using the downsampled number of raw call counts. This downsampling does not affect burden estimates, and it normalizes the confidence intervals of the two methods to reflect an equivalent number of interrogated bases (or base pairs). Confidence intervals before downsampling are provided in Supplementary Table 2.

Transcription level and transcription strand analysis of sperm HiDEF-seq ssDNA C>T calls

We obtained RNA-seq data of purified human spermatozoa from supplementary table 2 of ref. 94 (‘Expression’ sheet, average of the Control 1, Control 2, and Control 3 samples’ fragments per kilobase of transcript per million mapped reads (FPKM) values) and annotated each gene that had non-zero expression with its expression quartile. We joined these data to the UCSC CHM13 v.1.0 genome browser ‘CAT Gene + LiftOff Annotations V4’ transcript annotation track using Ensembl gene IDs. We then annotated each ssDNA C>T call in HiDEF-seq sperm samples with the transcript expression data, and further annotated for each call if it was present on the transcribed or non-transcribed strand. We excluded from analysis the small number of calls overlapping transcripts expressed on both strands. We next calculated the sum of the lengths of transcripts in each expression quartile, excluding regions with transcripts expressed on both strands. We then normalized the number of ssDNA C>T calls in each quartile and each transcribed/non-transcribed strand category by the sum of the lengths of transcripts in that quartile. We then normalized these values for each transcribed and non-transcribed strand category by the sum of that category’s values.

Signature analysis

Signature analysis for dsDNA mutations was performed using the ‘sigfit’ package95 v.2.2, with input of raw mutation counts for each trinucleotide context, and the ‘opportunities’ parameter set to the ratio of the fractional abundance of each trinucleotide context in interrogated bases of that sample versus the fractional abundance of that trinucleotide context in the human reference genome. The correction for trinucleotide context opportunities performed above for burden analyses used the fractional abundance of trinucleotides in CHM13 v.1.0, but the correction for trinucleotide context opportunities performed here for signature analysis and figures used the fractional abundance of trinucleotides in the full GRCh37 genome (for both nuclear and mitochondrial genome analyses and figures) so that the obtained spectra and signatures can be compared to standard COSMIC signatures. The ‘plot_gof’ function was used to determine the optimal number of signatures to extract. As COSMIC SBS1 was not well separated from other signatures during de novo extraction96, we used the ‘fit_extract_signatures’ function to fit SBS1 while simultaneously extracting additional signatures de novo. De novo extracted signatures were compared to the COSMIC SBS v.3.2 catalogue23 to identify the most similar known signature by cosine similarity. To obtain more accurate estimates of signature exposures, the fitted COSMIC SBS signature and the extracted signatures were then re-fit back to the mutation counts using the ‘fit_signatures’ function, along with correction for trinucleotide context opportunities. SBS5 is a ubiquitous clock-like signature23, and often de novo extraction produced more than one signature with weak or moderate similarity to SBS5, for example, both SBS5 and SBS40 (cosine similarity = 0.83) or both SBS3 and SBS40 (cosine similarity = 0.88). In these cases, we either reduced the number of de novo extracted signatures so that only one of these similar signatures was extracted, or we instructed ‘fit_extract_signatures’ to fit both COSMIC SBS1 and COSMIC SBS5.

ssDNA signatures were extracted by taking advantage of sigfit’s ability to analyse 192-trinucleotide context mutational spectra that distinguish transcribed versus untranscribed strands. Instead, we use this feature to distinguish central pyrimidine versus central purine contexts. We do this by arbitrarily setting central pyrimidine and central purine ssDNA calls to the transcribed and untranscribed strands, respectively (by setting the strand column to ‘−1’ for all calls that are input into sigfit’s ‘build_catalogues’ function, without collapsing central pyrimidine and central purine contexts). We then extract ssDNA signatures as described above for dsDNA signatures, with correction for trinucleotide context opportunities. Cosine similarities of ssDNA and dsDNA signatures are calculated after projecting ssDNA signatures to 96-central pyrimidine contexts, which is performed by summing values of central pyrimidine contexts with values of their reverse-complement central purine contexts.

To help to qualify the significance of cosine similarities, we performed simulations of random 96-element and 192-element number vectors (n = 10,000 random vector pairs each), which showed that 5.9%, 0.06% and 0% of random 96-context cosine similarities are above cut-offs of 0.8, 0.85 and 0.9, respectively, and 1.2%, 0% and 0% of random 192-context cosine similarities are above cut-offs of 0.8, 0.85 and 0.9, respectively. Thus, for 96-context comparisons (that is, dsDNA and projected ssDNA to dsDNA comparisons), we use the qualitative terms ‘weak similarity’ for 0.8 ≤ cosine similarity < 0.85, ‘moderate similarity’ for 0.85 ≤ cosine similarity < 0.9, and ‘strong similarity’ for cosine similarity ≥ 0.9. For 192-context comparisons (that is, ssDNA to ssDNA comparisons), we use the terms ‘moderate similarity’ for 0.8 ≤ cosine similarity < 0.85 and ‘strong similarity’ for cosine similarity ≥ 0.85.

Replication strand asymmetry (fork polarity) analysis

ENCODE replication timing (Repli-seq) data97 (wavelet-smoothed signal) were obtained from the UCSC Genome Browser80 (hg19) for the lymphoblastoid cell lines GM12878, GM06990, GM12801, GM12812 and GM12813. We calculated the average of the Repli-seq signal (higher values indicate earlier replication) across these samples at each position, and then lifted over the data to CHM13 v.1.0. For each analysed HiDEF-seq call, we calculated the fork polarity98 as the slope versus position of the Repli-seq data points spanning −5 to +5 kb from the call using the ‘lm’ function in R. Positive and negative fork polarities indicate the genome non-reference (−) strand is synthesized more frequently in the leading- and lagging-strand direction, respectively. This was also performed for a set of 50 iterations of 1,000 randomly selected genomic positions with either the sequence or the reverse complement of the sequence corresponding to the trinucleotide context being analysed (that is, AGA or TCT for POLE samples). We next calculated the fork polarity quantile values at quantiles ranging from 0 to 1.0 in 0.1 increments, and then for each of these quantile bins (combining 0.4–0.5 and 0.5–0.6 quantile bins into one bin, as these span fork polarity 0), we counted the number of loci whose sequence is AGA in the genome non-reference (−) strand and the number of loci whose sequence is AGA in the reference genome (+) strand. Loci without annotated Repli-seq data were excluded. Next, for each genome strand, we calculated normalized call counts by dividing the quantile bin call counts by the total number of calls in that strand. For each of the nine quantile bins, we then calculated the ‘strand ratio’ as the ratio of non-reference to reference strand normalized call counts. We also calculated this strand ratio for positive and negative fork polarities (that is, two bins rather than nine quantile bins), as there were not enough ssDNA calls in individual quantile bins for analysis. Analyses were also repeated after excluding loci within genic regions annotated in the CHM13 v.1.0 LiftOff Genes V2 annotation obtained from the UCSC Genome Browser.

Kinetics analysis

Signatures of sequencing polymerase kinetics have been previously identified for diverse base modifications in synthetic oligonucleotides, and they have been used to detect a small number of base modifications in genomic DNA such as cytosine methylation43,99. However, this approach has not yet been used to detect uracil-species in genomic DNA with single-molecule fidelity. We performed the kinetics analysis as follows.

For each sample, consensus sequences for each strand were created using pbccs v.6.4.0 (Pacific Biosciences) with the parameters: --by-strand --hifi-kinetics --min-rq 0.99 --top-passes 0. pbccs v.6.4.0 was used because, with these parameters, it outputs consensus kinetics values for each strand separately, which previous versions of pbccs do not. Consensus sequence reads were then aligned to the CHM13 v.1.0 reference genome with pbmm2 with the parameters ‘--preset CCS --sort’.

Next, we extracted the list of ssDNA C>T sequence calls in the 72 °C heat-treated blood DNA and the sperm samples (profiled by HiDEF-seq with nick ligation). Owing to the very high number of ssDNA C>T calls in blood DNA samples that were heat treated in water-only or Tris-only buffer, for these samples, we selected a random subset of 800 calls. We then extracted from these samples and from 85 other HiDEF-seq samples all of the consensus reads that overlapped the C>T call positions, from the strand synthesized by the sequencing polymerase opposite to the strand on which the call is present in the molecule. As kinetics is affected by sequence context43, this enables calculation of differences in kinetics between molecules with and without the event within the same sequence context. We next performed kinetic analyses of IPD and PW. Kinetics values (IPD or PW, reported by the sequencing instrument at a 10 ms frame rate) for each consensus read were transformed into units of time (seconds) and normalized by the average kinetics values of all bases in the consensus read to correct for baseline sequencing kinetics differences between molecules. For each C>T call, we extracted the kinetics values of all overlapping reads for ±30 bp flanking the event position relative to the reference genome coordinates using each read’s CIGAR value to account for insertions or deletions in the read relative to the reference genome. Next, for each C>T call, we calculated the ratio of kinetics values for each base position by dividing the kinetics values (IPD or PW) of the molecule with the call by the weighted average kinetics values of molecules without the call (the weighted average weights by each molecule’s number of passes; that is, its ‘ec’ tag value). Finally, for each flanking and mutant base position, we calculated the average and s.e.m. of the kinetics value ratios across all C>T calls of each sample or sample set of interest. The same kinetic analysis was performed for dsDNA C>T mutation calls (that is, bona fide cytosine to thymine double-strand mutations) in non-heat-treated blood DNA, 56 °C and 72 °C heat-treated blood DNA, sperm, kidney, and liver samples (all profiled by HiDEF-seq with nick ligation), for the strands synthesized by the sequencing polymerase opposite the strand containing the C>T mutation; this shows the kinetic profile of true C>T changes, as a comparator for C>T calls arising from cytosine damage. Note that the dsDNA C>T mutations used for this kinetics analysis were called with the same thresholds used for ssDNA C>T calls. Both these ssDNA and dsDNA analyses were additionally conducted after randomization of labels among molecules with and without the C>T call to confirm that the kinetic signal was specific to molecules with the C>T call. The kinetic profile heat map and clustering were performed using the ComplexHeatmap R package100.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.