Main

Clonal expansions of blood cells with genomic alterations commonly occur in older individuals and confer an increased risk of haematological malignancies and overall mortality1,2,3,4,5,6,7,8,9,10. Clones can contain diverse mutations—ranging from point mutations to the gains or losses of whole chromosomes—on every chromosome.

Although populations can differ greatly in their rates of various cancers, the genomic landscape of mosaicism in the absence of known cancer remains, to our knowledge, unexplored outside of European-ancestry cohorts10,11,12,13.

Mosaic chromosomal alterations in Japan

We searched for mosaic chromosomal alterations (mCAs) in blood-derived DNA microarray data from 179,417 participants of the BioBank Japan (BBJ) cohort, which recruited patients with 47 diseases14 (including 13 cancers found in 16.7% of participants) (Methods). We found mCAs by analysing allele-specific hybridization intensities for 515,355 genotyped autosomal single-nucleotide polymorphisms (SNPs) (Supplementary Table 1). We analysed these data using a recently developed approach that detects an imbalance in the abundance of two inherited haplotypes of an individual by using the long-range haplotype phase information that can be inferred from large population samples10 (Methods and Supplementary Note 1).

This analysis detected 33,250 autosomal mCAs (at a false-discovery rate (FDR) of 0.05) in 27,910 unique individuals (Fig. 1 and Supplementary Note 2). The high rate of events, relative to a contemporaneous analysis of 482,789 participants in the UK Biobank (UKB), reflects the fact that BBJ participants were older (mean age at enrolment, 62.8 years of age in the BBJ compared with 57 years of age in the UKB; s.d., 14.5 years; range, 0–113 years) and a larger fraction of participants was male (54.1% in the BBJ compared with 45.8% in the UKB) and the use of different genotyping arrays14. Of these mutations, 5,233 were confidently classified as mosaic deletions, 10,431 as copy-neutral loss of heterozygosity (CN-LOH) events and 4,044 as duplications (Fig. 2a and Supplementary Table 2); the remaining 13,452 events were present at cell fractions that were too low or spanned too few genotyping probes to confidently determine the copy number (Supplementary Note 1). A total of 4,156 individuals had two or more non-overlapping mCAs (Supplementary Table 3); analysis of mosaic cell fractions suggested that these events were usually present in distinct clones (Supplementary Note 3). The mCA-detection rate was broadly consistent across genotyping arrays (Supplementary Table 4) and across cases of the 47 diseases that were systematically surveyed by the BBJ (Supplementary Table 5); mCAs were strongly enriched (odds ratio (95% confidence interval), 1.93 (1.66–2.25); P = 2.1 × 10−17) among individuals with haematological cancers at registry (that is, at the time of DNA sampling) as expected from previous studies1,2,10 (Supplementary Note 4), but not among individuals with other illnesses.

Fig. 1: Genomic locations of 33,250 autosomal mCAs detected in 27,910 unique BBJ participants.
figure 1

Loss, CN-LOH and gain events are plotted as blue, orange and red horizontal lines, respectively. Events with undetermined copy numbers are plotted in grey. Commonly deleted regions are labelled in blue; loci associated with CN-LOH mutations in cis are labelled in orange.

Fig. 2: Classification of mCAs, frequency as a function of age and comparison of genomic distributions between BBJ and UKB.
figure 2

a, Classification of mCAs as loss, CN-LOH or gain events using log-transformed R ratio (LRR, measuring total DNA abundance) and B allele frequency deviation from 0.5 (|ΔBAF|, measuring allelic imbalance) (Methods). Unclassified events are indicated in grey. b, Frequency of detectable mosaicism stratified by age and sex. Frequencies (means) and error bars for 95% confidence intervals are indicated for the 179,417 participants analysed. c, d, Distribution of mCAs by chromosome (c) and copy number (d) in BBJ and UKB. e, Chromosomal coverage of loss and CN-LOH events in BBJ and UKB. Curves indicate the frequencies at which each chromosomal position is contained in loss or CN-LOH events, normalized to 1 on each chromosome. Numeric data are provided in Supplementary Tables 6, 13.

Inevitability of mCAs in elderly individuals

The long-lived Japanese population revealed that clonal haematopoiesis with mCAs becomes extremely common in very old individuals: detectable mosaicism reached 40.7% (s.e.m., 2.3%) in men and 31.5% (s.e.m., 1.7%) in women over the age of 90 (Fig. 2b and Supplementary Table 6), which suggests that mCAs are inevitable in elderly individuals (Supplementary Note 4). mCAs on different chromosomes and with different copy-number changes exhibited various degrees of enrichment in men and in elderly individuals (Extended Data Fig. 1, Supplementary Tables 7, 8 and Supplementary Note 4) and in individuals with anomalous blood counts (Supplementary Table 9); this suggests that a spectrum of biological processes is involved in the development of different clones.

Population differences in mCA distributions

To compare the genomic distributions of mCAs in the Japanese and British populations, we co-analysed BBJ mCAs together with 19,632 autosomal mCAs detected in a parallel study15 in the UKB cohort16,17 (Fig. 2c–e, Supplementary Note 3, 5 and Supplementary Table 10).

The Japanese individuals have a tenfold higher incidence of adult T cell leukaemias18 and fivefold lower incidence of chronic lymphocytic leukaemia (CLL, a B cell malignancy) compared to European individuals19,20. Our analysis indicated that, even among people without cancer, Japanese and British populations have markedly different rates of haematopoietic clones that arise from the B and T cell lineages, as shown by deletions produced during V(D)J recombination in developing T and B lymphocytes that thus identify clonal expansions in the T and B cell lineages. Mosaic deletions at the TRA locus on chromosome 14q (indicating clonal expansion in the T cell lineage) (Supplementary Note 5) were common in the BBJ but rare in the UKB dataset (82% versus 11% of loss events on chromosome 14 in the BBJ and UKB datasets, respectively); by contrast, deletions at the IGH and IGL immunoglobulin loci (indicating clonal expansion in the B cell lineage) were common in the UKB but rare in the BBJ dataset (5% versus 39% of loss events on chromosome 14 and 2% versus 58% of loss events on chromosome 22 in the BBJ and UKB datasets, respectively) (Fig. 2e and Supplementary Note 5). We verified that these differences did not arise from differences in genomic coverage by the genotyping arrays used by the BBJ and UKB projects (Extended Data Fig. 2). Clones that arose from the T cell lineage (as shown by deletions at TRA) were also associated with increased lymphocyte counts (Supplementary Tables 11, 12). Therefore, the differences in rates of B and T cell malignancies between Japanese and British populations seem to be preceded by distinct relative rates of subclinical clonal expansions in these lineages.

mCAs affect the various human chromosomes at different frequencies. The frequency of CN-LOH varied across chromosome arms in a way that strongly correlated between BBJ and UKB data (R = 0.73, P = 0.00013), with the exception of chromosomes 14q (more common in BBJ) and 13q (more common in UKB) (Fig. 2c, d, Extended Data Fig. 3 and Supplementary Table 13). By contrast, the most common loss and gain events in each population (including loss of 20q, 13q and 10q events and gains on chromosomes 21, 15 and 12) tended to be much more common in one population than the other (Fig. 2c, d and Supplementary Table 13).

A clear pattern among the most strongly population-differentiated mutations involved the two- to sixfold lower frequency in the BBJ dataset of chromosome 12 gain, 13q loss and 13q CN-LOH events (Fig. 2d): all three mutations are commonly observed in CLL21,22 and in individuals who later develop CLL10. Considering the 4–5-times lower incidence of CLL in East Asian individuals, the observation that all three of these precursor mutations are also less common in Japanese haematopoietic clones that have expanded to detectable cell fractions suggests that this population difference in CLL risk originates in a reduced selective advantage for clones with (diverse) CLL precursor mutations. Consistent with this hypothesis, we observed that clonal sizes for these events tended to be lower in BBJ than in UKB participants (Supplementary Note 3).

The subchromosomal distributions of mosaic deletion events were broadly similar between BBJ and UKB participants but exhibited a few notable differences (Fig. 2e, Supplementary Table 14 and Supplementary Note 5). Focal deletions frequently targeted DNMT3A, TET2, ETV6, NF1 and CHEK2, as shown in UKB and previous studies1,2,5,6,10 (Figs. 1, 2e). Notably, the CLL-related deletion region at 13q14 was less focal in BBJ than in UKB data (Fig. 2e), involving longer deletions in a pattern more similar to the chromosome 20q, 5q and 11q deletion regions. We also observed previously undescribed focal deletion regions in the BBJ dataset: at FHIT on chromosome 3p, TNFAIP3 on chromosome 6q, ABCA1 on chromosome 9q and PTEN on chromosome 10q (Fig. 2e and Supplementary Tables 15, 16); FHIT, TNFAIP3 and PTEN are known tumour-suppressor genes associated with blood cancers23,24,25.

Inherited risk variants for mCAs in cis

Recent studies have established an inherited component of clonal haematopoiesis that involves both common variants that slightly increase risk (of clones with any mutation)11,12,13,26 and rare variants that strongly predispose to developing clones with specific mCAs in cis10. The large number of mCAs detected among the Japanese population, together with the presence of distinct low-frequency alleles in Japan, could enable the detection of additional risk loci. To identify inherited variants associated with mCAs, we first performed association tests aimed at detecting CN-LOH events in cis that promoted clonal expansion by making risk alleles homozygous or removing them from the genome10 (the two-hit model27). We tested variants imputed into the BBJ dataset using the 1000 Genomes phase 3 reference panel28 together with 1,037 sequenced Japanese samples29, setting a significance threshold of P < 5 × 10−9 (Methods). We further performed binomial tests to determine whether each risk allele was consistently duplicated or removed by CN-LOH events (in individuals heterozygous for the risk allele) (Methods).

We identified five new loci at which inherited variants associated with mosaic CN-LOH events in cis (we also replicated previously reported associations at JAK230,31,32 and MPL10) (Table 1, Extended Data Figs. 46 and Supplementary Note 6). Three of the new loci—NBN, MRE11 and CTU2—involved rare variants with large effects. At NBN, the rare stop-gained variant rs756831345 on chromosome 8q associated strongly (odds ratio, 91 (52–159); P = 9.8 × 10−23) with chromosome 8q CN-LOH events, which consistently made the NBN risk allele homozygous (P = 0.00012) (Table 1 and Extended Data Figs. 5, 6). At MRE11, a very rare intronic variant (probably tagging a different causal variant) (Supplementary Note 7 and Supplementary Table 17) on chromosome 11q associated strongly (odds ratio, 37 (17–84); P = 2.6 × 10−9) with chromosome 11q CN-LOH events, which always made the MRE11 risk allele homozygous (P = 0.016) (Table 1 and Extended Data Figs. 5, 6). Consistent with the strong proliferative advantage of these clones, we observed that these rare risk alleles further associated with the detection of multiple CN-LOH clones on the same chromosome arm (with different proximal breakpoints) (Extended Data Fig. 7, Extended Data Table 1 and Supplementary Notes 3, 6). NBN, MRE11 and RAD50 (which did not exhibit a similar association) (Supplementary Table 18) encode the components of the MRN double-strand break-repair complex, which recruits ATM in response to DNA damage, leading to the phosphorylation of p53 and CHK2 and the initiation of cell-cycle arrest, apoptosis or DNA repair33. Together with the observations of focal deletions at ATM, TP53 and CHEK2 (Fig. 1) and rare ATM risk alleles for CN-LOH events in cis10, these results indicate a key role of DNA damage-response dysfunction in clonal selection.

Table 1 Genome-wide significant associations between inherited variants and mosaic chromosomal alterations

At CTU2, the rare missense variant rs200779411 associated strongly (P = 7.3 × 10−20; odds ratio, 28 (17–45)) with chromosome 16q CN-LOH events, which consistently made the CTU2 risk allele homozygous (P = 0.022) (Table 1 and Extended Data Figs. 5, 6). CTU2 encodes a component of the cytosolic thiouridylase complex, which is required for maintenance of genome integrity34. The missense variant rs200779411 was predicted to be probably damaging by PolyPhen-235 and deleterious by SIFT36, suggesting that impaired CTU2 function may promote clonal expansion by reducing genome stability.

Inherited risk variants for mCAs in trans

To additionally detect inherited variants associated with mCAs in trans, we performed genome-wide association tests on each mCA type (classifying events by chromosome and copy number), setting a genome-wide significance threshold of P < 5.7 × 10−11 to account for multiple hypotheses tested (Methods). Two trans associations reached significance: common variants in MAD1L1 associated with gains on chromosome 15 and common variants in TERT (previously associated with mosaic JAK2V617F mutation12) associated with chromosome 14q CN-LOH (Table 1, Extended Data Figs. 4, 5 and Supplementary Note 6). At MAD1L1, a cluster of five SNPs in near-perfect linkage disequilibrium (including the missense variant rs1801368) associated (P = 6.9 × 10−23; odds ratio, 1.61 (1.46–1.77)) (Table 1 and Extended Data Fig. 5d) with chromosome 15 gain events (mostly full trisomies) (Fig. 1). We replicated this association in the UKB cohort with a slightly reduced effect size (P = 5.1 × 10−4; odds ratio, 1.40 (1.16–1.69) for rs1801368). MAD1L1 encodes a component of the mitotic-spindle assembly checkpoint that ensures proper chromosome segregation37. The MAD1L1 risk allele was also previously observed to increase risk of mosaic Y chromosome loss13, which is consistent with a mechanism that involves the mis-segregation of chromosomes during mitosis owing to the impaired function of the spindle assembly checkpoint. Lending further support to this hypothesis, the risk haplotype was estimated to also increase risk for large (arm-level or whole-chromosome) gain events in 9 out of 10 chromosomes with at least 50 such events (binomial P = 0.02) (Supplementary Table 19).

Population-specific mCA risk alleles

A comparison of the mCA risk loci detected in the BBJ dataset with previously reported loci from the UKB dataset10 revealed ways in which genetic background can differentially shape clonal haematopoiesis in different populations (Supplementary Note 6 and Supplementary Tables 2023). Four of the risk variants that we found in BBJ participants (at NBN, MRE11, NEDD8–TINF2 and CTU2) were present at much lower allele frequencies in European individuals38 (Supplementary Table 20). Conversely, all rare variants that were previously associated with mCAs in UKB participants (at MPL, FRA10B, ATM and TM2D3–TARSL2)10 were absent from Japanese individuals in the whole-genome sequencing imputation panels, with the absence of FRA10B fragile alleles explaining the lack of 10q25.2-qter deletions in BBJ participants (Fig. 1). Notably, MPL variants were associated with chromosome 1p CN-LOH events in both the BBJ and UKB datasets despite the fact that most risk alleles in each cohort were population-specific, which indicated that a shared path to mosaicism was initiated by different variants in different populations.

mCAs and mortality in Japan

Clonal haematopoiesis has previously been linked to poorer health outcomes, with various types of mosaic events observed to increase the risk of future blood cancers, mortality and cardiovascular disease1,2,3,4,10,39. To investigate the link between mCAs and mortality, we analysed mortality outcomes (including cause of death), which were available for around 72% of the cohort40 (Methods).

We observed a nearly fivefold increase in the risk of death due to leukaemia (hazard ratio, 4.70 (3.26–6.78)) (Extended Data Fig. 8, Extended Data Table 2, Supplementary Table 24 and Supplementary Note 8). The increased risk of mortality caused by leukaemia did not appear to extend to other haematological malignancies (malignant lymphoma and multiple myeloma) (Extended Data Fig. 8 and Supplementary Table 24). We also did not observe a significantly increased risk of mortality attributable to cardiovascular disease, suggesting that previous associations of clonal haematopoiesis (which primarily involved point mutations in DNMT3A, TET2, JAK2 and ASXL1) with cardiovascular outcomes4,39 may be limited to specific mosaic events (Extended Data Fig. 8 and Supplementary Table 24). To refine the association between mosaic status and leukaemia mortality, we partitioned mosaic events by chromosome and copy-number change (Methods) and identified six mCAs with significant (P < 0.05/88, Cochran–Mantel–Haenszel test), large effects on leukaemia mortality risk (Extended Data Fig. 8b and Supplementary Table 25). Mosaic cell fraction and the number of mosaic events carried by an individual each associated with further increases in leukaemia mortality risk (Extended Data Fig. 8c, d and Supplementary Tables 26, 27). Mosaic status increased the risk of overall mortality (hazard ratio, 1.10 (1.05–1.16); P = 2.7 × 10−5) (Supplementary Table 24), an association that was driven by mCAs in chromosomes 9 and 14 (hazard ratio > 1.4) (Supplementary Table 28), underscoring the heterogeneity in the clinical effects of different mCAs.

Discussion

Our study of haematopoietic clones with mCAs in Japan provides a detailed comparison of the genomic landscape of clonal haematopoiesis between populations, revealing broad overall similarities as well as important population differences. A clear pattern among these results showed that population differences in blood cancer rates are preceded by population differences in subclinical clonal expansions, at multiple levels: both in specific cell lineages (including B and T cell lineages) and with specific cancer-associated mutations (for example, gain of chromosome 12, loss of chromosome 13q and chromosome 13q CN-LOH, which are hallmarks of CLL). These results point towards population-specific differences in the clonal advantages that are gained by the same chromosomal mutations in different genetic and environmental contexts.

The interplay between acquired and inherited genetic variation in Japan enabled further insights into the influences of inherited variation on clonal haematopoiesis: population-specific variants at several risk loci pointed to a key role of the maintenance of genomic integrity; with corroborating evidence from loci targeted by focal deletions. These results point to the need for larger and more diverse cohorts in genomic studies of cancer and subclinical clonal expansions as well as inherited variation.

Methods

Data reporting

No statistical methods were used to predetermine sample size. The experiments were not randomized and the investigators were not blinded to allocation during experiments and outcome assessment.

The BBJ cohort

All of the individuals analysed in this study were participants of the BBJ project. The BBJ is a multi-hospital-based registry that collected clinical information, DNA and serum samples from approximately 200,000 patients with one or more of 47 target diseases (including 13 cancers) at a total of 66 hospitals between fiscal years 2003 and 200714. The case proportions correlated well with prevalence in the Japanese population and all of the study participants were diagnosed by medical doctors as described elsewhere14. We complied with all relevant ethical regulations. This project was approved by the ethics committees of RIKEN Center for Integrative Medical Sciences and the Institute of Medical Sciences, the University of Tokyo. Written informed consent was obtained from all of the participants.

Genotyping individuals in the BBJ

Participants were genotyped in three batches using different arrays or set of arrays, namely: (1) a combination of Illumina Infinium Omni Express and Human Exome; (2) Infinium Omni Express Exome v.1.0; and (3) Infinium Omni Express Exome v.1.2 (Supplementary Table 1). The SNP content of the three methods was very similar. DNA was obtained from blood samples for all but one individual (for which DNA was obtained from oral mucosa; this sample was negative for mosaic events).

We excluded outliers from East Asian clusters in a plot in which we projected BBJ participants in combination with 1000 Genomes Project41 samples in the principal component (PC)1 and PC2 space. We also excluded samples genetically identical to another sample, samples with call rates less than 0.98, and samples for which the reported sex information was not supported by genotypes in the X chromosome. We further excluded three samples with evidence of potential contamination (as suggested by low cell-fraction mosaic events called on many chromosomes6,10), leaving 179,417 samples for analysis. We used plink v.1.9 software42 to handle the genotyping data.

Genotyping intensity data used for calling mosaic events

To call mosaic events, we analysed genotyping intensity data for variants in the intersection of the three primary arrays used for BBJ genotyping (namely, Illumina Infinium Omni Express and Infinium Omni Express Exome v.1.0 and v.1.2) to enable the analysis of the same set of variants in all individuals (to avoid the possibility of differing detection sensitivity across batches due to different numbers of genotyping probes analysed). When calling mosaic events, we did not include variants typed on the Human Exome array in some samples (see above) to minimize the potential for batch effects arising from different arrays. We did use variants from the Human Exome array in genetic association analyses (see below) as association tests are robust to genotyping heterogeneity when potential confounders are appropriately controlled by correcting for batch covariates and principal components.

Calculation of BAF and LRR from genotype intensity

We computed BAF and LRR values with the use of the BBJ genotyping intensity data43. We modified previously published methods1,10 to fit the current dataset. We computed LRR and BAF values on a per-array basis in which all of the participants genotyped in the same arrays were clustered together. Details are provided in Supplementary Note 1.

Phasing of genotype data for calling mosaic events

We phased the filtered genotypes mentioned above with the use of Eagle2 software44, which enabled us to conduct accurate long-range phasing. This phasing information was used for calling mosaic events (Supplementary Note 1).

Filtering possible non-mosaic trisomy or monosomy events

We excluded chromosomes with mean LRR > 0.2 or mean LRR < −0.5 (possible trisomy and monosomy, respectively) (Supplementary Note 1).

Calling mosaic events with the use of BAF and LRR

We used the same method to call mosaic events as previously described10. This calling method is composed of the following steps: (1) filtering constitutional duplications; (2) evaluating the phased BAF for variants on each chromosome using a parameterized hidden Markov model; (3) calling the existence of events using a likelihood ratio test; (4) calling the event boundaries; (5) calling the copy number; (6) filtering remaining possible constitutional duplications; (7) estimating the cell fraction of mosaic events. Details of each step are provided in Supplementary Note 1.

Associations between array batches or disease status at registry and detectable mosaicism

We conducted logistic regression analyses to evaluate associations between detectable mosaicism and either array batches or disease status at the time of participant recruitment (47 diseases, a binary trait for each of the diseases). For array batches (Supplementary Table 4), we put mosaic detection status as a dependent binary variable and age, sex, smoking, genotyping arrays and 10 principal components as independent variables. For disease status at registry (Supplementary Table 5), we put disease presence as a dependent variable and presence of mosaic events, age, sex, genotyping arrays and 10 principal components as independent variables.

Associations between haematological traits and mosaic events

We extracted data from the BBJ for 13 haematological traits, namely, red blood cell count, haemoglobin, haematocrit, mean corpuscular volume, mean corpuscular haemoglobin, mean corpuscular cell haemoglobin, white blood cell count, neutrophil count, lymphocyte count, monocyte count, eosinophil count, basophil count and platelet count. Associations between 13 haematological quantitative traits and the presence of 88 types of mosaic events (see below) were analysed in logistic regression models with event presence as outcomes. Before analysis in logistic models, the 13 traits were regressed out by covariates specified in the previous BBJ study45 for men and women. Residuals were normalized and used as independent variables one by one (a total of 13 models for each mCA). In each logistic model, disease status at registry (for each of the 47 diseases in the BBJ study design), age, sex, smoking, genotyping arrays and 10 principal components were used as covariates. We took this approach to control for effects of covariates associated with both mCAs and haematological traits.

We subdivided mosaic events by copy-number state (loss, CN-LOH or gain) and by p versus q arm for loss and CN-LOH events. To reduce multiple testing burden, we restricted analyses to mosaic events with more than 20 carriers (Supplementary Table 2). As a result, 88 mosaic events were analysed in association with 13 haematological traits. The statistical significance threshold was set to P < 0.05/88/13 (4.4 × 10−5); results are reported in Supplementary Table 9.

Comparison of mosaic frequency between BBJ and UKB

We co-analysed BBJ mosaic calls with mosaic calls in UKB data from 482,857 individuals15. We calculated the frequencies of mosaic events subdivided by chromosome arm and copy number among all mosaic events in both datasets. We assessed the correlation of event frequencies in the two datasets using Spearman’s correlation coefficients.

Relative coverage of the genome by mosaic events in the BBJ and UKB

We determined mosaic coverage as follows. We divided chromosomes into 0.1-Mb bins and calculated the fraction of loss or CN-LOH events that covered each bin to compute mosaic coverage. We scaled the coverage in each mosaic type in each chromosome (to set maximum coverage as 1). We compared mosaic coverage in the BBJ and UKB datasets using Pearson’s correlation coefficients.

Genomic coverage by genotyping arrays in BBJ and UKB

We computed the mean numbers of heterozygous genotyped sites across individuals in each 1-Mb region of the genome for the BBJ and UKB genotyping arrays to confirm that the difference in mosaic frequency between the two populations was not driven by different coverage of the genome by DNA microarrays.

Association between mosaic events indicating T cell expansions and lymphocyte counts

We used a Wilcoxon rank-sum test to compare lymphocyte counts between individuals who carried TRA deletions (indicating clonal expansions of T cells) and individuals without TRA deletion. We also evaluated Spearman correlations between the cell fraction of TRA deletions and lymphocyte counts.

Distribution of breakpoints of CN-LOH in BBJ and UKB

We computed relative frequencies of estimated CN-LOH breakpoint locations in each chromosome in BBJ and UKB. We smoothed breakpoints over ±2 Mb and rescaled to 1.

Genes affected by focal deletion

We evaluated the importance of genes by taking the numbers of genes involved in loss events into account. We counted the number of genes involved in each loss event and defined a score of each loss event as one divided by the number of genes (that is, when a loss event contained only one gene, the gene received a score of 1). We summed scores of all loss events containing each gene. To pick up genes that were frequently involved with focal deletions only in the Japanese population, we identified genes covered by at least 5% of loss events in a chromosome, having a tenfold larger score in BBJ than in UKB, and scoring more than 0.5.

Genetic association studies

We excluded participants who showed a high degree of kinship (first degree or closer as detected by plink42) with other individuals, leaving 173,599 participants for genetic association studies. Among related pairs, we retained individuals who had mosaic events. We also integrated the genotyping data used for calling mosaic events with genotyping data from additional variants typed on the Human Exome Array in some samples when also available on the Omni Express Exome Arrays in other samples (Supplementary Table 1) to maximize the number of variants used for imputation. We did not integrate these data at the stage of calling mosaic events to minimize the potential for batch effects. We phased the integrated data using Eagle2 software46. The phased genotypes were imputed using a reference panel containing 2,504 1000 Genomes phase 3 samples and 1,037 Japanese high-depth (30×) whole-genome sequencing samples (dataset 1 of a previous study29) using Minimac3 software47. Variants imputed with R2 > 0.3 were used for the association studies. We filtered variants with minor allele count less than 5. Best-guess data were used to conduct Fisher’s exact tests using plink software (plink --fisher --ci 0.95). We used Fisher’s exact tests to prevent inflated type I errors when testing associations between rare variants and rare mosaic events48. To confirm that significant associations were not driven by confounding factors, we reanalysed significant associations (detected by Fisher’s exact test) using logistic regression with and without covariates (10 principal components, disease status at registry, age, sex, smoking and genotype batches) and verified that the associations were robust. We used genotyping data from DNA microarrays if available to rescue rare variants that were not included in the reference panel, that were not well-imputed or that had low allele frequency. As a result, 26.6 million variants were used for association studies.

We analysed mosaic events in each chromosome as distinct phenotypes, treating loss, CN-LOH and gain separately. To maximize the power to identify significant associations with CN-LOH, we included unclassified ‘likely CN-LOH’ events (that is, events that extended to one telomere with |LRR| < 0.02) when testing variants for association with CN-LOH events. We subdivided loss and CN-LOH events in each chromosome into p-arm and q-arm events. We set a threshold of at least 20 event carriers to consider an event in genetic association studies. This led to a total of 88 copy number–chromosome pairs analysed (Supplementary Table 2). We tested each of these phenotypes for association with variants in cis (that is, on the same chromosome and contained within a mosaic event) or in trans (that is, on any chromosome). For cis associations, we also conducted allelic imbalance analyses to assess whether one of the alleles at each variant was preferentially duplicated by mosaic CN-LOH events. Details of each test and corresponding significance thresholds are described in Supplementary Note 6.

At significantly associated loci, we additionally performed stepwise conditional analyses (by iteratively removing carriers of high-risk rare alleles) to test for additional independently associated variants.

Associations between risk variants and presence of multiple CN-LOH clones with different breakpoints

In a small fraction of individuals, we detected evidence of multiple clonal expansions of CN-LOH events that affected the same chromosome arm but with different breakpoints. To detect such events, we applied a modified hidden Markov model as described previously10 (Supplementary Note 3). In brief, this analysis searched for evidence of CN-LOH with increasing BAF deviation towards the telomere. We evaluated associations between the presence of risk variants found above and the presence of single or multiple breakpoints among individuals with CN-LOH spanning the variants using Fisher’s exact test. Details are described in Supplementary Note 6.4.6.

Associations between mosaic events and mortality

The BBJ project has follow-up data to survey mortality and cause of death40. A total of 141,612 BBJ participants who have one of 32 out of 47 diseases were prospectively followed up after DNA collection. For participants who died, further detailed surveillance was carried out to identify causes of death (coded with codes from the tenth revision of the International Statistical Classification of Diseases and Related Health Problems (ICD10)) by accessing national vital registration system used for input survey of medical and social welfare at Ministry of Health, Labour and Welfare of the Japanese Government.

We restricted participants to those who were followed for at least 1 year after registry and free from malignancy at blood collection. We found 86,546 participants in the current study who were included in the follow-up data for mortality. Among them, 16,812 deaths were recorded during the follow-up period. The average follow-up period was 7.6 years (median, 8.3 years; s.d., 2.8 years).

Associations between mortality (overall or specific causes) and the presence of mosaic events (regardless of mosaic types) were analysed as an initial evaluation. We analysed overall mortality, haematopoietic malignancy mortality and non-haematopoietic malignancy mortality. We compared individuals with mosaic events (loss, CN-LOH or gain) at cell fraction ≥ 1% to individuals without mosaic events on any chromosomes. Cox regression analysis was used for the analyses conditioning for age, age2, sex, disease status, genotyping array and smoking. We used follow-up period as a censoring factor. When we analysed specific causes of death (for example, non-haematopoietic malignancy), we only used participants whose deaths were not reported during follow-up as controls to use consistent control samples across analyses. We used significance thresholds based on Bonferroni’s correction (based on the number of tested mortality phenotypes).

After evaluating associations between mortality phenotypes and the presence of any mosaic event in any chromosome, we searched for associations between specific mosaic event types and mortality. We analysed the same set of 88 mosaic event types (defined by copy-number state and chromosomal location) that we used when testing associations with haematological traits and inherited variants. In these analysis of associations between mortality phenotypes and specific mosaic types, we divided participants based on age, sex and smoking status and computed associations using Cochran–Mantel–Haenszel tests to avoid inflation of statistics arising from small number of individuals who carried mosaic events. We set a significance level based on Bonferroni’s correction (P < 0.00057, 0.05/88).

We also analysed cardiovascular mortality (defined as ischaemic heart diseases and ischaemic stroke) as previous studies have reported associations between mosaic point mutations and cardiovascular outcomes.

Definition of cancers based on ICD10 codes for causes of death

We categorized causes of death to decrease multiple testing burden. Haematopoietic malignancy was defined by ICD10 codes C81–C96 and D45, D46 and D47. Leukaemic diseases were defined by ICD10 codes C91–C96, D45 and D46. Malignant lymphoma was defined by ICD10 codes C81–C88. Multiple myeloma was defined by C90. Cancers were defined as ICD10 codes starting with ‘C’ together with haematopoietic malignancies defined not starting with ‘C’. We did not regard other ICD10 codes starting with ‘D’ as cancer as most of those are benign tumours.

Associations between multiple mosaic events and mortality

We extended the mortality analyses to investigate the effect of multiple mosaic events within a single individual. We limited analyses to individuals with at most three mosaic events. We divided participants into three groups: (1) individuals without mosaic events; (2) individuals with a single mosaic event; (3) individuals with multiple mosaic events (two or three mosaic events in different chromosomes). We analysed an association of the presence of multiple mosaic events with leukaemia mortality in comparison with the presence of a single mosaic event. The analyses were conditioned on age, sex, disease status, genotyping array and smoking status.

Associations between cell fraction of mosaic events and mortality

We also extended the mortality analyses to investigate the effect of mosaic cell fraction. For individuals with multiple mosaic events, we took the highest cell fraction. We divided participants into categories according to the cell fraction of mosaic events and analysed associations between cell fractions and outcomes with which the presence of mosaic events were significantly associated.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.