Introduction

Tobacco use remains the leading cause of preventable death and disease in North America [1]. Nicotine (the primary addictive agent in tobacco) [2] is metabolized to cotinine primarily by the liver enzyme CYP2A6, and then to 3’hydroxycotinine exclusively by hepatic CYP2A6 [3, 4]. The nicotine metabolite ratio (NMR; 3’hydroxycotinine/cotinine) is a stable biomarker for nicotine metabolism by CYP2A6 in smokers [5, 6]. Individual differences in NMR predict total nicotine clearance, and thus smoking behaviours (including cessation) as well as health outcomes. In particular, higher NMR (i.e. faster nicotine inactivation and CYP2A6 activity) is associated with greater nicotine dependence, cigarette consumption, and lung cancer risk along with lower cessation [7, 8]. Furthermore, NMR has translational potential in personalizing cessation treatment given that smokers with higher NMR show greater benefit from treatment with varenicline (compared to nicotine replacement therapy) [9, 10].

The NMR can only be reliably measured in current, regular smokers. This limits its use as a biomarker in longitudinal studies of smoking initiation or smoking-related disease risk in occasional/non-smokers, and limits the potential clinical utility of using NMR to guide personalized counselling on smoking-related risks to promote prevention efforts and behavioural change. However, because NMR is highly heritable (h2 = 60–80% [11, 12]), an individual’s NMR could potentially be estimated using their genetic information regardless of their current smoking status (i.e. using a genetic risk score that predicts NMR). To achieve this, large-scale genetic studies of NMR are required to robustly identify the underlying genetic risk variants.

To date, most genetic studies of NMR have been undertaken in European ancestry smokers, and the genetic architecture of NMR in non-European smokers remains only partially understood, contributing to potential health disparities [13]. In European smokers, the largest GWAS of NMR conducted (n = 5,185) identified a strong genome-wide association near CYP2A6 on chromosome 19, and a second association near TMPRSS11E on chromosome 4 [14]. The CYP2A6 association pattern in European smokers was complex, with six independent variants identified in conditional analysis and a top causal configuration including 13 variants identified in Bayesian fine-mapping [14]. To our knowledge we have conducted the largest GWAS of NMR in African American smokers to date (n = 954), finding a single genome-wide association near CYP2A6. The association pattern in African American smokers was unique compared to that observed in Europeans [15], with 58 of the 96 genome-wide significant hits not reaching genome-wide significant in Europeans and a different lead variant (rs12459249) that was not in high linkage disequilibrium (LD) with the top variant in Europeans (r2 < 0.6) [16].

While GWAS provide comprehensive coverage of single nucleotide polymorphisms (SNPs), there are several well characterized CYP2A6 * alleles with known functional effects on CYP2A6 activity that are not well captured using standard GWAS approaches [17]. Incorporating both CYP2A6 * alleles and common genetic variants identified by GWAS, we previously developed ancestry-specific genetic risk scores (GRSs) to estimate an individual’s NMR from their genetic information [18, 19]. These GRSs explained 33.8% and 32.4% of variance in NMR in European [18] and African [19] ancestry populations, respectively, and showed reasonable prediction of slow vs. normal nicotine metabolizer status in these populations (AUC = 0.78 and 0.73, respectively) [18, 19]. As has been previously described for GRS more broadly [13], given differences in LD structure across ancestral populations these ancestry-specific GRSs showed poor portability across populations, with the European and African ancestry GRSs explaining only 18–20% of variance in NMR in the alternate population [19]. Additionally, Bloom et al. developed an ancestry-specific GRS for a different nicotine metabolism measure (D2-cotinine:[D2-nicotine+D2-cotinine]) in Europeans using * alleles and other variants from the literature [20]. Development of a universal GRS using multi-ancestry cohorts is another promising approach, with Baurley et al. reporting similar predictive performance across African, Asian, and European ancestry smokers using machine learning algorithms to predict NMR based on age, sex, ancestry, BMI, and a set of 263 SNPs prioritized from GWAS (of which 198 were located in the CYP2A6 region) [21].

In summary, previous large-scale efforts have been undertaken to fine-map the CYP2A6 regional association with NMR in European ancestry smokers [14]. However, to our knowledge there has been no previous study fine-mapping the genome-wide CYP2A6 association in African ancestry smokers. Given growing interest in developing genetic tools to assist with smoking counseling and cessation, in the current study we address this knowledge gap and the potential health disparities it creates. Building on our previous studies in a group of African Americans participating in two large smoking cessation trials (Fig. S1), here we investigated the CYP2A6 association with NMR in more detail using an updated conditional analysis and new Bayesian fine-mapping approach to analyze both SNPs and * alleles (including structural variants) in the region. We also evaluated whether incorporating the putative causal variants identified by fine-mapping improved an existing ancestry-specific GRS to genetically predict NMR in African American populations, and the portability of this GRS to predict NMR in those of European ancestry.

Materials and methods

Participants

Our study sample comprised African and European ancestry smokers from two clinical trials of cessation: Pharmacogenetics of Nicotine Addiction Treatment 2 (PNAT-2; NCT01314001) [10] and Kick-it-at-Swope 3 (KIS-3; NCT00666978) [22]. The clinical trial protocols were approved by institutional review boards at all participating sites and the University of Toronto.

Study design of both PNAT-2 and KIS-3 have been described in detail elsewhere [10, 22]. Briefly, PNAT-2 randomized eligible adult smokers (aged 18-65 years, smoking ≥10 cigarettes/day) by NMR group (normal metabolizers vs. slow metabolizers) to treatment with placebo, nicotine patch, or varenicline for smoking cessation; all three treatment arms received behavioural counselling [10]. Approximately 37% of the total PNAT-2 sample were African ancestry (genetically determined based on comparison of genome-wide data to population reference panels as previously described [19], see Quality Control below for further details), and were included in the primary analyses here (n = 506, Table 1). We conducted additional analyses evaluating the portability of GRSs developed to predict NMR in African populations to the subset of PNAT-2 participants that were European ancestry (genetically determined as previously described [18], n = 933).

Table 1 Sociodemographic and clinical characteristics of the final study sample.

KIS-3 randomized eligible adult light smokers (aged ≥18 years, smoking ≤10 cigarettes/day) who self-identified as African American to treatment with bupropion or placebo for smoking cessation; both treatment arms received health education counselling [22]. Recruitment for KIS-3 was from a community-based clinic in Kansas, MO [22]. Participants who were African ancestry (genetically determined, as previously described [19], n = 458) were included in the primary analyses (Table 1).

Outcome measure

Nicotine metabolite ratio (NMR, 3’hydroxycotinine/cotinine ratio)

We measured NMR as a continuous variable by determining the ratio of 3’hydroxycotinine/cotinine concentrations in blood samples collected at the time of clinical trial enrollment, when participants were smoking regularly. Cotinine and 3’hydroxycotinine concentrations were determined using liquid chromatography-tandem mass spectrometry, as previously described [23].

Genetic data collection

Genotyping

To capture common SNPs, we conducted genome-wide genotyping using the Illumina HumanOmniExpressExome-8 v1.2 array (Illumina, San Diego, CA, USA) at the Centre for Applied Genomics, Hospital for Sick Children (Toronto, ON, Canada). We also included a previously described custom iSelect® add-on, capturing an additional 2,688 variants associated with nicotine metabolism and/or smoking behaviours for richer coverage of regions of interest including CYP2ABFGST (chromosome 19), CHRNA5-A3-B4 (chromosome 15), OCT2 (chromosome 6), and UGT2B (chromosome 4) [15].

We directly genotyped the following 12 CYP2A6 * alleles: CYP2A6*46 (formerly CYP2A6*1B), CYP2A6*1×2, CYP2A6*4, CYP2A6*9, CYP2A6*12, CYP2A6*17, CYP2A6*20, CYP2A6*23, CYP2A6*25/*26/*27 (all tagged by rs28399440), CYP2A6*28, CYP2A6*31, CYP2A6*35 as previously described [18, 19]. These CYP2A6 * alleles have demonstrated functional effects on CYP2A6 activity, and include structural variants (CYP2A6 gene deletions and duplications) as well as amino acid changes (see Table S2 for details). Individuals with structural variants (CYP2A6*1×2, CYP2A6*4, CYP2A6*12, CYP2A6*34, and CYP2A6*53) were re-genotyped using an approach with improved accuracy, as previously described [24].

Quality control

We performed quality control for samples and raw genotype data using PLINK [25], following standard protocols as previously described [15]. Individuals with discrepant sex, genotype call rate < 0.98, heterozygosity rate > 3 SDs from sample mean, substantial cryptic relatedness (PI_HAT > 0.185), or substantial non-African admixture (determined by visual inspection of multidimensional scaling (MDS) plots) were excluded. Self-reported African American ancestry was highly concordant with genetically determined ancestry in our sample (>95% concordance rate) [15]. Variants with call rate < 0.98, minor allele frequency (MAF) < 0.01, or Hardy-Weinberg equilibrium (HWE) p-value < 1 × 10−6 were excluded.

Imputation

We imputed chromosome 19 using the Michigan Imputation Server, which utilizes Minimac4 [26]. Accurately sequencing the CYP2A6 region is challenging due to extensive variability, regions of high homology (i.e. including the pseudogene CYP2A7), and complex structural variation [17]; poor sequencing quality in this region reduces the quality of imputed genotype calls made using standard reference panels. Therefore, we compared the results of imputation using two different cosmopolitan reference panels: the TOPMED Version R2 reference panel (N = 97,256 with ~30% African ancestry from African, African Caribbean, or African American populations) [27], and the 1000 Genomes Phase 3 reference panel (N = 2504 with ~25% African ancestry from the following populations: Esan in Nigeria (ESN), Gambian in Western Division, Mandinka (GWD), Luhya in Webuye, Kenya (LWK), Mende in Sierra Leone (MSL), Yoruba in Ibadan, Nigeria (YRI), African Caribbean in Barbados (ACB), people with African ancestry in Southwest USA (ASW)) [28]. The TOPMED imputation was performed with pre-phasing of haplotypes using Eagle v2.4 and human genome build hg38 [29]. The 1000 Genomes Phase 3 imputation was performed with pre-phasing of haplotypes using ShapeIT v2.r79034 [30] and human genome build hg37, as previously described [31].

Post-imputation quality control was performed using PLINK [25] to exclude duplicate and multi-allelic variants, as well as variants with poor imputation quality (INFO < 0.6) or HWE p-value < 1 × 10−6. We then compared the density of coverage and imputation quality across the two imputation methods.

Statistical analyses

Association testing

All statistical analyses were done using R Statistical Software unless otherwise specified [32]. We used a mega-analytic approach, pooling data from both clinical trials (PNAT-2 and KIS-3) for all analyses unless otherwise specified.

Based on LD patterns in our sample, and in keeping with prior CYP2A6 fine-mapping efforts in European ancestry smokers [14], we included variants within 5 Mb of CYP2A6 in our analyses (chromosome 19:38,000,000–43,000,000 bp; Genome Reference Consortium Human Build 38, hg38). We evaluated the association of these variants in the CYP2A6 region with NMR. Given the non-normal distribution of NMR in our sample, we applied rank-based inverse normal transformation using the R package RNOmni [33] and used these transformed NMR values for all analyses unless otherwise specified (Fig. S2).

Association testing was done in SNPTEST v2.5.2 [34] using linear regression to test the association of imputed genotype dosages with normalized NMR using an additive genotypic model with adjustment for age, sex, body mass index (BMI), and two ancestry-informative dimensions to account for population substructure as covariates.

Stepwise conditional analysis

To identify the number of independent associations in the CYP2A6 region, we completed stepwise conditional analysis in SNPTEST v2.5.2 [34] by including genotype dosages for the top variant as an additional covariate in the base model described above (effectively conditioning on additive effects of the top variant), and repeating this procedure until no further association signals reached genome-wide significance (p < 5 × 10−8). Regional association plots were constructed using LocusZoom, with LD information from the 1000 Genomes Phase 3 African populations reference panel [35].

Bayesian fine-mapping

To identify potentially causal variants in the CYP2A6 region, we used FINEMAP v1.4 specifying a maximum of 20 potential causal variants [36]. FINEMAP performs Bayesian fine-mapping using a shotgun stochastic search method to identify the most likely causal configuration of variants, given association summary statistics and local LD patterns [36]. We also performed exploratory functionally informed fine-mapping in FINEMAP [36] by assigning a higher prior probability to CYP2A6 * alleles (prior probability = 0.70 for these variants being causal) compared to non-* allele variants (prior probability = 0.50). Input summary statistics for FINEMAP were obtained as described above using SNPTEST v2.5.2 [34], and the input SNP correlation matrix was computed from genotype dosages in our sample using LDstore v2.0 [37]. Regional association plots were constructed using R [32].

Variant annotation

To annotate variants identified in our analyses we used RegulomeDB [38], a publicly available database that estimates a variant’s likelihood of having a regulatory function using a probability score that ranges from 0 to 1 (with 1 being most likely to be a regulatory variant). The probability score is constructed based on a machine learning model integrating functional genomic data including ChIP-seq signal, DNase-seq signal, information content change, and DeepSEA scores [38].

We also evaluated whether variants were known to influence expression of genes encoding functional proteins using publicly available expression quantitative trait loci (eQTL) data from the Genotype-Tissue Expression (GTEx) Project [39]. The GTEx Project eQTL analysis was based on whole genome sequencing and RNA-seq data collected from 838 donors ( ~ 13% African ancestry) across 49 tissues. Given the potential misidentification of CYP2A6 transcripts as pseudogene CYP2A7 due to high sequence homology, we considered eQTL data for pseudogene CYP2A7 along with all other protein-coding genes. The data used for the analyses described in this manuscript were obtained from the GTEx Portal on 12/04/2024.

Incorporation of putative causal variants into an existing genetic risk score (GRS) for NMR

To investigate whether Bayesian fine-mapping improved the predictive power of genetically determined NMR in African American smokers, we compared our previously described GRS for this ancestral population [19] (referred to here as the original GRS) to GRSs including putative causal variants identified by fine-mapping in the current study. The original GRS included eight CYP2A6 * alleles (*1×2, *4, *9, *12, *17, *20, *25/*26/*27, *35) and three LD-independent genome-wide significant SNPs (rs12459249, rs111645190, rs185430475) identified in an earlier conditional analysis of the CYP2A6 region [15]. The initial GRS estimation was constructed using mentholated cigarette use as an additional covariate, and explained 32.4% of the variance in log-NMR [19]. We elected to not adjust for menthol in the current study in order to maximize sample size (10% of participants were missing menthol data) and because menthol adjustment did not appreciably alter SNP effects on NMR [31]. For harmonization with data used in the current study, we therefore recalculated the weights for all variants in the original GRS using the analytic approach described below (without adjustment for mentholated cigarette use), and with CYP2A6 * allele genotypes obtained using a more recent genotyping approach with improved accuracy [24].

The updated GRS included all eight CYP2A6 * alleles from the original GRS and the six LD-independent putative causal variants identified by FINEMAP as the lead variant in their respective credible set. We did not include the three GWAS conditional hits in the CYP2A6 region from the original GRS [19] in our updated GRS given that two of these SNPs (rs12459249 and rs111645190) were in high LD (r2 > 0.80) with putative causal variants identified by fine-mapping (rs10853742 and rs28399451, respectively) and the remaining SNP (rs185430475) did not show robust association with NMR in our updated analysis (p > 1 × 10−4). To construct the updated GRS, the effect size of each putative causal variant was estimated separately in KIS-3 and PNAT-2 by association testing in SNPTEST v2.5.2 [34] using linear regression to test the association of imputed genotype dosages with square-root transformed NMR as the outcome variable using an additive genotypic model with adjustment for age, sex, BMI, and two ancestry-informative dimensions to account for population substructure as covariates. Given that the overall variance in log-NMR explained was comparable for GRSs with variant weights derived from linear regression against square-root or rank-transformed NMR, square-root transformed NMR was used for comparability of weights with the original GRS [20]. The overall effect size for each variant was then estimated in the total sample (KIS-3 and PNAT-2) by fixed-effects meta-analysis using the meta v1.7 R package [40], followed by multiplication of the resultant β coefficient by the standard deviation of the sqrt-NMR to unstandardize the scores [19]. The GRS was then computed for each n individual in the total sample as follows, where d refers to the number of risk alleles and β refers to the effect size for each i variant included in the GRS:

$${wGRS}=\mathop{\sum }\limits_{i=1}^{n}{\beta }_{i}* {d}_{i}$$

To evaluate the performance of the updated and original GRSs [19], we first calculated the variance in log-transformed NMR (log-NMR, which best represents the nicotine clearance rate [41]) explained by each GRS in linear regression models of log-NMR ~ GRS using the R function lm [32]. We also evaluated the variance in log-NMR explained by a GRS that included only the five variants identified by conditional analysis, and the six putative causal variants identified by FINEMAP.

Next, we compared the transferability of the updated and original GRSs [19] from African to European populations by calculating the variance explained in log-NMR by each GRS in the European ancestry subset of PNAT-2 (N = 933).

Results

Clinical characteristics of the final discovery sample are presented in Table 1. From PNAT-2, two samples were excluded due to missing or outlying normalized NMR values. From KIS-3, eight samples were excluded due to cotinine concentrations <10 ng/mL (which suggest non-daily smoking [42]), and one sample was excluded due to missing BMI. After quality control, our final sample therefore comprised 953 African American smokers (n = 504 from PNAT-2, and n = 449 from KIS-3).

Following imputation using the TOPMED reference panel, 104,131 variants in the CYP2A6 region (chromosome 19:38,000,000-43,000,000 bp; Genome Reference Consortium Human Build 38, hg38) were available for analysis. The median INFO score for variants in the CYP2A6 region was 0.97 (mean = 0.92, SD = 0.096), suggesting high imputation quality. After imputation using the 1000 Genomes reference panel, 46,154 variants in the CYP2A6 region were available for analysis with median INFO score 0.91 (mean = 0.88, SD = 0.110). Given the denser coverage and higher quality genotypes obtained from imputation using the TOPMED reference panel (Fig. S3), we used imputed genotype dosages from these data for our analyses along with 12 directly genotyped CYP2A6 * alleles.

Within the CYP2A6 region a total of 113 variants showed robust association (p < 5 × 10−8) with NMR, including four of the 12 * alleles genotyped in our sample (CYP2A6*17, CYP2A6*9, CYP2A6*4, and CYP2A6*25/*26/*27, Table S2). Overall, these CYP2A6 * alleles were less strongly associated with NMR than other variants in the region (p-values ranging from p = 2.06 × 10−26 for CYP2A6*17 to p = 4.40 × 10−8 for CYP2A6*25/*26/*27, Table S2). The strongest association was observed for rs11878604 (beta = −0.689, p = 4.75 × 10−44), a SNP located ~16 kb 3’ of CYP2A6 (Fig. 1). This lead variant had a RegulomeDB probability score of 0.69 (scores range from 0 to 1, with 1 most likely to represent a variant with regulatory function) [38]; rs11878604 was also identified as an adrenal eQTL for CYP2A6 in the GTEx Project, with the allele associated with lower NMR (i.e. reduced CYP2A6 activity) showing association with decreased CYP2A6 expression in adrenal gland tissue (Table S1, Fig. S4).

Fig. 1: Conditional analysis of the CYP2A6 regional association with NMR in African ancestry smokers.
figure 1

Five independent associations were identified by conditional analysis (a–e), including CYP2A6 deletion variant CYP2A6*4 (b); after conditioning on these five variants (a–e), there were no genome-wide significant associations remaining in the region (f). Genomic positions based on Genome Reference Consortium build 38, hg38.

Stepwise conditional analysis with SNPTEST [34] identified five independent associations with NMR in the CYP2A6 region (Fig. 1, Table S1). Only the lead variant (rs11878604) was identified as an eQTL for CYP2A6 in GTEx. After conditioning on imputed rs11878604 genotype dosage, a second independent association was identified with the directly genotyped CYP2A6*4 allele (beta = −1.033, p = 8.54 × 10−13). The CYP2A6*4 allele confers a whole gene deletion of CYP2A6, and individuals with this allele have correspondingly decreased CYP2A6 activity [43, 44]. Notably, in our sample CYP2A6*4 was not in LD with any other individual variant in the region (all r2 < 0.15), consistent with previous literature indicating that CYP2A6*4 cannot be tagged by nearby SNPs [45]. CYP2A6*4 was not genotyped in the 1000 Genomes Phase 3 African populations used as an LD reference for construction of regional association plots by LocusZoom, and as such there is no LD information displayed on the CYP2A6*4 regional association plot (Fig. 1b). Conditioning on rs11878604 and CYP2A6*4 revealed a third independent association with rs10853742 located ~9 kb 3’ of CYP2A6 (beta = 0.405, p = 5.65 × 10−12), a SNP with a RegulomeDB probability score of 0.61 that was identified as a skin eQTL for CYP2A7 in the GTEx Project (Table S2, Fig. S4). Conditioning on rs11878604, CYP2A6*4, and rs10853742 identified a fourth independent association with rs28399451 (beta = −0.340, p = 5.59 × 10−10). Located within intron 6 of CYP2A6, rs28399451 had a RegulomeDB probability score of 0.135 and was identified as a skin and peripheral nerve eQTL for CYP2A7 in the GTEx Project (Table S1, Fig. S4). Conditioning on genotype dosages of these four variants (rs11878604, CYP2A6*4, rs10853742, rs28399451) identified a fifth independent association with rs116670633 (beta = −0.676, p = 6.27 × 10−10); this SNP was located ~85 kb 5’ of CYP2A6, had a RegulomeDB probability score of 0.135, and was not identified as an eQTL in the GTEx Project. After conditioning on these five variants, there were no remaining genome-wide associations with NMR (Fig. 1). These findings were consistent when association testing was run independently in PNAT-2 and KIS-3 and then meta-analyzed using an inverse-variance weighting approach (Table S1).

Bayesian fine-mapping with FINEMAP [36] identified six causal variants contributing to the CYP2A6 region association with NMR (posterior probability of six causal variants in the region, PP = 0.67). The top causal configuration included CYP2A6*4, rs116670633, CYP2A6*9, rs28399451, rs8192720, and rs10853742; the posterior probability of these six variants representing the true causal configuration was 0.090, and together they explained 31% of the heritability of NMR (Fig. 2). In addition to the top causal configuration, Bayesian fine-mapping identified six “credible sets” (Fig. 2, Table 2); each credible set can be interpreted as containing a causal variant with 95% coverage probability. The lead variants in credible sets 1–5 were highly likely to be causal (CYP2A6*4, rs116670633, CYP2A6*9, rs28399451, rs8192720; PIP for these variants being truly causal >0.50). Four of the putative causal variants identified by FINEMAP were also identified by conditional analysis (CYP2A6*4, rs116670633, rs28399451, rs10853742). Exploratory functionally-informed FINEMAP analyses specifying a maximum of six causal variants and upweighting the 12 CYP2A6 * alleles, which have well characterized functional effects on CYP2A6 activity (summarized in Table S2), provided consistent results and did not identify any alternative putative causal variants.

Fig. 2: Bayesian fine-mapping of CYP2A6 association with NMR.
figure 2

Top causal configuration included CYP2A6*4, rs116670633, CYP2A6*9, rs28399451, rs8192720, and rs10853742; posterior probability of this top configuration being truly causal = 0.090; NMR heritability explained by top configuration (h2) = 0.31.

Table 2 Association with NMR and functional annotations for CYP2A6 region variants identified by fine-mapping.

The six credible sets were made up of differing numbers of putatively causal variants, typically in high LD with each other (Fig. S5). Credible set 1 included only CYP2A6*4 (PIP = 1), which was not in significant LD with any other variant in the region. As described above, CYP2A6*4 is a whole-gene deletion variant conferring absent CYP2A6 activity [44]; because it is a structural variant, CYP2A6*4 eQTL data is not available in existing eQTL datasets which use array-based technology for genotyping. Credible set 2 included only rs116670633, which as described above, is a SNP located ~85 kb upstream of CYP2A6 with limited evidence of regulatory function (PIP = 0.985); this variant was not in LD with any of the variants in other credible sets, but was in low LD with CYP2A6*35 (r2 = 0.46). Credible set 3 included CYP2A6*9 (PIP = 0.890), a functional promoter region variant that decreases CYP2A6 activity, along with 22 other SNPs in LD with CYP2A6*9 that each had very low PIPs (PIP range = 0.001–0.02, Table S3). Credible set 4 included three variants in high LD with each other (Fig. S5), with lead variant rs28399451 (PIP = 0.603). The variants in credible set 4 were also in moderate LD with CYP2A6*17 (r2 = 0.67–0.70). One variant in credible set 4 (rs28399439) was an adipose eQTL for CYP2A6 in GTEx, although unexpectedly the allele associated with lower NMR (i.e. slower CYP2A6 activity) was associated with increased CYP2A6 expression (Table 2, Fig. S4). The remaining two variants in credible set 4 (lead variant rs28399451 and rs4803380) were skin and peripheral nerve eQTLs for CYP2A7. Credible set 5 included three variants in high LD with each other (Fig. S5), with the top variant being rs8192720 (PIP = 0.574). The variants in credible set 5 were in moderate LD with CYP2A6*25/*26/*27 (r2 = 0.50–0.53) and low LD with CYP2A6*20 (r2 = 0.37–0.39); these three variants were not identified as eQTLs in GTEx (Table 2). Credible set 6 included four variants, with lead variant rs10853742 (PIP = 0.448). The variants in credible set 6 were in low LD with the lead variant from conditional analysis (rs11878604, r2 = 0.46). All four variants in credible set 6 were skin eQTLs for CYP2A7 in GTEx (Table 2, Fig. S4).

Incorporating the putative causal variants identified through fine-mapping into our existing ancestry-specific GRS [19] resulted in a new “updated GRS.” As a benchmark, the “original GRS” comprising eight CYP2A6 * alleles and three SNPs (rs12459249, rs111645190, rs185430475) identified in an earlier conditional analysis [15] explained 33.2% of the variance in log-NMR in our sample of African American smokers (Fig. 3a, Table 3). The updated GRS included the same eight CYP2A6 * alleles, excluded rs185430475, and included four new SNPs identified by fine-mapping (rs11667603, rs8192720, rs10853742, rs28399451). Two of these new putative causal variants (rs10853742, rs28399451) were represented by tag SNPs in the original GRS in the African ancestry sample (Fig. S5), while in the European ancestry sample only rs10853742 was represented by a proxy variant in the original GRS (r2 = 0.95 with rs12459249). The updated GRS showed similar prediction of NMR as the original GRS within the African ancestry training sample (variance in log-NMR R2 = 0.345 vs. 0.332 for the original GRS; Fig. 3a, c, Table 3), and improved prediction of NMR in an independent European ancestry sample (R2 = 0.282 vs. 0.228 for the original GRS; Fig. 3b, d). In comparison, a GRS including the six FINEMAP putative causal variants alone improved prediction of NMR to a lesser degree (R2 = 0.334 vs. 0.332 for the original GRS in African and R2 = 0.251 vs. 0.228 for the original GRS in European ancestry; Table 3), suggesting the SNPs identified by fine-mapping provide independent predictive information from CYP2A6 * alleles.

Fig. 3: Comparison of an existing African ancestry-specific genetic risk score (“Original GRS”) for NMR with a genetic risk score incorporating newly identified putative causal variants (“Updated GRS”).
figure 3

Variance in log-NMR explained by the original GRS in African American smokers (a) and its portability to European ancestry smokers (b), as well as the updated GRS in African American smokers (c) and its portability to European ancestry smokers (d). The original GRS comprised * alleles and SNPs identified in a previous conditional analysis, whereas the updated GRS replaced these SNPs with putative causal SNPs identified by fine-mapping (for details of the variants included in the original and updated GRS, see Table 3). R2 represents the variance in log-NMR explained.

Table 3 Effects of incorporating top putative causal variants identified by fine-mapping into an existing genetic risk score (“Original GRS”) to predict NMR in African American smokers.

Discussion

In this study we evaluated the strong regional association of CYP2A6 with NMR among African Americans participating in two large clinical trials of smoking cessation, performing an updated conditional analysis and novel fine-mapping analyses which improved an existing tool to genetically predict NMR. Importantly, our analyses focused on treatment-seeking individuals participating in clinical trials of smoking cessation, which excluded individuals with serious medical or psychiatric comorbidities (including comorbid substance use) and those who were pregnant or breastfeeding. As such, an important future direction will be to expand these analyses in community samples of smokers to evaluate external validity in the general population.

Previous conditional analysis of the CYP2A6 regional association in this sample described by Chenoweth et al identified three independent associations (rs12459249, rs111645190, rs185430475) [15]; this earlier work did not include CYP2A6 * alleles, and used an older reference panel for genotype imputation resulting in low-density SNP coverage. The conditional analyses and fine-mapping presented here included denser SNP genotyping coverage and 12 directly genotyped CYP2A6 * alleles (several of which are structural variants with robust functional effects on CYP2A6 activity) [46,47,48,49,50,51,52,53,54,55], providing a more comprehensive view of variation in the CYP2A6 region than any previous study in this population. In addition to confirming two previously reported CYP2A6 associations with NMR in African American smokers, our conditional analysis identified three novel associations: rs11878604, CYP2A6*4 (full CYP2A6 gene deletion), and rs116670633.

In this first fine-mapping effort of the CYP2A6 regional association with NMR in African populations to date, we identified six causal variants in the region (posterior probability, PP = 0.67). Prior fine-mapping using a similar analytic approach in European populations identified 13 causal variants in the region. The variants comprising the top causal configuration in our African ancestry sample were distinct from those in Europeans (CYP2A6*4, rs116670633, CYP2A6*9, rs28399451, rs8192720, rs1085374; PP = 0.090), and explained 31% of the heritability of NMR. Interestingly, CYP2A6*9 is a known functional allele conferring reduced CYP2A6 activity [49], while the remaining four lead SNPs identified by FINEMAP were not associated with altered CYP2A6 expression in GTEx (recognizing that regulatory information in publicly available databases is limited by methodological challenges inherent in measuring CYP2A6 gene expression levels due to structural and copy number variation in this region, as well as high sequence homology with pseudogene CYP2A7). Importantly, the top putative causal variant identified was CYP2A6*4 (PIP = 1), a loss-of-function mutation conferring whole gene deletion of CYP2A6. CYP2A6*4 is not included in the vast majority of genomic studies because it cannot by genotyped accurately using array-based technologies, and is not tagged by any individual SNP in the region [45]. The strong evidence we observed for a causal association between CYP2A6*4 and NMR highlights the importance of including CYP2A6 structural variants in future genetic studies of tobacco-related phenotypes. To help facilitate their inclusion we recently developed a method to impute CYP2A6 structural variants from SNP haplotypes obtained using standard genotyping array data (sensitivity >60%, false positive rate <1% in both African and European ancestry populations) [24].

Finally, we demonstrated that an updated GRS including the putative causal variants identified in African American smokers (versus those identified by conditional analysis in an earlier GRS) captured similar amounts of variation in log-NMR in African ancestry individuals, and improved the portability of the GRS to European ancestry individuals. Future work evaluating the performance of our updated GRS in independent validation samples including diverse ancestry smokers is needed to evaluate whether this improved portability extends across other ancestries. One potential explanation for the improved performance of our African ancestry-specific updated GRS within European smokers is that fine-mapping identified novel variants influencing NMR that were not represented in the original GRS (i.e. rs11670633, rs8192720). Additionally, prior work has demonstrated that including putative causal variants identified by fine-mapping improves the transferability of GRS across diverse populations because of differences in LD structure which result in tag SNPs from one ancestral population no longer being good proxies for the underlying true causal variants in other ancestral populations [56, 57]. Consistent with this, the LD patterns between tag SNPs included in our original GRS and the four putatively causal SNPs included in the updated GRS were different in our African and European samples.

Overall, our results further elucidate the genetic architecture of the CYP2A6 regional association with NMR among African American smokers and provide a shortlist of variants that may causally influence nicotine clearance in this population, which could be prioritized for investigation in future functional studies of CYP2A6 activity. In particular, the strong evidence for a causal association observed between CYP2A6*4 and NMR highlights the importance of including CYP2A6 structural variants in future genetic studies of tobacco-related phenotypes. Finally, the potential utility of genomic data - including genetic risk scores (GRS) - in medical decision making is growing and complements the utility of other biomarkers such as NMR, particularly in situations where NMR measurements are not available or feasible (i.e. non-smokers). Given that incorporating putative causal variants improved trans-ancestry portability of an existing GRS for NMR in this study, our results demonstrate the broader value of fine-mapping efforts as a tool to refine and improve the potential clinical utility of GRS across diverse populations which may ultimately help address potential health disparities exacerbated by existing Euro-centric GWAS data [13].