Introduction

Drug development programs most often fail because of lack of efficacy (Paul et al. 2010; Seyhan 2019). This often occurs when the medicine targets a protein that is not involved in the etiology of the disease (Butcher 2003; Lindsay 2003; Seyhan 2019). One way to identify causal mechanisms of human disease is to use human genetics, since drug development programs supported by human genetic evidence have a two to four-fold increase in probability of success (Nelson et al. 2015; King et al. 2019; Ochoa et al. 2022). Genome-wide association studies (GWAS) have been fruitful in identifying the regions of the genome associated with susceptibility to disease. However, due to linkage disequilibrium (LD), loci that have significant variants can span multiple genes and therefore, GWAS results often do not clearly identify a specific causal gene at associated loci. Consequently, a major challenge in current human genetics studies is mapping GWAS loci to their causal genes.

This is a non-trivial task for three primary reasons. First, associated variants are often in LD with nearby variants rendering resolution to a few causal variants difficult. Second, approximately 80–90% of all phenotype-associated variants are non-coding, and most have small effect sizes (Edwards et al. 2013; Mirza et al. 2014; Hrdlickova et al. 2014; Boyle et al. 2017). Thus, attributing their effects to specific genes is, even with experimental methods like CRISPR, challenging (Xu and Li 2020). Last, given the hundreds of loci associated with many common diseases, functional experimentation on a locus-by-locus basis is laborious and has rarely been achieved systematically.

One method to identify causal genes with less ambiguity is through exome-wide association studies (ExWAS) as this method relies upon coding genetic variants that may have high pathogenicity. As this technique focuses on coding variants, it provides direct evidence for causal gene inference. It can also provide information regarding the directionality and magnitude of the effects of genes on the phenotype of interest, which can be difficult to infer based on GWAS and gene prioritization metrics alone. However, a major limitation of ExWAS is that the statistical power is low due to the rarity of pathogenic coding mutations in general and consequently the false negative rate of ExWAS is high (Auer and Lettre 2015). This means that a much larger sample size is required to have sufficient variants across the exome to detect gene–trait associations. For example, an ExWAS for estimated bone mineral density (eBMD), a known polygenic trait, with close to 300,000 samples only identified 19 genes. Yet, many known causal genes were not identified amongst these 19 genes (unpublished data—accepted in principle). In comparison, GWAS with similar sample sizes have identified more than 500 independent GWAS loci (Kemp et al. 2017; Morris et al. 2019). This suggest that there are many known gene–trait associations missed in the ExWAS. This is likely due to the lack of statistical power and the lack of variants at some of the causal genes. Further, due to increased costs, ExWAS data is available often for a much smaller sample size than that in GWAS.

In comparison to ExWAS, GWAS results are much easier to obtain at large scale. For this reason, there is ongoing research into developing methods for improving the use of this abundant data-type for the purpose of drug target discovery. These methods aims to prioritize causal genes by incorporating LD information as well as easily obtainable genomic annotations with GWAS results (Weeks et al. 2020; Mountjoy et al. 2021; Forgetta et al. 2022). To date, the performance of these metrics when compared to the more direct (though more expensive) ExWAS results have not been evaluated systematically.

Here we evaluate the performance of four recently published gene prioritization metrics (Effector index [Ei] (Forgetta et al. 2022), Polygenic priority scores [PoPs] (Weeks et al. 2020), Locus-2-gene [L2G] (Ghoussaini et al. 2021), and Activity-by-contact [ABC] (Fulco et al. 2019)) in predicting ExWAS results in nine traits (calcium level, direct bilirubin level [Bilirubin], estimated bone mineral density [eBMD], standing height [Height], hypothyroidism, low density lipoprotein cholesterol level [LDL-C], red blood cell count [RBC], triglyceride level, and type 2 diabetes [T2D]). If ExWAS data can be predicted from GWAS data based on these metrics, it can potentially allow for the interrogation of causal genes at hundreds of GWAS loci and accelerate the delivery of medicines to the clinics.

Materials and methods

Trait selections

We chose nine traits for which there were large sample sizes in UK Biobank and also spanned a broad spectrum of disease pathology (Table S1). These included seven continuous traits (calcium level, direct bilirubin measurements, estimated bone mineral density, standing height, LDL-cholesterol level, red blood cell count, and triglyceride level) and two common disease traits (type 2 diabetes and hypothyroidism).

Sources of ExWAS results

Masks are defined as different variant inclusion criteria for burden testing (a list of all burden tests and masks are in Table S2). ExWAS results for all traits except eBMD were obtained from AZphewas (Wang et al. 2021) or GeneBass (Karczewski et al. 2022). For eBMD, ExWAS results were obtained from a disease specific ExWAS study (unpublished data – accepted in principle) (Table S1).The lowest ExWAS burden test p value across all ExWAS sources, burden tests and masks (or when specified, across a previously defined set of ‘stringent masks’ (Table S3)) were retained for each trait. Gene information was merged across datasets by matching gene names.

GWAS loci definition

GWAS results were obtained from publicly available resources or large-scale trait specific GWAS publications (Mahajan et al. 2018; Morris et al. 2019; Forgetta et al. 2022) (Table S4). GWAS loci were defined as previously described (Forgetta et al. 2022). Briefly, a set of independent SNPs were obtained through clumping using plink 1.9 (Purcell et al. 2007) using a window size of 250 kb, linkage disequilibrium threshold of 0.01 and a p value threshold of 5 × 10–8. Neighbouring signals within 50 kb were merged. The resulting loci were then padded with 250Kbp on each side resulting in loci with at least 500 kb in size. All genes were assigned to one of these loci if it overlaps the locus by more than 50% of its length. These genes and GWAS loci formed the basis of all locus-based analyses (i.e., AUPRC, AUROC, odds ratio, score distribution) throughout the study.

Sources of Ei scores

Ei scores for all traits were obtained from a recent publication (Forgetta et al. 2022). The goal of the Ei is to generate a score to quantify the probability of causality for each gene at each significant GWAS locus (Forgetta et al. 2022). Briefly, there were a total of twelve phenotypes used to develop the Ei model (T2D, LDL-C, height, calcium level, hypothyroidism, triglyceride level, eBMD, glucose level, red blood cell count, systolic blood pressure, diastolic blood pressure and direct bilirubin level). GWAS for each trait were fine-mapped then annotated to identify locus and gene-level features for use in the model. The Ei models were generated using a gradient boosting algorithm (XGBoost) with over 150 locus-, and variant-level annotations. Two measures were taken in the development of Ei to avoid potential overfitting (Forgetta et al. 2022). First, the Ei model for each trait was generated separately using a leave-one-out approach (e.g., the Ei model for T2D was generated by training the model on all traits, except for T2D). Second, to further avoid overfitting, the creators of Ei ensured that each gene contributed to the training of the models at most once.

Sources of L2G scores

L2G scores were obtained from their public online platform (https://genetics.opentargets.org/) (Table S5). L2G scores have been calculated for a number of GWASs published on GWAS catalog (https://www.ebi.ac.uk/gwas/). For this study, we chose L2G scores calculated based on GWASs that utilized UK Biobank data (Table S5). For each study, a list of significant variants and L2G scores for each associated gene was obtained. In cases where the same genes were implicated by multiple lead variants, the largest L2G score was retained. L2G is similar to Ei in that it also uses a gradient boosting algorithm (XGBoost) and assigns a value between zero and one; however, it differs from Ei in the features used to train the model (Carvalho-Silva et al. 2019; Ghoussaini et al. 2021; Mountjoy et al. 2021).

Sources of PoPs and ABC score

PoPs and ABC scores were obtained from a recent publication (Weeks et al. 2020) (Table S5). PoPs works under the assumption that causal genes share similar functional characteristics. As such, it aims to identify potential causal gene by analyzing a number of biological annotations ranging from gene expression data to biological pathways and protein–protein interaction information to identify the most probable causal gene (Weeks et al. 2020). ABC was developed to identify gene-enhancer relationship. As many GWAS significant variants are non-coding variants that influence the trait through interactions with gene regulatory elements, ABC aims to prioritize the most likely causal gene by identifying genes that are most likely to be regulated by enhancer elements linked to the lead fine-mapped variant (Fulco et al. 2019; Nasser et al. 2021). For each trait, the largest PoPs and ABC score for each gene was retained.

Evaluation of classification accuracy

Two metrics commonly used to evaluate the performance of binary classifiers are area under the receiver-operator curves (AUROC) and precision-recall curves (AUPRC). For each score, PRC and ROC were based on GWAS loci that contain at least one ExWAS significant gene. ExWAS significance status was defined in one of the following 2 ways:

  1. 1.

    Whether the lowest p values using variants from one of the previously defined set of 'stringent masks’ is below the exome-wide significance threshold of 3.6 × 10–7

  2. 2.

    Whether lowest p values using variants from any masks is below the exome-wide significance threshold of 3.6 × 10–7

Since it is known that a large number of causal genes will be missed in ExWAS due to reduced power, we focused on genes that are located among GWAS loci with known ExWAS significant genes as to avoid the appearance of an elevated false-positive rate. We computed the AUROC and AUPRC for each of the nine traits separately and combined. This resulted in 34 tests when ExWAS significance was defined using variants from the set of stringent masks and 35 tests when ExWAS significance was defined using variants from all masks. A 99.9% confidence interval were reported which will maintain family-wise error rate of 0.05 (Table S6, S7). AUROC and AUPRC confidence intervals were determined using 2000 bootstrap replicates. Baseline for AUROC is that of an uninformative classifier (i.e., 0.50). Baseline for AUPRC is that of a random classifier, which corresponds to the proportion of ExWAS significant genes among all genes that have both an ExWAS burden test p value and the relevant gene prioritization score. As the baseline for AUPRC is based on the proportion of target genes among all relevant genes, it is specific for each trait and score.

Causal and drug target gene definitions

Causal and drug target genes (Table S8) were obtained from a prior study (Forgetta et al. 2022). Briefly, causal genes were identified by manual inspection of the Human Disease Ontology database (Schriml et al. 2019) for relevant ontological terms by clinician scientists. The relevant OMIM linkage information was then used to obtain a list of relevant genes associated with the diseases. Drug target genes were identified by first identifying a list of guideline-approved medications from UpToDate (https://www.uptodate.com/contents/search) by clinician scientists. This information is then linked to DrugBank (Wishart et al. 2018) to identify a list of drug targets.

Results

High scoring genes across all metrics are enriched in ExWAS significant genes

To quantify the performance of the four metrics, we chose nine phenotypes (T2D, Bilirubin, eBMD, height, hypothyroidism, LDL-C, RBC, calcium level, and triglyceride level) for which there were large sample sizes in UK Biobank, since the largest sets of ExWAS data are generally derived from this resource. Further, these diseases and traits represent metabolism, ageing, autoimmune disease, anthropometric, hematologic and lipid outcomes, providing a broad spectrum of pathogenic mechanisms for study. GWAS and ExWAS results were obtained from publicly available resources or recent publications (see Methods).

We first evaluated the difference in the distributions of the four prioritization scores (Ei, PoPs, L2G and ABC) between ExWAS significant and non-significant genes. ExWAS significant genes were defined as those having a p value below the exome-wide significance threshold of 3.6 × 10–7 for any of the predefined stringent set of variants using burden testing (Table S3 and S9). A burden test assesses whether a specific category of variants in a particular gene is more common among the cases than the controls (Curtis 2019). Different categories or combinations of categories (e.g., predicted loss of function (pLOF), pLOF + missense, synonymous, etc.), commonly referred to as “masks”, are tested. It is also common to calculate different variations of rare variant association tests (i.e., SKAT, SKAT-O, standard gene-based burden tests) that differ in their assumptions of the effect of the variants on gene functions (Lee et al. 2012). This approach of gene-based collapsing test where the effects of multiple variants are aggregated is frequently done in ExWAS studies as sample sizes at the current biobank scale are often not large enough to reliably describe the effect of individual rare coding variants (Curtis 2019). For all four metrics, the distribution of the untransformed values was higher in the ExWAS significant gene category than the non-significant category (Figs. 1, S1). The range of possible values, and therefore distributions, is different between the metrics. For example, Ei and L2G are bounded between 0 and 1, whereas the others are not. Despite the difference in value distributions, for each metric, ExWAS significant genes tended to have higher scores than ExWAS non-significant genes.

Fig. 1
figure 1

Distributions of the untransformed prioritization metric between ExWAS significant and non-significant genes for Ei, ABC, L2G, and PoPs together with the p values for the differences between these groups are shown. The differences in score distribution between the two categories were determined using the Mann-Whitney test using all data points; however, outliers (defined as values outside the 1.5 * interquartile range) are removed for clarity of presentation. Only genes residing at GWAS loci which harboured ExWAS significant genes were considered. Only the lowest p value from burden tests based on one of the previously defined ‘stringent masks’ (Table S3) were considered for determining ExWAS significance status

We then evaluated the difference in p value distribution between different gene prioritization metric percentiles. In all cases, the p values are significantly lower in the top percentiles compared to the bottom percentiles (Fig. 2). We also observed that the ExWAS p values became lower across increasing quintiles of the gene prioritization metrics (Fig. S2). Taken together, this suggest that the four gene prioritization metrics tend to assign higher scores to genes that tend to have lower ExWAS p values.

Fig. 2
figure 2

Distribution of ExWAS burden test p values between the prioritization metric percentiles. The top 1% vs 99% (blue), 5% vs 95% (orange), and 10% vs 90% (purple) are shown. Difference in ExWAS p value distribution between the percentiles was determined using the Mann-Whitney test with all data points and the p values are shown. Outliers (defined as values outside the 1.5 * interquartile range) are omitted for clarity of presentation. Only p values from burden tests based on one of the previously defined ‘stringent masks’ (Table S3) are considered

Gene prioritization metrics scores are positively correlated with ExWAS significance

Next, we estimated the odds that a gene will reach the ExWAS significance threshold per unit increase in each gene prioritization metric. To do so, we assessed only genes that resided at GWAS loci, which also harboured at least one ExWAS significant gene. To do this, we normalized each prioritization score (i.e., scaled to have unit variance and zero mean), and fitted a logistic regression model using each of the normalized prioritization scores as the predictor and ExWAS status as the outcome. Overall, we tested the associations of each score in each phenotype separately and all phenotypes combined resulting in 29 associations. Significance status was determined based on the 99.8% confidence intervals, which corresponds to the Bonferroni corrected type I error of 0.002 (i.e., 0.05/29 associations (Table S10)). Using this Bonferroni corrected confidence interval will maintain the family-wise error rate of 0.05 (Dunn 1961). We observed that all scores have odds ratio significantly higher than the null (Fig. 3). The largest odds was achieved by the Ei where a one unit increase in the scaled Ei score corresponds to a 4.6-fold (99.8% confidence interval (99.8% CI) 3.39–6.34) increase in the odds of a gene reaching exome-wide significance. The lowest odds (though still significantly greater than one) was for ABC where we observed an OR of 1.3 (95% CI 1.09–1.6).

Fig. 3
figure 3

Odds ratio obtained from logistic regression using ExWAS significant status as the outcome and each normalized prioritization score as a predictor. This was assessed for only genes found among GWAS loci which harboured ExWAS significant genes. Data were combined across all traits. The dotted line represents no effect (i.e., odds ratio of 1). 99.8% confidence interval (corresponding to a Bonferroni corrected type I error of 0.002 = 0.05/29 tests) is shown. Only p values from burden tests based on one of the previously defined ‘stringent masks’ (Table S3) are considered for determining ExWAS significance

The same trend was obtained when ExWAS significance status was determined using the lowest burden tests p values across all masks rather than just the tests with the most stringent masks as defined above (Table S3, S11). Ei and L2G had the largest odds per unit increase in the scaled score followed by PoPs and ABC; however, the odds for all scores decreased (Table S11).

Gene prioritization metrics can accurately classify ExWAS significant genes

Next, we evaluated the ability of each metric to classify ExWAS significance status for each gene based on the same definition as above. We focused only on genes that were found at GWAS loci which also harboured an ExWAS significant gene. The ExWAS significant gene was denoted as the positive control and we calculated the area under the receiver operator curve (AUROC) and area under the precision recall curve (AUPRC) for each gene prioritization metric. Significance of AUROC and AUPRC was evaluated using the 99.9% confidence interval based on 2000 bootstrap replicates. The AUPRC was the highest for Ei followed by L2G, PoPs and ABC (Fig. 4A), with the exception of ABC, all were significantly higher than baseline based on the 99.9% confidence interval (Table S6). Similar trend is seen for AUROC. 99.9% CI of AUROC for all but ABC were above the baseline of 0.50, which corresponds to an AUC of an uninformative classifier (Fig. 4B). The best AUROCs were achieved using Ei and L2G (AUROC: 0.88 (99.9% CI 0.83–0.94), 0.83 (0.76–0.88), respectively). We note the AUPRC is more informative than the AUROC because there are many fewer positive control genes than other genes at the GWAS loci. The performance across all metrics decreased when ExWAS significance status was determined based on the lowest p value across all masks, much like what was observed above; however, relative performance of each metric remained similar (Fig. S3, Table S7).

Fig. 4
figure 4

Performance curve for all four gene prioritization metrics. Precision–recall curve (PRC) a and receiver operator curve (ROC) b are shown for all four metrics. Area under the curve (AUC) for each metric is shown with the largest AUC in bold. 99.9% confidence intervals were evaluated using 2000 bootstrap replicates. Target genes were defined as genes with ExWAS burden test p value less than the Exome-wide significance threshold of 3.6 × 10–7. Only p values from burden tests based on one of the previously defined ‘stringent masks’ (Table S3) are considered for determining ExWAS significance

We also noted that the ExWAS significant genes that are also known drug targets or are causal for the disease were almost always among the 10 highest scoring Ei and L2G genes within their corresponding GWAS locus (24/26 and 24/26, respectively) (Table S12). This suggest that it is possible to further rank ExWAS significant genes using Ei and L2G to obtain the most probable causal genes.

Discussion

Identifying drug targets with human genetic evidence will likely accelerate drug development programs. While ExWAS is able to precisely identify genes and sometimes provide direction of effect, it suffers a high false negative rate due to low statistical power. GWAS, on the other hand, has a high sensitivity to associate loci harbouring causal genes, but resolving associated loci to such genes is non-trivial. Here we have tested whether algorithms designed to prioritize genes at GWAS loci can predict ExWAS findings. Overall, these metrics perform reasonably well, but with a high degree of variability, and thus can be used to prioritize genes at GWAS loci that do not harbour an ExWAS significant gene, thus increasing the number of targets to accelerate genetics-based drug development.

All metrics have good performance with area under the receiver operator curve (AUROC) and area under the precision recall curve (AUPRC) reaching upwards of 0.88 and 0.52, respectively, for Ei. Across all traits, we have found that Ei, PoPs, and L2G have better performance than ABC in their ability to identify likely ExWAS hits. We found that all gene prioritization metrics can significantly enrich for ExWAS significant genes and are positively correlated with ExWAS p values where a 1 unit increase in the normalized scores were associated with a 1.3–4.6-fold increase in odds of the gene reaching exome-wide significance.

ABC was originally developed for identifying enhancer-gene relationships (Fulco et al. 2019) and thus may not be considered to be optimized to identify causal genes at GWAS loci. However, as it is well established that many non-coding variants can influence phenotypes through interactions with regulatory elements (Stranger et al. 2007; Nicolae et al. 2010; Zhang and Lupski 2015), it is expected that ABC is capable of identifying causal genes for which a gene-enhancer relationship exists, consistent with prior studies (Weeks et al. 2020; Nasser et al. 2021). PoPs, Ei and L2G were all developed with the goal of identifying likely causal genes and they do so by aggregating information from different sources. As a result, the difference in performance between these metrics are influenced, at least in part, by the features used in their model. For example, Ei used 154 features whereas L2G used 57. Ei focused more on locus and variant annotations such as GWAS p values, and variant functional annotations whereas L2G has a stronger focus on functional data such as pQTL, and eQTL colocalization results. Both metrics also shared several features such as distance to transcriptome start sites, number of genes within loci, and DNAse hypersensitivity sites. The difference between Ei and L2G scores is then likely due to the inclusion of data such as colocalization results involving pQTL and eQTL information. PoPs differs from both Ei and L2G in that it uses other omic measurements, such as protein and transcript levels to predict gene–trait associations defined by MAGMA (de Leeuw et al. 2015) results.

It is also worth noting that across all metrics evaluated, when the most relaxed ExWAS masks were used, performance decreased across all gene prioritization metrics. This may be because when such relaxed masks were used more variants are considered, this can result in noisier signals. Genes that do reach the significance threshold using these relaxed masks may potentially be less likely to be causal. This trade-off between signal and noise has previously been shown for ExWAS results where the burden tests that were based on masks that included additional filtering of missense variants through the use of missense intolerance score led to increased power (Wang et al. 2021).

We found that Ei and L2G consistently classify known causal genes and/or drug targets as one of the top 10 highest ranking genes within their respective loci. This suggest that it is possible to apply these GWAS-based metrics to ExWAS results and further rank ExWAS significant genes to prioritize the most likely causal gene. This is consistent with a recent study where it has shown that around 75% of rare coding variants identified via ExWAS are located within 1 megabase of GWAS lead variants and these two types of variants act independently on the trait (Backman et al. 2021). This underscores the benefit of a multifaceted approach to drug target identifications where multiple lines of evidence can be used to further reduce the search space for causal genes and identify the most probable candidate gene for downstream analysis.

This study has limitations. The performance of the metrics evaluated here are limited to the nine traits that were evaluated. PoPs had one trait missing (Bilirubin) while ABC missed two (Bilirubin and T2D). As such, the evaluation for these two scores were limited to the remaining traits. However, the trend observed in performance remains the same when looking at individual trait analyses (Figs. S4–S6). In addition, the metrics evaluated in this study were not originally trained using ExWAS results as target genes, rather they were trained based on either heritability-based metrics, known drug targets, or known gene prioritization metrics such as distance-to-closest gene. As such, performance of all metrics will likely improve by re-training the model using ExWAS significance genes as target genes. Unlike L2G, PoPs, and ABC, the traits evaluated here were also used in the development of the Ei; however, the Ei models were trained using a leave-one-out approach to avoid overfitting. Although this approach would limit the scalability of Ei as individual models were required for each trait, it is an effective method for removing data overlap between training and testing data and should allow for an accurate assessment of performance. Due to the limited number of traits used in training the Ei models, another potential source of overfitting is genetic correlations between traits. However, LD score regression (Bulik-Sullivan et al. 2015) showed that most genetic correlations between traits are insignificant if not low (genetic correlation < 0.3) (Table S13). Furthermore, the performance of Ei for eBMD, which has low genetic correlations with all traits, is among one of the highest (Table S6, S7). Taken together, this suggests that the bias in the Ei models due to genetic correlations is low if any. Recently, another novel disease gene prioritization metric named combined SNP-to-gene (cS2G) was proposed (Gazal et al. 2022). cS2G prioritizes causal genes through the linear combination of individual SNP-to-gene strategies (e.g., eQTL p values, promoter information, etc.). However, as this method was evaluated using a different definition of precision and recall and the data are restricted to a specific subset of genes that satisfied a specific cS2G threshold (cS2G > 0.5) (Gazal et al. 2022), it cannot be easily compared to the other methods in this study (i.e., Ei, L2G, PoPs, and ABC) using the same evaluation metrics and is, therefore, not included.

It has been shown that the combination of different gene prioritization metrics that utilizes different features in their model in a consensus or linear combination approach allows for improved performance (Weeks et al. 2020; Gazal et al. 2022). Whether more sophisticated machine learning approaches can more effectively aggregate the information in all four metrics remains to be investigated once a larger set of traits with ExWAS, GWAS, Ei, PoPs, L2G and ABC results are available.

In conclusion, this study demonstrates that gene prioritization metrics based on GWAS results can identify ExWAS significant genes with reasonable accuracy highlighting another potential use of the abundance of GWAS data currently available. This suggests that for traits where well-powered ExWAS is not currently feasible, these gene prioritization metrics can serve as promising alternatives to identify genes that would have been found through a well-powered ExWAS. It is also possible to further rank ExWAS significant genes using these prioritization metrics to identify the most likely causal genes among all ExWAS significant genes. Furthermore, such methods could also be applied to diseases with existing ExWAS to help identify causal genes at GWAS loci where power was insufficient in the ExWAS. Because these methods and ExWAS aim to address the same issue (i.e., identify causal genes) using orthogonal data types, they can serve as an independent source of evidence allowing for more confident identification of causal genes to accelerate the delivery of medicines to the clinic.