Introduction

Genome-wide association scans (GWAS) and meta-analyses combining information from multiple GWAS datasets have successfully identified common DNA sequence variants (single nucleotide polymorphisms, SNPs) associated with diseases, quantitative traits, and complex phenotypes [1,2,3,4,5]. The number of participants represented in meta-analyses have increased at an exponential rate since their introduction [6], with recent datasets in atrial fibrillation including >1,000,000 participants [7], although the largest meta-analysis in lipids is among 300,000 multiethnic participants [8] and the largest single cohort study exceeds 390,000 participants [9]. Meta-analyses have expanded our knowledge of specific genes and pathways influencing lipid levels due to the highly polygenic heritability pattern of lipid levels shown by hundreds of associated loci to date [10]. While most of the variation in lipid levels within the general population is due to polygenic variation, single protein-altering variants in known lipid genes can confer extreme lipid levels, generally referred to as dyslipidemias when observed in patients [11].

Technological advances in DNA sequencing have made the interrogation of protein-coding regions of the genome (the exome) more broadly utilized. Exome sequencing has been useful at identifying Mendelian forms of disease [12, 13], although its utility is limited with complex human phenotypes, including lipids, due to the expected low frequency of high impact mutations in the general population and higher costs with concomitant lower sample sizes in sequencing studies relative to other technologies. Several previous reviews of lipid genomics have been published, including in 2015 [14] and 2018 [15] which highlight lipid gene identification using GWAS, exome sequencing approaches, and emphasizing a data-driven approach to therapeutic drug target development. This review focuses on the foundational evidence from the last two decades of lipid genetics, while also illustrating the current status of recent computational approaches for transethnic GWAS, fine-mapping, transcriptome informed fine-mapping, and disease prediction (Fig. 1). Novel genetic insights derived from these methods may provide new plausible candidate genes for drug development, empower disease prediction for earlier identification of high-risk individuals, inform clinical practice for preventative health care, and suggest directions for future research of population level lipid variation in diverse populations.

Fig. 1
figure 1

A workflow for drug discovery. This diagram demonstrates a general workflow for progressing from variant-trait associations to drugs and therapies. Under (1) target discovery, GWA loci are refined through multiomics approaches, genetic fine-mapping, and WES rare variant analysis. The resulting loci represent potential targets for (2) drug development. Once developed, identifying at-risk individuals for (3) disease prevention and treatment through polygenic risk scores ensures drugs and therapies are administered to the individuals at highest disease risk

Single variant association discovery

GWAS meta-analyses in large sample sizes capture common variants with small to moderate genetic effects due to enhanced statistical power [16]. In 2010, the Global Lipids Genetics Consortium (GLGC) [2] meta-analyzed 46 lipid GWASs compiling over 100,000 participants of European ancestry. Findings revealed 59 previously unreported genome-wide significant loci across four lipid traits of total cholesterol (TC), low-density lipoprotein cholesterol (LDL-C), high density lipoprotein cholesterol (HDL-C), and triglycerides (TG). A follow-up association test of the meta-analysis lead SNPs in Europeans with coronary artery disease (CAD, n = 24,607) and without CAD (n = 66,197) identified four loci (IRS1, C6orf106, KLF14, and NAT2) significantly associated with CAD and decreased HDL-C and increased TG levels, indicating potential targets for preventative therapies. Quantitatively combining multiple GWAS cohorts can uncover associated signals not detected by single cohort GWASs, including variants with potential causal effects and candidate loci for disease preventative drugs and therapies.

GWAS meta-analyses within cohorts of the same ancestry have proven to be beneficial in the discovery of common mutations with modest contributions to trait heritability. However, GWAS restricted to single or closely related ancestries only identifies a subset of causative variants. GWAS meta-analyses that target a single ancestry often fail to capture low frequency variants specific to other ancestry groups, which may have significant contributions to complex trait heritability [17], or may point to new therapeutic targets. A recent trans-ethnic analysis of blood lipids and associated traits was conducted on participants across three distinct ancestries: non-Hispanic whites, (n ~ 216,000), non-Hispanic blacks (n ~ 57,000), and Hispanics (n ~ 24,000) from the Million Veterans Project (MVP) [8, 18]. GWAS was conducted within each ancestry cohort and then combined through an inverse variance-weighted fixed effects meta-analysis. The inverse variance-weighted fixed effect approach assumes all studies in the meta-analysis share the same true effect size and minimizes effect size variance by calculating the mean effect size across the GWASs weighted by the inverse variance of each single cohort GWAS. The meta-analysis identified over 46,000 genome-wide significant variants across 188 loci [19]. Subsequent replication in GLGC followed by conditional analysis combining both MVP and GLGC summary statistics for each lipid trait revealed 118 novel loci meeting genome-wide significance. These results demonstrate the benefits of performing a transethnic meta-analysis to isolate trait-specific loci [8].

Gene-based association discovery

Single cohort and GWAS meta-analysis are successful at identifying common variants with small to moderate effects on disease risk or quantitative traits. However, these approaches fail to capture moderate to large effect rare coding mutations that are integral to explaining the heritability of both Mendelian and complex traits [20]. Whole genome arrays designed to capture common variation are unlikely to include these rare mutations, and imputation from reference population samples rarely has enough occurrences of these mutations for confident imputation of these variants in the GWAS datasets.

Whole exome sequencing is a critical tool for discovering rare trait-associated variants within coding regions of the genome otherwise missed by GWAS arrays and variant imputation. However, traditional single variant tests are generally underpowered to accurately capture rare variant-trait associations. Either large sample sizes or large effect sizes are required for a single variant test to detect rare associations with sufficient power [21]. The former can be difficult to achieve for understudied traits and populations. The latter is typically not observed in polygenic complex traits, where rare variants carry a moderate burden of trait heritability. Unlike familial studies of Mendelian traits where a single mutation is typically implicated in disease inheritance, whole exome sequencing studies of complex traits may find clusters of mutations within a gene that are each associated with a phenotype. To circumvent the issue of insufficient statistical power for rare coding variants, gene-based burden tests are employed to collapse rare variant counts to a single gene across samples. Combining allele counts of low frequency mutations within a gene improves the power of associating trait within a given gene than with single variant testing of rare variants.

Burden tests are particularly powerful for genes with allelic heterogeneity, where a greater proportion of alleles in a gene are causative for the trait or disease [22]. The NHLBI Exome Sequencing Project leveraged whole exome sequencing samples from >2000 participants, including cohorts of extremely high (n = 307) and extremely low (n = 247) LDL-C levels, to identify rare variants within LDL-C-associated genes [23]. Gene-based burden tests across various allele frequency thresholds—from ultra-rare (MAF ≤ 0.1%) to more common (MAF ≤ 5%)—and different groupings of mutations based on variant effect classifications including known and predicted missense and loss of function mutations, revealed significant association between three well known lipid genes—PCSK9, LDLR, and APOB—and LDL-C levels. This is in comparison to single-variant tests of common mutations under the same study revealing only significant association with APOE. Collapsing rare and potential damaging variants under one gene signal enables discovery of associated genes otherwise missed by traditional single variant tests. The identification of PCSK9, LDLR, and APOB from gene-based burden tests is validation of dyslipidemia family studies, serving as prime drug targets for lowering LDL-C levels in participants with dyslipidemia traits [24]. PCSK9 is a target for heterozygous familial hypercholesterolemia treatments [25], whereas LDLR and APOB are targets for homozygous familial hypercholesterolemia therapies [26,27,28]. PCSK9 also serves as a target for lowering atherosclerotic cardiovascular disease risk. This exemplifies how fine-tuned discovery of gene-trait associations can result in actionable drug targets for treatment and prevention (Table 1).

Table 1 Summary of known lipid drug targets validated by rare variant burden tests

Refining single variant associations

GWAS meta-analyses have identified 167 [33, 34] lipid loci which however only count for approximately 20% [34, 35] of trait variation in the populations studied (less than 50% of the estimated trait heritability). It is hypothesized that the missing heritability may be due to the polygenic heritability pattern of lipid traits, suggesting there are still many genetic loci with small effect sizes to be found by increasing the sample size of the GWAS meta-analysis. Another hypothesis proposes that some of the lipid loci have multiple independent associations in close proximity, which are not considered by the standard approach of defining an associated locus or genetic signal based on physical distance [35]. Finally, some have hypothesized that trait heritability has been overestimated [36].

Fine mapping is one way to find additional associations and to pinpoint the causal genes in the established lipid loci. Fine mapping involves taking the local linkage disequilibrium into account and statistically estimating which variants are the most probable causal variants for the studied trait [37]. In a single cohort GWAS, the secondary signals can also be found using formal conditional analysis where the association test in a locus is adjusted for the lead-SNP association. There have been multiple studies testing different lipid loci showing secondary associations [34, 38, 39] in the coding region of the genome providing a direct link to the biological mechanism of the observed association, and therefore, illuminating potential drug targets.

Identifying relevant genes and pathways associated with noncoding mutations can be achieved with tools such as DEPICT [40] or the Polygenic Priority Score [41]. DEPICT assigns likely causal genes and enriched biological pathways for associated loci and highlights tissues and cell types where causal genes are highly expressed. Polygenic priority score identifies causal genes through integration of GWAS summary statistics with gene expression, biological pathway, and protein-protein interaction prediction data. The latter was successful in prioritizing over 8400 gene-trait associations across 113 complex traits with greater than 75% precision, including correctly identifying a previously discovered association between SORT1 and LDL-C.

It has also been suggested that using samples from different ancestries could identify the true causal variants underlying the association. Trans-ethnic meta-analysis of GWASs accounts for differences in linkage disequilibrium and heterogeneity of allelic effects and frequencies across diverse populations. Assuming a shared causal variant between ancestry groups, the surrounding variants in linkage disequilibrium with the causal variant may differ slightly between ancestries. By taking these slight differences into account in the meta-analysis with proper modeling, rather than excluding the minority ancestries in the analysis to avoid biases, scientists gain statistical power to identify the underlying putative causal variant. Utilization of diverse populations increases fine mapping resolution of the complex trait loci and further isolates the true genetic architecture of the underlying trait [42].

Epigenetic features play an important role in lipid genomics and understanding tissue-specific expression of lipid-associated genes. Recent efforts have been made to elucidate the function of long noncoding RNAs (lncRNAs). LncRNAs are transcribed RNA molecules greater than 200 nucleotides in length that do not encode for the protein. These serve a role in regulating gene transcription and posttranscription modifications and are largely tissue-specific in nature. In the case of lipid metabolism, lncRNA-mediated regulation colocalize primarily within liver and adipose tissues [43]. Previous reviews characterize the role of lncRNAs in cholesterol synthesis and metabolism [44] and diseases associated with lncRNA-mediated cholesterol dysregulation [45], including atherosclerosis, hypoalphalipoproteinemia (low LDL-C), myocardial infarction, and nonalcoholic fatty liver disease [46]. LncRNAs are prime drug and therapy targets because of their role in tissue-specific gene regulation. Other well-studied epigenetic features and their role in lipid-associated gene expression, such as DNA methylation, histone modification, and chromatin accessibility, are highlighted in other published reviews [47,48,49].

Machine learning and deep learning methods have indeed been implemented in predicting both deleterious coding mutations and prioritizing likely functional non-coding mutations. Predictive models such as CADD [50], PolyPhen [51], and SIFT [52] indicate a given mutation's impact on protein function. These models compile ancestral conservation data, epigenetic information, functional predictions (e.g., amino acid changes), and genetic content to predict the likelihood of deleterious mutations. Other models including RegulomeDB [53] and DeepSEA [54] highlight functional mutations in noncoding regions of the genome. These models compile data from chromatin profiles, transcription factor binding sites, and DNase hypersensitivity sites to predict the likelihood of functionally impactful mutations that affect the expression of target genes.

Refining associations with transcriptomics

Single cohort or GWAS meta-analysis identifies trait-causative loci; however, it is difficult to elucidate biological pathway effects from GWAS associations alone. Transcriptome-wide association studies (TWASs) provide insight into variant effects on gene expression and uncover gene-trait interactions within GWASs [55]. TWASs model the associated effect of variant alleles on nearby gene expression from a reference panel of genotypes and associated expression levels (sourced from public repositories; e.g., the GTEx project) [56]. This model then infers gene expression for participants within the GWAS cohort. From this, we can statistically associate certain expression patterns correlated with the target GWAS trait. The resulting association identifies genes potentially relevant to the trait under investigation (Fig. 2). A common method of modelling TWAs is by summary data-based Mendelian randomization (SMR), which integrates GWAS summary statistics with eQTL data to identify differentially expressed loci associated with complex disease [57]. This circumvents the common issue of unavailable full genotype data for developing well-powered association tests. GWAS2Genes serve as a public database compiling SMR gene-phenotype associations for multiple traits, tissues, and genes. However, this does not prove or disprove causality of a given gene; rather, TWAS methods highlight sets of candidate causal genes that warrant further examination.

Fig. 2
figure 2

Applying fine mapping and transcriptomics towards gene prioritization. Significantly associated GWAS loci are identified visually from Manhattan plots. Linkage disequilibrium (LD) information is integrated to identify lead causal SNPs. Measurement of gene expression change (eQTL analysis) for each SNP genotype indicates potential trait causative role of a given allele. Combining multiple candidate SNPs in a transcriptome-wide association study (TWAS) implicates sets of causal SNPs and genes for the target phenotype

TWASs can both confirm and reveal novel gene-trait associations. We revisit the multiethnic MVP cohort blood lipids investigation to demonstrate the utility of TWAS in discovery of novel associations. Four different gene expression reference panels were employed across relevant cell and tissue types: peripheral blood, adipose tissue, liver, and tibial artery tissues. The gene expression profiles derived from these panels were then imputed to predict associated expression within the combined GLGC and MVP GWAS meta-analysis for each lipid trait. 665 gene-lipid associations achieved genome-wide significance within 333 genes. To note, the 333 genes were contained within 122 genomic loci, of which 5 loci were previously unreported. These novel loci represent genomic regions with potential causal impact on lipid traits that were otherwise missed by traditional variant-trait associations [8]. More significant investigation of mutations in noncoding regions of the genome combining different omics data could reveal effects of gene regulation on phenotypic variance where known coding mutations fail to adequately explain the variation between individuals and ancestries.

Translating whole genome information into disease prediction

There are several ways in which lipid genetics could impact clinical practice, but most have not yet been realized. Individuals carrying Mendelian dyslipidemia mutations can be identified based on elevated lipids at a young age, or a family history of premature coronary artery disease, which would allow for earlier and stronger intervention to lower blood lipids. Testing for Mendelian dyslipidemias is not typically used to screen the population. Some challenges with Mendelian testing at scale include the cost of sequencing Mendelian dyslipidemia genes and difficulty in determining between protein-altering variants that cause disease (pathogenic variants) and those that do not (benign).

From a genome-wide perspective, initial discovery efforts were aimed at identifying a large catalogue of lipid genes to enable prioritization of lipid genes for development of new drug therapies. The challenge with this approach is the time and enormous cost required to turn a gene target into a new therapy available to patients. As such, the information produced by genome-wide association studies is not yet applied to clinical practice. This is mainly due to the still ongoing fine mapping efforts (we do not yet know which variants and genes are truly causal), complex biology behind the association (whole gene pathway versus a single gene), and polygenicity (lipids are driven by hundreds or even thousands of genetic loci).

However, an area of intense investigation in the last few years has been on using someone's genetic profile to predict their risk of disease, to again identify individuals who would benefit from lipid-lowering medications or lifestyle modifications. The whole genome information for a susceptibility of a disease/trait level can be summarized using polygenic risk scores (PRS), where the estimated effects for each disease/trait risk allele are summed over the whole genome for each patient carrying those alleles. There are several widely used methods [58,59,60,61] to select variants for the score and the different scores created by different research groups worldwide are publicly available (https://www.pgscatalog.org).

The predictive power of these scores is promising, especially for CAD [62], as the highest 5% of the CAD PRS appear to have as high a risk of heart disease as those people who carry a Mendelian mutation that causes familial hypercholesterolemia. Moreover, having a high PRS (5% of the population) is more common than carrying a monogenic mutation (less than 1% of the population), which is promising for the advancement of preventive options. Well-developed PRSs for quantitative risk factors (such as LDL-C) could improve the prediction of disease endpoints. For example, prediction of myocardial infarction was increased by combining the disease PRS with the risk factor or biomarker PRSs [63]. Participants in the high-risk group can be identified more accurately with the increased predictive power. It should be noted that CAD is a complex disease with genetic and environmental factors (and possibly interactions between the two) contributing to the overall risk. Hence, thorough evaluation of genetic, environmental and epigenetic factors, together with possible interactions between them, will need to be performed to account for all possible risk factors in the prediction models. This could potentially be accomplished by using machine learning methods that allow for more complex interplay between risk factors [64]. However, to date, machine learning models of this complexity have not yet been applied to CAD risk prediction.

As genetic information is constant throughout lifetime, utilizing genetic information would allow earlier prediction of disease susceptibility, paving the way for prevention rather than treatment after the disease has manifested. In regard to CAD, including the genetic information in the prediction model increases the predictive power to detect early onset cases [65] that would benefit from targeted early adulthood prevention (Fig. 3). However, this approach is still in early stages, as extensive DNA testing would need to be deployed to identify at-risk patients in clinic for preventive interventions.

Fig. 3
figure 3

The aim of complex disease risk prediction. This figure demonstrates how to apply genetic risk to clinical practice. Patients complete a routine DNA test during an annual health exam as well as other laboratory-based tests and basic health questionnaires which are linked to the electronic health record. The risk for multiple complex diseases is calculated and reported through an interface using a secured computing environment. Physicians and/or other health care providers communicate results and recommend tailored preventive actions for the patient

There are ongoing efforts to apply genetic risk information to clinical practice. In a Finnish study by Widen et al. [66] CAD risk estimates were returned to study participants, utilizing both traditional risk factors and whole genome genetic risk information, followed by evaluating their lifestyle changes after 6 months. Overall, the results showed positive changes, especially in the high-risk group, suggesting that early prevention with lifestyle changes could be possible with the right tools and easy-to-read risk reports for patients. However, there are well-known challenges in implementing these scores into routine clinical practice, in addition to limitations in patient uptake of recommended behavioral or medication changes.

Currently, PRSs are mainly derived from meta-analysis summary statistics that are typically derived from cohorts with a substantial fraction of subjects from European ancestry, which currently limits the utility to predict disease in individuals with other ancestries [67]. There are already multiple haplotype structure and/or genetic variation reference datasets of diverse ancestries available (e.g., HapMap Project [68, 69], 1000 Genomes Project [70], and Haplotype Reference Consortium [71]), but the diversity in datasets with phenotypic data available for association testing or disease prediction remains limited. Additional efforts to build genetic study cohorts with better representation of global ancestries and methodology to better translate results into other ancestry groups may decrease the inequity of disease prediction in the coming years.

Another area of development is to develop best practices and evaluate the overall impact of communicating genetic risks to the patients. Communicating genetic risk requires health care specialists to be able to explain the implications and preventative possibilities to patients in a comprehensible manner. Additionally, the overall predictive power of these scores is still limited. Currently, the biggest challenge in genome analysis and genome-based prediction is the lack of ancestral diversity in the existing study cohort datasets. A proportion of the missing heritability, and therefore lack of predictive power, will most likely be explained by the population specific variants of non-European ancestries currently underrepresented in the GWAS datasets. In addition, we need to be able to create and test disease prediction models for diverse ancestries to equally apply genomic information for participants across the globe.

Future possibilities with large biobanks

There are some suggestive results from phenome wide GWASs to identify drug targets that are less likely to have unanticipated adverse effects by extensively testing associations between the identified drug target gene or variant across multiple diseases and traits to predict these side-effects that may otherwise be undiscovered until expensive clinical trials are conducted, and human lives may be impacted [72]. In a phenome-wide GWAS, thousands of traits and diseases are tested for association—instead of only one trait of interest—giving clues on all the biological impacts of a single genetic change. These analyses are made possible by the large biobank datasets currently being collected across the world (Biobank Japan, Million Veterans Program, Finngen, UK Biobank) that combine hospital registries, laboratory measurements, and whole genome information from hundreds of thousands of participants. Integration of transcriptome data and other multiomics approaches can help identify relevant biological pathways and potential targets for new disease therapeutics, as well as avoid creating new medicines that may have unintended negative effects [73]. Large biobank datasets will also allow for testing associations for rarer and population specific genome variations which may quickly highlight genes as new drug targets. These large datasets with hospital registry data available, some with diverse populations, will also be a powerful tool for creating and testing optimal disease prediction models combining environmental and genetic information.

Discussion

We summarize the currently applied methods used for moving from GWAS summary statistics towards likely drug targets and identification of high-risk individuals for early prevention. While the past two decades have shown an incredible amount of progress, from identifying the first lipid-associated loci using GWAS to uncovering biological mechanisms and drug targets, and more recently translating these discoveries into clinically-meaningful predictions, the gap between observing an association and developing safe drugs for preventing atherosclerotic disease is vast. This gap can be partly narrowed by using the available methods for fine mapping, aggregating high impact rare variant associations to gene level associations or by combining transcriptomic data on top of the genomic information. As the number of exome sequenced samples is still somewhat limited for burden tests [74], hence having low number of rare variant carriers in the datasets, geneticists are currently capturing genes that have already been found to be plausible candidate genes for drug development in dyslipidemia family-based studies. However, capturing these genes using the current methodology proves that the methods are working, and the lack of novel candidate genes may be due to limited statistical power. In the meantime, prediction may be the key to identifying high risk individuals in the general population, implementing preventive approaches to reduce risk, which will hopefully lead to lower health care costs and, more importantly, reducing the overall number of cardiovascular disease related deaths.