Keywords

FormalPara Key Points
  • Genome-wide association studies in large sample sizes have identified, with high confidence, about 20 susceptibility loci for PCOS.

  • Robust susceptibility variants for PCOS have been used in Mendelian randomization studies to identify causes and consequences of PCOS.

  • Identification of PCOS susceptibility genes will expand our understanding of pathways and processes implicated in the syndrome’s etiology, allowing development of new diagnostic and treatment modalities.

The Heritable Basis of PCOS

In recent years the complex genetic architecture of polycystic ovary syndrome (PCOS) has begun to come into focus. Early family aggregation studies focused on the prevalence of PCOS-related traits in the siblings of PCOS cases and provided the first evidence for a genetic basis to the disorder [1,2,3]. These studies suggested an autosomal-dominant mode of inheritance based on the incidence of PCOS-related traits in the first-degree relatives of probands of 51–66% [4, 5]. Larger studies provided further evidence for an autosomal-dominant model of inheritance, with as many as 50% of mothers or sisters, 25% of aunts, and 20% of grandmothers of 250 PCOS probands having either hirsutism alone or hirsutism with oligomenorrhea [6]. Following the initial reports, however, systematic genetic investigations failed to support an autosomal-dominant mode of inheritance; rather, PCOS appears to be inherited as a common complex disorder, with multiple susceptibility loci. Twin studies that used a large cohort of more than 3000 Danish twins identified a small number of self-reported PCOS cases (n = 92), with an estimate of the monozygotic twin correlation for PCOS of 0.72 and a dizygotic correlation of 0.39 [7]. The identification of such a large proportion of variance in risk for PCOS in monozygotic twins provided strong evidence that there is a significant genetic component to the disease.

Candidate Gene Approaches Revealed an Incomplete Understanding of PCOS Biology

More than 100 candidate genes were studied as potential causal risk genes for PCOS; however, only the region surrounding the gene encoding the insulin receptor, INSR, was replicated in subsequent large, well-powered genome-wide association studies (GWAS) [8]. The initial studies of the region combined linkage and association analyses to identify the microsatellite marker D19S884, located in intron 55 of the fibrillin-3 gene (FBN3) , which is 1.3 cM distal to INSR, the candidate gene targeted with this variant [9]. It remains unclear whether the causal gene at this locus is FBN3, or in fact INSR. FBN3 was known to be expressed in the pituitary, but its role there is unknown. Contemporary epigenomic datasets from the ENCODE project [10] provide strong evidence to suggest this microsatellite is within an active gene regulatory element, but its target remains unknown. Histone modification data indicates likely promoter and/or enhancer activity across the region spanning the microsatellite, with clear cell type-specific modification of histones H3k4me3, H3K27ac, and H3K4me1 in conjunction with open chromatin identified using DNase hypersensitivity site analysis. There is currently no transcriptional isoform of the FBN3 gene with a promoter position overlapping this microsatellite and active regulatory region, but it is plausible that an isoform with corresponding promoter and transcriptional start site may exist in a cell type not yet comprehensively assayed as part of the ENCODE project. The close proximity of this marker to the INSR gene made it a popular target in candidate gene studies. Seven individual studies identified an association between single nucleotide polymorphisms (SNPs) across the INSR locus and PCOS risk [11,12,13,14,15,16,17,18,19]. Many of these studies included a small number polymorphisms, and modest sample sizes, as did three additional studies that were not able to replicate a significant association between PCOS risk and variants at the INSR locus [11, 14, 20].

Additional candidate gene studies focused on genes with known roles in obesity [21,22,23,24], type 2 diabetes [25,26,27,28,29,30], hormone metabolism, and synthesis and ovarian biology [31,32,33,34,35] did not yield any robust loci for PCOS. These studies were largely hampered by small sample sizes and small numbers of variants that provided incomplete tagging across the locus, focusing on coding regions which we now know are unlikely to harbor causal variants for complex traits [36].

GWAS Studies in PCOS

High-throughput genotyping platforms have enabled GWAS and facilitated rapid advancement in the understanding of the complex genetic architecture of many common traits. The first GWAS in PCOS reported in 2011 identified three risk loci: at 2p16.3 (LHCGR), 2p21 (THADA), and 9q33.3 (DENND1A) in Chinese PCOS cases and healthy controls [37]. This three-stage study used a modestly sized discovery cohort of 744 PCOS cases and 895 controls in the GWAS, with replication of suggestive risk loci in a two-stage approach in two cohorts: cohort I, 2840 PCOS cases and 5012 controls; cohort II, 498 PCOS cases and 780 controls [37]. A second study, also performed in Chinese PCOS cases and controls, identified an additional eight risk loci: 2p16.3, 9q22.32, 11q22.1, 12q13.2, 12q14.3, 16q12.1, 19p13.3, and 20q13.2 [8]. This study identified a second, independent risk signal at the 2p16.3 locus, implicating both LHCGR and FSHR as potential causal genes in the region. LHCGR and FSHR encode the luteinizing hormone/choriogonadotropin receptor and the follicle-stimulating hormone receptor, which play important roles in hormone signaling in the gonads, making them very plausible susceptibility genes for PCOS. The 2p16.3 region had been the focus of candidate gene studies that profiled only coding variants, without success [20, 38, 39], highlighting the importance of haplotype tagging approaches that include extensive coverage of non-coding variants at gene regions to enable risk locus discovery. The INSR locus at 19p13.3 was discussed above. Additional signals identified in the two Chinese GWAS (THADA and HMGA2 associated with type 2 diabetes [40], RAB5B/SUOX associated with type 1 diabetes [41]) are near genes from insulin and glucose metabolism pathways, supporting the importance of insulin resistance and metabolic disturbance in PCOS [42]. Two subsequent GWAS performed in Korean cases and controls did not identify any genome-wide significant loci, likely due to small sample size [43, 44].

The first two GWAS for PCOS performed in European-origin populations were published in 2015 [45, 46]. These analyses provided replication of loci reported by Chen and Shi [8, 37] and identified novel loci not previously identified as risk loci for PCOS (Table 4.1). In an initial study that used discovery and replication cohorts of European descent from North America that included a total of 3000 PCOS cases and more than 5000 controls, two novel risk loci were identified: 8p23.1 (GATA4/NEIL2) and 11p14.1 (FSHB) [45]. The potential causal gene at 8p23.1 is not immediately apparent. Due to linkage disequilibrium (LD) across the region, the association interval spans almost 30 kb. The lead SNP resides between GATA4 and NEIL2, and SNPs in LD with this variant intersect known regulatory regions that connect to the promoters of C8orf49, NEIL2, and FDFT1. NEIL2 is a transcription factor that is ubiquitously expressed [47, 48] and targets the promoter of more than 240 genes [47, 48], many of which are themselves transcription factors and are important in pathways that include the regulation of development that are dysfunctional in cancer (e.g., HOX family of genes) [49] and in hormone signaling (e.g., FST, which inhibits FSH release). Both C8orf49 and GATA4 are highly expressed in the ovary [47] and present possible causal genes at this locus. The association signal identified by Hayes et al. [45] at 11p14.1 intersects with the coding region for FSHB, the gene encoding follicle-stimulating hormone beta subunit, which is a strong candidate as the causal gene at this locus. Genome-wide significant association signals were reported across a 300 kb interval at this locus, and the lead SNP is located >20 kb upstream of the FSHB gene within a highly conserved 450 bp region upstream of the coding region for FSHB. In vitro studies have since shown this region binds the transcription factor steroidogenic factor 1 (SF1) and enhances the transcription of FSHB in an allele-specific manner, supporting the hypothesis that the risk allele at rs11031006 upregulates FSHB expression [50]. In this GWAS of European cohorts, more than half of the loci discovered in GWAS of Chinese cohorts exhibited nominal (P < 0.05) association with PCOS.

Table 4.1 Loci associated with PCOS in genome-wide association studies

A second GWAS performed in PCOS cases and controls of European descent was published in 2015, by Day et al. [46]. In this study the discovery analysis was performed in a cohort of more than 5000 self-reported PCOS cases and 82,000 healthy controls from the 23andme research resource, with replication performed in 2000 clinically identified cases and nearly 100,000 controls. This analysis successfully replicated genome-wide significant signals at 2p21 (THADA) and 11q22.1 (YAP), initially reported as PCOS risk loci in Chinese populations [8, 37] and 11p14.1 (FSHB), previously reported as a risk locus in European PCOS cases [45]. In this analysis there was directional consistency in effect on PCOS risk at 10 of the initially reported 11 signals identified in Chinese PCOS cohorts; however, only 6 were nominally (P < 0.05) associated, and due to consistently smaller effect sizes, none were genome-wide significant in the discovery GWAS. The effects of different LD structures between Han Chinese and European populations resulted in three of these loci (2p21 (THADA), 9q33.3 (DENND1A), and 11q22.1 (YAP1)) having different lead SNPs, only one of which (rs11225154; YAP1) is in LD with the lead SNP reported in Chinese PCOS cases [46]. Three novel loci were identified in this GWAS at 2q34 (ERBB4), 5q31 (IRF1/RAD50), and 12q21.2 (KRR1) as PCOS risk regions at genome-wide significance. Three members of the EGFR gene family (ERBB4, ERBB3, and ERBB2) were identified as risk loci at, or close to, genome-wide significance in this analysis. Recent studies identified a role for Erbb4 in the ovary, where it regulates anti-Müllerian hormone (AMH) level and folliculogenesis [51]. The risk association signal detected at 5q31 is within a complex, gene dense region. The index SNP lies within intron 3 of C5orf55 and intron 4 of IRF1 as well as within the reading frame for an uncharacterized protein-coding transcript AC116366.3. Nearby genes also include the transporter SLC22A5, an anti-sense RNA to the nearby gene IRF1, IRF1-AS1, the B cell growth factor IL5, and the double strand break repair gene RAD50. It is difficult to identify a candidate causal transcript at this locus given its complexity and what is known about the function of the genes in this region. To further identify potential biological mechanisms by which identified risk variants may impact PCOS biology, a quantitative analysis of the six genome-wide significant loci identified by Day et al. 2015 revealed an association between these six PCOS risk alleles and AMH levels in girls [46], suggesting that PCOS risk alleles from across the genome act through endocrine and reproductive pathways.

An international collaborative consortium assembled the largest GWAS of PCOS to date in order to identify risk loci in PCOS cases of European descent [52]. This analysis included more than 10,000 cases and 100,000 controls from seven cohorts (effective sample size 18,000), including a large proportion of previously analyzed cases [45, 46]. Imputation was conducted using the 1000 Genomes database, yielding over ten million SNPs for the GWAS. Fourteen risk loci were identified in this consortium effort. Three loci initially reported in GWAS studies of Chinese PCOS cases were replicated at genome-wide significance: 2p21 (THADA), 9q33.3 (DENND1A), and 16q21.1 (TOX3). The two risk loci, located at 8p23.1 (GATA4/NEIL2) and 11p14.1 (FSHB), reported by Hayes et al. [45] were confirmed in this large meta-analysis, as were the three risk loci at 2q34 (ERBB4), 5q31.1 (IRF1/RAD50), and 12q21.2 (KRR1) reported by Day et al. [46]. Three novel loci were identified in this collaborative meta-analysis at 9p24.1 (PLGRKT), 11q23.2 (ZBTB16), and 20q11.21 (MAPRE1). An additional novel genome-wide significant locus was identified on the X chromosome at the ARSD locus but was excluded from the formal results of the analysis due to low imputation quality, low minor allele frequency, and heterogeneity of effect across the three cohorts that had SNP data available for the X chromosome [52]. Additional analyses of this region in a larger sample size are needed to resolve the potential role of this locus in PCOS risk. Given that this GWAS included PCOS cases identified by self-report and two different clinical diagnostic criteria, heterogeneity analysis was performed to identify loci that demonstrated a difference in effect by these strata. The analysis identified heterogeneity at a single locus, 8p23.1 (GATA4/NEIL2), where the effect size associated with the risk allele was significantly less in self-reported PCOS cases and significantly greater in PCOS cases diagnosed using the NIH criteria [52]. For the remaining 13 loci, the magnitude of association with PCOS was similar regardless of mode of diagnosis. This lack of heterogeneity across PCOS cases identified using these different criteria, along with the consistent replication of PCOS risk loci across individual studies, underscores a conserved shared genetic architecture for this phenotype.

Day et al. 2018 combined the PCOS GWAS data with results from GWAS for other traits to carry out genetic correlation analyses [52]. Such analyses suggest shared etiology but do not indicate directionality or causality. This investigation found genetic correlation between PCOS and body mass index (the most correlated trait), childhood obesity, fasting insulin, type 2 diabetes, high-density lipoprotein cholesterol, triglyceride levels, age of menarche, coronary artery disease, and depression. No genetic correlation was observed between PCOS and age of menopause or male pattern balding.

As the use of research biobanks has grown over recent years, the ability for case identification via electronic medical records has facilitated the analysis of population-based cohorts recruited through large medical care systems. Two such systems are the Geisinger MyCode Community Health Initiative that has recruited more than 250,000 research participants throughout the care system in Pennsylvania [53] and the collaborative eMERGE (electronic MEdical Records and GEnomics) network that combines biobanks or studies with clinical data derived from medical records from across many sites [54]. Two such programs performed a GWAS in close to 3000 PCOS cases that met two of the following: (a) diagnosis of PCOS or polycystic ovaries; (b) hyperandrogenism or its related signs, or hyperandrogenemia; and (c) oligomenorrhea, amenorrhea, or infertility (i.e., Rotterdam diagnosis criteria) and 53,000 controls that did not meet any of the three criteria [55]. A small validation cohort of 253 cases and 2161 controls was available from the Vanderbilt BioVu study. This analysis identified three genome-wide significant signals (at 6q25.3, 2q34, and 3q25.1). The locus at 6q25.3 had not been detected in prior studies. The index SNP at this locus is more than 200 kb from the nearest genes (FNDC1 and SOD2) and does not overlap known regulatory elements from ENCODE or 3D chromatin interactions reported by GeneHancer. It is not immediately apparent what the causal gene is at this locus. The previously reported risk signal at 2q34 (ERBB4) was identified in this study at a suggestive level of significance, and additionally a novel independent risk variant was identified at this locus at genome-wide significance. A third locus at 3q25.1 (WWTR1) was reported as nearing genome-wide significance; this locus has not been previously reported as a risk locus for PCOS [55]. It should be noted that 17% of the total cohort in this study was listed as African American, although the numbers of cases and controls were not provided. A lookup of the three reported risk loci identified in this study was performed in an analysis of only African American participants, and only the novel risk SNP identified at 2q34 (ERBB4) passed quality control metrics. Despite having a higher minor allele frequency in African American populations, this SNP was only nominally associated with PCOS risk (P > 0.01) [55]. Genome-wide association studies in populations of other ethnicities have not been performed. Our lack of understanding of the shared or differing genetic architecture of PCOS in populations that are not of Chinese or European ancestry represents a significant deficit in our understanding. A major focus of ongoing research should prioritize the recruitment and profiling of PCOS cases and controls of other ancestries (e.g., Hispanic, African) to address this lack of knowledge.

To better identify the biological pathways through which susceptibility loci act to increase risk of PCOS, association of these loci with phenotypic traits related to PCOS has been performed in several studies, including the recent meta-analysis. Significant associations between known risk loci and polycystic ovarian morphology, ovulatory dysfunction, and hyperandrogenism were all identified [52]. GWAS analyses within PCOS cases also found that the allele associated with increased risk of PCOS at the FSHB locus was also associated with increased circulating LH level, decreased FSH level, and increased ratio of LH to FSH [45, 46]. Taken together these analyses further support the role for much of the genetic basis for PCOS to act through disrupting hormone pathways.

Polygenic Risk Scores for Disease Risk Prediction in PCOS

Polygenic risk scores (PRS) have been under active development in recent years, leveraging the increasing pace of discovery of the polygenic genetic architecture of many complex traits and the increasing sample sizes that are becoming available for testing and validation of such scores. The development of methods used to generate such scores is an active area, with empirical and Bayesian methods currently being applied. The long-term goal of PRS application in the population is to allow the early detection of risk for disease prevention strategies to be deployed [56]. This strategy is underway in cardiovascular traits, where the polygenic genetic risk estimated by GWAS equals the known monogenic risk and clinical risk factors [57]. A polygenic risk score for PCOS was developed based on the meta-analysis performed on clinically diagnosed cases included in the collaborative meta-analysis [52] and applied to a cohort of more than 120,000 individuals for whom electronic health records were available through the eMERGE network [58]. The best performing PRS in this analysis demonstrated a prediction accuracy of PCOS cases of 0.55 with an area under the curve (AUC) of 0.715 in eMERGE participants of European ancestry. When combined with information available based on PCOS component phenotypes, the PRS plus phenotype model performed with an accuracy of 0.873 and an AUC of 0.87, indicating that the PRS model built from this analysis is able to predict PCOS phenotype in individuals of European ancestry [58]. This genetic PRS model was also used to perform a phenome-wide association study (PheWAS), where the genetic risk score of an individual is used to identify anthropometric and clinical traits that are enriched in individuals of high genetic risk. This analysis can identify cross phenotype associations that may be the result of pleiotropy – whereby risk alleles impact multiple traits or phenotypes [59]. A significant PheWAS relationship was identified between the PCOS PRS and traits related to endocrine and metabolic traits (obesity, lipid dysfunction, type 2 diabetes), neurological traits (sleep apnea), circulatory system (hypertension), and digestive traits (esophageal disease) [58]. Many of these associations remained significant after the analysis was repeated without any PCOS cases included in the cohort , suggesting that there are likely undiagnosed PCOS cases within the eMERGE network.

Mendelian Randomization Using GWAS Signals

Even before causal genes are identified at risk loci, GWAS information can be used to dissect the biology of disease. A major example is that robust loci identified by GWAS can be used to interrogate causality between an exposure and an outcome using Mendelian randomization (MR). In this approach, SNPs associated with the exposure are used as instrument variables to estimate the genetically driven effect of the exposure on the outcome, yielding causal effect estimates. Reports of PCOS GWAS included MR analyses that suggested increased body mass index (BMI), age at menopause, decreased sex hormone-binding globulin (SHBG), fasting insulin, male pattern balding, and depression were causal factors for PCOS [46, 52]. The relationship between BMI and PCOS has been extensively investigated using MR, with results finding that while obesity appears to be causal for PCOS, PCOS does not cause obesity [60, 61]. MR studies found that testosterone levels, but not AMH levels, are causal for PCOS [62, 63].

A series of MR studies examined PCOS as the exposure against various outcomes, using PCOS SNPs from the largest GWAS for PCOS [52] as instrument variables. PCOS was found not to have a genetic causal effect on type 2 diabetes, coronary heart disease, or stroke [64]. Given that prior MR studies had demonstrated causal effects of BMI, higher testosterone, and lower SHBG on diabetes and/or cardiovascular disease, the authors concluded that these features commonly present in PCOS, rather than PCOS in and of itself, explain the association between PCOS and cardiometabolic disease. Genetically predicted PCOS was associated with increased risk of breast cancer overall and estrogen receptor-positive breast cancer; no effect on estrogen receptor-negative breast cancer was observed [65]. Consistent results were observed in a study that examined several subtypes of breast cancer [66]. MR studies found a protective effect of PCOS against invasive ovarian cancer and endometrioid ovarian cancer [67, 68]. These MR studies yielded key insights on causes and consequences of PCOS , avoiding confounding variables that affect epidemiological association studies.

Identifying Causal Genes at PCOS Risk Loci

Colocalization analysis of disease and intermediate cellular phenotypes (e.g., gene expression and protein level across different relevant tissues) is performed by measuring the probability that the two traits share a causal variant [69]. A recent analysis applied this approach and successfully identified seven proteins with strong evidence of colocalization [70]. The FSH protein was clearly implicated at the 11p14.1 locus where the significant correlation between genotype at risk-associated SNPs and circulating FSH level presents a clear colocalization of the same causal SNPs acting on both PCOS and FSH level. This approach was unable to resolve a single likely causal transcript at the 12q13.2 locus but implicated SUOX, ERBB3, IKZF4, RPS26, and GDF11 as potential causal genes. A single likely causal gene, ZFP36L2, was identified at 2q21 (THADA locus), and C9orf3 was implicated at 9p24.1. Colocalization analysis at 8p23.1 identified both C8orf49 and NEIL2 as potential causal transcripts [70].

Conclusion and Future Directions

Advances in genomic technology have led to rapid progress in our understanding of the genetic architecture of PCOS. Though PCOS is clinically heterogeneous, GWAS have found little genetic heterogeneity across PCOS diagnostic criteria. Twenty loci across the genome have been identified at genome-wide significance in Chinese and/or European cohorts (Table 4.1). The causal gene at many of these loci is unknown; however, genomic analysis and in vitro studies have provided some suggestion of the likely causal gene at specific loci. These results indicate that disruption of hormone signaling pathways, particularly related to the synthesis and signaling of FSH and the signaling of the LH receptor, are key to the pathogenesis of PCOS. As with many complex traits, much of the heritability for PCOS has yet to be identified. Identifying additional risk alleles will contribute to improved PRS accuracy and sensitivity and may identify further biological pathways to be targeted for the treatment of PCOS symptoms. Increasing sample sizes will be required for the discovery of additional risk alleles, and the continued efforts of the International PCOS Consortium (iPCOS) are focused on including increasing numbers of PCOS cases and controls for ongoing meta-analysis for risk allele discovery. A second focus of the iPCOS consortium is to foster the inclusion of PCOS cases and controls of Hispanic and African ancestry, so that we may begin to understand the shared and differing genetic architecture of PCOS between these populations and those already studied. The current move of genomic technologies beyond array-based genotyping into population-level whole genome sequencing will provide opportunities to discover additional types of risk variants (e.g., structural variants) and variants with rare and very rare risk allele frequencies, allowing a deeper understanding of the complex genetic underpinnings of PCOS.