Epigenetics

The term epigenetics was coined by the Cambridge University embryologist Conrad Waddington in a series of monographs, and he used it to describe his original view of developmental biology that the morphological and functional properties of an organism arise sequentially under a program defined by the genome under the influence of the organism’s environment [1]. The modern definition of epigenetics is modifications of the DNA or associated proteins, other than DNA sequence variation itself, that carry information content during cell division [2], although a few scientists take a more relaxed or stricter view, either including RNA modification or limiting to vertical (generational) transmission. Remarkably, the modern definition and Waddington’s have converged. That is because the epigenetic state of an organism progresses from gamete to zygote to somatic tissue, all of which have profoundly different epigenomes, while the DNA is the same. Furthermore, given that the developmental state of a cell can be completely reprogrammed by somatic cell nuclear transfer, or by specific genes in combination, the information specifying cell state is not the DNA alone but the epigenetic program layered on top of this genetic code and is heritable during cell division but ultimately reprogrammable.

The focus of this review is the specific epigenetic modification involving DNA methylation (DNAm), a covalent addition of a methyl (CH3) group to the nucleotide cytosine. DNAm is the only epigenetic modification whose mechanism for propagation is well understood biochemically. CpG dinucleotides show heritable methylation during cell division because the complementary strand shows the same sequence, and both cytosines are normally methylated. During DNA replication, the two daughter strands contain hemimethylated DNA, i.e., the parent strand is methylated and the daughter strand is not. The enzyme DNA methyltransferase I (DNMT1) has high affinity for this hemimethylated strand and adds a methyl group to the newly synthesized daughter cytosine at that site, likely within the same DNA replication complex. Dietary methionine and a cofactor synthesized from folic acid are necessary for the success of methylation maintenance, providing a strong link between the environment and the epigenome. Indeed in animals, the epigenome and gene expression can be modified by dietary manipulation of methylation precursors, and dietary deprivation of methionine leads to liver cancer in animals [3].

“CpG islands” are regions rich in CpG dinucleotides (formally defined as G + C content ≥ 0.5 and CpGobs/CpGexp ≥ 0.6)[4], and they are often described as uniformly unmethylated in normal cells, with the exception of the inactive X chromosome, and are near imprinted genes [5, 6]. However, the assumption that autosomal CpG islands (except for imprinted genes) are never methylated is clearly not the case [710]. It is also important to note that functionally important DNAm information is often not within conventionally defined CpG islands, e.g., the H19 and insulin-like growth factor II gene (IGF2) differentially methylated regions (DMRs) that regulate imprinting of IGF2 [11, 12].

Epigenetics of human disease

How can one identify disease-specific epigenetic differences? One would like to know that the epigenome varies normally in the population, is associated in particular ways with disease, and does not always simply reflect normal tissue-specific differences in gene expression. Individual gene data in support of this epigenetic variation were first reported in the 1980s [13]. Other genomic regions showing epigenetic variation in the population include X inactivation [14] and both familial and environmental determinants of IGF2/H19 imprinting, or parent of origin-specific gene silencing [14].

A common theme of disease epigenetics is the role of defects in phenotypic plasticity, the ability of cells to change their behavior in response to internal or external environmental cues; this was reviewed recently in detail [15]. For example, hereditary disorders of the epigenetic apparatus lead to developmental defects, a dramatic example being the Rett syndrome. This disorder involves loss of function of methyl-CpG-binding protein 2 (MeCP2), which recognizes DNAm. Children with Rett syndrome develop normally until 6–12 months and then gradually lose developmental milestones over years, due to a failure to maintain gene silencing in the brain. This process of delayed onset of disease is also a hallmark of bipolar disorder and schizophrenia.

The study of epigenetic changes in human cancers began with the discovery of widespread hypomethylation [16]. Cancer involves both hypomethylation and hypermethylation, attendant overexpression of oncogenes, silencing of tumor suppressor genes, and loss of imprinting. Here too, the mechanism by which epigenetic changes leads to cancer appears to involve disruption of normal phenotypic plasticity, in this case of the programming that leads a cell to differentiate normally within a given tissue compartment [2]. Moreover, epigenetic changes that arise constitutionally are associated with increased risk of common disease, such as loss of imprinting of the IGF2 gene in cancer, which has been shown in both human [17] and mouse [18, 19] studies. Prospective or nested case–control studies are needed to establish a cause and effect relationship in colorectal cancer.

Epigenetic alterations have long been linked to human disease, originally through disorders of genomic imprinting [20]. Defects in the epigenetic machinery also lead to developmental abnormalities, such as MeCP2 mutations in Rett syndrome [21] and DNMT3B mutations in immunodeficiency, centromeric region instability, and facial anomalies (ICF) syndrome [21].

Epigenetic alterations may also contribute to neuropsychiatric disease. Bipolar disorder shows several features consistent with an epigenetic contribution: lack of complete concordance in monozygotic twins; onset of illness in adolescence or adulthood rather than childhood, the often episodic nature of the illnesses, and the apparent relationship to environmental factors, such as stress [22, 23]. Stress has been shown to alter epigenetic marks including DNAm and histone modifications in the brain in animal models [24, 25]. Interestingly, three important bipolar disorder medications, the mood stabilizer valproate [24, 25], the antidepressant imipramine [25], and the antipsychotic haloperidol [26], have also been shown to induce epigenetic changes in the brain. More direct evidence in support of an epigenetic effect in bipolar disorder: is based on the identification of an excess of maternal transmission in some pedigrees [27]. The mounting evidence for epigenetic involvement in autism includes relationships with related phenotypes as well as direct evidence. For example, imprinted genes on the X chromosome are thought to be involved in social skills in girls because defects in these skills are found in Turner syndrome and in children lacking the paternal X chromosome but not the maternal X chromosome [28]. Both fragile X, a disorder with known phenotype overlap with autism, and ICF syndrome arise from malfunctions in the establishment of normal DNAm patterns [29, 30]. Rett syndrome, also associated with autistic features, is caused by mutations in the gene encoding DNA methyl-binding protein MeCP2, a protein important for interpreting DNAm and controlling the repression of gene transcription [31]. Patients with these three disorders exhibit mental retardation, demonstrating the importance for proper DNAm in the regulation of cognitive function. Furthermore, one mechanistic study has shown that abnormally hypomethylated CNS neurons were impaired functionally and were selected against in postnatal development [32]. Another suggests that neuronal activity can drive the transcription of genes important for controlling neurotransmitter release by regulating their DNAm status [33].

The potential for imprinting of autism-related genes could explain the lack of Mendelian inheritance in autism and the inconstant results across linkage and association studies that do not account for these features. Direct evidence for this idea comes from a study of the gene for contactin-associated protein-like 2 (CNTNAP2), identified by multiple studies as associated with autism spectrum disorders (ASD) [3438]. In one of these studies, risk for ASD associated with the identified single nucleotide polymorphism (SNP) showed parent-of-origin specificity suggesting a role for imprinting [36].

Epigenetics of aging

Increasing evidence supports a role for epigenetics in the biology of aging. X-inactivated genes in the mouse show an increased frequency of reactivation with aging, consistent with age-related epigenetic change [39, 40]. The frequency of epigenetic changes in mice may be one to two orders of magnitude greater than the rate of somatic DNA mutation [41]. This fits with a role of epigenetics in late-onset disorders such as frailty, a syndrome of decreased resiliency and reserves, in which a mutually exacerbating cycle of declines across multiple systems results in negative energy balance, sarcopenia, and diminished strength and tolerance for exertion [42]. Accumulation of DNA sequence changes might not occur at enough high rate during the lifespan to induce common disease, but epigenetic changes may occur at a frequency that could contribute to this effect. Very few studies have demonstrated epigenetic changes in humans with age due to technical and biosample limitations. A recent study has shown differences in local and global methylation by age by examining the similarity in methylation patterns between MZ twins aged 3 years old and MZ twins aged 50. Although these analyses were not in the same individuals (the same twins were not followed longitudinally), the similarity in methylation patterns between young twins compared to the dissimilar patterns among older twins argues strongly for age-related changes in the epigenome [43]. Direct evidence comes from a recent study showing changes in DNA methylation in the same individual over time, described in more detail below.

Epigenomics

Epigenomics refers to genome-scale analysis of epigenetic marks. The term “methylome,” or genome-wide state of DNAm, was first introduced by the author in 2001 [44]. Despite the availability of an essentially complete genome sequence for several years, understanding the methylome has progressed more slowly, largely due to limitations in technology affecting sensitivity, specificity, throughput, quantitation, and cost among the previously used detection methods. All of the available methods involve trade-offs among these variables. Furthermore, all of these variables are themselves moving targets, particularly cost. The rule in genomic science generally is that increased demand substantially reduces cost because of three factors: fierce competition in the biotechnology sector; production efficiencies as methods are automated; and continued technological advances. It is also important to define clearly what is meant by genome scale. The term is commonly applied to any method not limited to specific predefined genes, but no epigenomic method in common use examines the entire epigenome. For the sake of this article, the discussion will be limited to DNAm analysis because of its particular suitability for pathological and epidemiological studies due to its stability in biobanked specimens.

The human genome contains ∼3 × 109 bp of DNA, of which there are ∼3 × 107 CpG dinucleotides, and half of that is nonrepetitive single- or low-copy sequence [45]. CpG dinucleotides are the sites that can be methylated and the methylation in turn replicated faithfully during cell division by DNA methyltransferase 1. While non-CpG methylation exists, it is not currently considered epigenetic information since no mechanism is known for its propagation during DNA replication.

What are the methods in common practice for measuring genome-scale DNAm? While this review naturally is written from the perspective of our own approaches, there are several other excellent reviews of epigenomics [46, 47]. Most investigators are drawn to commercially available methods, particularly those that can be performed as a service, with only DNA needing to be prepared by the investigator. However, these methods are not necessarily the most comprehensive or most accurate. A method similar to array-based SNP analysis is the Illumina GoldenGate methylation assay [48], or its more recent cousin, the Illumina Infinium methylation platform. Both methods involve bisulfite conversion of unmethylated DNA to uracil, followed by polymerase chain reaction (PCR) which propagates a thymine residue at the converted base [49]. Methylated cytosine is unconverted and thus read as cytosine. Thus, the methylation state (C/mC) is transformed to a pseudopolymorphism (T/C, respectively). The readout is then as for any SNP and is semiquantitative, accurate within ∼17% for the GoldenGate assay [50]. The major limitation of this approach is the relatively poor coverage of the genome by both methods, only ∼1,500 CpG by GoldenGate and ∼27,000 CpG by Infinium, thus representing 0.01% to 0.18% of the single-copy methylome. A second limitation is the choices involved in selecting CpG sites for analysis. The chips are designed based in part on the idea that functional CpG methylation lies within canonical gene promoters and CpG islands. CpG islands are defined algorithmically, i.e., based on a formula given above. The major rationale for this choice is literature showing hypermethylation of CpG islands in cancer. Yet, those studies are largely self-referential in design, and recent studies described below suggest that most variable DNAm occurs outside of these islands. Nevertheless, great advances have been made possible by these reagents and methods, and they show the promise of increasing efforts by many laboratories to improve the resolution of genome-scale technology.

Furthermore, there are comparatively few data supporting the choice specifically of promoters and CpG islands for studies of other diseases, or normal population variation. Indeed, a relatively small scale but very comprehensive study was performed by Stephan Beck at the Sanger Center on ∼1.8 Mb of DNA including ∼40,000 CpG sites across 12 tissues [10]. The study showed that most methylation variation was not at transcriptional start site-associated CpGs or at CpG islands [10]. One encouraging result from that study, for those who wish to use CpG chips as described above, was a high degree of correlation between CpG site methylation within a few hundred base pairs. However, the choice of one or two CpGs per candidate region seems precariously underrepresented.

A second approach in common practice is hybridization of antibody-purified methylated DNA to high-density genome arrays [51]. For example, NimbleGen offers of methylated DNA immunoprecipitation (MeDIP) to a ∼2-Mb array tiled through gene promoters and CpG islands. The coverage of this array is much greater than the SNP-based arrays described above. However, choice of selection is still a significant issue given that complete tiling of the genome would currently require ten arrays, which is cost-prohibitive for large-scale epidemiological studies. Furthermore, MeDIP shows significant limitations in discriminating methylation differences in regions of medium- to low-density CpG content [52], and our recent study shows that that is exactly where many or most significant variation in DNAm occurs [53]. Another method focused on CpG islands is restriction landmark genome scanning [54]. There are emerging alternatives for methylation fractionation, including affinity purification of methylated DNA on methyl-CpG-binding protein [55], or affinity purification of unmethylated DNA [56].

Two promising methods for genome-scale analysis use methylated DNA fractionation based on restriction endonuclease digestion. One of these, developed by John Greally and colleagues at Albert Einstein College of Medicine in New York, is termed HELP for HpaII-tiny fragment Enrichment by Ligation-mediated PCR [57]. It takes advantage of the difference in sizes of Hpa-II fragments, which are generated from unmethylated DNA, and Msp-I fragments, which recognize the same cleavage site but are methylation-independent. While initial specificity was relatively limited, recent improvements involve additional methylcytosine sensitive endonucleases and allow representation of >98% of CpG islands and >90% of refSeq promoters, and it can also be combined with next generation sequencing for readout [58]. A second involves fractionation of the unmethylated component with McrBC, which recognizes methylated DNA if there are two methylcytosines preceded by purines and separated by ∼40–100 b, an easy condition to meet for methylated DNA except at very low CpG density. This approach was first applied to specific chromosome analysis [59] but was subsequently extended to study of human cancer [60].

Rafael Irizarry and I with our colleagues developed an array-based readout method that is independent of methylation fractionation method and can be applied equally to McrBC, HELP, or antibody-based methods. This approach, termed CHARM for comprehensive high-throughput array-based relative methylation analysis, involves two essential components. First, the array is agnostic to presuppositions about the location of differential methylation and tiles through regions based only on the relative CpG content in decreasing abundance [52]. It, therefore, includes all CpG islands, but that represents only 38% of the CpG “real estate,” or available oligonucleotide probe positions for analysis on the array. One could use additional arrays or soon to be released higher density arrays to increase coverage, which is now about one fourth of the entire nonrepetitive methylome. The second component is genome-weighted smoothing, or correction for the hybridization properties of the target (i.e., sample) genome at each location, which is in turn calculated from empirical measurements of hybridization efficiency with regard to GC content, CpG density, and length of fragments [52]. A statistical suite of postprocessing algorithms, written in R, is termed CharmR and is continually revised. The arrays and CharmR are open access and open source (http://www.biostat.jhsph.edu/∼maryee/charmR/). Thus, while not commercially available, this technology is readily transportable to core laboratories that have statistical and programming support.

Although my colleagues and I have developed one of the current approaches to epigenomic analysis, we gladly welcome the advent of second generation sequencing technology for DNAm analysis. There are multiple competing commercial platforms for massively parallel sequencing on slides, with throughput per machine >300 Gb per run at <1% of the cost of conventional automated sequencing [61, 62]. A particular advantage of sequencing-based methylation analysis is the ability to ascertain allele-specific methylation by virtue of DNA polymorphisms within the same sequencing read. This is particularly true as longer reads become cost-effective.

Nevertheless, whole genome bisulfite sequencing applied to humans is not presently cost-effective for epidemiological studies. Costs are in the many tens of thousands of dollars, compared to hundreds of dollars per sample for alternative less comprehensive methods. Therefore, sequencing-based methods all involve significant trade-offs. One method involves “reduced representation,” using restriction enzymes to limit the sequenced target to regions within CpG islands [63], which may miss significant normal variation in patient populations or across tissues [53]. Single molecule sequencing detection, such as in development by Pacific Biosystems, or other methods not widely discussed publicly, might reduce costs to the point of making whole genome shotgun sequencing inexpensive compared to other methods for epigenomic profiling. Until that day comes, however, a great deal can be learned about the methylome of human normal and disease populations using array or chip-based approaches.

The first genome-scale epigenetic analysis of human cancer: CpG island shores

We recently exploited CHARM methodology to perform the first genome-scale analysis of the human cancer methylome and to compare it to the normal tissue-varying methylome. A comparison was first made of DNA from five autopsy specimens, three matched tissues each representing the three embryonic lineages, brain, liver, and spleen. Surprisingly, most tissue-specific DNAm was not at CpG islands but at regions of intermediate CpG density located up to 2 kb from the islands, and which we termed “CpG island shores” [53]. Even though CpG islands accounted for 33% of the CpG real estate on the arrays, they only accounted for 6% of these tissue-varying differentially methylated regions, or T-DMRs. In contrast 76% of T-DMRs were in CpG island shores. Furthermore, the T-DMRs were located for the most part outside of promoters (96%), and more than half were >2 kb from the nearest annotated gene [53].

Next, comparing 13 colorectal cancers to matched normal mucosa from the same patients, 2,707 regions were identified showing cancer-specific differentially methylated regions, or C-DMRs, under a false discovery rate of 5%. These, too, were highly enriched at CpG island shores (67%), and islands were comparatively underrepresented (8% compared to 38% on the arrays). The data were highly reproducible, being validated by independent quantitative bisulfite pyrosequencing on a replicate set of 50 colon cancers and matched normal mucosa [53].

Remarkably, there was a comparable amount of hypermethylation as hypomethylation, even though the cancer literature is heavily biased toward the former. That may be because the CpG islands, even though underrepresented for C-DMRs overall, show hypermethylation when methylation is altered in cancer, while the shores away from the islands tend toward relative hypomethylation. In retrospect, this is not surprising, since CpG islands are protected against normal DNA methylation, and thus, the only direction they can commonly change in disease is toward relative hypermethylation. In any case, the common dictum that cancer shows repetitive DNA hypomethylation and gene-specific hypermethylation appears to be false. While the former is true, single genes are numerically comparably altered by hypomethylation and hypermethylation in cancer [53]. These results are illustrated in Fig. 1.

Fig. 1
figure 1

Altered DNA methylation of CpG island shores in human colon cancer. Shown are an example of hypomethylation (Gene A, top) and hypermethylation (Gene B, bottom) in cancer revealed by a genome-scale analysis of the cancer methylome. Gene A is normally methylated at the shore and not at the island, and it acquires a hypomethylated pattern at the shore in colon cancer, resembling that of the normal liver. Aberrant expression at an alternate promoter, or for an untranslated RNA, is activated at the shore. Gene B is normally unmethylated at both shore and island, and it acquires a hypermethylated island at the shore, resembling the normal liver, and potentially at the island as well. Aberrant silencing ensues at the shore and potentially at the canonical promoter

This comparative methylome analysis also showed that C-DMRs and T-DMRs largely overlap (65% using an F statistic). Indeed, if one performs supervised clustering to identify the C-DMRs that distinguish colorectal cancer from normal mucosa, those same DMRs in unsupervised clustering completely discriminate spleen from liver from brain [53]. Thus, the DMRs that regulate normal differentiation are involved in aberrant methylation in cancer, and this may occur generally across cancer, since they even distinguish tissues not of the type from which the cancer derives.

What do CpG island shores do? This is of great relevance to any investigator interested in the disease epigenome, since they were previously unapparent yet obviously at the heart of normal and abnormal variation. One clue comes from their localization, as they appear to be enriched in alternative transcriptional start sites for annotated genes, as well as unannotated RNAs. This localization was functionally supported by rapid amplification of complementary DNA ends experiments showing that hypomethylated CpG island shores in cancer show activation of alternative transcriptional start sites within them in the same tumor showing hypomethylation at these sites [53].

The new field of epigenetic epidemiology

Epidemiology is the study of disease in populations, and genetic epidemiology, or the relationship of genetic variation to disease, has exploded by taking advantage of the data generated by the HapMap project, a consortium effort to identify six million common polymorphisms across the genome and in the major human population groupings [64]. Genome-wide association studies have become the mainstay of human genetics research and have identified nearly 100 loci for over 40 diseases [65]. Concomitant copy number variant (CNV) analysis which can be performed on the same chip platform has also identified insertions or deletions that contribute to common disease [66]. However, epigenomics has not yet been integrated into the routine search for variation contributing to human disease susceptibility.

The new field of epigenetic epidemiology will measure and catalog such epigenetic variation within and across populations and to characterize the correlation properties of methylation, similar to the catalog of SNP/CNV variation and linkage disequilibrium. Epigenetic epidemiology can also provide a unique perspective on the environmental factors contributing to common disease. Several examples of environmentally mediated epigenetic effects have been documented, including the influence of methyl donors and folate from diet on methylation levels, smoking influence on methylation, and the effects of metallotoxins [67, 68]. DNA methylation occurs by homocysteine conversion to methionine, which is then converted to S-adenosylmethionine, the common methyl donor for DNAm. Animal work has shown that folate-deprived rats become hypomethylated locally, at particular genes [69, 70] and globally [71]. Human cell work has shown that specific genes may be hyper- or hypomethylated with reduced folate [72]. Also, clinical studies have shown a correlation between serum folate levels and hypomethylation [73], and epidemiologic interventions show older women put on folate-depleted diets result in increased plasma homocysteine and decreased methylation [74, 75]. A hypomethylation defect was associated with assisted reproductive technology in the conception of children with Beckwith–Wiedemann syndrome, a disorder of prenatal overgrowth, birth defects, and cancer [76], which has been borne out by several other groups [77, 78]. Thus, prenatal exposure can act through an epigenetic mechanism.

The prior focus of epigenomics on the simple interface between epigenetics and human disease phenotype variation has prepared us now to address the more complex task of including genetic variation in genome-scale analysis. Going forward, it is critical to develop genome-wide tools to determine the relationship between genetic variation, epigenetic variation, and disease simultaneously. This area of overlap, the hashed area in Fig. 2, is deliberately drawn as the larger fraction of the overlap between genetics and phenotype to emphasize that most genetic findings must be considered in an epigenetic context and to highlight that the full value of typical genetic epidemiology studies cannot be realized until the complementary epigenetic measures and statistical tools are developed and performed on these samples.

Fig. 2
figure 2

Epigenetic epidemiology. New genome-scale tools for epigenetic analysis will allow us to determine the relationship between genetic variation, epigenetic variation, and disease simultaneously. The area of overlap is deliberately drawn as the larger fraction of the overlap between genetics and phenotype to emphasize that most genetic findings must be considered in an epigenetic context and to highlight that the full value of typical genetic epidemiology studies cannot be realized until the complementary epigenetic measures and statistical tools are developed and performed on these samples

Fallin, Bjornsson, and I have proposed a common disease genetic and epigenetic (CDGE) model for human disease, which states that DNA sequence variation (traditional genetics), environment, and epigenetic mechanisms interact to cause or accelerate common disease, especially those of later onset [23, 79]. CDGE provides a model for understanding how one might integrate epigenetics into traditional studies genetics and the environment. For example, epigenetic marks such as DNAm may influence disease risk, either directly (such as aberrant DNAm turning a gene on/off inappropriately) or indirectly (through masking/unmasking DNA sequence variation that has disease consequences). This type of DNA variation, whose penetrance is dependent on epigenetic context, is denoted “g dep,” since it is only one risk factor in the context of certain epigenetic patterns. This type of sequence variation would be difficult to discover in traditional genetic association approaches, without knowledge of the epigenetic background. Parent-of-origin analyses in genetic epidemiology are aimed at accommodating this problem in the context of imprinting, but this is often low-powered and only addresses one kind of epigenetic model. Whether epigenotype (e.g., DNAm) acts directly or indirectly on disease risk, factors that control epigenotype are themselves critical risk factors. DNA variation that controls epigenotype such as DNAm may be located, for example, in genes that encode proteins in the one-carbon transfer pathway and may affect the cell’s ability to maintain DNAm. This type of risk-related DNA variation is denoted “g epg” to indicate that the multiple effects manifest through an epigenetic mechanism. With these thoughts in mind, at least three models can be considered for how DNA variation may contribute to risk: (1) independently of epigenetic mechanisms (g ind), (2) as genetic mediators of epigenetic modifications of other genes (g epg), or (3) where the effect of the genetic variant depends on its epigenetic context (g dep). Only the first of these would be easily detected in current genetic association studies. Current genetic studies might have reduced power to detect g epg without epigenotype measures or knowledge of the factors that contribute to epigenotype. Current genetic studies would have virtually no power to detect g dep without epigenetic measurement [23, 79].

A critical clue to CDGE comes from a global genome-scale measurement of DNAm termed luminometric methylation assay (LUMA), a precise quantitative measure of Hpa II site methylation. Using LUMA, intraindividual change in DNA methylation was found over time with familial clustering. DNA from 111 participants in the AGES Reykjavik Study [80] was first analyzed. In this cohort, 8.1% of individuals showing changes greater than 20% in Hpa II methylation over time, and these were approximately equally divided between gains and losses of DNAm. Permutation analysis showed that the change observed was much greater than by chance (P < 0.0001) [81]. Next examined was a second cohort of 126 individuals from a collection of Utah pedigrees that had been sampled twice over an average of 16 years. In this group, 11% of individuals show changes greater than 20% in Hpa II methylation over time. The Utah pedigrees showed high heritability, with a heritability estimate of 0.99 (P < 0.0001) [81]. The familial clustering of methylation changes raises the possibility that methylation changes could be directly related to genetic variation, as suggested by CDGE. Another recent study shows associations of single nucleotide polymorphisms with nearby differences in DNA methylation [82], consistent with CDGE.

Future needs for epigenetic epidemiology will require advances in three areas. First, we must develop scalable, cost-effective approaches for population-level epigenetic profiling. This includes technical advances in measurement and quantification of DNAm. We must also develop even more comprehensive coverage of the epigenome. This can be done by increasing real estate on the arrays, in part perhaps by reducing coverage of highly comparable adjacent sequences on the tiled regions, or by increasing array density as will occur this year. A substantial advance will come from combining array-based advances with second generation sequencing technology. For example, one could capture the relevant epigenome target (identified by studies on arrays) using arrays or molecular inversion probe or other solution-based technologies and then perform bisulfite-based shotgun sequencing for single nucleotide resolution.

Second, we must further develop the statistical tools and concepts that are necessary to analyze, interpret, and compare population-level epigenetic data. A critical requirement for epidemiological analyses is the transformation of granular individual epigenotypes into the higher-level epidemiological data types without significant information loss. This will include developing new statistical tools for identifying the subset of variable DNAm regions relevant to human disease and developing methods to simplify granular methylation patterns into epigenetic “barcodes,” similar to the work by Irizarry on gene expression [83].

Third, we must integrate conventional genetic epidemiology with these epigenetic data to fully develop epigenetic epidemiology. We must answer fundamental questions about type, frequency, and properties of epigenetic variation within and across individuals, families, and populations. This can be done by relating genetic variation to epigenetic variation in normal populations, or by investigating epigenetic differences among monozygotic twins. A critical question is whether epigenetic marks are transmitted intact from parent to offspring and whether DNAm is allele-specific and covaries with allele-specific gene expression. For example, can we develop an epigenetic transmission test comparable to the transmission disequilibrium test used in genetic epidemiology? Finally, and most excitingly, we must begin to examine the epigenome comprehensively in large population-based epidemiological studies of disease. Such studies will greatly enhance cancer risk assessment and prevention and are already showing promise in better understanding common neuropsychiatric disease.