Introduction

Heritability analyses for most complex disorders show that at least some portion of disease liability is due to environmental factors [1], often a large component of risk. The specific health consequences of environmental exposures have been well established for many toxicants and outcomes [2, 3]. Yet, many environmental risk factors have not yet been discovered, despite evidence that they play a role in disease. Environmental epidemiology’s goal of identification and characterization of non-heritable risk factors is critical, as these factors provide actionable insights about modifiable causes of disease that can lead to better prediction, prevention, treatment, and policy.

A major limitation to further discovery in environmental epidemiology has been the need for timing-specific exposure information and prospective outcome data. This is a great challenge, particularly for exposures influencing risk on outcome years to decades later, and for exposures that are difficult to measure or occur prior to feasible study enrollment, such as prenatal or preconception exposures. Some prospective cohort studies do begin prior to pregnancy, or early in pregnancy, and follow new babies through life (e.g. [4,5,6,7,8,9],). However, these study designs take years to accumulate outcomes, often with attrition or low enrollment numbers given the timing of enrollment and the length of commitment. Retrospective measurement of exposure is notoriously difficult, given the potential for recall bias in self-report, the lack of information in administrative data such as electronic health records, particularly for toxicants, and the short half-lives of many toxicants—such that biomarker measurement weeks or years later is irrelevant to amounts of exposure at the time of vulnerability.

Thus, there is a critical need in environmental epidemiology for measurement tools that can accurately capture past exposure, particularly prenatal and early life exposures. One emerging area of promise is the ability to measure toxicant content of shed baby teeth, available at middle childhood, but able to inform exposures that occurred in utero [10, 11]. While this is a promising avenue, it does require availability of baby teeth and is to date, relatively expensive with few labs able to perform detailed measurement. Among the other emerging options is the potential for blood, or other readily available tissue samples, to provide past exposure proxy information. This could be transformational for environmental epidemiology and genetic epidemiology. If one can use biosamples already in biobanks, such as UK biobank [12] or the vast genetic consortia banks (e.g. [13],) to estimate prior exposure with accuracy, there would be ample power to ask environmental exposure questions not previously possible and to truly integrate genetic and environmental information in these large sample sets.

One promising possibility for a blood (or convenience tissue)-based biomarker of past exposure that could enable environmental and gene-environmental work in existing biosamples is the potential for DNA methylation patterns to mark prior exposure. As we show in this review, there is now a substantial body of evidence that DNA methylation measured in blood, and other tissues, is associated with prior exposure, and that this association may be strong enough to enable an accurate predictor of exposure that is timing and toxicant specific. More work must be done to establish such biomarkers for specific exposure, but here, we show evidence from discovery epigenome-wide association studies (EWAS) for several exposures and timing, paving the way for such biomarker development. Such discoveries must be further evaluated in prediction models to establish their biomarker utility. As an example, we elaborate on the work done with the association between prenatal smoking exposure and DNA methylation patterns, which has moved from EWAS discovery to biomarker development. The results show promising accuracy, reproducibility, specificity to exposure, and persistence over many years. We also discuss DNA methylation patterns as a cumulative exposure biomarker, or biomarker of aging, through what has been termed “DNA methylation clocks”. Through this review, we hope to present these findings as examples of the opportunities that exist for environmental and genetic-environmental epidemiology through DNA methylation-based biomarkers and call for more work to be done in the field to realize this potential.

Suitability of DNA Methylation as a Biomarker of Past Exposure

DNA methylation is a type of epigenetic mark with several inherent properties that make it well suited for exposure biomarker purposes. DNA methylation involves the covalent addition of a methyl or hydroxylmethyl group to cytosine nucleotides in human DNA, and thus, it is relatively stable and not easily degraded with long-term storage. It also does not require any burdensome up front sample collection or processing methods. These properties are particularly important when considering new methods to extract past exposure information from existing biobanks and repositories. While chemically stable, DNA methylation is a dynamic process that can be modified by environmental context and over time, a critical feature of any exposure biomarker. It provides a mechanism for cells and organisms to respond to their environment without changing the DNA sequence. Finally, because DNA methylation is quantitative in nature, it may capture “biological dose” and/or effects of exposure mixtures.

There are several advantages to using DNA methylation as a biomarker of exposure relative to prospectively or retrospectively collected exposure data, metabolites, gene expression, or objective wearable devices. More traditional exposure ascertainment methods can pose several problems. Prospective collection of exposure data is ideal but is costly and can be inefficient for diseases with lower prevalence rates or those with long lag times between exposure and development of disease. Retrospective collection of exposure data is subject to recall bias or misclassification and is impossible to collect for certain exposures (e.g., metal toxicants). The emergence of objective wearable devices can overcome many of these issues but have only recently come online, and thus, do not enable utilization of existing large-scale biobanks. Use of molecular biomarkers of exposure has been mainstream for decades. For some exposures, metabolites have been the gold standard measurement tool to collect accurate highly reliable information about exposure. For example, cotinine, a major metabolite of nicotine, is widely recognized as the optimal collection metric to obtain smoking status [14, 15]. Untargeted metabolomic assays also have the potential to capture exposure mixtures and quantities. However, one of the major limitations to using metabolites as biomarkers of past exposure is their short half-life. The half-life of most metabolites, including cotinine, is on the order of hours to days [16,17,18]. Metabolites collected from untargeted assays can also be sensitive to dietary intake differences and sample collection protocols that may vary within and across large biobanks. Laboratory and analytic methods to best address these issues are still under development. Exposure-related transcriptome changes have also been observed. Isolating high-quality RNA suitable for gene expression profiling can be challenging in an epidemiologic and biobank resource setting because it is less stable than DNA and more subject to degradation with longer-term storage or suboptimal collection protocols. New molecular biomarkers that are long-lived, specific, stable, and that can be reliably measured in existing banked samples are needed; as evidenced in detail below, DNA methylation meets these criteria.

DNA Methylation Is Associated with Past Exposure, Across Multiple Domains

With the emergence of affordable genome-scale epigenetic technologies, it is now feasible to measure DNA methylation in a large number of samples and perform epigenome-wide association studies (EWAS) to discover methylation differences, at specific CpG sites in the genome, associated with particular exposures or outcomes [19]. This technological advance, coupled with a strong interest in identifying molecular changes related to environmental exposures has led to a rapid increase in environmental epigenomic studies. A wide range of exposures have now been linked to epigenetic changes in studies where both types of data were measured at the same time; these have been extensively reviewed elsewhere [20,21,22]. In this review, we focus on EWAS showing DNA methylation patterns, measured across the lifespan, reflect past exposures. As summarized in Table 1, methylation changes have been linked to past exposure, across a wide range of environmental domains.

Table 1 Summary of discovery EWAS and look-up replication studies showing exposure-related DNA methylation patterns are present and can be detected long after an exposure occurred

Prenatal Exposure to Smoking and Alcohol

Several EWAS have identified site-specific changes in DNA methylation levels at birth related to prenatal exposure to maternal smoking [24•, 25, 26•, 32•, 33•] and alcohol use [37] (Table 1). Several genomic regions have shown suggestive differences in cord blood DNA methylation levels related to maternal drinking habits during early pregnancy [37]. However, studies of prenatal alcohol exposure and DNA methylation are limited by sample size and window of pregnancy timing. Additional genome-wide significant findings may emerge with increased sample sizes and/or more resolved alcohol exposure metrics in the future. For prenatal smoking exposure, site-specific changes in DNA methylation have been detected in peripheral blood obtained from infants [27], older children [26•, 30•, 31••, 32•, 33•], and adolescents [32•]. Associations between later life blood DNA methylation and prenatal smoking exposure persist even after adjusting for postnatal and personal smoking exposures [32•, 33•]. Smoking and drinking are thought to have similar social determinants and correlated patterns of use; however, the associated DNA methylation findings published to date have not been consistent across these exposures, indicating that DNA methylation signatures may be exposure-specific and not merely capturing a social determinant construct [26•, 31••, 37].

Nutrition and Supplementation

As shown in Table 1, a number of studies have observed DNA methylation changes in samples collected—from birth through adulthood—related to differences in peri- and prenatal exposure to nutrient intake and nutritional supplements [39,40,41,42,43,44,45,46]. Differences in maternal nutrient intake during peri-conception and pregnancy through diet and food availability have been linked to DNA methylation changes, at specific genes, in blood and buccal samples obtained from their offspring at birth, infancy, and childhood [41,42,43,44,45]. A number of studies have leveraged data from cohorts dating back to the 1960s when the first randomized control trials were carried out to assess the impact of folic acid and/or docosahexaenoic acid (DHA) supplementation on birth and child outcomes. Saliva DNA methylation profiles collected in 47-year-old adult offspring of the Aberdeen Folic Acid Supplementation Trial (AFAST) participants showed differences related to whether their mothers received folic acid supplementation during pregnancy or were in the placebo group [39]. A randomized controlled trial for docosahexaenoic acid (DHA), an omega-3 fatty acid, observed differentially methylated genomic regions among infants whose mothers received DHA relative to those that did not receive the supplement. Furthermore, the methylation differences were also shown to be present in peripheral blood samples collected at 5 years of age [46].

Prenatal Toxicant Exposures

In the past year, DNA methylation changes have been linked to air pollutant exposure in the prenatal time period (Table 1). More specifically, a multi-study EWAS meta-analysis identified CpG loci showing significant methylation changes in cord blood, at birth, related to prenatal nitrogen dioxide (NO2) exposure levels. Interestingly, prenatal NO2-associated methylation changes were also observed in peripheral blood obtained from older children. The NO2 exposure levels at the time of blood sample collection in the older children were substantially lower than those the children experienced during pregnancy, arguing that their presence in childhood samples was not likely due to continued postnatal exposure or current NO2 exposure status [52]. More evidence in this area is likely to transpire as additional studies with unified prenatal air pollutant and DNA methylation data emerge. In addition to site-specific changes in DNA methylation, a significant global decrease in the total genomic amount of 5-hydroxymethyl, a specific type of DNA methylation, was observed in birth and early childhood blood samples among children with elevated prenatal exposure to mercury [54].

Prenatal Exposure to Adversity

Several social adversity exposures have been associated with long-term changes in DNA methylation (Table 1), although they have mainly focused on candidate genes. For example, candidate gene-based work, from the historic Dutch Hunger Winter study, revealed that DNA methylation levels at the IGF2 gene locus differ significantly between individuals with prenatal exposure to the 1944–1945 famine relative to their unexposed same-sex siblings [48]. These changes were detected in blood samples provided 60 years after their prenatal exposure to famine. Exposure to severe maltreatment during early childhood has also been linked to methylation changes in saliva. Significant decreases in DNA methylation at the NR3C1 gene locus were observed among preschool age children exposed to stress/maltreatment in the 6 months prior to biospecimen collection compared to unexposed children with similar economic status [51].

Maternal Conditions in Pregnancy

There is also evidence that exposure to adverse maternal health conditions during pregnancy is related to methylation changes at birth through adolescence (Table 1). A meta-analysis of 19 cohorts reported 86 site-specific changes in DNA methylation, in cord blood, related to maternal body mass index (BMI) at the start of pregnancy [55]. Of those, 72 sites showed a similar association, direction, and magnitude of effect in peripheral blood samples obtained in adolescence [55]. DNA methylation levels among infants born to women with an active eating disorder during pregnancy differed from those whose mothers had an active eating disorder (ED) prior to conception and non-ED controls [56].

Adult Exposures and Later Measurement

Several studies have reported long-lasting DNA methylation patterns in later adulthood biospecimens related to past earlier adulthood exposures. Similar to prenatal exposures, most findings to date are for behavioral and lifestyle types of exposures including smoking and alcohol use (Table 1). This is likely due to lack of unified exposure and methylation data in the same samples for other, more difficult, to obtain exposures. In world-wide population samples, meta-EWAS have identified thousands of loci where peripheral blood methylation levels differ by current, former, and never smoker status [34,35,36]. Joehanes et al. found that methylation values among former smokers that quit smoking 30 years prior to collection of methylation measurements in blood samples still had not reached levels comparable to individuals that never smoked [35]; thus, DNA methylation changes associated with past exposures can be long-lived. Further, smoking-related methylation values appear to capture additional valuable information about past exposures: time since quitting and number of pack-years smoked [34,35,36]. This has important implications for the potential to use DNA methylation signatures to serve not only as a simple dichotomous exposure biomarker but also as a biomarker that can be used to determine specific windows and doses of exposure. Similar differences in methylation related to smoking status, time since quitting, and pack-years have also been documented in buccal samples [34], another highly accessible and available tissue source. However, a comparison of DNA methylation patterns among hundreds of former drinkers compared to never drinkers, ~ 4 years after alcohol cessation, showed only marginal differences between the two exposure groups [38]. Epigenetic changes related to nutrition in adults have also been observed (Table 1). Males exposed to a short-term high-fat overfeeding diet showed epigenetic changes that persisted for 6–8 weeks after the men resumed their normal diets [47].

Longitudinal DNAm Data

To date, three studies have reported repeated measures of DNA methylation and associations with exposure information; two were focused on DNA methylation signatures of prenatal smoking exposure and the third examined the effects of maltreatment. Longitudinal analysis of methylation profiles at prenatal smoking-associated CpG sites showed similar differences in DNAm related to prenatal smoking status at 18 months [28], 7, and 17 years of age [32•] even after accounting for any postnatal smoking exposures in the older children [32•]. However, in adolescence, there were three CpG sites that showed reversion back to methylation levels observed among adolescence with no prenatal exposed to maternal smoking [57••]. This suggests that signatures of prenatal exposure developed solely in cord blood samples may fail to account for important differences in methylation stability in the postnatal period. Thus, the development of a robust epigenetic biomarker of past exposure will need to take this into account and evaluate methylation patterns at multiple post-exposure time points. The third study examined baseline and longitudinal changes in saliva methylation levels over a period of 6 months, among preschool age children, to assess the effects of maltreatment (at baseline) on methylation at NR3C1 [51]. Children with no history of maltreatment showed little variation in methylation across the two time points. However, children with a history of maltreatment had significantly higher levels of methylation at baseline and significantly decreased methylation 6 months later. This suggests looking for differences in methylation variation among exposed and unexposed individuals, as opposed to mean methylation shifts, may be a fruitful and important avenue for future studies.

Cumulative Exposures, Aging, and Epigenetic “Clocks”

In addition to serving as a biomarker for discrete intervals of exposure, DNA methylation signatures have also been reported to capture continuous cumulative levels of exposures including toxicant and behavioral. For example, measures of global DNA methylation levels in LINE-1 elements were significantly decreased among men with increased cumulative exposure to lead, as assayed via patella bone K-Xray which is a well-established traditional biomarker of long-term lead exposure [58]. In addition, several studies of adult smokers have consistently demonstrated that DNA methylation patterns at specific sites accurately reflect the cumulative amount and duration of current and prior smoking [34,35,36]. A number of DNA methylation “clocks” have been developed to reflect gestational [59,60,61], pediatric [62], and adult [63,64,65,66,67,68] chronologic ages, a type of demographic exposure, that can also be thought of as a cumulative exposure. These methylation clocks have been widely used to predict a number of adverse health outcomes demonstrating the utility of DNA methylation exposure biomarkers in epidemiology studies, more broadly [69,70,71,72]. For example, the adult-derived epigenetic clock has been shown to better predict all-cause mortality than examination of traditional risk factors or chronological age [73].

Biomarkers Require Predictive Modeling Beyond EWAS Discovery Analyses

EWAS findings continue to emerge and provide valuable insights into the biologic targets of environmental exposures. However, the main output from EWAS is not directly informative or useful as a predictive biomarker. Results are typically per-CpG, rather than a collective “signature”. Further, discovery analyses typically rely on general associations between exposed versus unexposed samples. A predictive modeling approach is needed to develop a useful biomarker. Accuracy parameters such as sensitivity, specificity, and area under the ROC curve (AUC) are more relevant for biomarker development [74, 75]. Further, a collection of CpGs associated with the particular exposure will necessarily have better predictive properties than a single CpG. Selection of this collective list, modeling of the prediction algorithm, and evaluation of prediction performance are necessary. This approach has been taken in the development of epigenetic clocks described above. Choices for CpG selection include simply taking all CpGs meeting a particular statistical threshold in EWAS, or building machine-learning models using techniques such as support vector machines or elastic net [76••]. Prediction algorithms can then include all CpGs equally, or weighted by their association with the exposure, or other characteristics. The output may be a probabilistic exposure membership (dichotomous, with associated probability), or a methylation-based exposure “score” [57••, 77].

Prenatal Smoking as an Example

For the most well-studied and replicated exposure—prenatal smoking—work in this area has already begun and can be used as an exemplary model for the field to be extended to other types of exposures. The first site-specific differences in DNA methylation related to prenatal exposure to smoking were reported in 2012 by Joubert et al. [24], where EWAS revealed 26 CpG sites with exposure-associated DNA methylation differences achieving genome-wide significance. Not long after, studies emerged replicating the findings in additional birth samples and adding a hand full of new loci [25, 32•, 33•]. Many also showed similar DNA methylation patterns associated with prenatal smoking exposure, but when measured in blood samples from older children, ranging in age from 5 to 17 years [30•, 31••, 32•, 33•], even after accounting for parental and personal postnatal smoking exposures [32•, 33•].

Ladd-Acosta et al. [31••] were the first to use predictive modeling to evaluate how well DNA methylation levels, measured in blood samples from 5-year-old children, at the originally reported 26 CpG sites associated with prenatal smoking exposure, could predict prenatal exposure to smoking from childhood, rather than cord blood. Their support vector machine classifier, with 10-fold cross validation, predicted the children’s exposure to sustained active maternal smoking in pregnancy with 87% accuracy when compared to maternal report of smoking during pregnancy (Table 2). Receiver operating characteristic (ROC) curves also showed that the specificity of the model was high; prediction of prenatal smoking exposure using permuted random sets of 26 loci never achieved greater than 60% accuracy and the prenatal smoking classifier was not able to predict exposure to maternal alcohol or medication use with higher than 56% accuracy [31••]. The following year, Reese et al. [77] developed a single numeric methylation score, based on DNA methylation measured in blood, and showed good correspondence to prenatal cotinine levels consistent with sustained exposure to active maternal smoking. In an independent test set of cord blood samples, the methylation score was able to predict prenatal exposure to sustained smoking with 91% overall accuracy [77] (Table 2). A recent cord blood methylation meta-analysis, spanning 13 world-wide studies and 6685 samples, showed consistency with previous findings and expanded the set of loci significantly associated with prenatal smoking from dozens to 2965 CpG sites [26•]. Nominally significant differences in methylation were also observed in older children (n = 3187) for every CpG site identified at birth [26•]. More recently, Richmond et al. [57••] developed a methylation-based smoking score using meta-EWAS findings and evaluated its ability to predict prenatal smoking exposure in an independent set of blood samples collected 30 years after pregnancy (Fig. 1; Table 2). The first score they derived was based on 568 loci that reached genome-wide significance in cord blood at birth (associated with prenatal smoking exposure) and a second score was based on 19 sites detected in blood from older children at genome-wide significance (associated with prenatal smoking) [26•]. Given the age of the participants at time of blood collection and methylation measurements, it is possible that the offspring themselves smoked; therefore, the authors also computed a methylation score for personal (postnatal) smoking exposure using 2623 sites identified as significantly associated with current smoking status in a large adult smoking meta-analysis [35]. As shown in Fig. 1 and Table 2, the classification accuracy of the prenatal exposure methylation score, based on 30-year-old adult blood specimens, was highest when using the 19 locus methylation score method that had been derived using middle childhood methylation data (AUC = 0.72). Somewhat unexpectedly, the cord blood-derived score had a lower overall prediction accuracy (AUC = 0.69). This highlights the importance of including childhood samples in discovery EWAS and for including loci identified in childhood samples in prenatal biomarker development, if later life biosamples are the intended use. Importantly, they also showed that current smoking exposure scores cannot predict prenatal smoking exposure with high accuracy (AUC = 0.57). Thus, these classifiers appear specific to prenatal exposure. This is consistent with previous observations that there is some, but not complete, overlap of loci associated with prenatal smoking exposure and personal adolescent or adult smoking exposures [33•, 35].

Table 2 DNA methylation-based biomarkers of exposure to smoking
Fig. 1
figure 1

DNA methylation biomarkers, regardless of timing of sample collection, can be used to predict prenatal smoking. As reported in Richmond et al. [57••], adult biosamples can accurately predict prenatal smoking, even after accounting for post-natal (own) smoking. Pre-defined sets of CpG DNA methylation loci can be used for prediction. Derived reference sets from infant cord blood and from middle childhood blood are available (top). The CpG set derived from childhood samples achieves slightly better prediction parameters (bottom)

Finally, separate DNA methylation patterns have been shown to predict prior adult personal smoking exposure. A 4-CpG model using predictive generalized linear models has been shown to predict prior personal smoking status among adults [78•]. The 4-locus model was highly accurate in an independent test sample with an AUC = 0.83 [78•] (Table 2). Furthermore, they showed DNA methylation is a better long-term biomarker of exposure than cotinine. The prediction model using cotinine levels was able to accurately predict former adulthood smoking in only 47% of the samples compared to 83% when DNAm was used as a biomarker of personal smoking history [78•] (Table 2). While associations between DNAm levels and specific dose, duration, and time since quitting have been observed in adults [34,35,36], these more detailed exposure classes have not been pursued in published predictive analyses to date.

Need for Additional Evidence

The smoking exposure examples demonstrate the potential for DNA methylation-based biomarkers of prior exposure. Multiple studies show the ability to accurately predict prenatal exposure based on DNA methylation measured at birth, in childhood, and even adulthood. Separate sets of DNA methylation loci can be used to accurately predict past personal adult exposure as well. Further, it appears that these two types of exposures, prenatal and previous personal exposure, can be isolated from each other. There is also a suggestion that quantitative methylation scores may be useful for estimating dose. If fully developed, such biomarkers, across multiple exposures and DNA measurement windows, can dramatically shift our ability to carry out environmental and genetic-environmental epidemiology using existing biobanks. However, much more work must be done. First, studies must move from site-by-site discovery EWAS approaches to classification approaches. The field must establish best practices for selecting CpGs that create accurate and generalizable classifiers. Multiple feature selection algorithms are available, and multiple metrics of predictive accuracy exist. The influence of QC pipelines on accuracy must also be considered, as has been done in other omics classifier work [79]. Perhaps most importantly, the accuracy and utility of DNA methylation biomarkers of exposure must be explored across ancestries and tissue matrices. Because DNA methylation at many CpG sites is, in part, genetically controlled [80, 81], it is likely that DNA methylation signatures of exposure may vary by ancestry. Additionally, the effects of environmental exposures on the epigenome can be influenced by underlying genotypes [82,83,84,85,86]. Genetic heterogeneity is likely to be particularly important among genes that establish, maintain, and regulate DNA methylation as well as for genes involved in exposure metabolism and detoxification. Thus, studies that assess potential genetic modification of epigenetic signatures of exposure are also needed. Tissue type will also play a critical role. While it is not necessary that a biomarker be on the causal path of an exposure to the ultimate health outcome of interest, it may still be true that different DNA methylation sites show predictive accuracy in different cell types. This is because the base level and variability of DNA methylation varies by cell type, and thus, the opportunity for additional variation that captures exposure is likely to be heterogeneous across tissue types. This has already been established for epigenetic clocks, where patterns from single tissue types do not fully overlap in their age prediction accuracy [65]. These caveats to not diminish enthusiasm for this potentially influential area for epidemiology, but do call attention to the rigorous work ahead.

Conclusions

The ability to obtain measures of environmental exposures in existing samples and biobanks will enable new large-scale analyses to investigate modifiable environmental risk factors for disease as well as their interaction with genes. Both inherent properties and empiric evidence support the potential for DNA methylation to serve as a stable, long-term biomarker of past exposures across a range of environmental domains. Predictive models and methylation-based exposure scores are emerging and have shown high accuracy in their ability to predicting former prenatal and adulthood personal smoking exposures. To fully realize the potential of DNA methylation as exposure biomarkers, continued large-scale EWAS and development of predictive models, across time points, tissue types, and ancestry are needed.