Keywords

1 Introduction

Over the last two decades, we have seen a rapid increase in the number of studies reporting associations of epigenetic marks, in particular DNA methylation (DNAm), with epidemiological and disease risk factors, as well as with disease itself [1,2,3,4,5,6,7,8,9,10]. Many of these associations have been derived by measuring DNA methylation in the tissue-of-origin, for example, in precursor cancer lesions [8, 11, 12], cancer-tissue [13] or post-mortem brain [14, 15]). However, by far most associations have been derived from easily accessible “surrogate” tissues like blood [1, 16], often under the assumption that DNAm changes in such tissues can be informative of disease or disease risk [17,18,19,20]. An ever-increasing number of epigenetic biomarker studies are also measuring DNAm of cell-free DNA fragments (cfDNAm) in serum, which offers great potential for noninvasive early detection of a wide range of diseases, including type-1 diabetes, multiple sclerosis, ischemic brain damage, pancreatitis, and cancer [21,22,23,24,25,26]. In most cases, however, the biological interpretation of the measured epigenetic alterations remains challenging [27]. One obvious reason for this is that, unlike gene or protein expression, an epigenetic change may not necessarily be functional. As a concrete example, age-associated DNAm alterations have been widely reported [6, 7, 28,29,30,31], yet a significant proportion of these changes do not appear to be functional or may only act to stabilize gene expression [32]. Another reason, which is still often overlooked, is cell-type heterogeneity [33]. This refers to the fact that most epigenome studies deriving biomarkers or molecular classifications of disease, do so by measuring DNAm in a complex tissue comprised of many different cell-types, each with its own characteristic DNAm profile. This is true not only for tissues like lung, liver, or blood, but also when measuring cfDNAm in serum, as most of the cfDNA derives from lymphocytes [21, 26]. Thus, by only measuring an average DNAm profile over many underlying cell-types, it is difficult to ascertain if DNAm changes associated with an exposure or outcome of interest is the result of DNAm changes in individual cell-types, and if so, in which cell-types, or whether the DNAm change is merely the result of underlying changes in cell-type proportions (Fig. 4.1).

Fig. 4.1
figure 1

Broad classes of DMCs. Epigenome studies performed in complex tissues (by definition these are composed of many different cell-types) only measure an average DNAm profile, and thus can only detect differentially methylated cytosines (DMCs) without knowledge of the underlying sources driving these DNAm changes. DMCs may result from shifts in cell-type proportions between cases and controls, or may result from DNAm changes in one or more cell-types (CT). All types of DMCs may be informative of say diagnosis or prognosis, but detection of cell-type specific DMCs is important to improve our understanding of the molecular pathways involved in disease development

Is it important, from a biological, clinical, or epidemiological perspective, to determine the source or nature of a DNAm change? In general, the answer to this question is yes: knowing in which cell-type (or cell-types) a DNAm change may be occurring in is critically important in order to understand how putative functional consequences of the DNAm changes may affect cell-function and disease development. For instance, in the context of asthma, an EWAS for immunoglobulin-E concentrations in blood revealed associations that were later validated in isolated eosinophils, highlighting molecular pathways in a relevant cell-type that mediate allergic inflammation [34]. Another example is that of HAND2, a transcription factor that mediates the tumor-suppressive effects of progesterone in the endometrium: here promoter hypermethylation and silencing of HAND2 is observed in endometrial fibroblasts from precursor cancer lesions, which results in increased FGF paracrine signaling with endometrial epithelial cells, sensitizing these to oncogenic estrogen [35]. Another scenario where cell-type specificity may play an important role is in the context of mQTLs, i.e., SNP-CpG pairs where the methylation of the CpG is associated with genotype, with recent studies indicating that up to 20–30% of mQTLs derived in blood may be cell-type specific [36, 37]. Such cell-type-specific mQTLs could be informative of molecular pathways that contribute to disease predisposition in a cell-type-specific manner. Another example is aging, where the promoter of a given cell-specific transcription factor may become gradually hypermethylated with age, which could lead to irreversible silencing of the TF and to a skewed or impaired differentiation [38]. The same DNAm change occurring in a cell-type where the TF is already switched off, say by a repressive H3K27me3 mark, would not have a functional consequence.

One solution to the cell-type heterogeneity problem would be to perform the epigenome study in purified cell-types, but this remains labor intensive, costly and has mostly only been carried out in immune cells [36, 39,40,41], or to validate findings at a few selected loci [34]. Another potential solution is to generate DNAm data at the resolution of single cells [42,43,44]. However, this also remains costly, is unscalable to large numbers of individuals, and only generates very sparse data that cannot be used for building DNAm reference profiles for individual cell-types [44, 45]. Thus, there has been an ongoing attempt to address the cell-type heterogeneity problem computationally, by devising statistical algorithms (i) that can infer cell-type fractions and DMCs [46,47,48], (ii) that can infer latent components of variation and DMCs [49,50,51], and (iii) that can infer cell-type specific differential DNAm, i.e., differentially methylated cell-types (DMCTs) [52, 53] [54, 55], all in the context of epigenome studies performed in complex tissues.

The computational complexity posed by the cell-type heterogeneity problem can be quite substantial: assuming a study is performed in whole blood (a tissue with seven main blood cell subtypes: neutrophils, eosinophils, monocytes, natural-killer, B-cells, CD4+, and CD8+ T-cells), and that one finds a DMC associated with a factor of interest, there could be at least 27–1 = 127 different combinatorial differential methylation (DM) patterns among the seven cell-types that could give rise to the observed DM at the whole tissue level. For instance, at one extreme, a DMC could be present in all cell-types of the tissue, while at another, it may only be present in one of the cell-types (Fig. 4.1). The above estimate does not even consider the potential combinatorial possibilities in terms of the directionality of DNAm change, which could involve increased (i.e., hyper) or decreased (i.e., hypo) methylation in different cell-types. Nor does the above estimate include scenarios where the DMC is driven purely by a change in cell-type composition. A well-known example of the latter is the increased myeloid to lymphocyte ratio that is observed in blood as a function of age [7, 56], cancer [57] or Rheumatoid Arthritis (RA) [58]. Shifts in cell-type composition are of course critically important in the context of cfDNAm in serum, since it is these shifts (e.g., increased circulating tumor DNA burden) that are informative of disease [26]. In other easily accessible tissues such as saliva or buccal swabs, the number of cell-types will be even bigger than in blood, because these tissues contain squamous epithelial cells besides immune cells [59,60,61]. The complexity only increases even further when we start to consider solid tissues, which in addition to immune cells, may contain different types of epithelial, endothelial, and fibroblast cells. The full repertoire of cell-types within human tissues and organs is only now been elucidated thanks to major international efforts such as the Human Cell Atlas [62, 63]. For instance, some studies have estimated over 50 different cell-types in a tissue like lung [64, 65]. It should be noted though that given an observed magnitude of DNAm change, that this may impose substantial constraints on the allowed cell-type-specific DM patterns. For instance, if we observe a close to 100% change in DNAm between cases and controls, then this can only be realized if the change is happening unidirectionally in all the major cell-types of the issue. In general, it should be clear that the complexity of calling cell-type-specific DM can be at least 100-fold higher compared to calling DM [66].

Given this complexity, and given the inevitable limitations on the sample size of epigenome studies, it is understandable that statistical algorithms alone may not be able to fully address the above challenge. Nevertheless, as we shall see, statistical and computational methodology can help towards partial solutions or to solving the challenge in simpler scenarios, which can still be very informative and useful for disease diagnosis and early detection, for hypothesis generation or for devising validation experiments in purified cell-types. For instance, one way to simplify the problem is to consider a small number of “coarse” or “representative” cell-types, or only restrict to the main dominant cell-types within a tissue.

In the next sections, we shall describe some of the computational and statistical methods that have been proposed to estimate cell-type fractions and to detect cell-type-specific DNAm changes, as well as a number of applications where it has led to important novel insights.

2 General Considerations

In that follows, we shall refer to a cell-type specific differentially methylated cytosine as a differentially methylated cell-type, abbreviating this with “DMCT.” Given a DNAm matrix defined over cytosines and samples, and given a factor of interest (e.g., an exposure, disease-status), the inference of DMCTs generally proceeds in two steps:

  1. 1.

    First, we need to estimate the fractions for all cell-types in the tissue. Without knowledge of the proportions of each cell-type in a given tissue, it is not possible to infer in which cell-types putative DNAm changes are happening. Some cell-types in a tissue may be present in such low numbers that their estimation is not possible, and therefore in this step one aims to infer fractions for as many cell-types so as to account for at least 95% of the tissue composition. This is because the error-rate of estimating cell-type fractions generally is about 5% [48].

  2. 2.

    Second, the estimated cell-type fractions are used as covariates in a linear model relating the observed DNAm profile of a cytosine to a factor/exposure of interest and other covariates representing potential confounding factors (e.g., batch effects). Importantly, the cell-type fractions enter the equations not only as ordinary covariates, but also as part of interaction terms with the factor/exposure of interest. Later we shall explain why interaction terms capture DMCTs.

While in general, only these two steps are required, in practice, it is strongly advised to include two intermediate analyses in order to better interpret the overall results. Once we have estimated cell-type fractions, it is important to establish if these cell-type fractions vary significantly in relation to our factor/exposure of interest. Indeed, if tissue composition changes, then the underlying shifts in cellular proportions could play an important causal role in disease development. A concrete example is the infiltration of CD8+ T-cells in triple-negative breast cancer, which is a well-known predictor of good outcome in this usually aggressive type of cancer [67]. Cell-type compositional changes could potentially also be used for disease risk prediction or diagnosis [21, 22, 68,69,70]. For instance, in the context of cfDNAm in serum, an increased proportion of DNA fragments reflecting tissue-specific cell-death, as e.g., it happens with pancreatic beta-cell death in type-1 diabetes or with exocrine cell-death in pancreatic cancer, could be used for early detection or monitoring of therapy response [26]. A shift in tissue composition can also reflect a systemic effect of disease, as for instance, with the observed increased myeloid to lymphoid ratio in the blood of cancer patients [57]. If such shifts occur before the conventional diagnosis of disease, they could be used for early detection or for quantifying disease risk.

The other intermediate step one is advised to perform is the inference of differentially methylated cytosines (DMCs). This is accomplished using the same linear model as for DMCT inference, but without the inclusion of interaction terms. In this model, cell-type fractions only enter as ordinary covariates alongside other potential confounders, aim being to identify DMCs that are not driven by changes in cell-type composition or by any of the other potential confounders. It is important to stress that once we have identified such a DMC, that this does not tell us anything about which cell-types the DNAm change is happening in, although there could be constraints on this depending on the observed effect size, as mentioned earlier. We shall see one concrete example of this later. Now we turn to the specific task of estimating cell-type fractions in complex tissues.

3 Estimating Cell-Type Fractions in Complex Tissues

As mentioned earlier, this task is required before we can infer DMCTs. And the reliability of the inference of DMCTs in a given study hinges on our ability to accurately estimate the underlying cell-type fractions. Currently, the best way to estimate cell-type fractions in a given sample for which a genome-wide DNAm profile is available, is through the construction of a DNA methylation reference matrix (DMRM) [71]: the columns of this matrix represent the cell-types within the tissue, with the rows representing marker CpGs (or marker genes) whose DNAm levels vary substantially between cell-types. Here we shall discuss three different strategies that have been applied to build a DMRM (Fig. 4.2). The choice of strategy largely depends on the tissue-type being considered, which is why the ensuing discussion is structured based on tissue-type.

Fig. 4.2
figure 2

Strategies to building a DNAm reference matrix. Depicted are three strategies to building a DNAm reference matrix (DMRM). One approach (left branch) is based on FACS sorting cells and subsequently generating genome-wide DNAm profiles for these purified cell populations. Subsequently, one identifies DMCs between these purified cell populations to build the DMRM. Another strategy (middle) is to use existing DNAm profiles of relevant purified cell samples in the public domain to build the DMRM, once again by identifying DMCs between these purified samples. This strategy can be applied more broadly to solid tissues, but compromising cellular resolution. Another strategy (right) is to leverage the high-resolution nature of a tissue-specific scRNA-Seq atlas to build an mRNA expression reference matrix and to subsequently impute a corresponding tissue-specific DMRM. The cellular resolution of this strategy is high (only limited by the cell-types the scRNA-Seq assay can measure), but the quality of the imputation may not suffice to distinguish similar cell-types

3.1 Estimation of Cell-Type Fractions in Blood

We first consider the case of whole blood (WB), peripheral blood mononuclear cells (PMBCs) and cord blood (CB), because for these tissues, it is possible to generate genome-wide DNAm reference profiles for all major underlying cell-types. This can be accomplished by the use of well-known markers for blood cell subtypes, whose specificity is relatively high in order to be able to generate purified samples (e.g., CD19 for B-cells) through, e.g., FACS sorting. Thus, in the case of WB, it has been possible to generate genome-wide DNAm profiles for all seven major cell subtypes, which includes neutrophils, eosinophils, monocytes, B-cells, CD4+ T-cells, CD8+ T-cells, natural-killer cells, and B-cells [72]. PBMCs are generally devoid of granulocytes which includes neutrophils, eosinophils, and basophils. For CB, a tissue of increasing importance in epigenetic epidemiology [73,74,75,76,77], it has also been possible to build tailored DNAm reference matrices [78, 79], as indeed it is well recognized that both the composition and cell-type-specific DNAm profiles are different to those of whole blood. In all these cases, once genome-wide DNAm profiles for purified samples have been generated, it is then possible to build corresponding DMRMs.

It is important to briefly describe how the DMRM should be constructed and indeed why it is necessary to use this DMRM for estimating cell-type fractions. If we have a genome-wide DNAm profile for a WB/PBMC/CB sample, represented by a vector \( \overrightarrow{\beta} \) defined over a large number of CpGs (around 450k or 850k, depending on the Illumina beadarray version), then it is sensible to express it as a linear mixture of corresponding genome-wide DNAm profiles for the constituent cell-types, as in:

$$ \overrightarrow{\beta}={\sum}_{k=1}^K{f}_k{\overrightarrow{\beta}}_k, $$

where k denotes the cell-type, K is the total number of different cell-types and fk is the proportion/fraction of cells of type k in the mixture. These fractions are obviously unknown, and we would like to infer them given the measured DNAm profile of the mixture and that of the constituent cell-types. Mathematically, the inference of these fractions is indeed possible, by formulating the above equation as a linear least squares multivariate regression problem, with one key difference, however, which is that the fractions must all be bounded between 0 and 1, and that their sum must add to 1 (or to a number less than 1 since in practice we cannot possibly know or have DNAm profiles for all underlying cell-types in a tissue). These additional constraints on the regression problem turn the ordinary least squares regression (LSR) into a constrained LSR or constrained projection (CP) problem, which can be solved using techniques in quadratic programming (QP) [46]. While it might be tempting to solve this CP problem using all available CpGs, this is not advisable because a large chunk of the genome will not differ between blood cell subtypes. In other words, uninformative CpGs that show little variability between blood cell subtypes generally do not contribute, and therefore it is sensible to exclude these from the inference as they are not needed. Computationally, if the CP problem can be solved over a small number of informative CpGs this will also speed up the estimation of cell-type fractions very substantially, which is an important consideration. Thus, once the DNAm profiles for the purified cell-types have been generated, the next step is to identify the informative CpGs. There are several strategies to accomplish this [46, 48, 78, 79], but in general, they involve a process of identifying cell-type specific DMCs, i.e., CpGs that are highly methylated (or unmethylated) in one cell-type, with correspondingly low (or high) DNAm values in all other cell-types. Ideally, one desires a reasonable number of cell-type specific DMCs for each cell-type, which ensures robustness to potential “dropouts” in independent datasets where the fractions are to be estimated. By dropout we here mean a probe that does not pass QC in the independent dataset and which therefore cannot be used in the inference procedure. Ideally, one would also want to ensure equal or approximately equal numbers of cell-type-specific markers for each cell-type, because can avoid biasing the inference to those cell-types with more markers. And thirdly, one would ideally only include cell-type-specific DMCs, where the difference in DNAm between the marker cell-type and all others is as large as possible, i.e., typically over 0.7 in the beta value [0,1] scale, i.e., at least a 70% methylation change. This number is motivated by the heavy use of Illumina DNAm beadarray technologies, where the two main modes (unmethylated and methylated peaks) are generally about 0.7 methylation units apart, with a value of 1 being the theoretical maximum. In practice, all three requirements above may not be met. For instance, for similar cell-types (e.g., CD4+ and CD8+ T-cells) it might be difficult to find many DMCs with over 70% DNAm difference between them, and indeed accurately estimating the relative fractions of highly similar cell-types still remains a challenge. Mathematically, the quality of a DNAm reference matrix can be expressed in terms of its conditioning number [79], which is an indicator of how robust the inference would be. Based on extensive experience estimating cell-type fractions, we advise on the following guidelines for selecting DMCs for a reference DNAm matrix. We note that these guidelines are not unique to DNAm but apply equally well to other data types (e.g., RNA-Seq):

  1. 1.

    Perform DM analysis between one cell-type (the “marker” cell-type) against all others (as one group) to identify DMCs using some sensible significance threshold (e.g., FDR < 0.05).

  2. 2.

    For these DMCs, compare the mean DNAm level in the marker cell-type to the highest (or lowest) mean DNAm level from among all other cell-types, to then subselect DMCs that have an effect size (i.e., difference in mean DNAm) larger than +/−0.7.

  3. 3.

    For each cell-type, rank these DMCs according to the absolute effect size and select a given top number of these. We recommend about 50 for each cell-type, if possible.

  4. 4.

    If in step-2 or step-3 there are not enough DMCs for a given cell-type, then the corresponding effect size threshold could be relaxed (for instance, one could go as low as +/−0.3 or +/−0.4), or alternatively, one may select a smaller number of top-ranked DMCs in step-3. However, as mentioned, due to QC-issues in independent data, we do not recommend that the final number of cell-type specific DMCs per cell-type should be less than 10.

Once the cell-type-specific DMCs have been identified, the DMRM is then built by taking the average or median DNAm of these CpGs in each cell-type separately. We note that although theoretically, one should take the average, in practice, taking the median is also justified as it may provide a more robust estimator. For K cell-types, the DMRM would then ideally be a matrix with K*50 rows labeling the cell-type-specific DMCs, and K columns labeling the cell-types. With this DMRM in place, we would then estimate cell-type fractions by solving the following CP problem:

$$ \overrightarrow{\beta}={\sum}_{k=1}^K{f}_k{\overrightarrow{\beta}}_k^{(R)}+\overrightarrow{\epsilon}\kern0.5em \mathrm{with} $$
$$ 0\le {f}_k\le 1\ \mathrm{and}\ {\sum}_{k=1}^K{f}_k\le 1 $$

DMRMs for blood and cord blood are available from various Bioconductor R-packages, including minfi [80], EpiDISH [48] and FlowSorted.CordBlood.450 K [81].

In the case of tissues like blood, it is also possible to obtain cell-type fractions using magnetic flow cytometric techniques (e.g., MACS), which thus allows objective testing of the DNAm-based estimates. This assumes of course that the sample taken for the DNAm-assay and the sample used to obtain MACS-based estimates are taken at the same timepoint, as blood cell fractions are known to vary with time [82,83,84]. In general, the agreement of estimated cell-type fractions obtained using a DMRM with those derived with MACS is very good (Fig. 4.3).

Fig. 4.3
figure 3

Agreement between DNAm and flow cytometry-based cell fraction estimates in whole blood. For each of the six major blood cell subtypes, a scatterplot of the flow cytometric cell-type fraction estimate (y-axis) vs. the DNAm-based estimate (x-axis). The dashed lines represent the best fit regression line. R2 and P-values are given. Dataset consists of 162 whole blood samples from healthy individuals

Fig. 4.4
figure 4

Identification of DMCTs. (a) Overall strategy to infer DMCTs involves a step where we infer cell-type fractions for each sample, followed by fitting a linear model with interaction terms between phenotype and cell-type fractions to infer a map over CpGs and cell-types indicating which CpGs are altered in which cell-types. (b–d) Three examples of CpGs that define DMCTs in one cell-type (CT1) (b), all cell-types (CT1–3) (c) and two cell-types where the direction of DNAm change is different (CT1–2) (d). The scatterplots display the adjusted DNAm beta value against the corresponding cell-type fraction, with red datapoints indicating “cases,” blue datapoints indicating “controls.” Hyper = hypermethylated in cases, Hypo = hypomethylated in cases

3.2 Estimation of Cell-Type Fractions in Saliva and Buccal Swabs

Like blood, saliva and buccal tissue can be obtained fairly easily and cheaply, allowing noninvasive measurement of DNA methylation in a tissue that contains squamous epithelial cells in addition to immune cells [59,60,61]. The reason for treating saliva and buccal swabs separately from blood and solid tissue-types is that it is relatively straightforward to build DMRMs for these two tissue-types. Since they contain immune cells, the same DNAm profiles of immune cell subtypes in blood can be used for building the DMRM for saliva/buccal swabs. The only additional requirement is to generate a DNAm reference profile for squamous epithelial cells. However, since effectively most epithelial cells in saliva/buccal swabs are squamous in origin, and there is therefore no need to distinguish different epithelial cell-types from each other, a DNAm reference profile for any epithelial cell-type may suffice. For instance, ENCODE [85] and the NIH Epigenomics roadmap [86, 87] have generated genome-wide DNAm profiles for many different normal epithelial cell lines and primary epithelial cells. By using different types of epithelial samples one can thus hone in on the generic DNAm patterns of any epithelial sample, including the squamous epithelial cells from the oral cavity. Thus, for these tissues and given genome-wide DNAm profiles for epithelial and immune cell subtypes, one can apply the same DMRM-construction strategy outlined above, i.e., by first deriving cell-type-specific DMCs, from which the DMRM is then built. Estimation of cell-type fractions then proceeds exactly as described before for the case of blood.

A slight variation to the above procedure is obtained by recognizing that cell-types within a tissue exhibit markedly different levels of similarity, depending on their shared developmental trajectories. Thus, one expects substantially more DMCs between epithelial and immune cells than between CD4+ and CD8+ T-cells. In recognition of this, an algorithm called HEpiDISH (Hierarchical Epigenetic Dissection of Intra-Sample Heterogeneity) [59] uses a hierarchical 2-step strategy with two DNAm reference matrices, one aimed at obtaining a total epithelial and total immune cell fraction, and a second one to obtain fractions for all immune cell subtypes. Because we will revisit HEpiDISH in the context of solid tissue-types, we refer details to the next subsection.

3.3 Estimation of Cell-Type Fractions in Solid Tissues

The most challenging scenario is to estimate cell-type fractions in a solid tissue-type. This is because the number of cell-types is obviously much higher than in blood (there will be different kinds of fibroblasts, endothelial, and epithelial cell subtypes, besides immune cells), but also because this number is still generally unknown and generating DNAm profiles for even just a few of these cell-types is technically challenging. This means that for most cell-types in a solid tissue, there are no available DNAm reference profiles. This is despite efforts from the Epigenomics Roadmap [86] to generate a number of these (e.g., kidney podocytes, breast myoepithelial, breast luminal cells). Given these major challenges, how does one then aim to estimate cell-type fractions in a solid tissue like breast or lung where the number of cell-types will be substantially higher than 10 and for which reference DNAm profiles may not be available for specific cell-types?

In principle, one may be tempted to apply reference-free cell-type deconvolution algorithms [49, 50, 88], which aim to infer latent (i.e., hidden/unknown) sources of variation in the data, some of which may correspond to variations in cell-type fractions. However, reference-free algorithms cannot yield direct cell-type fraction estimates and are therefore limited as far as DMCT inference is concerned, their application being primarily to infer DMCs not driven by changes in cell-type composition [49]. Given that biotechnology will eventually advance to a level that makes single-cell DNA methylomics more reliable, affordable, and scalable, thus allowing routine generation of DNAm reference matrices, it is sensible to focus on reference-based approaches as we are doing here.

So far, two different reference-based strategies have been developed. One approach is based on the HEpiDISH algorithm [59], mentioned earlier. This uses a two DMRM strategy, whereby in the first step, the algorithm uses one DMRM to estimate a total epithelial, a total fibroblast and a total immune cell fraction. While solid tissues like lung or breast contain other cell-types (e.g., endothelial cells), the lack of sufficient reference DNAm profiles for purified endothelial cells (to allow for training and validation) means that these may be challenging to include. In the case of breast, adipocytes form a major component in addition to epithelial, fibroblast, and immune cells, and for adipocytes, reference DNAm profiles are available in the public domain to allow construction and independent validation of a 4 cell-type DNAm reference matrix, defined over a generic epithelial, fibroblast, fat, and immune cell [59]. In the second optional step, HEpiDISH then estimates fractions for the immune cell subtypes. This is accomplished with a second DMRM defined over CpGs that discriminate the different immune cell subtypes from each other. Importantly, for each of these CpGs, it is also required that their baseline DNAm level, i.e., the level of DNAm seen across most of the immune cell subtypes, is similar to that of the epithelial, fibroblast, and fat cells, to ensure that variations in these other cell-type fractions would not bias the relative fractions of immune cell subtypes [59]. Thus, this second DMRM is distinct from the one used when inferring cell-type fractions in WB.

Another strategy is based on the EpiSCORE algorithm [89, 90]. EpiSCORE leverages the high-resolution nature of a tissue-specific scRNA-Seq atlas to first construct a mRNA expression reference matrix for all major cell-types in the tissue. This assumes that the scRNA-Seq assay has captured the most important cell-types in the tissue, which is not always the case: for instance, in the case of breast, fat cells are abundant but they are large cells that are often missed by current scRNA-Seq protocols [91]. Given the mRNA expression reference matrix, EpiSCORE then imputes a corresponding DMRM, defined over the promoters of a subset of the expression marker genes and the same number of cell-types. Of note, this imputation only works for 20–30% of the marker genes in the expression reference matrix, because for most marker genes there is no strong anti-correlative pattern between promoter DNAm and gene expression. The smaller number of “imputable” marker genes, for which there is such a strong anti-correlation, are identified using matched RNA-Seq and whole-genome bisulfite sequencing (WGBS) data from resources like the NIH Epigenomics Roadmap [86] and ENCODE [92, 93]. Unlike HEpiDISH, EpiSCORE can then yield cell-type fractions for all cell-types in the tissue in one step, by applying an analogous multivariate linear model as the one described earlier for blood.

3.4 Estimation of Cell-Type Fractions from cfDNAm in Serum

For completeness, we also describe briefly methods for estimating cell-type fractions from cfDNAm in serum, framing this in the context of cancer where such methods offer particular promise. Broadly speaking, the underlying strategy has been to build analogs of DMRMs by identifying informative CpGs (i.e., DMCs) from comparisons between DNAm measured in serum or blood from healthy individuals and DNAm measured in solid cancer-types. Since the cfDNA in plasma from healthy individuals stems predominantly from lymphocytes, such differential DNAm analyses naturally adjust for the immune cell contamination that is inevitable present in solid cancers [79, 94], thus identifying DNAm changes that are unique to the non-immune cells (e.g., epithelial, fibroblast cells) present in cancer-tissue. An algorithm called Cancer Locator [95] then builds a DMRM defined over a healthy and multiple cancer-type states, subsequently applying this DMRM in a beta-mixture modeling maximum likelihood framework to infer both tumor fraction and tumor-type. An important element of this approach has been the explicit modeling of variance in addition to the mean when building the DMRM [95]. Another strategy has been to perform targeted bisulfite sequencing of informative regions in case/control cohorts, using Machine Learning (ML) methods on a training set to build predictors of cancer-type, which are subsequently validated in independent plasma samples [21, 22]. Here, the ML method returns a probability measure that an individual has a particular cancer-type, which can be interpreted roughly as a relative measure of the corresponding tumor cell burden in the serum sample. Overall, these strategies have shown that sensitive detection, i.e., with sensitivities and specificities close to 90% or higher, is possible for a wide range of common tumor types (e.g., lung, liver, esophageal cancer) [21, 95], in some cases even up to 4 years before conventional diagnosis [22]. In this regard, it is worth noting that the cell-type specificity and meta-stability of DNAm offer significant advantages over other molecular data types in distinguishing the tissue-of-origin, as demonstrated by many studies (see e.g [96,97,98,99].). However, challenges remain in that informative features are not yet selected optimally. For instance, given that DNAm changes are widely altered and shared between cancer-types [100, 101], or that they could reflect alterations in other cellular compartments (e.g., fibroblasts or endothelial cells), it is plausible that these confounders could limit performance when validating in independent cohorts. Thus, cell-type deconvolution methods for solid tissues as described earlier should be the ideal starting point in which to identify an optimal set of informative DMCs that are unique to the cells of origin of each cancer-type.

4 Inferring Cell-Type-Specific Differential DNA Methylation (DMCTs)

Once we have inferred the cell-type fractions, we are now in a position to infer DMCTs. Here we shall describe one proposed DMCT-calling algorithm, the CellDMC algorithm [52]. Another very similar algorithm that subsumes CellDMC is TOAST [54]. Assuming for the time being that a CpG is altered in only one cell-type of the tissue, CellDMC/TOAST is based on the intuitive notion that the difference in DNAm between case/control status (for convenience we here assume a binary factor of interest, but similar arguments apply to a factor of interest that is continuously-valued, e.g., age or smoking exposure), would be bigger in those samples where the altered cell-type is more abundant. At the other extreme, i.e., in samples where the altered cell-type is only present in small numbers or not present at all, the difference in DNAm will correspondingly be much smaller. Extending this to more general scenarios, it is therefore plausible that by studying the patterns of DNAm change as a function of cell-type fractions, one can identify DMCTs (Fig. 3.4 to Fig. 4.4).

Statistically, the dependence of DNAm change with cell-type fractions can be captured using linear interaction terms, in this case, interactions between the factor of interest and cell-type fractions. From a modeling perspective, these interaction terms also emerge naturally: for a given cytosine c in a sample s, the methylation value βcs will be given by the formula:

$$ {\beta}_{cs}={\sum}_{k=1}^K{\hat{f}}_{ks}{\beta}_{cks}, $$

where \( {\hat{f}}_{ks} \) are the estimated cell-type fractions and where βcks denotes the DNAm value in cell-type k in sample s. It is natural to assume that this value follows a linear model in relation to the factor of interest (denoted here by y),

$$ {\beta}_{ck s}={\alpha}_{ck}+{\gamma}_{ck}{y}_s+{\varepsilon}_{cs}, $$

which means that the expression above becomes

$$ {\beta}_{cs}={\sum}_{k=1}^K{\alpha}_{ck}{\hat{f}}_{ks}+{\sum}_{k=1}^K{\gamma}_{ck}\left({\hat{f}}_{ks}{y}_s\right)+{\varepsilon}_{cs}, $$

where εcs is a Gaussian error term. The second term in brackets is the linear interaction term between cell-type fraction and our factor of interest. If there are covariates or confounders that affect DNAm independently of cell-type (e.g., batch effects), and which we represent as Wq, it is easy to show that the above formula becomes

$$ {\beta}_{cs}={\sum}_{k=1}^K{\alpha}_{ck}{\hat{f}}_{ks}+{\sum}_{k=1}^K{\gamma}_{ck}\left({\hat{f}}_{ks}{y}_s\right)+{\sum}_{q=1}^Q{\varphi}_{cq}{W}_{qs}+{\varepsilon}_{cs} $$

We note that here the cell-type fractions have already been estimated and that their sum must add to 1, which means that implicit in the equation above, there is an intercept term, and a linear term in y (without interactions). The regression coefficients can be solved under a linear LSR, which returns t-statistics and P-values for the interaction terms, i.e., for assessing if there are DMCTs or not. It is important to note that the above model is a marginal conditional model, i.e., one aims to determine if a CpG is altered in a given cell-type conditioned on all other cell-types. This is to be contrasted with the marginal (unconditional) model for cell-type k

$$ {\beta}_{cs}={\sum}_{k=1}^K{\alpha}_{ck}{\hat{f}}_{ks}+{\gamma}_{ck}\left({\hat{f}}_{ks}{y}_s\right)+{\sum}_{q=1}^Q{\varphi}_{cq}{W}_{qs}+{\varepsilon}_{cs}, $$

where we estimate an interaction between the factor and the given cell-type fraction, ignoring

all other potential interaction terms. Incidentally, the model for inferring ordinary DMCs, would be

$$ {\beta}_{cs}={\sum}_{k=1}^K{\alpha}_{ck}{\hat{f}}_{ks}+{\gamma}_c{y}_s+{\sum}_{q=1}^Q{\varphi}_{cq}{W}_{qs}+{\varepsilon}_{cs} $$

Later we present a power calculation to indicate that the marginal conditional model can display in certain scenarios limited sensitivity to detect DMCTs that occur in all cell-types. Thus, our guideline is the following 2-step strategy:

  1. 1.

    First, we infer ordinary DMCs adjusting for cell-type fractions, i.e., by adding cell-type fractions as covariates in the linear model. This ensures that DMCs are not driven by changes in cell-type proportion, while also keeping high sensitivity to detect most alterations, albeit without knowledge of which specific cell-types are altered.

  2. 2.

    Second, we run the marginal conditional model to identify DMCTs. Any DMCs called in step-2, which are not found to be DMCTs, most likely indicate CpGs that are altered in all cell-types.

5 Applications

We next describe a number of concrete examples where the CellDMC algorithm has been applied to real epigenome datasets, not only validating the algorithm, but also demonstrating how novel insights can be gained by inferring DMCTs as opposed to just DMCs.

5.1 Identification of Smoking-Associated DMCTs in Buccal Swabs

A compelling way to validate the CellDMC algorithm is by performing an EWAS in relation to smoking exposure in buccal swabs. To understand why, we first note that buccal swabs contain on average about 50% immune cells and 50% squamous epithelial cells [59], i.e., two main cell-types. Second, many smoking-EWAS performed in blood have shown that there is a highly reproducible signature of smoking-associated DNAm changes. For instance, a recent meta-analysis identified a gold-standard set of 62 smoking-associated CpGs, the great majority of which undergo hypomethylation in the blood of smokers [2]. This includes a well-known CpG mapping to the repressor of the aryl-hydrocarbon receptor (AHRR). A more recent study encompassing a much larger number of samples [102] further validated these 62 smoking-DMCs, while also identifying a larger number of alterations. Thus, it is reasonable to posit that in a smoking-EWAS performed in buccal swabs, that we should be able to retrieve this 62 CpG smoking hypomethylation signature in the immune cell compartment of the tissue. The result of applying CellDMC to an EWAS of 790 buccal swabs [18] is shown in Fig. 4.5a, which demonstrates that the algorithm indeed predicts most of the 62 CpGs to be hypomethylated in the immune cells from smokers. Interestingly, the algorithm also predicts that these alterations are not happening in the squamous epithelial compartment of buccal swabs, suggesting that these DNAm alterations do not occur in the cells of origin of smoking-related diseases like squamous cell carcinoma of the lung or head and neck. Instead, it has been shown that the DMCTs occurring in the epithelial compartment may be more relevant in mediating the risk of smoking in these cancer-types [18, 59, 103, 104].

Fig. 4.5
figure 5

Examples of DMCTs. (a) Scatterplot of t-statistics of association between smoking and DNA methylation (~480,000 CpGs) in immune cells (y-axis,IC) vs. the corresponding statistics in the epithelial cells (x-axis, Epi), as predicted by CellDMC in an EWAS of buccal swabs (n = 790 samples). CpGs belonging to a known 62 CpG hypomethylation smoking signature are highlighted in red. (b) Patterns of cell-type-specific differential DNAm for HAND2 (promoter & first exon region) in endometrial cancer, as predicted by CellDMC. Boxplot on the lower right corner is the differential DNAm pattern for whole tissue, i.e., averaged over all cell-types. (c) Distribution of DMCTs in an application of CellDMC to lung squamous cell carcinoma (LSCC) (TCGA dataset). Scatterplots at the bottom illustrate two DMCT examples: an epithelial and an endothelial-specific hypermethylated CpG. N = normal, C = cancer. (d) Scatterplots summarizing results of the GSEA of epithelial and endothelial-specific DMCTs. y-axis labels the statistical significance (−log10[P-value]), x-axis labels the log2[OddsRatio]. P-value and OR computed from a Fisher-test

5.2 HAND2 Hypermethylation in Endometrial Cancer

Another insightful application is to endometrial cancer. Endometrial tissue is composed mainly of epithelial, fibroblast, and immune cells. As mentioned in an earlier section, the promoter of HAND2, a transcription factor that mediates the tumor-suppressive effects of progesterone, is hypermethylated in the endometrial fibroblasts of precursor lesions, as well as in endometrial cancer itself, events that are associated with silencing of HAND2 [35]. HAND2’s promoter also undergoes hypermethylation in blood as a function of age [32]. Thus, application of CellDMC to say the TCGA endometrial cancer study (403 cancers +46 normal-adjacent) [105] should reveal that HAND2 is hypermethylated in fibroblasts and possibly also in the immune cells present within the endometrial tissue, because the average age of the 403 endometrial cancers is higher than that of the 46 normal-adjacent normals. Moreover, the average DNAm change between endometrial cancer and normal-adjacent samples is relatively large at around 0.6, a strong indication that HAND2 undergoes promoter hypermethylation in all main cell-types, including the epithelial compartment. Thus, it is not surprising that when applied to the TCGA dataset, at the resolution of three cell-types (epithelial, fibroblast, and immune cell) and using the corresponding DNAm reference matrix from HEpiDISH (see Fig. 4.2), that CellDMC predicts HAND2 promoter hypermethylation in all three cell-types (Fig. 4.5b). We note that in this instance, the marginal conditional model did have the sensitivity to detect HAND2 as changing in all three cell-types, despite being a scenario where an ordinary DMC-model is more appropriate. That the algorithm correctly predicts a DNAm change in the fibroblasts is critical, as it is the silencing of HAND2 in the fibroblasts that leads to an altered paracrine signaling with the epithelial cells, sensitizing these to oncogenic estrogen [35].

5.3 An Endothelial-to-Mesenchymal (EndoMT) Transformation Signature in Lung Cancer

An example of how novel insight can be gained through application of a tool like CellDMC is illustrated in the case of lung squamous cell carcinoma (LSCC). Here, the EpiSCORE algorithm (Fig. 4.2) was used to build a lung-specific DNAm reference matrix defined over epithelial, fibroblast, endothelial, and immune cells. While one could also consider different immune cell subtypes, we can refrain from doing so if our focus is on alterations occurring in the other cell-types. By simplifying the problem to four main cell-types also yields more power to detect changes in specific cellular compartments. The distribution of inferred DMCTs is displayed in Fig. 4.5c, together with a few examples, and a Gene Set Enrichment Analysis (GSEA) on the DMCTs appearing in the specific cellular compartments, which reveals patterns that are strongly consistent. For instance, the enrichment of bivalent and PRC2 marked domains among epithelial-specific hypermethylated DMCTs is a well-known universal cancer signature (Fig. 4.5d) [89]. Likewise, the concomitant enrichment of SMAD2/SMAD3 binding targets among endothelial-specific hypermethylated DMCTs, and of terms related to mesenchyme, migration and invasion among endothelial-specific hypomethylated DMCTs, is noteworthy given the role of SMAD2/SMAD3 in maintaining vascular integrity (Fig. 4.5d). Therefore, CellDMC makes the prediction that the observed hypermethylation in lung cancer endothelial cells is associated with disruption of SMAD2/SMAD3 binding, compromising vascular integrity and associated with an endothelial-to-mesenchymal (EndoMT) transformation. Such an EndoMT has been reported previously [106,107,108], but as revealed by CellDMC, epigenetic alterations could be critical in promoting such a transformation.

5.4 Myeloid and Lymphoid Specific Smoking-Associated DMCTs

As a final example, we revisit the case of smoking in blood. As mentioned earlier, smoking is associated with a highly reproducible DNAm signature in whole blood [2, 102]. However, to date, no large-scale EWAS in specific blood cell subtypes has been performed in relation to detecting smoking-associated DMCTs. Hence it is unknown whether the observed smoking signature is present in myeloid, lymphoid, or in both cell lineages together. Only smaller scale studies have converged on identifying a small number of CpGs that appear to be altered specifically in myeloid or lymphoid cells [109]. Specifically, Su et al. reported 7 CpGs that exhibit lineage-specific smoking-associated DNAm changes, with 5 of these being specific to myeloid cells, and 2 being specific to lymphocytes [109]. A recent study applied CellDMC to a number of independent EWAS with smoking information (2 in whole blood and 1 in buccal swabs), and at the resolution of 2 cell-types (myeloid and lymphoid), revealing good consistency with Su et al. [110]. A meta-analysis over a large number of EWAS in whole blood further revealed myeloid-specific hypermethylation and hypomethylation signatures, with the latter being strongly enriched for DNase Hypersensitive Sites (DHS) as defined in inflammatory macrophages, and with the former one only being enriched for DHSs as defined in acute myeloid leukemia (AML) [110]. In contrast, the same meta-analysis did not reveal an extended lymphocyte-specific smoking signature, suggesting that smoking imparts its effect on DNAm patterns in blood mostly via alterations in myeloid cells. This is a significant observation given that smoking is a moderate risk factor for AML but not for lymphocytic leukemias [111].

In summary, all the above examples illustrate successful applications of an algorithm such as CellDMC in identifying cell-type-specific DMCs, in some cases validating known DNAm signatures, and in other cases generating novel concrete hypotheses for further exploration and testing.

6 Limitations

Despite the successful applications described above, it is important to also emphasize the limitations associated with identifying DMCTs. We illustrate some of these limitations in the context of a power simulation on realistic data, focusing on the last example considered in the previous section, namely identifying smoking-associated DM in two cell lineages of blood (myeloid vs. lymphoid) [110]. This is a scenario where one cell-lineage (myeloid cells) accounts for most of the cells in the tissue (i.e., approximately 60–70% of cells in whole blood derive from the myeloid lineage), with variations in the myeloid fraction between individuals also accounting for most of the variation in DNAm. Details of the actual simulation model can be found in You et al. [110]. Very briefly, in this in-silico model, one simulates three classes of DMCTs, and subsequently assesses the sensitivity or power of the algorithm (in this case CellDMC) to detect these alterations. One class of DMCTs is specific to the lymphoid lineage, another class is specific to the myeloid lineage, and the third class of DMCTs is present in both lineages, and therefore this class is non-specific. The sensitivity to detect each category of DMCT in a whole blood EWAS is shown in Fig. 4.6, for a range of different effect sizes and two different sample sizes.

Fig. 4.6
figure 6

Power calculation. Plots of the sensivity (y-axis) to detect each class of DMCT vs effect size (x-axis), and for two different sample sizes (n = 200 and n = 600). We note that n is the total sample size and number of cases and controls is assumed similar. The effect sizes are realistic and corresponding to average DNAm changes in individual cell-types that range from 0.05 to just over 0.6. In the last plot, we display the sensitivity to detect the non-specific DMCTs in each lineage separately

Thus, for appropriately powered studies detecting lymphoid or myeloid-specific DMCTs is in general not a problem. For instance, for an epigenome study with 200 samples (100 cases and 100 controls), the sensitivity to detect myeloid-specific DMCTs is around 80% for an effect size of 2, which roughly corresponds to about a 10–20% average absolute DNAm change in myeloid cells. The corresponding sensitivity for lymphoid-specific DMCTs is just over 60%, i.e., lower than for myeloid-specific DMCTs, owing to the lower proportion and variance of lymphoid fractions in blood, but still reasonably high. For larger effect sizes, i.e., for average DNAm differences larger than say +/− 0.3, sensitivities are higher than 80 or 90% for both lineage-specific DMCTs. However, for non-specific DMCTs, the power calculation indicates markedly reduced sensitivity to detect the change in the lymphoid compartment. Thus, larger sample sizes would be needed to detect non-specific DMCTs as being non-specific. As far as the FPR and FDR are concerned, these are generally quite low, with the corresponding specificity and precision being close to 100% for all three classes of DMCTs (see You et al. [110]).

The above simple power calculation clearly illustrates the potential limitations associated with inferring DMCTs in real epigenome studies. As one wishes to infer DMCTs in the context of ever-increasing numbers of cell-types, one main limitation is on the sensitivity to detect non-specific DMCTs, and thus great caution needs to be exercised when interpreting the specificity of inferred DMCTs. Another major limitation is the potentially low variance in the fraction displayed by a given cell-type, as noted recently [112].

7 Useful Software

Here we give a brief overview of some of the important software that has been developed in the cell-type heterogeneity field. Most of this software is freely available as R-code from the websites as specified in the publications. For estimating cell-type fractions in whole blood or peripheral blood mononuclear cells, we can recommend the minfi Bioconductor package [80] which uses Houseman’s constrained projection algorithm [46] as well as the original DMRM for blood used when validating the Houseman algorithm. Another option is the EpiDISH Bioconductor package [48], which offers robust partial correlations (RPCs) and Support Vector Regression as an alternative to constrained projection for estimating cell-type fractions. For those unfamiliar with R, the EpiDISH package has an associated webserver [113] and also incorporates its own DMRM for blood. Another important option for blood is the IDOL algorithm [78] and associated DMRM [114]. For cord blood, there are dedicated DMRMs available from the FlowSorted.CordBlood.450 K Bioconductor R-package [81] and from [115, 129]. For saliva, there is also a dedicated DMRM, and which has been specially designed for epidemiological studies in children [116]. For saliva and buccal swabs, one can also use a general three cell-type DNAm reference matrix defined for generic epithelial, fibroblast, and immune cells [59, 117], and which is also available within the EpiDISH package. For complex solid tissues, one can use the HEpiDISH algorithm [59] and associated 2-layer DMRMs for estimating cell-type fractions for epithelial, stromal, and all immune cell subtypes, and which is also part of the EpiDISH package. Another option for inferring epithelial and stromal fractions in complex tissues is MethylCIBERSORT [118]. The EpiSCORE algorithm can be used to build DMRMs for tissue-types for which there is a matching scRNA-Seq atlas, and DMRMs for lung and breast tissue have been provided [89]. An alternative statistical strategy for estimating cell-type fractions called dtangle that improves upon the constraint projection and the other frameworks has also been proposed [119]. A method called MethylResolver has recently been proposed to evaluate the quality of an inferred cell-type decomposition [120], which is an important step missing from all other software. Algorithms for identifying cell-type-specific DNA methylation changes include TOAST [54], CellDMC [52], HIRE [55], Omicwas [121] and TCA [122]. Omicwas, which implements a non-linear ridge regression framework is noteworthy for also dealing with the co-linearity problem in DMRMs of high cellular resolution. Reference-free and semi-reference-free methods remain useful for the purpose of identifying DMCs not confounded by cell-type heterogeneity [49,50,51, 123,124,125], but are generally speaking limited for the tasks of cell-type fraction estimation and inference of cell-type-specific DNAm changes.

8 Outlook

It is likely that over the next years, technologies for single-cell methylomics [126] will improve to allow the construction of DNAm reference profiles for most cell-types within a tissue. This would provide an alternative means to build a DMRM, which is more tailored to the tissue of interest, thus overcoming some of the limitations of existing strategies. However, in the foreseeable future, generating single-cell methylomics for large numbers of tissues (and individuals) will remain impractical. This means that for epigenome studies aiming to identify cell-type-specific biomarkers, and which will require genome-wide profiling of many 100s to 1000s of individuals, it will still be necessary to use cell-type deconvolution methods. Thus, single-cell methylomics may help towards the construction of tissue-specific DMRMs, and this in turn may improve the inference of cell-type fractions and DMCTs from large-scale epigenome studies.

Another important area of future development will be applications in personalized epigenetic epidemiology and medicine. One key area that is likely to grow even further is noninvasive detection of disease via measurement of cfDNAm in plasma. Cell-type deconvolution methods as well as more conventional Machine Learning methods need to be improved further, specially in relation to how informative features are selected. Currently, algorithms are also aimed at identifying biomarkers (cell-type specific or not) that display average differences in DNAm associated with an exposure or outcome. However, averaging over individuals also obscures in which particular individuals a DNAm change is happening. In the context of DMCTs, thus one would wish to infer in which cell-types and individual a particular DNAm change is happening in. Thus, given a DNAm data matrix and estimates of cell-type fractions, one would ideally want to infer an array of DNAm values, defined over CpGs, cell-types, and individuals. Mathematically, this is known as a tensor-object [37] and attempts to infer such a tensor in the context of EWAS have been made [122], albeit not yet successfully so [127]. Of note, such tensor-inference methods are also likely to overfit and to be computationally very intensive if applied in a genome-wide context, as the number of parameters to estimate is substantially higher.

An important biological question for the future is the relative fractions of non-specific vs. specific DMCTs. This is likely to depend on the tissue as well as phenotype. For instance, in the case of SNPs, the proportion of mQTLs that are shared between blood cell-types is likely to be large (approx. 70–80%) [36, 37]. Thus, in the case of DNAm variation associated with genetic variants, this variation appears to be largely cell-type independent, which however still allows for downstream functional effects to display cell-type specificity. In aging too, a recent study has shown that most age-associated DNAm changes appear to be independent of tissue and cell-type [128]. EWAS for body-mass index have also revealed shared DNAm changes between blood and fat cells, but also differences [3]. In the case of smoking in blood, there is evidence for both non-specific as well as specific DMCTs, whereas in buccal swabs the effects appear to be much more cell-type specific [18]. Other conditions such as type-1 diabetes [39], asthma [34], Rheumatoid Arthritis [41] or cancer [89] appear to be associated with larger numbers of cell-type-specific DMCTs. Understanding why specific factors may affect DNAm in a cell-type independent manner and why others do not will be an important question for the epigenetics community that is also highly relevant for epigenetic epidemiological studies.

To conclude, this chapter provides a brief overview of the key concepts and tools needed to tackle the challenge posed by cell-type heterogeneity in the context of DNA methylation studies. We recommend that future epigenetic epidemiological studies make an attempt to tackle this challenge, using the software tools described herein, mainly as hypothesis generation tools, to be followed up with functional studies or validations in purified cell populations.