Introduction

DNA methylation (DNAme) is a chemical modification of DNA that entails the addition of a methyl group to a cytosine (C) residue resulting in 5-methylcytosine, and most commonly occurs in the context of CpG dinucleotides in humans [1]. The study of epigenetics and DNAme has become one of the most topical areas of genomic research in recent years, both from a functional point of view and a clinical perspective, owing to its potential application in cancer risk prediction and early detection strategies [2, 3].

The two most widely used techniques to study epigenome-wide DNAme are whole-genome bisulphite sequencing (WGBS) and Illumina methylation arrays. Both technologies require bisulfite pre-treatment of DNA to enable distinction of methylated from unmethylated cytosine residues in the context of CpG dinucleotides. Whereas WGBS provides information of the DNAme status of a series of neighbouringCpGs, Illumina methylation arrays allow for more affordable and high-throughput assessment of the methylation status of a subset of single CpG dinucleotides throughout the genome.

The Illumina bead array technology has undergone substantial re-development over the years and the total number of CpGs that can be simultaneously analysed has increased substantially from ~ 25,000 in 2008 (HumanMethylation27 BeadChip), to ~ 485,000 in 2011 (HumanMethylation450 BeadChip), to over ~ 850,000 CpG sites in 2016 (MethylationEPIC BeadChip v1.0), and finally to over ~ 935,000 CpG sites in 2022 when the MethylationEPIC BeadChip v2.0 was released. Illumina methylation microarrays include two different types of bead chemistry, Infinium type I and II probes [4, 5]. Type I probes have two separate probe sequences per CpG dinucleotide (one each for methylated and unmethylated CpGs), whereas type II probes have only one probe sequence per CpG dinucleotide. Consequently, type II probes take up less physical space on the arrays than type I probes and are the most abundant type on the latest MethylationEPIC arrays, constituting ~ 85% of all probes. For type II probes, discrimination of methylated (M) versus unmethylated (U) alleles is made possible by single nucleotide primer extension which results in the incorporation of Cy3 or Cy5-labelled nucleotides into the target sequence, emitting green or red fluorescence, respectively. For type I probes, discrimination of methylated versus unmethylated alleles is made by constructing corresponding probes sequences M (methylated) and U (unmethylated) which are measured in the same channel, either Red or Green. Thereafter, we will distinguish between them as type I-Red and I-Green probes, respectively. The level of methylation at specific CpG sites is expressed as beta (β value), which represents a constant between 0 (unmethylated) and 1 (fully methylated) and can be written as:

$$\beta = \frac{M}{M+U+ \alpha };$$
(1)

with α representing a small positive constant (typically 100) added to the equation to avoid dividing by zero when both M and U signals are equal to 0. If β = 0, then the interrogated CpG is unmethylated (there is no M signal), if β = 1, then the interrogated CpG is methylated (there is no U signal).

When assessing Illumina methylation array data, basic pre-processing steps would typically include identifying probes and/or samples with a low signal to noise intensity which should be excluded [6], correcting for background intensity and dye bias [7], performing within-array normalization to reduce differences in β distributions obtained from Infinium I and II probes [8], and imputing missing data [9]. For this, several established methods have been benchmarked and implemented into pre-processing pipelines available as R packages, such as minfi [10], ChAMP [11], SeSAMe [12] and the latest ENmix [13]. Additionally, previous studies have identified the necessity to exclude low-specificity probes that can bind to multiple sequences within the genome, as well as probes that contain genetic variants in their underlying sequence [5, 14]. Lastly, recent studies by Sugden et al. [15] and Ross et al. [16] identified a large set of ‘unreliable’ probes that poorly reproduced methylation values when samples from the same DNA source were measured either on the HumanMethylation450 or MethylationEPIC BeadChip and ‘batch-susceptible’ probes that performed excessively variable in different datasets, respectively. However, to date a comprehensive understanding of the factors that influence the reliability, i.e., the consistency of repeated measurements, of Illumina methylation array probes is lacking, and existing solutions of removing a fixed list of unreliable [15] or batch-susceptible [16] probes may not translate well to new datasets that were not included into the respective reliability investigations. This has substantial implications for the accurate interpretation of methylation array data, especially since a typical experimental design for Illumina methylation arrays does not include technical or biological replication.

In this study, we present a series of comprehensive analyses to investigate yet unexplored factors affecting the validity of Illumina methylation array data. To facilitate this exploration, we generate paired longitudinal MethylationEPIC BeadChip v1.0 data from 142 paired blood samples from 71 volunteers collected 6 weeks apart that enabled us to distinguish inter-individual DNAme variability with intra-individual DNAme data over time, as well as an additional technical replicate set spanning multiple sample types from the same group of individuals. Our results reveal new insights into factors affecting the variability of DNAme data derived from Illumina methylation arrays and we thus propose a novel, data-driven method for the assessment of probe reliability. We expect that these findings will further improve existing pre-processing pipelines and the subsequent interpretation and reliability of next-generation Illumina methylation array results.

Materials and methods

Sample collection and DNAme profiling in the clinical intervention study

93 individuals were recruited to the TACT (Turmeric-Anti-Inflammatory and Cell Damage Trial-clinical trial number NCT02815475) for a 6-week intervention study. There were three study arms: one group (‘Turmeric Capsule’ group; 25 patients: 17 females, 8 males) received a 400 mg Turmeric capsule providing 0.27 g curcuminoids/day, a second group (‘Placebo’ Group; 24 patients: 16 females, 8 males) a sugar placebo (xylitol), and a third group were asked to regularly cook with Turmeric powder (‘Turmeric Powder’ group; 22 patients: 20 females, 2 males) providing 0.24 g curcuminoids/day in their food every alternate day, all for a period of 6 weeks. Ethical approval number 16-WAT-23 was granted by Newcastle University’s SAgE ethics committee. 71 participants (53 females and 18 males) completed the study and provided full sets of usable 12-h fasting whole blood samples, which were collected at the start and end of the 6-week intervention into PAXgene blood DNA tubes (BD Biosciences, 761165). Full blood counts were complemented with measurements of lymphocyte subsets (T, B, and NK cells) using the fluorescently labelled antibodies: anti-CD3 (T cells), -CD4 (T helper cells), -CD8 (cytotoxic T cells), -CD19 (B lymphocytes), and -CD56 (NK cells; CD3-), respectively. Briefly, 50 µL blood were added to the antibody mix and incubated for 20’ at room temperature in the dark. To lyse red blood cells, the mixture was incubated 10’ with 3 mL lysis buffer, washed (PBS-1%FBS) and cells were resuspended prior to flow cytometry analysis using a FACSCanto II (Becton Dickinson). DNA from blood samples collected in PAXgene types was extracted using the Machery Nagel NucleoMag® Blood 200 µL extraction kit (cat, 744501) and 500 ng total DNA were bisulfite modified using the EZ-96 DNA Methylation-Lightning kit (Zymo Research Corp, cat, D5047). 8 µL of modified DNA were subjected to methylation analysis on the Illumina Infinium MethylationEPIC BeadChip (Illumina, CA, USA) at UCL Genomics, according to the manufacturer’s standard protocol.

Normalization of Illumina MethylationEPIC data and immune cell subtype inference

Downstream analyses of the TACT study utilized raw β values, obtained by formula (1) with raw intensities, as well as normalized β values obtained from three distinct pipelines: minfi [10], using the preprocessFunnorm function with default parameters, including background (NOOB) and dye bias correction; ChAMP [11], using the champ.norm function with default parameters for beta-mixture quantile normalization (BMIQ), and ENmix [13]. For ENmix, no samples and probes were removed (parameter qc = FALSE). β values were regressed against the FACS-measured neutrophil and lymphocyte cell fractions for the first and second visits separately, and probes were considered cell type-dependent when FDR < 0.05 for both p-values associated with the slope of two linear regressions.

Methylation changes linked to the clinical intervention

Two approaches were used to investigate differential methylation between two visits across the three treatment groups in the TACT study, either considering absolute differences in the original β-values between visits or considering differences in residuals from linear regression models (β values versus real neutrophil cell fraction, the largest blood cell fraction) fitted on samples from the first visit only and then applied to all samples. Pairwise comparisons of the treatment groups were done using the Wilcoxon test, as well as a common comparison of all three groups using a Kruskal–Wallis test. Since the proportion of males in the ‘Turmeric Powder’ group was lower than in the other two groups, all tests were repeated on the female samples only.

Single nucleotide polymorphism (SNP) analysis

SNPs were identified from probes with underlying genetic sequence variation at target CpG sites listed by Pidsley et al. (Supplementary Table S4 in [5]). SNPs affect methylation profiles in specific ways depending on the position of the SNP relative to the target site. We defined the ‘SNP-II-0-effect’ associated with a 0-position (C base of target CG pair) of a type II probe, which can cause false M (if the SNP is a G base) or false U signals (if the SNP is a T or A base), and the ‘SNP-II-1-effect’ associated with a 1-position (G base of target CG pair) of a type II probe, which may cause degradation of the total signal intensity (see Supplementary Fig. S1). The ‘SNP-II-0-effect’ results in a tri-modal distribution of β values of the type II probe where each mode is represented by carriers of one of three variants: CC–C on both chromosomes, C(SNP) or (SNP)C–C only on one chromosome, or (SNP)(SNP) for NON C on both chromosomes. Conversely, the ‘SNP-II-1-effect’ results in a tri-modal distribution of intensity levels of the type II probe, where each mode is represented by carriers of one of three variants: highest level for GG–G on both chromosomes, middle level for G(SNP) or (SNP)G–C only on one chromosome, and lowest level for (SNP)(SNP) for NON G on both chromosomes. On the level of β values, the ‘SNP-II-1-effect’ results in a bi-modal distribution, with one mode corresponding to the (SNP) variant and the second mode to the other two variants. Notably, SNPs in the other closest positions towards the end of the probe (likely 2–5 bp away) or large inserts and deletions in more distant positions can have the same ‘SNP-II-1-effect’. For type I-Red and I-Green probes, more SNP cases are possible that ultimately result in the same effects as described for type II probes, either a tri-modal β value distribution (~ ‘SNP-II-0-effect’), a tri-modal intensity distribution (~ ‘SNP-II-1-effect’), or a combination of both (see Supplementary Fig. S2).

Estimation probe sequence melting temperature

The melting temperature (Tm) of individual probe sequences was estimated with empirical formulas based on GC content using the Tm_GC() function implemented in the R package TmCalculator (version 1.0.3) [17]. Empirical constants for the calculation were set with the parameter variant = Primer3Plus, and the salt correction method described by Owczarzy et al. [18] was used. Remaining parameters were kept to default, except for the allowing ambiguous sequences and mismatches (ambiguous = TRUE, mismatch = TRUE) and adjusting buffer concentrations to the Illumina MethylationEPIC BeadChip v1.0 hybridization protocol (Na = 50, Tris = 10). Probe sequences were retrieved from the official Infinium MethylationEPIC BeadChip v1.0 B5 Manifest File and Tm for unmethylated (AlleleA) and methylated (AlleleB) type I sequences were calculated separately.

Mean Intensity (MI) score calculation

We calculate the Mean normalized Intensity of each Infinium type probe (MI score) as follows:

  • For n samples, calculated across all type II or type I-Red or type I-Green probes separately:

    $$\underline{{M_{i} }} \;{\text{and}}\;\underline{{U_{i} }} ;\quad \quad {\text{average raw methylated and unmethylated signal of}}\;i^{\text{th}}\,{\text{sample}}$$
    (2)
    $${\text{AIS}}_{i} = \underline{{M_{i} }} + \underline{{U_{i} }} ;\quad \quad {\text{Average Intensity of}}\;i^{\text{th}}\,{\text{Sample}}$$
    (3)
  • For each ith sample and each jth type II or type I-Red or type I-Green probe:

    $$N_{ij} = \frac{{M_{ij} + U_{ij} }}{{AIS_{i} }};\quad \quad {\text{Normalized intensity of}}\;j^{\text{th}}\,{\text{Probe on}}\;i^{\text{th}}\,{\text{Sample}}$$
    (4)
  • For each jth type II or type I-Red or type I-Green probe:

    $$MI_{j} = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} NI_{ij} ;\quad \quad {\text{MI Score}}$$
    (5)

Unreliability score calculation

First, intensities recorded in the Green and Red channels of the negative control probes on each array are collected (Green and Red noise, respectively) to create a Reliability Map (RM) for each probe type/colour separately, i.e., RM-II, RM-I-Green, RM-I-Red (see also Fig. 2d). Each RM is a grid of pairs of fixed methylated and unmethylated values \({M}_{k}\), \({U}_{l}\) with boundaries and resolution for all analyses arbitrarily set to k, l = (min = 0, max = 5000, step = 100). For each pair of \({M}_{k}\) and \({U}_{l}\) in a RM, noise values are randomly selected 1000× and methylated noise \({M}_{\text{err}(k,l)}\) and unmethylated noise \({U}_{\text{err}(k,l)}\) defined as follows, for m = 1:1000:

  • for type II probes:

    $$\begin{aligned} & M_{{{\text{err}}(k,l)}} :M_{{{\text{err}}(k,l)}}^{m} = G_{{{\text{err}}(k,l)}}^{m} (G_{{{\text{err}}}} = {\text{Green noise}}){\text{ and}} \\ & U_{{{\text{err}}(k,l)}} :U_{{{\text{err}}(k,l)}}^{m} = R_{{{\text{err}}(k,l)}}^{m} (R_{{{\text{err}}}} = {\text{Red noise}}); \\ \end{aligned}$$
  • for type I-Green probes:

    $$\begin{aligned} & M_{{{\text{err}}(k,l)}} :M_{{{\text{err}}(k,l)}}^{m} = G_{{{\text{err}}(k,l)}}^{m} (G_{{{\text{err}}}} = M{\text{ Green noise}}){\text{ and}} \\ & U_{{{\text{err}}(k,l)}} :U_{{{\text{err}}(k,l)}}^{m} = G_{{{\text{err}}(k,l)}}^{\prime m} (G_{{{\text{err}}}}^{\prime } = U{\text{ Green noise}}); \\ \end{aligned}$$
  • for type I-Red probes:

    $$\begin{aligned} & M_{{{\text{err}}(k,l)}} :M_{{{\text{err}}(k,l)}}^{m} = R_{{{\text{err}}(k,l)}}^{m} (R_{err} = M{\text{ Red noise}}){\text{ and}} \\ & U_{{{\text{err}}(k,l)}} :U_{{{\text{err}}(k,l)}}^{m} = R_{{{\text{err}}(k,l)}}^{\prime m} (R_{{{\text{err}}}}^{\prime } = U{\text{ Red noise}}); \\ \end{aligned}$$

For each pair k and l we generate the artificial distribution of β values by repeatedly adding \({U}_{err (k,l)}^{m}\) and \({M}_{err (k,l)}^{m}\) noise values to \({M}_{k}\) and \({U}_{l}\) respectively:

$${\beta }_{(k,l)}^{m}=\frac{({M}_{k}+{M}_{err \left(k,l\right)}^{m})}{{(M}_{k}+{M}_{err (k,l)}^{m})+ ({U}_{k}+{U}_{err \left(k,l\right)}^{m})}, m=1:1000;$$

Then we calculate a Q score for a given distribution:

$$Q_{k,l} = {\text{MAD}}({\text{mean absolute deviation}})\;{\text{distribution of}}\;\beta_{(k,l)}^{(1 \ldots 1000)} ;\quad Q\;{\text{score}}{.}$$
(6)

Thus, for each probe type the Reliability Map is a two-dimensional grid (\({G}_{k}\), \({R}_{l}\)), of M and U signal intensities, where each cell is assigned the \({Q}_{k,l}\) value, which is associated with the unreliability of the β value obtained at the corresponding intensities. For each real (i.e., not modelled) pair of Mij and Uij signals from the ith sample and each jth probe average noise values are subtracted in the corresponding channels and the closest point (G*,R*) on RM and associated its Q*-value is retrieved to finally calculate the unreliability score for each jth probe across all n samples in the data set:

$${\text{Unreliability}} = \frac{1}{n}\sum\limits_{i = 1}^{n} {Q^{*} } ;\quad \quad {\text{Unreliability score}}$$
(7)

For intensity values outside the grid, then \({Q}^{*}\) is assigned 0, that is very reliable.

Unreliability and MI score dynamic thresholding

The relationship between unreliability and MI scores was examined for each probe type/colour separately, constructing smoothed curves using a generalized additive model (GAM). Because the dependence of Unreliability on MI rapidly decreases and then stabilizes after a so-called “critical point”, we propose a dynamic threshold for determining which probes are deemed unreliable in a given dataset, by determining the maximum of the second derivative of the smoothed curve.

Unreliability and MI score validation

Two DNA methylation datasets comprised of technical replicates were used to validate the utility of the unreliability and MI scores for probe reliability estimation. The first dataset was GSE174422, accessed from NCBI Gene Expression Omnibus (GEO). This dataset comprised 128 technical replicates  of female blood samples collected within a Sister Study analyzed on the Illumina HumanMethylation450 Beadchip v1 [13]. To complement this first dataset with Illumina MethylationEPIC data, we generated a second dataset derived from four technical replicates from the same DNA (2x) and bisulfite converted DNA mixtures (2x) obtained from three different sample types, i.e., fresh blood, frozen blood and cervical smears, from four female subjects that participated in the TirolGESUND study [NCT05678426] (n = 3 × 4; see Supplementary Fig. S3; "Repeatability dataset"). Blood samples (2.5 mL) were stored in PAXgene blood DNA tubes (BD Biosciences) and DNA was isolated from fresh blood within a week after sample collection. The remaining blood was kept frozen at − 20°C. DNA was additionally isolated from the frozen left-over samples and treated as a separate sample type. Cervical smears were collected and stored at room temperature with ThinPrep® Collection Kit (Hologic). Within a week after sample collection, cells sediments were transferred, washed (PBS) and pelleted (2500 RPM, 10 min). Cell pellets were kept frozen at − 80°C. DNA was isolated according to the tissue protocol of the Mag-Bind® Blood and Tissue DNA 96 Kit (#M6399-01, Omega Bio-tek) and quantified using the Quantifluor® dsDNA System (#E2670, Promega). 2 × 500 ng of DNA were bisulfite modified with the EZ DNA Methylation-Lightning kit (\#D5030, Zymo) and standardized to a concentration of 25 ng/µL BC-DNA. From each BC-DNA mixture, 2 × 100 ng were processed on the Illumina HumanMethylationEPIC v1.0 (#20042130) according to the manufacturer’s instructions. To minimize batch effects, modified DNA from each sample type was processed randomly on array positions across two beadchips.

Results

Unexpected DNAme variability in repeated blood samples

We initially studied whole blood DNA methylation profiles of 71 volunteers in the TACT study at two time points separated by a six-week interval (nsamples = 142). Although participants were allocated to one of three treatment groups (‘Placebo’, ‘Turmeric Capsule’ or ‘Turmeric Powder’), no significantly differentially methylated CpG sites (FDR < 0.05) were found with champ.dmp() for any of the four variants of β values analyzed (raw or normalized with distinct published preprocessing pipelines; results not shown). For each CpG locus, we then calculated SDβ, i.e., standard deviation of the β values within the population at visit 1, and Δβ, i.e., the average (over all individuals) of absolute differences in β values between visits (over time) for the same person. This revealed a spectrum of CpGs targeted by Infinium probes in terms of DNAme variability with tendencies towards two extreme scenarios (Fig. 1a): some sites exhibited a wide range of variability across the population of samples of a single visit (time point 1), while other sites showed a high degree of variability over time for the same individual. This pattern was observed for both the raw and normalized β values.

Fig. 1
figure 1

Variability associated with genetic factors (sex and genetic variants) in the TACT study. a Sex chromosome-associated probes demonstrate high variability both within population at visit 1 (SDβ, x-axis) and over time (∆β, y-axis). b Example of β value and intensity of an X-chromosome-associated probe in males and females. c Example of β value and intensity of a Y-chromosome-associated probe in males and females. d Two types of type II probes SNPs and their impact on signal intensity: SNPs in position 0 (SNP-II-0) result in false green or red signals, while SNPs in position 1 (SNP-II-1) result in a loss of signal (for type I probes see Supplementary Fig. S2a). e SNPs in position 0 and 1 both demonstrate high variability in the population at visit 1 (SDβ), but SNPs in position 1 furthermore demonstrate high variability over time (∆β; for type I probes Supplementary Fig. S2b). f Example of a probe with 0-position SNP (SNP-II-0) showing a tri-modal  β value distribution. g Example of a probe with 1-position SNP (SNP-II-1) showing a tri-modal intensity distribution

Consequently, we checked to what degree the observed patterns of variability were linked to the genetic background of the targeted sites, in particular with sex chromosomes and single nucleotide polymorphisms (SNPs), sites typically removed or removed during preprocessing [14]. When grouping probes by their chromosomal location (X, Y, and “other”), sex chromosome-associated probes within our dataset exhibited a high variability across the population, which is partially expected as our participant cohort included both men and women. However, strikingly, methylation values in probes targeting sex chromosomes also showed a high variability over time. This effect was less pronounced for minfi and ENmix normalized β values, likely due to the special normalization performed for sex chromosome-associated probes in these pipelines. As expected, M and U signal intensities at sex chromosomes were influenced by biological sex and sex chromosome copy number (Fig. 1b,c). The total signal intensity (M + U) of CpG probes mapping to the X chromosome is higher in females than males, since females have two X chromosomes. Conversely, the total signal intensity of CpG probes mapping to the Y chromosome is close to ‘0’, since females do not carry a Y chromosome, although some Y-CpGs might be in pseudoautosomal regions.

Infinium type I and II probes are based on inherently different designs, therefore we consider them separately for the remainder of our analyses. With respect to the Infinium type II probes, the two most relevant positional types of SNPs occur at position ‘0’ immediately after the 3’ end of the probe (SNP-II-0), where the SNP specifically affects the cytosine residue of the interrogated CpG, or at position ‘1’ at the very end of the 3’ end of the probe (SNP-II-1), where the SNP specifically affects the guanidine residue of the interrogated CpG (Fig. 1d). These two SNPs have distinct impacts on signal: a SNP-II-0 can result in false U or M signals, depending on the nature of the SNP replacing the ‘C’ residue, while a SNP-II-1 impairs hybridization and extension and results in a loss of signal (see Materials and Methods section). Other studies [5, 14] previously identified probes on the MethylationEPIC BeadChip v1.0 whose reliability is impacted by SNPs within the sequence they target, and we have highlighted these probes in our dataset (see Fig. 1e for type II probes). Interestingly, the subset of probes with the highest Δβ, were neither a SNP-II-0 nor a SNP-II-1. Thus, like the sex-chromosome associated probes, SNP-associated probes contribute to a high degree of variability within the population, but they do not fully explain the high variability in DNAme data over time within individuals. We further found that SNP-II-1 resulted in bimodal β distributions (corresponding to signal or loss of signal; Fig. 1g), where the “true” variant is represented by the upper layer of intensity. SNP-II-0 that gives rise to either false U or M signals yielded a trimodal β distribution (Fig. 1f), and the nature of the polymorphism, i.e., which base has replaced the cytosine residue, determined which particular mode corresponded to the values of the “true” (C/C) variant (see Supplementary Fig. S1).

Modelling the impact of signal intensity on β value reliability

Since the properties of both SNP-related and sex chromosome-associated probes are closely related to signal intensity (either through DNA quantity, false signals, or a loss of signal), but cannot fully explain the observed variability in repeated blood samples from the same individuals, we further scrutinized the impact of probe intensities on DNAme variability. For each probe on the array, we calculated a mean intensity (MI) value, which represents the corrected mean overall signal strength of the probe. Overall, probes with the highest level of time-dependent variability have a low MI value (Fig. 2a). Probes on the lower end of the MI scale in our dataset show a low reproducibility in paired blood samples (example cg21373150; Fig. 2b), whereas probes on the higher end of the MI scale exhibited a high reproducibility (cg17588455; Fig. 2c). We hypothesize that this high variability at low intensity levels is caused by a relatively higher impact of signal to noise. Therefore, the MI score may potentially allow for the identification of ‘noisy’ or ‘unreliable’ probes, i.e., probes which do not yield consistent β values between two time-points or replicates.

Fig. 2
figure 2

Impact of probe intensity on β value reliability in the TACT study. a Association of probe variability within participants at visit 1 (SDβ) and over time (∆β) with mean intensity (MI). Probes with low MI have high variability over time. Examples of a probe with low MI (b) or high MI (c). Lines are connected by points corresponding to the individual in two different visits. β values from the same individual are closer for a probe with a high MI than on a probe with a low MI. d Reliability map of β values. e Smoothed curve shows the dependence of type II probe unreliability on mean intensity (MI). The point highlighted on the graph is the “critical point” marking a sharp change in dependence decline, which we recommend as a dynamic threshold for determining which probes are deemed unreliable (for type I probes and other datasets see Supplementary Figs. S4S6). f Dependence of the type II probe unreliability on MI, highlighting probes which are detected using the p-value method at different threshold stringency (for type I probes and other datasets see Supplementary Fig. S12)

We created a simulation model to estimate the impact of noise on each probe’s “unreliability” by collecting the background intensities recorded by the negative control probes on the array and repeatedly adding methylated and unmethylated noise values, M-noise and U-noise, to a fixed grid of M and U signal pairs (Fig. 2d, see Material and Methods section for further details). The resulting reliability maps summarize for each probe type/colour the mean absolute deviation of the resulting β distributions for each point in the grid (Q score) and are then subsequently used to assign an unreliability score for each probe in the final dataset, by averaging the matching Q scores for the measured β values across all samples. Dynamic thresholding for defining unreliable probes in each dataset is then achieved by examining the dependence of the unreliability scores on MI and finding the critical point of a smoothed curve where the dependence of unreliability (mirroring Δβ estimates) on MI stabilizes (Fig. 2e; Supplementary Figs. S4S6). The β values in the unreliability Map are only generated for the purpose of estimating unreliability. Independently, different QC procedures may be applied to call β values, and we noted that, especially for type II probes, the choice for a specific pipeline influenced the obtained average β distributions substantially (Supplementary Figs. S4S6a).

Compared to the popular p-value detection method (detP) to remove outlier probes as implemented in the minfi package [10], we improve probe filtering by modelling the effect of noise on β values obtained at different intensity levels, rather than comparing total intensities (across all genomic position in every sample) with distribution of total intensities on negative control probes (which only allows to estimate the ‘distance’ of probe intensities from the noise intensity, but does not allow to estimate the influence of noise on the final β values). Furthermore, by allowing for data-driven thresholding we detect more unreliable probes than the statistical outliers alone, even compared to a very stringent threshold of p = 1.e − 40 [6] for detP (Fig. 2f,e).

Linking mean intensity with probe sequence composition, target sequence copy number and unreliability scores

Investigating probe composition to identify factors associated with reliability, we found that probes with a low MI score tend to have a lower C content and target sequences with a lower G content (Fig. 3a; Supplementary Fig. S7). Stronger physical binding between G–C base pairs than A–T base pairs could result in an increase of bound targets and thus fluorescent signal. We tested this for each probe type, i.e., I-Green, I-Red and II, and corresponding probe sequences, including unmethylated/methylated probes for Infinium type I, by checking the association of MI with estimated probe melting temperature (Tm; Supplementary Fig. S8). For all probe types, we find a significant positive correlation of MI with Tm (Pearson correlation > 0.21, p-value < 0.01). This was most pronounced for the Infinium type II probes (Pearson correlation = 0.39). Infinium type II probes predominantly target open sea regions. Furthermore, probes targeting island, shore, shelf, and open sea regions inherently differ in their CG content (Fig. 3b). Correspondingly, open sea region probes have a lower MI and higher unreliability scores than probes located in island regions (Fig. 3c, d). As previously shown, the raw signal intensity of the probe is also influenced by the degree to which 3’ end subsequences (3’ nested subsequences) non-uniquely bind to different regions of the human genomes [14]. Here we additionally show that mean intensities (MI scores) depend both on the copy numbers of 3’ nested subsequences of different length and the C content of the full length probe (Fig. 3e). When 3’ nested subsequences uniquely bind to the genome, MI scores are higher with higher C content. For short non-unique overlaps (3’ nested subsequences = 10–20 bp length), MI drops with copy number only for high C content probes. For longer non-unique overlaps, MI increases with copy number, independent of the C content of the full length probe.

Fig. 3
figure 3

Type II probe features impacting signal intensity and reliability. a Dependence of mean signal intensity (MI) on C content of the probe (for type I probes and other datasets see Supplementary Fig. S7). b Odds ratio for C enrichment in different DNA regions: Islands, due to their inherently high CG content, have a strong enrichment of probes with a large C content, which leads to the fact that on average MIs are higher. Odds ratios and associated p-values were obtained with a two-sided Fisher test (full results given in Additional file 2). Non-significant (FDR-adjusted p-value > 0.05) results are shown in grey. Significant results (FDR-adjusted p-value ≤ 0.05) with an odds ratio > 1 or ≤ 1 are shown in red and blue, respectively. c, d Differences in MI and unreliability scores depending on the CpG context of the probe target. e Dependence of MI (Y-axis) on the degree of non-unique overlap for 3’ nested subsequences (estimated copy number; X-axis) and C content of the probe sequence (color scale). Copy numbers for 10–50 bp long 3’ nested subsequences were calculated by Zhou et al. [14] by scanning 3΄-subsequence of the probe's source sequences (‘G's converted to ‘A's) and of the specified length in the bisulfite converted genome (GRCh38 assembly) for matches. f Reliability scores proposed by Sugden et al. [15] versus unreliability scores defined here. g Association of Reliability scores proposed by Sugden et al. [15] with SNPs and MI. h Methylation status (β) and intraclass correlation coefficient (ICC) for example SNP-associated CpGs and probes yielding low signal intensity of probes

Interestingly, our unreliability score was not associated with a reliability measure proposed by Sugden et al. [15], which was calculated using ICC (Intraclass Correlation Coefficient) on β values of repeat measurements of the same DNA samples (Fig. 3f). Furthermore, the reliability score from Sugden et al. does not correlate with MI (Fig. 3g, lower panel), and SNP-associated probes were deemed the most reliable using this measure (Fig. 3g, upper panel), which seems to be due to the high spread of β values in SNP 0 and SNP 1 (Fig. 3h, upper panels). Of note, some probes which have a similar Sugden reliability score had different ICC, MI and unreliability in our TACT data (Fig. 3h, lower panels). In contrast to the method proposed by Sugden et al., our method of assessing probe reliability is not based on cross-correlation of samples (which can be different in intensity, and therefore result in β bias), but instead offers insights into the reliability of probes based purely on intensity and noise distribution.

Consistency of unreliability scores and probe MI with respect to biological and technical variation

Our method to estimate unreliability of Infinium probes is based on the analytical modeling of the effect of noise on probe intensities and explains the high values of Δβ well. However, since the TACT dataset consists of paired biological replicate samples, we investigated the magnitude of changes in cell type composition between repeatedly drawn blood samples that may have occurred, despite the absence of a treatment effect, using flow cytometry. Fractions of cell subtypes in the blood can change rapidly, for example the proportion of lymphocytes in peripheral blood can rapidly increase due to acute illness, influencing the β values that integrate the methylation status from all cell subtypes in the samples, and further explaining the variation seen in time for the same individuals. Only for the two main immune blood cell subtypes, i.e., neutrophils and lymphocytes, fractions changed between the two visits (see example for type II probes in Supplementary Fig. S9a,b). Furthermore, large proportional changes (> 0.1) were only noted for 5 out of 53 individuals. Next, we selected probes whose β values are strongly influenced by these two immune subtypes (see Supplementary Fig. S9c) and evaluated the variability for these cell type-dependent probes between patients at visit 1 (SDβ) and over time (∆β; see Supplementary Fig. S9d). Large variation over time (∆β) was not restricted to longitudinal (biological) variation captured by these cell-type dependent probes (Supplementary Fig. S9d) and a significant portion of high ∆β/low MI probes were not amongst the cell type dependent probes (Supplementary Fig. S9e). Taken together, a significant portion of the observed variation over time could not be explained by longitudinal changes in cell type composition.

To further demonstrate the consistency of the unreliability scores on true technical replicates, we analyzed two additional datasets: a previously described IlluminaHumanMethylation450 Beadchip dataset (GSE174422) from 128 duplicate female blood samples, and a new HumanMethylationEPIC v1.0 “Repeatability” dataset generated for this study from 4 quadruplicate fresh blood, 4 quadruplicate frozen blood and 4 quadruplicate cervical smear samples. Using raw β values, we estimated variability between technical replicates by mean, absolute β differences (Δβ) for each probe, confirming that technical variability increased with increasing unreliability scores and decreased with MI in the datasets with technical replicates (Fig. 4a,b; Supplementary Figs. S10S11). Also, for the technical replicate datasets, we detect more unreliable probes than the statistical outliers alone detected by the detP method with different thresholds and remove a significant portion of the lower variable probes deemed unreliable (Fig. 4c; Supplementary Figs. S12S16). The raw β values from all three datasets showed a large skew towards probes on the array that displayed a low standard deviation across all samples, independent of the preprocessing pipeline used (Supplementary Figs. S17S20). Our more stringent unreliable probe detection method detects proportionally more of such probes with lower overall variability (SDβ across all samples) compared to the detP method. Among the unreliable probes, probes that target SNPs, the Y chromosome or Open Sea Cpgs, as well as probes with a low C content tended to be overrepresented in all three datasets (Supplementary Figs. S21S23). Furthermore, MI scores, which we use for dynamic thresholding in our unreliability method, correlate well across datasets (Fig. 4d), despite marked differences in total signal intensities for samples in the different datasets (Fig. 4e) and their noise distributions recorded from the negative control probes (Fig. 4f). Regardless of the preprocessing pipeline used to call β values, the dynamic thresholds were set close to the minimum Δβ values (Supplementary Figs. S24S27) and SDβ values (Supplementary Figs. S28S31) in each dataset. Finally, we noted a strong correlation between the unreliability scores calculated (Supplementary Figs. S32S34) and a considerable overlap in the unreliable probes detected for the three datasets (Supplementary Figs. S35S37). Nonetheless, depending on the probe type and study, between 381 and 13,807 unique, unreliable probes were detected with our method, demonstrating the importance of evaluating signal intensity in a dataset-specific manner for the detection of unreliable probes.

Fig. 4
figure 4

Validation of the unreliability method on datasets with technical replicates. Association of type II probe variability estimated by the averaged, absolute methylation differences between repeated samples (∆β) with a the unreliability score and b the mean intensity (MI) in the TACT study (n = 2 × 71 longitudinal paired blood samples), the Repeatability set (n = 4 × 4 × 4 technical replicates for fresh blood, frozen blood and cervical smear samples) and GSE174422 (n = 2 × 128 technical replicates for blood samples). For type I probes see Supplementary Figs. S10S11. c Distribution of Δβ for all type II probes (grey) or those removed by the p-value detection method at different threshold settings (red, orange, yellow) and the unreliability method (black) in the different datasets. d Correlation of MI measured for type II probes across datasets. e Mean red versus green total signal intensities of the samples in the different datasets. f Noise distributions obtained from the negative control probes in the different datasets

Implications for the newer MethylationEPIC BeadChip v2.0

A new Illumina HumanMethylationEPIC BeadChip v2.0 was released in November 2022. The new manifest reports that ‘underperforming’ probes were removed compared to the v1.0 manifest (approximately 140,000, i.e., 23% of all type I probes and 15% of all type II probes). However, we found no evidence of an enrichment for probes with high unreliability or low MI in those discarded for v2.0 (Supplementary Fig. S38). Therefore, we assume that the issues raised here will remain of high importance also for the newer version of the EPIC array. We also note that despite the announced large-scale removal of SNPs, some SNP 0 (~ 15%) and 1 (~ 25%) probes remained on v2.0. In addition, we observed that some probes are not marked as containing SNPs (neither by Pidsley [5] or Zhou lists [14], nor by the Illumina Manifest), but clearly demonstrate SNP-II-0 (Supplementary Fig. S39) or SNP-II-1 behaviour (Supplementary Fig. S40).

Discussion

Despite considerable investment in improving existing analysis pipelines for popular Illumina methylation arrays, room for improvement remains. To facilitate the generation of meaningful findings that will not only increase our understanding of the epigenome and its relationship with health and disease, but also translate into clinically useful tools, it is vital that we fully understand how robust these DNA methylation arrays perform. Here we show that DNAme data from paired whole blood samples taken from the same individuals display variability over time which cannot be attributed to underlying genetic or biological factors alone. Much of the ‘unexplained’ temporal variability in the current study can be attributed to probe quality, which is primarily dictated by the probe sequence complexity and genome location.

Noise affects methylation values differently at different intensity levels: it has a dramatic effect on β values at low intensity, while at high intensities, the signal cancels out the effect of noise. We therefore developed an approach for assessing the unreliability of β values in a data-driven manner, using the negative control probes on the arrays to model the contribution of noise to any of the final signal intensities in a specific data set. Our new unreliability score correlates well with increasing degrees of variability observed between repeated samples, both in longitudinal dataset as well as in two validation sets with technical replicates. By modeling the noise distribution for each dataset for type II and type I/color probes independently, we were able to detect more unreliable probes compared to an existing detP method for detecting outlier probes (standard detP in minfi with 0.01 p-value threshold [10, 19] with 1.e−16 p-value threshold, [6] with 1.e−40 p-value threshold).

There is a marked difference in signal intensity and quality based on the C-content of the probe sequence and CpG content. On the one hand, this observation can help Illumina to achieve leveling of such an effect within the technological process, on the other hand, it will allow scientists to provide more qualitative comparisons on different regions of DNA, for example separating Islands and the Open Sea CpGs. Excluding low-intensity or unreliable probes from the analysis could help increase the detection of differentially methylated CpGs for different phenotypes and improve both the accuracy and precision of existing and emerging predictive models on this type of DNAme data. Beyond the mere exclusion of unreliable probes, new correction or normalization methods may emerge in the future based on the results of this work that could instead salvage the data generated from these probes.

Factors contributing to laboratory or methodological bias, such as sample storage and hybridization procedures, are relatively underexplored. Samples from different studies tend to be of different quality, yielding different average intensities depending on the instruments used and the exact laboratory protocols, which in turn can also affect the reproducibility of β values on probes with different intensities and estimated probe reliability. Therefore, accounting for probe reliability and raw signal intensities during initial quality control may also improve the reproducibility of DNAme studies across laboratories.

In summary, we developed a new computational method to further refine existing preprocessing methods for Illumina methylation array data by excluding unreliable probes from downstream analyses. We implemented our methods to calculate MI and unreliability scores into an R package, epicMI, which is publicly available on GitHub [20].