Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Omic and sequencing technologies have revolutionized the biomedical field [40]. With these technologies, it is now possible, at a reasonable economic cost, to measure the levels of molecular entities, for instance, gene expression, genome-wide, in cellular specimens from large numbers of patients [8]. Analysis of these large genomic, more generally refered to as “omic”, datasets promises to provide the advances and biomarkers, which are urgently needed in the biomedical field, heralding in the new age of personalized medicine [34]. However, a serious obstacle in translating these mammoth amounts of data into biomedical advances is the presence of confounding factors, both technical and biological [21]. Recent studies [21, 43] have shown that technical confounding factors, generally refered to as batch effects, for instance the date in which a sample was processed, are omnipresent in omic datasets, affecting even some of the highest-profile studies such as The Cancer Genome Atlas [46], or the 1,000 Genomes Project [7]. Some estimates indicate that in any given study up to 80 % of measured molecular features can correlate with unwanted technical factors [21]. Furthermore, not adjusting for confounding factors can adversely impact statistical inference, compromising sensitivity and specificity [20, 45].

There are many reasons why these batch effects arise. Specially, in the case of large-scale studies profiling hundreds to thousands of samples, samples will inevitably have been processed on either different dates, by different laboratories or personnel, or on different plates or chips. Laboratory conditions can vary between dates affecting the biological measurements, or the quality of the profiling technology may also vary significantly from batch to batch. Moreover, profiled samples may come from patients treated at different medical centers, and therefore the way samples were handled (e.g., time from sampling to storage) may introduce further variation (see e.g., [25]). All of these factors have been shown to introduce unwanted variation in the data, and since “the more you measure the more can go wrong”, it is clear that large scale studies are particularly vulnerable to such confounding factors. On the other hand, it is worth pointing out that large-scale studies are also much better placed than small sample-size studies at adjusting for confounding factors. For instance, it is easier to detect and subsequently correct for a single chip/plate effect if there are many other chips/plates in the study that have performed well since the latter can then serve as controls.

The statistical design of a study is of critical importance in trying to prevent the potentially adverse effects of confounding factors on downstream statistical inference. Clearly, the statistical design of a study must be such so as to ensure that a number of specific research questions can be properly addressed. This typically requires that samples be distributed randomly across batches, ensuring balanced numbers of specific phenotypes across them. Thus, in comparing phenotypes A and B, one would randomize these across batches ensuring balanced numbers of A and B in each batch. However, it is not unusual for unbalanced designs to arise as a result of samples dropping out, in turn caused by logistical or quality control issues. This is particularly true for large-scale studies where logistical or quality control issues almost inevitably arise. These unbalanced designs can then have a dramatic negative impact on statistical inference if adjustment for the technical sources of variation is not performed. Thus, (large-scale) studies with an initial perfect study design may still be hampered by confounding factors.

There are a number of other key issues to mention in connection with confounding factors. First, it is clear that the potential impact of confounding factors will depend on the signal-to-noise ratio. This in turn depends on numerous study-specific factors, including the phenotype of interest, the nature of the confounding variation and the tissue type being profiled. For instance, if one is measuring DNA methylation, a covalent modification of DNA that can affect the activity of nearby genes [9], and if the comparison is between normal and cancer tissue, then it is likely that batch effects can be ignored, since DNA methylation changes associated with cancer are generally of a large magnitude (high signal-to-noise ratio limit) [46]. On the other hand, if the Epigenome-wide Association Study (EWAS) [31] measuring DNA methylation is being conducted in whole blood tissue [24], then this is likely to involve small effect sizes in relation to the technical sources of variation like chip effects, or biological factors such as age. For instance, in Rakyan et al. [31], the authors report a genomic site with a DNA methylation pattern in whole blood that correlates with smoking status, involving small 5–10 % shifts in average methylation between cases and controls. Such 5–10 % shifts could in principle be also caused by batch/chip effects. Similarly, such small shifts in average DNA methylation levels could be due to relatively small changes in blood cell type composition, which in turn could be caused by differences in the age of the sampled individuals [43]. Thus, techniques like Singular Value Decomposition (SVD) are specially useful for omic data since they easily allow approximate relative quantification of the variance associated with different sources of variation [43].

A second important issue is that the way in which statistical inference is affected strongly depends on how the confounders are correlated to the phenotype of interest (POI) [19]. Clearly, a confounding factor which is anti-correlated to a POI will dampen the statistical significance, while positive correlations will lead to overoptimistic results. An orthogonal confounder of large variability in relation to the POI signal will similarly compromise the statistical significance and lead to a large false negative rate (FNR). Thus, when analyzing omic data it is important to be aware of these different potential scenarios and generation of \(P\)-value histograms is strongly recommended as a means of detecting the strength and type of confounding [19].

Last but not least, confounding sources of variation can be of a very different nature, directly influencing the type of statistical adjustment procedure to be used. For instance, some confounders like plate or date, are examples of known confounders in the sense that we know exactly on which date and on which plate a given sample was processed, as these are factors that are normally recorded in an experiment. In this case, adjustment with (Bayesian) regression models, which use the confounders as explicit covariates, is possible and indeed fairly popular [16]. However, surprisingly often confounders are only known with uncertainty or error. For instance, in DNA methylation studies conducted with the Illumina Infinium beadchips, samples need to be preprocessed using a bisulfite conversion step, which translates epigenetic changes into genetic ones allowing these to be measured on the beadchip [4]. This conversion step is variable between samples and although the conversion efficiency can be measured using control probes on the beadchip, this measurement is subject to error. As another example, we have observed components of variation in DNA methylation data associated with the season in which samples were collected. Season can be viewed as a surrogate for temperature, which is the more likely causal factor, yet the exact temperature to which the samples were exposed to during transportation from medical centers to the central processing lab was not recorded. At the other extreme, we may have confounders which are completely unknown, or there is no correlated known factor that could be used as surrogate. All these considerations are important in the context of this chapter, because clearly in the latter two scenarios, explicit adjustment for confounders is neither advisable or possible. Hence, BSS techniques are needed to infer these confounders from the data itself. On the other hand, as we shall see, known confounders also become useful in the BSS context, since they can be used to objectively evaluate the quality of blind source separation.

It is paramount to stress again the importance of adjusting for confounding factors, as not doing so can seriously reduce the effective power of the studies, or lead to unacceptably large false discovery rates [21, 45]. Thus, there is an urgent need for powerful statistical methods to be applied in the biomedical field to help address these significant challenges. To further motivate a BSS-based approach to statistical inference, we emphasize that it is only natural to view any biological omic dataset as an interference pattern, with some sources of variation reflecting the biological phenotype of interest, and others reflecting the effects of technical factors. Therefore, BSS methods are optimally placed to infer such sources of variation.

Indeed, BSS methods have already been extensively applied to omic data, but only as a means of performing dimensional reduction to identify biological sources of variation [12, 18, 22, 23, 28, 42, 49], and, secondly, as a means of performing feature selection and classification [14]. Specific popular BSS algorithms include Independent Component Analysis (ICA) [15] and non-negative matrix factorisation (NMF) [13], which have been applied to diverse data types, from gene expression [42] to DNA methylation data [51], including even mutational data [1] and multidimensional cancer genomic profiles [50]. The earliest studies already demonstrated that BSS methods like ICA and NMF lead to substantial improvements in modeling biological sources of variation and that these improvements are mainly due to the sparse (supergaussian) nature of the underlying biological sources [18, 42].

In contrast, relatively few BSS applications have focused on the problem of artifact removal in biomedical data, which is surprising given that technical sources of variation are omnipresent in such data and that they can so negatively affect statistical inference. We would also argue that the application of BSS methods to identify and remove technical artifacts in real omic data provides a substantially better framework in which to objectively evaluate BSS algorithms. There are several reasons for this. First, biological sources of variation such as activity of a molecular signaling pathway are “fuzzy” objects and only rarely can be used as defining a ground truth. On the other hand, technical artifacts are sometimes well known to the experimentalist performing the study and hence, as explained above, these can be exploited to assess the quality of BSS separation. Indeed, we recently demonstrated the feasibility of this conceptual framework for assessing BSS methods in a proof-of-principle study, analyzing both DNA methylation and gene expression data [45]. In that work, we proposed an algorithm called Independent Surrogate Variable Analysis (ISVA), based on ICA, for performing supervised normalization in the presence of confounding factors [45], demonstrating its superiority over non-BSS based alternatives. The main purpose of this chapter is therefore to demonstrate that BSS methods can lead to substantial improvements in statistical inference in large omic datasets, thanks to a more efficient deconvolution of the confounding sources of variation. Our secondary aim is to increase the awareness among the BSS community of the importance of this fairly novel BSS application to artifact removal in biomedical omic data, and thus provide a fertile ground for interdisciplinary cross-pollination.

This chapter is organized as follows. First, because most of the examples considered in this chapter are drawn from studies in DNA methylation, we provide the reader with a brief introduction to DNA methylation and the Illumina Infinium Beadarray technology, a technology that allows genome-wide measurements of this epigenetic mark. In the subsequent section, we provide a number of examples of confounding variation in omic data and describe their negative impact on downstream statistical inference, including examples where methods based on explicit adjustment of confounders cannot be applied. In Sect. 17.3, we describe the problem of performing supervised analysis in the background of confounding factors, introducing and reviewing the SVA framework of Leek et al. [19, 20]. We argue theoretically why SVA may break down and why a BSS method is needed to avoid the pitfalls associated with SVA. This motivates the ISVA algorithm [45], which we review in the next subsection. In Sect. 17.4, we validate ISVA on simulated data and demonstrate the need for adjustment of confounding factors. In Sect. 17.5, we compare ISVA to SVA in modeling beadchip effects in real omic data. Section 17.6 provides a rigorous evaluation of ISVA on eight real omic datasets, using the non-BSS SVA method as well as another method based on explicit adjustment as benchmarks. In the final section, we briefly explore the performance of a generalized BSS algorithm in modeling beadchip effects. We end with conclusions and suggestions for further research.

2 DNA Methylation and the Illumina Infinium Beadarray Technology

DNA methylation refers to the covalent attachment of a methyl CH\(_3\) group to DNA cytosines, normally, but not exclusively, in the context of a CG dinucleotide, refered to as a CpG [9]. There are about 30 million of such CpG sites in the human genome, most of which are methylated. These 30 million CpG sites represent in fact an underenrichment of CpGs in the human genome. In some genomic regions however, the density of CpGs is much higher than normal, and these are refered to as CpG islands. Roughly, about 60 % of gene promoters fall within CpG islands and most of these are normally unmethylated. Thus, whereas most of the genome is methylated, many of the promoter CpG islands are unmethylated in the normal state.

DNA methylation is important for a number of reasons. It is not only essential for embryonic development, but is also key in developmental processes [9]. Very recently, it has been demonstrated that differentially methylated regions between diverse normal cell types are enriched for transcription factor binding sites, supporting the view that DNA methylation is associated with how accessible the DNA is to transcription factors. Thus, hypomethylation, i.e., loss of DNA methylation, allows transcription factor proteins to more easily bind to DNA in order to initiate developmental differentiation programs. The DNA methylation state at the gene promoter is also a key determinant of the gene’s activity, i.e., its gene expression level, with promoter hypermethylation normally associated with gene silencing [9]. DNA methylation is particularly important in diseases like cancer, where it is significantly altered [11, 17]. Indeed, a key cancer hallmark is the hypermethylation of CpG island promoters, whilst most of the cancer genome undergoes widespread hypomethylation. These deregulations in DNA methylation may lead respectively, to underexpression/silencing of key tumor suppressor genes, or overexpression of oncogenes (tumor promoting genes).

DNA methylation can be measured fairly accurately using a number of different technologies. In this chapter, we will be considering DNA methylation data generated using the Infinium beadarray technology from Illumina [4]. In particular, we will be considering a version of this technology, called Infinium 27k, that allows measurement of DNA methylation at over 27,000 CpG sites, mostly located within gene promoters of approximately 14,000 genes. The beadarray consists of a set of probes that interrogate the methylation state at each of these 27,000 sites. For each CpG site, there are two sets of probes, one designed to match the methylated version of the allele, while the other matches the unmethylated version. This is made possible by treating the DNA with bisulfite, prior to hybridisation to the beadarray. During bisulfite conversion, unmethylated cytosines are converted into uracil and then thymine upon DNA amplification (i.e., \(uC\rightarrow T\)), whereas methylated cytosines are protected and remain cytosines (i.e., \(mC\rightarrow C\)). Thus, an epigenetic difference can be translated into a genetic one, which is then easily measured using probes on the beadarray as described. While the methylation state of a given CpG site in a given diploid cell can take only three values (0 \(=\) both alleles unmethylated, 1 \(=\) only one of the alleles is methylated, 2 \(=\) both alleles are methylated), in practice, measurement is taken over many thousands of cells, with the methylation state also being potentially variable between cells. Hence, methylation at a single CpG site in a given sample taken from an individual is quantified in terms of a \(\beta \)-distributed quantity, \(\beta = M/(U\,\)+\(\,M)\), where \(M\) and \(U\) denote the intensities of the methylated and unmethylated versions of the allele, as estimated from the respective probes on the array. By construction, this \(\beta \)-value lies between 0 (unmethylated) and 1 (fully methylated).

A number of important features of the Illumina methylation beadarrays are worth mentioning. First, a maximum of 12 samples can be measured on any given beadchip. As with any technology, the quality of beadchips can vary from batch to batch. Also, the DNA quality of a sample can vary significantly, which would subsequently affect \(\beta \)-value estimates. For these reasons, the beadchips are equipped with a number of control probes, each designed to measure the quality of a particular aspect of the assay. For instance, bisulfite conversion efficiency (BSC) could vary between samples, causing biases in the \(\beta \)-values, and this can be assessed using built-in control probes which measure the efficiency of bisulfite conversion.

3 Confounding Factors in Large-Scale Omic Studies

In order to illustrate the nature and impact of the problem posed by confounding factors, we consider two examples. These examples are taken from two separate DNA methylation studies generated with the Infinium 27k technology. Let us consider our first example. This is a DNA methylation dataset of whole blood samples from 187 individuals with type-1 diabetes, including both sexes, and with individuals drawn from two underlying cohorts. This particular dataset was used to test if DNA methylation changes correlate with the age of the individual at sample draw, thus age is here the POI [44]. The 187 samples were distributed over 17 different beadchips with at most 12 samples per beadchip. A SVD of the 27,578 \(\times \) 187 row-centered (rows label CpGs) data matrix was performed to assess the nature of the largest sources of variation. As can be seen in Fig. 17.1, it is only the fifth component of variation that correlates with the POI (i.e., age), with the top components correlating with other factors such as sex, BSC and (bead)chip. Furthermore, it can be seen that the fifth component also correlates with chip indicating that this could be a potential confounder. This example further illustrates that technical or other biological variation can be of orders of magnitude larger than the effect size of interest.

Fig. 17.1
figure 1

a Relative fraction of variation carried by each of the seven significant singular vectors of a SVD, as measured relative to the total variation in the data. Number of significant singular vectors was estimated using Random Matrix Theory (RMT) [45]. Some of the singular values are labeled according to which confounders the corresponding singular vectors are correlated to, as shown in panel b. b Heatmap of \(P\)-values of association between the seven significant singular vectors and the phenotype of interest (here age at sample draw) and confounding factors (Chip, cohort, sex, and bisulphite conversion (BSC) efficiency controls 1 and 2). \(P\)-values were estimated using linear ANOVA models in the case of chip, cohort and sex, while linear regressions were used for age and BSC efficiency. Color codes: \(P<1e-10\) (brown), \(P<1e-5\) (red), \(P<0.001\) (orange), \(P<0.05\) (pink), \(P>0.05\) (white)

As a second example, we consider a DNA methylation dataset of 48 samples, consisting of 30 normal samples from the cervix and 18 representing an intraepithelial cervical neoplasia of grade 2 or higher (CIN2\(+\)) (a preinvasive cancer condition). Here too, a SVD on the row-centered data matrix, reveals that it is only the third, fourth, and fifth components that correlate with biological factors such as age or CIN2\(+\) status (Fig. 17.2a–b). Furthermore, unsupervised clustering of the samples does not lead to segregation of the samples according to CIN2\(+\) status, as one would have expected on biological grounds (Fig. 17.2c). This example also illustrates that the top component of variation is correlating with an unknown factor, possibly spatial artifacts on the chips but which are also largely independent of chip. The key point to appreciate here is that there is no surrogate known factor that we can use to model this confounding source of variation, and hence explicit adjustment for this confounder using a multivariate regression model in which the confounder is included as a covariate is not possible [16].

Fig. 17.2
figure 2

Confounding variation in a DNA methylation dataset of 30 normal cervical samples and 18 cervical intraepithelial neoplasias of grade 2 or higher (CIN2\(+\)). a Relative fraction of variation carried by each of the six significant singular vectors of a SVD, as measured relative to the total variation in the data. Number of significant singular vectors was estimated using Random Matrix Theory (RMT) [45]. Some of the singular values are labeled according to which confounders the corresponding singular vectors are correlated to, as shown. b Heatmap of \(P\)-values of association between the six significant singular vectors and the phenotypes of interest (here CIN2\(+\) status and age at sample draw) and confounding factors (Chip and bisulphite conversion efficiency (BSCE)). \(P\)-values were estimated using linear ANOVA models in the case of chip and CIN2\(+\) status, while linear regressions were used for age and BSC efficiency. Color codes: \(P<1e-10\) (brown), \(P<1e-5\) (red), \(P<0.001\) (orange), \(P>0.05\) (white). c Hierarchical clustering of the 48 samples over the 5,000 most variable probes

4 Supervised Normalization by SVA and ISVA

The previous examples illustrate some of the difficulties that confounding factors can pose in statistical analyses. One of the common tasks in omic data analysis is to perform a supervised analysis in which we seek to identify features associated with a phenotype of interest. Clearly, such task may be compromised by the presence of confounding factors, specially if the confounder is unknown or if it is only known subject to error, since in these cases we can’t adjust for them explicitly. Thus, one desires a statistical framework in which to perform supervised analysis (i.e., feature selection) in the presence of uncertain or unknown confounding factors. We refer to this supervised analysis problem as “supervised normalization” in the sense that the normalization of the data is performed as part of the supervised analysis and is therefore dependent on the phenotype of interest. So far, only two algorithms, SVA [19, 20] and ISVA [45] have been proposed to address this problem in the context of omic data, where by definition the number of features is relatively large.

4.1 Surrogate Variable Analysis

Leek and Storey proposed an ingenious solution to the problem posed above, known as SVA [19, 20], which we now describe. Let us assume that we have a data matrix, \(X_{ij}\), with \(i\) (\(i=1,\ldots ,p\)) labeling the features (genes, CpGs,...) and \(j\) (\(j=1,\ldots ,n\)) labeling the samples, with \(p \gg n\). Furthermore, we assume that each row of \(X\) has been mean centered, and that we have a POI encoded by a vector \(\varvec{y} = \lbrace y_1,\ldots ,y_n\rbrace \). As in [20] we may allow for a general function of the phenotype vector, so that the starting model for SVA takes the form

$$\begin{aligned} X_{ij}=f_i(y_j)+\varepsilon _{ij}. \end{aligned}$$
(17.1)

Typically, \(f_i(y)\) would be a function of the form \(f_i=b_iF(y)\) with \(b_i\) a feature specific regression parameter (to be estimated) and \(F\) representing a general link function. Thus, SVA starts by performing univariate regressions, leading to estimates \(\hat{b}_i\) as well as an estimate of the error matrix \(\epsilon \), which we shall call the residual variation matrix, \(R\equiv \hat{\epsilon }\). Componentwise, \(R_{ij}\equiv X_{ij}-\hat{f_i}(y_j)\). SVA then proceeds by performing a SVD of the residual variation matrix

$$\begin{aligned} R=UDV^T. \end{aligned}$$
(17.2)

Thus, the singular vectors of the SVD capture variation which is orthogonal to the variation associated with the POI. This residual variation is therefore likely to be associated with other biological factors, not of direct interest, or with experimental factors, all of which constitute potential confounders. SVA provides a prescription for the construction of surrogate variables, \(v_k\) (\(k=1,\ldots ,K\) with \(K < n\)), in terms of the singular vectors (i.e., the column vectors of \(V\)) of this SVD [20]. In the final step, feature selection is performed using the modified regression model

$$\begin{aligned} X_{ij}=f_i(y_j)+ \sum _{k=1}^K{\lambda _{ki}v_{kj}} + \epsilon ^\prime _{ij}. \end{aligned}$$
(17.3)

with the rows of \(\epsilon ^\prime \) now uncorrelated [19].

In the above framework, it is key to realize that SVA hinges on a big assumption, which is that we have a perfect, or at least a sufficiently accurate model \(F(y)\) describing the data, such that the residual variation encapsulated by the matrix \(R\) does not contain any biological variation of interest (see left part of Fig. 17.3). In this case, the only requirement on the surrogate variables describing the confounding variation is that they span the residual variation space. We note that there is in fact no requirement for the surrogate variables (SVs) to align with (i.e., precisely model) the confounding factors.

Fig. 17.3
figure 3

Surrogate Variable Analysis (SVA) begins by performing a regression of the data matrix, \(X\), against the phenotype of interest, \(Y\), specified through a possibly nonlinear function \(F(Y)\). In the equation above, \(B\) denotes regression parameters, whereas \(R\) denotes the residual variation, i.e., the variation in the data not explained by the phenotype of interest under the specified model \(F\). Under such a model, there are two possible scenarios. In the ideal scenario (left pointing arrow), \(F(Y)\) models the data perfectly in the sense that the residual variation space, depicted by the plane R, contains no residual biological variation of interest. In this case, the surrogate variables, which are estimated from a SVD of \(R\), and are indicated by blue arrows, don’t need to align with the confounding factors (green arrows), as they are only required to span the same plane R. However, in the more realistic scenario, there could be imperfections in the model \(F(Y)\) (e.g., using a linear model when the relationship between \(X\) and \(Y\) is nonlinear), which in turn could lead to residual biological variation (red arrow) in the residual variation space R. In this case, we need to choose surrogate variables that align with the confounders and “avoid” the residual biological variation of interest, since otherwise using the whole space \(R\) in the subsequent adjustments will lead to loss of biological signal. Thus, in this scenario, we need to select an appropriate subspace of \(R\) and only use this subspace for the subsequent adjustments and supervised analysis. ISVA uses ICA instead of PCA/SVD in the decomposition of \(R\), thus allowing to infer surrogate variables that better model the confounding sources of variation. Geometrically, this means that the independent surrogate variables align significantly better with the confounders and the residual biological variation, thus allowing an appropriate subspace of \(R\) to be selected. This subspace should not contain any residual biological variation and ICA is key to achieving this

However, now consider an alternative, and, as we shall see later, a more realistic scenario, where model \(F(y)\) is imperfect. For instance, we may be using a linear function \(F\) when the relation between data and POI is highly nonlinear. In this case, residual biological variation of interest may be present in \(R\) (see right part of Fig. 17.3). In such a scenario, we would want our SVs to align with the confounding factors and not with the residual biological variation, since otherwise inclusion of this in the subsequent adjusted supervised analysis (Eq. 17.3) would lead to a reduced biological signal. Later we shall see examples of this happening. Hence, in this more realistic scenario, we need to choose SVs that span a subspace of \(R\), i.e., one that is also orthogonal to the residual biological variation. This in turn means that we need an algorithm that can more accurately deconvolve the confounding sources from the residual biological variation. As one might expect (and we shall see examples of this later), the SVD used in SVA can not accurately deconvolve these different sources of variation. This motivates the introduction of BSS methods in the context of supervised normalization.

4.2 Independent Surrogate Variable Analysis

Motivated by the discussion above, we seek a BSS method that can more accurately infer the sources of variation in the estimated residual matrix \(R\). The generalization of SVA in which a BSS method is used to decompose \(R\) is called ISVA [45]. Although many BSS methods exist, in [45] we considered one of the simplest versions of ICA, the “fastICA” algorithm [15]. Thus, as with SVA, there are three parts to the ISVA algorithm: (i) detection of confounding/unmodeled factors (steps 1–4), (ii) construction of surrogate variables (SVs) (steps 5–10), and (iii) final feature selection using the SVs as covariates.

In detail, the steps in ISVA are:

  1. 1.

    Construction of the residual variation matrix by removing the variation associated with the phenotype of interest: \(R_{ij}\equiv X_{ij}-\hat{f_i}(y_j)\).

  2. 2.

    We estimate the intrinsic dimensionality, \(K\), of the residual variation matrix using RMT [29]. This gives the number of components as input to the ICA algorithm.

  3. 3.

    Perform ICA on \(R\): \(R=SA+\epsilon \), with \(S\) a \(p\times K\) source matrix and \(A\) a \(K\times n\) mixing matrix. We point out that in this formulation of ICA, the statistical independence requirement is imposed on the columns of \(S\). We denote the columns of \(S\) and rows of \(A\) by \(S_k\) and \(A_k\), respectively.

  4. 4.

    We regress \(A_k\) to each \(X_i\) (\(i=1,\ldots ,p\)) and calculate \(P\)-values of association \(p_i\).

  5. 5.

    From this \(P\)-value distribution, we estimate the FDR using the \(q\)-value method [38] and select the features with \(q<0.05\). If the number of selected features is less than 500, we select the top 500 features (based on \(P\)-values). Let \(r_k\) denote the number of selected features.

  6. 6.

    We construct the reduced \(r_k\,\times \, n\) data matrix \(X_r\) obtained by selecting the features in previous step.

  7. 7.

    Perform ICA on \(X_r\) using \(K\) independent components: \(X_r=S_rA_r + \epsilon _r\). Find the column \(k^*\) of \(A_r\) that best correlates (absolute correlation) with \(A_k\).

  8. 8.

    Set the SV \(v_k=(A_r)_{k^*}\). The purpose of steps 4–8 is to regularize the estimates and thus avoid overfitting as explained in [20].

  9. 9.

    Repeat steps 4–8 for each significant independent component, \(A_k\), obtained in step-3.

  10. 10.

    Perform SV subspace selection using a SV selection criterion. Let \(K^*\) denote the set of selected SVs.

  11. 11.

    Finally, we run the model

    $$\begin{aligned} X_{ij}=f_i(y_j)+\sum _{k\in K^*}{\lambda _{ki}v_{kj}} + \epsilon ^\prime _{ij}. \end{aligned}$$
    (17.4)

    and perform feature selection using a FDR (\(q\)-value) estimation procedure [38] and a nominal \(q\)-value threshold of say 0.05.

As formulated above, there are three differences between ISVA [45] and SVA [19]. First, ISVA uses RMT to estimate the dimensionality, in contrast to SVA which uses an explicit randomization procedure [20]. This difference is, however, not of major consequence [45]. Second, ISVA uses ICA in step-3 instead of SVD. Third, ISVA incorporates a SV subspace selection step (step-10) using a SV selection criterion that we shall discuss in detail in Sect. 17.7.4. This step is absolutely key to the improved inference that ISVA offers, and we point out here that the use of a BSS method in step-3 is also key to facilitating the choice of SV subspace in step-10. Finally, we remark that any BSS technique could be used to model the sources of variation in \(R\) (step-3), and thus the ISVA framework can be easily generalized to incorporate more sophisticated BSS algorithms.

5 Validation of SVA and ISVA on Simulated Data

Before exploring the SVA and ISVA algorithms in the context of real data, it is illuminating to first compare their performance on simulated data. The simulation model is exactly the one considered in [45], and for completeness we provide full details here again in the appendix. Briefly, we generated synthetic data matrices with 2,000 features and 50 samples and considered the case of two confounding factors (CFs) in addition to the primary POI. The primary phenotype is a binary variable \(y\) with 25 samples in one class (\(y=0\)) and the other half with \(y=1\). Similarly, each confounding factor is assumed to be a binary variable affecting one half of the samples (randomly selected). We further assume 10 % of features (200 features) to be true positives (TPs) discriminating the two phenotypic classes. We model the confounding factors as follows: each confounding factor is assumed to affect 10 % of features with a 25 % overlap with the TPs (i.e., 50 of the 200 TPs are confounded by each factor). Without loss of generality, noise is modeled by a Gaussian of mean zero and unit variance \(N(0,1)\). We further assume that the POI is associated with an effect size \(e_y(=\Delta \mu /\sigma )\) of 1, i.e., the difference in the means between the phenotypes, \(\Delta \mu \), equals the standard deviation, \(\sigma \), within each group. Effect sizes of the two confounders are assumed to be equal to \(e_{CF}\) and we define the relative effect size as \(e_R\equiv e_{CF}/e_y=e_{CF}\). We here consider the case \(e_R=2\) corresponding to a situation where the confounding factors are associated with a larger variance than the POI. The simulation model is run a total of 100 times and for each run we record the following measures (using an estimated FDR threshold of 0.05): the sensitivity (SE), the positive predictive value (PPV), the sensitivity of TPs specifically affected by the confounding factors (SE-A), and the overall correlation (\(R^2\)-values) to the CFs. For the first three measures, we also compare SVA and ISVA to a simple linear regression method that does not do any adjustment for the confounding factors (LR). Results are shown in Fig. 17.4.

Fig. 17.4
figure 4

Feature selection performance metrics of different algorithms over 100 runs of the synthetic data ran with \(e_R=2\). The algorithms for feature selection are SVA, ISVA, and a simple linear regression without adjustment for confounders (SLR). For a given estimated FDR threshold of 0.05, we compare the sensitivity/power (SE), the positive predictive value (PPV), the sensitivity to detect true positives which are affected by confounders (SE-A), and the average \(R^2\)-value between confounders and the best correlated surrogate variable. See Appendix for further details of simulation model

From this figure, we can make the following observations. First, the PPV is high for all methods, and is in line with the estimated FDR (\(=\)1-PPV) of 0.05 used in performing feature selection. Second, we can see that the power of the study is reduced if no adjustment is made for the confounding factors. Indeed, we can see that, focusing on those true positive features which are corrupted by confounding variation, the sensitivity to retrieve these features is improved approximately twofold by using SVA or ISVA. Third, ISVA and SVA perform similarly on simulated data, despite the fact that ISVA reconstructs the confounding factors at substantially higher \(R^2\) values. Thus, the simulated data nicely illustrates the “perfect model” scenario depicted in the left side of Fig. 17.3. Since the data are simulated with the same model that is subsequently used to run the univariate regression, the residual variation matrix \(R\) contains no residual biological variation, hence it does not matter if the SVs align with the confounders. The main requirement is for the SVs to span the space \(R\), and hence similar results are obtained using the SVs from SVA or ISVA, since in both cases, the SVs span the same space.

6 Improved Modeling of Confounding Factors in Omic Data by BSS Methods

In the previous section, we have seen how ISVA models the confounding factors much better than SVA. The aim of this section is to demonstrate that ISVA also leads to improved modeling of the confounding sources of variation in real data. Later, in the subsequent section, we shall see how this translates into improved feature selection. Once again, we consider DNA methylation data and as confounding factor we consider the beadchip. Illumina Infinium beadchips can accommodate up to 12 samples per chip, hence there are enough samples for beadchip effects to be assessed. Importantly, it is always known which samples were profiled on which beadchip, hence this is an example of a known confounder and thus it can be used to objectively assess the quality of blind source separation. As a benchmark we consider SVA which uses SVD/PCA to decompose the residual variation matrix. As shown in Fig. 17.5, the surrogate variables inferred using ISVA model the beadchip effects substantially better than those inferred using SVA, as indicated by the significantly higher \(R^2\) values. For further examples, we refer the reader to [45].

Fig. 17.5
figure 5

Comparison of ISVA to SVA in identifying beadchip effects in the DNA methylation dataset from [3]. The weights (y-axis) of the two surrogate variables that most significantly associated with beadchip effects are plotted against beadchip number (x-axis), for SVA and ISVA separately. To compare the identifiability of beadchip effects, we provide the \(R^2\) and F-statistics of a linear ANOVA model with beadchip number as the independent variable

7 Improved Feature Selection Using ISVA

We have seen that ISVA can model confounding sources of variation substantially better than SVA. This in turn should lead to improved statistical inference, e.g., feature selection, at least in those scenarios where it is necessary to select a surrogate variable subspace, as explained in Sect.  17.3. To demonstrate this, we first provide a number of real data examples where SVA breaks down. Subsequently, we show how ISVA circumvents the problem, leading to substantially improved statistical inference.

7.1 SVA Breakdown in mRNA Expression Data

In order to demonstrate that SVA can break down, we consider a real dataset with a known biological signature: it is well known that many genes implicated in cell proliferation and the cell-cycle are differentially expressed between high and low grade cancers [26, 32, 36, 41]. The grade of a cancer refers to the level of differentiation of the cancer cells, with high-grade cancers exhibiting a less differentiated state, whilst low-grade cancers are more differentiated in the sense that they are more similar to normal (healthy) tissue, which is a highly differentiated state compared to the undifferentiated stem cells that they are derived from. Thus, high-grade cancers are generally more aggressive and correspondingly are also characterized by a higher expression of cell proliferation and cell-cycle genes. This cell proliferation gene expression signature is a universal signature, able to distinguish high grade from low-grade cancers, irrespective of tissue type [26, 32, 36, 41]. Thus, given a gene expression dataset of high and low grade cancers, selecting features (genes) that best discriminate low and high grade cancers should lead to significant enrichment of genes implicated in the cell-cycle and cell proliferation. The enrichment of a top ranked list of discriminatory genes for any gene ontology can be assessed using a Fisher’s exact test, as done for instance in [43], a procedure known generally as Gene Set Enrichment Analysis (GSEA) [39]. If a feature selection method were to not yield significant enrichment for cell-cycle or cell proliferation genes, one would conclude that the feature selection procedure has failed to retrieve the known biological signature. Thus, in what follows we consider “grade” as the POI and we aim to show that SVA breaks down, not being able to retrieve the cell proliferation/cell-cycle enrichment due to the presence of confounding factors.

Specifically, we consider the case of breast cancer. There are two main subtypes of breast cancer: estrogen receptor positive (ER\(+\)) and estrogen receptor negative (ER\(-\)) breast cancer [48]. This stratification of breast cancers reflects the levels of expression of the estrogen receptor gene, ESR1, with ER\(-\) breast cancers showing absent expression of ESR1. Thus, in ER\(+\) breast cancer, ESR1 expression and activity is high, which results in the overexpression of genes within the ESR1 signaling pathway. We note that these ESR1 signaling genes are different from the cell-cyle/cell-proliferation ones. Now, it is well known that most ER\(-\) breast cancers are of high grade, whilst ER\(+\) breast cancers can be either high or low grade [41]. Thus, if the aim is to identify genes whose expression correlates with grade, ER\(-\)status may be seen as a biological confounder, since the distribution of ER\(+\) and ER\(-\) tumors will differ between low and high-grade cancers. Furthermore, it is also well known that low and high grade ER\(+\) breast cancers do not differ in terms of the level of ESR1 expression and ER\(-\)signaling [26, 36, 41]. Hence, this means that in the task of identifying genes that are associated with grade, any gene set enrichment must be specific to cell-cycle and should not include terms involved in ER\(-\)signaling. In other words, if feature selection for grade associated genes also leads to enrichment of ER\(-\)signaling genes, then this indicates confounding by ER\(-\)status. Although here the confounder is biological, this does not matter for the sake of comparing algorithms, and indeed the biological framework considered here provides a nice testing ground for the SVA and ISVA algorithms.

As expression data, we consider the data from four independent breast cancer studies [5, 26, 35, 36], as used in [45]. In these datasets, besides ER\(-\)status, we also consider tumor size as a potential biological confounder. We note that in these datasets potential technical confounders such as batch effects are unknown. The \(P\)-values of the GSEA of the top ranked grade-associated genes against cell-cycle and ER\(-\)signaling terms are given in Table 17.1 for genes selected using SVA and a feature selection method that uses ER\(-\)status and tumor size as explicit covariates in the linear regression model (LR \(+\) CF).

Table 17.1 Grade associated expression differences: in each mRNA expression dataset and for each method (LR \(+\) CFs, SVA, ISVA) we give the number of confounding factors (CFs) or SVs used as covariates in the regression analysis, the number of genes differentially expressed with histological grade (nDEGs) at a false discovery rate threshold of 0.05 (FDR \(<0.05\)), and the \(P\)-value of enrichment (Hypergeometric/Fisher test) of cell-cycle and estrogen upregulated gene (ESR1-UP) categories among these differentially expressed genes

Based on this table, we can make two important observations. First, in three datasets, SVA predicts no differentially expressed genes between low and high grade breast cancer, a result which is in complete disagreement with extensive biological knowledge [26, 32, 41]. As a result of this, none of the biological terms cell-cycle or ER\(-\)signaling are enriched. Second, performing feature selection using a multivariate linear regression model with ER\(-\)status and size as explicit covariates (LR \(+\) CF) leads to many differentially expressed genes (DEGs) in every dataset. Correspondingly, we observe strong enrichment of the cell-cyle term among these genes, consistent with biological knowledge. However, we also observe that ER\(-\)signaling is significantly enriched in 2 out of 4 studies, hence the enrichment for cell-cycle genes is nonspecific. This means that explicit adjustment for the confounders has not fully eliminated the effect of one confounder (ER\(-\)status) and hence we can conclude that the list of DEGs contains many false positives associated with ER\(-\)signaling. This contamination of ER\(-\)signaling genes is likely to be due to the fact that the immunohistochemically determined ER\(-\)status of the samples is only approximate, i.e., the confounder is subject to error. Thus, neither method, SVA or LR \(+\) CF, succeeds in yielding specific enrichment of cell-cycle genes among the genes associated with grade.

7.2 SVA Breakdown in DNA Methylation Data

As a second example, we consider DNA methylation data. A large number of studies have now unequivocally demonstrated that promoter DNA methylation of a specific class of genes, known generally as PolyComb Group Targets (PCGTs), increases with the age of the tissue (see e.g., [27, 30, 44]). Hence, feature selection for CpGs in gene promoters undergoing age-associated increases in DNA methylation should be enriched of PCGTs. Table 17.2 shows the results of applying SVA and a linear regression method that uses confounders as explicit covariates (LR \(+\) CF).

Table 17.2 Age-associated CpGs: in each dataset and for each method (LR \(+\) CFs, SVA, ISVA) we give the number of CFs or SVs used as covariates in the regression analysis, the number of CpGs differentially methylated with age (nDMCs) (FDR\(<0.05\) for Datasets T1D and UKOPS1, FDR\(<0.3\) for Datasets UKOPS2 and WBBC), the number of these that are hypermethylated with age and that map to polycomb group targets (nPCGTs), and the \(P\)-value of PCGT enrichment among age-hypermethylated CpGs (Hypergeometric test)

We can see that in only one of the four datasets (T1D set), does SVA convincingly retrieve the age-PCGT DNA methylation signature. In the other three datasets, the \(P\)-value of enrichment is either not significant or would fail significance after correction for multiple testing. In contrast, linear regression with explicit adjustment for confounders (see Appendix for the nature of the explicit confounders) convincingly captures the biological signature in 3 out of 4 datasets.

7.3 Residual Biological Variation

The results presented above clearly demonstrate a pitfall of the SVA algorithm: it can fail to retrieve a well-known and extensively validated association between a molecular signature and a phenotype of interest. The most plausible explanation for why this happens is that residual biological variation is being interpreted as confounding variation leading to a “dampening” of the biological signal (see Fig. 17.3). To show that this is indeed what is happening we can study the correlations between the surrogate variables and the biological as well as confounding factors. The statistical significance of these correlations is best shown as a heatmap. This is shown for the four DNA methylation datasets considered in Table 17.2 in Fig. 17.6. From this figure and Table 17.2 we can see that in all three datasets where SVA fails to clearly capture the age-PCGT DNA methylation signature, that in all three of them there is residual variation correlating with age. Conversely, in the one dataset where there is no residual variation correlating with age (i.e., T1D set), SVA retrieves the biological signature. Thus, this example clearly illustrates that the scenario of residual biological variation arising due to imperfections in the modeling, as depicted in Fig. 17.3, is indeed fairly common.

Fig. 17.6
figure 6

Heatmap of \(P\)-values of association between the surrogate variables (SVs) inferred using SVA and the confounders and phenotype of interest (age). \(P\)-values were estimated from linear ANOVA in the case of categorical confounders (e.g Chip, Sex, Cohort) and from linear regressions in the case of continuous variables (age, BSC efficiency-BSCE and DNA concentration-DNAc). Color codes: \(P<1e-10\) (darkred), \(P<1e-5\) (red), \(P<0.001\) (orange), \(P<0.05\) (pink), \(P>0.05\) (white)

7.4 The Need for Surrogate Variable Subspace Selection

The above two examples in gene expression and DNA methylation data demonstrate the need to perform adjustment on a surrogate variable subspace, since otherwise one risks “peeling” away biological variation of interest. In the case where there is no residual biological variation it should be clear that it does not matter what basis (i.e., surrogate variables) we use to span the surrogate variable subspace. In other words, it should not matter whether we use SVs constructed from principal components (SVA) or from the independent components (ISVA). However, in the scenario where biological variation of interest is present in the residual variation matrix \(R\), we need to select surrogate variables that “align” with the true confounders and which avoid as much as possible the directions defined by the residual biological variation. This then requires a BSS method to better deconvolute the effects of the confounders and this residual biological variability. However, application of a BSS method to \(R\) only yields a decomposition of \(R\) into a number of independent “sources” and does not, on its own, provide a prescription for subspace selection. Hence, how do we select this subspace?

The previous example discussed in Table 17.2 and Fig. 17.6 provides a possible prescription for how to perform the subspace selection, namely, only those SVs should be included that do not correlate significantly with the phenotype of interest. But what if SVs correlate significantly with both the POI and a confounder? In this scenario, it is unclear whether to include these SVs in the final feature selection procedure (i.e., step-11). The surrogate variable selection step therefore remains an outstanding problem.

Here we propose a simple heuristic to the subspace selection problem, which we can only justify a posteriori, by showing that it leads to successful retrieval of the known biological signatures. For each of the SVs and for each factor (biological or technical) we first compute a model fit \(R^2\) value, using an appropriate linear or nonlinear model framework. Let \(R^2_{vf}\) denote the \(R^2\) value between surrogate variable \(v\) and factor \(f\). Further, let \(b\) denote the POI factor, and \(t\) denote a generic technical factor. Then, there are four possible cases to consider, as indicated in Table 17.3. In case-1, the surrogate variable correlates significantly only with the POI, and hence it ought to be excluded as remarked earlier. Conversely, if the surrogate variable correlates significantly with a technical factor but not with the POI, then the corresponding SV should be included. In the third case, where the SV correlates significantly with both the POI and a technical CF, we use the model selection criterion

$$\begin{aligned} R^2_{vb}< R^2_{vt} \end{aligned}$$
(17.5)

to include only those where the correlation with the technical factor is stronger. The rationale for this criterion is that if the variation described by \(v\) correlates more strongly with the POI, then it is more likely that this variation is genuinely associated with the POI, and hence this component should be excluded. The final case corresponds to a scenario where the SV does not correlate with any known factor, in which case it is also unclear whether to include the SV or not. In principle, one must allow for the possibility of complete unknown (i.e., hidden) factors, in which case the SV should be included. On the other hand, exclusion could be argued on grounds of small variability and inaccuracies in dimensionality estimation.

Table 17.3 Surrogate Variable Selection: there are four possible case scenarios to consider depending on the \(R^2_{vf}\) values between surrogate variable \(v\) and factor \(f\), as shown

Before demonstrating that the simple procedure presented in Table 17.3 works, we need to discuss further what may seem as a serious drawback with the above heuristic, as it requires some knowledge of the technical confounding factors. Given that BSS methods are ideally suited to the scenario where sources of variation are unknown, does this then pose an intrinsic limitation to the ISVA method? The answer is no. To understand this, we first note that BSS methods are useful also in circumstances where confounders are only known with error, since in such cases it would be better to model the effects of the confounders from the data itself. In this case, the simple SV subspace selection step described above can be applied. Second, the scenario where confounders are known, or only known subject to error, constitutes the most common scenario. Last but not least, SVs not correlating with any factor (case-4) may still be included in the adjustment, as the main requirement is to avoid including SVs that correlate strongly with the POI.

7.5 The ISVA Solution

Let us now see how ISVA resolves the problematic issues that we encountered earlier with SVA. We first consider the four DNA methylation datasets considered in Table 17.2 and Fig. 17.6. In Fig. 17.7 we show the heatmap of associations between SVs constructed from ISVA with the same confounders.

Fig. 17.7
figure 7

Heatmap of \(P\)-values of association between the surrogate variables (SVs) inferred using ISVA and the confounders and phenotype of interest (age). \(P\)-values were estimated from linear ANOVA in the case of categorical confounders (e.g., Chip, Sex, Cohort) and from linear regressions in the case of continuous variables (age, BSC efficiency-BSCE and DNA concentration-DNAc). Color codes: \(P<1e-10\) (darkred), \(P<1e-5\) (red), \(P<0.001\) (orange), \(P<0.05\) (pink), \(P>0.05\) (white)

Fig. 17.8
figure 8

Heatmap of \(P\)-values of association between the surrogate variables (SVs) inferred using SVA and ISVA and the confounders (ER\(-\)status and tumor size) and the phenotype of interest (Grade). a Dataset Loi, b Dataset Schmidt. \(P\)-values were estimated from linear regressions. Color codes: \(P<1e-10\) (darkred), \(P<1e-5\) (red), \(P<0.001\) (orange), \(P<0.05\) (pink), \(P>0.05\) (white). “V” indicates SVs selected for adjustment in SVA or ISVA

Note how in two datasets (UKOPS1 and WBBC) there is no residual biological variability associated with age (the POI). In the UKOPS2 set, there are two SVs that correlate marginally with age, and importantly they do not correlate with any other factor, hence these are not included in step-11 of ISVA. In the T1D set, there are three SVs that correlate with age, but only one of these (SV-3) is excluded, because the other two (SV-1 and SV-5) correlate more strongly with potential confounders such as Sex, Cohort, BSCE, and Chip. As seen in Table 17.2, ISVA with the above prescription for SV subspace selection, leads to significant enrichment of PCGTs in all four DNA methylation datasets. Thus, using ISVA the known biological signature is successfully retrieved in all sets.

It could be argued that the key step is the SV subspace selection, and not the BSS algorithm per se. To show how the use of ICA facilitates the SV subspace selection, we return to the example of mRNA expression data with grade as the POI and ER\(-\)status playing the role of confounder. Table 17.1 shows the results obtained by ISVA. In comparison to SVA, we can see that ISVA leads to specific enrichment of cell-cycle genes (i.e., ER\(-\)signaling genes are not enriched), clearly indicating that confounding by ER\(-\)status has been successfully removed. As we can see from Fig. 17.8, this improved feature selection can be attributed to a more accurate deconvolution of residual variation associated with grade from that associated with ER\(-\)status. As illustrated in Fig. 17.8a, SV-1 in SVA is equally strongly correlated with grade and ER\(-\)status, indicating inaccurate deconvolution. In contrast, with ISVA, the SVs correlating most strongly with ER (SV-12) and grade (SV-7) are distinct, thus facilitating SV subspace selection and subsequently allowing improved feature selection. Similarly, in Fig. 17.8b, SV-3 in SVA is selected for adjustment yet it correlates very strongly with grade. In contrast, in ISVA the SV correlating most strongly with grade (SV-9) does so much more strongly than with ER\(-\)status, and hence this SV is not included in the subsequent adjustment. The effect of ER in the residual variation space is captured by other SVs (SV-12, 20, 24, 27) which do not correlate as strongly with grade, and these are therefore included in the adjustment. Thus, in these two examples, the BSS method is key since it allows more accurate deconvolution of the different sources of variation in the residual variation space. Even if a SV subspace selection step is incorporated into SVA (using the same heuristic criterion as for ISVA), we would still select problematic SVs since PCA does not allow accurate deconvolution of the different sources of variation (see [45] for results of this modified SVA).

8 Modeling of Confounding Factors with Generalized BSS Algorithms

In the previous sections, we have seen how a simple BSS method (fastICA) can lead to substantial improvements in modeling confounding factors as well as to an improved deconvolution of the biological and confounding factors, both of which are important, and which subsequently lead to improved feature selection in supervised analysis problems. We have also provided an objective evaluation framework in which to assess and compare the different algorithms.

It is therefore of interest to consider more sophisticated BSS methods, since these might offer further improvements in statistical inference. In doing so, the first question to address is whether modeling of confounders is improved using these more advanced BSS methods. One particular generalization of ICA which is of interest to study concerns the statistical independence assumption, which so far has been applied to the columns of the source matrix \(S\). In other words, given the residual matrix \(R\) of dimension \(p\times n\), we applied ICA in the context

$$\begin{aligned} R=S_1A+\epsilon \end{aligned}$$
(17.6)

with the inference required to minimize a residual sum of squares subject to the constraint that the \(K\) \(p-\)dimensional column vectors of \(S_1\) be as statistically independent as possible. However, as shown in previous studies [37, 47], a dual interpretation/implementation is possible, whereby statistical independence is imposed on the rows of the mixing matrix \(A\). This dual problem can be expressed as:

$$\begin{aligned} R^T&= A^TS_1^T+\epsilon \nonumber \\&= \tilde{S_2}\tilde{A}+\epsilon \end{aligned}$$
(17.7)

where statistical independence is now imposed on the columns of \(\tilde{S_2}\) which is a matrix of dimensionality \(n\times K\). As shown in [2, 33, 37, 47], it is possible to formulate a “spatio-temporal” or bi-dimensional ICA,

$$\begin{aligned} R=S_1S_2^T+\epsilon \end{aligned}$$
(17.8)

in which statistical independence is favored across both features (“time”) and samples (“space”), by means of an overall cost function, \(C_f\), defined as a weighted linear combination of the cost functions used to solve Eqs. 17.6 and 17.7, i.e.,

$$\begin{aligned} C_f=(1-a)C_{f_1} + aC_{f_2} \end{aligned}$$
(17.9)

More formally, the specific bi-dimensional ICA algorithm we consider here [2, 33, 47] starts with a SVD of the row and column centered (residual) data matrix \(R\), so \(R=UDV^T\), with corresponding estimation of the dimensionality \(K\) (using as before RMT). One then constructs the reduced matrix \(R_K=U_KD_KV_K^T\) where the first \(K\) columns of \(U\) and \(V\) have been selected corresponding to the top \(K\) singular values of \(D\). This reduced matrix can then be rewritten as

$$\begin{aligned} R_K = \underbrace{U_KD_KW^{-1}}_{S_1}\underbrace{WV_K^T}_{S_2^T} \end{aligned}$$
(17.10)

with \(W\) an invertible matrix of size \(K\times K\). Finally, we seek to optimize the matrix \(W\) such that the fourth-order cumulants of \(S_1\) and \(S_2\) are as diagonal as possible, i.e., minimizing

$$\begin{aligned} C_f(W)= \left( a \sum _i \mathrm{{Off}}\left( C_i(S_2^T)\right) + (1-a)\sum _i \mathrm{{Off}}\left( C_i(S_1^T)\right) \right) \end{aligned}$$
(17.11)

where \(\mathrm{{Off}}(Y)\) returns the sum of squares of the off-diagonal elements of \(Y\), and the \(C_i\) are fourth-order cumulants. Imposing that \(W\) is orthogonal leads to a formulation which can be solved by means of the JADE algorithm [6]. We note however that this formulation of bi-dimensional ICA differs slightly from that of [33, 47], as the second term in the contrast function involves \(\left( C_i(S_1^T)\right) \) instead of \(\left( C_i(S_1^T)\right) ^{-1}\). Minimizing one or the other pursues the same goal, namely statistical independence for columns of \(S_1\). This novel formulation however allows us to treat both extreme cases on an equal footing: \(a=1\) corresponds to JADE applied on \(R_K^T=S_2S_1^T\) whereas \(a=0\) corresponds to JADE applied on \(R_K=S_1S_2^T\). Thus, the cost function can be interpreted as a weighted linear combination of two ‘jade-like’ cost functions.

Given the above formulation of bi-dimensional ICA, it is of interest to study the effect of the parameter \(a\) on the quality of BSS. Since beadchip effects provide an objective framework in which to assess the quality of the BSS, we focus on how well these effects are modeled by the family of bi-dimensional ICA algorithms above. For simplicity, we consider the unsupervised problem in which the ICA decomposition is done on the data matrix \(X\) itself.Footnote 1 Figure 17.9 shows the results, indicating that in terms of modeling beadchip effects, ICA is best run with values of \(a\) close to zero. This corresponds to imposing statistical independence of the sources across features, as implemented in the fastICA version of the ISVA algorithm.

Fig. 17.9
figure 9

Modeling of beadchip effects by bi-dimensional ICA in two DNA methylation datasets. y-axis labels the \(R^2\) value of the component correlating best with the beadchip as assessed using a linear ANOVA model. x-axis labels the parameter \(a\) in Eq. 17.11

9 Conclusions

In this chapter, we have presented and discussed the problem that confounding factors pose in large omic datasets. Since feature selection is a common task in the analysis of such large datasets, it is paramount to have statistical methods in place that can perform supervised analysis and feature selection in the background of such confounding factors, specially when these are uncertain or unknown. We have seen how BSS methods are necessary in this context, since there is a requirement to accurately model confounding factors and to deconvolve these from variation associated with the phenotype of interest. We have presented an algorithm, ISVA, which uses a BSS technique (ICA) to perform a supervised normalization of the data and have shown that it offers a more sound statistical framework in which to perform feature selection than a competing non-BSS tool based on PCA.

As mentioned earlier, it is possible to consider any BSS algorithm within the ISVA framework. One of the most straightforward generalizations of the fastICA algorithm used in our ISVA implementation is to relax the statistical independence assumption, but to simultaneously impose partial statistical independence along the dual “sample”-space, resulting in a bi-dimensional ICA. However, we have seen that, at least in terms of modeling beadchip effects, that the original implementation (i.e., imposing statistical independence across features) is optimal. This could be due to the sources across features being well described by sparse distributions or by the fact that statistical independence is best assessed using the larger feature space.

Although the bi-dimensional ICA did not lead to improved modeling of beadchip effects, it is nevertheless of interest to investigate this and other BSS algorithms in the ISVA context. For instance, it could well be that other types of confounding factors are best modeled using bi-dimensional ICA or ICA algorithms that also allow for skewed sources of variation [37, 47]. Exact known confounders (like beadchip effects) allow for objective assessment of BSS in real data, yet unfortunately, not many such factors exist. On the other hand, the number of beadchips in studies can vary substantially, thus allowing assessment of the BSS methods at least in relation to statistical properties such as kurtosis, which would vary for beadchip effects depending on the overall sample size of the study. Thus, a beadchip effect affecting 12 samples out of 120 samples (10 beadchips) will exhibit different statistical properties to one in a study of only 36 samples.

Besides the detailed modeling of the sources, another key challenge faced in ISVA is the SV subspace selection step. Although we have presented a simple heuristic selection criterion, which, as we have seen, successfully retrieves the known biological signatures in diverse real datasets, the criterion itself is not applicable to the case where confounders are complete unknowns (i.e., hidden). In fact, this remains an outstanding statistical challenge since (1) the presence of biological variation of interest in the matrix of residuals is almost always inevitable and (2) it is entirely plausible that some of this variation is driven by hidden confounding factors and hence that the associated SVs should be included in the final regression model.

The results on eight real datasets presented here however, conclusively demonstrate that a SV selection step is absolutely necessary to arrive at the correct biological conclusion, yet in other datasets where the biological truth is unknown, the SV selection criterion used here could falter due to hidden confounding factors. In other words, in the eight real datasets considered here we can be fairly certain that the data is not subject to substantial hidden (i.e., completely unknown) confounding variation, since otherwise our SV selection criterion would not have led to the retrieval of the known biological signatures.

With this chapter we hope to engage biologists, bioinformaticians, and signal processing experts alike. The problem that confounding factors pose in the statistical analysis of omic data is both challenging and critical to the ultimate success of large-scale genomic and epigenomic studies aiming to identify the much needed disease biomarkers. Further research in this area is therefore urgently needed.