Promise and Complexity of Personalized Medicine

Simple and inexpensive genetic tests capable of showing person’s risks to develop certain diseases would help to effectively target clinical treatments to each individual patient in order to achieve the best possible results [1, 2]. Consequently, efficient technologies and software for uncovering treatment-related mutations in illness-inducing viruses as well as disease-related variants in patient’s DNA will play important roles in the future of medicine. Improved disease prevention and diagnosis as well as novel routes to therapies are the main motivations for extensive studies aimed at finding disease-related genetic signatures.

Presently, the estimated disease risks via characterization of known genetic risk factors can provide only a limited help in clinical applications [1, 3]. Even though a large amount of resources has been directed in this direction recently, the genetic basis of common human diseases has not been identified for the most part [4, 5]. Recent emergence of successful experimental and statistical strategies for the genome-wide association studies was supposed to provide the necessary tools for deciphering genetic causes of complex human illnesses like type 1 and 2 diabetes [6], rheumatoid arthritis, and bipolar disorder [4, 7]. However, the presence of complicated multi-locus interactions immensely complicates the task of discovering disease-related variants in patient’s genome [8, 9]. Thus, biochemical and statistical understanding of genetic interactions will play a crucial role in future clinical applications.

Whole-Genome Association Studies

An examination of a large number of genetic markers across the whole genome for multiple individuals with the goal of identifying variants-disease associations is known as genome-wide association study (GWAS). Novel scientific and technological advances in high-throughput biotechnologies such as microarrays and next-generation sequencing [10,11,12] made GWAS a powerful tool for unlocking the genetic basis of complex diseases. Particularly, development of International HapMap resource [13] that simplified design and analysis of association studies, emergence of dense genotyping chips [10, 14], and assembly of large and characterized clinical samples [4] should be singled out as important factors in recent successful progress for GWAS. While many disease loci have been identified in such surveys [4, 15], discovered variants explain only a small proportion of the observed familial aggregation [2, 16], thus posing a famous problem of missing disease heritability [17]. While there are a few proposed solutions to the encountered challenge [5], an urgent contemporary question that still needs to be solved is regarding the architecture of complex human traits. While, “common variant” hypothesis has come under a lot of criticism lately [1, 17], it is now necessary to devise experimental and computational methods to determine which one of the proposed disease architectures describes the reality in order to help develop future clinical medicine applications of bioinformatics technologies [1, 3, 17].

The most common type of DNA change is known as the single-nucleotide polymorphism (SNP), which arises when a single base (A, T, C, or G) is replaced by another one at a specific DNA position. Some SNPs can directly lead to disease formation; others increase the chance of disease statistically [18]. Analysis of SNP data is complicated because of a large number of possible interaction combinations as well as by the presence of correlation with the nearby SNPs.

Beyond Single-Locus Analysis

Despite striking success in the twentieth century in pinpointing genes responsible for Mendelian diseases, genetic origins of common complex diseases are, in fact, non-Mendelian in nature [9, 19]. Particularly, gene–gene interactions are involved in many complex biological processes like metabolism, signal transduction and gene regulations; thus, genetic variants in multiple loci may contribute to the disease formation together [20, 21]. For example, breast cancer and type 2 diabetes have been linked to multi-SNP interactions [21,22,23]. While most current bioinformatics approaches focus on detecting single-SNP associations, advanced statistical methods are necessary for multi-SNP association mapping because single-variant methods not only lose power when interactions exist but are, in fact, helpless in detecting rare mutations [24]. Also, the number of possible interactions is so vast that it is computationally unrealistic to search through all possible interactions in the genome for a large-scale case-control study [8, 25].

Additional challenge for disease origin discovery comes from the statistical correlation between nearby variants known as linkage disequilibrium or LD [25, 26]. LD patterns have many important applications in genetics and biology [27] and arise due to shared ancestry for contemporary chromosomes [13]. Due to LD patterns, it is likely that there will be a lot of redundant positive signals in dense studies [24]. Later on we address in detail how Bayesian strategies can address the burning problems in genetics while dealing with epistasis and linkage disequilibrium.

Modern Bioinformatics Approaches

Currently, most of the approaches to disease association mapping employ the standard “frequentist” attitude to the evaluation of significance [2]. Particularly, such algorithms use hypothesis testing procedures to deal with one variant at a time [24]. However, failures of such “frequentist” methods to account for the power of a study and the number of likely true positives [2] combined with the increased likelihood to report a multitude of redundant associations [24] sparked a wide interest in the Bayesian procedures. In this review, we survey the challenges facing statistical geneticists while analyzing the GWAS data and outline how recently emerged Bayesian methods can help with the process. In addition to outlining the main differences between various proposed approaches, we highlight limitations and advantages of each method and describe future prospects in the field and how Bayesian approaches can aid in answering outstanding questions in biomedicine.

Bayesian Data Analysis Methods

In Fig. 1, we have shown multiple complicated interactions that have to be considered while developing statistical models for understanding of the multi-locus interactions resulting in the disease development. The ultimate goal is to be able to accurately understand all the shown connections in large-scale case-control studies while also comprehending the biological processes that lead to disease development. Thus, while statistical understanding is important, developing methods that can point in the direction of the appropriate biological processes taking place is the ultimate goal.

Fig. 1
figure 1

Genetic interaction graph showing possible paths to disease formation. SNPs are represented as circles with color and shading indicating disease connection: “red solid” ones are marginally associated with the phenotype under consideration, “red shaded” ones are leading to disease formation through epistasis or are in linkage disequilibrium (LD) with such variants, and finally “green circles” are not associated with the phenotype. LD blocks are shown as square brackets and interactions are depicted as double-headed arrows

Overview of Bayesian Data Analysis

Statistical conclusions about an unknown parameter θ (or unobserved data x unobs ) in the Bayesian approach to parameter estimation are described utilizing probability statements, which are conditional on the observed data x: p(θ|x) and p(x unobs |x). Additionally, implicit conditioning is performed on the values of any covariates [28]. The concept of conditioning on the observed data is what separates Bayesian statistics from other inference approaches which estimate unknown parameter over the distribution of the possible data values while conditioning on the true, yet unknown parameter values [28, 29].

At the heart of all the Bayesian approaches for detection of gene–gene interactions lies the concept of Bayesian inference and model selection. The goal is to determine the posterior distribution of all parameters in the problem (disease association, epistatic interactions, gene–environment interactions and others), given the common variants data for the case-control study while incorporating prior believes about parameter values. The conditional probability of all parameters \( ({\text{Params}}) \) given the observed data \( ({\text{Data}}) \) is given by the product of the likelihood function of the data and prior distribution on the parameters, as well as the normalization constant:

$$ P\left( {\text{Params|Data}} \right) = \frac{{P\left( {\text{Data|Params}} \right)P\left( {\text{Params}} \right)}}{{P\left( {\text{Data}} \right)}} $$
(1)

For most high-dimensional data sets encountered in large-scale studies, \( P({\text{Data}}) \) cannot be explicitly calculated [9] and, therefore, \( P({\text{Params}}|{\text{Data}}) \) can be evaluated analytically only up to the proportionality constant. However, advanced computational techniques (iterative sampling methods) can be used to determine posterior distribution of parameters [29, 30]. The main task is to make appropriate choices of statistical models to describe the likelihood expression and also to choose appropriate prior distributions on the values of parameters, \( P({\text{Params}}) \).

Overview of Bayesian Variable Partition

Instead of testing each SNP set in a stepwise manner [31, 32], Bayesian approaches fit a single statistical model to all of the data simultaneously [9, 25, 33] allowing for increased robustness when compared to hypothesis testing methods [2, 24]. Another advantage of Bayesian approach to the problem is the ability to quantify all the uncertainties and information, and to incorporate previous knowledge about each specific SNP marker into the statistical model through priors [9, 29].

In the Bayesian model selection framework, we are interested in figuring out which of the set of models \( \left\{ {M_{i} } \right\}_{i = 1}^{N} \) is the most likely one given the observed \( {\text{Data}} \). The posterior probability for a particular model \( M_{i} \) given \( {\text{Data}} \) is described by:

$$ P\left( {M_{i} | {\text{Data}}} \right) \propto P\left( {{\text{Data|}}M_{i} } \right)P\left( {M_{i} } \right) $$
(2)

Thus, through comparison of the posterior odds ratio for \( P(M_{i} |{\text{Data}}) \) and \( P(M_{j} |{\text{Data}}) \) it can be determined whether model \( M_{i} \) or \( M_{j} \) is more likely [29]. It is important to note that the normalization constant in Eq. 2 involves summation over all possible models: \( P\left( {\text{Data}} \right) = \mathop \sum \nolimits_{i = 1}^{N} P\left( {{\text{Data|}}M_{i} } \right)P(M_{i} ) \). For example, consider the case of a genome-wide study containing 1500 SNPs each of which can take one of the three possible states; thus, \( N = 3^{1500} \approx 5 \times 10^{715} \) is the total number of feasible models to sum over. In such instances, it is necessary to use stochastic methods to sample from the posterior distribution. Now let us consider how this conceptual framework is applied in practice to the determination of multi-locus interactions in case-control studies.

Epistasis Analysis Methods

While statistical methods like BGTA [34], MARS [35], and CPM [36] are capable of detecting epistatic associations, the Bayesian epistasis association mapping (BEAM) algorithm [9] was the first practical approach capable of handling genome-wide case-control data sets. BEAM algorithm gives for each SNP marker posterior probabilities for disease association and epistatic interaction with other markers given the case-control genotype SNP data. Figure 2 shows the input file format necessary for application of the algorithm. The core of the Bayesian marker partition model used can be briefly summarized as follows.

Fig. 2
figure 2

Input data format for BEAM software [9] which uses MCMC to analyze case-control genetic studies. Label “1” denotes patients while “0” denotes controls. Note that it is not a requirement to provide the SNP ID and location

BEAM can detect both interacting and noninteracting disease loci among a large number of variants. It is an application of Bayesian model selection procedure. Particularly, all the markers are split into three groups: (1) markers not associated with the disease, (2) marginally disease-associated variants, and (3) those with interaction associated disease effect. Thus, using the priors on the marker memberships and Markov Chain Monte Carlo (MCMC) methods, posterior probabilities for group memberships are determined. Specifically, by interrogating each SNP marker conditionally on the current status of others via MCMC method, the algorithm produces posterior probabilities [9]. Particularly, the genotype counts are modeled by the multinomial distribution with frequency parameters \( \theta = \left\{ {\theta_{1} ,\theta_{2} ,\theta_{3} } \right\} \), \( \mathop \sum \nolimits_{i = 1}^{3} \theta_{i} = 1 \) described by the Dirichlet prior:

$$ P(\theta |\alpha ) \propto \mathop \prod \limits_{i = 1}^{3} \theta_{i}^{{\alpha_{i} - 1}} $$
(3)

In order to determine the posterior probability of each marker’s group membership (represented by I), the Metropolis–Hastings (MH) algorithm [30] is used to sample from \( P(I|D,H) \) as given in Eq. 3:

$$ P\left( {I |D,H} \right) \propto P(D_{1} |I)P(D_{2} |I)P(D_{0} ,H|I)P(I), $$
(4)

where D is the patient data set (with disease), H is the control data set (healthy), and then D 0, D 1, and D 2 are correspondingly partitions of the patient data set into the three categories described above. The assumption is that case genotypes at the disease-associated markers will have different distributions when compared to control genotypes. Furthermore, the likelihood model assumes independence among markers in control group.

While BEAM algorithm was one of the first few to be able to handle GWAS data, it suffered from an assumption that SNPs dependence structure could be described by the Markov chain [9, 25]. In fact, SNP markers are highly correlated within haplotype blocks which are separated by recombination events [13, 37]. Therefore, despite its success, BEAM model is unable to capture the block-like human genome structure.

Incorporating Block-Type Genome Structure

Given that nearby SNPs are strongly correlated due to linkage disequilibrium, a new Bayesian model [25] that infers diplotype blocks and chooses SNP markers within blocks that are disease-associated becomes much more powerful when compared to other similar approaches. Here, we review the statistical Bayesian model for the LD-block structure determination [25, 26]. The main assumption is that diplotypes of individuals come from a multinomial distribution with frequency parameters described by the Dirichlet prior and that genotype combinations of SNPs in different blocks are mutually independent. The compact expression for the marginal probability of the data for a specific block is given by:

$$ P\left( {D_{{\left[ {s,b} \right)}} |\left[ {s,b} \right) = block} \right) = \left( {\mathop \prod \limits_{i = 1}^{{3^{b - s} }} \frac{{\Gamma \left( {n_{i} + a_{i} } \right)}}{{\Gamma \left( {a_{i} } \right)}}} \right)\frac{{\Gamma \left( {\mathop \sum \nolimits a_{i} } \right)}}{{\Gamma \left( {\mathop \sum \nolimits \left( {n_{i} + a_{i} } \right)} \right)}}, $$
(5)

where a block of SNPs considered consists of the SNPs (s, …, b  1); Γ is the gamma function, \( \vec{a} \) is the vector of Dirichlet parameters and \( ni \) refers to the number of counts for a specific diplotype. For joint inference of diplotype blocks and disease association status, we use the joint statistical model for the observed genotype data in cases and controls, the marker membership and block partition variable:

$$ P\left( {H,D,B,I} \right) = P\left( {H,D |B,I} \right)P\left( B \right)P(I) $$
(6)

Finally, in order to determine the posteriors \( P(B|D,H) \) and \( P(I|D,H) \) the model uses a combination of MH algorithm and Gibbs sampler [25].

Detailed Interaction Partition Structure Determination

While successful in inferring epistatic interactions in large-scale case-control studies, both BEAM and its newer version BEAM2 had a disadvantage of using saturated models which limited the ability of the algorithms to accurately determine the epistatic interactions structure. Recent studies showed [4, 33, 38] that such interaction details arising due to encoding of the complicated regulatory mechanisms might play an important role in the disease formation. In order to carefully explore the etiopathogenesis and genetic mechanisms of diseases, a novel algorithm named Recursive Bayesian Partition (RBP) was proposed [33]. The RBP approach employs a Bayesian model to discover independence groups among interacting markers: first, it recursively infers all the marginally independent interaction groups, and then determines the conditional independence within each group using a chain dependence model. RBP therefore successfully recursively determines dependence structure among interacting variants in GWAS. Figure 3 shows an example of the possible outcomes of the RBP algorithm applied to GWAS data when determining the epistatic interactions independence structure.

Fig. 3
figure 3

Inference of the detailed dependence structure using Recursive Bayesian Partition (RBP) method. The individual independence groups among five variants are pointed out using separate solid blocks. A group with conditional independence is denoted using a dotted shape

Bayesian Graph Models and Networks

In order to improve disease mapping sensitivity and specificity, BEAM3 algorithm [24] uses a graph model to allow for flexible interaction structures for multi-SNP associations. Through the use of Bayesian networks, BEAM3 detects flexible interaction structures instead of using saturated models (like BEAM and BEAM2), therefore, highly reducing the interaction model complexity. Moreover, because only the disease association graphs are constructed, BEAM3 provides for higher computational efficiently in the whole-genome association settings [24].

In detail, BEAM3 allows for higher order couplings via saturated interactions within cliques (nonoverlapping partition of SNPs) and pairwise interactions between them. It can be shown [24] that the joint probability of all SNPs X, parameters, including disease graph and association status (G, I), and disease status indicator (Y) is given by:

$$ P(X,Y,G,I) \propto P_{A} (X_{1} |Y,G)P(Y)P(G|I)P(I)/P_{0} (X_{1} ), $$
(7)

where \( G = (C,\Delta ) \) is an undirected disease graph constructed on disease-associated SNPs (X 1) and including partition of SNPs into cliques (C) and interaction between cliques (∆); probability function of X1 set under the phenotype association hypothesis is described by P A . Therefore, as can be seen from Eq. 7, only a few disease-associated SNPs are modeled (in set X 1), and hence a significant portion of computational time is saved by avoiding explicit modeling of complicated dependence structures of all SNPs which could be millions [19, 24, 39]. Additionally, through the choice of a proper baseline probability function P 0(X 1), the model automatically accounts for the complex LD effects among dense SNPs employing graphs. Thus, a significant number of repetitive false interactions are avoided reducing computational burden [24]. Specifically, summing over all \( G^{\prime} \) graphs, the expression for the baseline model becomes:

$$ P_{0} \left( {X_{1} } \right) = \mathop \sum \limits_{{G^{\prime}}} P_{0} \left( {X_{1} |G^{\prime}} \right)P(G^{\prime}) $$
(8)

An alternative approach toward learning disease inducing gene–gene interactions is using binary classification trees. Bayesian methodology has been recently applied [21] to identification of multi-locus interactions in the large-scale data sets using a Bayesian classification tree model. Specifically, this kind of machine learning approach produces tree structure models, where each nonterminal node determines the splitting rule based upon the predictor variables like SNPs, and edges between nodes correspond to different possible values for the variable in the top parent node. A path along such a tree till the terminal node represents a specific combination of predictor variables along the path, therefore, automatically accommodating for epistasis [8, 21].

There are various ways for searching through the feasible tree space in such recursive partitioning approaches including greedy algorithms [40], random forests approach [8, 41], and MCMC [42, 43]. Bayesian variable partition and Bayesian classification trees are conceptually similar in that prior is assigned to all the tree models with the purpose of controlling the tree size [21]. One main advantage of this approach is in a possible enhancement of finding probability for multi-locus interactions with weak marginal effects due to ensuring the variable splitting through the prior specification. Moreover, due to the adaptivity of the MCMC algorithm, such Bayesian tree models detect higher order interactions by performing thorough searches near trees with the interacting variables determined in previous iterations [21]. It is important to point out that classification tree approaches do not test for epistatic interactions directly [8].

Clinical Applications of Bayesian Methodology

Even though practical Bayesian approaches for whole-genome multi-locus interactions analysis have emerged relatively recently, such methods have already helped to make important advances in determination of disease etiology. Table 1 succinctly summaries and compares all the statistical methods described above as well as their success in determination of the previously known disease loci and, more importantly, in the discovery of new multi-locus interactions responsible for complex diseases. We specifically note what interaction model each method utilizes. For example, Bayesian analysis strategy combining BEAM and BEAM2 software [44] allowed for the discovery of 319 high-order interactions across the genome that can potentially explain the missing genetic component of the rheumatoid arthritis susceptibility. Moreover, their findings indicate that nervous system, in addition to autoimmune one, potentially performs a crucial role in the disease development. Figure 4 shows a schematic diagram of the combined Bayesian strategy used for the analysis. This is an example of the statistical study in which disease underlying biological processes can be extracted from determined statistical associations. For sure, many more studies will follow in the near future that apply Bayesian methods either to existing GWAS data or to new large-scale studies.

Table 1 Comparison of modern Bayesian approaches for whole-genome association analysis with possible clinical applications
Fig. 4
figure 4

A schematic diagram of the Bayesian analysis strategy combining multiple software applications [44]. In order to account for linkage disequilibrium, BEAM2 algorithm [25] was used to discover chromosome-wise interactions. However, a more efficient BEAM algorithm [9] was used on the determined SNPs across all chromosomes

Conclusions and Future Prospects

Certain issues need to be considered when using Bayesian approaches described above. For example, the combination of genotyping errors, disease heterogeneities, and population substructures could have adverse effects on the statistical results of the methods [9]. Currently, the major problem in the field is that the determined disease-associated genetic variants explain only a small part of the disease heritability [3, 4]. However, it is conceivable that the usage of the software tools outlined above will help with the detailed understanding of the interactions involved. Additionally, recent development of Bayesian models should allow for the elucidation of the detailed etiopathogenesis of the disease formation and the underlying causal biology.

Improvements to the Bayesian approaches mentioned in this article can include incorporation of environmental factors and population structures as covariates in the statistical model [33, 45]. Another possible improvement is to impute untyped SNPs and missing genotypes [46]. Efficient incorporation of prior biological knowledge into the Bayesian model can increase the probability of making discoveries in association studies [47]. Finally, recent computational proposals attempt to apply Bayesian methodology specifically toward efficient identification of causal rare variants in GWAS [48, 49].

It is important to keep in mind that the clinical applications of the statistical methods will arise from the understanding of the relationship between determined mathematical couplings and their biochemical underpinnings. The biological interpretation of the determined single- and multi-variant effects is currently a crucial area of research in genetics [8]. Modern statistical approaches to the analysis of the SNP data from whole-genome association studies have potential to play an important role in the future of bioinformatics and genomics research. Specifically, such methods will contribute to novel understandings of disease pathogenesis and provide crucial information for drug discovery [50], thus leading to important clinical applications.