Clinical Assessment of Disease Risk Factors Using SNP Data and Bayesian Methods

Kozyryev, Ivan; Zhang, Jing

doi:10.1007/978-3-319-44981-4_6

Ivan Kozyryev⁶ &
Jing Zhang⁷

Part of the book series: Health Information Science ((HIS))

1490 Accesses

Abstract

Recent groundbreaking technological and scientific achievements impelled the field of personalized medicine (PM), which promises to start a new era in clinical disease treatment. However, the degree of success of PM strongly depends on the establishment of a vast resource library containing the connections between many common complex diseases and specific genetic signatures. Particularly, these connections can be discovered performing whole-genome association studies, which attempt to link diseases to their genetic origins. Such large-scale surveys, combined with modern advanced statistical methods, have already identified many disease-related genetic variants. In this review, we describe in detail novel statistical methods based on Bayesian data analysis ideas—Bayesian modeling, Bayesian variable partitioning, and Bayesian graphs and networks—which are promising to help shine light on complex biological processes involved in disease formation and development. Particularly, we outline how to use Bayesian approaches in the context of clinical applications to perform epistasis analysis while accounting for the block-type genome structure.

Access provided by CONRICYT-eBooks. Download chapter PDF

Polygenic Scores in Epidemiology: Risk Prediction, Etiology, and Clinical Utility

Article 28 September 2015

Statistical Methods for Disease Risk Prediction with Genotype Data

Improved prediction of complex diseases by common genetic markers: state of the art and further perspectives

Article Open access 02 February 2016

Promise and Complexity of Personalized Medicine

Simple and inexpensive genetic tests capable of showing person’s risks to develop certain diseases would help to effectively target clinical treatments to each individual patient in order to achieve the best possible results [1, 2]. Consequently, efficient technologies and software for uncovering treatment-related mutations in illness-inducing viruses as well as disease-related variants in patient’s DNA will play important roles in the future of medicine. Improved disease prevention and diagnosis as well as novel routes to therapies are the main motivations for extensive studies aimed at finding disease-related genetic signatures.

Presently, the estimated disease risks via characterization of known genetic risk factors can provide only a limited help in clinical applications [1, 3]. Even though a large amount of resources has been directed in this direction recently, the genetic basis of common human diseases has not been identified for the most part [4, 5]. Recent emergence of successful experimental and statistical strategies for the genome-wide association studies was supposed to provide the necessary tools for deciphering genetic causes of complex human illnesses like type 1 and 2 diabetes [6], rheumatoid arthritis, and bipolar disorder [4, 7]. However, the presence of complicated multi-locus interactions immensely complicates the task of discovering disease-related variants in patient’s genome [8, 9]. Thus, biochemical and statistical understanding of genetic interactions will play a crucial role in future clinical applications.

Whole-Genome Association Studies

An examination of a large number of genetic markers across the whole genome for multiple individuals with the goal of identifying variants-disease associations is known as genome-wide association study (GWAS). Novel scientific and technological advances in high-throughput biotechnologies such as microarrays and next-generation sequencing [10,11,12] made GWAS a powerful tool for unlocking the genetic basis of complex diseases. Particularly, development of International HapMap resource [13] that simplified design and analysis of association studies, emergence of dense genotyping chips [10, 14], and assembly of large and characterized clinical samples [4] should be singled out as important factors in recent successful progress for GWAS. While many disease loci have been identified in such surveys [4, 15], discovered variants explain only a small proportion of the observed familial aggregation [2, 16], thus posing a famous problem of missing disease heritability [17]. While there are a few proposed solutions to the encountered challenge [5], an urgent contemporary question that still needs to be solved is regarding the architecture of complex human traits. While, “common variant” hypothesis has come under a lot of criticism lately [1, 17], it is now necessary to devise experimental and computational methods to determine which one of the proposed disease architectures describes the reality in order to help develop future clinical medicine applications of bioinformatics technologies [1, 3, 17].

The most common type of DNA change is known as the single-nucleotide polymorphism (SNP), which arises when a single base (A, T, C, or G) is replaced by another one at a specific DNA position. Some SNPs can directly lead to disease formation; others increase the chance of disease statistically [18]. Analysis of SNP data is complicated because of a large number of possible interaction combinations as well as by the presence of correlation with the nearby SNPs.

Beyond Single-Locus Analysis

Despite striking success in the twentieth century in pinpointing genes responsible for Mendelian diseases, genetic origins of common complex diseases are, in fact, non-Mendelian in nature [9, 19]. Particularly, gene–gene interactions are involved in many complex biological processes like metabolism, signal transduction and gene regulations; thus, genetic variants in multiple loci may contribute to the disease formation together [20, 21]. For example, breast cancer and type 2 diabetes have been linked to multi-SNP interactions [21,22,23]. While most current bioinformatics approaches focus on detecting single-SNP associations, advanced statistical methods are necessary for multi-SNP association mapping because single-variant methods not only lose power when interactions exist but are, in fact, helpless in detecting rare mutations [24]. Also, the number of possible interactions is so vast that it is computationally unrealistic to search through all possible interactions in the genome for a large-scale case-control study [8, 25].

Additional challenge for disease origin discovery comes from the statistical correlation between nearby variants known as linkage disequilibrium or LD [25, 26]. LD patterns have many important applications in genetics and biology [27] and arise due to shared ancestry for contemporary chromosomes [13]. Due to LD patterns, it is likely that there will be a lot of redundant positive signals in dense studies [24]. Later on we address in detail how Bayesian strategies can address the burning problems in genetics while dealing with epistasis and linkage disequilibrium.

Modern Bioinformatics Approaches

Currently, most of the approaches to disease association mapping employ the standard “frequentist” attitude to the evaluation of significance [2]. Particularly, such algorithms use hypothesis testing procedures to deal with one variant at a time [24]. However, failures of such “frequentist” methods to account for the power of a study and the number of likely true positives [2] combined with the increased likelihood to report a multitude of redundant associations [24] sparked a wide interest in the Bayesian procedures. In this review, we survey the challenges facing statistical geneticists while analyzing the GWAS data and outline how recently emerged Bayesian methods can help with the process. In addition to outlining the main differences between various proposed approaches, we highlight limitations and advantages of each method and describe future prospects in the field and how Bayesian approaches can aid in answering outstanding questions in biomedicine.

Bayesian Data Analysis Methods

In Fig. 1, we have shown multiple complicated interactions that have to be considered while developing statistical models for understanding of the multi-locus interactions resulting in the disease development. The ultimate goal is to be able to accurately understand all the shown connections in large-scale case-control studies while also comprehending the biological processes that lead to disease development. Thus, while statistical understanding is important, developing methods that can point in the direction of the appropriate biological processes taking place is the ultimate goal.

Overview of Bayesian Data Analysis

Statistical conclusions about an unknown parameter θ (or unobserved data x _unobs) in the Bayesian approach to parameter estimation are described utilizing probability statements, which are conditional on the observed data x: p(θ|x) and p(x _unobs |x). Additionally, implicit conditioning is performed on the values of any covariates [28]. The concept of conditioning on the observed data is what separates Bayesian statistics from other inference approaches which estimate unknown parameter over the distribution of the possible data values while conditioning on the true, yet unknown parameter values [28, 29].

At the heart of all the Bayesian approaches for detection of gene–gene interactions lies the concept of Bayesian inference and model selection. The goal is to determine the posterior distribution of all parameters in the problem (disease association, epistatic interactions, gene–environment interactions and others), given the common variants data for the case-control study while incorporating prior believes about parameter values. The conditional probability of all parameters $ ({\text{Params}}) $ given the observed data $ ({\text{Data}}) $ is given by the product of the likelihood function of the data and prior distribution on the parameters, as well as the normalization constant:

$$ P\left( {\text{Params|Data}} \right) = \frac{{P\left( {\text{Data|Params}} \right)P\left( {\text{Params}} \right)}}{{P\left( {\text{Data}} \right)}} $$

(1)

For most high-dimensional data sets encountered in large-scale studies, $ P({\text{Data}}) $ cannot be explicitly calculated [9] and, therefore, $ P({\text{Params}}|{\text{Data}}) $ can be evaluated analytically only up to the proportionality constant. However, advanced computational techniques (iterative sampling methods) can be used to determine posterior distribution of parameters [29, 30]. The main task is to make appropriate choices of statistical models to describe the likelihood expression and also to choose appropriate prior distributions on the values of parameters, $ P({\text{Params}}) $.

Overview of Bayesian Variable Partition

Instead of testing each SNP set in a stepwise manner [31, 32], Bayesian approaches fit a single statistical model to all of the data simultaneously [9, 25, 33] allowing for increased robustness when compared to hypothesis testing methods [2, 24]. Another advantage of Bayesian approach to the problem is the ability to quantify all the uncertainties and information, and to incorporate previous knowledge about each specific SNP marker into the statistical model through priors [9, 29].

In the Bayesian model selection framework, we are interested in figuring out which of the set of models $ \left\{ {M_{i} } \right\}_{i = 1}^{N} $ is the most likely one given the observed $ {\text{Data}} $. The posterior probability for a particular model $ M_{i} $ given $ {\text{Data}} $ is described by:

$$ P\left( {M_{i} | {\text{Data}}} \right) \propto P\left( {{\text{Data|}}M_{i} } \right)P\left( {M_{i} } \right) $$

(2)

Thus, through comparison of the posterior odds ratio for $ P(M_{i} |{\text{Data}}) $ and $ P(M_{j} |{\text{Data}}) $ it can be determined whether model $ M_{i} $ or $ M_{j} $ is more likely [29]. It is important to note that the normalization constant in Eq. 2 involves summation over all possible models: $ P\left( {\text{Data}} \right) = \mathop \sum \nolimits_{i = 1}^{N} P\left( {{\text{Data|}}M_{i} } \right)P(M_{i} ) $. For example, consider the case of a genome-wide study containing 1500 SNPs each of which can take one of the three possible states; thus, $ N = 3^{1500} \approx 5 \times 10^{715} $ is the total number of feasible models to sum over. In such instances, it is necessary to use stochastic methods to sample from the posterior distribution. Now let us consider how this conceptual framework is applied in practice to the determination of multi-locus interactions in case-control studies.

Epistasis Analysis Methods

While statistical methods like BGTA [34], MARS [35], and CPM [36] are capable of detecting epistatic associations, the Bayesian epistasis association mapping (BEAM) algorithm [9] was the first practical approach capable of handling genome-wide case-control data sets. BEAM algorithm gives for each SNP marker posterior probabilities for disease association and epistatic interaction with other markers given the case-control genotype SNP data. Figure 2 shows the input file format necessary for application of the algorithm. The core of the Bayesian marker partition model used can be briefly summarized as follows.

BEAM can detect both interacting and noninteracting disease loci among a large number of variants. It is an application of Bayesian model selection procedure. Particularly, all the markers are split into three groups: (1) markers not associated with the disease, (2) marginally disease-associated variants, and (3) those with interaction associated disease effect. Thus, using the priors on the marker memberships and Markov Chain Monte Carlo (MCMC) methods, posterior probabilities for group memberships are determined. Specifically, by interrogating each SNP marker conditionally on the current status of others via MCMC method, the algorithm produces posterior probabilities [9]. Particularly, the genotype counts are modeled by the multinomial distribution with frequency parameters $ \theta = \left\{ {\theta_{1} ,\theta_{2} ,\theta_{3} } \right\} $, $ \mathop \sum \nolimits_{i = 1}^{3} \theta_{i} = 1 $ described by the Dirichlet prior:

$$ P(\theta |\alpha ) \propto \mathop \prod \limits_{i = 1}^{3} \theta_{i}^{{\alpha_{i} - 1}} $$

(3)

In order to determine the posterior probability of each marker’s group membership (represented by I), the Metropolis–Hastings (MH) algorithm [30] is used to sample from $ P(I|D,H) $ as given in Eq. 3:

$$ P\left( {I |D,H} \right) \propto P(D_{1} |I)P(D_{2} |I)P(D_{0} ,H|I)P(I), $$

(4)

where D is the patient data set (with disease), H is the control data set (healthy), and then D ₀, D ₁, and D ₂ are correspondingly partitions of the patient data set into the three categories described above. The assumption is that case genotypes at the disease-associated markers will have different distributions when compared to control genotypes. Furthermore, the likelihood model assumes independence among markers in control group.

While BEAM algorithm was one of the first few to be able to handle GWAS data, it suffered from an assumption that SNPs dependence structure could be described by the Markov chain [9, 25]. In fact, SNP markers are highly correlated within haplotype blocks which are separated by recombination events [13, 37]. Therefore, despite its success, BEAM model is unable to capture the block-like human genome structure.

Incorporating Block-Type Genome Structure

Given that nearby SNPs are strongly correlated due to linkage disequilibrium, a new Bayesian model [25] that infers diplotype blocks and chooses SNP markers within blocks that are disease-associated becomes much more powerful when compared to other similar approaches. Here, we review the statistical Bayesian model for the LD-block structure determination [25, 26]. The main assumption is that diplotypes of individuals come from a multinomial distribution with frequency parameters described by the Dirichlet prior and that genotype combinations of SNPs in different blocks are mutually independent. The compact expression for the marginal probability of the data for a specific block is given by:

$$ P\left( {D_{{\left[ {s,b} \right)}} |\left[ {s,b} \right) = block} \right) = \left( {\mathop \prod \limits_{i = 1}^{{3^{b - s} }} \frac{{\Gamma \left( {n_{i} + a_{i} } \right)}}{{\Gamma \left( {a_{i} } \right)}}} \right)\frac{{\Gamma \left( {\mathop \sum \nolimits a_{i} } \right)}}{{\Gamma \left( {\mathop \sum \nolimits \left( {n_{i} + a_{i} } \right)} \right)}}, $$

(5)

where a block of SNPs considered consists of the SNPs (s, …, b − 1); Γ is the gamma function, $ \vec{a} $ is the vector of Dirichlet parameters and $ ni $ refers to the number of counts for a specific diplotype. For joint inference of diplotype blocks and disease association status, we use the joint statistical model for the observed genotype data in cases and controls, the marker membership and block partition variable:

$$ P\left( {H,D,B,I} \right) = P\left( {H,D |B,I} \right)P\left( B \right)P(I) $$

(6)

Finally, in order to determine the posteriors $ P(B|D,H) $ and $ P(I|D,H) $ the model uses a combination of MH algorithm and Gibbs sampler [25].

Detailed Interaction Partition Structure Determination

While successful in inferring epistatic interactions in large-scale case-control studies, both BEAM and its newer version BEAM2 had a disadvantage of using saturated models which limited the ability of the algorithms to accurately determine the epistatic interactions structure. Recent studies showed [4, 33, 38] that such interaction details arising due to encoding of the complicated regulatory mechanisms might play an important role in the disease formation. In order to carefully explore the etiopathogenesis and genetic mechanisms of diseases, a novel algorithm named Recursive Bayesian Partition (RBP) was proposed [33]. The RBP approach employs a Bayesian model to discover independence groups among interacting markers: first, it recursively infers all the marginally independent interaction groups, and then determines the conditional independence within each group using a chain dependence model. RBP therefore successfully recursively determines dependence structure among interacting variants in GWAS. Figure 3 shows an example of the possible outcomes of the RBP algorithm applied to GWAS data when determining the epistatic interactions independence structure.

Bayesian Graph Models and Networks

In order to improve disease mapping sensitivity and specificity, BEAM3 algorithm [24] uses a graph model to allow for flexible interaction structures for multi-SNP associations. Through the use of Bayesian networks, BEAM3 detects flexible interaction structures instead of using saturated models (like BEAM and BEAM2), therefore, highly reducing the interaction model complexity. Moreover, because only the disease association graphs are constructed, BEAM3 provides for higher computational efficiently in the whole-genome association settings [24].

In detail, BEAM3 allows for higher order couplings via saturated interactions within cliques (nonoverlapping partition of SNPs) and pairwise interactions between them. It can be shown [24] that the joint probability of all SNPs X, parameters, including disease graph and association status (G, I), and disease status indicator (Y) is given by:

$$ P(X,Y,G,I) \propto P_{A} (X_{1} |Y,G)P(Y)P(G|I)P(I)/P_{0} (X_{1} ), $$

(7)

where $ G = (C,\Delta ) $ is an undirected disease graph constructed on disease-associated SNPs (X ₁) and including partition of SNPs into cliques (C) and interaction between cliques (∆); probability function of X₁ set under the phenotype association hypothesis is described by P _A. Therefore, as can be seen from Eq. 7, only a few disease-associated SNPs are modeled (in set X ₁), and hence a significant portion of computational time is saved by avoiding explicit modeling of complicated dependence structures of all SNPs which could be millions [19, 24, 39]. Additionally, through the choice of a proper baseline probability function P ₀(X ₁), the model automatically accounts for the complex LD effects among dense SNPs employing graphs. Thus, a significant number of repetitive false interactions are avoided reducing computational burden [24]. Specifically, summing over all $ G^{\prime} $ graphs, the expression for the baseline model becomes:

$$ P_{0} \left( {X_{1} } \right) = \mathop \sum \limits_{{G^{\prime}}} P_{0} \left( {X_{1} |G^{\prime}} \right)P(G^{\prime}) $$

(8)

An alternative approach toward learning disease inducing gene–gene interactions is using binary classification trees. Bayesian methodology has been recently applied [21] to identification of multi-locus interactions in the large-scale data sets using a Bayesian classification tree model. Specifically, this kind of machine learning approach produces tree structure models, where each nonterminal node determines the splitting rule based upon the predictor variables like SNPs, and edges between nodes correspond to different possible values for the variable in the top parent node. A path along such a tree till the terminal node represents a specific combination of predictor variables along the path, therefore, automatically accommodating for epistasis [8, 21].

There are various ways for searching through the feasible tree space in such recursive partitioning approaches including greedy algorithms [40], random forests approach [8, 41], and MCMC [42, 43]. Bayesian variable partition and Bayesian classification trees are conceptually similar in that prior is assigned to all the tree models with the purpose of controlling the tree size [21]. One main advantage of this approach is in a possible enhancement of finding probability for multi-locus interactions with weak marginal effects due to ensuring the variable splitting through the prior specification. Moreover, due to the adaptivity of the MCMC algorithm, such Bayesian tree models detect higher order interactions by performing thorough searches near trees with the interacting variables determined in previous iterations [21]. It is important to point out that classification tree approaches do not test for epistatic interactions directly [8].

Clinical Applications of Bayesian Methodology

Even though practical Bayesian approaches for whole-genome multi-locus interactions analysis have emerged relatively recently, such methods have already helped to make important advances in determination of disease etiology. Table 1 succinctly summaries and compares all the statistical methods described above as well as their success in determination of the previously known disease loci and, more importantly, in the discovery of new multi-locus interactions responsible for complex diseases. We specifically note what interaction model each method utilizes. For example, Bayesian analysis strategy combining BEAM and BEAM2 software [44] allowed for the discovery of 319 high-order interactions across the genome that can potentially explain the missing genetic component of the rheumatoid arthritis susceptibility. Moreover, their findings indicate that nervous system, in addition to autoimmune one, potentially performs a crucial role in the disease development. Figure 4 shows a schematic diagram of the combined Bayesian strategy used for the analysis. This is an example of the statistical study in which disease underlying biological processes can be extracted from determined statistical associations. For sure, many more studies will follow in the near future that apply Bayesian methods either to existing GWAS data or to new large-scale studies.

Table 1 Comparison of modern Bayesian approaches for whole-genome association analysis with possible clinical applications

Full size table

Conclusions and Future Prospects

Certain issues need to be considered when using Bayesian approaches described above. For example, the combination of genotyping errors, disease heterogeneities, and population substructures could have adverse effects on the statistical results of the methods [9]. Currently, the major problem in the field is that the determined disease-associated genetic variants explain only a small part of the disease heritability [3, 4]. However, it is conceivable that the usage of the software tools outlined above will help with the detailed understanding of the interactions involved. Additionally, recent development of Bayesian models should allow for the elucidation of the detailed etiopathogenesis of the disease formation and the underlying causal biology.

Improvements to the Bayesian approaches mentioned in this article can include incorporation of environmental factors and population structures as covariates in the statistical model [33, 45]. Another possible improvement is to impute untyped SNPs and missing genotypes [46]. Efficient incorporation of prior biological knowledge into the Bayesian model can increase the probability of making discoveries in association studies [47]. Finally, recent computational proposals attempt to apply Bayesian methodology specifically toward efficient identification of causal rare variants in GWAS [48, 49].

It is important to keep in mind that the clinical applications of the statistical methods will arise from the understanding of the relationship between determined mathematical couplings and their biochemical underpinnings. The biological interpretation of the determined single- and multi-variant effects is currently a crucial area of research in genetics [8]. Modern statistical approaches to the analysis of the SNP data from whole-genome association studies have potential to play an important role in the future of bioinformatics and genomics research. Specifically, such methods will contribute to novel understandings of disease pathogenesis and provide crucial information for drug discovery [50], thus leading to important clinical applications.

References

S.S. Hall, Revolution postponed. Sci. Am. 303, 60–67 (2010)
Article Google Scholar
M.I. McCarthy et al., Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat. Rev. Genet. 9, 356–369 (2008)
Article Google Scholar
P. Donnelly, Progress and challenges in genome-wide association studies in humans. Nature 456, 728–731 (2008)
Article Google Scholar
WTCCC, Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661–678 (2007)
Article Google Scholar
E.E. Eichler et al., Missing heritability and strategies for finding the underlying causes of complex disease. Nat. Rev. Genet. 11, 446–450 (2010)
Article Google Scholar
J.A. Todd et al., Robust associations of four new chromosome regions from genome-wide analyses of type 1 diabetes. Nat. Genet. 39, 857–864 (2007)
Article Google Scholar
J.N. Hirschhorn, M.J. Daly, Genome-wide association studies for common diseases and complex traits. Nat. Rev. Genet. 6, 95–108 (2005)
Article Google Scholar
H.J. Cordell, Detecting gene-gene interactions that underline human diseases. Nat. Genet. 10, 392–404 (2009)
Article Google Scholar
Y. Zhang, J.S. Liu, Bayesian inference of epistatic interactions in case-control studies. Nat. Genet. 39, 1167–1173 (2007)
Article Google Scholar
M.L. Metzker, Sequencing technologies—the next generation. Nat. Rev. Genet. 11, 31–46 (2010)
Article Google Scholar
D. Branton et al., The potential and challenges of nanopore sequencing. Nat. Biotechnol. 26, 1146–1153 (2008)
Article Google Scholar
A. Schaffer, Nanopore sequencing. Technol. Rev. (2012)
Google Scholar
The International HapMap Consortium, A haplotype map of the human genome. Nature 437, 1299–1320 (2005)
Article Google Scholar
E. Svoboda, The DNA transistor. Sci. Am. 303, 46 (2010)
Article Google Scholar
A.D. Johnson, C.J. O’Donnell, An open access database of genome-wide association results. BMC Med. Genet. 10, 6 (2009)
Article Google Scholar
D. Altshuler, M. Daly, Guilt beyond a reasonable doubt. Nat. Genet. 39, 813–815 (2007)
Article Google Scholar
G. Gibson, Rare and common variants: twenty arguments. Nat. Rev. 13, 135–145 (2012)
Article Google Scholar
M. Carmichael, One hundred tests. Sci. Am. 303, 50 (2010)
Article Google Scholar
X. Jiang et al., Learning genetic epistasis using Bayesian network scoring criteria. BMC Bioinform. 12, 89 (2011)
Article Google Scholar
J.H. Moore, The ubiquitous nature of epistasis in determining susceptibility to common human diseases. Hum. Hered. 56, 73–82 (2003)
Article Google Scholar
M. Chen et al., Detecting epistatic SNPs associated with complex diseases via a Bayesian classification tree search method. Ann. Hum. Genet. 75, 112–121 (2011)
Article Google Scholar
M.D. Ritchie et al., Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am. J. Hum. Genet. 69, 138–147 (2001)
Article Google Scholar
S. Wiltshire et al., Epistasis between type 2 diabetes susceptibility loci on chromosomes 1q21-25 and 10q23-26 in Northern Europeans. Ann. Hum. Genet. 70, 726–737 (2006)
Article Google Scholar
Y. Zhang, A novel graphical model for genome-wide multi-SNP association mapping. Genet. Epidemiol. 36, 36–47 (2012)
Article Google Scholar
Y. Zhang et al., Block-based Bayesian epistasis association mapping with application to WTCCC type 1 diabetes data. Ann. Appl. Stat. 5, 2052–2077 (2011)
Article MathSciNet MATH Google Scholar
I. Kozyryev, J. Zhang, Bayesian determination of disease associated differences in haplotype blocks. Am. J. Bioinform. 1, 20–29 (2012)
Google Scholar
J.D. Wall, J.K. Pritchard, Haplotype blocks and linkage disequilibrium in the human genome. Nat. Rev. Genet. 4, 587–597 (2003)
Article Google Scholar
A. Gelman et al., Bayesian Data Analysis, 2nd edn. (2003)
Google Scholar
J.A. Rice, Mathematical Statistics and Data Analysis, 3rd edn. (2006)
Google Scholar
J.S. Liu, Monte Carlo Strategies in Scientific Computing, 1st edn. (2001)
Google Scholar
J. Marchini et al., Genome-wide strategies for detecting multiple loci that influence complex diseases. Nat. Genet. 37, 413–417 (2005)
Article Google Scholar
Y. Liu et al., Genome-wide interaction-based association analysis identified multiple new susceptibility loci for common diseases. PLoS Genet. 7, 3 (2011)
Google Scholar
J. Zhang et al., A Bayesian method for disentangling dependent structure of epistatic interaction. Am. J. Biostat. 2, 1–10 (2011)
Google Scholar
T. Zheng et al., Backward genotype-trait association (BGTA)—based dissection of complex traits in case-control design. Hum. Hered. 62, 196–212 (2006)
Article Google Scholar
N.R. Cook et al., Tree and spline based association analysis of gene-gene interaction models for ischemic stroke. Stat. Med. 23, 1439–1453 (2004)
Article Google Scholar
M.R. Nelson et al., A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation. Genome Res. 11, 458–470 (2001)
Article Google Scholar
D.E. Reich et al., Linkage disequilibrium in the human genome. Nature 411, 199–204 (2001)
Article Google Scholar
Y. Yang et al., Testing association with interactions by partitioning chi-squares. Ann. Human. Genet. 73, 109–117 (2009)
Article Google Scholar
Y. Zhang, J.S. Liu, Fast and accurate approximation to significance tests in genome-wide association studies. J. Am. Stat. Assoc. 106, 846–857 (2011)
Article MathSciNet MATH Google Scholar
T. Hastie et al., The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 5th edn. (2011)
Google Scholar
L. Breiman, Random forests. Mach. Learn. 45, 5–32 (2001)
Article MATH Google Scholar
H.A. Chipman et al., Bayesian CART model search. J. Am. Stat. Assoc. 93, 935–948 (1998)
Article Google Scholar
D.G.T. Denison et al., A Bayesian CART algorithm. Biometrika 85, 363–377 (1998)
Article MathSciNet MATH Google Scholar
J. Zhang et al., High-order interactions in rheumatoid arthritis detected by Bayesian method using genome-wide association studies data. Am. Med. J. 3, 56–66 (2012)
Google Scholar
I. Lobach et al., Genotype-based association mapping of complex diseases: gene-environment interactions with multiple genetic markers and measurement errors in environmental exposures. Genet. Epidemiol. 34, 792–802 (2010)
Article Google Scholar
Y. Zhang, Bayesian epistasis association mapping via SNP imputation. Biostat 12, 211–222 (2011)
Article Google Scholar
M. Chen et al., Incorporating biological pathways via a Markov random field model in genome-wide association studies. PLoS Genet. 7(4), e1001353 (2011)
Article Google Scholar
F. Liang, M. Xiong, Bayesian detection of causal rare variants under posterior consistency. PLoS ONE 8(7), e69633 (2013)
Article Google Scholar
M.A. Quintana et al., Incorporating model uncertainty in detecting rare variants: the Bayesian Risk Index. Genet. Epidemiol. 35, 638–649 (2011)
Article Google Scholar
Y. Okada et al., Genetics of rheumatoid arthritis contributes to biology and drug discovery. Nature 506, 376–381 (2013)
Article Google Scholar

Download references

Acknowledgements

Zhang was supported by the start-up funding and Sesseel Award from Yale University.

Author information

Authors and Affiliations

Department of Physics, Harvard University, Cambridge, MA, USA
Ivan Kozyryev
Department of Mathematics and Statistics, Georgia State University, Atlanta, GA, USA
Jing Zhang

Authors

Ivan Kozyryev
View author publications
You can also search for this author in PubMed Google Scholar
Jing Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jing Zhang .

Editor information

Editors and Affiliations

Digital Biology Laboratory, Computer Science Department, University of Missouri-Columbia, Columbia, Missouri, USA
Dong Xu
Georgia Institute of Technology and Emory University, Atlanta, Georgia, USA
May D. Wang
College of Computer Science and Technology, Jilin University, Changchun, China
Fengfeng Zhou
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, Guangdong, China
Yunpeng Cai

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Kozyryev, I., Zhang, J. (2017). Clinical Assessment of Disease Risk Factors Using SNP Data and Bayesian Methods. In: Xu, D., Wang, M., Zhou, F., Cai, Y. (eds) Health Informatics Data Analysis. Health Information Science. Springer, Cham. https://doi.org/10.1007/978-3-319-44981-4_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-44981-4_6
Published: 10 September 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-44979-1
Online ISBN: 978-3-319-44981-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics