Abstract
Single-nucleotide polymorphism (SNP) is the basic unit to understand the heritability of complex traits. One attractive application of the susceptible SNPs is to construct prediction models for assessing disease risk. Here, we introduce prediction methods for human traits using SNPs data, including the polygenic risk score (PRS), linear mixed models (LMMs), penalized regressions, and methods for controlling population stratification.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Claussnitzer M, Cho JH, Collins R et al (2020) A brief history of human disease genetics. Nature 577(7789):179–189
Corder EH, Saunders AM, Strittmatter WJ et al (1993) Gene dose of apolipoprotein E type 4 allele and the risk of Alzheimer’s disease in late onset families. Science 261(5123):921–923
Clayton DG (2009) Prediction and interaction in complex disease genetics: experience in type 1 diabetes. PLoS Genet 5(7):e1000540
Lux MP, Fasching PA, Beckmann MW (2006) Hereditary breast and ovarian cancer: review and future perspectives. J Mol Med 84(1):16–28
Manolio TA, Collins FS, Cox NJ et al (2009) Finding the missing heritability of complex diseases. Nature 461(7265):747–753
Lango Allen H, Estrada K, Lettre G et al (2010) Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature 467(7317):832–838
Yang J, Benyamin B, McEvoy BP et al (2010) Common SNPs explain a large proportion of the heritability for human height. Nat Genet 42(7):565–569
Lee SH, Wray NR, Goddard ME et al (2011) Estimating missing heritability for disease from genome-wide association studies. Am J Hum Genet 88(3):294–305
Golan D, Lander ES, Rosset S (2014) Measuring missing heritability: inferring the contribution of common variants. Proc Natl Acad Sci 111(49):E5272–E5281
Wei Z, Wang W, Bradfield J et al (2013) Large sample size, wide variant spectrum, and advanced machine-learning technique boost risk prediction for inflammatory bowel disease. Am J Hum Genet 92(6):1008–1012
Lambert SA, Abraham G, Inouye M (2019) Towards clinical utility of polygenic risk scores. Hum Mol Genet 28(R2):R133–R142
Rencher AC, Schaalje GB (2008) Linear models in statistics. Wiley, Hoboken
Allen DM (1971) Mean square error of prediction as a criterion for selecting variables. Technometrics 13(3):469–475
Huang J, Ling CX (2005) Using AUC and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Eng 17(3):299–310
Visscher ISCMpPSMspmhebWNRSJL, Michael C. 6 Visscher Peter M. 5 PasWNRMSSPscmhedSPFOD, Gurling H et al (2009) Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature 460(7256):748–752
Anderson CA, Pettersson FH, Clarke GM et al (2010) Data quality control in genetic case-control association studies. Nat Protoc 5(9):1564–1573
McCullagh P, Nelder JA (2019) Generalized linear models. Routledge, London
Chang CC, Chow CC, Tellier LC et al (2015) Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4(1):s13742-13015-10047-13748
Clarke L, Fairley S, Zheng-Bradley X et al (2017) The international genome sample resource (IGSR): a worldwide collection of genome variation incorporating the 1000 genomes project data. Nucleic Acids Res 45(D1):D854–D859
Dudbridge F (2013) Power and predictive accuracy of polygenic risk scores. PLoS Genet 9(3):e1003348
Euesden J, Lewis CM, O’reilly PF (2015) PRSice: polygenic risk score software. Bioinformatics 31(9):1466–1468
Wray NR, Lee SH, Mehta D et al (2014) Research review: polygenic methods and their application to psychiatric traits. J Child Psychol Psychiatry 55(10):1068–1087
Vilhjálmsson BJ, Yang J, Finucane HK et al (2015) Modeling linkage disequilibrium increases accuracy of polygenic risk scores. Am J Hum Genet 97(4):576–592
O’donovan MC, Craddock N, Norton N et al (2008) Identification of loci associated with schizophrenia by genome-wide association and follow-up. Nat Genet 40(9):1053–1055
Consortium IMSG (2010) Evidence for polygenic susceptibility to multiple sclerosis—the shape of things to come. Am J Hum Genet 86(4):621–625
Speliotes EK, Willer CJ, Berndt SI et al (2010) Association analyses of 249,796 individuals reveal 18 new loci associated with body mass index. Nat Genet 42(11):937–948
Simonson MA, Wills AG, Keller MC et al (2011) Recent methods for polygenic analysis of genome-wide data implicate an important effect of common variants on cardiovascular disease risk. BMC Med Genet 12(1):1–9
Stahl EA, Wegmann D, Trynka G et al (2012) Bayesian inference analyses of the polygenic architecture of rheumatoid arthritis. Nat Genet 44(5):483–489
Duncan L, Shen H, Gelaye B et al (2019) Analysis of polygenic risk score usage and performance in diverse human populations. Nat Commun 10(1):1–9
Kim MS, Patel KP, Teng AK et al (2018) Genetic disease risks can be misestimated across global populations. Genome Biol 19(1):1–14
Martin AR, Gignoux CR, Walters RK et al (2017) Human demographic history impacts genetic risk prediction across diverse populations. Am J Hum Genet 100(4):635–649
Mostafavi H, Harpak A, Agarwal I et al (2020) Variable prediction accuracy of polygenic scores within an ancestry group. elife 9:e48376
Cai M, Xiao J, Zhang S et al (2021) A unified framework for cross-population trait prediction by leveraging the genetic correlation of polygenic traits. Am J Hum Genet 108(4):632–655
Coram MA, Fang H, Candille SI et al (2017) Leveraging multi-ethnic evidence for risk assessment of quantitative traits in minority populations. Am J Hum Genet 101(2):218–226
Selzam S, Krapohl E, Von Stumm S et al (2017) Predicting educational achievement from DNA. Mol Psychiatry 22(2):267–272
Lee JJ, Wedow R, Okbay A et al (2018) Gene discovery and polygenic prediction from a genome-wide association study of educational attainment in 1.1 million individuals. Nat Genet 50(8):1112–1121
Zhang Y, Lu Q, Ye Y et al (2021) SUPERGNOVA: local genetic correlation analysis reveals heterogeneous etiologic sharing of complex traits. Genome Biol 22(1):1–30
Ruderfer DM, Fanous AH, Ripke S et al (2014) Polygenic dissection of diagnosis and clinical dimensions of bipolar disorder and schizophrenia. Mol Psychiatry 19(9):1017–1024
Maier R, Moser G, Chen G-B et al (2015) Joint analysis of psychiatric disorders increases accuracy of risk prediction for schizophrenia, bipolar disorder, and major depressive disorder. Am J Hum Genet 96(2):283–294
Ruderfer DM, Ripke S, McQuillin A et al (2018) Genomic dissection of bipolar disorder and schizophrenia, including 28 subphenotypes. Cell 173(7):1705–1715. e1716
Guo H, Li JJ, Lu Q et al (2021) Detecting local genetic correlations with scan statistics. Nat Commun 12(1):1–13
Krapohl E, Patel H, Newhouse S et al (2018) Multi-polygenic score approach to trait prediction. Mol Psychiatry 23(5):1368–1374
Maier RM, Zhu Z, Lee SH et al (2018) Improving genetic prediction by leveraging genetic correlations among human diseases and traits. Nat Commun 9(1):1–17
Grotzinger AD, Rhemtulla M, de Vlaming R et al (2019) Genomic structural equation modelling provides insights into the multivariate genetic architecture of complex traits. Nat Hum Behav 3(5):513–525
Wand H, Lambert SA, Tamburro C et al (2021) Improving reporting standards for polygenic scores in risk prediction studies. Nature 591(7849):211–219
Mars N, Koskela JT, Ripatti P et al (2020) Polygenic and clinical risk scores and their impact on age at onset and prediction of cardiometabolic diseases and common cancers. Nat Med 26(4):549–557
Khera AV, Chaffin M, Aragam KG et al (2018) Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat Genet 50(9):1219–1224
Elliott J, Bodinier B, Bond TA et al (2020) Predictive accuracy of a polygenic risk score–enhanced prediction model vs a clinical risk score for coronary artery disease. JAMA 323(7):636–645
Inouye M, Abraham G, Nelson CP et al (2018) Genomic risk prediction of coronary artery disease in 480,000 adults: implications for primary prevention. J Am Coll Cardiol 72(16):1883–1893
Abraham G, Havulinna AS, Bhalala OG et al (2016) Genomic prediction of coronary heart disease. Eur Heart J 37(43):3267–3278
Yang J, Zaitlen NA, Goddard ME et al (2014) Advantages and pitfalls in the application of mixed-model association methods. Nat Genet 46(2):100–106
Loh P-R, Tucker G, Bulik-Sullivan BK et al (2015) Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat Genet 47(3):284–290
Lloyd-Jones LR, Zeng J, Sidorenko J et al (2019) Improved polygenic prediction by Bayesian multiple regression on summary statistics. Nat Commun 10(1):1–11
Vilhjálmsson BJ, Nordborg M (2013) The nature of confounding in genome-wide association studies. Nat Rev Genet 14(1):1–2
Makowsky R, Pajewski NM, Klimentidis YC et al (2011) Beyond missing heritability: prediction of complex traits. PLoS Genet 7(4):e1002051
Habier D, Fernando RL, Kizilkaya K et al (2011) Extension of the Bayesian alphabet for genomic selection. BMC Bioinform 12(1):1–12
Moser G, Lee SH, Hayes BJ et al (2015) Simultaneous discovery, estimation and prediction analysis of complex traits using a Bayesian mixture model. PLoS Genet 11(4):e1004969
Zeng J, De Vlaming R, Wu Y et al (2018) Signatures of negative selection in the genetic architecture of human complex traits. Nat Genet 50(5):746–753
Zeng P, Zhou X (2017) Non-parametric genetic prediction of complex traits with latent Dirichlet process regression models. Nat Commun 8(1):1–11
Durvasula A, Lohmueller KE (2021) Negative selection on complex traits limits phenotype prediction accuracy between populations. Am J Hum Genet 108(4):620–631
Shi H, Gazal S, Kanai M et al (2021) Population-specific causal disease effect sizes in functionally important regions impacted by selection. Nat Commun 12(1):1–15
Wang Y, Guo J, Ni G et al (2020) Theoretical and empirical quantification of the accuracy of polygenic scores in ancestry divergent populations. Nat Commun 11(1):1–9
Xia X, Sun R, Zhang Y et al (2022) A prism vote framework for individualized risk prediction of traits in genome-wide sequencing data of multiple populations. bioRxiv. https://doi.org/10.1101/2022.02.02.478767
Erbe M, Hayes B, Matukumalli L et al (2012) Improving accuracy of genomic predictions within and between dairy cattle breeds with imputed high-density single nucleotide polymorphism panels. J Dairy Sci 95(7):4114–4129
Zhou X, Carbonetto P, Stephens M (2013) Polygenic modeling with Bayesian sparse linear mixed models. PLoS Genet 9(2):e1003264
Yang J, Fritsche LG, Zhou X et al (2017) A scalable Bayesian method for integrating functional information in genome-wide association studies. Am J Hum Genet 101(3):404–416
Zhu X, Stephens M (2017) Bayesian large-scale multiple regression with summary statistics from genome-wide association studies. Ann Appl Stat 11(3):1561
Hoerl AE, Kennard RW (1970) Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12(1):55–67
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B Methodol 58(1):267–288
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B Stat Methodol 67(2):301–320
Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33(1):1
Zeng Y, Breheny P (2017) The biglasso package: a memory-and computation-efficient solver for lasso model fitting with big data in r. arXiv preprint arXiv:170105936
Privé F, Aschard H, Ziyatdinov A et al (2018) Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr. Bioinformatics 34(16):2781–2787
Qian J, Tanigawa Y, Du W et al (2020) A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank. PLoS Genet 16(10):e1009141
Mak TSH, Porsch RM, Choi SW et al (2017) Polygenic scores via penalized regression on summary statistics. Genet Epidemiol 41(6):469–480
Abraham G, Malik R, Yonova-Doing E et al (2019) Genomic risk score offers predictive performance comparable to clinical risk factors for ischaemic stroke. Nat Commun 10(1):1–10
Lu X, Niu X, Shen C et al (2021) Development and validation of a polygenic risk score for stroke in the Chinese population. Neurology 97(6):e619–e628
Devlin B, Roeder K (1999) Genomic control for association studies. Biometrics 55(4):997–1004
Devlin B, Roeder K, Wasserman L (2001) Genomic control, a new approach to genetic-based association studies. Theor Popul Biol 60(3):155–166
Sul JH, Martin LS, Eskin E (2018) Population structure in genetic studies: confounding factors and mixed models. PLoS Genet 14(12):e1007309
Clayton DG, Walker NM, Smyth DJ et al (2005) Population structure, differential bias and genomic control in a large-scale, case-control association study. Nat Genet 37(11):1243–1246
Price AL, Patterson NJ, Plenge RM et al (2006) Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38(8):904–909
Yang J, Lee SH, Goddard ME et al (2011) GCTA: a tool for genome-wide complex trait analysis. Am J Hum Genet 88(1):76–82
Consortium GP (2015) A global reference for human genetic variation. Nature 526(7571):68
Consortium EP (2012) An integrated encyclopedia of DNA elements in the human genome. Nature 489(7414):57
Bernstein BE, Stamatoyannopoulos JA, Costello JF et al (2010) The NIH roadmap epigenomics mapping consortium. Nat Biotechnol 28(10):1045–1048
Lonsdale J, Thomas J, Salvatore M et al (2013) The genotype-tissue expression (GTEx) project. Nat Genet 45(6):580–585
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature
About this protocol
Cite this protocol
Xia, X., Zhang, Y., Wei, Y., Wang, M.H. (2023). Statistical Methods for Disease Risk Prediction with Genotype Data. In: Fridley, B., Wang, X. (eds) Statistical Genomics. Methods in Molecular Biology, vol 2629. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-2986-4_15
Download citation
DOI: https://doi.org/10.1007/978-1-0716-2986-4_15
Published:
Publisher Name: Humana, New York, NY
Print ISBN: 978-1-0716-2985-7
Online ISBN: 978-1-0716-2986-4
eBook Packages: Springer Protocols