Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies

Zhou, Wei; Nielsen, Jonas B.; Fritsche, Lars G.; Dey, Rounak; Gabrielsen, Maiken E.; Wolford, Brooke N.; LeFaive, Jonathon; VandeHaar, Peter; Gagliano, Sarah A.; Gifford, Aliya; Bastarache, Lisa A.; Wei, Wei-Qi; Denny, Joshua C.; Lin, Maoxuan; Hveem, Kristian; Kang, Hyun Min; Abecasis, Goncalo R.; Willer, Cristen J.; Lee, Seunggeun

doi:10.1038/s41588-018-0184-y

Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies

Analysis
Published: 13 August 2018

Volume 50, pages 1335–1341, (2018)
Cite this article

From

View current issue Submit your manuscript

Wei Zhou^1,2,
Jonas B. Nielsen ORCID: orcid.org/0000-0002-6654-2852³,
Lars G. Fritsche ORCID: orcid.org/0000-0002-2110-1690^2,4,5,
Rounak Dey^2,5,
Maiken E. Gabrielsen⁴,
Brooke N. Wolford ORCID: orcid.org/0000-0003-3153-1552^1,2,
Jonathon LeFaive^2,5,
Peter VandeHaar^2,5,
Sarah A. Gagliano^2,5,
Aliya Gifford⁶,
Lisa A. Bastarache⁶,
Wei-Qi Wei⁶,
Joshua C. Denny^6,7,
Maoxuan Lin³,
Kristian Hveem^4,8,
Hyun Min Kang^2,5,
Goncalo R. Abecasis^2,5,
Cristen J. Willer ORCID: orcid.org/0000-0001-5645-4966^1,3,9^na1 &
…
Seunggeun Lee ORCID: orcid.org/0000-0002-8097-3878^2,5^na1

27k Accesses
634 Citations
144 Altmetric
17 Mentions
Explore all metrics

Abstract

In genome-wide association studies (GWAS) for thousands of phenotypes in large biobanks, most binary traits have substantially fewer cases than controls. Both of the widely used approaches, the linear mixed model and the recently proposed logistic mixed model, perform poorly; they produce large type I error rates when used to analyze unbalanced case-control phenotypes. Here we propose a scalable and accurate generalized mixed model association test that uses the saddlepoint approximation to calibrate the distribution of score test statistics. This method, SAIGE (Scalable and Accurate Implementation of GEneralized mixed model), provides accurate P values even when case-control ratios are extremely unbalanced. SAIGE uses state-of-art optimization strategies to reduce computational costs; hence, it is applicable to GWAS for thousands of phenotypes by large biobanks. Through the analysis of UK Biobank data of 408,961 samples from white British participants with European ancestry for > 1,400 binary phenotypes, we show that SAIGE can efficiently analyze large sample data, controlling for unbalanced case-control ratios and sample relatedness.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

**Fig. 1: Manhattan plots of GWAS results for four binary phenotypes with various case-control ratios in the UK Biobank.**

**Fig. 2: Quantile–quantile plots of GWAS results for four binary phenotypes with various case-control ratios in the UK Biobank.**

A generalized linear mixed model association tool for biobank-scale data

Article 04 November 2021

Scalable generalized linear mixed model for region-based association tests in large biobanks and cohorts

Article 18 May 2020

A resource-efficient tool for mixed model association analysis of large-scale data

Article 25 November 2019

References

Bush, W. S., Oetjens, M. T. & Crawford, D. C. Unravelling the human genome-phenome relationship using phenome-wide association studies. Nat. Rev. Genet. 17, 129–145 (2016).
Article CAS Google Scholar
Denny, J. C. et al. Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nat. Biotechnol. 31, 1102–1110 (2013).
Article CAS Google Scholar
Kang, H. M. et al. Variance component model to account for sample structure in genome-wide association studies. Nat. Genet. 42, 348–354 (2010).
Article CAS Google Scholar
Zhang, Z. et al. Mixed linear model approach adapted for genome-wide association studies. Nat. Genet. 42, 355–360 (2010).
Article CAS Google Scholar
Yang, J., Lee, S. H., Goddard, M. E. & Visscher, P. M. GCTA: a tool for genome-wide complex trait analysis. Am. J. Hum. Genet. 88, 76–82 (2011).
Article CAS Google Scholar
Lippert, C. et al. FaST linear mixed models for genome-wide association studies. Nat. Methods 8, 833–835 (2011).
Article CAS Google Scholar
Zhou, X. & Stephens, M. Genome-wide efficient mixed-model analysis for association studies. Nat. Genet. 44, 821–824 (2012).
Article CAS Google Scholar
Loh, P. R. et al. Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat. Genet. 47, 284–290 (2015).
Article CAS Google Scholar
Chen, H. et al. Control for population structure and relatedness for binary traits in genetic association studies via logistic mixed models. Am. J. Hum. Genet. 98, 653–666 (2016).
Article CAS Google Scholar
Ma, C., Blackwell, T., Boehnke, M. & Scott, L. J., GoT2D investigators. Recommended joint and meta-analysis strategies for case-control association testing of single low-count variants. Genet. Epidemiol 37, 539–550 (2013).
Article Google Scholar
Kuonen, D. Saddlepoint approximations for distributions of quadratic forms in normal variables. Biometrika 86, 4 (1999).
Dey, R., Schmidt, E. M., Abecasis, G. R. & Lee, S. A fast and accurate algorithm to test for binary phenotypes and its application to PheWAS. Am. J. Hum. Genet. 101, 37–49 (2017).
Article CAS Google Scholar
Kaasschieter, E. F. Preconditioned conjugate gradients for solving singular systems. J. Comput. Appl. Math. 24, 265–275 (1988).
Article Google Scholar
Sudlow, C. et al. UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015).
Article Google Scholar
Bycroft, C. et al. Genome-wide genetic data on ~500,000 UK Biobank participants. Preprint at bioRxiv, https://doi.org/10.1101/166298 (2017).
Gilmour, A. R., Thompson, R. & Cullis, B. R. Average information REML: an efficient algorithm for variance parameter estimation in linear mixed models. Biometrics 51, 1440–1450 (1995).
Article Google Scholar
Aulchenko, Y. S., Ripke, S., Isaacs, A. & van Duijn, C. M. GenABEL: an R library for genome-wide association analysis. Bioinformatics 23, 1294–1296 (2007).
Article CAS Google Scholar
Svishcheva, G. R., Axenovich, T. I., Belonogova, N. M., van Duijn, C. M. & Aulchenko, Y. S. Rapid variance components-based method for whole-genome association analysis. Nat. Genet. 44, 1166–1170 (2012).
Article CAS Google Scholar
McCarthy, S. et al. A reference panel of 64,976 haplotypes for genotype imputation. Nat. Genet. 48, 1279–1283 (2016).
Article CAS Google Scholar
Nelis, M. et al. Genetic structure of Europeans: a view from the North-East. PLoS One 4, e5472 (2009).
Article Google Scholar
Shameer, K. et al. A genome- and phenome-wide association study to identify genetic variants influencing platelet count and volume and their pleiotropic effects. Hum. Genet. 133, 95–109 (2014).
Article Google Scholar
Yang, J., Zaitlen, N. A., Goddard, M. E., Visscher, P. M. & Price, A. L. Advantages and pitfalls in the application of mixed-model association methods. Nat. Genet. 46, 100–106 (2014).
Article Google Scholar
Listgarten, J. et al. Improved linear mixed models for genome-wide association studies. Nat. Methods 9, 525–526 (2012).
Article CAS Google Scholar
Breslow, N. E. & Clayton, D. G. Approximate inference in generalized linear mixed models. J. Am. Stat. Assoc 88, 9–25 (1993).
Google Scholar
Hestenes, M. R. & Stiefel, E. Methods of conjugate gradients for solving linear systems. J. Res. Natl Bur. Stand. 49, 409–436 (1952).
Article Google Scholar
Imhof, J. P. Computing the distribution of quadratic forms in normal variables. Biometrika 48, 419–426 (1961).
Article Google Scholar
Abecasis, G. R., Cherny, S. S., Cookson, W. O. & Cardon, L. R. Merlin—rapid analysis of dense genetic maps using sparse gene flow trees. Nat. Genet. 30, 97–101 (2002).
Article CAS Google Scholar
de Villemereuil, P., Schielzeth, H., Nakagawa, S. & Morrissey, M. General methods for evolutionary quantitative genetic inference from generalized mixed models. Genetics 204, 1281–1294 (2016).
Article Google Scholar
Bulik-Sullivan, B. K. et al. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 47, 291–295 (2015).
Article CAS Google Scholar

Download references

Acknowledgements

This research has been conducted using the UK Biobank Resource under application number 24460. S.L. and R.D. were supported by NIH R01 HG008773. C.J.W. was supported by NIH R35 HL135824. W.Z. was supported by the University of Michigan Rackham Predoctoral Fellowship. J.B.N. was supported by the Danish Heart Foundation and the Lundbeck Foundation. J.C.D., A.G., L.A.B., and W.-Q.W. were supported by NIH R01 LM010685 and U2C OD023196.

Author information

These authors contributed equally: Cristen J. Willer and Seunggeun Lee.

Authors and Affiliations

Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
Wei Zhou, Brooke N. Wolford & Cristen J. Willer
Center for Statistical Genetics, University of Michigan School of Public Health, Ann Arbor, MI, USA
Wei Zhou, Lars G. Fritsche, Rounak Dey, Brooke N. Wolford, Jonathon LeFaive, Peter VandeHaar, Sarah A. Gagliano, Hyun Min Kang, Goncalo R. Abecasis & Seunggeun Lee
Department of Internal Medicine, Division of Cardiology, University of Michigan Medical School, Ann Arbor, MI, USA
Jonas B. Nielsen, Maoxuan Lin & Cristen J. Willer
K. G. Jebsen Center for Genetic Epidemiology, Department of Public Health and Nursing, Norwegian University of Science and Technology, Trondheim, Norway
Lars G. Fritsche, Maiken E. Gabrielsen & Kristian Hveem
Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, MI, USA
Lars G. Fritsche, Rounak Dey, Jonathon LeFaive, Peter VandeHaar, Sarah A. Gagliano, Hyun Min Kang, Goncalo R. Abecasis & Seunggeun Lee
Department of Biomedical Informatics, Vanderbilt University, Nashville, TN, USA
Aliya Gifford, Lisa A. Bastarache, Wei-Qi Wei & Joshua C. Denny
Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, USA
Joshua C. Denny
HUNT Research Centre, Department of Public Health and General Practice, Norwegian University of Science and Technology, Levanger, Norway
Kristian Hveem
Department of Human Genetics, University of Michigan Medical School, Ann Arbor, MI, USA
Cristen J. Willer

Authors

Wei Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Jonas B. Nielsen
View author publications
You can also search for this author in PubMed Google Scholar
Lars G. Fritsche
View author publications
You can also search for this author in PubMed Google Scholar
Rounak Dey
View author publications
You can also search for this author in PubMed Google Scholar
Maiken E. Gabrielsen
View author publications
You can also search for this author in PubMed Google Scholar
Brooke N. Wolford
View author publications
You can also search for this author in PubMed Google Scholar
Jonathon LeFaive
View author publications
You can also search for this author in PubMed Google Scholar
Peter VandeHaar
View author publications
You can also search for this author in PubMed Google Scholar
Sarah A. Gagliano
View author publications
You can also search for this author in PubMed Google Scholar
Aliya Gifford
View author publications
You can also search for this author in PubMed Google Scholar
Lisa A. Bastarache
View author publications
You can also search for this author in PubMed Google Scholar
Wei-Qi Wei
View author publications
You can also search for this author in PubMed Google Scholar
Joshua C. Denny
View author publications
You can also search for this author in PubMed Google Scholar
Maoxuan Lin
View author publications
You can also search for this author in PubMed Google Scholar
Kristian Hveem
View author publications
You can also search for this author in PubMed Google Scholar
Hyun Min Kang
View author publications
You can also search for this author in PubMed Google Scholar
Goncalo R. Abecasis
View author publications
You can also search for this author in PubMed Google Scholar
Cristen J. Willer
View author publications
You can also search for this author in PubMed Google Scholar
Seunggeun Lee
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

W.Z., C.J.W., and S.L. designed the experiments. W.Z. and S.L. performed the experiments. J.B.N., L.G.F., A.G., L.A.B., W.-Q.W., and J.C.D. constructed the phenotypes for the UK Biobank data. W.Z., J.L., S.A.G., B.N.W., M.L., H.M.K., C.J.W., S.L., and G.R.A. analyzed the UK Biobank data. P.V. created the PheWeb. M.E.G. and K.H. provided the data. W.Z., J.B.N., A.G., J.C.D., R.D., C.J.W., and S.L. wrote the manuscript.

Corresponding authors

Correspondence to Cristen J. Willer or Seunggeun Lee.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–18, Supplementary Tables 1–8 and Supplementary Note

Reporting Summary

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhou, W., Nielsen, J.B., Fritsche, L.G. et al. Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nat Genet 50, 1335–1341 (2018). https://doi.org/10.1038/s41588-018-0184-y

Download citation

Received: 15 November 2017
Accepted: 21 June 2018
Published: 13 August 2018
Issue Date: September 2018
DOI: https://doi.org/10.1038/s41588-018-0184-y
Springer Nature America, Inc.

This article is cited by

Genome-wide association study and trans-ethnic meta-analysis identify novel susceptibility loci for type 2 diabetes mellitus
- Asma A Elashi
- Salman M Toor
- Omar M E Albagha
BMC Medical Genomics (2024)
TBK1, a prioritized drug repurposing target for amyotrophic lateral sclerosis: evidence from druggable genome Mendelian randomization and pharmacological verification in vitro
- Qing-Qing Duan
- Han Wang
- Yong-Ping Chen
BMC Medicine (2024)
Association between genetic risk and adherence to healthy lifestyle for developing age-related hearing loss
- Sang-Hyuk Jung
- Young Chan Lee
- Dokyoon Kim
BMC Medicine (2024)
The predictive power of data: machine learning analysis for Covid-19 mortality based on personal, clinical, preclinical, and laboratory variables in a case–control study
- Maryam Seyedtabib
- Roya Najafi-Vosough
- Naser Kamyari
BMC Infectious Diseases (2024)
Novel insights into the pleiotropic health effects of growth differentiation factor 11 gained from genome-wide association studies in population biobanks
- Jessica Strosahl
- Kaixiong Ye
- Robert Pazdro
BMC Genomics (2024)

Associated content

UK Biobank

Collection 11 October 2018

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies

From

Abstract

Access this article

Similar content being viewed by others

A generalized linear mixed model association tool for biobank-scale data

Scalable generalized linear mixed model for region-based association tests in large biobanks and cohorts

A resource-efficient tool for mixed model association analysis of large-scale data

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Supplementary information

Supplementary Text and Figures

Reporting Summary

Rights and permissions

About this article

Cite this article

This article is cited by

Genome-wide association study and trans-ethnic meta-analysis identify novel susceptibility loci for type 2 diabetes mellitus

TBK1, a prioritized drug repurposing target for amyotrophic lateral sclerosis: evidence from druggable genome Mendelian randomization and pharmacological verification in vitro

Association between genetic risk and adherence to healthy lifestyle for developing age-related hearing loss

The predictive power of data: machine learning analysis for Covid-19 mortality based on personal, clinical, preclinical, and laboratory variables in a case–control study

Novel insights into the pleiotropic health effects of growth differentiation factor 11 gained from genome-wide association studies in population biobanks

UK Biobank

Navigation

Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies

Abstract

Access this article

Similar content being viewed by others

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Navigation