Abstract
We present a new statistical test of association between a trait and genetic markers, which we theoretically and practically prove to be robust to arbitrarily complex population structure. The statistical test involves a set of parameters that can be directly estimated from large-scale genotyping data, such as those measured in genome-wide association studies (GWAS). We also derive a new set of methodologies, called a 'genotype-conditional association test' (GCAT), shown to provide accurate association tests in populations with complex structures, manifested in both the genetic and non-genetic contributions to the trait. We demonstrate the proposed method on a large simulation study and on the Northern Finland Birth Cohort study. In the Finland study, we identify several new significant loci that other methods do not detect. Our proposed framework provides a substantially different approach to the problem from existing methods, such as the linear mixed-model and principal-component approaches.
Similar content being viewed by others
References
McCarthy, M.I. et al. Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat. Rev. Genet. 9, 356–369 (2008).
Frazer, K.A., Murray, S.S., Schork, N.J. & Topol, E.J. Human genetic variation and its contribution to complex traits. Nat. Rev. Genet. 10, 241–251 (2009).
Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661–678 (2007).
Pritchard, J.K. & Rosenberg, N.A. Use of unlinked genetic markers to detect population stratification in association studies. Am. J. Hum. Genet. 65, 220–228 (1999).
Astle, W. & Balding, D.J. Population structure and cryptic relatedness in genetic association studies. Stat. Sci. 24, 451–471 (2009).
Price, A.L., Zaitlen, N.A., Reich, D. & Patterson, N. New approaches to population stratification in genome-wide association studies. Nat. Rev. Genet. 11, 459–463 (2010).
Zhang, S., Zhu, X. & Zhao, H. On a semiparametric test to detect associations between quantitative traits and candidate genes using unrelated individuals. Genet. Epidemiol. 24, 44–56 (2003).
Price, A.L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006).
Yu, J. et al. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat. Genet. 38, 203–208 (2006).
Kang, H.M. et al. Variance component model to account for sample structure in genome-wide association studies. Nat. Genet. 42, 348–354 (2010).
Wang, K., Hu, X. & Peng, Y. An analytical comparison of the principal component method and the mixed effects model for association studies in the presence of cryptic relatedness and population stratification. Hum. Hered. 76, 1–9 (2013).
Sabatti, C. et al. Genome-wide association analysis of metabolic traits in a birth cohort from a founder population. Nat. Genet. 41, 35–46 (2009).
Soranzo, N. et al. Meta-analysis of genome-wide scans for human adult stature identifies novel loci and associations with measures of skeletal frame size. PLoS Genet. 5, e1000445 (2009).
Hao, W., Song, M. & Storey, J.D. Probabilistic models of genetic variation in structured populations applied to global human studies. arXiv, http://arxiv.org/abs/1312.2041 (2013).
Zhou, X. & Stephens, M. Genome-wide efficient mixed-model analysis for association studies. Nat. Genet. 44, 821–824 (2012).
Dudbridge, F. & Gusnanto, A. Estimation of significance thresholds for genomewide association scans. Genet. Epidemiol. 32, 227–234 (2008).
Sandhu, M.S. et al. LDL-cholesterol concentrations: a genome-wide association study. Lancet 371, 483–491 (2008).
Prokopenko, I. et al. Variants in MTNR1B influence fasting glucose levels. Nat. Genet. 41, 77–81 (2009).
Devlin, B. & Roeder, K. Genomic control for association studies. Biometrics 55, 997–1004 (1999).
Yang, J. et al. Genomic inflation factors under polygenic inheritance. Eur. J. Hum. Genet. 19, 807–812 (2011).
Witten, D.M., Tibshirani, R. & Hastie, T. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10, 515–534 (2009).
Baglama, J. & Reichel, L. Restarted block Lanczos bidiagonalization methods. Num. Algo. 43, 251–272 (2006).
Balding, D.J. & Nichols, R.A. A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity. Genetica 96, 3–12 (1995).
Pritchard, J.K., Stephens, M. & Donnelly, P. Inference of population structure using multilocus genotype data. Genetics 155, 945–959 (2000).
Acknowledgements
This research was supported in part by US National Institutes of Health grant R01 HG006448. The NFBC data were collected by the STAMPEED: Northern Finland Birth Cohort 1966 (NFBC1966) GWAS, made available through database of Genotypes and Phenotypes (dbGaP) study accession phs000276.v2.p1. A full list of contributors to the STAMPEED study can be found on its dbGaP web site.
Author information
Authors and Affiliations
Contributions
J.D.S. designed the study and wrote the manuscript. J.D.S. and M.S. developed statistical theory and methods. W.H., J.D.S. and M.S. designed the simulations. W.H. analyzed the data and developed the software.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Supplementary information
Supplementary Text and Figures
Supplementary Note, Supplementary Figures 1–18 and Supplementary Tables 1 and 2. (PDF 14034 kb)
Rights and permissions
About this article
Cite this article
Song, M., Hao, W. & Storey, J. Testing for genetic associations in arbitrarily structured populations. Nat Genet 47, 550–554 (2015). https://doi.org/10.1038/ng.3244
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/ng.3244
- Springer Nature America, Inc.
This article is cited by
-
Hybrid autoencoder with orthogonal latent space for robust population structure inference
Scientific Reports (2023)
-
Combining Multiple Hypothesis Testing with Machine Learning Increases the Statistical Power of Genome-wide Association Studies
Scientific Reports (2016)
-
A multi-marker association method for genome-wide association studies without the need for population structure correction
Nature Communications (2016)
-
High-density genotyping of immune-related loci identifies new SLE risk variants in individuals with Asian ancestry
Nature Genetics (2016)