Skip to main content

Advertisement

Log in

Bayesian estimation of gene constraint from an evolutionary model with gene features

  • Article
  • Published:

From Nature Genetics

View current issue Submit your manuscript

Abstract

Measures of selective constraint on genes have been used for many applications, including clinical interpretation of rare coding variants, disease gene discovery and studies of genome evolution. However, widely used metrics are severely underpowered at detecting constraints for the shortest ~25% of genes, potentially causing important pathogenic mutations to be overlooked. Here we developed a framework combining a population genetics model with machine learning on gene features to enable accurate inference of an interpretable constraint metric, shet. Our estimates outperform existing metrics for prioritizing genes important for cell essentiality, human disease and other phenotypes, especially for short genes. Our estimates of selective constraint should have wide utility for characterizing genes relevant to human disease. Finally, our inference framework, GeneBayes, provides a flexible platform that can improve the estimation of many gene-level properties, such as rare variant burden or gene expression differences.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1: Limitations of LOEUF and schematic representation for inferring shet using GeneBayes.
Fig. 2: Factors that contribute to our estimates of shet.
Fig. 3: GeneBayes estimates of shet perform well at identifying constrained and unconstrained genes.
Fig. 4: Breakdown of the gene features that are important for shet prediction.
Fig. 5: Comparing selection on LOFs (shet) between genes and shet to selection on other variant types.
Fig. 6: GeneBayes is a flexible framework for estimating gene-level properties.

Similar content being viewed by others

Data availability

Posterior means and 95% credible intervals for shet are available in Supplementary Table 1. Data sources for pLOF annotations, CpG methylation levels, exome sequencing coverage, variant frequencies and mappability/segmental duplication annotations are available in Supplementary Table 5. A description of the gene features is available in Supplementary Table 8. Posterior densities for shet, likelihoods for shet, LOF variants with misannotation probabilities and gene feature tables are available in ref. 83. Additional publicly available datasets used in this study are described in Methods and Supplementary Information and are accessible at IMPC essential genes (https://www.ebi.ac.uk/mi/impc/essential-genes-search/); pLOF annotations (gs://gnomad-public/papers/2019-tx-annotation/pre_computed/all.possible.snvs.tx_annotated.GTEx.v7.021520.tsv); mean methylation for CpG sites (gs://gcp-public-data–gnomad/resources/methylation); exome sequencing coverage (gs://gcp-public-data–gnomad/release/2.1/coverage/exomes/gnomad.exomes.coverage.summary.tsv.bgz); variant frequencies (gs://gcp-public-data–gnomad/release/2.1.1/vcf/exomes/gnomad.exomes.r2.1.1.sites.vcf.bgz); low mappability and segmental duplications (https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/genome-stratifications/v3.1/GRCh37/Union/GRCh37_alllowmapandsegdupregions.bed.gz); ClinVar variants (https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/); DepMap 22Q2 release (https://depmap.org/portal/download/all/); DDD annotations (https://www.deciphergenomics.org/ddd/ddgenes); HPO phenotype-to-gene annotations (http://purl.obolibrary.org/obo/hp/hpoa/phenotype_to_genes.txt); DNMs from developmental disorder patients5; UK Biobank summary statistics (https://nealelab.github.io/UKBB_ldsc); RNA-seq from chimpanzee/human cortical models28; GTEx v8 release29.

Code availability

GeneBayes and code for estimating shet are available at https://github.com/tkzeng/GeneBayes and in ref. 84. Analysis code is available in ref. 85. All analyses were performed using Python v3.8, Python v3.9 or R v4.2. To train models, we used a modified version of NGBoost (v0.3.12)16,86 (https://github.com/tkzeng/ngboost), XGBoost (v2.0.2)87 and PyTorch (v1.12.1)88. Likelihoods were computed with fastDTWF (v.0.0.3)15 (https://github.com/jeffspence/fastDTWF). For hyperparameter tuning, we used shap-hypetune v0.2 (https://github.com/cerlymarco/shap-hypetune). For heritability enrichment analyses, we used ldsc (v1.0.1)89. For additional analyses, we used NumPy (v1.26.0)90, SciPy (v1.8.1)91, Pandas (v2.1.3)92, Scikit-learn (1.3.0)93 and Statsmodels (v0.14.0)94.

References

  1. Cassa, C. A. et al. Estimating the selective effects of heterozygous protein-truncating variants from human exome data. Nat. Genet. 49, 806–810 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  2. Weghorn, D. et al. Applicability of the mutation–selection balance model to population genetics of heterozygous protein-truncating variants in humans. Mol. Biol. Evol. 36, 1701–1710 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Fuller, Z. L., Berg, J. J., Mostafavi, H., Sella, G. & Przeworski, M. Measuring intolerance to mutation in human genetics. Nat. Genet. 51, 772–776 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Agarwal, I., Fuller, Z. L., Myers, S. R. & Przeworski, M. Relating pathogenic loss-of-function mutations in humans to their evolutionary fitness costs. eLife 12, e83172 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Kaplanis, J. et al. Evidence for 28 genetic disorders discovered by combining healthcare and research data. Nature 586, 757–762 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Fu, J. M. et al. Rare coding variation provides insight into the genetic architecture and phenotypic context of autism. Nat. Genet. 54, 1320–1331 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Whiffin, N. et al. The effect of LRRK2 loss-of-function variants in humans. Nat. Med. 26, 869–877 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Gazal, S. et al. Combining SNP-to-gene linking strategies to identify disease genes and assess disease omnigenicity. Nat. Genet. 54, 827–836 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Wang, X. & Goldstein, D. B. Enhancer domains predict gene pathogenicity and inform gene discovery in complex disease. Am. J. Hum. Genet. 106, 215–233 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Mostafavi, H., Spence, J. P., Naqvi, S. & Pritchard, J. K. Systematic differences in discovery of genetic effects on gene expression and complex traits. Nat. Genet. 55, 1866–1875 (2023).

    Article  CAS  PubMed  Google Scholar 

  11. Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Gillespie, J. H. Population Genetics: A Concise Guide (JHU Press, 2004).

  14. LaPolice, T. M. & Huang, Y. F. An unsupervised deep learning framework for predicting human essential genes from population and functional genomic data. BMC Bioinformatics 24, 347 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

  15. Spence, J. P., Zeng, T., Mostafavi, H. & Pritchard, J. K. Scaling the discrete-time Wright–Fisher model to biobank-scale datasets. Genetics 225, iyad168 (2023).

    Article  PubMed  Google Scholar 

  16. Duan, T. et al. Ngboost: natural gradient boosting for probabilistic prediction. In Proc. International Conference on Machine Learning (eds Daumé, H. III & Singh, A.) 2690–2700 (PMLR, 2020).

  17. Ewens, W. J. Mathematical Population Genetics: Theoretical Introduction Vol. 27 (Springer, 2004).

  18. Agarwal, I. & Przeworski, M. Mutation saturation for fitness effects at human CpG sites. eLife 10, e71513 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Huang, Y. F. Unified inference of missense variant effects and gene constraints in the human genome. PLoS Genet. 16, e1008922 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Da Costa, L., Leblanc, T. & Mohandas, N. Diamond–Blackfan anemia. Blood 136, 1262–1273 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  21. Berger, W. et al. Mutations in the candidate gene for Norrie disease. Hum. Mol. Genet. 1, 461–465 (1992).

    Article  CAS  PubMed  Google Scholar 

  22. Howard, T. D. et al. Mutations in TWIST, a basic helix–loop–helix transcription factor, in Saethre–Chotzen syndrome. Nat. Genet. 15, 36–41 (1997).

    Article  PubMed  Google Scholar 

  23. Ghouzzi, V. E. et al. Mutations of the TWIST gene in the Saethre–Chotzene syndrome. Nat. Genet. 15, 42–46 (1997).

    Article  PubMed  Google Scholar 

  24. Meyers, R. M. et al. Computational correction of copy number effect improves specificity of CRISPR–Cas9 essentiality screens in cancer cells. Nat. Genet. 49, 1779–1784 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Ghandi, M. et al. Next-generation characterization of the cancer cell line encyclopedia. Nature 569, 503–508 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Wright, C. F. et al. Genomic diagnosis of rare pediatric disease in the United Kingdom and Ireland. N. Engl. J. Med. 388, 1559–1571 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Köhler, S. et al. The Human Phenotype Ontology in 2021. Nucleic Acids Res. 49, D1207–D1217 (2021).

    Article  PubMed  Google Scholar 

  28. Agoglia, R. M. et al. Primate cell fusion disentangles gene regulatory divergence in neurodevelopment. Nature 592, 421–427 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. GTEx Consortium The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 369, 1318–1330 (2020).

    Article  Google Scholar 

  30. Basha, O. et al. Differential network analysis of multiple human tissue interactomes highlights tissue-selective processes and genetic disorder genes. Bioinformatics 36, 2821–2828 (2020).

    Article  CAS  PubMed  Google Scholar 

  31. Gao, S. et al. Tracing the temporal-spatial transcriptome landscapes of the human fetal digestive tract using single-cell RNA-sequencing. Nat. Cell Biol. 20, 721–734 (2018).

    Article  CAS  PubMed  Google Scholar 

  32. Charlesworth, B. et al. Evolution in Age-Structured Populations Vol. 2 (Cambridge University Press, 1994).

  33. Barrio-Hernandez, I. et al. Network expansion of genetic associations defines a pleiotropy map of human cell biology. Nat. Genet. 55, 389–398 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Van Dam, S., Vosa, U., van der Graaf, A., Franke, L. & de Magalhaes, J. P. Gene co-expression analysis for functional classification and gene–disease predictions. Brief. Bioinform. 19, 575–592 (2018).

    PubMed  Google Scholar 

  35. Nasser, J. et al. Genome-wide enhancer maps link risk variants to disease genes. Nature 593, 238–243 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Wieder, N. et al. Differences in 5′ untranslated regions highlight the importance of translational regulation of dosage sensitive genes. Genome Biol. 25, 111 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Sella, G. & Barton, N. H. Thinking about the evolution of complex traits in the era of genome-wide association studies. Annu. Rev. Genomics Hum. Genet. 20, 461–493 (2019).

    Article  CAS  PubMed  Google Scholar 

  38. Charlesworth, B. Effective population size and patterns of molecular evolution and variation. Nat. Rev. Genet. 10, 195–205 (2009).

    Article  CAS  PubMed  Google Scholar 

  39. Simons, Y. B., Mostafavi, H., Smith, C. J., Pritchard, J. K. & Sella, G. Simple scaling laws control the genetic architectures of human complex traits. Preprint at bioRxiv https://doi.org/10.1101/2022.10.04.509926 (2022).

  40. Mathieson, I. & Terhorst, J. Direct detection of natural selection in Bronze Age Britain. Genome Res. 32, 2057–2067 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  41. Emdin, C. A. et al. Phenotypic characterization of genetically lowered human lipoprotein(a) levels. J. Am. Coll. Cardiol. 68, 2761–2772 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Langsted, A., Nordestgaard, B. G. & Kamstrup, P. R. Low lipoprotein(a) levels and risk of disease in a large, contemporary, general population study. Eur. Heart J. 42, 1147–1156 (2021).

    Article  CAS  PubMed  Google Scholar 

  43. Rausell, A. et al. Common homozygosity for predicted loss-of-function variants reveals both redundant and advantageous effects of dispensable human genes. Proc. Natl Acad. Sci. USA 117, 13626–13636 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Reyes-Soffer, G. et al. Lipoprotein(a): a genetically determined, causal, and prevalent risk factor for atherosclerotic cardiovascular disease: a scientific statement from the American Heart Association. Arterioscler. Thromb. Vasc. Biol. 42, e48–e60 (2022).

    Article  CAS  PubMed  Google Scholar 

  45. Millar, D. S. et al. Molecular genetic analysis of severe protein C deficiency. Hum. Genet. 106, 646–653 (2000).

    CAS  PubMed  Google Scholar 

  46. Romeo, G. et al. Hereditary thrombophilia: identification of nonsense and missense mutations in the protein C gene. Proc. Natl Acad. Sci. USA 84, 2829–2832 (1987).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. O’Connor, L. J. et al. Extreme polygenicity of complex traits is explained by negative selection. Am. J. Hum. Genet. 105, 456–476 (2019).

    Article  PubMed  PubMed Central  Google Scholar 

  48. Benton, M. L. et al. The influence of evolutionary history on human health and disease. Nat. Rev. Genet. 22, 269–283 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Gulko, B., Hubisz, M. J., Gronau, I. & Siepel, A. A method for calculating probabilities of fitness consequences for point mutations across the human genome. Nat. Genet. 47, 276–283 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Huang, Y. F., Gulko, B. & Siepel, A. Fast, scalable prediction of deleterious noncoding variants from functional and population genomic data. Nat. Genet. 49, 618–624 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Huang, Y. F. & Siepel, A. Estimation of allele-specific fitness effects across human protein-coding sequences and implications for disease. Genome Res. 29, 1310–1321 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  52. Chen, S. et al. A genomic mutational constraint map using variation in 76,156 human genomes. Nature 625, 92–100 (2024).

    Article  CAS  PubMed  Google Scholar 

  53. Satterstrom, F. K. et al. Large-scale exome sequencing study implicates both developmental and functional changes in the neurobiology of autism. Cell 180, 568–584 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  54. Gardner, E. J. et al. Reduced reproductive success is associated with selective constraint on human genes. Nature 603, 858–863 (2022).

    Article  CAS  PubMed  Google Scholar 

  55. He, X. et al. Integrated model of de novo and inherited genetic variants yields greater power to identify risk genes. PLoS Genet. 9, e1003671 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. Zhu, X. & Stephens, M. Bayesian large-scale multiple regression with summary statistics from genome-wide association studies. Ann. Appl. Stat. 11, 1561–1592 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  57. Boyeau, P. et al. An empirical Bayes method for differential expression analysis of single cells with deep generative models. Proc. Natl Acad. Sci. USA 120, e2209124120 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  58. Des Portes, V. et al. A novel CNS gene required for neuronal migration and involved in X-linked subcortical laminar heterotopia and lissencephaly syndrome. Cell 92, 51–61 (1998).

    Article  CAS  PubMed  Google Scholar 

  59. Nascimento, R. M., Otto, P. A., de Brouwer, A. P. & Vianna-Morgante, A. M. UBE2A, which encodes a ubiquitin-conjugating enzyme, is mutated in a novel X-linked mental retardation syndrome. Am. J. Hum. Genet. 79, 549–555 (2006).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  60. Stevenson, R. E. et al. Renpenning syndrome comes into focus. Am. J. Med. Genet. A 134, 415–421 (2005).

    Article  PubMed  Google Scholar 

  61. Esmailpour, T. et al. A splice donor mutation in NAA10 results in the dysregulation of the retinoic acid signalling pathway and causes Lenz microphthalmia syndrome. J. Med. Genet. 51, 185–196 (2014).

    Article  CAS  PubMed  Google Scholar 

  62. Laumonnier, F. et al. Transcription factor SOX3 is involved in X-linked mental retardation with growth hormone deficiency. Am. J. Hum. Genet. 71, 1450–1455 (2002).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  63. Faundes, V. et al. Impaired eIF5A function causes a Mendelian disorder that is partially rescued in model systems by spermidine. Nat. Commun. 12, 833 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  64. Hatada, I. et al. An imprinted gene p57 KIP2 is mutated in Beckwith–Wiedemann syndrome. Nat. Genet. 14, 171–173 (1996).

    Article  CAS  PubMed  Google Scholar 

  65. Cacciagli, P. et al. Mutations in BCAP31 cause a severe X-linked phenotype with deafness, dystonia, and central hypomyelination and disorganize the Golgi apparatus. Am. J. Hum. Genet. 93, 579–586 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  66. Fantes, J. et al. Mutations in SOX2 cause anophthalmia. Nat. Genet. 33, 462–463 (2003).

    Article  Google Scholar 

  67. Nichols, K. E. et al. Inactivating mutations in an SH2 domain-encoding gene in X-linked lymphoproliferative syndrome. Proc. Natl Acad. Sci. USA 95, 13765–13770 (1998).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  68. Garg, V. et al. GATA4 mutations cause human congenital heart defects and reveal an interaction with TBX5. Nature 424, 443–447 (2003).

    Article  CAS  PubMed  Google Scholar 

  69. Bione, S. et al. A novel X-linked gene, G4. 5. is responsible for Barth syndrome. Nat. Genet. 12, 385–389 (1996).

    Article  CAS  PubMed  Google Scholar 

  70. Amberger, J. S., Bocchini, C. A., Schiettecatte, F., Scott, A. F. & Hamosh, A. OMIM.org: Online Mendelian Inheritance in Man (OMIM®), an online catalog of human genes and genetic disorders. Nucleic Acids Res. 43, D789–D798 (2015).

    Article  PubMed  Google Scholar 

  71. Schiffels, S. & Durbin, R. Inferring human population size and separation history from multiple genome sequences. Nat. Genet. 46, 919–925 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  72. Cummings, B. B. et al. Transcript expression-aware annotation improves rare variant interpretation. Nature 581, 452–458 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  73. McLaren, W. et al. The Ensembl Variant Effect Predictor. Genome Biol. 17, 122 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  74. Frankish, A. et al. GENCODE: reference annotation for the human and mouse genomes in 2023. Nucleic Acids Res. 51, D942–D949 (2023).

    Article  CAS  PubMed  Google Scholar 

  75. Olson, N. D. et al. PrecisionFDA Truth Challenge V2: calling variants from short and long reads in difficult-to-map regions. Cell Genom. 2, 100129 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  76. Blake, J. A. et al. Mouse Genome Database (MGD): knowledgebase for mouse–human comparative biology. Nucleic Acids Res. 49, D981–D987 (2021).

    Article  CAS  PubMed  Google Scholar 

  77. Groza, T. et al. The International Mouse Phenotyping Consortium: comprehensive knockout phenotyping underpinning the study of human disease. Nucleic Acids Res. 51, D1038–D1045 (2023).

    Article  CAS  PubMed  Google Scholar 

  78. Gudmundsson, S. et al. Variant interpretation using population databases: lessons from gnomAD. Hum. Mutat. 43, 1012–1030 (2022).

    Article  PubMed  Google Scholar 

  79. Hart, T., Brown, K. R., Sircoulomb, F., Rottapel, R. & Moffat, J. Measuring error rates in genomic perturbation screens: gold standards for human functional genomics. Mol. Syst. Biol. 10, 733 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  80. Blomen, V. A. et al. Gene essentiality and synthetic lethality in haploid human cells. Science 350, 1092–1096 (2015).

    Article  CAS  PubMed  Google Scholar 

  81. Samocha, K. E. et al. A framework for the interpretation of de novo mutation in human disease. Nat. Genet. 46, 944–950 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  82. Finucane, H. K. et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet. 47, 1228–1235 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  83. Zeng, T., Spence, J. P., Mostafavi, H. & Pritchard, J. K. s_het estimates from GeneBayes and other supplementary datasets. Zenodo https://doi.org/10.5281/zenodo.10403680 (2023).

  84. Zeng, T. tkzeng/GeneBayes: GeneBayes v1.0. Zenodo https://doi.org/10.5281/zenodo.10939506 (2024).

  85. Zeng, T. Code and data to reproduce GeneBayes figures. Zenodo https://doi.org/10.5281/zenodo.11141460 (2024).

  86. Schuler, A. et al. tkzeng/ngboost: NGBoost for GeneBayes v1.0. Zenodo https://doi.org/10.5281/zenodo.10944711 (2024).

  87. Chen, T. & Guestrin, C. Xgboost: a scalable tree boosting system. In Proc. 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794 (Association for Computing Machinery, 2016).

  88. Paszke, A. et al. Pytorch: an imperative style, high-performance deep learning library. In Proc. Advances in Neural Information Processing Systems (eds Wallach, H. M., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F. & Fox, E. B.) 32 (Curran Associates Inc., 2019).

  89. Finucane, H. K. et al. Heritability enrichment of specifically expressed genes identifies disease-relevant tissues and cell types. Nat. Genet. 50, 621–629 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  90. Harris, C. R. et al. Array programming with NumPy. Nature 585, 357–362 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  91. Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  92. Van der Walt, S. & Millman, J. (eds). Data structures for statistical computing in Python. In Proc. 9th Python in Science Conference 56–61 (SciPy, 2010).

  93. Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).

    Google Scholar 

  94. Van der Walt, S. & Millman, J. (eds). Statsmodels: econometric and statistical modeling with Python. In Proc. 9th Python in Science Conference 92–96 (SciPy, 2010).

Download references

Acknowledgements

We would like to thank I. Agarwal, M. Przeworski, J. Engreitz and members of the Pritchard Lab for valuable feedback and discussions. This work was supported by the National Institutes of Health (NIH; grants R01HG011432, R01HG008140 and U01HG009431 to J.K.P. and R01AG066490 to S. Montgomery). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the paper.

Author information

Authors and Affiliations

Authors

Contributions

J.P.S., H.M. and J.K.P. conceived and designed the study. T.Z. and J.P.S. performed all data analyses and developed the model. H.M. provided intellectual contributions to all aspects of the study. T.Z., J.P.S., H.M. and J.K.P. wrote the paper. J.K.P. supervised the study and acquired funding.

Corresponding authors

Correspondence to Tony Zeng, Jeffrey P. Spence or Jonathan K. Pritchard.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Genetics thanks Zilin Li and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Performance of shet estimates from a model with some features removed.

a, Scatterplot of posterior mean shet estimated from a model trained without missense constraint or cross-species conservation features (y axis) against shet estimated from the full model (x axis). b, Precision–recall curves comparing the performance of shet estimated from the full model (blue) and from the model without missense/conservation features (orange) in classifying essential genes. c, Precision–recall curves comparing the performance of shet estimated from the two models in classifying developmental disorder genes.

Extended Data Fig. 2 Comparison of shet estimates from models trained on subsets of gnomAD.

a, Scatterplot of posterior mean shet estimated from a model trained with non-NFE individuals (y axis) against shet estimated from the full model (x axis). NFE, Non-Finnish European. This subset consists of 56,000 individuals or 45% of the total dataset. b, Scatterplot of posterior mean shet estimated from a model trained with NFE individuals (y axis) against shet estimated from the full model (x axis). This subset consists of 67,000 individuals or 55% of the total dataset.

Extended Data Fig. 3 shet distributions for additional example genes.

Left: posterior distributions and rescaled likelihoods for genes with few expected LOFs (genes in the bottom quartile). Right: posterior distributions and rescaled likelihoods for genes with many expected LOFs (genes in the top quartile).

Extended Data Fig. 4 Additional validation analyses.

a, Precision–recall curves comparing the performance of shet estimates from GeneBayes against LOEUF from gnomAD v4.0.0 (731k exomes) or LOEUF from gnomAD v2.1.1 (125k exomes) in classifying essential genes. b, Precision–recall curves comparing the performance of shet estimates from GeneBayes against other constraint metrics in classifying nonessential genes. c, Precision–recall curves comparing the performance of shet against other constraint metrics in classifying developmental disorder genes. d, Enrichment of de novo mutations in patients with developmental disorders, calculated as the observed number of mutations over the expected number under a null mutational model (n = 31,058 parent–offspring trios). We plot the enrichment of synonymous, missense, splice and nonsense variants in the 10% of genes considered most constrained by shet (blue) and the enrichment of these variants in all other genes (gray), including (left) and excluding (right) known developmental disorder genes. Bars represent 95% confidence intervals, centered around the mean. e, Scatterplot of the enrichment of common variant heritability in the 10% of genes considered most constrained by shet (y axis) or LOEUF (x axis), normalized by the enrichment of heritability in all genes. Each point represents one trait.

Extended Data Fig. 5 Performance of shet and LOEUF for genes with differing numbers of expected LOFs.

Left: precision–recall curves comparing the performance of shet against LOEUF in classifying essential genes for groups of genes binned by their expected number of LOFs. Right: precision–recall curves comparing the performance of shet against LOEUF in classifying developmental disorder genes for binned genes.

Extended Data Fig. 6 Correlation of gene features with gene length.

a, Histogram of the Spearman ρ between gene features and coding sequence (CDS) length. b, Histogram of the Spearman ρ between gene features and CDS length for gene expression features, colored by category. c, Spearman ρ between gene features and CDS length for additional features of interest. d, Scatterplot of the Spearman ρ between gene features and posterior mean shet (y axis) against the partial Spearman ρ (x axis) after controlling for the effect of gene (CDS) length.

Supplementary information

Supplementary Information

Supplementary Note and Figs. 1–4.

Reporting Summary

Peer Review File

Supplementary Tables

Supplementary Table 1: Posterior means and 95% credible intervals for GeneBayes estimates of shet. Supplementary Table 2: LOEUF and shet for ribosomal proteins associated with Diamond–Blackfan anemia. Supplementary Table 3: Terms used to define tissues for expression features. Supplementary Table 4: Filtering criteria for LOF variant curation. Supplementary Table 5: Sources for the LOF data. Supplementary Table 6: Parameters for fitting the gradient-boosted trees. Supplementary Table 7: Parameters for fitting the gradient-boosted trees for models trained on feature subsets. Supplementary Table 8a–k: Gene feature descriptions.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zeng, T., Spence, J.P., Mostafavi, H. et al. Bayesian estimation of gene constraint from an evolutionary model with gene features. Nat Genet 56, 1632–1643 (2024). https://doi.org/10.1038/s41588-024-01820-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41588-024-01820-9

  • Springer Nature America, Inc.

This article is cited by

Navigation