Extensive variation within the pan-genome of cultivated and wild sorghum

Sorghum is a drought-tolerant staple crop for half a billion people in Africa and Asia, an important source of animal feed throughout the world and a biofuel feedstock of growing importance. Cultivated sorghum and its inter-fertile wild relatives constitute the primary gene pool for sorghum. Understanding and characterizing the diversity within this valuable resource is fundamental for its effective utilization in crop improvement. Here, we report analysis of a sorghum pan-genome to explore genetic diversity within the sorghum primary gene pool. We assembled 13 genomes representing cultivated sorghum and its wild relatives, and integrated them with 3 other published genomes to generate a pan-genome of 44,079 gene families with 222.6 Mb of new sequence identified. The pan-genome displays substantial gene-content variation, with 64% of gene families showing presence/absence variation among genomes. Comparisons between core genes and dispensable genes suggest that dispensable genes are important for sorghum adaptation. Extensive genetic variation was uncovered within the pan-genome, and the distribution of these variations was influenced by variation of recombination rate and transposable element content across the genome. We identified presence/absence variants that were under selection during sorghum domestication and improvement, and demonstrated that such variation had important phenotypic outcomes that could contribute to crop improvement. The constructed sorghum pan-genome represents an important resource for sorghum improvement and gene discovery.

Fig. 1: Sorghum pan-genome.
Fig. 2: Phylogenetic relationships and distribution of genetic variation across the sorghum genome.
Fig. 3: Presence/absence variation underlying grain-colour variation in sorghum.

Data availability

The datasets generated during and/or analysed during current study have been deposited in China National GeneBank database ( under the project CNP0001440 and the Genome Sequence Archive64 in the National Genomics Data Center65, Beijing Institute of Genomics (China National Center for Bioinformation), Chinese Academy of Sciences under accession number CRA003806, which are publicly accessible at

Code availability

The code used in this manuscript is available at the GitHub repository


This work was undertaken as part of the initiative ‘Adapting Agriculture to Climate Change: Collecting, Protecting and Preparing Crop Wild Relatives’, which is supported by the Government of Norway. The project is managed by the Global Crop Diversity Trust with the Millennium Seed Bank of the Royal Botanic Gardens, Kew and implemented in partnership with national and international gene banks and plant breeding institutes around the world. For further information, see the project website: This work was also supported by funding from the Australian Research Council through the Centre of Excellence for Translational Photosynthesis (CE1401000015), National Key R&D Program of China (2019YFD1002701 and 2018YFD1000701) and Strategic Priority Research Program of Chinese Academy of Sciences (XDA26050101).

E.M., H.J., D.J. and Y.T. designed this study and coordinated the project. Y.T., X.Z., A.C. and A.H. selected samples and conducted field work. T.S., Y.L. and X.W. collected samples. J.X. and F.T. carried out the genome assembly and annotation. Y.T., H.L. and F.T. performed pan-genome analysis. Y.T. and J.X. conducted variation detection, phylogenetic analysis and selection analysis. Y.T., X.Z. and A.H. carried out GWAS analysis. Y.T. wrote the manuscript, E.M. and D.J. edited the manuscript. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to David Jordan, Haichun Jing or Emma Mace.

Competing interests

The authors declare no competing interests.

Peer review information Nature Plants thanks Zhangjun Fei and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 A snapshot of graph-based sorghum pan-genome.

This pan-genome graph shows variation within a LGS1 region on Chromosome 5. The graph was visualised using Bandage. Yellow colour highlights the sequence segment containing LGS1. Grey colour indicates sequence segments from the reference genome BTx623. Green colour indicates sequence segments from genomes other than BTx623.

Extended Data Fig. 2 Comparison between core, shell and cloud genes.

A CDS length. Core genes are significantly longer than shell and cloud genes (p-value<2.2e-16, Wilcoxon signed rank, two-sided). B number of exons. Core genes have significantly more exons than shell and cloud genes (p-value<2.2e-16, Wilcoxon signed rank, two-sided). Sample size: core, n = 15,867; shell, n = 28,026; cloud genes, n = 186. In the box plots, center lines represent the median, the bottom and top of boxes represent the first and third percentiles, whiskers show the data that lie within the 1.5 interquartile range of the first and third quartiles.

Extended Data Fig. 3 Comparison of expression level between core, shell and cloud genes.

Expression level (FPKM, Fragments Per Kilobase of transcript per Million mapped reads) of core, shell and cloud genes were measured in six samples. Core genes consistently showed a higher expression level compared to shell and cloud genes across six genomes (p-value<2.2e-16, Wilcoxon signed rank, two-sided). Sample size in the six genomes, 353: core, n = 22,522; shell, n = 13,873; cloud, n = 78, IS3614-3: core, n = 20,786; shell, n = 12,648; cloud, n = 12, IS8525: core, n = 21,223; shell, n = 13,365; cloud, n = 12, IS929: core, n = 21,702; shell, n = 12,860; cloud, n = 35, Ji2731: core, n = 22,251; shell, n = 13,874; cloud, n = 57, PI525695: core, n = 20,445; shell, n = 11,372; cloud, n = 35. In the box plots, center lines represent the median, the bottom and top of boxes represent the first and third percentiles, whiskers show the data that lie within the 1.5 interquartile range of the first and third quartiles.

Supplementary information

Supplementary Information

Supplementary notes, Figs. 1–22 and Tables 1–16, 18–20 and 24–27.

Reporting Summary

Supplementary Tables

Supplementary Tables 17, 21, 22 and 23.

Tao, Y., Luo, H., Xu, J. et al. Extensive variation within the pan-genome of cultivated and wild sorghum. Nat. Plants 7, 766–773 (2021).

