Introduction

Sorghum [Sorghum bicolor (L.) Moench] is the fifth most important cereal grown worldwide in terms of both production and area planted (FAO 2004). Like millet and fonio, sorghum is genetically suited to hot and dry agroecologies where it is difficult to grow most food grains. Sorghum, therefore, is a pillar of food security in the semi-arid zones of Western and Central Africa. Three S. bicolor subspecies are recognized, including cultivated types (i.e., landraces) (ssp. bicolor), a wild complex that is widely distributed and ecologically diverse (ssp. verticilliflorum), and weedy types from hybridizations between domesticated and wild sorghums (ssp. drummondii) (De Wet 1978). Furthermore, five races (i.e., bicolor, caudatum, durra, guinea and kafir) and ten hybrid races that combine characteristics of two or more races have been described within ssp. bicolor (Harlan and De Wet 1972).

Because of its global socio-economic importance, there has been substantial interest in characterizing the levels of genetic diversity present within sorghum using both phenotypic and molecular markers (Dean et al. 1999; Dje et al. 2000; Ghebru et al. 2002; Grenier et al. 2000a, b; Menz et al. 2004; Uptmoor et al. 2003), and these analyses have provided a foundation of genetic data for making informed decisions regarding the management and utilization of genetic resources. More importantly, molecular diversity data can potentially bridge conservation and use when employed as a tool for mining germplasm collections for genomic regions associated with adaptive or agronomically-important traits (i.e., genes that have been important in adaptation to local environments or are associated with phenotypes selected by farmers or breeders).

Population genetics theory predicts that intense directional selection pressure should dramatically reduce allelic diversity, both at the target of selection and at linked neutral loci, due to genetic “hitchhiking” (Maynard Smith and Haigh 1974). Methods based on this prediction are now used routinely by human and Drosophila geneticists for identifying genomic regions that may have experienced selection or a “selective sweep” (Harr et al. 2002; Kayser et al. 2003; Payseur et al. 2002). Candidate regions are identified by surveying neutral markers throughout the genome for unusual patterns of allelic variation, such as skews in allele frequency distributions within populations or groups of individuals. Because crop species experienced strong selective pressures during domestication, they offer a unique opportunity to test the use of neutrally evolving markers for identifying genes controlling domestication-related traits (Vigouroux et al. 2002). In the work presented here, we used SSR loci to determine diversity and evaluate genetic relationships and population structure within and between cultivated and wild sorghum accessions. Concurrently, we applied statistical methods based on population genetics theory to identify regions of the S. bicolor genome that may have undergone human-driven selection during domestication.

Materials and methods

Plant material

Genome-wide diversity was assessed in 104 sorghum [Sorghum bicolor (L.) Moench] accessions, including both landraces (n=73) and wild sorghums (wilds) (n=31) [listed as electronic supplementary information (ESM) S1]. These accessions comprised all S. bicolor subspecies and racial types and represented a wide geographic sampling from the species’ center of diversity (Africa). Seeds of the cultivated material (landraces) were obtained either from the National Center for Genetic Resources Preservation (USDA/ARS, Ft. Collins, Colo.,) or the Plant Genetic Resources Conservation Unit (USDA/ARS, Griffin, Ga.), and information on geographical origin and racial classification was gathered primarily from the System-wide Information Network for Genetic Resources database (http://singer.cgiar.org/Search/SINGER/search.htm). Wild accessions were provided by Mitchell R. Tuinstra (Agronomy Department, Kansas State University), and relevant information on these was obtained from either Dr. Tuinstra or the National Plant Germplasm System (http://www.ars-grin.gov/npgs/searchgrin.html).

Simple sequence repeat loci

A total of 98 simple sequence repeats (SSRs) were assayed, including 47 di-, 19 tri-, and 23 tetranucleotide or longer motifs, and nine imperfect/compound repeats. These loci were developed from restriction fragment length polymorphism (RFLP) probe sequences (Schloss et al. 2002), small-insert genomic libraries (Brown et al. 1996; Kong et al. 2000), and bacterial artificial chromosome (BAC) end sequences (Bhattramakki et al. 2000). Locus designation followed the naming convention of Bhattramakki et al. (2000). All SSR loci were mapped in an F2 population derived from an interspecific cross between S. bicolor inbred line BTx623 and S. propinquum (Chittenden et al. 1994). Loci that do not appear on the published genetic map (Bowers et al. 2003) were mapped in our laboratory (to be submitted elsewhere). A list of the SSRs assayed, including primer sequences, information on repeat motif and length, and genomic location are provided as ESM (S2).

DNA extraction, PCR, and data collection

DNA was isolated from individual seedlings following a standard CTAB extraction protocol (Doyle and Doyle 1987). PCRs were performed as previously described (Matsuoka et al. 2002) using primers with fluorescent labels [6-FAM, HEX, TET, NED, TAMRA, VIC or PET; Applied Biosystems Inc. (ABI), Foster City, Calif.). Samples containing 0.5 μl PCR product, 0.05 μl size standard (GeneScan-500 LIZ; ABI), 9.0 μl formamide, and 0.45 μl water were denatured (95°C/5 min) and assayed on a capillary DNA sequencer (Model 3730XL; ABI). Fragment sizes were determined based on migration relative to the internal lane size standard using genemapper ver. 3.0 software (ABI) and the “local Southern” sizing algorithm.

Analysis of genetic diversity

Number of alleles (A), allele frequencies (p I ), and expected heterozygosity (He) (also referred to as unbiased gene diversity, D) for each locus was calculated with powermarker ver. 3.0 (Liu and Muse 2002). Here, we use expected heterozygosity and gene diversity interchangeably. Allelic richness (R s ), a measure of the number of alleles corrected for sample size differences, was estimated using fstat (Goudet 2001). Relative loss of diversity in cultivated versus wild was calculated according to Vigouroux et al. (2002). Mann-Whitney (MW) tests were used to determine whether estimates were significantly different between groups. Genetic distance (Nei et al. 1983) matrices and cluster analyses based on neighbor-joining (Saitou and Nei 1987) were carried out with populations ver.1.2.28 (http://www.pge.cnrs-gif.fr/bioinfo/populations/index.php). Branch support was determined by bootstrap (1,000 replicates) (Efron 1979). A matrix containing SSR allele information for the germplasm tested is available from http://igd.tc.cornell.edu.

Statistics for detecting selection

To identify candidate loci linked to genomic regions that might have experienced selection, we used three statistics: the Ewens-Watterson test for allele excess (EW), lnRH, and Fst. Invariant loci were excluded from these analyses. EW was calculated using bottleneck ver.1.2 (Cornuet and Luikart 1996). Accessions were assigned to either cultivated or wild groups and evaluated separately under the infinite allele mutation model (IAM) of microsatellite evolution (see Results and discussion section). Probability values were obtained based on 1,000 replications. lnRH, a comparison of the relative amount of gene diversity between two populations or groups of individuals (see Kauer et al. 2003), was estimated as follows:

$$ \ln \left[ {\left( {RH} \right)} \right] = \ln \left[ {\frac{{\left( {\left( {\frac{1} {{1 - He_{{\text{Pop}}1}}}} \right)^2 - 1} \right)}} {{\left( {\left( {\frac{1} {{1 - He_{{\text{Pop}}2}}}} \right)^2 - 1} \right)}}} \right] $$

where HePop1 is the gene diversity in population 1 (in our study landraces), and HePop2 is the gene diversity in population 2 (wilds). A Kolmogorov-Smirnov test was performed to ascertain whether the observed lnRH values followed a normal distribution. For detecting genomic regions under selection, we standardized lnRH by the mean and standard deviation across all loci (Kauer et al. 2003). The standardized distribution, therefore, had a mean of 0 and a standard deviation of 1, and 95% of loci were expected to have values between 1.96 and −1.96. Any locus with a value falling outside this interval was significant (P<0.05) (i.e., the locus might have experienced selection). Unbiased estimates of Fst (Weir and Cockeram 1984) between cultivated and wild sorghum accessions and among cultivated races were calculated from allele frequency data using powermarker ver. 3.0. Several methods, largely based on demography, have been developed to derive expected distributions of Fst values (Beaumont and Nichols 1996; Akey et al. 2002; Balloux and Goudet 2002). However, for most organisms, including sorghum, knowledge of the intrinsic factors that affect population demography is lacking. We, therefore, adopted a strictly empirical approach for determining Fst significance. That is, we assumed that outliers (loci with the highest Fst values) were most likely linked to genomic regions that had experienced selection (Kayser et al. 2003).

Results and discussion

Genome-wide diversity scan

The 98 SSRs evaluated in this study provided a fair coverage of the genome, with between 8 and 16 loci represented in each of the ten sorghum linkage groups. We excluded 22 loci with ≥20% null alleles (primarily due to poor amplification in the wild sorghums). Our final dataset, therefore, was based on results from 76 SSRs.

As expected, the dinucleotide repeat-containing SSRs were, generally, more variable than those with longer motifs. Moreover, Xgap and Xtxp SSRs usually had more alleles than Xcup loci. This finding was probably due to differing SSR origins. The Xgap and Xtxp markers were isolated from either small-insert genomic libraries (Brown et al. 1996; Kong et al. 2000) or bacterial artificial chromosome (BAC)-end sequences (Bhattramakki et al. 2000). These loci, therefore, were more likely to include noncoding regions than the Xcup SSRs that were developed from low-copy RFLP probe sequences located primarily near or in genes (Schloss et al. 2002).

It is noteworthy that only 16 of 72 polymorphic loci (23%) followed either a stepwise (Kimura and Ohta 1978) or two-phase (Di Rienzo et al. 1994) mutation model of microsatellite evolution. Although the stepwise mutation model (SMM) has been widely accepted as the main mode through which SSRs evolve, our results indicated that most (approx. 77%) of the sorghum loci did not fit this model (loci that followed the SMM are indicated in ESM S3). This proportion is only slightly lower than what has been reported for maize, where 87% of SSR loci exhibited alleles that did not reflect the differences expected based on the variation in repeat size (Matsuoka et al. 2002). Because sorghum SSRs did not conform to the SMM, we assumed the infinite alleles model (IAM) ( Wright 1949; Kimura and Crow 1964) in relevant statistical analyses. The IAM is not ideal for describing the mutational processes underlying SSR evolution in sorghum, and a greater understanding of this issue is needed for developing improved computational tools for population and evolutionary analyses in grasses.

Genetic diversity in wild and cultivated sorghums

Allelic richness (R s ) and gene diversity (He) values for sorghum landraces, cultivated races, and wild subspecies are shown in Table 1. In addition, a more detailed breakdown of these estimates by locus is provided in ESM S3. Four loci, Xcup18, -21, -65 and sugar transporter were monomorphic, and a fifth locus, Xcup58, was invariant in landraces. As seen in previous studies of sorghum germplasm with diverse phenotypes and geographic origins (Grenier et al. 2000b), the majority of alleles detected in this study (64%) were “rare” [i.e., allele frequency (p I )<0.05]. Because allele number is dependent on sample size and we assayed twice as many landraces as wild accessions, we estimated R s , a measure that is independent of sample size. The results indicated that R s was significantly higher in wild sorghums than in landraces [6.2 and 4.9, respectively; MW tests (P<0.05)] (Table 1). Although landraces exhibited 86% of the gene diversity observed in the wilds, they were significantly less diverse than the wilds (average He in landraces was 0.51 compared to 0.59 in wilds; MW tests P< 0.05) (Table 1). As observed for other crops (Vigouroux et al. 2005), reduced variation in cultivated types typically results from population bottlenecks during domestication. We did observe four loci (Xcup11, -16, -19 and -53), however, that exhibited up to twofold more variation in landraces than in wilds (ESM S3). This result might reflect our heavy sampling of landraces, the occurrence of balancing selection in cultivated sorghums, and/or founder effects associated with domestication.

Table 1 Allelic richness (R s ) and gene diversity (He) in Sorghum bicolor landraces, cultivated races, and wild subspecies

For sorghum races bicolor, caudatum, durra, and guinea, there were no significant differences in diversity (Table 1). The kafir accessions, however, were significantly less diverse than the other races as measured both by allelic richness and gene diversity (MW tests P<0.001). Other studies based on SSRs (Dje et al. 2000) and, more recently, on DNA sequences (Hamblin et al. 2004) have shown similar results. This lower diversity could reflect the relative genetic isolation of race kafir in southern Africa. Because it is adapted to a more temperate environment, race kafir tends to be more photoperiod insensitive than the other sorghum races (Grenier et al. 2001). Both geographic isolation and temperate adaptation, therefore, may have contributed to limited opportunities for matings between kafir and other racial types.

Phenetic analysis

Neighbor-joining analysis indicated that wild sorghums belonging to ssp. verticilliflorum generally formed a coherent group that failed to cluster with the landraces (Fig. 1, group I). Moreover, wild accessions from the same geographic region tended to be genetically more similar to each other than to those from more distant locations, indicating the presence of population structure. S. bicolor ssp. drummondii, an annual weed thought to be a natural hybrid between ssp. bicolor and verticilliflorum (De Wet 1978), grouped mainly with a subset of the cultivated ssp. bicolor (namely, race bicolor) (Fig. 1, group II). Race bicolor resembles spontaneous weedy sorghums but lacks the ability to disperse seeds naturally (i.e., seeds do not shatter) (de Wet 1978). Because its long, clasping glumes, elongated seed, and open panicles are considered to be primitive characters, bicolor is thought to be the race most closely related to wild sorghums (Harlan and de Wet 1972). The remaining cultivated types were loosely partitioned into three groups: one consisting primarily of bicolor and durra accessions (Fig. 1, group II), another containing the kafir accessions and a mixture of several other racial types (group III), and a third group comprising mainly the guinea and caudatum accessions (group IV). We should emphasize that branch support for groupings of cultivated accessions was low (most bootstrap values ≤20), indicating a history of gene flow among the various races and/or recent common ancestry.

Fig. 1
figure 1

Neighbor-joining phenogram depicting genetic relationships among Sorghum bicolor accessions. Wild accessions and the various cultivated races of sorghum are color-coded. Numbers along branches denote bootstrap support (shown only for values greater than 50)

Partitioning variation

Levels of variation can be attributed to within- and between-population components. Fst (Wright 1951) measures genetic differentiation and allows one to determine how much variation is due to differences between populations. There was a moderate degree of differentiation between the wilds and landraces (average Fst=0.13) and little differentiation among the cultivated races (average Fst=0.06). This relatively low Fst might reflect the diverse composition of the germplasm evaluated (i.e., accessions were deliberately chosen to include the racial and geographic diversity represented in sorghum), the occurrence of pollen or seed flow, and/or recent divergence.

Using SSR diversity to detect the signals of selection

Results from statistical analyses for detecting patterns of variation that might be indicative of selection are shown in Fig. 2. Of the 72 polymorphic SSRs analyzed, 11 loci showed significant deviation from the population average or from neutrality in at least one statistical test (Xcup14, -40, -01, -33, -22, -60, -15, -07, -09, Xgap256 and -10) (Fig. 2). Most loci (8/11) were identified by the EW test for allele excess (Fig. 2). This statistic, equivalent to the Watterson’s homozygosity test (Watterson 1978), compares gene diversity (He) to that expected, given the number of alleles observed at a locus. The premise underlying EW is that under neutrality, allele number and frequency distribution result from equilibrium between mutation and genetic drift. Following a selective sweep, however, a surplus of low-frequency alleles is generated as new mutations occur in the selected region. Therefore, the power of EW increases over time until equilibrium is re-established in the population.

Fig. 2
figure 2

Plot of Fst, lnRH, and EW (deficiency of heterozygosity) values for 72 SSR loci by linkage group. Right-hand side y -axis scales, Fst; left-hand side lnRH and EW. Loci designations are on the x -axis. Fst values are connected by the dotted lines, lnRH by solid lines, and EW by dashed lines. Stars indicate significant values (P<0.05), and corresponding loci are denoted by arrows. Map positions of loci (centiMorgans) are presented as ESM S2. Loci not mapped in the S. bicolor × S. propinquum population appear at the right end of each linkage group (A–J) and are not connected to mapped loci by lines

Seven of the eight loci identified by the EW test exhibited a deficiency of heterozygosity (i.e., EW values were negative) (Fig. 2), which is consistent with the expectation under a directional or positive selection scenario (i.e., trait selection by farmers or breeders). One locus (Xcup40), however, showed an excess of heterozygosity (EW value was positive), and the allele frequency distribution at Xcup40 (data not shown) was consistent with diversifying selection in cultivated sorghum (i.e., selection stemming from adaptation to environment, pests, or disease). Since the distribution of allele number versus gene diversity over all loci was skewed towards a reduction in allele number compared to the neutral steady-state distribution (data not shown), our significance levels for the EW test were probably conservative.

While EW compares observed and expected frequency distribution at each locus within groups (e.g., landraces), lnRH compares variability between groups (landraces and wilds). Because, lnRH is a ratio of variances in gene diversity, this test has an identical expectation for all loci independent of the SSR mutation rate and the effective population size. A recent population-specific selective sweep however, would be expected to change that ratio at loci linked to the selected site but not at genomic regions evolving neutrally (for a detailed discussion, see Schlotterer 2002; Kauer et al. 2003). The power of lnRH is highest immediately following the selective event (when the number of alleles in the population is dramatically reduced) but declines over time as new alleles are generated by mutation and population diversity increases (when the power of EW is highest). For most SSRs (53/72), non-standardized lnRH values were negative, indicating a lower variability in landraces than in wilds. Following standardization, lnRH statistics for two loci, Xgap256 and Xcup09, were marginally significant (P< 0.05) (Fig. 2).

Besides providing insights into how genetic variation is partitioned (see above), Fst can also be used for identifying loci under selection. Unlike lnRH, the power of Fst does not decline over time as new variants arise as long as different alleles predominate in the groups being compared. Fst values for three loci (Xcup15, Xcup60, and Xgap256) were at least fourfold higher (>0.50) than the genome-wide average (0.13) (Fig. 2) and were considered to be most likely linked to genomic regions under selection.

Approximately 15% of the SSRs evaluated had patterns of variation consistent with selection according to at least one of the tests. Only 1 of the 11 loci identified, Xgap256, was significant by all methods. We should emphasize that deviation from neutral expectation is only the first step towards the identification of target genes. The incorporation of both DNA sequence diversity and map location of candidate loci will be invaluable for associating candidates with quantitative trait loci (QTLs) for traits that were/are under selection. Seven loci identified in this study map in or near regions associated with domestication-related QTLs. For example, Xcup33 maps to the shattering region on linkage group (LG) C (Paterson et al. 1995a, b), Xcup14 is located near a QTL for leaf width and seed weight on LG A (Paterson et al. 1995a), and Xcup01, 22, 60, Xgap10, and 256 map to major QTLs for rhizomatousness on LGs C and F (Hu et al. 2003; Paterson et al. 1995b). Results from independent diversity studies can also lend support to the selection hypothesis. For example, the low variation we observed for cultivated sorghum at locus Xgap256 was consistent with findings by Grenier et al. (2000b) where this locus had both the lowest number of alleles (7 compared to an average 19.3) and gene diversity (0.27 compared to average 0.80) of all loci evaluated in approximately 700 sorghum landraces from the world collection. We should also note that although significance at some loci was marginal, we have now obtained DNA sequence data from loci linked to two of the candidate SSRs, and these results were also consistent with a selective sweep (i.e., variation deviated significantly from neutral expectation).

Possibilities and caveats

SSRs have been extensively used in sorghum for germplasm characterization (Grenier et al. 2000b), evaluation of breeding material (Smith et al. 2000; Menz et al. 2004) and genetic diversity analyses (Dje et al. 2000; Ghebru et al. 2002; Uptmoor et al. 2003). Here, we applied population genetics principles to SSR diversity data in an attempt to identify sorghum genomic regions that might have experienced recent selection. One of the advantages of using this strategy is that no a priori information is required, either of traits or function, for identifying regions of the genome underlying functional or adaptive diversity. Indeed, about one-third of the candidate loci identified in this study had no association with previously characterized QTLs. The effectiveness of genome-wide scans, however, relies heavily on both a knowledge of population structure and appropriate sampling. In our sorghum panel, we found only weak evidence of population structure between cultivated and wilds as well as among the cultivated sorghum races. Therefore, loci showing significant deviation from mutation-drift equilibrium (as measured by the EW test) or high levels of genetic differentiation (Fst) are more likely to be associated with selection events and not false signals produced by population structure.

SSRs are usually not the targets of selection, and the probability that variation will be reduced at a linked site during a selective sweep depends not only on the strength of selection, but also on SSR mutation mode and rate, and the rate of recombination between the selected site and the marker assayed. Because higher mutation rates translate into higher diversity, depending on when selection took place, SSR loci may quickly recover lost diversity and, therefore, give non-significant deviation from neutrality (Vigouroux et al. 2005). In a species like maize where linkage disequilibrium (LD) decays rapidly (r2 falls to 0.24 by 200 bp) (Remington et al. 2001; Tenaillon et al. 2001), a signal detected by an SSR would indicate that the target of selection is nearby (possibly within a few hundred nucleotides). However, rapid LD decay also means that to detect selection a candidate region would need to be saturated with markers. Because LD decays much more slowly than in maize, S. bicolor might be more amenable to population genetics-based assays for selection and association analyses. Recent studies have shown that in sorghum, r2 declines to 0.2, on average, by 15 kb (Hamblin et al. 2004; M.T. Hamblin, unpublished data). Compared to maize, therefore, a smaller number of markers would be required both to scan the sorghum genome for selection signals and to detect associations between genotype and phenotype.

Conclusions

While the area of functional genomics is still in its infancy, we anticipate that the use of population diversity-based approaches will allow the mining of germplasm collections and extant SSR diversity data for identifying interesting genomic regions. Certainly molecular data from species with little population structure and intermediate levels of LD would be well suited for re-analysis. We must stress that while these approaches may be advantageous for identifying genomic regions that differ from the average observed in the genome, some of these departures may also result from non-equilibrium population history. Functional studies (e.g., mutant screening, genetic complementation, expression analysis, biochemical localization and characterization, etc) are still required to establish causation.