Introduction

Plants demonstrate a wide range of diversity in their morphology, adaptation and ecology, the product of many years of evolutionary divergence and diversification. Characterizing the extent and partitioning of this diversity across populations and/or distribution ranges for target taxa, coupled with an understanding of the mechanisms through which it arises and maintained, has been of interest to plant genetic resources conservation and crop improvement programs alike (Frankel and Hawkes 1975; Frankel et al. 1995). In particular, assessment of the level and patterns of genetic diversity is useful for several purposes, including (i) determining the level of genetic variability for facilitating identification of subsets of core or mini-core collections with possible utility for specific breeding purposes (Mohammadi and Prasanna 2003); (ii) estimating any possible loss of genetic diversity during conservation programs; (iii) assisting the selection of diverse parental combinations to create segregating progenies with maximum genetic variability for further selection (Barrett and Kidwell 1998); and (iv) estimating the relative strengths of the evolutionary forces (mutation, natural selection, migration or gene flow, genetic drift) and population properties such as population size, breeding system, population structure, and dispersal mechanisms. Moreover, information on the genetic relationships between crops and their wild/weedy relatives is useful in estimating the extent and dynamics of crop-wild gene flow. Although gene flow between crops and their relatives has been taking place since the dawn of agriculture (Ellstrand et al. 1999; Haygood et al. 2003), there are fears that transgenes will escape from genetically modified (GM) crops to sexually compatible wild and weedy relatives via gene flow. Depending on the relative fitness conferred by such transgenic traits in recipient wild/weedy relatives, potential harmful consequences include increased invasiveness, weediness, genetic erosion and in extreme cases extinction of populations (Ellstrand 1992; Snow and Moran-Palma 1997; Bhatia and Mitra 2003; Conner et al. 2003; Haygood et al. 2003; Cleveland and Soleri 2005; Thies and Devare 2007; Auer 2008; Chandler and Dunwell 2008).

Sorghum (Sorghum bicolor L. Moench.) is one of the world’s most important cereals. Sorghum was domesticated and diversified in Africa before moving to other parts of the world (Dogget 1988) and continues to play an important food security role in Africa. In 2007 for example, over 40 million ha of land was dedicated to sorghum production globally, out of which 60% was found in Africa (FAO 2008). Besides its use as a cereal crop sorghum is extensively used for fodder, construction material, brooms, syrup and beer. In Kenya, sorghum is grown in all but one administrative province. It is an important food crop and dietary staple in the country’s arid and semi-arid lands which account for over 80% of the total land area. A wide diversity of sorghum landraces is cultivated under equally diverse agro-climatic conditions and practices by subsistence farmers in different communities of Kenya (Mutegi et al. 2010). Moreover, morphologically and geographically diverse wild relatives of domesticated sorghum in the primary and tertiary gene pools are known to occur in the country (Clayton and Renvoize 1982; Mutegi et al. 2010). Wild relatives of sorghum are recognized as broad genetic base reservoirs and potential sources for resistance and adaptation traits in breeding programs (Gurney et al. 2002; Kamala et al. 2002; Reed et al. 2002; Rao Kameswara et al. 2003; Rich et al. 2004) and deserve special conservation attention. Unfortunately, we are not aware of any documented studies on the extent and partitioning of diversity in cultivated sorghum or its wild relatives at national scale in Kenya. Such information is important for designing appropriate conservation and utilization programs.

Cultivated sorghum (S. bicolor ssp. bicolor) is taxonomically conspecific with its proposed wild progenitor (S. bicolor ssp. verticilliflorum) and the stabilized weedy derivative of their hybridisation (S. bicolor ssp. drummondii) (Harlan and De Wet 1972; Dogget 1988). All subspecies of S. bicolor are inter-fertile under sympatric conditions, leading to a continuum of wild-weedy-domesticate complex forms that have been documented to occur in many sorghum growing parts of Africa (Dogget and Majisu 1968; Dogget 1988; Tesso et al. 2008; Mutegi et al. 2010). Moreover, cultivated and wild sorghum occupy diverse ecological landscapes and have over the years been subjected to diverse biotic and abiotic selection pressures across their geographic range. Wide genetic diversity is therefore expected in the landraces of cultivated sorghum and their wild-weedy relatives in Africa.

Levels and patterns of diversity within and between cultivated and wild sorghum gene pools have been reported before (e.g. Morden et al. 1990; Aldrich and Doebley 1992; Aldrich et al. 1992; Cui et al. 1995; Deu et al. 1995; Casa et al. 2005). These previous studies have shown that (i) there is low to moderate genetic differentiation among cultivated and wild sorghum gene pools, (ii) portions of the wild gene pool most genetically similar to cultivars originated in central-northeast Africa and (iii) genetic diversity is greater in wild-weedy than in cultivated sorghums. Little is known about the extent, patterns and direction of introgression between cultivated and wild sorghum. Moreover, most of these previous results were obtained from ex situ collections from genebanks and need validation using exhaustive samples obtained in situ at different spatial scales. This is especially needed for Africa, the center of origin and primary diversification of sorghum. Attempts have been made to use in situ collected samples but such studies have been limited to separate investigations of genetic diversity and structure in either cultivated sorghum (Djè et al. 1998, 1999; Ayana et al. 2000b, 2001; Ghebru et al. 2002; Barnaud et al. 2007; Deu et al. 2008; Sagnard et al. 2008; Barro-Kondombo et al. 2010) or its closest wild relatives (Ayana et al. 2000a). Our study applied microsatellite markers to analyze cultivated sorghum and its closest wild relatives sampled from different growing regions in Kenya, in order to elucidate patterns of diversity within and among the two congeners, and to shed more light on their genetic and evolutionary relationships. We sought to address three questions: (i) What is the extent of diversity within cultivated and wild gene pools of sorghum in Kenya?; (ii) Are cultivated and wild sorghum gene pools genetically differentiated?; and (iii) How is genetic diversity in cultivated and wild sorghum gene pools structured?

Materials and methods

Plant materials

Cultivated and wild sorghum seed samples were collected in farmer fields in the crop’s four main growing areas of Kenya: (i) Turkana, in the northern parts of the Rift Valley bordering Sudan and Ethiopia; (ii) western/Nyanza region covering the Kisii Highlands and lowlands around Lake Victoria; (iii) eastern/central region covering the Highlands east of Mt. Kenya and the drier lowlands of Meru, Kitui and Machakos administrative districts; and (iv) coastal areas of the country including the Taita Hills and adjacent areas as well as the farming systems in the Indian Ocean hinterlands (Fig. 1). Three collection trips were undertaken between June 2006 and July 2007 in order to capture differences in cropping seasons amongst the four growing regions. Passport data associated with each collected sample and farmer knowledge of cultivated varieties as well as of wild and weedy sorghum distribution, ecology and dynamics were recorded. Geographic coordinates and elevation data of each collection point were recorded using a handheld global positioning system (GPS) (eTrex Summit HC, Garmin). In addition, 93 georeferenced samples of cultivated sorghum were obtained from the National Genebank of Kenya (see Fig. 1) to cover the northeastern region bordering Ethiopia(34 samples), central Rift Valley (55 samples) and parts of Kisii highlands in western/Nyanza (4 samples) that were not covered in the collection trips due to logistical constraints. In total, 439 samples comprising 110 wild and 329 cultivated sorghum varieties were assembled. The highest number of samples for cultivated sorghum originated from eastern/central (90), followed by western/Nyanza (72), Rift Valley (55), Turkana (42), Coast (36) and northeastern (34) regions. For wild sorghum, the highest number of samples originated from eastern/central (41), followed by Coast (39), Turkana (17) and western/Nyanza (13) regions. Overall, samples were representative of Kenya’s sorghum growing agro-climatic and ethno-linguistic diversity. A copy of each sample collected in this study was deposited at the National Genebank of Kenya for long-term conservation and future utilization. Detailed description of the collection, including the ecogeographical variation across the country is published elsewhere (Mutegi et al. 2010).

Fig. 1
figure 1

Agro-climatic map of Kenya with origin and distribution of cultivated and wild sorghum used in the study shown. Six sorghum growing regions were covered: Coast (C), Eastern/central (E), Northeastern (N), Rift valley (R), Turkana (T) and Western/Nyanza (W)

DNA isolation and genotyping

Seeds from each sample of cultivated and wild sorghum were grown at room temperature for two weeks in potted plastic trays in the laboratory. To break seed dormancy in wild sorghum, glumes were removed using a scalpel blade and seeds soaked overnight in water at 35°C before planting. Subsequently, only one seedling per sample was used for DNA extraction. Total genomic DNA was extracted from freshly harvested leaves (4–6 cm) using a modified version of the high throughput mini-prep 3% cetyl trimethyl ammonium bromide (CTAB) method described by Mace et al. (2003). The quality and quantity of the isolated DNA was determined by comparing the fluorescence of aliquots of DNA samples with a know concentration of λ-DNA after running them on a 0.8% agarose gel that contained 0.3 μg/ml ethidium bromide solution.

The strategy of sampling one individual per sample was guided by the need to maximize the number of landrace and wild/weedy relatives to be genotyped on country scale. This approach has proved sufficient to detect large-scale inter-sample evolutionary trends in crops and/or their wild relatives (e.g. Matsuoka et al. 2002; Fukunaga et al. 2005; Mariac et al. 2006; Deu et al. 2008; Barro-Kondombo et al. 2010) provided that the number of loci is sufficient.

PCR amplification and genotyping

Thirty SSR markers (Table 1) with a good genome coverage were analyzed using the M13-tailed primer method (Schuelke 2000) to label amplicons for visualization on an ABI 3730 (Applied Biosystems) capillary sequencer. Forward primers were 5′-tailed with a 19-base pair (bp) M13 universal sequence, 5′-CACGAGCTTGTAAAACGACXXXXXXXXXXXXX-3′, where the X’s denote microsatellite-specific primer sequences (See Table 1 for details).

Table 1 List of the 24 microsatellite loci used in the analysis

Polymerase chain reaction (PCR) was performed in 10 μl reaction volumes, containing 2.5 ng of template DNA, 0.2 units of Amplitaq Gold Taq DNA polymerase (Applied Biosystems), 1X PCR buffer (10 mM Tris–HCl pH 8.3, 50 mM KCl, 1.5 mM MgSO4), 0.16 mM dNTPs, 2 μM sequence-specific reverse primer, 0.04 μM 5′-M13 tailed sequence-specific forward primer and 0.16 μM 5′-fluorescently labeled M13 universal sequence primer in a GeneAmp PCR system 9700 thermocycler (Applied Biosystems). The M13 universal sequence primer was 5′-tagged with VIC, NED, FAM or PET fluorescent dyes in order to facilitate post-PCR multiplexing. The PCR program was as described by Folkertsma et al. (2005)

After the PCR, a few samples from each primer pair were randomly selected and checked for proper amplification and product intensity on to 2% agarose. For high throughput and low cost genotyping, PCR products were separated by pooling post-PCR products based on fluorescent dye and/or fragment size. Depending on band intensity on agarose gel, 1.5–3.0 μl of PCR products from each of the 6-FAM, VIC, NED and PET-labeled PCR products were pooled together and the final volume adjusted to 15 μl by adding the required volume from a mix of an injection solution (HiDi) and size standard (GS500 LIZ). PCR fragments were denatured and size-fractioned using ABI 3730 Capillary DNA Sequencer (PE-Applied Biosystems) as described in the user’s manual. The peaks were sized and the alleles called using GeneMapper software version 3.7 and the internal size standard GS500LIZ. Positive control samples (genotype BTx623, pool-A, pool-B, pool-C) were included in all PCR for verifying the repeatability of each PCR and allele calls.

Data scoring and analyses

Although automated DNA sequencers and corresponding software’s contribute substantially to increased throughput rates for large-scale genotyping projects, different factors (e.g., plus-A amplification, stuttering, incorrect allele sizing) cause ambiguity in allele binning. Thus, the software AlleloBin (Prasanth et al. 2006) was used to classify observed microsatellite allele sizes into representative discrete alleles using the least-square minimization algorithm of Idury and Cardon (1997). All statistical analyses were performed on the adjusted data. Twenty four out of the 30 SSR markers showed high reproducibility, with high consistency between the expected (based on sequence information) and observed allele sizes for the 4 positive control samples (BTx623, pool-A, pool-B, pool-C). Therefore, only the 24 SSRs were used in the analysis (Table 1).

Investigated levels of genetic diversity

The extent and partitioning of microsatellite diversity was investigated at three different levels: (i) sorghum type (cultivated or wild), (ii) geographic region of origin and (iii) agro-climatic zone of origin. Six geographic zones that corresponded to the six sorghum growing areas sampled in this study, namely, Turkana, western/Nyanza, eastern/central, Coast, Rift Valley and northeastern were recognized. Agro-climatic zones were defined according to the agro-climatic zone map of Kenya (Sombroek et al. 1982) which recognizes seven agro-climatic zones based on annual rainfall and potential for evaporation: I (humid with 1,100–2,700 mm of annual rainfall), II (sub-humid with 1,000–1,600 mm of annual rainfall), III (semi-humid with 800–1,400 mm of annual rainfall), IV (semi-humid to semi-arid with 600–1,100 mm of annual rainfall), V (semi-arid with 450–900 mm of annual rainfall), VI (arid with 300–550 mm of annual rainfall) and VII (very arid with 150–350 mm of annual rainfall).

Standard parameters of genetic diversity, including total number of alleles (A t), number of rare alleles (A r, alleles with a frequency <5% per group), private alleles (A p, alleles unique to a group), observed heterozygosity (H o) and unbiased expected heterozygosity or gene diversity (H e) were computed using GENETIX version 4.05 (Belkhir et al. 2004). Since the observed number of alleles is highly dependent on sample size, the program FSTAT (Goudet 2002) was employed to compute the mean allelic richness across all loci (R s) for each defined level of genetic structure. In addition the software HP-RARE 1.2 (Kalinowski 2005) was used to compute and compare the private allelic richness (\( \prod {_{\text{taxon}}^{S} } \)) between cultivated and wild sorghum. The two programs implement the rarefaction statistical method first used by Hulbert (1971) to estimate species diversity. The method allows for unbiased comparisons among populations of unequal sample sizes by calculating a standardized estimate of allelic richness for a fixed sample size. Overall differences in R s, \( \prod {_{\text{taxon}}^{S} } \) and H e between cultivated and wild sorghum were assessed for significance using Wilcoxon’s signed-rank test as implemented in the software GenStat (VSN International Ltd. 2007). The Kruskal–Wallis test was used to test for differences among geographic and agro-climatic zones in their allelic richness and gene diversity using the software R (R Development Core Team 2007).

To investigate the genetic relationships between all pairs of cultivated and wild sorghum plants, a genetic dissimilarity matrix was computed using the simple matching procedure and subsequently used as an input for principal coordinate analysis (PCoA) in the software DARwin 5.0 (Perrier and Jacquemoud-Collet 2006). The pairwise deletion option was chosen to ensure that dissimilarity calculations were done only for pairs of genotypes where allelic scores were obtained for at least 70% of all loci. At this level of threshold, eighteen individuals were eliminated from the final calculation because of having too many missing data.

Genetic structure

Three complementary approaches were used to explore genetic diversity structure: Fixation index (F ST), analysis of molecular variance (AMOVA), and model-based clustering. The observed diversity was for each approach, partitioned between the sorghum types (cultivated or wild) and among geographic and agro-climatic zones for cultivated and wild sorghum separately.

The software GENETIX 4.04 was used to compute the Weir and Cockerham (1984) θ, an unbiased estimator of F ST. Pairwise F ST calculations were used to compare the level of genetic differentiation among geographic regions for cultivated and wild sorghum separately and between the two sorghum gene pools. The F ST values were tested for significance using the permutations procedure (10,000 permutations). Genetic differentiation in cultivated and wild sorghum was further analyzed using the analysis of molecular variance (AMOVA; Excoffier et al. 1992) procedure, implemented in the software ARLEQUIN 3.11 (Excoffier et al. 2005). The significance of partitioning of the genetic variance components among the various groups was tested using 10,000 permutations.

The Bayesian model-based clustering method implemented in the software STRUCTURE 2.2.3 (Pritchard et al. 2000) was also used to explore genetic structure first by pooling cultivated and wild sorghum individuals and then separately for each sorghum type. The basic admixture model with unlinked loci, correlated allele frequencies and with no a priori population information was used. STRUCTURE was run by varying the number of clusters (K) from 2 to 10 using the web resources of the Computational Biology Service Unit (CBSU) from Cornell University (http://cbsuapps.tc.cornell.edu/structure.aspx). Each K was run 10 times with a burn-in length of 500,000 and a post-burning data collection length of 1 × 106. The most likely number of genetic clusters was estimated using the ad hoc statistic ∆K (Evanno et al. 2005), which is based on the second order rate of change of P(X|K), the posterior probability of the data with respect to a given K. According to Evanno et al. 2005, the peak value of the distribution of ∆K is located at the most likely value of K. We illustrated the peak value by plotting ∆K values against successive K values. The proportion of individuals genome assigned to each cluster (Q) for the most likely number of clusters was summarized by way of bar plots.

Spatial structure of genetic variation

To investigate the spatial structure of genetic diversity in cultivated and wild sorghum at country scale, spatial autocorrelation analysis was performed as implemented in the software SPAGeDi (Hardy and Vekemans 2002). A geographic distance matrix was generated from the latitude/longitude coordinates associated with each sample using the software Geographic Distance Matrix Generator version 1.2.3 (http://biodiversityinformatics.amnh.org/open_source/gdmg). In each case, 20 distance classes were defined such that there were approximately equal numbers of pairwise comparisons in each class. Within each class, the relative kinship coefficient (r ij) was estimated using the method of Ritland (1996). This index represents the correlation in allelic states between homologous genes and weighs allele distribution by the inverse of allele frequency, thus giving more weight to rare alleles. This way the approach results in lower sampling variance, hence, is powerful for detecting genetic structure (Hardy and Vekemans 2002). The cultivated sorghum samples obtained from the genebank for northeastern and Rift Valley regions did not have corresponding wild sorghum samples and were therefore eliminated from the analysis to allow for comparison of patterns between cultivated and wild sorghum. The significance of the estimated values of kinship coefficient and regression slope was tested by permuting individuals among locations 1,000 times. The relationship between genetic relatedness and geographic distance was visualized in correlograms using the software R, with the 95% confidence interval envelope under the null hypothesis of no spatial structure indicated.

Further, the method of Rousset (2000) was used to indirectly infer the extent of gene dispersal between cultivated and wild sorghum individuals. This approach is based on the analytical model of isolation-by-distance, which predicts that the genetic distance between individuals (â) (Rousset 2000) increases approximately linearly with the logarithm of spatial distance. Rousset’s measure of genetic distance, â, was computed for each pair of individuals using the program SPAGeDi (Hardy and Vekemans 2002). Ten distance classes consisting of approximately equal numbers of individual pairwise genetic distance comparisons were defined. Subsequently, the pairwise genetic distance estimates were regressed on the logarithm of spatial distance, providing a regression slope (blog) and an estimate of the coefficient of determination (r 2). The significance of the regression slope was tested by a randomisation procedure whereby individuals were permuted among locations 1,000 times to assess the distribution of the slope values under the null hypothesis of no correlation between geographic and genetic distance. P values were estimated as the proportion of this distribution lying higher than the observed slope value. The program R was used to plot estimates of the pairwise genetic distance between individuals of cultivated and wild sorghum against logarithmic spatial distance, with the regression line shown.

Results

Extent of genetic diversity

The number of alleles per marker varied from 3 in Xtxp136 to 25 in SbAGB02 (Table 1). The 24 SSR markers amplified a total of 295 different alleles among all the 439 samples, an average of 12.3 alleles per marker (Table 2). The number of alleles detected among the 329 cultivated sorghum samples was 257, out of which 173 (67%) were rare and 15 (5%) private alleles. In comparison, 238 alleles were observed in the 110 wild sorghum samples, with 122 (51%) being rare and 13 (5%) private alleles. The cultivated gene pool was observed to harbor lower genetic diversity than the wild gene pool, based on significantly lower mean allelic richness (P ≤ 0.05), private allelic richness (P ≤ 0.05) and gene diversity (P < 0.001) values. We found significant differences among regions in the levels of allelic richness (3.45 ≤ R s ≤ 5.59; P ≤ 0.05) and gene diversity (0.3396 ≤ H e ≤ 0.5595; P ≤ 0.001) for both cultivated and wild sorghum. Estimates of allelic richness and gene diversity for cultivated and wild sorghum in each region are presented in form of box plots in Fig. 2. For cultivated sorghum, the mean allelic richness ranged from 3.45 in Turkana to 5.59 in the coast, whereas the mean gene diversity ranged from 0.3396 in Turkana to 0.5595 in eastern/central. For wild sorghum, the mean allelic richness ranged from 3.67 in Turkana to 6.00 in the coast, while mean gene diversity ranged from 0.4836 in western/Nyanza to 0.6806 in the coast. No significant differences (P ≥ 0.05) were evident for either allelic richness or gene diversity among the agro-climatic zones for cultivated sorghum as well as for its wild progenitor.

Table 2 Comparative genetic diversity estimates for Kenya’s sorghum gene pool
Fig. 2
figure 2

Box plots showing inter-regional differences in allelic richness and gene diversity for cultivated (a, b) and wild (c, d). Letters C, E, N, R, T and W represent coast, eastern/central, northeastern, Rift Valley, Turkana and western/Nyanza regions, respectively. The box represents the interquartile range (50% of values); while the line across the box indicates the median. The lines running vertically from the box (whiskers) extend to the highest and lowest values, excluding outliers which are denoted by circles

Genetic relationships within and between cultivated and wild sorghum

Genetic relationships among individuals of cultivated and wild sorghum are presented as a biplot of the first plane of a PCoA (Fig. 3). In total the first plane accounted for 13.7% of the total variability, 6.1% of which is accounted for by axis 1 and 5.3% by axis 2. Generally, the separation between cultivated and wild sorghum gene pools was low, although cultivated sorghum from northeastern is clearly separated from both the cultivated and wild counterparts from the other regions. For the rest of the samples, there appeared to be high levels of overlap both among regions and among sorghum types.

Fig. 3
figure 3

Biplot of the axes 1 and 2 of the principle coordinate analysis based on the dissimilarity of 24 SSR markers for cultivated and wild sorghum. Letters C, E, N, R, T and W represent coast, eastern/central, northeastern, Rift Valley, Turkana and western/Nyanza regions, respectively

Patterns of genetic differentiation

Overall, the level of differentiation between cultivated and wild sorghum based on F ST was moderate but highly significant (F ST = 0.062; P < 0.001) (Table 3). Similarly, AMOVA and Bayesian cluster analysis showed close genetic proximity between cultivated and wild sorghum in Kenya. For AMOVA, only 6.5% of the total genetic diversity was partitioned to the variation between cultivated and wild sorghum, compared to 93.6% that was partitioned to the variation within the two sorghum types. Similarly, the Bayesian model-based cluster analysis at K = 2 (i.e. assuming only two genetic groups) failed to identify distinct differentiation among cultivated and wild sorghum individuals (Fig. 4). In this analysis, we assumed that an individual was only exclusively assigned to a particular genetic cluster if at least 85% of its genome (i.e. q i ≥ 0.85) is found in it; otherwise it was assumed to be jointly assigned to two clusters probably due to admixture. Using this arbitrary threshold, at least 26.8% of cultivated sorghum individuals and 36.4% of their wild sorghum counterparts were assigned to both cluster 1 and 2.

Table 3 FST -based genetic differentiation of the sorghum gene pool at various levels
Fig. 4
figure 4

Estimated population structure at K = 2 for the entire sorghum gene pool ordered by type and membership fraction. Each individual is represented by a vertical line, which is partitioned into colored segments that represent the individual’s membership fraction in K clusters

In cultivated sorghum, genetic differentiation was high among regions (F ST = 0.187; P < 0.001), but moderate among agro-climatic zones (F ST = 0.077; P < 0.001). Similar trends were observed for wild sorghum, with moderate genetic differentiation among regions (F ST = 0.097; P ≤ 0.001), and low genetic differentiation among agro-climatic zones (F ST = 0.054; P ≤ 0.001). The outcome of comparing the extent of genetic differentiation within and between cultivated and wild sorghum among the regions (pairwise F ST) is presented in Table 4. Overall, all F ST values were significantly greater than zero (P ≤ 0.05).

Table 4 Estimates of pairwise F ST among collections of cultivated and wild sorghum within and among different geographical regions

The level of divergence among regions was variable in both cultivated and wild sorghum, with the former generally exhibiting greater F ST values than the latter. In cultivated sorghum the lowest level of inter-regional genetic differentiation was observed between eastern/central and coastal regions (F ST = 0.03) and the highest between Turkana and northeastern regions (F ST = 0.44). Notably, there was high level of inter-region genetic similarity between eastern/central and coastal regions (F ST = 0.03), western/Nyanza and coastal regions (F ST = 0.05) and eastern/central and western/Nyanza (F ST = 0.07). Contrastingly, Turkana and northeastern cultivated sorghum pools appeared to be clearly distinct, both with each other and with the rest of the cultivated sorghum pools. Similar trends were revealed in wild populations, with substantial inter-regional similarities among coastal, eastern/central and western regions (0.06 ≤ F ST ≤ 0.10), that were coupled with substantial distinctiveness of Turkana populations in relation to those from other regions (0.13 ≤ F ST ≤ 0.17, P < 0.001).

We detected lower levels of genetic divergence between cultivated and wild sorghum within (0.03 ≤ F ST ≤ 0.18; in italics) than among (0.05 ≤ F ST ≤ 0.33) regions (Table 4). Within regions, the highest level of crop-wild genetic divergence was recorded in Turkana, and the least in western/Nyanza (F ST = 0.03). The level of crop-wild genetic divergence was comparable and moderate in the coastal (F ST = 0.11) and eastern (F ST = 0.10) regions.

The genetic structure of the 329 cultivated sorghum individuals based on the Bayesian model-based algorithm implemented in STRUCTURE is shown for K = 7 (Fig. 5a, b). Evanno’s adhocK method determined K = 7 to be the most likely number of genetic clusters for the entire cultivated sorghum pool (Supplementary Fig. S1). The mean proportion of genome assigned in each of the seven clusters (Q i) is presented for each geographic region in Fig. 5b. The identified genetic structure corresponded closely to geographic origin for Turkana (Q i = 0.94 in cluster 5), Northeastern (Q i = 0.70 in cluster 1) and to some considerable extent for Rift valley (Q i = 0.65 in cluster 7). For each of these regions, the largest proportion of genome was assigned predominantly to a single and largely unique cluster. Contrastingly, cultivated sorghum individuals from the coastal, eastern/Central and western/Nyanza regions tended to be jointly assigned to more than one cluster (Fig. 5b).

Fig. 5
figure 5

Structure of the genetic diversity of the 329 cultivated sorghum plants at K = 7: a bar plot of overall partitioning and individual plant assignment (sorted by geographic origin), and b plots of the mean proportion of genome (Q i) assigned in each of the K = 7 clusters for the group of individuals in each region. Letters C, E, N, R, T and W represent coast, eastern/central, northeastern, Rift Valley, Turkana and western/Nyanza regions, respectively

For the 110 wild sorghum individuals, STRUCTURE identified two genetic clusters (Supplementary Fig. S2), which clearly did not correspond to the geographic origin (Fig. 6a, b). Of the four regions the highest level of genetic uniformity was observed in western/Nyanza, where the wild sorghum individuals were assigned almost exclusively to one of the two genetic structure (Q i = 0.95 in cluster 1).

Fig. 6
figure 6

Structure of the genetic diversity of the 110 wild sorghum plants at K = 2: a bar plot of overall partitioning with assignment of individual plants into the identified genetic clusters (sorted by geographic origin), and b plots of the overall proportion of genome (Q i) assigned in each of the K = 2 clusters for the group of individuals in each region. Letters C, E, T and W represent coast, eastern/central, Turkana and western/Nyanza regions, respectively

When cultivated and wild sorghum data was pooled together and STUCTURE run from K = 1 to K = 10, Evanno’s adhocK method determined K = 5 to be the most likely number of genetic clusters (Supplementary Fig. S3). The wild forms were separated largely into four of the five genetic groups, the largest of which (cluster 3) was least shared with the cultivated counterparts (Fig. 7). Cultivated sorghum on the other hand was generally represented in all the five clusters, with Turkana and northeastern collections being restricted into single and largely distinct genetic groups. Notably, cultivated sorghum from northeastern appeared to share minimal ancestry with cultivated and wild sorghum individuals from other regions. For the rest of the regions, cultivated and wild sorghum individuals seemed to overlap in most of the clusters (Fig. 7).

Fig. 7
figure 7

Structure of the genetic diversity of the pooled cultivated and wild sorghum individuals at K = 5: a bar plot of overall partitioning with assignment of individual plants into the identified genetic clusters (sorted by geographic origin), and b plots of the overall proportion of genome (Q i) assigned in each of the K = 5 clusters for the group of individuals in each region. Letters C, E, N, R, T and W represent coast, eastern/central, northeastern, Rift Valley, Turkana and western/Nyanza regions, respectively

Spatial genetic structure

Outcome of spatial autocorrelation analyses in cultivated and wild sorghum is presented as correlograms (Fig. 8a, b). There was a clear decrease in relatedness among individuals with increasing geographical distance, a reflection of strong spatial genetic structure in both cultivated and wild sorghum. Cultivated sorghum had a mean regression slope (blog) value of −0.015 (P < 0.001) and a coefficient of determination value (r 2) of 0.045, while wild sorghum had a blog value of −0.017 (P < 0.001), and a r 2 value of 0.055. Furthermore, kinship coefficient values were positive and significant within a similar range of about 180 km for the two sorghum conspecifics. Negative and significant kinship coefficient values were clearly evident in wild sorghum beyond 600 km, while in cultivated sorghum significant negative values did not show a consistent pattern. The outcome of investigations on the relationship between crop-wild genetic distance and geographic distance at country level is presented in Fig. 8c. Rousset’s genetic distance (â) between pairs of cultivated and wild sorghum individuals increased linearly with logarithmic distance (slope = 0.149, Permutation test: P ≤ 0.001, r 2 = 0.028), a pattern typical of isolation by distance.

Fig. 8
figure 8

Spatial patterns of genetic relatedness within and between cultivated and wild sorghum. Correlograms of pairwise relatedness (Ritland kinship coefficient) among individuals of cultivated and wild sorghum are presented in a and b, respectively. In each case, the dashed lines represent upper and lower 95% confidence limit envelopes around the null hypothesis of no spatial structure. A plot of the regression of pairwise genetic distance among cultivated and wild sorghum individuals on geographic distance is presented in c

Discussion

Extent of genetic diversity in cultivated and wild sorghum

Mean gene diversity across the 24 SSR markers for cultivated sorghum in Kenya (H e = 0.59) is similar to values reported for microsatellites in Niger (H e = 0.61) by Deu et al. (2008) and in South Africa (H e = 0.60) by Uptmoor et al. (2003), but slightly lower than values estimated for Eritrea by Ghebru et al. (2002) and for Morocco by Djè et al. (1999). In the wild sorghum gene pool, the mean gene diversity estimated for Kenya across the 24 SSR markers (H e = 0.69) was higher than that estimated for a set of accessions selected to represent a wide geographic sampling in Africa (H e = 0.59) by Casa et al. (2005). As noted by Deu et al. (2008), however, the comparison of the magnitude of genetic diversity between different studies is difficult as it may be complicated by differences in amongst others underlying sampling schemes, number of SSR surveyed, size of SSR repeats and location of the SSR on the genome.

During the process of domestication, evolutionary processes of founder effect, population bottleneck, and artificial selection are all expected to reduce genetic diversity of the crop in relation to its wild progenitor (Ladizinsky 1999; Gepts 2004). This view was supported in the present study by findings that cultivated sorghum harbored lower genetic diversity (in terms of overall allelic richness, private allelic richness, and gene diversity) than its proposed wild progenitor. Our findings were consistent with previous comparisons between cultivated and wild sorghum using various genetic markers (Morden et al. 1990; Aldrich and Doebley 1992; Cui et al. 1995; Casa et al. 2005). The significantly higher private allelic richness in the wild sorghum relative to its cultivated counterpart is of great importance to the conservation and utilization of sorghum genetic resources. These findings support the widely held view that crop wild relatives are potential sources of important and unique genes for crop improvement programs and therefore deserve special conservation and utilization attention.

Divergence between cultivated and wild sorghum

We found close genetic proximity between individuals of cultivated and wild sorghum at the national level, based on PCoA, F ST and Bayesian model-based cluster analyses. Furthermore, pairwise F ST showed the extent of crop-wild genetic divergence to be generally lower within than among regions. Considered together these findings may reflect important historical gene flow between cultivated sorghum and its wild relatives in situ. Cultivated and wild sorghum are inter-fertile, with natural hybrids between ssp. bicolor and ssp. verticiliflorum being well documented within and around cultivation in Africa (Dogget and Majisu 1968; Dogget and Prasada Rao 1995; Tesso et al. 2008). In a study related to the present one, Mutegi et al. (2010) recorded putative crop-wild hybrid plants within and around cultivated sorghum and other fields across the crop’s growing regions of Kenya. Such crop-wild hybrids may potentially act as conduits for escape and persistence of transgenes in wild and/or weedy relative populations through introgressive hybridisation as has been demonstrated between GM oilseed rape (Brassica napus L.) and its wild relatives (Halfhill et al. 2004).

Furthermore, we found the extent of divergence between cultivated and wild sorghum to vary substantially among regions. This finding may reflect inter-regional differences in the extent of introgression between cultivated and wild sorghum, probably due to differences in farmer practices. For example, during our sample collection surveys, two contrasting weedy sorghum management practices were observed between western/Nyanza and Turkana regions. Farmers in western/Nyanza were observed to tolerate putative crop-wild hybrids in their sorghum fields following harvest, often resulting in populations of wild sorghum in recently harvested sorghum fields. Such remnant populations were also observed in abandoned (fallow) fields within close proximity to cropped sorghum fields. Such a scenario may enhance hybridisation between cultivated and wild sorghum on-farm. In Turkana most of the sorghum is grown under irrigation with farmers holding small (50 × 100 m) plots. Farmers therefore practice more intensive weeding on their sorghum fields. Consequently we encountered very few putative crop-wild hybrids within and around sorghum fields during the surveys. Not surprisingly, relative to other growing regions, genetic divergence between cultivated and wild sorghum was least in western/Nyanza and highest in Turkana.

Patterns of genetic differentiation in cultivated and wild sorghum

Our study found the extent of genetic diversity to vary significantly among regions but not among agro-climatic zones for both cultivated and wild sorghum. Furthermore, we found greater levels of genetic differentiation among geographic regions than among agro-climatic zones for both cultivated and wild sorghum. Our results suggest that diversity of cultivated sorghum and its wild counterpart in Kenya is structured more along geographical than climatic factors. Deu et al. (2008) found the same for cultivated sorghum in Niger, with moderate differentiation among regions (F ST = 0.07) and lower differentiation among annual rainfall classes (F ST = 0.03). The main evolutionary forces responsible for producing genetic structure in plant populations are gene flow, selection associated with environmental heterogeneity and/or farmer preferences and random genetic drift (Hartl and Clark 1997; Neal 2004). Because most SSR loci are presumed to be selectively neutral, environmental factors are expected to offer minimum contribution to the observed genetic structure. In contrast, geographic isolation limits the level of gene flow among populations and should therefore contribute significantly to contemporary genetic structure. The higher level of geographic structure observed in cultivated sorghum (F ST = 0.187) compared to its wild progenitor (F ST = 0.097) may point to greater intra-regional genetic proximity in the former than the latter. Such intra-regional genetic proximity in cultivated sorghum would arise through seed exchanges among farmers. Mutegi et al. 2010 reported sorghum seed systems in Kenya to be largely traditional, with farmers playing a major role in the selection and exchange of seeds.

Bayesian model-based cluster analysis nevertheless showed poor correspondence between the observed genetic structure and geographic origin both for cultivated and wild sorghum. Similar results have been reported in Africa for cultivated sorghum (Ayana et al. 2000b; Ghebru et al. 2002; Nkongolo and Nsapato 2003; Deu et al. 2008) and for its wild relatives (Ayana et al. 2000a). In our study, the only exception was Turkana and northeastern cultivated sorghum pools, each of which was clustered into a distinct and unique genetic group. Pairwise F ST analysis showed variable levels of genetic differentiation among regions for both cultivated and wild sorghum. Considered together our results may reflect contemporary and/or historical seed-mediated gene flow among the geographic regions, with varying amounts of seed exchanges among regions. Because Turkana and northeastern are relatively geographically remote from other growing regions, the two regions have experienced minimal if any cross-regional seed-mediated genetic exchanges. Interestingly, Turkana and northeastern cultivated sorghum gene pools appear to be genetically distinct even though the regions are geographically proximal. The two regions are however physically separated by Lake Turkana and Mt. Kulal, both of which might have acted as barriers to cross-regional seed-mediated gene flow. Another plausible explanation for clear genetic separation of northeastern and Turkana sorghum both from themselves and from the rest of the country is separate evolutionary history. For example, Northeastern sorghum may be part of the Ethiopian sorghum gene pool, having originated from the Boran agro-pastoralist ethnic group whose distribution spans across the Kenya–Ethiopia border. This hypothesis however needs further testing with evolutionary genetic studies that incorporate materials from among other neighboring countries such as Ethiopia, Sudan and Uganda.

Spatial genetic structure

We observed largely patterns of strong spatial genetic structure in both cultivated and wild sorghum, with surprisingly significant positive spatial autocorrelation within approximately 180 km for the two congeners. This suggests that similar evolutionary factors may underlie the observed pattern of spatial genetic structure in the two congeners. Among factors that can explain the strong spatial structure in cultivated and wild sorghum, one could consider seed-mediated and/or pollen-mediated gene flow. In plants, spatial distribution of genetic variation is primarily determined by seed and pollen dispersal, habitat distribution, micro-environmental selection and genetic drift (Levin and Kerster 1974; Epperson 1993). Seed and pollen dispersal (with or without interbreeding) causes similarity between neighboring populations, whereas distant populations differ for the studied autocorrelation coefficient (Sokal and Oden 1978; Epperson 1993, 2004).

In the present study, a number of farmers originally from the western/Nyanza and eastern/central regions were noted to have migrated with their sorghum landraces into the coastal region. In a recent study, Mutegi et al. (2010) documented incidences of medium to long distance seed exchanges in sorghum among growing regions, mostly through inter-ethnic marriage relationships, but also appreciably through formal distribution of improved varieties via government and non-governmental extension systems. Finally, there are two plausible explanations for the near identical patterns of spatial genetic structure in cultivated and wild sorghum; inadvertent dispersal and establishment of wild sorghum seed via cultivated sorghum seed systems and pollen-mediated crop-wild gene flow in sites of sympatric occurrence.

Implications for biosafety regulations

Important historical gene flow between cultivated and wild sorghum is strongly suggested by two findings in this study; low level of divergence between cultivated and wild sorghum gene pools and significant isolation-by-distance between pairs of cultivated and wild sorghum individuals. The level of divergence between cultivated and wild sorghum varied among geographic regions, probably a reflection of intra-region differences in the level of crop-to-wild gene flow. Differences in farmer practices such as weedy sorghum management and/or seed selection are some of the factors that could explain this inter-region variation in the extent of crop-to-wild gene flow. Furthermore, the pattern of increased genetic similarity between geographically close pairs of cultivated and wild sorghum individuals relative to isolated ones (isolation-by-distance) as revealed in this study is of further biosafety significance. It suggests that crop-to-wild gene flow in sorghum is spatially predictable, and transgene escape into cultivated and/or wild-weedy relatives of the crop will be higher within and around cultivated fields compared to natural habitats away from cultivation. Overall, this study suggests that deployment of GM sorghum in Kenya may lead to escape and persistence of transgenes into wild-weedy sorghum relatives, with the rate of crop-to-wild gene flow being variable among growing regions. However, the extent and direction of gene flow remains unknown, as does the consequence of transgenes escape and persistence in wild sorghum populations. Biosafety regulators could benefit from further studies on the extent and direction of crop-wild gene flow on-farm, and from studies on the fitness effect of transgenic traits in wild-weedy relatives of crop sorghum.

Implications for utilization and conservation of germplasm

Significantly higher levels of genetic diversity were revealed in wild sorghum relative to its domesticated congeners. These results, together with findings of 2.6 unique alleles per locus in the wild sorghum gene pool, are of interest to sorghum breeding. The high genetic diversity could potentially be exploited in broadening the genetic base of sorghum breeding germplasm, while the unique diversity imply that wild sorghum is a potential source of novel genes such as pests and disease resistance. The genetic potential of wild relatives of sorghum, particularly as sources of resistance to pests and diseases, is well documented such as for sorghum shoot fly (Kamala et al. 2009), sorghum midge (Sharma and Franzmann 2001), green bug (Duncan et al. 1991), downy mildew (Kamala et al. 2002) and ergot (Reed et al. 2002). Moreover, the substantial genetic variability and differentiation revealed in sorghum landraces of Kenya should be incorporated in breeding programs by developing different populations with a broad genetic base. In addition to safeguarding the landrace diversity, this will help to create new genetic recombinations that can be exploited in response to new breeding challenges.

Finally, levels of genetic diversity within the two S. bicolor conspecifics differed significantly among geographic regions. Adequate measures need to be put in place for systematic conservation of these important genetic resources using complementary ex situ and in situ approaches. Such approaches could also benefit from further studies on the extent and partitioning of diversity in the country, including wild sorghum samples from natural habitats away from cultivated lands and from the regions not sampled in the present study (Rift Valley and northeastern).