Introduction

Sorghum [Sorghum bicolor (L.) Moench, 2n = 2x = 20] is the emerging model crop species for the tropical grasses with C4 photosynthesis. Sorghum has a genome size of 740 Mb (Paterson et al. 2009). The complete genome sequence of sorghum was made available for public use in early 2008 (http://genome.jgi-psf.org/Sorbi1/Sorbi1.info.html, Paterson et al. 2009), which enabled the sorghum geneticist to understand the complex traits at sequence level. Sorghum is the fifth most important cereal crop and occupies the second position among the staple food grains in semi-arid tropics. It is well adapted to withstand harsh conditions, the characteristic feature of tropical regions.

During the initial stages of molecular markers’ evolution, RFLP markers played a significant role in identifying the genomic regions responsible for agronomically important traits, genetic diversity analyses and comparative genome mapping. In the later stages, SSR markers proved highly useful in applied breeding programs because of their reproducibility and co-dominant nature. In sorghum, SSR markers have been developed by different research groups (Brown et al. 1996; Taramino et al. 1997; Bhattramakki et al. 2000; Kong et al. 2000; Schloss et al. 2002; Ramu et al. 2009; Srinivas et al. 2008, 2009; Li et al. 2009). Genetic diversity studies in sorghum were initiated early in the 1990s using molecular markers, especially RFLP probes (Deu et al. 1994, 1995, 2006). During the initial stages, no separate clustering was observed according to race or with their geographic origin. This may be because of using less number of markers for their differentiation and also inclusion of less number of samples. Later, differentiation according to race and origin was first established by Deu et al. (1994) using RFLP probes. Later on, diversity analyses were carried out using different marker systems, AFLP (Menz et al. 2004), RAPD (Agrama and Tuinstra 2003), SSRs (Folkertsma et al. 2005; Barnaud et al. 2007; Caniato et al., 2007; Deu et al. 2008; Ng’uni et al. 2011; Bouchet et al. 2012) and next-generation sequencing technologies (NGS) between sweet and grain sorghums (Zheng et al. 2011). SSRs were used widely for linkage mapping and diversity analyses of selected germplasm subsets, ranging from a core collection (Deu et al. 2006) to more specific sets of locally adapted lines (Deu et al. 2008, 2010; Barro-Kondombo et al. 2010; Ng’uni et al. 2011; Ji et al. 2011; Ngugi and Onyango 2012) and trait-specific germplasm sets for aluminum tolerance (Caniato et al. 2007) and sweet sorghum lines (Ali et al. 2008; Murray et al. 2009; Wang et al. 2009).

The major disadvantage associated with the anonymous SSR is the high cost of development and the time-consuming process. Alternatively, molecular markers/tools in a particular crop can be developed rapidly and in an inexpensive way through data mining (Kantety et al. 2002; Varshney et al. 2002; Jayashree et al. 2006; Ellis and Burke 2007). Nowadays, because of NGS technologies and genome sequencing projects, a vast amount of nucleotide sequences are flooding the public databases. This includes both genomic and genic sequences. Genic sequences (expressed sequence tags or ESTs) are derived from either ends of cDNA of plant genome-expressed regions. ESTs or cDNA sequences have gained much attention in this functional genomics era. SSRs have their usefulness even in this functional genomics era because of their reproducibility, and multi-allelic and high-throughput nature. Identification of SSRs from ESTs (genic microsatellites) is the preferred alternative for rapid and inexpensive marker development (Gupta et al. 2003; Kantety et al. 2002; Varshney et al. 2002; Jayashree et al. 2006; Ellis and Burke 2007). ESTs, if found with SSRs, can be linked directly to the genes, which may have some agronomic importance.

Unlike anonymous SSRs, EST-SSRs are derived from the expressed portion of the genome and, thus, have better chances of transferability across the species/taxa. EST-SSRs were successfully developed in different crops through data mining, e.g., in rice (Cho et al. 2000), sorghum (Ramu et al. 2009; Srinivas et al. 2008, 2009), pearl millet (Senthilvel et al. 2008), barley (Eujayl et al. 2002; Thiel et al. 2003), wheat (Gupta et al. 2003; Yu et al. 2004), tall-fescue grass (Saha et al. 2004), sugarcane (da Silva and Solis-Gracia 2006), rye grass (Studer et al. 2008), etc., for different purposes including linkage mapping (Ramu et al. 2009; Srinivas et al. 2008, 2009 in sorghum, Yu et al. 2004 in wheat, Thiel et al. 2003; Varshney et al. 2006 in barley), diversity analysis (Cho et al. 2000 in rice, Leigh et al. 2003; Gupta et al. 2003 in wheat, Eujayl et al. 2002; Varshney et al. 2008 in barley) and transferability testing (Saha et al. 2004; Balyan et al. 2005; Mian et al. 2005; Zhang et al. 2006).

In sorghum, EST-SSR markers were limited to transferability study across the other cereals (Wang et al. 2005) and have not yet been used for assessing diversity of germplasm subsets. Therefore, the present study aimed to assess the pattern of genetic diversity using EST-SSR markers and test their discriminative power in comparison with the genomic SSRs in a reference set of sorghum.

Materials and methods

Plant material

A sorghum reference set was defined based on a genetic diversity analysis of global composite germplasm collection (GCGC, 3,367 accessions) using 41 genomic/neutral SSR markers in the frame of a project sponsored by the Generation Challenge Programme (GCP) of the CGIAR. The reference set captured 78 % of the sorghum GCGC genetic diversity (Billot et al. 2013). A complete list of 384 accessions along with passport information is provided in Supplemental table S1. This set was built to represent along with wild relatives, all five basic races and their intermediate races from sorghum worldwide distribution (47 different countries). Trait-specific and most commonly used genotypes in breeding programs were also included.

DNA extraction and quantification

Seeds for the reference set were collected from ICRISAT, Patancheru, India and CIRAD, Montpellier, France. These seeds were grown in pots under natural conditions. DNA was isolated from 10-day-old seedlings using the protocol developed by Mace et al. (2003). After extraction, 1 μl of DNA sample for all accessions was loaded in 0.8 % agarose gels. Uncut λ DNA was loaded as a control to assess the quality and the quantity of DNA. Based on uncut λ DNA standards, DNA samples were normalized to a uniform concentration (2.5 ng/μl) and used for SSR genotyping.

Genotyping with EST-SSR markers

Forty-five EST-SSR markers were selected from the set of 55 SSRs developed and mapped at ICRISAT (Ramu et al. 2009). These markers had reasonable coverage on sorghum genome representing all ten linkage groups of sorghum. EST-SSR markers were grouped into different multiplex sets based on their amplicon sizes in internal standard genotype, BTx623 (Table 1). In addition, three pooled control panel DNAs were used as standards (Billot et al. 2012; http://sorghum.cirad.fr/SSR_kit), to check the consistency of each run in ABI 3130xl Genetic Analyzer.

Table 1 Details of sorghum EST-SSR markers screened across the reference set of sorghum

The forward primers were synthesized adding the M13-forward primer sequence (5′CACGACGTTGTAAAACGAC3′) at the 5′end. A fluorescently labeled M13-forward primer sequence was also added in the PCR mix to amplify the fluorescently labeled PCR product, which can help further screening in the ABI 3130xl Genetic Analyzer. PCR was performed in a 5 μl reaction volume with final concentrations of 2.5 ng DNA, 2 mM MgCl2, 0.1 mM of dNTPs, 1× PCR buffer, 0.08 pM of M13-tailed forward primer, 0.16 pM of M13-labeled and reverse primers and 0.1 U of Taq DNA polymerase (AmpliTaq Gold®, Applied Biosystems, USA) in a GeneAmp® PCR System 9700 thermal cycler (Applied Biosystems, USA) with the following cyclic conditions: initial denaturation at 94 °C for 15 min (to activate Taq DNA polymerase), then ten cycles of denaturation at 94 °C for 15 s, annealing at 61 °C for 20 s (temperature reduced by 1 °C for each cycle) and extension at 72 °C for 30 s. This was followed by 40 cycles of denaturation at 94 °C for 10 s, annealing at 54 °C for 20 s and extension at 72 °C for 30 s with a final extension of 20 min at 72 °C. Amplified PCR products according to their multiplexes with internal ROX-400 size standard were run through the ABI 3130xl Genetic Analyzer.

Data analysis

Raw data produced from the ABI 3130xl Genetic Analyzer were first scanned through GeneScan 3.7 (Applied Biosystems, USA) and further fragment analysis was carried out using Genotyper 3.7 software (Applied Biosystems, USA). Based on relative migration with internal size standard (ROX-400, Applied Biosystems), PCR amplicon sizes were scored in base pairs (bp). These raw data were further processed through the AlleloBin program (http://www.icrisat.org/bt-software-downloads.htm) to get perfect allele calls.

Five EST-SSR markers (Xisep0209, Xisep0621, Xisep0728, Xisep1128 and Xisep1150) were deleted from the data analysis due to poor data quality (many missing data points). Allele numbers, gene diversity, heterozygosity and polymorphic information content (PIC) were calculated with PowerMarker v3.25 software (Liu and Muse 2005). Between samples, genetic distances were assessed through simple matching index as implemented in DARwin v5 software (Perrier et al. 2003; Perrier and Jacquemoud-Collet 2006). A dendrogram was constructed using the neighbor-joining (NJ) method as implemented in the same software.

Analysis of population structure and AMOVA

Bayesian cluster analysis, a model-based approach, as implemented in STRUCTURE 2.2 (Pritchard et al. 2000; Falush et al. 2003) was employed to detect the population structure and to assign the individuals in the reference set of sorghum into subpopulations (K) based on the 40 EST-SSR markers data. An admixture model with correlated allele frequencies was used with burn-in length of 100,000 and 200,000 Monte Carlo Markov chain (MCMC) replicates with five independent runs for each K ranging from 5 to 15. After obtaining the optimum number of subpopulations, an analysis of molecular variance (AMOVA) was performed using Arlequin ver. 3.1 (Excoffier et al. 2005) to estimate the genetic variance components within subpopulations and between subpopulations.

Results

All EST-SSR markers used in this study were polymorphic across accessions of the sorghum reference set. Availability of marker data ranged from 91.4 (Xisep0108) to 100 % (Xisep0101) with a total of 1.83 % missing data. These markers yielded a total of 360 alleles with an average of 9 alleles per marker (Table 1). Out of 360 alleles observed, cultivated sorghum accessions alone contributed 320 alleles (88.9 %), whereas the wild genotypes alone contributed 257 alleles (71.4 %). The number of alleles per locus ranged from 3 (Xisep1038) to 39 (Xisep1012) (Table 1). Complete marker data produced across 384 reference set accessions with 40 EST-SSR markers are included in Supplemental table S2.

Unique and rare alleles

Cultivated sorghum genotypes accounted for 103 (28.6 %) unique alleles, while wild genotypes showed 40 unique alleles (11.1 %). Among the 40 unique alleles in wild genotypes, 8 were contributed by only one accession, IS 18868, which belongs to the verticilliflorum subspecies that purportedly originates from the USA. A total of 217 alleles were detected in common between cultivated and wild genotypes. A large number of alleles (134 alleles, 37.2 %) were found as ‘rare alleles’ (frequency <1 %). Rare alleles were not observed for two markers viz., Xisep0523 and Xisep1029, whereas the maximum numbers of rare alleles were noted for Xisep1012 (13 alleles) followed by Xisep0829 (11 alleles).

The PIC values for these markers in the reference set of sorghum varied from 0.1379 (Xisep0120) to 0.9483 (Xisep1012) with an average of 0.523. Xisep1012 was the most informative, with the most alleles and highest PIC value. The observed heterozygosity varied from 0.0 (Xisep1038) to 0.1076 (Xisep1103) with a mean of 0.039. Gene diversity varied from 0.1429 (Xisep0120) to 0.9505 (Xisep1012) with an average of 0.5690 (Table 1).

Allelic distribution according to the race

Maximum numbers of alleles were present in the bicolor race (212 alleles, 58.89 %), followed by the caudatum race (208 alleles, 57.78 %). Durra and guinea (excluding guinea margaritiferum (Gma)) race accessions contributed 177 (49.17 %) and 180 alleles (50 %), respectively. Kafir race’s contribution was limited to only 107 alleles (29.72 %). Among the guinea races, guinea margaritiferum (Gma) contributed 91 alleles (25.28 %). All EST-SSR markers screened were found polymorphic in bicolor and guinea races. Five markers (Xisep0114, Xisep0422, Xisep0829, Xisep1038 and Xisep1231) were found monomorphic across the kafir race accessions. Xisep0120 and Xisep0422 markers were found monomorphic in the caudatum and durra races, respectively. Within the guinea race, gma race accessions had nine monomorphic markers, namely, Xisep0242, Xisep0422, Xisep0444, Xisep0607, Xisep0639, Xisep0824, Xisep0829, Xisep0948 and Xisep1035.

Allelic distribution according to geographic origin

All the markers were polymorphic across all the genotypes originating from the African continent. Genotypes originating from the African continent had 319 alleles (88.61 %), whereas those from other parts of world contributed 306 alleles (85.0 %). Among the African origin genotypes, those from Central Africa contributed 183 alleles (50.83 %), Eastern Africa 251 alleles (69.72 %), Southern Africa 212 alleles (58.89 %) and Western Africa 225 alleles (62.50 %).

Reference set accessions from the Asian continent were grouped into three, namely, Eastern Asia, Middle East and India. All the markers were polymorphic in the genotypes of Indian origin contributed 174 alleles (48.33 %). Genotypes from Eastern Asia contributed 119 alleles (33.05 %), whereas those from Middle East contributed 139 alleles (38.61 %). Five markers, Xisep0203, Xisep0422, Xisep0805, Xisep1038 and Xisep1231, were found monomorphic in Eastern Asia accessions, while Xisep0422, Xisep0449, Xisep0504 and Xisep1231 were monomorphic in Middle East accessions. In both cases, Xisep0422 and Xisep1231 markers were found monomorphic.

Cluster analysis

Reference set accessions were clustered according to geographic origin and by race within each origin. Factorial analysis (FA) (Fig. 1a, b) and NJ analysis (Fig. 2) have clearly identified the different clusters according to their race and geographic origin. Guinea margaritiferum race genotypes from Western Africa were clustered on top of the second quadrant (II) (Fig. 1b) of FA, when axes 2 and 3 were considered. Wild genotypes were clustered in the middle of the intersection point between quadrant I and II. Genotypes belonging to the kafir race were clustered in III quadrant, while genotypes belonging to races caudatum and guinea were grouped in IV and I quadrants. The durra race grouped together between quadrant I and IV when viewed on axis 1–2 of the FA. On these axes, caudatum accessions grouped in quadrant II, while bicolor race accessions were grouped between durra and caudatums.

Fig. 1
figure 1

Factorial analysis with 40 EST-SSR markers screened against a reference set of sorghum. Basic races have been represented with different colors [B-bicolor in blue, C-caudatum in violet, D-durra in pink, G-guinea (excluding Gma) in green, Gma-guinea margaritiferum in light green, K-kafir in orange, wild in red

Fig. 2
figure 2

NJ dendrogram for sorghum reference set based on the allelic data of 40 EST-SSRs. Cluster numbers and color codes are the same as those used in the STRUCTURE analysis presented in Fig. 3

For a better understanding, the dendrogram produced by NJ analysis was divided into eight major clusters. Accessions originating from Western Africa (AfricaW) were grouped into two separate clusters. One cluster was exclusively made up of accessions belonging to guinea margaritiferum (Gma) in the VIIa cluster which formed a group along with wild genotypes (Fig. 2). Another cluster consisted of mostly guinea (G) and guinea gambicum (Gga) and their intermediate races (cluster II). Cluster III was formed mainly with the accessions belonging to the kafir (K) race, most of which originated from Southern Africa (AfricaS). This cluster also included other accessions from Southern Africa belonging to race bicolor (B) and from Eastern Africa belonging to intermediate race durra-caudatum (DC). Accessions that originated from India were mainly grouped into two clusters. One group of accessions belonging to the guinea race was grouped in cluster IV along with some other intermediate races derived from Southern Africa. Another group of accessions was grouped in Cluster Va which belonged to the durra race. Durras from the Middle East were clustered in Cluster Vb. Cluster VI was divided into two subgroups, a and b. Cluster VIa mainly consisted of bicolor and intermediate races from Eastern Asia, whereas VIb consisted of durra and intermediate accessions with durra from Africa being grouped together. Accessions belonging to the caudatum race derived from Eastern Africa (AfricaE), Central Africa and Western Africa were largely grouped in cluster I. This cluster also hosted the accessions having the intermediate races between caudatum and guinea.

Most of the wild genotypes were grouped in cluster VIIb. There was no specific cluster differentiation according to their geographic origin (Fig. 2). But there was clear demarcation in grouping pattern according to their subspecies. Most of the arundinaceum and aethiopicuum subspecies accessions were grouped together, while verticilliflorum and drummodii subspecies accessions formed another group. Some of the landraces were also grouped into the same cluster along with wild genotypes in cluster VIIc. Cluster VIII consisted of accessions from different geographic origins. A dendrogram was also constructed without wild accessions (Supplemental figure S3), which does not affect much of the grouping pattern of other germplasm.

STRUCTURE and AMOVA

This reference set representing global sorghum germplasm collection representing all basic races and intermediate races from all geographic origins. Hence, cluster analysis (distance based) in combination with STRUCTURE (model-based) analysis was used to identify the possible number of subpopulations. Cluster analysis clearly showed that Gma accessions were separated from guinea race and clubbed with wild accessions. STRUCTURE analysis also clearly assigned the individuals into subpopulations according to their geographic origin followed by race. Gma race accessions were separated from guinea and wild accessions when K = 6. This K was taken as the prime source of a number of subpopulations for further investigation. Southern Africa-derived accessions (kafir and guinea races) were grouped as a single subpopulation. When K = 7, Southern Africa-derived accessions [kafir (cluster III) and guinea races (cluster IV)] were separated into two subgroups. Further at K = 8, clusters V and VI were grouped into three subpopulations, separating Middle East- and East Asia-derived accessions as a single population from Indian durras and intermediate races from the African continent. When K increased to 10, accessions from Middle East and East Asia were further separated into two subpopulations. Hence, K = 10 was taken as the possible number of subpopulations present in the reference set of sorghum (Fig. 3). AMOVA was performed with these ten subpopulations and more genetic variation within the subpopulations (71.7 %) than between the subpopulations (28.3 %) was observed.

Fig. 3
figure 3

STRUCTURE analysis identified ten subpopulations among reference set accessions of sorghum. Labels below refer to the cluster number in NJ analysis

Discussion

Any crop improvement program starts with identification of variability among the genotypes. Diversity analysis at the molecular level using PCR-based markers is the cheapest and rapid method of identifying the relationships among the genotypes. The user-friendly nature of SSR markers was successfully exploited in crop species for better understanding of the genetic diversity, domestication process and geographic divergence and distribution.

During the early stages, RFLP markers were used for diversity analysis (Aldrich et al. 1992; Tao et al. 1993). Because of very few numbers of genotypes included in those studies, the prevailing structure could not be deciphered well using RFLP and RAPD markers. However, an assessment of the population structure mainly depends on the type and number of markers used and representativeness of the samples analyzed. In the present study, samples were selected from a global genetic diversity analysis using SSR markers. A reference set of sorghum (Supplemental table S1) was selected for diversity analysis representing a significant portion of genetic variation with all five basic races, their intermediate races and wild genotypes. Markers selected for screening (Table 1) have complete genome coverage across the ten linkage groups of sorghum.

All of the EST-SSR markers (40) used in this study were polymorphic (100 %). Gupta et al. (2003) identified that only 55 % of the 20 EST-SSR markers used were polymorphic among 52 wheat accessions. Eujayl et al. (2002) reported a lower level of polymorphism (25 %) when 42 EST-SSR markers screened against 64 durum wheat germplasm lines. This clearly indicates that the percentage of polymorphism depends on the number and nature of the material used under analysis. In the present study, data analysis was carried out with 40 EST-SSR markers. Five markers (Xisep0209, Xisep0621, Xisep0728, Xisep1128 and Xisep1150) were removed before the analysis because of missing data. The AlleloBin program, used for converting raw allele calls to perfect allele calls, uses a step-wise mutation model (SMM; Kimura and Crow 1964). In the present study, 33 EST-SSR markers followed the SMM model, whereas 7 markers (Xisep0203, Xisep0523, Xisep0617, Xisep0630, Xisep1038, Xisep1140 and Xisep1202) did not follow the SMM model. However, these seven markers fit well in an alternative model called ‘infinite alleles model’ (IAM; Ohta and Kimura 1973), which assumes that each mutation (insertion and deletion) creates a new allele. A similar pattern of allele scoring was observed with genomic SSR markers in Folkertsma et al. (2005).

In total, 40 EST-SSR markers produced 360 alleles with an average of 9 alleles per marker (Table 1). This is the maximum number of alleles per marker reported using EST-SSR markers in any cereal to date. In case of tall-fescue grass, an average of 2.78 alleles/marker were reported (Saha et al. 2004), while 1.8 alleles/marker in bread wheat (Gupta et al. 2003) with 20 EST-SSRs, 4.5 alleles/markers in durum wheat with 42 EST-SSRs (Eujayl et al. 2002), 3 alleles/markers in 54 barley accessions using 38 EST-SSR markers (Thiel et al. 2003) and 4.6 alleles/marker in Crotalaria species (Wang et al. 2006) were reported. Compared to the results obtained with neutral genomic SSRs, the average number of alleles per marker detected in this study is comparable to that found in a limited size core collection (Caniato et al. 2007), or on geographically limited studies (e.g., Barnaud et al. 2007, Deu et al. 2008), but is lower than that found on the same material (Billot et al. 2013). This illustrates that although EST-SSR markers present a high discriminating power, they are under selective pressure.

The PIC values of markers can provide an estimate of the discrimination power in a set of accessions by taking not only the number of alleles, but also the relative frequencies of each allele (Smith et al. 2000). The average PIC value of EST-SSR markers (0.5230) was a bit higher in this reference set of sorghum in comparison with previous studies using EST-SSR markers for genetic diversity analysis in other crops, e.g., 0.443 in bread wheat (Gupta et al. 2003) and 0.45 in barley (Thiel et al. 2003). However, the average PIC value was lower compared to the PIC values of genomic SSR markers in sorghum [0.62 in both the studies of Agrama and Tuinstra (2003) and Caniato et al. (2007)]. However, this is higher than the PIC value reported by Folkertsma et al. (2005) using 100 guinea race accessions and 21 genomic SSR markers and by Ali et al. (2008) using 72 sorghum accessions with 41 SSR markers. This is on par with PIC (0.54) reported by Wang et al. (2009) in a study involving 96 sweet sorghum lines and 95 SSRs. High PIC values and large number of alleles per markers can also be attributed to the nature of the materials studied. As the reference set was derived from global germplasm collection covering worldwide origins and racial characterizations, and including wild samples, it is expected to produce a large number of alleles. SSR markers containing di-nucleotide repeats produced more alleles and hence greater PIC values (Table 1). These results were in harmony with previous studies by Smith et al. (2000), Agrama and Tuinstra (2003), Casa et al. (2005) and Deu et al. (2008).

Allelic distribution

The maximum number of alleles was reported in the bicolor race (58.89 %) followed by the caudatum race (57.78 %). Five markers were found to be monomorphic across the kafir accessions, while nine markers were found to be monomorphic among the Gma accessions. A large number of monomorphic markers and low percentages of allele contribution from Gma and kafir (25.28 and 29.72 %, respectively) to the total alleles are the major reasons for their separate clusters as compared to other races.

Genotypes from the African continent reported the maximum number of alleles (88.61 %) suggesting that the maximum diversity was found in that continent and also greatly supporting the idea that sorghum originated from Africa. In Africa, Eastern African accessions contributed the maximum number of alleles (69.72 %) supporting greatly the idea that sorghum originated from Eastern Africa. No marker was found to be monomorphic in African-originated accessions, whereas five markers were found to be monomorphic among the accessions derived from Eastern Asia and four markers were monomorphic among the accessions derived from the Middle East. In both these cases, Xisep0422 and Xisp1231 were found to be monomorphic.

Wild species are the most divergent (capturing 71.39 % of total alleles) in the present study as reported by Deu et al. (1994, 2006). Cultivated and wild genotypes shared 217 alleles in common. Among 40 unique alleles reported by wild accessions, 8 were contributed only from one wild accession, IS 18868, belonging to S. bicolor subspecies verticilliflorum. This accession has passport data indicating its origin as the USA, but this likely means that it entered the global germplasm collection via the USA rather than as a direct field collection from Africa. These rare alleles may be a target of plant breeders if these alleles have any significant role in trait-specific characters. A large number of rare alleles in the present study (37.2 %) were detected due to the presence of many unique alleles in the wild and cultivated germplasm. Two markers did not detect any rare alleles and they had only few alleles.

Structure of genetic diversity

Based on floral and grain morphology, sorghum cultivars were grouped into five basic races and ten intermediate races. Reference set accessions were grouped primarily according to race within their geographic origin (Fig. 2) and in agreement with previous studies using RFLP markers (Deu et al. 1994, 2006), SSRs and RFLPs (Menz et al. 2004), SSRs (Deu et al. 2008) and recently by using DArT markers (Mace et al. 2008; Bouchet et al. 2012). Cluster analysis of accessions based on EST-SSR allelic variation divided them into eight groups (Fig. 2). On a finer discrimination level, a racial pattern was found within eight clusters on the dendrogram. Racial discrimination by markers was first observed by Deu et al. (1994) in sorghum.

Average gene diversity (H e, also known as expected heterozygosity) among the reference set is 0.5690 with 40 EST-SSR markers (Table 1). As expected with EST-SSR markers, this is lower in comparison with previously published results in a small set of highly polymorphic SSRs in materials from Morocco (H e = 0.84) by Dje` et al. (1999), in Eritrea (H e = 0.78) by Ghebru et al. (2002), Niger accessions (H e = 0.61), in South Africa (H e = 0.60) by Uptmoor et al. (2003) and in Kenyan sorghums (0.6627) by Ngugi and Onyango (2012). The observed heterozygosity reported in the present study is much less (0.039), which confirms the use of single locus/single copy markers and is expected of autogamous accessions maintained by several generations of enforced self-pollination. Besides, the large differences between the expected and observed heterozygosity is reinforced by the population structure.

In previous studies, accessions belong to bicolor race were found scattered across all clusters and considered as the most heterogeneous and ancient race with wider geographical distribution and diverse uses (forage, broom-corn and sweet stalks) (Doggett 1988; Dje` et al. 2000; Deu et al. 2006; Mace et al. 2008). However, in the present study, EST-SSR markers showed their ability to differentiate the bicolor race mainly into two major groups (Fig. 2). African bicolors was grouped in cluster III along with the homogeneous kafir race derived from Southern Africa. Some of the bicolor accessions derived from Africa were grouped closely with wild accessions in cluster VIIc. A couple of bicolor accessions derived from India were grouped in clusters Va and VIa. Grouping of bicolor race with wild accessions confirms their resemblance to spontaneous weedy sorghum and the most primitive grain sorghum (Casa et al. 2005) and also their sharing rare alleles with wild sorghum. This suggests that EST-SSR markers have good discriminating power in relation to the bicolor race. Similarly, Mace et al. (2008) also found the grouping of one bicolor race (IS 12179C) with wild genotypes.

Caudatum is the race of greatest interest because it provides the genes responsible for higher grain yields with excellent seed quality (Mace et al. 2008), whereas the guinea race is of interest to breeders due to its great genetic diversity (Deu et al. 1994, 1995). The caudatum race was grouped mainly into three different clusters (Fig. 2), which were largely geographic origin specific. Most of the African caudatums were grouped in cluster I, whereas a group of accessions from Eastern Africa (AfricaE) were grouped in cluster IV. Eastern Asian and North American origin caudatums were grouped in cluster VIa. A couple of caudatum race accessions from Eastern Africa (AfricaE) were grouped with wild accessions (VIIb).

Durra race accessions were grouped primarily in cluster Va (Fig. 2), largely comprising durra genotypes originating from India. Most of the durras were considered as drought-tolerant genotypes based on their adaptations to very high temperatures and/or receding moisture conditions (Deu et al. 2006). Durra accessions and intermediate races with durra, durra–caudatum (DC) genotypes from Middle East were grouped in cluster Vb. A group of DC was grouped in cluster VIII. This is contrary to the conclusions drawn by Deu et al. (2008) where all intermediate races with durra were clustered in the same cluster of durra.

Kafir race is grouped in only cluster III primarily derived from Southern Africa (AfricaS) (Fig. 2). Interestingly, an intermediate race with caudatum (KC) also grouped in the same cluster. This suggests that the kafir race is the least divergent and expected to be the most homogeneous group among all the races. The same conclusions were drawn in previous studies (Deu et al. 1994, 1995, 2006; Cui et al. 1995; Menkir et al. 1997; Dje` et al. 2000). The homogenous nature in the kafir race was supported by the presence of a lower number of alleles (29.72 %) and five monomorphic markers. These results are in agreement with the recent origin and restricted geographic distribution of the kafir race (Doggett 1988; Deu et al. 2006).

Because of its diverse nature, the guinea race has become the choice of breeders (Deu et al. 1994, 1995; Folkertsma et al. 2005) in crop improvement programs. Guinea race exhibits a moderate level of genetic diversity and is grouped into three major clusters according to their origin and distribution (Fig. 2). Similar grouping patterns were observed with isozymes, RFLP probes and SSRs (Deu et al. 1994, 2006, 2008; Cui et al. 1995; Folkertsma et al. 2005; Barnaud et al. 2007). Guinea accessions derived mainly from Western Africa (AfricaW) were grouped into two separate clusters, margaritiferum and non-margaritiferum. Guinea guineense (Ggu) and guinea gambicum (Gga) were grouped with some other guinea lines in cluster II, whereas the most Gma accessions clustered in a small group (cluster VIIa) closely knit with wild accessions, which mainly originated from Western Africa (AfricaW) (Fig. 2). Hence, Gma may be considered as a recently evolved “‘primitive form’ of the guinea race”. Guinea accessions originating from India were grouped separately in cluster IV and closely with Southern African guineas (including guinea roxburghii, Gro) suggesting a recent introduction of Asian forms from Southern Africa. Deu et al. (2006) also found a separate cluster for guineas derived from Asia.

The singularity of Gma race accessions is in accordance with previous studies (Deu et al. 1994, 1995, 2006, 2008; Cui et al. 1995; de Oliveira et al. 1996; Folkertsma et al. 2005; Barro-Kondombo et al. 2010). The singular nature of Gma was due to the presence of only 25.27 % of alleles found across the sorghum reference set and a maximum of nine monomorphic markers. All Gma accessions in cluster VIIa originated from Western Africa (AfricaW). The distinct nature of Gma from other guinea race subgroups is remarkable, since both are infertile and cultivated in sympatry in the same season by the same farmers (Deu et al. 2006). The Gma subgroup was distinct from other guinea races and closer knit with wild genotypes than other cultivated sorghums suggesting that the Gma group represents an independent domestication event. A Gma accession of Southern African origin (IS 19455) was not grouped with Western African margaritiferum, but with other guineas originating from Southern Africa. Grouping of this accession is in agreement with previous studies by Deu et al. (1995, 2006). This suggests that Southern African margaritiferum shares a common ancestor with Western African margaritiferum and the change in its genetic background as compared to Western African guinea margaritiferum is due to their isolation and selection pressure (Deu et al. 2006). Grouping of accessions is not much affected by removal of wild accessions from cluster analysis (Supplemental figure S3).

STRUCTURE analysis clearly identified ten subpopulations according to their race within the geographic origin in the reference set germplasm lines. This in agreement with clustering observed in distance-based method, with AMOVA results indicating that a large amount of variation (71.7 %) is present within the subpopulations. This confirms that the reference set used in this study includes diverse germplasm lines and is suitable for allele mining and association mapping.

Geneflow in sorghum

Exchange of genes is one of the major factors in evolution of domesticated plant species (Harlan 1992). Gene flow from cultivated to wild, weedy and feral relatives disturbs the size and dynamics of wild and weedy populations, resulting in the disturbance of natural gene pools and endangering the wild relatives. This in turn leads to loss of natural genetic diversity (Akimoto et al. 1999; Snow et al. 2003). Molecular analysis clearly identified the existence of gene flow in crop plants (Mariac et al. 2006; Sagnard et al. 2011). In the present study, some of the landraces were grouped with wild genotypes and most of them were derived from Africa (Fig. 2). A wild accession (IS 21691, S. drummondii from Western Africa) was grouped in cluster III along with kafir and a couple of wild accessions were grouped with the guinea race in cluster II. This can be attributed to gene flow between landraces or cultivars and wild genotypes. This may be because of wild accessions growing around the cultivated sorghums in Africa as observed by Tesso et al. (2008) and Sagnard et al. (2011). Existing cultural practices in Africa (growing side by side or in close proximity) are the major source of gene flow from cultivated to wild genotypes and vice versa. Wild accessions were found in Africa in crop margins, barren lands, hill bottom areas and in the fields where crops were abandoned due to severe drought, pest and weed infestation or extreme nutrient deficiency (Tesso et al. 2008). This suggests that wild accessions have the desirable alleles to resist these factors. This would encourage sorghum breeders to use wild alleles for transfer into locally adapted improved varieties in their crop improvement programs.

In conclusion, functional diversity analysis with proper phenotyping in multiple locations will help the breeders to mine for trait-specific alleles and facilitate an effective method of identifying the gene/QTL for different agronomic traits through association mapping (Abdurakhmonov and Abdukarimov 2008; Zhu et al. 2008; Casa et al. 2008; Upadhyaya et al. 2012). Thus, the sorghum diversity panel representing global genetic diversity can be used as a community resource for allele mining and genome-wide association mapping using SNPs generated through NGS technologies. A combination of association mapping with physical mapping helps in identifying the gene sequences underlying a particular trait. This further helps in better understanding of the variation at the sequence level which in turn helps to identify the genomic location responsible for different traits in closely related crops through comparative genome mapping.