Sorghum [Sorghum bicolor (L.) Moench] is an important staple food and fodder crop in tropical/semi-tropical African and Asian regions (Doggett 1998). All cultivated sorghums belong to S. bicolor ssp. bicolor, which is further divided into five races (bicolor, caudatum, guinea, durra, kafir) based on morphological characters. There are also 10 intermediate races resulting from all possible inter-racial crosses (Harlan and de Wet 1972). Information on genetic diversity in the available germplasm is essential in deciding their utility in crop breeding programmes. Earlier, morphological characteristics (Dje et al. 1998) and isozymes (Danquah et al. 2000) were employed for the assessment of genetic diversity in sorghum. These approaches require more time and expertise to efficiently classify material and are based on limited traits. Moreover, the results are subjective, due to environmental influences. This situation demands the development of robust molecular marker technology which will help in a better estimation of genetic diversity.

Among the molecular markers, microsatellites are the most favored for a variety of applications in crop genetics and breeding because of their multi-allelic nature, high reproducibility, co-dominant inheritance, abundance and extensive genome coverage (Gupta and Varshney 2000). The usefulness of microsatellite markers in genetic diversity analysis (Shehzad et al. 2009; Wang et al. 2009) and quantitative trait locus (QTL) mapping (Parh et al. 2008; Wu and Huang 2008) has been amply demonstrated in sorghum. The availability of the complete sequence of the sorghum genome in the public domain offers the opportunity for the identification of motifs of interest through in silico approaches, thereby helping in the rapid development of PCR-based markers. Earlier studies have revealed that (GATA) n motifs are abundant in many eukaryotic genomes (Singh 1995) and several such motifs serve as transcription factors (Trembley and Viger 2003). Even though the (GATA) n motif-based microsatellite markers are widely used for paternity testing, genetic counseling and identification of individuals in humans (Pena et al. 1994), their utility in plant species is very limited. A small number of reports in crops such as pearl millet (Chowdari et al. 1998), rice (Davierwala et al. 2001), papaya (Parasnis et al. 1999), chickpea (Serret et al. 1997), sunflower (Mosges and Friedtu 1994) and tomato (Grandillo and Tanksley 1996; Rao et al. 2006) have highlighted the potential of (GATA) n motifs in analyzing the genetic variation. Recently, microsatellite markers targeting (GATA) n motifs across the genome were developed in rice and their utility in genetic diversity assessment was demonstrated (Rajendrakumar et al. 2009). Such markers have not been reported in sorghum. The objectives of this study were to analyze the abundance and distribution pattern of Class I (GATA) n motifs in sorghum, develop PCR-based markers and demonstrate their utility in genetic diversity analysis.

The whole genome sequence of sorghum (S. bicolor cv. Btx623) was downloaded from the Joint Genome Institute database (Paterson et al. 2009; ftp://ftp.jgi-psf.org/pub/JGI_data/phytozome/v7.0/Sbicolor/). Class I microsatellites (motifs with repeat length ≥ 20 nt; McCouch et al. 2001) with (GATA) n motifs were identified in the sorghum genome using a Perl script developed in-house. The identified (GATA) n motifs were designated as genic (exons and introns), upstream (within 2 kb from transcription start site) and non-genic based on the results of BLASTN analysis of the motif as query sequence using the gene annotation information available at the Phytozome database (http://www.phytozome.net/sorghum). Primers were designed targeting all the Class I (GATA) n motifs (Supplementary Table 1) using the web resource Primer3 (http://frodo.wi.mit.edu/primer3/). Primer pairs targeting 50 motifs (five per chromosome) with maximum repeat lengths were used for further analysis and were physically mapped in the sorghum genome based on their positions using the software MapChart 2.2 (Voorrips 2002). Through a preliminary analysis, 48 primer pairs which showed good amplification were selected and then used for amplification of DNA isolated from 24 sorghum genotypes (Supplementary Table 2) representing different races. Total genomic DNA was extracted using the CTAB method (Saghai-Maroof et al. 1984) and used for PCR amplification according to the procedure described by Panaud et al. (1996) with modifications. PCR was carried out using 20 ng DNA as template for amplification, 2 pmol of each primer, 0.1 mM dNTPs, 1 × PCR buffer (Bangalore Genei, India), and 1 unit of Taq DNA polymerase (Bangalore Genei, India) in a total volume of 10 μl. The PCR cycling conditions followed were: a 5 min initial denaturation at 94 °C followed by 35 cycles of 30 s denaturation at 94 °C, 30 s annealing at 55 °C, 1 min extension at 72 °C, concluding with a final extension of 72 °C for 7 min. PCR products were resolved on a Bio-Rad Sequi-Gen™ sequencing electrophoresis apparatus (Bio-Rad Laboratories, USA) in 6 % non-denaturing polyacrylamide gel. The DNA fragments were visualized by silver staining (Fritz et al. 1995). Amplicons were scored individually as ‘1’ and ‘0’ for the presence or absence, respectively, of an allele, and a binary data matrix was generated. The polymorphism information content (PIC) value for each microsatellite marker was calculated using the online resource ‘Polymorphism Information Content Calculator’ available at www.liv.ac.uk/~kempsj/pic.html. The binary data matrix was subjected to cluster analysis. Genetic diversity estimate-related analyses were done using NTSYSpc ver.2.02i (Rohlf 2000). Genetic similarities (GS) between pairs of accessions were measured by the DICE similarity coefficient based on the proportion of shared alleles (Dice 1945; Nei and Li 1979) with SIMQUAL module. The clustering of accessions was done based on a similarity matrix using an unweighted pair group method with arithmetic average (UPGMA) algorithm following SAHN module. The similarity result was then used to construct a dendrogram following TREE module. The reliability of the tree was tested by bootstrap analysis using the software Winboot (Yap and Nelson 1996).

The present study identified a total of 128 Class I microsatellites with (GATA) n motifs in the sorghum genome (Supplementary Table 3). Of these, 14 were in the genic region, 16 were present upstream of genes and the remaining 98 repeats were non-genic. Among the genic motifs, two were present in the 3′ untranslated region (UTR), while 12 were present in introns. Only 23.44 % (30 out of 128) of the (GATA) n motifs in sorghum may have functional significance due to their presence in genic and upstream regions, which is lower than of the proportion in rice (51.03 %) reported by Rajendrakumar et al. 2009. The presence of a lower number (128) of Class I (GATA) n motifs in the sorghum genome than in the rice genome (243) may be attributed to the presence of gaps in the sorghum genome sequence and also due to the more recent evolution of rice as compared to sorghum. The details of the distribution of these motifs are given in Supplementary Table 4. The frequency of (GATA) n motifs was high in relatively smaller chromosomes (Chromosome 5 and 7) while the frequency was low in a larger chromosome (Chromosome 2). This observation clearly indicated the lack of correspondence between the number of (GATA) n motifs and the chromosome size, which was similar to that observed in human and primates by Subramanian et al. (2003). It is interesting to note that the (GATA) n motifs show a skewed distribution and are mostly found in clusters, with a majority of them (37.50 %) towards the telomeric region of the chromosome; chromosome 9 had the least motifs (12.5 %) while chromosome 6 and 7 had the most motifs (55.55 %) in the telomeric region. This may be due to the conservation of sequences in the telomeric regions. This observation is in contrast to that of rice where these motifs were found as clusters distributed across the length of chromosomes (Rajendrakumar et al. 2009) and in tomato where these motifs were clustered around the centromere (Areshchenkova and Ganal 1999; Grandillo and Tanksley 1996).

About 14 Class I (GATA) n motifs were identified in the genic region of the sorghum genome, of which two were present in the 3′ UTR and 12 were intronic. One of the motifs in the 3′ UTR was present in the gene encoding monosaccharide transport protein 1 while the other motif was present in the gene encoding expressed protein (Supplementary Table 4). The presence of microsatellite motifs in genic regions could have implications in the molecular function of the encoded protein. The presence of (GATA) n motifs in genic regions of metazoan eukaryotic genomes and its implications in post-transcriptional signaling of mRNAs encoding membrane-associated proteins was reported by Riley and Krieger (2002). Interestingly, a clear association of (GATA) n motifs with the sex of papaya was reported by Parasnis et al. (1999). The presence of these motifs in the genes encoding for disease resistance in rice was reported by Davierwala et al. (2001).

Twelve (GATA) n motifs were present in the intronic region of genes. Nine of these motifs were present in genes encoding proteins with known functions while three motifs were present in genes with unknown function (Supplementary Table 4). The genes coding for proteins with known function include putative CLB1 protein (calcium-dependent lipid binding) protein, major facilitator superfamily protein, putative peroxidase 1, putative serine-type endopeptidase inhibitor, putative transcription activator RF2a, SBP-domain protein 1, auxin efflux carrier component 1, caffeic acid 3-O-methyltransferase and cytochrome P450 71E1. Caffeic acid O-methyltransferase gene is responsible for brown mid-rib trait (bmr12) in sorghum and was cloned through a candidate gene approach by Bout and Vermerris (2003). The presence of (GATA) n motifs in the intronic regions of the gene may be associated with differential splicing of mRNA and determination of splice junctions similar to that reported in rice (Rajendrakumar et al. 2009). In humans, a GATA motif present in the conserved region within α-synuclein gene (SNCA) intron-1 associated with Parkinson’s disease directly induces a 6.9-fold increase, thereby involving in gene regulation (Scherzer et al. 2008).

A total of 16 (GATA) n motifs were identified in the upstream regions of genes and six of these code for putative uncharacterized proteins while the remaining 10 code for proteins with known as well as unknown functions. GATA sequences present upstream of genes have been associated with gene regulation (McNeil et al. 2008; Trembley and Viger 2003). The majority of the Class I (GATA) n motifs identified (111 motifs) were present in non-genic regions (Supplementary Table 4).

PCR-based markers were developed by designing primer pairs targeting all the Class I (GATA) n motifs present in the sorghum genome. If two or more (GATA) n motifs were found adjacent to one another, a single primer pair flanking these motifs was designed. Thus a total of 110 microsatellite markers were developed, which were distributed across 10 chromosomes (Supplementary Table 1). Primers could not be designed for eight motifs because of the presence of gaps in the flanking regions. The physical map location of the (GATA) n motif-based microsatellite markers developed in this study is shown in Supplementary Fig. 1. To assess the potential of (GATA) n motif-based microsatellite markers, 50 markers were selected as mentioned above and analyzed in a set of diverse sorghum genotypes. All the markers except SbGM1-5 and SbGM8-3 showed clear and robust amplification, of which 38 markers were polymorphic. A total of 233 alleles were generated by these 38 microsatellite markers among 24 genotypes. The number of alleles ranged from 3 to 10 with an average of 6.13 per marker. The marker SbGM4-2 amplified a maximum number of 10 alleles, while six markers (SbGM2-2, SbGM3-4, SbGM5-8, SbGM6-6, SbGM6-7 and SbGM8-2) amplified only three alleles among the genotypes studied. An average PIC value of 0.69 was recorded for these markers, with SbGM8-1 exhibiting a maximum PIC value of 0.86, while SbGM3-4 exhibited a minimum value of 0.26. The total number of alleles generated and the number of polymorphic alleles with their corresponding PIC values are listed in Table 1.

Table 1 Details of polymorphic (GATA) n motif-based microsatellite markers used in the study

The average number of alleles per marker (6.13) detected in this study using 38 polymorphic microsatellite markers was higher than that detected by Schloss et al. (2002) on 25 sorghum lines (3.4), Ali et al. (2008) on 72 sorghum accessions (3.22), and Agrama and Tuinstra (2003) and Smith et al. (2000) with mean allele numbers per marker of 4.3 and 5.9, respectively. The gene diversity or PIC observed in this study (mean PIC value = 0.69) is higher than the gene diversity values (0.40, 0.46, 0.58, 0.62) reported by Ali et al. (2008), Schloss et al. (2002), Smith et al. (2000) and Agrama and Tuinstra (2003), respectively. It is interesting to note that (GATA) n motif-specific markers exhibited good allelic diversity. This observation is similar to the high resolution power and good allelic diversity of (GATA) n motif-based microsatellite markers reported in rice by Rajendrakumar et al. (2009). Most of these markers yielded sharp and robust amplification without multiple amplicons, which may be due to the presence of limited numbers of (GATA) n motifs and their unique occurrence without multiple primer binding sites, similar to that reported in rice by Rajendrakumar et al. (2009) in rice. The (GATA) n motif-based microsatellite markers developed in this study can be considered as informative and unique and can supplement the microsatellite markers reported by many research groups and the genome-wide microsatellite markers reported by Yonemaru et al. (2009).

Based on the 233 polymorphic alleles, the Dice genetic similarity co-efficient was estimated for each pair of the 24 sorghum genotypes, and ranged from 0.08 to 0.63 (Supplementary Table 5). The wide range of similarity index values indicated the presence of ample diversity among the genotypes analyzed. The greatest genetic diversity in this study was observed between ELG15–Selection3 and E77–EG45 (similarity value of 0.08) followed by Selection3–C43, BTx623–IS18551, ELG15–M35-1, ELG15–IS18551, E77–C43, E77–IS18551, E77–EG40, EG15–M35-1, EP 117–E145 and EJ67–IS8525 with a similarity value of 0.11. A dendrogram constructed based on the molecular marker polymorphism grouped the 24 sorghum genotypes into two major clusters with an average similarity of 24.4 % (Fig. 1). Cluster 1 included 11 genotypes that were grouped into four sub-clusters (1A, 1B, 1C and 1D). Sub-cluster 1A consisted of two genotypes (296B and B35) while sub-cluster 1B comprised three genotypes (CSV216R, EG40 and EG45). Three genotypes each were grouped in sub-clusters 1C (Selection3, M35-1 and IS18551) and 1D (EA11, EP117 and PEC17). Cluster 2 consisted of 13 genotypes that were grouped into three sub-clusters (2A, 2B and 2C). Sub-cluster 2A consisted of four genotypes (27B, B58586, IS8525 and E77) while sub-cluster 2C comprised two genotypes (EJ67 and UPMC503). Sub-cluster 2B was the major sub-cluster and comprised seven genotypes (C43, E36-1, SSV84, Btx623, EG15, ELG15 and E145). The majority of the genotypes belonging to post-rainy sorghum were contained in Cluster 1. Clusters 2 consisted of genotypes belonging to different races and there was no distinct grouping based on race. Forage sorghum genotypes (EJ67 and UPMC503) were grouped together in Cluster 2C. It is a well-known fact that most of the post-rainy sorghum varieties belong to durra type whereas kharif cultivars belong to caudatum and kafir races (Reddy et al. 2003). In addition, post-rainy sorghums are characterized by tolerance to shoot fly, stalk rot and terminal stress and possess large lustrous grain with semi-corneous endosperm (Reddy et al. 2006). It seems that the identification of clusters in this study is not based on sorghum races. Similarly, the study by Dje et al. (2000) pointed out that the races of sorghum are not substantially differentiated genetically. Moreover, Menz et al. (2004) and Gabriel (2005) reported that it is clear and evident that working groups and races in sorghum are not well defined. The main reason for not finding a clear clustering pattern among races could be the lack of representation of each race with a sufficient number of lines, as suggested earlier by Moncada (2006). Hence, it can be concluded that studies including a large number of lines per race and the use of greater numbers of marker loci should provide a better interpretation of genetic diversity in sorghum. However, the (GATA) n motif-based microsatellite markers used in this study revealed the allelic richness among the diverse sorghum genotypes analyzed, thus making them suitable for application in crop genetics and marker-assisted breeding.

Fig. 1
figure 1

Dendrogram constructed using (GATA) n motif-based microsatellite marker data revealing the clustering of sorghum genotypes

In conclusion, the present study was helpful in the identification of Class I (GATA) n motif-containing microsatellites and their distribution pattern in the sorghum genome. Some of these motifs that are present in the genic region and upstream of genes might have functional significance, which needs to be investigated in detail. To the best of our knowledge, this is the first report of the genome-wide development and validation of (GATA) n motif-based microsatellite markers in sorghum. The study also demonstrated the utility of these markers in genetic diversity analysis. Sharp and robust amplification combined with high polymorphism makes the (GATA) n motif-based microsatellite markers suitable for genetic diversity assessment, cultivar identification, hybrid purity testing, QTL mapping and marker-assisted selection.