Introduction

Horsegram [Macrotyloma uniflorum (Lam.) Verdc.] is an important pulse crop of many Southeast Asian countries including India. It is also cultivated as forage crop in some other semi-arid regions of the world and grown as green manure in many tropical countries, especially in Australia and Africa. It is an underutilized warm season food legume with little genetic and genomic information. Horsegram is a diploid plant species with chromosome number 2n = 2× = 20, 22, 24. The estimated genome size of horsegram is 400 Mbps (Bhardwaj et al. 2013). The name Macrotyloma is derived from the Greek words macros, meaning large, tylos, meaning knob, and loma, meaning margin, in reference to knobby structures on the pods (Blumenthal and Staples 1993). Genus Macrotyloma consists of nearly 25 species indigenous to Africa and Asia. Of these, Macrotyloma var. uniflorum is the only cultivated species (Allen and Allen 1981; Lackey 1981). In India, Horsegram is cultivated as a summer crop in Karnataka, Tamil Nadu, and Andhra Pradesh and in northwestern and central Himalayan regions of Himachal Pradesh, Jammu & Kashmir, and Uttarakhand. It is also grown in Punjab, Bihar, Uttar Pradesh, Madhya Pradesh, Rajasthan, Maharashtra, and Gujarat. The fodder is rich in protein content and free from digestive inhibitors; it is widely used as a feed for milch animals and horses (Prakash et al. 2008). Horsegram is a nutritious food legume and provides many essential dietary components. The protein content in seeds of cultivated horsegram is reported to be 16.9–30.4% (Patel et al. 1995). It has high lysine content, an essential amino acid (Gopalan et al. 1989). Horsegram seeds are a rich source of phosphorus, iron and vitamins such as carotene, thiamine, riboflavin, niacin and vitamin C (Sodani et al. 2004). It is known to contain many medicinal and therapeutically active principles, thus suggested as an Ayurvedic medicine for a variety of ailments such as edema, piles, and renal stones. It exhibits high antioxidant properties and contains molybdenum, which regulates calcium and iron, which helps in transporting oxygen to cells and forms part of hemoglobin in blood (Ramesh et al. 2011; Murthy et al. 2012). Horsegram also helps in lowering blood cholesterol level (Mehra and Upadhyaya 2013). Chaitanya et al. (2010) proved that the seeds of M. uniflorum are endowed with significant anti-urolithiatic activity. Certain tests have proven that lipids extracted from horsegram are known to heal peptic ulcers in rats (Jayaraj et al. 2000). Eco-physiologically, horsegram is a drought hardy crop species and exhibits many desirable traits such as tolerance to drought, salinity, and heavy metal stresses. These attributes make it a potential future crop (Sosulski and Young 1979; Kadam et al. 1985; Oram 1990; Smartt 1990; Reddy et al. 2008). With all these features, horsegram holds a great potential as nutraceutical, forage, and food commodity in drought prone areas of the world (Morris 2008).

Although, a crop species of potentially high value, it received little attention of researchers for its genetic improvement. This may be due to limited availability of genetic resources and molecular breeding tools in this crop. Therefore, it is immensely important to develop marker resources in this crop which could be used to study genetic diversity, linkage analysis, QTL and association mapping studies. In addition, marker-assisted selection (MAS) and genomic selection (GS) can also be performed if sufficient genomic resources become available for the purpose. So far, only 232 characterized SSRs along with 13 ILP markers have been developed in this crop (Sharma et al. 2015a). However, more marker resources are required in the species to initiate a genetic improvement programme. Simple sequence repeats (SSRs) are widely used due to their co-dominant inheritance, multi-allelic nature, high reproducibility and transferability, extensive genome coverage, and simple detection methods (Varshney et al. 2005; Agarwal et al. 2008). Thus, in the present study genome-wide SSR markers were developed in horsegram using Illumina sequencing and were used for genetic diversity analysis in horsegram germplasm. The cross-transferability of these markers was also examined across nine related legume species.

Material and methods

Sequencing and assembly of reads

Illumina paired-end libraries were constructed from total genomic DNAs according to the instructions provided by the manufacturer. The developed Illumina libraries were subjected to genome sequencing with an Illumina HiSeq 1000. A total of 375 M paired-end reads of “HPK4” were obtained by an Illumina Hiseq 2000 and subjected to genome assembly with a SOAPdenovo2 r223 (Li et al. 2010). After the assembly, gaps in the scaffolds were closed by GapCloser 1.10 (p = 31). The number of assembled scaffolds were 237,662 (kmer = 81) consisting of 313,833,559 bases (N50 length = 97,711 bases). This sequence information was used to identify genomic SSRs.

SSR mining and primer design

The sequence database obtained as described above was used as a source of potential SSRs with different repeat types. After pre-processing, SSR containing sequences were identified by PERL script-based program, MISA software (MIcroSAtellite identification tool, http://pgrc.ipk-gatersleben.de/misa/). The criteria for identification of SSRs were a minimum of six repeat units for di-nucleotide repeats (DNR) and a minimum of five repeat units for all other types of repeats, i.e. tri-nucleotide repeats (TNR), tetra-nucleotide repeats (TtNR), penta-nucleotide repeats (PNR), and hexa-nucleotide repeats (HNR). All SSR-containing sequences were used for primer design following standard parameters by using the BatchPrimer3 software (http://probes.pw.usda.gov/cgibin/batchprimer3/batchprimer3.cgi) considering the following criteria: (1) primer length ranging from 18 to 22 with an optimum 20 bp, (2) PCR product size ranging from 100 to 400 with an optimum 300 bp, (3) annealing temperature ranging from 45 to 60 °C, (4) GC content ranging from 40 to 60 with an optimum of 50%. Newly designed SSRs were named as M. uniflorum Genomic SSR (MUGSSR).

Plant materials and DNA isolation

The panel of 48 horsegram lines were collected from different horsegram growing regions of the country and maintained at the Department of Agricultural Biotechnology, CSK HP Agricultural University Palampur, India, and were selected for the diversity analysis and characterization of newly developed SSRs. For cross-transferability analysis, nine related legume species namely-M. sar-gharwalensis, Cicer arietinum, Vigna unguiculata, Lens culinaris, Vigna mungo, Pisum sativum, Trifolium pratense, Vigna umbelleta, and Phaseolus vulgaris were taken. Genomic DNA was isolated from young leaves of each plant using the CTAB method (Doyle and Doyle 1990) with some modifications. The quantity and quality of DNA was estimated using 0.8% agarose gel and by comparing with uncut lambda DNA (Fermentas, Lithuania).

SSR genotyping

The polymorphism survey of 117 newly developed primers pairs in 48 selected horsegram lines was carried out following the method of Sharma et al. (2015a). The polymorphic primer pairs were also used for cross-transferability study in related species. For amplification of genomic DNA, a reaction mixture of 10 μL volume was prepared using 4.8 μL of sterilized distilled water, 2.0 μL genomic DNA (13 ng/μL), 0.5 μL of forward and 0.5 μL of reverse primer (5 μM), 0.5 μL MgCl2 (25 mM), 1.0 μL 10 X PCR buffer (10 mMTris-Hcl, 50 mMKcl, pH 8.3), 0.5 μL dNTP mix (0.2 mM each of dATP, dGTP, dCTP, and dTTP), and 0.2 μL Taq polymerase (5 U/μL). The PCR conditions were: one cycle of 5 min at 94 °C, 35 cycles of 1 min at 94 °C, 1 min at respective annealing temperature for each primer, as shown in Supplementary Table 2, 1 min at 72 °C, final extension for 7 min at 72 °C. All the PCR reactions were conducted in 96-well Gene Amp PCR System 9700® (Applied Biosystems, CA, USA) and 2720 Thermal Cycler (Applied Biosystems, CA, USA). PCR products were resolved in 6% polyacrylamide gel at a constant current of 65 W at room temperature for 90 min. Gels were prepared and run in 1X TBE buffer and visualization of fragments was done using silver-staining. Size estimation of the alleles was done by using 50 bp DNA size standard (Fermentas, Lithuania).

Data analysis

All unabiguously amplified fragments were scored manually for each SSR locus and included for further analysis. The polymorphism information content (PIC) was calculated according to the formula given by Botstein et al. (1980) and implemented in Cervus version 3.0 (Kalinowski et al. 2007). Other genetic diversity attributes such as expected heterozygosity (He), observed heterozygosity (Ho), Shannon information index (I), etc. were calculated with the help of POPGENE version 1.32 (Yeh and Boyle 1997). A neighbor-joining (NJ) tree was constructed using Jaccard's coefficient with the help of DARwin Version 5.0.158 (Perrier and Jacquemoud-Collet (2006). Genetic relationships among the genotypes were also analyzed by principal component analysis (PCA) using DARwin. Genetic structure analysis was performed by the Bayesian clustering model using software STRUCTURE version 2.3 (Pritchard et al. 2000). An admixture model with correlated allele frequencies was used to infer the value of K with prior population information. All analyses were performed with a burn-in period of 100,000 and a Markov chain Monte Carlo (MCMC) replication number set at 1,000,000. The value of K was estimated using the method described by Evanno et al. (2005) and was obtained using STRUCTURE HARVESTER (Earl and von Holdt 2012). Analysis of molecular variance (AMOVA) was done using GenAlEx (Peakall and Smouse 2012).

Results and discussion

Development and frequency distribution of different SSRs

Microsatellites are tandem repeat motifs of 1–6 bp, which have a frequent occurrence in all prokaryotic and eukaryotic genomes (Hamada et al. 1982; Tautz 1989; Zane et al. 2002). These are found in both coding and non-coding regions and are distributed throughout the nuclear genome. These can also be found in the chloroplast and mitochondrial genomes (Soranzo et al. 1999; Provan et al. 2001; Chung et al. 2006; Rajendrakumar et al. 2007). These properties of SSRs present opportunities and chances of finding SSRs in unexplored species such as horsegram. Moreover, continued advances in next-generation sequencing (NGS) technologies have resulted in generation of large amounts of sequence data and data mining for marker development. Hence, in this work we applied NGS using Illumina sequencing to generate sequence data and to develop new SSR primers. A total of 86,498 sequences were found containing SSRs (supplementary Table 1). Of these, 2458 primer pairs were designed, and 117 were randomly selected for synthesis and validation (Supplementary Table 2). The detailed information about additional 2341 primers designed in the present study is given in Supplementary Table 3. The horsegram sequence data analyzed here comprised 186,445 scaffolds having a total length of 305,302.6 kb (305.30 Mbps). These were found to contain a total of 86,498 SSRs located in 20,966 scaffolds in which 7267 scaffolds had more than one SSR, and 9979 had SSRs of different motifs. DNR were the most abundant (48,202 or 55.73%) followed by TNR (35,635 or 41.22%) and TtNR (1947 or compound SSRs) (supplementary Table 1). The proportion of different SSR unit size was not evenly distributed. Among DNRs, the AT/AT (85.46%) was the most abundant followed by AG/CT (11.05%), whereas, among TNRs, the AAG/CTT (63.35%) were most abundant followed by AAT/ATT (28.84). Similarly among TtNRs, PNRs, and HNRs the AAAG/ATTT (54.64%), AATAT/ATATT (27.76%), AAAAAT/ATTTTT, and AAATAT/ATATTT (6.71%), were most abundant repeats, respectively. The number of SSR repeats ranged from five to 272, with five repeats 26,284 (30.39%) being the most abundant, followed by six 13,483 (15.59%) and seven repeats 7879 (9.11%) as the next most abundant. Motifs with more than 10 repeats were 20,192 (23.34%).

Validation and polymorphic potential of SSRs

Of the 2458 designed primers, 117 SSR primer pairs were used for validation and assessment of the polymorphism among 48 diverse horsegram lines. Of these, 47 exhibited polymorphism and 70 were found monomorphic. Out of 47 polymorphic markers, 31 were DNRs, 10 TNRs, four TtNRs, one each PNR and HNR. Similarly, out of 70 monomorphic markers, 34 were DNRs, 18 TNRs, seven TtNRs, eight PNRs, and three HNRs. A total of 192 alleles were detected by 47 polymorphic SSR markers in which two markers MUGSSR-536 and MUGSSR-537 amplified two loci each MUGSSR-536(A), MUGSSR-536(B) and MUGSSR-537(A), MUGSSR-537(B) with seven, three and four, four alleles, respectively (Table 1). The total number of loci amplified was found to be 49 with an average of 3.90 alleles amplified per locus, with a range of two to15 alleles, revealing high allelic diversity. Primer pair MUGSSR-549 amplified a maximum of 15 alleles followed by MUGSSR-526 with 13 alleles. Size of alleles ranged from 50 bp to 350 bp. The PIC values ranged from 0.06 in primer MUGSSR-24 to 0.87 in MUGSSR-549 with an average value of 0.46. Sharma et al. (2015a) found all characterized SSR and ILP primers amplified alleles in range of 1-8 with an average of 2.64 alleles per primers and PIC ranged from 0.04 to 0.75 with an average of 0.36. The mean observed number of alleles was 3.19 ± 2.65 in different genotypes used in the study. Highest (0.89) and lowest (0.07) expected heterozygosity (HE) values were obtained with MUGSSR-549 and MUGSSR-24, respectively, with an average of 0.54. The observed heterozygosity ranged from 0 to 1 with a mean value of 0.64; Shannon’s information index (I) was recorded maximum (2.35) for primer pair MUGSSR-549, whereas the lowest value (0.15) was observed for MUGSSR-24 with an average of 0.94. The genetic diversity revealed by SSR loci was supported by the observed high values of mean allele numbers, heterozygosity and PIC values. However, mean heterozygosity and PIC values were 0.54 and 0.46, respectively. Average Shannon’s information index (I) was found to be 0.94 (Table 1). Amplification profiles of 48 horsegram genotypes generated by MUGSSR-530, MUGSSR-532, and MUGSSR- 546 markers have been shown in supplementary Figure 1. The estimates of these diversity indices using SSR primers were nearly similar to those obtained by Sharma et al. (2015b) in this species using RAPD and ISSR markers, where the average PIC was reported as 0.34. In another study, using 360 genetically diverse accessions of horsegram, an average of 5.73 alleles per primer pair were observed with mean heterozygosity and PIC of 0.58 and 0.50, respectively (Divya Shikha 2015) and average Shannon’s Information Index (I) was found to be 1.07.

Table 1 Diversity statistics for 47 SSR loci studied in 48 horsegram accessions

Cross-transferability studies

Cross-transferability offers potential for the low-cost generation of microsatellite markers for related or distant species. In order to assess the cross-transferability of novel horsegram SSR markers, the SSR primers were checked for amplification in nine other legume species. The SSR markers showed variable extent of cross-transferability in the analyzed species. Of the 47 primer pairs, 43 produced amplicons in at least one legume species analyzed, whereas four primer pairs did not amplify any amplicons. The amplification profile of nine related species with primer pair MUGSSR-542 is shown in supplementary Figure 1. Of the 43 primer pairs that showed cross-amplification, five primer pairs produced amplicons in all the legumes analyzed, indicating 100% transferability. Among 43 genomic SSR markers, 16 showed amplification each in M. sar-gharwalesis and P. vulgaris, 15 in V. unguiculata, 17 each in C. arietinum, L. culinaris, and V. mungo, 43 in P. sativum, 25 in T. pratense, 19 in V. umbelleta. Surprisingly, highest species-wise transferability was observed in P. sativum (91.49%) followed by T. pratense (53.19%), V. umbelleta (40.43%), V. mungo, C. arietinum, and L. culinaris (36.17% each), M. sar-gharwalensis and P. vulgaris (34.04% each), and V. unguiculata with lowest value (31.91%). Percentage of cross-transferabolity in different species is depicted in supplementary Figure 2. There have been many examples of SSR markers transferred within legume genera for example from soybean to peanut (He et al. 2006); from M. truncatula to pea, fava bean, and C. arietinum (Gutierrez et al. 2005); and from M. truncatula, Pisum sativum, and Trifolium pratense to Lens culinaris (Reddy et al. 2010).

Distance based cluster analysis

A neighbor-joining (N-J) tree of 48 accessions showed two major clusters with high (> 50.0%) bootstrap value (Fig. 1). Of the two clusters, the first cluster comprised accessions from Himachal Pradesh. Interestingly, the single accession of wild species M. axillare and two accessions each from Madhya Pradesh and Tamil Nadu were also clustered with this group. On the other hand, the accessions belonging to Southern Indian states such as Tamil Nadu and Andhra Pradesh were grouped in the second cluster. Jaccard’s similarity matrix showed the highest genetic similarity of 0.93 between HPKM-317 and HPKM-150 indicating that they are genetically the most similar accessions because they are mutants developed from the same parent, validating results obtained in the present study. Accessions TCR-654 and IC-120825 were genetically most dissimilar with the similarity value of 0.25. These four accessions belonged to M. uniflorum. When related species M. axillare was taken into consideration for similarity estimates it showed maximum dissimilarity with TCR-654 (M. uniflorum) with the similarity value of 0.18. Two-dimensional graphical view of genetic diversity in 48 accessions was represented in PCA analysis (Fig 2), which clearly showed the existence of two groups. The clustering pattern showed correspondence with geographical distribution of accessions. This grouping may be attributed to the frequent flow of alleles within a confined population of a geographical region rather than between populations of two geographical regions. Further, there are two main regions of horsegram diversity in the country, one of these includes Andhra Pradesh and other southern states and the other comprises some hilly regions of the north Indian states such as Himachal Pradesh, Uttarakhand, and Jammu & Kashmir (Neelam et al. 2014; Sharma et al. 2015a). In PCA, group-I consisted of accessions from Himachal Pradesh with one M. axillare accession from Australia and two accessions each from Madhya Pradesh (IC-19432 and IC-23445) and Tamil Nadu (IC-469259 and IC-469271). Group-II had accessions primarily from the southern states of India, which confirmed the results obtained by the neighbor-joining tree. Similar results were also reported by Sharma et al. (2015b) while studying diversity in 51 horsegram accessions using RAPD and ISSR primers.

Fig. 1
figure 1

Radial neighbor-joining tree based on 192 alleles from 47 SSR loci among 48 accessions. Bootstrap values ≥50% are shown

Fig. 2
figure 2

Principal component analysis (PCA) of 48 horsegram genotypes based on 192 fragments amplified by novel 47 SSR primers

Bayesian genetic structure

Bayesian clustering method is a powerful computational tool meant for estimating various features of populations. STRUCTURE assumes K (unknown) populations for the given data set, and the value of K can be estimated by posterior probability of the data for a given K. Delta K, which is used to determine the best fit value of K, was computed by STRUCTUREHARVESTER for the given range, i.e. 1–10 and highest value was shown at K = 2. The maximum Delta K mean was 308.06 at K = 2. STRUCTURE analysis indicated two genetic stocks for horsegram accessions in the present study. One of the clusters (Cluster-I) represented accessions from Himachal Pradesh, while another cluster (Cluster-II) represented accessions from central and southern parts of India such as Tamil Nadu, Andhra Pradesh, and Madhya Pradesh. Genetic admixture was also recorded in a few accessions. A few accessions of Himachal Pradesh were also dispersed with Cluster-II. Similarly, a few accessions from Tamil Nadu, Andhra Pradesh, and Madhya Pradesh were also dispersed into Cluster-I. A very low admixture recorded in the accessions of both clusters indicated their pure ancestry (Fig. 3). Similar clustering pattern were reported by Sharma et al. (2015b) and Divya Shikha (2015) at Delta K = 2. A low level of admixture was recorded and accessions in two different gene pools represented purity of 23.88 and 37.03% for Cluster-I and Cluster-II, respectively. This genetic structuring indicated two centers of diversity of horsegram, one pointing towards the region of the Indian Himalayas and the other in southern parts of the country. Although STRUCTURE results showed low admixture in accessions of both clusters, Cluster-I corresponding to the Himalayan region, showed a greater extent of admixture in its accessions. Thus, admixture shown by STRUCTURE may be due to the mixing of accessions across regions. Structuring of accessions into two clusters with low admixture also points towards isolated and conserved genetic background of the crop species. The results obtained confirm the findings of Sharma et al. (2015b) and Divya Shikha (2015), in which they have reported that cluster of hilly region, was showing greater admixture as compared to the cluster comprising of accessions from central and southern India. Result of analysis of molecular variance (AMOVA) from the data revealed 82% genetic variance resided within the population, whereas 18% among populations (Table 2), thus indicating more genetic variance and allelic richness within populations than among populations. This distribution pattern of genetic variance is due to greater chances of mixing of genomic regions of individuals of same populations. Thus, both the southern and northern regions harbored more genetic variance within populations than among different populations.

Fig. 3
figure 3

Genetic structure of 48 horsegram accessions as inferred by STRUCTURE v2.3.3 with 47 novel SSR markers data set. Single vertical line represents an individual genotype and different color represents segments of each vertical line showing extent of admixture in an individual

Table 2 Summary analysis of molecular varience

The novel SSR markers developed in present study can be enormously useful for future genetic improvement of horsegram in terms of large scale diversity analysis, linkage mapping, QTL detection and association mapping studies. In addition they can be valuable for conducting similar genetic analyses in other related legume crops devoid of SSR marker resources.