Introduction

Horsegram [Macrotyloma uniflorum Lam. (Verdc.)] is a protein rich, underutilized, ancient Indian pulse crop of family Fabaceae (Prasad and Singh 2015; Fuller and Murphy 2018). It is commonly known as kulath, kulthi, ulavalu, hurali, kollu or muthira etc. in various parts of India. Its English name was derived from its feeding to horses (Bhardwaj et al. 2013). It is a rich source of proteins, minerals and vitamins. Besides its nutritional importance, it has been found to possess medicinal properties, because of the presence of some non-nutritive bioactive substances. It is considered as a pulse with medical values (Bolbhat and Dhumal 2014; Bhartiya et al. 2015; Vandarkuzhali and Narayanasamy 2015; Fuller and Murphy 2018). It possesses anti-hypercholesterolemic, anti-microbial, anti-obesity, anti-helminthic, analgesic, anti-inflammatory, anti-diabetic, anti-cholilithiatic, anti-histaminic, anti-peptic ulcer, anti-oxidant, anti-urolithiatic, diuretic, haemolitic, hepatoprotective, and anti-hypertensive properties (Ranasinghe and Ediriveera 2017).

In the developing countries like India, only a few conventional legumes dominate the pulse production. Therefore, to effectively eradicate the protein malnutrition, the underutilized legumes like horsegram have great potential for improving nutritional security of rural, tribal and underprivileged people (Tontisirin 2014). The crop accounts for about 5–10% of pulse production in India and its annual production is about 0.65 million tonnes (Kiranmai et al. 2016). Horsegram has also been identified as a potential food source for the future. It is also used as cattle feed in both fresh and dried form. It is known to be useful for water conservation in the semi-arid region because of low water requirement. Therefore, it is widely grown in the semi-arid regions of India. It is generally considered as protein rich poor man’s crop that grows well under dry conditions and marginal soil fertility (Kiranmai et al. 2016). It is now established that horsegram also performs better than many other pulse crops under saline conditions (salt stress conditions), therefore this crop has a higher production potential under such conditions (Reddy et al. 2008).

Despite its importance, the crop has been neglected for long and has not received its due recognition in research fraternity, however, recently this trend seems to be reversing. Studies on characterization of horse gram germplasm have been carried out using morphological markers. However, few studies have utilized molecular markers for the characterization of horse gram germplasm. Few mutation studies have also been carried out in the direction of understanding some characters in the crop. Since this crop comes up reasonably well in dry land areas with receding soil moisture conditions and in poor soils where other crops fail to grow, there is also high probability that this crop contain a large number of drought resistance genes. However, over the years, the production and area under this crop has been decreasing tremendously owing to non-availability of improved and well adapted varieties. The studies on morphological traits diversity have been conducted in this crop by different workers and they have suggested a great potential for its improvement (Geetha et al. 2011; Neelam et al. 2014; Singh et al. 2019).

Although few molecular studies have been carried out recently in this crop but SSRs have not been utilized to a greater extent in this important food and fodder crop (Ramya et al. 2013; Kiranmai et al. 2018). This may be due to limited availability of genetic and genomic resources in this crop. Therefore, it is immensely important to develop SSR marker resources in this crop which could be used to study genetic diversity, linkage analysis, QTL and association mapping studies as demonstrated by other workers in various crops (Qiu et al. 2010; Wang et al. 2011, 2013; Xue et al. 2018; Chahota et al. 2020). In addition, marker-assisted selection (MAS) and genomic selection (GS) can also be performed if sufficient genomic resources become available for the purpose. So far, limited numbers of SSRs have been developed and characterized in this crop. Sharma et al. (2015a, b) have identified and developed 245 SSR and 13 Intron Length Polymorphism (ILP) markers from public sequence data. Similarly, Divya (2015) and Kaldate et al. (2017) employed public sequence data to design and validate SSR markers in this crop. Chahota et al. (2017) used next-generation Illumina sequencing platform to develop a large number of microsatellite markers in this species. Of the total 23,305 potential SSRs motifs, 5755 primers were designed by their group. Of these designed primers, 30 SSRs were used in 360 accessions to study genetic diversity and population structure. However, more marker resources are required in this species to initiate genetic improvement programmes and apply genomic tools in combination with other conventional techniques which can be a promising strategy of improvement (Datir 2016). Simple sequence repeats (SSRs) are widely used due to their codominant inheritance, multi-allelic nature, high reproducibility and transferability, extensive genome coverage, and simple detection methods (Choudhary et al. 2009; Sharma et al. 2009; Kaur et al. 2016). Therefore, in present study we identified, developed and utilized SSR markers in this crop to enrich genomic resource data and to analyze genetic structure in a panel of horsegram genotypes. Furthermore, the SSRs identified and characterized in this study were from drought resistant genotypes and hence, can also be useful in identification and screening of drought resistant accessions in future.

Materials and methods

SSR markers development and gene ontology

Transcriptome database was used to develop new SSRs. A total of 110.2 MB clean, filtered, FASTQ sequences of horsegram were acquired from NCBI SRA (sequence read archive) under accession numbers SRX341972, SRX341973, SRX341974 and SRX341975. Reads were assembled into 124,147 contigs using de novo assembly module of CLC Genomics Workbench v 6.5 (CLC Inc, Aarhus, Denmark) with overlap length cut off value of 150 and N50 value of 851. Assembled contigs were utilized for transcriptomic SSR mining using MISA tool with criteria of direpeats with minimum six units and trirepeats, tetrarepeats, pentarepeats and hexarepeats having minimum five units with the maximum difference between two SSRs setting at 100 bp.

All SSR-containing sequences were utilized for primer design following standard parameters by using BatchPrimer3 software (https://probes.pw.usda.gov/cgibin/batchprimer3/batchprimer3.cgi) considering the following criteria: (1) primer length ranging from 18–22 bp with an optimum 20 bp, (2) PCR product size ranging from 100–400 bp with an optimum 300 bp, (3) annealing temperature ranging from 45–60 °C, (4) GC content ranging from 40–60 with an optimum of 50%. Newly designed transcriptomic SSRs were named as horsegram transcriptomic SSRs (HTSSR).Gene ontology classification of SSR containing sequences assigned 13,427 GO terms. Open Reading Frames (ORF) were identified in SSR containing sequences using ORF finder. Earlier SSR work (Sharma et al. 2015a, b) did not include this information in their studies. In-House pearl script based programme was used to find the location of SSRs in SSR containing sequences. Further, KEGG pathways analysis was performed on 6211 SSR containing sequences.

Plant materials and genomic DNA isolation

Based on the evaluation made for early flowering and maturation period, a panel of diverse 58 horsegram genotypes (Table 1) collected from different horsegram growing geographical regions of the country and maintained at DAV University, Jalandhar was selected based on the flowering period and maturation timing in Jalandhar region of Punjab. Genomic DNA was isolated from young leaves of each plant using the modified CTAB method (Doyle and Doyle 1990; Rana et al. 2017) with some modifications. The quantity and quality of DNA was estimated through electrophoresis using 1 per cent agarose gel by comparing with lambda DNA (Fermentas, Lithuania).

Table 1 Panel of genotypes selected for molecular characterization

PCR amplification

For amplification of genomic DNA, a reaction mixture of 10 µl volume was prepared using 4.8 µl of sterilized distilled water, 2.0 µl genomic DNA (13 ng/ µl), 0.5 µl of forward and 0.5 µl of reverse primer (5 µM), 0.5 µl MgCl2 (25 mM), 1.0 µl 10× PCR buffer (10 mM Tris–Hcl, 50 mM Kcl, pH 8.3), 0.5 µl dNTP mix (0.2 mM each of dATP, dGTP, dCTP and dTTP) and 0.2 µl Taq polymerase (5 U/µl).The PCR conditions were: 1 cycle of 5 min at 94 °C, 35 cycles of 1 min at 94 °C, 1 min at respective annealing temperature for each primer, 1 min at 72 °C, final extension for 7 min at 72 °C and storage at 4 °C for ∞. All the PCR reactions were conducted in 96 well Thermal Cycler Veriti™ (Applied Biosystems, CA, USA), respectively. The PCR products were first checked on 3% agarose gel and then resolved in 6% polyacrylamide gel at a constant current of 65 W at room temperature for 90 min. Gels were prepared and run in 1× TBE buffer and visualization of fragments was done using silver-staining. Size estimation of the alleles generated by newly developed markers was done by using 50 bp DNA size standard.

Data analysis

All SSR primer generated fragments were scored manually for each SSR locus; polymorphic bands were scored and converted into binary data as 1 for presence or 0 for absence of the bands respectively. Only unambiguously amplified alleles were scored and included for further analysis. The polymorphism information content (PIC) is a measure of the effectiveness of given DNA marker for detecting polymorphism. The PIC for each primer pair was calculated according to the following standard formula given by Botstein et al. (1980) and implemented in Cervus version 3.0.

$${\text{PIC}}_{i} = 1 - \sum\limits_{j = 1}^{n} {P^{2} ij}$$

where Pij is the frequency of the jth pattern for marker i and summation extends over n patterns. Various genetic diversity estimates such as expected heterozygosity (He), observed heterozygosity (Ho), Shannon information index (I), etc. were calculated with the help of POPGENE version 1.32 (Yeh and Boyle 1997). Distance-based cluster analysis was performed by generating dendrogram based on unweighted pair-group method of arithmetic mean (UPGMA) using Jaccards similarity coefficient with the help of NTSYS pc-2.02e (Rohlf 1998). Neighbour-Joining (NJ) tree was constructed using Jaccards coefficient with the help of DARwin Version 6.0.20 accessed on 28th Mrach 2019 (Perrier and Jacquemoud-Collet 2006). Genetic relationships among the genotypes were also analysed by principal component analysis (PCA). Genetic structure analysis was performed by the Bayesian clustering model using software STRUCTURE version 2.3 (Pritchard et al. 2000). An admixture model with correlated allele frequencies was used to infer the value of K with prior population information. All analyses were performed with a burn-in period of 1,00,000 and a Markov chain Monte Carlo (MCMC) replication number set at 10,00,000. The value of K was estimated using the method described by Evanno et al. (2005) and was obtained using STRUCTURE HARVESTER (Earl and VonHoldt 2012). Analysis of molecular variance (AMOVA) was done using GenAlEx (Peakall and Smouse 2012).

Results and discussion

With the advancing technologies, ease to access and exploration of plant systems has increased. Same is the case of generating SSR markers. It was very laborious and time consuming process when isolating SSR using conventional methods. However, with the advent of Next generation sequencing (NGS) techniques it becomes easy and lowcost affair to develop sequence based molecular markers like EST-SSRs. Plenty of sequences can be produced through NGS techniques in a very short time resulting in huge amount of data generation. This sequence data can be utilized to develop SSR markers as also developed by several workers in various crops (Zalapa et al. 2012; Ravishankar et al. 2017; Neophytou et al. 2018; Tibihika et al. 2019; Patil et al. 2020). We also utilized horsegram data from public domain to develop SSR markers, specifically from the drought resistant transcripts.

SSR designing and gene ontology

In toto, 7352 SSR primers were designed from the mined sequences of horsegram transcriptome. Of these, 1785 (24%) were direpeats, 4380 (59%) were trirepeats, 584 (8%) were tetrarepeats, 258 (3.5%) were pentarepeats and 345 (4.6%) were hexarepeats. Overall, frequency of SSR occurrence was observed ~ 6% (Table 2). In earlier studies, SSR frequencies were reported between 2.65 and 16.8% in 49 dicotyledonous species (Kumpatla and Mukhopadhyay 2005). Similarly, di-, tri- and tetra-nucleotide repeat containing EST-SSR frequency observed in monocot species was 1.5 and 4.7 (Kantety et al. 2002) and 7–10% (Varshney et al. 2002). The GO terms were classified into three categories namely (1) biological process 38.69% (5196) followed by (2) molecular function 34.87% (4682) and (3) cellular component 26.43% (3549). In biological category, genes involved in cellular and metabolic processes were found to be the most prevailing. In molecular function, DNA binding proteins, catalytic proteins and transcription regulators were found to be most abundant. It indicates that molecular processes involving DNA such as replication, transcription and DNA modifications may be methylation or acetylation are prominent. In addition, catalytic proteins regulating and speeding up many molecular processes also dominate other molecules. Whereas, in cellular component, majority of the genes were involved in cell, cell parts or organelle structure and function (Fig. 1). Survey of ORF showed that majority of SSRs (57%) were found in functional coding sequence region (CDS) followed by SSRs in 3' UTR (22%) and 5' UTR (16%). SSRs in inter 5'CDS regions and inter 3'CDS were found to be the least (Fig. 2). This will be helpful for many genetic manipulation works in future using new technologies to enhance the adaptability or quality of the crop. Further, KEGG pathway analysis performed on 6211 SSR containing sequences generated 1800 KEGG IDs corresponding to 152 pathways. These IDs also corresponded to 1483 enzymes in six categories as classified by enzyme commission (EC). Thus, the SSRs analyzed in present study have many implications in the different processes and pathways taking place in the plant. The details of newly designed primers are given in supplementary file 1.

Table 2 SSR marker mining details
Fig. 1
figure 1

Classification and functional annotation of SSR markers

Fig. 2
figure 2

Localization of SSR regions in ORFs and their vicinity and EC classification

Polymorphic extent of markers

From the 150 randomly synthesized SSR markers, 33 markers were polymorphic and produced 40 loci (Table 3). In total, 130 alleles were produced in a range of 2–9 alleles with an average of 3.25 per locus. Maximum numbers of alleles (9) were produced by primer HTSSR 155 while minimum numbers of alleles (2) were produced by nineteen different primers. Observed heterozygosity (Ho) ranged from 0.03 to 1.00 with an average of 0.55 and expected heterozygosity (He) ranged from 0.13 to 0.81 with an average of 0.56. PIC value ranged from 0.065 to 0.78 with an average of 0.47 (Table 4). The values shows that some of the primers were highly informative as compared to previous reports (Divya 2015; Sharma et al. 2015a, b; Chahota et al. 2017; Kaldate et al. 2017). These selected primers can be useful for exploring the highly conservative germplasm of horsegram in some geographical regions.

Table 3 Details of polymorphic transcriptomic HTSSR primers used in diversity analysis for horsegram
Table 4 Details of observed heterozygosity (Ho), Expected heterozygosity (He) and polymorphism information content (PIC) of 40 transcriptomic SSR loci originated from 33 SSR

Diversity and cluster analysis

The cluster analysis of 58 accessions showed four major clusters with high (> 50.0%) bootstrap value in N-J tree based on Jaccards coefficient and principal component analysis (Figs. 3, 4). Each clustering method supported one another. Group one was further sub-classified into two sub-groups: Group 1A and Group 1B. Group 1A had genotypes procured from NBPGR, Hyderabad. In Group 1B, accessions procured from NBPGR, Uttrakhand clustered together. Based on our field evaluation and observations, Group 1A included early maturation accessions (average duration of 77 days of Monsoon season) while Group 1B included accessions exhibiting late maturation (average duration of 63 days of Monsoon season). Group 2 was found to be a small group containing 3 accessions collected from Himachal Pradesh producing early maturation (average duration of 62 days of Monsoon season). Group 3 was further sub classified into two subgroups, Group 3A and Group 3B. Group 3A included accessions procured from NBPGR Uttrakhand, giving early maturation in monsoon season (average duration of 49 days of Monsoon season). While group 3B included accessions from NBPGR Hyderabad, showing late maturation (average duration of 93 days of Monsoon season). Group 4 clustered genotypes from different geographical locations of Himachal Pradesh primarily showing late maturation (average duration of 90 days of Monsoon season). In the present study, N-J tree analysis exhibited apparent groupings based on the geographical source of horsegram accessions in addition to clear demarcation for early and late maturation of the crop in their respective categories. The analysis of N-J tree and its findings were found coherent with the Bayesian model based cluster analysis which was used to check genome sharing among studied accessions.

Fig. 3
figure 3

Principal component analysis (PCA) of 58 horsegram genotypes based on 130 fragments amplified by novel 33 SSRs

Fig. 4
figure 4

Neighbour-Joining (N-J) tree of 58 horsegram genotypes constructed using Jaccards coefficient using 130 fragments amplified by novel 33 SSRs

Bayesian model based cluster analysis

Bayesian clustering method is a powerful computational tool meant for estimating various features of populations. STRUCTURE assumes K (unknown) populations for the given data set, and the value of K can be estimated by posterior probability of the data for a given K. Delta K, which is used to determine the best fit value of K, was computed by STRUCTURE HARVESTER for the given range, i.e. 1–10 and highest value was shown at K = 3. The numbers of populations (K) were identified by performing five independent run by setting the value of K from 1–10 with a burn-in period of 1,00,000 and 10,00,000 number of Markov Chain Monte Carlo (MCMC) repeats after burn-in. Based on the maximal value of Ln P (D), Posterior probability of data using STRUCTURE HARVESTER software, a clear delineation of K was found to be 3 (Fig. 5, Supplementary Table 1). At K = 3, it was clearly possible to classify the 58 horsegram genotypes in 3 clusters irrespective of the ones with ‘pure’ ancestry or ‘mixed’ ancestry (Fig. 5). The percentage of accessions belonging to pure ancestry (accessions with membership probabilities ≥ 80%) were found to be 66, 59 and 50% in cluster one, two and three respectively indicating a strong genetic structure in the analyzed germplasm of the crop. This shows the mixing of three genetic stocks in the past was not prominent and different germplasm stocks can be utilized to produce fruitful results in future breeding programmes. At least 23 (40%) tested horsegram accessions belonged to the ‘pure’ ancestry spanning all three clusters. Rest of the accessions had shared/mixed ancestry. A comparison of the results from Bayesian-based STRUCTURE analysis with the NJ based tree revealed considerable congruence. The structure analysis showed three genetic stocks for the analyzed germplasm of horsegram which is in contradiction to earlier studies by different workers (Sharma et al. 2015a, b; Chahota et al. 2017; Kaldate et al. 2017). This may be due to diverse collection of genotypes or drought specific SSR markers used in this study. However, it can be further validated using larger numbers of genotypes countrywide in future. Result of this diversity analysis suggested that we have diverse set of horsegram germplasm which can be used in future improvement programmes and the novel polymorphic SSRs developed in this study can be useful for various genetic studies in Horsegram and related legumes.

Fig. 5
figure 5

a Posterior probability based on LnP (D) values detected by 33 novel microsatellites using STRUCTURE HARVESTER software showing a clear delineation of 3 gene pools (K = 3) in 58 accessions of Horsegram. b Bar plot showing genetic structure of 58 horse gram accessions as inferred by STRUCTURE v2.3.3