Introduction

Sapota (Manilkara zapota (L.) P. Royen) is a delectable fruit of tropical and sub-tropical regions, and belongs to the family Sapotaceae. It has spread to many tropical countries though it is native to Central America. In Asia, it was first introduced in Philippines by the Spanish, and later spread to other Asian countries (Meghala et al. 2005). In India, it was introduced during 1800’s from Mexico via Sri Lanka (Rekha et al. 2011), and is commonly known as “chikku”. The hard nature and wide range of adaptability to different conditions have made this tree popular in India. It is mainly meant for its fruit value. It is commercially grown in some countries for a gum like substance chickle, extracted from the latex of unripe fruit, which is used in the preparation of chewing gum. The fruit has many health beneficial ingredients in sufficient quantities such as dietary fibres, sugars like fructose and sucrose, phenolics, viz., gallic acid, catechin, chlorogenic acid, leucodelphinidin, leucocyanidin and leucopelargonidin, carotenoids, ascorbic acid, minerals like potassium, calcium, phosphorous and iron, and antioxidant compounds (Siddiqui et al. 2014; Rastegar 2015).

Now-a-days, DNA markers are widely used in genetic studies, especially in diversity analysis, DNA fingerprinting, mapping, identification of genes, disease diagnostics, pedigree analysis, hybridity confirmation, identification of sex types and marker assisted breeding (Bhat et al. 2010). Markers have become part of crop improvement programme.

Microsatellites, also known as Simple Sequence Repeats (SSRs), are segments of genomic sequences that include tandem repeats of small nucleotide motifs (1–6 bp) (Ellegren 2004). Because of their co-dominant character, multi-allelic nature, excellent repeatability, and high polymorphism information content, SSRs are widely employed in plant genetic studies. Initially, microsatellites were detected and isolated by microsatellite enrichment library method through selective hybridization. Another method is to discover microsatellite repeats in DNA databases such as EST sequences or genomic sequences. At present, Next Generation Sequencing (NGS), a robust revolutionizing technology, is used for microsatellites development (Ravishankar et al. 2015a, b; Araya et al. 2017). Using microsatellite enrichment library method, a small number of microsatellites have been isolated from Sapota and tested on a few genotypes (Arias et al 2020).

Hence, next generation sequencing technology was employed in the study to partially sequence Sapota genome which is promising approach to generate high throughput genome data at reduced cost and with less time. This would further help in the detection of thousands of microsatellite sites in the genome of target species. The NGS generated data were used to identify and standardize microsatellite markers in Sapota. Furthermore, the data were analysed for its genetic diversity and population structure of Sapota germplasm collected from different parts of India, which is available at ICAR—Indian Institute of Horticultural Research (IIHR), Bengaluru, India. The results of the analysis helped to understand the nature of diversity and genetic variation present in this introduced Sapota crop.

Materials and methods

Plant material and DNA extraction

Fifty-three genotypes of Sapota (supplementary material 1) were utilized in the study. Young leaf samples were collected from the Sapota germplasm collection which is maintained at ICAR—IIHR(Indian Institute of Horticultural Research) Bengaluru, India. Total genomic DNA was isolated from the leaves using modified CTAB method (Ravishankar et al. 2000). The quality of extracted DNA was examined using agarose gel electrophoresis (0.8%), and the concentration of DNA was determined by UV- spectrophotometer (Gene Quanta, Amersham Biosciences) at 260/280 nm.

Next Generation sequencing and de novo assembly

The total genomic DNA of ‘Cricket ball’ variety was used to perform paired end HiSeq Illumina sequencing using 2500 platform following manufacturer’s protocol (QTLomics, Technologies Pvt Ltd. Bengaluru). The FastQ files containing raw data were submitted to sequence read archive at National Centre for Biotechnology Information (NCBI) with accession ID SRP127995. The quality of paired end data was checked using the FastQC tool (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) at the default parameters followed by quality trimming of the reads with trimmomatic programme (Bolger et al. 2014). High-quality data were used for analysis i.e. bases having Phred score ≥ 40. Filtered reads were used to get optimal k-mer for assembly using Velvet Optimizer (Zerbino and Birney 2008). Based on the optimization results, assembly was performed using velvet using k-mer 31. The assembled contigs with a length longer than 1 kb were taken for SSR mining.

Microsatellites mining and primer designing

MISA software, a Perl-based SSRs motifs scrutinizing tool, was used for the identification and localization of microsatellites (http://pgrc.ipk-gatersleben.de/misa/). The SSR contigs were screened for dinucleotide, trinucleotide, tetranucleotide, pentanucleotide, hexanucleotide and complex repeats. The primers were designed for the flanking regions of SSR contigs using a web-based application Batchprimer3 (You et al. 2008).

From the generated primers sequence, 75 primer pairs were chosen and tailed with M13 sequence ‘GTAAAACGACGGCCAGT’ at the 5´ end of forward primers, and sequence ‘GTTTCTT’ at the 5´ end of reverse primers. The M13 sequences were labelled with four dyes FAM, VIC, NED and PET at their 5´ end and used as probes (Schuelke 2000).

PCR and SSRs validation

PCR conditions were standardized with a 15 μl volume of reaction containing 50–100 ng of DNA, 1.5 μl of 10X Taq Buffer, 0.8 μl of 10 mM dNTPs, 0.5 μl of 5 μM locus-specific forward primer, 1.0 μl of 5 μM reverse primer, 1.0 μl of 5 μM M13 probe and 0.5 unit of Taq DNA polymerase (Genei). PCR was carried out in BioRad Thermocycler with the following temperature profile: 94 °C for 4 min followed by 35 cycles at 94° C for 45 s, an annealing temperature of 55 °C for 45 s, and 72 °C for one minute. The final extension reaction was proceeded at 72 °C for 7 min. For confirmation, the amplified products were separated on 2% agarose gel. The PCR samples were then mixed by combining four PCR products, labelled with four different fluorophores (FAM, VIC, NED and PET) into a single tube. These samples were separated on the 96 capillary-automated DNA Sequencer (Applied Biosystem Company, USA) at M/s Bioserve facility, Hyderabad.

Data analysis

The raw data generated through genotyping were analysed using GeneMarker V2.6.3 software (Softgenetics, LLC) for detecting the fragment size of the alleles. The generated fragment size data were used for genetic analysis using CERVUS 3.0 software (Kalinowski et al. 2007) for calculating the number of alleles (k), observed heterozygosity (Hobs), expected heterozygosity (Hexp) and polymorphic information content (PIC). The Probability of Identity (PI) was also calculated using IDENTITY 1.0 software (Wagner and Sefc 1999). The genetic relationship among 53 Sapota genotypes was analysed and examined by dendrogram analysis using neighbour-joining method with DARwin 6 software (Perrier and Jacquemoud-Collet 2006). For further confirmation, the genetic structure analysis was inferred with STRUCTURE V2.3.4 software (Kaeuffer et al. 2007) with a data set of 30 SSR markers. An assignment test of individuals was carried out using admixture model with 200,000 burn-ins and 200,000 iterations, and a total of 10 independent simulations were run for each value of k tested ranging from K = 1 to K = 20. Further, delta k values were calculated using Evanno method, and graphical representations were deduced using the STRUCTURE HARVESTER programme (Earl and VonHoldt 2012; http://taylor0.biology.ucla.edu/structureHarvester/) and STRUCTURE PLOT (Ramasamy et al. 2014; http://omicsspeaks.com/strplot2/), respectively. AMOVA analysis was performed based on the population clusters obtained by STRUCTURE analysis using Arlequin V 3.5 software with 10,000 permutations (Excoffier et al. 2005). AMOVA derived genetic differentiation values (FST) between pairs of populations analogous to F-statistics were calculated.

Results

Genome sequencing and de novo assembly

The Next Generation Sequencing technology Illumina HiSeq 2500 platform was used to sequence genomic DNA from cultivar Cricket ball. The sequencing run yielded 3,326,257,143 bases from 22,028,193 reads. Low quality reads were filtered, and 21,654,953 (98.31%) paired end reads were obtained finally. Assembly optimization was done with k-mer 31 as it had optimal readings for N50 which generated 6,396,224 contigs in the assembly (Supplementary file 2). The total length of assembled contigs was 839,591,847 bp. The longest contig had a length of 11,136 bp, and minimum length of the contig was 61 bp. The average GC content of the genome was 40.91%.

Microsatellites mining

Out of 16,416 sequences examined, 2591 sequences contained SSRs. Among 2591 sequences, a total of 3591 SSRs were identified. There were 710 sequences containing more than one SSR, and 252 SSRs were found in compound formation. Among 3591 SSRs, there were 2122 monorepeats (59.1%), 1027 direpeats (28.6%), 294 trirepeats (8.2%), 63 tetrarepeats (1.75%), 10 pentarepeats (0.25%) and 3 hexarepeats (0.075%). Monorepeats were found to be more predominant class compared to di, tri, tetra, penta and hexa repeats (Fig. 1). However, design primers using Batchprimer3 web-based software was done only for 1285 microsatellite regions (supplementary material 2). From that set, randomly selected 75 primers were screened for PCR, from which 30 successfully amplified primers with clear one or two bands were used for genotyping. SSRs with 6 tandem repeats (25.5%) were more common, followed by 5 tandem repeats (15.5%), 3 tandem repeats (14.6%), 4 tandem repeats (12.7%), 7 tandem repeats (11.5%), 8 tandem repeats (8.6%), 9 tandem repeats (6.8%), 10 tandem repeats (3%) and 11 tandem repeats (1.7%). Among the direpeats, AT/AG repeat motifs (33.3%) were the most common motif. The most frequent trirepeat motifs were TCT/AAT (10.1%), and tetrarepeat motif was AAAT (12.5%).

Fig. 1
figure 1

Distribution of repeat motifs in Sapota

Validation of SSRs and genetic analysis

A set of 75 primers was chosen for experimental validation which included di, tri and tetra repeats. Out of these, 14 primers failed to amplify Sapota DNA. From the amplified set of primers, 30 were used for genotyping of 53 Sapota accessions.

In the present study, the data on 30 polymorphic SSR markers were reported. All these 30 markers show PIC values more than 0.8. The analysis of overall heterozygosity for 30 markers revealed that the number of alleles ranged from 11 to 38 with a total of 692 alleles, and a mean of 23.10 alleles per locus. The PIC values ranged from 0.85 to 0.96 with an average of 0.9118. The mean PIC of loci with trirepeats were high 0.924 compared to that of di and tetra repeats (Table 1). Among the 30 loci, 14 markers showed PIC between 0.85 and 0.89, 13 markers with 0.90–0.95 range and 3 markers with PIC greater than 0.96. The PI (Probability of Identity) values ranged from 0.002 to 0.043 with an average PI of 0.0145, and the total probability of identity was 2.215230e−59. The locus SapSSR_34 showed highest PIC value 0.962 with 38 alleles and lowest PI value 0.002 (Table 1). The twelve markers, SapSSR_34 (0.002), SapSSR_32 (0.003), SapSSR_4 (0.004), SapSSR_2 (0.005), SapSSR_16 (0.006), SapSSR_21 (0.006), SapSSR_23 (0.006), SapSSR_54 (0.006), SapSSR_1 (0.007), SapSSR_15 (0.008), SapSSR_3 (0.009) and SapSSR_8 (0.009) were found to be having very low PI values. These informative markers will be a useful tool in Sapota breeding programs, diversity analyses and DNA fingerprinting.

Table 1 Genetic analysis of thirty novel sapota microsatellite markers developed

Using the data generated from the 30 SSR markers, cluster analysis was done following Neighbour-joining tree method, which showed three clusters (Fig. 2). In general, the hybrid varieties were grouped with the maternal parent. This clustering pattern was further assessed using the STRUCTURE V2.3.4 on the basis of assignment tests carried out. The Evanno method illustrated an ideal K = 3 (Fig. 3a). This K = 3 authenticates the N-J analysis deduced with three clusters. The STRUCTURE analysis showed 30 genotypes shared ancestry among the population of 53 genotypes (56.6%; Fig. 3). Both the STRUCTURE analyses and N-J analysis showed admixed population.

Fig. 2
figure 2

Neighbour-joining analysis of 53 genotypes of Sapota using DARwin 6.0 software

Fig. 3
figure 3

(a) Graph of delta k values to determine ideal number of groups present in sapota using 30 SSR loci and the Evanno method implemented in STRUCTURE HARVESTER program, (b). Proportion of estimated ancestry coefficients of sapota populations using 30 SSR loci and STRUCTURE program with k = 3, depicted using STRUCTURE PLOT

Analysis of Molecular Variance (AMOVA) showed only 6.33% of the variation was found among the groups, 63.33% of the variation was found among populations within groups and 30.34% of the variation within populations. The Fst value estimated from AMOVA was 0.6965 which showed high genetic differentiation (Table 2).

Table 2 AMOVA analysis of genetic variances within and among populations of Manilkara zapota (level of significance is based on 10,000 iterations)

Discussion

Sapota was an introduced crop by the Spaniards to Asia, and it was found to be brought to India through Sri Lanka. Despite its cultivation throughout the world, very few studies on its genetic diversity using molecular markers. So far, molecular characterization in Sapota has been done using dominant DNA markers like RAPD (Meghala et al. 2005; Jalawadi et al. 2014; Kumar et al. 2015).

In present study, microsatellite markers for Sapota were developed to characterize at molecular level, and to have clear understanding of genetic diversity. SSR markers are widely used in species identification, genome mapping in crop breeding programmes, forensics, phylogeography and population genetics due to their abundance availability in the genome, high polymorphism, easy reiteration and cost effectiveness (Ravishankar et al. 2011, 2015c). In the absence of sequenced genomes in non-model species like Sapota, the enrichment of genomic libraries with microsatellite markers will be advantageous to develop molecular markers for genetic studies. However, so far there is only one report on development of SSR markers in Sapota using microsatellite library enrichment method (Arias et al 2020), which identified a 47 SSR markers and number of alleles reported in this study was low, as they used only 4 genotypes. The distribution of different types of repeat motifs cannot assessed using microsatellite enrichment library method. NGS Illumina HiSeq 2500 platform was utilized for developing this informative and versatile DNA based microsatellite markers. At present, NGS has transformed the development of microsatellite markers in to a quick, simple and cost effective method with a high throughput data, identifying a large number of loci in the genome (Ravishankar et al. 2015b, c; Unamba et al. 2015; Hodel et al. 2016).

The Sapota cultivar Cricket Ball genomic DNA was sequenced which generated 3,326,257,143 bases from 22,028,193 reads and after assembly 6,396,224 contigs were obtained. The GC content of contigs was 40.91% which is within the range commonly observed for plant genomes (Smarda and Bures 2012). A total of 2591 sequences containing 3591 SSRs were identified. It is found that mononucleotide repeats were more predominant, accounting for 59.1% of all the observed repeats followed by direpeats (28.6%). Others like tri, tetra, penta and hexarepeats accounted for less than 10% (Fig. 1). This finding was in accordance with other crops like mango (Ravishankar et al. 2015c), Garcinia gummi-gutta (Ravishankar et al. 2017), rice, sorghum, Brachypodium, Arabidopsis, Populous (Sonah et al. 2011) with predominant monorepeats. Dinucleotide repeats were also common in other crops like Pouteria sapota (Arias et al. 2015), Pomegranate (Ravishankar et al. 2015b), sour passion fruit (Araya et al. 2017), Manchurian walnut (Hu et al. 2016), American Cranberry (Zhu et al. 2012).

Among different motif types, AT and AG dinucleotide repeat motifs, TCT and AAT trinucleotide repeat motifs and AAAT tetranucleotide repeat motif are higher in frequency. Similar pattern was observed in many crops including Pouteria sapota (Arias et al. 2015). The higher frequency a particular repeat motif and its length in the plant genome might be due to selection pressure on that motif over the years during selection and evolution. The evolution of microsatellites in plant genome is neither well studied nor understood properly. The reason could be due to mutational mechanism through replication slippage. The other causes might be unequal crossing over, nucleotide substitution, and duplication events. However, they may not explain specific pattern of motif repeats in different species (Buschiazzo and Gemmell 2006; Sonah et al. 2011; Ravishankar et al. 2015c).

Present study reports successful development of thirty polymorphic microsatellite markers with high PIC values more than 0.8. The mean PIC value was 0.912. The high number of polymorphic SSRs isolation may be due to Illumina paired-end sequencing which provides an effective alternative to the expensive and time consuming conventional microsatellite enrichment library based method of genome wide SSRs isolation. According to Botstein et al. (1980), any locus with PIC more than 0.5 is highly polymorphic. The average number of alleles per locus was 23, with a range of 11 to 38. With a mean value of 0.0141, the PI values vary from 0.0026 to 0.0370. The higher PIC and relatively high alleles per locus indicate the presence of high heterozygosity in this species. It is observed 17 (32%) SSR markers showing alleles more than 20 per locus indicating high heterozygosity and diversity in the genotypes used. The markers with low PI can be used as universal primers for Sapota DNA fingerprinting.

The clustering pattern in N-J analysis showed three clusters (Fig. 2), cluster 1 has most of the genotypes from Gujarat state, Eastern region of India, Cluster 3 includes genotypes mainly from southern states Tamil Nadu, Telangana, and Andhra Pradesh., Cluster 2 has admixture of genotypes from different regions. Generally, there is an affiliation towards geographical region where they originated. However, there is no clear separation was observed. Our results deviate from the studies carried out in Sapota germplasm in India using RAPD markers by Jalawadi (2014), Jalawadi et al. (2014), Kumar et al. (2014) stated that the clustering in Sapota was based on size and shape of the fruit. Further analysis of the clustering pattern using the STRUCTURE programme figured out ideal value of k = 3 which also showed admixed population with ancestry shared among 56.6% of the population. In Pouteria sapota, a study carried out by Arias et al. (2015) microsatellite markers showed a clustering pattern based on the geographical locations and the STRUCTURE analyses showed admixed population(Fig. 3). The hindrance in the clustering pattern in N-J analysis is probably due to the admixed population shown by STRUCTURE analyses.

Analysis of molecular variance revealed a significant Fst value of 0.69659 indicating high genetic differentiation among the 53 genotypes and 3 populations studied. There was a high differentiation among populations within groups and low differentiation among groups. The observed diversity among populations within groups indicates likely coexistence of different genotypes in the same region (Ravishankar et al. 2015c). In this study the genetic differentiation within the populations showed by the STRUCTURE analysis is similar with the AMOVA results.

Sapota is a crop that has been introduced to India. The variability of such introduced crops is determined by the number of cultivars/genotypes introduced and the degree of heterozygosity, both of which are unknown in this case. As a result, it is projected to be narrow, with little genetic variation. However, the results show that high genetic differentiation and diversity in the Sapota population. This is in accordance with earlier studies by Jalawadi et al. (2014) and Kumar et al. (2014). Initially, Sapota might have been cultivated using seedlings, due to their high heterozygosity nature there are high variations in the off-springs. However, low observed heterozygosity in the open pollinated crop, might be due to small number of introduction, and Sapota is grown in small packets predominately as a single variety, which causes a type of inbreeding depression due to lack of diverse pollen sources. It was later selected and vegetatively propagated based on the preferences of the region for fruit quality and productivity. As a result of seedling segregation, considerable diversity and a high degree of genetic variability in Sapota could have emerged. It is also possible that a significant number of seedlings or grafts of Sapota were imported to India from the place of its origin. This is the first study to employ microsatellites to evaluate the genetic diversity of Indian Sapota collections. The SSR markers developed would aid in the development of a linkage map, the assessment of genetic diversity, and molecular characterization.