Introduction

Papaya is the sole representative of the genus Carica in the family Caricaceae, and its wild relatives are now classified under Vasconcellea (Badillo 2000). Papaya is believed to have originated in the Mesoamerican center (south of Mexico and Central America) (Fuentes and Santamaría, 2014) and it was introduced to India by Spaniards in the sixteenth century (Singh 1990). Currently, it is cultivated in tropical and subtropical regions worldwide. Globally, papaya is cultivated in an area of 0.48 million ha with a production of 14.1 million metric tonnes. India contributes 40% (5.54 million tonnes) of papaya production with 30% (0.14 million ha) of the global papaya cultivated area (FAOSTAT 2022).

Papaya, with its rich nutrient profile and wide adaptability, plays a vital role in food and nutritional security (Pinnamaneni 2017) in tropical and subtropical regions. Its cultivation is profitable globally owing to its demand and efficient productivity. In addition, dried milky latex from mature papaya, called papain, has significant applications in biotechnology and industrial sectors (Elsson et al. 2019). It is particularly used in industries such as pharmaceuticals, breweries, tanneries, cosmetics, detergents (Saran and Choudhary 2013), and the processing of cheese, meat, and fish (Mamboya and Amri 2012).

Genetic resources of crops are essential for food security (Toledo and Burlingame 2006). A wide gene pool aids in understanding evolutionary relationships and breeding better traits such as disease resistance and fruit quality. A larger population increases the chances of identifying individuals with the desired traits in various environments. Morphological and agronomic traits, such as plant height, juvenile period, flower initiation, leaf shape, fruit shape, flesh color, stamen abortion, carpelloidy, and fruit yield, can vary owing to genotype and environment interaction (Campostrini and Glenn 2007; Silva et al. 2007; Kumar et al. 2015; Kaluram et al. 2018). Field observations can help to estimate genetic diversity; however, environmental factors can affect the same gene differently (Weckwerth et al. 2020), making it difficult to draw conclusions. Genotyping is the most reliable method because it is unaffected by environmental factors and can identify variations at the genome level.

Genotyping using molecular markers have been used for germplasm characterization and conservation for many years. A large extent of genetic diversity has been reported within Caricaceae and the genus Carica using molecular marker studies. Different molecular markers have been used, such as random amplified polymorphic DNA (RAPD) (Stiles et al. 1993; Jobin-Décor et al. 1997), restriction fragment length polymorphism (RFLP), and amplified fragment length polymorphism (AFLP) (Van Droogenbroeck et al. 2002; Kim et al. 2002; Ratchadaporn et al. 2007; Oliveira et al. 2011), inter-simple sequence repeats (ISSR) (Costa et al. 2011; Kanupriya et al. 2012), and simple sequence repeats (SSR) (Oliveira et al. 2010a, b; Matos et al. 2013; Sengupta et al. 2013; Pirovani et al. 2021) were used to analyze genetic diversity. Among these, SSR markers are considered robust molecular tools for the analysis of genetic diversity because of their abundance in the genome and their high reproducibility (Eustice et al. 2008). In addition, SSR markers have been used for sex identification (Parasnis et al. 1999), segregating populations (Pinto et al. 2013), DNA fingerprinting (Vitoria et al. 2004), and genetic mapping (Blas et al. 2012).

In India, studies on the genetic diversity of papaya have been conducted based on morphological traits and conventional molecular markers (Singh et al. 1997; Singh and Kumar 2010; Sudha et al. 2013; Saran et al. 2015). However, the extent of genetic diversity within the active germplasm of papaya remains unexplored. The Tamil Nadu Agricultural University (TNAU) has a long history of collecting papaya genotypes (Ram 2005) and releasing five inbred cultivars and three hybrids. Here, we maintained a diverse population of papaya genotypes consisting of landraces, cultivars, improved cultivars, and exotic collections. Despite its potential significance, evaluation of genetic diversity among the germplasm collections available in TNAU using molecular markers has not been attempted. Fifteen SSR primers were used to analyze 55 papaya accessions of the TNAU papaya germplasm. The objective of this study was to assess the genetic variation within the germplasm and determine its population structure. The results provide insights into genetic diversity and population structure, aiding conservation management, targeted breeding, and collection expansion. In addition, this study can be instrumental in framing policies related to germplasm conservation and utilization.

Material and methods

Plant material

Seeds of 55 papaya accessions were procured from the germplasm repository of the Department of Fruit Crops, Horticultural College & Research Institute, TNAU, Coimbatore. Subsequently, the seeds were sown in polybags and after 45 days, a polybag containing five to six seedlings was transplanted (spacing 1.8 m X 1.8 m) into the field at the College Orchard, TNAU, Coimbatore. The experiment conducted using randomized block design with 15 plants per accession in experimental plot. A list of the 55 papaya accessions is presented in Table 1.

Table 1 List of 55 accessions used in genetic diversity study

Genomic DNA isolation

At the fruit maturation stage, fourth leaf from the top of tree was collected from the selected female or hermaphrodite plant. Healthy papaya leaves were collected, a Genomic DNA was extracted using the CTAB method (Doyle 1991). DNA quality was determined using 0.8% agarose gel, and the quantity and purity were recorded using a spectrophotometer (NanoDrop1000c, Thermo Scientific). The extracted DNA was diluted to 50 ng/µL and stored at − 20 °C until further analysis.

SSR analysis

A set of 16 SSR primers (Table 2) were selected from the microsatellite sequences developed by Perez et al. (2006). Reaction mixture for PCR assay:10 µL containing 1.0 µL of reaction buffer (10X with 20 mM MgCl2), 0.2 µL of 10 mM dNTP, 0.5 µL from forward and reverse primers of 10 µM, 1.0 µL of genomic DNA and 0.5 U of Taq polymerase. The amplification reaction was performed as follows: initial denaturation at 94 °C for 4 min, 35 cycles of denaturation at 94 °C for 1 min, primer annealing temperature (adjusted according to primers) for 1 min, primer extension 72 °C for 45 s, final extension at 72 °C for 4 min and finally, hold at 4 °C. The amplified PCR products were resolved by agarose gel electrophoresis (3%) and visualized using a gel documentation system (Alpha Imager, USA). The amplicon size was measured using a Takara 100 bp ladder.

Table 2 List of simple sequence repeat (SSR) primers used in genetic diversity study

SSR-based diversity analysis

The gel images of the SSR bands were scored using Gel Analyzer (Version 19.1 (www.gelanalyzer.com) by Istvan Lazar Jr., PhD and Istvan Lazar Sr., PhD, CSc) based on the molecular weight and data were recorded. From the recorded molecular data, the number of alleles, effective alleles, Shannon’s information index, observed heterozygosity, and expected and unbiased heterozygosity were computed using the GenAIex software (Version:6.0.5) (Peakall and Smouse 2012). Power marker (Liu and Muse 2005) was used to calculate the allele frequency and polymorphism information content (PIC) of the markers and generate a unweighted pair group method with arithmetic mean (UPGMA) dendrogram based on the shared allele frequency.

Population structure analysis

A Bayesian model of clustering was performed using STRUCTURE V.2.3.4 (Pritchard, 2000) to categorize the individuals into clusters (subpopulations). Without prior population information, the parameters were configured as an admixture model with correlated allelic frequencies. Ten independent runs with K values ranging from 1 to 10 were performed, with a burn-in period of 500,000 iterations and 500,000 Monte Carlo–Markov iterations. The generated output was compressed and uploaded to STRUCTURE HARVESTER V.0.9.94 (http://taylor0.biology.ucla.edu/structureHarvester/) (Earl and Holdt 2012). This software was used to determine the best K value, as outlined by Evanno et al. 2005. Individuals were assigned to clusters using a membership coefficient (q) and samples showing q < 0.8 termed as “genetic admixture” within that particular cluster.

Discriminant analysis of principal components was performed for the SSR dataset in R (version4.3.1) using the adegenet package (Jombart 2008). The SSR dataset was imported using the poppr package (Kamvar et al. 2014). The major advantage of DAPC is that it is not reliant on population genetics models, such as Hardy–Weinberg equilibrium or linkage equilibrium (Jombart et al. 2010). Data was first transformed into PCA, followed by a discriminant analysis of the retained principal components (PC). First, the clusters were identified using the find.clusters function based on the K-means algorithm, with K values varying from 1 to 10. The number of clusters was chosen based on the Bayesian information clustering (BIC) value. Next, the number of principal components (PC’s) was retained using the a-score optimization method function from the adegenet. The final cluster was generated using discriminant analysis.

Analysis of molecular variance (AMOVA)

Genetic differentiation within the population and individuals was determined using AMOVA (Excoffier et al. 1992) implemented in the GenAlex 6.503 software. To calculate the significance among the populations, pairwise Fst values and gene flow (Nm) were computed.

Results

Assessment of polymorphisms in SSR loci

A set of 15 SSR markers was efficiently amplified in DNA fragments of 55 accessions of papaya and the results are given in Table 3. The SSR markers generated 95 alleles in all accessions, with an average of 6 alleles per marker. The number of alleles detected per primer varied from four to ten. The lowest number of alleles was recorded for the three primers S285, mcpCIR09, and mcpCIR16, whereas the highest number of alleles was present for mcpCIR28. The number of effective alleles ranged from 3.05 (mcpCIR16) to 7.70 (mcpCIR28), with an average of 4.37. The mean major allele frequency was 0.34, with a range of 0.23—0.55. The most frequent allele was recorded in S 285 and the least frequent alleles were in mcpCIR28.

Table 3 Genetic diversity parameters of 55 papaya accessions from data of 15 simple sequence repeat (SSR) markers

The Shannon’s information index (I) was highest in mcpCIR28 (2.15) and lowest in S 285 (0.96). The observed heterozygosity ranged from 0.00 to 0.27 with an average value of 0.03. The expected heterozygosity or gene diversity detected by all SSR loci varied from 0.75 to 0.87, with an average value of 0.75. The polymorphism information content of the loci ranged from 0.48 (S 285) to 0.85 (mcpCIR28), with an average value of 0.72. The size of the alleles produced by the 15 SSR primers ranged from 67 to 780 bp.

Dendrogram

Based on the SSR marker data, a neighbor-joining tree of 55 accessions of papaya was constructed using the unweighted pair group method with an arithmetic mean (UPGMA) algorithm (Fig. 1). The dendrogram clearly grouped the 55 papaya accessions into four groups. Group 1 (G1) contained 13 papaya accessions and was divided into two subgroups: two accessions from IIHR-Bangalore, Sunrise Solo, CO.7, and Waimanalo in one group and seven exotic collections in another group. Group 2 (G2) was composed of 11 accessions with two subgroups, four local collections in one subgroup, and four exotic collections, Washington, Tainung II, and PAU selection were placed in another subgroup. Group 3 (G3) included 13 accessions, including two subgroups consisting of TNAU cultivars (CO.1 (R &Y), CO.2 (Y), CO.4(R &Y) and CO.5 (R)) in subgroup 1 and five nearby local collections grouped in subgroup 2. Group 4 (G4) was separated into two groups: 18 accessions consisting of three exotic collections (Singapore, Mexican, EC.611 100), one inter-varietal hybrid (CO.3 × Washington), one local collection (Perur), six Pusa varieties (Pusa Dwarf, Pusa Giant, Pusa Delicious, Pusa Majesty, Giant) and their derivatives (CO 5 and CO 6), and two cultivars from the Madhya Pradesh region (Barwani (R&Y)).

Fig. 1
figure 1

Dendrogram based on share allele distance of 55 accessions. The letter in the parenthesis indicates R as red pulp and Y as yellow pulp

Population structure

To understand the population structure of the 55 accessions of papaya, Bayesian clustering analysis using Structure software and discriminant analysis of principal components were performed. The optimal K value was obtained using methods described by Pritchard et al. (2000) and Evanno et al. (2005). As shown in the figure (Fig. 2), the highest delta K value was K = 2. The bar plot of cluster K = 2 showed that out of 55 accessions, 31 accessions were grouped in one population and 24 accessions in another population, of which one accession was genetically admixed (Fig. 3). Population I consisted of most of the dioecious accessions, and population II consisted of gynodioecious accessions.

Fig. 2
figure 2

Graph of best delta K (K = 2) value derived from structure harvester using structure data analysis

Fig. 3
figure 3

Population structure of 55 accessions of papaya germplasm based on structure. Red and green columns indicates the populations

The results from the DAPC method revealed four distinct clusters (Fig. 4), which corresponded to the four BIC values obtained using the find.clusters function (Supplementary Fig. 1). Principal components were retained using an a-score optimization method (Supplementary Fig. 2). Clusters were formed by retaining the first five major principal components (cumulative variance = 50%), with three discriminant eigenvalues. A total of 12 accessions were assigned to Cluster I, 14 to Cluster II, 17 to Cluster III, and 12 to Cluster IV (Fig. 5). Varieties derived from Sunrise Solo and local accessions are grouped in cluster I. In comparison with dendrogram Group 1, the six accessions were similar in DAPC Cluster I. Cluster II grouped the varieties released from TNAU, Coimbatore and accessions collected in nearby areas. Cluster III comprises Pusa varieties and varieties derived from Pusa cultivars. Both DAPC clusters II and III were similar to dendrogram groups G3 and G4. Cluster IV grouped all the exotic collections in the germplasm, and Group 2 was in contrast to DAPC cluster IV, which contained only four similar accessions.

Fig. 4
figure 4

Discriminant analysis of principal components (DAPC) for 55 papaya accessions. Each circle represents a cluster and each bullet represent individuals

Fig. 5
figure 5

Cluster plot of 55 accessions based on discrimination analysis of principal components (DAPC) analysis

Genetic diversity of the identified populations by DAPC analysis

The gene diversity of the populations generated from the DAPC analysis was calculated, and the results are presented in Table 4. Among the populations, cluster III recorded the highest number of alleles (4.20), whereas the lowest number of alleles (2.60) was recorded in cluster IV. Similarly, the allele frequency and number of effective alleles were highest in cluster III (4.06 and 3.01, respectively) and lowest in cluster IV (2.60 and 2.03, respectively). Cluster III recorded the highest expected heterozygosity of 0.63, and Cluster IV recorded the lowest value (0.43).

Table 4 Gene diversity parameters of four clusters from discriminat analysis of principal components (DAPC)

Analysis of molecular variance within and among the population of papaya accessions based on results of DAPC analysis

The populations derived from DAPC analysis were tested for genetic differentiation using SSR genotypic data. The extent of genetic variability between populations, within individuals, and among individuals in the germplasm was analyzed using AMOVA (Table 5). The analysis showed that 75% of the variation existed between individuals, which was significantly higher than the variation obtained between the population (22%) and within individuals (3%). The Fst value was 0.216 (0.15 < Fst < 0.25) indicating a moderate level of genetic differentiation among the population. The Nm value of 0.905 indicated low gene flow among the populations. The pairwise Fst value was highest (0.29) between Clusters II and IV, and the lowest Fst value was between Clusters II and III. The gene flow (Nm) was highest (1.26) between clusters II and III (Supplementary Table 1).

Table 5 Analysis of molecular variance of four populations from discriminat analysis of principal components (DAPC)

Discussion

Molecular markers provide a comprehensive understanding of the genetic diversity and population structure of germplasms without any environmental influence. SSR markers are better suited for germplasm diversity analysis because they are easy to use, highly polymorphic, and reliable (Powell et al. 1996; Varshney et al. 2007). Earlier studies have reported that papaya contains abundant SSRs in its genome and is more useful for detailed genetic studies of population structure, hybrid testing, evolutionary studies, and QTL mapping (Santos et al. 2003; Perez et al. 2006; Eustice et al. 2008; Oliveira et al. 2011; Matos et al. 2013). In this study, using SSR genotypic data, the genetic diversity among 55 selected accessions of papaya was evaluated to understand the genetic variation and existing population structure between individuals.

In this study, the choice of markers was based on previous study conducted by Perez et al. 2006. The set of SSR markers used in this study provided a distinct genetic structure of the individuals in the papaya germplasm. Fifteen polymorphic simple sequence repeat (SSR) markers revealed 95 alleles across 55 papaya germplasm accessions. The alleles per locus ranged from 4 to 10, with an average of 6.3. This is lower than the 7 alleles per locus reported by Sengupta et al. (2013) for 34 accessions, including Indian and non-Indian accessions. Our results are similar to those of Ocampo Perez et al. (2006), who found an average of 6.6 alleles per locus in 72 accessions using 15 SSR markers, and Hasibuzzaman et al. (2020), who reported six alleles per locus for 34 genotypes with 10 SSR markers. In contrast, De Oliveira et al. (2010a) found 4.02 alleles per marker in 48 papaya accessions with 59 SSR markers, whereas Matos et al. (2013) reported 4.08 alleles per marker in 96 accessions with 15 microsatellite markers. The high number of alleles in the papaya germplasm may be due to the collection and conservation of accessions from all papaya-growing regions in India and exotic collections. In India, over the past 500 years since the papaya introduction, it has been naturalized and widely cultivated, leading to considerable genetic diversity. A wide range of cultivars exists in India, including primitive types, local adaptive cultivars, minor cultivars and principal cultivars released from Indian Agricultural Research Institute—Regional Station at Pusa, Tamil Nadu Agricultural University, Coimbatore and Indian Horticultural Research Institute, Bengaluru (Ram 2005).

Gene diversity and polymorphism information content

Nei’s gene diversity or expected heterozygosity and polymorphism information content is the reliable measure for assessing genetic variation in the population. The average gene diversity in this study was 0.75, similar to 31 papaya genotypes (0.74) from various countries including Bangladesh (Hasibuzzaman et al. 2020). This is higher than Caribbean populations (0.37–0.69) (Ocampo Perez et al. 2007), USDA germplasm (0.58) (Luciano-Rosario et al. 2018), and Costa Rica’s natural populations (0.62) (Brown et al. 2012).

Botstein et al. (1980) stated that PIC value > 0.5 as high locus diversity, PIC < 0.25 as limited diversity and values between 0.25 and 0.50 as intermediate diversity. Our germplasm’s average PIC value of 0.72 indicates high genetic diversity. Comparatively, Hasibuzzaman et al. (2020) reported a value of 0.70, showing a similar level of diversity in the Bangladeshi germplasm. Sengupta et al. (2013) observed a slightly higher PIC value of 0.74 in analyzing Caricaceae accessions and Asudi et al. (2013) reported the highest of 0.81 indicating diverse Kenyan germplasm. In the Embrapa papaya genebank, Oliveira et al. (2010a) found 0.52 in 30 select accessions and Matos et al. (2013) reported 0.47 in 96 accessions, indicating lower diversity in their genebank accessions than in our study.

Genetic structure of the germplasm

The population structure of germplasm facilitate effective management and utilization of resources. SSR analysis data clearly revealed the genetic similarity cluster between the accessions based on the shared allele distance computed using the UPGMA method. Fifty-five accessions in the germplasm were divided into four main groups and subgroups within it. Group 1 comprised Waimanalo, Sunrise Solo, IIHR-39, IIHR-57, CO 7, Malaysian Long, Singapore, and six other exotic collections. Some of these accessions were interlinked to the Hawaiian cultivar “Solo”, from which Waimanalo and Sun rise solo were derived (Ram 2005). Accession IIHR-39 has sunrise solo as the main parent, and IIHR-57 is derived from Arka Surya and Tainung-I. Accessions IIHR-39 (Arka Surya) and IIHR-57(Arka Prabhath) were released as cultivars from IIHR, Bengaluru, suitable for the institute region (Mitra and Dinesh 2019).

Group 2 had two subgroups: subgroup 1 included three local accessions (Sathyamangalam Dwarf, Valliyur collection, and Red flesh) and PAU selection from PAU, Ludhiana were closely related. These collections were not related to a specific region. The second subgroup included an open-pollinated accession and Tainung II from Taiwan, four exotic collections (EC.100 211, EC.100 135, EC.100 012, and EC.100 064) and Washington. “Washington” papaya has a distinct character of purple-colored petiole and it has been domesticated for a long time in the Maharashtra region of India (Ram 2005).

Group 3 comprised most accessions belonging to the Coimbatore region of Tamil Nadu. The cultivars CO.1 (red), CO.1 (yellow), CO.2 (yellow), CO.4(yellow), CO.4 (red), and CO.5 (yellow) were closely connected in subgroup 1, and the second subgroup had local collections (MD.13 (Veda Patti), M1 (OP), Local Acc (Y), and MD Telungu palayam), Carica pink petiole, and PKM-1 long from Periyakulam, Theni. Over the past five decades, TNAU, Coimbatore have made significant advancements in papaya crop improvement, resulting in the release of eight elite cultivars from CO.1 to CO.8 (Mitra and Dinesh 2019). The narrow genetic diversity among the TNAU released cultivars, as revealed in the present study, could be due to the parental materials involved in the development of these cultivars. As the local accessions were collected within 20 km of Coimbatore, these genotypes have common alleles, probably because of the exchange of seeds among farmers (Matos et al. 2013), therefore, limited genetic differentiation existed between these groups.

Lastly, group 4 was divided into two subgroups, one group dominated by Pusa varieties such as Pusa Dwarf, Pusa Giant, Pusa Majesty, Pusa Delicious, CO.6 (selection from Pusa Majesty), CO.5 (selection from Washington), Barwani (locally adapted genotype from Madhya Pradesh) and Manila Pink from the Philippines. Interestingly, this subgroup consisted of the hybrid Carica (wild) X CO-6 (CP-50), which was reported to be a PRSV-tolerant genotype by Balamohan et al. (2010). CO.5, derived from the Washington variety (Sharma and Mitra 2014), was distantly related to its parent. This can be attributed to many factors, such as the outcrossing nature of papaya increasing genetic distance (Kim et al. 2002), and a limited number of SSR markers influencing differentiation. In another subgroup, accessions such as Mexican, Singapore, and Perur (local collection from Coimbatore). Nevertheless, this subgroup was comprised of a mixture of accessions collected from various regions. Ocampo Perez et al. (2007) analyzed genotypes from various regions, including Costa Rica, Colombia, Venezuela, Guadeloupe, and the Antillean islands and found geographic based clustering regions with few exceptions. In our study, despite the small number of accessions are region specific, we did not observe a correlation between geographic region and cluster formation.

In addition to the dendrogram, we applied both structure and DAPC approaches to infer the population structure of the 55 accessions. A model-based approach by structure distinguished the germplasm accessions into two populations based on the delta value (K = 2), while Hasibuzzaman et al. (2020) reported six populations from 31 papaya accessions collected around the world. However, the DAPC method revealed a remarkably distinct clustering pattern that deviated significantly from the results obtained using structure. DAPC analysis successfully classified the 55 selected accessions of papaya into four distinct clusters, irrespective of their region of collection. The clustering pattern derived from the DAPC analysis exhibited a close alignment with the hierarchical structure depicted in the dendrogram, except for Clusters I and IV. The difference in clustering is attributed to the methodologies and principles underlying both the analytical approaches. Using the DAPC method, Matos et al. (2013) clearly identified that the papaya germplasm of 96 selected accessions was classified into six distinct clusters; however, in contrast to our study, the DAPC classified clusters were concordant with Bayesian clustering by STRUCTURE algorithm. However, Campoy et al. (2016) and Mariette et al. (2010) reported that DAPC analysis yielded a comprehensive clustering pattern within the germplasm compared with the results obtained from structure analysis.

Molecular variation in the populations

The results of molecular variation between the populations indicated 22%, whereas a variation of 75% among the individuals of the population represented the overall genetic diversity. The increased variation is possibly due to the reproductive biology of papaya with three sex forms (Matos et al. 2013), evolutionary forces such as the hybridization of the most divergent parents (Goulet et al. 2017), and the introduction of exotic collections in germplasm (Scherlosky et al. 2018). Wright (1965) stated that Fst (Fixation index) close to 0 signifies low genetic differentiation, 0 to 1 indicates moderate and close to 1 shows high genetic differentiation. Nm (Gene flow) value below 1 signify limited gene flow within the population. Fst (0.22) and Nm (0.21) values showed moderate genetic differentiation and limited gene flow.

Conclusion

In this study, 55 accessions selected from papaya germplasm collected worldwide were genotyped using 15 SSR markers. Allelic richness and extensive gene diversity indicated broad genetic variation in the germplasm. DAPC and UPGMA analyses separated the accessions into four subpopulations, irrespective of their region. These findings can potentially optimize the expansion of collection, effective management of resources, parental line selection for hybridization, and tailor breeding programs.