Introduction

The genus Gossypium includes 46 diploid (2n = 2x = 26) and five allotetraploid (2n = 2x = 52) species, which distributed in Central, South, and North America (18 species), Africa and Asia (14 species), and Australia (17 species) (Wendel and Cronn 2003). Among the 50 species, only four are cultivated, including two diploids (2n = 2x = 26): G. arboreum L. (A2A2) and G. herbaceum L. (A1A1), which originated from Africa–Asia, and two tetraploids (2n = 4x = 52): G. hirsutum (AD1AD1) and G. barbadense (AD2AD2), which originated from America. Currently, G. hirsutum and G. barbadense have been responsible for 90 and 8 % of the annual cotton crop production in the world, respectively. Because of its economic importance, such as high-yield and environmental suitability, G. hirsutum has attracted considerable scientific interest for plant breeders and agricultural scientists and been planted widespreadly. However, other cultivated species possess many favorable traits for cotton production which the G. hirsutum cultivars lack. For example, G. herbaceum possesses the traits of earliness and drought resistance (Stewart 1992), G. arboreum of resistance to diseases and pests (Kantartzi et al. 2009), and G. barbadense of good fiber quality (Wendel and Cronn 2003). So, Gossypium-cultivated species are important germplasm for cotton improvement.

Though G. hirsutum was planted in more than 50 countries, most of these countries were not cotton’s original home (Wendel and Cronn 2003). The domestication and breeding history of cultivated cotton after its introduction from Mesoamerica to these countries have led to low level of genetic polymorphism (Rungis et al. 2005; Chen and Du 2006). Any crop with a narrow genetic base is more prone to natural disasters, such as the outbreak of a disease. Epidemics of cotton Verticillium wilt disease in China since 1993 is a typical example, and none of the upland cotton varieties was found to be resistant (Zhai and Luo 1994). So, it is crucial to explore novel germplasm resources for potential natural genetic diversity and develop innovative genomics tools to efficiently mobilize these useful genetic variations to breeding germplasm, which should help to overcome existing and potential problems of cotton production associated with narrow genetic base of the cultivar germplasm.

Several studies showed that new genetic variation could be produced by interspecific hybridization and induced mutagenesis and thus enriched the germplasm resources in cotton (Ahoton et al. 2003; Hussain et al. 2002). Recently, some new elite cotton (G. hirsutum) germplasm with the following one or more characters of high yield, good fiber quality, earliness, disease, and pest resistance was created by interspecific hybridization combining induced mutagenesis, polymerization backcross, marker-assisted selection, and introgression of exogenous genes in China (Sun et al. 2004). For cotton geneticists and breeders, the precise evaluation of the genetic diversity of excellent germplasm of G. hirsutum will provide a guide for choosing parents and predicting the degree of inheritance, variation, and level of heterosis, which are essential for realizing the breeding goal. SSR loci are particularly useful for the study of genetic diversity and population structure of domesticated species because their high level of allelic diversity facilitates the detection of the fine structure of diversity more efficiently than an equal number of RFLP, AFLP, or SNP loci (Akagi et al. 1997). Using SSRs, the cotton source germplasm in China was divided into five groups on the basis of the average similarity coefficient (0.610) among the source germplasm (Chen and Du 2006), and 334 G. hirsutum variety accessions from Uzbek cotton germplasm collection performed the analysis of genetic diversity and population structure (Abdurakhmonov et al. 2009).

In this study, we selected a large number of elite G. hirsutum cultivar accessions from the China cotton germplasm collection. These elite cultivar accessions originated from either different cotton-growing areas in China or outside of China, among which some innovative germplasms from interspecific hybridization or induced mutagenesis among G. hirsutum and either G. arboreum or G. barbadense were included. The genotypes of these accessions were analyzed to study the extent and distribution of diversity, population structure, and kinship by using SSR markers. The specific goals of this study are to characterize genetic diversity and population structure within elite G. hirsutum cultivar accessions, to examine the differences between and relationships among genetically defined groups. The resulting framework will be used to pose questions about the origin and diversity of gene pools that exist within world cultivated G. hirsutum cultivars and to lay the foundation for characterizing the genes that distinguish them.

Materials and methods

Plant material

We sampled 157 cotton accessions (Supplementary Table 1) representing the geographic range of elite G. hirsutum cultivar accessions from the China cotton germplasm collection. The sampled accessions were collected from different cotton growing areas in China (106) as well as from America (41), Africa (3), Former Soviet Union (4), France (1), Pakistan (1), and Australia (1).

Genomic DNA extraction and SSR genotyping

Genomic DNA of all materials was isolated from pooled young leaves of ten seedlings following Paterson et al. (1993). SSRs distributed on the 26 chromosomes on the AD-genome wide Reference Map (http://www.cottondb.org/cgi-bin/cmap/viewer) were screened for their polymorphism, and 146 pairs of SSR markers (an average of five on each of the 26 chromosomes, Supplementary Table 2) showing polymorphism among the 157 cotton accessions were retained for genotyping. The chromosome locations of these SSR markers and positions of each locus were obtained from the AD-genome wide Reference Map (http://www.cottondb.org/cgi-bin/cmap/viewer). PCR amplification for SSRs was performed in 67 mM of Tris–HCl (pH 8.8), 16 mM of (NH4)2SO4, 2.5 mM of MgCl2, 0.2 mM of dNTPs, 0.6 μM of primers, with 0.5 units of Sangon (Sangon, Shanghai, China) Taqase, and 25 ng of genomic DNA per 10 μl using a Thermal Cycler 9600 (Perkin-Elmer). PCR runs were performed 35 cycles of 45 s at 94 °C, at the annealing temperature for 45 s and 72 °C for 90 s, and a final extension step at 72 °C for 10 min. PCR products were run on 10 % polyacrylamide gels, using a DYCZ-30 vertical electrophoresis gel apparatus (Produced by China). The gel was run for about 50 min after loading the sample. After electrophoresis, the gel was separated from the plates and treated for 10 min in fixation solution (10 % v/v ethanol and 0.5 % v/v acetic acid) with gentle shaking. After incubating for 12–15 min in staining solution (0.2 % w/v silver nitrate), the gel was washed 2 times with distilled water for 2 min, and with 0.0002 % w/v sodium thiosulfate for about 2 min, and then, the gel was transferred to developing solution (1.5 % w/v sodium hydroxide, 0.4 % formaldehyde) to develop the silver-staining DNA bands. The stop and stored solution was 0.75 % sodium carbonate.

Molecular genetic diversity and phylogenetic analyses

Since the cotton germplasm used in this study was strictly self-pollinated during the past decades for germplasm renewing, we genotyped our cotton accessions according to the methods reported by Abdurakhmonov et al. (2009), in which the SSR data were scored like a dominant marker type. Genetic diversity was assessed using the program POWERMARKER V3.25 (http://www.powermarker.net), measured by number of alleles per locus, gene diversity, and polymorphism information content (PIC). Pairwise genetic distances were calculated, and phylogeny was analyzed using the software Powermarker 3.25 under the Nei 1983 model (Liu and Muse 2005). Genetic variation within and among predefined groups, the entire population, and the inferred groups and pairwise FST genetic distances was measured by analysis of molecular variance (AMOVA) using ARLEQUIN2.0 (Schneider et al. 2000).

Population structure and kinship analysis

The model-based (Bayesian) cluster software STRUCTURE 2.2 (Pritchard et al. 2000) was chosen to estimate the population structure of the predefined groups and the entire population with the 146 unlinked markers which distributed across all cotton chromosomes. We ran STRUCTURE under the ‘admixture model’ with a burn-in period of 10,000 followed by 100,000 replications of Markov Chain Monte Carlo. Five independent runs each were performed with the number of clusters (K) varying from 1 to 10. An ad hoc measure ∆K based on the relative rate of change in the likelihood of the data between successive K values was used to determine the optimal number of clusters (Evanno et al. 2005). That run with the maximum likelihood was adopted to subdivide the cotton accessions into different subgroups with the membership probabilities threshold of 0.50 or the maximum membership probability among subgroups. No a priori population information was used. Relative kinship matrix was constructed using the software SPAGeDi, and negative values between two individuals were changed to 0 (Hardy and Vekemans 2002).

Results

Inference of genetic structure of the predefined groups and the 157 cotton accessions

According to the geographical origins, we predefined the total cotton panel as three groups: American-origin group (P1), China-origin group (P2), and other-origin group (P3). In order to explore the genetic structure of the predefined groups, we performed population structure inference for American-origin group, China-origin group, and other-origin group, respectively. At the same time, we considered the whole cotton panel as one population with potential substructures and its population structure was inferred. The inferred groups for the whole cotton panel were compared with the predefined groups to judge the conformities between the cultivars’ origination and their genetic structure.

Population structure inference showed that the LnP(D) value of the predefined American-origin group, China-origin group, and the whole cotton panel constantly increased with K from 1 to 10, and the highest △K value was observed at K = 2 for American-origin group, K = 3 for China-origin group, and K = 2 for the whole cotton panel, respectively (Figure S1). This suggested that American-origin group could be assigned into two subgroups (P1a, P1b), China-origin group into three subgroups (P2a, P2b, P2c), and the whole cotton panel into two groups (Figs. 1, 2). While for the other-origin group, which consisted of cultivars from Africa, Former Soviet Union, French, Pakistan, and Australia, the LnP(D) value showed that this group could be assigned into seven subgroups (Figure S1, Fig. 1). The two groups inferred from the whole cotton panel were named G1 and G2, respectively. Using a probability of membership threshold of 0.50 or the maximum membership probability among subgroups, 61 lines were assigned to G1 and 96 lines to G2 (Supplementary Table 3). G1 group consisted of 21 American cultivar accessions, 34 Chinese cultivar accessions, 3 Former Soviet Union cultivar accessions, 2 Africa cultivar accessions, and 1 Australia cultivar accession. G2 group consisted of 21 American cultivar accessions, 71 Chinese cultivar accessions, 1 Pakistan cultivar accession, 1 Former Soviet Union cultivar accession, 1 Africa cultivar accession, and 1 French cultivar accession. The two groups inferred from structure were not consistent with the predefined three groups, reflecting the probable extensive exchange of parental lines by breeders worldwide.

Fig. 1
figure 1

Unrooted neighbor-joining trees and population structure for the predefined American-origin group (A), China-origin group (B), and other-origin group (C). The ancestries of the accessions in inferred subgroups are represented by different colors

Fig. 2
figure 2

Unrooted neighbor-joining trees and population structure for the whole cotton panel. The ancestries of the accessions in inferred subgroups are represented by different colors

For the whole cotton panel, due to the gradual increase of LnP(D) from k = 1 to k = 10 and there were small peaks of ∆K after k = 2 (Figure S1), we performed an independent STRUCTURE run for each of the two groups. Their ∆K indicated there were three and two subgroups in the group G1 and G2, respectively (Figure S1, Fig. 2). The G1 group was classified into three subgroups. Subgroups were named as G1a, G1b, and G1c. G1a contained 9 lines, which were representative of five lines collected from north early-maturity cotton area in China; G1b contained 19 lines and consisted of 8 American cultivars, 9 Chinese cultivars, 1 African cultivar, and 1 Former Soviet Union cultivar, which were representative of PD6186, a breeding line from America; G1c contained 33 lines derived from Huanghe River valley area in China(8), Yangtze River valley area in China (3), northwest inland area in China (3), America (13), Former Soviet Union (2), Australia (1), and Africa (1), respectively.

The G2 group was classified into two subgroups. The subgroups included G2a, containing 90 lines, which originated from USA (20), Africa (1), Pakistan (1), Former Soviet Union (1), France (1), and China (66); G2b, containing 6 lines, which were representative of Arcot-1, a breeding line from America, and its four innovative lines created by atomic energy mutation (Supplement Table 1, Supplement Table 4).

The pairwise kinship estimates based on 146 informative SSR markers showed that the majority of the pairs of cotton accessions (53.75 %) had zero estimated kinship values, while 17.55 % of the pairwise kinship estimates ranged from 0 to 0.05, 13.43 % of the pairs ranging from 0.05 to 0.1, and 11.70 % of the pairs ranged from 0.1 to 0.20. The remaining estimates (3.57 %) had >0.25 kinship values, with a continuously decreasing number of pairs falling in higher estimate categories, and these high kinship values implied some common parental genotypes were utilized in the breeding history of these germplasm groups (Fig. 3). These results indicated that most lines in the panel have no or very weak kinship, which might be attributed to the broad range collection of genotypes and the exclusion of similar genotypes before analysis.

Fig. 3
figure 3

Distribution of pairwise relative kinship estimates between 157 cotton accessions. Values are from SPAGeDi estimates using 146 SSRs. For simplicity, only percentages of relative kinship estimates ranging from 0 to 0.50 are shown

Genetic diversity

A total of 146 SSR loci, randomly distributed across the genome with an average of five on each of the 26 chromosomes, were used to evaluate the genetic diversity of the elite cotton germplasm. These SSR loci were polymorphic across the 157 variety accessions, and a total of 330 alleles were detected (Table 1). The number of alleles per locus varied among marker loci, ranging from 2 to 5 with an average of 2.26. The frequency of major alleles (the alleles have highest frequency per locus) varied from 0.3774 to 0.9936. The average gene diversity of predefined American-origin group (P1), China-origin group (P2), and other-origin group (P3) was 0.3527, 0.3434, and 0.3234, respectively, while the average gene diversity of the entire population, the inferred G1 group, and G2 group (see below) was 0.3502, 0.3695, and 0.3148, respectively. In addition, the average PIC value of predefined American-origin group (P1), China-origin group (P2), and other-origin group (P3) was 0.28809, 0.2798, and 0.2639, respectively, while the average PIC value of the entire population, the inferred G1 group, and G2 group (see below) was 0.2857, 0.3003, and 0.2572, respectively (Table 1).

Table 1 Summary of genetic diversity for overall group, predefined American-origin group (P1), China-origin group (P2), other-origin group (P3), and the inferred groups (G1, G2) and subgroups (G1a, G1b, G1c, G2a, G2b), which were classified using STRUCTURE analysis

Phylogenetic analyses revealed genetic distance (GD) of all elite G. hirsutum cultivar accessions ranged from 0.0182 to 0.5651 with an average of 0.3457, demonstrating significant genetic diversity ranges. The average GD within the G. hirsutum cultivar accessions of predefined groups (P1, P2, and P3) was very close and ranged from 0.3398 to 0.3599. Among predefined groups, the American-origin group had the highest average genetic distance and the China-origin group had the lowest, but the latter had the widest range of genetic variation (0.0182–0.5556, Table 1). While among the inferred groups, the average GD within G1 group (0.37) was higher than that in G2 group (0.3122). The highest GD (0.5651) among all elite G. hirsutum cultivar accessions was observed between the cultivar Nashangqudahua (from China) and Mei8123 (from America) (Table 1).

The NJ trees of American-origin group, China-origin group, other-origin group, and the whole cotton panel were created to judge the consistency between clusters based on the phylogenic tree and groups based on the STRUCTURE running (Figs. 1, 2). For American-origin group, the cultivars from the same subgroup in STRUCTURE running were assembled into the same cluster in phylogenic tree (Fig. 1a). For China-origin group, the cultivars from each of the three subgroups in STRUCTURE running were apt to disperse into different clusters in phylogenic tree (Fig. 1b). For the whole cotton panel, the cultivars from the same group or subgroup in STRUCTURE running were also apt to disperse into different clusters in phylogenic tree (Fig. 2). The cultivars in the dispersion state were often assigned to intermediates if we use a probability of membership threshold of 70 %, instead of membership threshold of 0.50 or the maximum membership probability among subgroups (Table S4). This showed that compared to American-origin group, the more exchange of parental lines happened in China-origin group or the whole cotton panel.

Analysis of molecular variance (AMOVA)

The genetic diversity within and among predefined groups was estimated by AMOVA test conducted using predefined P1 and P2 as groups (P3 was not included because of only a few samples) and their corresponding subgroups (P1a, P1b, P2a, P2b, and P2c) as populations. For the whole cotton panel, the same AMOVA test was conducted using inferred groups (G1, G2) as groups, and inferred subgroups (G1a, G1b, G1c, G2a, and G2b) within group as populations.

Analysis of molecular variance results indicated that 14.56 % of the total molecular variation for the inferred groups and 8.97 % of the total molecular variation for the predefined groups could be attributed to the differentiation among corresponding subgroups (Tables 2, 4). But the variation among both the inferred groups and the predefined groups was not significant, implying the low differentiation between American-origin group and China-origin group, or between G1 group and G2 group. About 86.49 % of total genetic variance for the inferred groups and 92.18 of total genetic variance for the predefined groups were attributed to the difference within subgroups (Tables 2, 4). Pairwise Fst showed that differentiation between subgroups from American-origin group (P1) was higher (Fst = 0.12237, P < 0.0001) than that from China-origin group (P2), with Fst ranging from 0.07593 to 0.11158 (Table 5). Similarly, the differentiation between subgroups (G2a and G2b) from the G2 group was higher than that from the G1 group (Table 3). Interestingly, each Fst value in Table 3 was larger than that in Table 5. It seemed that differentiation between subgroups could be greatly improved by STRUCTURE running.

Table 2 Analysis of molecular variance (AMOVA) among inferred populations
Table 3 Fst among five subgroups of inferred populations

Discussion

Genetic diversity of elite cotton germplasm

We genotyped a total of 157 elite cotton germplasm accessions using 146 genomic SSR markers covering all 26 chromosomes of cotton. Our results demonstrated that the level of detected diversity was relatively low, with an average number of alleles per locus of 2.26 (between 2 and 5 alleles/locus), a gene diversity of 0.3502, and a PIC of 0.2857 (Table 1). These values were very similar to those reported by Fang et al. (2013) (an average of 2.64 alleles per primer pair and an average PIC of 0.2869 detected in 193 upland cotton cultivars collected from 26 countries), and similar to those reported by Rungis et al. (2005) (an average of 2.4 alleles per primer pair and an average PIC of 0.37 detected in 9 cotton cultivars consisted of 8 G. hirsutum and 1 G. barbadense) and by Bertini et al. (2006) (an average of 2.13 alleles per microsatellite locus and an average PIC of 0.40 detected in 53 G. hirsutum cotton cultivars), but lower than those reported by Lacape et al. (2007) (an average of 5.6 alleles per locus and an average PIC of 0.55 detected in 47 accessions including 38 G. hirsutum, 2 G. darwinii, 2 G. tomentosum, and 5 G. barbadense) and by Moiana et al. (2012) (an average of 6.9 alleles per locus and a mean PIC of 0.646 detected in 35 cultivars and eight inbred lines of G. hirsutum from Africa, USA, and Brazil). Although the number of alleles in this study was lower than that reported by Liu et al. (2000) (an average of 5 alleles per locus detected in 97 cultivars and primitive species of G. hirsutum derived from various wild race stocks), the PIC values in the two studies were very similar (0.2857 in our study and 0.31 in Liu’s study). These differences in genetic diversity values might be attributed to the types of germplasm used. As expected, the level of polymorphism among races and wild species of Gossypium was significantly higher than that within cultivated G.hirsutum. What’s more, cultivars domesticated directly in a native cotton growing area usually reserved their higher level of polymorphism than those in a non-native cotton-growing area. It was reported that the G. hirsutum cultivated around the world is derived from the USA, which were exported to other countries in the nineteenth and early twentieth century, with most upland cotton used in early cotton breeding in the world coming from this source (Iqbal et al. 2001). So, American-origin cultivars usually reserved their higher level of polymorphism than those in other countries. Our study also proved that the predefined American-origin group had higher average genetic diversity than China-origin group. Chen and Du (2006) reported that most cotton varieties planted in China were derived from a few sources of germplasm, such as DPL, Stoneville, King, Uganda, Foster, and Trice, all of which were introduced from abroad. Therefore, in terms of allelic richness of our study, the pool of elite cotton germplasm in China owned only a small share of the species variability. However, we detected that the highest GD happened between the varieties from America and from China, which implies a direction for genetic improvement.

Population structure and differentiation of elite cotton germplasm

The elite cotton germplasm lines based on the introduction and domestication of exotic germplasms often have complex genetic background; therefore, understanding population structure and relationships among the elite germplasm lines is of significant importance for cotton improvement and association analysis. In the present study, we firstly predefined the whole cotton panel as American-origin group (P1), China-origin group (P2), and other-origin group (P3), according to their geographic origin. Because the other-origin group had only a few samples, we mainly analyzed American-origin group and China-origin group. At the same time, we considered the whole cotton panel as one population with potential substructures. By selecting ≧0.5 membership as the group subdivision criterion, the analysis showed that when K = 2, the ad hoc measure of △K showed the highest value, which indicated dividing the whole cotton panel into two groups was the most biologically meaningful population structure. It was interesting to note that in each of the two groups, there were germplasm lines from several origins (China, America, Africa, and former Soviet Union), and what’s more, the two groups inferred from structure were not consistent with the predefined three groups, indicating that the exchange and domestication of germplasm between different origins.

Several studies had showed that the pedigree relationships among the cultivars of G. hirsutum strictly relied on the areas where the cultivars originated. For example, the cluster analysis of the genetic distance data for nine cotton cultivars grouped the Australian cultivars separately from most of American cultivars (Rungis et al. 2005), and applying different methods, 35 cultivars and eight inbred lines of G. hirsutum L. were identified as four groups that consisted of American cultivars and inbred lines, African and Brazilian cultivars, BRS Brazilian cultivars, and FM Brazilian cultivars (Moiana et al. 2012). China is not a native cotton growing area, and its cotton breeding and production were based on the introduced germplasms (Chen and Du 2006), which usually led to the close relationships between Chinese cultivars or improved accessions and the introduced germplasms. This viewpoint was further proved by the AMOVA results that the variation among both the inferred groups and the predefined groups was not significant (Tables 2, 4), implying the low differentiation between American-origin group and China-origin group, or between G1 group and G2 group. In Genome-Wide Association Mapping (GWAS) studies, the power of structure-based association studies to detect the effects of single genes would be reduced if a large fraction of variation was explained by population structure (Flint-Garcia et al. 2005). In our population, only a little variation between groups was explained by population structure, suggesting that our population was suitable for association mapping.

Table 4 Analysis of molecular variance (AMOVA) among predefined populations

From an evolutionary point of view, the low differentiation between groups in our population reflected the evolutionary and domestication history of cultivated upland cotton. Unlike rice, in which the domesticated species Oryza sativa had been domesticated into Japonica varieties and indica varieties over thousands of years (Zhang et al. 2009), the upland cotton only had a domestication history of several hundreds years and had no separate domestication events happened, leading to no special local variety or ecotype. In fact, by analyzing the genetic diversity of upland cotton from different countries using more than one hundred and eighty thousand SNP markers, it was showed that the average genetic similarity coefficient of upland cotton from different countries was close to 0.9 (personal communication). So, we thought all the cultivated upland cotton in the world should be considered as one big population, and different clusters of upland cotton obtained by different researchers should be thought as subgroups from the whole upland cotton population.

Population structure is an indicator of genetic differentiation among groups and subgroups. The results showed that for the predefined American-origin group, the model-based subpopulations corresponded well with distance-based modeled clusters (Fig. 1a), confirming that higher differentiation among subgroups in American-origin group than that in China-origin group (Table 5). So, we thought the American-origin group had higher population substructure than China-origin group.

Table 5 Fst among five subgroups of predefined populations

Since the G. hirsutum cultivated around the world is derived from the USA (Iqbal et al. 2001), the structure of this cotton panel might be a reflection of population structure in the US Upland Cotton. By performing independent structure inference for each group, we had assigned the whole cotton panel into five subgroups. Among the five subgroups, we only detected one subgroup (G1a), in which all the lines were originated from China, but other four subgroups consisted of accessions derived from both China and USA (Supplementary Table 1). This result was very similar to those reported by Tyagi et al. (2014), who divided the US upland cotton accessions into four differentiated subpopulations corresponding to major cotton-growing regions: western, southwestern, midsouth, and eastern. Therefore, we deduced that the genetic structure of upland cotton accessions in the world might be mediated by those in USA.

Through the apparent lack of diversity in cultivated G. hirsutum, Van-Esbroeck and Bownam (1998) had argued that there is enough allelic variation, mutation, or recombination in crosses between closely related individuals to allow improvement in agronomic performance and/or that the coefficient of parentage may not reflect the real genetic distance. In our study, we observed that all the accessions in the subgroup G2b had good fiber qualities with a fiber length of >30 mm and a fiber strength of >30 cN/tex, and most accessions in the subgroup G1a had fine Verticillium wilt resistance (unpublished), which would effectively improve the fiber quality and disease resistance by justifying crosses between these accessions and other related individuals in cotton cultivar breeding programs.