Introduction

Cotton (mainly upland cotton) is the world’s most important natural textile fiber and a significant oilseed crop (Zhang et al. 2012). China is the world’s largest country of cotton production, consumption and importation, and it is also the largest producer of textiles and garments. The major cotton growing regions in China are the Yangtze River Region (YtRR), Yellow River Region (YRR), Northwestern Inland Region (NIR) and Northern Specific Early Maturation Region (NSEMR). In recent years, with climate change and adjustments to the industrial development strategy in China, the cotton production and growing areas were gradually reduced in the YtRR, YRR and NSEMR, but the NIR, represented by the Xinjiang cotton growing region, has rapidly expanded, becoming the largest cotton planting area worldwide. In 2016, the total cotton growing area was 1904.3 kha in Xinjiang, accounting for 50.1% of the total growing area in China, and its total yield was 3.503 million tons, contributing 62.5% of the total yield in China. The cotton yield per unit area was 1839.6 kg/ha, which is 364.3 kg/ha higher than the national average (National Bureau of Statistics of the People’s Republic of China, http://www.stats.gov.cn, 2015). Therefore, Xinjiang has become the cotton growing area with greatest yield worldwide.

The major yield indicators of cotton include seed cotton yield per plant (SY), lint percentage (LP), boll weight (BW), bolls per plant (BN), lint index (LI) and seed index (SI), which are controlled by complicated quantitative trait loci (QTLs). In cotton breeding, the mapping of QTLs conditioning the above-mentioned traits and the identification of stable QTLs and favorable alleles are of great theoretical and practical value for the improvement of cotton yield. The classical genetic method of studying quantitative traits is based on a linkage analysis of both parents, and it enabled the mapping of multiple QTLs responsible for cotton yield and fiber quality using segregating populations from different parents (Guo et al. 2013; Hu et al. 2008; Liu et al. 2014; Ma et al. 2008; Ning et al. 2013; Qin et al. 2008, 2009; Shao et al. 2014; Shen et al. 2005; Ulloa and Meredith 2000; Yin et al. 2002; Yu et al. 2013), providing the basis for the study of the genetic architecture of quantitative traits in cotton. However, because of the limited number of markers in upland cotton populations, the accuracy of QTLs is relatively low, which impedes further studies using map-based cloning and marker-assisted selection. An association analysis is an effective method for studying complex quantitative traits, and it facilitates the fine mapping of target genes, providing information on certain candidate genes and validating their functions. Compared with a conventional linkage analysis, an association analysis exhibits three significant characteristics: (1) the development of a mapping population is not required, and all of the natural population and germplasm resources can serve as mapping materials. In addition, knowing the pedigrees of these materials is not mandatory, as long as genetic variation is present; (2) the diverse genetic materials assist the identification of greater numbers of gene alleles, which improves the resolution of the map; and (3) it can evaluate most of the QTL-linked loci and their allelic variations in multiple traits simultaneously, resulting in a greater mapping efficiency. This method has already been applied to crops, such as wheat (Maccaferri et al. 2006), rice (Iwata et al. 2010), maize (Yang et al. 2010) and soybean (Hou et al. 2011). In cotton populations, association mapping has been used to identify QTLs or determine genomic regions associated with plant type (Li et al. 2016), yield and its components (Mei et al. 2013), fiber quality traits (Nie et al. 2016; Zeng et al. 2009), seed oil and protein contents (Liu et al. 2015), and resistance traits (Mei et al. 2014; Zhao et al. 2014). These studies laid a sound foundation for the application of marker-assisted selection (MAS) and molecular design in cotton breeding.

In the present study, two experimental sites were employed, Korla in southern Xinjiang and Shihezi in northern Xinjiang. Using 403 upland cotton germplasms from different cotton growing regions (199 from NIR, 49.3% of the total number) and SSR markers evenly distributed on the cotton genome, we performed a whole genome scan and an association analysis to investigate the favorable genes, allelic mutations and germplasms related to cotton yield and yield-related traits in the high-yield region of Xinjiang. This work will provide a theoretical basis for MAS in upland cotton breeding.

Materials and methods

Selection of upland cotton accessions

A total of 403 upland cotton accessions from China (378) and other countries (25) were selected from cotton germplasm collections in our laboratory for association mapping purposes (Table S1). These accessions were divided into the following five groups based on ecological areas, with 87 from the YRR in China, 63 from the YtRR in China, 199 from the NIR in China, 29 from the NSEMR in China and 25 from foreign countries, with 18 from the United States, five from the Soviet Union, one from Israel and one from Angola.

Field experiments and trait phenotyping

All of the 403 upland cotton accessions were planted at Shihezi, North Xinjiang (E85.94°, N44.27°) in 2013, 2014 and 2015 (designated SHZ13, SHZ14 and SHZ15, respectively) and at Korla, South Xinjiang (E86.06°, N41.68°, in the NIR) in 2013, 2014 and 2015 (designated KRL13, KRL14 and KRL15, respectively). Each accession was grown in a plot having 40–45 plants in two rows, with 0.10 m between the plants in each row and 0.45 m between the rows. The field planting followed a randomized complete block design with three replications in each environment. Field management followed conventional standard field practices. The 10 plants in the middle of each row were tagged for scoring and harvesting seed cotton. The yield traits evaluated included SY (g/plant), LP (%), BW (g), BN, LI (g) and SI (g).

SSR marker genotyping

Total genomic DNA of 403 accessions was extracted from the leaves as described by Paterson et al. (1993). SSR markers were selected at an average distance of 10 cM on each of the 26 chromosomes from the tetraploid cotton genetic map (Zhao et al. 2012). Additionally, previously reported markers linked to QTLs for breeding target traits of cotton (Guo et al. 2013; Ning et al. 2013; Qin et al. 2008; Shen et al. 2005) were also selected. In total, 560 SSR markers were used to screen for the genetic polymorphisms of the population. The SSR primer sequences were obtained from the Cotton Microsatellite Database (http://www.cottonmarker.org/). The polymerase chain reaction (PCR) amplification procedure was performed as described by Zhang et al. (2002). The polymorphisms detected by SSR markers were evaluated using FA™-96 Automated High Throughput SSR/Tilling Analysis System (Advanced Analytical Technology Inc, Ankeny, IA, USA), SSR genotyping was coded as “1” for present, “0” for absent and “?” for missing data.

Phenotypic data analysis

The descriptive statistics, association analyses and the frequency distributions of the phenotypic traits were carried out using SPSS 19.0 software (Li and Chen 2010). The best linear unbiased predictions (BLUP) for yield and yield component traits across the six environments were obtained using the “R” program. The ANOVA for each trait under multiple environments was performed in SAS 8.1. The variance was divided into components of germplasm, environment, and the interaction of germplasm and environment to determine the variance of each trait, and the broad sense heritability \({{h}}_{{{B}}}^{{\text{2}}}\) was calculated based on the variance.

Genotypic data analysis

The analysis of the genotypic data of the 403 germplasms was performed using PowerMarker V3.25 software (Liu and Muse 2005). The analysis of the population structure based on the genotypes was performed using Structure 2.3.1 software (Evanno et al. 2005) with the parameter settings as follows: 1–10 was selected for kinship (K) value with a replication of 5; the length of burn in period of the Markov chain monte carlo was set to 10,000 times at the beginning, followed by 100,000 times, and all other parameters were as default. The ΔK was calculated based on lnP (D), with which an appropriate K value was selected to obtain the corresponding population structure (Q) matrix. The genetic similarity coefficient (Jaccard coefficient) among germplasms was determined by NTSYS-pc V2.10 software to obtain a phylogenetic tree (Adams and Rohlf 2000). The linkage disequilibrium (LD) values among the polymorphic loci were evaluated by Tassel5.0 (Bradbury et al. 2007).

Association analysis and favorable allele exploration

Using Tassel 5.0 software (Bradbury et al. 2007), along with the K matrix produced from genotypic data, the above-mentioned marker data, the trait values under each environment, and the BLUP values and Q matrix, a mixed linear model (MLM) was employed to perform the association analysis between traits and markers, and the phenotypic variation explained (R2) by each loci at the significance of P < 0.05 (− lgP > 1.30) was calculated. After obtaining associated markers, the phenotypic effects of allelic variations at SSR loci were evaluated, and the allelic variations, phenotypic effects and typical varieties significantly associated with the traits were analyzed. The calculation method for the phenotypic effect of SSR alleles was as follows:

$${a_i}={\text{ }}{{\Sigma {x_{ij}}} \mathord{\left/ {\vphantom {{\Sigma {x_{ij}}} {{n_i}}}} \right. \kern-0pt} {{n_i}}}-{\text{ }}{{\Sigma {N_k}} \mathord{\left/ {\vphantom {{\Sigma {N_k}} {{n_k}}}} \right. \kern-0pt} {{n_k}}}$$

where a i represents the phenotypic effect of the ith allele, x ij represents the jth material’s phenotypic value of the ith allele, n i represents the number of materials having the ith allele; N k represents the nth material’s phenotypic value of all materials, and n k is the number of materials. If a i  > 0, then the allele is considered to have a positive effect; if a i  < 0, then the allele is considered to have a negative effect.

Results

Variations in phenotypic traits

The phenotypic characteristics of yield and yield components in six environments were determined by BLUPs, the BLUPs values of each accession and phenotypic data for each environment of six yield and yield components traits were determined for association mapping. The coefficients of variance for SY, LP, BW, BN, SI and LI were 9.04, 7.01, 7.30, 8.43, 9.37 and 9.26%, respectively (Table 1), indicating that there was a broad variation under the six environmental conditions among the 403 upland cotton accessions. The \({{h}}_{{{B}}}^{{\text{2}}}\) for the six traits had a range of 47.77–91.99% among the accessions (Table 1). LP showed the greatest \({{h}}_{{{B}}}^{{\text{2}}}\) value (91.99%), indicating that LP was less impacted by environmental factors than the other five traits. An analysis of the frequency distribution based on the results of BLUP processing indicated that each trait had a normal or approximately normal distribution and was suitable for genetic analysis (Fig. 1). There were significant positive phenotypic correlations between SY and its components, and there were significant negative correlations between SI and LP, SI and BN, and BW and BN (Table 2). The correlation coefficients for SY with LP, BW, BN, SI and LI were 0.109, 0.464, 0.799, 0.234 and 0.331, respectively.

Table 1 Phenotypic statistics based on the results of the BLUP processing of six environments
Fig. 1
figure 1

Frequency distributions of the breeding values based on the BLUPs of six yield-related agronomic traits of 403 cotton accessions in six environments. a Seed cotton yield (SY); b lint percentage (LP); c boll weight (BW); d the number of bolls per plant (BN); e seed index (SI); and f lint index (LI)

Table 2 The correlations between SY and its components based on the results of the BLUP processing of six environments

Genetic diversity of the SSR markers

From 560 SSR markers in the 403 upland cotton accessions, 201 displayed polymorphisms (Table S2), accounting for 36.1% of the total markers, with an average of 7.73 markers per chromosome. A total of 394 alleles were obtained, ranging from 1 to 4, with an average of 1.96. The number of genotypes per marker ranged from 2 to 14, with an average of 4. The gene diversity and polymorphism information content (PIC) of the 394 alleles averaged 0.556 and 0.483, respectively, with ranges of 0.142–0.669 and 0.132–0.677, respectively (Fig. 2). Thus, a large difference in the distribution frequencies existed in upland cotton accessions.

Fig. 2
figure 2

Distributions of genetic diversity of 201 SSR marker loci in 403 upland cotton accessions. a Gene diversity, b polymorphism information content (PIC)

Population structure and linkage disequilibrium

The population structure was determined using STRUCTURE software. When K = 1–10, the LnP(D) was elevated as the K value increased, and no inflection point was found in this panel (Fig. 3a). Therefore, ΔK changes were adopted to determine suitable K values. When K = 2, the ΔK reached its maximum value (Fig. 3b), suggesting that the total panel could be divided into two subpopulations (Fig. 3c), designated as Subgroups 1 and 2. Subgroup 1 contained 206 accessions, including 70 accessions (70/87, 80.5%) from the YRR, 58 accessions (58/63, 92.1%) from the YtRR, 10 accessions (10/29, 35.5%) from the NSEMR, 53 accessions (53/199, 26.6%) from the NIR and 15 accessions (15/25, 60.0%) from countries other than China. Subgroup 2 contained 197 accessions, including 17 accessions (17/87, 19.5%) from the YRR, 5 accessions (5/63, 7.9%) from the YtRR, 19 accessions (19/29, 65.5%) from the NSEMR, 146 accessions (146/199, 73.4%) from the NIR and 10 accessions (10/25, 40.0%) from countries other than China. The Neighbor-joining method was used to construct a phylogenetic tree containing the 403 upland cotton accessions based on Nei’s genetic distances calculated by PowerMarker V3.25 software, and it showed that the majorities of Subgroups 1 and 2 were clustered together in the unrooted tree (Fig. S1). Based on these results, the corresponding Q matrix at K = 2 was used for the marker-trait association mapping. The LD of this population was analyzed using 201 SSR markers. In total, 18.94% of the marker loci showed significant LD values (P < 0.05), the LD distribution was nonuniform on each chromosome, with the loci having greater LD levels being mainly concentrated on chromosomes D5, D7, D8, D11, D6 and A8 (Fig. S2).

Fig. 3
figure 3

Estimated LnP(D) and ∆K values based on the population structural analysis of 403 upland cotton accessions. a The magnitude of LnP(D) as a function of K; b magnitude of ∆K as a function of K; c population structure of 403 upland cotton accession-based SSR markers that were divided into two subpopulations: red indicates Subgroup 1 and green indicates Subgroup 2. (Color figure online)

Markers associated with yield and yield components

Based on the genotypic data, the Q, the K and the phenotypic traits data, a MLM of association mapping was performed to identify the associated SSR loci in upland cotton accessions. 43 loci were found to be related to yield and its components traits according to the BLUPs and in at least three of the six environments at the P < 0.05 (− lgP > 1.3) significance level, the range of the explained phenotypic variation observed was from 0.97 to 4.01%, with an average of 1.62% (Table 3). The numbers of marker loci associated with SY, LP, BW, BN, SI and LI were 4, 9, 6, 3, 10 and 11, respectively. Moreover, four marker loci (BNL3089a, NAU1028a, NAU3031b and NAU3881c) were simultaneously associated with two different traits, and one marker loci (NAU2984b) was simultaneously associated with three different traits.

Table 3 SSR marker loci significantly associated with yield and its component traits, and their explained proportion of phenotypic variation

Discovery of favorable alleles and typical accessions

The phenotypic effects of each QTL’s allele for the 21 associated marker loci were measured according to the calculation formula of the phenotypic allele effect. Phenotypic effects and typical accessions for each favorable allele are shown in Table 4. Taking LI as an example, six marker loci having positive phenotypic effects on LI were obtained, of which NAU4057a manifested the greatest positive effect on the phenotype, and increased LI by 0.14 g in the typical germplasm accessions ‘L4-13’, ‘Huiyuan14-19’ and ‘Chuang65’.

Table 4 The SSR marker loci associated with phenotypic effects (a i ) for six yield traits in typical accessions

Discussion

An association analysis is an analytical method based on individual phenotypes and genotypes: thus, the accuracy of the phenotypic data has a great impact on the results of an association analysis. The BLUP method has desired characteristics, such as an unbiased estimation, minimum variance of estimation, and the ability to compensate for the bias caused by selection and elimination, and thus, it produces the best linear unbiased results when acquiring individual phenotypic data (Schenkel et al. 2002). In this study, a phenotypic trait analysis based on the BLUP results eliminated environmental effects and improved the accuracy of predicting complex quantitative traits. A genotypic analysis revealed an average genetic diversity and a polymorphism information content of 0.556 and 0.483, respectively, which were greater than those of previous studies (Nie et al. 2016; Qin et al. 2015). As a result, we explored the 403 upland cotton accessions, one of the largest yet investigated, with broad geographic and genetic diversity ranges that provided sufficient detection power for association mapping.

In the association mapping, the population structure always affects the LDs of loci, which further influences the accuracy of the association analysis, resulting in false positives (Pritchard et al. 2000). The MLM (Q + K), which employs both population structure (Q value) and the genetic relationship among varieties (K value), is better than GLM (Q) and MLM (K) that are solely based on Q and K values, respectively (Zhao et al. 2007). In this study, we evaluated the genotypic data of polymorphic loci using STRUCTURE software to analyze and calibrate the population structure, to calculate the probability of classifying materials into subgroups, and to perform the MLM association analysis with it as a covariate. This effectively corrected the false association caused by the existence of subgroups. In addition, based on the subgrouping from STRUCTURE, the clustering analysis was carried out using Nei’s genetic distance calculation method, and results of both methods showed good consistency. Based on the pedigree, we demonstrated that the cotton from different growing regions in China originated from different sources. The genetic components of the cotton grown in YRR and YtRR are mainly from American cotton germplasms ‘Stoneville’, ‘Lone star’, ‘Foster’, ‘Acala’ and ‘Deltapine’, and they exhibit large BNs, high LPs, loose architecture and mid-to-late maturity. The cotton grown in NIR is mainly based on genetic components that originated from germplasms from the former Soviet Union, such as ‘108Fu’, ‘KK1543’ and ‘611Bo’, which show compact statures, short fruit branches, early maturity, and fast and centered boll openings. The cotton varieties of the NSEMR are based on the American cotton ‘King’ with the integration of genetic components of a former Soviet Union variety, and they show early maturity and disease resistance (Ai et al. 2005; Huang 2007). The tested germplasms in the present study were classified into two populations, and varieties originating from various cotton growing regions were present in each population, but the varieties in the YtRR and YRR were mainly classified into Subgroup 1, while most of the NIR and NSEMR varieties were classified into Subgroup2, which was consistent with the pedigrees of these germplasms.

The association analysis was based on the LD values among markers. Thus, understanding the genomic LDs of the targeted population can result in estimates of the density and quantity requirements of markers in the analysis. As an often cross-pollinated crop, cotton exhibits a relatively high recombination rate among genomic loci, so its LD is relatively low. The LD decay analysis showed that the average decay distances in the upland cotton population were 8–25 and 3–7 cM, with LD coefficients r2 ≥ 0.1 and r2 ≥ 0.2, respectively, and 100–400 SSR markers were used for statistical analysis in upland cotton (Cai et al. 2014; Mei et al. 2013; Nie et al. 2016; Qin et al. 2015). In the present study, among the 560 uniformly distributed SSR markers, 201 were polymorphic, and was sufficient to perform association analysis of QTLs in upland cotton. however, to perform a genome-wide association analysis, we need to increase the marker density to identify more markers linked to the targeted traits.

Identifying the favorable allelic variations and desired germplasms related to yield are important requirements for the breeding of high-yield upland cotton. In the present study, we performed phenotypic identifications of six yield-related traits in 403 upland cotton germplasms over 3 years at two locations in the high-yield environment of Xinjiang, a total of 43 marker loci were identified as associated with the six traits (− lgP > 1.30, P < 0.05). Of these, five marker loci were associated with multiple traits. For example, BNL3089a was related to SY and LI, NAU1028a was related to SY and BW, and NAU2984b was linked to BW, SI and LI, indicating that these markers can be used for the synchronous improvement of multiple traits. In addition, among the loci identified as associated with yield-related traits in the present study, nine were associated with the same traits based on the QTLs previously obtained by linkage analysis (Table 5), indicating the repeatability and stability of these markers. The remaining 34 markers may be novel loci related to yield and yield-related traits, which provide a theoretical basis for further understanding the genetic mechanisms of cotton yield. Moreover, in this study, the 21 stably detected associations provided 2, 5, 2, 2, 4 and 6 favorable alleles for SY, LP, BW, BN, SI and LI, respectively (Table 4). These favorable alleles can be used in MAS to improve the cotton output. Typical carrier accessions based on the favorable alleles for yield traits could be selected as predominant parents.

Table 5 Markers associated or linked with the same traits in the present study and previous studies