Introduction

Cotton is a major source of natural textile fiber and a significant cash crop worldwide (Chen et al. 2007). Upland cotton (Gossypium hirsutum L.) occupies approximately 95% of global cotton production (Zhang et al. 2008). Lint yield, an important measure of cotton yield, depends on boll number (BN), lint percentage (LP), boll weight (BW) and other factors (Qin et al. 2015). Many studies have uncovered a significant positive correlation between LP and cotton yield, and LP is an important trait index for the breeding of high-yielding cotton (Immenkamp 2006). However, the genetic basis of LP is not fully understood. Identifying genetic variation in LP and the genes underlying this trait is therefore essential.

Most traits in plants are complex quantitative traits controlled by the small effects of multiple genes (Huang et al. 2010). Identification of genes underlying the target trait is therefore difficult. Analysis of quantitative trait loci (QTL) and genome-wide association studies (GWASs) are currently the most commonly used research methods to determine the genetic variation of a complex trait (Huang et al. 2018; Mitchell-Olds 2010). In the past few decades, QTL mapping has been widely used to dissect the genetic basis for cotton complex traits (Jamshed et al. 2016; Liu et al. 2017; Reinisch et al. 1994; Rong et al. 2004). QTL mapping of related traits in cotton has yielded fruitful results, with a total of 4892 QTLs for yield, fiber quality, stress resistance and seed traits currently identified. Among them, 327 LP QTLs distributed on different chromosomes have been detected (Said et al. 2015a, b). Because of the time-consuming nature of mapping-group construction and the low mapping accuracy of linkage analysis, fine mapping of QTLs for LP and map-based cloning of key genes is difficult to achieve (Cavanagh et al. 2008; Nie et al. 2016). GWAS is a more convenient and effective tool for discovering QTLs and candidate genes related to major traits in plants (Saidou et al. 2014; Zhu et al. 2008). Because of higher resolution, greater efficiency and suitability for use with large natural populations, GWAS has been widely applied to detect relationships between genetic loci and complex phenotypes in crops such as rice (Dong et al. 2018; Huang et al. 2010; Zheng et al. 2018), maize (Li et al. 2013; Tian et al. 2011; Zhao et al. 2018), rapeseed (Wang et al. 2018a; Wei et al. 2016) and soybean (Wang et al. 2018b; Wen et al. 2018; Zhou et al. 2015). However, conducting a genome-wide association analysis in cotton is relatively lagging, because of the complex genome of this species.

The completion of cotton genome sequence (Li et al. 2014; Paterson et al. 2012; Zhang et al. 2015) and the rapid evolution of gene array and high-throughput sequencing technologies (Cai et al. 2017; Hulse-Kemp et al. 2015) have led to the discovery of a large number of single-nucleotide polymorphism (SNP) markers and greatly promoted the use of genome-wide association analyses in cotton. Using a GWAS strategy with high-density SNP markers, researchers have recently detected many genetic loci associated with cotton yield components, fiber quality and disease resistance (Fang et al. 2017; Li et al. 2017; Ma et al. 2018; Wang et al. 2017). Similarly, GWAS has been used to investigate the LP trait. The 355 upland cotton accessions were genotyped by specific-locus amplified fragment sequencing (SLAF-seq), and combination with multiple environmental phenotypes in a GWAS, a gene, Gh_A02G1268, that may determine LP, was revealed (Su et al. 2016). The population structure and linkage disequilibrium (LD) of 503 upland cotton accessions were dissected using a CottonSNP63K array (Hulse-Kemp et al. 2015), and one candidate gene for LP, Gh_D08G2376, was detected (Huang et al. 2017).

In the present study, a population comprising of 276 upland cotton accessions was genotyped using a CottonSNP63K array and analyzed for structure, kinship and LD. Phenotype data were collected from seven environments and used for GWAS to determine the relationship between genetic loci and LP. The main objectives of this research were to: (1) determine the genetic structure and linkage disequilibrium level of this population, (2) identify loci associated with LP and (3) explore the candidate genes that control LP. These results should serve as useful information for the improvement breeding of LP in cotton.

Materials and methods

Plant materials and field experiments

A diverse collection of 276 upland cotton accessions was used for an association study (Table S1). These accessions were classified into five groups according to their origin: YRR (Yellow River region of China), YtRR (Yangtze River region of China), NW (Northwest China), NSEMR (Northern special early maturing region of China) and other countries of the world. All 276 accessions were grown in Anyang (Henan Province, China), Jingzhou (Hubei Province, China) and Jiujiang (Jiangxi Province, China) in 2016 and in Anyang, Jingzhou, Huanggang (Hubei Province, China) and Anqing (Anhui Province, China) in 2017 and designated as 16AY, 16JZ, 16JJ, 17AY, 17JZ, 17HG and 17AQ, respectively. In each experimental environment, all accessions were planted in a single-row plot (6.0 m long and 0.8 m between rows) with two replications (20–25 plants per replication). All field experiments were arranged in a complete randomized block design. The field management followed the local agricultural practices throughout the growing period.

Phenotypic evaluation and statistical analysis

During the open-boll bloom period, 25 naturally open bolls were randomly harvested from the middle of each plot. The lint fiber was ginned by roller gin, and LP was calculated based on fraction of lint weight to seed-cotton weight (Abdurakhmonov et al. 2007). Statistical analysis, calculation of Pearson linear correlation coefficients of LP between different environments and an analysis of variance (ANOVA) were conducted using R software (Team 2014). In addition, the broad-sense heritability (H2) of LP was computed as \(H^{2} = \, \sigma^{2}_{G} /\left( {\sigma^{2}_{G} + \sigma^{2}_{GE} /n + \sigma^{2}_{e} /nr} \right)\), where \(\sigma^{2}_{G}\) is the genetic variance, \(\sigma^{2}_{GE}\) is the genotype–environment interaction (G × E) variance, \(\sigma^{2}_{e}\) is the error variance, n represents the number of environments and r represents the number of replications. \(\sigma^{2}_{G,} \sigma^{2}_{GE} \;{\text{and}}\;\sigma^{2}_{e}\) were estimated using the lmer function in the lme4 package of R. The best linear unbiased prediction (BLUP) of LP for each line across multiple environments was calculated using lme4 package as well (Bates et al. 2015).

SNP genotyping

Total DNA was extracted from young leaf tissues of each accession using a modified CTAB method (Zhang and Stewart 2000). A CottonSNP63K array (Hulse-Kemp et al. 2015), which contained 63,058 SNPs, was used to determine the genotype of each mapping accession as the previous reports (Huang et al. 2017; Sun et al. 2017b). The genotyping was performed on an Illumina Infinium platform following the Illumina protocols. The SNP data were clustered and genotyped using Illumina GenomeStudio v2011.1. The SNP data were further screened according to the following criteria: SNP call rate > 0.85 and minor allele frequency > 0.05. In addition, according to the reported method (Sun et al. 2017b), the probe sequences of the SNP array were assigned to the G. hirsutum TM-1 reference genome (Zhang et al. 2015), and SNPs with the unique physical positions were retained for further analysis.

Population structure assessment and GWAS

The population genetic structure of the 276 accessions was analyzed using a Bayesian model-based method in STRUCTURE 2.3.4 (Evanno et al. 2005). The number of population clusters was predefined as K = 1–10, with five independent runs for each K. For each run, we performed 100,000 Markov chain Monte Carlo iterations after a burn-in period of 100,000 iterations. STRUCTURE HARVESTER (Earl and Vonholdt 2012), a free web-based program, was used to calculate the natural logarithm of the probability of the data (Ln P[K]) and the ad hoc statistic ΔK. The optimal K was chosen based on ΔK (Mezmouk et al. 2011). Finally, the Q matrix was obtained from CLUMPP software (Jakobsson and Rosenberg 2007) by integrating the results of the five repeated runs. In addition, principal component analysis (PCA) and calculation of a relative kinship matrix were performed using the GAPIT package (Lipka et al. 2012), with the first three principal components constituting the PCA matrix and the kinship matrix constructed according to the described method (VanRaden 2008). PowerMarker v3.25 (Liu and Muse 2005) was used to estimate the polymorphism information content (PIC) of the SNP markers, gene diversity and genetic distances among the 276 accessions. A neighbor-joining phylogenetic tree based on Nei’s genetic distances (Nei 1972) was generated using MEGA 6.0 (Tamura et al. 2013). The LD parameter r2 between pairs of SNPs was calculated with the −r2 command in PLINK software (Purcell et al. 2007) based on a window size of 1000 following the reported method (Wang et al. 2017).

The association study between phenotype and genotype was performed using the GAPIT package in R under the mixed linear model (MLM) (Yu et al. 2006). The PCA matrix and kinship matrix were used as the fixed and random effects, respectively. The significance threshold for trait–marker associations was calculated according to the number of markers (p = 1/n, where n is the total number of SNPs used). By combining the GWAS results in different environments, an adjusted suggested genome-wide significance threshold of p = 1.0 × 10−3 was chosen in this study. Manhattan plots were generated using the R package qqman (Turner 2014). Heatmaps of LD on both sides of peak SNPs were produced using Haploview 4.2 (Barrett et al. 2005).

RNA-seq and quantitative real-time PCR (qRT-PCR) analysis

The raw RNA-seq data of G. hirsutum TM-1 tissues (root, stem, leaf, ovule and fiber developmental periods) were downloaded from the National Center for Biotechnology Information (NCBI) Sequence Read Archive (accession no. PRJNA248163). Expression analysis of the RNA-seq data was carried out using TopHat and Cufflinks software (Trapnell et al. 2012), with normalized fragments per kilobase per million mapped read (FPKM) values used as the gene expression levels.

Total RNA was extracted from G. hirsutum TM-1 tissues, including ovules at 0, 10, 20 and 30 days post-anthesis (DPA) and fibers at 10, 20 and 30 DPA, using TRIzol reagent (Tiangen, Beijing, China) and then reverse-transcribed using a PrimeScript RT Reagent Kit with gDNA Eraser (Takara, Tokyo, Japan). qRT-PCR amplifications were performed using SYBR Premix Ex Taq (2×) (Takara) on a LightCycler 480 96-well system (Roche, Mannheim, Germany). The G. hirsutum histone3 gene was used as an internal reference. Expression levels of target genes were calculated using the comparative Ct method (Schmittgen and Livak 2008). Gene-specific primers are listed in Table S6.

Results

Analysis of LP phenotypic variation

We evaluated LP of 276 accessions in seven environments during 2016 and 2017 (Table 1). Extensive phenotypic variation was observed in each individual environment. LP values ranged from 10.49 to 49.62%, with a mean value of 37.60% across the seven environments. The coefficient of variation (CV) ranged from 7.68 to 11.20%. As indicated by skewness and kurtosis values, the LP trait exhibited an approximately normal distribution pattern in all environments (Table 1, Figure S1). In addition, the ANOVA revealed significant differences (p < 0.001) in the effects of genotype (G), environment (E) and the interaction of genotype and environment (G × E) (Table S2). The broad-sense heritability (H2) of LP was 90.7% (Table S2), and a correlation analysis across different environments uncovered significant positive correlations among LP phenotypes in different environments (Figure S1). These results demonstrated that the LP trait is highly stable and mainly controlled by genetics.

Table 1 Phenotypic data statistics of lint percentage observed in seven environments

Analysis of genetic diversity based on SNPs

From the 63,058 SNPs, used to genotype the 276 tested accessions, a total of 10,660 high-quality SNPs meeting the filtering criteria were used for the subsequent analysis (Fig. 1, Table 2). These SNPs were unevenly distributed across the 26 chromosomes, with more SNPs found on the Dt subgenome (6480) than on the At subgenome (4180). The SNP density of chromosomes ranged from 86.43 kb/SNP (Dt07) to 731.71 kb/SNP (At06), with an average marker density of 237.32 kb/SNP. In addition, the polymorphism information content (PIC) values varied from 0.200 (Dt06) to 0.294 (At13) among the 26 chromosomes, with a mean value of 0.250. The mean gene diversity value of all chromosomes was 0.31 and ranged from 0.24 (Dt06) to 0.37 (At01, At05 and At13) (Table 2).

Fig. 1
figure 1

Distribution of 10,660 polymorphic SNPs on the 26 chromosomes of an upland cotton association population. The horizontal axis indicates chromosome lengths, and the color legend depicts SNP density (the number of SNPs within a 1-Mb window)

Table 2 The summary of SNPs, PIC and gene diversity in 26 chromosomes of upland cotton

Population structure and kinship analyses and LD decay estimation

STRUCTURE analysis indicated that values of Ln P(K) increased continuously as K was increased from 1 to 10, and there was no obvious inflexion point (Fig. 2a). However, the ∆K reached its maximum value when K = 2 (Fig. 2b), indicating that the population could be separated into two subgroups (Fig. 2c). PCA gave a result similar to the STRUCTURE analysis, and some accessions were admixed between the two groups (Fig. 2d). The association population was divided into two clades in a neighbor-joining phylogenetic tree based on Nei’s genetic distances (Fig. 2e). This classification was also supported by a kinship plot (Figure S2).

Fig. 2
figure 2

The results of population structure, principal component and phylogenetic analyses of 276 upland cotton accessions. a Plot of mean Ln P(K) versus K for K = 1 to 10. b Plot of ΔK versus K for K = 1 to 10. c Population structure based on a STRUCTURE analysis at K = 2. The y-axis quantifies cluster membership, and the x-axis represents the different accessions. d Principal component plot of the test population. e Neighbor-joining phylogenetic tree based on Nei’s genetic distances. Group 1 and Group 2 are represented by blue and orange, respectively (color figure online)

Most of the kinship coefficients (88.71%) were less than 0.2, with 58.74% equal to 0. Only 2.37% of kinship values were larger than 0.5 (Figure S3). These results indicated that weak relatedness was present in the accessions. Moreover, the LD decay, which corresponded to the distance at which r2 was half of its maximum value, was approximately 530 kb (Fig. 3).

Fig. 3
figure 3

Genome-wide average LD decay estimates of the association population. The black dashed line indicates the position where r2 is at half of its maximum value

These results indicated that the accessions were not highly structured and exhibited weak relatedness and moderate LD decay. The association population was thus suitable for association mapping.

GWAS of the LP trait

A total of 23 SNP loci randomly distributed on 13 chromosomes were identified as significantly associated with the LP (Fig. 4a, Figure S4 and Table S3). The quantile–quantile (Q–Q) plot indicated that the MLM model can be used to identify association signal (Fig. 4b). Among these loci, seven were located on chromosomes Dt05, four on Dt10 and two on Dt13. The remaining 10 loci were positioned on chromosomes At01, At03, At05, At07, At10, Dt01, Dt02, Dt04, Dt09 and Dt11 (Figure S4 and Table S3). The phenotypic variation explained by these SNPs ranged from 4.20 to 10.23%, with an average of 5.68% (Table S3). Eleven significant SNPs were consistently detected in at least two environments. Four SNPs (i56741Gb, i61131Gt, i08888Gh and i00252Gh) were simultaneously detected in five environments and were distributed on chromosomes At03 and Dt05. Moreover, ten of these SNPs were also identified in BLUP. For example, the SNP locus i56741Gb on chromosome At03 had the highest −log10(P) value (5.10) and explained the largest amount of phenotypic variation (10.23%) in 17JZ, and the −log10(P) value and phenotypic variation explained in BLUP were 4.03 and 6.02%, respectively. For SNP loci on chromosome Dt05, i00252Gb recorded the highest −log10(P) value (5.06) and phenotypic contribution rate (8.05%) and also possessed the highest value in BLUP (Table S3). Thus, these SNPs, which were detected in more than two environments and BLUP at the same time, were used for further analysis.

Fig. 4
figure 4

Genome-wide association study (GWAS) for lint percentage (LP). a Manhattan plot of the best linear unbiased prediction (BLUP) across seven environments. The black dashed line represents the significance threshold. b A quantile–quantile (Q–Q) plot of the BLUP for LP

According to previous studies (Su et al. 2018; Sun et al. 2017b), the 200-kb upstream and downstream regions of significant SNPs could be defined as QTLs and considering QTLs with overlapping regions to be the same locus. Following the definition of QTL, 15 QTLs were detected in total (Table S4). Similar to significant SNP loci, these QTLs were scattered across different chromosomes. Most of these QTLs contained only one significant SNP, and the exceptions were qLP-Dt05-1 (five significant SNPs), qLP-Dt05-2 (two significant SNPs), qLP-Dt10-2 (three significant SNPs) and qLP-Dt13 (two significant SNPs). Moreover, nine QTLs were co-localized with 11 previously reported QTLs (Table S4). Six of these co-localized QTLs shared overlapping regions with known QTLs (qLp-A-1, qLP-Chr10-1, qLP-Chr14-1, qLP-Chr21-2, TMB0206 and MGHES46), and the remaining QTLs were adjacent to qGhLP-c5, JESPR220, NAU3269, qLP-19 or qLP-D10_16.

Candidate genes underlying associated loci

In total, 434 candidate genes were identified in the QTL regions (Table S5). Analysis of the TM-1 RNA-seq data revealed that 263 of these genes were specifically highly expressed in different organs, including roots, stems, leaves, ovules (− 3, − 1, 0, 1, 3, 5, 10, 20, 25 and 35 DPA), and fibers (5, 10, 20 and 25 DPA) (Figure S5). Some of these specifically expressed genes, such as GhUPL7, GhTUB5 and GhCK1, have been previously determined to be involved in cotton fiber development (Table S5). Moreover, to narrow the range of candidate genes associated with LP, we conducted the local LD analysis of the peak SNPs and non-synonymous SNPs identified in the GWAS. Finally, we identified two genomic loci associated with LP.

The most significant SNP (i00252Gh) on Dt05 was selected the promising variant site, as i00252Gh was identified in five environments and exhibited the lowest p value (Fig. 5a and Figure S4). The candidate region was estimated to be 9.41–9.81 Mb (Fig. 5a, b). An LD block analysis indicated that the candidate SNP locus i00252Gh did not fall into any LD block (Fig. 5b). Interestingly, the peak SNP (i00252Gh) was located in the 10th exon region of Gh_D05G1124, a gene of unknown function homologous to a gene encoding a protein phosphatase 2C family protein in Arabidopsis. In addition, i00252Gh was a non-synonymous SNP (A/G) (Table S3) responsible for an aspartic acid to glycine amino acid substitution (Fig. 5c). The GG haplotype was found to have positive phenotypic effects on LP, as accessions carrying the GG allele had significantly higher LP values than those with the AA allele (p < 0.001) (Fig. 5d). Moreover, RNA-seq data for Gh_D05G1124 from 17 different upland cotton tissues revealed that Gh_D05G1124 was highly expressed during ovule and fiber development (Figure S5). qRT-PCR analysis indicated that the expression of this gene gradually increased during ovule and fiber development, with peak levels observed at 30 DPA in ovules as well as in fibers (Fig. 5e). These results suggest that Gh_D05G1124 participates in ovule and fiber development and is a causative gene for LP in upland cotton.

Fig. 5
figure 5

GWAS results for lint percentage and identification of the causal gene for the peak on chromosome Dt05. a Local Manhattan plot for the candidate region on Dt05. The purple dot represents the peak SNP i00252Gh. Red dotted lines indicate the candidate region. b LD block analysis of SNPs in this region. The degree of linkage is represented by the coefficient of r2. c Gene structure of Gh_D05G1124 and a non-synonymous SNP within it. Purple rectangles and black lines indicate exons and introns, respectively. Ref and Alt stand for reference and alternate, respectively. d Box plots for LP based on the allele of SNP i00252Gh. The significance of differences was analyzed by a two-sided Wilcoxon test. e Tissue-specific expression profiles of Gh_D05G1124. Expression of Gh_D05G1124 was investigated in ovule (0, 10, 20 and 30 DPA) and fiber (10, 20 and 30 DPA) developmental stages by qRT-PCR. GhHis3 was used as an internal control. Error bars indicate the standard deviation of three technical replicates (color figure online)

There is a another notable hotspot region at the interval of 2.61–2.76 Mb on chromosome Dt05, where a novel non-synonymous SNP (i08888Gh) resulted in an amino acid change from asparagine to serine in the coding sequence (CDS) of gene Gh_D05G0313 (Figure S6a–c). Accessions with the GG allele had significantly higher LP values than those harboring the AA allele (p < 0.001; Figure S6d). Furthermore, qRT-PCR analysis indicated that Gh_D05G0313 was relatively high expressed in 20 and 30 DPA ovules and 30 DPA fibers (Figure S6e). The ortholog of Gh_D05G0313 in Arabidopsis, AtLUT2, plays an important role in photosynthesis, an important process in plant organs, including developing cotton ovules and fibers.

Analysis of favorable SNP alleles

To identify the cumulative effect of favorable SNPs on LP, we selected the two significant SNPs i00252Gh and i08888Gh, which were found to have a positive effect on LP phenotypic performance. The 276 accessions were classified into three types (AA–AA, AG–AG/AG–AA/AG–GG and GG–GG) based on the SNP alleles of the two loci. A total of 134 accessions were genotyped as AA–AA, 126 accessions were heterozygous, and only 16 possessed the GG–GG genotype. The average LP values of the three genotype groups were 36.58%, 38.24% and 39.48%, respectively, showing that the more favorable alleles were pyramided in varieties, with the larger average LP values increasing (Fig. 6). These results suggest that LP is positively correlated with the number of favorable alleles and these favorable alleles displayed pyramiding effects on LP.

Fig. 6
figure 6

Box plot of lint percentage versus the number of favorable alleles. The x-axis indicates LP, and the y-axis indicates the number of favorable SNP alleles

Discussion

For GWAS, the wider range of genetic diversity among materials is especially critical (Li et al. 2018). In the present study, the 276 accessions originated from the five main cotton regions in China and other foreign countries, with more abundant genetic variation among materials. Moreover, the LP trait for the association panel was evaluated in seven environments during 2016 and 2017. The LP trait showed abundant phenotypic variation in each single environment, and multienvironment survey phenotypic data strategy would be enhanced the reliability of association mapping. In addition, the broad-sense heritability of LP was 90.7%, which is similar to previously reported values (Huang et al. 2017; Wang et al. 2015). This showed that the stability of LP was high, and the marker associated with LP can be stably detected and those markers should be useful for cotton breeding to adapt to different environments (Su et al. 2016).

Moreover, the high marker density is beneficial for the discovery of more elite loci and promising genes (Wang et al. 2018a). In our study, the average genome-wide density of polymorphic SNPs was one SNP per 273.32 kb. This marker density is similar to levels reported by Sun et al. (2017b) and Huang et al. (2017). The LD decay distance in the current study, 530 kb, was higher than the distance reported in cotton by Li et al. (2018) (400 kb) but lower than the result of Sun et al. (2017b) (820 kb). The average PIC value of the markers was 0.250, less than the value of 0.332 obtained by Huang et al. (2017) and close to 0.285 reported by Sun et al. (2017b). These conflicting results may be mainly due to differences in population sizes and SNP-marker filtering criteria, as a similar phenomenon has been observed in soybean (Wen et al. 2018). Furthermore, population structure and relative kinship among individuals are the two important factors in controlling false positives (Lu et al. 2015). In this study, the 276 accessions were divided into two subpopulations by comprehensive analysis, which were unrelated to geographic origin. The lack of any geographic correlation may be due to extensive exchange and penetration of germplasm from different geographic origins during the process of cotton breeding. Overall, the association population was not highly structured and the LD level was moderate.

LP is a typical complex quantitative trait, which is controlled by multigene (Sun et al. 2018). In cotton, more than 327 QTLs for LP have been detected based on linkage and association mapping (Said et al. 2015a, b). Some of them were also identified by GWAS, especially the stably inherited QTLs (Huang et al. 2017; Su et al. 2016; Sun et al. 2018). In the present study, a total of 23 SNPs were found to be significantly associated with LP, half were identified in more than two environments and BLUP. The high proportion of significant SNPs identified in multiple environments reflects their high heritability. In addition, a total of 15 QTLs (as defined in this study) were detected. Among them, six were novel, while six overlapped with confidence regions of previously reported QTLs or GWAS signals for LP and three were near these regions. For instance, qLP-At03, qLP-Dt02 and qLP-Dt04 identified in this study overlapped with the confidence intervals of qLP-A-1 (Wang et al. 2013), qLP-Chr14-1 (Li et al. 2016) and TMB0206 (Abdurakhmonov et al. 2007). These results confirm the reliability of the LP-related associations determined in the present study. In addition, these stably inherited QTLs, which were repeatedly identified across different genetic backgrounds, populations and environments, may display a great potential of marker-assisted breeding for LP in cotton.

In cotton, several genes associated with LP, such as Gh_A02G1268 (Su et al. 2016), Gh_D08G2376 (Huang et al. 2017), AIL6 and EIL (Fang et al. 2017), Gh_D03G1064 and Gh_D12G2354 (Sun et al. 2018) and Gh_D02G0025 (Ma et al. 2018), have been detected via GWAS using different association populations. In the current study, 434 genes were found in the confidence intervals of identified QTLs. Among them, 263 genes were highly expressed in various organs including ovule and fiber developmental stages. We particularly focused on two of these genes, Gh_D05G1124 and Gh_D05G0313, because their exon regions harbored polymorphic SNPs that were responsible for protein-coding differences. Moreover, qRT-PCR analysis revealed that both genes were highly expressed at the ovule and fiber development stages. The closest homologs of Gh_D05G1124 and Gh_D05G0313 in Arabidopsis are, respectively, PP2C (Protein phosphatase 2C family protein) and AtLUT2; those homologs are involved in protein phosphorylation and photosynthesis, two processes related to fiber development. Our results thus point to Gh_D05G1124 and Gh_D05G0313 as candidate genes for LP.

Elite-allele loci are valuable resources for crop breeding programs, and the accumulation of superior alleles is an efficient way to improve target traits in crop plants (Su et al. 2016). In wheat, the nine superior alleles contributing to a high thousand-kernel weight were uncovered in multiple environments in the cultivar Pindong34, and proper pyramiding of superior alleles was beneficial to increase wheat yield (Sun et al. 2017a). In rapeseed, the aggregation of superior alleles significantly associated with earliness resulted in earlier flowering or maturity (Zhou et al. 2018). In cotton, three favorable SNP alleles were selected to identify the effects of allelic variation on Verticillium wilt resistance in upland cotton, and it was found that the resistance of accessions was increased by pyramiding favorable SNP alleles (Li et al. 2017). In the present study, we similarly found two SNPs significantly associated with LP, i00252Gh and i08888Gh, that had a positive effect on LP. Accessions carrying GG alleles at i00252Gh and i08888Gh had higher LPs than those harboring the AA allele. The phenotypic value of LP increased continuously with the number of favorable alleles. This result indicates that those favorable alleles can be pyramided in a target line by marker-assisted selection. Out of the 276 upland cotton accessions, however, only 16 contained these favorable alleles. This scarcity indicates that these elite loci are not presently well utilized. The future application of favorable alleles thus has great potential in cotton breeding programs.

Author contribution statement

DY, XM and WL conceived and designed the research. CS, ZR, KS and XZ performed the experiments. XP, YL, KH and FZ prepared the materials. CS and WL analyzed the data and wrote the paper. DY and XM revised the manuscript. All authors read and approved the final manuscript.