Introduction

Gossypium hirsutum L., commonly referred to as Upland cotton, is an essential cash crop worldwide, which accounts for 95 % of the world’s cotton production (Zhang et al. 2008). Commercial seed cotton is composed of approximately 40 % lint and 60 % seed, which provides the most important natural fiber for the textile industry as well as seed nutrition for both humans and livestock. The importance of fiber quality has long been recognized due to the changing requirements of spinning technology, and considerable efforts have been devoted to improving fiber quality traits (Chen et al. 2011; Ashokkumar et al. 2014). By contrast, cottonseed is still considered to be a by-product of lint, and little emphasis has been placed on breeding for seed nutritional components (Wu et al. 2009). Cottonseed oil, which consists of approximately 70 % unsaturated and 30 % saturated fatty acids, can be refined to eliminate phenolic compounds, which can then directly be used for edible purposes (Lukonge et al. 2007), and it is also considered to be an important biofuel resource (Liu et al. 2009). Cottonseed protein is widely used to feed sheep, cattle and other ruminant livestock (Kohel et al. 1985; Coppock et al. 1987). If gossypol were eliminated from cottonseed protein, it would be fully edible, thereby providing a new, important source of nutrition, which would increase food security worldwide.

Cottonseed oil and protein contents are quantitative traits that are simultaneously affected by genetic and environmental factors during seed development; these traits often vary among different growing seasons, locations and years (Dani and Kohel 1989; Wu et al. 2010). Large-scale, repeated chemical testing during breeding is labor-intensive, costly and time-consuming, and it has proven to be unfeasible for effectively improving these two traits based simply on phenotypic selection (Wu et al. 2010; Ashokkumar and Ravikesavan 2011). Molecular markers tightly linked to target genes and/or QTLs can be used for marker-assisted selection (MAS), which markedly improves breeding efficiency (Xu and Crouch 2008; Ashokkumar and Ravikesavan 2011). In the past two decades, the availability of abundant molecular markers has made tagging QTLs harboring functional genes through family-based linkage mapping a routine process, and a large number of QTLs for agronomically important traits have been identified in cotton (Zhang et al. 2008; Said et al. 2013), including QTLs for cottonseed oil and protein (Song and Zhang 2007; An et al. 2010; Liu et al. 2012; Alfred et al. 2012; Yu et al. 2012; Liu et al. 2013). However, approximately 80 % of the QTLs identified by linkage mapping could not be confirmed in subsequent studies, and few have actually been applied in breeding programs (Lacape et al. 2010; Said et al. 2013). This may be due to the fact that most of the QTLs were population-specific, and the limited amount of recombination present in most populations used for linkage mapping makes it difficult to map QTLs at a high resolution, which has severely limited their application in breeding programs. Linkage disequilibrium (LD)-based association mapping (AM), which has the potential to exploit most recombination events that have occurred during the plant’s evolutionary history and to simultaneously evaluate the effects of many alleles of target loci, has become a powerful approach to dissecting complex traits in many crops (Zhu et al. 2008; Mackay et al. 2009). In cotton, AM had been used for QTL detection for fiber quality traits (Kantartzi and Stewart 2008; Abdurakhmonov et al. 2008, 2009; Zeng et al. 2009; Zhang et al. 2013; Cai et al. 2014), yield and its components (Zhang et al. 2013; Mei et al. 2013), disease resistance (Mei et al. 2014; Zhao et al. 2014) and salinity tolerance (Saeed et al. 2014). However, to date, to the best of our knowledge, no association-mapping study of seed oil and protein traits has been reported in cotton. In the present study, 180 elite Upland cotton cultivars and breeding lines were assembled into an AM panel, evaluated in three locations across 2 years and genotyped using 228 polymorphic SSR markers to perform marker-trait association analysis. The results provide useful information for further understanding the genetic basis of cottonseed oil and protein, and they should facilitate future efforts to breed cotton containing seeds with high oil or protein content by MAS.

Materials and methods

Plant materials

A total of 180 elite Upland cotton cultivars and breeding lines were selected from the AM population previously used in our laboratory (Mei et al. 2013) to construct a new panel. Among these accessions, 174 entries were developed in China (including 77, 64, 17 and 16 that were released in the Yellow River, Yangtze River, Northwest China and North China cotton growing regions, respectively) and six that were introduced from the U.S., including the genetic standard line TM-1. All accessions had been self-pollinated for more than eight generations. Detailed information about the 180 accessions is summarized in Table S1.

Measurements of seed oil and protein contents

All 180 accessions were grown in the following three environments in 2011 and 2012: (1) Breeding Station at Nanyang Agricultural Research Institute, Nanyang, Henan, China (32°55′16″N, 112°34′07″E, designated NY); (2) Xinxiang Cotton and Wheat Research Institute, Xinxiang, Henan (35°09′34″N, 113°47′35″E, XX); and (3) Breeding Station at Biocentury Seed Company Limited, Korla, Xinjiang Uygur Autonomous Region (41°44′36″N, 86°07′40″E, XJ). The first location is in the Yangtze River cotton-growing region of China, the second is in the Yellow River region and the third is in the Northwestern inland region. A randomized complete block design with single row plot and two replications was used in each field trial. The sowing dates were from late March to early April in different years and locations. Seeds of 180 accessions were directly sown into the field in XJ, with 30 holes per row, a hole spacing of 10 and 40 cm between rows, and single plant was remained after seedling emergence. Seedling transplant were performed in NY and XX. Seedlings with 3–4 leaves were transplanted from seedbeds to the field, with 20 plants per row, a plant spacing of 30 and 80 cm between rows. Field management followed local practices. At the maturity stage, 25 fully opened bolls from each plot were hand harvested and ginned. Approximately 20 g cottonseed per sample was prepared and shell-intacted, and seed oil and protein contents were measured with the NIR (near infrared reflectance) spectrum method on a Foss NIRSystems 5000 (NIRSystems, Silver Spring, MD, USA) according to Huang et al. (2013).

SSR genotyping

Young leaves from each of the 180 accessions were collected and stored at −20 °C. Total genomic DNA was extracted from the leaf samples following the published method developed in our laboratory (Guo et al. 2007). In our previous study, a 360-accession panel was genotyped using 145 SSR markers. However, two of the 145 SSRs were not polymorphic in the new 180-accession panel. An additional 83 SSRs tightly linked to agronomically important traits (Qin et al. 2008; Mei et al. 2014) were then used to fingerprint the new panel. In total, the panel was genotyped using 228 markers (detailed information is summarized in Table S2) for marker-trait association analysis.

Data analysis

Summary statistics of genetic diversity were calculated using PowerMarker 3.25 software (Liu and Muse 2005). The Bayesian model-based program STRUCTURE 2.3 was used to infer the population structure with 66 unlinked and/or weakly linked markers (Pritchard et al. 2009; Mei et al. 2013). Both the burn-in period and the Markov Chain Monte Carlo replications were set at 100,000 using an admixture and allele frequency correlated model. Five run iterations were performed with the hypothetical number of subpopulations (k) ranging from 1 to 10. The correct estimation of k was performed by joining the probability [LnP(D)] output and the ad hoc statistic ∆k (Evanno et al. 2005). Based on the correct k, each accession was assigned to a subpopulation for which the membership value (Q value) was >0.5 (Pritchard et al. 2000), and the population structure matrix (Q) was generated for further marker-trait association mapping. SPAGeDi software was used to calculate the pair-wise relatedness coefficients (K, kinship matrix) with the negative value of kinship set to zero (Hardy and Vekemans 2002).

Statistical analysis of phenotype data from three locations across 2 years was performed using SAS 8.0 software (SAS Institute 1999). Analysis of variance (ANOVA) was calculated with PROC GLM based on the trait means for each line across six environments. Decomposition of variance components (genotype, location, year, replicate and the interactions among these factors) was evaluated using PROC VARCOMP.

The widely used mixed linear model (MLM) considering both Q and K implemented in the TASSEL software package was used to perform marker-trait association analysis, and the P value and R 2 of each association were determined (Yu et al. 2006; Bradbury et al. 2007).

Results

Genetic diversity, population structure and genetic relatedness

A total of 601 alleles were detected at 228 SSR loci. The allele number averaged 2.64 (ranging from 2 to 11), and approximately 86 % of the loci (198 of 228) yielded only two or three alleles (Table S2). The gene diversity, polymorphism information content (PIC) and heterozygosity values of the 228 loci averaged 0.37, 0.31 and 0.03, respectively, with ranges of 0.04–0.75, 0.04–0.72 and 0–0.22, respectively (Fig. 1).

Fig. 1
figure 1

Distribution of a gene diversity, b PIC and c heterozygosity of 228 SSR loci in 180 Upland cotton accessions. Data were calculated using PowerMarker 3.25 software (Liu and Muse 2005)

Model-based population structure analysis of the panel revealed that the LnP(D) value corresponding to each hypothetical k increased with increasing k value and did not exhibit a peak. By contrast, the ∆k value revealed a much higher likelihood at k = 2 than at k = 3–10 (Fig. 2), suggesting that the total population could be divided into two major subpopulations (Evanno et al. 2005), designated P1 and P2, respectively. The P1 group contained 56 accessions including 23 cultivars from the Yellow River cotton growing region, 13 from Northwest China, 12 from North China, six from the Yangtze River region and two from the United States. The P2 group consisted of 124 accessions including 58 lines from the Yangtze River region, 54 from the Yellow River region, four from Northwest China, four from the North China region and four from the United States (Table S1). The corresponding Q matrix at k = 2 was used for further marker-trait association analysis.

Fig. 2
figure 2

Model-based evaluation of population structure. a LnP(D) values for k from 1 to 10, bk for k from 2 to 9. The LnP(D) of each hypothetical k continued to increase, and the ∆k values showed a much higher likelihood at k = 2 than at k = 3–10, suggesting that the total panel should be divided into two major subpopulations. LnP(D) values are mean values of five repeats estimated using STRUCTURE (Pritchard et al. 2009), and ∆k values were calculated according to Evanno et al. (2005)

A total of 86.98 % of the cotton accessions had kinship coefficient values less than 0.05, while 8.63 % had values ranging from 0.05 to 0.10, and the remaining 4.39 % showed varying degrees of genetic relatedness (Fig. 3). We constructed a K matrix for association analysis based on the results of relatedness analysis.

Fig. 3
figure 3

Distribution of pair-wise kinship coefficients among 180 Upland cotton accessions. Data were calculated using SPAGeDi (Hardy and Vekemans 2002) with 228 SSR markers

Variations in seed oil and protein contents

We measured the seed oil and protein contents of 180 Upland cotton accessions grown in three different locations across 2 years. Each trait varied widely, and ANOVA revealed that the genotype (G) and the interactions between genotype and environmental factors (G × E) were both highly significant (á = 0.001; Table 1), indicating that these two seed quality traits are strongly affected by the environment. The mean values of oil content of 180 seed samples from XJ, XX, and NY across 2 years were 31.76, 29.98 and 29.45 %, respectively, and highly significant differences (LSD, á = 0.01) were observed in the three locations. The mean values of protein content of 180 seed samples from NY and XX across 2 years were 49.19, 49.08 %, respectively, and they were not significantly different (LSD, á = 0.01) from each other. When compared with the mean value from XJ (45.61 %), highly significant differences were also observed in the former two and the latter one location. The mean coefficient of variance (CV) for oil and protein contents was 8.25 % and 5.31 %, respectively, demonstrating that there was a high degree of diversity in seed quality traits in the present panel. The absolute values of skewness and kurtosis for these two traits in most environments were less than 1, suggesting that the seed oil and protein contents in this panel approximately followed a normal distribution. High significant negative correlations (−0.95 to −0.98, á = 0.01, data not shown) were found between oil and protein contents in each of the six environments, which is consistent with the results of many other studies (Wu et al. 2009; Yu et al. 2012; Liu et al. 2013).

Table 1 Descriptive statistics, ANOVA of seed oil and protein contents of 180 Upland cotton accessions in three locations across 2 years

SSR markers associated with oil and protein contents

At the á = 0.01 (−lgP = 2) level, a total of 86 significant marker-trait associations were detected between 58 SSR markers and two seed quality traits in six environments. Among these associations, more than half (59 of 86) could be detected in only one environment. The proportion of phenotypic variation explained by markers ranged from 4.31 to 24.18 %, with an average of 10.11 % (Table S3). Significant marker-trait associations simultaneously detected in more than one environment are shown in Table 2. Fifteen SSR markers distributed on 10 chromosomes (A3, A7, A9, A10, A12, A13, D2, D5, D6 and D9) are associated with seed oil content, including two, two and 11 that could be detected in five, three and two environments, respectively. Twelve SSR markers across nine chromosomes (A3, A7, A9, A10, A12, D2, D3, D5 and D9) are associated with seed protein content, including one, four and seven that could be detected in five, three and two environments, respectively. Among the 18 related SSR markers, nine loci are significantly associated with the two seed traits simultaneously (Table 2).

Table 2 SSR markers significantly (á = 0.01, −lgP ≥ 2.0) associated with seed oil and protein contents detected in more than one environment

Discussion

Genetic diversity and population structure of the AM panel

It is essential for genetic diversity to be present in the AM panel used for marker-trait AM studies (Flint-Garcia et al. 2005). Most Upland cotton cultivars developed in China were derived from a few germplasm resources introduced from abroad (Huang 2007). Therefore, it is especially critical to select samples that encompass genetic diversity as much as possible. Theoretically, a panel with a large number of accessions should best meet this requirement. However, working with large populations significantly increases phenotyping costs and can easily lead to errors due to differences in field conditions and management practices, especially for large-plant crops such as cotton, which could reduce the detection power of association analysis. The 180 accessions in the panel used in the current study were selected from our previously used panel (Mei et al. 2013) based on field performance and pair-wise genetic distances among the 356 accessions. The gene diversity and PIC values of the 228 loci averaged 0.37 and 0.31 with ranges of 0.04–0.75 and 0.04–0.72, respectively, and more than 86 % of the kinship coefficient values were less than 0.05. In spite of the fact that fewer accessions were used in the current study than in the previous study (180 vs. 356), the genetic diversities of both panels were similar (Mei et al. 2013).

Many crops have long, complex histories of domestication and breeding, including Upland cotton. Relatedness among entry samples can lead to population stratification, which can confound the results of AM (Price et al. 2006; Yu et al. 2006). Our model-based evaluation of the population structure of the 180 Upland cotton cultivars revealed that the population could be divided into two major subpopulations (Fig. 1). Group P1 contains almost all cultivars with early maturity and some cultivars with moderate maturity, while group P2 contains almost all cultivars with late maturity and some cultivars with moderate maturity. These results indicate that population stratification has occurred in the current AM panel, which should be considered in subsequent association analyses.

QTLs for cottonseed oil and protein contents detected by association mapping

In this study, marker-trait association analysis was performed with the optimal MLM model, which considers both population structure and relatedness, to detect SSR markers associated with seed oil and protein contents (Yu et al. 2006). A total of 86 highly significant (á = 0.01) associations were detected between 58 SSR markers and two seed quality traits (Table S3). Nonetheless, it is not easy to determine which significance level should be accepted. The use of stringent probability thresholds may reduce the danger of false positives but may pose the risk of rejecting true positives caused by setting the thresholds too high (Yan et al. 2011). If a more stringent Bonferroni-corrected threshold (P ≤ 0.05/228, −lgP ≥ 3.66) is adopted (Lander and Botstein 1989), only four associations were significant (Table S3). For the purpose of MAS, marker-trait associations should be environmentally stable and consistent. SSR markers with significant associations simultaneously detected in more than one environment could be considered candidate seed quality SSRs. Of these candidates, 15 SSR markers were associated with cottonseed oil content and 12 SSRs were associated with seed protein content. Among the 18 related SSR markers, nine loci were significantly associated with the two seed traits simultaneously (Table 2), which is also consistent with the fact that seed oil and protein are usually negatively correlated (Wu et al. 2009; Yu et al. 2012; Liu et al. 2013). The resulting stably associated markers, such as NAU845 and NAU1048, which were detected in almost all locations, should be quite useful for developing new cultivars with broad adaptability to different environments. Moreover, the materials used in this study are all cultivars and breeding lines with elite field performance, which can be directly utilized as parents in breeding programs.

The genetic basis of seed oil and protein contents appears to be complicated; as mentioned in several reports, additive, dominance and cytoplasmic effects all play important roles in their inheritance (Wu et al. 2010; Yu et al. 2012; Liu et al. 2013). The QTL mapping results of previous studies (Song and Zhang 2007; An et al. 2010; Liu et al. 2012; Alfred et al. 2012; Yu et al. 2012; Liu et al. 2013), as well as the current results, are too divergent to be compared. More research should be performed to better dissect the genetic architecture of the seed oil and protein traits in the future. Recently, two preliminary maps of the whole-genome scaffolds of G. raimondii (the putative diploid donor for tetraploid species) were separately released by two different groups (Paterson et al. 2012; Wang et al. 2012), which will facilitate tetraploid genome assembly. Moreover, true genome-wide AM will be realized in the near future through resequencing or other high-throughput genotyping technologies.