Introduction

Genome-wide association studies (GWAS) have proven valuable at identifying genomic locations that affect traits in humans, livestock, and crops (Bouwman et al. 2018; Huang and Han 2014; MacArthur et al. 2016; Welter et al. 2013). Most frequently, these depend on collecting genotypic and phenotypic information from natural populations and must overcome issues of population structure to control for shared regions that do not harbor a causative genetic polymorphism (McCarthy et al. 2008). However, the algorithms developed for GWAS can also be applied to carefully designed multi-parent populations, including nested association mapping (NAM) and multi-parent advanced generation intercross (MAGIC) populations (Buckler et al. 2009; Huang et al. 2015; Rakshit et al. 2012; Yu et al. 2008). While these populations may not capture the full complement of diversity that is present in a natural population, they avoid many of the issues of kinship and population structure (Huang et al. 2015). In species with a narrow genetic base, such as cultivated upland cotton (Gossypium hirsutum L.), the selection of sufficiently diverse parents may allow the identification of important haplotypes and causative genes or alleles (Fang et al. 2013; Islam et al. 2016; Percy et al. 2006). The recombinations between haplotypes observed in a MAGIC population are likely to be more numerous and more evenly distributed than in natural populations, resulting in significantly lower linkage disequilibrium (LD) and thus finer mapping of quantitative trait loci (QTLs) (Dell’Acqua et al. 2015; Huang et al. 2015, 2011). Ongoing advances in genotyping technology allow ever more precise delineation of haplotype blocks, and whole genome sequencing (WGS) of members of a population is now feasible (Cao et al. 2011; The 1000 Genomes Project Consortium 2010).

Cotton is the major renewable source of fibers for textiles and is an important cash crop worldwide (Paterson et al. 2012). In the USA, every bale of cotton is graded to establish a premium or discounted price. This analysis is carried out with a high volume instrument (HVI) which measures cotton fiber quality characteristics including elongation (ELO), micronaire (MIC), short fiber index (SFI), fiber strength (STR), upper half mean length (UHML), and uniformity index (UI), although ELO and SFI are not frequently used in setting the price. These characteristics are influenced by genotype, environment, management practices, and their interactions (Dabbert et al. 2017; Gore et al. 2014; Paterson et al. 2003; Said et al. 2015). Recently, researchers have used GWAS to identify important loci that affect cotton fiber and agronomic traits, using various DNA marker technologies, including SNP arrays, genotyping-by-sequencing (GBS) and WGS (Fang et al. 2017; Huang et al. 2017; Islam et al. 2016; Li et al. 2018; Ma et al. 2018; Su et al. 2016, 2018; Sun et al. 2017; Yuan et al. 2018).

Here, we present GWAS of a cotton MAGIC population that consists of 550 recombinant inbred lines (RILs) derived from crosses between ten diverse cotton cultivars and one improved, but not cultivated, cotton line (M240). We collected phenotypic data from three locations, spanning 7 years, for a total of twelve environments, or location–years. We identified significant QTLs for the major cotton fiber quality characteristics. We found masking of a secondary STR QTL by a major, multi-trait locus and additive effects for ELO and the highly environment-dependent MIC. Importantly, we were able to directly identify candidate gene variants at these loci using WGS of all the parents and RILs.

Materials and methods

Plant materials

A set of eleven diverse G. hirsutum cotton lines (Table S1) from major breeding programs across the USA were used as parents to develop a MAGIC population. The details of the MAGIC population development were previously described (Fang et al. 2014; Islam et al. 2016; Jenkins et al. 2008). Briefly, the eleven parents were crossed in a half-diallel to establish 55 families. These were randomly mated by a bulked pollen approach for five generations (C5), followed by six generations of single seed descent (S6) to establish the 550 C5S6 RILs in Starkville, MS, USA.

DNA isolation and whole genome sequencing

Five hundred fifty RILs along with their eleven parental lines were grown in a greenhouse in 2013 in New Orleans, LA, USA. Young leaves were collected from ten plants of each RIL or parent and stored at − 80 °C. The genomic DNA was extracted from frozen leaves following the protocol previously described with an additional RNAase A digestion step before binding DNA to the column (Islam et al. 2014). The quality and quantity of DNA were measured using a NanoDrop 2000 spectrophotometer (Thermo Fisher Scientific, Waltham, MA, USA) as well as on an 1.5% agarose gel. DNAs were sent to Novogene Corporation (Chula Vista, CA, USA) for library preparation, and whole genome sequencing using Illumina HiSeq 2500 with paired-end 150 bp reads. For the eleven parental lines, each was sequenced at 20× coverage (about 50 Gb), and for the 550 RILs, each was sequenced at 3× coverage (about 8 Gb).

Processing of sequence data and variant calling

Sequencing reads of the eleven parents were first aligned to the draft reference G. hirsutum cv. Texas Marker-1 genome (Zhang et al. 2015) with GSNAP software, using the “-Q -n 1” option which requires that mapped reads have a single, uniquely best match in the genome (Wu and Nacu 2010). Reads with multiple, equally good matches in the genome were excluded. Variants were identified with samtools mpileup (with “-Ego” flags) and bcftools (“call-vm”) software (Danecek et al. 2011; Li et al. 2009). We then required that at least one of the eleven parents must have been scored as homozygous for the reference allele and at least one scored as homozygous for the alternate allele. A SNP was also discarded if three or more parental lines had missing data or were heterozygous.

For the RILs, the resulting variant call format (VCF) files were first filtered for variants that segregated in the RIL population using a custom Python script. The filtering criteria were: reference allele ≥ 2.5%, alternative allele ≥ 2.5%, missing score ≤ 20%, and heterozygous rate ≤ 30%. Then, the minor allele frequency (MAF) among the parents and RILs were checked separately for each of the filtered SNPs. The SNPs with significant MAF difference (p > 0.05 based on Chi-square test) between the parents and RILs were excluded from the final variant set. Remaining missing data were imputed with the k-nearest neighbors (k-NN) approach implemented in TASSEL 5.0 software with default parameters (Bradbury et al. 2007). Non-synonymous SNPs were identified as before (Thyssen et al. 2014).

Field experiments and phenotyping

The RILs and eleven parents were planted in Florence, SC, in 2014–2016, Starkville, MS, in 2009–2011, and 2014–2016, and in Stoneville, MS, in 2013–2015. The 550 RILs were divided in two subpopulations of 275, called Set A and Set B. Usually, only one set was grown at each location–year (Supplemental Table S2). Field plot arrangements were according to a randomized complete block design in Starkville in 2009 and according to an alpha lattice at all other location–years. Single row plots were 12 m long with approximately 120 plants per plot and two replicate plots per line at each location–year. Field practices were applied according to the prevailing conventions at each location–year. Twenty-five naturally opened bolls were manually harvested from each line and ginned using a 10-saw laboratory gin. The fiber quality attributes (ELO, MIC, SFI, STR, UHML, and UI) were measured using an HVI (USTER Technologies, Charlotte, NC, USA).

Phenotype variance analysis and normalization

Raw phenotypic data from 2015 across the three locations, and from Starkville, MS, USA across years 2010, 2011, 2015, 2016, were separately subjected to ANOVA using PROC MIXED in SAS software (SAS Institute, Cary, NC, USA). For GWAS of individual location–years, arithmetic means of phenotype values were computed between replicates. Raw phenotypic data were then also normalized across replicates, years, and locations using a best linear unbiased predictor (BLUP) implemented in R software using the lme4 package to fit the model: “model = lmer(phenotype ~ (1|line) + (1|location) + (1|year) + (1|(replicate: location):year) + (1|line: location) + (1|line:year))” (Bates et al. 2014). To investigate the location effect on micronaire, but control for year, we normalized data from each location separately by BLUP to fit the model: “model = lmer(phenotype ~ (1|line) + (1|year) + (1|rep:year) + (1|line:year)).”

Association mapping analysis

The compressed mixed linear model (MLM) marker–trait association analysis was implemented with GAPIT software using the select sequencing variants, input parameter “PCA.total = 3,” and phenotypic data, which was normalized and subsampled as described above (Lipka et al. 2012; Zhang et al. 2010). GAPIT calculated a kinship matrix according to the VanRaden method and performed GWAS using the default average clustering algorithm and default mean group kinship type (Lipka et al. 2012). We applied GAPIT’s default correction for multiple sampling to establish the significance threshold of − log(p) > 7.

RNA expression analysis

To present tissue-specific expression of annotated genes near the candidate loci, we retrieved publically available data from ccNET (http://structuralbiology.cau.edu.cn/gossypium) (You et al. 2016).

Results

Phenotypic variation

We collected phenotypes of six cotton fiber quality attributes: ELO, MIC, SFI, STR, UHML, and UI. We measured these phenotypes for cotton grown at three locations in the southeastern USA, Florence, SC, Starkville, MS, and Stoneville, MS, between 2009 and 2016, for a total of twelve environments (Fig. 1). The mean values of the RILs were similar to the parental lines for all traits; however, the standard deviations, maximum and minimum values were broader, indicating transgressive segregation (Table S3). We observed significant correlation between STR, and UI, SFI and UHML in both the parents and the RILs, while ELO and MIC varied independently of the other traits and each other (Table S4). ANOVA results revealed that variance of genotypes and genotypes by environment effects are significant for all the tested traits. Among the traits, location had the most significant effect on the variance of MIC values by ANOVA, and this was more significant than the year component for MIC (Tables S5, S6).

Fig. 1
figure 1

Phenotypic distribution of six major cotton fiber quality phenotypes across twelve environments. Fiber elongation (ELO), micronaire (MIC), short fiber index (SFI), strength (STR), upper half mean length (UHML), and uniformity index (UI). Growing year is preceded by location at left. Florence, SC (FLO), Starkville, MS (MSU), and Stoneville, MS (STV)

Genotyping by whole genome sequencing

We generated 4.4 TB of sequences from 550 RILs, resulting in 3× coverage, each. We sequenced the eleven parents at 20× coverage. We selected 473,517 SNPs that did not exhibit significant segregation distortion in the MAGIC population and were distributed throughout the genome (Fig. 2). Of these, 7506 were non-synonymous mutations to annotated genes. We did not detect significant kinship among any of the lines (Fig. S1), which is congruent with our previous results based on the analysis of 1582 SSR markers (Fang et al. 2014) and of 6071 genotyping-by-sequencing (GBS)-SNP and 233 SSR markers (Islam et al. 2016).

Fig. 2
figure 2

Distribution of 473,517 SNPs in the eleven parental G. hirsutum varieties. Each chromosome is labeled and is composed of eleven vertical columns which represent the varieties in the same order as they are listed in Table S1. Each horizontal row represents a 1-Mb bin. The number of SNPs per Mb is color-coded according to the scale

GWAS at each location–year

Since we detected environmental contributions to phenotypic variance by ANOVA (Tables S5, S6), we first subjected data from each location–year to GWAS separately (Fig. S2). The congruence among the most highly significant loci identified at each location–year was very high for most traits, though noticeably less for MIC and UHML (Fig. S2). Next, we performed GWAS using the sequence variants and the full set of normalized phenotypes. We identified 460 SNPs at the − log(p) > 7 significance threshold for at least one trait. The full list of 1546 SNPs with − log(p) > 4 is presented in Table S7. The publically available tissue expression data for annotated genes at these loci are presented in Table S8.

Multi-trait fiber quality locus on chromosome (Chr.) A07 masks STR locus on Chr. D13

Three of the traits, STR, UI, and SFI, identified the same major effect locus on Chr. A07 near position 72-Mb (A07:72-Mb) in the whole data set (Fig. 3). This locus was identified before based on GBS-SNP marker analysis (Islam et al. 2016). At this location, we further identified non-synonymous mutations in two annotated genes (Gh_A07G1744 and Gh_A07G1769) that had highly significant p values based on the GWAS (Table 1).

Fig. 3
figure 3

Major multi-trait locus at chromosome A07:72-Mb masks a significant STR locus at Chr. D13:52,852,792. Three traits, STR, UI, and SFI, reveal a major locus, labeled A07 at left, when the full dataset is subjected to GWAS. When only the 465 RILs that contain the inferior haplotype at A07:72-Mb were analyzed, a significant SNP in the intron of Gh_D13G1792 (D13) is evident in the STR data. Vertical axis is labeled with − log(p) values, and the significance threshold is indicated on each plot with a horizontal line

Table 1 Non-synonymous (NonSyn) mutations in annotated genes with significant GWAS p values

To identify additional loci that may be masked in the analysis by this large effect locus, we excluded the 15% of the RILs that contained the superior minor haplotype and analyzed the 465 RILs with the inferior A07:72-Mb haplotypes. This subset revealed a single SNP located on Chr. D13 that passed the significance threshold (Fig. 3). This SNP results in the loss of a stop codon from an in-frame, short (63 bp), annotated intron in Gh_D13G1792 (Table 1). This stop codon is present in the TM-1 reference sequence and four parents (Coker 315, STV825, PSC355, and STV474), but is replaced with an arginine residue in the other seven MAGIC parents.

Significant and additive QTLs for ELO

Our GWAS analysis of ELO, a measure of how far a cotton fiber sample can be stretched before breaking, revealed three highly significant loci, at Chr. D01:58-Mb, D04:47-Mb, and D05:43-Mb (Fig. 4). We identified six non-synonymous SNPs in five genes at two of these loci (Table 1). One of these, Gh_D04G1519, is very highly expressed in fiber cells and ovules (Table S8). We did not observe non-synonymous variants of genes at D05:43-Mb, which is an especially broad peak, with highly significant SNPs extending 8-Mb in both directions (Fig. 4 and Table S7). We found that these QTLs had additive effects, and the 136 RILs with high-ELO haplotypes at all three loci had much greater ELO than the 19 RILs with three low-ELO haplotypes (Fig. 5 and Table S9). None of the eleven parents had all three low-ELO haplotypes, while five parents (M240, DP90, SG747, PSC355, and STV474) had all three high-ELO haplotypes.

Fig. 4
figure 4

GWAS of the full MAGIC population reveals significant loci that control ELO and UHML. The loci at D01:58-Mb, D04:47-Mb, D05:43-Mb, and D11:24-Mb are labeled with their respective chromosomes. Vertical axis is labeled with − log(p) values, and the significance threshold is indicated on each plot with a horizontal line

Fig. 5
figure 5

Violin plot of ELO values for RILs based on genotypes at the three significant loci. Genotypes are presented along the horizontal axis, with the high-ELO haplotype indicated with a dark gray rectangle and the low-ELO haplotypes with a light gray rectangle. The number (N) of RILs in each group is indicated. See also Fig. 4. For pairwise t test p values see Table S9

Significant QTL for UHML on Chr. D11

We identified a single highly significant locus for UHML at Chr. D11:24-Mb, which harbored three candidate genes (Gh_D11G1928, Gh_D11G1929, and Gh_D11G1931) with non-synonymous SNPs that are associated with fiber length variation (Fig. 4 and Table 1). These genes are expressed at similar levels in most tissues, including developing fiber cells (Table S8). One parent, HS26, contributed the low-UHML alternative haplotype, and the other parents shared the same, reference-type haplotype.

Location-dependent QTLs for MIC

Our analysis of MIC using the full dataset did not reveal any highly significant QTLs by GWAS. However, when we divided the data based on locations, we identified one significant locus at Chr. D08:3-Mb in the Starkville, MS location, but not in the other two locations (Fig. 6 and Fig. S2). This locus harbors three genes (Gh_D08G0275, Gh_D08G0282, and Gh_D08G0305) with non-synonymous SNPs relative to the reference sequence (Table 1). We observed that this peak was prominent in the overall MIC analysis, though below the significance threshold. The most significant peak from each of the other two locations was also visible in the overall analysis. The Florence, SC location, revealed a locus at A13:68-Mb and Stoneville, MS, at A08:63-Mb (Fig. 6 and Table S7). We found that these three loci exhibited an additive effect in the overall data and that the 23 RILs with low-MIC haplotypes at all three had significantly lower MIC than the 102 RILs with high-MIC haplotypes at each (Fig. S3 and Table S10). One parent, DP90, had all three low-MIC haplotypes, and two (HS26 and Pyramid) had all three high-MIC haplotypes.

Fig. 6
figure 6

Micronaire (MIC) is highly influenced by growing location. Manhattan plots for each location and the full dataset (MIC_all) are shown, and the most prominent peaks at A08:63-Mb, A13:68-Mb, and D08:3-Mb are labeled with their chromosome. Vertical axis is labeled with − log(p) values, and the significance threshold is indicated on each plot with a horizontal line

Discussion

Independent identification of candidate causative mutations for STR and UHML

Perhaps, the greatest advantage of genotyping by WGS for GWAS is the ability to observe variants of genes at the detected loci and directly observe these variants in each of the lines. The large number of recombinations that accumulated during the eleven generations of MAGIC population development further ensured a very high resolution to the GWAS mapping approach. We previously determined that 5000 markers would be sufficient for GWAS with this population, and here we report a WGS analysis with 100× that many markers (Islam et al. 2016). Recently, the WGS for GWAS approach was used with a collection of 419 cotton varieties (Ma et al. 2018). Interestingly, although their study used primarily Chinese varieties and relied on natural rather than MAGIC populations, Ma et al. did propose candidate causative mutations that were also identified in our analysis, particularly at the STR locus at Chr. A07:72-Mb and UHML locus at Chr. D11:24-Mb. At the Chr. A07:72-Mb locus, they proposed that the mutant allele of Gh_A07G1769 is responsible for the superior fiber STR, and indeed this was our most significant non-synonymous mutation at the locus. However, we also detected another gene with non-synonymous SNPs, Gh_A07G1744, very close to the locus, and with a similarly significant p value (Fig. 3 and Table 1). However, since expression of Gh_A07G1744 is low in all tissues, it may be a pseudogene (Table S8). Also, here we identified the non-synonymous SNP at Chr. D11:24-Mb in Gh_D11G1929 that Ma et al. propose as a fiber length gene, GhFL2; however, we also identified non-synonymous SNPs in Gh_D11G1928 and Gh_D11G1931, each less than 40-kb away from GhFL2 and expressed at similar levels (Fig. 4, Tables 1 and S8). The same group, in an earlier report based on the Cotton 63 K Illumina Infinium SNP array, proposed another candidate gene, which we can here independently corroborate and perhaps better explain (Hulse-Kemp et al. 2015; Sun et al. 2017). At the Chr. D13:53-Mb STR_465 locus, which we identified by only examining the 465 RILs that lack the high-STR A07:72-Mb haplotype, we found a SNP in an intron of Gh_D13G1792 (Fig. 3 and Table 1). Sun et al. (2017) demonstrated that expression of this gene is very low in two low-quality cultivars and high in two normal-quality cultivars. Our sequence analysis suggests that the 63-bp annotated intron is usually retained in full-length transcripts from cotton lines that lack the stop codon and that the intron is likely only annotated as such due to the presence of the stop codon in the TM-1 cultivar that was used to establish the reference genome. A potential epistatic interaction between Gh_D13G1792 and genes at the Chr. A07:72-Mb locus merits further study to explore the QTL masking we observed (Fig. 3).

Location-dependent QTLs for MIC

The molecular underpinnings of MIC are particularly elusive due to its complicated relationships with other fiber properties and developmental factors. MIC is a measure of the resistance to airflow of a sample of cotton fibers of a known weight that has been compressed to a known volume (Wakelyn and Chaudhry 2010). MIC is a complex trait that both the fineness and maturity of fibers and properties of the cell wall can all affect the MIC value (Paudel et al. 2013). Growers have long reported a significant environmental influence on MIC values, with fibers harvested relatively early in the growing season exhibiting lower MIC values (Bradow and Davidonis 2000; Verhalen et al. 1975). It is interesting to find a difference in identified QTLs between locations, particularly between Starkville, MS, and Stoneville, MS, which are at similar latitudes (33.4°N) and only 200 km apart. Florence, SC, is only slightly north at 34.2°N, but is 840 km from Starkville, MS. Further research into the contributions of soil type and management practices to MIC values may be warranted. Perhaps, the candidate genes we present at Chr. D08:3-Mb will be useful for the development of niche cultivars that offer significant value to growers under specific environmental conditions or management practices (Fig. 6 and Table 1). Moreover, as MIC is a complex trait including maturity and fineness, identification of a major and stable MIC QTL may be difficult. Future research may require to accurately measure fiber maturity and fineness separately using special instrument such as CottonScope® (Rodgers et al. 2011) or cross section (Hequet et al. 2006) and then identify QTLs for fineness and/or maturity.

Transgressive segregation in MAGIC populations

Linkage drag is a major impediment to the efficient improvement in crops and livestock by traditional breeding, since beneficial alleles of genes may reside in chromosomal locations that are physically close to deleterious alleles of other genes. Our MAGIC population was originally developed as a breeding tool, to facilitate novel combinations of alleles from a number of diverse high-quality cotton lines (Jenkins et al. 2008). As we reported earlier, the five cycles of random mating and six generations of self-pollination via single seed descent efficiently shuffled the genomes of the eleven parents into a MAGIC population with no discernible structure or kinship and very low LD (Fig. S1) (Islam et al. 2016). This created opportunities for beneficial alleles that originate in different parents to exhibit novel additive or epistatic effects and to break the linkage between nearby genes. That this goal of transgressive segregation was achieved is readily apparent from the broader distribution of phenotypes observed in the RILs than in the parent lines, the masking of Gh_D13G1792 and the additive effects of our ELO and MIC loci (Tables S3, S9, S10, Figs. 3, 5, and S3). Gene pyramiding is a daunting task for breeders, since several small-effect QTLs are laborious to identify and require multiple introgressions to be combined (Servin et al. 2004). Among our RILs are lines where the three ELO and three MIC loci have already been combined with the STR and UHML loci and are thus valuable breeding and research materials. The concordance of some of our candidate genes with independent studies is encouraging for the utility of our full list of candidates in breeding and biotechnological applications.

Author contribution statement

D. D. F. conceived and coordinated the project. G. N. T. analyzed phenotypic and sequencing data and wrote the manuscript. J. N. J. and J.C. M. designed and developed the population. J. N. J., J. C. M., L. Z. and B. T. C. conducted field experiments. C. D. D. conducted fiber property measurements. M. S. I. conducted ANOVA analysis. M. S. I., P. L. and D. D. F. isolated DNA. D. C. J., B. D. C. and D. D. F. provided material support and edited the manuscript. All authors read, edited, and approved the final manuscript.