Introduction

Genetic mapping of a mutant phenotype generally progresses through three steps when a bi-parental mapping population is used. First, polymorphic genetic markers are identified between parents or between the DNA pools of mutant and wild-type individuals. Second, genome-wide analysis of linkage in a mapping population establishes a chromosome or chromosomal region for the mutant locus. Finally, fine mapping with a larger segregating population and additional genetic markers narrows down the interval until candidate genes are identified. Since the advent of next generation sequencing technologies, many researchers have attempted to streamline this process; however, most of these efforts have been in model species. The existence of a quality reference genome enabled mapping-by-sequencing approaches in Arabidopsis thaliana, Saccharomyces cerevisae and Drosophila melanogaster (Birkeland et al. 2010; Blumenstiel et al. 2009; Schneeberger et al. 2009; Zuryn et al. 2010). More recently, the use of related genomes with local synteny has been shown to be useful for identification of a diverse region encompassing clusters of mutations that distinguish mutant from wild-type organisms and contain the mutant locus (Galvão et al. 2012; Wurtzel et al. 2010).

Upland cotton (Gossypium hirsutum L.) is a tetraploid crop plant that is the foundation of global natural textile production (Paterson et al. 2012). Allotetraploid cotton descends from an inter-specific hybridization event that occurred 1–2 million years ago between an A-genome diploid native to Africa and a Mexican D-genome diploid (Wendel and Cronn 2003). Recently, the genome sequence of a D5 genome diploid species, G. raimondii Ulbr, was released (Paterson et al. 2012). Importantly, a bioinformatics tool, called PolyCat, enables the assignment of sequence reads from G. hirsutum to A and D subgenomes, after first aligning all reads to the G. raimondii genome (Page et al. 2013). Used together, PolyCat and the G. raimondii genome provide a workable reference for the chromosomes of the D-subgenome (chromosomes 14–26) of G. hirsutum and the gene content of the A subgenome (chromosomes 1–13). In addition to causing difficulties with sequence alignment, the presence of two similar subgenomes within G. hirsutum complicates the design of genetic markers such as PCR-based simple sequence repeat (SSR) and single nucleotide polymorphism (SNP) markers, which may amplify products from both subgenomes, making it impossible in some cases to determine marker zygosity.

Our objective was to identify candidate genes responsible for the short-fiber mutant Ligon-lintless 2 (Li 2) phenotype by locating the Li 2 genetic locus on a physical map and quantifying expression of genes in this locus. The Li 2 mutation affects cotton fiber elongation and its identification will advance the understanding of plant cell elongation and enable the development of cotton varieties with higher fiber quality. Previous work has mapped the Li 2 locus to G. hirsutum chromosome 18 by phenotype association with aneuploid stocks and linkage analysis using restriction fragment length polymorphism and SSR markers (Hinchliffe et al. 2011; Kohel et al. 2002; Rong et al. 2005). In this study, we used RNA and DNA deep-sequencing to identify a region of dense polymorphism between Li 2 and its wide-type NIL G. hirsutum cv. DP5690. Filtering candidate SNPs for non-synonymous changes to annotated proteins enabled us to accurately identify the introgressed region between NILs using G. raimondii as a reference sequence. We were therefore able to streamline, in a non-model tetraploid crop species, the first two steps of mutant characterization, by direct RNA sequencing of NILs without a mapping population. We sequenced bulked F2 segregants to confirm the region and to identify more candidate SNP markers. We designed novel subgenome-specific SNP markers, based on our sequencing data, which included both allele-specific and subgenome-specific primers for fine mapping in a population of 1,733 F2 plants. The resulting genetic map defined an interval that contains only a single gene, an aquaporin. This gene, γTIP, does not contain any non-synonymous mutations, but is significantly under-expressed in elongating Li 2 fiber cells.

Materials and methods

Plant materials

The development of the two NILs of Upland cotton used in this study has been described previously (Hinchliffe et al. 2011). Briefly, the Li 2 gene was introgressed from a G. hirsutum cv. TM-1 background into G. hirsutum cv. DP5690 via five generations of backcrossing and single seed descent using DP5690 as the recurrent parent (Hinchliffe et al. 2011). Homozygous plants with the dominant (Li 2/Li 2) or recessive (li 2 /li 2) phenotype were grown in the field in New Orleans, LA in 2013 for mRNA isolation. Parental NILs, Li 2 mutant and DP5690, were crossed and the resulting F1 were self-pollinated to produce F2 seeds. The F2 population was grown in Stoneville, MS as described below.

RNA isolation, RNAseq and RT-qPCR

Cotton bolls were harvested at the following eight time-points during fiber development: 0, 3, 5, 8, 12, 16, 20, and 24 days post-anthesis (DPA). Cotton fibers were separated from ovules using a glass bead shearing technique (Taliercio and Boykin 2007). Total RNA was isolated from detached fibers as described previously (Naoumkina et al. 2014). RNA samples from three biological replicates of 8-DPA fiber were subjected to paired-end Illumina mRNA sequencing (RNAseq). Library preparation and sequencing were conducted by Data2Bio LLC (Ames, Iowa) following standard procedures. The libraries were sequenced using 101 cycles of chemistry and imaging, resulting in paired-end sequencing reads with length of 2 × 101 bp that flank 150-bp inserts. RNA from each of the above-mentioned time-points was used for reverse transcription quantitative polymerase chain reaction (RT-qPCR) analysis using primers listed in Table S1. A detailed description of reverse transcription, qPCR and expression analysis was previously reported (Naoumkina et al. 2014).

Super bulked segregant analysis sequencing (sBSAseq)

F2 plants from a population of 536 individuals were selected for sequencing by a bulked segregant approach (Michelmore et al. 1991; Takagi et al. 2013). Two DNA pools were constructed: a pool of DNA from 100 short fiber (Li 2 /Li 2, and Li 2 /li 2) and a pool of DNA from 100 wild-type (li 2 /li 2) plants. DNA was Illumina sequenced with paired 101-bp reads flanking 150-bp inserts by Data2Bio LLC (Ames, Iowa).

Identification of diverse genomic regions

The general outline of data processing is shown in Fig. 1. Sequence reads were aligned to the JGI G. raimondii genome sequence (v2.1) with the GSNAP software program (Paterson et al. 2012; Wu and Nacu 2010). These reads were assigned to the A or D subgenome using the PolyCat software program (Page et al. 2013). The binary alignment/map (BAM) files from different biological replicates of RNAseq data, but matching subgenome assignment, were combined using the SAM tools merge software (Li et al. 2009). InterSNP, part of the BamBam package that includes PolyCat, was used to identify candidate SNPs between Li 2 and its wild-type DP5690, for each subgenome separately and for RNAseq and sBSAseq data separately (http://sourceforge.net/projects/bambam/). InterSNP was run with default parameters, but we filtered the output file for SNPs with only one non-ambiguous variant per sample. These SNPs were then compared between Li 2 and DP5690 to identify non-synonymous SNPs in annotated exons. Histograms were generated by counting the instances of non-synonymous SNPs in each 1-Mb or 100-kb interval.

Fig. 1
figure 1

Flowchart of data processing to identify the introgressed region by RNAseq and sBSAseq and to identify differentially expressed genes by RNAseq. Sequence reads from wild type or Li 2 mutant Gossypium hirsutum samples were aligned to the related G. raimondii genome with GSNAP, assigned to subgenomes with PolyCat, and compared for SNPs with InterSNP. SNPs were analyzed to find non-synonymous changes in annotated proteins. The RNAseq alignments were also used to compute differential gene expression with EdgeR

Differential gene expression

RNAseq expression analysis was conducted following the PolyCat pipeline (Page et al. 2013), with only two minor adjustments: (1) we only counted exonic reads; (2) we used the ratio of A-assigned to D-assigned reads to proportionally divide the total number of mapped reads for each gene. Differential gene expression was calculated by the negative binomial method of the EdgeR software using the tagwise estimation of dispersion (Robinson et al. 2010).

Subgenome-specific SNP primer design

Manual inspection of read alignments in RNAseq and sBSAseq data was used to identify true SNPs and nearby homeoSNPs. Primers were designed to end on the SNP nucleotide and to incorporate an additional mismatch, usually at the third base from the 3′ end, which has been shown to increase stringency (Drenkard et al. 2000). Forward primers were designed to be specific to the D-subgenome variant of the homeoSNP, while reverse primers were synthesized in two versions, for the mutant and wild-type variants of the true SNPs.

Mapping population

The 1,733 F2 plants used in this study were grown over 2 years in Stoneville, MS. During 2012, 516 F2 plants were grown, and 1,217 plants were grown in 2013. Standard conventional field practices were applied during the plant growing season. Young leaves were collected for DNA isolation as described previously (Fang et al. 2010). SSR markers were scored as usual (Fang et al. 2010). The new SNP markers were first tested by running allele-specific qPCR reactions on parental NILs and F1 plants. Unlabeled oligos, template DNA and a SYBR PCR mix were run on a C1000 thermal cycler with CFX96 Real Time System (Bio-Rad, Hercules, CA), which recorded fluorescent signal after each cycle. Samples were scored as positive or negative for each reaction by a C t value threshold determined for each primer pair. After both allele-specific reactions were run for each individual F2 plant, a SNP marker was scored as homozygous or heterozygous. The marker genotypes were analyzed by the JoinMap software using default parameters and LOD score of 10 to construct the genetic map (Van Ooijen 2006).

Results

Identification of the introgressed region containing the Li 2 locus based on RNAseq

We isolated and sequenced mRNA from 8-DPA fibers from field grown NILs that were either homozygous for the Li 2 short-fiber gene or the wild-type allele from the recurrent parent, DP5690. We analyzed the data according to the pipeline shown in Fig. 1. We first aligned reads to the reference G. raimondii genome using GSNAP, and then assigned the reads to A and D subgenomes using PolyCat (Page et al. 2013; Wu and Nacu 2010). We compared the Li 2 and wild-type alignments to identify SNPs in each subgenome using InterSNP. Next, we identified those SNPs that would alter the amino acid sequence of annotated proteins. We identified 391 non-synonymous SNPs between the Li 2 and DP5690 NILs in the A-assigned reads and 120 in the D-assigned reads. Therefore, the average non-synonymous SNP density across 749 Mb of the 13 analyzed chromosome assemblies of the D5 genome for the A-assigned reads was 0.52/Mb and was 0.16/Mb for D-assigned reads. We observed a striking peak at 57 Mb of chromosome (Chr.) 13 in the histogram of D-assigned reads (Fig. 2). There are 24 putative non-synonymous SNPs in 17 genes located within the 57th Mb region, constituting the region of highest diversity between the NILs.

Fig. 2
figure 2

Non-synonymous SNPs per Mb in RNAseq data. The 13 reference G. raimondii chromosomes are presented along the x-axis in each panel, with either A or D subgenome-assigned reads

Super bulked segregant analysis sequencing (sBSAseq)

To confirm the region identified by RNAseq and to identify additional SNPs, including intergenic SNPs and those in genes that lacked expression in the 8-DPA fiber tissue of either or both parental NILs, we sequenced pooled DNA of 100 short fiber and 100 wild-type F2 plants. Again, after aligning, assigning and filtering for non-synonymous SNPs, we observed a striking peak near the telomere of reference Chr. 13 (Fig. 3). This peak was centered at the 55.7 Mb region, which is 1.5 Mb away from the region identified by RNAseq at 57.2 Mb. The 3 Mb region between 55 Mb and the end of Chr. 13 contains 3,494 putative SNPs of which 82 are non-synonymous mutations in 48 genes.

Fig. 3
figure 3

Non-synonymous SNPs per 100 kb in sBSAseq data. Chromosome 13 of G. raimondii, which corresponds to G. hirsutum chromosome 18, is shown, and the 3-Mb peak is expanded for detail. Solid blue indicates non-synonymous SNPs in sBSAseq data; hatched gray indicates RNAseq data in the inset. Li 2 indicates the locus according to the genetic map (see Fig. 5)

Subgenome-specific SNP markers

We designed primer sets to interrogate SNPs that were found in the RNAseq and sBSAseq data. We found that the wild-type allele primer pair would amplify from both wild type (li 2/li 2) and mutant (Li 2/Li 2) parental NIL templates, even though the mutant-specific primer pair would only amplify from a mutant parental NIL template. We deduced that the homeologous chromosome in our tetraploid G. hirsutum DNA was also acting as a template for our primers. Therefore, we designed subgenome-specific forward primers to pair with allele-specific reverse primers (Fig. 4; Table S2). We found homeoSNPs less than 600 bp from our true SNPs by manual inspection of alignment files and designed primers that would end on the homeoSNP. This resulted in a significant improvement in our conversion of putative SNPs to viable PCR markers.

Fig. 4
figure 4

Subgenome-specific SNP marker design. Subgenome-specific primer is black, mutant allele-specific primer is red, and wild-type allele-specific primer is blue. Only templates with two annealed primers will successfully amplify

Segregation of markers in 1,733 F2 progeny

Since the RNAseq and sBSAseq implicated a 3 Mb introgressed region, we generated a large mapping population for fine genetic mapping of the Li 2 locus. We tested our SNP markers for linkage to Li 2 and used the alignment of reads to the reference chromosome as our guide to identify and develop markers that flanked Li 2 as closely as possible. We also tested two SSR markers that were previously identified as linked to Li 2 on G. hirsutum Chr. 18 by our group, NAU3391 and DPL0922, on the large segregating population (Hinchliffe et al. 2011). Ultimately, we constructed a genetic map (Fig. 5) which confines the Li 2 locus to the end of G. hirsutum Chr. 18, which is orthologous to G. raimondii Chr. 13 (Blenda et al. 2012; Wang et al. 2013). The Li 2 locus is flanked by two SNP markers, CFB5851 and CFB5852, which are 0.084 and 0.216 cM away, respectively. These two SNPs are located less than 6 kb apart on the reference sequence, flanking a single gene, Gorai.013G265400, a gamma tonoplast-intrinsic protein (γTIP) or aquaporin. However, there are no differences in the coding sequence of this gene between the Li 2 and DP5690 NILs.

Fig. 5
figure 5

Li 2 genetic and physical map based on 1,733 F2 progeny. SNP and SSR markers associated with the Li 2 gene are shown on G. hirsutum chromosome 18 and on G. raimondii chromosome 13. Genetic map locations are shown in centiMorgans (cM) and physical locations are shown in base pairs (bp)

Differential expression of genes near the Li 2 locus based on RNAseq

To investigate the expression of genes near the Li 2 locus, we subjected the subgenome-assigned RNAseq reads to statistical analysis with EdgeR (Robinson et al. 2010). We analyzed the region bound by our second closest flanking markers, CFB5850 and CFB5853. We found nine genes on the interval that were significantly different (adjusted p value <0.05) between Li 2 and wild-type 8-DPA fiber cells (Gorai.013G265900, Gorai.013G265400, Gorai.013G265200, Gorai.013G265100, Gorai.013G264600, Gorai.013G264300, Gorai.013G264100, Gorai.013G263800, Gorai.013G263400). Of these, only four (Gorai.013G265900, Gorai.013G265400, Gorai.013G265200, Gorai.013G263800) are at least twofold different (Table 1). γTIP is nearly fourfold down regulated in Li 2 fibers and is also the most highly expressed gene on the interval, even more highly expressed than the ribosomal protein, Gorai.013G265000. Just outside the CFB5851/CFB5852 interval is a C2H2-type zinc finger family protein (ZnF), Gorai.013G265200, which is more than 50-fold over-expressed in Li 2 fiber, although its average expression is 175-fold less than γTIP at 8-DPA.

Table 1 RNAseq differential expression of annotated genes near the Li 2 locus in 8-DPA fiber cells position indicates base on Chr. 13 of G. raimondii

RT-qPCR of the selected genes during fiber development

To confirm the results of the RNAseq expression analysis and to investigate the activation of genes near the Li 2 locus throughout the development of cotton fiber cells, we chose six genes from the CFB5850/CFB5853 interval for RT-qPCR across eight time-points (Fig. 6). In addition to γTIP and ZnF, we chose the two closest transcription factors, Gorai.013G265700 (NOT), a NOT family transcription factor and Gorai.013G265300 (MYB), an MYB family transcription factor. We also chose the most highly expressed of the significantly misregulated genes on each side of the flanking markers, Gorai.013G265900 (PK), a protein kinase, and Gorai.013G264100 (ACT), an actin (Fig. 6). Three of the genes, PK, NOT and MYB showed significant differences between wild type and Li 2 expression during the late stage of fiber development, at 20-DPA, with PK and MYB remaining over-expressed in Li 2 at 24-DPA. ACT expression is reduced in Li 2 fibers during part of early and late elongation phases at 5, 8, 16, 20 and 24-DPA, however, is indistinguishable from wild type at 0, 3, and 12-DPA. γTIP is not expressed at 0-DPA in either sample and is expressed equally in Li 2 and wild type at the beginning of elongation, 3-DPA. Thereafter, the wild-type γTIP gene increases dramatically during the peak of elongation, while expression in Li 2 is essentially flat and consistently below wild type. ZnF is expressed at similar levels at 0-DPA in both wild type and Li 2 fibers, before being essentially silenced in both at 3-DPA. At 5-DPA, ZnF rebounds to about one-tenth its expression at 0-DPA in Li 2 fibers only, while wild-type expression remains null. Thereafter, ZnF expression drops off slowly in Li 2 and only becomes detectable in wild-type fibers at 16-DPA.

Fig. 6
figure 6

RT-qPCR expression of select genes near the Li 2 locus during fiber development of Li 2 and wild-type DP5690 G. hirsutum fiber cells. Error bars indicate standard deviation from three biological replicates. Along the x-axis, dpa indicates the number of days post-anthesis, a measure of developmental time for cotton fiber cells

Discussion

Mapping-by-sequencing mRNA and bulked DNA of NILs

Sequencing of bulked segregants has successfully identified causative mutations in other species (Schneeberger et al. 2009; Wenger et al. 2010). Since tetraploid G. hirsutum does not have a published genome sequence, we relied on the related diploid G. raimondii reference sequence and a bioinformatics tool to assign reads to A and D subgenomes (Page et al. 2013; Paterson et al. 2012). Large-scale synteny between G. raimondii and the D subgenome of G. hirsutum has been shown by the correspondence of genetic and physical maps, and our work supports the utility of local, megabase-scale, synteny for mapping-by-sequencing (Blenda et al. 2012; Wang et al. 2013).

The direct sequencing of RNA from fiber tissues of the Li 2 mutant and its wild-type NIL enabled us to map the Li 2 gene to a 1-Mb chromosomal region without the creation of a mapping population. We found putative SNPs between Li 2 and its wild-type NIL throughout the genome, but after filtering for non-synonymous SNPs we found a striking peak on G. raimondii Chr. 13 which corresponds to G. hirsutum Chr. 18 (Fig. 2) (Blenda et al. 2012; Wang et al. 2013). Confining our search for SNPs to coding sequences probably diminished the significance of the divergence between G. hirsutum and G. raimondii. After following the same SNP filtering pipeline we used for mapping by RNAseq (Fig. 1), we pooled and sequenced DNA from 100 wild-type and 100 short-fiber F2 plants, which confirmed the signature of a diverse introgressed region. However, this combined approach did not itself produce a clear candidate gene, mainly because the putative SNPs in the introgressed region were so numerous, with 82 non-synonymous SNPs among 3,494 SNPs in the 3-Mb region (Fig. 3).

Subgenome-specific SNP marker development

Our deep-sequencing data provided us with many candidate SNPs in the introgressed region, but due to amplification from the homeologous chromosome, we could not score the zygosity of many of our early SNP markers. Therefore, we developed subgenome-specific SNP markers, where a forward primer that terminates on a homeoSNP was designed to ensure that only the D-subgenome was interrogated by the allele-specific reverse primers (Fig. 4). Prior techniques for subgenome-specific markers relied either on two-step nested amplification (Blake et al. 2004; Chai et al. 2010; Long et al. 2011) or incorporated the homeoSNP into the same primer as the true SNP (Byers et al. 2012). Since we only made a small number of SNP markers for the purpose of our fine mapping, we cannot compare the relative efficiencies of the different approaches. However, we believe that placing the homeoSNP at the 5′ position of the subgenome-specific primer should increase specificity and that our one-step technique will be useful to future work in cotton and other polyploid species.

Traditional fine genetic mapping

To narrow down the list of candidate genes from a 3-Mb to a single gene interval, we turned to traditional fine mapping. We tested our subgenome-specific SNP primers in a large segregating population of 1,733 F2 plants. As our flanking markers closed in on the Li 2 gene, we returned to our sequencing data to identify more putative SNPs and developed more markers. Ultimately this led to a locus, 2.1 Mb away from the region identified by sBSAseq and 0.6-Mb away from the region identified by RNAseq, which was not predicted by either dataset to contain non-synonymous mutations (Fig. 3). The two closest flanking markers, CFB5851 and CFB5852, are less than 6 kb apart and flank a single gene γTIP, Gorai.013G26550, an aquaporin. The next two closest flanking markers define an interval that includes 32 annotated genes in G. raimondii. RNAseq and RT-qPCR confirmed that several of the genes on this interval show altered expression in Li 2 short-fiber cells during fiber development including γTIP (Table 1; Fig. 6).

Li2 candidate genes γTIP and ZnF

Taken together, our data strongly implicate the aquaporin, γTIP (Gorai.013G265400), in the genetic mechanism of the Li 2 short-fiber trait. The closest G. hirsutum ortholog of the reference gene Gorai.013G265400 is GhγTIP (EF470294), which was previously identified as a highly expressed fiber transcript (Liu et al. 2008). The involvement of aquaporins in cell elongation has already been suggested both in cotton and Arabidopsis (Karlsson et al. 2000; Li et al. 2013; Liu et al. 2008; Ruan et al. 2001; Smart et al. 1998). By controlling the flow of water in and out of the vacuole, tonoplast-intrinsic aquaporins allow turgor pressure within the vacuole to drive cell expansion (Li et al. 2013; Liu et al. 2008). G. hirsutum has at least 71 aquaporins, only some of which are active in fiber cells (Park et al. 2010). Those aquaporins are highly expressed during the elongation phase of cotton fiber cells to facilitate the rapid influx of water into the central vacuole (Li et al. 2013; Liu et al. 2008). The peak expression of γTIP that we observe at 8-DPA corresponds well with the opening of fiber plasmodesmata, the import of solutes capable of increasing turgor pressure, and the expression of other aquaporins implicated in cotton fiber development (Li et al. 2013; Liu et al. 2008; Park et al. 2010; Ruan et al. 2001). The Li 2 short-fiber mutant fails to up-regulate γTIP during the peak of elongation. At 5 and 12-DPA expression of γTIP is about twofold less in Li 2 than DP5690 fiber cells and at 8-DPA the difference is even greater: threefold by RT-qPCR or nearly fourfold by RNAseq (Fig. 6; Table 1). This failure to up-regulate γTIP may prevent the accumulation of solutes and water in the central vacuole of the mutant fiber and thereby impede the turgor-driven mechanism of cell elongation.

Very close to our mapped location is ZnF, which, although expressed at a much lower level than γTIP, is a transcription factor and has a greater difference in expression between Li 2 and wild-type fiber. Interestingly, the Arabidopsis homolog of ZnF, Zat12, is involved in stress responses including response to reactive oxygen species (ROS) (Davletova et al. 2005; Kiełbowicz-Matuk 2012). Our prior work that characterized global changes to Li 2 fiber transcript levels identified altered levels of genes and metabolites involved in ROS homeostasis and stress response (Hinchliffe et al. 2011; Naoumkina et al. 2013, 2014). ROS has been shown to be involved in cell elongation through loosening of cell walls in growing tissue and selective breeding for enhanced fiber growth seems to have influenced regulation of ROS related genes (Chaudhary et al. 2009; Liszkay et al. 2004). Furthermore, we recently found that Li 2 mutant fiber cells have altered gene expression in mitochondria, the site of much ROS production (Thyssen et al. 2014).

Our future work will attempt to modify the fiber phenotypes in cotton plants by modulating the activity of these candidate genes. We believe that both TIP and ZnF are plausible Li 2 candidates based on the current evidence. However, absent changes to coding sequences in these or the other genes in the vicinity of the Li 2 locus, the causative mutation is likely to affect a control sequence, which could potentially exert its influence on a distant gene (Clark et al. 2006; Guenther et al. 2014).

Author contributions

GNT, DDF and MN conceived and designed the experiment. GNT analyzed the sequencing data, designed the SNP markers and wrote the paper. DDF analyzed the SSR data and oversaw the project. RBT developed the NILs and grew the F2 populations. CF and PL conducted SNP and SSR marker analysis of the F2 populations. MN performed the RT-qPCR experiments and analyzed the SNP marker data. All authors read and approved the manuscript.