Introduction

Cotton (Gossypium spp.) is the world’s most important natural textile fiber warranting increased exploration of fiber-related traits through various molecular genetic approaches. Currently, two types of molecular markers are primarily used in molecular mapping of cotton genome. One is genomic markers which primarily target non-coding regions such as RFLP (Reinisch et al. 1994), RAPD (Kohel et al. 2001), AFLP (Mei et al. 2004), STS (Rong et al. 2004), and SSR (Zhang et al. 2002; Frelichowski et al. 2006). The other is candidate gene markers represented by EST–SSR (Chee et al. 2004; Park et al. 2005; Guo et al. 2007), cDNA probe-based STS or RFLP markers (Rong et al. 2004), and SNP i.e. single nucleotide polymorphism (An et al. 2007). Development of candidate gene markers has received much attention in recent years because of the possible association of functional genes with complex traits. However, the low polymorphism level of cDNA probe-based STS or RFLP markers hampered candidate gene mapping (Rong et al. 2004). SNPs have recently been used as the choice for candidate gene markers in many plant species and are reported to be the most abundant molecular markers (Cho et al. 1999; Ching et al. 2002; Zhang et al. 2003; Zhu et al. 2003). However, SNP development in cotton is impeded by its allotetraploid nature, high repetitive DNA content, and inadequate genome sequence information.

The candidate gene approach is widely accepted as a strategy for identification of loci influencing complex and economically important traits (Faris et al. 1999; Giroux et al. 2000; Pflieger et al. 2001; Beecher et al. 2002). Candidate gene markers derived from resistance genes or deference response genes were placed on regions containing major resistance QTL in wheat (Faris et al. 1999), pepper (Pflieger et al. 1999), and rice (Wang et al. 2001). The storage protein genes for puroindoline in wheat (Giroux et al. 2000) and hordoindolines in barley (Beecher et al. 2002) were both implicated to play a role in grain hardiness and texture by QTL analysis. Markers developed from genes related to carbohydrate and nitrogen metabolism were found to be associated with sugar content and yield in sugar beet (Schneider et al. 2002). Wilson et al. (2004) detected significant association between candidate genes involved in kernel starch biosynthesis and traits for maize kernel composition and starch quality. In cotton, Rong et al. (2007) also found evidence of a general association between concentrations of candidate genes and cotton fiber-related QTL.

R2R3-MYB transcription factors, characterized by two imperfect repeats (R2 and R3) in the DNA-binding domain, are one of the largest regulatory gene families in plants (Riechmann et al. 2000). Some of them were shown to control trichome initiation, expansion, branching, and maturation in Arabidopsis (Oppenheimer et al. 1991; Glover et al. 1998; Szymanski et al. 2000; Schiefelbein 2003). Cotton fibers are elongated trichomes derived from ovule epidermis. Previous reports suggested a similarity in genetic control of MYB transcription factors in Arabidopsis trichomes and cotton fibers (Suo et al. 2003; Wang et al. 2004; Humphries et al. 2005; Perez-Rodriguez et al. 2005; Wu et al. 2006). Expression analysis demonstrated that six R2R3-MYB transcription factors were expressed in fiber cells but regulated differentially during fiber initiation and expansion (Loguercio et al. 1999; Cedroni et al. 2003). In addition, several other MYB genes have been indicated to play an important role in cotton fiber initiation (Suo et al. 2003; Hsu et al. 2005; Lee et al. 2006; Yang et al. 2006).

Here we report the sequence phylogenomic characterization of the six MYB genes in selected tetraploid and diploid cotton species, their chromosomal locations and molecular linkage map using candidate gene derived SNP markers. The chromosomal locations and genetic linkage mapping of SNP markers with framework SSR markers will improve the resolution of the cotton comparative map. SNP markers derived from MYB genes in this study will be useful as diagnostic markers for exploration of the roles of these candidate genes in complex fiber traits.

Materials and methods

Plant materials

HS46 and MARCABUCAG8US-1-88 (MAR), two G. hirsutum (AD1) lines of diverse agronomic and fiber properties, and three lines of other tetraploid species including G. barbadense L. (AD2, accession 3-79), G. tomentosum Nuttall ex Seemann (AD3), and G. mustelinum Miers ex Watt (AD4) were used for PCR amplification, cloning, and sequencing of the six MYB genes. Chromosomal assignment of SNP markers was accomplished using three different sets of hypoaneuploid F1 stocks developed from an interspecific cross between TM-1 (genetic standard for G. hirsutum, AD1) and one of the three species, 3-79, G. tomentosum or G. mustelinum, together with one set of euploid interspecific backcrossed chromosome substitution lines (CS-B, BC5S1) of 3-79 in TM-1. Hypoaneuploid F1 cytogenetic stocks between TM-1 and 3-79 consisted of 10 primary monosomic and 28 monotelodisomic lines; whereas, hypoaneuploid F1 lines between TM-1 and G. tomentosum included 11 primary monosomic and 27 monotelodisomic lines (Liu et al. 2000; Saha et al. 2006b). The new hypoaneuploid F1 chromosome substitution stocks between TM-1 and G. mustelinum (unpublished information) were also used for deletion analysis. Euploid CS-B stocks contain 12 different chromosome and 8 chromosome arm substituted from 3-79 in TM-1 background (Stelly et al. 2005). Fresh leaves were collected from individual plant, frozen in liquid nitrogen, and then subjected to genomic DNA extractions by a Qiagen DNeasy plant maxi kit (Qiagen Inc., Valencia, CA, USA). A set of 186 recombinant inbred lines (RILs) generated from an interspecific cross between TM-1 and 3-79 were used as a mapping population for constructing molecular linkage map of SNP markers specific to the MYB genes and the selected framework SSR markers in cotton (Park et al. 2005; Frelichowski et al. 2006).

PCR amplification, cloning, and sequencing

Gene-specific PCR primers of MYB1 (COT105 and COT106), MYB2 (Myb2F and COT108), MYB3 (Myb3F and COT110), MYB4 D-genome locus (COT111 and COT112), and MYB6 (Myb6F and COT116) were adopted from Loguercio et al. (1999). Gene-specific PCR primers of MYB4 A-genome locus (Myb4A_F and Myb4A_R) and MYB5 (Myb5_F and Myb5_R) were designed based on GenBank deposited sequences generated from the previous works by Loguercio et al. (1999) and Cedroni et al. (2003) (Table 1). Pfu polymerase (Stratagene, La Jolla, CA, USA) was used for PCR amplification following the protocol described elsewhere (An et al. 2007). The PCR products were separated on a 1% (w/v) agarose gel and purified using QIAEX II gel extraction kit (Qiagen Inc, Valencia, CA, USA). The purified products were ligated into TOPO TA cloning vector and transformed into TOPO10 competent E. coli cells (Invitrogen, Carlsbad, CA, USA). Both strands of the recombinant plasmid were sequenced using an ABI 3730XL automated sequencer with ABI Prism BigDye Terminator Cycle Sequencing Kit v3.1 (Applied Biosystems, Foster City, CA, USA). In order to avoid possible complications from PCR recombination (Cronn et al. 2002) and to identify the duplicated copies in the genome, we picked up multiple clones (12 clones) for sequencing of each amplicon and considered one identical sequence from at least three clones.

Table 1 SNP makers derived from six MYB genes for genotyping in four allotetraploid cotton species

SNP characterization and phylogenetic analysis

Six MYB gene sequences from the five allotetraploid cotton lines together with GenBank deposited sequences from TM-1 and living models of two allotetraploid ancestral genomes: G. herbaceum L. (A-genome; accession A1-73) and G. raimondii Ulbrich (D-genome; “Galaus”) were used for SNP characterization. The GenBank sequences of Gossypoides kirkii (Masters) J.B. Hutchinson were used as an outgroup to cotton genus (Malvaceae) in phylogenetic analyses (Cedroni et al. 2003). DNASTAR (DNASTAR Inc., Madison, WI, USA) and Clustalx (Thompson et al. 1997) were used for vector-trimming and sequence alignment. Before SNP characterization, differentiation between paralogous and homoeologous loci was performed by phylogenetic grouping and comparison of sequences from the two diploid species (An et al. 2007). Phylogenetic analyses were performed by maximum parsimony (MP) method using MEGA 3.1 (Kumar et al. 2004). To determine the confidence levels for each tree, an MP bootstrap analysis with 100 replicates was conducted. DnaSP 4.0 software was used to identify SNP by comparative analysis of aligned sequences from different genotypes at a putative locus (Rozas et al. 2003). Nucleotide diversities (π), haplotype number (H) and diversity (H d), rate of silent (K sil) and non-synonymous (K a) substitutions of pairwise comparisons were also calculated by DnaSP 4.0 software (Tajima 1983; Nei 1987; Rozas et al. 2003).

Chromosomal assignment and linkage mapping

In order to minimize the potential problems associated with homoeologous sequences in SNP genotyping, genome-specific (or locus-specific) PCR primers were designed according to sequence differences between two subgenomes in tetraploid cotton when applicable (Table 1). Interspecies SNP primers were designed based on a single nucleotide difference among sequences at a putative locus (each clade or group in the phylogram of individual MYB gene) between the genotypes of TM-1 and 3-79, G. tomentosum or G. mustelinum. The primer was designed to anneal just upstream or downstream of the SNP site as the forward or reverse primer, respectively, so that the polymorphism could be detected by one base extension technology with an ABI Prism SNaPshotTM multiplex kit (Applied Biosystems, Foster City, CA, USA). All the primers used for genotyping are summarized in Table 1. The deletion analysis method frequently used for molecular marker chromosomal assignment in cotton (Liu et al. 2000; An et al. 2007) was employed to assign chromosomal locations for six MYB genes using the four sets of cytogenetic stocks mentioned in the plant materials. A total of 90 SSR markers, which are polymorphic between TM-1 and 3-79 and span the cotton genome, were selected based on the information available in cotton microsatellite database (CMD, http://www.cottonmarker.org/; Blenda et al. 2006), and used as anchored markers for linkage mapping of sections of selected chromosomes with SNP markers. Chromosomal assignment of the constructed linkage groups was achieved by deletion analysis, comparison to the allele size with CMD panel, published integrated molecular maps (Lacape et al. 2005; Park et al. 2005; Frelichowski et al. 2006; Guo et al. 2007), and the assignment of cotton linkage maps to chromosomes (Wang et al. 2006b). The SSR markers used in this study were fluorescent labeled by Sigma Genosys (The Woodlands, TX, USA) or Applied Biosystems (Foster City, CA, USA). PCR reactions and thermal cycle protocols for genotyping the RILs population were conducted according to the method of Gutierrez et al. (2002). One polymorphic SNP marker between TM-1 and 3-79 was selected, if available, from each gene for linkage mapping. The procedures of SNP marker genotyping described in An et al. (2007) were employed for cytogenetic stock and RIL population genotyping. An automated capillary electrophoresis system ABI3100 Genetic Analyzer with GeneMapper software 4.0 (Applied Biosystems, Foster City, CA, USA) was used to analyze both PCR-amplified DNA fragments of SSR markers and the single nucleotide extension of SNP markers. The genotyping output data of both SNP and SSR markers were coded for linkage analysis using JoinMap® 4 (Van Ooijen 2006). The fit of marker segregation to the 1:1 ratio expected was evaluated according to Chi-square tests. Recombination frequencies were converted into map distances (centiMorgan, cM) using the Kosambi mapping function (Kosambi 1944) and linkage groups were determined at LOD scores ≥6.

Results

SNP characterization and haplotype analyses of six MYB genes

In vitro SNP discovery through amplicon cloning and sequencing was accomplished by homoeologous differentiation and gene specific fragment amplification in cotton (Supplementary Fig. 1; Table 2). In this study, no duplicated or heterogeneous loci were found within each subgenome. SNPs and indels were detected from 8,301 bp of aligned sequences (7,084 and 1,217 bp of coding and non-coding regions, respectively). From the eight cotton genotypes, 108 SNPs were detected from Gossypium species (Table 2), giving an average SNP frequency of one SNP every 77 bases. Results showed the presence of one SNP per 106 bp in the coding regions and one SNP per 30 bp in the non-coding regions (Table 2). The SNP distribution varied among the six examined genes. The highest rate of SNP occurrence was observed in MYB6 (one SNP every 34 bp) and the lowest rate of SNP frequency was present in MYB3 (one SNP every 260 bp). Transitions (“A/G” or “C/T”) were the most common cause of sequences variation in the selected cotton genotypes (49%) compared to transversions (“A/T”, “G/C”, “A/C” or “G/T”, 26%) and indels (25%). In MYB6, two nucleotide (“C” and “T”) substitutions were observed in three indel positions (A-genome sites 101 and 111, D-genome site 99). A significant bias to “T” insertion/deletion was detected in the overall sequences (59.30%). In coding regions of the six MYB genes, 41 out of 67 cSNPs (SNPs in coding region) sites were predicted to result in amino acid changes (Table 2). The number of haplotypes defined by sequence polymorphism ranged from two to seven among the seven selected cotton genotypes, and haplotype diversity varied from 0.286 ± 0.196 to 1.000 ± 0.076 among six MYB genes (Table 2 and Supplementary Tables 1–12).

Table 2 SNP characterization of the six MYB genes in selected cotton genotypes

Phylogenomic sequence characterization

SNP-based multivariate relationships suggested independent evolution of the six MYB homoeologous loci in the four tetraploid species. Parsimony analyses revealed that sequences (Supplementary Figs. 1, 2) fell into two clades, each containing one of the two homoeologous loci from the allotetraploid cotton lines and the corresponding copy from the progenitor diploid genomes. Pairwise comparisons of the nucleotide diversity (π) of the six MYB genes in both A- and D-genomes are summarized in Table 3. The π value measures the average number of nucleotide differences per site between two sequences (Nei 1987). The lowest nucleotide diversities occurred among the three G. hirsutum lines in both A- and D-genomes. Results from both A- and D-genomes showed the highest nucleotide diversities were between G. mustelinum and the extant relatives of the ancestral genome donors. Nucleotide diversities of MYB genes were higher in the D–Dt comparisons than for the A–At comparison of the allotetraploid cotton species, indicating that G. herbaceum may be a closer ancestor of the At-genome donor than G. raimondii is of the Dt-genome donor.

Table 3 Pairwise comparison matrix of the eight cotton lines showing the nucleotide diversity (π value, ×10 2) in A- and D-genomesa

To further explore the nature of substitutions contributing to overall divergence in cotton, pairwise comparisons among orthologous copies for the six MYB genes of both A- and D-genomes are tabulated separately for non-synonymous substitution (K a), silent substitution (K sil), and the K a:K sil ratio (Table 4). K a and K sil values in the DDt comparisons were higher than the corresponding values in the A–At comparisons, except for the comparison between MAR (G. hirsutum) and its two genome living models of K sil value. Contributing to the relatively level of D-Dt differentiation were greater amino acid substitutions, nucleotide changes in non-coding regions, and synonymous changes in the coding regions. Although these predictions were based on the genomic sequence, they may allow speculation of evolutionary constraints placed on amino acid substitutions without knowing the exact effect of the SNPs on predicted codons. Nucleotide diversities among the three G. hirsutum lines in the Dt-genome were higher than those in the At-genome, indicating six MYB genes loci in Upland cotton Dt-genome exhibited a faster evolutionary rate than the At-genome (Table 3). Most of the substitution ratios (K a:K sil) of pairwise comparisons were less than 1, indicating the possibility of a high level of evolutionary constraint placed on amino acid substitution in the six MYB genes (Table 4).

Table 4 Pairwise comparison matrix of molecular evolutionary rates for the six MYB genes in cottona

Chromosome localization of six MYB genes

Hypoaneuploid stocks, developed from three interspecific crosses between TM-1 (G. hirsutum) and 3-79 (G. barbadense), G. tomentosum or G. mustelinum, and one set of euploid interspecific backcrossed chromosome substitution lines (CS-B, BC5S1) of 3-79 in TM-1 were used for chromosomal assignment of SNP markers by deletion analysis (An et al. 2007). Thirteen different SNP sites between the common parent TM-1 and 3-79, G. mustelinum or G. tomentosum, respectively, were selected for SNP primer design in six MYB genes (Table 1). We confirmed our identification of chromosomal locations using deletion lines from different sources. Due to the conserved character of the homoeologous sequences in the genes MYB3 and MYB5, no suitable genome-specific PCR primers could be designed. However, chromosomal assignment of genome-specific alleles was still possible by euploid CS-B or hypoaneuploid F1 stocks (Table 1). Moreover, no SNP marker could be designed from the Dt-genome of gene MYB2. Therefore, only the At-genome location was considered for chromosomal assignment by either deletion analysis or linkage mapping. SNP markers used for chromosomal assignment and the according genotyping results are listed in Table 1. Deletion analyses of the six genes were performed using all the available cytogenetic stocks and the results are summarized in Table 5. We detected chromosomal locations of the gene MYB4 on the long arm of two homoeologous chromosomes: 7 and 16. Only one subgenomic location of gene MYB1, MYB2, MYB5, and MYB6 was found by deletion analysis using SNP markers, which was on the long arm of chromosome 18, short arm of chromosome 8, short arm of chromosome 11, and short arm of chromosome 11, respectively. We do not have complete coverage for all the chromosomes in the cytogenetic stocks. The putative chromosome location of gene MYB3 in Dt-genome could not be determined due to incomplete coverage of Dt-genome; however, it is probably on one of the chromosomes for which we do not have aneuploid stock coverage (long arm of chromosomes 14, 15, or chromosomes 19, 21, 23, and 24).

Table 5 Chromosomal locations of the six MYB genes with previously reported QTL

Linkage mapping of MYB genes by SNP markers

Framework SSR markers were utilized to construct linkage maps with SNP markers. We used 186 RILs, from the cross of TM-1 and 3-79, for genotyping by 90 SSR markers and five polymorphic SNP markers specific to genes MYB1, MYB2, MYB4, and MYB6. Genetic linkage mapping results confirmed the deletion analysis for the chromosomal locations of MYB1, MYB2, and MYB4. Linkage mapping also revealed chromosomal locations of two genes’ homoeologous loci (At-genome of gene MYB1 and Dt-genome of gene MYB6), which were on chromosome 13 and 21, respectively (Table 5). Moreover, it showed the linkage relationship between 15 SSR markers and 5 SNP markers (Fig. 1). Three SNP markers showed distorted segregation in the mapping population. The segregation of SNP markers Myb1Gbmt_238_R and Myb4Gbmt_105_R was skewed toward TM-1 and the segregation of SNP marker Myb2Gb_204_R was skewed toward 3-79.

Fig. 1
figure 1

Linkage maps of selected SNP markers derived from MYB genes with SSR markers. The names of the linkage groups A01, A02, and D02 and chromosomal locations were as per Wang et al. (2006b)

Discussion

SNP in cotton

Efficient SNP discovery in polyploids, such as cotton, must address the problem with appropriate methods of distinguishing between genome-specific polymorphisms (GSPs) and locus-specific polymorphisms (LSPs). In this study, we reduced the possibility of identifying false SNP by applying the following approaches: (1) designing PCR primers from well-characterized genes to generate an amplicon pool from each genotype; (2) sequencing multiple clones to avoid random sequencing errors and to ensure getting the duplicated loci of the gene; (3) putative locus identification by phylogenetic clustering and comparing to the two progenitor diploid genome species of allotetraploid cottons; (4) designing locus-specific PCR and SNP primer for SNP marker genotyping to confirm the reliability of the procedures (An et al. 2007). Thus, a total of 108 putative SNPs were identified among selected genotypes at the same locus. The average frequency was one SNP per 77 bp (1.30%), with one SNP per 106 bp (0.94%) and one SNP per 30 bp (3.33%) in coding and non-coding regions, respectively. In Arabidopsis thaliana, the rate of variation per nucleotide were detected as 1.09 and 0.27% in GL1 gene (a member of the MYB gene family) of 26 accessions (Hauser et al. 2001) and Atmyb2 gene of 20 ecotypes (Kamiya et al. 2002), respectively. In cotton, the average rate of SNP per nucleotide was observed as 2.35% in six EXPANSIN A genes (An et al. 2007). Another pilot SNP study revealed the rate of variation per nucleotide was 0.35% between G. hirsutum and G. barbadense (one SNP every 286 bp), and the variations per nucleotide were 0.14 and 0.37%, respectively within these two species (Rong et al. 2004).

In other crops, Ching et al. (2002) reported the presence of one SNP per 31 bp in non-coding regions and one per 124 bp in coding regions when analyzing 18 maize genes in 36 inbred lines. One SNP in every 273 bp was present in soybean (Zhu et al. 2003). Genome-wide sequence alignment between rice subspecies Indica and Japonica revealed a polymorphism rate of 1.70 SNP/kb and 0.11 indel/kb (Feltus et al. 2004). In wheat, SNP frequency was one SNP per 540 bp (Somers et al. 2003). The incidence of SNP in barley was reported as one SNP per 27 bases in the intronless Isa gene (Bundock et al. 2003), and approximately one SNP per 131 bases in the exonic region of the P450 gene family members (Bundock and Henry 2004). Although varying frequencies of SNP per length of DNA sequence have been reported, they are highly dependent upon the kind of sequence data and genotypes used to generate SNP in each species. As expected, we observed more number of SNPs at the interspecific level compared to intraspecific level of six cotton MYB genes in this study.

MYB gene phylogenomic features

The cotton genus contains about 50 species with the basic chromosome number of 13. The five tetraploid cotton species (AADD, 2n = 4x = 52) are a monophyletic assemblage putatively derived from a single allopolyploidization event that occurred 1.5 million years ago (MYA) after divergence of the diploid progenitors about 6.7 MYA (Senchina et al. 2003). The two diploid species that gave rise to the allotetraplods were from the A- and D-genome groups which are best represented by the extant species G. herbaceum L. and G. raimondii Ulbr., respectively (Wendel and Cronn 2003). Our results showed that the tetraploid MYB genes could be broadly separated into two origins representing the putative A- and D-genomes based on their similarity with the sequences of the diploid ancestral species (Supplementary Figs. 1, 2). SNP-based multivariate relationships conformed to independent evolution of the six MYB homoeologous loci in the four tetraploid species (Cronn et al. 1999; Cedroni et al. 2003). We observed that the nucleotide diversity was higher in the Dt-genome compared to the At-genome of the three G. hirsutum lines. Previous studies with Adh (Small et al. 1998, 1999; Small and Wendel 2002) and FAD2-1 (Liu et al. 2001) showed a faster evolutionary rate in the Dt-genome than in the At-genome of cotton. Reinisch et al. (1994) reported that the RFLP marker polymorphism levels of the Dt-genome were 10% higher than the At-genome. The Dt-genome, from an ancestor that does not produce spinnable fiber, contributes substantially to fiber quality of tetraploid cottons (Jiang et al. 1998; Saranga et al. 2001; Paterson et al. 2003; Lacape et al. 2005; Rong et al. 2007). Many QTLs that positively affect fiber quality have been detected on the Dt-genome (Table 5). In addition, many EST loci associated with fiber development have also been mapped to the Dt-genome (Park et al. 2005). However, some QTL influencing fiber quality and yield have been identified in the At-genome as well (Mei et al. 2004; Frelichowski et al. 2006). Whether the spreading of the At-genome repetitive DNA elements to the Dt-genome (Zhao et al. 1998) or different evolutionary pressures operating on the two genomes (Small and Wendel 2002) caused the different evolutionary dynamics is still obscure. But, all these facts collectively indicated the importance of further investigations of the Dt-genome for fiber improvement in the tetraploid cottons.

Chromosomal locations of MYB genes

The chromosomal locations of six MYB genes were identified via deletion analysis or linkage mapping (Table 5; Fig. 1). The low level of polymorphism in molecular markers derived from functional genes such as EST-SSR (Park et al. 2005; Guo et al. 2007) or cDNA probe-based STS or RFLP (Rong et al. 2004) among mapping parents has hindered their use in candidate gene mapping. Results presented here show the great potential for using SNP markers to tag functional genes and improve the comparative maps in cotton.

Previous studies have led to the discoveries of important QTL on different chromosomes in cotton. A comprehensive summary of the previously reported cotton fiber quality and yield component traits related QTL on the same chromosomes as the six MYB genes are summarized in Table 5. Analyses on the effects of chromosome-specific introgression in Upland cotton indicated that substitutions for chromosomes 16 and 18 from 3-79 had additive effects related to reduced yield (Saha et al. 2006a). These chromosomes are the locations of MYB1 and MYB4 genes. Further studies using topcrosses of 13 CS-B lines with five commercial cultivars showed that chromosomes 7 and 18 (locations of gene MYB4 and MYB1, respectively) had additive effects for fiber strength (Jenkins et al. 2007). Given the role of MYB transcription factors in fiber cell initiation and expansion, the agreement of the chromosomal locations between MYB genes and previously reported fiber yield and quality QTL suggested these SNP markers may be useful in studying the association between important fiber development genes and economically important QTL in cotton.