Introduction

Most molecular studies of positive selection have focused on estimating the type of selection on a gene or gene family of known function (e.g., Wright et al. 2002; O’Connell and McInerney 2005; Strain and Muse 2005) or, more broadly, on genes associated with a particular tissue, such as male reproductive organs (Wyckoff et al. 2000; Swanson et al. 2001, 2003; Torgerson et al. 2002). However large-scale sequencing of genomes, specific genes, and expressed sequence tags (ESTs) now make it feasible to compare substitution rates between species for large numbers of genes from throughout the genome. Currently, only a small number of these whole-genome approaches have been undertaken (e.g., Endo et al. 1996; Swanson et al. 2001; Tiffin and Hahn 2002; Clark et al. 2003). Results from both of these approaches have identified only a small percentage of genes with a molecular signature of positive selection; the remainder exhibit patterns of molecular evolution most consistent with purifying selection or neutral evolution.

One limitation of these whole-genome comparisons is that most only involve comparisons of two sequences (Bishop et al. 2000). Typically, substitution rates are estimated with pairwise maximum likelihood (Goldman and Yang 1994) or approximate (e.g., Nei and Gojobori 1986) methods that average the substitution rates across sites, but averaging may not be appropriate for most genes (e.g., Hughes and Nei 1988; Yang et al. 2000; Bielawski and Yang 2001) and has little power (Anisimova et al. 2001). More powerful likelihood methods have been developed that allow substitution rates and selection pressures to vary among sites (variable rate models [Nielsen and Yang 1998; Yang et al. 2000, 2005; Wong et al. 2004]), making it feasible to identify positively selected genes even when most sites are not under selection. These methods have not previously been used for comparisons of sequence pairs because accurate estimates of the strength of selection and the identity of selected sites are sensitive to the number of sequences available for comparison (Anisimova et al. 2001, 2002). However, estimates of the likelihood of a given model of sequence evolution are more robust and less sensitive to sequence number (Anisimova et al. 2001). Thus, comparisons among likelihood estimates of various models provide a reliable means of identifying selection even for pairs of sequences. In this paper, we use both variable rate models that allow selection pressures to vary across sites and rate-averaging models to examine 523 sets of orthologous EST pairs from representatives of the two subfamilies of the Asteraceae: lettuce (Lactuca sativa L. and L. serriola L.) and sunflower (Helianthus annuus L.). Genes under selection within the sunflower or lettuce lineages are identified by comparing the likelihood values of various models, including neutral and selection models, for each gene. To verify the results of the selection analyses, we placed positively selected ESTs from sunflower on a QTL framework map to determine whether they map coincident with QTLs associated with domestication. When possible, sequences were also added from two wild species of sunflower and the selection analyses were repeated. The reliability and applicability of the site-specific selection models for pairs of sequences are discussed in relation to the results from sunflower. Furthermore, we discuss the relative utility of an EST sequencing approach for evolutionary analyses such as those reported here.

Materials and Methods

The EST sequence data were generated as part of the Compositae Genome Project (http://compgenomics.ucdavis.edu) and were deposited in our database, which contains ∼112,000 individual ESTs sequenced from both sunflower and lettuce (http://cgpdb.ucdavis.edu). For sunflower, ∼44,000 individual ESTs were sequenced from 10 tissue types (callus, roots, disk and ray flowers, flowers prefertilization, developing kernels, root/shoot chemical induction, roots—environmental stress, shoots—environmental stress, germinating seeds, flowers—environmental stress, hulls) of two Helianthus annuus cultivars. The first cultivar, RHA801, is an unbranched confectionary line (Roath et al. 1981), whereas the second, RHA280, is a branched oilseed producing line (Fick et al. 1974). Differences between the cultivars include seed size, seed storage oil production, branching patterns, and disease resistance (S. Knapp, personal observation). These cultivars have been used in a number of other genetic studies, including a high-density genetic map (Tang et al. 2002).

Individual EST sequences from the two sunflower cultivars were previously assembled into 5504 unique orthologous contiguous sequences (contigs) and 6597 singletons, each of which represents a unique gene (unigene [A. Kozik et al. unpublished]). Of the unigenes with more than one component EST sequence, 2799 are composed of sequences from both sunflower cultivars. A putative reading frame was determined for 2038 of these sunflower unigenes by alignment with the Arabidopsis thaliana genome sequence (Arabidopsis Genome Initiative 2000). All sequences, alignments, unigenes (contigs), and results of BLAST searches against Arabidopsis are available from the Compositae Genome Project database (http://cgpdb.ucdavis.edu/).

The remaining ∼68,000 sequences in the dataset are from lettuce. Ten different tissue types (callus, roots, flowers prefertilization, flowers postfertilization, root/shoot chemical induction, roots—environmental stress, shoots—environmental stress, germinating seeds, flowers—environmental stress, dark-grown leaves) from two lettuce species, Lactuca serriola 92G489 (a wild lettuce accession) and L. sativa cv. Salinas (USDA PI 536851; cultivated crisphead lettuce), were used to generate the ESTs. Lactuca serriola may be the wild progenitor of L. sativa, and the two are often considered conspecific (Kesseli et al. 1991; for review see Koopman et al. 1998). These species differ in several important characters including bolting, root morphology, leaf shape, and disease resistance (De Vries and Van Raamsdonk 1994). Based on the results of a CAP3 assembly, there are 13,956 singleton unigenes and 8179 unigenes with more than one EST sequence. Sequences from both lettuce species are present in 5341 of these unigenes. The 4641 lettuce unigenes for which putative reading frames have been determined comprise the initial dataset used in the current analyses.

Analyses were performed separately for the sunflower and lettuce genes. A Perl script (available from the authors by request) was written to automate analyses for each sequence set. For sunflower and lettuce unigenes with more than one EST sequence from a taxon, a majority consensus sequence was constructed for the lineage using modified versions of the bioperl (Stajich et al. 2002) modules “consensus_iupac” and “consensus_string” (available from the authors by request). The majority consensus rule was applied only where ESTs had sequence information for a position (missing data were ignored), allowing single EST sequences to extend the majority consensus sequence. At positions where EST reads overlapped but there was not a majority consensus across sequences. International Union of Pure and Applied Chemistry ambiguity codes were used. The translated region of each consensus sequence was then extracted using “extractseq” from the European Molecular Biology Open Software Suite (Rice et al. 2000).

From the sunflower and lettuce datasets, sequence pairs were excluded from analysis based on poor alignment scores, stop codons within the putative reading frames, or very short consensus sequences for either or both of the lineages (9% removed from sunflower, 5% removed from lettuce). Unigenes with very short (<100 codons) or very similar (distance <0.11) sequences were excluded from analysis based on the simulation parameters of Anisimova et al. (2001).

For the remaining sequences, the nucleotide composition of the translated regions was analyzed using the Phylogenetic Analysis Using Maximum Likelihood (PAML) package (version 3.14; Yang 1997). The CODEML module in PAML uses maximum likelihood methods to estimate the likelihood scores of various models for a given gene. Using a maximum likelihood approach can account for transition/transversion rate biases, codon usage bias, and multiple substitutions (Yang and Nielsen 1998).

The likelihood score of each of six analyses was computed for each pair of consensus sequences using CODEML. Four analyses implemented site-specific models (M0, M1a, M2a, M3 [Nielsen and Yang 1998; Yang et al. 2000]) and two analyses implemented the pairwise comparison method (runmode −2). Three models (MO and two pairwise models, runmode −2) were implemented that average synonymous and nonsynonymous substitution rates across sites. This results in a single selection pressure across codons (d N /d S ). In the first pairwise model (runmode −2), d N /d S was estimated from the data. The second pairwise model (runmode −2) assumed d N /d S to be fixed at 1, a neutral model. In model M0, d N /d S is estimated from the data. Three site models allowing selection pressure to vary across codons (M1a, M2a, M3) were also implemented (Nielsen and Yang 1998; Yang et al. 2000, 2005; Wong et al. 2004). These models allow the nonsynonymous substitution rate to vary over sites, whereas the synonymous substitution rate remains constant. For these models, the average d N /d S across the gene is a function of codon specific substitution rates (ω n ) and the proportion of codons with each substitution rate (p n ). In the nearly neutral model, M1a, two site classes are allowed. The first class estimates the nonsynonymous rate, but it is constrained to be between 0 and 1 (ω0 at proportion p 0 of sites). The second class of sites constrains the nonsynonymous substitution rate to be equal to the synonymous substitution rate (ω1 = 1 at proportion p 1 of sites; d N /d S = ω0 * p 01 * p 1). The selection model, M2a, incorporates a third class of sites (ω2) that estimates an unconstrained nonsynonymous substitution rate. The final model, M3, is a discrete model with three classes (K = 3) in our analyses. This model allows nonsynonymous substitution rates to vary across codons, estimating three categories of substitution rates (ω0, ω1, and ω2, in proportions p 0, p 1 and p 2, respectively; d N /d S = ω0 * p 0 + ω1 * p 1 + ω2 * p 2).

Several of these models are nested with one another, allowing their likelihood values to be directly compared using a likelihood ratio test (LRT). In this test, the likelihood value of a simpler (null) model (-lnL 1) is compared to the likelihood value of a more general alternative model (-lnL 2) by taking twice their difference and comparing this value with a chi-square distribution. Specifically, the likelihood values of the two pairwise analyses (d N /d S estimated vs. constrained; runmode−2) can be compared and have 1 degree of freedom. The selection model (M2a) can be compared to the neutral model (M1a) with 2 degrees of freedom (Yang et al. 2000) and the discrete model (M3; three nonsynonymous substitution rate classes) can be compared to model M0, which contains only one class of nonsynonymous substitution rates (df = 3). For the pairwise models (runmode −2), if the constrained model (d N /d S = 1) is rejected in favor of the unconstrained model that estimates d N /d S and d N /d S > 1, then this is a test for positive selection. Similarly, if the selection model (M2a) is a significantly better fit to the data than the neutral model (M1a) and contains a class of sites with ω > 1, this also constitutes a significant test for positive selection (Yang et al. 2000). A significant LRT between the M3 and the M0 models is support for variable substitution rates among codons.

To test the reliability of the LRT between variable selection pressure models, we asked whether the ESTs that appeared to be under positive selection in sunflower were correlated with domestication traits. This was accomplished by PCR-amplifying the positively selected ESTs using the “touch-down” cycling program described by Burke et al. (2002). Loci that successfully amplified were genotyped on a WAVE dHPLC (denaturing high-performance liquid chromatograph) following the methods of Lexer et al. (2003). MAPMAKER 3.0/EXP (Lincoln et al. 1992) was used to place the ESTs on an existing QTL framework map for wild X domesticated sunflower (Burke et al. 2002; Lai et al. 2005). We also searched much smaller EST databases for two wild sunflower species, H. argophyllus (sister to H. annuus) and H. paradoxus (hybrid derivative of H. annuus and H. petiolaris), for orthologous sequences to the positively selected sunflower genes. Sequences were found for only one positively selected gene, and the analyses were repeated to determine if the results were consistent with those obtained from sequence pairs.

Results

Fifty-six genes were identified as having experienced positive selection in either the sunflower or the lettuce lineages (Tables 1 and 2). Although the functions of most of the identified genes under positive selection are unknown, several exhibit homology to known genes. Inferred functions of these genes in sunflower include the regulation of transcription, a dehydration induced protein, and an oxidation enzyme. In lettuce, the inferred functions include the regulation of cell division and gene expression, RNA unwinding and splicing, calcium and protein transport, anthocyanin biosynthesis, floral development, and construction of the cytoskeleton (see database). The genes under positive selection were found to occur in every tissue type included in the EST libraries except from prefertilized Helianthus flowers, for which very few ESTs were sequenced. Moreover, most of the selected genes were expressed in multiple tissues. There was thus no statistical difference among tissue categories in the occurrence of selected genes. Unpublished mapping studies in sunflower also show that there is no apparent clustering of these genes within the genome.

Table 1. Genes determined to be under positive selection in sunflower cultivars
Table 2. Genes determined to be under positive selection in lettuce species

Sunflower

Of the original 2038 unigenes with a predicted reading frame and at least one sequence from both the oilseed and the confectionary sunflower taxa, only 1479 had more than 100 codons. Fifteen percent (224) of these genes had sufficient numbers of substitutions (0.11 < distance < 2) for further analyses. For these 224 genes, the average number of codons is 215 (range, 100–694) and the average distance between sequences is 0.363. The current analyses identified 11 genes (4.9%) under positive selection based on significant LRTs and at least one class of substitutions with ω > 1 (Table 1). In 10 of these cases, both the selection model (M2a) and the discrete model (M3) fit the data significantly better than either the neutral model (M1a) or the rate-averaging model (M0). The LRT between the pairwise models (runmode −2) was significant for five genes; however, the estimated d N /d S value in the pairwise comparison was not > 1. The remaining 213 sunflower genes appear to be evolving under neutral conditions or purifying selection.

Amplification products were generated for all 11 positively selected genes in sunflower, however, dHPLC patterns were too complex for some of them to allow confident mapping. Five genes were successfully mapped and are positioned on the wild X domesticated QTL map of Burke and Rieseberg (2002) as follows: gene 502 (between marker 328 and marker 1043 on linkage 8), gene 566 (between marker 258 and marker 343 on linkage 16), gene 1548 (between marker 995 and marker 388 on linkage 13); gene 2337 (between marker 727 and marker 561 on linkage 17), and gene 2816 (above marker 388 on linkage 13). Four of these positions are very close to domestication QTLs. Specifically, gene 566 (putative transcription factor, based on BLAST results) maps coincident with a QTL controlling the number of heads per branch; gene 1548 (putative protein) maps to the same position as a QTL affecting leaf shape; gene 2337 (putative protein) maps coincident with a QTL affecting days to flower, plant height, number of leaves, and peduncle length; and gene 2816 (putative b-zip transcription factor) maps to a QTL controlling disk diameter, ray number, and achene width. Sequence from gene 2816 was also found in the database for Helianthus paradoxus and used in an analysis combining the three genes with variable substitution rate models. The results of this analysis are consistent with the results presented here, with a single class of amino acids having an elevated substitution rate (ω = 5.22) and the LRT between the selection (M2a) and neutral (Mla) models being significant.

Lettuce

After removing genes with too few codons, 3755 lettuce genes were retained. Only 299 (8%) of these genes showed sufficient levels of divergence (0.11 < distance < 2) for further analyses. On average, these genes had estimated genetic distances of 0.36 and 259 codons (range, 101–733), Of these, 45 (17.4%) show evidence of positive selection (Table 2), while the remaining 254 appear to be either evolving neutrally or under selective constraint.

For all positively selected genes except one (see below), the selection model (M2a) fit the data significantly better than the neutral model (M1a) (Table 2). For 39 of the genes under positive selection, the discrete model (M3) was also significantly better than the rate-averaging model (M0). Only nine genes had a significant LRT between pairwise models (runmode −2), with one of these genes having an estimated d N /d S > 1. In only one case did the pairwise analyses identify a gene (lettuce 5649) under positive selection that was not identified by the more complex variable rate models. It is interesting to note that for this same gene, the variable rate model (M3) had the same likelihood as the rate-averaging model (M0), which had a d N /d S value of 5.284, suggesting that in this case the variable rate models were not necessary to identify positive selection.

Discussion

Evolutionary analyses of shotgun EST sequence data from closely related sunflower and lettuce taxa have identified 56 genes whose coding regions appear to be under positive selection. The majority of genes were identified by implementing site-specific variable substitution rate models. In sunflower, these results were strengthened by correlations with domestication QTLs and by sequence comparisons with closely related species. These results suggest that more complex models can successfully identify positively selected genes from pairs of sequences. Although we successfully identified a number of genes under positive selection, only a small fraction of the sequenced genes showed high enough levels of diversity to be used in such analyses. This indicates that although evolutionary analyses of ESTs are informative, a large number of sequences must be obtained to ensure a sufficient sample size for subsequent evolutionary comparisons.

Alternate Models of Selection

In our analyses, 55 genes under positive selection were identified based on a significant LRT between the site-specific variable substitution rate selection model (M2a) and the site-specific neutral model (M1a), Only a subset of these genes, 15 (27%), had a significant LRT between the pairwise models (runmode −2). Reliance on the latter model alone would have resulted in underestimates of the proportion of genes under positive selection, overlooking evolutionarily important genes.

However, it is important to consider the limits of the variable rate models as well, particularly when pairs of sequences are compared. Simulation studies have shown that increasing the number of sequences in a comparison bolsters the accuracy of the variable rate models, particularly in calculations of the strength of selection (ω), the proportion of sites under selection (p), and subsequent identification of specific sites under selection (Anisimova et al. 2001, 2002). It is likely that estimates of these parameters in the current study are not accurate and, therefore, were not reported. Fortunately, estimates of the likelihood of a given substitution rate model for a gene are accurate and the LRT is reliable, even for smaller datasets (Anisimova et al. 2001). Briefly, simulations indicate that the LRT is a conservative test even when sequences are short or divergence is low (Anisimova et al. 2001), such as in our study. With few sequences, the LRT loses power, but accuracy is not affected (Anisimova et al. 2001), making it more likely that our results misidentified positively selected genes as being neutral rather than the reverse. Even given these limitations, the LRT was able to identify a number of candidate positively selected genes. Furthermore, in sunflower, results from subsequent analyses and mapping studies were consistent with the results from the LRT. Our results thus demonstrate the value of using models that incorporate variable substitution rates across codons, even when only two sequences are available for comparison. The assumption of positive selection on individual genes should be verified using mapping or sequencing techniques such as those used with the sunflower dataset. However, the methods presented here make these other techniques much more feasible due to the much smaller pool of target genes.

Genes Under Selection

It is important to note that while our study examines agricultural lineages that have been subject to artificial selection, our estimates of the number of positively selected genes (5–17%) are similar to those estimates obtained from noncultivated species (0%–11% [e.g., Swanson et al. 2001; Tiffin and Hahn 2002; Clark et al. 2003]). Due to the short time since the domestication of these species, the allelic differences within positively selected genes may predate domestication and have been either sorted among domesticated lines (sunflower) or sampled from the wild progenitor (lettuce). Sampling of other wild and cultivar populations will help us to better pinpoint the location and timing of the positive selection.

Mapping Studies in Sunflower

Although four of the positively selected genes in sunflower are correlated with domestication QTLs, much additional work will be required to test whether they are causally related to these phenotypic differences. Full-length cDNA sequences are required to more precisely infer what the function of these genes might be. Two of them are unknown proteins. The other two have inferred functions (b-zip transcription factor and DNA binding protein) that are consistent with the phenotypic changes correlated with them, but the functional categories are too broad to be informative. In addition to the sequencing, fine-mapping is needed to verify the correlation and transgenic complementation and/or RNAi mediated by virus-induced silencing (Baulcombe 1999; Chuang and Meyerowitz 2000) will be required to demonstrate function.

Evolutionary Analyses of ESTs

The tremendous number of ESTs currently being generated for a variety of different plant and animal species present the tantalizing potential to identify genes contributing to species differences. However, coding sequences often are not sufficiently divergent to test for positive selection in closely related taxa. In our analyses, for example, only 10% of the unigenes with more than 100 codons were sufficiently divergent for analyses of substitution rates and selection. The proportion of genes with sufficient divergence for analysis is likely to increase for comparisons of more distantly related taxa, but the fraction of analyzable genes may still remain small due to other limiting factors, such as lack of reliable BLAST hits and hence reliable translated regions and reading frames. Furthermore, evolutionary analyses of EST data rely on the identification of orthologous sequences from unique genes. In our analyses, we carefully examined unigene sets for any indication of paralogy such as multiple patterns of substitution at a given gene within taxa, However, for many genes, only a single EST sequence was available from each lineage. In this case, paralogues could be misidentified as orthologues and any signature of selection could be due to divergence among genes rather than among lineages. Thus, it is important to verify the orthology of these sequences with further molecular analyses such as direct sequencing.

Assuming that orthologous gene copies are correctly identified, another concern is that most EST sequences have untranslated regions that are not informative for tests of positive selection. In our analyses, the untranslated region averaged 100 bp. As a result, many unigenes were too short for rigorous analysis of positive selection. Given that most ESTs (or the unigenes derived from them) do not cover the entire gene, some genes under positive selection are likely to be missed. Moreover, most of the unigenes in both lettuce and sunflower are represented by single EST sequences and may contain uncorrected sequencing errors. These factors will reduce the number of sequences that can be tested for positive selection and will bias downward estimates of the proportion of positively selected genes. Cumulatively, these results indicate that although evolutionary analyses of ESTs are productive, a large number of sequenced genes are needed to ensure a sufficient sample size for evolutionary analyses.

Conclusions

To our knowledge, this is the first attempt to use variable substitution rate models to compare sequence pairs. The current analyses are also among the first to test for positive selection across a large number of genes isolated from throughout the genome as well as across several tissue types. The results have identified 56 genes that are under positive selection in cultivated taxa of sunflower and lettuce. This corresponds to ∼11% of the analyzed genes, which is in accordance with previous studies of positive selection. In sunflower, several of these positively selected genes map coincident with QTL involved in domestication. While we were able to identify a significant number of positively selected genes, we have also identified several limitations to the use of EST sequences for similar evolutionary analyses. In particular, the short lengths of many unigenes and low divergence levels between taxa excluded many genes from these analyses. As a result, less than 0.5% of the genes in the original unigene set could be confidently identified as experiencing positive selection.