The DNA sequence corresponding to the amino acid sequence of a polypeptide is subject to a variety of forces that influence its evolution. Purifying, or negative, selection against mutations that change the amino acid sequence (i.e., replacement mutations) constrain evolution, while directional, or positive, selection can increase the frequency of beneficial mutations. Many mutations, however, are neutral or nearly neutral, and their fate is determined to greater or lesser extent by stochastic processes. In many organisms, synonymous mutations that do not change the amino acid sequence are thought to be subject to weak selection that can influence a mutation's fate only when the population size is sufficiently large (Ohta 1973).

It has been known for some time that synonymous codon usage varies among genes within organisms. Studies on prokaryotes (Ikemura 1981, 1985; Gouy and Gautier 1982), single-celled eukaryotes (Bennetzen and Hall 1982; Ikemura 1985; Sharp and Devine 1989; Sharp and Cowe 1991; Coghlan and Wolfe 2000), and complex multicellular eukaryotes (Stenico et al. 1994; Duret and Mouchiroud 1999; Hey and Kliman 2002) have indicated a correlation between codon bias and gene expression. The prevailing view is that high levels of protein synthesis associated with high levels of gene expression generate a selective pressure to minimize the energetic cost of translation. In those organisms for which data are available, the codons used in the most biased genes (i.e., those inferred to be the “preferred” codons) tend to correspond to the most abundant tRNAs (Ikemura 1985; Moriyama and Powell 1997; Coghlan and Wolfe 2000), and this does affect translation speed (Berg and Kurland 1997). Other forms of selection on codon usage may include selection for accuracy of translation (Akashi 1994) and selection against certain mRNA secondary structures (Antezana and Kreitman 1999; Bernardi 2000).

While some of the among-gene variance in codon usage can be attributed to variation in selection pressure, regional variation in compositional bias also appears to play an important role. This is especially apparent in warm-blooded vertebrates (Bernardi et al. 1985; Bernardi 2000), where expression-associated selection on codon usage has not been supported. Kliman and Hey (1993, 1994) found a correlation between intron GC content and codon third position GC content (GC3) in Drosophila melanogaster, and it is clear that among-gene variance in expression-associated selection and compositional bias both contribute to variance in codon usage in this species.

For the past decade or so, several research groups have focused on codon bias, not simply to understand its causes, but as a model system for studying limitations in the effectiveness of natural selection. Kliman and Hey (1993) found that codon bias in D. melanogaster was reduced in regions of the genome with limited recombination (see also Comeron et al. 1999); Hey and Kliman (2002) recently confirmed this relationship using ~13,000 genes from the complete genome (Adams et al. 2000). The findings are consistent with the expected reduction in the effectiveness of selection on one target due to stochastic effects of selection on linked targets (Hill and Robertson 1966; Felsenstein 1974; Gillespie 2000; Comeron and Kreitman 2002)—the so-called Hill–Robertson effect. A second prediction, that codon bias would be reduced in closely spaced genes (due to reduced opportunity for recombination among targets), was partially confirmed by Hey and Kliman (2002); that is, the expected pattern was observed when gene spacing fell below ~1500–2000 bp. Another way in which selection conflicts can arise, akin to the Hill–Robertson effect, is if individual genes have more potential targets for selection. Li (1987) showed by computer simulation that longer genes will have lower usage of preferred codons, assuming complete linkage. Comeron et al. (1999) showed by simulation that codon bias would decrease with increasing gene length and, also, with decreasing recombination rate. McVean and Charlesworth (2000) also demonstrated a reduction in codon bias with decreasing recombination rate. Most recently, Comeron and Kreitman (2002) showed that weak selection (i.e., “interference selection”) on multiple targets can reduce codon bias and may even explain some of the codon usage patterns observed within Drosophila genes (Kliman and Eyre-Walker 1998).

In the present study, we use factor analysis to identify the major trends in codon usage in the S. cerevisiae genome. Using genomewide data on gene expression levels (DeRisi et al. 1997), we confirm that variance in codon usage is due, in part, to variance in expression levels, consistent with the model that certain preferred codons allow for more efficient translation of mRNA (see also Coghlan and Wolfe 2000). We then show that preferred codon usage covaries with recombination rate (Gerton et al. 2000) and gene length as predicted by selection conflict models. Finally, we find that, as in the case of D. melanogaster, preferred codon usage in yeast decreases as gene density increases. We propose that the latter is due to two forms of selection conflict: conflict due to Hill–Robertson effects and conflict due to antagonistic pleiotropy.

Materials and Methods

Yeast DNA Sequences

The 28 April 2002 annotations of the 16 completely sequenced yeast chromosomes (Goffeau et al. 1996) were downloaded from NCBI. A locus was initially selected for analysis if it met the following three criteria: (i) its start and stop codons were located at the positions indicated in the annotation, and there were no premature stop codons; (ii) if the region between the start and the stop codons overlapped the corresponding region of another locus, a third locus did not overlap either of the first two; and (iii) the locus was not the first or last listed, by location, for a chromosome. The latter were excluded from analysis because we cannot measure the distance to flanking genes on both sides for terminal loci (see gene spacing). A total of 6149 loci met all three criteria.

Measures of Gene Length, Gene Spacing, and Local Base Composition

Gene length was defined as the number of codons. As a measure of gene density, we used SBG (space between genes), previously defined by Hey and Kliman (2002). SBG is the mean of the distance, in base pairs, between the terminal bases of the coding sequence of the locus in question and the terminal bases of the coding sequence of the nearest loci to either side. SBG was calculated for all loci, regardless of whether they would be used later in the study.

In many organisms, intron GC content can be used as a measure of local base composition. However, only 3–4% of yeast genes have introns. Instead, we use the GC content of 1000-bp regions flanking the terminal codons (GC_fl) on both sides of the gene. If either segment of flanking region extended into the coding region of an adjacent gene, we truncated the analyzed region to include only noncoding bases. On average, we analyzed 479 upstream flanking bases and 373 downstream flanking bases (with median values of 416 and 296 bases, respectively). Since noncoding regions include positions subject to functional constraint (e.g., transcription factor binding sites), it was possible that truncation would bias GC_fl. However, we found no correlation between the number of noncoding bases analyzed and GC_fl (r = −0.0173, p = 0.185). Regardless, some analyses were repeated using only flanking regions of 200 bases or more. [All statistical analyses presented here were performed using Statistica 6.0 (StatSoft, Inc. 2001).]

Measures of Gene Expression and Recombination Rate

The expression data for seven time points during the S. cerevisiae diauxic shift (DeRisi et al. 1997) were obtained from http://cmgm.stanford.edu/pbrown/explore/array.txt . In their study, Cy5-labeled cDNA from each time point was applied to an ORF microarray, along with Cy3-labeled cDNA from a reference time point. The level of Cy5 fluorescence, corrected for background fluorescence and variation in the amount of RNA spotted on the array, serves as a measure of transcription. We calculated expression for each time point as E t = Cy5t × Cy3avg/Cy3t, where Cy3avg is the mean across the seven time points of the fluorescence of the Cy3-labeled cDNA corrected for background, Cy3t is the Cy3 fluorescence for a given time point corrected for background, and Cy5t is the Cy5 fluorescence for a given time point. The assumption behind this approach is that variation in Cy5 fluorescence across time points due to inconsistent array loading will be revealed by variation in Cy3 fluorescence. Unfortunately, the approach cannot correct for systematic among-gene bias in the average amount spotted. However, as the results will show, the inferred measures of relative expression appear to be quite informative. We define a composite expression value, E avg, as the mean of the seven E t values for a given gene. We also define E max as the maximum value of E t calculated for the seven time points for a given gene. From the data provided by the authors, we were able to match confidently 5847 of their loci to the 6149 of those that met our earlier criteria for inclusion in the study.

Relative recombination rates have been estimated by Gerton et al. (2000) for most of the S. cerevisiae loci. The authors estimated relative recombination rate up to seven times for each locus; we defined relative recombination rate for a given locus as the mean of all available measures for that locus. We were able to assign confidently relative recombination rates to 5836 of the 5847 loci that met all prior criteria for inclusion in the study.

Indices of Codon Usage

Unequal usage of synonymous codons is often measured using the codon bias indices Chi/L (Shields et al. 1988) and ENC (Wright 1990). These indices do not explicitly distinguish the systematic overuse or underuse of specific codons; however, because many organisms consistently overuse a specific subset of codons, both Chi/L and ENC usually reflect usage of that subset. Alternative indices that reflect the usage of specific preferred, or optimal, codons have been developed (e.g., Sharp and Li 1987), but these indices require a priori identification of the preferred codons. Regardless, the correlations among codon bias indices tend to be strong.

Here, we use factor analysis to identify major trends in codon usage. The frequencies of synonymous codons within amino acid class were calculated for the 5836 loci that met all criteria for inclusion. These frequencies were entered into the factor analysis, and all factors with eigenvalues exceeding 1.0 were identified. The primary factor (Factor1) had an eigenvalue of 11.08, corresponding to 18.79% of the overall variance. The secondary factor (Factor2) had an eigenvalue of 4.73 (8.01% of the overall variance). All other factors had eigenvalues below 2.0. We used each gene's position along the Factor1 axis as a measure of the primary trend in codon usage. We used each gene's position along Factor2 as a measure of the secondary trend. Factor1 is highly correlated with Chi/L (r = 0.8206, p < 0.001) and ENC (r = −0.7770, p < 0.001); note that ENC decreases as codon usage becomes increasingly biased.

“Major” vs. “Preferred” Codons

We defined major codons as those that were present in the genome at a frequency that exceeded 1/degeneracy of the amino acid. That is, the major codons were the ones used disproportionately across the genome. All other codons were deemed minor.

Initial identification of preferred codons was based on the factor analysis. Since Factor1 is strongly positively correlated with Chi/L, the polarity of the primary factor is such that increased values of Factor1 correspond to greater codon bias. Thus, the 24 codons whose frequencies positively correlated with Factor1 were initially inferred to be preferred, and all others were inferred to be unpreferred. This relationship held up when we turned our attention to gene expression.

Statistical Analysis of Influences on Codon Usage

Correlation coefficients were calculated for several variables (gene length, GC_fl, recombination rate, gene expression, etc.) with Factor1 and Factor2. We then used multiple regression to calculate partial correlations and to calculate residual values of Factor1. The relationship between gene expression and Factor1 was evaluated using both linear and curvilinear models, and the relationship between Factor1 and SBG was evaluated using either Factor1 directly or the residuals of Factor1 following regression on other variables. Further details and elaboration are provided under Results and Discussion.

Results

Identification of Major and Preferred Codons

Overall codon frequencies are listed in Table 1. All but two of the major codons (AGG and TGG) end in either A or T. This is consistent with an overall AT compositional bias of noncoding regions in the yeast genome: The mean GC_fl is 0.358, while the mean intron GC is 0.341. In contrast, of the 24 codons whose frequencies correlate positively with Factor1, 12 end in G or C (see Table 1). Third codon position GC content (GC3) correlates positively with Factor1 (r = 0.2177, p < 0.001), indicating a slight preference for G- and C-ending codons. All of the preferred codons initially identified by Bennetzen and Hall (1982) are included in our list, along with the addition of CGT (arginine). All of the preferred codons identified by Sharp and Cowe (1991) are included in our list, with the addition of CGT as well as GCC (alanine). While we will continue to use Factor1 as our measure of codon bias, we note that when frequency of optimal codons (F op [Ikemura 1985]) is calculated using the 24 identified preferred codons, the correlation between F op and Factor1 is 0.9788.

Table 1 Overall codon usage

As expected, there is a strong positive correlation between Factor1 and E avg (Table 2). This correlation holds across time points (though it is weaker for E 6 and E 7), as well as for E max. Because E avg is not normally distributed (there is a long tail of highly expressed genes), we also calculated the correlation for log-transformed expression levels (r = 0.5825, p < 0.001) as well as Spearman's rank correlation (r = 0.4867, p < 0.001). The frequencies of all 24 codons inferred as being preferred by their positive correlation with Factor1 were positively correlated with E avg; likewise, the frequencies of all 35 presumed unpreferred codons were negatively correlated with E avg. Figure 1 contrasts overall codon usage with codon usage of the top 5% of genes ranked by E avg. All codons with frequencies positively correlated with Factor1 were used more often in the highly expressed genes than in all genes, and the converse is true for codons with frequencies negatively correlated with Factor1.

Table 2 Product–moment correlations between Factor1 and levels of gene expression
Figure 1
figure 1

Synonymous codon usage in all genes vs. highly expressed genes. Overall codon usage was calculated for all 5836 genes and for the top 5% (N = 292) of genes ranked by E avg.

If codon usage reflects, in part, selection for efficient translation, we expect that preferred codons would correspond to the more abundant iso-accepting tRNAs (Ikemura 1985). This is relevant for threefold, fourfold, and sixfold redundant classes, where two or more iso-accepting tRNAs are used for a given amino acid. We find a strong association between codon preference and tRNA gene copy number (Percudani et al. 1997); the preferred codons always correspond to tRNAs with the highest gene copy number. However, a few unpreferred codons (arginine codon CGC, glycine codon GGC, and proline codon CCG) use the same tRNA as a preferred codon, indicating that correspondence to an abundant tRNA does not alone guarantee preference.

Factor1 and Factor2 vs. Other Variables

In addition to correlating strongly with gene expression, Factor1 is negatively correlated with gene length and positively correlated with relative recombination rate, as predicted by models of selection conflict. When all genes are included in the analysis, there is no correlation between Factor1 and GC_fl, consistent with the finding that the preferred codons are split evenly between AT- and GC-ending (see Table 3). This changes slightly if analyses are limited to genes with flanking regions of at least 200 bp; a weak positive correlation between flanking GC content and Factor1 emerges (r = 0.0386, N = 5639, p = 0.004). If analyses are limited only to upstream flanking regions of at least 200 bp, there is a somewhat stronger correlation between flanking GC content and Factor1 (r = 0.1360, N = 4914, p < 0.001), though a negative correlation of similar magnitude is seen if the analysis is limited to downstream regions of at least 200 bp (r = −0.1361, N = 4100, p < 0.001). Thus, there is no clear relationship between the primary trend in codon usage and local base composition.

Table 3 Correlations between factors and other variables

While Factor1 correlates most strongly with gene expression and somewhat with recombination rate, Factor2 appears to reflect mainly compositional biases. Factor2 is very strongly correlated with GC3 (see Fig. 2) and more strongly correlated with GC_fl and recombination rate than is Factor1, though the correlation with GC_fl is weak (see Table 3). The correlations between Factor2 and GC_fl are considerably stronger, however, if we limit analysis to only flanking regions of at least 200 bp (upstream, r = 0.3248, p < 0.001; downstream, r = 0.3370, p < 0.001; combined, r = 0.3930, p < 0.001).

Figure 2
figure 2

Factor1 vs. gene expression, flanking GC content, and GC3, as well as Factor2 vs. GC3.

Given that GC_fl may reflect regional compositional bias and that GC-biased gene conversion can explain some of the correlation between recombination rate and base composition (Eyre-Walker 1993; Marais et al. 2001; Birdsell 2002; Marais and Piganeau 2002), the correlations between Factor2 and these two variables are not surprising. In fact, recombination rate is positively correlated with several measures of GC content in yeast (codon first positions, r = 0.1530, p < 0.001; codon second position, r = 0.1409, p < 0.001; GC3, r = 0.4732, p < 0.001; GC_fl, r = 0.0045, p = 0.733). Recombination rate is also positively correlated with GC content of upstream and downstream flanking regions of at least 200 bp (upstream, r = 0.2520, N = 4914, p < 0.001; downstream, r = 0.1746, N = 4100, p < 0.001; combined, r = 0.2775, N = 5836, p < 0.001), though the correlations involving upstream or downstream regions nearly disappear if shorter flanking regions are included.

Gene Expression and Codon Bias vs. Gene Spacing

Hey and Kliman (2002) found a curvilinear relationship between SBG and both gene expression and preferred codon usage in D. melanogaster. As SBG decreased, both gene expression and preferred codon usage increased; however, when genes were within ~1500–2000 bp of other genes, the relationships reversed.

In yeast, we observe a positive correlation between SBG and three variables: E avg (r = 0.0385, p = 0.003), Factor1 (r = 0.1090, p < 0.001), and the residual values of Factor1 after multiple regression on gene length, GC_fl, E avg, and recombination rate (r = 0.1123, p < 0.001). While there is a slight hint at a curvilinear relationship (see Fig. 3), the r 2 values for fitting the data to second-order polynomials are not much greater than those for the strict linear model. It should be noted, however, that most values of SBG in yeast fall below 2000 bp, and the clear curvilinear relationship observed in D. melanogaster requires the inclusion of more distantly spaced genes. Also, because Factor1 correlates with GC3, we plotted GC3 vs. SBG (see Fig. 3). The relationship between GC3 and SBG is clearly different and cannot account for the relationship between Factor1 and SBG. The relationship between codon bias and gene spacing is discussed later in more detail.

Figure 3
figure 3

Gene expression, Factor1, and GC3 vs. SBG. Genes were ranked by SBG and placed into nine bins of 584 genes and a tenth bin of 580 genes. Error bars represent 95% confidence intervals for each bin. Factor1 (residuals) are the residual values of Factor1 following multiple regression on GC_fl, gene length, E avg, and recombination rate.

Discussion

One of the first things that stands out in a genomewide survey of yeast codon usage is that the most commonly used codons are not necessarily the preferred codons. Twenty four codons are used at a frequency higher than 1/degeneracy, and all but two end in A or T. Twenty-four codons are also positively correlated with both Factor1 and gene expression and, therefore, inferred to be the preferred codons. However, half of the preferred codons are minor codons; 11 of 12 of these end in C or G. There appears to be a strong compositional bias toward A and T, such that many of the preferred codons are underrepresented. In fact, three of the preferred codons (GCC, alanine; GAC, aspartic acid; and CAC, histidine) remain at a frequency below 1/degeneracy even in the most highly expressed 5% of genes.

Factor1, which represents the primary trend in codon usage, is positively correlated with recombination rate and negatively correlated with gene length. Both of these findings are consistent with studies on other organisms, and both are consistent with a model whereby the effectiveness of natural selection on linked targets is reduced. This effect, first suggested by Hill and Robertson (1966), is due to the difficulty, in the absence of recombination, in assorting character states at multiple sites into haplotypes composed entirely of preferred character states. The correlation between recombination rate and preferred codon usage has been well documented in D. melanogaster (Kliman and Hey 1993; Comeron et al. 1999; Hey and Kliman 2002), but there is an important difference between fruit flies and yeast. The preferred codons in Drosophila predominantly end in C or G, such that GC3 and preferred codon usage are essentially synonyms. Among-gene variation in base composition will change GC3 and, consequently, preferred codon usage. Marais et al. (2001) argued that the correlation between recombination rate and preferred codon usage in Drosophila is due mainly to the influence of recombination on base composition, and not to Hill–Robertson effects. However, Hey and Kliman (2002) found that, even after correcting for local base composition, the correlation between recombination rate and preferred codon usage remained.

Regardless, the situation in yeast is clearly different. Because exactly half of the preferred codons end in G or C, variation in local base composition may have little effect on overall preferred codon usage. As GC content increases, the increased usage of preferred G- or C-ending codons will balance the decreased usage of preferred A- or T-ending codons. In fact, there is no clear correlation between flanking GC content and Factor1. Therefore, the correlation between recombination rate and Factor1 is consistent with Hill–Robertson effects. However, we recognize that there is a strong correlation between GC3 and recombination rate and that there is some correlation between GC3 and Factor1. Thus, some of the covariance between recombination rate and Factor1 may reflect a more direct effect of recombination rate (by GC-biased gene conversion) on GC3. It is interesting that the secondary trend in codon usage is very strongly correlated with GC3, and more strongly correlated than is Factor1 with recombination rate. Thus, we should be cautious when suggesting that reduced usage of preferred codons in regions of lower recombination reflects Hill–Robertson effects in yeast, though the possibility should not be dismissed.

The negative correlation between gene length and preferred codon usage is also consistent with among-gene variance in the effectiveness of natural selection. Using computer simulations, Li (1987) showed that preferred codon usage will be reduced in longer genes, even if the codons are independently assorting (though the effect is weaker). A negative correlation between gene length and preferred codon usage has been observed in Drosophila (Moriyama and Powell 1998; Comeron et al. 1999). Kliman and Eyre-Walker (1998) also found that preferred codon usage decreased toward the 3′ end of genes in Drosophila, and this would lead to decreased preferred codon usage in longer genes.

The strong correlation between gene length and the secondary trend in codon usage is intriguing, since Factor2 seems to reflect mainly local base composition. However, we cannot at this time offer a clear cause-and-effect explanation for the relationship between gene length and Factor2. Gene length and recombination rate are negatively correlated in yeast (r = 0.1775, p < 0.001), and gene length and upstream flanking GC are positively correlated (r = 0.0993, p < 0.001). (The correlation with overall flanking GC is similarly positive if regions of at least 200 bp are used.) Duret and colleagues (1995) have reported that longer genes tend to be found in low-GC isochores in vertebrates, suggesting a correlation between base composition and gene length. However, cause and effect are unclear, and the lower gene density of low GC isochores may contribute to this statistical association (Bernardi 2000). It may simply be the case that selection on codon usage is weaker on longer genes, such that the usage of A- and T-ending codons is more likely to increase due to the overall AT-biased mutation pressure in yeast.

The negative correlation we observe between gene length and codon bias contrasts with the positive correlation found by Coghlan and Wolfe (2000). Log-transformation of gene length does not appreciably change our value for the correlation coefficient; r changes from −0.0702 to −0.0758. Coghlan and Wolfe pointed out that a product–moment correlation may be inaccurate due to violation of the assumption of normality, and they chose to use Spearman's correlation. They also first removed the effect of gene expression, which they argued confounds the relationship between gene length and codon bias. To explore the discrepancy in our results, we regressed Factor1 on E avg and then calculated Spearman's correlation between the residuals and gene length. The correlation coefficient was now positive but indistinguishable from zero (Spearman's r = 0.0082, p = 0.5311). If we use residuals from a curvilinear regression of Factor1 on E avg (see below), Spearman's correlation is −0.0137. One assumption behind using residuals is, of course, that the independent variable is accurately measured, and there is certainly going to be error in estimating gene expression levels from microarrays. Still, our correlation between codon bias and gene expression is similar to that obtained by Coghlan and Wolfe (2000). Our gene data sets are, however, somewhat different. We used nearly 400 fewer genes, as our analyses were limited to those for which we had recombination rates, expression levels, and SBG values. This is admittedly an unsatisfying ad hoc explanation for the discrepancy, and further investigation may be warranted.

The association between gene density and preferred codon usage is of particular interest. Hey and Kliman (2002) predicted a negative correlation between gene density and preferred codon usage on the basis of Hill–Robertson effects. Specifically, by reducing the opportunity for recombination among targets of selection, decreased space between genes should interfere with purifying selection on synonymous codon usage. However, the overall trend observed was an increase in preferred codon usage as gene spacing decreased, though it was clear that the relationship was not linear. As gene spacing continued to decrease, the trend reversed, and preferred codon usage decreased. Hey and Kliman proposed that two factors were at play. Eukaryotic gene expression is influenced by chromatin structure (Gasser et al. 1998; Farkas et al. 2000; Bell et al. 2001), and the loosening of chromatin necessary for expression will affect a region that may include nearby genes. Thus, when high gene expression is selectively favored, chromosome rearrangements or deletions that place a highly expressed gene of interest near another highly expressed gene would be favored. However, at some point, placement of genes too near each other would interfere with expression. Here, there are two distinct models. Hill–Robertson effects could interfere with selection on codon usage and, therefore, the efficiency of translation. This, however, would not decrease the level of transcription. Alternatively, closely spaced genes will have a greater tendency to share control elements, so individual sites may have pleiotropic effects. Mutations that increase the expression of one gene will not necessarily do the same for the other, and such antagonistic pleiotropy will make it more difficult to optimize control regions for maximum transcription of both genes. Consequently, usage of preferred codons may decrease.

The gene expression data for yeast are of much better quality than the expression data available at the time for Drosophila. The very strong correlation between E avg and Factor1 suggests that the former is a fairly reliable estimate of expression levels. Just as Hey and Kliman saw for the more closely spaced Drosophila genes, both gene expression and Factor1 decrease as the space between genes decreases in yeast. It is striking, however, that the decrease in Factor1 is nearly the same when we correct for variation in gene expression (as well as other variables) by multiple regression. Thus, a decrease in preferred codon usage in closely spaced genes does not appear simply to reflect a more direct effect on expression (i.e., that fitness effects of unpreferred codon usage are simply smaller because efficiency of protein synthesis is less important). Hill–Robertson effects may, therefore, play a role in the small, but significant, decrease in preferred codon usage in closely spaced yeast genes.

To explore this possibility further, we predicted that antagonistic pleiotropy should be more obvious if we measure SBG only using upstream regions of genes, and it should be less obvious if we measure SBG only using downstream regions of genes. Our reasoning was that the most important control elements of the promoter tend to be located upstream of the transcription start site, so selection conflicts associated with pleiotropic effects should be more evident there. However, this reassessment of gene density does not affect the results in an obvious way (see Fig. 4). Decreases in either upstream or downstream SBG are associated with decreases in Factor1 (corrected for gene expression and other variables). We should note, however, that the correlation between upstream SBG and Factor1 (r = 0.1141) is significantly greater than the correlation between downstream SBG and Factor1 (r = 0.0497). These correlations are essentially unchanged if we substitute for Factor1 the residuals of Factor1 following the multiple regression described earlier.

Figure 4
figure 4

Gene expression and Factor1 residuals vs. upstream and downstream SBG. Genes were ranked by upstream SBG (SBG_up) or downstream SBG (SBG_down) and placed into nine bins of 584 genes and a tenth bin of 580 genes. Error bars represent 95% confidence intervals for each bin.

Because the relationship between gene expression and SBG may not be linear, simple regression may not entirely remove the effect of gene expression. We computed a second-order polynomial relating SBG to E avg, and r 2 (0.4000) was slightly, but significantly, greater than that of the linear model (r 2 = 0.3534). [The second-order polynomial using a log transformation of gene expression had an intermediate value of r 2.] Thus, more of the effect of gene expression should be removed by saving the residuals from the fitting of the second-order polynomial. In line with our earlier analyses, we performed multiple regression of the residuals on gene length, GC_fl, and recombination rate. The residuals from this multiple regression were then saved. The correlations between these Factor1 residuals and SBG were essentially unchanged (SBG, r = 0.1104; upstream SBG, r = 0.1122; downstream SBG, r = 0.0539). Thus, it appears that the finding is robust and that codon bias decreases as genes become more closely spaced. The effect is somewhat stronger for upstream gene spacing than for downstream gene spacing, but the nonzero correlation between codon usage and the latter may indicate subtle Hill–Robertson interference.

Our analyses are consistent with multiple forms of selection conflict associated with evolution of codon usage in yeast. Polymorphism data, as well as orthologous sequences from close relatives of S. cerevisiae, would allow us better to infer among-gene variation in the intensity and effectiveness of natural selection, as we could use the suite of analytical tools developed for analogous studies in Drosophila (Akashi 1995; Akashi and Schaeffer 1997; Kliman 1999).