Introduction

Synonymous codons are used with unequal frequency between organisms, within genomes, and even within genes (Hooper and Berg 2000; Ikemura 1981; Kawabe and Miyashita 2003; Merkl 2003; Qin et al. 2004). This phenomenon, termed codon usage bias, is now a widely reported occurrence, following initial reports by Grantham et al. (1980).

Mutation, drift, and natural selection all may influence synonymous codon usage. While selection on codon usage has been well supported by observations in prokaryotes (Fuglsang 2003a; Ikemura 1981; Singer and Hickey 2003) and insects (Dunn et al. 2001; Herbeck and Novembre 2003; Moriyama and Powell 1997; Morlais and Severson 2003), it has been detected in vertebrates (DeBry and Marzluff 1994; Lavner and Kotlar 2005; McWeeney and Valdes 1999; Musto et al. 2001; Resch et al. 2007; Romero et al. 2003; Yang and Nielsen 2008) amid controversy over the role of forces that influence nucleotide composition such as mutation bias and biased gene conversion (Bernardi 2000a, b; Chen et al. 2004; Duret and Galtier 2009; Duret et al. 2002; Eyre-Walker 1999; Lercher et al. 2002; Semon et al. 2005; Smith and Eyre-Walker 2001; Sueoka and Kawanishi 2000). This controversy remains particularly strong in mammals where, even as various methods have identified selection acting on codon usage, there is little agreement as to its mechanistic origins such as selection for correct mRNA folding, translational efficiency, or selection against nonsense errors (SANE) (see Lavner and Kotlar 2005; Lin et al. 2003; Resch et al. 2007; Yang and Nielsen 2008 for differing examples).

The myosin heavy-chain (MyHC) family of contractile proteins is among the most highly expressed proteins in vertebrates, where body mass may be up to 60% muscle, such as in fishes (Sänger and Stoiber 2001), and contractile proteins may comprise more than 20% of muscle cell mass. The MyHC protein is large at over 200 kDa, and has a relatively high turnover and expression, representing a large fraction of total metabolic expenditure. During our frequent work with these genes in studies of muscle plasticity in novel organisms, we observed a strong and varied codon usage bias in vertebrate MyHC mRNA coding sequences. This observation is notable because of its occurrence in a wide taxonomic array of vertebrates, including amphibians, fishes, birds, and mammals. Our search for underlying explanations of this pattern has led us to propose that SANE may profoundly influence codon usage in these genes.

Nonsense errors produce prematurely terminated, and potentially dysfunctional, proteins during translation making frequent nonsense errors costly for such highly expressed and large proteins as MyHC. This may be an important energetic incentive to drive SANE in this family of genes, although few studies have detected strong evidence for SANE in vertebrates. MyHC is a model family of genes for the study of functional evolution in muscle (McGuigan et al. 2004), and understanding additional forces shaping MyHC evolution contributes another perspective. A fascinating component of analysis is the variety of protein isoforms of MyHC tandemly arrayed on separate chromosomes (Mahdavi et al. 1984; Matsuoka et al. 1989), which are broadly categorized from cardiac or skeletal muscle, or as slow and fast myosin ATPase isoforms.

The detection of selection on silent sites, regardless of the driving factor, is difficult since forces that influence silent-site GC content (often measured by third-site GC composition, GC3), such as biased gene conversion, may confound traditional phylogenetic tests for selection (Duret and Galtier 2009). Detecting patterns expectant of a specific mechanism of selection on silent sites, independently of phylogenetic analysis, may be an informative way of detecting selection acting on codon usage, so long as the mechanism is hypothesized a priori.

We argue that if SANE is present in MyHC genes, then the following patterns should be detected: (1) Selection acting on highly expressed genes should promote the skewing of codon usage toward more favorable codons (or away from less favorable codons). If the phenomenon is robust, then we should observe codon usage bias in MyHC genes across a range of taxa, albeit with specific codon preferences for each organism. Furthermore, the bias should be independent of GC3 in so far as properties associated with SANE, such as tRNA abundances, are GC3-independent. (2) The pattern of bias should act to lower the expected cost of translation. The production of nonfunctional peptides due to nonsense errors exerts an energetic cost on an organism (Bulmer 1991; Eyre-Walker 1996; Kurland 1992), as well as additional costs to the cell for management of the incomplete and potentially toxic peptides (Menninger 1978). Therefore, excessive errors may impose a fitness disadvantage to an organism, tying up resources essential to cell function. Therefore, the pattern of bias should act to lower the expected cost of translation. (3) We should observe an intragenic spatial pattern of codon bias that is lower at the 5′ end and increase toward the 3′ end of the gene. SANE is thought to preferentially minimize nonsense errors that occur later in translation since these errors are more energetically costly (Bulmer 1988; Eyre-Walker 1996; Liljenstrom and von Heijne 1987). Therefore, unlike selection against missense errors or for proper mRNA folding, SANE should impose this specific pattern of increasing bias toward the 3′ end. This pattern has been observed previously among various organisms (Gilchrist et al. 2009; Qin et al. 2004).

Vertebrates, and mammals in particular, commonly show marked codon usage bias due to the influence of GC-rich isochores (Eyre-Walker and Hurst 2001). Isochores themselves may have arisen due to processes such as biased gene conversion, which are influenced by recombination rates thereby imparting silent-site nucleotides with the hallmarks of selection (Duret, and Galtier 2009). Thus, selection on codon usage is difficult to identify via phylogenetic analysis over a background of biased gene conversion (Duret and Galtier 2009; Chamary et al. 2006).

Therefore, we utilize an array of procedures to determine if the hypothesized patterns of codon usage are present in MyHC genes, while avoiding phylogenetic analyses that may provide spurious results for tests of selection. We examine the degree and spatial variation of codon bias in ≈100 MyHC genes and ≈40 species across five classes of vertebrates. We determine the influence of isoform type and taxa on codon bias, GC3, and elongation rate. In addition, using a mathematical model for the energetic cost of translation (Gilchrist 2007), we determine relationships between codon usage and expected translation cost. We demonstrate that patterns of bias in vertebrate MyHC genes are consistent with energetic arguments for SANE. These results collectively indicate that SANE may have played an important role in the evolution of vertebrate MyHC and other vertebrate genes. This data set may aid in the development of more rigorous procedures for detection of translational selection.

Methods

Novel Sequence Generation

Many of the major vertebrate taxa are represented to some degree by existing published sequence information. However, MyHC nucleotide information is still sparse in birds and reptiles; interestingly, elasmobranch fishes were not represented. We therefore obtained 10 novel MyHC sequences from three species of elasmobranchs, in addition to other sequences we have collected from related projects on muscle physiology of reptiles. Sequences were generated using a commonly employed 3′ RACE protocol to clone isoforms of MyHC (Andersen et al. 2005; Medler and Mykles 2003; Rourke et al. 2004, 2006). Elasmobranch tissue samples were kindly provided by Dr. Christopher Lowe and the California State University Long Beach Sharklab, and by Dr. Lance Adams of the Long Beach Aquarium of the Pacific. Heart tissue, and red and white skeletal muscle samples were obtained freshly and cloned as in Rourke et al. (2004).

Briefly, total RNA was isolated (Tri Reagent, Molecular Research Laboratories), and reverse-transcribed to cDNA (3′ RACE, Invitrogen). A pair of PCR primers was used to isolate myosin sequences, both representing a region 495 bp upstream of the stop codon (agaaggagcaggacaccag; agaaggagcaggacaccagt), and based on similarity to teleost sequences. Sequences thus isolated by PCR were generally a mix of at least two isoforms, and were ligated into pGEM vectors (Promega), and then used to transform DH5a cells (Invitrogen). Approximately 50 purified plasmids containing the MyHC insert were sequenced per species (CEQ Kit, Beckman, and more recently by a commercial sequencing service, Macrogen USA).

Retrieved Sequences

All additional sequences were retrieved from GenBank, or are previously published and unpublished sequences generated by the authors during investigations of novel vertebrate myosin genes. The sequence names and accession numbers used in the analyses are listed in Table 1. Full sequences (denoted by an asterisk) were characterized as complete coding sequences from start to stop codon (n = 54); short sequences were characterized as the last ≈500 bp from a region of nearly complete identity (agaaggarcargayaccag), to the stop codon (n = 94). The dataset represents the extent of currently available vertebrate adult MyHC genes that are either full-length coding sequences or partial sequences which contain the last 500 bp.

Table 1 Myosin heavy-chain isoform nucleotide sequences and Monte Carlo results

Measurements of Codon Bias

Four measures of codon bias were used to examine the codon usage in each gene.

  1. 1)

    The effective number of codons (ENC) is a broadly used measure of bias that ranges from 20 for the highest bias to 61 for uniform bias, representing an estimate of the number of codons used in a sequence to represent all 20 amino acids. The ENC was calculated using Wright’s (1990) formula. Wright’s method was shown to have less dependence on sequence length than other methods (Fuglsang 2003b, 2006).

  2. 2)

    Karlin et al.’s (1998) B statistic was one of the early attempts to measure codon bias while accounting for nucleotide content based on a reference sequence. The B statistic ranges from 0 for no bias, to 2 for the highest bias, although Karlin et al. (1998) stated that it rarely exceeds 0.5 among yeast genes. Reference frequencies for full-length and short sequences assumed uniform codon usage.

  3. 3)

    The relative codon usage bias score (RCBS) is a recently developed method of calculating codon bias that attempts to account for background nucleotide content (Das et al. 2009). The RCBS may be between 0 and 1 (lowest to highest bias).

  4. 4)

    Synonymous codon usage order (SCUO) (Wan et al. 2006) is a method that uses Shannon information theory to describe the orderliness of a gene. Admissible values of SCUO may be between 0 and 1 with larger values indicating stronger bias. The SCUO was shown to be strongly correlated with the codon adaptation index (CAI) and is positively correlated with mRNA copy number in yeast.

Codon bias measures are invariably influenced by amino acid sequence. For example, a sequence of codons that are all from a twofold degenerate (d2) class of amino acids cannot have the diversity of codon usage as a sequence constructed from codons that are all from a fourfold degenerate class (d4) of amino acids. Therefore, it is important to take into account the influence of the mean degeneracy of a sequence on codon bias. Furthermore, codon bias due to translational selection may be constrained, or even masked, by background nucleotide content. We therefore examined the mean degeneracy and mean GC3 for each sequence.

Relative Codon Usage Analysis

Correspondence analysis (CA) was used to examine codon usage patterns among short sequences (Fig. 1). Short sequences were used for this analysis due to the greater number, and breadth of taxa, of available sequences than their full-length counterparts. Principal axes were determined by singular value decomposition of the overall codon frequencies from a table of codon counts. Since all sequences had similar numbers of codons (≈500 bp), codon frequencies were not normalized to sequence length or amino acid usage (Perriére and Thioulouse 2002).

Fig. 1
figure 1

Correspondence analysis of MyHC sequences (short only). The two CA axes with the largest inertia are both indicators of codon usage bias, representing GC3-dependent (a) and GC3-independent factors (RCBS, b). The factors shaping codon usage are both species- and type-specific (slow or fast), evident by the clustering of sequences (c). Orthogonality of the correspondence axes predicates their statistical independence, indicating statistical independence between RCBS, and GC3

Model of Translation Cost

To examine the relationship between translation cost, nonsense errors, and codon usage, we used a mathematical model of translation proposed by Gilchrist and Wagner (2006) and further developed by Gilchrist (2007) which, using estimates for the translation rate of individual codons, calculates a estimate for the expected cost of translation of a single transcript, taking into account the expected cost of a nonsense error at each site.

To summarize the model, let \( \vec{c} = \{ c_{1} , \ldots ,c_{n} \} \) be the vector of elongation rates \( c_{i} \) corresponding to the codons of a given transcript of length n. If b is the constant nonsense error rate, then the probability of translating up to and including codon i is given by

$$ \sigma_{i} (\vec{c}) = P({\text{Translation up to }}i|\vec{c}) = \prod\limits_{j = 1}^{i} {\frac{{c_{i} }}{{c_{i} + b}}} . $$
(1)

We can then compute the expected energetic cost of nonsense errors by

$$ \xi (\vec{c}) = \frac{1}{{1 - \sigma_{n} (\vec{c})}}\sum\limits_{i = 2}^{n} {(a_{1} + a_{2} i)} \sigma_{i - 1} (\vec{c})\frac{b}{{c_{i} + b}} , $$
(2)

where \( a_{1} \) and \( a_{2} \) represent the energetic investment of 2 and 4 phosphate bonds for ribosome initiation and elongation, respectively. This estimate can be used to approximate the expected cost of translation for transcript \( \vec{c} \), given by

$$ \eta (\vec{c}) = \sigma_{n} (\vec{c})\sum\limits_{i = 0}^{\infty } {(1 - \sigma_{n} (\vec{c}))^{i} } (\xi (\vec{c})i + a_{1} + a_{2} n) = \left( {\frac{1}{{\sigma_{n} (\vec{c})}} - 1} \right)\xi (\vec{c}) + (a_{1} + a_{2} n). $$
(3)

Note that the above description differs slightly from Gilchrist (2007). The starting index of the summation in (2) was changed from 1 to 2 in order to account for the subscript i − 1 on the translational completion probability \( \sigma \), since there is no codon 0; and a 1 and a 2 are 2 and 4, respectively, as was first described in Gilchrist and Wagner (2006).

Using this model, we used estimated elongation rates from tRNA gene copy numbers to determine \( \vec{c} \) as input to compute values for Eqs. 13. η and ξ were calculated for each of the 38 sequences for which both full-length MyHC mRNA coding sequences and tRNA gene copy numbers were available. To our knowledge there are no reliable estimates of b, and the energetic requirements for translation may be greater than those characterized by a 1 and a 2. However, it is noteworthy that Gilchrist et al. (2009) obtained results using a related method that were relatively insensitive to a 1, a 2, and b. This strengthens the argument that the translation rate input parameter \( \vec{c} \), is the dominant factor determining η and ξ, and minimizes the role of uncertainty in a 1, a 2, and b in the current study.

Elongation Rates

Using tRNA gene copy number as a proxy for tRNA abundance we assumed that tRNA abundance is proportional to elongation rate (Lavner and Kotlar 2005). Following Gilchrist and Wagner (2006), we normalized the tRNA elongation rates to have an average of 10 codons per second for each species, and reduced the rate of G–U and I–C wobble codons by 39 and 36%, respectively, compared to the Watson–Crick base pairing (Curran and Yarus 1989). The harmonic mean of \( \vec{c} \), denoted by c H , was also calculated for each sequence (Gilchrist and Wagner 2006). Gene copy numbers of tRNA for Bos taurus, Equus caballus, Homo sapiens, Mus musculus, Rattus norvegicus, Canis familiaris, Felis catus, Gallus gallus, Danio rerio, Pan troglodytes, Macaca mulatta, and Takefugu rubripes were retrieved from the Genomic tRNA database (Chan and Lowe 2008). Differences in tRNA abundance between GC- and AT-ending codons were determined by the 2-sample Kolmogorov–Smirnov test.

Analysis and Scaling of η and ξ

Since the expected translation cost (η) and the expected cost of nonsense errors (ξ) are sensitive to the codon usage, the amino acid composition, and the specific sequence of codons, we wanted to fairly compare the observed η and ξ, which we will denote as η Obs and ξ Obs, with the distribution of possible ηs and ξs for the same set of amino acids. Since full-length MyHC mRNA coding sequences have ≈1800 codons with average degeneracy of ≈3.3, we can estimate that any given MyHC gene would have 3.31800 unique codon sequences for the same protein. This makes computation of a true distribution of possible ηs and ξs infeasible. We therefore used Monte Carlo (MC) simulation to construct approximate η and ξ distributions to compare with the observed values, η Obs and ξ Obs.

Two separate MC simulations per sequence were conducted using different sequence variation methods. For the first set of simulations, at each MC iteration, we randomized the usage of codons for each amino acid while keeping the amino acid sequence fixed. We retained the overall frequency of GC3 by setting the probability of choosing a G- or C-ending codon for each amino acid equal to the observed frequency of GC-ending codons of the sequence. For the second analysis, at each iteration, we randomly permuted the sequence of codon positions while retaining the original set of codons for the sequence. For each sequence variation method, one thousand MC runs per sequence were generated to estimate the distributions of η and ξ for statistical analysis.

Since gene lengths and elongation rates are fixed, there exist maximum and minimum possible values of η. We calculated η Max and η Min by computing a \( \vec{c} \) with only the slowest and fastest (respectively) codons for each amino acid sequence. We could then scale η as

$$ y = \frac{{\eta - \eta_{\text{Min}} }}{{\eta_{\text{Max}} - \eta_{\text{Min}} }} . $$
(4)

Gilchrist et al. (2009) showed that the best approximating distribution of η is a gamma distribution, when compared to the Weibull, normal, and log-normal distributions. However, since the domain of η is finite, the random variable Y has a distribution that can be best approximated by a Beta(θ1, θ2) distribution with parameters θ1 and θ2 such that the probability density function of Y is given by

$$ f(y|\theta_{1} ,\theta_{2} ) = \frac{1}{{B(\theta_{1} ,\theta_{2} )}}y^{{\theta_{1} - 1}} (1 - y)^{{\theta_{2} - 1}} , \, 0 \le y \le 1, \, \theta_{1} > 0, \, \theta_{2} > 0 , $$
(5)

where \( B(\theta_{1} ,\theta_{2} ) \) is the beta function. The MC samples were used to determine the maximum-likelihood estimates of θ 1 and θ 2 numerically using the mle() function in Matlab (Mathworks, Natick, MA). Thus, the observed value could be scaled as

$$ y_{\text{OBS}} = {{(\eta_{\text{Obs}} - \eta_{\text{Min}} )} \mathord{\left/ {\vphantom {{(\eta_{\text{Obs}} - \eta_{\text{Min}} )} {(\eta_{\text{Max}} - \eta_{\text{Min}} }}} \right. \kern-\nulldelimiterspace} {(\eta_{\text{Max}} - \eta_{\text{Min}} }}).$$

Maximum and minimum values of ξ cannot be generated by simply choosing the corresponding maximum and minimum values in \( \vec{c} \). We therefore elected to transform the MC sample of ξ from each sequence to a t-statistic, i.e.,

$$ t_{\text{Obs}} = \frac{{\xi_{\text{Obs}} - \bar{\xi }}}{{s_{\xi } }} , $$
(6)

where \( \bar{\xi } \) and \( s_{\xi } \) are the MC sample mean and standard deviation, respectively.

Sample histograms and observed values of y and t for a representative sequence is plotted in Fig. 2. Using this approach, we can test the hypothesis that the scaled values y Obs and t Obs are lower than can be explained by random variation by computing the probabilities P(y Obs < Y) and P(t Obs < T) where Y and T are the beta and t random variables, respectively.

Fig. 2
figure 2

Representative comparisons of the observed value for the expected cost of translation \( \eta_{\text{Obs}} \) (top) and expected cost of nonsense errors \( \xi_{\text{Obs}} \) (bottom) with their respective distributions from MC simulations. Black vertical lines represent the observed values. Bars represent the sample histograms from the MC simulations. Fitted beta- and t-distributions overlay the histograms for comparison. Normalized variables are on the abscissae

Verification of Monte Carlo Calculations

In order to verify the accuracy of the MC samples, we compared results from the MC simulations to results from analytical formulas in Gilchrist et al. (2009). They demonstrate that the expectation and variance of η can be determined exactly, given a pre-determined set of codon usage probabilities. If E[y] = μ and Var[y] = σ 2 are the expectation and variance of y in Eq. 4 , then the parameters of the beta distribution can be determined exactly by the method of moments formulas θ1 = μ(μ(1 − μ)/σ2 − 1) and θ2 = (1 − μ)(μ(1 − μ)/σ2 − 1).

It should be noted that there are slight differences in the formula for η. In particular, if we let \( \beta_{i} = a_{1} + a_{2} (i - 1), \) as in Gilchrist et al. (2009), then the Gilchrist et al. (2009) η is given by

$$ \eta = \frac{1}{{\sigma_{n} }}\mathop \sum \limits_{i = 1}^{n} \beta_{i} \sigma_{i - 1} (1 - p_{i} ) + \beta_{n + 1} $$

where σ is given by Eq. 1, and \( p_{i} = \frac{{c_{i} }}{{c_{i} + b}}. \) However, using the same notation, η given in Gilchrist (2007) is

$$ \eta = \frac{1}{{\sigma_{n} }}\mathop \sum \limits_{i = 1}^{n} \beta_{i + 1} \sigma_{i - 1} (1 - p_{i} ) + \beta_{n + 1} . $$

The values we utilize in this article are from Gilchrist (2007) and Gilchrist and Wagner (2006). Verification used formulas from those earlier papers as well.

Another difficulty with verification is that there is no easily determined analytical expression for the expectation and variance of ξ. Therefore, we compared a related quantity, E[Cost|p] from equation (3) of Gilchrist et al. (2009), which does have an analytical expression, to the results of a MC sample of the same quantity.

In examining the analytical formulas in Gilchrist et al. (2009), we also identified two minor typographical discrepancies in that paper. In equation (7), the superscript of the first summation symbol of the second term should be n − 1, not n. In equation (10) of that paper, the last factor is given by

$$ \mathop \prod \limits_{k = j + 1}^{n} E\left[ {\frac{1}{{p_{k} }}} \right]^{2} $$

but should be

$$ \mathop \prod \limits_{k = j + 1}^{n} E\left[ {\frac{1}{{p_{k}^{2} }}} \right]. $$

We implemented these corrections in our verification.

Samples from MC simulations were compared to analytical distributions by the 1-sample Kolmogorov–Smirnov test. None of the MC samples were found to be distributed significantly differently from their analytical counterparts.

Analysis of Intragenic Spatial Patterns of Codon Usage

Intragenic spatial patterns of codon usage were analyzed for full-length sequences. Since codon bias measurements may be unreliable when using short sequences (Supek and Vlahoviček 2005), we computed moving window averages of codon bias, average degeneracy, GC3, and c H for each sequence using a 300-codon window and a 50-codon step size (qualitatively similar results were achieved with step sizes 1–300 and window sizes 200–500, results not shown). Moving window averages were normalized and centered for comparison in the following way: For each sequence property x (GC3, degeneracy, codon bias), the property in sequence j, at window position i is given by x ij , i = 1, …, n j , j = 1, …, K, where sequence j is of length n j . Quantitative intragenic variations are dependent upon genomic factors specific to the evolutionary history of each organism (e.g., strength of compositional bias, differences in translation rates between competing codons, etc.) allowing for large variations in some species, but not in others. Therefore, in order to fairly examine the qualitative spatial properties in all sequences, each position was normalized to w ij  = (x ij  − x ·j )/x ·j where x ·j is the arithmetic mean of x ij over i, making w ij a relative deviation from the sequence mean. The mean w ij overall sequences, w i· was then computed at each position i, as well as the 95% confidence envelope.

This process was repeated separately for codons from each degeneracy group at each window position. Sixfold degenerate (d6) codons were analyzed by moving window analysis both as a single, pooled group, as well as separated into d2 and d4 parts. For codon usage measures requiring reference frequencies (Karlin’s B and RCBS) whole-sequence frequencies were used as a reference for each window.

Statistical Analyses

Statistical results for MC simulations, MC validation, CA, and intragenic analysis were generated in Matlab (Mathworks, Natick, MA). All other statistical procedures were carried out in JMP 7 (SAS, Cary, NC). The threshold of statistical significance for each MC analysis was determined to maintain a False Discovery Rate of q* = 0.05 (Benjamini and Hochberg 1995).

Results

Codon Usage and GC3

Means and standard deviations of codon bias and GC3 for the full-length sequences are listed in Table 2, and for short-length sequences in Table 3, by isoform type and taxa. Variations between slow and fast genes in GC3 may be attributed to chromosome-dependent differences in compositional bias. Both GC3 and codon bias tended to be strong for the genes studied, particularly for the slow mammalian isoforms, which showed an average GC3 of greater than 0.85.

Table 2 Summary statistics for full sequences
Table 3 Summary statistics of codon bias in short sequences, representing the final coding 500 bp

CA was used to examine the codon usage patterns between organisms. A plot of the first two CA axes is given in Fig. 1c. A strong correlation was found between the first principal CA axis (CA1) and GC3 (r = −0.99, P < 0.0001) (Fig. 1a). The second principal CA axis (CA2) was found to have a strong correlation with RCBS (r = 0.79, P < 0.0001) (Fig. 1b). The CA1 and CA2 axes account for 32.00 and 8.32 percent of the inertia, respectively.

Monte Carlo Simulations of Translation Costs and Costs of Nonsense Errors

Results for MC simulations, which include all sequences for which both tRNA gene copy numbers and full-length sequences are available, are listed in Table 1. For simulations in which codon usage was randomized while keeping GC3 and amino acid sequence fixed (Random Codon in Table 1), we found that all η Obs but one (alpha cardiac isoform of D. rerio) were significantly smaller than random sequences. Conversely, none of the ξ Obs were significantly smaller than random (P > 0.3 for all sequences) and, of the 38 sequences analyzed, 18 sequences had ξ Obs that were significantly larger than random. For simulations in which codon usage was fixed and positions were permuted (Random Position in Table 1), 20 sequences show both η Obs and ξ Obs significantly smaller than random. Interestingly, all but three of the remaining 18 sequences were fast isoforms, representing the majority of fast isoforms studied by MC simulation. We interpret these results as evidence in support of SANE and minimization of η Obs, but possibly not minimization of ξ Obs.

tRNA Gene Copy Number and GC3

In order to ensure that translation rates (given by c H) were independent of GC composition, we tested the null hypothesis that the distributions of AT-ending tRNA abundances were the same as the distributions of GC-ending tRNA abundances by the 2-sample Kolmogorov–Smirnov test. We found no evidence that tRNA abundances for GC- and AT-ending codons in all species studied were significantly different (P = 0.65).

Spatial Variations

The moving window plots for GC3, degeneracy, c H, RCBS, and SCUO are given in Figs. 3, 4, 5, 6, and 7, respectively, where “a” represents overall sequence means and “b–d” are results for d2, d4, and d6 codons. Figure 4b–d plots fractional composition of degeneracy classes d2, d4, and d6. For d6 codons, we found that for nearly all measured quantities, qualitative characteristics of twofold and fourfold degenerate parts of d6 codons appeared to be more similar to one another than to d2 or d4 codons, respectively (results not shown). The one exception to this observation was GC3. We therefore plotted only the pooled results for d6 codons and treated d6 codons as a single, cohesive group.

Fig. 3
figure 3

Intragenic variation of centered, and scaled GC3 in all codons (a), as well as twofold (b), fourfold (c), and sixfold (d) degenerate codons. Means (solid line) and 95% confidence envelope (dashed line) are plotted by codon position

Fig. 4
figure 4

Intragenic variation of centered and scaled mean degeneracy in all codons (a), as well as composition of twofold (b), fourfold (c), and sixfold (d) degenerate codons. Means (solid line) and 95% confidence envelope (dashed line) are plotted by codon position

Fig. 5
figure 5

Intragenic variation of centered and scaled mean c H in all codons (a), as well as c H of twofold (b), fourfold (c), and sixfold (d) degenerate codons. Means (solid line) and 95% confidence envelope (dashed line) are plotted by codon position. Note the large increase in mean c H in d2 and d4 codons relative to the width of the confidence envelope

Fig. 6
figure 6

Intragenic variation of centered and scaled mean RCBS in all codons (a), as well as RCBS of twofold (b), fourfold (c), and sixfold (d) degenerate codons. Means (solid line) and 95% confidence envelope (dashed line) are plotted by codon position. Note the large increase in overall RCBS, and RCBS in d4 codons, relative to the width of the confidence envelope

Fig. 7
figure 7

Intragenic variation of centered and scaled mean SCUO in all codons (a) as well as SCUO of twofold (b), fourfold (c), and sixfold (d) degenerate codons. Means (solid line) and 95% confidence envelope (dashed line) are plotted by codon position. Note similarity in a and b with a and b from GC3 in Fig. 1

Codons in d3 represented relatively small fractions of sequence composition, contributing little to the overall sequence patterns and were therefore not considered further. Intragenic variations in ENC followed nearly identical patterns to SCUO and were therefore not plotted. Karlin’s B statistic showed no apparent spatial patterns in bias and was therefore not plotted.

Although we use no formal hypothesis testing to determine the significance of spatial trends we argue that, heuristically, a spatial trend that is considerably larger in magnitude than the 95% confidence envelope may be evidence for a significant trend. Following this rationale, the intragenic analysis demonstrated an increasing trend in codon bias and c H toward the 3′ end of the gene. We found that for c H and RCBS among d4 codons, and c H among d2 codons, there was an increasing trend toward the 3′ end of the genes (Figs. 5, 6).

Intragenic patterns in GC3, c H, RCBS, and SCUO all appear to be influenced by mean degeneracy. Peaks and troughs in the intragenic degeneracy plot correspond to peaks and troughs (or vice versa) in the intragenic plots of all other measurements. Furthermore, each degeneracy group displays qualitatively different spatial variation for each measurement. For instance, Fig. 5 shows that among d2 and d4 codons, there is a clear increase in c H towards the 3′ end of the transcript, while there is almost no positional effect among d6 codons. Similarly, Fig. 6 shows that RCBS increases toward the 3′ end among d4 codons with no similar trend in d2 or d6 codons. Therefore, forces shaping codon usage may affect each degeneracy group differently.

We caution the reader that we may find the intragenetic relationships between variables are not the same as the population-level relationships found here for the short sequences. For example, degeneracy is negatively correlated with SCUO intragenically, which can be visually determined by Figs. 4 and 7, but is positively correlated with bias across species (P < 0.05). Thus, inter-genic comparisons may paint a different picture than intra-genic relationships between variables.

Discussion

Codon usage bias has been demonstrated in both prokaryotes and eukaryotes, and can be related to gene expression or gene length (Carbone et al. 2003; Chiapello et al. 1998; Coghlan and Wolfe 2000; Dos Reis et al. 2003; Duret and Mouchiroud 1999; Eyre-Walker 1996; Herbeck et al. 2003; Larsen et al. 1999; Moriyama and Powell 1997; Shukle 2000). However, arguments weighing the forces shaping codon usage in vertebrates are ongoing. The present analysis may illuminate this debate, for the case of sarcomeric MyHC genes. Myosin protein expression is important to locomotion, growth, and the plasticity of the contractile apparatus in vertebrate muscle and represents a significant contribution to the total metabolic demand of the organism. Skeletal and cardiac muscles are among the most plastic of animal tissues, capable of both rapid and profound alterations in mass and contractile protein isoform composition during either growth or atrophy (Adams et al. 2003; Baldwin and Haddad 2001; Caiozzo 2002; Haddad et al. 2003; Thomason et al. 1987a, b). The unique combination of features attributed to the MyHC family makes them ideal candidates for SANE across a range of vertebrate taxa.

We have accumulated observations of curiously high and varied codon usage bias in the sarcomeric MyHC family of motor proteins in a diverse array of vertebrates. We consider the influence of underlying chromosomal GC bias and identify GC-independent factors which implicate a role of selection on codon usage. Collectively, our results illustrate the plausibility of SANE as a major force shaping codon usage in MyHC genes. Owing to the number and breadth of available sequences in different taxa, these genes provide an excellent model for genes displaying patterns of SANE across species and isoform.

Codon Usage Bias in MyHC is Shaped by Both GC-Dependent and GC-Independent Forces

Nearly all of the MyHC sequences studied here, in particular the slow mammalian isoforms, display an elevated GC3 (Table 2; Fig. 1a), suggesting that the genes are located within GC-rich isochores. Factors influencing GC3 appear to represent the dominant forces shaping codon usage in vertebrate MyHC genes (GC3 represented nearly 1/3 of the inertia in CA, Fig. 1). These findings reflect those of other studies involving vertebrate genes (see reviews in Chamary et al. (2006) and Hershberg and Petrov (2008)). Both GC content and codon bias were similar among related species, as is demonstrated by the clustering patterns by type and taxa in the correspondence plot (Fig. 1c).

Codon usage bias among MyHC genes varies with respect to taxa, type, and the measurement method used (Table 2). The GC-independent measure of codon bias, RCBS, was between 0.38 (for fast mammalian genes) and 0.56 (for fast genes from teleost fishes). By comparison with results from Das et al. (2009) in yeast this is a considerably high range for genes of this length (≈6000 bp), which nearly all had a RCBS of less than 0.2. Interestingly, the largest RCBS scores in the current study belonged primarily to the fast isoforms of teleost fishes, which also have the lowest GC3 (Table 2; Fig. 1a), and have skeletal muscle MyHC compositions that are among the most plastic in all vertebrates.

It is important to note that the orthogonality of any two CA axes necessitates their statistical independence, and therefore the independence of GC3 and RCBS. This also validates the claims of Das et al. (2009) that their measure is independent of compositional bias. In contrast, ENC and SCUO showed strong correlations with GC3 (compare Figs. 4, 7). Thus, the emergence of GC3 and RCBS as proxies for the CA1 and CA2 axes indicates that the dominant forces acting on codon usage in MyHC coding sequences can be partitioned naturally into composition-dependent and composition-independent factors.

Since we found that both RCBS and tRNA abundances were independent from GC3, we reasoned that quantities derived from these measurements (intragenic variations in RCBS and c H, η Obs, and ξ Obs) were also independent of GC composition. While GC3 may influence mRNA stability or translational efficiency (Kudla et al. 2006), such effects were not considered in the present analysis, where we have focused on the influence of nonsense errors in MyHC evolution. Interestingly, in our first analysis of η and ξ using MC simulation, we committed an error in the calculation of GC3 which incorrectly biased GC in our simulated sequences. After correcting the error there was virtually no difference in the resulting probabilities, further supporting the notion that translation rates are independent from GC3. These results collectively verify that, while GC3 is elevated in many of these genes, there is ample evidence to infer that there have been composition-independent forces shaping the codon usage of MyHC coding regions.

MyHC Genes Preferentially Utilize Abundant tRNA to Avoid Nonsense Errors

Monte Carlo simulations using a mathematical model of translation (Gilchrist and Wagner 2006; Gilchrist 2007) indicated that MyHC genes utilize codons with greater tRNA abundance and this lowers both the expected cost of translation, η Obs, as expected for SANE. However, results for the expected cost of nonsense errors ξ Obs were less straightforward (Table 1). Simulations varying codon usage while keeping the amino acid sequence fixed showed that η Obs is unanimously smaller than random. Gilchrist (2007) obtained similar results using this model with genes studied in S. cerevisiae for η Obs and found strong evidence for translational SANE using a related method (Gilchrist et al. 2009).

Simulations varying codon position while keeping codon usage fixed showed that codon position was equally important to both the expected cost of translation η Obs and the expected cost of nonsense errors ξ Obs but primarily for slow sequences. Our finding that many sequences showed that ξ Obs was larger than random for simulations varying codon usage (Table 1), while η Obs was typically smaller than random, suggests that SANE may act primarily to lower the expected number of nonsense errors, given by 1/σ n  − 1 in Eq. 3, preferentially over the expected cost of a single nonsense error ξ, in order to lower overall translation cost η.

This possibility prompted us to examine the distribution of 1/σ n  − 1. Similarly to the Monte Carlo simulations performed for ξ and η, we performed a Monte Carlo simulation with randomized codon usage identical to that described in “Methods” section. Denoting δ = 1/σ n  − 1, we scaled δ as s = (δ − δMin)/(δMax − δMin), where δMin, and δMax were found by using the largest and smallest values, respectively, for the elongation rates and approximated the distribution of s by a beta distribution. After controlling for multiple testing by false discovery rate, we found all δObs to be significantly smaller than random for all tested sequences. This observation helps to explain some of the discrepancy between the results for ξ and η in these simulations, and suggests that δ is more important than ξ in translational selection. This may be due to additional, nominal energetic costs for processing prematurely terminated transcripts which are not directly related to elongation. The significance of the expected number of nonsense errors has not been extensively treated in literature on SANE and our results indicate that it deserves further scrutiny.

Differences Between Isoform Type

Fast and slow isoforms showed qualitatively different patterns in both MC simulations (Table 1) and codon bias measurements in general (Tables 2, 3). This finding suggests that chromosomal differences exist in the equilibrium frequencies of MyHC codons. In mammals, suites of MyHC genes are tandemly arrayed on separate chromosomes (Leinwand et al. 1983; Mahdavi et al. 1984; Matsuoka et al. 1989; Qin et al. 1990; Weiss et al. 1999). This seems so for chickens as well (Chen et al. 1997; Gulick et al. 1987) and some subsets of teleost MyHC genes may also be tandemly arrayed (McGuigan et al. 2004). Furthermore, it is well known that silent-site substitution rates can vary within a genome (Fox et al. 2008). Thus, it is natural to expect differences in the equilibrium distributions of codons between isoform types when types are separated by chromosomes.

Alternatively, a more physiological explanation for the difference in slow and fast genes may be related to the additional tissue expression of slow genes as β cardiac in the heart. Selection pressures on uninterrupted cardiac pumping function may be greater than for skeletal muscle, where some fast fibers may be recruited for only minutes to seconds during a day.

Intragenic Spatial Patterns

The intragenic spatial analysis of bias also revealed trends expected by SANE, but not consistently between the degeneracy groups. Increases in translation rate (c H) toward the 3′ end were only detected in d2 and d4 codons, while increases in GC-independent codon bias (RCBS) were detectable only in d4 codons. For us, this raises two questions; (1) if SANE is acting on the spatial distribution of codon bias, why does translation rate increase toward the 3′ end in d2 and d4 codons but not in d6 codons? (2) Why, if translation rates increase toward 3′ in d2 codons, does RCBS not increase toward 3′ in d2 codons?

For the first, consider the conditional probability of a conversion (after one or more mutations) from a less favorable codon at a given locus, to a more favorable codon (a codon with a greater elongation rate), given that a conversion occurs. For a codon in degeneracy group di, i = 3, 4, 6, with uniform conversion probabilities, the probability of converting to any other codon is 1/(i − 1). If a given codon belonged to an amino acid that had only one codon with higher elongation rate, the d6 codon would have a 40% smaller chance of converting to the optimal codon than a d4 codon. Thus, in general, supposing uniform conversion probabilities, the higher the degeneracy of the codon, the lower the conditional probability of converting to the optimal codon, limiting the strength of selection acting on codon bias in d6 codons in particular.

We continue that reasoning to the second question—why does not RCBS increase toward 3′ in d2 codons when elongation rate does? There often exists an “optimal” codon—a codon with a higher elongation rate than any other for a given amino acid. However, there are also intermediate elongation rates, making some codons “better” than others, but not necessarily optimal. Indeed, the signal for selection for optimal codons is likely to be masked by the absence of what could be called the optimal codon. For instance, suppose a d4 amino acid has elongation rates c 1 = 0.02, c 2 = 0.03, c 3 = 4.0, and c 4 = 5.0. While c 4 has the largest elongation rate, it has relatively little advantage over c 3, thus it is unlikely that sufficient selection pressure can drive fixation of c 4 over c 3. Alternatively, c 1 and c 2 have elongation rates that are more than two orders of magnitude smaller than c 3 and c 4. Thus, while some codon choices are better than others there is a selective invariance in elongation rates between c 1 and c 2 and between c 3 and c 4, reducing the effective number of phenotypically different alleles. That is, the selection component of the selection–mutation–drift equilibrium distribution of codon usage for a given amino acid is a function of both degeneracy, and variance in elongation rates of the codons.

Most measures of codon bias (including the RCBS) will overlook this invariance when weighing the contribution of a given codon to the signal for SANE. Indeed, codon bias itself seems unlikely to be the best indicator of translational selection. The present analysis may be improved through weighing codon bias by this important factor. To our knowledge no studies have explicitly addressed the notion of selective strength on codon usage as a function of degeneracy class, although qualitative differences in codon usage between different degeneracy classes have been noted in bacteria (Suzuki et al. 2005) and the role of phenotypic degeneracy in the robustness and flexibility of genotypes has been discussed more generally (Meyers et al. 2005; Whitacre 2010).

Isoaccepting tRNA abundances in our data may vary by as much as three orders of magnitude, or may be equal between more than two codons. Thus, the discrepancy we find between intragenic patterns of RCBS and c H may be due to a selective invariance in elongation rates between competing codons. While the RCBS and other measures can provide valuable information about the codon usage of a transcript, we view the methods used here, derived from Gilchrist (2007; Gilchrist et al. 2009) to estimate the distribution of expected translation cost and expected cost of nonsense errors, to be superior codon usage analysis techniques in this regard when SANE is suspected and estimates of tRNA abundance are available.

Statistical properties of codon bias measures themselves may be responsible for some observations in this and other studies. In Fig. 4, the proportion of d4 codons decreases toward the 3′ end of the gene in a way that mirrors the increase in RCBS in d4 codons (Fig. 6). A reduction in the proportion of d4 codons results in shorter sequences for the calculation of codon bias at each window position by our method. Therefore, it is possible that the increase in codon bias toward the 3′ end of the genes in d4 codons is an artifact of fewer codons in the window (i.e., shorter sequence length), a result that has been observed with other methods (Fuglsang 2004; Supek and Vlahoviček 2005) and has been verified by our lab for the RCBS measure by examining large variations in window lengths (results not shown). This artifact could also explain a negative correlation between gene length and codon bias in other studies, where the relationship was attributed to expression level (Das et al. 2009; Urrutia and Hurst 2003). While such an artifact could explain the lone increase in RCBS for d4 codons, which represented less than 20% of the codons (Fig. 4c), it is unlikely to explain the degree of increase in RCBS toward the 3′ end of the genes for all codons (Fig. 6a), for which window size was uniform across the gene, nor can it explain the increase in c H in both d2 and d4 codons, which is independent of window size.

Limitations

Other studies have identified selection on codon usage in vertebrates using phylogenetic methods (Lin et al. 2003; Resch et al. 2007; Yang and Nielsen 2008) and in the future this may aid recognition of the patterns and nature of codon bias, including mechanistic explanations. However, the results of this study suggest a role of SANE acting on codon usage in vertebrates (at least in the MyHC family) and should be considered when using the ratio of synonymous to nonsynonymous substitutions (K a/K s) to identify selection by phylogenetic methods.

No formal hypothesis testing was used in this study to determine the significance of the trends in intragenic measurements. Instead, a heuristic statistical approach was taken. Due to the nature and degree of spatial variations within and between genes, parametric procedures may be inappropriate for such an analysis. A nonparametric test is needed, but to our knowledge a sufficient one does not currently exist. We should also note that, if translation rates are available, the spatial variation in the mean and variance of η can be calculated using the methods of Gilchrist et al. (2009). However, it is unclear to us how to interpret their metric as an intragenic measure of selection and was not considered for the current study.

The intragenic spatial variations in RCBS and c H in our data could be explained by different evolutionary pressures acting on different domains of the MyHC amino acid sequence. Lin et al. (2003) have demonstrated the possibility that selection could act to maintain amino acid properties in different functional domains. The notion that selection at the amino acid level can impact codon usage has been suggested by Plotkin et al. (2006) as well, attributing it to asymmetries in the genetic code. Indeed, it is possible that there are different evolutionary rates for head and tail portions of the MyHC protein (McGuigan et al. 2004). The results of our intragenic analysis does not discount these possibilities, but the rise in intragenic c H, and the results of MC simulations suggest that SANE is one of several possible selective pressures acting on codon usage.

It should be noted that our intragenic results are dominated by the properties of the mammalian genes, which outnumbered genes from other taxa in the both short and full sequences (Tables 2, 3). While this limits the application of our results to primarily mammalian MyHC genes, it suggests that SANE in mammals is pervasive, as indicated by the results of Resch et al. (2007).

The detection of codon bias in MyHC genes is fortuitous in that it has been sequenced in such a broad range of taxa. By examining a single family of genes across a range of vertebrate taxa, we were able to examine the potential influence of SANE acting across these taxa while neutralizing (to some degree) interspecies variation in expression rate, gene product function, and gene length. Furthermore, we have improved the chance of detecting the influence of SANE, by choosing to examine a relatively large and highly expressed family of genes for which it is expected that SANE would act most strongly. The results of this study indicate that the signal of SANE is present across all of the vertebrate taxa studied, but there are interspecies, interisoform, and intragenic differences in its strength. The strength of this signal is likely to vary with amino acid degeneracy, and the degree to which it must compete with factors influencing GC composition. Similar analyses should be conducted for other vertebrate gene families.