Introduction

Many models for the evolution of the genetic code have been proposed. These models are supported by diverse lines of evidence, including analysis of biosynthetic pathways (coevolution theory), translation kinetics, amino acid physiochemical properties, codon designations, phylogeny of aminoacyl-tRNA synthetases, and composition of ancestral sequence reconstructions (Brooks et al. 2004; Cavalcanti et al. 2004; Davis 1999; Hartman 1975; Hartman 1978; Hartman 1984; Hartman 1995; Higgs and Purdritz 2006; Knight et al. 1999; Trifonov 2004; Wong 2005). Most attempt to provide a mechanistic explanation of the organization and evolution of the genetic code, typically via a process of “expansion,” whereby new amino acids are added to the translational schema; usually these models do not directly rely on empirical compositional evidence gathered from existing protein sequences.

There are many protein lineages universal to all extant life, presumably similar in structure and function to proteins present at the time of the most recent common ancestor (MRCA) (Delaye et al. 2005; Gogarten and Taiz 1992; Koonin 2003). Amino acid positions that came and remained under purifying selection early in the evolution of life will be conserved in all or most homologous modern versions of these proteins and, therefore, should reflect the state of the genetic code at the time of their fixation. Comparison with amino acid usage rates at more recently fixed positions allows for these trends to be identified, providing insight into the evolution of the set of amino acids used in the genetic code. This method intentionally focuses on code evolution from the perspective of the amino acid set available to synthesize polypeptides. While codon expansion and redesignation are also important aspects of genetic code evolution, these processes cannot be analyzed using only amino acid sequence alignments, and their effects are not considered here.

Ribosomal proteins are especially useful for investigating this effect, as the ribosome is one of the most ancient and well-conserved structures in the biological world. It is mostly comprised of RNA, which serves to provide the necessary catalytic activity for polypeptide synthesis (Noller et al. 2005). The ribosome also contains many proteins, of which 27 are universally conserved in all life, the postulated “ribosomal core” (Harris et al. 2003). This large number of subunits allows for many independent samples of amino acid usage at conserved positions to be measured, providing a basis for statistical analysis. Similarly well-conserved, ATP synthase proteins consist of complexes of catalytic and noncatalytic subunits which are present as V-type (vacuolar) and F-type homologues in Archaea and Bacteria, respectively. Interestingly, the catalytic and noncatalytic subunits are also homologous, resulting from an ancient gene duplication before the MRCA (Gogarten et al. 1989). For this reason, positions conserved between catalytic and noncatalytic subunits present an even more ancient view of amino acid composition. Since it is impossible to determine at what time before divergence each position came into fixation, we cannot determine the relative ages of these ATPase and ribosomal protein lineages.

We also investigate the alternative hypothesis that observed changes in amino acid usage at conserved positions are explained by factors independent of genetic code expansion. By definition, alignments of more divergent protein sequences produce fewer conserved positions, and some amino acids (such as glycine) may be more likely to be conserved: observed trends in amino acid usage may be a function of these effects. In order to test this possibility, we analyzed amino acid usage at conserved positions in the set of splicesomal core proteins. Under our hypothesis, these more recently evolved proteins should not show the same trends, having never experienced purifying selection under a more primitive genetic code. Rather, the trends observed in more recently evolved protein families should be more similar to the domain-specific positions in ancient protein lineages.

The splicesome is a much more recent biological invention than the ribosome, having no known prokaryotic homologue. Splicesomal proteins resemble ribosomal proteins in their organization as a ribonucleoprotein complex with RNA-mediated catalytic activity and many protein-nucleic acid interactions. They may also have been subjected to the same type of neutral evolutionary pressure (Stoltzfus 1999). These similarities all suggest that this is a suitable dataset for comparison. Most importantly, splicesomal core proteins show a similar level of sequence conservation. The 22 splicesomal proteins analyzed have 6.77% of total positions universally conserved. Across all 27 ribosomal core proteins, 5.53% of total positions are universally conserved. Within the archaeal and bacterial domains, an average of 13.89% of total positions are conserved (not including universal positions).

Examining the conserved positions in ribosomal and ATPase proteins, we identify a subset of amino acids that are likely the most recent additions to the genetic code, as well as amino acids that are likely among the most ancient. Analysis of conserved positions within splicesomal core protein groups show that this trend is not an artifact of divergent protein sequence alignments. In addition, our results suggest that the genetic code was present as the “complete” 20-amino acid schema at the time of the MRCA. We discuss these results with regard to overall trends in the evolution of the genetic code, and propose a scenario in which genetic code expansion may have promoted the transition from RNA-based to protein-based life.

Methods

Sequences

For the analysis of ribosomal and ATPase proteins, a total of 18 genomes were used, from 9 archaeal and 9 bacterial species (Table 1). Species were selected to maximize phylogenetic and %GC distribution. For the analysis of splicesomal proteins, 10 divergent eukaryotic lineages were used (Table 1).

Table 1 Genomes used in the analysis of ribosomal, ATPase, and splicesomal proteins

The universally conserved set of ribosomal proteins in both large (50s) and small (30s) subunits (Harris et al. 2003) was collected for alignment and analysis (Table 2). Genomic Blast (Cummings et al. 2002) was used to verify that these proteins were universally conserved in both Archaea and Bacteria. Archaeal homologues were not found for LSU protein L16 (RplP). Therefore, this protein was omitted from the universal ribosomal proteins used in this analysis.

Table 2 Proteins comprising the ribosomal core

For the analysis of splicesomal proteins, ancestral splicesomal proteins were defined as the subset of proteins “likely” or “very likely” to be present in the most recent eukaryotic ancestor (Collins and Penny 2005) for which a homologue could be identified in G. lamblia using either BLASTP or TBLASTN.

Counting of Conserved Positions

For each amino acid, we count positions that are universally conserved within ribosomal/splicesomal/ATPase protein types in alignments generated by three different algorithms (clustalW, MUSCLE, TCoffee) (Edgar 2004; Poirot et al. 2004; Thompson et al. 1994). Counts of conserved amino acids for ribosomal, splicesomal, and ATPase protein familes are provided as supplementary material (Supplementary Tables 13). Default parameter settings were used for all alignment algorithms. Conserved Met residues in initiation positions were excluded, as their nearly obligatory presence and position in all sequences render them noninformative and a potential source of bias.

Table 3 Comparison of domain-specific conserved amino acid (AA) usage: values represent the sum usage across ribosomal protein positions for each domain (ΔB, ΔA)

Bacterial (B) and archaeal (A) conserved amino acid counts for the ribosomal protein sets were obtained using the bacterial and archaeal subsets of the universal alignments. All bacterial and archaeal sequences clustered based on sequence similarity within their respective domains, indicating that for these data no interdomain horizontal gene transfer has occurred. Universally conserved position counts (U ribo ) were then subtracted from these sets to produce the count of conserved positions exclusive to each domain:

$$ {\Delta B\, = \,B\, - \,U_{{ribo}} } $$
(1)
$$ {\left( {\Delta A} \right)}\, = A\, - \,U_{{ribo}} $$
(2)

The different alignment algorithms gave nearly identical counts at universal positions (95.93% of these positions were identified by all three different alignment algorithms). Bacterial- and archaeal-specific conserved positions showed similar levels of fidelity (ΔB, 87.59%; ΔA, 89.81%). In order to compare U ribo against all domain-specific values, ΔB and ΔA were then combined to produce the set ΔD ribo :

$$ \Delta D_{{ribo}} \, = \,\Delta B\, + \,\Delta A $$
(3)

The same methodology was used for ATPase subunits (vA, vB, , ), expanded to include positions conserved within the pre-MRCA ancient gene duplication. Note that ΔD ATP is not equivalent to ΔD ribo ; rather, ΔD ATP U ribo , as both map to the phylogenetic node of the MRCA (Fig. 1). Positions conserved within both catalytic (C) and noncatalytic (nC) paralogues are defined as paralogue-specific (P):

$$ {\Delta C = C - U_{{ATP}} } $$
(4)
$$ {\Delta nC = nC - U_{{ATP}} } $$
(5)
$$ {\Delta f\beta = f\beta - \Delta C} $$
(6)
$$ {\Delta f\alpha = f\alpha - \Delta nC} $$
(7)
$$ {\Delta vA = vA - \Delta C} $$
(8)
$$ {\Delta vB = vB - \Delta nC} $$
(9)
$$ {\Delta P_{{ATP}} = \Delta C + \Delta nC} $$
(10)
$$ {\Delta D_{{ATP}} = \Delta f\beta + \Delta f\alpha + \Delta vA + \Delta vB} $$
(11)
Fig. 1
figure 1

Organization of conserved position definitions

The mean values per protein per amino acid of U ribo , U ATP , ΔP ATP , ΔB, ΔA, ΔD ribo , ΔD ATP , and S (splicesomal) for the three different alignment algorithms were then normalized with respect to total conserved positions within each. In order to control for the small and varied sample size in universally conserved positions across ribosomal proteins, for the comparison of U ribo and ΔD ribo a weighted counting scheme was also used ( \( {W_{N} \propto N^{{1 \mathord{\left/ {\vphantom {1 2}} \right. \kern-\nulldelimiterspace} 2}} } \)). However, these normalized weighted counts were nearly identical to the unweighted, suggesting that our analysis is not affected by artifacts due to sample size.

Statistical Analysis

The many ribosomal core proteins can be treated as individual samples, allowing for a more rigorous statistical analysis of this particular dataset. In order to determine significant differences between amino acid usages, paired and unpaired analyses were performed using parametric methods (paired two-sided two-sample t-test, single-factor ANOVA) and nonparametric methods (Wilcoxon signed-rank test, simple bootstrap). Simple bootstrap was performed for the ribosomal protein data by generating 1000 random samples from ΔD ribo and comparing against the observed amino acid counts of U ribo . Since pairwise statistics are impossible between ribosomal and splicesomal (S) protein datasets (as proteins are nonhomologous), in this case a two-sample z-test was used. The VassarStats online web resource (http://www.faculty.vassar.edu/lowry/VassarStats.html), Microsoft Excel data analysis tools, and in-house software (available upon request) were used for performing these analyses.

Results

Comparison of Domains

Relative amino acid usage at conserved positions was compared between bacterial and archaeal domains (ΔB vs. ΔA). A strikingly linear relationship was identified between the individual amino acid usages of both domains (R 2 = 0.8178) (Fig. 2), with only Lys and Pro showing a significant difference in usage (Table 3). Similar results have been reported with respect to the overall amino acid composition of organismal lineages (Sorimachi et al. 2001). Such similar usage rates between domains suggest that bacteria and archaea inherited the same genetic code from a common ancestor, with amino acids subsequently having equal opportunity for fixation in domain-specific conserved positions. This does not support any proposed polyphyletic origin of the genetic code (Di Giulio 2006; Syvanen 2002). A polyphyletic origin should predict amino acid preferences between domains representing their respective ancestral codes, especially at the highly conserved positions of ribosomal proteins.

Fig. 2
figure 2

Comparison of domain-specific conserved amino acid usage. Values represent the sum usage across ribosomal proteins in each dataset (ΔB, ΔA). Trendline indicates linear relationship (y = x). Amino acids with significantly different values for ΔB and ΔA are labeled. Error bars represent the standard error (SE) of the sum usage across each dataset, given the overall variance of the dataset (\( {SE = {n\hat{\sigma }} \mathord{\left/ {\vphantom {{n\hat{\sigma }} {{\sqrt n }}}} \right. \kern-\nulldelimiterspace} {{\sqrt n }}} \))

Although a significant difference was observed between ΔB (Pro) and ΔA (Pro) , neither value differed significantly from U ribo(Pro) or ΔD ribo(Pro) . This is in contrast to Lys, where a significant difference was observed between ΔB (Lys) and U ribo(Lys) (p = 0.001). However, both ΔB (Lys) and ΔA (Lys) showed higher levels than U ribo(Lys) .

With regard to the observed significant difference in the usage of Lys, it is interesting to note that Lys is also unique in that it has two distinct biosynthetic pathways (DAP and AAA) (Nishida et al. 1999; Vogel 1964) and aminoacyl-tRNA synthetases (Class I, Class II) (Ibba et al. 1997). In both cases, evidence exists that this duality reflects domain-specific origins, albeit with significant horizontal gene transfer between groups (Klipcan and Safro 2004; Tumbula et al. 1999). The much wider phylogenetic distribution of the DAP pathway and class II synthetase favored by the bacterial domain, however, may indicate that they are more ancient. This conclusion is further supported by the observed shared evolutionary history of the DAP and arginine biosynthesis pathways, as well as the similarity of two codon blocks used by these amino acids, AAR and AGR, respectively (Velasco et al. 2002). The higher usage of Lys in bacterial conserved positions may be a consequence of retaining the ancestral biosynthesis and incorporation machinery for this amino acid.

Comparison of Universal and Domain-Specific Ribosomal Core Positions

Relative amino acid usage was compared between universally conserved and domain-specific conserved positions in the set of 27 ribosomal core proteins (Fig. 3). Although many data fail the criteria for parametric analysis (lacking normally distributed residuals or equal variance), results of parametric and nonparametric analyses nevertheless strongly agreed for each amino acid. Several statistical methods confirmed that ΔD ribo shows a significant decrease in the usage of Gly and Asn and a significant increase in the usage of Cys, Glu, Phe, Ile, Lys, Trp, and Tyr (Table 4). Less significant were the observed decreases for Ala, Gln, His, Met, and Leu and increases for Arg, Ser, Thr, and Val. Asp and Pro showed almost no change in usage. Amino acids with statistically significant changes in usage explain 71.78% of the total change in amino acid usages between U ribo and ΔD ribo .

Fig. 3
figure 3

Relative conserved positions usage among ribosomal core and ATP synthase proteins. Relative usage of conserved positions in ribosomal core (A) and ATP synthase (B) protein groups. Columns indicate relative usage rates within each phylogenetic node. U and ΔP are analogous nodes, both corresponding to the MRCA. Error bars represent the standard error (SE) of the sum usage across each dataset, given the overall variance of the dataset (\( {SE = {n\hat{\sigma }} \mathord{\left/ {\vphantom {{n\hat{\sigma }} {{\sqrt n }}}} \right. \kern-\nulldelimiterspace} {{\sqrt n }}} \)). Error bars could not be calculated for the ATP synthase dataset, as the analysis only contains a single protein type. *Denotes amino acid trends identified as statistically significant in the ribosomal core protein

Table 4 Comparison of ribosomal core, ATPase, and inferred Last Universal Ancestor (LUA) (Brooks et al. 2004) amino acid (AA) usage: values represent the sum usage across protein positions in each dataset

The most extreme trend was the decrease in Gly, accounting for 32.59% of the total change in amino acid usages between U ribo and ΔD ribo . Cys and Trp were completely absent in universally conserved positions. Relative Ile usage increased by almost a factor of 10, while Tyr usage increased by more than a factor of 20.

Comparison of Universal, Paralogue-Specific, and Domain-Specific ATPase Positions

Relative amino acid usage was also compared between paralogue-specific and domain-specific conserved positions in ATPase proteins (Fig. 3). While the ATPase dataset does not contain enough information for detailed statistical analysis, trends can nonetheless be compared to the results from the ribosomal core protein dataset (Table 4). Only 11 conserved positions were identified in U ATP , too few to recognize any meaningful trends besides that of an excess of Gly residues. Much more informative is the comparison of ΔP ATP and ΔD ATP , which identifies the amino acids with the largest relative increases in usage as Cys, Phe, Ile, Trp, Tyr, His, Met, Asn, Val, and Ala. The largest relative decreases were for Thr, Leu, Gly, Gln, Pro, and Arg. Asp, Ser, Glu, and Lys showed little change in usage.

Trend Consensus

Results from ribosomal and ATPase protein analyses were combined into a consensus of amino acid usage trends, classifying amino acids as “recent” or “ancient” additions (Fig. 4). A three-tiered consensus was constructed with decreasingly stringent criteria (Table 7). The strict consensus identifies Cys, Phe, Ile, Val, Trp, and Tyr as recent additions, and Gly, Gln, and Leu as ancient. The semi-strict consensus further identifies Glu, Lys, and Ser as recent, and Pro ancient. The most permissive consensus additionally includes Asp as ancient. Several amino acids showed strong disagreement between datasets and were not part of any consensus (Ala, Arg, Asn, His, Met, Thr). The semistrict consensus contains the greatest overlap with the amino acid trends in the ribosomal dataset found to be statistically significant. Overall, the ATPase dataset is in agreement with six of the nine statistically significant amino acid trends identified in ribosomal proteins (Cys, Gly, Phe, Ile, Trp, Tyr) and shows strong disagreement with only one (Asn). As the ribosomal dataset contains many samples, likely covering a much broader spectrum of evolutionary time, statistical significance in this set is additionally informative to the consensus model. For this reason, Asn may very well be an “ancient” addition. Additionally, the inclusion of Glu and Lys in the semistrict “recent” consensus carries more weight than the inclusion of Ser.

Fig. 4
figure 4

Amino acid usage trends in ribosomal core and ATPase proteins. Trends in amino acid usage for ATPase (ΔP ATP D ATP ) and ribosomal core (U ribo D ribo ) proteins. Yellow, orange, and red points indicate agreement, weak disagreement, and strong disagreement among datasets, respectively. Agreement is defined as both datasets showing agreeing trends of magnitudes of at least 0.15; weak disagreement is defined as one dataset showing an increase or decrease of a magnitude <0.15; strong disagreement is defined as datasets showing opposing trends of magnitudes of at least 0.15. Trends for Trp and Cys are superimposed. *Denotes amino acid trends identified as statistically significant in the ribosomal core protein dataset

Comparison to Splicesomal Core Proteins

For this analysis, we defined the splicesomal core as the set of splicesomal proteins determined to be “likely” or “very likely” present in the most recent eukaryotic ancestor (Collins and Penny 2005). Of these 51 proteins, we identified 22 for which a homologue could be identified in G. lamblia, presumably one of the deepest-branching extant eukaryotic lineage (Nixon et al. 2002) (Table 5). Amino acids at universally conserved positions were then counted and compared to the universally conserved positions of ribosomal core proteins.

Table 5 Proteins comprising the splicesomal core

Amino acid usages at conserved positions in splicesomal core proteins (S) are more similar to ribosomal amino acid usages at domain-specific positions (Table 6), supporting the hypothesis that universally conserved amino acid usages reflect a more primitive genetic code. In fact, the sum of the difference between S and U ribo across all amino acids is almost twice the difference between S and ΔD ribo (61.4% and 32.5%, respectively). This difference remains large even if Gly is omitted from the analysis. Statistical analysis shows that amino acid usages for S and ΔD ribo do not significantly differ for any amino acid, except Lys. Interestingly, S (Lys) is most similar to ΔA (Lys) , which was also shown to be statistically different from ΔB (Lys) . This may suggest that Lys usage in Archaea and Eukarya is more similar when compared with Bacteria.

Table 6 Comparison of ribosomal core and splicesomal conserved amino acid (AA) usage: values represent the sum relative usage across proteins in each dataset
Table 7 Consensus of genetic code evolution as indicated by amino acid usage trends in ribosomal core and ATPase proteins

Several amino acids were shown to have significantly different levels of usage between S and U ribo . In every case, this matched amino acids shown to be significantly different between ΔD ribo and U ribo . Specifically, Gly showed a significant decrease, while Glu, Phe, Ile, and Tyr showed significant increases. While a significant decrease was not observed for Asn, and a significant increase not observed for Cys or Trp, these usages were more similar to ΔD ribo and would have been determined statistically significant at the critical value of α = 0.10 (Asn, p = 0.088; Cys, p = 0.080; Trp, p = 0.100). While S (Cys) and S (Trp) may not provide enough information for a reliable statistical inference in comparison with U ribo , the direction and magnitude of change are similar to the results obtained for ribosomal proteins (Table 6).

Discussion

Our method identifies several amino acids as late additions to the genetic code, including Cys, Phe, Glu, Ile, Val, Trp, Tyr, and possibly Lys, Glu, and Ser. Conversely, Gln, Gly, Leu, and possibly Pro, Asp, and Asn were likely part of an earlier version of the coding schema. Based on our criteria, trends could not be identified for other amino acids. Similar usage rates in the archaeal and bacterial domains suggest that all currently used amino acids all became fixed in the genetic code before the time of the MRCA.

Coevolutionary theory suggests that amino acid usage in the genetic code is closely tied to the evolution of the biosynthetic pathways required for their production, especially if there is no prebiotic source for the amino acid in question (Wong 2005). Our results are in agreement with this model, as many of the amino acids we identify as the most recent additions, especially Phe, Ile, Trp, and Tyr, require extensive biosynthetic pathways in which they are the end products. While Cys and Glu are also predicted to be more recent additions, they have much simpler biosynthetic pathways. However, they are also used as intermediates in sulfur and nitrogen metabolism, and could have existed in this role for some time before becoming co-opted into the protein synthesis machinery. One must be cautious in inferring the availability of ancient substrates by extrapolating from current biosynthetic pathways, though, as many plausible alternative pathways could have been present at earlier times (Keefe et al. 1995).

Presumably, strong selective pressure and/or pervasive neutral mechanistic forces are required to initially add and retain a new amino acid in protein sequence space. However, once an amino acid is incorporated into the genetic code, it can be fixed at additional positions by even weak selection, and for reasons unrelated to the selective advantage driving the initial recruitment. For this reason, the specific functional roles of amino acids in ancient conserved positions may tell us little about the circumstances of their initial recruitment. In fact, since the studied proteins have changed little in structure and function between the phylogenetic nodes of our analysis, the specific roles of the residues in question are almost certainly noninformative in this regard. Rather, to infer the reason an amino acid was added to the genetic code, one should examine their known general physiochemical properties, focusing on characteristics that are observed to be lacking in other amino acids (Weber and Miller 1981).

The most significant trend in code expansion appears to be the increase in the number and kind of hydrophobic amino acids (Phe, Ile, Val, Trp, Tyr), especially those containing aromatic rings in their structure (Phe, Trp, Tyr). An increase in the usage of hydrophobic residues has been reported to provide favorable energetics for protein folding and stability in thermophiles (Sadeghi et al. 2006). However, this trend is also consistent with an increasing dominance of proteins in the organization of living systems. Larger, more complex proteins require aliphatic and aromatic hydrophobic residues for core packing in protein folding, the creation of more stable protein-protein interactions and association with lipid membranes. The latter may have been most important of all, as it has been observed that a major shortcoming of the RNA world would have been the inability to form adequate transmembrane structures (Vlassov 2005). Thus, the increasing complexity and dominance of proteins would be driven by the addition of novel biological functions for which RNA is physiochemically unsuitable, as well as the takeover of existing functions previously mediated by RNA.

The specific expansion of protein space to include both γ-branching (Leu) and β-branching (Ile, Val) aliphatic amino acids may also be related to an expansion of protein function and localization. While aliphatic hydrophobic amino acids are very similar in structure and function, it has been shown that Ile and Val have a higher propensity to be present in transmembrane helices, while Leu has a higher propensity for intracellular helices (Bywater et al. 2001). It has also been shown that Ile is preferred in β-sheet structures (Betts and Russell 2003), which are more common in larger proteins, and often are present in transmembrane structures. Conversely, the identification of Leu as a more ancient amino acid suggests that α-helices may be one of the most ancient secondary structural motifs, even supposing their transmembrane role evolved later on.

Increases in protein size would also favor the addition of sulfur-containing amino acids capable of forming disulfide bond linkages (Cys), contributing to the stability of larger protein structures and complexes. Disulfide linkages would also contribute to protein thermostability. The selective advantage of a genetic code incorporating Cys would be especially strong in evolutionary models where early impact events create a “bottleneck” that only thermophilic organisms can survive (Gogarten-Boekels et al. 1995; Nisbet and Sleep 2001). Incorporation of Cys would also permit the formation of Fe-S clusters, critical for electron transfer and cationic binding activities that would provide a strong metabolic advantage (Imlay 2006). Cys may also have been important in the takeover of other binding roles, still retained in some functional RNA structures (Holbrook 2005).

Aside from their general hydrophobic character, aromatic amino acids also share some properties with nucleic bases, possessing ring structures with hydrophobic surfaces (Shih et al. 1998) and, in the case of Trp and Tyr, the ability to form hydrogen bonds at their planar edges. This resemblance, especially in the case of Trp, may suggest that aromatic amino acids were important in the takeover of functions previously mediated by the nucleic bases of RNA, such as substrate binding and other functions depending on stacking interactions. In particular, these properties would have greatly increased proteins’ capacities to interact with the many nucleotide-based cofactors and small molecules which are presumably products of the RNA world themselves (Jadhav and Yarus 2002; Saran et al. 2003).

These results also generally agree with patterns observed in the evolutionary history of aminoacyl-tRNA synthetases (Cavalcanti et al. 2004). Amino acids we identify as “recent” are all associated with class I synthetases, with the exception of Phe, which uses an unusual class II synthetase that performs a 2’OH aminoacylation reaction similar to class I enzymes, and Lys, which has synthetases from both classes. Furthermore, the molecular phylogeny of class I synthetases identifies IleRS/ValRS and TrpRS/TyrRS as diverging relatively recently, supporting that the specificity of their cognate amino acids is a recent event in the evolution of the genetic code (Nagel and Doolittle 1995). Conversely, the molecular phylogeny of Class II synthetases shows GlyRS as a deeply branching lineage, supporting our assertion that this amino acid is more ancient (Hartlein and Cusack 1995). In potential conflict with a direct interpretation of the accepted phylogenies of aminoacyl-tRNA synthetases are our results that Asn is a more ancient amino acid and that Glu is more recent. Molecular phylogenies of class II synthetases show AsnRS diverging from AspRS within the Archaea, subsequently being transferred to the Bacteria; a similar history is also apparent for GlnRS and GluRS (Brown and Doolittle 1999; Tumbula-Hansen et al. 2002). Therefore, one would expect Glu and Asp to be more ancient than their amine forms. However, these paralogues both diverged from synthetase lineages that were nondiscriminating and depended on tRNA-dependent modification for the conversion of Asp to Asn, and Glu to Gln. Therefore, it is entirely possible that the ancestral state of these synthetase families was, in each case, nondiscriminating, completely obfuscating which amino acid occurred first in the genetic code. In general, because of the difficulty in inferring the ancestral substrate of related synthetases, and the distinct possibility that existing synthetases merely replaced a more ancient (possibly RNA-dependent) system, one should be cautious in using the molecular phylogeny of these proteins to infer which amino acids were absent in an earlier genetic code (Nagel and Doolittle 1995).

Two methods of inferring genetic code evolution from amino acid usage rates have been previously proposed, using either asymmetries in substitution matrices among closely related organisms (Jordan et al. 2005; Zuckerkandl et al. 1971) or ancestral sequence reconstructions of ancient protein lineages (Brooks et al. 2004). However, each of these methods has significant shortcomings. It has been reported that observed asymmetries in substitution matrices can be explained by neutral processes and do not imply trends related to genetic code evolution (McDonald 2006). It is also notable that this method relies on the assumption that more recent additions to the genetic code are still in the process of expanding their usage, therefore permitting extrapolation. Yet it seems at least as likely that asymmetry due to code expansion would only exist as a transient phase following the expansion event, with rapid replacement quickly achieving a new equilibrium across existing amino acids. At any rate, given the relative rapidity of the evolution of the genetic code in comparison to the more than 3 billion years of protein evolution since (Jordan et al. 2005), these initial asymmetries should have long since ceased to exist.

Perhaps more relevant to the work discussed here is a proposed compositional method for determining genetic code evolution, utilizing probabilistic methods for the inference of amino acid usage in ancestral sequence reconstructions (Brooks et al. 2004). This method identified Tyr, Trp, Thr, Ser, Phe, Leu, Gln, Cys, and Asp as more recent additions to the genetic code, with Val, Ile, His, and Glu as more ancient. In comparing our results to the trends observed by Brooks et al. (Table 4), we find that they are in agreement for significant increases in the use of Cys, Trp, and Tyr. However, they strongly disagree with the significant increases we observe for Phe, Ile, Val, Glu, and Lys, as well as with the significant decreases we observe for Gly and Asn.

We assert that our more stringent approach of only counting fully conserved positions allows for a better measure of genetic code evolution. Probabilistic approaches rely on the implicit assumption that the ancestral state of each amino acid position was a specific residue defined by a conventional genetic code organization. However, it is possible that more primitive versions of the genetic code were less discriminating and, in some cases, could not respond to selective pressure to drive a specific amino acid position into fixation. Rather, selection may have been “degenerate,” acting on a subset of similar amino acids that were not yet discriminated by the coding machinery. Using probabilistic inference to identify the “ancestral” amino acid for such a position would produce a misleading result. Conversely, the presence of a fully conserved residue in all modern lineages strongly suggests that enough specificity existed to drive this residue into fixation in the ancestral lineage. Furthermore, ancestral state reconstruction, even if correct, only shows selection or “state” at the point of sequence divergence; conserved sequence positions indicate past selection up to and through the point of species divergence and, thus, probe deeper into evolutionary time. Finally, their phylogenetic analysis relies on midpoint rooting of a tree containing bacterial, archaeal, and eukaryotic protein sequences. This is inconsistent with the generally accepted notion that the root of the tree of life is between the Bacteria and the Archaea, with eukaryotes represented as a long-branched sister group to the Archaea (Brown and Doolittle 1995; Gribaldo and Cammarano 1998; Zhaxybayeva et al. 2005). While data do exist that support other rootings (Zhaxybayeva et al. 2005), a midpoint rooting scheme remains problematic, especially considering the uncertainty associated with molecular evolution rates during such ancient periods, lacking any fossil record for calibration (Douzery et al. 2006).

Our analysis reveals an important stage of genetic code evolution, and its role in the transition from an ancestral RNA-peptide world to the modern DNA-RNA-protein world. Furthermore, it does so in an entirely empirical fashion, without any a priori assumptions concerning environmental conditions, metabolic pathways, or codon organization. Such independence makes our approach uniquely useful in evaluating these other factors, in order to better identify the mechanisms and selective pressures responsible for producing the modern organization of the genetic code.