Abstract
The genetic code is the syntactic foundation underlying the structure and function of every protein in the history of the biological world. Its highly ordered degenerate complexity suggests an incremental evolution, the result of a combination of selective, mechanistic, and random processes. These evolutionary processes are still poorly understood and remain an open question in the study of early life on Earth. We perform a compositional analysis of ribosomal proteins and ATPase subunits in bacterial and archaeal lineages, using conserved positions that came and remained under purifying selection before and up to the most recent common ancestor. An observable shift in amino acid usage at these conserved positions likely provides an untapped window into the history of protein sequence space, allowing events of genetic code expansion to be identified. We identify Cys, Glu, Phe, Ile, Lys, Val, Trp, and Tyr as recent additions to the genetic code, with Asn, Gln, Gly, and Leu among the more ancient. Our observations are consistent with a scenario in which genetic code expansion primarily favored amino acids that promoted an increase in polypeptide size and functionality. We propose that this expansion would have been critical in the takeover of many RNA-mediated processes, as well as the addition of novel biological functions inaccessible to an RNA-based physiology, such as crossing lipid membranes. Thus, expansion of the genetic code likely set the stage for the transition from RNA-based to protein-based life.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Introduction
Many models for the evolution of the genetic code have been proposed. These models are supported by diverse lines of evidence, including analysis of biosynthetic pathways (coevolution theory), translation kinetics, amino acid physiochemical properties, codon designations, phylogeny of aminoacyl-tRNA synthetases, and composition of ancestral sequence reconstructions (Brooks et al. 2004; Cavalcanti et al. 2004; Davis 1999; Hartman 1975; Hartman 1978; Hartman 1984; Hartman 1995; Higgs and Purdritz 2006; Knight et al. 1999; Trifonov 2004; Wong 2005). Most attempt to provide a mechanistic explanation of the organization and evolution of the genetic code, typically via a process of “expansion,” whereby new amino acids are added to the translational schema; usually these models do not directly rely on empirical compositional evidence gathered from existing protein sequences.
There are many protein lineages universal to all extant life, presumably similar in structure and function to proteins present at the time of the most recent common ancestor (MRCA) (Delaye et al. 2005; Gogarten and Taiz 1992; Koonin 2003). Amino acid positions that came and remained under purifying selection early in the evolution of life will be conserved in all or most homologous modern versions of these proteins and, therefore, should reflect the state of the genetic code at the time of their fixation. Comparison with amino acid usage rates at more recently fixed positions allows for these trends to be identified, providing insight into the evolution of the set of amino acids used in the genetic code. This method intentionally focuses on code evolution from the perspective of the amino acid set available to synthesize polypeptides. While codon expansion and redesignation are also important aspects of genetic code evolution, these processes cannot be analyzed using only amino acid sequence alignments, and their effects are not considered here.
Ribosomal proteins are especially useful for investigating this effect, as the ribosome is one of the most ancient and well-conserved structures in the biological world. It is mostly comprised of RNA, which serves to provide the necessary catalytic activity for polypeptide synthesis (Noller et al. 2005). The ribosome also contains many proteins, of which 27 are universally conserved in all life, the postulated “ribosomal core” (Harris et al. 2003). This large number of subunits allows for many independent samples of amino acid usage at conserved positions to be measured, providing a basis for statistical analysis. Similarly well-conserved, ATP synthase proteins consist of complexes of catalytic and noncatalytic subunits which are present as V-type (vacuolar) and F-type homologues in Archaea and Bacteria, respectively. Interestingly, the catalytic and noncatalytic subunits are also homologous, resulting from an ancient gene duplication before the MRCA (Gogarten et al. 1989). For this reason, positions conserved between catalytic and noncatalytic subunits present an even more ancient view of amino acid composition. Since it is impossible to determine at what time before divergence each position came into fixation, we cannot determine the relative ages of these ATPase and ribosomal protein lineages.
We also investigate the alternative hypothesis that observed changes in amino acid usage at conserved positions are explained by factors independent of genetic code expansion. By definition, alignments of more divergent protein sequences produce fewer conserved positions, and some amino acids (such as glycine) may be more likely to be conserved: observed trends in amino acid usage may be a function of these effects. In order to test this possibility, we analyzed amino acid usage at conserved positions in the set of splicesomal core proteins. Under our hypothesis, these more recently evolved proteins should not show the same trends, having never experienced purifying selection under a more primitive genetic code. Rather, the trends observed in more recently evolved protein families should be more similar to the domain-specific positions in ancient protein lineages.
The splicesome is a much more recent biological invention than the ribosome, having no known prokaryotic homologue. Splicesomal proteins resemble ribosomal proteins in their organization as a ribonucleoprotein complex with RNA-mediated catalytic activity and many protein-nucleic acid interactions. They may also have been subjected to the same type of neutral evolutionary pressure (Stoltzfus 1999). These similarities all suggest that this is a suitable dataset for comparison. Most importantly, splicesomal core proteins show a similar level of sequence conservation. The 22 splicesomal proteins analyzed have 6.77% of total positions universally conserved. Across all 27 ribosomal core proteins, 5.53% of total positions are universally conserved. Within the archaeal and bacterial domains, an average of 13.89% of total positions are conserved (not including universal positions).
Examining the conserved positions in ribosomal and ATPase proteins, we identify a subset of amino acids that are likely the most recent additions to the genetic code, as well as amino acids that are likely among the most ancient. Analysis of conserved positions within splicesomal core protein groups show that this trend is not an artifact of divergent protein sequence alignments. In addition, our results suggest that the genetic code was present as the “complete” 20-amino acid schema at the time of the MRCA. We discuss these results with regard to overall trends in the evolution of the genetic code, and propose a scenario in which genetic code expansion may have promoted the transition from RNA-based to protein-based life.
Methods
Sequences
For the analysis of ribosomal and ATPase proteins, a total of 18 genomes were used, from 9 archaeal and 9 bacterial species (Table 1). Species were selected to maximize phylogenetic and %GC distribution. For the analysis of splicesomal proteins, 10 divergent eukaryotic lineages were used (Table 1).
The universally conserved set of ribosomal proteins in both large (50s) and small (30s) subunits (Harris et al. 2003) was collected for alignment and analysis (Table 2). Genomic Blast (Cummings et al. 2002) was used to verify that these proteins were universally conserved in both Archaea and Bacteria. Archaeal homologues were not found for LSU protein L16 (RplP). Therefore, this protein was omitted from the universal ribosomal proteins used in this analysis.
For the analysis of splicesomal proteins, ancestral splicesomal proteins were defined as the subset of proteins “likely” or “very likely” to be present in the most recent eukaryotic ancestor (Collins and Penny 2005) for which a homologue could be identified in G. lamblia using either BLASTP or TBLASTN.
Counting of Conserved Positions
For each amino acid, we count positions that are universally conserved within ribosomal/splicesomal/ATPase protein types in alignments generated by three different algorithms (clustalW, MUSCLE, TCoffee) (Edgar 2004; Poirot et al. 2004; Thompson et al. 1994). Counts of conserved amino acids for ribosomal, splicesomal, and ATPase protein familes are provided as supplementary material (Supplementary Tables 1–3). Default parameter settings were used for all alignment algorithms. Conserved Met residues in initiation positions were excluded, as their nearly obligatory presence and position in all sequences render them noninformative and a potential source of bias.
Bacterial (B) and archaeal (A) conserved amino acid counts for the ribosomal protein sets were obtained using the bacterial and archaeal subsets of the universal alignments. All bacterial and archaeal sequences clustered based on sequence similarity within their respective domains, indicating that for these data no interdomain horizontal gene transfer has occurred. Universally conserved position counts (U ribo ) were then subtracted from these sets to produce the count of conserved positions exclusive to each domain:
The different alignment algorithms gave nearly identical counts at universal positions (95.93% of these positions were identified by all three different alignment algorithms). Bacterial- and archaeal-specific conserved positions showed similar levels of fidelity (ΔB, 87.59%; ΔA, 89.81%). In order to compare U ribo against all domain-specific values, ΔB and ΔA were then combined to produce the set ΔD ribo :
The same methodology was used for ATPase subunits (vA, vB, fα, fβ), expanded to include positions conserved within the pre-MRCA ancient gene duplication. Note that ΔD ATP is not equivalent to ΔD ribo ; rather, ΔD ATP ≈ U ribo , as both map to the phylogenetic node of the MRCA (Fig. 1). Positions conserved within both catalytic (C) and noncatalytic (nC) paralogues are defined as paralogue-specific (P):
The mean values per protein per amino acid of U ribo , U ATP , ΔP ATP , ΔB, ΔA, ΔD ribo , ΔD ATP , and S (splicesomal) for the three different alignment algorithms were then normalized with respect to total conserved positions within each. In order to control for the small and varied sample size in universally conserved positions across ribosomal proteins, for the comparison of U ribo and ΔD ribo a weighted counting scheme was also used ( \( {W_{N} \propto N^{{1 \mathord{\left/ {\vphantom {1 2}} \right. \kern-\nulldelimiterspace} 2}} } \)). However, these normalized weighted counts were nearly identical to the unweighted, suggesting that our analysis is not affected by artifacts due to sample size.
Statistical Analysis
The many ribosomal core proteins can be treated as individual samples, allowing for a more rigorous statistical analysis of this particular dataset. In order to determine significant differences between amino acid usages, paired and unpaired analyses were performed using parametric methods (paired two-sided two-sample t-test, single-factor ANOVA) and nonparametric methods (Wilcoxon signed-rank test, simple bootstrap). Simple bootstrap was performed for the ribosomal protein data by generating 1000 random samples from ΔD ribo and comparing against the observed amino acid counts of U ribo . Since pairwise statistics are impossible between ribosomal and splicesomal (S) protein datasets (as proteins are nonhomologous), in this case a two-sample z-test was used. The VassarStats online web resource (http://www.faculty.vassar.edu/lowry/VassarStats.html), Microsoft Excel data analysis tools, and in-house software (available upon request) were used for performing these analyses.
Results
Comparison of Domains
Relative amino acid usage at conserved positions was compared between bacterial and archaeal domains (ΔB vs. ΔA). A strikingly linear relationship was identified between the individual amino acid usages of both domains (R 2 = 0.8178) (Fig. 2), with only Lys and Pro showing a significant difference in usage (Table 3). Similar results have been reported with respect to the overall amino acid composition of organismal lineages (Sorimachi et al. 2001). Such similar usage rates between domains suggest that bacteria and archaea inherited the same genetic code from a common ancestor, with amino acids subsequently having equal opportunity for fixation in domain-specific conserved positions. This does not support any proposed polyphyletic origin of the genetic code (Di Giulio 2006; Syvanen 2002). A polyphyletic origin should predict amino acid preferences between domains representing their respective ancestral codes, especially at the highly conserved positions of ribosomal proteins.
Although a significant difference was observed between ΔB (Pro) and ΔA (Pro) , neither value differed significantly from U ribo(Pro) or ΔD ribo(Pro) . This is in contrast to Lys, where a significant difference was observed between ΔB (Lys) and U ribo(Lys) (p = 0.001). However, both ΔB (Lys) and ΔA (Lys) showed higher levels than U ribo(Lys) .
With regard to the observed significant difference in the usage of Lys, it is interesting to note that Lys is also unique in that it has two distinct biosynthetic pathways (DAP and AAA) (Nishida et al. 1999; Vogel 1964) and aminoacyl-tRNA synthetases (Class I, Class II) (Ibba et al. 1997). In both cases, evidence exists that this duality reflects domain-specific origins, albeit with significant horizontal gene transfer between groups (Klipcan and Safro 2004; Tumbula et al. 1999). The much wider phylogenetic distribution of the DAP pathway and class II synthetase favored by the bacterial domain, however, may indicate that they are more ancient. This conclusion is further supported by the observed shared evolutionary history of the DAP and arginine biosynthesis pathways, as well as the similarity of two codon blocks used by these amino acids, AAR and AGR, respectively (Velasco et al. 2002). The higher usage of Lys in bacterial conserved positions may be a consequence of retaining the ancestral biosynthesis and incorporation machinery for this amino acid.
Comparison of Universal and Domain-Specific Ribosomal Core Positions
Relative amino acid usage was compared between universally conserved and domain-specific conserved positions in the set of 27 ribosomal core proteins (Fig. 3). Although many data fail the criteria for parametric analysis (lacking normally distributed residuals or equal variance), results of parametric and nonparametric analyses nevertheless strongly agreed for each amino acid. Several statistical methods confirmed that ΔD ribo shows a significant decrease in the usage of Gly and Asn and a significant increase in the usage of Cys, Glu, Phe, Ile, Lys, Trp, and Tyr (Table 4). Less significant were the observed decreases for Ala, Gln, His, Met, and Leu and increases for Arg, Ser, Thr, and Val. Asp and Pro showed almost no change in usage. Amino acids with statistically significant changes in usage explain 71.78% of the total change in amino acid usages between U ribo and ΔD ribo .
The most extreme trend was the decrease in Gly, accounting for 32.59% of the total change in amino acid usages between U ribo and ΔD ribo . Cys and Trp were completely absent in universally conserved positions. Relative Ile usage increased by almost a factor of 10, while Tyr usage increased by more than a factor of 20.
Comparison of Universal, Paralogue-Specific, and Domain-Specific ATPase Positions
Relative amino acid usage was also compared between paralogue-specific and domain-specific conserved positions in ATPase proteins (Fig. 3). While the ATPase dataset does not contain enough information for detailed statistical analysis, trends can nonetheless be compared to the results from the ribosomal core protein dataset (Table 4). Only 11 conserved positions were identified in U ATP , too few to recognize any meaningful trends besides that of an excess of Gly residues. Much more informative is the comparison of ΔP ATP and ΔD ATP , which identifies the amino acids with the largest relative increases in usage as Cys, Phe, Ile, Trp, Tyr, His, Met, Asn, Val, and Ala. The largest relative decreases were for Thr, Leu, Gly, Gln, Pro, and Arg. Asp, Ser, Glu, and Lys showed little change in usage.
Trend Consensus
Results from ribosomal and ATPase protein analyses were combined into a consensus of amino acid usage trends, classifying amino acids as “recent” or “ancient” additions (Fig. 4). A three-tiered consensus was constructed with decreasingly stringent criteria (Table 7). The strict consensus identifies Cys, Phe, Ile, Val, Trp, and Tyr as recent additions, and Gly, Gln, and Leu as ancient. The semi-strict consensus further identifies Glu, Lys, and Ser as recent, and Pro ancient. The most permissive consensus additionally includes Asp as ancient. Several amino acids showed strong disagreement between datasets and were not part of any consensus (Ala, Arg, Asn, His, Met, Thr). The semistrict consensus contains the greatest overlap with the amino acid trends in the ribosomal dataset found to be statistically significant. Overall, the ATPase dataset is in agreement with six of the nine statistically significant amino acid trends identified in ribosomal proteins (Cys, Gly, Phe, Ile, Trp, Tyr) and shows strong disagreement with only one (Asn). As the ribosomal dataset contains many samples, likely covering a much broader spectrum of evolutionary time, statistical significance in this set is additionally informative to the consensus model. For this reason, Asn may very well be an “ancient” addition. Additionally, the inclusion of Glu and Lys in the semistrict “recent” consensus carries more weight than the inclusion of Ser.
Comparison to Splicesomal Core Proteins
For this analysis, we defined the splicesomal core as the set of splicesomal proteins determined to be “likely” or “very likely” present in the most recent eukaryotic ancestor (Collins and Penny 2005). Of these 51 proteins, we identified 22 for which a homologue could be identified in G. lamblia, presumably one of the deepest-branching extant eukaryotic lineage (Nixon et al. 2002) (Table 5). Amino acids at universally conserved positions were then counted and compared to the universally conserved positions of ribosomal core proteins.
Amino acid usages at conserved positions in splicesomal core proteins (S) are more similar to ribosomal amino acid usages at domain-specific positions (Table 6), supporting the hypothesis that universally conserved amino acid usages reflect a more primitive genetic code. In fact, the sum of the difference between S and U ribo across all amino acids is almost twice the difference between S and ΔD ribo (61.4% and 32.5%, respectively). This difference remains large even if Gly is omitted from the analysis. Statistical analysis shows that amino acid usages for S and ΔD ribo do not significantly differ for any amino acid, except Lys. Interestingly, S (Lys) is most similar to ΔA (Lys) , which was also shown to be statistically different from ΔB (Lys) . This may suggest that Lys usage in Archaea and Eukarya is more similar when compared with Bacteria.
Several amino acids were shown to have significantly different levels of usage between S and U ribo . In every case, this matched amino acids shown to be significantly different between ΔD ribo and U ribo . Specifically, Gly showed a significant decrease, while Glu, Phe, Ile, and Tyr showed significant increases. While a significant decrease was not observed for Asn, and a significant increase not observed for Cys or Trp, these usages were more similar to ΔD ribo and would have been determined statistically significant at the critical value of α = 0.10 (Asn, p = 0.088; Cys, p = 0.080; Trp, p = 0.100). While S (Cys) and S (Trp) may not provide enough information for a reliable statistical inference in comparison with U ribo , the direction and magnitude of change are similar to the results obtained for ribosomal proteins (Table 6).
Discussion
Our method identifies several amino acids as late additions to the genetic code, including Cys, Phe, Glu, Ile, Val, Trp, Tyr, and possibly Lys, Glu, and Ser. Conversely, Gln, Gly, Leu, and possibly Pro, Asp, and Asn were likely part of an earlier version of the coding schema. Based on our criteria, trends could not be identified for other amino acids. Similar usage rates in the archaeal and bacterial domains suggest that all currently used amino acids all became fixed in the genetic code before the time of the MRCA.
Coevolutionary theory suggests that amino acid usage in the genetic code is closely tied to the evolution of the biosynthetic pathways required for their production, especially if there is no prebiotic source for the amino acid in question (Wong 2005). Our results are in agreement with this model, as many of the amino acids we identify as the most recent additions, especially Phe, Ile, Trp, and Tyr, require extensive biosynthetic pathways in which they are the end products. While Cys and Glu are also predicted to be more recent additions, they have much simpler biosynthetic pathways. However, they are also used as intermediates in sulfur and nitrogen metabolism, and could have existed in this role for some time before becoming co-opted into the protein synthesis machinery. One must be cautious in inferring the availability of ancient substrates by extrapolating from current biosynthetic pathways, though, as many plausible alternative pathways could have been present at earlier times (Keefe et al. 1995).
Presumably, strong selective pressure and/or pervasive neutral mechanistic forces are required to initially add and retain a new amino acid in protein sequence space. However, once an amino acid is incorporated into the genetic code, it can be fixed at additional positions by even weak selection, and for reasons unrelated to the selective advantage driving the initial recruitment. For this reason, the specific functional roles of amino acids in ancient conserved positions may tell us little about the circumstances of their initial recruitment. In fact, since the studied proteins have changed little in structure and function between the phylogenetic nodes of our analysis, the specific roles of the residues in question are almost certainly noninformative in this regard. Rather, to infer the reason an amino acid was added to the genetic code, one should examine their known general physiochemical properties, focusing on characteristics that are observed to be lacking in other amino acids (Weber and Miller 1981).
The most significant trend in code expansion appears to be the increase in the number and kind of hydrophobic amino acids (Phe, Ile, Val, Trp, Tyr), especially those containing aromatic rings in their structure (Phe, Trp, Tyr). An increase in the usage of hydrophobic residues has been reported to provide favorable energetics for protein folding and stability in thermophiles (Sadeghi et al. 2006). However, this trend is also consistent with an increasing dominance of proteins in the organization of living systems. Larger, more complex proteins require aliphatic and aromatic hydrophobic residues for core packing in protein folding, the creation of more stable protein-protein interactions and association with lipid membranes. The latter may have been most important of all, as it has been observed that a major shortcoming of the RNA world would have been the inability to form adequate transmembrane structures (Vlassov 2005). Thus, the increasing complexity and dominance of proteins would be driven by the addition of novel biological functions for which RNA is physiochemically unsuitable, as well as the takeover of existing functions previously mediated by RNA.
The specific expansion of protein space to include both γ-branching (Leu) and β-branching (Ile, Val) aliphatic amino acids may also be related to an expansion of protein function and localization. While aliphatic hydrophobic amino acids are very similar in structure and function, it has been shown that Ile and Val have a higher propensity to be present in transmembrane helices, while Leu has a higher propensity for intracellular helices (Bywater et al. 2001). It has also been shown that Ile is preferred in β-sheet structures (Betts and Russell 2003), which are more common in larger proteins, and often are present in transmembrane structures. Conversely, the identification of Leu as a more ancient amino acid suggests that α-helices may be one of the most ancient secondary structural motifs, even supposing their transmembrane role evolved later on.
Increases in protein size would also favor the addition of sulfur-containing amino acids capable of forming disulfide bond linkages (Cys), contributing to the stability of larger protein structures and complexes. Disulfide linkages would also contribute to protein thermostability. The selective advantage of a genetic code incorporating Cys would be especially strong in evolutionary models where early impact events create a “bottleneck” that only thermophilic organisms can survive (Gogarten-Boekels et al. 1995; Nisbet and Sleep 2001). Incorporation of Cys would also permit the formation of Fe-S clusters, critical for electron transfer and cationic binding activities that would provide a strong metabolic advantage (Imlay 2006). Cys may also have been important in the takeover of other binding roles, still retained in some functional RNA structures (Holbrook 2005).
Aside from their general hydrophobic character, aromatic amino acids also share some properties with nucleic bases, possessing ring structures with hydrophobic surfaces (Shih et al. 1998) and, in the case of Trp and Tyr, the ability to form hydrogen bonds at their planar edges. This resemblance, especially in the case of Trp, may suggest that aromatic amino acids were important in the takeover of functions previously mediated by the nucleic bases of RNA, such as substrate binding and other functions depending on stacking interactions. In particular, these properties would have greatly increased proteins’ capacities to interact with the many nucleotide-based cofactors and small molecules which are presumably products of the RNA world themselves (Jadhav and Yarus 2002; Saran et al. 2003).
These results also generally agree with patterns observed in the evolutionary history of aminoacyl-tRNA synthetases (Cavalcanti et al. 2004). Amino acids we identify as “recent” are all associated with class I synthetases, with the exception of Phe, which uses an unusual class II synthetase that performs a 2’OH aminoacylation reaction similar to class I enzymes, and Lys, which has synthetases from both classes. Furthermore, the molecular phylogeny of class I synthetases identifies IleRS/ValRS and TrpRS/TyrRS as diverging relatively recently, supporting that the specificity of their cognate amino acids is a recent event in the evolution of the genetic code (Nagel and Doolittle 1995). Conversely, the molecular phylogeny of Class II synthetases shows GlyRS as a deeply branching lineage, supporting our assertion that this amino acid is more ancient (Hartlein and Cusack 1995). In potential conflict with a direct interpretation of the accepted phylogenies of aminoacyl-tRNA synthetases are our results that Asn is a more ancient amino acid and that Glu is more recent. Molecular phylogenies of class II synthetases show AsnRS diverging from AspRS within the Archaea, subsequently being transferred to the Bacteria; a similar history is also apparent for GlnRS and GluRS (Brown and Doolittle 1999; Tumbula-Hansen et al. 2002). Therefore, one would expect Glu and Asp to be more ancient than their amine forms. However, these paralogues both diverged from synthetase lineages that were nondiscriminating and depended on tRNA-dependent modification for the conversion of Asp to Asn, and Glu to Gln. Therefore, it is entirely possible that the ancestral state of these synthetase families was, in each case, nondiscriminating, completely obfuscating which amino acid occurred first in the genetic code. In general, because of the difficulty in inferring the ancestral substrate of related synthetases, and the distinct possibility that existing synthetases merely replaced a more ancient (possibly RNA-dependent) system, one should be cautious in using the molecular phylogeny of these proteins to infer which amino acids were absent in an earlier genetic code (Nagel and Doolittle 1995).
Two methods of inferring genetic code evolution from amino acid usage rates have been previously proposed, using either asymmetries in substitution matrices among closely related organisms (Jordan et al. 2005; Zuckerkandl et al. 1971) or ancestral sequence reconstructions of ancient protein lineages (Brooks et al. 2004). However, each of these methods has significant shortcomings. It has been reported that observed asymmetries in substitution matrices can be explained by neutral processes and do not imply trends related to genetic code evolution (McDonald 2006). It is also notable that this method relies on the assumption that more recent additions to the genetic code are still in the process of expanding their usage, therefore permitting extrapolation. Yet it seems at least as likely that asymmetry due to code expansion would only exist as a transient phase following the expansion event, with rapid replacement quickly achieving a new equilibrium across existing amino acids. At any rate, given the relative rapidity of the evolution of the genetic code in comparison to the more than 3 billion years of protein evolution since (Jordan et al. 2005), these initial asymmetries should have long since ceased to exist.
Perhaps more relevant to the work discussed here is a proposed compositional method for determining genetic code evolution, utilizing probabilistic methods for the inference of amino acid usage in ancestral sequence reconstructions (Brooks et al. 2004). This method identified Tyr, Trp, Thr, Ser, Phe, Leu, Gln, Cys, and Asp as more recent additions to the genetic code, with Val, Ile, His, and Glu as more ancient. In comparing our results to the trends observed by Brooks et al. (Table 4), we find that they are in agreement for significant increases in the use of Cys, Trp, and Tyr. However, they strongly disagree with the significant increases we observe for Phe, Ile, Val, Glu, and Lys, as well as with the significant decreases we observe for Gly and Asn.
We assert that our more stringent approach of only counting fully conserved positions allows for a better measure of genetic code evolution. Probabilistic approaches rely on the implicit assumption that the ancestral state of each amino acid position was a specific residue defined by a conventional genetic code organization. However, it is possible that more primitive versions of the genetic code were less discriminating and, in some cases, could not respond to selective pressure to drive a specific amino acid position into fixation. Rather, selection may have been “degenerate,” acting on a subset of similar amino acids that were not yet discriminated by the coding machinery. Using probabilistic inference to identify the “ancestral” amino acid for such a position would produce a misleading result. Conversely, the presence of a fully conserved residue in all modern lineages strongly suggests that enough specificity existed to drive this residue into fixation in the ancestral lineage. Furthermore, ancestral state reconstruction, even if correct, only shows selection or “state” at the point of sequence divergence; conserved sequence positions indicate past selection up to and through the point of species divergence and, thus, probe deeper into evolutionary time. Finally, their phylogenetic analysis relies on midpoint rooting of a tree containing bacterial, archaeal, and eukaryotic protein sequences. This is inconsistent with the generally accepted notion that the root of the tree of life is between the Bacteria and the Archaea, with eukaryotes represented as a long-branched sister group to the Archaea (Brown and Doolittle 1995; Gribaldo and Cammarano 1998; Zhaxybayeva et al. 2005). While data do exist that support other rootings (Zhaxybayeva et al. 2005), a midpoint rooting scheme remains problematic, especially considering the uncertainty associated with molecular evolution rates during such ancient periods, lacking any fossil record for calibration (Douzery et al. 2006).
Our analysis reveals an important stage of genetic code evolution, and its role in the transition from an ancestral RNA-peptide world to the modern DNA-RNA-protein world. Furthermore, it does so in an entirely empirical fashion, without any a priori assumptions concerning environmental conditions, metabolic pathways, or codon organization. Such independence makes our approach uniquely useful in evaluating these other factors, in order to better identify the mechanisms and selective pressures responsible for producing the modern organization of the genetic code.
References
Betts M, Russell R (2003) Amino acid properties and consequences of substitutions. Wiley, West Sussex
Brooks D, Fresco J, Singh M (2004) A novel method for estimating ancestral amino acid composition and its application to proteins of the Last Universal Ancestor. Bioinformatics 20:2251–2257
Brown J, Doolittle W (1995) Root of the universal tree of life based on ancient aminoacyl-tRNA synthetase gene duplications. Proc Natl Acad Sci USA 92:2441–2445
Brown JR, Doolittle WF (1999) Gene descent, duplication, and horizontal transfer in the evolution of glutamyl- and glutaminyl-tRNA synthetases. J Mol Evol 49:485–495
Bywater R, Thomas D, Vriend G (2001) A sequence and structural study of transmembrane helices. J Comput Aided Mol Des 15:533–552
Cavalcanti A, Leite E, Neto B, Ferreira R (2004) On the classes of aminoacyl-tRNA synthetases, amino acids and the genetic code. Orig Life Evol Biosph 34:407–420
Collins L, Penny D (2005) Complex splicesomal organization ancestral to extant eukaryotes. Mol Biol Evol 22:1053–1066
Cummings L, Riley L, Black L, Souyoroy A, Resenchuk S, Dondoshansky I, Tatusova T (2002) Genomic BLAST: custom-defined virtual databases for complete and unfinished genomes. FEMS Microbiol Lett 216:133–138
Davis B (1999) Evolution of the genetic code. Prog Biophys Mol Biol 72:157–243
Delaye L, Becerra A, Lazcano A (2005) The last common ancestor: what’s in a name? Orig Life Evol Biosph 35:537–554
Di Giulio M (2006) The non-monophyletic origin of the tRNA molecule and the origin of genes only after the evolutionary stage of the last universal common ancestor (LUCA). J Theor Biol 240:343–352
Douzery E, Delsuc F, Philippe H (2006) Molecular dating in the genomic era. Med Sci (Paris) 22:374–380
Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32:1792–1797
Gogarten JP, Taiz L (1992) Evolution of proton pumping ATPases: rooting the tree of life. Photosynth Res 33:137–146
Gogarten-Boekels M, Hilario E, Gogarten J (1995) The effects of heavy meteorite bombardment on the early evolution–the emergence of the three domains of life. Orig Life Evol Biosph 25:251–264
Gogarten J, Kibak H, Dittrich P, Taiz L, Bowman E, Bowman B, Manolson M, Poole R, Date T, Oshima T, Konishi J, Denda K, Yoshida M (1989) Evolution of the vacuolar H+-ATPase: implications for the origin of eukaryotes. Proc Natl Acad Sci USA 86:6661–6665
Gribaldo S, Cammarano P (1998) The root of the universal tree of life inferred from anciently duplicated genes encoding components of the protein-targeting machinery. J Mol Evol 47:508–516
Harris J, Kelley S, Spiegelman G, Pace N (2003) The genetic core of the universal ancestor. Genome Res 13:407–412
Hartlein M, Cusack S (1995) Structure, function and evolution of seryl-tRNA synthetases: implications for the evolution of aminoacyl-tRNA synthetases and the genetic code. J Mol Evol 40:519–30
Hartman H (1975) Speculations on the evolution of the genetic code. Orig Life 6:423–427
Hartman H (1978) Speculations on the evolution of the genetic code II. Orig Life 9:133–136
Hartman H (1984) Speculations on the evolution of the genetic code III: the evolution of t-RNA. Orig Life 14:407–412
Hartman H (1995) Speculations on the evolution of the genetic code IV: the evolution of the aminoacyl-tRNA synthetases. Orig Life Evol Biosph 25:265–269
Higgs P, Purdritz R (2006) From protoplanetary disks to prebiotic amino acids and the origin of the genetic code. Cambridge University Press
Holbrook S (2005) RNA structure: the long and the short of it. Curr Opin Struct Biol 15:302–308
Ibba M, Celic I, Curnow A, Kim H, Pelaschier J, Tumbula D, Vothknecht U, Woese C, Soll D (1997) Aminoacyl-tRNA synthesis in Archaea. Nucleic Acids Symp Ser 37:305–306
Imlay J (2006) Iron-sulfur clusters and the problem with oxygen. Mol Microbiol 59:1073–1082
Jadhav V, Yarus M (2002) Coenzymes as coribozymes. Biochimie 84:877–888
Jordan IK, Kondrashov FA, Adzhubei IA, Wolf YI, Koonin EV, Kondrashov AS, Sunyaev S (2005) A universal trend of amino acid gain and loss in protein evolution. Nature 433:633–638
Keefe AD, Lazcano A, Miller SL (1995) Evolution of the biosynthesis of the branched-chain amino acids. Orig Life Evol Biosph 25:99–110
Klipcan L, Safro M (2004) Amino acid biogenesis, evolution of the genetic code and aminoacyl-tRNA synthetases. J Theor Biol 228:389–396
Knight RD, Freeland SJ, Landweber LF (1999) Selection, history and chemistry: the three faces of the genetic code. Trends Biochem Sci 24:241–247
Koonin EV (2003) Comparative genomics, minimal gene-sets and the last universal common ancestor. Nat Rev Microbiol 1:127–136
McDonald JH (2006) Apparent trends of amino Acid gain and loss in protein evolution due to nearly neutral variation. Mol Biol Evol 23:240–244
Nagel GM, Doolittle RF (1995) Phylogenetic analysis of the aminoacyl-tRNA synthetases. J Mol Evol 40:487–498
Nisbet E, Sleep N (2001) The habitat and nature of early life. Nature 409:1083–1091
Nishida H, Nishiyama M, Kobashi N, Kosuge T, Hoshino T, Yamane H (1999) A prokaryotic gene cluster involved in synthesis of lysine through the amino adipate pathway: a key to the evolution of amino acid biosynthesis. Genome Res 409:1175–1183
Nixon J, Wang A, Morrison H, McArthur A, Sogin M, Loftus B, Samuelson J (2002) A splicesomal intron in Giardia lamblia. Proc Natl Acad Sci USA 99:3701–3705
Noller HF, Hoang L, Fredrick K (2005) The 30S ribosomal P site: a function of 16S rRNA. FEBS Lett 579:855–858
Poirot O, Suhre K, Abergel C, O’Toole E, Notredame C (2004) 3DCoffee@igs: a web server for combining sequences and structures into a multiple sequence alignment. Nucleic Acids Res 32:W37–W40
Sadeghi M, Naderi-Manesh H, Zarrabi M, Ranjbar B (2006) Effective factors in thermostability of thermophilic proteins. Biophys Chem 119:256–270
Saran D, Frank J, Burke DH (2003) The tyranny of adenosine recognition among RNA aptamers to coenzyme A. BMC Evol Biol 3:26
Shih P, Pedersen LG, Gibbs PR, Wolfenden R (1998) Hydrophobicities of the nucleic acid bases: distribution coefficients from water to cyclohexane. J Mol Biol 280:421–430
Sorimachi K, Itoh T, Kawarabayasi Y, Okayasu T, Akimoto K, Niwa A (2001) Conservation of the basic pattern of cellular amino acid composition of archaeobacteria during biological evolution and the putative amino acid composition of primitive life forms. Amino Acids 21:393–399
Stoltzfus A (1999) On the possibility of constructive neutral evolution. J Mol Evol 49:169–181
Syvanen M (2002) Recent emergence of the modern genetic code: a proposal. Trends Genet 18:245–248
Thompson J, Higgins D, Gibson T (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighing, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22:4673–4680
Trifonov EN (2004) The triplet code from first principles. J Biomol Struct Dyn 22:1–11
Tumbula D, Vothknecht UC, Kim HS, Ibba M, Min B, Li T, Pelaschier J, Stathopoulos C, Becker H, Soll D (1999) Archaeal aminoacyl-tRNA synthesis: diversity replaces dogma. Genetics 152:1269–1276
Tumbula-Hansen D, Feng L, Toogood H, Stetter KO, Soll D (2002) Evolutionary divergence of the archaeal aspartyl-tRNA synthetases into discriminating and nondiscriminating forms. J Biol Chem 277:37184–37190
Velasco AM, Leguina JI, Lazcano A (2002) Molecular evolution of the lysine biosynthetic pathways. J Mol Evol 55:445–459
Vlassov A (2005) How was membrane permeability produced in an RNA world? Orig Life Evol Biosph 35:135–149
Vogel H (1964) Distribution of lysine pathways among fungi: evolutionary implications. Am Nat 98:446–455
Weber AL, Miller SL (1981) Reasons for the occurrence of the twenty coded protein amino acids. J Mol Evol 17:273–284
Wong JT (2005) Coevolution theory of the genetic code at age thirty. Bioessays 27:416–425
Zhaxybayeva O, Lapierre P, Gogarten JP (2005) Ancient gene duplications and the root(s) of the tree of life. Protoplasma 227:53–64
Zuckerkandl E, Derancourt J, Vogel H (1971) Mutational trends and random processes in the evolution of informational macromolecules. J Mol Biol 59:473–490
Acknowledgments
This work was supported through the NASA Exobiology Program (NNX07AK15G).
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Rights and permissions
About this article
Cite this article
Fournier, G.P., Gogarten, J.P. Signature of a Primitive Genetic Code in Ancient Protein Lineages. J Mol Evol 65, 425–436 (2007). https://doi.org/10.1007/s00239-007-9024-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00239-007-9024-x