Introduction

Evolution of the Genetic Code and Amino Acid Alphabet

The mechanism and history of the evolution of the genetic code are one of the most important questions surrounding the origin of life on Earth. It has been the subject of numerous investigations in the last several decades, with a wide variety of approaches using different lines of biochemical, evolutionary, schematic, and mathematical evidence (Hartman and Smith 2014; Higgs and Pudritz 2009; Vetsigian et al. 2006; Wong 1988; Cavalcanti et al. 2004) reviews (Koonin and Novozhilov 2009; Trifonov 2000). Along these lines, previous work has also shown that ancestral reconstructions of universally conserved ribosomal proteins contain a compositional bias that likely provides evidence of earlier stages in the evolution of the genetic code (Brooks and Fresco 2002; Brooks et al. 2002; Fournier and Gogarten 2007, 2010). Specifically, compositional analyses suggest that Gly, Ala, Asp, Asn, and Thr are among the most ancient amino acids in the code, while Glu, Gln, Phe, Tyr, Trp, Cys, and Ser are later additions. The results of this empirical sequence-based method are similar to those of consensus meta-studies (Trifonov 2000), classic prebiotic synthesis experiments (Miller 1953), and more recent selection-based models of the organization and evolution of the codon table schema (Higgs 2009).

While the ribosome is the RNA/protein complex responsible for mediating translation and elongation of the nascent peptide chain via codon/anticodon pairing and peptidyltransferase activity, many additional proteins are required for defining the genetic code and performing protein synthesis in living cells. Three major groups of these proteins could potentially have played a role in the development of the genetic code via their early evolution and diversification: (1) aminoacyl-tRNA synthetase (aaRS) proteins, which load specific tRNA with their cognate amino acids; (2) amino acid biosynthesis enzymes; and (3) tRNA modification proteins, essential for the partitioning of the genetic code via codon–anticodon interactions that allow for its current complex organization. Within each of these groups, many protein families with similar functions are paralogs, suggesting a common ancestor with a similar function. These relationships are especially apparent between several sets of aaRS proteins (TyrRS/TrpRS, IleRS/ValRS, GluRS/GlnRS, AspRS/AsnRS, CysRS/MetRS) (Brown and Doolittle 1995; Landes et al. 1995; Nagel and Doolittle 1995). A similar scenario is found in many amino acid biosynthetic pathways, such as the distinct, paralogous sets of enzymes involved in Lys, Arg, and Leu biosynthesis (Fondi et al. 2007). The situation is more complex for tRNA modification enzymes, which are often non-homologous across major domains of life (Grosjean et al. 2010).

To at least some extent, the evolution of the genetic code was likely dependent on the duplication and divergence of these groups of proteins, and the expansion of their functional roles. Evolution of novel aaRS and amino acid biosynthesis proteins could potentially permit new amino acids to be added to the code and used in polypeptide synthesis (Cavalcanti et al. 2004; Di Giulio 1992; Klipcan and Safro 2004; Wetzel 1978). Similarly, the partitioning of the genetic code into blocks of synonymous codons via tRNA codon–anticodon recognition is made possible by the evolution of tRNA modification proteins. However, even if the genetic code was shaped by protein evolution, its earliest origins must predate polypeptide synthesis as we know it, possibly arising within an RNA-based system consisting of ribozymes with diverse catalytic activities (Hartlein and Cusack 1995; Wetzel 1995).

Ancestral Reconstruction of aaRS Protein Family Paralog Ancestors

Among these protein families, aaRS is unique in that their specific functions directly presuppose the specific encoding of certain amino acids, of which they, themselves, are composed. The ancestors of aaRS families could also only have been composed of amino acids specified within the code at the time. Therefore, the identities of the amino acids found within the reconstructed sequences of aaRS family paralog ancestors can be used to constrain their ancestral functions, extending the tools of molecular evolution to a time before LUCA. Three possible functional histories are associated with the duplication and divergence of paralogous proteins from a common ancestor: neofunctionalization, with the addition of a novel function in one or the other paralog; subfunctionalization, with specialization of each paralog from a non-discriminating ancestor; and parafunctionalization, with one or both descendants taking over pre-existing functions following their divergence. If function in aaRS families is defined as amino acid specificity, the genetic code itself will also have been changed in the cases of neofunctionalization or subfunctionalization, with amino acid sequence space co-evolving as enzymatic diversity and/or specificity increases.

Previous work has already applied this principle to the paralog ancestor of IleRS and ValRS (Fournier et al. 2011). IleRS and ValRS show a high level of sequence and structural similarity, recognize very similar amino acids that are frequently substituted for one another, and also use a similar set of codons. Furthermore, ValRS and IleRS are relatively abundant amino acids, providing many sites for analysis within the reconstruction. Interestingly, the probabilistic reconstruction of the IleRS/ValRS ancestor shows many sites with a high probability of being specific for Ile and Val, respectively. This result supports parafunctionalization, with specific coding for Val and Ile predating their cognate synthetases, possibly indicating an alternative aminoacylation or proofreading system in operation at this early time. It has been previously proposed that, as appears to be the case for Ile and Val, the genetic code was fixed by the time of the evolution of cognate aaRS families, and was mediated by an RNA-based system (Hartlein and Cusack 1995).

In this work, we extend and further develop this compositional reconstruction analysis to investigate the paralog ancestor of TyrRS and TrpRS. These aromatic amino acids are generally considered to be more recent additions to the code, by virtue of their complex biosynthetic pathways, and being encoded within the “stop” codon block, among other criteria (Trifonov 2000). In particular, the temporal ordering of Trp as one of the most recent additions to the code is one of the most longstanding and robust observations across independent lines of evidence (Jukes 1973, 1981; Osawa et al. 1992). As such, reconstructing the TyrRS/TrpRS paralog ancestor permits the investigation of a time before LUCA, recent enough that the genetic code was largely established, but still early enough that Trp and/or Tyr may not have yet existed as part of the canonical genetic code.

The presence or absence of Trp and/or Tyr residues at sites within the reconstructed paralog ancestor permits the discrimination of the previously described three possible functional histories (Fig. 1). In the case of subfunctionalization, an earlier genetic code is probabilistic for some amino acids, incorporating similar amino acids at certain positions without discrimination. As selection would be unable to act to “fix” any individual AA within this set, proteins would simply evolve with this constraint, with functions being tolerant to the presence of Tyr or Trp at specific positions. One likely cause of such promiscuity could be non-specific aminoacylation of tRNA. However, following aaRS duplication and divergence, “symmetry breaking” could occur, with selection for subfunctionalization driven by increases in fitness gained by fixing either Trp or Tyr at previously non-discriminated positions. Similarly, codon space could also become subdivided via the partitioning of cognate tRNAs between each paralog. The fitness impact of such a transitional “partitioning phase” has been found to be favorable in specific instances (Higgs 2009). In such a case, it is expected that no sites with a high probability for being specifically Tyr or Trp would be observed within the reconstructed paralog ancestor, although sites with high non-specific probabilities of being either Tyr or Trp would be expected to be observed. For example, sites may be observed with a 45 % probability of Trp, a 45 % probability of Tyr, and only a 10 % probability of being another amino acid. This suggests a tolerance to interchangeability between the two amino acids across protein sequences. If a large number of such ambiguous sites are observed, significantly more than are expected given amino acid substitution models within protein sequences, this likely indicates an inherited tolerance arising from ancestral ambiguity, and evidence of a non-specific genetic code at the ancestral node of the reconstruction.

Fig. 1
figure 1

Hypotheses for the ancestral amino acid specificity of TyrRS and TrpRS paralogs. Proteins (gray) and cognate amino acids (white) are shown in this schema. Tyr (Y) and Trp (W) represent the presence of these amino acids within the paralog ancestor and descendants under each hypothesis. W/Y represents an ambiguous specificity, in which Tyr and Trp are not discriminated by the genetic code during protein synthesis, or in aaRS binding. In the case of parafunctionalization, the cognate amino acid(s) cannot be inferred from composition analysis, as under this hypothesis specific Tyr and Trp usage in the genetic code both predate their cognate aaRS

In the case of neofunctionalization, an ancestral duplication and subsequent divergence of TyrRS and TrpRS lineage ancestors would permit addition of a new amino acid to the coding schema. This would be inferred by the absence of any sites reconstructing for either Tyr or Trp, respectively, within the paralog ancestor. An absence of Tyr and a presence of Trp would imply that TrpRS was the ancestral function, and the code lacked Tyr before the divergence occurred. Conversely, an absence of Trp and a presence of Tyr would suggest the code lacked Trp at this time.

As was the case with IleRS and ValRS, parafunctionalization would be inferred if specific sites individually reconstructing for Tyr and Trp are observed within the paralog ancestor, suggesting that specific encoding of Tyr and Trp predates their cognate aaRS, and must have been mediated by another, more ancient system. As Trp and Tyr are generally considered to be some of the most recent additions to the code, younger than Ile or Val, this would suggest that a different and currently unknown aminoacylation regime persisted through the entire development of the canonical genetic code, independent of the evolution of the aaRS families.

Reliability of Ancient Protein Ancestral Reconstructions

To our knowledge, the most ancient protein attempted to be resurrected via ancestral sequence reconstruction is the bacterial ancestor of elongation factor Tu (EF-Tu), which is estimated to have existed >3 Gya (Gaucher et al. 2003). This resurrected protein ancestor was shown to have GDP-binding activity comparable to its contemporary homologs. Furthermore, this activity was shown to be optimal at a temperature of ~65 °C, consistent with the range of ocean temperatures predicted on the early Earth based on interpretations of δ18O isotopic ratios in Archaean rocks (Knauth 2005). These experimental results demonstrate the capacity for ancestral reconstruction methods to successfully and accurately recover the biological functionality of very ancient proteins, presumably via their fidelity in reconstructing correct ancestral sequences. While the pre-LUCA paralog ancestors of aaRS proteins may predate the bacterial ancestor by hundreds of millions of years, the similarly high sequence, structure, and functional conservation observed for EF and aaRS proteins suggests that the latter are similarly amenable to accurate reconstruction over these timescales.

Results

Phylogenetic Reconstruction

Maximum-likelihood phylogenetic reconstruction of TyrRS and TrpRS paralogs shows the monophyly of each group, and a clear division between bacterial and archaeal/eukaryal homologs in each case (Fig. 2). While some previous investigations have suggested a paraphyletic relationship between TyrRS and TrpRS (Dong et al. 2010; Ribas de Pouplana et al. 1996), our result is consistent with other analyses that include sequence data from all three Domains, supporting the monophyly and pre-LUCA origin of each family (Brown et al. 1997; Chandrasekaran et al. 2013). In further agreement with these analyses, we also find clear evidence of horizontal gene transfer (HGT) within each protein family, especially within archaeal groups. Within TyrRS, there is a complex pattern of transfer between Crenarchaeota and other members of the TACK superphylum, and euryarchaeal clades including Thermoplasmatales, Thermococcales, and Nanoarchaeum equitans, which apparently acquired TyrRS from its symbiotic host, Ignicoccus hospitalis (or vice versa) (Podar et al. 2008). The ancestor of TyrRS within I. hospitalis and N. equitans itself appears to have also been transferred from Thermoproteales, or possibly vertically inherited within N. equitans as a sister to Thermococcales (Brochier et al. 2005), with independent transfers to Thermoproteales and I. hospitalis. A subset of Halobacteriales also appears to have received TyrRS from a deep TACK-associated lineage. Most notably, Eukarya are also polyphyletic within the TyrRS tree, with Animalia and Fungi (Opisthokonta) grouping with the TACK-associated halobacterial subset, while other eukaryotes form a monophyletic group within the TACK clade. This is consistent with previous work showing a HGT of the gene-encoding TyrRS from Halobacteria to the opisthokont ancestor (Huang et al. 2005). A deep division is also apparent within bacterial TyrRS forms, as has been attributed to complex patterns of biased HGT (Andam et al. 2010). Several HGT events are also apparent within the TrpRS tree, including a transfer from within group II methanogens to Desulfurococcales, crenarchaeal transfers to N. equitans and Thaumarchaeota, and a deep TACK transfer to a subset of Halobacteriales, similar to that observed within TyrRS. For TrpRS, Eukarya form a monophyletic group, rooting deeply within the archaeal Domain closer to the TACK group. Additional HGT events within more recently diverging groups may also have occurred, but are less readily apparent and were not investigated.

Fig. 2
figure 2

Maximum-likelihood tree of TyrRS and TrpRS paralogs. Colors indicate broad taxonomic groups, as indicated. TACK refers to the TACK superphylum, in this tree including Crenarchaeota, Thaumarchaeora, and Korarchaeota. Black lines indicate lineages existing either earlier than Domain ancestors, or for which no taxonomic group could be inferred due to patterns of HGT. Tree reconstruction is described in Methods. Bootstrap support values are provided for major nodes. Branch length scale bar indicates the number of substitutions/site

The deep division between bacterial and archaeal variants of each paralog supports, in both cases, the node ancestor of these domain lineages being congruent with LUCA. This is further supported by the reciprocal rooting of paralog sub-trees on congruent branches, showing a rooting of the ToL on the bacterial branch. As such, these aaRS paralog lineages diverged pre-LUCA, and a reconstruction of the likely sequence of the paralog ancestor provides information about a very early time in the history of life, possibly even during the later stages of the evolution of the genetic code itself.

Partial HGT of TyrRS Within Eukarya and Halobacteriales

Phylogenetic trees of the TyrRS protein family show eukaryal and halobacterial TyrRS proteins as polyphyletic, with opisthokont orthologs grouping deeply on the crenarchaeal branch together with a subset of Halobacteriales, and all other eukaryal orthologs grouping within Crenarchaeota. The remaining Halobacteriales group together with group II methanogens. However, this HGT to Opisthokonta does not seem to be consistently evident across the full TyrRS protein sequence. Three regions were identified containing conserved amino acids supporting different bipartions for the placement of Eukarya and Halobacteriales. The first region (R1) consists of 20 amino acid sites. This region generates a phylogenetic tree that supports the monophyly and vertical placement of Halobacteriales within group II methanogens, as well as the monophyly of Eukarya, including Opisthokonta. The second region, located 100 AA sites downstream of the first, contains two sub-regions. The sub-region (R2) contains 11 sites, again supporting the monophyly of Halobacteriales, but with Opisthokonta grouping deeply within Crenarchaeota, congruent with its position in the gene tree, and not monophyletic with other Eukarya. Immediately, downstream is a second 14 AA sub-region (R3) with sites supporting another distinct topology, where Halobacteriales is once again monophyletic within group II methanogens. However, in this region, Opisthokonta groups together with all Halobacteriales within the methanogens.

While these regions are small and do not contain many sites for phylogenetic inference, it is conspicuous that, in both cases, R1 and R2/R3 are at sites of tRNATyr recognition. This suggests a complex narrative of partial HGT and selection, wherein recombinations between donor and recipient genes in both eukaryal and halobacterial lineages preserved the regions of the ancestral, vertically inherited gene that had co-evolved with its cognate tRNA, to maintain the specificity of that interaction. This is not without precedent, as a partial HGT within halobacterial LeuRS and GluRS has also been reported in previous analyses. In both cases, the recombined regions were also shown to be involved in tRNA recognition (Dasgupta and Basu 2014; Fang et al. 2014).

These recombinations appear to have occurred following the HGT of a crenarchaeal TyrRS to a subset of Halobacteriales and, consequently, the HGT to the ancestor of Opisthokonta, so that both clades preserved their respective vertically inherited regions (Fig. 3). In the case of R1, the recombinations within Halobacteria and Opisthokonta could have occurred independently (if the secondary HGT to Opisthokonts occurred before the R1 recombination within Halobacteria) or in a stepwise fashion, with an R1 recombined copy transferred to Opisthokonta, then replaced via recombination with the vertically inherited eukaryal copy. Within these halobacteria, the entire R2/R3 region appears to have been recombined following HGT from the crenarchaeal donor, preserving the vertically inherited sequence. In the secondary HGT to Opisthokonta, however, the crenarchaeal R2 region from the initial HGT appears to be retained, while the recombined halobacterial R3 region was inherited. Therefore, for the R3 region, but not the R2 region, Opisthokonta groups with Halobacteriales within group II methanogens. This could result from HGT to Opisthokonta after R3 recombination, but before R2 recombination within halobacterial populations.

Fig. 3
figure 3

Proposed schema for HGT and recombination of gene-encoding TyrRS within Halobacteriales and Opisthokonta. Given that many halobacterial genomes retain the vertically inherited TyrRS homolog, the initial transfer was either likely to within the halobacterial clade, or lineage sorting of the vertical and transferred copy occurred following HGT into the halobacterial ancestor. Subsequent HGT to the ancestor lineage of Opisthokonta supports a stepwise series of recombination events within halobacterial populations. Recombined regions R1–R3 are not shown to scale

These events require stepwise recombination of transferred and vertically inherited gene regions within halobacteria, which suggest that both copies of the gene persisted within halobacterial populations for some time. Interestingly, this seems to be the case for halobacterial TrpRS, in which several species contain two gene copies, with one version likely acquired via HGT (Fig. 2). This scenario also makes the prediction that, while R1 and R2/R3 regions may both be important in halobacterial tRNA recognition, R1 is most important within opisthokont tRNA recognition, since R1 underwent recombination to preserve this interaction, but did not undergo additional recombination to preserve the eukaryal R2/R3 regions.

While HGT generally is not problematic for ancestral sequence reconstruction of individual proteins, since inferences are based on gene trees rather than organismal trees, partial HGT can confound ancestral reconstruction studies. Site probabilities of ancestral amino acids may be incorrectly inferred if the tree topology for one part of the gene does not match another part. Furthermore, if regions of partial HGT are large enough, the overall gene tree may be impacted, resulting in an artifactual topology different from any of the component “true” evolutionary histories. For this reason, the ancestral sequences for these regions were calculated using gene trees edited to reflect the unique topologies of R1, R2, and R3. Conversely, the gene tree used for ancestral reconstruction of the remaining majority of sites (Fig. 2) was generated from an alignment omitting these regions.

Absence of Trp Within Paralog Ancestor Sequence Reconstructions

Homogeneous and non-homogeneous reconstruction models return similar sequences and site probability distributions for the paralog ancestor node (Fig. 4). Of 251 reconstructed aligned sites, 212 show the same amino acid maximum-likelihood identity. Both show a complete absence of sites with a maximum-likelihood identity of Trp (TrpML) within the paralog ancestor of TyrRS and TrpRS. Additionally, in both models, the total expectation count for Trp sites (TrpExp) within paralog ancestor reconstructions is well below 1 (Fig. 5). Importantly, this absence does not seem to arise from an asymmetrical usage of Trp across TyrRS and TrpRS ancestors; each aaRS family ancestor shows a similar absence of TrpML sites, and similarly low TrpExp counts. In both families, Trp residues only begin to be incorporated along the branches leading to the major Domains. Four TrpML sites are acquired on the branch leading to the TyrRS bacterial ancestor, two on the branch leading to the TyrRS crenarchaeal/eukaryal ancestor, two on the branch leading to the TrpRS bacterial ancestor, and one each on branches leading to the TyrRS euryarchaeal ancestor, and TrpRS euryarchaeal, crenarchaeal, and eukaryal ancestors.

Fig. 4
figure 4

Site amino acid probabilities in reconstructed TyrRS/TrpRS paralog ancestor. Gaps indicate sites with protein family-specific indels that were not included in the final analysis. Partial HGT regions are labeled on the left. Sites corresponding to labeled TrpExp contributions in Fig. 6 are numbered. Site probabilities for Trp are in bold red. (*) indicates sites with ML amino acid identities that differ between homogeneous and non-homogeneous reconstructions. Site amino acid probabilities for all internal nodes of each ancestral reconstruction are provided (Online Resources 1–2)

Fig. 5
figure 5

Expectation counts of Trp within reconstructed ancestors of the TyrRS/TrpRS protein families. Results from homogeneous (a) and non-homogeneous (b) reconstruction models consistently support an absence of Trp within the pre-LUCA paralog ancestor of TyrRS and TrpRS. TrpExp values for each branch are labeled and proportional to branch thickness. Additionally, bracketed values in bold indicate site-specific acquisition (+) and loss (−) of TrpML along their respective branches. [0] values indicate that no TrpML sites are observed along a branch. (*) an ambiguous gain of TrpML is observed at site 278, which may have preceded multiple gains and losses in descendent lineages. Terminal branches correspond to the stem lineages ancestral to each labeled clade

Avoidance in using cognate amino acids within amino acid biosynthetic enzymes has been observed in many pathways, including Trp synthesis (Alves and Savageau 2005; Tivorsak 2001; Xie and Reeve 2005). It has been proposed that this avoids translation attenuation due to a lack of charged tRNA under amino acid starvation conditions, under which these genes must be more highly expressed. Importantly, the inferred absence of Trp within the paralog ancestor does not seem to be a consequence of broad selection against including Trp within TrpRS descendant lineages. Rather, Trp acquisition appears to be near-universal across major groups (Fig. 5).

Only groups of transferred halobacterial TrpRS and TyrRS, and associated opisthokont TyrRS appear to lack Trp in extant sequences, in each case likely due to loss after earlier archaeal acquisition, at sites 166 and 278, respectively. It may not be coincidental that two of these affected groups are halobacterial, if these sites are somehow functionally impacted by high salt concentrations. Along the stem branches to these groups, both homogeneous and non-homogeneous reconstructions show a substitution of Trp to Arg at site 166, and a substitution of Trp to Thr at site 278. These particular amino acid substitutions are not generally expected in response to halophily (Fukuchi et al. 2003). Nevertheless, if both are due to selection under halophilic conditions that favor the replacement of Trp, the substitution at site 278 informs the directionality of HGT of TyrRS between Opisthokonta and Halobacteriales. Since Opisthokonta also shows a Thr at this site, a halophily induced substitution would polarize the direction of transfer, suggesting that TyrRS was secondarily transferred from Halobacteria to the Opisthokont ancestor. This would be consistent with the schema inferred from analysis of partial HGT regions (Fig. 3).

Per-site Probability Densities

The estimation of TrpExp within the paralog ancestor sequence is cumulative across sites, so that as permitted by the substitution model, given a sufficiently long sequence, there will be a substantial expectation of at least one site containing Trp. However, it is clear that the probability density of Trp is non-uniform across sites within the reconstructed paralog ancestor, with the vast majority of sites (>96 %) having a very low probability of containing a Trp residue. Most probability density resides within a handful of sites that are consistent across both reconstruction models (Fig. 6). Under both models, over half of the total expectation of Trp within the paralog ancestor arises from only four sites. This suggests that the expected frequency of Trp within the paralog ancestor sequence does not arise from a lack of reconstruction information across most sites, or a consistent model bias excluding Trp from ancestral sites. Rather, as shown in Fig. 4, these few sites contain Trp within descendant lineages, which directly contribute to the probability of Trp at ancestral nodes.

Fig. 6
figure 6

Contribution of individual sites to paralog ancestor expectation of Trp. The majority of TrpExp within reconstructed TyrRS/TrpRS paralog ancestor sequences is within only a few sites, consistently recovered across both reconstruction models. In each model, the vast majority of sites (>96 % of sites) contribute less than 33 % of the total expectation value for Trp at the root. Sites contributing >5 % TrpExp are labeled. Additional sites contributing remaining TrpExp are also consistent across homogeneous and non-homogeneous models. In order of decreasing contribution per model: a = 166, 265, 633, 277, 282; b = 164, 166, 277, 265, 633, 282

Within both homogeneous and non-homogeneous models, much of the probability density for Trp within the TyrRS/TrpRS paralog ancestor was found within two sites, 190 and 269, together accounting for 38 and 35 % of the expectation of Trp within the inferred paralog ancestor sequence, respectively (Fig. 6). For both cases, this arises from the derived substitution of Trp along specific branches at the Domain level, rather than a distributed low probability of Trp across the tree. For site 269, the substitution occurs on the branch leading to the TyrRS bacterial ancestor. For site 190, this substitution occurs on two separate lineages within the tree, leading to the bacterial TrpRS ancestor and crenarchaeal/eukaryal TyrRS ancestor.

Site 269 is the first site of the R1 partial HGT region. However, the proposed differing tree topology for this region does not affect the inference of Trp for this residue within the reconstruction, as neither the halobacterial nor eukaryal groups involved in the proposed HGT contain a Trp at this site.

The ancestral reconstruction for site 190 at both LUCA ancestors, as well as the paralog ancestor, is most likely a Leu residue, while the probability of Trp being present within the paralog ancestor at site 190 is relatively low in both homogeneous and non-homogeneous reconstructions (homogeneous: p Trp = 0.099, p Leu = 0.771; non-homogeneous: p Trp = 0.136, p Leu = 0.672). This suggests that incorporation of Trp was most likely convergent, and does not represent an ancestral Trp residue within the paralog ancestor. As this site is part of the dimerization interface for both TyrRS and TrpRS, these substitutions may have been independently advantageous consistent with changes in amino acid hydrophobicity and aromatic character being under selection within these protein regions.

Presence of Other Rare Amino Acids Within the TyrRS/TrpRS Ancestor

The absence of Trp within reconstructed sequence ancestors appears to be unique among the twenty amino acids, with other rare amino acids (Cys, His) being present in both TyrRS and TrpRS sequences, and also predicted within the reconstructed paralog ancestor (Table 1). Similar to Trp, Cys and His are also typically conserved in a small number of positions, resisting frequent substitution. This is further evidence against the absence of Trp in the paralog ancestor being the result of model bias due to its low equilibrium frequency.

Table 1 Reconstructed usages of rare amino acids within TyrRS/TrpRS ancestors

While the equilibrium frequency of Cys within the TyrRS and TrpRS proteins sampled for this analysis is nearly identical to that of Trp (0.0096 vs. 0.0080, respectively), the paralog ancestor contains at least two and possibly three distinct CysML sites, and CysExp is at least 2.75 times the observed value of TrpExp. These differences suggest that the absence of Trp cannot be explained purely as model bias arising from the low frequency of Trp usage. While His is more abundant within these proteins (2.5 % of sites), it is still the third rarest amino acid within the dataset, and, like Cys, but unlike Trp, has a predicted usage in the paralog ancestor similar to its overall abundance within the dataset.

Simulated Reconstructions Correctly Predict the Absence of Trp in Paralog Ancestors

There is inherent uncertainty in ancestral sequence reconstruction, propagating from many sources, including phylogenetic uncertainty, non-polarized characters, and the fitting of incorrect and/or inadequate evolutionary models. Each branch within a phylogeny also loses information about its ancestral state as substitutions continue to occur. For these reasons, the absence of rare amino acids at sites within an inferred ancestor sequence may not be reflective of the true ancestral state. This is especially true of a non-homogeneous model, which may over-fit by reducing the equilibrium frequencies of very rare amino acids to near zero along some branches, effectively eliminating the possibility of observing them at the inferred ancestor. In order to test for the possibility of a false-negative observation of Trp at the inferred paralog ancestor, two sets of 100 simulations were performed for each set of homogeneous and non-homogeneous model parameters inferred from the original sequence alignment and phylogeny. One set of simulations evolved sequences under an evolutionary scenario where root frequencies for Trp were set to 0, so that all simulated sequences evolved from an ancestral state of a total absence of Trp. Another set of simulations evolved sequences with a root frequency of Trp equal to that observed within the actual leaf sequences (0.8 %), and a presence of at least one Trp residue in each root sequence. Ancestral reconstructions were then performed with both sets of sequences, using the exact model parameters inferred from the original data. In this way, the ability of the ancestral reconstruction model and method to accurately predict the presence or absence of Trp within the paralog ancestor could be tested.

Simulations under both homogeneous and non-homogeneous models support the hypothesis that the observed absence of Trp within the TyrRS/TrpRS paralog ancestor is conspicuous, and unlikely to be observed if there was a true presence of Trp at the root. Under the homogeneous model (Fig. 7a), 64 % of reconstructions from zero-Trp simulations correctly inferred zero TrpML positions within the ancestor. Of the remaining simulations, 27 % had a false positive of one TrpML, 8 % had a false positive of two TrpML, and a single simulation had a false positive of three TrpML. Only 15 % of non-zero-Trp simulations showed a false negative, that is, an absence of TrpML sites. Under the non-homogeneous model (Fig. 7b), 82 % of reconstructions from zero-Trp simulations correctly inferred zero TrpML positions within the ancestor, while the remaining 18 % inferred a single Trp residue within the ancestor. Conversely, only 7 % of non-zero-Trp simulations incorrectly inferred zero TrpML sites in the ancestor. Therefore, given these models, it is substantially more likely to observe an absence of TrpML sites as a true negative, rather than a false negative (Bayes’ factors K H = 4.27, K NH = 11.71). These simulations further suggest that these models of ancestral reconstruction are more likely to over-estimate the probability and presence of Trp within deep ancestors, rather than under-estimate it.

Fig. 7
figure 7

Ancestral reconstruction simulations support a conspicuous absence of Trp within the TyrRS/TrpRS paralog ancestor. Homogeneous (a) and non-homogeneous (b) models were tested. Simulations under each model (n = 100) were performed with a 0 % root frequency of Trp (black) and a 0.8 % root frequency of Trp with at least 1 Trp residue within the simulation root ancestor node (gray), respectively. Actual TrpML and TrpExp values (Observed) were compared to distributions of corresponding values for each set of simulations. The Bayes’ factor (K) for each test shows substantial support for the observed absence of Trp being more likely due to a true absence of Trp within the paralog ancestor sequence

Since ancestral reconstructions are probabilistic, it is possible that an inferred ancestor with no TrpML sites could still be likely to contain at least one Trp residue, given the per-site likelihoods across the entire sequence, the sum of which equals the expectation value (TrpExp) for the count of Trp within the inferred ancestral sequence. Furthermore, since the probability of an ancestral Trp residue at each site is always non-zero, TrpExp continues to increase with sequence length; in a sufficiently long sequence, even if TrpML remains zero, TrpExp will increase to infinity. Comparison to simulated TrpExp values is therefore a useful additional metric. The homogeneous and non-homogeneous sequence reconstructions of the TyrRS/TrpRS ancestor show TrpExp values of 0.54 and 0.69, respectively. These values were compared to each set of simulations, in order to determine the relative likelihood of observing similarly low TrpExp under each hypothesis. Under the homogeneous model, 14 % of non-zero-Trp simulations showed expectations lower than 0.54, compared to 63 % of zero-Trp simulations. Under the non-homogeneous model, only 8 % of non-zero-Trp simulations showed expectations lower than 0.69, compared with 93 % of zero-Trp simulations. (Bayes’ factors K H = 4.50, K NH = 11.63) (Fig. 7).

These simulations show that the inferred low probability of a Trp residue being present within the TyrRS/TrpRS paralog ancestor is substantially more likely to occur under a model where Trp is truly absent at the root. For each evolutionary model, K values for TrpML and TrpExp were similar, supporting that each metric is reflecting the same signal within the data.

Methods

Sequence Collection, Alignments and Partial HGT

Amino acid sequences of 182 TyrRS and TrpRS proteins were collected from GenBank (Benson et al. 2014), with a representative sampling across all 3 Domains. Protein BLAST searches within each Domain were performed using the NCBI Non-redundant protein sequences database (Altschul et al. 1990), subsequently using the neighbor-joining tree visualization tool to confirm whether major clusters of homologs were represented in the sampling. All sequences were aligned in MUSCLE using default parameters (Edgar 2004). Proposed regions of partial HGT involving Opisthokonta and Halobacteriales were identified by visual inspection of amino acid site identities and confirmed with subsequent phylogenetic analysis (see Tree Reconstructions). Proposed partial HGT regions were removed from the alignment before phylogenetic analysis of the remaining sites. Abbreviated sequence namely key and aligned sequence files, including partial HGT region files, are available as online resources (see Online Resource files 3–7).

Tree Reconstructions

Phylogenetic trees were generated using PhyML v3.0 (Guindon et al. 2010) with the WAG amino acid substitution model, estimated portions of invariable sites, estimated rate gamma distribution parameter alpha, 8 rate categories, estimated amino acid frequencies, and an NJ starting tree. For the tree generated from sites not involved in partial HGT (shown in Fig. 2), 100 bootstrap replicates were generated. In the case of partial HGT regions, trees were initially calculated from each proposed recombined region. Differences in topology for halobacterial and opisthokont groups were then identified and used to edit the tree derived from the remainder of non-recombined sites (the majority of sites) to generate phylogenies that reflect the recombined topology, as well as preserve the phylogenetic relationships between unaffected groups, and within the descendant lineages of affected groups. Branch lengths on the edited trees were then re-estimated using TREEPUZZLE (Schmidt et al. 2002) under the same model parameters. The branch length of the branch leading to the transferred group from each recombined region tree was then used to replace the corresponding value from the edited trees. In this way, each resulting partial HGT tree used for ancestral reconstruction preserves both the topology and branch lengths across all groups using information from the full sequence, as well as reticulated branches associated with each recombined region (Online Resources 8–10). All trees were rooted between paralogs, as depicted in Fig. 2.

Model Parameter Generation and Ancestral Reconstruction

All homogeneous and non-homogeneous model parameters were estimated using the bppML program belonging to the bppSuite of software (Gueguen et al. 2013). In the homogeneous case, the JTT substitution model was used, with equilibrium frequencies fixed at observed usage. A discrete Gamma distribution with an estimated alpha value of 0.890 and four categories was employed to model variation of rates among sites, plus an additional category for invariant sites (estimated invariant site rate of 0.003). In the non-homogeneous case, the COaLA model (Groussin et al. 2013) was used. COaLA permits the variation of global amino acid composition between lineages. To do so, COaLA assigns branch-specific parameters to explore the space of equilibrium frequencies, in a sub-space defined by a correspondence analysis computed with the matrix of observed frequencies. Two parameters corresponding to the positions along the first two axes of the model were assigned to each branch of the tree, with branch-specific parameters being independent between branches. Two parameters were also assigned to the root.

Only positions from the alignment not predicted to be involved in partial HGT were used for estimating model parameters under the majority site phylogeny (the same sites used to calculate the phylogeny), so as to avoid biases resulting from the inclusion of sites likely evolving under differing tree topologies. Model parameters for partial HGT regions were estimated separately under the adjusted phylogenies for each region, as described in the previous section. Analogous to branch length estimation for the affected bipartitions, only the sets of parameters for the reticulate branches were used from these parameter estimates, with other unaffected branches in the partial HGT model remaining identical to those from the majority model. Ancestral reconstructions were subsequently performed on sequence alignments using bppAncestor (Gueguen et al. 2013). Per-site amino acid probabilities for reconstructed nodes are provided (Online Resources 1–2). A phylogenetic tree with internal node numbers mapping to the ancestral reconstructions is also provided (Online Resources 11). Alignment sites with an excessive number of gaps were automatically excluded from the reconstruction results.

Sequence Simulations

Protein sequence evolution was simulated over the majority site TyrRS/TrpRS paralog tree using bppSeqGen (Gueguen et al. 2013) for 100 iterations under both homogeneous and non-homogeneous model parameters, with and without Trp sites in the root sequence. For iterations with Trp, root frequencies were set equal to the observed alignment frequency of 0.008, with at least one Trp site in the root sequence. For iterations without Trp, root frequencies of Trp were set to 0 %. Alignment length was set equal to the number of reconstructed sites within the actual TyrRS/TrpRS alignment (n = 251). Ancestral reconstruction was then performed for each resulting simulated alignment, using the same tree and model parameters as the TyrRS/TrpRS reconstructions using the majority sequence. Due to their short lengths and likely small impact on estimated root frequencies of Trp, the different inferred topologies arising from partial HGT regions were not used in simulations. The site counts of these regions are included in the simulated sequence alignment length.

Discussion

A Limited Co-evolution of aaRS and the Genetic Code

The time between the origin of life and LUCA is one of dramatic evolutionary change, encompassing the invention of all the core cellular machinery, including the translation system and the genetic code itself. Despite the importance in this interval in establishing fundamental biological processes, the tools of comparative genomics are limited in its investigation, save for the relationships between highly conserved paralogous gene families that can be inferred to have diverged pre-LUCA. Due to their ancient origins, high levels of structural, functional, and sequence conservation, and critical role within protein synthesis and the syntax of the genetic code, aaRS proteins are especially informative in this regard.

Combined with our previous results showing the parafunctionalization of the ValRS/IleRS protein families (Fournier et al. 2011), this work points to the evolution of the genetic code being an extended process that continued throughout an increasingly complex and protein-based system. The many stages of code evolution likely included early stages that occurred within and were mediated by an RNA-based (or other) physiology. At later stages, possibly once proteins had increased in specificity and functionality due to this very same genetic code expansion, protein evolution could directly shape the genetic code itself. This stage may have been very late indeed, if the aaRS-mediated addition of Trp to the genetic code was the final stage in the evolution to its current form. The observation that all other amino acids are represented within the TyrRS/TrpRS paralog ancestor supports this hypothesis. However, this does not preclude the possibility of subsequent parafunctionalization events within other aaRS lineages. In fact, the inferred ValRS/IleRS paralog ancestor does contain sites specific for both Tyr and Trp, suggesting that any takeover of aminoacylation activities for Val and/or Ile by these proteins would have occurred after the addition of Trp to the code. Given the relatively short branch lengths separating these pairs of aaRS paralogs in both cases (compared to other sets of aaRS families), both divergences were likely among the most recent among aaRS families, making this order of events a plausible scenario.

The results of this work also predict that other groups of protein families arising before the addition of Trp to the code should also lack Trp in their ancestral sequences. As such, there should be a clear delineation between pre-Trp and post-Trp proteins, as evidenced by the presence or absence of conserved Trp sites within their ancestral sequences. It may be possible to make such distinctions among other ancient paralogous gene families diverging pre-LUCA.

Partial HGT and Ancestral Reconstruction

Ancestral sequence reconstruction of a protein generally assumes that each site shares the same evolutionary history, as is generally assumed in phylogenetic reconstruction of single genes, even for those undergoing HGT. However, if different parts of a protein alignment evolved under different histories, as is the case following partial HGT, this assumption is invalid, and the ancestral reconstruction may be incorrect for the affected regions. If the recombination(s) contain sites that experienced substitutions along the impacted lineages, specious inferences will propagate to the site probabilities of ancestral nodes. Depending on the size of the recombined region and the phylogenetic depth of the reticulation, these events may be highly disruptive to both model parameter estimation and the accuracy of reconstructed ancestor sequences. In this analysis, we have attempted a maximum-information approach to deal with three predicted partial HGT regions, by reconstructing them individually under their own respective topologies, using as much phylogenetic and model information as possible from the remainder of the alignment. One alternative approach would involve removing the affected clades entirely, sacrificing phylogenetic information in order to prevent false inferences at the affected sites. To our knowledge, previously published reconstructions of ancient protein families have not been impacted by known partial HGT events. While it has been shown that partial HGT events have occurred within the EF-1a sequences of some archaeal lineages (Inagaki et al. 2006), these groups were not included in the published ancestral reconstruction of homologous bacterial EF-Tu proteins (Gaucher et al. 2003).

Future Work

Detailed mechanistic analyses of amino acid recognition and aminoacylation within TyrRS and TrpRS have been performed (Doublie et al. 1995; Hogue et al. 1996; Praetorius-Ibba et al. 2000); applying the results of these models to the inferred sequence of TyrRS/TrpRS paralog ancestor may further elucidate ancestral function. By the same principle, reconstructed sites involved with tRNA discrimination may also be useful in determining ancestral tRNA specificity, and, by extension, amino acid specificity (Bedouelle et al. 1993; Tsunoda et al. 2007). Direct biochemical specificity assays on resurrected ancestor protein sequences are another possible avenue of investigation (e.g., Gaucher et al. 2003). However, this approach would be challenged by very large evolutionary distances and correspondingly higher levels of uncertainty in pre-LUCA paralog ancestor amino acid site identities. Due to the combinatorics arising from even a subset of sites having an uncertain ancestral identity, it is likely that a very large number of potentially “true” ancestor sequences would need to be synthesized and tested in order to accurately and fully explore the likely phenotypic space of the ancestral function. However, in the case of an informative reconstruction, one would predict that the likelihood of each predicted resurrected ancestor (calculated from per-site amino acid probabilities) would positively correlate with the true ancestral amino acid-binding specificity/aminoacylation activity. Thus, a more sparse sampling within the space of possible ancestors may provide a tractable solution.