Introduction

The genetic code is a basic feature of molecular biology. It sets the rules according to which nucleic-acid sequences are translated into amino acid sequences. The genetic code probably evolved by a process of gradual evolution from a proto-biological stage, via many intermediary stages, to its present form (see e.g. Crick 1968; Lehman and Jukes 1988; Vetsigian et al. 2006). During this process, error robustness was built into the code (see e.g. Ardell 1998; Caporaso et al. 2005; Crick 1968; Di Giulio 2008; Freeland et al. 2003; Higgs 2009; Ikehara et al. 2002; Massey 2008; Vetsigian et al. 2006; Wolf and Koonin 2007; Wong 2005). Two different kinds of error robustness can be observed (Vetsigian et al. 2006) by even the most superficial inspection of the standard genetic code (SGC). On one hand, codons assigned to the same amino acid are almost always similar, see Table 1. As an example, all codons ending with a pyrimidine (U or C) in a codon box (the four codons sharing first and second nucleotides) are without exception assigned to the same amino acid (e.g. UAU and UAC both code for Tyr). On the other hand, similar codons are mostly assigned to similar amino acids, e.g. codons with U in the second position are all assigned to hydrophobic amino acids (Woese 1965; Woese et al. 1966a, b). This is illustrated in Table 1, when looking at the values of polar requirement: overall, low values of polar requirement correspond to hydrophobic amino acids.

Table 1 The standard genetic code

Three main approaches exist to explain the emergence of this robustness of the code: specific selection for robustness (see e.g. Freeland and Hurst 1998a; Haig and Hurst 1991; Vetsigian et al. 2006), amino acid-RNA interactions leading to assignments (see e.g. Woese 1965; Yarus et al. 2009), and a slow growth process of assignment patterns reflecting the history of amino acid repertoire growth (see e.g. Crick 1968; Di Giulio 2008; Massey 2006; Wong 1975). The concept that all three competing hypotheses are important has also been brought forward (Knight et al. 1999). In the present study we make adjustments to earlier mathematical work in this field (see e.g. Buhrman et al. 2011; Freeland and Hurst 1998a; Haig and Hurst 1991), which integrate the three concepts into a single mathematical model. We will now, one by one, introduce these three adjustments.

Polar Requirement

The polar requirement (Woese et al. 1966a) is not just a measure related to hydrophobicity. Several different measures of hydrophobicity exist, each focusing on different aspects of it. Polar requirement specifically focuses on the nature of the interaction between amino acids and nucleic acids. Stacking interactions between e.g. the planar guanidinium group of arginine and the planar purine ring systems and pyrimidine ring systems of RNA is an example of that. Woese chose to chemically model the nucleotide rings by using pyridine as the solvent system in the measurements leading to the polar requirement scale (Woese 1965, 1967, 1973; Woese et al. 1966a, b). This interaction between amino acids and nucleic acids has been stressed as an especially important aspect of early protein chemistry because one possibility for the very first function of coded peptides was suggested (Noller 2004) to be the enlargement of the number of conformations accessible for RNA (realized by the binding of small, oligopeptide cofactors). Thus, polar requirement could have been among the most important aspects of an amino acid during early stages of genetic code evolution.

The remarkable character of polar requirement as a measure of amino acids in connection to the genetic code was found again and again throughout the years. Firstly, Woese found that distinct amino acids coded by codons differing only in the third position are very close in polar requirement, despite differences in general character (Woese et al. 1966b). The pair cysteine and tryptophan nicely exemplifies this. Secondly, Haig and Hurst (1991) discovered that polar requirement showed the SGC to be special to a much larger degree than another scale of hydrophobicity [the hydropathy scale of Kyte and Doolittle (1982)]. Thirdly, when Mathew and Luthey-Schulten updated the values of polar requirement (Mathew and Luthey-Schulten 2008) by in silico methods (the most important change was believed to be due to a cellulose–tyrosine interaction artefact in the original experiments), the SGC showed a further factor 10 increase (Butler et al. 2009) in error robustness calculations. In all these developments the expectation that polar requirement would behave in a special way, as interaction between nucleotides and amino acids is biochemically important, was more than borne out by the results. One of the adjustments we introduce in our work compared to our earlier calculations (Buhrman et al. 2011) is that in the present work we use the new, updated values of polar requirement (see Table 1).

Aptamers

Oligonucleic-acid molecules that bind to a specific target molecule (e.g. a specific amino acid) are called aptamers (Ellington and Szostak 1990). Over the last two decades, many results have been obtained regarding specific binding of amino acids by RNA aptamers, mainly by Yarus and co-workers (Illangasekare and Yarus 2002; Majerfeld and Yarus 1994; Yarus et al. 2009). For several amino acids, codons and anticodons were found in binding sites, in quantities higher than would be expected to occur by chance (Yarus et al. 2009). In Table 2, a list of occurrences of anticodons in binding sites of RNA sequences is given, together with the articles in which these sequences were reported. Please note that the definition of anticodons used in these articles is: triplets complementary to codons. These anticodons are therefore not necessarily identical to the triplets found in tRNA molecules which are normally meant with the word ‘anticodon’. As an example: the triplet AUG is considered as an His anticodon because it is complementary to the His codon CAU. In tRNAs, however, the anticodon recognizing CAU is GUG (see Grosjean et al. 2010; Johansson et al. 2008) for reviews on codon–anticodon interaction). We summarize published details on the aptamers for seven amino acids, and subsequently formulate a conclusion regarding the implications of the existence of these molecules for genetic-code error-robustness calculations. This conclusion is based on reasoning presented by the Yarus group concerning the existence of specific relationships between certain triplets and certain amino acids. These relationships could have led to evolutionary conserved assignments of these amino acids to these triplets, e.g. by a mechanism as presented in (Yarus et al. 2009).

Table 2 The occurence of anticodons in binding sites of the RNA sequences of amino acid binding aptamers, and the references in which the actual RNA sequences can be found

For Ile, Trp, and His, three binding motifs were described, respectively named the ‘UAUU-motif’ (Lozupone et al. 2003), the ‘CYA-motif’ (Majerfeld et al. 2010, Majerfeld and Yarus 2005), and the ‘histidine-motif’ (Majerfeld et al. 2005). As can be seen from the names, the anticodons UAU for Ile, and CCA for Trp, are characteristic for the motifs (‘CYA’ stands for ‘CUA or CCA’). In the case of His, both GUG and AUG (the anticodons for the two His codons CAC and CAU) are found in quantities higher than would be expected by chance (Majerfeld et al. 2005).

Although binding sites for Phe and Tyr have so far not been studied as extensively as those for Ile, Trp, and His, the analysis of Yarus et al. (2009) shows that the anticodons (GAA and AAA for Phe, and GUA and AUA for Tyr) are present in the binding sites more often than would be expected on a random basis.

Both the CCU anticodon (Janas et al. 2010) and the UCG anticodon (Yarus et al. 2009) are present in Arg binding sites more often than would be expected on a random basis. Thus, a physico-chemical background was observed, compatible with: (1) Arg having more than four codons, and (2) all six Arg codons sharing the same middle nucleotide.

A similar observation can be made for Leu, the other amino acid which is encoded by six codons all having the same middle nucleotide. For this amino acid, however, only a single RNA sequence was found binding the amino acid with specificity (Yarus et al. 2009). Inspection of this sequence shows anticodons UAG, GAG, and CAA to be present in its binding parts.

Taking the combined results of Yarus and co-workers into consideration, we propose to fix assignments of Ile, Trp, His, Phe, Tyr, Arg, and Leu for calculations using random variants of the SGC.

Gradual Growth

In ‘Methods’ section we present our approach in detail. We use Haig and Hurst’s ‘mean square’ measure, [as first proposed in Haig and Hurst (1991)] to quantify the error robustness of a given code. With this measure, a relatively error-robust code gets a low value when compared to the average value of a large set of codes produced by random allocation of amino acid assignments [see Buhrman et al. (2011) for a more in-depth treatment of the approach]. The space of codes allowed to exist by the allocation procedure can be large [in the original work of Haig and Hurst (1991) the space has a size of exactly 20! codes, which is \(\approx 2.433 \times 10^{18}\) codes]. We call a code optimal if it reaches the minimum in error robustness calculations among all possible codes in a particular setting.

In 1975, Wong proposed the coevolution theory of the genetic code (Wong 1975). According to this proposal, SGC codons assigned to an amino acid biosynthetically derived from another amino acid, were originally assigned to that ‘precursor’ amino acid. As an example: Pro is biosynthetically derived from Glu. According to coevolution theory, the four Pro codons (CCN) would have originally encoded Glu. Without embracing all details of the original coevolution theory, or modern refinements of the theory (Di Giulio 2008; Wong 2007), something remarkable can be noted as a result of this way of looking at the SGC. Shikimate-derived amino acids (Phe, Tyr, and Trp) all have U in the first position of the codon (Phe: UUY; Tyr: UAY; and Trp: UGG). Glu-derived amino acids (Pro, Gln, and Arg) almost always have C in the first position of the codon (Pro: CCN; Gln: CAR, which stands for ‘CAA or CAG’; and Arg: AGR and CGN, where N stands for all four nucleotides). Asp-derived amino acids (Ile, Met, Thr, Asn, and Lys) all have A in the first position of the codon (Ile: AUY and AUA; Met: AUG; Thr: ACN; Asn: AAY; and Lys: AAR). Codons with G in the first position all code for amino acids produced in Urey–Miller experiments Footnote 1 (Val: GUN; Ala: GCN; Asp: GAY; Glu: GAR; and Gly: GGN). This ‘layered structure’ of the SGC was first pointed out explicitly by Taylor and Coates (1989). It may indeed suggest a sequential development of the repertoire of amino acids specified in the developing code, and a possibly sequential introduction of use of G, A, C, and U as first nucleotide in codons. The ‘layered structure’ of the SGC is a regularity different from the well-known error-robust distribution of polar requirement (Haig and Hurst 1991), which is pronounced in the first and the third, but not in the second position of the codon (please note: having, as a group, all the same nucleotide in the first position, gives error robustness for the group character to changes in the second and third position). As is shown in ‘Appendix: Molecular Structure Matrix’, it is possible to prove the presence of the ‘layered structure’ quantitatively, when the appropriate set of values is developed and used as input.

Freeland and Hurst (1998b) followed the concept of Taylor and Coates, and formally divided the 20 amino acids in four groups of five amino acids each: Gly, Ala, Asp, Glu, and Val in a first group which could be called ‘the prebiotic group’; a second group of amino acids with codons starting with A (Ile, Met, Thr, Asn, and Lys); a third group with codons mainly starting with C (Leu, Pro, His, Gln, and Arg); and, finally, a group with codons mainly starting with U (Phe, Ser, Tyr, Cys, and Trp). Division of the set of twenty in these four subsets was subsequently incorporated in the calculations on code error robustness (Freeland and Hurst 1998b). This approach reduced the size of the space from which codes could be sampled randomly in a drastic way: from a size of about \(2 \times 10^{18}\) codes (see above) to a size of (5!)4 codes (which is exactly \(2.0736 \times 10^{8}\) codes). This space was called the ‘historically reasonable’ set of possible codes (Freeland and Hurst 1998). By sampling from the historically reasonable set of possible codes, we incorporate in the current study the notion of a chronologically-determined, layered structure of the SGC.

Integration of assumptions

We have found that if: (1) the updated values for polar requirement are used as amino acid attributes; (2) the assignments of seven amino acids to codons are fixed following the rationale given above; and (3 the subdivision leading to the historically reasonable set of possible codes is used to define the space of code variations [which is also reduced in size by (2)], then the SGC is optimal. It is important to note that the constraints applied drastically reduce the size of the space: with applying both (2) and (3), the ‘realistic space’ has a size of 11,520 codes.

Methods

We use the mean-square method developed by Alff-Steinberger (1969), Wong (1980), Di Giulio (1989), and Haig and Hurst (1991). For the mathematical formulation, we follow the approach of Buhrman et al. (2011) and consider the undirected graph G = (VE) that has the 61 codons Footnote 2 as its vertices and an edge between any two codons if they differ in only one position, yielding 263 edges. A code F maps each codon c to exactly one amino acid F(c). We denote by rF(c) the polar requirement of the amino acid that codon c encodes in the code F and by r the full vector of 20 values. The mean square error function of code F is then given by

$$ MS_0^{\alpha, {\bf r}}(F)= \frac{1}{N} \sum_{\{ c,c'\} \in E} \alpha_{c,c'}(r_{F(c)}-r_{F(c')}) ^2 $$

where the αc,c' are the weights of the different mutations that can occur (corresponding to edges of the graph) and \(N=\sum_{\{c,c'\} \in E} \alpha_{c,c'}\) is the total weight. Following Haig and Hurst (1991), we use a subscript 0 to indicate the overall measure. If we set all 263 weights αc,c' to 1, we get the original function described by Haig and Hurst (1991), which we simply denote by MS0(F). We also consider the following set of weights introduced by Freeland and Hurst (1998a), which differentiates between transition errors (i.e. U to C, C to U, A to G, G to A) and transversion errors and the position where they occur in the codon:

  • αc,c' = 0.5 if (cc′) is a transversion in the first position or a transition in the second position,

  • αc,c' = 0.1 if (cc′) is a transversion in the second position,

  • αc,c' = 1 otherwise.

Using weights for different codon positions implies the existence of a tRNA with a triplet anticodon during the process of code evolution. As we consider a process of gradual expansion of the repertoire of amino acids during the evolution of the SGC (see e.g. Ardell 1998, Crick 1968, Lehman and Jukes 1988) as the most likely mechanism—with duplication of tRNA genes, and subsequent divergence [cf. (Ohno 1970)] of their sequences and functions—we think this assumption is acceptable. This assumption does not necessarily imply the existence of protein aminoacyl-tRNA synthetases during all or part of the process of code evolution, as there could originally have been ribozymes which fulfilled their function. The value of error-robustness of a code F using the set of weights introduced above will be denoted by MS FH0 (F).

In principle, there are at least three ways in which one can improve the model of Haig and Hurst (1991) to reflect biological reality more accurately. The first possibility is to change how the level of error robustness is measured, e.g. by introducing weighting factors as described above. Variations of the weighting factors used in the calculation show an even higher error robustness of the SGC, as noticed by e.g. Butler et al. (2009), Freeland and Hurst (1998a), Gilis et al. (2001). The rationale behind changing weighting factors is improved reflection of natural selection pressures. It is, however, difficult to decide which weighting factors adequately reflect the natural selection pressures operating during the early evolution of the genetic code [see comment 4 of Ardell in Novozhilov et al. (2007) and the exchange of thoughts with respect to ‘column 4’ in Higgs (2009)].

The second way to improve the model is to change the set of values representing amino acid properties used as input in the error-robustness calculation. For instance, one can use the values of hydropathy from Kyte and Doolittle (1982), or the matrix of Gilis et al. (2001) instead of the polar requirement scale. In our paper, we use the values of the 2008 update of polar requirement by in silico methods (Mathew and Luthey-Schulten 2008) given in Table 1. Work concerning the issue what an ‘ideal’ set of 20 values would look like, and work considering different known sets of amino acid properties is presented in ‘Appendices: Inverse Parametric Optimization and Scan of Other Amino Acid Properties’.

The third way to improve the model is to change the size of the space from which random codes are sampled (Buhrman et al. 2011). The incentive to enlarge that space [as was done in Buhrman et al. (2011)] is the wish to work from a space that encompasses all possible codes, or at least, all known codes. As indicated in Buhrman et al. (2011), larger spaces are increasingly difficult to work with. The frequency distributions obtained by sampling from the larger spaces in Buhrman et al. (2011) highly coincide with the frequency distribution obtained from the original space [as presented in Haig and Hurst (1991)]. From this viewpoint, working in the original space is acceptable as a simplification. In the current study, we shrink the size of the space, based on considerations of fixed assignments of certain codons, and combining this with the constraint of the historically reasonable set of possible codes of Freeland and Hurst (1998b), as outlined in ‘Introduction’ section.

MATLAB-programs were used for the error-robustness calculations and visualizations. All software can be found as supplemental information, or downloaded from https://github.com/cschaffner/gcode.

Results

Among all genetic codes (in this particular setting of the problem), the SGC is optimal in terms of error-robustness if:

  1. 1.

    We use the updated values of polar requirement (Mathew and Luthey-Schulten 2008).

  2. 2.

    We use fixation for Phe, Tyr, Trp, His, Leu, Ile, and Arg, based on aptamer experiments (Janas et al. 2010; Yarus et al. 2009).

  3. 3.

    We use the historically reasonable set of possible codes (Freeland and Hurst 1998b).

Figure 1 shows a histogram of MS FH0 (F)-values resulting from this procedure. When, the original error function MS0(F) from Haig and Hurst (1991) is used, the result is essentially the same: the SGC is the optimal code. We wondered if by fixation of just one or two more assignments, the SGC would be optimal in the space resulting from the combination of these fixations with the random permutations of amino acid assignments according to the method used by Haig and Hurst (1991), without the constraint of the historically reasonable set of possible codes (Freeland and Hurst 1998b). This was not the case (as is reported in ‘Appendix: Minimal Number of Fixed Assignments’).

Fig. 1
figure 1

Histogram of MS FH0 -values when using the historically reasonable set of possible codes, and fixing Phe, Tyr, Trp, His, Leu, Ile, Arg. Standard genetic code (indicated by dashed red line) is optimal

Discussion

What is the biological relevance of the mathematical result presented, if any? Can we indeed conclude that natural selection steered the translation system toward better and better variants of the assignments (in terms of error-robustness) within realistic boundaries? Stated differently, when making a model, should one respect that seven assignments are fixed, and that the system evolved gradually (as reflected by using the historically reasonable set of possible codes), until the optimal code (within these boundaries) was reached? Or is it rash to arrive at such a conclusion, and could one imagine positive selection for error-robustness to be an illusion?

The space of codes resulting from the constraints imposed on the calculations is a space of very limited size: only 11,520 codes (\(2!\times 2!\times 4!\times 5!\)). The fact that the SGC is optimal in this space is impressive, but of a different order of magnitude than the near-optimalities in significantly larger spaces presented in earlier studies (e.g. Buhrman et al. 2011; Butler et al. 2009; Freeland and Hurst 1998a; Freeland et al. 2000; Gilis et al. 2001). The impact of the different fixed assignments varies: for the MS0-values, it would theoretically suffice to fix the three assignments of Phe, Trp, and Arg (or any set containing them) in order to find the SGC to be optimal in the resulting space. Footnote 3 In this way, the SGC can be thought of as the global optimum in a space of \(3! \times 4!\times 5!\times 5!=2073600\) codes. We further refrain from presenting it thus, because in doing so we would abandon the physico-chemical facts which were the starting point for our calculations with fixed assignments.

It is also possible to increase the number of fixed assignments (and in this way decrease the size of the space of random code variants) even further. A recent article (Johnson and Wang 2010) suggests that more than the seven assignments (listed in Table 2) are fixed.

The logical extreme of fixing assignments is that all assignments of the SGC are fixed, as argued recently by Erives (2011). In his theory, a kind of RNA cage (pacRNA: proto-anti-codon RNA) is presented, in which different amino acids are bound by different kinds of ‘walls’, which are exposing anticodons to the different amino acids. Although this model combines elegant explanations for several aspects of present-day tRNA functioning, it is very hard to get an objective measure for the specificity of amino acid-anticodon interactions in this model. In particular, the different possibilities allowed by ‘breathing’ of the cage cast doubt on interaction specificity. Some objections can also be raised regarding the tRNA activation mechanism. Yarus and co-workers recently reported a very small ribozyme (only five nucleotides in length), which was experimentally shown to aminoacylate certain small RNAs using aminoacyl-NMPs as activated precursors (Turk et al. 2010; Yarus 2011). Such an early activation mechanism, using NTPs as source of energy, is different from the one in Erives’ model, where the \(5^\prime\) end of the pacRNA is performing this role.

Taking all considerations sketched above into account, it is possible to draw a tentative picture of genetic code evolution which is compatible with the indications concerning which aspects of code evolution are important. Code evolution probably followed classical mechanisms of gene duplication and subsequent diversification (here of ‘tRNA’ genes and genes involved in aminoacylation). Evolution would be mainly by stop-to-sense reassignments (Lehman and Jukes 1988), with occasional reassignments in only slightly different new or developing uses of codons [cf. Ardell 1998; Vetsigian et al. 2006], not yet massively present in protein-coding sequences [cf. the frozen accident concept (Crick 1968)]. In a proto-biological stage, RNA would be absent while very small peptides could have been synthesized, e.g. by the salt-induced peptide formation (SIPF) reaction (Rode et al. 1999; Schwendinger and Rode 1989). Under prebiotic conditions especially Ala and Gly would be expected to be present in relatively large amounts (see e.g. Higgs and Pudritz 2009; Philip and Freeland 2011). Asp-containing peptides could possibly play a role in the origin of RNA, as they could position Mg2+ ions in the correct orientation to help polymerize nucleotides, and, concomitantly, keep these ions from stimulating RNA hydrolysis (Szostak 2012). Asp content of peptides could be enriched in the presence of carboxyl-group binding montmorillonite surfaces (Rode et al. 1999).

In the first stages of coded peptide synthesis, GCC and GGC probably were the only codons in mRNAs (Eigen and Schuster 1978), and coded peptides would consist of Ala and Gly. The remaining codons effectively would be stop codons (Lehman and Jukes 1988), although functioning without release factors: water would break bonds between tRNA and peptide whenever codons stayed unoccupied for too long. The ‘single-step biosynthetic distance’ between Ala and pyruvate suggests a carbon storage role for these peptides; Gly allowing folding of such molecules. A mRNA/tRNA system functioning without a ribosome has been proposed by several authors (Crick et al. 1961; Lehman and Jukes 1988; Woese 1973). The first rRNA could then have been functioning in improved termination (see above). At this stage the proposal that coded peptides enlarge the possible range of RNA conformations should be taken into account (Noller 2004).

In the next stage of coded peptide synthesis, Asp and Val could have been added to the repertoire (see e.g. Ardell 1998; Eigen and Schuster 1978; van der Gulik et al. 2009; Higgs 2009; Ikehara 2002). This would have been a crucial step: enabling directed production of the important Asp-containing peptides (van der Gulik et al. 2009; Szostak 2012) as well as formation of something resembling protein structure, characterized by hydrophobic cores (Val) and hydrophilic exteriors (Asp). The emerging polypeptides could have functioned in carbon storage, as mentioned above. Having started with trinucleotide codons, this aspect was retained, not because four nucleotide codons are in principle impossible, but this system allowed a further robust development (cf. Vetsigian et al. 2006). Depletion of prebiotic pools of either Ala, Gly, Asp, or Val (e.g. by excessive storage in coded peptides) could have led to the biosynthetic routes involving Gly, Ser, Val, Asp, Ala, and pyruvate. In this way the lack of an amino acid could in principle be resolved by use of the other three (cf. the hypothesized carbon storage function of coded peptides).

In a further stage, Ser, and Asp-derived amino acids like Asn and Thr would be added to the repertoire. Asn would be the first amino acid with an entirely biosynthetic origin (it is relatively unstable, and does not accumulate prebiotically). The production of Asn is known to be originally linked to enzymatic conversion of Asp to Asn on a tRNA (see e.g. Wong 2007). When instead of two molecules of pyruvate, one molecule of pyruvate and one molecule of alpha-keto-butyrate are fed into the Val biosynthesis pathway, Ile is produced instead. Therefore, when both Thr and Val biosynthesis are present, the evolution of just one enzyme (making alpha-keto-butyrate from Thr) suffices for the emergence of Ile. Aptamers can handle this amino acid, and these two factors (easy development from existing biochemistry and easy manipulation by RNA) could be responsible for the ‘choice’ of Ile (cf. Philip and Freeland 2011).

Larger amino acids like His and Gln would have appeared in a later stage of code development than Asp-derived amino acids like Asn and Thr. The reactions catalyzed by the few enzymes in the Leu biosynthesis, which are not enzymes involved in Val biosynthesis (apart from leucine aminotransferase) are reminiscent of the first three reactions of the citric acid cycle (Voet and Voet 1995). Jensen (1976) hypothesized that originally enzymes would have had much broader substrate specificity. With the citric acid cycle being ‘old’, as well as important for bio-energetic reasons, and Val biosynthesis being present, the system could have produced an excess of Leu. Again, aptamers would be able to ‘handle’ Leu. Existing biochemistry and aptamer potential would thus answer the question why Ile and Leu are part of the Set of Twenty, and e.g. norleucine and alpha-amino-butyric acid are not (cf. Philip and Freeland 2011). Linked to the citric acid cycle and important in nitrogen management are Glu and Gln. A further expansion of the repertoire with a Glu-derived amino acid is the expansion with Arg. Two of the enzymes of the urea (nitrogen management) cycle are related to pyrimidine synthesis enzymes, two others to purine synthesis enzymes (Berg et al. 2007). The last enzyme in the cycle is arginase. This suggests an ancient accumulation of Arg as a side effect of RNA synthesis, upon Glu becoming a major cell component. Arginase could function in bringing the Arg concentration down to acceptable levels. Aptamers could also have evolved to manipulate Arg levels, allowing Arg to become part of the Set of Twenty. Again Jensen’s concept of primordial broad substrate specificity (Jensen 1976) is essential to get a possible answer to the ‘Why these 20?’ question: Arg could be part of the set, rather than ornithine and citrulline, because Arg accumulates, and Arg can be manipulated by aptamers.

In an advanced stage of code development aromatic amino acids would be added to the repertoire, and release factors would evolve. Van der Gulik and Hoff (2011) have argued that codons UUA, AUA, UAA, CAA, AAA, GAA, UGA, and AGA could not function unambiguously until the anticodon modification machinery was developed, which is seen by them as the last development leading to the full genetic code. Because archaea and bacteria have different solutions for the ‘AUA problem’ [agmatidinylation vs. lysidinylation (van der Gulik and Hoff 2011)], unambiguous sense assignment of AUA must have been late indeed.

The SGC has probably evolved in a genetic environment characterized by rampant horizontal gene-flow (Vetsigian et al. 2006). The interaction between genetic systems with slightly different, still-evolving codes, is thought to have caused both universality and optimality of the SGC (Vetsigian et al. 2006). Universality, because the genetic code functioned as an innovation sharing protocol (Vetsigian et al. 2006). Optimality, because competition allowed selection for the ability to translate the genetic information accurately (Vetsigian et al. 2006). The work presented in our paper illuminates constraints within which this process of genetic code development took place. Both the step-by-step increasing complexity of biochemistry, and the stereochemical relationship between at least some amino acids and triplets, are factors which have to be taken into account.

In summary, although there are at least two different lines of research suggesting a greater number of fixed assignments than the seven given in Table 2 [based on the work of Yarus and co-workers (Janas et al. 2010; Yarus et al. 2009)], for now it is not clear that more [or even all (Erives 2011)] assignments are fixed. Thus, the observed error-robustness still needs explanation. It is possible that the optimality of the SGC we found results from positive selection for error-robustness, though starting within a more restricted set of possibilities than previously thought.