Introduction

Crick (1968) introduced the notion that the genetic code is simply the result of pure chance or a “frozen accident” and that it therefore does not need any further evolutionary explanation. Later, this view was questioned. Although certain knowledge of the origin and early stages of life is not likely to be obtained, there are some hints of possible evolutionary scenarios of the genetic code. One direction of research (the “top-down approach” [Szathmary 1999]) analyzes patterns in the contemporary code (Knight and Landweber 1998; Szathmary 1999) and tries to infer appropriate chemical and selective forces. The bottom-up approach, on the other hand, is rooted in biochemistry and aims at constructing plausible scenarios for the origin of coding (Topal and Fresco 1976; Maizels and Weiner 1987; Szathmary 1993).

It has been appreciated for a long time that the genetic code assigns similar amino acids to similar codons (Sonneborn 1965; Woese 1965; Zuckerkandl and Pauling 1965; Crick 1968). Two different rationales have been presented: first, mutation (Sonneborn 1965; Zuckerkandl and Pauling 1965) and translation (Woese 1967; Haig and Hurst 1991; Freeland and Hurst 1998) error minimization (or both (Ardell and Sella 2002)) and, second, the tendency of similar amino acids to directly interact with similar RNA sequences (Woese et al. 1966; Yarus 1998, 2000). Landweber and coworkers found further evidence to support both hypotheses. Extending previous work (Haig and Hurst 1991; Freeland and Hurst 1998) by quantifying amino acid similarity, these authors were able to show that “the canonical code is at or very close to a global optimum for error minimization” (Freeland et al. 2000). Based on the earlier work of Yarus (cf. Yarus 1998, 2000), by doing a statistical analysis of RNA aptamers (nucleic acid molecules selected to bind specific ligands), they concluded that there is “the strongest support for an intrinsic affinity between any amino acid and its codons” (Knight and Landweber 1998). It has also been proposed that instead of the actual codons, some derivatives of them, such as the anticodons (Dunnill 1966; Jungck 1978) or codon–anticodon duplexes (Alberti 1997), were the original amino acid binding motifs. It could also be that the original amino acid recognition took place at the tRNA acceptor stem (Hopfield 1978) or that the specificity of aminoacylation is determined by the interaction of the tRNA synthetase with its tRNA (Weiner and Maizels 1987). Szathmary (1999) proposed that amino acid RNA allocation took place even before the appearance of tRNA. He also gave a possible evolutionary scenario for the development of an anticodon hairpin to a longer structure with an operational code at the acceptor stem.

Several patterns of the genetic code have been identified, which can be illustrated within the classical scheme.

The Common Scheme of the Genetic Code

The common scheme of the genetic code (Alberts et al. 2002) contains 43 = 64 codons, a three-dimensional matrix where each dimension represents one of the three positions in the triplet code (Fig. 1). Viewed this way, some patterns emerge: The first codon position seems to be correlated with amino acid biosynthetic pathways (Wong 1975; Taylor and Coates 1989) and with their evolution as evaluated by synthetic “primordial soup” experiments (Eigen 1977; Schwemmler 1994). The second position is correlated with the hydropathic properties of the amino acids (Crick 1968; Wolfenden et al. 1979; Taylor and Coates 1989), and the degeneracy of the third position could be related to the molecular weight or size of the amino acids (Hasegawa and Miyata 1980; Taylor and Coates 1989).

Figure 1
figure 1

The common presentation of the standard (“universal”) genetic code. All deviations from this code (Elzanowski and Ostell 2000) are thought to be the result of later mutations (Osawa et al. 1992; Knight and Landweber 2000b; Knight et al. 2001). Shaded regions show codon families.

Lagerkvist (1978, 1981) divided the common illustration scheme (Fig. 1) into a left part (containing the first and second columns, i.e., U and C in the second position of the codon, respectively) and a right part (the third and fourth columns, i.e., A and G in the second position). He observed that codon families (the amino acid of a codon family is uniquely determined by the first two nucleotides of a codon; cf. shaded regions in Fig. 1) have a much higher probability to appear in the left part. Furthermore, he found that “strong” codons (the first two nucleotides in the codon are G and/or C) always represent codon families, while “weak” codons (A and/or U as the first two nucleotides) never do so. “Mixed” codons in the right part of the scheme never represent codon families, whereas mixed codons in the left part always stand for a codon family. Lagerkvist (1978) speculated “that interactions between mixed codons and their anticodons are stronger in the left half of the codon square.”

However, most amino acid properties show no clear pattern in the common scheme of the genetic code. Instead Jungck (1978) used 15 different quantitative measures of amino acid properties such as polarity and molecular volume to demonstrate that these properties are generally more closely correlated with anticodon than with codon dinucleoside monophosphate properties. This supports the hypothesis that the relationship between amino acids and their anticodon dinucleosides was the basis for the origin of the genetic code.

In this article we follow the “top-down approach” toward understanding the organization of the genetic code. We are thereby led to propose a new classification scheme for the code that helps us to identify new patterns which in turn suggest new speculations about its origin.

Results

A New Classification Scheme of the Genetic Code

Figure 2 shows our new scheme for presenting the genetic code. It is based on a binary classification of nucleic acid bases. The two components of all nucleic acids, purines and pyrimidines, are denoted 1 and 0, respectively. The eight rows in Fig. 2 represent the 23 = 8 possible combinations of three binary digits. Since there are two purines (A, G) and two pyrimidines (U, C) for each row, there again exist eight possibilities.

Figure 2
figure 2

A new classification scheme of the standard genetic code based on a binary representation of purines (1) and pyrimidines (0). The third base is given in parentheses. When there are differences between the standard code and any other code, the number of deviations from the standard code is indicated. This comparison is based on 16 nonstandard codes (Elzanowski and Ostell 2000). For instance, in the UG(G/A) field, 0/9 indicates that UGG encodes for Trp in all codes, but UGA is not the termination codon in 9 of the 16 nonstandard codes: In 8 different mitochondrial codes UGA encodes Trp, and in the euplotid nuclear code it represents Cys. It is interesting that at least in some bacteria the 21st amino acid, selenocysteine, can also be encoded by UGA (Osawa et al. 1992; Thanbichler and Böck 2002). Another example is the CU(G/A) field. In the yeast mitochondrion CUG and CUA encode Thr; in the alternative yeast nuclear code, CUG represents Ser. Shaded region show codon families. The point in the center indicates the perfect point symmetry in this scheme, according to Halitsky’s (2003) family–nonfamily symmetry operation. The thick horizontal line marks the symmetry axis for codon–anticodon symmetry.

Our first observation is that four (and not eight) columns are sufficient to place all 20 amino acids, as well as the termination codons. Each row contains exactly four different amino acids (including the termination codon). In the standard code, exceptions are the second row, with two leucines, and the AU* start codon in the fourth row. Note that here are also the deviations from the standard code. Interestingly, the yeast mitochondrial code shows no exception: Each row contains exactly four different entries in four different columns. In this respect the yeast mitochondrial code is the most regular one. The fact that in our scheme four columns are sufficient reflects the well-known fact that if the third position is important (in exactly half of our table this is not the case), then it is only decisive if there is either a purine (1) or a pyrimidine (0) (Fitch and Upper 1987), i.e., the third position is analyzed in a binary manner (Taylor and Coates 1989). This has been explained by Crick’s (1996) wobble hypothesis, wherein the first two nucleotides of the codon pair with their anticodon bases according to Watson–Crick rules, but the third base pairs according to the wobble rules, which say that G can also pair with U, for instance. The third codon position is exclusively analyzed in a binary manner in the mitochondrial codes of yeast, vertebrates, invertebrates, coelenterates, and flatworms, as well as in the codes of mold, protozoa, and mycoplasma/spiroplasma; for the other codes there are a few exceptions (cf. Elzanowski and Ostell 2000). Note that these few exceptions always have a purine at the third position of the codon (e.g., AUA [Ile] and AUG [Met] in the standard code).

Our scheme provides some support for the “adaptive genetic code” hypothesis (Freeland 2002), which states that the code has evolved to minimize the deleterious effects of mutation and translation error (Haig and Hurst 1991; Freeland and Hurst 1998). The purine–pyrimidine binary coding scheme, shown in Fig. 2, exhibits a much greater regularity than a binary coding according to the base pairs (A,U—1; G,C—0). This corresponds to the known fact that transition mutations (e.g., purine A vs. purine G) occur more frequently than transversion mutations (e.g., purine A vs. purimidine U).

A second observation concerns the order of the columns. In the first column the first two positions are G and C. These always pair with their anticodon base via three hydrogen bonds, i.e., the first two bases together always guarantee six hydrogen bonds. For that reason Lagerkvist (1978) called them strong codons. In the second and third columns, the first two bases guarantee five bonds (mixed codons), and in the fourth column just four bonds (weak codons). This pattern corresponds very well to the importance of the third base in the triplet codon: If the first bases are G and/or C (first column), the third base is never important, and in the second and third columns, the third base is important in exactly half the cases (if there is a purine in the second position—lower half of the table). In the fourth column the third base is always necessary for the determination of the correct amino acid. In Fig. 2, the order of codon families is illustrated by the shaded regions. It seems that for the first column, the first two bases alone guarantee sufficient stability in the codon–anticodon pairing to ensure the correct choice of the amino acid. In the case of mixed codons (second and third columns) a codon family is guaranteed if there is a pyrimidine in the second position. Going beyond Lagerkvist’s counting of hydrogen bonds, others have provided some quantitative information about nucleotide binding strengths (Ornstein and Fresco 1983).

A third observation refers to two perfect symmetries in our scheme. The first is the codon–anticodon symmetry: The thick horizontal line in Fig. 2 marks the symmetry axis. For instance, codon CCC (Pro; first column, first row) has the anticodon GGG (Gly; first column, last row), and codon ACG (Thr; third column, fourth row) has the anticodon UGC (Cys; third column, fifth row). The second perfect symmetry is the point symmetry corresponding to Halitsky’s (2003) family–nonfamily symmetry operation (“E–M bifurcation”), indicated by the point in the center of Fig. 2. Halitsky observed that all 32 “family codons” CC*, CU*, UC* GC*, GU*, AC*, CG*, and GG* can be mapped into the 32 “nonfamily codons” UU*, AU*, CA*, UG*, UA*, GA*, AG*, and AA* by exchanging the two amino bases A and C with one another and the two keto bases U and G with one another. For instance, the family codon GUA (Val) is mapped into the nonfamily codon UGC (Cys). Thus, this point symmetry underlies the family–nonfamily symmetry in our scheme (shaded vs. unshaded regions).

A fourth observation concerns the deviations of nonstandard genetic codes. As can be seen in Fig. 2, nearly all deviations occur in codons with a purine at the third position. The only exception is the yeast mitochondrial code, in which CU* does not code for Leu but, rather, for Thr.

Our fifth observation refers to the number of different tRNAs. The mammalian mitochondrial genomes contain one gene for each tRNA, with the exceptions of tRNA Leu and tRNA Ser for which two genes are present. Our new classification scheme for these mitochondrial codes (slight modification of Fig. 2) makes this number obviously: eight tRNAs for the eight codon families plus 14 tRNAs for the remaining 14 fields (the two termination codons need no tRNA).

Our sixth observation shows hitherto unknown regularities of amino acid properties. Jungck (1978) collected 15 different measures of amino acid properties, as well as 3 measures for dinucleoside monophosphates. For all of these 18 measures we arranged a table with eight rows and four columns corresponding to the scheme in Fig. 2. For AU(G/A) we took the Met values (e.g., vertebrate mitochondrial code); for UA(G/A), the Tyr values (mitochondrial flatworm code). Then we analyzed all row and column sums. The row sums show a strong monotonicity just for the three dinucleoside monophosphate measures and for the hydropohobicity measure of Levitt (1976). However, amazingly, the column sums of nearly all measures are perfectly correlated with the corresponding codon–anticodon binding strength as defined by Lagerkvist (1978, 1981), in the following simply denoted codon strength. This is demonstrated in Table 1. For this table we averaged the column sums of the second and third columns, giving one “mixed codons” column. As can be seen in Table 1 there are just two exceptions. In the polarity measure of Zimmerman et al. (1968), the deviation is very weak, and in contradiction to all other measures, here the values for the amino acids vary by orders of magnitude. A problem only arises for the three hydrophobicity measures: The two monotonic measures “Levitt” and “BullBreese” are anticorrelated, and the “Jones” measure is not monotonic. The anticorrelation was found by Jungck (1978), but he did not comment on this.

Table 1 Correlation of codon strength and amino acid properties

The fact that the order of the second and third columns is not fixed is also underlined by individual consideration of the two mixed codon columns, instead of the averaging done in Table 1. In about half of the cases the order of the second and third columns should be exchanged to guarantee the strong monotonicity of the amino acid measures as a function of the column number.

The strong correlation between amino acid properties and codon strength implies that the first and second position together, and not one of them alone, must have been important for the amino acid–codon assignment in the evolution of the genetic code.

Evolution of the Genetic Code

What do the observed patterns tell us about the evolution of the genetic code? The so-called biosynthetic theory assumes that the genetic code evolved from a simpler form that encoded fewer amino acids (Crick 1968). A special version of this theory has been given by Wong (1975), who proposes that the genetic code coevolved with the invention of biosynthetic pathways for new amino acids. Although it has been shown that his analyses rest on wrong assumptions (Ronneberg et al. 2000), it is generally accepted that one can discriminate evolutionarily old and new amino acids (Alberts et al. 2002). Of course it could be that the binding allocation between nucleic acid molecules (RNAs or even PNAs [Knight and Landweber 2000b]) and amino acids did not start until all 20 amino acids were available, but it seems simpler to assume that as soon as there were amino acids and nucleic acids available, produced abiotically, both began to bind to each other. It now seems clear that “the code probably underwent a process of expansion from relatively few amino acids to the modern complement of 20” (Knight and Landweber 2000b).

Does our scheme yield some hints as to the evolution of the code? We already noted that the third nucleotide is nearly always (two exceptions in the standard code) analyzed just in a binary manner. Taking this for granted, we can reduce our original 8 × 8 scheme to an 8 × 4 scheme (shown in Fig. 2). Looking at this scheme, we observe high redundancy for each second row. Therefore, it is tempting to speculate that there was a period during code evolution when the third position was not needed at all. Assuming this, we can cancel each second row and are left with a pure doublet code that encodes 4 × 4 = 16 amino acids (or 15 plus a termination codon). Perhaps, then, a doublet code preceded the triplet code, as has already been speculated (Jukes 1973; Hayes 1998).

Conceivably, codon expansion from doublet to triplet could have arisen before this or, possibly, not until all 16 amino acids were encoded. If one assumes the latter, then it is interesting to postulate for each doublet the corresponding old amino acid. Met (Wong 1975), Trp, Gln, Asn (Knight and Landweber 2000b), and Tyr (Alberts et al. 2002) seem to be newer amino acids. As mentioned above, Szathmary (1999) proposed an evolutionary mechanism of tRNA formation. In principle, this mechanism could also work starting with doublets instead of triplets. It should be possible to gain experimental evidence for a doublet code by studying amino acid–nucleic acid doublet binding in the same way as has been done for triplets. Knight and Landweber (2000a) showed that Arg triplet codons alone significantly associate with arginine binding sites. Perhaps the doublets show a higher specificity.

However, by proposing a doublet code one faces the frameshifting problem. It seems unthinkable that a sudden transition from a two-letter to a three-letter frame ever occurred. Instead, one can imagine a gradual evolution with an ancient three-letter reading frame where just the first two letters have been analyzed by an ancient translation machinery. However, one then wonders about such inefficient use of coding space. Perhaps the ancient translation machinery, simply for stereochemical reasons, could not analyze a two-letter frame. In this context it is also interesting to note that even our contemporary code is somehow “inefficient”: Already a quaternary doublet code can encode 16 amino acids (or 15 plus a termination codon). For just four (or five) further amino acids a third letter is necessary. Of course, this inefficiency has the advantage of robustness enhancing redundancy.

Szathmary (1992, 2003) proposed a model which yields the result that two different base pairs represent an optimal compromise between the overall copying fidelity and the overall reproduction rate (metabolic efficiency). He assumed that the genetic code was developed before evolution invented proofreading. For higher copying fidelity (due to proofreading, etc.), the model predicts that three different base pairs are better than just two. It is tempting to speculate that in the earliest phases of biological evolution with the lowest copying fidelity, just one base pair could have worked as well. (The copying fidelity is always highest for just one base pair. Nevertheless, Szathmary’s simple model gives no one-base pair optimum, but a more detailed model for the metabolic efficiency could do so.) So, perhaps, nucleic acid–amino acid mapping started with a binary code. This is in accordance with earlier speculations that the first genetic material contained only a single base-pairing unit (Crick 1968; Orgel 1968). An important argument in this context is the chemical instability of cytosine, so that it may be difficult to establish a genetic system with G–C base pairing (Levy and Miller 1998). Wächtershäuser (1988) proposed an all-purine precursor of nucleic acids. However, for the sake of self-replication it is more obvious to assume a two-letter code that can give rise to complementary base pairing. Jimenez-Sanchez (1995) argued for an early (binary) A–U coding. Recently, a ribozyme composed of only two different nucleotides has been found by in vitro evolution that contained the pyrimidine uracil and the purine 2,6-diaminopurine (Reader and Joyce 2002). Note that uracil is the biosynthetic precursor of the pyrimidines cytosine and thymine (the corresponding precursor of the purines adenine and guanine is hypoxanthine).

Of course, a binary encoding also would be the most aesthetic version from a purely mathematical point of view. A binary triplet code would represent just one column in our scheme (Fig. 2). Given the high redundancy between the rows, it is unlikely that this ever happened. However, an even simpler coding, a binary doublet code, seems conceivable. It is tempting to speculate which four amino acids, one per two consecutive rows, were the first encoded ones. In the first two rows (two pyrimidines, i.e., 00) Ser seems to be the oldest amino acid, and in the third and fourth rows (10), Ala (Wong 1975). On the other hand, the 01 rows obviously contain no really old amino acid, while the 11 rows contain more than one: Gly, Asp, and Glu (Wong 1975).

One could speculate that the termination marker was important from the very beginning and resulted in coding by the 01 binary doublet. It has been noted that the five amino acids coded by G** (Ala, Val, Gly, Asp, Glu) are all at or near the head of the amino acid synthesis pathways (Taylor and Coates 1989) and also the most abundantly formed ones in abiotic synthesis experiments (Miller 1953, 1987). Furthermore, it has been shown recently by extensive statistical analyses that the frequencies of all five G** amino acids are significantly higher in evolutionary conserved residues, and it has been concluded that “these amino acids may have been the first introduced into the genetic code” (Brooks and Fresco 2002, 2003; Brooks et al. 2002). This is also consistent with physicochemical arguments proposing that the first sense codons had the form G** (Eigen and Schuster 1978). However, Gly is biochemically built from Ser, so Ser can be assumed to be prior. It could be that in the beginning of nucleic acid–amino acid assignment, Asp and Glu competed for the 11 doublet. Of course, code transfer from one amino acid to another one might also have occurred (Wong 1975).

Another scenario consistent with a binary doublet code has been given by Fitch’s “ambiguity reduction” hypothesis (Fitch and Upper 1987). It states that early in evolution there was an ambiguity in the charging of amino acids to anticodon acceptors: In the first step just *pyrimidine* codons (*0*), coding for hydrophobic amino acids, and *purine* codons (*1*), coding for hydrophilic amino acids, have been distinguished (binary singulet code). In the second step the more refined binary doublet code (00*, 01*, 10*, 11*) evolved.

The idea that the doublet code was just the second state in the evolution of the genetic code, and that this evolution started with just the midbase as coding, has been worked out by others, who termed it the “simplet” code (McClendon 1986; Schwemmler 1994). However, in this hypothesis both old amino acids Ser (UC*) and Ala (GC*), as well as Asp (GA*) and Glu (GA*), cannot be discriminated. We therefore suggest that the first two positions were equally important from the very beginning. Although our suggestion also does not allow discrimination between the related amino acids Asp and Glu, it nevertheless allows discrimination between the functionally divergent amino acids Ser and Ala. A further argument for the evolutionary importance of the first two nucleotides is the strong correlation observed between codon strength and the amino acid properties.

Conclusion

Taylor and Coates (1989) stated that “many parts of the patterns (of the genetic code) have been seen by others but ... it is the synthesis that adds up to the most interesting ... new insights.” In this spirit, we note that in the work presented here different patterns appear more clearly than in the common scheme of the genetic code. An example is Lagerkvist’s (1978) observation that all strong codons represent codon families, while weak codons do not. Mixed codons represent codon families in half of the cases. Our presentation of the code also highlights new patterns, which were not seen before. As summarized in Table 1, nearly all measures of the amino acid properties correlate strongly with the codon strengths. Furthermore, there is perfect codon–anticodon symmetry as well as point symmetry corresponding to the family–nonfamily symmetry operation (Halitsky 2003) in our scheme.

With regard to evolution, we hypothesize that codon assignments started from a binary doublet code (e.g., hypoxanthin and uracil) and developed later to a quaternary doublet code (A, G, C, U); thereafter, expansion to a triplet code took place. Although the third position is needed for correct amino acid recognition, until now it has nearly always been analyzed in a binary manner. The conclusion that code evolution must have started with doublets and not with a single letter is also underlined by the correlation observed here between the properties of amino acids and the strengths of their codons.