Introduction

The genetic code, meaning the assignment of codons to 22 coding signals (20 amino acids, starts and stops), is not fixed, and different taxa have different genetic codes (Elzanowski and Ostell 2019, https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi). Some extensive recodings involve stop codons and occur in some individuals (mitochondria of an olive ridley (Lepidochelys olivacea, Seligmann 2012a, b) and of a stonefly (Aleurodicus dispersus, Seligmann 2018a); HIV inactivation in HIV-resistant patients (Colson et al 2014). In genetic code evolution, the most frequent codon reassignments involve start and stop codons, showing that punctuation codes within genetic codes (El Houmami and Seligmann 2017) evolve the most (Seligmann 2015, 2018b). Stop codon assignment is optimized for regulating frameshifted translation (Seligmann and Pollock 2004; Itzkovitz and Alon 2007; Tse et al 2010; Křížek and Křížek 2012; Seligmann 2012a, b; Abrahams and Hurst 2018). This involves also increased densities of off-frame stop codons in protein-coding genes immediately downstream of shifty codons (homopolymer codons AAA, CCC, GGG, TTT) (Seligmann 2019).

Note that the genetic code seems designed to enable a relative conservation of amino acid properties after frameshifts (Wang et al 2015, 2018; Geyer and Mamlouk 2018; Bartonek et al. 2019). Beyond this, overlapping frames of the same gene potentially code for many random combinations of proteins belonging to very different protein families (Opuu et al 2017).

The Natural Circular Code

Start and stop codons signal translational initiation and termination; hence, they signal frontiers between genes, and between genes and noncoding sequences. In addition, off-frame stops regulate frameshifted ribosomal translation after ribosomal slippage (Seligmann 2007, 2010). Other punctuation signals exist: Within genes, the most frequently observed frontiers between exons and introns are two heptamers, 3′-GGTAAGT-5′ and 5′-TTCA(G)GA-3′ (present, respectively, in the D-loop and Tψ-loop of many tRNAs Demongeot and Norris 2019). A punctuation system also regulates ribosomal frame before frameshifts: The natural circular code (Arquès and Michel 1996) enables detecting the ribosomal translational frame (Ahmed et al 2010).

The subsets X = X0, X1 and X2, of each 20 trinucleotides, are found with overrepresentation in, respectively, the frames 0 (reading frame), 1 (frame 0 shifted by one nucleotide) and 2 (frame 0 shifted by two nucleotides) in genes of both prokaryotes and eukaryotes. These groups of overrepresented codons are non-overlapping, meaning that no codon belonging to one of these groups is found in any of the two remaining groups. No homopolymer nucleotide triplet is included in any of these three groups of 20 nucleotide triplets. The set X0 contains, for example, the following 20 trinucleotides: X0 = {AAC, AAT, ACC, ATC, ATT, CAG, CTC, CTG, GAA, GAC, GAG, GAT, GCC, GGC, GGT, GTA, GTC, GTT, TAC, TTC}.

These X0, X1 and X2 present peculiar mathematical properties. Each X0, X1 (overrepresented in frame 1) and X2 (overrepresented in frame 2) are circular codes, meaning that they enable detecting the frame of a sequence (Fimmel and Strüngmann 2016, 2018). There are only 64 nucleotide triplets, and homopolymer triplets cannot be included in circular codes. Hence, no circular code based on triplet codons includes more than 20 triplets. Therefore, X0, X1 and X2 are maximal circular codes (Michel and Pirillo 2010; Gonzalez et al 2011, 2017; Michel et al 2016).

Each codon from X0 has its reverse-complement codon among the remaining 19 codons belonging to X0; hence, X0 is a maximal self-complementary circular code (Fimmel et al 2018). A relation exists between X0, X1 and X2. Nucleotide triplets in X1 result from the N1N2N3 → N3N1N2 permutation of X0 codons, and those in X2 result from the N1N2N3 → N2N3N1 permutation of X0 codons. Hence, the natural circular code empirically discovered in protein-coding genes is a maximal, self-complementary C3 circular code (Michel 2012). The C3 indicates that the N1N2N3 → N3N1N2 and the N1N2N3 → N2N3N1 permutations of X0 are also maximal circular codes.

Comma-free codes are particularly efficient circular codes, as a single codon belonging to a comma-free code detects the coding frame. Regular circular codes require longer sequences for frame detection (Fimmel et al 2017). Early on, it was believed that the genetic code was a comma-free code (Crick et al 1957), but it seems the genetic code optimizes between frame disambiguation and coding flexibility. Nevertheless, the natural circular code includes a subset of four codons (CAG, CTG, CTC and GAG) that form a comma-free code (Ahmed et al 2010).

Bijective transformations of X0 produce also maximal circular codes, some among which are self-complementary (Fimmel et al 2015; Michel and Seligmann 2014). The molecular mechanisms by which X0 regulates ribosomal frame detection remain unknown; however, these probably involve conserved nucleotide motifs belonging to X0 found in tRNAs (Michel 2012, 2013) and rRNAs (El Soufi and Michel 2014, 2015). The natural circular code seems not vestigial and probably still functions in modern translation: X0 codons are more conserved than synonymous codons not belonging to X0 (Dila et al 2018). X0 also regulates frameshifting transcription (El Houmami and Seligmann 2017; Warthi and Seligmann 2019).

Theoretical Minimal RNA Rings

Though some variant natural circular codes occur (Michel 2015, 2017) or are suspected (Arquès and Michel 1997), X0 is a more conserved punctuation system than start and stop codon punctuations and seems near-universal. Some recent peculiar observations shed light on the evolution of the natural circular code (Demongeot and Seligmann 2019a).

A theoretical search for life’s primordial RNAs assumed that these would code over the shortest possible sequence once for each coding signal, a start, a stop and each of the 20 amino acids, and form a stem-loop hairpin preventing degradation (Demongeot 1978; Demongeot and Besson 1983). These constraints define 25 theoretical minimal RNA rings. These seem homologous to a consensual ancestral tRNA (Demongeot and Moreira 2007) as it was defined by Eigen and Winkler-Oswatitsch (1981). Concerning the barycenter (for Hamming and edits distance) of these 25 RNA rings, RNA ring homology with tRNAs results from heptamers GAAUGGU and UUCAAGA frequently found in D-loop and Tψ-loop of many tRNAs and at frontiers between exon/intron and between intron/exon. The tRNA homology defines anticodons for each RNA ring, and a presumed order of evolution according to the cognate amino acid (Trifonov 2000) defined by that predicted anticodon. This evolution order based on cognate amino acid identity is congruent with independent RNA ring properties derived from their peptide coding properties (Demongeot and Seligmann 2019b, c), their occurrences in tRNA synthetase genes (Demongeot and Seligmann 2019d) and their predicted anticodons functioning as binding sites for replication initiation by polymerases (Demongeot and Seligmann 2019e).

These observations suggest that theoretical minimal RNA rings are realistic primordial proto-life RNA sequences. Indeed, they also include information relevant to the evolution of the natural circular code. Codons belonging to X0 are overrepresented in RNA rings. Their numbers increase from presumed ancient to recent RNA rings [ranked according to the presumed genetic code integration order of their cognate amino acid (Trifonov 2000)]. X1 codon numbers decrease from ancient to recent RNA rings. Hence, prebiotic/early life sequences presumably switched from X1 to X0 and between coding frames, potentially explaining X1 and X2 occurrences in + 1 and − 1 frames of modern genes (Demongeot and Seligmann 2019a). This could explain biases for some specific codon structures in RNA rings (Demongeot and Seligmann 2019c).

Working Hypothesis: Non-redundant Coding and the Natural Circular Code

The most stringent constraint involved in the design of RNA rings is that overlapping frames cannot code for the same amino acid. This property resembles the property of single-frame motifs in dicodons, which defines dicodons where none of the codons occurring in any of the three overlapping frames of the dicodon occurs in any other frame of that dicodon (Michel 2019). Here, non-redundant coding excludes also nonidentical synonymous codons. We explore whether non-redundancy of coded amino acids between overlapping frames of pentamer nucleotide sequences biases populations of short coding nucleotide sequences toward codons belonging to X0, X1 and/or X2: Did the natural circular code evolve through selection for non-redundant coding? This would be in line with observations that the 20 biogenic amino acids are more diverse than would be expected from a random sample of likely available amino acids (Philip and Freeland 2011; Ilardo et al 2015).

Materials and Methods

Sequences

We examine non-redundant coding across frames for all tetra-nucleotide (64 × 4 = 256) and penta-nucleotide (64 × 4 × 4 = 1024) sequences. Pentamers are the shortest sequences for which a complete codon exists for each frame. (Complete codons occur only in two frames of tetramers.) Hence, pentamers consist of the simplest system in which coding non-redundancy can be examined across all three potential coding frames. Tetra- and pentamer sequences are produced along the following method. We added to the 5′ extremity of each of the 64 codons nucleotide A, producing 64 tetramers. This was repeated adding C, G and T, respectively, producing all 256 tetramers. The same procedure was applied adding A, C, G or T to the 3′ extremity of the 64 codons. The 1024 pentamers were produced by adding A, C, G or T to the 5′ extremity of the latter 256 tetramers. The amino acids coded by frames − 1, 0, and + 1 of these sequences were translated according to each of the known genetic codes (Elzanowski and Ostell 2019, https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi). (For tetramers, only frames 0 and + 1 exist.)

Non-redundant Coding

Sequences were classified into two groups, those lacking coding redundancy among frames versus those where the same amino acid is coded by more than one frame. These groups differ according to genetic codes used for translation. Hence, sequence translations and classifications were done separately for each alternative genetic code.

Biases

Abundances of all 64 codons were counted considering the nucleotides 2, 3 and 4 of pentamers, among non-redundant and redundant sequences, for all alternative genetic codes. Differences between X0 abundances vs of the remaining 44 codons are estimated by chi-square tests. This is also done for X1 and X2 codons. For example, pentamers where nucleotides 2, 3 and 4 are X2 codons are non-redundant and coding-redundant in 287 and 33 cases, respectively (88.5% non-redundant). Pentamers where nucleotides 2, 3 and 4 do not belong to X2 are non-redundant and coding-redundant in 590 and 114 cases, respectively (80.7%); a chi-square test estimates the statistical difference between these distributions. This procedure was done for the abundances of X0, X1 and X2 codons in non-redundant versus redundant sequences compared in each case to abundances of the 44 remaining codons, for each genetic code.

Abundances of redundant and non-redundant pentamers for each circular code and the remaining 44 codons can be obtained from data in Table 2. This requires to subtract numbers of non-redundant and redundant pentamers, respectively, found for a given circular code, from the total numbers of non-redundant and redundant pentamers, respectively.

Results

Table 1 presents the numbers of non-redundant tetra- and pentamers for each of the 64 codons, considering all possible tetramers (4 × 64 = 256, separately for − 1 and + 1 frameshifted tetramers) and pentamers (4 × 4 × 64 = 1024, all three frames considered, and excluding incomplete codons). Results clearly show a bias against homopolymer codons AAA, CCC, GGG and TTT. Hence, under selection for non-redundancy of coding among frames, primitive prebiotic sequences would have had a strong bias against homopolymer codons. Results for the + 1 frame of tetramers indicate positive biases for codons belonging to X0 and X1, in line with previous results from theoretical minimal RNA rings (Demongeot and Seligmann 2019e).

Table 1 Numbers of non-redundant tetra- and pentamers for each standard genetic code codon, and biases for X0, X1 and X2 natural maximal circular codes (Table 1a)

Results for pentamers indicate a stronger bias for X0 codons than observed in tetramers. From here on, analyses focus on pentamers as these have three overlapping frames. (Tetramers have no more than two overlapping coding frames.)

Table 2 presents the results from analyses similar to those presented in Table 2, for all known genetic codes. Overall, statistically significant bias for X0 occurs in all genetic codes. P values are lower (meaning stronger bias) for X0 than X1 and X2 in all but three and five cases, respectively (P = 1.8 × 10−5 and P = 0.0033, respectively, two-tailed sign tests). All exceptions to the main tendency are for mitochondrial genetic codes.

Table 2 Number of non-redundant and redundant pentamers with codons belonging to X0, X1 and X2 at positions 2, 3 and 4 of the pentamer in all known alternative genetic codes (Elzanowski and Ostell, https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi, accessed July 2019, where numbers in the first column follow code numbers at this NCBI site)

We used the Benjamini–Hochberg adjustment for test multiplicity to account for multiple tests (Benjamini and Hochberg 1995). After adjustment, biases for X0 and X1 remain significant in 20 among 24 genetic codes, and in 21 among 24 cases for X2. Hence, accounting for multiple tests does not qualitatively alter results: Assuming selection for non-redundancy between overlapping frames of pentamers biases for codons belonging to X0.

Reading Frame Retrieval of Circular Code Codons

The codons belonging to X0 vary in terms of their contribution to detect the coding frame. For sequences entirely constructed by codons from X0, some codons, such as CTG, can only belong to the coding frame, which is the “construction” frame. Hence, CTG has a reading frame retrieval (RFR) capacity of “100” (Ahmed et al 2010). The RFR values for X0 are indicated in Table 1. If coding non-redundancy in pentamers determined which codons would belong to X0, one expects correlations between RFR and the number of non-redundant pentamers to which that X0 codon belongs. Non-redundancy of X0 codons correlates negatively with RFR in each of the genetic codes examined. This correlation is statistically significant at P < 0.05 for three genetic codes, which are all the known genetic codes specific to yeast: the yeast mitochondrial genetic code, r = − 0.481, P = 0.032; the alternative yeast nuclear genetic code; and the Pachysolen tannophilus nuclear genetic code, r = − 0.527, P = 0.017 (two-tailed tests).

The biological meaning of the negative correlation between non-redundancy and RFR is not obvious. It probably suggests an optimization between two partially opposite functions: One is coding frame detection estimated by RFR, and the other is overlap coding in different frames for close, but nonidentical alternative proteins, promulgating redundancy between frames. These results suggest that overlap coding should be particularly strong in yeasts with unusual genetic codes. The coevolution of this property between nuclear and mitochondrial alternative yeast genetic codes is particularly remarkable. Overall, the result indicates that pentamer coding non-redundancy (and redundancy) among frames affected the detailed design of the natural circular code.

Genetic Code History and Non-redundancy in Codon Families

Several hypotheses suggest that there exist evolutionary orders of integration of the 20 biogenic amino acids in the genetic code (Trifonov 2000). The mean ranks of amino acids in modern proteins recapitulate these hypothetical evolutionary orders (Seligmann 2018c), reminding speculations about ontogeny recapitulating phylogeny. We added to the genetic code evolutionary hypotheses reviewed by Trifonov (2000) the order deduced from the self-referential hypothesis (Guimarães et al 2008; Guimarães 2017), the order deduced from preference of l- over d-amino acids for interacting with D-RNA (Michel and Seligmann 2014; deduced from data from Han et al 2010), the Rogers hypothesis (Rogers 2019) and the subtraction of preferences for interactions between codons and cognate amino acids in ribosomal structures from preferences for interactions between anticodons and cognate amino acids in ribosomes (Johnson and Wang 2010). Overall, ranks converge among many hypotheses, suggesting that structurally complex and chemically inactive amino acids integrated first the genetic code, and that complex, chemically reactive amino acids were last.

The mean non-redundancy of synonymous codons assigned to amino acids overall increases with these ranks of integration for most hypotheses (37 among 44, P = 0.0000027, two-tailed sign test). Two specific hypotheses produce correlations with P < 0.05 (two-tailed test) with non-redundancy: The hypothesis developed on the transition between self-replicating oligoribotides to peptide-assisted RNA replication (Ferreira and Coutinho 1993, r = 0.81, P = 0.000007, two-tailed test); and with the self-referential hypothesis (r = 0.708, P = 0.00033, two-tailed test). The latter hypothesis is notable because at this point it is the most complete hypothesis, meaning that it integrates elements from several components of the translation system: amino acid properties and interactions with RNA, tRNAs and tRNA synthetases. Overall, results suggest that early genetic code codon-amino acid assignments favored redundancy, indicating tolerance to low-accuracy processes, and high non-redundancy for codons assigned to late amino acids, suggesting error avoidance for these more complex and overall less mutable residues. It is also noteworthy that mean non-redundancy of pentamers is inversely correlated with Trifonov’s redundancy of tetracodons (r = − 0.803, one-tailed P = 0.000006) (Fig. 1).

Fig. 1
figure 1

Representation of the genetic code (as drawn by E.N. Trifonov in https://is.muni.cz/el/1431/jaro2011/Bi8960/), with indication (numbers at the left-hand side of codons) of the tetramer redundancy index for each codon of each amino acid synonymy class: This index is obtained by counting the number of times for which the class of synonymy of a codon XYZ is conserved by considering a new codon obtained either by adding a nucleotide on the left- or on the right-hand side of the codon XYZ

Discussion and Conclusion

Homopolymer codons have very high coding redundancies. Hence, selection for non-redundant codons would have biased against these codons before the evolution of complex biomolecular machineries where homopolymers cause polymerase and ribosomal frameshifts in modern protein-coding genes.

The analysis of redundant and non-redundant amino acid coding in pentamers shows biases for non-redundancy in codons belonging to the reading frame circular code X0. Hence, prebiotic selection for maximal coding diversity would have contributed to the origin of the near-universal circular code X0. Biases regarding X1 and X2, the circular codes detecting the two noncoding frames of protein-coding genes, are much weaker and usually not statistically significant. This could suggest transitions from X1 and/or X2 to X0, which could also explain associations between genetic code and codon structures (Seligmann and Warthi 2017).

Non-redundant coding did not only constrain the general pool of codons, but also the detail of the codons belonging to X0. RFR, the capacity of specific codons belonging to X0 for reading frame retrieval, is inversely proportional to that codon’s non-redundancy in pentamers, in all genetic codes and, in particular, in alternative nuclear and mitochondrial genetic codes specific to yeasts. This suggests an optimization between reading frame retrieval and the capacity for overlap coding for relatively conserved protein variants, especially in yeasts.

Moreover, biases occur for the X1 circular code in some mitochondrial genetic codes. This is in line with previous observations that presumed early theoretical minimal RNA rings, 22-nucleotide-long RNAs designed for non-redundant overlap coding, are biased toward X1, and biases toward X0 increase for presumed more recent RNA rings. We suggest that non-self-complementary circular codes such as X1 are favored in single-stranded RNA and that the self-complementary X0 evolved upon transition to double-stranded RNA (or DNA). Mitochondrial genomes are among the rare exceptions to Chargaff’s rule (same-strand A/T and C/G ratios close to 1) (Nikolaou and Almirantis 2006; Fimmel et al 2019), probably because of their mainly strand-asymmetric, unidirectional replication mode (Xia 2012), hence reproducing or conserving prebiotic conditions that presumably favored non-self-complementary circular codes. Results indicate that circular codes evolved from selection for non-redundant overlap coding in short nucleotide sequences.