1 Introduction

All current natural genetic codes may have evolved from a single ancestral code. According to the Crick hypothesis in 1968, this ancestral code would have consisted of fixed random codon assignments for each encoded amino acid and the stop signal. This approach is known as "the frozen accident" (Crick 1968). An explanation for the original fixed codon assignments could be the deleterious effects of genetic changes. These effects would be increasingly catastrophic as the number of genes in organisms increased. However, in early evolution extensive horizontal gene transfer might have been useful because only one code survived, a requirement for the transition to the cellular level of complexity (Vetsigian et al. 2006). Given the above, the origin of LUCA (Last Universal Cellular Ancestor) (Weiss et al. 2018), the first common ancestor of all current organisms, but not the first cell, would have been a bottleneck resulting from this horizontal gene transfer, which would have resulted in the selection of a universal code (Koonin 2003, 2017; Vetsigian et al. 2006).

Thus, at an early stage a completely random universal code could be possible. However, this evolved in such a way that there were fewer reading and writing failures, diminishing the structural and functional consequences of the encoded proteins (i.e., the error function as a cost function) (Freeland and Hurst 1998). Moreover, considering that sometimes code errors could be important for developing new cellular adaptive properties, perhaps genetic code evolution, rather than a way of optimizing stability, tended to optimize the balance between stability and adaptability. According to an evolutionary increase in stability, it has been found that the errors associated with the standard genetic code are considerably smaller than most random codes, although its achieved stability is not the best possible (Błażej et al. 2018, 2016; Buhrman et al. 2011; Freeland and Hurst 1998; Goldman 1993; Haig and Hurst 1991, 1999; Novozhilov et al. 2007; Salinas et al. 2016; Santos and Monteagudo 2010; Wnętrzak et al. 2018, 2019). The remarkable stability of the standard genetic code, besides being a driving force through the selection pressure, may have been a consequence of the expansion of the genetic code by mean of similar mechanism of codon assignments to physicochemically similar amino acids (Crick 1968; Koonin 2017). Thus, the hypothetical accidental nature of a selected ancestral genetic code is in agreement with subsequent genetic code extension and optimization mechanisms (Koonin and Novozhilov 2017). In this work, the average and standard deviation of error functions of random genetic codes with fixed standard stop codons were analytically obtained assuming that a primitive and completely random version of an ancestral genetic code may have been selected from a large set of random codes. The used error functions are different depending on a parameter indicating which codon bases (i.e., first, second or third) can be wrong. As a possible application of these results in future research, the deduced expressions of statistical parameters could be useful to select different sets of natural amino acids and many kinds of amino acid properties, either pure or mathematically combined. This approach, regarding the different statistical behaviors of the error function in a random Crick scenario, could allow a better understanding of the code stability as a selective pressure on the origin and evolution of the genetic code.

For calculations, the following mathematical formalism is introduced:

In the standard genetic code, from the 64 possible codons, there are 3 stop codons and 61 amino acid encoding codons, which encode the 20 standard amino acids. Hence, a genetic code can be described by a function, such that there are 61 different triplets ijk (with bases i, j, k ∈ B = {A, C, G, U} termed codons, each one encoding one amino acid; Ep is the set of pairs of triplets indicated by \(\left(ijk, {i}^{{\prime}}{j}^{{\prime}}{k}^{{\prime}}\right)\), that only differ in position p, with p = 1, 2 or 3, such that only codon pairs \(\left(ijk, {i}^{{\prime}}jk\right)\), \(\left(ijk, {ij}^{{\prime}}k\right)\) and \(\left(ijk, ij{k}^{{\prime}}\right)\) (\(i\ne {i}^{{\prime}}, j\ne {j}^{{\prime}}\) and \(k\ne {k}^{{\prime}}\) respectively) are considered. For p = 0, E0 denotes the union \({E}_{1}{\cup E}_{2}\cup {E}_{3}\) (Buhrman et al. 2011).

Let rijk be the numeric value of property \({a}_{u}\) as expressed by the standard amino acid \(u\) coded by triplet ijk codon (that is, in functional notation, \(u=u\left(ijk\right))\)

$${r}_{ijk}\equiv {a}_{u}={a}_{u(ijk)}$$
(1)

The rijk values of the six amino acid properties used in this work are shown in Table 1. Four of these properties are real properties taken from Haig and Hurst (Haig and Hurst 1991); however, two other properties are not real, but are only arbitrary values to increase the number of cases to test the theoretical results of this study.

Table 1 Values of amino acid properties used in this study

To understand the robustness of the genetic code, the consequences of single-point changes in codons (either mutation or translation errors) have been studied. Hence, genetic code robustness can be inversely estimated by measuring a global error, basically a cost function associated with decoding mistakes. Such an error function (MS) is defined as follows:

$${MS}_{p}\equiv \frac{1}{\left|{E}_{p}\right|}\sum_{\left(ijk, {i}^{{\prime}}{j}^{{\prime}}{k}^{{\prime}}\right)\in {E}_{p}}{\left({r}_{ijk}- {r}_{{i}^{{\prime}}{j}^{{\prime}}{k}^{{\prime}}}\right)}^{2}$$
(2)

where |Ep| is the cardinality of set Ep.

We find that

$$\left|{E}_{p}\right|=\sum_{\left(ijk, {i}^{{\prime}}{j}^{{\prime}}{k}^{{\prime}}\right)\in {E}_{p}}1$$
(3)

and verify that

$$\left|{E}_{0}\right|=\left|{E}_{1}\right|+\left|{E}_{2}\right|+\left|{E}_{3}\right|$$
(4)

Considering 61 amino acid encoding codons and the 3 standard stop codons (UAG, UAA, and UGA), and since codon pairs with inner differences simultaneously in more than one position are not considered, we have |E0|= 263, |E1|= 87, |E2|= 88, and |E3|= 88 (Buhrman et al. 2011).

2 Theoretical Framework and Results

Only completely random models of the genetic code with the fixed three standard stop codons (also named the unrestricted structure model (Wnętrzak et al. 2018)) were considered. Genetic codes were built fixing the three standard stop codons and using the other 61 codons to encode the 20 standard amino acids. Although the number of possible codes is finite (Novozhilov et al. 2007; Schönauer and Clote 1997), the number of randomly selected codes from the huge number of possible codes can be infinite. Hereinafter we will refer to an “infinite number of sampling cycles of random genetic codes” as “infinite random codes”. Thus, let \({\langle \rangle }_{\infty }\) be the average of infinite random codes. Then we denote the average of MSp (Eq. 2) over infinite random genetic codes by \({\langle {MS}_{p}\rangle }_{\infty }\):

$${\langle {MS}_{p}\rangle }_{\infty }={\langle \frac{1}{\left|{E}_{p}\right|}\sum_{\left(ijk, {i}^{{\prime}}{j}^{{\prime}}{k}^{{\prime}}\right)\in {E}_{p}}{\left({r}_{ijk}- {r}_{{i}^{{\prime}}{j}^{{\prime}}{k}^{{\prime}}}\right)}^{2}\rangle }_{\infty }$$
(5)

Let \({\sigma }_{p}\) be the standard deviation of MSp over infinite random genetic codes. Then \({\sigma }_{p}^{2}\) is the variance given by

$${\sigma }_{p}^{2}= {\langle {\left({MS}_{p}-{\langle {MS}_{p}\rangle }_{\infty }\right)}^{2} \rangle }_{\infty }$$
(6)

For the computer calculations, 100,000 randomly sampled genetic codes were obtained using randomly-generated non-overlapping block of codons with random assignment of the amino acids. In addition, six kinds of amino acid properties (Table 1) were used and their statistical properties were analyzed with respect to changes in a single position of the codon (p = 1, 2, or 3). The parameters \({\langle {MS}_{p}\rangle }_{\infty }\) and \({\sigma }_{p}\) were numerically calculated using the Monte Carlo method. Subsequently, these parameters were analytically calculated using general expressions, which were obtained for any amino acid property and any set of encoded amino acids. Comparisons between the values obtained using numerical and analytical methods are shown in Tables 2 and 3. The analytical results are very similar to the numerical results of the Monte Carlo computational calculation. In fact, because the results of the analytical calculations are completely based on mathematical arguments, a numerical proof of these results is not necessary. However, Tables 2 and 3 are useful to show how derived statistical expressions can be applied and numerically contrasted.

Table 2 Analytical and numerical calculations of the average of MSp over infinite completely random genetic codes with standard stop codons (\({\langle {MS}_{p}\rangle }_{\infty }\))
Table 3 Analytical and numerical calculations of the standard deviation of MSp over infinite completely random genetic codes with standard stop codons (\({\upsigma }_{p})\)

3 Analytical Calculation of \({\langle {MS}_{p}\rangle }_{\infty }\) (p = 1, 2, or 3) for Infinite Completely Random Genetic Codes with the Standard Stop Codons

From Eq. 5, since summations of \({r}_{ijk}^{2}\) and \({r}_{{i}^{{\prime}}{j}^{{\prime}}{k}^{{\prime}}}^{2}\) are equal, by exchanging summation and average operators (\(\langle \sum \rangle =\sum \langle \rangle\)), we obtain

$${\langle {MS}_{p}\rangle }_{\infty }=\frac{2}{\left|{E}_{p}\right|}\left(\sum_{\left(ijk,{i}^{{\prime}}{j}^{{\prime}}{k}^{{\prime}}\right)\in {E}_{p}}{\langle {r}_{ijk}^{2}\rangle }_{\infty }-\sum_{\left(ijk, {i}^{{\prime}}{j}^{{\prime}}{k}^{{\prime}}\right)\in {E}_{p}}{\langle {r}_{ijk}{r}_{{i}^{{\prime}}{j}^{{\prime}}{k}^{{\prime}}}\rangle }_{\infty }\right)=\frac{2}{\left|{E}_{p}\right|}\left({\langle {r}_{ijk}^{2}\rangle }_{\infty }\sum_{\left(ijk, {i}^{{\prime}}{j}^{{\prime}}{k}^{{\prime}}\right)\in {E}_{p}}1-{\langle {r}_{ijk}{r}_{{i}^{{\prime}}{j}^{{\prime}}{k}^{{\prime}}}\rangle }_{\infty }\sum_{\left(ijk, {i}^{{\prime}}{j}^{{\prime}}{k}^{{\prime}}\right)\in {E}_{p}}1\right)$$
(7)

where \({\langle {r}_{ijk}^{2}\rangle }_{\infty }\) and \({\langle {r}_{ijk}{r}_{{i}^{{\prime}}{j}^{{\prime}}{k}^{{\prime}}}\rangle }_{\infty }\) are constants because they are obtained from infinite selected random codes from all possible codes. That is, the averages \({\langle \rangle }_{\infty }\) do not depend on the subscripts for \(r\) and therefore they can be written outside of the summations, as a factor.

Note that

$${\langle {r}_{ijk}{r}_{{i}^{{\prime}}{j}^{{\prime}}{k}^{{\prime}}}\rangle }_{\infty }={\langle {r}_{ijk}\rangle }_{\infty }^{2}$$
(8)

Using Eqs. 1, 3 and 8 in Eq. 7 results in

$${\langle {MS}_{p}\rangle }_{\infty }=2\left({\langle {a}_{u}^{2}\rangle }_{\infty }-{\langle {a}_{u}\rangle }_{\infty }^{2}\right)$$
(9)

Because each \(u\)-th amino acid has the same statistical weight in calculations of averages over infinite random genetic codes, in Eq. 9 we replace \({\langle \rangle }_{\infty }\) with \({\langle \rangle }_{Aa}\), that is the average over the 20 standard amino acids (i.e. \({\langle {(...)}_{u}\rangle }_{Aa}\equiv \sum_{u=1}^{20}{(...)}_{u}/20\)). Thus, we obtain

$${\langle {MS}_{p}\rangle }_{\infty }=2\left({\langle {a}_{u}^{2}\rangle }_{Aa}-{\langle {a}_{u}\rangle }_{Aa}^{2}\right)$$
(10)

where p = 1, 2, or 3. The application results are shown in Table 2, and another demonstration of Eq. 10 is shown in Appendix A.

4 Analytical Calculation of \({\sigma }_{p}\) (p = 1, 2, or 3) for Infinite Completely Random Genetic Codes with Standard Stop Codons

In Appendix B the following expression for the standard deviation is obtained

$${\sigma }_{p}=2{\left[\frac{1}{\left|{E}_{p}\right|}{\langle {\left({a}_{u}-{\langle {a}_{u}\rangle }_{Aa}\right)}^{4}\rangle }_{Aa}\right]}^\frac{1}{2}$$
(11)

where p = 1, 2, or 3. The application results are shown in Table 3.

5 Discussion

In the calculation of the averages over random genetic codes, each code has the same probability of being obtained by the Monte Carlo method. Therefore, the averages for infinite number of sampling cycles of random genetic codes are equal to the corresponding averages for the finite set of all possible genetic codes. Using standard stop codons in the completely random model of genetic code, we analytically obtained the average (Eq. 10) and standard deviation (Eq. 11) of error functions (Eq. 2) of infinite random codes selected. The formulae in Eqs. 10 and 11 were exact and applicable for any kind of amino acid property, even for new properties resulting from combinations of some already known (e.g., a linear combination of several amino acid properties). Similarly, the set of encoded amino acids could also be redefined into the formulae. In computational experiments, using 100,000 random genetic codes, in addition to the 20 standard amino acids and 6 kinds of amino acid properties (4 real properties and the other 2 invented, for test purposes only) both statistical parameters (i.e., \({\langle {MS}_{p}\rangle }_{\infty }\) and \({\upsigma }_{p}\) (p = 1, 2, or 3) were obtained with values ​​very similar to those predicted by the analytical calculations (See Tables 2 and 3).

It is interesting that the average of the error function of the code is proportional to the mean squared change (Eq. 10) of the encoded amino acid property, as long as the variance is proportional to the mean quartic change (Eq. 49). Such a simple result could avoid a large number of computational calculations and was capable of establishing a theoretical framework that could be applied to a random scenario prior to a universal code. For example, it seems plausible that genetic codes with small error function values are more competitive (i.e., genetic codes having a greater tolerance to errors of use) as \({\upsigma }_{p}\) decreases and \({\langle {MS}_{p}\rangle }_{\infty }\) increases. That could be achieved by a suitable selection of sets of amino acids and their properties (pure or mathematically combined) to give the appropriate parameters to the error function in primitive systems containing amino acids, such as in some meteorites or in primary organic soups (Burton et al. 2012; Cleaves 2010; Zaia et al. 2008). In this regard, the following question seems interesting: how optimal are the current standard amino acids and their selected properties in terms of the competitiveness of genetic codes within a system with more options of amino acids to be encoded? Therefore, the statistical parameters found here to describe the error in random genetic codes could be applied to the selection of sets of amino acids or to find more appropriated amino acid properties function, so that a few codes could be much more efficient (greater tolerance to error) than the rest, something very appropriate for a natural selection of a genetic code.

Despite the optimization patterns of the standard genetic code, Francis Crick’s frozen accident theory still survives when combined with theories of genetic code expansion (Koonin 2017), although it has been said that the emphasis is on the frozen part (Kun and Radvanyi 2018). However, it seems important to consider random events in the earliest stages of the genetic code. Assuming a hypothetical early random scenario for the origin of the genetic code, in this approach the distribution of the error function for the completely random model was mathematically described under very general conditions, which may facilitate subsequent applications.