Abstract
The origin of the genetic code has been attributed in part to an accidental assignment of codons to amino acids. Although several lines of evidence indicate the subsequent expansion and improvement of the genetic code, the hypothesis of Francis Crick concerning a frozen accident occurring at the early stage of genetic code evolution is still widely accepted. Considering Crick’s hypothesis, mathematical descriptions of hypothetical scenarios involving a huge number of possible coexisting random genetic codes could be very important to explain the origin and evolution of a selected genetic code. This work aims to contribute in this regard, that is, it provides a theoretical framework in which statistical parameters of error functions are calculated. Given a genetic code and an amino acid property, the functional code robustness is estimated by means of a known error function. In this work, using analytical calculations, general expressions for the average and standard deviation of the error function distributions of completely random codes with standard stop codons were obtained. As a possible biological application of these results, any set of amino acids and any pure or mixed amino acid properties can be used in the calculations, such that, in case of having to select a set of amino acids to create a genetic code, possible advantages of natural selection of the genetic codes could be discussed.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
All current natural genetic codes may have evolved from a single ancestral code. According to the Crick hypothesis in 1968, this ancestral code would have consisted of fixed random codon assignments for each encoded amino acid and the stop signal. This approach is known as "the frozen accident" (Crick 1968). An explanation for the original fixed codon assignments could be the deleterious effects of genetic changes. These effects would be increasingly catastrophic as the number of genes in organisms increased. However, in early evolution extensive horizontal gene transfer might have been useful because only one code survived, a requirement for the transition to the cellular level of complexity (Vetsigian et al. 2006). Given the above, the origin of LUCA (Last Universal Cellular Ancestor) (Weiss et al. 2018), the first common ancestor of all current organisms, but not the first cell, would have been a bottleneck resulting from this horizontal gene transfer, which would have resulted in the selection of a universal code (Koonin 2003, 2017; Vetsigian et al. 2006).
Thus, at an early stage a completely random universal code could be possible. However, this evolved in such a way that there were fewer reading and writing failures, diminishing the structural and functional consequences of the encoded proteins (i.e., the error function as a cost function) (Freeland and Hurst 1998). Moreover, considering that sometimes code errors could be important for developing new cellular adaptive properties, perhaps genetic code evolution, rather than a way of optimizing stability, tended to optimize the balance between stability and adaptability. According to an evolutionary increase in stability, it has been found that the errors associated with the standard genetic code are considerably smaller than most random codes, although its achieved stability is not the best possible (Błażej et al. 2018, 2016; Buhrman et al. 2011; Freeland and Hurst 1998; Goldman 1993; Haig and Hurst 1991, 1999; Novozhilov et al. 2007; Salinas et al. 2016; Santos and Monteagudo 2010; Wnętrzak et al. 2018, 2019). The remarkable stability of the standard genetic code, besides being a driving force through the selection pressure, may have been a consequence of the expansion of the genetic code by mean of similar mechanism of codon assignments to physicochemically similar amino acids (Crick 1968; Koonin 2017). Thus, the hypothetical accidental nature of a selected ancestral genetic code is in agreement with subsequent genetic code extension and optimization mechanisms (Koonin and Novozhilov 2017). In this work, the average and standard deviation of error functions of random genetic codes with fixed standard stop codons were analytically obtained assuming that a primitive and completely random version of an ancestral genetic code may have been selected from a large set of random codes. The used error functions are different depending on a parameter indicating which codon bases (i.e., first, second or third) can be wrong. As a possible application of these results in future research, the deduced expressions of statistical parameters could be useful to select different sets of natural amino acids and many kinds of amino acid properties, either pure or mathematically combined. This approach, regarding the different statistical behaviors of the error function in a random Crick scenario, could allow a better understanding of the code stability as a selective pressure on the origin and evolution of the genetic code.
For calculations, the following mathematical formalism is introduced:
In the standard genetic code, from the 64 possible codons, there are 3 stop codons and 61 amino acid encoding codons, which encode the 20 standard amino acids. Hence, a genetic code can be described by a function, such that there are 61 different triplets ijk (with bases i, j, k ∈ B = {A, C, G, U} termed codons, each one encoding one amino acid; Ep is the set of pairs of triplets indicated by \(\left(ijk, {i}^{{\prime}}{j}^{{\prime}}{k}^{{\prime}}\right)\), that only differ in position p, with p = 1, 2 or 3, such that only codon pairs \(\left(ijk, {i}^{{\prime}}jk\right)\), \(\left(ijk, {ij}^{{\prime}}k\right)\) and \(\left(ijk, ij{k}^{{\prime}}\right)\) (\(i\ne {i}^{{\prime}}, j\ne {j}^{{\prime}}\) and \(k\ne {k}^{{\prime}}\) respectively) are considered. For p = 0, E0 denotes the union \({E}_{1}{\cup E}_{2}\cup {E}_{3}\) (Buhrman et al. 2011).
Let rijk be the numeric value of property \({a}_{u}\) as expressed by the standard amino acid \(u\) coded by triplet ijk codon (that is, in functional notation, \(u=u\left(ijk\right))\)
The rijk values of the six amino acid properties used in this work are shown in Table 1. Four of these properties are real properties taken from Haig and Hurst (Haig and Hurst 1991); however, two other properties are not real, but are only arbitrary values to increase the number of cases to test the theoretical results of this study.
To understand the robustness of the genetic code, the consequences of single-point changes in codons (either mutation or translation errors) have been studied. Hence, genetic code robustness can be inversely estimated by measuring a global error, basically a cost function associated with decoding mistakes. Such an error function (MS) is defined as follows:
where |Ep| is the cardinality of set Ep.
We find that
and verify that
Considering 61 amino acid encoding codons and the 3 standard stop codons (UAG, UAA, and UGA), and since codon pairs with inner differences simultaneously in more than one position are not considered, we have |E0|= 263, |E1|= 87, |E2|= 88, and |E3|= 88 (Buhrman et al. 2011).
2 Theoretical Framework and Results
Only completely random models of the genetic code with the fixed three standard stop codons (also named the unrestricted structure model (Wnętrzak et al. 2018)) were considered. Genetic codes were built fixing the three standard stop codons and using the other 61 codons to encode the 20 standard amino acids. Although the number of possible codes is finite (Novozhilov et al. 2007; Schönauer and Clote 1997), the number of randomly selected codes from the huge number of possible codes can be infinite. Hereinafter we will refer to an “infinite number of sampling cycles of random genetic codes” as “infinite random codes”. Thus, let \({\langle \rangle }_{\infty }\) be the average of infinite random codes. Then we denote the average of MSp (Eq. 2) over infinite random genetic codes by \({\langle {MS}_{p}\rangle }_{\infty }\):
Let \({\sigma }_{p}\) be the standard deviation of MSp over infinite random genetic codes. Then \({\sigma }_{p}^{2}\) is the variance given by
For the computer calculations, 100,000 randomly sampled genetic codes were obtained using randomly-generated non-overlapping block of codons with random assignment of the amino acids. In addition, six kinds of amino acid properties (Table 1) were used and their statistical properties were analyzed with respect to changes in a single position of the codon (p = 1, 2, or 3). The parameters \({\langle {MS}_{p}\rangle }_{\infty }\) and \({\sigma }_{p}\) were numerically calculated using the Monte Carlo method. Subsequently, these parameters were analytically calculated using general expressions, which were obtained for any amino acid property and any set of encoded amino acids. Comparisons between the values obtained using numerical and analytical methods are shown in Tables 2 and 3. The analytical results are very similar to the numerical results of the Monte Carlo computational calculation. In fact, because the results of the analytical calculations are completely based on mathematical arguments, a numerical proof of these results is not necessary. However, Tables 2 and 3 are useful to show how derived statistical expressions can be applied and numerically contrasted.
3 Analytical Calculation of \({\langle {MS}_{p}\rangle }_{\infty }\) (p = 1, 2, or 3) for Infinite Completely Random Genetic Codes with the Standard Stop Codons
From Eq. 5, since summations of \({r}_{ijk}^{2}\) and \({r}_{{i}^{{\prime}}{j}^{{\prime}}{k}^{{\prime}}}^{2}\) are equal, by exchanging summation and average operators (\(\langle \sum \rangle =\sum \langle \rangle\)), we obtain
where \({\langle {r}_{ijk}^{2}\rangle }_{\infty }\) and \({\langle {r}_{ijk}{r}_{{i}^{{\prime}}{j}^{{\prime}}{k}^{{\prime}}}\rangle }_{\infty }\) are constants because they are obtained from infinite selected random codes from all possible codes. That is, the averages \({\langle \rangle }_{\infty }\) do not depend on the subscripts for \(r\) and therefore they can be written outside of the summations, as a factor.
Note that
Using Eqs. 1, 3 and 8 in Eq. 7 results in
Because each \(u\)-th amino acid has the same statistical weight in calculations of averages over infinite random genetic codes, in Eq. 9 we replace \({\langle \rangle }_{\infty }\) with \({\langle \rangle }_{Aa}\), that is the average over the 20 standard amino acids (i.e. \({\langle {(...)}_{u}\rangle }_{Aa}\equiv \sum_{u=1}^{20}{(...)}_{u}/20\)). Thus, we obtain
where p = 1, 2, or 3. The application results are shown in Table 2, and another demonstration of Eq. 10 is shown in Appendix A.
4 Analytical Calculation of \({\sigma }_{p}\) (p = 1, 2, or 3) for Infinite Completely Random Genetic Codes with Standard Stop Codons
In Appendix B the following expression for the standard deviation is obtained
where p = 1, 2, or 3. The application results are shown in Table 3.
5 Discussion
In the calculation of the averages over random genetic codes, each code has the same probability of being obtained by the Monte Carlo method. Therefore, the averages for infinite number of sampling cycles of random genetic codes are equal to the corresponding averages for the finite set of all possible genetic codes. Using standard stop codons in the completely random model of genetic code, we analytically obtained the average (Eq. 10) and standard deviation (Eq. 11) of error functions (Eq. 2) of infinite random codes selected. The formulae in Eqs. 10 and 11 were exact and applicable for any kind of amino acid property, even for new properties resulting from combinations of some already known (e.g., a linear combination of several amino acid properties). Similarly, the set of encoded amino acids could also be redefined into the formulae. In computational experiments, using 100,000 random genetic codes, in addition to the 20 standard amino acids and 6 kinds of amino acid properties (4 real properties and the other 2 invented, for test purposes only) both statistical parameters (i.e., \({\langle {MS}_{p}\rangle }_{\infty }\) and \({\upsigma }_{p}\) (p = 1, 2, or 3) were obtained with values very similar to those predicted by the analytical calculations (See Tables 2 and 3).
It is interesting that the average of the error function of the code is proportional to the mean squared change (Eq. 10) of the encoded amino acid property, as long as the variance is proportional to the mean quartic change (Eq. 49). Such a simple result could avoid a large number of computational calculations and was capable of establishing a theoretical framework that could be applied to a random scenario prior to a universal code. For example, it seems plausible that genetic codes with small error function values are more competitive (i.e., genetic codes having a greater tolerance to errors of use) as \({\upsigma }_{p}\) decreases and \({\langle {MS}_{p}\rangle }_{\infty }\) increases. That could be achieved by a suitable selection of sets of amino acids and their properties (pure or mathematically combined) to give the appropriate parameters to the error function in primitive systems containing amino acids, such as in some meteorites or in primary organic soups (Burton et al. 2012; Cleaves 2010; Zaia et al. 2008). In this regard, the following question seems interesting: how optimal are the current standard amino acids and their selected properties in terms of the competitiveness of genetic codes within a system with more options of amino acids to be encoded? Therefore, the statistical parameters found here to describe the error in random genetic codes could be applied to the selection of sets of amino acids or to find more appropriated amino acid properties function, so that a few codes could be much more efficient (greater tolerance to error) than the rest, something very appropriate for a natural selection of a genetic code.
Despite the optimization patterns of the standard genetic code, Francis Crick’s frozen accident theory still survives when combined with theories of genetic code expansion (Koonin 2017), although it has been said that the emphasis is on the frozen part (Kun and Radvanyi 2018). However, it seems important to consider random events in the earliest stages of the genetic code. Assuming a hypothetical early random scenario for the origin of the genetic code, in this approach the distribution of the error function for the completely random model was mathematically described under very general conditions, which may facilitate subsequent applications.
Availability of data and material
Not applicable.
References
Błażej P, Wnętrzak M, Mackiewicz D, Mackiewicz P (2018) Optimization of the standard genetic code according to three codon positions using an evolutionary algorithm. PLoS ONE 13:e0201715. https://doi.org/10.1371/journal.pone.0201715
Błażej P, Wnȩtrzak M, Mackiewicz P (2016) The role of crossover operator in evolutionary-based approach to the problem of genetic code optimization. Biosystems 150:61–72. https://doi.org/10.1016/j.biosystems.2016.08.008
Buhrman H, van der Gulik PT, Kelk SM, Koolen WM, Stougie L (2011) Some mathematical refinements concerning error minimization in the genetic code. IEEE/ACM Trans Comput Biol Bioinform 8:1358–1372. https://doi.org/10.1109/tcbb.2011.40
Burton AS, Stern JC, Elsila JE, Glavin DP, Dworkin JP (2012) Understanding prebiotic chemistry through the analysis of extraterrestrial amino acids and nucleobases in meteorites. Chem Soc Rev 41:5459–5472. https://doi.org/10.1039/c2cs35109a
Cleaves HJ 2nd (2010) The origin of the biologically coded amino acids. J Theor Biol 263:490–498. https://doi.org/10.1016/j.jtbi.2009.12.014
Crick FH (1968) The origin of the genetic code. J Mol Biol 38:367–379. https://doi.org/10.1016/0022-2836(68)90392-6
Freeland SJ, Hurst LD (1998) The genetic code is one in a million. J Mol Evol 47:238–248. https://doi.org/10.1007/pl00006381
Goldman N (1993) Further results on error minimization in the genetic code. J Mol Evol 37:662–664. https://doi.org/10.1007/bf00182752
Haig D, Hurst LD (1991) A quantitative measure of error minimization in the genetic code. J Mol Evol 33:412–417. https://doi.org/10.1007/bf02103132
Haig D, Hurst LD (1999) A quantitative measure of error minimization in the genetic code. J Mol Evol 49:708. https://doi.org/10.1007/pl00006591
Koonin EV (2003) Comparative genomics, minimal gene-sets and the last universal common ancestor. Nat Rev Microbiol 1:127–136. https://doi.org/10.1038/nrmicro751
Koonin EV (2017) Frozen accident pushing 50: stereochemistry, expansion, and chance in the evolution of the genetic code. Life (basel). https://doi.org/10.3390/life7020022
Koonin EV, Novozhilov AS (2017) Origin and evolution of the universal genetic code. Annu Rev Genet 51:45–62. https://doi.org/10.1146/annurev-genet-120116-024713
Kun A, Radvanyi A (2018) The evolution of the genetic code: impasses and challenges. Biosystems 164:217–225. https://doi.org/10.1016/j.biosystems.2017.10.006
Novozhilov AS, Wolf YI, Koonin EV (2007) Evolution of the genetic code: partial optimization of a random code for robustness to translation error in a rugged fitness landscape. Biol Direct 2:24–24. https://doi.org/10.1186/1745-6150-2-24
Salinas DG, Gallardo MO, Osorio MI (2016) Local conditions for global stability in the space of codons of the genetic code. Biosystems 150:73–77. https://doi.org/10.1016/j.biosystems.2016.08.007
Santos J, Monteagudo Á (2010) Study of the genetic code adaptability by means of a genetic algorithm. J Theor Biol 264:854–865. https://doi.org/10.1016/j.jtbi.2010.02.041
Schönauer S, Clote P (1997) How optimal is the genetic code? In: Frishman D, Mewes HW, eds, Computer science and biology proceedings of the German Conference on Bioinformatics (GCB'97). Sep 21–24, 1997. p. 65–67. http://clavius.bc.edu/~clote/pub/geneticCode.pdf.
Vetsigian K, Woese C, Goldenfeld N (2006) Collective evolution and the genetic code. Proc Natl Acad Sci USA 103:10696–10701. https://doi.org/10.1073/pnas.0603780103
Weiss MC, Preiner M, Xavier JC, Zimorski V, Martin WF (2018) The last universal common ancestor between ancient Earth chemistry and the onset of genetics. PLoS Genet 14:e1007518. https://doi.org/10.1371/journal.pgen.1007518
Wnętrzak M, Błażej P, Mackiewicz D, Mackiewicz P (2018) The optimality of the standard genetic code assessed by an eight-objective evolutionary algorithm. BMC Evol Biol 18:192. https://doi.org/10.1186/s12862-018-1304-0
Wnętrzak M, Błażej P, Mackiewicz P (2019) Optimization of the standard genetic code in terms of two mutation types: point mutations and frameshifts. Biosystems 181:44–50. https://doi.org/10.1016/j.biosystems.2019.04.012
Zaia DA, Zaia CT, De Santana H (2008) Which amino acids should be used in prebiotic chemistry studies? Orig Life Evol Biosph 38:469–488. https://doi.org/10.1007/s11084-008-9150-5
Funding
Not applicable.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The author declares that he has no conflict of interest.
Ethical approval
Not applicable.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A
Alternative Calculation of \({\langle {MS}_{p}\rangle }_{\infty }\) for Infinite Completely Random Genetic Codes with Standard Stop Codons
From p = 3 into Eq. 5, we obtain
Using the Kronecker delta function \({\delta }_{xy}\) (i.e., given that \(x\) and \(y\) are positive integers, if \(x=y\), then \({\delta }_{xy}\)= 1 and else \({\delta }_{xy}\)= 0), Eq. 12 becomes
whereby
Note that
and
given that \(x \ne z\). Moreover, since that Eq. 12 does not depend of any values of the \({r}_{UAG}\), \({r}_{UAA}\) and \({r}_{UGA}\), we conveniently choose the values of those parameters as
Applying Eqs. 15–18 in Eq. 14 results in
which is
equivalently
and this can be written as
Considering Eqs. 18 and 22, results from left to right in Eq. 22 summations having 61, 2, 3 and 176 amino acidic terms, respectively. Besides, the average of those terms are such that \({\langle {r}_{ijk}^{2}\rangle }_{\infty }= {\langle {r}_{UAk}^{2}\rangle }_{\infty }={\langle {r}_{UGk}^{2}\rangle }_{\infty }\). Then, we obtain
so that
From |E3| = 88 and Eq. 24 we have
Similar to that indicated to obtain Eq. 10 from Eq. 9, in Eq. 25 we replace \({r}_{ijk}\) and \({\langle \rangle }_{\infty }\) by \({a}_{u}\) and \({\langle \rangle }_{Aa}\), respectively. Thus, 25 becomes
Similar demonstrations for p = 1 and 2 can be developed. Thus we obtain
where p = 1, 2, 3, in agreement with Eq. 10.
Appendix B
Analytical Calculation of \({\sigma }_{p}\) (p = 1, 2, 3) for Infinite Completely Random Genetic Codes with Standard Stop Codons
Alternatively, Eq. 6 can be written as
From p = 3 into Eq. 28 we obtain
Calculating the term \({\langle {MS}_{3}^{2} \rangle }_{\infty }\), from p = 3 and Eq. 2, it becomes
equivalently
Calculating from Eq. 31
which is
On the other hand
And then, reordering
which is, considering Eqs. 3 and 36,
Considering 36 in 33, we obtain
and using constants \(\alpha , \beta ,\) and \(\gamma\) Eq. 37 can be written as
Similar to that indicated to obtain Eq. 10 from Eq. 9, in Eq. 38 we replace \({r}_{ijk}\) and \({\langle \rangle }_{\infty }\) by \({a}_{u}\) and \({\langle \rangle }_{Aa}\), respectively. Thus, 38 becomes
Replacing Eqs. 10 and 39 into Eq. 29, we obtain
Let a change of variable given by
and
(defining a new amino acid property).
Let \({MS^{\prime}_{3}}\) be an error function calculated for the encoded \({b}_{u}\) values of the new amino acid property
Let \({\sigma }_{3}^{ \prime}\) be the standard deviation of \({MS^{\prime}_{3}}\) over infinite random genetic codes
(Similar to Eq. 6 with p = 3)
Then, considering the \({b}_{u}\) values of the new amino acid property results
(Similar to Eq. 40, but using \({{\sigma }_{3}}^{\prime 2}\) and \({b}_{u}\) instead of \({\sigma }_{3}^{2}\) and \({a}_{u}\), respectively)
Replacing \({\langle {b}_{u}\rangle }_{Aa}=0\) (from 42) in Eq. 46, we obtain
Finally, from \({\upsigma }_{3}={\upsigma }_{3}^{ \prime}\) (because \({r}_{ijk}-{r}_{ij{{k}^{\prime}}}={{r^\prime}_{ijk}}-{{r}^{\prime}}_{ij{k}^{\prime}}\)), Eqs. 42 and 47, we have
Similar demonstrations for p = 1 and 2 can be developed. Thus, we obtain
where p = 1, 2, 3. Whereby the standard deviation is
Rights and permissions
About this article
Cite this article
Salinas, D.G. Average and Standard Deviation of the Error Function for Random Genetic Codes with Standard Stop Codons. Acta Biotheor 70, 7 (2022). https://doi.org/10.1007/s10441-021-09427-x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10441-021-09427-x