Introduction

The standard genetic code (SGC) is one of the universal features of life (Alberts et al. 2008), with only minor variations across the three domains (Knight et al. 2001). The assignment of amino acids to different codons is not random. Even as the genetic code was being deciphered in the 1960s (Nirenberg et al. 1963; Woese 1967), it was recognized that codons differing by a single base are either assigned the same amino acid or amino acids that are biochemically similar in the SGC, as compared to random genetic codes (Woese 1965; Epstein 1966; Goldberg and Wittes 1966; Alff-Steinberger 1969). Recent studies utilizing computer simulations have lent quantitative support to this notion (Haig and Hurst 1991; Freeland and Hurst 1998; Butler et al. 2009). The organization of codon–amino acid assignments in the SGC may have evolved to minimize, on average, the phenotypic effect of genetic mutations, transcription errors, and mistranslations (Cullmann and Labouygues 1983). Some studies have suggested that the “error minimization” property of the SGC may simply be a by-product of code expansion mechanisms (Massey 2008, 2015). Others have suggested that the genetic code co-evolved with pathways for amino acid synthesis, with amino acids having closer precursor–product relationships in biosynthetic pathways being encoded by similar codons (Wong 1975; Taylor and Coates 1989; Freeland et al. 2000; Giulio 2016; Wong et al. 2016). Another view is the stereochemical theory that proposes the codon–amino acid assignments in the SGC are an outcome of the physiochemical affinity between amino acids and the cognate codons or the cognate anti-codons (Woese 1968; Yarus et al. 2009; Johnson and Wang 2010; Polyansky et al. 2013).

There are two prevalent theories of the mechanism via which a universal genetic code may have evolved (Koonin and Novozhilov 2017). The first theory suggests that the universality of the genetic code is an outcome of the fact that all present-day life forms evolved from a universal common ancestor. After the emergence of translation, the genetic code fixed in the population of a single niche and froze. It was only after the code froze that life forms diversified from the single niche initially occupied (Crick 1968; Wong 1976; Harris et al. 2003). This frozen code was then inherited unchanged during the subsequent spread and diversification of life forms. Another possibility in this vertical descent model of evolution of a universal genetic code is that the genetic code froze after life had diversified. As life forms spread, organisms with distinctive characteristics emerged, some even lacking translation. Those organisms with translation machineries would have evolved different genetic codes. However, over an extended period of time, all codes except the SGC were lost, either due to neutral drift or due to selection for optimality properties of the SGC (Novozhilov et al. 2007).

In contrast to the vertical descent model described above is the view that a universal and optimal genetic code emerged from communal evolution amidst extensive horizontal gene transfer during the early stages of life (Vetsigian et al. 2006; Goldenfeld and Woese 2007). With organisms utilizing different genetic codes in multiple communities competing for a single niche, the community wherein members utilize the same genetic code or compatible genetic codes is more likely to succeed. This is because a community-wide code or a set of compatible codes will allow for efficient sharing of newly evolved beneficial proteins among different individuals in the community via horizontal gene transfer, thereby allowing the organisms in the community access to a larger innovation pool. Further, robustness to mistranslations is likely to be selected for in such a community-wide code due to the likely inefficient translation machineries in these ancient life forms. Simulations have suggested that communal evolution amidst extensive horizontal gene transfer allows for the evolution of a code that is more optimized for minimizing the phenotypic effect of translation errors. In the absence of horizontal gene transfer, conversely, evolving genetic codes tend to get stuck in local minima, ending up less optimal for minimizing the phenotypic effect of translation errors (Vetsigian et al. 2006).

In both mechanisms of emergence of a universal code described above, selection acts on the phenotypic features of the organism, and not directly on the system of codon–amino acid assignments. Mutations in the genomic DNA sequence, or the RNA sequence in the case of some viruses, are inherited by the progeny during replication. Since proteins are the molecules that carry out a majority of the cellular functions, it has been argued that they largely determine the phenotype of an organism (Griffiths et al. 2000). Given that the genetic code governs the translation of a transcript of the genetic material into a protein, it plays a fundamental role in steering molecular evolution (Gonnet et al. 1992). Here, we probe this dependence of molecular evolution on the genetic code.

Life forms have evolved over time amidst changing environmental conditions. Different environmental conditions require different phenotypic responses for an organism to survive. Considering the large, yet finite, space of possible amino acid sequences, a species that can access a greater portion of the sequence space is more likely to encounter a protein capable of forging a fitting response to a new environmental condition. However, only those proteins encountered during the exploration of the sequence space that are functional can contribute toward the survival of the organism. Thus, paths through the sequence space should be highly biased toward those containing functional protein variants. Given the evolutionary advantage of exploring a larger fraction of the space of functional protein sequences, and the role of the genetic code in guiding molecular evolution, the SGC may have evolved under a selection pressure to maximize the fraction of the functional sequence space explored. We test this hypothesis using the landscape of functional variants of the PhoQ protein.

Previous studies have suggested that the organization of the SGC constrains the exploration of the space of possible amino acid sequences (Maynard Smith 1970). This is because a single nucleotide change allows access to only 6 of the 19 possible amino acid substitutions on average, and silent mutations are abundant within the SGC. Maeshiro and Kimura suggested that the SGC allows for a balance of robustness and changeability via a balance between the probabilities of synonymous and nonsynonymous single-nucleotide substitutions (Maeshiro and Kimura 1998). Judson and Haydon found that codes computationally evolved under selection for characteristics such as higher amino acid connectedness and shorter path length between different amino acids were closer to the SGC (Judson and Haydon 1999). It has since been proposed that positive selection for increased diversity of proteins and improved protein functionality was a driving force during the evolution of the SGC (Higgs 2009; Francis 2013). Zhu and Freeland, using a population genetic code model and defining fitness as the linear summed difference from an optimal sequence, found that the SGC has properties that enhance the efficacy of adaptive sequence evolution (Zhu and Freeland 2006). Firnberg and Ostermeier confirmed Zhu and Freeland’s hypothesis on a subset of variants of the antibiotic resistance gene TEM-1 β-lactamase and found that the SGC enriches for adaptive mutations even when an experimental fitness function is used (Firnberg and Ostermeier 2013).

For a definitive analysis of the influence of the organization of the genetic code on the exploration of the space of functional variants of a given protein, knowledge of the functional activity of all possible variants of the protein is essential. While recent deep mutational scanning studies have tested the functional activity of double and higher order mutants (Fowler and Fields 2014), lack of comprehensive characterization of all possible variants has impeded the systematic analysis of the space of all functional variants. Podgornaia and Laub generated a library containing all 160000 variants of the Escherichia coli protein PhoQ with amino acid substitutions at four positions and used a two-step selection mechanism coupled to next-generation sequencing to identify 1659 functional variants (Podgornaia and Laub 2015). PhoQ is a sensor histidine kinase that phosphorylates the response regulator PhoP in response to low extracellular magnesium concentrations. In such two-component signaling systems, mutating just three or four interfacial residues is enough to alter specificity. We used the space of variants of the PhoQ protein with amino acid substitutions at the PhoQ–PhoP interface and the information regarding the functional activity of these variants to test the role of genetic code organization in the exploration of the space of functional amino acid sequences.

We considered the exploration of functional variants of the PhoQ protein under translation using the SGC and different randomly generated codes. The SGC explored a larger fraction of the functional protein sequence space at small and intermediate time scales, compared to random genetic codes with the same degeneracy as the standard code and to random genetic codes with degeneracies different from that of the standard code. Upon considering longer time scales, the fraction of random codes of both types that allowed for the exploration of more functional PhoQ variants than the standard code increased. However, less than 5% of the random genetic codes with the same degeneracy as the standard code allowed for the exploration of more functional PhoQ variants than the standard code even at these extended time scales. We also investigated the dependence of the fraction of the functional sequence space explored on the starting nucleotide sequence. Finally, we calculated the correlations of the fraction of the sequence space explored under translation using different genetic codes with different quantitative characteristic measures of the genetic codes.

Materials and Methods

Generation of Random Genetic Codes that Preserved the Degeneracy of the SGC

The 64 codons were divided into 21 classes, 20 classes each consisting of codons coding for the same amino acid in the SGC and 1 class consisting of the 3 stop codons. To generate a random code, an amino acid was randomly assigned to one of the 20 classes of codons, not including the class consisting of stop codons. The set of stop codons was left unaltered. These random genetic codes are hereafter referred to as type \({T_{{\text{DP}}}}\) codes.

Generation of Random Genetic Codes with Degeneracies Different from that of the SGC

The set of stop codons was kept the same as in the SGC. Each amino acid was assigned to one codon chosen randomly from among the 61 codons. Each of the remaining 41 codons was then assigned to a randomly chosen amino acid. These random genetic codes are hereafter referred to as type \({T_{{\text{DNP}}}}\) codes.

Simulation

In a 12-nucleotide sequence encoding the 4 amino acids at positions 284, 285, 288, and 289 of the Escherichia coli protein kinase PhoQ or of the functional variants of this protein (Podgornaia and Laub 2015), one position was chosen randomly, and the nucleotide at that position was mutated to one of the other three possible nucleotides. Since multi-nucleotide substitutions are rare (Terekhanova et al. 2013), we do not consider them here. The mutated sequence was translated into a 4-amino acid sequence using the genetic code being considered, i.e., either the SGC or a randomly generated code. This mutation corresponded to one simulation step. If the new 4-amino acid sequence corresponded to a functional PhoQ variant as determined previously (Podgornaia and Laub 2015), the mutated nucleotide sequence became the start sequence in the next simulation step. Otherwise, the unmutated nucleotide sequence remained the start sequence in the subsequent simulation step. Simulations were run using the SGC, 10000 type \({T_{{\text{DP}}}}\) codes, or 10000 type \({T_{{\text{DNP}}}}\) codes for 100, 1000, 10000, 100000, or 1000000 steps. With the SGC, the starting 12-nucleotide sequence in the first simulation step was the wild-type nucleotide sequence coding for PhoQ in E. coli. With a randomly generated genetic code, the starting sequence in the first simulation step was such that it coded for the wild-type PhoQ amino acid sequence, hereafter referred to as the wild-type PhoQ variant, under the given code, with codons from degenerate sets chosen with probabilities proportional to their frequencies in the E. coli genome. For each code, for each number of simulation steps, the average number of functional PhoQ variants explored in 100 different simulation runs was reported.

Definition and Calculation of Different Characteristic Measures of Genetic Codes

Mean Squared Change in Physio-Chemical Properties

Let \(w\) be the amino acid physio-chemical property being considered. Then, the mean squared change in \(w\) for a given genetic code \({\text{G}}\) is defined as

$${w_{\text{G}}}=\frac{1}{{549}}\mathop \sum \nolimits^{} {\left( {{w_{{\text{new}}}} - {w_{{\text{old}}}}} \right)^2},$$
(1)

where the sum is over all possible single-nucleotide mutations, \({w_{{\text{old}}}}\) is the amino acid encoded by the unmutated codon, and \({w_{{\text{new}}}}\) is the amino acid encoded by the mutated codon. The quantity was calculated for the following amino acid properties: polar requirement (Woese et al. 1966b), hydrophilicity (Weber and Lacey 1978), isoelectric point (Alff-Steinberger 1969), and volume (Zamyatnin 1972). References adjacent to each property indicate the study from which the values for the property were taken.

Code Fragility

Code fragility for a genetic code is defined as the number of codons in the given genetic code for which, out of the 9 possible single-nucleotide substitutions, 8 or more were nonsynonymous (Judson and Haydon 1999). Note that all genetic codes with the same degeneracy have the same value of code fragility.

Code Mutability

Let \({x_i}\) be the number of nonsynonymous single-nucleotide substitutions for codon \(i\) under a given genetic code \({\text{G}}\). Code mutability \(M\) for that genetic code is then defined as (Judson and Haydon 1999)

$${M_{\text{G}}}=\frac{1}{{64}}\mathop \sum \limits_{{i=1}}^{{64}} {x_i}.$$
(2)

Note that all genetic codes with the same degeneracy have the same value of code mutability.

Total Number of Synonymous Point Mutations

Under a genetic code \({\text{G}}\), let \({x_i}\) be the number of single-nucleotide substitutions to codon \(i\) that do not change the amino acid encoded by the codon. Then, the total number of synonymous mutations is defined as

$${N_{{\text{S}},{\text{G}}}}=\mathop \sum \limits_{{i=1}}^{{64}} {x_i}.$$
(3)

Note that all genetic codes with the same degeneracy have the same value for this measure.

Code Changeability

The definition of code changeability was taken from Maeshiro and Kimura (Maeshiro and Kimura 1998). If \(i\) and \(j\) are amino acids such that it is possible to transition from amino acid \(i\) to amino acid \(j\) under the genetic code \({\text{G}}\) via a single-nucleotide substitution, the probability of transition from \(i\) to \(j\), \({\rho _{ij}}\), is defined as

$${\rho _{ij}}=\frac{{{m_{ij}}}}{{9{n_i}}},$$
(4)

where \({m_{ij}}\) is the number of ways of going from a codon that encodes amino acid \(i\) to a codon that encodes amino acid \(j\) in the genetic code \({\text{G}}\), and \({n_i}\) is the number of codons that encode amino acid \(i\) in the genetic code \({\text{G}}\). For amino acids \(i\) and \(j\) connected via 2 transitions, \({\rho _{ij}}\) is defined as

$${\rho _{ij}}=\mathop \sum \limits_{k} \frac{{{m_{ik}}}}{{9{n_i}}} \times \frac{{{m_{kj}}}}{{9{n_k}}},$$
(5)

where \(k\) is the intermediate amino acid via which the transition from \(i\) to \(j\) must take place. The value of \({\rho _{ij}}\) for amino acids connected via more than 2 transitions is defined along similar lines. The transition probability between amino acid \(i\) and amino acid \(j\) is defined by the minimal number of transitions for which \({\rho _{ij}}\) is nonzero, i.e., the path between amino acids \(i\) and \(j\) involving a minimum number of transitions is chosen for calculating \({\rho _{ij}}\). The three stop codons are treated as if coding for a 21st amino acid. However, paths that passed through stop codons are excluded from the calculation of \({\rho _{ij}}\). Code changeability is finally defined as

$${\rho ^{\text{G}}}=\frac{1}{{210}}\mathop \sum \limits_{i} \mathop \sum \limits_{j} {\rho _{ij}},\;~i \ne j.$$
(6)

Note that all genetic codes with the same degeneracy have the same value of code changeability.

Estimate of the Time Scale Corresponding to One Simulation Step

Each simulation step corresponds to the time taken for an individual in a population of effective size \({N_{\text{e}}}\) to acquire one nucleotide substitution in the 12-nucleotide PhoQ sequence considered here. Considering a typical E. coli genome size of 5.44 × 106 base pairs and a mutation rate of 0.003 mutations per genome per generation, one simulation step will be equivalent to

$$\frac{{5.44 \times {{10}^6}}}{{0.003 \times 12 \times {N_{\text{e}}}}}=\frac{{1.5 \times {{10}^8}}}{{{N_{\text{e}}}}}{\text{generations.}}$$
(7)

With an average generation time of 30 min, one simulation step is equivalent to

$$\frac{{1.5 \times {{10}^8}}}{{{N_{\text{e}}}}} \times \left( {30 \times 1.9 \times {{10}^{ - 6}}} \right)=\frac{{8.6 \times {{10}^3}}}{{{N_{\text{e}}}}}{\text{years.}}$$
(8)

Results

The Standard Genetic Code Allows for the Exploration of More Functional Variants of PhoQ as Compared to Random Codes

Figure 1 shows the distribution of the number of unique functional variants of the PhoQ protein explored under translation using different random genetic codes and using the SGC for different numbers of simulation steps. Under translation using the SGC, more unique functional variants of PhoQ were explored via single-nucleotide substitutions as compared to the average number visited under translation using randomly generated codes of both types, \({T_{{\text{DP}}}}\) and \({T_{{\text{DNP}}}}\), for up to 100000 simulation steps. It is only at 1 million simulation steps that the average number of functional PhoQ variants visited under translation using type \({T_{{\text{DNP}}}}\) codes surpassed the number explored under translation using the SGC. The mean number for type \({T_{{\text{DP}}}}\) codes, however, remained small as compared to the standard code even at 1 million simulation steps (Fig. 2a). Among codes of type \({T_{{\text{DP}}}}\), less than 5% allowed for the exploration of more functional variants than the standard code for up to 1 million simulation steps. The fraction was lower at lower numbers of simulation steps as is shown in Fig. 2b. Among codes of type \({T_{{\text{DNP}}}}\), less than 7% permitted exploration of more functional PhoQ variants than the standard code with the number of simulation steps up to 10000. The fraction was larger for higher numbers of simulation steps, reaching 87% at 1 million simulation steps. The results in Fig. 2b indicate that the SGC is more optimized for exploring a large fraction of the space of functional PhoQ variants at intermediate time scales than at very short or long time scales. The fraction of random codes of both types that explored a larger fraction of the PhoQ functional sequence space as compared to the SGC was lower at 1000 and 100000 simulation steps than at 100, 100000, or 1 million simulation steps (Fig. 2b).

Fig. 1
figure 1

Distribution of the number of functional variants of the E. coli kinase PhoQ explored via single-nucleotide substitutions under translation using type \({T_{{\text{DP}}}}\) codes, type \({T_{{\text{DNP}}}}\) codes, and the SGC. All simulations were carried out as described in the “Materials and Methods” section. Distributions are shown for N = 100, 1000, 10000, 100000, and 1000000 simulation steps

Fig. 2
figure 2

a The number of functional variants of the E. coli kinase PhoQ explored via single-nucleotide substitutions under translation using different genetic codes for different numbers of simulation steps. The bars indicate the number of functional PhoQ variants explored, averaged over 10000 different type \({T_{{\text{DP}}}}\) or type \({T_{{\text{DNP}}}}\) codes. The error bars for randomly generated codes represent one standard deviation from the mean. b The fraction of type \({T_{{\text{DP}}}}\) codes and the fraction of type \({T_{{\text{DNP}}}}\) codes that allowed for the exploration of more functional PhoQ variants via single-nucleotide substitutions for different numbers of simulation steps as compared to the standard code. Inset shows the fraction of type \({T_{{\text{DP}}}}\) codes that allowed for the exploration of more functional PhoQ variants as compared to the standard code for 100, 1000, and 10000 simulation steps

Number of Functional PhoQ Variants Explored Varies only Slightly for Different Degenerate Starting Nucleotide Sequences

Since there are only 20 amino acids and 3 stop codons with 64 possible three-nucleotide codons, all genetic codes are degenerate, i.e., some amino acids will be encoded by more than one codon. Therefore, the same amino acid sequence can be encoded by multiple distinct nucleotide sequences. We investigated how the number of functional PhoQ variants explored depends on the nucleotide sequence coding for the wild-type PhoQ variant from which the simulation is started. The results are shown in Fig. 3. For the SGC, the standard deviation of the number of proteins explored over all 384 possible starting nucleotide sequences was only 3 and 2% of the mean for 100 and 1000 simulation steps, respectively. The average standard deviation as a fraction of the mean was 4.6 and 3.8% for type \({T_{{\text{DP}}}}\) codes, and 12.3 and 12.1% for type \({T_{{\text{DNP}}}}\) codes, for 100 and 1000 simulation steps, respectively. The distribution of standard deviations as fractions of means was wider for type \({T_{{\text{DNP}}}}\) codes as compared to type \({T_{{\text{DP}}}}\) codes.

Fig. 3
figure 3

Distribution of the standard deviation, as a fraction of the mean, of the number of functional PhoQ variants explored on starting the simulation from different degenerate 12-nucleotide sequences that code for the wild-type PhoQ amino acid sequence under type \({T_{{\text{DP}}}}\) and type \({T_{{\text{DNP}}}}\) codes. The distributions are over 10000 randomly generated codes. a Distribution for simulations with \(N=100\) simulation steps. b Distribution for simulations with \(N=1000\) simulation steps. Insets show the standard deviation, as a fraction of the mean, of the number of functional PhoQ variants explored on starting the simulation from different degenerate nucleotide sequences for the SGC, type \({T_{{\text{DP}}}}\) codes, and type \({T_{{\text{DNP}}}}\) codes. Inset of a shows the summary statistics for the case of \(N=100\) simulation steps. Inset of b shows the summary statistics for the case of \(N=1000\) simulation steps

The Space of Functional Nucleotide Sequences is Modular for all Genetic Codes

We next investigated the structure of the space of functional nucleotide sequences under different genetic codes. For each genetic code, we constructed a network with nodes as the nucleotide sequences functional under the given code, and edges between sequences differing by one nucleotide substitution. We calculated the Newman modularity (Newman 2004; Newman and Girvan 2004) of this network using the Louvain algorithm (Blondel et al. 2008) for the standard code, for 100 type \({T_{{\text{DP}}}}\) codes, and for 100 type \({T_{{\text{DNP}}}}\) codes. For this calculation, we used the computer code available from http://www.ludowaltman.nl/slm/. As shown in Fig. 4, the modularity of the space of functional nucleotide sequences is very high for the SGC and for the randomly generated codes (> 0.85 in all cases). Further, we observed a larger variation in modularity values for type \({T_{{\text{DNP}}}}\) codes as compared to type \({T_{{\text{DP}}}}\) codes.

Fig. 4
figure 4

Newman modularity of the network of functional 12-nucleotide sequences coding for the PhoQ protein under the SGC, 100 type \({T_{{\text{DP}}}}\) codes, and 100 type \({T_{{\text{DNP}}}}\) codes. For each genetic code, all 12-nucleotide sequences that translated into a functional 4-amino acid sequence were identified. These sequences formed the nodes of the network. If it was possible to go from one nucleotide sequence to another sequence via a single-nucleotide substitution, the nodes corresponding to the two nucleotide sequences were connected via an edge

Codes that Better Preserve the Chemical Properties of Amino Acids Under Point Mutations Explore More Functional PhoQ Variants

To determine the factors with which the numbers of functional PhoQ variants explored are correlated, we calculated different characteristic measures for the SGC and for the random genetic codes. These measures can be classified into two categories, ones that characterize the effects of single-nucleotide substitutions on physio-chemical properties of the amino acids encoded, and ones that quantify different structural features of codon–amino acid assignments in different genetic codes. We discuss measures of the first type here. For each genetic code, we calculated the average over all single-nucleotide substitutions in all codons, except the stop codons, of the squared change in different physical and chemical properties of the amino acids encoded: polar requirement (Woese et al. 1966a), hydrophilicity (Weber and Lacey 1978), isoelectric point (Alff-Steinberger 1969), and amino acid volume (Zamyatnin 1972).

We found weak, yet statistically significant (p value < 10−4) negative correlations between the number of functional PhoQ variants explored and the mean squared change in polar requirement, hydrophilicity, and isoelectric point for both type \({T_{{\text{DP}}}}\) and type \({T_{{\text{DNP}}}}\) codes, at all numbers of simulation steps (Figs. 5 and 6). However, none of the type \({T_{{\text{DP}}}}\) codes and none of the type \({T_{{\text{DNP}}}}\) codes exhibited a lower mean squared change in polarity requirement under point mutations than the SGC. The correlation between the number of functional PhoQ variants explored and the mean squared change in amino acid volume was further weak, with statistical insignificance at certain numbers of simulation steps (Figs. 5 and 6).

Fig. 5
figure 5

Scatter plots representing the dependence of the number of functional PhoQ variants explored under translation using type \({T_{{\text{DP}}}}\) codes (vertical axis) on the mean squared change in different physio-chemical properties of amino acids due to single-nucleotide substitutions: polar requirement, hydrophilicity, isoelectric point, and volume (horizontal axis). The Pearson’s correlation coefficient (r) and the p value of the estimate are indicated under each plot

Fig. 6
figure 6

Scatter plots representing the dependence of the number of functional PhoQ variants explored under translation using type \({T_{{\text{DNP}}}}\) codes (vertical axis) on the mean squared change in different physio-chemical properties of amino acids due to single-nucleotide substitutions: polar requirement, hydrophilicity, isoelectric point, and volume (horizontal axis). The Pearson’s correlation coefficient (r) and the p value of the estimate are indicated under each plot

To further probe the relation between the mean squared change in polar requirement for genetic codes and the number of functional PhoQ variants explored, we compared the mean squared changes in polar requirement for the 100 type \({T_{{\text{DP}}}}\) codes and the 100 type \({T_{{\text{DNP}}}}\) codes that allowed for the exploration of most PhoQ variants to the mean squared changes in polar requirement for all type \({T_{{\text{DP}}}}\) and type \({T_{{\text{DNP}}}}\) codes (Fig. 7a, b). We observed that the mean squared changes in polar requirement for the top-performing type \({T_{{\text{DP}}}}\) and type \({T_{{\text{DNP}}}}\) codes did not differ significantly from other random codes. We also considered the 100 type \({T_{{\text{DP}}}}\) and the 100 type \({T_{{\text{DNP}}}}\) codes with least values of mean squared change in polar requirement and found that the numbers of functional PhoQ variants explored under translation using these codes did not differ significantly from the numbers explored under translation using all random codes of each type (Fig. 7c, d).

Fig. 7
figure 7

a Mean squared change in polar requirement for the 100 type \({T_{{\text{DP}}}}\) codes that allowed for the exploration of the highest numbers of functional PhoQ variants for different numbers of simulation steps (best 100 codes), for all 10000 type \({T_{{\text{DP}}}}\) codes (all codes), and for the SGC. b Mean squared change in polar requirement for the 100 type \({T_{{\text{DNP}}}}\) codes that allowed for the exploration of the highest numbers of functional PhoQ variants for different numbers of simulation steps (best 100 codes), for all 10000 type \({T_{{\text{DNP}}}}\) codes (all codes), and for the SGC. c Number of functional PhoQ variants explored under translation for different numbers of simulation steps using the 100 type \({T_{{\text{DP}}}}\) codes with the lowest mean squared changes in polar requirement (best 100 codes), all 10000 type \({T_{{\text{DP}}}}\) codes (all codes), and the SGC. d Number of functional PhoQ variants explored under translation for different numbers of simulation steps using the 100 type \({T_{{\text{DNP}}}}\) codes with the lowest mean squared changes in polar requirement (best 100 codes), 10000 type \({T_{{\text{DNP}}}}\) codes (all codes), and the SGC. The error bars represent one standard deviation from the mean

Codes with a Larger Number of Nonsynonymous Point Mutations Explore More Functional PhoQ Variants

We evaluated quantitative measures of different structural features of codon–amino acid assignments in randomly generated genetic codes and investigated their correlation with the number of functional PhoQ variants explored under translation using the different codes. We used the measures defined previously by Judson and Haydon (Judson and Haydon 1999): code fragility, defined as the number of codons with eight or nine nonsynonymous point mutations out of the nine possible point mutations; code mutability, defined as the average number of nonsynonymous point mutations per codon; and the total number of synonymous point mutations. Scatter plots of these measures versus the number of functional PhoQ variants explored under translation using type \({T_{{\text{DNP}}}}\) codes at different numbers of simulation steps are shown in Fig. 8. Note that all type \({T_{{\text{DP}}}}\) codes have the same value as the SGC for the measures considered in this section.

Fig. 8
figure 8

Scatter plots representing the dependence of the number of functional PhoQ variants explored under translation using type \({T_{{\text{DNP}}}}\) codes (vertical axis) on quantitative measures characterizing the organization in codon–amino acid assignments: code fragility, code mutability, total number of synonymous point mutations, and code changeability (horizontal axis). Detailed definitions of these properties are given in the “Materials and Methods” section. The Pearson’s correlation coefficient (r) and the p value of the estimate are indicated under each plot

We observed that code fragility and code mutability were positively correlated with the number of functional PhoQ variants visited for different numbers of simulation steps, with p value < 10−4 in each case. The total number of synonymous mutations, which encodes information opposite to that encoded by code fragility and code mutability, was negatively correlated, with p value < 10−4, with the number of functional PhoQ variants explored under translation using type \({T_{{\text{DNP}}}}\) codes for different numbers of simulation steps. None of the type \({T_{{\text{DNP}}}}\) codes allowed for more synonymous mutations than the SGC.

We also calculated the code changeability (Maeshiro and Kimura 1998) for different type \({T_{{\text{DNP}}}}\) codes. Changeability is defined as the sum, over all pairs, of the probabilities of transitions between different amino acids, excluding paths involving stop codons. The number of functional PhoQ variants explored under translation using type \({T_{{\text{DNP}}}}\) codes was negatively correlated with code changeability, with p value < 10−4, at numbers of simulation steps greater than 100.

Preservation of Chemical Properties of Amino Acids Under Point Mutations Facilitates Exploration of the Functional PhoQ Sequence Space at Intermediate Time Scales

We calculated, for each number of simulation steps, the mean squared change in amino acid polar requirement, hydrophilicity, isoelectric point, and volume due to point mutations under translation using those type \({T_{{\text{DP}}}}\) and type \({T_{{\text{DNP}}}}\) codes that explored more functional PhoQ variants as compared to the SGC at the given number of simulation steps. The results are shown in Fig. 9. The mean squared change in these physio-chemical properties of amino acids due to point mutations is lower for codes that explored more functional PhoQ variants than the SGC for 1000 and 10000 simulation steps as compared to the codes that explored more functional PhoQ variants than the SGC for 100, 100000, or 1000000 simulation steps. These results indicate that preservation of chemical properties of amino acids under point mutations promotes the exploration of functional PhoQ variants at intermediate time scales while contributing little toward the exploration of functional PhoQ variants at very short time scales, i.e., 100 simulation steps, or at very long time scales, i.e., 100000 and 1000000 simulation steps.

Fig. 9
figure 9

Mean squared changes in different physio-chemical properties of amino acids due to single-nucleotide substitutions for those type \({T_{{\text{DP}}}}\) codes (top row; solid lines) and type \({T_{{\text{DNP}}}}\) codes (bottom row; solid lines) that explored more functional PhoQ variants as compared to the standard code (dashed lines) for 100, 1000, 10000, 100000, and 1000000 simulation steps. Results are shown for amino acid polar requirement, hydrophilicity, isoelectric point, and amino acid volume. The error bars represent one standard error of the mean

The Deviant Genetic Codes in Different Species Explore More Functional PhoQ Variants as Compared to the SGC

Maeshiro and Kimura (1998) proposed that the reassignment of certain codons in the deviant genetic codes of some species such as Candida spp., Mycoplasma spp., Euplotes spp., and Blepharisma spp. could ease the transitions between amino acids with different polarities, and increase chances of recovery from nonsense mutations by decreasing the number of stop codons, thereby allowing for greater alterability of the phenotypes. Postulating that these properties will facilitate the exploration of the space of functional PhoQ variants, we calculated the number of functional PhoQ variants explored under translation using these deviant genetic codes at different numbers of simulation steps and compared the results to the number of functional PhoQ variants explored under translation using the SGC. The results are shown in Fig. 10 and Table 1. For numbers of simulation steps greater than 1000, translation using deviant genetic codes in Candida spp., Euplotes spp., and Blepharisma spp. allowed for the exploration of a significantly higher (two-sample t test p value < 0.01) fraction of the space of functional PhoQ variants as compared to translation using the SGC. Differences from the SGC are not significant at 100 and 1000 simulation steps for the deviant codes in these species. Under translation using the deviant genetic code in Mycoplasma spp., a significantly larger fraction of the PhoQ functional sequence space as compared to translation using the standard code (two-sample t test p value < 0.05) was explored only at 100000 and 1000000 simulation steps. For each code and each number of simulation steps, 100 simulations were carried out and the results were used to calculate the p values.

Fig. 10
figure 10

Fraction change in the number of functional PhoQ variants explored under translation using deviant genetic codes from different species as compared to the SGC for numbers of simulation steps \(N=10000,\,N=100000\), and \(N=1000000.\) The change is calculated as \(\Delta f=({f_{{\text{code}}}} - {f_{{\text{SGC}}}})/{f_{{\text{SGC}}}}\). Here, \({f_{{\text{SGC}}}}\) is the mean of the number of functional PhoQ variants explored using the SGC over 100 different simulations and \({f_{{\text{code}}}}\) is the mean of the number of functional PhoQ variants explored using the deviant genetic code over 100 different simulations

Table 1 The p values for the number of functional PhoQ variants explored under translation using deviant genetic codes from different species compared to the number explored under translation using the SGC. Here, the null hypothesis is that the number of functional PhoQ variants explored under translation using the SGC is equal to or greater than the number of variants explored under translation using a deviant genetic code

Discussion

Our results indicate that the organization of codon–amino acid assignments in the SGC allows for the exploration of a greater fraction of the space of functional variants of the Escherichia coli protein kinase PhoQ via single-nucleotide substitutions, as compared to most randomly generated codes, particularly at intermediate time scales (Fig. 2b). A role of selection for flexibility in the evolution of the SGC was first proposed by Maeshiro and Kimura (1998) and backed up soon thereafter by Judson and Haydon (1999). Later studies argued in favor of selection for increased protein diversity during the evolution of the SGC (Higgs 2009; Francis 2013). Firnberg and Ostermeier showed for a subset of the functional variants of the antibiotic resistance gene TEM-1 β-lactamase that there is enrichment for adaptive mutations under translation using the SGC (Firnberg and Ostermeier 2013; Firnberg et al. 2014). Our results represent the first direct confirmation of Maeshiro and Kimura’s hypothesis for the space of all possible variants of an amino acid sequence.

Both with randomly generated codes and the standard code, the space of functional nucleotide sequences is partitioned into clusters of sequences with dense connections between nucleotide sequences in the same cluster and sparse connections between sequences in different clusters (Fig. 4). This modular structure of the functional nucleotide sequence space for all genetic codes arises since all genetic codes map 61 codons to 20 amino acids and must therefore be degenerate. Given the small separation between degenerate codons in the standard code and in type \({T_{{\text{DP}}}}\) codes, the probability that a single-nucleotide substitution will change the encoded amino acid is low. Further, amino acid sequences that are closer to a functional sequence are more likely to be functional as compared to the sequences that are far away from it. This is since protein functionality derives from the physio-chemical properties of amino acids. Thus, substitution of an amino acid with another amino acid having similar properties is less likely to alter the functionality of the amino acid sequence. This characteristic results in a modular structure of the functional nucleotide sequence space, even for type \({T_{{\text{DNP}}}}\) codes where degenerate codons may be separated by a larger distance.

The modular structure of the functional nucleotide sequence space is responsible for the comparatively small variation in the number of functional PhoQ variants visited on starting the simulation from different nucleotide sequences coding for the wild-type PhoQ amino acid sequence (Fig. 3). For the standard code and for type \({T_{{\text{DP}}}}\) codes, all nucleotide sequences coding for the wild-type PhoQ are likely to lie within the same cluster, given the small distances between degenerate codons in these codes. Degenerate codons in these codes differ by 1.3 nucleotides on average. Since our simulation is a random walk on the network of functional nucleotide sequences, given the highly modular nature of the network, simulations starting at nodes within the same cluster are likely to explore similar numbers of nodes. For type \({T_{{\text{DNP}}}}\) codes, degenerate codons differ by 2.25 ± 0.07 nucleotides (mean ± standard deviation). Thus, different nucleotide sequences coding for the wild-type PhoQ are less likely to be located within the same cluster, resulting in a larger variation and a wider distribution of variations in the number of functional PhoQ sequences visited as compared to type \({T_{{\text{DP}}}}\) codes on starting the simulation from different degenerate nucleotide sequences.

The property of the SGC to allow for the exploration of a large fraction of the space of functional nucleotide sequence variants is not a direct consequence of the property of the code to minimize changes in different physio-chemical properties of the encoded amino acids under point mutations, a property that has been demonstrated in previous studies (Alff-Steinberger 1969; Zamyatnin 1972; Wolfenden et al. 1979; Haig and Hurst 1991; Freeland and Hurst 1998). That the latter property does not directly lead to the former is evident from the weaker optimization of the SGC for exploration of the space of functional PhoQ variants as compared to its optimization for minimizing the changes in physio-chemical properties of amino acids under point mutations. None of the 10000 type \({T_{{\text{DP}}}}\) or type \({T_{{\text{DNP}}}}\) codes exhibited a smaller mean squared change in amino acid polar requirement under single-nucleotide substitutions than the standard code. The 100 type \({T_{{\text{DP}}}}\) codes and the 100 type \({T_{{\text{DNP}}}}\) codes that allowed for the exploration of highest numbers of functional PhoQ variants did not exhibit significantly lower values of the mean squared change in polar requirement as compared to all random codes (Fig. 7a and b). The 100 type \({T_{{\text{DP}}}}\) codes and the 100 type \({T_{{\text{DNP}}}}\) codes with least values of mean squared changes in polar requirement did not allow for the exploration of significantly higher numbers of functional PhoQ variants as compared to all random codes (Fig. 7c and d). Collectively, these results imply that the property of a genetic code to allow for minimal changes in physio-chemical properties of amino acids under point mutations is neither necessary nor sufficient for exploring a large fraction of the space of functional nucleotide sequences under the given code. Further, while all type \({T_{{\text{DP}}}}\) codes allowed for the same number of synonymous mutations as the standard code, only a few of these codes allowed for the exploration of a larger fraction of the space of functional PhoQ variants than the standard code. None of the type \({T_{{\text{DNP}}}}\) codes allowed for more synonymous mutations than the standard code, but a number of such codes allowed for the exploration of a greater fraction of the space of functional PhoQ variants as compared to the standard code. Taken together, these observations indicate that selection for exploring a large fraction of the space of functional nucleotide sequences is largely independent of the different selection pressures postulated before. In fact, selection for a code that allows for the exploration of a larger fraction of the space of functional protein variants as compared to the SGC may have contributed toward the emergence of codon re-assignments in deviant genetic codes seen in Candida spp., Mycoplasma spp., Euplotes spp., and Blepharisma spp.

The correlations of the number of functional PhoQ variants explored with different characteristic measures of the genetic codes indicate that a balance between physio-chemically conservative amino acid changes under point mutations, i.e., robustness, and alterability of the amino acid sequence under point mutations, i.e., flexibility, is needed for better exploration of the space of functional nucleotide sequences (Maeshiro and Kimura 1998). While conservation, under point mutations, of polar requirement, hydrophilicity, and isoelectric point of amino acids encoded did correlate with the number of functional PhoQ variants explored (Figs. 5 and 6), the number of functional PhoQ variants explored was greater for genetic codes allowing for smaller numbers of synonymous mutations and for codes with higher fragility and mutability (Fig. 8). This result indicates that frequent transitions between different amino acids, unless constrained, are not sufficient for exploring a larger fraction of the functional nucleotide sequence space. High code fragility and mutability allow for visiting, via point mutations, a greater number of nucleotide sequences, while conservation of physio-chemical properties of amino acids under point mutations helps restrict the sequence of PhoQ mutants visited to functional PhoQ mutants, thereby allowing for the exploration of a larger fraction of the space of functional PhoQ variants. We have shown that the SGC embodies an evolutionary advantageous balance of robustness and flexibility, leading to the exploration of a large fraction of the space of functional PhoQ variants. A similar postulate was previously put forward by Firnberg and Ostermeier in the context of TEM-1 β-lactamase (Firnberg and Ostermeier 2013).

While code mutability, defined as the average number of nonsynonymous single-nucleotide substitutions, was positively correlated with the number of functional PhoQ variants explored for type \({T_{{\text{DNP}}}}\) codes, code changeability, defined as the sum of probabilities of transitions between all pairs of amino acids, was negatively correlated (Fig. 8). The difference in trends for metrics that allude to the same property of a genetic code is due to the distribution of functional sequences in the space of all possible amino acid sequences. Code mutability describes the likelihood that the amino acid sequence will change due to a single-nucleotide substitution. Thus, starting from a functional sequence, codes with high mutability will allow for visiting a large number of neighboring sequences differing by one amino acid from the functional sequence. Since sequences differing only slightly from a functional sequence are more likely to be functional, high mutability will facilitate the exploration of functional sequences. Code changeability, on the other hand, describes the probability of an amino acid changing into another amino acid, including changes that must occur via multiple nucleotide substitutions. Transitions involving multiple nucleotide substitutions, however, generally involve multiple intermediate amino acid sequences. Given the sparsity of functional sequences in the space of all possible amino acid sequences, the intermediate sequences in a transition involving multiple nucleotide substitutions are unlikely to be all functional. Thus, while high changeability may promote the exploration of space of all possible amino acid sequences, such a property is more likely to hinder exploration of the space of functional amino acid sequences by promoting transitions to nonfunctional sequences. Due to the rugged nature of the space of functional proteins, searching for functional sequences via single-nucleotide substitutions is likely to be a more successful strategy as compared to taking multiple steps at once, and the organization of codon–amino acid assignments in the SGC is optimized to take advantage of such a strategy.

An organization of codon–amino acid assignments in the genetic code that allows for the exploration of a larger fraction of the space of functional nucleotide sequences via point mutations can provide an evolutionary advantage. Selection for such a property need not invoke a teleological view of evolution. In the vertical descent model of emergence of a universal genetic code (Crick 1968; Wong 1976; Harris et al. 2003), exploration of a larger fraction of the functional nucleotide sequence space will allow the population descended from an individual to access a much larger innovation pool of functional sequences. Thus, such a genetic code will provide a fitness advantage to the population of individuals, particularly amidst changing environmental conditions, and can therefore be selected for. Along similar lines, in the competition between innovation pools model of emergence of a universal genetic code (Vetsigian et al. 2006; Goldenfeld and Woese 2007), a community with a genetic code that allows for the exploration of a larger fraction of the space of functional nucleotide sequences will have access to a larger innovation pool. This property will provide a fitness advantage to the individuals in the community, and such a community is more likely to drive out other communities from a niche, allowing for positive selection for the property to explore a larger fraction of the space of functional nucleotide sequences during the emergence of a universal genetic code.

As described above, the property of a genetic code to allow for the exploration of a greater number of functional nucleotide sequences via single-nucleotide substitutions affords greater standing diversity in the population. Diversity in the population will result in a fitness advantage under changing environmental conditions. Environmental changes are more likely to occur at intermediate time scales as compared to very short or very long time scales. As shown in the “Materials and Methods” section, each step in our simulations roughly corresponds to 8.6 × 103/Ne years, where \({N_{\text{e}}}\) is the effective population size. For \({N_{\text{e}}}={10^6},\) 10000 simulation steps will correspond to a period of around 100 years, an approximate time scale for environmental changes. Our results indicate that the SGC is more optimized for exploring a larger fraction of the functional nucleotide sequence space at these intermediate time scales than at very short or very long time scales (Fig. 2b). Further, the preservation of chemical properties of amino acids under single-nucleotide substitutions for which the SGC is well known to be optimized seems to aid exploration of more functional nucleotide sequences at such intermediate time scales (Fig. 9). Thus, the organization of codon–amino acid assignments in the SGC not only helps minimize the phenotypic effects of mutations and translation errors, but also allows for greater standing genetic diversity in the population at the intermediate time scales of typical environment changes. Together, these benefits afforded by the SGC may have contributed toward its emergence as the universal genetic code.

The present study only considers the space of functional variants of the PhoQ protein with amino acid substitutions at four positions. Thus, the ideas described above are universal to the extent to which our results for the PhoQ protein can be generalized. Given the dependence of protein function on protein structure and chemistry (Berg et al. 2002), both of which derive from the amino acid composition of the protein, we expect the SGC to exhibit a similar characteristic for the functional sequence spaces of other proteins. Comprehensive studies of functionalities of variants of other proteins are needed for strengthening evidence in support of the ideas presented here.