Introduction

Synonymous codon usage in many bacteria is highly nonrandom, where codon bias can reflect the existence of mechanistic or evolutionary constraints limiting the degree of freedom for coding sequence building (Ikemura 1982). Codon usage bias has been an active area of genome analysis shown to be related to protein structure in E. coli (Gu et al. 2004) and the cross-species conservation of nonrandom synonymous codon usage (Grantham et al. 1980). The earliest hypotheses proposed to explain codon usage bias favored natural selection as the mechanism to correlate codon usage to the most abundant tRNAs in the cell (Ikemura 1981; Sharp et al. 1993; Duret 2000). This selective force was proposed to contribute to the optimization of translation efficiency and/or level of gene expression (Grosjean and Fiers 1982; Morton 1998; Merkl 2003). Another proposed mechanism for the creation of synonymous codon bias is through mutational bias (Wolfe et al. 1989; Sueka and Kawanishi 2000). It has also been proposed that synonymous codon usage could be relevant to mRNA stability, which could further facilitate translational processes (Mita et al. 1988). Codon usage bias dependent on the positioning of local nucleotides or condon context has been reported in the honey bee (Cutler and Chantawannakul 2007) as well as some bacteria (McVean and Hurst 2000). For example, in Escherichia coli, of the two triples which code for lysine, AAG is preferred when the next three-prime nucleotide is a C or A, but AAA is preferred when G follows (Berg and Silva 1997). Frequencies of occurrence of oligomers such as dinucleotides within organisms could well represent genomic signatures resulting from selective pressures from nucleotide stacking, DNA conformation tendencies, DNA replication, and/or repair mechanisms (Karlin et al. 1998).

Due to the extraordinary temperature and radioactivity tolerance of species in the class Deinococci, the microorganisms in this class are of great interest for understanding the forces of evolution and possible mechanisms responsible for survival in extreme environments. The Deinococci are known to survive in the presence of otherwise lethal and mutagenic damaging agents such as heat, starvation, oxidation stress, dehydration, and ionization radiation (Minton 1994; Bessman et al. 1996; White et al. 1999; Narumi 2003). The Deinococcus radiodurans eubacterium (formerly called M. radiodurans), first discovered in irradiated tins of meat, is a Gram positive, red-pigmented, nonmotile, and non-spore-forming eubacterium (Anderson et al. 1956). Despite being unable to form spores, D. radiodurans has a remarkable resistance to extreme environments and has been found in such diverse habitats as soil, animal feces, weathered granite from the Antarctic valley, room dust, and irradiated medical instruments (Counsell and Murray 1986; Masters et al. 1991). The extreme radiotolerance property, in particular, has captured the attention of many scientists in areas such as bioremediation (Cleston et al. 1998; Lange et al. 1998; Brim et al. 2000), genetic engineering (Lui 2006), evolution (Woese 1987; Lui 2006), and even exobiology (Makarova et al. 1999). In one landmark study, after being exposed to a normally lethal 3000-G dose of γ radiation which produced over 110 double-strand breaks and massive DNA damage, in only 3 h of cultivation the double-stranded DNA was completely repaired and no reduction in fitness was observed (Battista 1997). Another completely sequenced species of the class Deinococci, closely related to D. radiodurans, called D. geothermalis, was originally isolated from a hot pool at the Termi di Agnano, in Naples, Italy, and is distinct from most other members of this class due to its only moderate thermophilic ability (Ferreira et al. 1997). The Thermus genus, which also belongs to the class Deinococci, is found to be less resistant to radiation, however, it is well adapted to extreme thermal stress conditions (Omelchenko et al. 2005).

Using the completely sequenced genomes of two Deinococcus genomes and two Thermus genomes, we investigate local nucleotides correlations by observing dinucleotide frequencies of nucleotides separated by up to 60 nucleotides to determine any possible relatedness between such correlations and synonymous codon usage bias, high GC content conservation, potential repair mechanisms, and/or the DNA damage-resistant phenotype. The results presented here may provide new perspectives on the processes leading to the development of high tolerance to mutational agents and the conservation of genome-wide properties such as synonymous codon bias resulting from the accumulation of mutation bias due to local base composition effects.

Materials and Methods

Database Search and Dataset

The four species, Deinococcus radiodurans, D. geothermalis, Thermus thermophilus HB8, and T. thermophilus HB27, were used as the source of data for the genomic analysis presented here. For Deinococcus radiodurans, the two completely sequenced chromosomes, in addition to the two plasmids (MP1 and CP1) publicly available as GBK files from Genbank (accession numbers NC_000958, NC_000959, NC_001263, and NC_0001264), were used. The completely sequenced genome for D. geothermalis was composed of one circular genome and one plasmid (accession numbers NC_008025 and NC_008010). Each of the Thermus thermophilus genomes was composed of one large circular genome and one plasmid (accession numbers NC_005835, NC_005838, NC_006461, and NC_006462). To facilitate the nucleotide correlation frequency analyses used in this study, multiple Fasta format data files were generated which contained the complete set of extracted coding sequences with a minimum length of 20 amino acids for each species. For example, the Fasta data file for D. radiodurans contains the 3198 currently annotated coding sections (designated CDS regions), each of which was stored as a separate Fasta data entry using Perl routines created by the author. From these CDS regions a total of 960,427 codon regions were then extracted and used as the basis for the analyses presented here.

Randomly Generated Sequences

To determine which deviations from expected are purely due to codon abundances, a set of randomly generated sequences was constructed using the codon abundances for each respective species in the class Deinococci. For each exon region in each of the four Deinococci genomes, a new “exon” region was randomly generated using the respective codon frequencies. Since the codons in these exon regions are assigned randomly, there should be no deviations from expected on any length scale due to codon–codon correlation effects. In addition, any deviations from the expected nucleotide–nucleotide abundances, for nucleotides with two codons separated by a distance x, found using the codon abundances for each of the four genomes would be due solely to codon usage.

Gene Expression Datasets

To determine whether the deviations found in these species are predominantly due to mutational forces or selective forces, six additional codon datasets containing genes with high, medium, and low expression levels were assembled for the D. radiodurans and T. thermophilus HB8 genomes using gene expression data from the Entrez Gene Expression Omnibus (GEO) database. The expression levels for 2159 labeled genes in T. thermophilus HB8 were extracted from the GEO data file GSE7165 and ranked based on the average expression levels for the 30 time points. From these rankings, three new datasets were created, corresponding to the one-sixth highest expressed, the one-sixth lowest expressed, and the remaining genes. For the D. radiodurans species expression level data were extracted from GEO dataset GSE9636, which contained eight samples for wild type vs. mutant expression profile differences. For the 3084 genes in this dataset, three data files, of low, medium, and high expression level, were generated with the same profile as above.

Nucleotide Correlations Within Coding Regions

Using the CDS regions described above and the codon correlations analysis developed by Cutler and Chantawannakul (2007) summarized below, nucleotide abundances within codons were recorded by counting the total number of times nucleotides, in each of the three positions (sites N 1 , N 2 , and N 3 ) for two codons, occurred at distances of up to 20 codons apart. The actual nucleotide–nucleotide abundances for all codon sites separated by up to x = 20 codons apart designated N x (i a, j b), where a and b refer to the codon site and i and j designate the particular nucleotides, were calculated and the total abundances stored to a file for further analysis. For example, a codon starting with an adenine (A) nucleotide residing two codons before a second codon starting with a cytosine (C) nucleotide is recorded as a single N2 (A 1, C 1) count for the first nucleotide positions within the two codons.

Theoretically, as the distance between codon sites increases, the total number of nucleotide–nucleotide counts decreases linearly proportional to the number of CDS regions, since within a single CDS region there is one more codon pair at a triplet distance x than at a distance + 1. Since each of the three codon sites can contain any of the four nucleotides, in total there are 16 distinct nucleotide pairs. By counting the total number of times each nucleotide pair occurs for two codon sites, the counts partition between the 16 nucleotide pairs at approximately the respective frequencies for each nucleotide at that codon site. In the null model, if there is no relationship between nucleotides at a given distance, the expected background for nucleotides at increasing codon separations should vary around the line defined by the overall number of counts times the nucleotide frequencies. By quantifying the expected background of nucleotide abundances, the deviation from expected can be determined by subtracting the expected number of counts from the actually observed nucleotide–nucleotide abundances. This deviation from expected is defined as D x (i a, j b) in the same way the total abundance is defined as N x (i a, j b). This allows over- and underabundances for particular nucleotide pairs within codons at different distances to be easily visualized and a statistical significance for the deviation from zero to be determined for any given deviation.

To quantify both the presence of large deviations from the expected background abundances and the over- and underabundance pattern of these deviations, we define a “pattern summary deviation” designated \( {\text{D}}_{\text{x}}^{\text{YR}} \)(I a, J b) where I and J are groups of nucleotides such as N = {A,C,G,T}, S = {C,G}, and R = {C,T}, a and b are the two codon sites, and x is the distance between the two codons in question. This pattern summary deviation is defined as

$$ {\text{D}}_{\text{x}}^{\text{YR}} \,\left( {I_{\text{a}} , J_{\text{b}} } \right)\, = \,\sum\limits_{i \in I} {\sum\limits_{j \in J} {D_{x} \left( {i_{\text{a}} } \right.\left. {,j_{\text{b}} } \right)} * I_{YR} (i,j} ) $$

where I YR (i, j) = 1 if i and j share the same purine/pyrimidine status and –1 if not (as shown in the last column in Table 1). If a set of nucleotide pairs all conserve the general pattern of overabundance with purine/pyrimidine identity and underabundance with two different purine/pyrimidine nucleotides, this summation will add the positive deviations for the nucleotide pairs with the same purine/pyrimidine status and subtract the negative deviations for those nucleotides with different purine/pyrimidine status. As an example of this calculation in D. radiodurans, the pattern summary deviations for strong nucleotides NN(GC)|NN(GC) on two adjacent third site codons designated \( {\text{D}}_{ 1}^{\text{YR}} \) (S 3, S 3) are found by adding together the four nucleotide combinations:

$$ \begin{aligned} {\text{D}}_{ 1}^{\text{YR}} \,\left( {S_{ 3} , S_{ 3} } \right)\, & = \,\sum\limits_{i \in S} {\sum\limits_{j \in S} {D_{1} \left( {i_{ 3} } \right.\left. {,j_{ 3} } \right)} * I_{YR} (i,j} ) \\ & = \,{\text{ D}}_{ 1} \left( {C_{ 3} , C_{ 3} } \right)\, + \,{\text{D}}_{ 1} \left( {G_{ 3} ,\,G_{ 3} } \right)\,-\,{\text{D}}_{ 1} \left( {C_{ 3} ,\,G_{ 3} } \right)\,-\,{\text{D}}_{ 1} \left( {G_{ 3} ,\,C_{ 3} } \right) \\ & = \, 1 4, 1 80\, + \, 1 4, 4 6 4\,-\,\left( { - 1 4, 7 2 1} \right)\,-\,\left( { - 1 5,0 1 5} \right)\, = \, 5 8, 3 80 \\ \end{aligned} $$
Table 1 Summary of all deviations D1(i 3, j 3), where statistically significant deviations are in boldface

This pattern summary deviation function shows the total deviation from expected for a set of nucleotide pairs that satisfy the general pattern, which in this case is purely dependent on the purine/pyrimidine status of the two nucleotides in question. If there are no large deviations from expected, then the pattern summary deviation will have no large terms to add constructively; likewise if there are large deviations from expected but these deviations do not conserve the general pattern, then the deviations will add destructively and not produce an enlarged total deviation from expected. If all of the nucleotide pairs conserve the purine/pyrimide relationship, but the pattern of over- and underabundances is reversed, the pattern summary deviation function will produce a large negative value, therefore the sign is a measure of the over- and underabundance relationship and not a measure of conservation of the purine/pyrimide relationship.

To be able both to easily compare these results between different species and to have an intuitive feel for the magnitude of deviations present, one last normalization \( \overline{{{\text{D}}_{\text{x}}^{\text{YR}} }} \)(I a, J b) makes these results more easily conceptualized. By dividing the pattern summary deviation value by the total number of codons in that particular dataset, this percentage value quantifies the fraction of codon sites which need to have been modified beyond the expected background to generate the observed relationships.

Statistical Significance

Since the expected number of counts for correlations between two nucleotides at increasing codon lengths decreases linearly, each set of abundances should be well fit using a standard linear regression. Since statistically significant deviations from the expected background have been shown to occur for adjacent codons (Morton and So 2000; Moura et al. 2005; Cutler and Chantawannakul 2007), each linear regression was run for lengths ≥ 2. The residuals about this linear regression, D x (i a , j b ) for the 20 examined codon separation distances x, provide a measure of the expected variation for that given nucleotide pair and two codon sites. For each linear regression these residuals were used to calculate the standard deviation for these deviations. The t-score for each deviation was then calculated to assess the statistical significance of each correlation. To correct the comparison-wise α level for the 20 different length scales at each of the 144 multiple comparisons for each of the four examined species, the overall 0.05 α level was adjusted down to 4.34 × 10−6 (t-critical = 4.59) using a Bonferroni correction so that the type I error rate was not artificially inflated.

Results and Discussion

Abundances of Spatially Correlated Weak (AT) Nucleotides on Two Third Site Codons Up to 20 Codons Apart

As an example of the background behavior of nucleotide co-occurrence, Fig. 1 displays the abundances of all nucleotides, in the Deinococcus radiodurans genome, for two weak (AT) nucleotides residing at the third codon sites of two codons separated by from 1 to 20 codons. As expected the total number of counts decreases linearly for increasing separation of the two codons, and no large deviations from expected are observed for nucleotide co-occurrence of the form N x (i 3, j 3), where i and j in this case are weak nucleotides.

Fig. 1
figure 1

Example of background correlation abundance behavior for site 3–site 3 weak nucleotides NN(AT)|…|NN(AT), where x is the distance between the two codons and runs from 1 to 20. The expected number of counts decreases approximately linearly as the length between codons increase

Codon Site 3–Codon Site 3 Nucleotide Co-Occurrence for Strong Nucleotides on Length Scales of from 1 to 20 Codons Apart

For pairs of strong (GC) nucleotides at the third site of one codon and the third site of a second codon separated by up to 20 codons, the absolute abundances N x (i 3, j 3 ) for the Deinococcus radiodurans genome are shown in Fig. 2a.

Fig. 2
figure 2

Co-occurrence patterns for third site–third site strong (GC) nucleotides in D. radiodurans. (a) Absolute counts for all third site–third site strong nucleotide pairs for a given x codons apart. (b) Deviation from expected for the total abundances in A. Note the large deviations for D1 (C 3, C 3), D1 (C 3, G 3), D1 (G 3, C 3), and D1 (G 3, G 3), which have a total magnitude of deviation of 58,380 counts for these four nucleotide pairs. The same pattern is conserved in all four species, with total deviations of 45,972 for T. thermophilus HB8, 46,008 for T. thermophilus HB27, and 41,318 for D. geothermalis

On the x = 1 length scale (i.e., adjacent codons) there are obvious deviations from the expected background where the nucleotide pair abundances N1 (G 3, C 3 ) and N1 (C 3, G 3 ) are underabundant and the nucleotide pairs N1 (G 3, G 3 ) and N1 (C 3, C 3 ) are overabundant. The sizes of these deviations are found by subtracting the background obtained from the linear regression of the abundances in Fig. 2a from the actual abundances. For example, the nucleotide abundances N x (C 3, C 3) shown in Fig. 2a all have a magnitude of about 210,000 counts except for N1 (C 3, C 3 ), which is overabundant by about 15,000 counts. The deviations from expected for these abundances Dx (C 3, C 3) graphed in Fig. 2b show that all of the largest deviations from expected occur for nucleotide co-occurrence on the x = 1 length scale.

To determine whether the large deviations from expected for nucleotide pairs of the form D1 (i 3, j 3), where i and j are strong nucleotides, could be due to codon abundance effects within these species, the same total count of nucleotides Nx (i 3, j 3) as shown in Fig. 2a was determined for the randomly generated sequences, which were based purely on the codon frequencies in each respective species. Figure 3a shows the total counts Nx (i 3, j 3) for the randomly generated sequences based on the D. radiodurans codon abundances. As shown there, the magnitude of the counts for strong–strong nucleotides at the third sites of two codons separated by up to 20 codons is approximately the same number of counts as for the D. radiodurans genome, shows the same expected decrease in abundance with increasing separation (x), and shows no apparent large deviations for adjacent codons as found for the D. radiodurans abundances. The deviations from expected D x (i 3, j 3) shown in Fig. 3b further verify that there are no large deviations from the expected background due solely to the codon abundances in the D. radiodurans genome. The same exact lack of large deviations for pairs of strong nucleotides was found for the generated sequences of all four species. In fact, in looking at all nucleotide pair deviations for any pair of nucleotides at any pair of codon sites for codon separations of up to 20 codons, not a single large deviation from the expected background was found for any of the generated sequences. So the large deviations from expected in Fig. 2b are not due to codon abundances, but are due to an increase in the occurrence of adjacent codons which satisfy the third codon site–third codon site relationship (i.e., identical nucleotides) versus those that do not satisfy this relationship.

Fig. 3
figure 3

Co-occurrence patterns of third site–third site strong nucleotides for the randomly generated sequences with codon usage identical to that of D. radiodurans. (a) Absolute counts for all third site–third site strong nucleotide pairs with a separation of x codons. (b) Deviation from expected for the total abundances in A. Note the lack of any large deviations from expected. The generated genomes for all species of the class Deinococci show the same lack of large deviations from expected for all nucleotide pairs at all codon sites at all length scales

One way to quantify the total deviation from expected for a given set of deviations is to add together the magnitudes of the individual deviations. In D. radiodurans this yields a total deviation of 58,380 counts from the expected background for the combination of these four nucleotide pair deviations on the = 1 length scale. The magnitude of the deviations D1 (i 3, j 3) shown in Fig. 2b quantifies how many times this relationship is satisfied beyond the null-model background. In the D. radiodurans genome, these four strong–strong nucleotide pair deviations correspond to approximately 58,380 additional adjacent codons which have been modified to preserve these relationships. Since there are 960,000 codons in the D. radiodurans genome, another way to view the deviations D1(i 3, j 3) is as a percentage of codons sharing the modified relationship beyond the expected background versus the total number of codons. In this case, the 58,380 total nucleotide counts deviating from the expected background account for 6% of the total number of codons, which means that 6% of all adjacent codons in the D. radiodurans genome need to have been modified to generate the nucleotide pair patterns shown in Fig. 2a and b.

For these four nucleotide pairs, the same pattern of over- and underabundance is conserved in all four species, with similar total deviations of 45,972 (6.9%) for T. thermophilus HB8, 46,008 (6.9%) for T. thermophilus HB27, and 41,318 (4.6%) for D. geothermalis. Since the magnitudes of each of these four deviations for a given species are not statistically different from each other, the deviations from expected are equal and opposite for the overabundant NNG|NNG and NNC|NNC nucleotide pairs versus the underabundant NNG|NNC and NNC|NNG nucleotide pairs.

For pairs of strong nucleotides at the third sites of two codons (i 3 and j 3) separated by up to 20 codons, the only additional statistically significant deviation in any of the four species other than D1(i 3, j 3 ) is for nucleotides separated by two codons D2(i 3, j 3) in the two Thermus species. The pattern of over- and underabundances for the D2(i 3, j 3) deviations is exactly the same as for the D1(i 3, j 3) deviations (for example, NNG|NNN|NNG is overabundant), with the slightly reduced total abundance of 34,033 (5.1%) counts in T. thermophilus HB8 and 34,777(5.2%) counts for T. thermophilus HB27.

Conserved Patterns of Context-Dependent Synonymous Codon Usage

Within the class Deinococci, for all nucleotide pairs at distances of up to 20 codons, there is a wide variety of statistically significant over- and underabundant deviations which all occur for nucleotide pairs at most one codon apart. Since no deviations from expected were found for any of the randomly generated sequences, which were based on the codon abundances in each respective genome, these statistically significant over- and underabundant nucleotide pairings are not due to a codon bias effect but, instead, are due to the positioning of adjacent codons in a context-dependent manner. In addition, not a single statistically significant deviation was found for nucleotide pairs separated by more than a single codon in any of the four Deinococcus genomes.

As shown above, one abundance relationship conserved in all four Deinococcus species is of the form D1(i 3, j 3 ), with i and j being strong nucleotides, where NNG|NNG and NNC|NNC combinations are overabundant and NNG|NNC and NNC|NNG combinations are underabundant. To further quantify the general pattern of over- and underabundant deviations from expected, the complete set of all D1(i 3, j 3 ), for i and j being any of the four nucleotides are shown in Table 1, where those deviations which are statistically significant have a shaded background. As an example of the general pattern of over- and underabundance found pervasively in the Deinococci, Table 1 has been arranged with the nucleotide pairs sharing the same purine/pyrimidine status at the top (NNR|NNR and NNY|NNY), versus the nucleotide pairs containing both a purine and a pyrimidine (NNR|NNY and NNY|NNR), at the bottom. As can be seen there, every statistically significant nucleotide pair deviation D1(i 3, j 3 ), conserves the overabundance pattern for codon pairs NNY|NNY and NNR|NNR and the underabundance pattern for codon pairs NNY|NNR and NNR|NNY. Due to the low number of counts for nucleotide pairs containing a weak nucleotide, it is not surprising that several of these deviations are not significant, particularly for underabundance relationships, since even a zero abundance in some cases would fall within the expected range. Even the nonsignificant deviations, however, share the “correct” abundance relationship in most cases. As found in the honeybee genome (Cutler and Chantawannakul 2007) these four species show a pattern of over- and underabundance based on the biochemical nature of the two nucleotides in question, but in these bacterial species, the general pattern depends only on the purine/pyrimidine status of the two nucleotides in question. The second-to-last row in Table 1 reports the total magnitude of the nucleotide pair deviations \( {\text{D}}_{ 1}^{\text{YR}} \) (N 3, N 3) for adjacent third codon sites and all nucleotide pair combinations in the four species of the class Deinococci.

As shown in the last row in Table 1, for third site–third site adjacent codons the total pattern summary deviations range from 5.3% to 8.4% of the total number of codons in these four species. For the randomly generated sequences with the codon abundances for each of the four species, the values of \( \overline{{{\text{D}}_{\text{x}}^{\text{YR}} }} \)(I a, J b) range from nearly zero to ± 0.1%, with a standard deviation of 0.04%, while for the actual genomes the larger standard deviation of 0.3% was found. The magnitude of the pattern summary deviations in these four species shows that there is some mechanism replacing nucleotides in a concerted fashion based on the biochemical nature of the surrounding nucleotides. For the honeybee genome, we proposed that a similar deviation result was due to a mechanism which most likely functions to repair single point mutations based on the surrounding nucleotide context. Due to the remarkable ability of bacteria in the class Deinococci to survive extreme environments and high levels of mutation, we chose these species to study because, if a similar mechanism were at work here as well, it should have a greatly pronounced effect, which could shed light on changes in amino acid composition, base abundances, and modified synonymous codon usage.

Context Dependence in the T. thermophilus vs. D. radiodurans Species

In the class Deinococci, the largest deviations from the expected nucleotide pair abundances were found to contain at least one strong nucleotide (GC) residing on a site 3 codon. This is most likely due to the fact that the strong nucleotides in all four species are by far the most abundant at the third codon site as shown in Fig. 4. Note that, in addition to the abundances of nucleotides at the three codon sites in each genome, the error bars in Fig. 4 indicate the corresponding nucleotide abundances found for the generated sequences based solely on the codon abundances for each species.

Fig. 4
figure 4

Nucleotide abundances for all nucleotides at each of the three codon sites for (a) Deinococcus radiodurans, (b) Deinococcus geothermalis, and (c) Thermus thermophilus. The error bars show the nucleotide abundances for the sequences generated using the codon abundances from each species. Since the results for the two strains of T. thermophilus were nearly identical in all respects, the data for these two genomes were grouped to represent the species T. thermophilus

Due to the high percentage of strong nucleotides at the third codon site, the largest over- and underabundant nucleotide pair deviations always contain this site and generally are composed of two strong nucleotides. In the T. thermophilus and D. radiodurans genomes, all of the largest nucleotide pair deviations from expected occur in groups of four, with two overabundant and two underabundant nucleotide pair deviations. Since in every case these sets of four nucleotide pairs consistently satisfy the purine/pyrimidine relationship presented above, the pattern summary deviations for each of these groups displayed in Fig. 5a–d show the total magnitude of deviation for these significant deviations. For the strong third site–strong third site deviations NN(GC)|NN(GC) discussed above, the over- and underabundance relationship in all four species was identical as shown in Fig. 5c, but as shown in Fig. 5a and b the abundance relationships for \( {\text{D}}_{ 1}^{\text{YR}} \) (S 3, S 1) and \( {\text{D}}_{ 1}^{\text{YR}} \) (S 3, W 2) are exactly reversed in the T. thermophilus versus the D. radiodurans genomes. If these relationships were due to a biochemical optimization (such as optimizing translational efficiency or stacking energies), one set of total over- or underabundances would be favored, as was found for all the \( {\text{D}}_{ 1}^{\text{YR}} \) (N 3, N 3) deviations. Since both diametrically opposed abundance relationships are present for \( {\text{D}}_{ 1}^{\text{YR}} \) (S 3, S 1) and \( {\text{D}}_{ 1}^{\text{YR}} \) (S 3, W 2), these observed deviations cannot be due to a selective force for a biochemical optimization.

Fig. 5
figure 5

The largest deviations from the expected background in D. radiodurans and T. thermophilus, where significant deviations from expected are circled. Note the reversed abundance relationships for \( {\text{D}}_{\text{x}}^{\text{YR}} \) (S 3, S 1) and \( {\text{D}}_{\text{x}}^{\text{YR}} \) (S 3, W 2)

To test further for the presence of a possible selective force creating the purine/pyrimidine relationship, we partitioned the genes from the T. thermophilus and D. radiodurans genomes into low, medium, and high expression level datasets. For the largest significant deviations from expected shown in Fig. 5a–d, the normalized pattern summary deviations were generated for each of these expression level datasets (Table 2). As mentioned earlier, in T. thermophilus there are large deviations from expected for third site–third site nucleotide pairs on both the = 1 and = 2 length scales, but only the = 1 deviations are significant in D. radiodurans. The normalized pattern summary deviations for both of these sets of deviations are shown in Table 2, with the nonsignificant deviations for D. radiodurans marked with daggers. As these data show, the overall pattern of deviations from expected is conserved in genes with low, medium, and high expression levels, but as the expression levels increase, the strength of the deviations from expected decreases. Since the standard deviation of expected values for these normalized deviations is ~0.3%, a magnitude of deviation >1.4% shows a significant set of deviations from expected. For the largest deviations \( \overline{{{\text{D}}_{ 1}^{\text{YR}} }} \)(S 3, S 1) in T. thermophilus, the difference in deviation between the low and the medium datasets is 2.3%, and that between the medium and the high datasets is 2.8%, so these decreases in the strength of the normalized pattern summary deviations show significant changes with respect to increasing levels of gene expression. This further evidence that the purine/pyrimidine abundance relationships are not due to selection implies that these relationships are being set by context-dependent replacement of mutated nucleotides in these species. One additional set of patterns conserving the purine/pyrimidine-dependent abundance deviations shed further light on the probable functioning of this context-dependent nucleotide replacement.

Table 2 Summary of all large deviations from expected for the D. radiodurans and T. thermophilus species normalized to the total number of codons in each dataset

By far the largest deviations in all four species occur for strong nucleotide pairs residing at adjacent third and first codon sites. For example, in the T. thermophilus HB8 genome, the magnitudes of these four strong–strong nucleotide pair deviations are −50,381 for D1(C 3, G 1), 51,936 for D1(G 3, G 1), 31,659 for D1(C 3, C 1), and −28,025 for D1(G 3, C 1). Another property of all of the large deviations summarized in Fig. 5 is that when all four magnitudes are not the same, the deviations occur in pairs of approximately equal and opposite size, with the two similar-magnitude nucleotide pair deviations sharing the same second nucleotide. Note the approximately equal and opposite magnitudes for nucleotide pairs with a strong nucleotide followed by either a C or a G nucleotide in T. thermophilus. This suggests that nucleotides at the third codon site are being replaced based on the context of the adjacent fourth codon site. The same “forward checking” relationship is held for the strong site 3–weak site 2 NN(YR)|N(YR)N deviations. For example, in T. thermophilus, the magnitudes of the NNC|NTN and NNG|NTN deviations are twice as large as the NNC|NAN and NNG|NAN deviations. Since the deviations for the third site–third site nucleotide pairs have approximately equal magnitude, these deviations do not shed any light on the directionality of the context-dependent nucleotide replacement. Due to the very large purine/pyrimidine-dependent deviations from the expected background nucleotide pair abundances, expression level data are able to show an increased deviation effect with decreasing levels of selection, and similarities in the magnitudes of pairs of deviations within the sets of four individual nucleotide pairs suggest that the context-dependent nucleotide replacement is based on the downstream nucleotide context. In the next two sections additional readily observed genomic differences due to this context dependent nucleotide replacement are presented.

Context-Dependent Synonymous Codon Usage

The nucleotide co-occurrence results presented above show that there are several conserved patterns of over- and underabundant nucleotide pairs within the class Deinococci, including \( {\text{D}}_{ 1}^{\text{YR}} \) (N 3, N 3) (which is conserved in all four tested species), \( {\text{D}}_{ 1}^{\text{YR}} \) (S 3, S 1), and \( {\text{D}}_{ 1}^{\text{YR}} \) (S 3, W 2) (which have opposite abundance relationships for the Thermus versus the Deinococcus species). These synonymous codon usage changes are independent of the codon bias for a particular codon and correspond to a shift from synonymous codons which do not satisfy the above relationships to ones which do. As an example of these synonymous codon changes, Fig. 6a–d display the counts of alanine codons (GCN) followed by either an aspartic acid (CAR) or a histidine (GAR) codon at distances of up to 20 codons for the T. thermophilus HB8 and D. radiodurans genomes. Since the purine/pyrimdine status of the first codon site of these two codons is exactly reversed, the synonymous codon usage of the adjacent alanine codon is modified depending on whether the alanine codon ends in a purine or a pyrimidine nucleotide. By comparing Fig. 6a versus 6b and Fig. 6c versus d, alanine codons adjacent to histidine versus aspartic acid codons show a large and opposite synonymous codon shift dependent on the purine/pyrimidine status of the alanine third codon site. In addition, since the \( {\text{D}}_{ 1}^{\text{YR}} \) (S 3, S 1) deviations have an opposite over(under-)-abundance relationship in T. thermophilus HB8 versus D. radiodurans, the change in synonymous codon usage is exactly opposite for alanine codons followed by a histidine or aspartic acid codon in the two species, as seen by comparing Fig. 6a versus c and Fig. 6b versus c. In all four cases, there are obvious deviations in the abundances of alanine codons adjacent to histidine or aspartic acid codons, which show a shift from synonymous codons which do not satisfy the preferred purine/pyrimidine nucleotide pairs to ones which do.

Fig. 6
figure 6

The total count of alanine codons 1 to 20 codons before (a, c) aspartic acid codons (GAC and GAT) or before (b, d) histidine codons (CAC and CAT). The codon–codon abundances are for (a, b) T. thermophilus and (c, d) D. radiodurans. The filled diamonds display the total count of all alanines, open circles display the count of alanine codons with a third site pyrimindine, and stars display the count of alanine codons with a third site purine. Note the opposite over(under-)-abundances for adjacent synonymous alanine codons followed by either a histidine or an aspartic acid due to the opposite purine/pyrimidine status of the first nucleotides in histidine versus aspartic acid. Also note the opposite over(under-)-abundance relationships for the same adjacent synonymous codons in D. radiodurans versus T. thermophilus

When pairs of synonymous codons satisfy multiple purine/pyrimidine relationships simultaneously, the over(under-)-abundance of deviations from expected can be even further strengthened. For example, when a valine codon GTC is paired with a glutamic acid codon GAG, since the third site C nucleotide is a pyrimidine and all of the glutamic acid nucleotides are purines, this combination is repressed even more due to the concurrent differences in purine/pyrimidine status of the glutamic acid nucleotides (Fig. 7). As Fig. 7 shows this combination is completely repressed, with only 5 GTC-GAG occurrences in the entire genome. By just changing the C to a G codon at the third valine codon site, all of the purine/pyrimidine overabundance relationships are satisfied and the dicodon pair GTG-GAG occurs 4236 times, which is an overabundance of 65% from the expected background.

Fig. 7
figure 7

The total count of selected codons showing an absolute supression of co-occurrence for adjacent codons due to a lack of satisfying the purine/pyrimidine relationship at three different codon sites concurrently. With more than 4000 expected occurrences in the genome, these three codon combinations occur only 63 times

Mutation and CpG Methylation Bias

For adjacent CpG nucleotides, it has been proposed that methylation of the cytosine nucleotide, which can lead to a CpG-to-TpG mutation, could be responsible for deviations leading to observed codon biases (Coulondre et al. 1978). This mechanism, which works to decrease adjacent CpG nucleotides, cannot explain why some genomes such as D. radiodurans actually show an increased CpG total abundance. In addition to the D. radiodurans genome, high CpG abundances have been found in some halobacterial sequences, α-proteobacteria, and Neisseria gonorrhoeae (Karlin et al. 1997). In higher organisms, the honey bee (Apis mellifera) (The Honey Bee Genome Sequencing Consortium 2006; Cutler and Chantawannakul 2007) and red alga (Cyanidioschyzon merolae) genomes (Matsuaki et al. 2004) also have overabundant CpG dinucleotides compared with the expected nucleotide frequencies. In contrast to D. radiodurans, the Thermus genus (White et al. 1999) is known to have low dinucleotide CpG abundances.

The fact that the purine/pyrimidine relationships \( {\text{D}}_{ 1}^{\text{YR}} \)(S 3, S 1) in D. radiodurans and T. thermophilus are exactly reversed could account for the relative CpG abundances in these two species if a context-dependent nucleotide replacement mechanism is at work. Although such a mechanism could change dinucleotide deviations (and thereby the total CpG abundances), as shown in Fig. 2, it has no effect on the nucleotide abundances in the respective species since there are essentially no differences in nucleotide abundance between these two species.

At present for CpG dinucleotides residing at adjacent third–first codon sites, there is an overabundance of 15,313 counts in D. radiodurans and an underabundance of 50,381 counts in T. thermophilus. This corresponds to a total change of more than 65,000 dinucleotide pairs between these two species. If these effects were due to a mutational bias due to local nucleotide context, this bias would lead to a similar increase or decrease in a particular nucleotide environment. The action of CpG-to-TpG spontaneous mutations would tend to decrease the overabundance of these deviations in the D. radiodurans species and enhance the underabundance for the Thermus species, which cannot account for the relative overabundance in D. radiodurans.

Context Dependence in the Class Deinococci

In T. thermophilus HB8, statistically significant purine/pyrimidine abundance relationships were found for the five codon site combinations shown in Table 2. These summaries were for the largest deviations in both species, which were mainly for strong–strong nucleotide pairs. The general pattern, however, for adjacent third codon–third codon site nucleotides is purely dependent on purine/pyrimidine status as discussed previously. The significant deviations for T. thermophilus HB8 and T. thermophilus HB27 are nearly identical, most likely due to the fact that these are two strains of the same species which are still very closely related. The total differences in deviation for the largest purine/pyrimidine patterns \( \overline{{{\text{D}}_{ 1}^{\text{YR}} }} \)(N 3, N 3), with 8.4% and 8.3% deviations, and \( \overline{{{\text{D}}_{ 1}^{\text{YR}} }} \)(N 3, N 1), with 32.7% and 32.6% total deviations, show that the total magnitudes of deviations are nearly identical in all cases.

In D. geothermalis species the overall magnitudes of the largest deviations were much smaller than those found in D. radiodurans. For example, by far the two strongest deviations in both species occurred for the D1 (C 3, C 1) and D1 (G 3, C 1) nucleotide pair deviations, which in D. radiodurans had deviations of −32,379 and 29,400 counts, whereas in D. geothermalis the deviations were −16,789 and 12,222 counts, respectively. This corresponds to a reduction in the magnitude of these deviations for D. geothermalis by more than 50% over that of D. radiodurans. If the mechanism responsible for setting these relationships is related to the correction of point mutations, the increased magnitude of deviations in D. radiodurans could be due to environmental differences between these two species.

Although the absolute deviations of the context-dependent synonymous codon deviations in D. geothermalis were smaller than in D. radiodurans, the relative strengths and pattern of over(under-)-abundant deviations were generally conserved with the exception of the two nucleotide pairs D1 (C 3, G 1) and D1 (G 3, G 1), which together did not reach the magnitude required for a significant deviation from expected. Note the fact that these two deviations form one of the strong–strong pairs with identical downstream nucleotides which previously suggested that the context-dependent sampling was using downstream nucleotides to affect the replacement of the upstream nucleotide from one which does not satisfy the purine/pyrimidine relationship to one which does. The fact that the abundance relationship has been lost for this pair further suggests that context-dependent replacement of nucleotides is based on the downstream nucleotide. Correlating all of the large nucleotide deviations, except these two, in D. radiodurans and D. geothermalis shows a significant correlation of the respective deviations in these two species, with an R 2 value of 91.5%.

Conclusion

In the class Deinococci, statistically significant nucleotide–nucleotide correlations exist within codon regions for codon sites up to six nucleotides apart. The majority of the large deviations from expected occur between strong–strong nucleotides where one of the two codon sites is always the third codon site and the conserved pattern of over- and underabundances relates to whether the two bases share the same purine/pyrimidine status or not. Some strong co-occurrence patterns such as \( {\text{D}}_{ 1}^{\text{YR}} \) (N 3, N 3) are conserved in all tested species, while other patterns such as \( {\text{D}}_{ 1}^{\text{YR}} \) (S 3, W 2) show large deviations from the expected background but have completely reversed patterns of over- and underabundances in the Thermus versus the Deinococcus lineages. For datasets with low, medium, and high gene expression levels, significant large deviations satisfying the purine/pyrimidine-dependent deviation effect were observed in all cases with reduced magnitude with increasing expression levels (Berg and Martelius, 1995). These deviations are not due to codon abundances but, instead, are due to the accumulation of synonymous codons which satisfy the respective purine/pyrimidine relationship versus those that do not. In all cases, though, it is the biochemical purine/pyrimidine property which the context-dependent replacement of nucleotides is dependent on.

Since the effect presented here is not due to selective pressure, but rather the replacement of mutated nucleotides in a context-dependent fashion, the decision to study the class Deinococci capitalized on these species’ ability to survive high levels of genomic mutation. Although this effect is clearly and strongly present in the class Deinococci, it is not unique to these species. A similar effect has been found in the honey bee genome (Cutler and Chantawannakul 2007) and is present in a variety of other bacteria as well. For example, in E .coli K12 both the third site–third site NN(GC)|NN(GC) and the third site–first site NN(GC)|(GC)NN abundance patterns are conserved, but with the overall reduced deviation for the strongest third site–first site deviations of only 4.97% (data not shown), compared to the 24.4% total deviation found in T. thermophilus HB8. The larger deviations found in class Deinococci are most likely due to the increased amount of mutation these bacteria have survived, providing a clear and striking example of the purine/pyrimidine context-dependent effect.

In conclusion, we present an example of a form of mutation-biased synonymous codon usage dependent on the surrounding nucleotide context. The conserved pattern of over- and underabundances is solely dependent on the purine/pyrimidine status of local nucleotides. Due to the symmetries of the nucleotide replacement, no changes in overall nucleotide abundances at any of the three codon sites have occurred, but dinucleotide abundances such as CpG abundances have been consistently modified. The mechanism responsible for the patterns described here has dramatically shifted the direction of synonymous codon evolution in these species. This suggests that small changes in a single mutation replacing abundance relationship, such as the overabundance of \( {\text{D}}_{ 1}^{\text{YR}} \) (S 3, S 1) in T. thermophilus to an underabundance of \( {\text{D}}_{ 1}^{\text{YR}} \) (S 3, S 1) in D. radiodurans, could lead to genome-wide differences in synonymous codon usage and possibly provide a bias to the process of random drift. As further evidence for a mechanism which may be repairing single point mutations in a context-dependent manner, the class Deinococci shows strong deviations from the expected background nucleotide and codon abundances as expected (which was the reason these species were chosen), possibly due to the remarkable ability of these bacteria to survive extreme environments and highly damaging radiation.