Introduction

CpG deficiency was first observed in vertebrates (Josse et al. 1961; Swartz et al. 1962), then in some species of archaea, bacteria, and fungi, as well as in mitochondria belonging to many organisms (Cardon et al. 1994; Karlin et al. 1998). CpG dinucleotides play an important role in cell differentiation and in the regulation of gene expression in vertebrates (Bestor 1990). CpG deficiency can also influence codon usage bias (De Amicis and Marchetti 2000) and the relative abundance of oligonucleotides, thereby indirectly affecting a variety of cell functions. This triggered many studies aiming at understanding genome base composition biases (Karlin et al. 1998). Several hypotheses have been put forward to explain CpG deficiency, including counter-selection at the translation level (Subak-Sharpe et al. 1966), DNA methylation (Bird 1980), DNA structural constraints (Antri et al. 1993), DNA–protein interaction, and stressful environments (Karlin et al. 1994b). Among them, DNA methylation is the most popular hypothesis.

Cytosine deamination is a major cause of mutation in living organisms, especially in open DNA structures (for recent references and discussion see Lobry and Sueoka 2002). It is, however, readily repaired, since deamination leads to uracil, subject to proofreading in DNA. It is widely documented that methylated cytosine is even more prone to spontaneous deamination and this induces transition mutations to the natural base thymine (Coulonder et al. 1978). Such mutations are hard to repair (Coulonder et al. 1978). Since methylated cytosines were predominantly found within CpG dinucleotides in vertebrates, CpG deficiency was naturally linked to CpG methylation (Bird 1980). The presence of highly methylated CpG dinucleotides in both male and female germ cells provided strong evidence for the relationship between DNA methylation and CpG deficiency in the human genome (El-Maarri et al. 1998). However, cytosine methylation may not be the ultimate or only explanation for CpG deficiency. For example, CpG deficiency in most mitochondrial genomes is unlikely to be related to DNA methylation, because DNA methylase has not yet been discovered in these organelles. One of the few reports on methylation in mitochondria identified an RNA methylation by a nucleus-encoded RNA adenine methyltranferase (McCulloch et al. 2002). CpG deficiency was also found in many bacterial species and their phages (Karlin et al. 1994a, 1997), where cytosine methylation is not widespread (see below).

This prompted us to revisit the association between DNA methylation and CpG deficiency in bacterial genomes. In bacteria, DNA methylation is generally associated with restriction-modification systems (RM systems) (Wilson 1988). These elements may prevent the invasion of the cell by bacteriophages. So far, more than 2000 different RM systems have been identified and over 700 methyltransferases are known to recognize at least 300 different DNA sites (http://www.neb.com/rebase) (Roberts and Macelis 2001). Three kinds of DNA methylation systems were found in bacteria: A6-adenine methylation, N4-cytosine methylation, and C5 cytosine methylation (Bestor 1990). In this report, we focus our attention on C5 cytosine-specific methylation, the same DNA methylation process that is assumed to induce CpG deficiency in eukaryotes. Due to versatile functions and recognition sites of DNA methylation in bacteria compared to vertebrates, DNA methylation is unlikely to share a common role in all bacterial genomes. This was previously suggested in a study on the Mycoplasma genitalium genome in which CpG deficiency was suspected to be unrelated to DNA methylation (Goto et al. 2000). The suspicion was based on the finding that the high substitution rate from C to T was not specific to CpG and TpG dinucleotides and the fact that there was no reported methylation activity in mycoplasmas (Goto et al. 2000). In the present study, we further document that deamination of methylated cytosine is probably not the reason for the CpG deficiency in bacteria.

Methods

Sources of Data

First, the fully sequenced bacterial genomes were surveyed, after being retrieved from the NCBI (http://www.ncbi.nlm.nih.gov). We searched for potential C5 methyltransferase genes using the annotation files. When such a cytosine methyltransferase was identified, the bacterial identification was used to search for the corresponding enzyme in the REBASE database (http://rebase.neb.com) (Roberts and Macelis 2001). Almost all the cytosine methyltransferases were C5 methyltransferases, and the one case of N4 cytosine methylation was discarded. When more than one C5 methyltransferase was found in a genome, only the one including a CpG dinucleotide at the restriction site was included.

Cytosine-specific methyltransferase genes are labeled as “putative” in some bacteria. This makes in-depth analysis difficult because the biochemical properties of their products are not substantiated in REBASE. Therefore, the latter approach is only feasible for well-studied bacteria in which the presence of cytosine methylation has been studied. As a complement of explicit identification, we used the BLASTP tool provided by REBASE to ascertain that a CDS putatively coding for a C5 methyltransferase is highly similar to a known C5 methyltransferase CDS.

Second, utilizing REBASE, we also identified C5 methyltransferases in the unfinished genomes of several bacterial species. When such a gene was found, we collected the available DNA sequences from NCBI, extending our study to the corresponding organisms. By exploring REBASE in addition to two other protein databases, Pfam at the Sanger Centre (http://www.sanger.ac.uk/Software/Pfam/) and TIGRFAMs at TIGR (http://www.tigr.org/TIGRFAMs/), we collected DNA sequences from all the bacteria that are likely to express C5 methyltranferases. Finally, only bacterial species for which more than 20 nonredundant sequences (excluding ribosomal DNA) could be retrieved from GenBank were included in the analysis.

Relative Abundance of Dinucleotides

To measure the frequency of dinucleotides in a long genomic sequence, the value of relative abundance was calculated by computing the relevant odds ratio (Burge et al. 1992). In the case of CpG dinucleotide, the formula is ρCpG = F CpG/F C*F G, where ρCpG denotes relative abundance of CpG and F CpG denotes the frequency of CpG dinucleotide. If ρCpG falls between 0.81 and 1.20, the CpG dinucleotide is considered to be at a normal level. If it is lower than 0.81, the CpG relative abundance is classified as being deficient. However, the relative abundance of this dinucleotide can be further classified as follows: 0.78–0.81 is marginally low, 0.70–0.78 is significantly low, 0.50–0.70 is very low, and ≤0.50 is extremely low (Burge et al. 1992). In this study, the bacteria with CpG relative abundances lower than 0.78 were considered to be CpG deficient.

GC Content and CpG Deficiency at Neutral Positions of CDS

Generally bacterial CDSs are short in size, so the variance of CpG relative abundances of the CDSs with the same GC content is very large. Especially in low-GC content CDSs, the values will highly deviate from the trend line when they are plotted against GC content. The deviation could strongly mask the changing tendency of CpG relative abundance. Since the calculated ρCpG for longer sequences do not deviate from actual values as much as those for shorter sequences (i.e., decreasing magnitude of deviation from actual value as CDS length increases), we first listed all the CDSs according to their GC contents. We then concatenated every 40 CDSs (every 20 CDSs for some small bacterial genomes, like the C. trachomatis genome) to generate long coding sequences for this study. The third position of a codon is under less selective pressure due to the redundancy in the genetic code, therefore we chose C3pG1 (C in the third position of a codon; G in the first position of the following codon) to study the mutation pattern of CpG dinucleotides. The relative abundances of C3pG1 and T3pG1 in each sequence were calculated and then plotted against the GC content of the CDS.

Results

Classification of Genomes According to C5 Methyltranferase

A total of 47 bacterial species whose genomes contain C5 methyltransferases were analyzed in terms of GC content, CpG relative abundance, C5 methyltransferase, and C5 methylation site. These species were categorized into three groups according to their C5 methyltransferase recognition sites (the length of the recognition sites was in the range of four to seven nucleotides). Some of these sites contain a methylated CpG dinucleotide, while others do not. In our first group, non-CpG dinucleotides are methylated in the recognition sites of the C5 methyltransferases (Table 1). In our second group, the presence of methylated CpG dinucleotides in recognition sites is uncertain (Table 2). In our third group, a methylated CpG dinucleotide can be found in the recognition sites (Table 3). Although we still do not know which cytosine is methylated in the recognition site of CGATCG (for Escherichia coli O157:H7 EDL933) in Table 3, the recognition site must have a methylated CpG dinucleotide because both cytosines in the recognition site are within the CpG dinucleotides. In addition, 20 bacterial species were found to be lacking C5 methyltransferases (their CpG relative abundances are listed in Table 4).

Table 1 The recognition sites of the C5 methyltransferases lacking a methylated CpG dinucleotide
Table 2 The recognition sites of the C5 methyltransferases possibly containing a methylated CpG dinucleotide
Table 3 The recognition sites of the C5 methyltransferases containing a methylated CpG dinucleotide
Table 4 Relative abundance of CpG dinucleotide in bacteria devoid of C5 methyltransferases

Is CpG Deficiency a Result of Horizontal Transfer of RM Systems?

RM systems in free-living bacteria are often horizontally transferred by means of linkage with mobility-related elements such as phages and plasmids (Kobayashi 2001 and references therein). RM systems act like an infectious agent, by rendering the bacteria dependent on the functioning of the methylase to avoid chromosome degradation by the nuclease. These bacteria thus suffer a selective pressure for the avoidance of restriction sites (Rocha et al. 2001). Since most of the underrepresented sites are not recognition sites for the known RM systems of a given bacterium, the avoidance on these sites indicates the impact of RM systems in bacteria’s evolutionary history (Rocha et al. 2001). Therefore the current status of DNA methylation does not allow investigating the avoidance of the sites that may have been methylated in the past due to RM systems that were lost. Because free-living bacteria can often contact with other bacteria living in the surrounding environment, they can easily obtain a new RM system through horizontal transfer. Obligatory intracellular parasites and symbionts cannot do so due to their occlusive living environment. Such bacteria are currently devoid of such systems, and are generally thought to lack horizontal transfer. Thus, one may suppose that they have not been in contact with such systems for a large period of their recent evolution. We therefore made a comparative analysis of obligatory intracellular bacteria with the free-living bacteria holding at least one RM system. We observed that only two free-living bacterial species, Streptococcus pneumoniae and Streptococcus pyogenes, are CpG deficient. In contrast, 6 of 12 intracellular pathogens or symbionts show CpG deficiency. Thus, CpG dinucleotides are more significantly depleted in intracellular pathogens or symbionts than in proteobacteria (χ2 test, p < 0.01). This is the opposite of what was expected under the cytosine deamination theory via the spread of RM systems.

Lack of Association Between Cytosine Methylation and CpG Deficiency

Among the 34 recognition sites identified in bacterial genomes (Tables 1 and 3), only seven methylated CpG dinucleotides were found within the recognition sites. Therefore, cytosine methylation in bacteria is not generally associated with CpG dinucleotide methylation.

Surprisingly, we find CpG deficiency in eight bacterial species (Campylobacter jejuni, Chlamydia muridarum, Chlamydophila pneumoniae, Clostridium perfringens, Fusobacterium nucleatum, Lactococcus lactis IL1403, Mycoplasma genitalium, and Rickettsia prowazekii) that are devoid of C5 methyltransferase (Table 4), and this is in contrast to five species (Clostridium acetobutylicum, Mycoplasma pulmonis, S. pneumoniae, S. pyogenes, and Synechocystis sp. 6803) that contain C5 methyltransferase but are significantly CpG deficient (Tables 1, 2, and 3). This suggests that CpG dinucleotide deficiency is more frequent in bacteria lacking cytosine methylation (χ2 test, p < 0.01). We cannot exclude, however, that this is due to a genome sampling effect since genome programs did not select the bacteria of interest in a random way.

Finally, a t-test shows that the CpG relative abundances in bacteria containing RM systems methylating CpG dinucleotides (Table 3) are not significantly lower than those of other bacteria (Tables 1 and 4; p > 0.1), indicating that the presence of methylated CpG dinucleotides in recognition sites does not give rise to CpG deficiency.

The above analyses do not support the idea that cytosine methylation is responsible for CpG deficiency. Therefore, we have performed a set of analyses to further explore potential reasons behind CpG deficiency in bacteria.

Associations Between CpG Deficiency and Other Dinucleotide Biases

According to the cytosine methylation hypothesis, CpG dinucleotide is depleted through deamination of methylated cytosines, leading to the concurrent increase in relative abundances of TpG and CpA. In our present study we found that the relative abundances of both TpG and CpA are not significantly higher than that of ApG, GpG, CpT, and CpC (p > 0.1, t-test) among the bacterial species that show CpG deficiency (Table 5). In Chlamydiae and Clostridia, the relative abundances of TpG and CpA are lower than that of ApG, GpG, CpT, and CpC. The reasons for this are presently unknown. CpG relative abundance of the bacterial species showing CpG deficiency was plotted against TpG and CpA relative abundances (Fig. 1). The regression of TpG on CpG (Fig. 1A) results in a nearly horizontal line (R 2 = 0.0002, slope = −0.005, p < 0.001), indicating that the change in CpG relative abundance is not correlated with that of TpG relative abundance. In sharp contrast, a negative correlation of the two values was found in the human genome (addressed below). The regression of CpA on CpG (Fig. 1B) also results in a nearly horizontal line (R 2 = 0.006, slope = 0.023, p < 0.001). These findings indicate that CpG variation is not significantly negatively correlated with TpG or CpA abundances. As such, it seems unlikely that CpG variation in bacteria can be attributed to different rates of methylated cytosine deamination.

Figure 1
figure 1

Correlation between CpG relative abundance and TpG (A) and CpA (B) relative abundances in the bacterial genomes that show CpG deficiency.

Table 5 Relative abundances of NpG and CpN in 13 bacterial genomes that show CpG deficiency

Analysis of Covariation Between CpG RelativeAbundance and GC Content

It has been pointed out that the negative correlation between CpG and TpG in different GC contents is an artifact ascribed to deamination of methylated cytosine in the human genome (Duret and Galtier 2000). In order to further test the hypothetical relationship between cytosine methylation and CpG deficiency in bacteria, we analyzed the covariation among dinucleotides CpG, TpG, and CpA under different contents.

In the bacteria studied here, CpG relative abundance is found to be higher in the DNA sequences with a high GC content. No bacterial species showing overall CpG deficiency has more than a 50% GC content (Tables 1, 2, 3, 4). We then analyzed the correlation between CpG relative abundance and GC content at the intragenome level. The GC content within a genome is not uniform, so we might expect CpG relative abundances in different genomic regions to correlate with the GC content. Because a bacterial genome is largely composed of CDSs, the effect of codon usage bias on CpG dinucleotide must not be ignored. For example, a study in plants showed that the negative correlation between C3pG1 and T3pG1 relative abundances was significant (De Amicis and Marchetti 2000). This was considered to be a consequence of heavy DNA methylation in plants. Therefore, we compared the relative abundance of the neutral dinucleotide sites, C3pG1 and T3pG1, in a CDS.

We then analyzed the CDSs of the 13 bacterial species showing CpG deficiency for the covariation of dinucleotide relative abundance with GC content. The relative abundances of C3pG1 and T3pG1 were plotted against the GC content of all the CDSs. The results for C. perfringens and M. pulmonis are shown in Fig. 2, indicating that C3pG1 relative abundance increases somewhat in parallel with T3pG1 relative abundances in different GC contents. In comparison, the relative abundance of C3pG1 is negatively correlated with that of T3pG1 in Homo sapiens (Duret and Galtier 2000). This distinctive correlation pattern in humans probably results from methylated cytosine deamination.

Figure 2
figure 2

C3pG1 and T3pG1 relative abundances plotted against GC contents of long coding sequences. C3pG1 (C in the third position of a codon, G in the first position of the following codon); T3pG1 (T in the third position of a codon, G in the first position of the following codon). The dots in these patterns represent the data from the long coding sequences. The long coding sequences were generated by listing all CDSs according to their GC contents and then integrating every 40 CDSs.

The results of the regressions of C3pG1 and T3pG1 relative abundances in function of GC content are listed in Table 6. A positive slope value means a positive correlation between GC content and dinucleotide relative abundance. Except for two cases, the slopes are positive in all the bacterial species. If the two slope values of a given species in Table 6 are positive, the relative abundances of C3pG1 and T3pG1 increase with the GC content. This seems to be a general trend with only two exceptional species, M. genitalium and Synechocystis sp. PCC 6803 (Fig. 2). The negative slope of C3pG1 relative abundance in M. genitalium is small and we do not know at present how to explain it. With a positive C3pG1 slope value and a negative T3pG1 slope value, Synechocystis sp. PCC 6803 has a trend that is similar to H. sapiens except that the relative abundance of C3pG1 remains quite constant as the GC content increases (Fig. 2). The explanation to this exception probably lies in the relatively higher GC content (47.6%) and larger genome size (3.6 Mb) of Synechocystis sp. PCC 6803. The above results are in agreement with the rule that CpG deficiency is related to lower GC content but do not support the prediction of the cytosine methylation hypothesis.

Table 6 Slope values of the changing pattern of C3pG1 and T3pG1 relative abundances plotted against GC contents of long coding sequences

Discussion

Evaluation of the Potential Effects of RM Systems on CpG Deficiency in Bacteria

In vertebrates, it is widely accepted that CpG deficiency is a consequence of CpG methylation (Bird 1980; Jeltsch 2002). The DNA methylation pattern on CpG dinucleotides is largely maintained by DNA methyltransferase1 (Dnmt1) (Lyko et al. 1999). Some essential differences in the properties of DNA methyltransferases in vertebrates and bacteria may explain the observed differences in CpG deficiency. First, bacteria vary widely in both the content and the size of their C5 methyltransferase recognition sites. Most of the recognition sites do not contain a methylated CpG dinucleotide, suggesting that cytosine methylation is not a determinant of CpG deficiency in bacteria. Although some RM systems have a methylated CpG dinucleotide, the large size of these recognition sites determines that most CpG dinucleotides are not methylated because of the low occurrence of these sites in the genome (i.e., CpG methylation mediated by a single methyltransferase in a rare site such as CGATCG is too weak to induce CpG deficiency).

Second, the DNA methylation in bacteria is a kind of de novo methylation (Bestor 1990). This is different from that in vertebrates because Dnmt1 can only function on hemimethylated DNA (Lyko et al. 1999). De novo methylation mediated by Dnmt3a and Dnmt3b indeed occurs in vertebrates, but it is restricted in very early embryonic stage (Ramsahoye et al. 2000; Gowher and Jeltsch 2001). These differences between bacterial C5 methyltransferases and those of vertebrates reinforce the idea that C5 methylation is not the major source of CpG deficiency in bacteria. It is possible that a more fundamental mechanism is affecting dinucleotide relative abundance and distribution in bacterial genomes, rather than cytosine methylation.

Third, RM systems are frequently gained and lost by horizontal transfer (Kobayashi 2001). As such, the presence of C5 methyltransferase is intermittent, and possibly rare, which necessarily implicates a much lower bias than methylated cytosine deamination that in genomes containing C5 methyltransferase in permanence, such as in humans. Most free-living bacteria are not CpG deficient compared to pathogen/symbionts. Therefore, the contribution of RM systems to CpG deficiency in bacteria appears suspicious in analysis involving either current or historic parameters. Interestingly, it was reported that free-living pathogens had a significantly higher GC content than intracellular pathogens and symbionts (Rocha and Danchin 2002). Here we show that CpG deficiency correlates with GC content and lifestyle.

Association of GC Content and CpG Deficiency

In this study we find that C3pG1 relative abundance and GC content are generally positively correlated in those bacterial species that show CpG deficiency. We obtained qualitatively similar correlations using C1pG2 and C2pG3 in this analysis (results not shown). This strengthens the link between CpG dinucleotide relative abundance and GC content in bacteria. Identical correlations have been found in humans (Aissani and Bernardi 1991; Pesole et al. 1997) and RNA viruses (Rima and McFerran 1997). It was subsequently pointed out that this could be a mathematical artifact caused by the high mutation rate on methylated CpG dinucleotide (Duret and Galtier 2000). As methylated CpG deaminates to TpG or CpA dinucleotides, the number of C and G decreases in this process. This would lead to a lower expected number of CpG dinucleotides in the new sequence compared to the original sequence. This effect is found to be more evident when the GC content increases (Duret and Galtier 2000). However, the mutation process from methylated CpG to TpG dinucleotide is not present in most of the bacteria that show CpG deficiency. This is implied by parallel changing patterns of CpG and TpG in different GC contents in bacteria. As a result, Duret and Galtier’s artifact hypothesis does not explain satisfactorily the association of GC content and CpG deficiency in the bacterial context.

CpG Deficiency in Vertebrates May Be the Cost of a Newly Developed Function of DNA Methylation

Two functions have been suggested for DNA methylation. A primary function is to defend a genome against the invasion of bacteriophages or transposon elements, and a secondary function, a new-developed function in evolution history, is connected with the regulation of gene expression (Yoder et al. 1997). We classify the organisms having DNA methylation into two groups according to the different functions: the first group includes bacteria, fungi, and invertebrates; and the second group includes vertebrates and plants. Only in the second group, CpG dinucleotides are massively methylated or demethylated in order to regulate gene expression activity. In conclusion, only the DNA methylation playing the secondary function in vertebrates and plants can be persuasively linked to CpG deficiency.

Actually the above boundary, within the animal kingdom, should be moved forward to the sea urchin, the only invertebrate species in which Dnmt1-like methyltransferase was identified (Aniello et al. 1996, 2003). As such, it should be distinguished from the other invertebrates. Dnmt1 is critical in playing the secondary function (Ramsahoye et al. 2000), so the presence of Dnmt1-like protein in sea urchin is probably a strong requirement of developmental regulation. Therefore, the evolution of methyltransferase genes from bacteria to human reflects the requirement of functions specialized in more complex organisms, making DNA methylation evolve from a protection mechanism to an epigenetics mechanism. This enables an organism to have an increased life span and to survive under more complex environmental conditions. This benefit comes at a cost. For one, vertebrate genomes confront a huge mutation pressure on the recognition sites for DNA methylation. Until now, no study has shown that vertebrates have found a strategy to compensate for the depleted CpG dinucleotides. Theoretically, continued CpG depletion will lead to a vertebrate genome crisis.

Conclusion

We studied the link between C5 methylation and CpG content in bacteria and found no significant correlation. Thus, C5 methylation is probably not the major factor inducing CpG deficiency in bacteria and more effort should be invested in looking for alternative explanations for this phenomenon. Finally, this study indicates that CpG dinucleotide deficiency is related to GC content. This can be taken as a clue in the search for factors that induce CpG deficiency in bacteria.