Abstract
Understanding the extent and causes of biases in codon usage and nucleotide composition is essential to the study of viral evolution, particularly the interplay between viruses and host cells or immune responses. To understand the common features and differences among viruses we analyzed the genomic characteristics of a representative collection of all sequenced vertebrate-infecting DNA viruses. This revealed that patterns of codon usage bias are strongly correlated with overall genomic GC content, suggesting that genome-wide mutational pressure, rather than natural selection for specific coding triplets, is the main determinant of codon usage. Further, we observed a striking difference in CpG content between DNA viruses with large and small genomes. While the majority of large genome viruses show the expected frequency of CpG, most small genome viruses had CpG contents far below expected values. The exceptions to this generalization, the large gammaherpesviruses and iridoviruses and the small dependoviruses, have sufficiently different life-cycle characteristics that they may help reveal some of the factors shaping the evolution of CpG usage in viruses.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Introduction
Even before the genetic code was deciphered, it was proposed that gene sequence evolution is not only influenced by fitness effects at the protein level, but also by the intrinsic nucleotide composition of the genome (Sueoka 1961). Once the redundancy of the genetic code was revealed, it became apparent that different organisms had evolved, along with “classical phenotypes,” unique genomic signatures, or “genomic phenotypes” (Bernardi and Bernardi 1986). Of particular importance was the proposal that each species was subject to specific genomic pressures on base composition, in turn resulting in a distinctive bias in codon choice (Grantham et al. 1980), and that explaining these unique coding strategies “is the heart of the problem of molecular evolution” (Grantham et al. 1986).
More recent studies have revealed that patterns of codon usage bias and nucleotide composition within many cellular genomes are far more complex than previously imagined, and the factors shaping their evolution are still not entirely understood. In principle, biases in nucleotide composition and codon usage can result from natural selection and/or differential mutational pressure. In many organisms, such as Escherichia coli, Saccharomyces cerevisiae, Drosophila melanogaster, and Caenorhabditis elegans, there is evidence that codons that use abundant tRNAs are selectively favoured, especially in highly expressed genes (Sharp et al. 1986; Powell and Moriyama 1997; Gouy and Gautier 1982; Stenico et al. 1994). Codon selection of this type most likely functions to optimize translational speed and/or translational accuracy, although additional factors, such as transcription efficiency, mRNA secondary structure, and protein structure, can also exert selection pressures (Xia 1996; Zama 1990; Oresic and Shalloway 1998). In contrast, codon usage bias in mammals appears to be more strongly influenced by differential mutation pressure (Sharp et al. 1993), although some evidence for selection has also been observed (Smith and Eyre-Walker 2001; Chamary and Hurst 2004; Duan and Antezana 2003). Under the mutation pressure model, intrinsic differences in the propensity of genome replication to make specific mutational errors, usually depicted as the frequency of GC ↔ AT changes, shapes overall patterns of base composition. Further, because the effective population sizes (N e) of mammalian species are typically small, as are the selection coefficients (s) of most mutations at synonymous sites or in noncoding DNA (i.e., N e s << 1), natural selection is usually unable to control the substitution dynamics of these mutations and they are fixed, or more usually lost, by genetic drift. Evidence for this “mutation-drift” theory in vertebrate genomes is that the nucleotide and codon bias of genes largely reflects their genomic location; for example, genes in GC-rich regions (so-called GC “isochores”) are biased toward GC-ending codons (Sharp et al. 1993).
There have been few in-depth studies of codon and nucleotide usage biases among families of DNA viruses, although the biases of some eukaryotic viruses can differ substantially from those of their hosts (Strauss et al. 1996). For example, in the mammalian papillomaviruses it has been proposed that a codon usage bias different from the average seen in host genomes strongly influences both viral replication and gene expression (Zhao et al. 2003). Moreover, in the rapidly evolving human RNA viruses, one of the few groups of viruses for which codon bias data have been compiled, mutation pressure seems to be the main force shaping codon usage, accounting for 71–85% of the observed bias (Jenkins and Holmes 2003). Indeed, in one in-depth analysis of the nidovirales, neither translational selection nor gene length was found to have an effect on codon usage (Gu et al. 2004). Although RNA viruses possess large effective population sizes, it is possible that their mutation rates are so high that they prevent natural selection from working efficiently on codon choice (Jenkins and Holmes 2003).
Codon usage can also be strongly influenced by underlying biases in dinucleotide frequency, which differs greatly among organisms. Specifically, after accounting for dinucleotide biases, the proportion of codon usage bias explained by mutation pressure often increases, as seen in human RNA viruses (Jenkins and Holmes 2003). Dinucleotide biases can be extreme. For example, CpG is present at only 20% of its expected frequency in most vertebrate genomes (Jones et al. 1992) and is depleted in other organisms (Karlin et al. 1994). The most popular explanation for the underrepresentation of CpG in vertebrate genomes is that 60–90% of cytosines in CpG doublets are methylated by cellular methyltransferases (Kress et al. 2001) and methylated cytosine (5-methylcytosine) has a tendency to undergo deamination when unpaired, resulting in the mutation of the cytosine to a thymidine. Methylation can control the expression of particular genes by preventing the binding of transcription factors and modifying chromatin structure and interactions with histones. In vertebrates, methylation is central to genomic imprinting and cell differentiation (Kress et al. 2001). Thus, CpG may be selected for at certain sites and selected against at others. While mutation following methylation probably plays a strong role in the reduced CpG content of many organisms, it is unlikely to be the only factor. In particular, the underrepresentation of CpG is often not accompanied by a correspondingly high level of TpG. Furthermore, similarly low CpG contents are found in vertebrate mitochondrial genomes which are unmethylated (De Amicis and Marchetti 2000).
The dinucleotide TpA is also underrepresented in many genomes (Burge et al. 1992). This is often attributed to the susceptibility of UpA uracils to RNase and that two of the three stop codons in the universal code begin with TpA (Beutler et al. 1989). Additionally, structural factors may have an effect on the frequency of certain dinucleotides. For example, the reduced frequency of TpA may also be explained by its low thermal stability (Breslauer et al. 1986; Beutler et al. 1989), such that TpA-rich sequences in DNA helices may suffer from detrimental levels of unwinding, twisting, and bending (De Amicis and Marchetti 2000).
Eukaryotic DNA viruses can be classified into two or three broad groups based on the size and nature of their genome. The genomes of large double-stranded DNA (dsDNA) viruses are mostly greater than 100 kb in length, while a putative second group, the Adenoviridae, are 28–45 kb in length but biologically similar to the large dsDNA viruses in that they encode their own DNA polymerase and accessory proteins involved in immune response regulation. In contrast, the genomes of small DNA viruses are less than 10 kb in length and comprised of either dsDNA or single-stranded DNA (ssDNA). Large dsDNA viruses encode many different proteins, often exceeding 100, which play intricate roles in virus replication, host-cell regulation, and host immune modulation (Shackelton and Holmes 2004). In contrast, small DNA viruses encode fewer than 10 proteins, some of which provide capsid structural functions while the remainder are primarily involved in supplementing or stimulating replication by host machinery, as these viruses do not encode their own polymerases. For the small double-stranded Polyomaviridae and Papillomaviridae, these supplemental replication proteins typically induce the host cell to enter S phase through a variety of interactions with host cell cycle regulatory components (Cole et al. 2001; Howley et al. 2001). Most of the single-stranded parvoviruses can only replicate in cells that are mitotically active, while the dependo-parvoviruses rely on co-infection of their host cells by a large helper adenovirus or herpesvirus (Muzyczka and Berns 2001). In addition, small DNA viruses are generally unable to modulate either the innate or the adaptive immune response of the host to nearly the extent of the large DNA viruses. This, and the fact that small DNA viruses are far more dependent on cellular proteins and other resources for their propagation, may have resulted in differences in the evolution of biases in base composition and codon usage among large and small viruses.
Herein we conducted an analysis of the codon usage and nucleotide biases among all vertebrate-infecting DNA virus families. Such a comparative analysis is central to revealing the interplay between the genetic drift of neutral mutations and the selective fixation of advantageous ones, can help elucidate the evolutionary dynamics of viruses and their interactions with hosts, and can potentially improve the efficiency and effectiveness of experimental or therapeutic procedures in which viral genome sequences are altered.
Materials and Methods
Sequence Data
Reference sequences (as defined by GenBank; http://www. ncbi.nlm.nih.gov/genomes/VIRUSES/10239.html ) were obtained for at least one species within all genera of vertebrate DNA viruses for which such data exist. This resulted in data sets of 41 large dsDNA viruses, 21 small dsDNA viruses, and 15 small ssDNA viruses. The hepadnaviruses were excluded, as they replicate via reverse transcription and are therefore likely to be subject to different evolutionary pressures. Known and putative ORFs were concatenated for total codon analyses. Accession numbers are given in Table 1.
Measuring Codon Usage and Nucleotide Biases
The CodonW package (http://www.bioweb.pasteur.fr/docs/softgen. html#CODONW) was employed to measure the effective codon usage statistic, Nc (Wright 1990), codon usage frequency, and nucleotide biases. Nc is calculated on a scale of 20 to 61, with a score of 20 representing maximum bias—the use of only one codon for each of the 20 amino acids—and a score of 61 indicating no bias, such that all codons are used equally for each amino acid.
We measured the total G + C (denoted GC) content and the frequency of all dinucleotides within both the concatenated ORFs and the complete genome sequences. Dinucleotide biases were calculated as the observed frequency of the dinucleotide relative to the product of the frequencies of the individual nucleotides (i.e., the expected dinucleotide frequency). For example, ρTpA = ƒ(TA)/(ƒ(T)ƒ(A)). For dinucleotides which do not form a reverse complemented pair on the opposite strand, we symmetrized the measure of ρ with the complementary dinucleotide and labeled this ρ* as outlined by Burge et al. (1992). For example, ρ*TpG = 2(ƒ(TG + CA)/[(ƒ(T + A))(ƒ(G + C))].
Besides total GC content, we measured the frequency with which either of these nucleotides appears at the synonymous third codon position (GC3s). GC3s was compared to the GC content at the first and second codon positions (GC1,2) with the Pearson product-moment correlation coefficient (r). To examine the influence of GC content on codon usage we plotted the relationship between Nc and GC3s for each virus. This was compared to the Nc which would result if GC content were solely responsible for the codon biases (denoted Nc*), calculated as Nc* = 2 + GC3s + {29/[(GC3s)2 + (1 − GC3s)2]} (Wright 1990).
To quantify the effects of natural selection on codon usage we also calculated the Nc’ statistic for each data set using the programs SeqCount and ENCprime (Novembre 2002). Nc’ is similar to Nc but takes into account background nucleotide composition. This statistic, which also ranges from 20 to 61, measures the deviation of the actual codon usage pattern from the distribution expected from the background nucleotide composition (Novembre 2002).
To determine codon biases across genomes we measured the Nc and GC3s, with the methods described above, of individual genes from each family, subfamily, or distinct genus. To determine CpG biases across large DNA virus genomes we conducted a sliding window analysis of CpG content with a window size of 300 and a step size of 100 using the SWAAP 1.0.2 program (Pride 2000).
Results
Codon Usage Biases
The effective codon usage statistic, Nc, was used to measure codon usage bias across the ORFs of large and small DNA viruses (Table 1). While all viruses exhibit some codon bias, none of these biases were unusually strong and the majority of families include members with both low and high codon biases, following no apparent trend. The mean (and range) for the Nc values of the large and small viruses were 48.3 (34.7–59.6) and 50.5 (35.8–58.0), respectively. However, an examination of genome and gene codon usage tables for a number of viruses suggested that the codon biases which do exist are greatly influenced by GC frequency. For example, in canine parvovirus and human BK virus, which have low Nc values, the most frequently employed synonymous codons are those with the fewest Gs and Cs. As a case in point, of the six synonymous codons coding for arginine, AGA, the only codon with a single G or C, is used 83% and 61% of the time in these two viruses, respectively. In contrast, HSV-2, a virus with a low Nc value which probably results from an overrepresentation of Gs and Cs, employs the two arginine codons with only Cs and Gs, CGC and CGG, 82% of the time.
Nucleotide Biases
To determine whether these codon biases are primarily caused by mutation pressure on overall base composition or natural selection for particular triplets, we first explored the extent of underlying nucleotide bias in all viral genomes. We began by examining the overall GC content at the genomic level and in the concatenated ORFs, as well as the GC3s content of the latter (Table 1). Although values varied greatly, the majority of large viruses, including the adenoviruses, alphaherpesviruses, and betaherpesviruses, have GC3s contents above 0.50, averaging 0.52, 0.64, and 0.61, respectively. However, specific families of large DNA viruses, namely, the Poxviridae, the Asfarviridae, and some iridoviruses, have low GC3s contents. Interestingly, these are the only DNA viruses which replicate completely (Poxviridae) or partially in the cytoplasm of the cell rather than entirely in the nucleus (Moss 2001; Rojo et al. 1999; Williams 1996). Almost all small DNA viruses showed low GC3s values. For the Papillomaviridae, Polyomaviridae, anellovirus, Circoviridae, and Parvoviridae (excluding the dependovirus genera of the Parvoviridae, which do not show this trend), the average GC3s values were 0.41, 0.35, 0.43, 0.53, and 0.33, respectively. For all viruses, GC3s values were close to their genomic GC values. There were no obvious trends in single nucleotide strand biases across virus families.
Mutation Pressure Versus Codon Selection
To determine the relative effects of mutation pressure versus natural selection on codon composition, we examined the relationship between GC3s and GC at the first and second codon positions (GC1,2). As shown in Fig. 1, GC3s and GC1,2 were significantly correlated (r = 0.95 for the large DNA viruses and r = 0.94 for the small DNA viruses; p < 0.001). Although this test does not take into account the phylogenetic relationships among the taxa studied, it does show that observed patterns of base composition are present at all codon positions. This suggests that they are most likely the result of mutation pressure, as natural selection would be expected to act differently on different codon positions.
The notion that codon bias is primarily governed by an underlying biased mutation pressure was further supported by examining the correlation between the GC3s and Nc of the coding sequences. We plotted GC3s against Nc for each virus and compared this to the expected Nc (Nc*) that would result if GC content were the sole determinant of codon usage. Remarkably, the actual Nc values for most viruses were on, or just below, the Nc* curve (Fig. 2). This implies that codon bias is mainly explained by uneven base composition and, hence, by mutation pressure rather than natural selection on codon choice. However, the fact that the majority of the actual Nc values were slightly lower than Nc* indicates that there are other factors, with less of an effect, that also influence the codon bias.
Additional evidence that codon biases are predominantly influenced by mutation pressure was obtained by determining the Nc’ statistic for each virus. Unlike Nc, Nc’ takes into account the background nucleotide composition and should therefore reflect the degree of codon usage bias due to factors other than nucleotide composition. As expected, Nc’ was greater than Nc in the vast majority of viruses and the difference between Nc and Nc’ was greatest in those viruses which have GC contents that depart significantly from 50% and minimal in those viruses with GC contents near 50% (Table 1). The graphic relationship between GC content and Nc’–Nc is a V-shaped curve centered at a GC content of 0.5—approximated by y = C(|x − 0.5|) (Fig. 3).
Dinucleotide Biases
Because there have been reports of CpG underrepresentation in RNA and small DNA viruses (Karlin et al. 1994) and because dinucleotide biases can affect codon bias, we next determined the dinucleotide biases for each virus. We measured the actual frequency of each pair of nucleotides relative to the expected frequency (i.e., the product of the frequencies of the individual nucleotides). As previously observed by Karlin et al. (1994), while most large DNA viruses showed no bias against CpG, small DNA viruses were extremely biased against this dinucleotide (Table 2, Fig. 4). This difference in CpG content was highly significant (Student’s t-test: p = 2.3−15 when outliers were excluded [see below] and p = 2.2−11 when outliers were included) and indicates that in small DNA viruses there is additional mutation pressure away from this doublet which may also impact codon usage. The CpG content of the small papillomaviruses, polyomaviruses, anellovirus, circoviruses, and parvoviruses (excluding the dependoviruses) averaged 47, 22, 67, 62, and 43% of their expected values, respectively. The outliers, the dependoviruses, averaged 82%, which is significantly different from the other small DNA viruses (p = 0.005). In contrast, the large viruses—the adenoviruses, asfarvirus, alphaherpesviruses, betaherpesviruses, unassigned herpesvirus, and poxviruses—had average CpG contents of 78, 87, 98, 115, 108, and 112% of their expected values. However, among the large DNA viruses there are two distinct exceptions (significant at p = 0.0003) to the general correlation between genome size and CpG content: the Gammaherpesvirinae and the Iridoviridae, which have CpG contents 49 and 70% of their expected values, respectively.
In those viruses with CpG depletion there were, generally, also elevated levels of TpG (measured as discussed above in order to symmetrize the value over both strands), which would result from a cytosine-to-thymidine mutation. However, the overrepresentation of this pair was slight compared to the underrepresentation of CpG. The genomic CpG and TpG contents were both approximately equivalent to the ORF contents. Finally, and as seen in most organisms, the TpA doublet had a reduced frequency in both large and small DNA viruses, although this underrepresentation was small compared to the level of CpG suppression in the small viral genomes (Table 2). With the exception of those mentioned above, all other dinucleotides were present at approximately the expected frequency.
Comparisons Along Viral Genomes
To determine the extent to which codon biases vary among genes, we selected one virus from each family, subfamily, or distinct genus and examined every gene separately. Because correct gene prediction is difficult for large viruses and results in some annotations that are only putative, we limited our analysis to those genes with assigned functions in the NCBI database. Strikingly, viral genes located in different genomic regions do not differ dramatically in the biases they display (see Supplementary Table).
Furthermore, because large genomes can have very different nucleotide compositions in different regions, we used a sliding window method to examine CpG content along the genomes of the large viruses. While small windows varied in their CpG contents, we found no systematic diversion from the average viral bias in any genomic region (see Supplementary Figure).
Discussion
This study revealed a number of trends in the nucleotide and codon composition among families of vertebrate-infecting DNA viruses. First, the strong correlation between codon usage bias and GC composition indicates that codon usage bias in DNA viruses is primarily explained by overall nucleotide content. In addition, not only are the GC frequencies of each virus similar at nonsynonymous and synonymous codon positions, but these frequencies appear in genes with different genomic positions and functions. These observations suggest that genome-wide mutational pressure is the most important factor shaping patterns of codon usage bias in DNA viruses, rather than natural selection for specific codons. Although nucleotide biases can vary even among closely related viruses, those viruses with similar genomes and life-cycle characteristics most frequently show similar CpG frequencies relative to their GC content. Such generalities, especially the marked CpG deficiency in the genomes of small DNA viruses, along with the relatively high levels of this dinucleotide in the large DNA viruses, point to common evolutionary pressures faced by similar viruses. These observations notwithstanding, it is likely that experimental analyses of DNA viruses, such as artificially altering codon usage or dinucleotide frequencies, are required to fully understand the complex mechanistic basis of these different genomic signatures.
Nucleotide Compositions
It has long been known that viruses may take on a much wider range of GC frequencies than other organisms (Wyatt 1952; Bronson and Anderson 1994). Even viruses within the same family, which have similar replication and life-cycle strategies, can show very different GC contents and hence large differences in codon biases (Schachtel et al. 1991). For example, HSV-2 and VZV (HSV-3), closely related alphaherpesviruses, which infect humans, persist latently in the nervous system of the host, and reactivate to cause secondary infections (but which have different routes of infection and cell tropism), show GC contents of 71 and 47% and Nc values of 37 and 56, respectively. Yet despite these extreme differences in nucleotide and codon bias, the viruses exhibit almost-identical frequencies of CpG doublets in proportion to their GC contents. Again, this supports the idea that there are common mechanisms which determine CpG frequencies for viruses with similar life cycles.
Another important observation in this context is that many poxviruses, many iridoviruses, and African swine fever virus have low GC contents. From the perspective of the mutation pressure hypothesis this could be viewed as resulting from their cytoplasmic site of replication, particularly if the nucleotide composition in this location differs greatly from that of the cell nucleus (Moss 2001; Moyer and Henderson 1985; Williams 1996; Rojo et al. 1999). Accordingly, the low GC content in most RNA viruses may similarly reflect both their cytoplasmic site of replication and the fact that they, like small DNA viruses, do not encode enzymes that alter dNTP concentrations.
Comparisons of Host-Viral Genome Compositions
Given the role played by mutation pressure in shaping codon and nucleotide biases in both animals and viruses, and because small DNA viruses are replicated, and possibly repaired, by cellular machinery, whereas large DNA viruses generally encode much of their own replicative machinery, it might be expected that the former will show biases more similar to those of host cellular DNA than the latter. Clearly, viral-host genome comparisons are critical in addressing this question. However, because vertebrate genomes are so large and their base compositions differ dramatically depending on the region of the genome examined, it is difficult to make generalized statements regarding host nucleotide biases. Moreover, unlike the situation in bacteria, there is no strong evidence of a link between codon usage bias and gene expression levels in mammals, as expected under tRNA-mediated selection (Sharp and Matassi 1994). This was confirmed during the initial analysis of the human genome sequence in which codon usage bias was found to be largely determined by genome location and hence local mutation pressure (Lander et al. 2001). Consequently, studies that attempt to compare viral codon usage with host codon usage or tRNA availability are likely to be unreliable.
CpG Frequency
Considerable data have been compiled on CpG contents within different regions of the human genome which are useful in comparing the base composition of human viruses with those of their hosts. CpG suppression in most small human DNA viruses falls within the range of that in the human genome, in which ρCpG values of 50-kb stretches range from 0.12 to 0.45 and chromosomal averages range from 0.18 to 0.31 (Karlin and Mrázek 1997; Gentles and Karlin 2001). However, unlike the situation in their hosts, CpG is not suppressed in most large human DNA viruses, where the frequency of this dinucleotide is almost four times as high as for human DNA. At first sight this suggests that CpG bias in the small DNA viruses could be the result of biases intrinsic to the host replicative/repair machinery/processes which act on small viral genomes.
CpG depletion in vertebrates is largely attributed to the methylation/deamination/mutation of cytosines in these doublets. This is partially supported by the overrepresentation of TpG in the human genome. However, this overrepresentation is slight in comparison to the great underrepresentation of CpG, indicating that methylation cannot be the only cause of CpG depletion in humans (Gentles and Karlin 2001). While small DNA viruses also have elevated levels of TpG, this elevation is even less than that in the human genome, which makes methylation/deamination an even more questionable cause of CpG suppression in these viruses. Furthermore, available evidence indicates that there may be little or no methylation in many viral genomes when they are actively replicated or packaged (Acken et al. 1979; Karlin and Burge 1995; Lundberg et al. 2003; Kämmer and Doerfler 1995).
While the common biases in human and small DNA virus genomes may still be due to shared replication machinery within the nucleus, a few factors point away from this explanation. First, it does not explain the human polyomaviruses which have ρCpG levels of 0.05 and 0.07—far below human levels. Second, small DNA viruses may not employ the complete set of host replicative machinery, as at least some of these viruses have mutation rates orders of magnitude greater than those of the host (Truyen et al. 1995; Shackelton et al. 2005). Finally, it does not explain why RNA viruses, which encode their own polymerases and replicate in the cytoplasm, should also have many of the same dinucleotide signatures (Karlin et al. 1994).
It is therefore likely that additional factors influence the suppression of CpG in small DNA viruses. The deficiency may be related to the immunostimulatory properties of unmethylated CpGs, which are recognized by the host’s innate immune system as a pathogen signature (Krieg 2003). These sequences bind and activate Toll-like receptor 9 (TLR9) on neutrophils, dendritic cells, and macrophages, inducing a rapid immune response (Lund et al. 2003; Wagner 2004). While a high CpG content may be detrimental to small DNA (and RNA) viruses, large ones may not be similarly affected because they encode a range of proteins that interfere with cellular pathogen-pattern recognition. For example, vaccinia poxvirus encodes agonists of TLRs (Harte et al. 2003). It has even been suggested that some complex viruses have evolved mechanisms to activate TLRs to induce the accompanying activation of specific cells (Rassa and Ross 2003). As long as the virus can modulate the antiviral response which accompanies TLR activation, it may induce cellular proliferation for its own replicative advantage.
In contrast to most large viruses, gammaherpesviruses and iridoviruses show a CpG deficiency. This may be because these viruses methylate their genomes at specific times in their life cycle (Ambinder et al. 1999; Willis and Granoff 1980; Wagner et al. 1985), perhaps to protect against immune recognition (Tao and Robertson 2003) and control patterns of transcription. As with vertebrate genomes, methylated viral genomes would face a high chance of mutation at CpGs, which would result in a reduction of this dinucleotide (Ambinder et al. 1999).
The most obvious difference between the CpG-suppressed small DNA viruses and the large DNA viruses is gene number. While the compact genomes of the papillomaviruses, polyomaviruses, anellovirus, circoviruses, and parvoviruses encode only a small number of replication and structural genes, most large viruses encode over 100 genes which, among other things, may allow them to evade adaptive or innate immune detection, suppress antiviral immune reactions, regulate and dominate cellular metabolic machinery, and produce, and compensate for, cellular depletions in metabolic resources (Shackelton and Holmes 2004). Small DNA viruses, which do not encode such an extended array of proteins capable of manipulating host cell metabolism or reducing/delaying host immune responses, may be under strong selection to propagate quickly in a cell with limited resources before host responses resolve the infection. In this case CpG may be selected against not only because of its immunostimulatory effects, but because it may extend the amount of time necessary for viral replication and transcription in the infected cell. Specifically, CpG has the highest stacking energy of any dinucleotide, thereby requiring the greatest amount of free energy to disrupt a double helix (Karlin and Burge 1995; Breslauer et al. 1986).
Finally, that CpG introduces structural abnormalities into the DNA helix (El Antri et al. 1993a, b; Grzeskowiak et al. 1991) may also affect the dinucleotide composition of small DNA viruses, which have few epigenetic protection mechanisms against helical distortions, and the DNA-binding molecules they attract, which may interfere with genome recognition or packaging.
Dependoviruses
Dependoviruses are small viruses of the Parvoviridae, so called because they require a helper virus, most often an adenovirus, or in some cases a herpesvirus, for a productive infection. Curiously, these viruses do not show the genomic biases common to other similarly sized DNA viruses, including the closely related autonomous parvoviruses. Instead of CpG depletion, the genomic signature of dependoviruses resembles that of adenoviruses and other large dsDNA viruses. Unlike other parvoviruses, these viruses integrate into the host genome when the helper virus is absent (Muzyczka and Berns 2001). Although the possibly prolonged integrated stage in their life cycle may affect their base composition, helper virus coinfection may also play a role. By replicating in cells alongside these large viruses, dependoviruses may “benefit” from the adenovirus or herpesvirus modulation of the host immune system, control of the cell cycle, and induction or production of supplementary resources. The additional time and resources available for replication and expression, and the suppression of viral detection and immune responses, could result in a dependovirus composition similar to that of its helper virus.
References
Acken UV, Simon D, Grunert F, Döring H-P, Kröger H (1979) Methylation of viral DNA in vivo and in vitro. Virology 99:152–157
Ambinder RF, Robertson KD, Tao Q (1999) DNA methylation and the Epstein-Barr virus. Semin Cancer Biol 9:369–375
Bernardi G, Bernardi G (1986) Compositional constraints and genome evolution. J Mol Evol 24:1–11
Beutler E, Gelbart T, Han J, Koziol JA, Beutler B (1989) Evolution of the genome and the genetic code: Selection at the dinucleotide level by methylation and polyribonucleotide cleavage. Proc Natl Acad Sci USA 86:192–196
Breslauer KJ, Frank R, Blocker H, Marky LA (1986) Predicting DNA duplex stability from the base sequence. Proc Natl Acad Sci USA 83:3746–3750
Bronson EC, Anderson JN (1994) Nucleotide composition as a driving force in the evolution of retroviruses. J Mol Evol 38:506–532
Burge C, Campbell AM, Karlin S (1992) Over- and under-representation of short oligonucleotides in DNA sequences. Proc Natl Acad Sci USA 89:1358–1362
Chamary J-V, Hurst LD (2004) Similar rates but different modes of sequence evolution in introns and at exonic silent sites in rodents, evidence for selectively driven codon usage. Mol Biol Evol 21:1014–1023
Cole CN, Conzen SD (2001) Polyomaviridae: the viruses and their replication. In: Knipe DM, Howley PM (eds) Fundamental virology, vol 4. Lippincott Williams and Wilkins, Philadelphia, PA, pp 985–1018
De Amicis F, Marchetti S (2000) Intercodon dinucleotides affect codon choice in plant genes. Nucleic Acids Res 28:3339–3345
Duan J, Antezana MA (2003) Mammalian mutation pressure, synonymous codon choice, and mRNA degradation. J Mol Evol 57:694–701
El Antri S, Bittoun P, Mauffret O, Monnot M, Lescot E, Convert O, Fermandjian S (1993a) Effect of distortions in the phosphate backbone conformation of six related octanucleotide duplexes on CD and 31P NMR spectra. Biochemistry 32:7079–7088
El Antri S, Mauffret O, Monnot M, Lescot E, Convert O, Fermandjian S (1993b) Structural deviations at CpG provide a plausible explanation for the high frequency of mutation at this site, Phosphorus nuclear magnetic resonance and circular dichroism studies. J Mol Biol 230:373–378
Gentles AJ, Karlin S (2001) Genome-scale compositional comparisons in eukaryotes. Genome Res 11:540–546
Gouy M, Gautier C (1982) Codon usage in bacteria, correlation with gene expressivity. Nucleic Acids Res 10:7055–7047
Grantham R, Gautier C, Guoy M, Mercier R, Pave A (1980) Codon catalogue usage and the genome hypothesis. Nucleic Acids Res 8:49–62
Grantham R, Perrin P, Mouchiroud D (1986) Patterns in codon usage of different kinds of species. Oxford Surv Evol Biol 3:48–81
Grzeskowiak K, Yanagi K, Privé GG, Dickerson RE (1991) The structure of B-helical C-G-A-T-C-G-A-T-C-G and comparison with C-C-A-A-C-G-T-T-G-GJ. Biol Chem 266:8861–8883
Gu W, Zhou T, Ma J, Sun X, Lu Z (2004) Analysis of synonymous codon usage in SARS Coronavirus and other viruses in the Nidovirales. Virus Res 101:155–161
Harte MT, Haga IR, Maloney G, Gray P, Reading PC, Bartlett NW, Smith GL, Bowie A, O’Neill AJ (2003) The poxvirus protein A52R targets toll-like receptor signalling complexes to suppress host defense. J Exp Med 197:343–351
Howley PM, Lowy DR (2001) Papillomaviruses and their replication. In: Knipe DM, Howley PM (eds) Fundamental virology, vol 4. Lippincott Williams and Wilkins, Philadelphia, PA, pp 1019–1051
Jenkins GM, Holmes EC (2003) The extent of codon usage bias in human RNA viruses and its evolutionary origin. Virus Res 92:1–7
Jones PA, Rideout WMIII, Shen J-C, Spruck CH, Tsai YC (1992) Methylation, mutation and cancer. Bioessays 14:33–36
Kämmer C, Doerfler W (1995) Genomic sequencing reveals absence of DNA methylation in the major late promoter of adenovirus type 2 DNA in the virion and in productively infected cells. FEBS Lett 362:301–305
Karlin S, Burge C (1995) Dinucleotide relative abundance extremes, a genomic signature. Trends Genet 11:283–290
Karlin S, Mrázek J (1997) Compositional differences within and between eukaryotic genomes. Proc Natl Acad Sci USA 94:1027–10232
Karlin S, Doerfler W, Cardon LR (1994) Why is CpG suppressed in the genomes of virtually all small eukaryotic viruses but not in those of large eukaryotic viruses? J Virol 68: 2889–2897
Kress C, Thomassin H, Grange T (2001) Local DNA methylation in vertebrates, how could it be performed and targeted? FEBS Lett 494:135–140
Krieg AM (2003) CpG DNA, Trigger of sepsis, mediator of protection, or both? Scand J Infect Dis 35:653–659
Lander ES, Linton LM, Birren B, et al. (2001) Initial sequencing and analysis of the human genome. Nature 409:860–921
Lund J, Sato A, Medzhitov R, Iwasaki A (2003) Toll-like receptor 9-mediated recognition of Herpes simplex virus-2 by plasmacytoid dendritic cells. J Exp Med 198:513–520
Lundberg P, Welander P, Han X, Cantin E (2003) Herpes simplex virus type 1 DNA is immunostimulatory in vitro and in vivo. J Virol 77:11158–11169
Moss B (2001) Poxviridae: The viruses and their replication. In: Knipe D, Howley P (eds) Fundamental virology, vol 4. Lippincott Williams and Wilkins, Philadelphia, PA, pp 1249–1283
Moyer JD, Henderson JF (1985) Compartmentation of intracellular nucleotides in mammalian cells. CRC Crit Rev Biochem 19:45–61
Muzyczka N, Berns KI (2001) Parvoviridae: the viruses and their replication. In: Knipe DM, Howley PM (eds) Fundamental virology, vol 4. Lippincott Williams and Wilkins, Philadelphia, PA, pp 1089–1121
Novembre JA (2002) Accounting for background nucleotide composition when measuring codon usage bias. Mol Biol Evol 19:1390–1394
Oresic M, Shalloway D (1998) Specific correlations between relative synonymous codon usage and protein secondary structure. J Mol Biol 281:31–48
Powell JR, Moriyama EN (1997) Evolution of codon usage bias in Drosophila. Proc Natl Acad Sci USA 94:7784–7790
Pride DT (2000) SWAAP Version 1.0.0—Sliding windows alignment analysis program: a tool for analyzing patterns of substitutions and similarity in multiple alignments. Distributed by the author
Rassa J, Ross SR (2003) Viruses and toll-like receptors. Microbes Infect 5:961–968
Rojo G, García-Beato R, Viñuela E, Sala MA, Salas J (1999) Replication of African swine fever virus DNA in infected cells. Virology 257:542–536
Schachtel GA, Bucher P, Mocarski ES, Blaisdell BE, Karlin S (1991) Evidence for selective evolution in codon usage in conserved amino acid segments of human alphaherpesvirus proteins. J Mol Evol 33:483–494
Shackelton LA, Holmes EC (2004) The evolution of large DNA viruses, combining genomic information of viruses and their hosts. Trends Microbiol 12:458–465
Shackelton LA, Parrish CR, Truye U, Holmes EC (2005) High rate of viral evolution associated with the emergence of carnivore parvovirus. Proc Natl Acad Sci USA 102:379–384
Sharp PM, Matassi G (1994) Codon usage and genome evolution. Curr Opin Genet Dev 4:851–860
Sharp PM, Tuohy TM, Mosurski KR (1986) Codon usage in yeast, cluster analysis clearly differentiates highly and lowly expressed genes. Nucleic Acids Res 14:1525–5143
Sharp PM, Stenico M, Peden JF, Lloyd AT (1993) Codon usage, mutational bias, translational selection, or both? Biochem Soc Trans 21:835–841
Smith NGC, Eyre-Walker A (2001) Synonymous codon bias is not caused by mutation bias in G + C-rich genes in humans. Mol Biol Evol 18:982–986
Stenico M, Lloyd AT, Sharp PM (1994) Codon usage in Caenorhabditis elegans, delineation of translational selection and mutational biases. Nucleic Acids Res 22:2437–2446
Strauss EG, Strauss JH, Levine AJ (1996) Virus evolution. In: Fields BN, Knipe DM, Howley PM (eds) Virology. Lippincott-Raven, Philadelphia, PA, pp 153–171
Sueoka N (1961) Compositional correlation between deoxyribonucleic acid and protein. Cold Spring Harbor Symp Quant Biol 26:35–43
Tao Q, Robertson KD (2003) Stealth technology, how Epstein–Barr virus utilizes DNA methylation to cloak itself from immune detection. Clin Immunol 109:53–63
Truyen U, Gruenberg A, Chang SW, Obermaier B, Veijalainen P, Parrish CR (1995) Evolution of the feline-subgroup parvoviruses and the control of canine host range in vivo. J Virol 69:4702–4710
Wagner H (2004) The immunobiology of the TLR9 subfamily. Trends Immunol 25:381–386
Wagner H, Simon D, Werner E, Gelderblom H, Darai C, Flügel RM (1985) Methylation pattern of fish lymphocystis disease virus DNA. J Virol 53:1005–1007
Williams T (1996) The iridoviruses. Adv Virus Res 46:345–412
Willis DB, Granoff A (1980) Frog virus 3 DNA is heavily methylated at CpG sequences. Virology 107:250–257
Wright F (1990) The “effective number of codons” used in a gene. Gene 87:23–29
Wyatt GR (1952) The nucleic acids of some insect viruses. J Gen Physiol 36:201–205
Xia X (1996) Maximizing transcription efficiency causes codon usage bias. Genetics 144:1309–1320
Zama M (1990) Codon usage and secondary structure of mRNA. Nucleic Acids Symp Ser 22:93–94
Zhao K-N, Liu WJ, Frazer IH (2003) Codon usage bias and A + T content variation in human papillomavirus geomes. Virus Res 98:95–104
Acknowledgments
This work was completed under a Howard Hughes Medical Institute Fellowship to L.A.S. and NIH Grant R01AI028385 to C.R.P.
Author information
Authors and Affiliations
Corresponding author
Additional information
[Reviewing Editor: Dr. Nicolas Galtier]
Electronic Supplementary Material
Rights and permissions
About this article
Cite this article
Shackelton, L.A., Parrish, C.R. & Holmes, E.C. Evolutionary Basis of Codon Usage and Nucleotide Composition Bias in Vertebrate DNA Viruses. J Mol Evol 62, 551–563 (2006). https://doi.org/10.1007/s00239-005-0221-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00239-005-0221-1