Introduction

Even before the genetic code was deciphered, it was proposed that gene sequence evolution is not only influenced by fitness effects at the protein level, but also by the intrinsic nucleotide composition of the genome (Sueoka 1961). Once the redundancy of the genetic code was revealed, it became apparent that different organisms had evolved, along with “classical phenotypes,” unique genomic signatures, or “genomic phenotypes” (Bernardi and Bernardi 1986). Of particular importance was the proposal that each species was subject to specific genomic pressures on base composition, in turn resulting in a distinctive bias in codon choice (Grantham et al. 1980), and that explaining these unique coding strategies “is the heart of the problem of molecular evolution” (Grantham et al. 1986).

More recent studies have revealed that patterns of codon usage bias and nucleotide composition within many cellular genomes are far more complex than previously imagined, and the factors shaping their evolution are still not entirely understood. In principle, biases in nucleotide composition and codon usage can result from natural selection and/or differential mutational pressure. In many organisms, such as Escherichia coli, Saccharomyces cerevisiae, Drosophila melanogaster, and Caenorhabditis elegans, there is evidence that codons that use abundant tRNAs are selectively favoured, especially in highly expressed genes (Sharp et al. 1986; Powell and Moriyama 1997; Gouy and Gautier 1982; Stenico et al. 1994). Codon selection of this type most likely functions to optimize translational speed and/or translational accuracy, although additional factors, such as transcription efficiency, mRNA secondary structure, and protein structure, can also exert selection pressures (Xia 1996; Zama 1990; Oresic and Shalloway 1998). In contrast, codon usage bias in mammals appears to be more strongly influenced by differential mutation pressure (Sharp et al. 1993), although some evidence for selection has also been observed (Smith and Eyre-Walker 2001; Chamary and Hurst 2004; Duan and Antezana 2003). Under the mutation pressure model, intrinsic differences in the propensity of genome replication to make specific mutational errors, usually depicted as the frequency of GC ↔ AT changes, shapes overall patterns of base composition. Further, because the effective population sizes (N e) of mammalian species are typically small, as are the selection coefficients (s) of most mutations at synonymous sites or in noncoding DNA (i.e., N e s << 1), natural selection is usually unable to control the substitution dynamics of these mutations and they are fixed, or more usually lost, by genetic drift. Evidence for this “mutation-drift” theory in vertebrate genomes is that the nucleotide and codon bias of genes largely reflects their genomic location; for example, genes in GC-rich regions (so-called GC “isochores”) are biased toward GC-ending codons (Sharp et al. 1993).

There have been few in-depth studies of codon and nucleotide usage biases among families of DNA viruses, although the biases of some eukaryotic viruses can differ substantially from those of their hosts (Strauss et al. 1996). For example, in the mammalian papillomaviruses it has been proposed that a codon usage bias different from the average seen in host genomes strongly influences both viral replication and gene expression (Zhao et al. 2003). Moreover, in the rapidly evolving human RNA viruses, one of the few groups of viruses for which codon bias data have been compiled, mutation pressure seems to be the main force shaping codon usage, accounting for 71–85% of the observed bias (Jenkins and Holmes 2003). Indeed, in one in-depth analysis of the nidovirales, neither translational selection nor gene length was found to have an effect on codon usage (Gu et al. 2004). Although RNA viruses possess large effective population sizes, it is possible that their mutation rates are so high that they prevent natural selection from working efficiently on codon choice (Jenkins and Holmes 2003).

Codon usage can also be strongly influenced by underlying biases in dinucleotide frequency, which differs greatly among organisms. Specifically, after accounting for dinucleotide biases, the proportion of codon usage bias explained by mutation pressure often increases, as seen in human RNA viruses (Jenkins and Holmes 2003). Dinucleotide biases can be extreme. For example, CpG is present at only 20% of its expected frequency in most vertebrate genomes (Jones et al. 1992) and is depleted in other organisms (Karlin et al. 1994). The most popular explanation for the underrepresentation of CpG in vertebrate genomes is that 60–90% of cytosines in CpG doublets are methylated by cellular methyltransferases (Kress et al. 2001) and methylated cytosine (5-methylcytosine) has a tendency to undergo deamination when unpaired, resulting in the mutation of the cytosine to a thymidine. Methylation can control the expression of particular genes by preventing the binding of transcription factors and modifying chromatin structure and interactions with histones. In vertebrates, methylation is central to genomic imprinting and cell differentiation (Kress et al. 2001). Thus, CpG may be selected for at certain sites and selected against at others. While mutation following methylation probably plays a strong role in the reduced CpG content of many organisms, it is unlikely to be the only factor. In particular, the underrepresentation of CpG is often not accompanied by a correspondingly high level of TpG. Furthermore, similarly low CpG contents are found in vertebrate mitochondrial genomes which are unmethylated (De Amicis and Marchetti 2000).

The dinucleotide TpA is also underrepresented in many genomes (Burge et al. 1992). This is often attributed to the susceptibility of UpA uracils to RNase and that two of the three stop codons in the universal code begin with TpA (Beutler et al. 1989). Additionally, structural factors may have an effect on the frequency of certain dinucleotides. For example, the reduced frequency of TpA may also be explained by its low thermal stability (Breslauer et al. 1986; Beutler et al. 1989), such that TpA-rich sequences in DNA helices may suffer from detrimental levels of unwinding, twisting, and bending (De Amicis and Marchetti 2000).

Eukaryotic DNA viruses can be classified into two or three broad groups based on the size and nature of their genome. The genomes of large double-stranded DNA (dsDNA) viruses are mostly greater than 100 kb in length, while a putative second group, the Adenoviridae, are 28–45 kb in length but biologically similar to the large dsDNA viruses in that they encode their own DNA polymerase and accessory proteins involved in immune response regulation. In contrast, the genomes of small DNA viruses are less than 10 kb in length and comprised of either dsDNA or single-stranded DNA (ssDNA). Large dsDNA viruses encode many different proteins, often exceeding 100, which play intricate roles in virus replication, host-cell regulation, and host immune modulation (Shackelton and Holmes 2004). In contrast, small DNA viruses encode fewer than 10 proteins, some of which provide capsid structural functions while the remainder are primarily involved in supplementing or stimulating replication by host machinery, as these viruses do not encode their own polymerases. For the small double-stranded Polyomaviridae and Papillomaviridae, these supplemental replication proteins typically induce the host cell to enter S phase through a variety of interactions with host cell cycle regulatory components (Cole et al. 2001; Howley et al. 2001). Most of the single-stranded parvoviruses can only replicate in cells that are mitotically active, while the dependo-parvoviruses rely on co-infection of their host cells by a large helper adenovirus or herpesvirus (Muzyczka and Berns 2001). In addition, small DNA viruses are generally unable to modulate either the innate or the adaptive immune response of the host to nearly the extent of the large DNA viruses. This, and the fact that small DNA viruses are far more dependent on cellular proteins and other resources for their propagation, may have resulted in differences in the evolution of biases in base composition and codon usage among large and small viruses.

Herein we conducted an analysis of the codon usage and nucleotide biases among all vertebrate-infecting DNA virus families. Such a comparative analysis is central to revealing the interplay between the genetic drift of neutral mutations and the selective fixation of advantageous ones, can help elucidate the evolutionary dynamics of viruses and their interactions with hosts, and can potentially improve the efficiency and effectiveness of experimental or therapeutic procedures in which viral genome sequences are altered.

Materials and Methods

Sequence Data

Reference sequences (as defined by GenBank; http://www. ncbi.nlm.nih.gov/genomes/VIRUSES/10239.html ) were obtained for at least one species within all genera of vertebrate DNA viruses for which such data exist. This resulted in data sets of 41 large dsDNA viruses, 21 small dsDNA viruses, and 15 small ssDNA viruses. The hepadnaviruses were excluded, as they replicate via reverse transcription and are therefore likely to be subject to different evolutionary pressures. Known and putative ORFs were concatenated for total codon analyses. Accession numbers are given in Table 1.

Table 1 Codon usage bias, as measured by the effective codon usage statistic, not accounting for (Nc) and accounting for (Nc’) background nucleotide composition, and nucleotide bias (measured by ORF GC3s and genomic GC content).

Measuring Codon Usage and Nucleotide Biases

The CodonW package (http://www.bioweb.pasteur.fr/docs/softgen. html#CODONW) was employed to measure the effective codon usage statistic, Nc (Wright 1990), codon usage frequency, and nucleotide biases. Nc is calculated on a scale of 20 to 61, with a score of 20 representing maximum bias—the use of only one codon for each of the 20 amino acids—and a score of 61 indicating no bias, such that all codons are used equally for each amino acid.

We measured the total G + C (denoted GC) content and the frequency of all dinucleotides within both the concatenated ORFs and the complete genome sequences. Dinucleotide biases were calculated as the observed frequency of the dinucleotide relative to the product of the frequencies of the individual nucleotides (i.e., the expected dinucleotide frequency). For example, ρTpA = ƒ(TA)/(ƒ(T)ƒ(A)). For dinucleotides which do not form a reverse complemented pair on the opposite strand, we symmetrized the measure of ρ with the complementary dinucleotide and labeled this ρ* as outlined by Burge et al. (1992). For example, ρ*TpG = 2(ƒ(TG + CA)/[(ƒ(T + A))(ƒ(G + C))].

Besides total GC content, we measured the frequency with which either of these nucleotides appears at the synonymous third codon position (GC3s). GC3s was compared to the GC content at the first and second codon positions (GC1,2) with the Pearson product-moment correlation coefficient (r). To examine the influence of GC content on codon usage we plotted the relationship between Nc and GC3s for each virus. This was compared to the Nc which would result if GC content were solely responsible for the codon biases (denoted Nc*), calculated as Nc* = 2 + GC3s + {29/[(GC3s)2 + (1 − GC3s)2]} (Wright 1990).

To quantify the effects of natural selection on codon usage we also calculated the Nc’ statistic for each data set using the programs SeqCount and ENCprime (Novembre 2002). Nc’ is similar to Nc but takes into account background nucleotide composition. This statistic, which also ranges from 20 to 61, measures the deviation of the actual codon usage pattern from the distribution expected from the background nucleotide composition (Novembre 2002).

To determine codon biases across genomes we measured the Nc and GC3s, with the methods described above, of individual genes from each family, subfamily, or distinct genus. To determine CpG biases across large DNA virus genomes we conducted a sliding window analysis of CpG content with a window size of 300 and a step size of 100 using the SWAAP 1.0.2 program (Pride 2000).

Results

Codon Usage Biases

The effective codon usage statistic, Nc, was used to measure codon usage bias across the ORFs of large and small DNA viruses (Table 1). While all viruses exhibit some codon bias, none of these biases were unusually strong and the majority of families include members with both low and high codon biases, following no apparent trend. The mean (and range) for the Nc values of the large and small viruses were 48.3 (34.7–59.6) and 50.5 (35.8–58.0), respectively. However, an examination of genome and gene codon usage tables for a number of viruses suggested that the codon biases which do exist are greatly influenced by GC frequency. For example, in canine parvovirus and human BK virus, which have low Nc values, the most frequently employed synonymous codons are those with the fewest Gs and Cs. As a case in point, of the six synonymous codons coding for arginine, AGA, the only codon with a single G or C, is used 83% and 61% of the time in these two viruses, respectively. In contrast, HSV-2, a virus with a low Nc value which probably results from an overrepresentation of Gs and Cs, employs the two arginine codons with only Cs and Gs, CGC and CGG, 82% of the time.

Nucleotide Biases

To determine whether these codon biases are primarily caused by mutation pressure on overall base composition or natural selection for particular triplets, we first explored the extent of underlying nucleotide bias in all viral genomes. We began by examining the overall GC content at the genomic level and in the concatenated ORFs, as well as the GC3s content of the latter (Table 1). Although values varied greatly, the majority of large viruses, including the adenoviruses, alphaherpesviruses, and betaherpesviruses, have GC3s contents above 0.50, averaging 0.52, 0.64, and 0.61, respectively. However, specific families of large DNA viruses, namely, the Poxviridae, the Asfarviridae, and some iridoviruses, have low GC3s contents. Interestingly, these are the only DNA viruses which replicate completely (Poxviridae) or partially in the cytoplasm of the cell rather than entirely in the nucleus (Moss 2001; Rojo et al. 1999; Williams 1996). Almost all small DNA viruses showed low GC3s values. For the Papillomaviridae, Polyomaviridae, anellovirus, Circoviridae, and Parvoviridae (excluding the dependovirus genera of the Parvoviridae, which do not show this trend), the average GC3s values were 0.41, 0.35, 0.43, 0.53, and 0.33, respectively. For all viruses, GC3s values were close to their genomic GC values. There were no obvious trends in single nucleotide strand biases across virus families.

Mutation Pressure Versus Codon Selection

To determine the relative effects of mutation pressure versus natural selection on codon composition, we examined the relationship between GC3s and GC at the first and second codon positions (GC1,2). As shown in Fig. 1, GC3s and GC1,2 were significantly correlated (r = 0.95 for the large DNA viruses and r = 0.94 for the small DNA viruses; p < 0.001). Although this test does not take into account the phylogenetic relationships among the taxa studied, it does show that observed patterns of base composition are present at all codon positions. This suggests that they are most likely the result of mutation pressure, as natural selection would be expected to act differently on different codon positions.

Figure 1
figure 1

Correlation between GC content at the synonymous third codon position (GC3s) and GC content at the nonsynonymous first/second positions (GC1,2) of DNA viruses. Filled squares and open diamonds represent the individual large DNA viruses and small DNA viruses used in this study, respectively.

The notion that codon bias is primarily governed by an underlying biased mutation pressure was further supported by examining the correlation between the GC3s and Nc of the coding sequences. We plotted GC3s against Nc for each virus and compared this to the expected Nc (Nc*) that would result if GC content were the sole determinant of codon usage. Remarkably, the actual Nc values for most viruses were on, or just below, the Nc* curve (Fig. 2). This implies that codon bias is mainly explained by uneven base composition and, hence, by mutation pressure rather than natural selection on codon choice. However, the fact that the majority of the actual Nc values were slightly lower than Nc* indicates that there are other factors, with less of an effect, that also influence the codon bias.

Figure 2
figure 2

Correlation between the synonymous third codon position (GC3s) and the effective codon usage statistic (Nc) for DNA viruses. Filled squares and open diamonds represent the individual large DNA viruses and small DNA viruses, respectively. As shown, these Nc values form a curve on, or slightly below, the line representing the expected Nc values (Nc*) which would result if GC composition were the only factor influencing codon usage bias.

Additional evidence that codon biases are predominantly influenced by mutation pressure was obtained by determining the Nc’ statistic for each virus. Unlike Nc, Nc’ takes into account the background nucleotide composition and should therefore reflect the degree of codon usage bias due to factors other than nucleotide composition. As expected, Nc’ was greater than Nc in the vast majority of viruses and the difference between Nc and Nc’ was greatest in those viruses which have GC contents that depart significantly from 50% and minimal in those viruses with GC contents near 50% (Table 1). The graphic relationship between GC content and Nc’–Nc is a V-shaped curve centered at a GC content of 0.5—approximated by y = C(|x − 0.5|) (Fig. 3).

Figure 3
figure 3

Graphical representation of the relationship between GC content and Nc’ – Nc. The difference between Nc’ and Nc depicted here should reflect the amount of bias due to background base composition. As shown, this value is greatest in viruses with extreme GC contents. Filled squares and open diamonds represent the individual large DNA viruses and small DNA viruses, respectively.

Dinucleotide Biases

Because there have been reports of CpG underrepresentation in RNA and small DNA viruses (Karlin et al. 1994) and because dinucleotide biases can affect codon bias, we next determined the dinucleotide biases for each virus. We measured the actual frequency of each pair of nucleotides relative to the expected frequency (i.e., the product of the frequencies of the individual nucleotides). As previously observed by Karlin et al. (1994), while most large DNA viruses showed no bias against CpG, small DNA viruses were extremely biased against this dinucleotide (Table 2, Fig. 4). This difference in CpG content was highly significant (Student’s t-test: p = 2.3−15 when outliers were excluded [see below] and p = 2.2−11 when outliers were included) and indicates that in small DNA viruses there is additional mutation pressure away from this doublet which may also impact codon usage. The CpG content of the small papillomaviruses, polyomaviruses, anellovirus, circoviruses, and parvoviruses (excluding the dependoviruses) averaged 47, 22, 67, 62, and 43% of their expected values, respectively. The outliers, the dependoviruses, averaged 82%, which is significantly different from the other small DNA viruses (p = 0.005). In contrast, the large viruses—the adenoviruses, asfarvirus, alphaherpesviruses, betaherpesviruses, unassigned herpesvirus, and poxviruses—had average CpG contents of 78, 87, 98, 115, 108, and 112% of their expected values. However, among the large DNA viruses there are two distinct exceptions (significant at p = 0.0003) to the general correlation between genome size and CpG content: the Gammaherpesvirinae and the Iridoviridae, which have CpG contents 49 and 70% of their expected values, respectively.

Table 2 Dinucleotide biases of the viral genomes, measured with ρ, the observed frequency of the dinucleotide relative to the expected frequency
Figure 4
figure 4

Genomic ρCpG values (observed CpG frequency/expected CpG frequency) of large and small DNA viruses. Small autonomous viruses are depicted by filled diamonds and small dependoviruses by empty diamonds, while most large viruses are depicted by filled squares and the iridoviruses and gammaherpesviruses by open squares.

In those viruses with CpG depletion there were, generally, also elevated levels of TpG (measured as discussed above in order to symmetrize the value over both strands), which would result from a cytosine-to-thymidine mutation. However, the overrepresentation of this pair was slight compared to the underrepresentation of CpG. The genomic CpG and TpG contents were both approximately equivalent to the ORF contents. Finally, and as seen in most organisms, the TpA doublet had a reduced frequency in both large and small DNA viruses, although this underrepresentation was small compared to the level of CpG suppression in the small viral genomes (Table 2). With the exception of those mentioned above, all other dinucleotides were present at approximately the expected frequency.

Comparisons Along Viral Genomes

To determine the extent to which codon biases vary among genes, we selected one virus from each family, subfamily, or distinct genus and examined every gene separately. Because correct gene prediction is difficult for large viruses and results in some annotations that are only putative, we limited our analysis to those genes with assigned functions in the NCBI database. Strikingly, viral genes located in different genomic regions do not differ dramatically in the biases they display (see Supplementary Table).

Furthermore, because large genomes can have very different nucleotide compositions in different regions, we used a sliding window method to examine CpG content along the genomes of the large viruses. While small windows varied in their CpG contents, we found no systematic diversion from the average viral bias in any genomic region (see Supplementary Figure).

Discussion

This study revealed a number of trends in the nucleotide and codon composition among families of vertebrate-infecting DNA viruses. First, the strong correlation between codon usage bias and GC composition indicates that codon usage bias in DNA viruses is primarily explained by overall nucleotide content. In addition, not only are the GC frequencies of each virus similar at nonsynonymous and synonymous codon positions, but these frequencies appear in genes with different genomic positions and functions. These observations suggest that genome-wide mutational pressure is the most important factor shaping patterns of codon usage bias in DNA viruses, rather than natural selection for specific codons. Although nucleotide biases can vary even among closely related viruses, those viruses with similar genomes and life-cycle characteristics most frequently show similar CpG frequencies relative to their GC content. Such generalities, especially the marked CpG deficiency in the genomes of small DNA viruses, along with the relatively high levels of this dinucleotide in the large DNA viruses, point to common evolutionary pressures faced by similar viruses. These observations notwithstanding, it is likely that experimental analyses of DNA viruses, such as artificially altering codon usage or dinucleotide frequencies, are required to fully understand the complex mechanistic basis of these different genomic signatures.

Nucleotide Compositions

It has long been known that viruses may take on a much wider range of GC frequencies than other organisms (Wyatt 1952; Bronson and Anderson 1994). Even viruses within the same family, which have similar replication and life-cycle strategies, can show very different GC contents and hence large differences in codon biases (Schachtel et al. 1991). For example, HSV-2 and VZV (HSV-3), closely related alphaherpesviruses, which infect humans, persist latently in the nervous system of the host, and reactivate to cause secondary infections (but which have different routes of infection and cell tropism), show GC contents of 71 and 47% and Nc values of 37 and 56, respectively. Yet despite these extreme differences in nucleotide and codon bias, the viruses exhibit almost-identical frequencies of CpG doublets in proportion to their GC contents. Again, this supports the idea that there are common mechanisms which determine CpG frequencies for viruses with similar life cycles.

Another important observation in this context is that many poxviruses, many iridoviruses, and African swine fever virus have low GC contents. From the perspective of the mutation pressure hypothesis this could be viewed as resulting from their cytoplasmic site of replication, particularly if the nucleotide composition in this location differs greatly from that of the cell nucleus (Moss 2001; Moyer and Henderson 1985; Williams 1996; Rojo et al. 1999). Accordingly, the low GC content in most RNA viruses may similarly reflect both their cytoplasmic site of replication and the fact that they, like small DNA viruses, do not encode enzymes that alter dNTP concentrations.

Comparisons of Host-Viral Genome Compositions

Given the role played by mutation pressure in shaping codon and nucleotide biases in both animals and viruses, and because small DNA viruses are replicated, and possibly repaired, by cellular machinery, whereas large DNA viruses generally encode much of their own replicative machinery, it might be expected that the former will show biases more similar to those of host cellular DNA than the latter. Clearly, viral-host genome comparisons are critical in addressing this question. However, because vertebrate genomes are so large and their base compositions differ dramatically depending on the region of the genome examined, it is difficult to make generalized statements regarding host nucleotide biases. Moreover, unlike the situation in bacteria, there is no strong evidence of a link between codon usage bias and gene expression levels in mammals, as expected under tRNA-mediated selection (Sharp and Matassi 1994). This was confirmed during the initial analysis of the human genome sequence in which codon usage bias was found to be largely determined by genome location and hence local mutation pressure (Lander et al. 2001). Consequently, studies that attempt to compare viral codon usage with host codon usage or tRNA availability are likely to be unreliable.

CpG Frequency

Considerable data have been compiled on CpG contents within different regions of the human genome which are useful in comparing the base composition of human viruses with those of their hosts. CpG suppression in most small human DNA viruses falls within the range of that in the human genome, in which ρCpG values of 50-kb stretches range from 0.12 to 0.45 and chromosomal averages range from 0.18 to 0.31 (Karlin and Mrázek 1997; Gentles and Karlin 2001). However, unlike the situation in their hosts, CpG is not suppressed in most large human DNA viruses, where the frequency of this dinucleotide is almost four times as high as for human DNA. At first sight this suggests that CpG bias in the small DNA viruses could be the result of biases intrinsic to the host replicative/repair machinery/processes which act on small viral genomes.

CpG depletion in vertebrates is largely attributed to the methylation/deamination/mutation of cytosines in these doublets. This is partially supported by the overrepresentation of TpG in the human genome. However, this overrepresentation is slight in comparison to the great underrepresentation of CpG, indicating that methylation cannot be the only cause of CpG depletion in humans (Gentles and Karlin 2001). While small DNA viruses also have elevated levels of TpG, this elevation is even less than that in the human genome, which makes methylation/deamination an even more questionable cause of CpG suppression in these viruses. Furthermore, available evidence indicates that there may be little or no methylation in many viral genomes when they are actively replicated or packaged (Acken et al. 1979; Karlin and Burge 1995; Lundberg et al. 2003; Kämmer and Doerfler 1995).

While the common biases in human and small DNA virus genomes may still be due to shared replication machinery within the nucleus, a few factors point away from this explanation. First, it does not explain the human polyomaviruses which have ρCpG levels of 0.05 and 0.07—far below human levels. Second, small DNA viruses may not employ the complete set of host replicative machinery, as at least some of these viruses have mutation rates orders of magnitude greater than those of the host (Truyen et al. 1995; Shackelton et al. 2005). Finally, it does not explain why RNA viruses, which encode their own polymerases and replicate in the cytoplasm, should also have many of the same dinucleotide signatures (Karlin et al. 1994).

It is therefore likely that additional factors influence the suppression of CpG in small DNA viruses. The deficiency may be related to the immunostimulatory properties of unmethylated CpGs, which are recognized by the host’s innate immune system as a pathogen signature (Krieg 2003). These sequences bind and activate Toll-like receptor 9 (TLR9) on neutrophils, dendritic cells, and macrophages, inducing a rapid immune response (Lund et al. 2003; Wagner 2004). While a high CpG content may be detrimental to small DNA (and RNA) viruses, large ones may not be similarly affected because they encode a range of proteins that interfere with cellular pathogen-pattern recognition. For example, vaccinia poxvirus encodes agonists of TLRs (Harte et al. 2003). It has even been suggested that some complex viruses have evolved mechanisms to activate TLRs to induce the accompanying activation of specific cells (Rassa and Ross 2003). As long as the virus can modulate the antiviral response which accompanies TLR activation, it may induce cellular proliferation for its own replicative advantage.

In contrast to most large viruses, gammaherpesviruses and iridoviruses show a CpG deficiency. This may be because these viruses methylate their genomes at specific times in their life cycle (Ambinder et al. 1999; Willis and Granoff 1980; Wagner et al. 1985), perhaps to protect against immune recognition (Tao and Robertson 2003) and control patterns of transcription. As with vertebrate genomes, methylated viral genomes would face a high chance of mutation at CpGs, which would result in a reduction of this dinucleotide (Ambinder et al. 1999).

The most obvious difference between the CpG-suppressed small DNA viruses and the large DNA viruses is gene number. While the compact genomes of the papillomaviruses, polyomaviruses, anellovirus, circoviruses, and parvoviruses encode only a small number of replication and structural genes, most large viruses encode over 100 genes which, among other things, may allow them to evade adaptive or innate immune detection, suppress antiviral immune reactions, regulate and dominate cellular metabolic machinery, and produce, and compensate for, cellular depletions in metabolic resources (Shackelton and Holmes 2004). Small DNA viruses, which do not encode such an extended array of proteins capable of manipulating host cell metabolism or reducing/delaying host immune responses, may be under strong selection to propagate quickly in a cell with limited resources before host responses resolve the infection. In this case CpG may be selected against not only because of its immunostimulatory effects, but because it may extend the amount of time necessary for viral replication and transcription in the infected cell. Specifically, CpG has the highest stacking energy of any dinucleotide, thereby requiring the greatest amount of free energy to disrupt a double helix (Karlin and Burge 1995; Breslauer et al. 1986).

Finally, that CpG introduces structural abnormalities into the DNA helix (El Antri et al. 1993a, b; Grzeskowiak et al. 1991) may also affect the dinucleotide composition of small DNA viruses, which have few epigenetic protection mechanisms against helical distortions, and the DNA-binding molecules they attract, which may interfere with genome recognition or packaging.

Dependoviruses

Dependoviruses are small viruses of the Parvoviridae, so called because they require a helper virus, most often an adenovirus, or in some cases a herpesvirus, for a productive infection. Curiously, these viruses do not show the genomic biases common to other similarly sized DNA viruses, including the closely related autonomous parvoviruses. Instead of CpG depletion, the genomic signature of dependoviruses resembles that of adenoviruses and other large dsDNA viruses. Unlike other parvoviruses, these viruses integrate into the host genome when the helper virus is absent (Muzyczka and Berns 2001). Although the possibly prolonged integrated stage in their life cycle may affect their base composition, helper virus coinfection may also play a role. By replicating in cells alongside these large viruses, dependoviruses may “benefit” from the adenovirus or herpesvirus modulation of the host immune system, control of the cell cycle, and induction or production of supplementary resources. The additional time and resources available for replication and expression, and the suppression of viral detection and immune responses, could result in a dependovirus composition similar to that of its helper virus.