To study the patterns of neutral substitutions in the human genome, we have recently analyzed a large data set of alignments of orthologous noncoding DNA sequences from human, chimpanzee, and baboon (Meunier and Duret 2004). We observed that the base composition of the human genome is not at equilibrium: substitutions from G or C to A or T (hereafter referred to as GC→AT substitutions) are more numerous than AT→GC substitutions. Antezana (2005) has re-analyzed the genomic alignment data that we had compiled. In contradiction to our results, he found that the GC-content of the human genome is close to the equilibrium. The explanation he proposed for this discrepancy is that Meunier and Duret “used a malfunctioning dinucleotide-level simulation procedure out of concern for context-dependent mutation effects.” I show here that in fact, Antezana (2005) used an erroneous procedure to count substitutions that ignored the hypermutability of CpG dinucleotides, and therefore led to systematically overestimating the number of AT→GC substitutions.

Antezana (2005) used parsimony to count substitutions in alignments of orthologous human, chimpanzee, and baboon nongenic DNA sequences: substitutions to human or chimpanzee were retrieved from sites at which the baboon base and the base in one of the two nonbaboon sequences were identical but different from the base in the other nonbaboon sequence. It is well established that because of multiple substitutions, parsimony may be erroneous when patterns of substitutions are biased (Eyre-Walker 1998). It is also well known that in mammals, CpG dinucleotides are mutational hot spots: the rate of transition (C→T or G→A) at CpG sites is about 10 times higher than at non-CpG sites (Giannelli et al. 1999). Thus, although the average rate of divergence (excluding indels) between human and chimpanzee is 1.2%, the divergence at CpG sites is about 15.2% (CSAC 2005). Hence, as mentioned in our article (Meunier and Duret 2004), there is an important frequency of homoplasy at CpG sites, and therefore parsimony must be used with caution.

To illustrate this problem of homoplasy at CpG sites, let us take a simple example, very similar to the real situation in our human/chimp/baboon alignments: two species (species1 and species2) and an outgroup, such that the evolutionary distance at non-CpG sites is 0.01 substitutions/site between species1 and species2 and 0.05 substitutions/site between the outgroup and the two other species (Fig. 1a), and the rate of substitution at CpG sites is 10 times higher than at non-CpG sites. Let us consider a site that corresponds to a T in species1, a C in species2, a T in the outgroup and that is followed by a C, conserved in the three species (Fig. 1b, c). The scenario proposed by the simple parsimony method predicts that the ancestral sequence was TC and that a single T→C substitution occurred in the species2 lineage. The probability of that scenario is 5 × 10−3 (Fig. 1b). The second most likely scenario involves two independent substitutions, and is 40 times less likely than the first one (Fig. 1c). Thus, in that situation, the parsimony approach can be considered as reliable. Now consider a site that—as in the first example—corresponds to a T in species1, a C in species2, a T in the outgroup but that is followed by a G, conserved in the three species (Fig. 1d, e). As in the previous example, the scenario proposed by the simple parsimony method predicts that a single T→C substitution occurred in the species2 lineage, and the probability of that scenario is 5 × 103 (Fig. 1d). The alternative scenario predicts that the ancestral sequence was CG (i.e., a CpG site), and that two independent C→T substitutions occurred in the species1 lineage and in the outgroup. Because the rate of substitution is 10 times higher at CpG sites, this scenario (that involves two C→T substitutions) is 2.5 times more likely than the one predicted by the simple parsimony approach (that predicts a single T→C substitution). In other words, the parsimony approach used by Antezana (2005) systematically overestimates the number of AT→GC substitutions, because of homoplasy at CpG sites, and this of course leads to overestimation of the equilibrium GC content.

Figure 1
figure 1

Illustration of the artifact of the maximum parsimony method to count substitutions at CpG sites. The phylogeny of the three species used to infer substitutions is shown in (a). Branch lengths indicate the rate of substitution per site at non-CpG positions. Substitution rates at CpG sites are considered 10 times higher than at non-CpG sites. The first alignment (TC/CC/TC) corresponds to a situation where the parsimony method is reliable: the most parsimonious scenario (one single substitution) (b) is 40 times more likely than the first alternative scenario (c). The second alignment (TG/CG/TG) corresponds to a situation where the parsimony method is not reliable: the most parsimonious scenario (one single substitution) (d) is 2.5 times less likely than the alternative scenario that involves two independent substitution at CpG sites (e).

This artifact of the parsimony method is a major problem even for very closely related species. Indeed, substitutions at CpG sites constitute 25% of all substitutions observed between human and chimpanzee (CSAC 2005). This is the reason why, as clearly mentioned in our article, we took care to analyze separately CpG and non-CpG sites and to exclude those sites for which the ancestral state was unsure (Meunier and Duret 2004). This analysis showed that GC→AT substitutions clearly outnumber AT→GC substitutions, even if only non-CpG sites are considered (Table 1 in Meunier and Duret 2004). The excess of GC→AT over AT→GC substitutions is more pronounced in GC-rich isochores than in GC-poor isochores. These observations have led to the conclusion that there is an overall decrease of the GC-content of GC-rich isochores in the human genome, which we have called the “erosion” of GC-rich isochores.

Antezana (2005) also analyzed with the same parsimony method the pattern of substitutions in homologous coding regions from human, mouse, and rat. Given the evolutionary distance between primates and rodents, even non-CpG sites are affected by homoplasy (the average synonymous substitution rate between primate and rodents is about 0.6 substitutions per site Waterston et al. 2002). Hence, the numbers of substitutions inferred by Antezana (2005) in the rodent lineages are clearly unreliable.

There is another problem in the article by Antezana (2005): the method he used to compute the equilibrium GC-content (GC*) assumes that all sites evolve independently (i.e., that the probability of substitution at a given base does not depend on the nature of flanking bases). It is well established that in reality this assumption is not correct, and that the strongest neighboring effect is by far that of CpGs (Hess et al. 1994). Indeed, the frequency of CpGs in the human genome is only about 23% of what would be expected if all sites were evolving independently (Bird 1980). If sequences were at equilibrium, then the procedure used by Antezana would have given the correct estimate of GC*. However, when sequences are not at equilibrium, then it is necessary to use more realistic models DNA sequence evolution with neighbor-dependent mutations, such as the one proposed by Arndt et al. (2003a). This is clearly shown in a recent paper by Arndt and Hwa (2005), where they investigated the impact of taking into account of neighbor-dependent nucleotide substitution processes on the estimate of substitution rates and of GC*.

It should be stressed that the erosion of GC-rich isochores in the genomes of primates and rodents had been previously demonstrated by many independent works. This erosion was first observed by analyzing patterns of substitutions in transposable elements in the human genome (Lander et al. 2001; Arndt et al. 2003b, 2005). It might be argued that the pattern of substitution in repeated sequence does not perfectly reflect the evolution of unique DNA. Indeed, it has been shown in mammals that the rate of substitution at CpG sites is higher in repeated sequences than in unique DNA, most probably because of a higher level of methylation (Kricker et al. 1992; Meunier et al. 2005). However, in those repeated sequences, even non-CpG sites show an excess of GC→AT over AT→GC substitutions (Fig. 1 in Meunier et al. 2005). Moreover, it has been shown that this pattern is not restricted to repetitive DNA, but is also observed at synonymous sites of exons, not only in humans (Duret et al. 2002), but also in rodents (Duret et al. 2002; Smith and Eyre-Walker 2002) and cetartiodactyls (Duret et al. 2002). The latter result was criticized because the cetartiodactyls species that we had analyzed were too distantly related and, therefore, the parsimony approach that we had used was not reliable (Alvarez-Valin et al. 2004). Indeed, the analysis of synonymous substitutions by a maximum likelihood approach confirmed the erosion of GC-rich isochores in primates and in rodents, but not in cetartiodactyles (Belle et al. 2004). The erosion of GC-rich isochores in primates was again confirmed by the analysis of substitutions in introns and intergenic regions (Webster et al. 2003; Meunier and Duret 2004). This erosion of GC-rich isochores has also been noted in carnivores, but not in lagomorphs or perissodactyls (Belle et al. 2004).

In conclusion, there is ample evidence for an erosion of GC-rich isochores in rodents and primates. The assertion made by Antezana (2005) that the GC content of their genomes is close to equilibrium is based on an erroneous count of substitutions and an inappropriate method to estimate the equilibrium GC content. This paper illustrates again the fact that even with very closely related species, parsimony should be used with caution and that it is essential to take into account neighbor-dependent mutations if we want to understand the evolution of genomes.