Introduction

The identification of protein-coding genes evolving by adaptive natural selection can show the fundamental ways in which organisms adapt to their environment. One of the clearest signatures of positive selection in the coding region of genes is an excess of nonsynonymous substitutions per site (d N) relative to synonymous substitutions per site (d S), i.e., d N/d S > 1 (Hill and Hastie 1987; Hughes and Nei 1988). This test for positive selection can be applied either to single-copy orthologs from multiple species or to duplicated paralogs within a species. To address both the role of duplicate genes in organismal adaptation and the role of natural selection in maintaining duplicated genes, it is necessary to test for the signature of positive selection among paralogs (Bielawski and Yang 2003).

Different methods have been developed to estimate d N/d S, from simple counting methods (e.g., Nei and Gojobori 1986) to more complex, and more sensitive, codon-substitution models that rely on likelihood calculations (Muse and Gaut 1994; Nielsen and Yang 1998; Yang et al. 2000). CODEML, which is implemented in the PAML suite of programs, is one of the most popular likelihood tools used to estimate d N/d S (Yang 2007). CODEML allows pairs of nested models with and without positive selection to be tested in a likelihood ratio framework to determine if adaptive evolution has occurred. Furthermore, CODEML implements an empiric Bayes approach to identify individual codons undergoing adaptive evolution (Yang et al. 2005). A growing number of genes evolving under positive selection, including duplicated genes, have been discovered using these methods (Birtle et al. 2005; Des Marais and Rausher 2008; Hahn et al. 2007a, b).

The models implemented in CODEML follow the assumption that branch lengths and the topology of the phylogenetic tree do not vary across the sequences of interest. Recombination causes variation in branch lengths across a sequence, and although the models implemented in CODEML allow for variation in selective constraint across a sequence (i.e., d N/d S), they assume constant synonymous distances (d S). Likewise, CODEML calculates the likelihood of the data over a single prespecified tree topology; however, recombination changes the topology from one base to the next. As a result of violating basic assumptions of the underlying model, analyses of recombining sequences show incorrect signatures of positive selection (Anisimova et al. 2003; Scheffler et al. 2006; Shriner et al. 2003). However, it is not known whether the branch-length or the topology assumption is more sensitive to violation (Anisimova et al. 2003).

Although paralogs do not recombine in the same manner as allelic sequences, “ectopic” gene conversion among paralogs can result in the exchange of sequence among duplicated genes. Gene conversion is the nonreciprocal exchange between a “donor” sequence and an “acceptor” sequence and represents one of the most common outcomes of double-stranded breaks between two homologous sequences (Chen et al. 2007; Li 1997; Slightom et al. 1980). Ectopic gene conversion has been documented in a plethora of organisms, including bacteria, plants, fungi, and metazoans (Drouin et al. 1999; Gerton et al. 2000; Mondragon-Palomino and Gaut 2005; Nielsen et al. 2003; Santoyo and Romero 2005; Semple and Wolfe 1999). Gene conversion can violate some of the same assumptions that cause PAML to incorrectly infer positive selection in the presence of recombination. However, because it is relatively common to analyze only pairs of paralogous sequences, there can be no violations of the assumption of constant tree topology in these cases. It may therefore be true that rates of false-positive inferences of natural selection are much lower when analyzing paralogs. In this study, we carried out extensive simulations to examine the rate and causes of false-positive results when considering gene conversion between paralogous sequences.

Methods

Sequence data sets were generated by Monte Carlo simulations using the EVOLVER program of the PAML 4 package (Yang 2007). All data sets were simulated without positive selection but instead with two site classes (d N/d S = 0 and d N/d S = 1), both with frequency 0.5. A uniform codon frequency of 1/61 was applied and the transition-to-transversion rate ratio was set to κ = 2.

Two groups of data sets were built to examine the effect of the number of sequences included in analyses. The first group consists of data sets formed by simulating 3 coding sequences of 500 codons replicated 1000 times, with 5 different tree lengths. Pairwise distances between ingroup sequences, represented by d S values, were fixed at 0.02, 0.04, 0.1, 0.2, and 0.4; these distances represent common divergence values between paralogs analyzed in the literature (e.g., Han et al. 2009). The third sequence represented the outgroup (from the same genome) and was arbitrarily set to have twice the distance from each ingroup sequence as the distance between ingroup sequences. Artificially converted data sets were built from the first group of replicates as follows: converted tracts of 50, 167, and 250 codons (i.e., 1/10, 1/3, and 1/2 of the total sequence length) were transferred from a donor to an acceptor sequence, starting from the 100th codon. These conversion tract lengths are also representative of lengths seen in nature (Benovoy and Drouin 2009; Chen et al. 2007; Gerton et al. 2000; Semple and Wolfe 1999). Four different experimental conditions were established as described later in the text.

In the second group, each data set was represented by 100 replicates of 10 coding sequences with 500 codons. For these data sets, we used the tree shown in Supplementary Fig. 1. Gene conversion was simulated between genes at different genetic distances using the second sequence as the acceptor and sequences 1, 3, or 6 as donor. Converted tracts of 50, 167, and 250 codons were transferred from each donor to the acceptor sequence, starting from the 100th codon.

Positively selected sequences and codons were detected using the CODEML program of the PAML 4 package (Yang 2007). Two different sets of models that allow d N/d S to vary among sites were compared: (1) the M1a and M2a models and (2) the M7 and M8 models. Model M1a allows the site classes d N/d S = 1 and 0 < d N/d S < 1, whereas model M2a has the same site classes of M1a and a third class with d N/d S > 1 (Nielsen and Yang 1998; Wong et al. 2004; Yang et al. 2000, 2005). Model M7 includes several site classes with d N/d S ratios following the beta-distribution B(p,q), whereas model M8 extends model M7 with a further class with d N/d S > 1. Likelihood ratio tests (LRTs) were carried out between models M1a/M2a and models M7/M8 as described in (Yang 2007).

For “branch-site” analysis, we used the same data sets produced by the EVOLVER program and ran CODEML with the parameters specified in the PAML 4 manual to perform “test 2” (Yang 2007; Zhang et al. 2005). In the alternative hypothesis, we fixed initial d N/d S = 1.5. As suggested in the PAML 4 documentation, we performed the branch-site analysis with different initial values (analyses performed with an initial value of d N/d S = 5 did not diverge significantly from these outcomes).

Results and Discussion

Rate of False-Positive Results in Site Models with Gene Conversion

To determine the false-positive rate in the presence of gene conversion, we simulated protein-coding sequences evolving without positive selection and introduced conversion tracts. Sequences of length 500 codons were generated using the EVOLVER program in PAML (Yang 2007), with d N/d S = 0.5 (see Methods). Gene conversion was simulated by copying fragments of different length (50, 167, and 250 codons) from one sequence to another. Each tree was initially simulated with 3 sequences and then subject to 1 of 4 main treatments (Fig. 1): (I) conversion occurred between the two ingroup sequences, and only these two sequences were tested for positive selection; (II) conversion occurred between the two ingroup sequences, but all three sequences were included in the test for selection; (III) the outgroup sequence converted one of the ingroup sequences, but only the two ingroup sequences were tested for positive selection; or (IV) the outgroup sequence converted one of the ingroup sequences, and all of the sequences were included in the test for selection. Each treatment was simulated 1000 times for each of 5 different values of d S and each conversion tract length. To estimate the false-positive rate for each experimental condition, we tested each simulated alignment for positive selection using likelihood ratio tests between two different sets of “site” models implemented in CODEML (M1a/M2a and M7/M8). For comparison we also estimated the false-positive rate in equivalent nonconverted data sets.

Fig. 1
figure 1

Scheme of experimental conditions used in this study. All four conditions are shown, with the arrow indicating the direction of gene conversion

Our analysis shows that gene conversion can lead to a moderate increase in the proportion of genes erroneously identified as undergoing adaptive evolution (Fig. 2). Generally, the number of false-positive results is directly proportional to the genetic distance (d S) between sequences and the length of converted tracts, whereas different sets of models (M1a/M2a or M7/M8) seem to produce similar outcomes (Fig. 2 and Supplementary Fig. 2). Larger conversion tracts and larger distances between paralogs (before conversion) result in higher numbers of false-positive rates, possibly because there is greater disparity in branch length among sites when these two values grow larger. Large conversion tracts between distant paralogs are often the easiest to identify (Sawyer 1989), which may make it easier to avoid these false-positive results (see later text).

Fig. 2
figure 2

Percentage of false-positive results in site models versus the pairwise genetic distance of ingroup sequences. Different experimental conditions using models M1a-M2a are compared (see text for details). Noconv = data sets with no conversion. Note that the y-axis in the four panels is not on the same scale

We found that differences among the experimental conditions (i.e., conditions I through IV) were a major factor in determining false-positive rates. The outgroup-to-ingroup–converted data sets (conditions III and IV) showed at most the expected proportion of false-positive results at p < 0.05 (approximately 5%; Fig. 2c and d). However, ingroup-to-ingroup conversions (conditions I and II) had rates of type I error up to almost 33% (Fig. 2a and b; Supplementary Fig. 2A and B). This result was unexpected because conversion among ingroup sequences will not change the inferred tree topology. Based on previous results (Anisimova et al. 2003), we expected outgroup-to-ingroup conversion to have higher false-positive rates because they produce contrasting relations across different parts of the acceptor (ingroup) sequence.

In addition, experimental conditions in which only the two ingroup sequences were included in tests for positive selection (conditions I and III; Fig. 2a and c) had higher type I error rates compared with conditions that included an outgroup sequence (conditions II and IV; Fig. 2b and d). These outcomes were also unexpected because there is no possible way to change the topology of a tree that includes only two sequences.

One possible explanation for the increased rates of false-positive results mentioned previously is that the accuracy of the likelihood ratio test tends to be low for data sets with few sequences (Anisimova et al. 2001, 2002). Therefore, we performed a similar analysis on a data set with 10 sequences, with 3 possible simulated gene-conversion events between sequences at increasing genetic distance (see Methods and Supplementary Fig. 1). We then compared type I error rates between data sets with 10 sequences and the previously described data sets with 3 sequences using replicates with equal or similar pairwise genetic distances between acceptor and donor sequences. Conversion between the two close paralogs 1 and 2 (1→2) in the tree with 10 sequences generated <5% false-positive results, similarly to the rate recovered from conditions I and II (pairwise d S = 0.04 between ingroup sequences) in data sets with only 3 sequences (Fig. 3a; Supplementary Fig. 3B). In the second scenario involving the larger tree, we recreated a transfer from sequence 3 (the closest outgroup to paralogs 1 and 2) to sequence 2. The number of false-positive results in this case is slightly higher than in the corresponding replicates of conditions III and IV (pairwise d S = 0.04 between ingroup sequences) described previously, but it is still not much greater than the expected 5% for either model comparison (Fig. 3b; Supplementary Fig. 3B). The third simulated conversion involved a transfer from sequence 6 to sequence 2 (6→2). Given the pairwise d S = 0.18 between donor and acceptor sequence, we compared these replicates with data sets with pairwise ingroup-to-ingroup distance of d S = 0.1 (conditions III and IV) and d S = 0.2 (conditions I and II; Fig. 3c). In the latter scenario, the number of false-positive results is at least three times higher in replicates with 6→2 conversion than in data sets with only 2 or 3 sequences for conversion tracts of 167 and 250 codons (Fig. 3c; Supplementary Fig. 3C). This is in agreement with the reported results from LRTs between codon models using trees of different size with recombination among sequences (Anisimova et al. 2003).

Fig. 3
figure 3

Percentage of false-positives results in site models in data sets with 2, 3, and 10 sequences using models M1a-M2a. Sequence 2 in the tree is the fixed acceptor sequence, and donor sequences are sequence 1 (1→2), 3 (3→2), and 6 (6→2). Noconv = data sets with no conversion. Results obtained using different codon models and conversion tract lengths are shown (see text for further details). The pairwise d S value between donor and acceptor sequences in each data set is shown. 2seq = condition I; 3seq = condition II; 10seq = data sets with 10 sequences; 2seq 3→2 = condition III; 3seq 3→2 = condition IV; 2seq 1→2 = condition I; 3seq 1→2 = condition II. Note that the y-axis in the three panels is not on the same scale

Gene conversion between paralogs may occur repeatedly and at different times, producing an acceptor gene that is a mosaic of sequences with different genetic distances from the donor gene(s). Because our simulations thus far have only considered extremely recent conversion events, and only one event per paralog, we further investigated the effects of these processes on the accuracy of CODEML. We again generated data sets with three sequences and simulated either one or two ingroup-to-ingroup conversion events occurring at different times since their split (see Methods and Supplementary Table 1). As observed with the other data sets (Figs. 2 and 3), the number of false-positive results was >5% only for the largest genetic distance between the sequences of the tree (pairwise d S = 0.4 between ingroup sequences) and was higher when the outgroup sequence was removed from the analysis of positive selection, whether there was one (Fig. 4a and b) or multiple conversion events (Fig. 4c and d). Both sets of results also show that there are a larger number of false-positive results the more recently the conversion event occurred, regardless of the models being compared (Fig. 4; Supplementary Fig. 4). Finally, we found little difference in the number of false-positive results between data sets simulated with one or two conversion events, except for the highest divergence between ingroup paralogs, where data sets with two events showed approximately 5% more false-positive results than replicates with only one event (Fig. 4; Supplementary Fig. 4).

Fig. 4
figure 4

Percentage of false-positive results in site models depending on the age and number of conversion events. Different times of conversion, experimental conditions, and models are compared (see text and Supplementary Table 1 for details). The pairwise d S value between ingroup sequences is shown on the x-axis. a = old conversion; b = recent conversion; c = new conversion; noconv = data sets with no conversion. Note that the y-axis in the two panels is not on the same scale

Rate of False-Positive Results in Branch-Site Models with Gene Conversion

Together with “site” models such M1a/M2a and M7/M8, CODEML also implements methods to look for positive selection on individual codons along specific branches of a phylogenetic tree (Yang and Nielsen 2002; Zhang et al. 2005). These methods, also referred to as “branch-site” models, require subdivision of the tree into foreground and background lineages. A model allowing positive selection on foreground branches is compared by a likelihood ratio test with a second model that assumes no positive selection on these branches (see Methods). Given this a priori requirement, we could only examine replicates containing the outgroup sequence (conditions II and IV), testing one or the other ingroup sequence as the foreground branch in two independent analyses.

Type I error rates obtained from the LRT of the two branch-site models range from 0% to 13% (Fig. 5; Supplementary Fig. 5). As observed for the site models, these rates increase with the genetic distance and length of converted tracts. The largest difference in the proportion of false-positive results is seen between condition II (ingroup-to-ingroup conversion) and condition IV (outgroup-to-ingroup conversion). We used both ingroup sequences as foreground branches, with the acceptor sequence of the simulated gene conversion always denoted as ingroup branch 2 (“b2”). In condition II, there is little difference in the rate of false-positive results when using either branch 1 (“b1”) or branch 2 as the foreground lineage (Fig. 5a). This makes sense because the two lineages have been at least partly homogenized by gene conversion. In contrast, in condition V there is a significantly higher rate of false-positive results when using branch 1 as the foreground lineage (Fig. 5b). Because the outgroup branch and branch 2 are homogenized in condition IV, branch 1 (which is unaffected by conversion) will appear to be evolving at a much higher rate. This heterogeneity in branch lengths may cause the higher rate of false-positive results. Overall, changes in the tree topology seem to affect the specificity of branch-site methods in PAML when recombination occurs between background lineages. In general, branch-site models lead to lower type I error rates compared with site models, possibly because of their decreased sensitivity (Zhang et al. 2005).

Fig. 5
figure 5

Percentage of false-positive results in branch-site models versus the pairwise genetic distance of ingroup sequences for experimental condition II (a) and IV (b). “b1” and “b2” represent foreground ingroup branches 1 and 2, respectively. Results from replicates with simulated conversion tracts of 250 codons are shown. Noconv = data sets with no conversion. Note that the y-axis in the two panels is not on the same scale

Proportion of False-Positive Sites in Paralogs with Gene Conversion

CODEML site models that include parameters allowing positive selection (M2a and M8) also include two Bayesian estimations of codons evolving under positive selection using either the naïve empiric Bayes (NEB) or the Bayes empiric Bayes (BEB) algorithms. NEB does not account for sampling errors and is rather inaccurate, especially for small data sets with highly similar sequences (Yang 2007; Yang et al. 2005); therefore, we used only the results from the BEB method to infer the extent of type I error in detecting sites undergoing adaptive evolution, considering only sites with Bayesian confidence levels ≥95% (Yang et al. 2005).

Overall, the BEB method produces few false-positive results, not exceeding 0.084% of all codons. However, given that the BEB method is conservative (Yang et al. 2005) and that only a few “positive” sites are identified even in the presence of adaptive evolution, we addressed which factors affect more significantly the distribution of such false-positive results. In this analysis, experimental condition I is not examined because converted regions between ingroup sequences are perfectly identical and show no false-positive results.

Experimental conditions are one of the most prominent factors shaping the BEB type I error, especially when converted and nonconverted regions are compared. In replicates including the outgroup sequence (conditions II and IV), BEB false-positive results in nonconverted regions increase with d S, but only when model M8 is used, whereas the length of converted tracts seems to have only a minor effect (Supplementary Fig. 6A and 6C). Converted regions show a few BEB false-positive results regardless of genetic distance, conversion tract length, and codon models. In experimental condition III (outgroup-to-ingroup conversion; only ingroup sequences analyzed), higher type I error rates are associated with converted regions, especially for longer converted tracts at d S = 0.02, and using model M8 (Supplementary Fig. 6B).

Compared with the LRT results across whole sequences (i.e., M1a/M2a and M7/M8 comparisons), BEB predictions are based on single codon estimates of the numbers of synonymous and nonsynonymous substitutions; therefore, they are not significantly affected by changes in the phylogenetic tree topology introduced by recombination. In agreement with this, we noticed that the number of false-positive results, considering all BEB sites, is influenced in different ways than are LRTs by the length of converted tracts, codon model, and d S values. LRT type I error rates increase with d S and the length of conversion tracts for each experimental condition (Fig. 2; Supplementary Fig. 2). BEB false-positive results tend to be higher at extreme d S values (0.02 and 0.4; see also Arbiza et al. 2006) with model M8 and when the outgroup sequence is included (Supplementary Fig. 6). Although this analysis showed a highly variable number of false-positive results predicted by the BEB method, these numbers are always rather low, as noted by Yang et al. (2005).

Conclusion

Our results demonstrate that inferences of adaptive evolution in duplicates genes by the models implemented in CODEML can have moderately high type I error rates (up to approximately 33%) when conversion occurs between duplicated genes. Our results also suggest that using an outgroup sequence can increase specificity of the analysis when site methods are used, whereas this approach may produce the opposite effect with branch-site methods. In addition, larger gene trees negatively affect the accuracy of site models to predict adaptive selection in the presence of conversion, especially when donor and acceptor sequences are more distantly related and when conversion tracts are long. Overall, such results imply that erroneous between-paralogs inferences of positive selection due to gene conversion can be limited by using one outgroup sequence, even if this sequence is another paralog from the same genome. This approach is likely more effective than using large trees because large trees will also inevitably have more chances to harbor genes that have undergone conversion events. Importantly, the highest rates of false-positive results occur in exactly those conditions where gene conversion is easiest to detect (i.e., long conversion tracts and high d S). This indicates that it will be relatively easy to exclude converted sequences from analyses of positive selection and therefore avoid an unnecessarily high proportion of false-positive results.