Introduction

Gene duplication, which can give rise to new genes encoding proteins with new functions, is believed to have played an important role in the evolutionary diversification of organisms (Nei 1969; Ohno 1970; Li 1982; Hughes 1994). There is evidence that gene duplication occurs continually over evolutionary time (Lynch and Conery 2000; Friedman and Hughes 2003). Only certain duplicate genes are actually retained in the genome, while the others are eventually lost. If a duplicate gene can assume a new function beneficial to the organism, it is more likely that it will be retained (Hughes 1994; Lynch et al. 2001). Thus, positive Darwinian selection may frequently be involved in the fixation of duplicate genes that have undergone beneficial mutations (Hughes 1999a).

It has often been difficult to obtain evidence that positive selection has acted on mutations leading to the functional differentiation of duplicate genes. A useful approach to testing for positive selection involves comparing the number of synonymous nucleotide substitutions per synonymous site (d S ) with the number of nonsynonymous nucleotide substitutions per nonsynonymous site (d N ) (Hughes and Nei 1988). In a number of cases, this approach has provided evidence of positive selection diversifying duplicate genes at the amino acid level (e.g., Hill and Hastie 1987; Tanaka and Nei 1989; Hughes 1999b, 2002; Hughes et al. 2000).

However, it seems unlikely that this approach will be able to detect positive selection in many cases involving multigene families. First, since positive selection is likely to be focused only on certain functional regions of the protein, this approach works best in cases in which structural and functional information is available (Hughes 1999a). Moreover, positive selection favoring specialization of duplicate genes may typically occur over a short time frame. Once the proteins encoded by two duplicate genes have become specialized for distinct functions, new amino acid changes may no longer be favored (Hughes 1999a). If so, purifying selection will again predominate, and eventually d S will overtake d N . There is evidence of such an evolutionary process in plots of d N vs. d S in pairwise comparisons among members of a variety of gene families. In such plots, d N often exceeds d S when d S is low, and thus the genes compared have a recent common ancestor, but d S exceeds d N in more distant comparisons (e.g., Tanaka and Nei 1989; Hughes et al. 2000).

An additional correlate of functional divergence between two duplicated genes might be inequality (asymmetry) of the rates of nonsynonymous substitution in the two genes; such a pattern might indicate that one of the two genes has adopted a new function, whereas the other gene has retained a function closer to that of the ancestral gene. Some recent studies have made use of genomic data to survey for nonsynonymous rate asymmetry between duplicate genes. Kondrashov et al. (2002), in a study of 101 paralogs, found that significant nonsynonymous rate asymmetry occurred in only 5 cases. On the other hand, in analyses of Saccharomyces cerevisiae, Schizosaccharomyces pombe, and Drosophila melanogaster, Conant and Wagner (2003) found significant nonsynonymous rate asymmetry in 22 of 80 duplicate gene pairs. Furthermore, Zhang et al. (2003) found significant rate asymmetry at the amino acid level in 145 of 250 human duplicate gene pairs. Although all of these authors used different methods to test for rate asymmetry, none applied any correction for multiple tests. Thus, the true significance levels in these studies remain unclear.

Here we study pairs of paralogous genes in the human and mouse genomes that have arisen by gene duplication since the most recent common ancestor of the two species (estimated to have occurred about 110 million years ago; Kumar and Hedges 1998). We employ robust approaches to test for rate asymmetry between paralog, with an emphasis on methods that take into control for multiple testing. We apply simple regression-based methods that take into account the variability in the entire data set, with the goal of identifying gene pairs likely to have diversified functionally as a result of positive selection.

Methods

Sequence Data

The genomic data for human (version 16.33) and mouse (version 16.30) were obtained from Ensembl (http://www.ensembl.org). The version numbers refer to the database software (Ensembl version 16) and the assembly of the genomic sequences (NCBI versions 33 and 30). Genes predicted by Ensembl have been curated and verified by similarity with homologs discovered experimentally (Clamp et al. 2003). The numbers of annotated protein-coding genes were 32,035 for human and 32,911 for mouse. After removal of genes that were shorter than 100 bases and longer than 300,000 bases, the total gene sets were 29,606 in human and 32,296 in mouse. Further curation to remove overlapping loci resulted in a count of 20,387 genes in human and 23,222 genes in mouse.

Protein families were identified by homology and a single-linkage method employed by the BLASTCLUST software available in the Blast software package (Altschul et al 1997). Sequence homology was established by identifying matches using a conservative E-value of 10−6 with a minimum of 30% sequence identity across at least 50% of the length of two sequences. The single-linkage method assembles larger families by linking shared genes among families, thus ensuring that a given gene will be assigned to only one family. To identify recent duplicates, we chose families with exactly three members and at least one member from each of the two species. From these families, we selected those cases in which the number of synonymous substitutions per synonymous site (d S ) between the two conspecific genes was lower than that for either between-species comparison.

There is evidence that even recently duplicated genes can be chimeras as a result of exon shuffling (Katju and Lynch 2003). In order to rule out chimeric genes with marked differences between regions with respect to the extent of sequence divergence, we computed the proportion of amino acid difference in a window of 30 aligned amino acid residues along each pair of paralogs. One pair of paralogs showed a strong difference in sequence similarity between N-terminal and C-terminal regions, and this pair was found to correspond to a known chimeric gene (Paulding et al. 2003). Therefore, this gene family was excluded from the analysis. The resulting data set contained 316 families in which a gene duplication occurred in human (119 families) or in mouse (197 families) after the last common ancestor of human and mouse. The data are available from http://www.biol.sc.edu/∼austin/.

Statistical Analyses

Homologous sequences were aligned at the amino acid level using the CLUSTAL W program (Thompson et al. 1994), and this alignment was imposed on the DNA sequences. The number of synonymous nucleotide substitutions per synonymous site (d S ) and the number of nonsynonymous nucleotide substitutions per nonsynonymous site (d N ) were estimated by a maximum likelihood method (Yang and Nielsen 2000) using the software package PAML (Yang 1997). We used Tajima’s (1993) method to test the hypothesis that duplicate gene pairs evolved at equal rates at the amino acid, i.e., to test for rate asymmetry at the amino acid level. This test has an advantage over some other methods that have been used for such relative-rate tests because it is not model-dependent (Tajima 1993). However, the test statistic could not be computed in 32 of 317 families, either because the amino acid sequences were too similar or because they were too divergent. We also used approaches to identifying rate asymmetry based on linear regression, which have the advantage of taking into account stochastic error in the entire data set.

Results

Relative Rate Tests

Tajima’s (1993) relative rate test statistic could be computed for 284 duplicate gene pairs. Using a Bonferroni-corrected simultaneous significance level, 102 (35.9%) pairs showed a significant rate asymmetry in amino acid sequence evolution at the 5% level. The proportions of gene pairs showing rate asymmetry were similar for the two species: 41 of 109 (37.6%) in human and 60 of 175 (34.3%) in mouse. At the 1% level (Bonferroni-corrected), 81 of 284 (28.5%) gene pairs showed significant rate asymmetry, 35 of 109 (32.1%) in human and 46 of 175 (26.3%) in mouse. Thus despite the use of a conservative statistical test, our results showed a high frequency of rate asymmetry, comparable to other studies (Conant and Wagner 2003; Zhang et al. 2003).

In order to understand the factors contributing to a significant relative rate test, we compared duplicate gene pairs with statistically significant (at the 5% level) evidence of asymmetry in the rate of amino acid evolution with those showing no evidence of asymmetry. We found that gene pairs with significant asymmetry encoded significantly longer polypeptides on average (Table 1). The mean length of pairs with significant asymmetry was 354.5 residues (rang, 102–1134), while the mean length of pairs without significant asymmetry was 253.9 (range 52–962). A plausible explanation for the difference in mean length between the two groups is that the power of Tajima’s (1993) test to detect a rate difference increases as the number of sites increases. Pairs with significant asymmetry also had greater mean values of both the number of synonymous substitutions per synonymous site (d S ) and the number of nonsynonymous substitutions per nonsynonymous site (d N ) (Table 1). These results also are best explained as reflecting the power of the test to detect rate differences, which is likely to increase as the number of differences between the sequences increases.

Table 1 Means (±SE) of variables comparing duplicate gene pairs without significant evidence of amino acid evolution rate asymmetry and those with significant evidence of rate asymmetrya

Thus, these results suggested that evidence of a significant rate asymmetry was largely a function of the statistical power to detect a rate difference. As a consequence, applying such a test to individual gene pairs may not be the optimal approach if the goal is to identify cases of exceptional rate asymmetry, which may be indicative of functional divergence between duplicated genes. Instead, we used a number of approaches to identify duplicate gene pairs whose divergence at the amino acid level was unusually high in comparison to other gene pairs in the data set.

Regression Methods

The approach we chose involved conducting linear regression analyses and identifying outliers from the linear trend (as evidenced by high standardized residuals). First, we conducted a linear regression of d N between one duplicate and to the reference sequence (the other species) against d N between the other duplicate and the reference sequence. We refer to these values as d N1 and d N2 (Fig. 1). Because the order in which the two duplicates were compared to the reference was arbitrary, we used the absolute value of the standardized residual from this regression as an indicator of cases where the absolute difference between d N1 and d N2 was unusually large and thus there was asymmetry in the nonsynonymous rate. The mean absolute value of the standard residuals from the regression of d N1 vs. d N2 was significantly higher in cases where Tajima’s test showed significant rate asymmetry than in cases where Tajima’s test showed no rate asymmetry (Table 1). Thus, gene pairs showing high absolute values of the standard residuals from the regression of d N1 vs. d N2 seemed good candidates for functional divergence between duplicates. There were 14 cases (9 in human 5 in mouse) with standard residuals ≥2.0 in absolute value (Table 2).

Table 2 Duplicate gene pairs with large (≥2.0) absolute values of the standard residual from the regression of d N1 vs. d N2
Figure 1
figure 1

Plot of d N1 (comparison of duplicate gene 1 vs. reference) vs. d N2 (comparison of duplicate gene 2 vs. reference). The linear regression line was Y = −0.00134 + 0.887X (R 2 = 0.789, p<0.001).

In addition, we conducted regression of d N1/d S1 vs. d N2/d S2 and examined the absolute values of the standard residuals. In this analysis also, the mean absolute value of the standard residual was higher in cases where the Tajima test showed significant rate asymmetry than in cases where that test did not show significant rate asymmetry (Table 1).

In order to determine which factors were most associated with significant evidence of rate asymmetry in Tajima’s (1993) test, we computed partial correlations between each of the variables whose means are summarized in Table 1 and the chi-square statistic for Tajima’s test (Table 3). Only two variables showed significant partial correlations with the chi-square statistic when controlling simultaneously for all other variables: the number of codons in the sequence and the absolute value of the standard residual from the regression of d N1 vs. d N2 (Table 3). The significant partial correlation in the case of the latter variable showed that this variable is associated with a significant result of Tajima’s (1993) test independent of the increase in power of that test as a function of increased sequence length and increased sequence divergence.

Table 3 Partial correlations between selected variables and the chi-square statistic for Tajima’s (1993) relative rate test, simultaneously controlling for all other variables

Synonymous and Nonsynonymous Substitutions

Figure 2 illustrates the number of synonymous substitutions per synonymous site (d S ) and the number of nonsynonymous substitutions per nonsynonymous site (d N ) in comparisons between duplicate pairs. Overall, mean d S (0.139 ± 0.011 SE) was significantly greater than mean d N (0.073 ± 0.006) (paired t-test; p < 0.001). However, in 36 (11.4%) of 316 gene pairs, d N exceeded ds, while in 14 gene pairs (4.4%) d N and d S were equal. Only in two cases did d N exceed d S significantly (at the 5% level) by the widely used z-test. In both of these cases no synonymous substitutions were observed. The two gene pairs involved were the following: (1) a gene pair encoding proteins of unknown function (Ensembl I.D. ENSP00000295653 and ENSP00000303012) with d N  = 0.0077 ± 0.0031 and (2) two genes encoding proteins related to pro-melanin-concentrating hormone (Ensembl I.D. ENSP00000323682 and ENSP00000295326) with d N = 0.0306 ± 0.0126. Because of the lack of synonymous substitutions, both of these cases evidently represented very recent duplicates. Also, it is worth pointing out that if the z-test results were corrected for multiple tests, the z-test would no longer be significant in either of these cases.

Figure 2
figure 2

Plot of d N vs. <d S in the comparison of duplicate gene pairs. The line is a 45° line. The linear regression line was Y = 0.0115 + 0.443X (R2 = 0.722, p < 0.001).

In order to identify other cases with unusually high d N with respect to d S , we searched for large positive standard residuals from the regression of d N vs. d S (Fig. 2). There were 10 gene pairs (5 from human, 5 from mouse) in which the standard residual was ≥2.0 (Table 4). In two of these pairs, d N was greater than d S although not significantly so by the z-test (Table 4). Furthermore, six of these gene pairs also showed unusually high absolute values of the standard residual from the regression of d N1 vs. d N2 (Tables 2 and 4). These six gene pairs thus show both unusually high d N relative to d S , as indicated by high standard residuals (Table 4), and strong nonsynonymous rate asymmetry. As a consequence, these six gene pairs seem to be good initial candidates for duplicates that have diverged functionally.

Table 4 Duplicate gene pairs with large (≥2.0) values of the standard residual from the regression d N vs. d S

Discussion

Relative rate tests (Wu and Li 1985; Tajima 1993; Takezaki et al. 1995) have been widely used to test the hypothesis of equality of the rate of molecular evolution between sequences (or groups of sequences) by comparison to an outgroup or reference. Such tests typically rely on the assumption that all sites evolve independently. As a consequence, it is expected that, as the number of sites examined increases, the power of the test to detect rate asymmetry will increase. However, such small rate differences may actually results from stochastic error and thus may not be biologically meaningful. Consistent with the theoretical prediction that the power of these tests increases as the number of sites examined increases, we found that the number of sites examined was a good predictor of statistical significance in Tajima’s (1993) test, even when a very conservative correction for multiple testing was applied (Tables 1 and 3).

In studies whose goal is to identify cases where duplicate gene pairs may have diverged as a result of adaptation to distinct functions, it seems preferable to use approaches that take into account the variance across gene pairs. In the present paper, we analyzed data on recently duplicated gene pairs in mammals using robust approaches based on linear regression. We show that these approaches can be used to search for gene pairs whose divergence at the amino acid level is unusually high in comparison to others in the data set.

One of these approaches is based on the absolute values of the standard residuals from the regression of d N in the comparison of one duplicate gene with the reference sequence against d N in the comparison of the other duplicate gene with the reference sequence. Partial correlation analysis showed that the absolute value of the standard residuals from this regression was significantly correlated with the chi-square statistic in Tajima’s (1993) test, independent of the effect of sequence length (Table 3). An additional approach was based on the standard residuals from the regression of d N vs. d S between the two duplicated genes. Interestingly, five gene pairs were identified by both of these methods (Tables 2 and 4). These gene pairs seem the best candidates in our data set for functional diverged duplicate gene pairs.

Computation of d S and d N over the entire coding region of a gene can rarely provide a meaningful test of the hypothesis of positive Darwinian selection, because such selection typically acts only on a limited region involved in the function that is under selection (Hughes 1999a). In the present data set, only two gene pairs showed a significantly greater value of d N than d S for the entire gene by the commonly used z-test However, in both of these cases, no synonymous substitutions were observed, and d N was quite low. These cases may represent positive selection that occurred soon after gene duplication. On the other hand, the difference between d N than d S may be due to stochastic error. Additional information on the structure of the proteins encoded by these gene will be needed to definitively rule out the latter possibility.

Of the 10 cases in which the standard residuals from the regression of d N against d S were unusually large, d N exceeded d S in only 2, and in neither of these cases was the difference significant by the z-test (Table 4). On the other hand, these 10 cases were identified by a method that takes into account the variance in d S and d N over the entire data set. Such cases may actually be at least as plausible candidates for positive selection as the two cases in which d N of exceeded d S significantly by the z-test. In searching for cases of adaptive evolution at the molecular level, it may be preferable to employ a two-step procedure: (1) using regression of d N vs. d S , identify cases with an unusually high d N for a given d S ; (2) using structural information, identify functionally important regions of these molecules and compute d N and d S separately in each region. In the present data set, this approach identified six duplicate gene pairs as good candidates for functional divergence. Detailed structural information was lacking for these six duplicate pairs, but further application of this approach may uncover candidates with known structure and may eventually inspire structural studies on genes whose duplication and functional divergence may have played an important role in the evolution of biological processes.